Computational Bayesian Statistics
Computational Bayesian Statistics
An Introduction
1 Bayesian Inference 1
1.1 The Classical Paradigm 2
1.2 The Bayesian Paradigm 5
1.3 Bayesian Inference 8
1.3.1 Parametric Inference 8
1.3.2 Predictive Inference 12
1.4 Conclusion 13
Problems 14
5 Model Assessment 72
5.1 Model Criticism and Adequacy 72
5.2 Model Selection and Comparison 78
5.2.1 Measures of Predictive Performance 79
5.2.2 Selection by Posterior Predictive Performance 83
5.2.3 Model Selection Using Bayes Factors 85
5.3 Further Notes on Simulation in Model Assessment 87
5.3.1 Evaluating Posterior Predictive Distributions 87
5.3.2 Prior Predictive Density Estimation 88
5.3.3 Sampling from Predictive Distributions 89
Problems 90
9 Software 174
9.1 Application Example 175
9.2 The BUGS Project: WinBUGS and OpenBUGS 176
9.2.1 Application Example: Using R2OpenBUGS 177
9.3 JAGS 183
9.3.1 Application Example: Using R2jags 183
9.4 Stan 187
9.4.1 Application Example: Using RStan 188
9.5 BayesX 194
9.5.1 Application Example: Using R2BayesX 196
9.6 Convergence Diagnostics: the Programs CODA and BOA 200
9.6.1 Convergence Diagnostics 201
9.6.2 The CODA and BOA Packages 203
9.6.3 Application Example: CODA and BOA 205
9.7 R-INLA and the Application Example 215
9.7.1 Application Example 217
Problems 224
This book is based on lecture notes for a short course that was given at
the XXII Congresso da Sociedade Portuguesa de Estatística. In the trans-
lation from the original Portuguese text we have added some additional
material on sequential Monte Carlo, Hamiltonian Monte Carlo, transdi-
mensional Markov chain Monte Carlo (MCMC), and variational Bayes,
and we have introduced problem sets. The inclusion of problems makes
the book suitable as a textbook for a first graduate-level class in Bayesian
computation with a focus on Monte Carlo methods. The extensive discus-
sion of Bayesian software makes it useful also for researchers and graduate
students from beyond statistics.
The core of the text lies in Chapters 4, 6, and 9 on Monte Carlo meth-
ods, MCMC methods, and Bayesian software. Chapters 5, 7, and 8 include
additional material on model validation and comparison, transdimensional
MCMC, and conditionally Gaussian models. Chapters 1 through 3 intro-
duce the basics of Bayesian inference, and could be covered fairly quickly
by way of introduction; these chapters are intended primarily for review
and to introduce notation and terminology. For a more in-depth introduc-
tion we recommend the textbooks by Carlin and Louis (2009), Christensen
et al (2011), Gelman et al (2014a) or Hoff (2009).
Preface
statistics and for the opportunity to give a minicourse at the 22nd confer-
ence of the society. We also acknowledge the institutional support from
the Universidade de Lisboa through the Centro de Estatística e Aplicações
(PEst-OE/MAT/UI0006/2014, UID/MAT/00006/2013), in the Department
of Statistics and Operations Research in the Faculdade de Ciências and
of the Department of Mathematics in the Instituto Superior Técnico. We
would like to acknowledge that the partial support by the Funda cão para a
Ciência e Tecnologia through various projects over many years enabled us
to build up this expertise in Bayesian statistics.
Finally, we would like to dedicate this book to Professor Bento Murteira
to whom the development of Bayesian statistics in Portugal owes a lot. In
fact, Chapter 1 in this book reflects in many ways the flavor of his writings.
1
Bayesian Inference
1
Before discussing Bayesian inference, we recall the fundamental problem
of statistics: “The fundamental problem towards which the study of Statis-
tics is addressed is that of inference. Some data are observed and we wish
to make statements, inferences, about one or more unknown features of
the physical system which gave rise to these data” (O’Hagan, 2010). Upon
more careful consideration of the foundations of statistics we find many
different schools of thought. Even leaving aside those that are collectively
known as classical statistics, this leaves several choices: objective and sub-
jective Bayes, fiducialist inference, likelihood based methods, and more.2
This diversity is not unexpected! Deriving the desired inference on pa-
rameters and models from the data is a problem of induction, which is one
of the most controversial problems in philosophy. Each school of thought
follows its own principles and methods to lead to statistical inference.
Berger (1984) describes this as: “Statistics needs a: ‘foundation’, by which
I mean a framework of analysis within which any statistical investigation
can theoretically be planned, performed, and meaningfully evaluated. The
words ‘any’ and ‘theoretically’ are key, in that the framework should ap-
ply to any situation but may only theoretically be implementable. Practical
difficulties or time limitations may prevent complete (or even partial) utili-
sation of such framework, but the direction in which ‘truth’ could be found
would at least be known”. The foundations of Bayesian inference are bet-
1
This material will be published by Cambridge University Press as Computational
Bayesian Statistics, by M.A. Amaral Turkman, C.D. Paulino, and P. Müller
(https://tinyurl.com/CompBayes). This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative
works. c Maria Antónia Amaral Turkman, Carlos Daniel Paulino & Peter Müller, 2019.
2
Subjective Bayes is essentially the subject of this volume. In addition to these schools of
thought, there are even half-Bayesians who accept the use of a priori information but
believe that probability calculus is inadequate to combine prior information with data,
which should instead be replaced by a notion of causal inference.
2 Bayesian Inference
lower and upper limits are functions of (X1 , X2 , . . . , Xn ) and, therefore, ran-
dom variables. However, once a specific sample is observed (i.e., post-
experimentally) as n real values, (x1 , x2 , . . . , xn ), this becomes a specific
interval on the real line (now with real numbers as lower and upper limits).
(T ∗ (x1 , x2 , . . . , xn ), T ∗∗ (x1 , x2 , . . . , xn )),
and the probability
P{∗ T (x1 , x2 , . . . , xn ) < θ < T ∗∗ (x1 , x2 , . . . , xn ) | θ} = 1 − α,
0 < α < 1, is no longer meaningful. In fact, once θ has an unknown, but
fixed, value, this probability can only be 1 or 0, depending upon whether
the true value of θ is or is not in the real interval
(T ∗ (x1 , x2 , . . . , xn ), T ∗∗ (x1 , x2 , . . . , xn )).
Of course, since θ is unknown, the investigator does not know which situ-
ation applies. However, a classical statistician accepts the frequentist inter-
pretation of probability and invokes the principle of repeated sampling in
the following way: If one imagines a repetition of the sampling and infer-
ence process (each sample with n observations) a large number of times,
then in (1 − α) 100% of the repetitions the numerical interval will include
the value of θ.
Another instance of classical statistical inference is a parametric hypoth-
esis test. In the course of scientific investigation one frequently encounters,
in the context of a certain theory, the concept of a hypothesis about the
value of one (or multiple) parameter(s), for example in the symbols
H0 : θ = θ0 .
This raises the following fundamental question: Do the data (x1 , x2 , . . . , xn )
support or not support the proposed hypothesis? This hypothesis is tradi-
tionally referred to as the null hypothesis. Also here the classical solution
is again based on the principle of repeated sampling if one follows the
Neyman–Pearson theory. It aims to find a rejection region W (critical re-
gion) defined as a subset of the sample space, W ⊂ X, such that
(X1 , X2 , . . . , Xn ) ∈ W ⇒ rejection of H0 ,
of θ,
L(θ | x) = k f (x | θ), θ ∈ Θ or
(1.2)
L(θ | x1 , . . . , xn ) = kΠi f (xi | θ), θ ∈ Θ,
which expresses for every θ ∈ Θ its likelihood or plausibility when X = x
or (X1 = x1 , X2 = x2 , . . . , Xn = xn ) is observed. The symbol k represents a
factor that does not depend on θ. The likelihood function – it is not a prob-
ability, and therefore, for example, it is not meaningful to add likelihoods –
plays an important role in Bayes’ theorem as it is the factor through which
the data, x, updates prior knowledge about θ; that is, the likelihood can be
interpreted as quantifying the information about θ that is provided by the
data x.
In summary, for a Bayesian the posterior distribution contains, by way
of Bayes’ theorem, all available information about a parameter:
prior information + information from the sample.
It follows that all Bayesian inference is based on h(θ | x) [or h(θ | x1 , x2 ,
. . . , xn )].
When θ is a parameter vector, that is, θ = (γ, φ) ∈ Γ × Φ, it can be
the case that the desired inference is restricted to a subvector of θ, say
γ. In this case, in contrast to the classical paradigm, the elimination of
the nuisance parameter φ under the Bayesian paradigm follows always the
same principle, namely through the marginalization of the joint posterior
distribution,
Z Z
h(γ | x) = h(γ, φ | x)dφ = h(γ | φ, x)h(φ | x)dφ. (1.3)
Φ Φ
in the implementation the two approaches differ. On one side, classical in-
ference is based on probabilities associated with different samples, x, that
could be observed under some fixed but unknown value of θ. That is, in-
ference is based on sampling distributions that “weigh” probabilistically
the values that a variable X or statistic T (X) can assume across the sample
space. On the other hand, Bayesian inference is based on subjective prob-
abilities or a posteriori credibilities associated with different values of the
parameter θ and conditional on the particular observed x value. The main
point is that x is fixed and known and θ is uncertain.
For example, once x is observed, a Bayesian being asked about the
hypothesis {θ ≤ 0.5} would directly address the question by calculating
P(θ ≤ 0.5 | x) based on h(θ | x), i.e., without leaving probability cal-
culus. In contrast, a classical statistician would not directly answer the
question. Stating, for example, that the hypothesis H0 : θ ≤ 0.5 is re-
jected at significance level 5% does not mean that its probability is less
than 0.05, but that if the hypothesis H0 were true, (i.e., if in fact θ ≤ 0.5),
then the probability of X falling into a given rejection region W would be
P(X ∈ W | θ ≤ 0.5) < 0.05, and if in fact x ∈ W, then the hypothesis is
rejected.
In O’Hagan’s words (O’Hagan, 2010), while a Bayesian can state prob-
abilities about the parameters, which are considered random variables, this
is not possible for a classical statistician, who uses probabilities on data
and not on parameters and needs to restate such probabilities such that they
seem to say something about the parameter. The question is also related to a
different view of the sample space. For a classical statistician, the concept
of the sample space is fundamental, as repeated sampling would explore
the entire space. A Bayesian would start by objecting to the reliance on
repeated sampling and would assert that only the actually observed value
x is of interest and not the space that x belongs to, which could be totally
arbitrary, and which contains, besides x, observations that could have been
observed, but were not.4
In estimation problems a classical statistician has several alternatives for
functions of the data – estimators – whose sampling properties are inves-
tigated under different perspectives (consistency, unbiasedness, etc.). For
a Bayesian there is only one estimator, which specifically is the posterior
distribution h(θ | x). One can, of course, summarize this distribution in dif-
4
The irrelevance of the sample space also leads to the same issue about stopping rules,
something which Mayo and Kruse (2001) note, recalling Armitage, could cause
problems for Bayesians.
10 Bayesian Inference
ferent ways, using mode, mean, median, or variance. But this is unrelated
to the problem facing a classical statistician, who has to find a so-called op-
timal estimator. For a Bayesian such a problem only exists in the context of
decision theory, an area in which the Bayesian view has a clear advantage
over the classical view. Related to this, Savage claims that in past decades
the central problem in the face of uncertainty is shifting from which infer-
ence one should report, to which decision should be taken. As individual
decisions have been considered outdated by some philosophers, we have
also recently seen a resurgence of the Bayesian approach in the context of
group decisions.
Under a Bayesian approach, confidence intervals are replaced by credi-
ble intervals (or regions). Given x, and once a posterior distribution is deter-
mined, one finds a credible interval for a parameter θ (assume, for the mo-
ment, a scalar). The interval is formed by two values in θ, say [θ(x), θ̄(x)],
or simpler, (θ, θ̄), such that
Z θ
P(θ < θ < θ̄ | x) = h(θ | x) dθ = 1 − α, (1.4)
θ
ical aspects of the theory of confidence intervals (or regions), there are the
ironical comments of Lindley (1990), who says to know various axioms of
probability – for example, those due to Savage, de Finetti, or Kolmogorov
– but no axiomatic definition of confidence.
For example, when a Bayesian investigates a composite hypothesis H0 :
θ ∈ Θ0 versus a composite alternative H1 : θ ∈ Θ1 , with Θ0 ∩ Θ1 =
∅, Θ0 ∪ Θ1 = θ, she or he uses expressions in terms of probabilities on θ.
When the investigator possesses a distribution h(θ), θ ∈ Θ, representing the
initial credibility attributed to different parameter values, her or his prior
probabilities of the competing hypotheses are determined by
Z Z
P(Θ0 ) = h(θ) dθ, P(Θ1 ) = h(θ) dθ .
Θ0 Θ1
The ratio P(Θ0 )/P(Θ1 ) is known as the prior odds for H0 versus H1 . Af-
ter the experiment resulting in the observations x, and after determining
h(θ | x), a Bayesian statistician calculates the corresponding posterior prob-
abilities
Z Z
P(Θ0 | x) = h(θ | x) dθ, P(Θ1 | x) = h(θ | x) dθ,
Θ0 Θ1
and usually also the posterior odds for H0 versus H1 , that is, P(Θ0 | x)/P(Θ1 |
x). One can therefore say that in the Bayesian framework the inference out-
come is not so much the acceptance or rejection of a the hypothesis H0 –
as is the case in the Neyman–Pearson framework – but rather the updating
of the plausibility that is attributed to the competing hypotheses. Bayesian
inference can be described as a comparison of posterior odds versus prior
odds through
P(Θ0 | x)/P(Θ1 | x)
B(x) = , (1.6)
P(Θ0 )/P(Θ1 )
which is known as the Bayes factor in favor of H0 (or Θ0 ). The Bayes factor
quantifies the evidence in the data x in favor of H0 . Of course, the larger
the Bayes factor, the larger is the increase of the posterior odds relative
to the prior odds and thus the support that the data give to the hypothesis
H0 . In general, the Bayes factor depends on the prior distribution and can
be expressed as a ratio of likelihoods weighted by the prior distributions
conditional on the respective hypothesis on Θ0 and Θ1 (see also Paulino et
al., 2003). In this sense one can not say that the Bayes factor is a measure
of the support for H0 based on only the data.
When the hypothesis about θ is specific to the point of being defined
as H0 : θ = θ0 , evaluation of a Bayes factor or of posterior odds requires
12 Bayesian Inference
that the prior distribution be consistent with this conjecture in the sense
of avoiding zero probability for H0 , implying in general that the prior is a
mixture model. This implication is considered natural by Bayesians such
as Jeffreys, with the argument that a prior distribution needs to integrate
probabilistic judgments that are inherent in the statement of competing hy-
potheses, which in this case attribute some importance to θ0 , as opposed to
other values of θ.
Other Bayesians such as Lindley and Zellner advocate a different ap-
proach, with a certain formal analogy with classical significance tests, in a
way in which the statement of point hypotheses does not interfere with the
prior distribution. Their approach can be described as a quantification of
relative plausibility under the posterior for the value θ0 , via the evaluation
of P = P(θ < R0 (x) | x), where R0 (x) = {θ ∈ Θ : h(θ | x) ≥ h(θ0 | x)} is
the smallest HPD region that contains θ0 . Large (small) values of the pos-
terior relative plausibility P for H0 are evidence in favor of (against) this
hypothesis.
The fundamental tool of the Bayesian approach and the way the joint
model M = { f (x | θ)h(θ), x ∈ X, θ ∈ Θ} is used in the implementation
of inference already suggest that the question of evaluating the adequacy
of a conjectured model in absolute terms might not have an answer in the
sense of Popper (reject/not reject) of the type that is guaranteed by classical
goodness-of-fit tests.
Bayes factors can be used if it is possible to extend the model M (or
parts of it) to a larger family that contains the true model as an unknown
quantity, and that allows comparing models within it. Otherwise, one can
only define various measures of model adequacy for a relative analysis of
a reference model in the context of a class of suitably defined competing
models (see Chapter 5). The unsatisfactory nature of such options has lead
some statisticians to defend a Bayesian approach only when the underlying
model is not put into question, a condition which Gillies (2001) refers to
as fixity of the theoretical framework.
1.4 Conclusion
In summary, from a Bayesian point of view:
• The classical approach to statistical inference proceeds by arguments of
an inductive type, such as, the notion of confidence intervals, which do
not have a direct interpretation as probabilities. The difficulty or impos-
14 Bayesian Inference
Problems
1.1 Suppose there are N cable cars in San Francisco, numbered 1 to N. You
don’t know the value of N, so this is the unknown parameter. Your prior
distribution on N is a geometric distribution with mean 100; that is
!N−1
1 99
h(N) = ,
100 100
N = 1, 2, . . . . You see a cable car at random. It is numbered x = 203. Assume
that x =number on a randomly picked car and has the probability distribution
f (x | N) = 1/N for x = 1, . . . , N, and f (x | N) = 0 for x > N.
a. Find the posterior distribution h(N | x). Find the Bayes estimate of N, i.e.,
the posterior mean of N, and the posterior standard deviation of N (use
P
results for a geometric series ∞ x
x=k a ; or use a numerical approximation).
b. Find a 95% HPD credible interval for N (you will not be able to exactly
match the 95% – get as close as possible).
1.2 Recording the number of bacteria (yi ) found in n = 6 water samples (of the
P
same volume), we find yi = 2, 3, 8, 6, 4, and 1 (you might need S = yi =
Problems 15
a. Find the predictive distribution (i.e., the marginal distribution) for the first
toss, under each of the three hypotheses, i.e., find p(x1 = 1 | H1 ), p(x1 =
1 | H2 ), and p(x1 = 1 | H3 ).
b. Find the predictive distribution for the first toss, p(x1 = 1).
c. Still using the data x1 from one coin toss, given x1 = 1,
(i) find the Bayes factors for H1 versus H2 and for H3 versus H2 ;
(ii) find the posterior probability that the coin is unbiased.
(iii) In general it is not meaningful to compute a Bayes factor with a
non-informative uniform prior h(θ) ∝ c (because the choice of c is
arbitrary). Why is it okay here?
2
1
The mechanics of the inference process require that all basic ingredients
be properly specified. The first one is a sampling model that can explain
(with more or less accuracy) the data as they arise from some experiment
or observational process and which we wish to analyze. This model en-
compasses a set of unknown aspects about which one could have prior
information that should be included in the analysis, no matter how vague
or significant this information is, and that need to be somehow represented
and quantified.
The process of representing prior information is often complicated by
the involvement of subjective elements that need to be elicited. We address
this here for two cases:
• The first case is when no prior information is available, neither objective
nor subjective (sometimes referred to as “a priori ignorance”) or when
prior knowledge is of little significance relative to the information from
the sampling model (“vague” or “diffuse” prior information). We review
some of the principal methods to derive prior distributions that are in
some sense very little informative, and which commonly are referred to
as non-informative priors.
• The second case assumes a convenient parametric family and then chooses
a member of that family using carefully elicited summary measures for
the desired distribution. An example of this elicitation process arises in
the medical inference problem that is discussed by Paulino et al (2003).
This is the context of so-called natural conjugate priors. Such priors
1
This material will be published by Cambridge University Press as Computational
Bayesian Statistics, by M.A. Amaral Turkman, C.D. Paulino, and P. Müller
(https://tinyurl.com/CompBayes). This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative
works. c Maria Antónia Amaral Turkman, Carlos Daniel Paulino & Peter Müller, 2019.
18 Representation of Prior Information
not like un-normalized measures (even if this does not necessarily prevent
the use of Bayes’ theorem, as the posterior distribution, the source of all
inference, might often still be proper).
Another and perhaps more serious critique against the argument that the
lack of information, which some refer to as ignorance, should be repre-
sented by a uniform distribution, is the fact that this is not invariant with
respect to non-linear transformations, thus leading to contradictions. To il-
lustrate, take the model {Ber(θ), θ ∈ (0, 1)} which is part of the exponential
family with natural parameter ψ = ln [θ/(1 − θ)] ∈ R. The use of uniform
distributions for θ (proper) and ψ (improper) is probabilistically inconsis-
tent. In fact, θ ∼ U(0, 1) ≡ Be(1, 1) is equivalent to the reduced logistic
eψ
distribution for ψ, with density h(ψ) = (1+e ψ )2 , ψ ∈ R.
Jeffreys’ Prior
One of the approaches that ensure invariance under one-to-one transfor-
mations is the one proposed by Jeffreys. Jeffreys’ prior is based on Fisher
information for θ ∈ R, defined as
!
∂ ln f (X | θ) 2
I(θ) = E θ .
∂θ
1
For the univariate case Jeffreys’ prior is defined by h(θ) ∝ [I(θ)] 2 . The fact
that for any real-valued one-to-one transformation ψ of θ ∈ R,
!2
dθ
I(ψ) = I(θ(ψ)) ,
dψ
shows that Jeffreys’ prior for the univariate case has the desired invariance
property and thus, ensures invariance of inference under arbitrary transfor-
mations of the parameter space (that is, under reparametrization).
To understand the non-informative nature of Jeffreys’ prior, note that I(θ)
grows with the square of the change (in expectation over the sample space)
with respect to θ of ln f (X | θ). Also note that the better the model can
differentiate θ from θ+dθ, the larger I(θ), that is, the sample information on
20 Representation of Prior Information
positive, that is, Θ = (0, +∞), and to have mean u. Using as the non-
informative, translation invariant prior distribution the “uniform” distri-
bution on Θ, we get h(θ) ∝ exp(λ1 θ), θ > 0, which implies
h(θ) = −λ1 exp(λ1 θ) I(0,+∞) (θ),
with λ1 < 0, that is, an exponential distribution. Keeping in mind that
the fixed mean is −1/λ1 = u, we find the maximum entropy distribution
θ ∼ Exp(1/u).
P P
which is proportional to the kernel of a Be( i xi +1, n− i xi +1) distribution
in θ, which is closed under products. The natural conjugate model for a
Bernoulli sampling model is therefore a family of beta distributions, with
well-known versatility. In summary,
X X
θ ∼ Be(a, b) ⇒ θ | x ∼ Be(A, B), A = a + xi , B = b + n − xi ,
i i
which shows how sampling information enters the easily computed poste-
rior distribution, through the number of successes and failures (that is, the
minimum sufficient statistic), additively with respect to the prior hyperpa-
rameters (a, b).
Since a, b > 0, prior information vanishes (relative to the information
in the likelihood) when a, b → 0. Thus a non-informative (or vague) prior
obtained from the natural conjugate family is the improper Haldane prior,
“Be(0, 0),” defined as h(θ) ∝ θ−1 (1 − θ)−1 , θ ∈ (0, 1), which corresponds
to the “uniform” prior for ψ = ln [θ/(1 − θ)] ∈ R. Consequently, the prior
distribution Be(a, b) can be interpreted as a posterior distribution resulting
from updating a non-informative prior based on a hypothetical sample of
size a + b with a successes.
In conjugate families, the process of Bayesian updating, that is, the com-
bination of prior and sample information by Bayes’ theorem, is carried out
entirely within them. This allows us to symbolically represent the updating
by a transformation in the space A of hyperparameters, e.g.,
F
a ∈ A → A = a + (A − a) ∈ A.
in Example 2.5. The transformation shows the relative weights of the two
types of information and highlights the interpretative and analytical sim-
plicity of the Bayesian mechanism in the context of a natural conjugate
2.2 Natural Conjugate Priors 25
family. In the above form, A − a represents the role of the sample infor-
mation in the updating of prior information, represented as a. This is illus-
trated in Example 2.5.
Problems
2.1 Consider a Poisson sampling model f (n | λ) = Poi(λ). Find Jeffreys’ prior
for λ.
2.2 In order to diagnose the cause of a certain symptom, a patient’s physician
analyzes the tetrahydrocortisone content (THE) in the urine of the patient
over 24 hours. The laboratory returns a result of THE = 13 mg/24 h. There
are two possible causes, adenoma (z = 1) and carcinoma (z = 2). It is known
that the distribution of, y = ln(THE), i.e., the logarithms of the amount of
THE in the urine has approximately a normal distribution. We assume
M1 : y ∼ N(µ1 , σ21 )
M2 : y ∼ N(µ2 , σ22 )
for adenoma (M1 ) and carcinoma (M2 ), respectively. A data bank of THE
measures is available with information about six patients with adenoma and
five patients with carcinoma as follows (in mg/24 h):
Problems 27
3
Bayesian inference satisfies the likelihood principle since it is entirely based on the
posterior distribution which in turn depends on the data only through the likelihood
f (x | θ).
28 Representation of Prior Information
1
Having understood the fundamental ideas of the Bayesian approach to sta-
tistical inference and the main methods to represent prior information, it is
now time to continue with some illustrations. The basic Bayesian paradigm
is best illustrated in problems in which inference can be represented as
exactly as possible, preferably analytically or at least by simulation from
well-known posterior distributions.
This is done in this chapter by discussing the analysis of eight Bayesian
models which, together with Examples 2.5 – 2.8 and some of the exercises
in this chapter, cover the majority of textbook problems in a basic statistics
course. Since a Bayesian model is defined by a joint distribution of a vector
(X) of observations and of parameters (θ), the statement of each model
includes a sampling model { f (x | θ)} (the model of classical statistics)
and a prior distribution h(θ), which we will indicate by notation like f (x |
θ) ∧ h(θ). See Appendix A for a statement of all the parametric families that
are used in the following discussion.
“successes” and the probability of each of them. Compared with the pre-
vious expression, we can rewrite it as p(y1 , . . . , yk | x) Π j y j !(ky )/y !, and
therefore
p(y1 , . . . , yk | x) = p Mk−1 (y , 1k 1k ) (y1 , . . . , yk | y ) × pBiN(A,B/(B+k)) (y | x).
That is, similar to the binomial ∧ beta case, the posterior predictive proba-
bility function can be written as a product of a homogeneous multinomial
probability function and a Poi–Ga marginal probability function. This rep-
resentation highlights the nature of the dependence in the posterior predic-
tive distribution, which reduces to a negative binomial when k = 1.
Comparison of Means
Following an analogous argument as in the previous section, we easily find
that (µ1 , σ21 ) and (µ2 , σ22 ) are a posteriori again independent with univariate
marginal distributions:
µ j − x̄ j
µ j | x j ∼ t(k j ) ( x̄ j , s2j /n j ) ⇔ ν j = √ | x j ∼ t(k j )
s j/ n j
2
kj kjsj
σ2j | x j ∼ IGa( , ),
2 2
Pn j
where k j = n j − 1 and k j s2j = i=1 (x ji − x̄ j )2 .
The standardized difference λ = µ1 − µ2 , written as
λ − ( x̄1 − x̄2 )
τ= q 2 ≡ ν1 sin u + ν2 cos u,
s1 s22
n1
+ n2
Comparing Variances
σ21
Assume now that the parameter of interest is ψ = σ22
. Based on the inde-
d s21
pendent gamma posteriors for {1/σ2j } we find that ψ | x1 , x2 ≡ F
s22 (k2 ,k1 )
,
which allows easy implementation of inferences about ψ.
which is then the basic inference result for the comparison of two normal
populations under homoskedasticity.
1 X
A2 −1
This, together with the distributional result for (1/2) ln[θ j /(1 − θ j )] that was
stated at the beginning of this subsection, give the approximate posterior
distribution
(A1 − 1/2)/(B1 − 1/2) X
−1
ln ∆ | t1 , t2 ∼ N ln (A−1 .
approx
, j + B j )
(A2 − 1/2)/(B2 − 1/2) j=1,2
This can be used to construct one-sided or two-sided Bayesian tests for the
comparison of proportions.
probability function
N! B(a1 + x1 , . . . , ac+1 + xc+1 )
p(x | a) = , x ∈ X.
Πc+1
i=1 xi !
B(a1 , . . . , ac+1 )
The first two moments can be expressed as (using, e.g., the properties of
conditional expectations):
!
a a + N aa′
E(X | a) = N ; V(X | a) = N Da − .
a a (a + 1) a
Using a segmentation of X as before, one easily finds that
P
M = (M1 , . . . , M s ) | a ∼ MD s (N; i∈Ck ai , k = 1, . . . , s + 1)
ind
X (k) | M, a, k = 1, . . . , s + 1 ∼ MDdk −1 (Mk ; ai , i ∈ Ck ).
Application to Inference
Assume x = (x1 , . . . , xc ) is a realization of a random vector X | θ ∼
Mc (N, θ). Since f (x | θ) is proportional to the kernel of a Dc (xi + 1, i =
1, . . . , c + 1) distribution that is closed under products, we conclude that
the Dirichlet family is the natural conjugate for a multinomial sampling
model. Thus, if θ | a ∼ Dc (a), then θ | a, x ∼ Dc (A), A = (Ai = ai + xi , i =
1, . . . , c + 1).
Bayesian estimates of (θi , i = 1, . . . , c + 1) can be derived in particular
from the components of the posterior mode (A − 1c+1 )/A (if Ai > 1, ∀i),
where 1c+1 is a c + 1 vector of all 1s, or the posterior mean A/A . Note how
this is a weighted mean of the prior mean, a/a , and the vector of sample
proportions p = (xi /N, i = 1, . . . , c + 1).
In the analysis of contingency tables, inference of interest is often related
to independent structures (or other log-linear models) in which paramet-
P P
ric functions i bi ln θi with i bi = 0 often play a critical role (see, e.g.,
Paulino and Singer, 2006). When the components of A are large, one can
invoke approximate posterior normality of the posterior distribution and
use the resulting χ2 distribution for appropriate quadratic forms, allowing
tests for such structures. For more detail see, e.g., (Paulino et al, 2018, ch.
6). In the special case of a 2 × 2 table the use of such an approach to the
test of independence, which reduces simply to a test of homogeneity of
two binomials, leads to the approach mentioned at the end of the previous
section.
Assume now the aim is to predict a vector Y with Y | m, θ ∼ Mc (m, θ).
The corresponding posterior predictive distribution is a multinomial-Dirichlet,
40 Bayesian Inference in Basic Problems
Y | m, x ∼ MDc (m, A), whose summary by the first two moments can be
obtained by the formulas in Appendix A.
Problems
3.1 Exponential ∧ gamma model. Let xi ∼ Exp(θ), i = 1, . . . , n, i.i.d.
a. Assuming θ is distributed as Ga(α, β), with α and β fixed, find the poste-
rior distribution h(θ | x1 , . . . , xn ).
b. Under the same assumptions, compute the predictive distribution p(xn+1 |
x1 , . . . , xn ).
c. Consider now a slightly different situation. Imagine that the exponential
random variables represent waiting times for the recurrence of disease in
n patients surgically treated at time 0. We are at time a, and all patients are
still alive. Some of the n patients may have experienced recurrence. We
record their recurrence time xi (assume patients are not followed up after
the recurrence, i.e., we have no further data). Other patients may still be
healthy, so we don’t know exactly what the recurrence time is. We only
know xi > a. Repeat steps (a) and (b) for this case.
3.2 Binomial (unknown θ, n) ∧ beta/Poisson model. Consider a binomial sam-
pling model x | θ, n ∼ Bi(n, θ).
a. For fixed n and θ ∼ Be(a0 , b0 ), find the marginal distribution p(x). This
distribution is known as beta-binomial (cf. Section 3.1).
b. Assume now that also n is unknown, h(n) ∝ 1/n2 . Plot the joint posterior
distribution h(n, θ | x) for x = 50, and (a0 , b0 ) = (1, 4).
Hint: Use a grid on 0.01 ≤ θ ≤ 0.99 and x ≤ n ≤ 500. Use the R function
lgamma(n+1) to evaluate ln(n!).
3.3 Hierarchical Poisson ∧ gamma model. An animal experiment records tu-
mor counts for mice of two different strains, A and B. Tumor counts are
42 Bayesian Inference in Basic Problems
Jeffreys prior. It is quite flat over most of its support, with the exception
0 |x)
of small positive values. Find the Bayes factor B(x) = P(H P(H1 |x) .
d. Finally, we consider prediction for Y, with Y | δ ∼ Ray(δ) independently
of (X1 , . . . , Xn ). Find the posterior predictive density function p(y | x), the
point estimate E(Y | x) and P(Y > 1 | x).
Hint: Use properties of the √ conditional expectation and integration by
parts to prove E(Y | δ) = π/2 δ−1/2 ).
3.5 Uniform ∧ Pareto model. Let x = (xi , i = 1, . . . , n) denote a random sample
from a uniform U(0, θ) model, with density
f (x1 , . . . , xn | θ) = θ−n I[t,+∞) (θ), t = x(n) ≡ max xi .
1≤i≤n
1
The vast majority of statistical inference problems involve complex mod-
els that often leave it unrealistic to consider analytical (or even numerical)
solutions for any inference summaries of interest, which in Bayesian infer-
ence often take the form of integrals. In this context, classical Monte Carlo
methods arise as an attractive alternative. Monte Carlo methods evaluate
the relevant inference based on calculations involving simulated random
samples from probability distributions, which in turn can be generated from
(pseudo-)random variate generators [realizations of a uniform U(0, 1)]. Re-
lated methods of stochastic simulation have been the subject of extended
literature2 and can now be implemented by several available statistical and
mathematical software packages.
This chapter describes the general idea of traditional Monte Carlo meth-
ods in its basic version and in importance sampling, as well as some of the
specialized versions and variations pertaining to Bayesian inference.
some integrable function g(θ). This is the case for posterior moments of
the elements of θ, posterior probabilities of subsets of the parameter space,
and posterior predictive densities, where g(θ) is, respectively, θi (for the
mean of the i-th coordinate of θ), IA (θ) for A ⊂ Θ and f (y | θ) for fixed y.
Other quantities of interest can also be expressed using appropriate inte-
grals, including the normalizing constant of a posterior distribution, marginal
posterior densities, Bayes factors, and posterior model probabilities.
If one can simulate a random sample θ1 , . . . , θn from the posterior dis-
tribution h(θ | x), then the simplest Monte Carlo method approximates the
integral (4.1) by the sample mean
1X
n
Ê g(θ) | x = g(θi ), (4.2)
n i=1
which by the strong law of large numbers converges almost surely to
E g(θ) | x . The precision of the estimator can be evaluated by the (esti-
mated) standard error of the Monte Carlo average, given by
n 2 1/2
1
X
1X
n
√
g(θ ) − g(θ i
)
, (4.3)
n(n − 1)
i
n
i=1 i=1
2
when E{[g(θ)] | x] < ∞.
The integral of interest (4.1) can be represented by infinitely many vari-
ations involving consistent changes of the triple of parameter space, inte-
grand, and target distribution, (Θ, g, h) (Ripley, 1987). The Monte Carlo
estimators related to each of these representations come with different pre-
cisions, and corresponding implications for the computational effort (eas-
ier or more difficult random variate generation and larger or smaller Monte
Carlo sample size) required for getting reliable estimates. This suggests op-
portunities for more efficient tools to obtain highly precise estimates with
a relatively low Monte Carlo sample size.3
In summary, if one can generate from the posterior distribution h(θ | x),
then the evaluation of integrals of the type (4.1) is straightforward. Note
also that a simulated Monte Carlo sample considerably simplifies many
otherwise analytically difficult inference summaries. This is the case, for
example, for reparametrizations and marginalization, which are simply car-
ried out by the corresponding transformations of the simulated sample and
the selection of the components of interest, respectively. The following
subsections discuss in more detail the evaluation of posterior probabilities,
3
For details on variance reduction techniques for Monte Carlo estimates see, for example,
Rubinstein (1981) and Robert and Casella (2004).
4.1 Simple Monte Carlo 47
Table 4.1 CRM: dose levels, prior probabilities of toxicity π0k , and
standardized doses δk .
Dose level
k 1 2 3 4 5 6
Dose (mg/m2 ) 10 20 40 60 75 90
Prob(toxicity) π0k 0.05 0.10 0.20 0.30 0.50 0.70
Standardized dose δk –1.47 –1.1 –0.69 –0.42 0 0.42
15
●
k
0.6
1
2
3
0.5
4
5
10
●
6
0.4
h(πk | y)
p(Y=1)
0.3
5
0.2
●
0.1
0
1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0
DOSE πk
threshold such that the posterior probability of R0 (γ) is at least γ. The def-
inition makes this interval more difficult to evaluate than intervals with
fixed, predetermined tail areas, even if a closed-form expression for the
posterior density of θ is available. Chen and Shao (1999) propose a Monte
Carlo approach to approximate R0 (γ) which is extremely simple to imple-
ment. Based on an ordered Monte Carlo sample (θ(i) , 1 ≤ i ≤ n), credible
intervals at level γ can be determined by
R̂i (γ) = θ(i) , θ(i+[nγ]) , i = 1, . . . , n − nγ ,
where nγ denotes the integer part of nγ. Taking into account the minimum
length property of HPD intervals, the Monte Carlo approximation of R0 (γ)
proposed by Chen h and Shao is defined
i as R̂0 (γ) = Ri0 (γ), with i0 determined
by i0 = arg mini θ(i+[nγ]) − θ(i) , 1 ≤ i ≤ n − nγ .4
Note that the method is straightforward to modify for HPD intervals for
parametric functions ψ(θ). It suffices to apply it to the transformed Monte
Carlo sample (ψ(θi ), 1 ≤ i ≤ n). However, recall that the HPD nature is not
invariant under non-linear transformations.
which constitutes the support of h(θ1 , θ2 | x) for fixed θ1 , and let Θ1 (θ2 ) =
{θ1 : (θ1 , θ2 ) ∈ Θ} denote the support of the conditional density h(θ1 | θ2 , x).
For a fixed value θ1∗ of θ1 , let (assuming Fubini’s theorem applies)
R
h(θ1∗ | x) = Θ (θ ) h(θ1∗ | θ2 , x)h(θ2 | x) dθ2
R −1 1∗ nR o
= Θ (θ ) h(θ1∗ | θ2 , x) Θ (θ ) h(θ1 , θ2 | x) dθ1 dθ2 (4.7)
R −1 1∗ 1 2
= Θ h(θ1∗ | θ2 , x)h(θ | x) dθ,
which shows that the ordinates of the marginal posterior density for θ1 can
be interpreted as posterior expected values (with respect to θ, including in
particular θ2 ) of the corresponding ordinates of the conditional posterior
density for θ1 .
Generalizing this argument for θ = θ(m) , θ(−m) , with θ(−m) = (θm+1 , . . . , θk ),
we get
Z
(m)
h(θ∗ | x) = h(θ∗(m) | θ(−m) , x)h(θ | x) dθ. (4.8)
Θ
The expression implies that the marginal posterior density for θ(m) = (θ1 , . . . ,
θm ) can be approximated with a Monte Carlo estimate based on a ran-
(m) (−m) (m)
dom sample of h(θ | x), θ(i) = (θ(i) , θ(i) ) with θ(i) = θ(i)1 , . . . , θ(i)m and
(−m)
θ(i) = θ(i)m+1 , . . . , θ(i)k , i = 1, . . . , n, as
1X n (−m)
(m)
ĥ θ∗ = h θ∗(m) | θ(i) ,x . (4.9)
n i=1
This estimate was proposed by Gelfand and Smith (1990; see also Gelfand
(m)
et al., 1992). It does not make use of the part θ(i) , i = 1, . . . , n, of the
simulated values on which the kernel density estimate is based, but instead
requires complete knowledge of the conditional posterior density of θ(m)
given θ(−m) .
Assuming this condition is met, the estimate (4.9) turns out to be more
efficient than the estimate obtained by the kernel method, as was shown by
Gelfand and Smith (1990). This is because it exploits the knowledge of the
model structure as it is embodied in the required conditional distribution.
Something similar happens with the following estimate of the posterior
mean of θ(m) that, accounting for (4.9), is given by
1 X (m) (−m)
n
(m)
θ̂ = E θ | θ(i) , x . (4.10)
n i=1
This estimator, using the availability of the indicated conditional expec-
tation, is more precise than the classical Monte Carlo estimator obtained
52 Inference by Monte Carlo Methods
The form of this estimator shows that importance sampling could be seen
as weighted sampling, with the weights wi for g(θi ) known as importance
sampling weights. Under appropriate assumptions, that is, in particular
Rthat the support of p(θ) includes the support of h(θ | x) and the integral
g(θ)h(θ | x)dθ exists and is finite, Geweke (1989) shows that with an
i.i.d. importance sample θi from p(θ),
Z
1 X
n
Pn wi g(θi ) → g(θ)h(θ | x)dθ a.s.,
i=1 wi i=1
The result assumes a finite variance of the Monte Carlo estimator, that is,
of the posterior expected value of the product of [g(θ)]2 and the importance
weight h(θ | x)/p(θ) (this is identical to the expected value with respect to
p of the square of g(θ)h(θ | x)/p(θ)).
The rate of convergence for the importance sampling estimator depends
on the ratio between the importance distribution and the target distribution
h(θ | x). Note that the estimator gives more weight to θ values with p(θ) <
h(θ | x) and less when the opposite inequality holds. If the importance
ratio is unbounded, as happens if the tails of p are lighter than the tails of
h(· | x), then the weights can vary substantially and give great importance to
few simulated values in a range (tails) with low probability under the target
distribution. Depending on the function g, the variance of the estimator
could even be infinite, implying far worse performance than an estimator
based directly on the target distribution, if that were possible.
In summary, whether Monte Carlo with importance sampling is a promis-
ing technique for variance reduction depends on the selected importance
sampling density (for more details, see Robert and Casella, 2004). Desir-
able properties for a good importance sampling density are: (1) easy ran-
dom variate generation; (2) heavier tails than the target distribution h(· | x);
and (3) good approximation of h(· | x). Shaw (1988) developed a class of
univariate distributions that are designed to be suitable as importance func-
tions (see also Smith, 1991). In the case of multivariate parameters, often
multivariate normals or Student t distributions are used.6
6
The program BAYESPACK provides an R interface for Fortran subroutines that
implement numerical integration by importance sampling, based on methods described
4.2 Monte Carlo with Importance Sampling 55
Example 4.2 Recall the setup from Example 4.1. The single-parameter
model (4.4) was deliberately chosen to allow meaningful inference also
with small sample sizes in a phase I trial. Still keeping a parsimonious
model, but allowing for some more flexibility Neuenschwander et al (2008)
propose a design based on a two-parameter probit regression,
f (yi = 1 | di = δk ) = πk with πk = 1 − Φ [−a − b ln(δk /δo )] . (4.15)
Here, a and b are probit regression coefficients and δo is a reference dose.
The model allows easy interpretation of the parameters, with a determined
by the prior odds at the reference dose δo , and b determines the shift in
log odds for doses away from δo . The model is completed with a bivariate
normal prior (a, b) ∼ N(µ, Σ). We set the prior mean to µ = (1.65, 1.327)
to match a prior expectation E(π1 ) = 0.01 and E(πK ) = 0.95, respectively,
and Σ = I to have wide prior support over a range of plausible dose–
response curves.
Noting that it is impossible to define a single precise target toxicity π⋆ ,
Neuenschwander et al (2008) extend the notion of a single target toxicity
level π⋆ to an ordinal scale over four sub-intervals of toxicity probabil-
ities. Neuenschwander et al (2008) partition the range of toxicity proba-
bilities πk into four intervals: under-dosing, Iu = (0, 0.20]; targeted toxi-
city, It = (0.20, 0.35]; excessive toxicity, Ie = (0.35, 0.60]; and unaccept-
able toxicity, Ia = (0.60, 1.00]. Based on the posterior probabilities of πk ,
k = 1, . . . , K, being in these four intervals they propose a pragmatic design
that prescribes de-escalation, escalation, and continued enrollment at the
current dose, depending on these four probabilities.
Consider a study with a dose grid δ = (12.5, 25, 50, 100, 150, 200, 250)
with reference dose δ0 = 250. Assume that the first n = 7 patients were as-
signed doses ki = 1, 1, 2, 2, 3, 2, 3, with recorded outcomes yi = 0, 0, 0, 0, 1, 0, 1
We use importance sampling to find P(πk ∈ It | Dn ) and similar for Ie and
Ia , for k = 1, . . . , K. Let θ = (a, b). We use a bivariate normal importance
function p(θ) = N(m, V), with m being the posterior mode (MAP) and V
being the negative inverse of the Hessian of the log posterior at the mode,7
i.e., S = −H −1 . Figure 4.2 summarizes posterior inference. The trial design
would then determine the dose with maximum p(πk ∈ It | Dn ), restricting
the search to all doses with p(πk ∈ Ie ∪ Ia | Dn ) < 0.25.
1.0
1.0
A A A
● A
● U
●
A
50
0.8
0.8
●
40
0.6
0.6
U
p(πk ∈ Ix | y)
Frequency
E(πk | y)
30
0.4
0.4
E
20
A
T T
0.2
0.2
●
10
E E
U
● T E
O
E A T E E T
0.0
0.0
A U U
T U
T U
T
0
U
0.2 0.4 0.6 0.8 1.0 2 4 6 8 2 4 6 8
w DOSE DOSE
is a (unbiased and consistent) Monte Carlo estimate of B(x). Note that this
approximation is most efficient when the tails of h1 (θ) are heavier than the
tails of h0 (θ) and less efficient when there is little overlap of h0 (θ) and h1 (θ)
(that is, small Eh1 [h0 (θ)]).
In many problems, the Bayes factor involves densities of different di-
mension and therefore requires different strategies. We will discuss some
in Chapter 7. For a more general discussion of this problem, see the books
by Chen et al (2000) and Paulino et al (2018).
where
(−m) (−m)
wi = h θ(i) | x /p θ(i) ,
i = 1, . . . , n. Alternatively, with an argument as in (4.8), one could use
wi = h θ(i) | x /p θ(i) , i = 1, . . . , n, with an importance sampling density
p(·) for h(θ | x).
(m)
As yet another alternative, one could generate θ(i) , i = 1, . . . , n, by the
(−m) (m) (−m)
composition method using p(θ ) and h(θ | θ , x). However, since we
(m)
used p(·) instead of the marginal posterior for θ(−m) , the samples θ(i) are not
(m) (m)
a random sample from h(θ | x). Instead, the θ(i) are a weighted sample of
P
an approximation of h(θ(m) | x), with weights pi = wi / ni=1 wi , i = 1, . . . , n.
To get from this a random sample of size L, say, one could simulate in the
spirit of a bootstrap method L values θ((m) j)∗ , j = 1, . . . , L, from the discrete
(m)
distribution (θ(i) , pi ), i = 1, . . . , n. Smith and Gelfand (1992) therefore used
the name weighted bootstrap.
This method to evaluate marginal posterior densities, similar to Gelfand
and Smith (1990), does not apply when the normalization constant for the
conditional posterior density h(θ(m) | θ(−m) , x) is unknown. (m) In this case,
Chen (1994) proposes a weighted estimator for h θ∗ | x , based on the
following identity
R
h θ∗(m) | x = Θ (θ(m) ) h θ∗(m) , θ(−m) | x dθ(−m)
R −m ∗ h(θ(m) ,θ(−m) |x) (4.21)
= Θ w θ(m) | θ(−m) h(θ∗(m) ,θ(−m) |x) h(θ | x)dθ,
where Θ−m (θ(m) ) = {θ(−m) : (θ(m) , θ(−m) ) ∈ Θ} is the subspace of Θ with fixed
θ(m) and w(θ(m) | θ(−m) ) is a completely known conditional density with the
same as or greater support than h(θ(m) | θ(−m) , x), given by Θm (θ(−m) ) =
{θ(m) : (θ(m) , θ(−m) ) ∈ Θ}. The criteria for selecting the function w are the
same as for selecting an importance sampling density.
The form of (4.21) highlights that the normalization constant of the joint
posterior distribution (and, a fortiori, conditional) cancel out and that the
marginal density in question can be unbiasedly estimated based on a sam-
(m) (−m)
ple θ(i) = (θ(i) , θ(i) ), 1 ≤ i ≤ n, of h(θ | x), by
1X n (m) (−m) h θ∗(m) , θ(i)
(−m)
|x
(m)
ĥ θ∗ | x = w θ(i) | θ(i) . (4.22)
n i=1 h θ(m) , θ(−m) | x
(i) (i)
4.3 Sequential Monte Carlo 61
likelihood factor for yt . This is essentially the idea of the auxiliary variable
particle filter, except that by combining steps (1) and (3), the algorithm
makes the sampling more efficient.
Similar to (4.30), we define
X
ht ≈ bht ∝ wt−1,i f (yt | θt )p(θt | θt−1,i ) (4.31)
i
without the likelihood approximation in g(θt , i). Instead, the normal as-
sumption allows us to use Bayes’ theorem and replace the last two factors
to obtain
b
h(θt , i) ∝ wt−1,i N(θt | mi , vi ) N(yt | µ(θt−1,i ), σ2 (θt−1,i ) + 1), (4.35)
| {z }
λi
completed with a prior on the static parameters and the initial state (φ, θ0 ) ∼
h(φ, θ0 ). For example, in the ARCH model, φ = (β0 , β1 , σ2 ).
There are several approaches to generalizing particle filter methods to
include parameter learning. Liu and West (2001) reduce static parameter
learning to the earlier problem by allowing a negligibly small but positive
evolution noise for φ, thus including it as φt in the state vector. Polson
et al (2008) use a (approximate) sufficient statistic to define the “practical
filter.” For a review and more detailed discussion see, for example, Prado
and West (2010).
Problems
4.1 Simple Monte Carlo integration. Implement inference for Example 4.1.
a. Verify the posterior means πk shown in Figure 4.1, and add central 50%
credible intervals for πk , k = 1, . . . , K.
b. Assume the next patient is treated at dose k5 = 3, and we record no tox-
icity, y5 = 0. Find the updated posterior means πk = E(πk | D5 ) and
determine the treatment allocation k⋆ for the next patient.
4.2 Importance sampling: probit regression. Refer to Example 4.2.
a. Based on Figure 4.2 and the design rule as it is summarized in the exam-
ple, which dose k⋆ is assigned to the next, (n + 1)-st patient?
b. Let θ = (a, b). Evaluate the marginal posterior distributions h(πk | Dn )
using importance sampling, using a bivariate normal importance function
p(θ) = N(m, V), as in the example. Plot h(πk | Dn ), k = 1, . . . , K. In the
plot indicate the four intervals for under-dosing, target dose, excessive
dose, and unacceptable toxicity by vertical dotted lines on the interval
boundaries.
c. Let {θ(m) ; m = 1, . . . , M} denote the importance sample from p(θ). Figure
4.2(a) shows the importance weights w(m) ≡ h(θ(m) | y)/p(θ(m) ) for an
importance Monte Carlo sample {θ(m) ; m = 1, . . . , M} of size M = 100.
Redo the importance sampling estimate, now with M = 1000, and plot
the histogram of weights. What do you observe?
d. Now consider an alternative bivariate Student t importance function, p2 (θ) =
t2 (m, V; ν) with ν = 4. Plot a histogram of the importance sampling
weights w2(m) = h(θ(m) | y)/p2 (θ(m) ) and compare.
4.3 Importance sampling: nonlinear regression. Let x denote the number of
plants per unit area and y denote the yield per plant. To model the relation-
ship between yield and planting density, we use the sampling model
f (yi | θ) = N(µi , σ2 ) with µi = 1/(α + βxi + γxi2 ).
Problems 67
Let τ = ln(σ2 ) and θ = (α, β, γ, τ). In this model, 1/α = lim x→0 y is inter-
preted as “genetic potential” and 1/β can be interpreted as “environmental
potential.”
The model is completed with a non-informative prior h(θ) ∝ 1. Use the data
set onion in the R package SemiPar. The data set includes the variables
dens, yield, and location. We use data from location 0 (Purnong
Landing) only. Scale y = yield/100 by 100, and standardize x as x =
(d − d̄)/sd , where d =density, and d̄, s2d are the sample mean and variance.
a. Set up a multivariate normal importance sampling density p(θ). Use a
N(m, V) with m being the posterior mode (MAP) and V = −H −1 being
the negative inverse Hessian of the log posterior at the mode.9 Carry out
importance sampling to estimate the posterior means and standard de-
viations of θ. Report numerical uncertainties and show a histogram of
importance sampling weights.
b. Repeat the same importance sampling estimation with a multivariate Stu-
dent t distribution with ν = 4 d.f. (see Appendix A). See also the hint for
Problem 4.4(b).
c. Why would one recommend the Student t importance function? Explain.
4.4 Importance sampling: hierarchical model. Carlin and Gelfand (1991) con-
sider a hierarchical event rate model for the number of pump failures (yi )
over given exposure times (ti ), i = 1, . . . , n = 10. The data are shown in the
following table.
yi 5 1 5 14 5
ti 94.32 15.72 62.88 125.76 5.24
yi 19 1 1 4 22
ti 31.44 1.048 1.048 2.096 10.48
We assume a Poisson regression,
yi ∼ Poi(λi ti ),
with a hierarchical prior for Li = ln(λi ):
Li ∼ N(η, σ2 ), and η ∼ N(µ, τ2 ), γ ≡ 1/σ2 ∼ Ga(a, b).
We fix the hyperparameters at (µ, τ2 , a, b) = (−1, 1, 1, 1).
Let g = ln(γ) and let θ = (η, g, L1 , . . . , Ln ) denote the complete parameter
vector (now with g = ln(γ)). Let θ̂ and S denote mean and covariance matrix
of a multivariate normal approximation of the joint posterior distribution
h(θ | y) (for example, posterior mode and negative inverse Hessian of the
log posterior at the mode, as in Problem 4.3). Under the multivariate N(θ̂, S )
9
We will discuss the use of such multivariate normal approximations in Chapter 8.
68 Inference by Monte Carlo Methods
distribution for θ let m = (θ̂1 , θ̂2 ) and V denote the corresponding marginal
moments for ψ = (η, g).
Let p1 (ψ) = t2 (m, V; ν) (use ν = 2), and p2 (Li | η, γ) ∝ N(Li | η, σ2 )Poi(yi |
ti eLi ). Here, Poi(y | µ) denotes the Poisson probability mass function with
rate µ evaluated for y and similar for N(x | µ, σ2 ). Use an importance sam-
pling density
p(θ) = p1 (η, g) Πni=1 p2 (Li | η, g). (4.38)
We can generate from p(·) by first generating ψ = (η, g) from the bivariate
t distribution p1 (ψ), and then Li from p2 using, for example, a grid-based
method using evaluation of p2 on a grid (using the R function sample();
see Appendix B).
a. Using the importance sampling function (4.38), find the expression for
the importance sampling weights wi .
Hint: Do not forget the normalization constant in p2 (Li | η, g).
b. Let y = (y1 , . . . , yn ) denote the data. Implement importance sampling
to evaluate posterior moments, λ̄i = E(λi | y) and s2i = Var(λi | y),
i = 1, . . . , n. Show a histogram of the importance sampling weights, a
table of estimated posterior moments λ̄i (si ), and corresponding numerical
uncertainties.
Hint: To generate from the bivariate t distribution tν (m, V) with ′
2
√ V = LL ,
you could use: z ∼ N2 (0, I), y = Lz, u ∼ χ (ν), and ψ = m + ν/uy. Then,
" #− ν+2
1 ′ −1
2 ν+2
p(ψ) ∝ 1 + (ψ − m) V (ψ − m) = [1 + z′ z/u]− 2
ν
c. Use suitable Monte Carlo methods to evaluate and plot the marginal pos-
terior distributions h(λi | y), i = 1, . . . , n.
d. The problem greatly simplifies with λi ∼ Ga(c, ηc ). Why?
4.5 The skewed normal distribution defines a skewed continuous random vari-
able. Let ϕm,s denote a normal p.d.f. with moments (m, s) and let Φ(z) denote
a standard normal c.d.f. The skewed normal can be defined by its p.d.f.,
x − µ
f (x | µ, σ, α) ∝ ϕµ,σ (x) Φ α .
σ
For α > 0 (α < 0), the distribution is right (left) skewed. We write X ∼
SN(µ, σ, α) for a skewed normal random variable X. For the following ques-
tions use the data set stockreturns.txt10 of (simulated) stock returns.
We assume the sampling model
xi ∼ SN(µ, σ, α),
10
Available on the book’s homepage at
sites.google.com/view/computationalbayes/home
Problems 69
for the quadratic approximation ℓ(θt ) ≈ −1/(2σ2t )(θt −µt )2 , using a Taylor
series expansion of ℓ(θt ).
b. Use this quadratic approximation to derive a variation of (4.35).
c. Using that variation of (4.35), work out appropriate variations (1′′ ), (2′′ ),
and (3′′ ) of the adapted particle filter (the weights in step (3′′ ) are no
longer constant).
4.10 Consider a stochastic volatility model,
yt = ǫt β exp(θt /2) and θt+1 = φθt + ηt ,
with ǫt ∼ N(0, 1) and ηt ∼ N(0, σ2η ). For this problem we fix φ = 0.9702, ση =
0.178, and β = 0.5992 (Pitt and Shephard, 1999: section 4.2).
a. Simulate a data set y1 , . . . , yT , using θ0 = 0 and T = 200. State the random
variate seed that you use. For example, if you use R, use set.seed(1963).
Plot yt versus t as a time series.
b. Let ℓ(θt ) = ln f (yt | θt ) denote the log-likelihood factor for θt . Find (µt , σt )
for the quadratic approximation ℓ(θt ) ≈ −1/(2σ2t )(θt − µt )2 (see the previ-
ous problem).
c. Implement the adapted auxiliary particle filter from the previous problem.
As in Section 4.3, let Dt = (y1 , . . . , yt ) denote the data up to time t. Plot the
posterior means θt = E(θt | Dt ) against time, and add posterior medians
and 25% and 75% quantiles.
5
Model Assessment
1
This chapter deals with approaches for model assessment aiming to imple-
ment a selection and comparison leading to the choice of the best model(s).
We start in Section 5.1 with a description of model diagnostics such as
Bayesian p-values associated with various discrepancy measures, resid-
uals, conditional predictive ordinates, and corresponding pseudo-Bayes
factors. In Section 5.2 we then consider several measures of predictive
performance, known by their acronyms as AIC, SIC (or BIC), DIC, and
WAIC, and ways of using Bayes factors for selection and comparison tasks.
Finally, in Section 5.3, using Monte Carlo samples to represent posterior
distributions in complex models, we review ways of estimating relevant
quantities for model assessment in this context.
Bayesian P-values
Box (1980, 1983) proposed to base a critical analysis of a model on the
comparison of the data x and the marginal (or prior predictive) p(x) =
1
This material will be published by Cambridge University Press as Computational
Bayesian Statistics, by M.A. Amaral Turkman, C.D. Paulino, and P. Müller
(https://tinyurl.com/CompBayes). This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative
works. c Maria Antónia Amaral Turkman, Carlos Daniel Paulino & Peter Müller, 2019.
5.1 Model Criticism and Adequacy 73
Eh f (x | θ) by way of marginal p-values P p(X) ≥ p(x) . This approach
does not work with improper priors. In particular, it prevents the use of
hypothetical replicate data based on simulation from h(θ). Even when h(θ)
is proper, Monte Carlo methods tend to be in general unstable (Gelfand,
1996), and therefore not reliable, for the evaluation of p(x).
One of the most widely used strategies for a critical model examination
is based on the posterior predictive distribution under the model,
Z
p(y | x) = f (y | θ) h(θ | x) dθ. (5.1)
Θ
Here, y are hypothetical future data (see the following for details) and x
are the observed data. The data y should expectedly (or not) reflect the
observed data in the case of a good (bad) fit of the model. Any systematic
discrepancies between the two data sets, y and x, are evidence for some
flaw in the model.
The discrepancy between the observed data and data that are observable
under the model can be evaluated by summary variables V(x, θ), which
should be chosen to relate to the specific model aspects that the investigator
wishes to evaluate. Examples of such variables include standardized mean
squared deviations defined as
X [xi − E(Xi | θ)]2
V(x, θ) = (5.2)
i
Var(Xi | θ)
and the actual log-likelihood, ln f (x | θ), as well as the special cases of
statistics V(x) and functions g(θ) of the parameters only.
The posterior predictive distribution is typically determined by simula-
tion from the joint distribution of (y, θ), conditional on x. Consider, then,
{(yk , θk ), k = 1, . . . , m)}, a set of values generated from this joint distribu-
tion. The comparison of the real data x and predictive data via the sum-
mary variable V can be shown graphically by a scatter plot of the val-
ues {(V(yk , θk ), V(x, θk ), k = 1, . . . , m)}, or by a histogram of {V(yk , θk ) −
V(x, θk ), k = 1, . . . , m}. For a model that fits the data well, the points in the
scatter plot should be symmetric with respect to the 45-degree line and the
histogram should include 0.
One of the summary measures of discrepancy between V(x, θ) and the
distribution of V(Y, θ) is the posterior predictive p-value, sometimes re-
ferred to as the Bayesian p-value, defined as
PB = P [V(Y, θ) ≥ V(x, θ) | x]
= E(Y,θ) I{V(Y, θ) > V(x, θ)} | x . (5.3)
74 Model Assessment
Note that a very small or very large value of PB (for example, below 1% or
above 99%) implies that V(x, θ) falls into one or the other extreme tail of
the posterior distribution of V(Y, θ), given x (in both cases, being in general
implausible under the model). In this way, both cases are evidence for a
bad fit of the model to the data with respect to the characteristics that are
reflected in the chosen summary variable. For this reason, the Bayesian p-
value can, alternatively, be defined as PB = P [V(Y, θ) ≤ V(x, θ) | x], which
can be regarded as resulting from (5.3) by taking the negative of the chosen
variable. Typically, the Bayesian p-value, PB , is calculated as a proportion
of simulated values from the posterior predictive distribution of (Y, θ) given
x which satisfy V(yk , θk ) ≥ V(x, θk ). Such simulated values can be obtained
by first generating from h(θ | x) and then future data y from the sampling
model.
When the discrepancy measure is a statistic V(X), the posterior predic-
tive evaluation can alternatively be applied separately for each observation,
making use of the marginal predictive distributions p(yi | x). The corre-
sponding marginal p-values with respect to V(Xi ) are defined as PBi =
P [V(Yi ) ≥ V(xi ) | x], i = 1, . . . , n, which reduces to PBi = P(Yi ≥ xi | x) if
V is the identity. This is useful to detect outliers and to check whether there
is a global correspondence between the model and observed data. p-values
centered at extremes or too concentrated in the middle of the range of val-
ues are evidence for overdispersion and underdispersion, respectively, rel-
ative to the predictive data.
Extreme values of a measure of statistical significance, such as PB , need
not necessarily prompt one to completely abandon the model if the char-
acteristic that is expressed by the corresponding variable V is not of great
practical relevance. Even if it were, the features that are subject to criticism
resulting from this model inspection should suggest possible directions for
model revision to find models that are more adequate for the context of the
actual data.
Residuals and Other Measures of Model Adequacy and Diagnostic
Other measures can be constructed by comparing characteristics of the
predictive distribution under the model conditional on observed data with
other also observed data. This is possible in the context of cross-validation,
where one uses a training sample x = (x1 , . . . , xn ) to generate a posterior
distribution h(θ | x), and another sample y = (y1 , . . . , yℓ ), independent of
x, to evaluate the validity of the model by inspection of the corresponding
posterior predictive p(y | x) or some of its characteristics. For example,
the predictive mean and variance for each component Y j of the vector Y, of
5.1 Model Criticism and Adequacy 75
(LPML)
X
n
LPML = ln CPOi = ln Πni=1 p(xi | x(−i) ), (5.7)
i=1
the more adequate is a model. For other diagnostic instruments see, in par-
ticular, Gelfand (1996) and Gelman and Meng (1996).
Example 5.1 Henderson and Velleman (1981) describe a study of the per-
formance of car models in terms of fuel consumption. We use a subset of
the data from this study. The data report efficiency, E f , measured as miles
per gallon, weight in pounds (X1 ), power in horse power (X4∗ ), and number
of gears in the transmission, at levels 3, 4, or 5, jointly represented by indi-
cator variables for 4 (X2 ) and 5 (X3 ), respectively. After some preliminary
analysis they consider normal linear regression models for the transformed
response variable Y = 100/E f , gallons per 100 miles (Y here is unrelated
to the earlier notation y for the predictive data).
One of the models involves the explanatory variables X1 , (X2 , X3 ), and
X4 = X4∗ /X1 (power per unit weight) in a multiple regression model:
X
4
M1 : µ ≡ E(Y) = β0 + β j X j + β5 X2 X4 + β6 X3 X4 ,
j=1
from (5.2), calculated with the simulated values θk , generated from the pos-
terior distribution of θ (given the observed values for Y and {X j }) for the
actual data versus the predicted data.
M1 M2
150
150
100
100
V(x,θ)
V(x,θ)
50
50
0
0 20 40 60 80 0 20 40 60 80
V(x,θ) V(x,θ)
based on the same sampling model and differ only in the parametric struc-
ture. Otherwise, one must be careful because the DIC measure depends on
the chosen function g(·). Regarding this, it is by no means straightforward
to determine the maximum likelihood for the saturated structure in most
multiparameter models.
Just like Ã(x) previously, also the complexity penalty term involves an ex-
pression in terms of the individual data. One of the proposals for pW is
similar to the one used in the DIC, expressed as
X
n
pW1 = −2 {Eθ|x ln f (xi | θ) − ln Eθ|x f (xi | θ) }
i=1
n
X 1X
m
1 Xm
≃ −2 ln f (xi | θk ) − ln f (xi | θk ) , (5.19)
i=1
m k=1 m k=1
with pW1 ≥ 0 by construction.
Another proposal for pW is based on the posterior variance of ln f (xi | θ)
for all data points, and defined as
X
n
pW2 = Varθ|x ln f (xi | θ)
i=1
n
X i2
1 Xh
m
≃ ¯ i) ,
lk (xi ) − l(x (5.20)
i=1
m − 1 k=1
¯ i ) = 1 Pmk=1 lk (xi ).
where lk (xi ) = ln f (xi | θk ) and l(x m
Using any of these proposals for the complexity penalty term, inter-
preted as “effective model dimensionality,” depending on either data or
prior information, Gelman et al. (2014a, 2014b) propose a transformation
of Watanabe’s information criterion to the same scale as other proposals,
5.2 Model Selection and Comparison 83
being expressed by
X
n
WAIC = −2 ln Eθ|x f (xi | θ) + 2pW . (5.21)
i=1
p j (x)
≃ −2 ln = −2 ln B jo (x). (5.23)
po (x)
That is, the differences r j (BIC) are related with the Bayes factor between
the two models under comparison, say H j and Ho , in the sense that for large
sample sizes they can be approximated by −2 ln B jo (x), using an argument
of Schwarz (1978).
Example 5.2 Continuing with the earlier example, consider now compar-
ative evaluation of the three multiple regression models with respect to
their predictive performance. The PBF for the pairwise comparisons are
PBF(M1 /M2 ) = 0.809; PBF(M1 /M3 ) = 0.941, and PBF(M2 /M3 ) = 1.164.
These values support the earlier result that M2 was the best and M1 the
worst of the three models, in terms of this criterion based on conditional
predictive ordinates.
A comparison with respect to the information criteria shown in Table
5.2, reports model M2 as the best in terms of the Bayesian summaries DIC
and WAIC, which, however, is dominated by M3 under the BIC. The latter
is no surprise, knowing that BIC favors the more parsimonious models.
Table 5.2 DIC, BIC and WAIC criteria for models M1 , M2 and M3
Model DIC (pD ) BIC (p) WAIC (pW2 )
M1 48.69 (8.27) 67.36 (8) 46.77 (5.38)
M2 47.32 (6.19) 61.33 (6) 46.70 (4.78)
M3 47.58 (4.12) 56.93 (4) 47.40 (3.48)
Pooling the results here and in the previous example, the applied criteria
show that the overall best of the three models is the one with intermediate
complexity as measured by the number of parameters.
available information. But there is no concern that any model stands for the
reality under study.
In contrast, some methods correspond to a perspective (possibly con-
troversial) that the set of considered models must include a so-called true
model, in the sense of being the one that generated the observed data (or,
at least, which constitutes a sufficiently good approximation of the true
model). The method of using Bayes factors to select one of a finite number
of models fits into this perspective. This and other aspects that could be
criticized should not prevent one from using this method in some contexts,
with due caution.
Any set of Bayesian models M = {M j , j ∈ J} (with different sampling
models and prior distributions) implies a set of corresponding prior predic-
tive models
Z
p(x | M j ) ≡ p j (x) = f j (x | θ j )h j (θ j ) dθ j , j ∈ J. (5.24)
where P(M j ) is the prior probability of M j being the true model and is
updated in the face of data x to
P(M j | x) = P(M j )p(x | M j )/p(x), j∈J. (5.26)
The Bayes factor in favor of Mk versus M j is thus the ratio
P(Mk | x)/P(M j | x) p(x | Mk )
Bk j (x) = = , (5.27)
P(Mk )/P(M j ) p(x | M j )
and the posterior odds for Mk are
P(Mk | x) P(Mk )
=P , (5.28)
1 − P(Mk | x) j,k P(M j )B jk (x)
where B jk (x) = 1/Bk j (x) is the Bayes factor in favour of M j versus Mk . The
P
formula for these odds becomes ( j,k B jk (x))−1 when one uses a uniform
prior over the space of models. Simple examples of the application of this
method can be found in Paulino et al (2018).
In general, the evaluation of prior predictive distributions, and thus of
Bayes factors, requires simulation – see the preceding chapter and the next
5.3 Further Notes on Simulation in Model Assessment 87
section. Model comparison based on Bayes factors requires the use of prac-
tical rules for the choice of thresholds on the strength of evidence in favor
of a model. One such rule is proposed by Kass and Raftery (1995).
we get
"Z #−1
p(x) 1
p(xi | x(−i) ) = = h(θ | x) dθ ,
p(x(−i) ) f (xi | x(−i) , θ)
and thus, if {θ(⋆j) ; j = 1, . . . , m} is a sample from h(θ | x), we have
−1
1 Xm
1
p̂(xi | x(−i) ) = ⋆
. (5.29)
m j=1 f (xi | x(−i) , θ( j) )
This estimate of p(x) has been shown, however, to be in general very inef-
ficient, as we already mentioned in Section 4.2.2.
Newton and Raftery (1994) suggested the harmonic mean of { f (x | θ(⋆j) )},
that is
h1 Xm
1 i−1
p̂(x) = ⋆ , (5.30)
m j=1 f (x | θ( j) )
More discussion of this and other prior predictive density estimates can
be found in Kass and Raftery (1995).
Problems
5.1 Replicate Example 5.1. The data are available in R as the data frame mtcars.
Evaluate PB and the log pseudo-Bayes factor (5.22).
5.2 Let yi be the number of misprints on page i of the book Bayesian Theory by
Bernardo and Smith (2000). Consider two alternative models:
Model H1 : yi | λ ∼ Poi(λ · Ni ), i = 1, . . . , n
λ ∼ Ga(1, β)
Model H2 : yi | θ ∼ Bin(Ni , θ)
θ ∼ Be(1, β − 1),
where Ni is the number of characters per page, β > 1 is a fixed hyperparam-
eter, and the gamma distribution is parametrized such that E(λ) = 1/β. That
is, under both models 1/β is interpreted as the expected number of typos per
word.
a. Find the Bayes factor B = p(y | H2 )/p(y | H1 ) for choosing between
competing models H1 and H2 .
Model H1 :
yi = β1 + β2 xi + ǫi , i = 1, . . . , n; ǫi ∼ N(0, 1)
β1 ∼ N(0, 1), β2 ∼ N(1, 1),
with β1 and β2 a priori independent.
Model H2 :
yi = γ1 + γ2 xi + γ3 xi2 + ǫi , i = 1, . . . , n; ǫi ∼ N(0, 1)
γ1 = N(0, 1), γ2 ∼ N(1, 1), γ3 ∼ N(0, 1),
with γ1 , γ2 , γ3 a priori independent.
R
a. Find the marginal
R distributions p(y | H1 ) = f (y | β, H1 ) h(β) dβ. and
p(y | H2 ) = f (y | γ, H2 ) h(γ) dγ.
b. Write down the Bayes factor B = p(y | H2 )/p(y | H1 ) for comparing
model H1 versus model H2 and evaluate it for the given data set.
c. We now replace the prior distributions by improper constant priors: h(β) =
c1 in modelR H1 ; and h(γ) = c2 in model R H2 . We can still formally evaluate
integrals3 f (y | β, H1 ) h(β) dβ and f (y | γ, H2 ) h(γ) dγ and define a
Bayes factor,
R
f (y | γ)h(γ) dγ
B̃ = R .
f (y | β)h(β) dβ
Show that the value of the Bayes factor B̃ depends on the – arbitrarily
chosen – constants c1 and c2 .
d. Evaluate the Bayes factor using the harmonic mean estimator (5.31) and
compare with the exact evaluation from b.
5.4 Refer to Problem 5.3. Evaluate the log pseudo-Bayes factor (5.22), BIC,
AIC, WAIC, and DIC to compare the two models.
3
Although the marginal distributions might be improper, i.e., meaningless.
6
1
As discussed in Chapter 4, the implementation of Bayesian inference of-
ten involves the use of simulation-based methods, based on a Monte Carlo
sample of values that are generated from a typically multivariate posterior
distribution, h(θ | x), θ ∈ Θ. The use of simulation-based methods is one
way of dealing with the analytically often intractable form of Bayesian
inference summaries. Depending on the complexity of the posterior distri-
bution h(·), the evaluation of posterior summaries like E g(θ) | x can be
carried out by classical Monte Carlo (MC) methods, by generating i.i.d.
samples from the target distribution itself, or from some appropriate im-
portance sampling distribution, whose construction involves the target dis-
tribution.
For more complex problems it has become more common, especially
since the 1990s, to use more general MC methods based on the simulation
of a (homogeneous) Markov chain that is constructed to have an ergodic
distribution π(θ) equal to the target distribution, π(θ) ≡ h(θ | x). Such meth-
ods, known as Markov chain Monte Carlo (MCMC), thus use dependent
samples of θ, implying also more complex asymptotics and the need for
larger simulation sample sizes compared to classical MC methods.
The rediscovery of MCMC methods2 by statisticians in the 1990’s (in
particular, Gelfand and Smith, 1990) led to considerable progress in simulatio
based inference methods and, in particular, Bayesian analysis for models
that were too complex for earlier methods.
Given the nature of MCMC methods, it is not possible to fully under-
1
This material will be published by Cambridge University Press as Computational
Bayesian Statistics, by M.A. Amaral Turkman, C.D. Paulino, and P. Müller
(https://tinyurl.com/CompBayes). This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative
works. c Maria Antónia Amaral Turkman, Carlos Daniel Paulino & Peter Müller, 2019.
2
The earlier literature includes Metropolis et al (1953), Hastings (1970) and Geman and
Geman (1984).
6.1 Definitions and Basic Results for Markov Chains 93
stand them and the details of their application in Bayesian statistics with-
out a knowledge of basic results for Markov chains which are therefore
summarized in the following section, in a very brief form given the nature
of this text.3 For simplicity, we use in this section (and beyond) a generic
notation for the states of a chain. In the following sections we describe
the most commonly used methods, including Metropolis–Hastings chains,
Gibbs samplers, slice sampling, and Hamiltonian Monte Carlo. The last
section is dedicated to implementation questions related to MCMC meth-
ods, including the question of one versus multiple parallel chains and con-
vergence diagnostics.
chains, and will therefore henceforth not explicitly indicate the qualifier
homogeneous.
For a discrete state space, the Markov chain is entirely defined by the
conditional probabilities P(u, {v}), i.e.
P(Un+1 = v | U0 = u0 , . . . , Un = u) =
P(Un+1 = v | Un = u) ≡ p(u, v), ∀n ≥ 0, u, v ∈ U.
In the case of a finite state space the transition probabilities p(·, ·) can be
recorded as a matrix P of probabilities for one step. When U is uncount-
ably infinite and F(u, v) is absolutely continuous, then the transition func-
tion can be defined by a density p(u, v) = ∂F(u,v)
∂v
.
For the moment we assume a discrete Markov chain. We have
X X
P(Un+1 = v) = P(Un = u)p(u, v) = P(U0 = u)pn (u, v),
u u
P
where pn (u, v) = P(Un = v | U0 = u) = u pn−1 (u, z)p(z, v), n ≥ 1
defines a transition function for n steps (in matrix form the product Pn ).
The construction of a Markov chain is therefore completely determined by
the transition function, as long as an initial distribution is given.
We say that a probability distribution π(u), u ∈ U is a stationary distri-
bution if
X
π(v) = π(u)p(u, v).
u
5
Note that for k = 1, U (t) corresponds to θt in these earlier chapters, reserving θ(t) for the
t-th smallest value of a sample (θ1 , . . . , θn ).
6
For convenience, we will use the term “density function” to refer to a distribution
independently of the nature of its support. That is, in the case of discrete random
variables it represents the probability mass function.
6.2 Metropolis–Hastings Algorithm 97
one should always start with a preliminary analysis of π, such that q could
be chosen to approximate the target distribution as well as possible.
Given the generic nature of the M-H algorithm, we now describe two of
the most commonly used special cases. 8
U (t+1)
j ∼ π(u(t+1)
j | u(t+1)
1 , . . . , u(t+1) (t) (t)
j−1 , u j+1 , . . . , uk ),
for j = 1, 2, . . . , k.
2. At the end of k steps, take u(t+1) = (u1(t+1) , . . . , u(t+1)
k ) and repeat step 1 for
t ≡ t + 1.
which depends on the past only through the previous value of U2 . In other
words, the definition of the marginal densities and the transition density for
U2 implies
Z
π2 (ũ2 ) = π2 (ũ2 | w)π1 (w) dw =
Z "Z # Z
π2 (ũ2 | w)π1 (w | u2 ) dw π2 (u2 ) du2 = P(u2 , ũ2 )π2 (u2 ) du2 ,
π(u) remains the marginal distribution of the joint f (u, Z). 13 In some cases,
the augmented model f (u, Z) allows a much easier implementation of the
Gibbs sampler, now involving the distributions f (u | z) and f (z | u).
Example 6.2 Consider a diagnostic test for some disease with a binary
outcome (positive or negative). Taking a random sample of N patients, let
X be the number of positive results, and assume that X | φ ∼ Bi(N, φ).
Most diagnostic tests are subject to classification errors implying that the
probability of a positive can be written as φ = ασ + (1 − α)(1 − ε), where
θ = (α, σ, ε) and α is the prevalence of the disease, σ is the sensitivity of
the test (probability of a true positive), and ε is the specificity of the test
(probability of a true negative).
The vector θ is typically unknown and is the parameter vector of interest
in the inference problem. However, given the obvious overparametrization
of the sampling model, it is not possible to report inference on θ (or func-
tions of θ, except φ, for example) without additional data (e.g., from a fur-
ther test regarded as a gold standard) or a priori information related to the
type of diagnostic test and the disease under consideration. Suppose that
one only has access to such prior information, represented as independent
beta distributions for the components of θ, with fixed hyperparameters. The
posterior distribution takes the following analytically intractable form:
h(θ | x) ∝ f (x | θ) αa p −1 (1 − α)b p −1 σcs −1 (1 − σ)ds −1 εce −1 (1 − ε)de −1 ,
θ ∈ (0, 1)3 . Intending to implement posterior inference using MCMC, the
Gibbs sampler is not a particularly easy approach for this posterior dis-
tribution because the conditional posterior distributions are complicated,
requiring specialized random variate generation methods. However, the
implementation of a Gibbs sampler is greatly facilitated by the following
model augmentation with latent data. Let Y = (X, Z1 , Z2 ), where Z1 (resp.
Z2 ) are unobserved (latent) data that report the number of individuals with
true positive (negative) results. A model for Y that remains consistent with
the observed data X is defined as
f (y | θ) = f (x | θ) f (z1 | x, θ) f (z2 | x, θ),
where now Z1 | x, θ ∼ Bi(x, ασ/φ) and Z2 | x, θ ∼ Bi(N −x, (1−α)ε/(1−φ)).
Note that the parameters of the conditional distributions for the latent
variables correspond to the so-called positive predicted values V+ = ασ/φ
13
The construction of f (u, Z) could also be referred to as demarginalization or
augmentation of π(u).
106 Markov Chain Monte Carlo Methods
This algorithm was introduced, still without any reference to the Gibbs
sampler, by Tanner and Wong (1987) under the aforementioned title, where
they prove that h(θ | x, z1(t) , z(t)
2 ) converge as t → ∞ to h(θ | x), under quite
general conditions.
z π(u)
z (t+1)×
×
u(t+1) u(t) u
Figure 6.1 Slice sampler for a univariate distribution π(u).
value z(t+1) is generated. The intersection of the line Z = z(t+1) with π(u)
defines the point(s) that delimit the horizontal slice S(z(t+1) ) (an interval
or union of intervals) over which u(t+1) is generated. In practice, the main
difficulty with this algorithm is the second step since the support S(z) of the
distribution on the horizontal slice could be complicated for a multimodal
π(u), which might need the use of other simulation methods in this step
6.5 Hamiltonian Monte Carlo 109
Leapfrog Approximation
To implement Hamiltonian dynamics we use a discretization of the differ-
ential equation system (6.8) known as the leapfrog method. Starting from
(θ(t), p(t)) we generate (θ(t + ǫ), p(t + ǫ) by using a discrete approximation
over two subintervals of length ǫ/2 each:
ǫ ǫ ∂ ln π(θ(t))
pi t + = pi (t) +
2 2 ∂θ
iǫ
θi (t + ǫ) = θi (t) + ǫ pi t +
2
ǫ ǫ ∂ ln π(θ(t + ǫ))
pi (t + ǫ) = pi t + + . (6.9)
2 2 ∂θi
Let T ǫ (θ(t), p(t)) = (θ(t + ǫ), p(t + ǫ)) denote the discrete approximation
implemented in (6.9). It is easily verified that the approximation is perfectly
reversible, that is, T −ǫ (θ(t + ǫ), p(t + ǫ)) = (θ(t), p(t)) again, or T −ǫ (θ, p) =
T ǫ−1 (θ, p). Even easier than turning back time, all we have to do to send
the ball back to where it came from is to turn it around. That is, p ≡ −p,
or T ǫ−1 (θ, p) = T ǫ (θ, −p). Note that the time index in θ(t) and p(t) is only
used for the implementation of T ǫ (·). After the last step in (6.9), we drop
the time index again.
Example 6.3 Logistic regression. In a toxicity study for some new com-
pound, various dose levels of the compound are administered to batches
of animals. Let i = 1, . . . , k index the batches, let xi index the dose for the
i-th batch, let ni denote the number of animals in the i-th batch, and yi the
number of animals in the i-th batch that show a response. We assume that
yi is a binomial random variable, yi ∼ Bi(ni , πi ), where πi is the probability
of response at dose xi . We assume πi = 1/ 1 + eα+βxi . This is known as a
logistic regression model. The probabilities πi define the sampling model
(likelihood).
We complete the model with a prior h(α, β) = N(0, cI), where I is a
(2 × 2) identity matrix, and c = 100, i.e., a very vague prior. We observe
112 Markov Chain Monte Carlo Methods
with gradient
! X ! X !
−α 1 1
∇ ln h(θ | y) = − yi πi + ȳi πi .
−β xi xi
Using (6.9) with target distribution π(θ) = h(θ | y), we implement M = 8
steps of size ǫ = 0.1. The transitions are deterministic, and we expect to
stay on a constant level of the augmented model (6.7). The algorithm re-
quires an initial value of p. Think of θ as the location of a ball on log pos-
terior contours, and p as the momentum that you give the ball by kicking it.
Hamiltonian mechanics and the approximation in (6.9) describes the tra-
jectory of the ball over M small time intervals of length ǫ. In this analogy
you have to think of the posterior distribution as a valley, with the posterior
mode being the lowest point. This is because ln h(θ | y) appears in (6.7)
with negative sign. Figure 6.2 shows a contour plot of the log posterior
“valley,” and three possible trajectories. All start at θ = (−2, −4), kicking
the ball with p = (1, 0), (1, −3), and (1, 1.4). On the four-dimensional con-
tours of the augmented model H(θ, p) the trajectories would move along
contour lines. Next we will introduce a random transition function, build-
ing on this deterministic transition of the leapfrog approximation, which we
will then use in Problem 6.12 to implement MCMC posterior simulation for
this example.
6.5 Hamiltonian Monte Carlo 113
0
●
●
0.1
● 0.2
0.4
●
−2
0.6
●
0.8
●
●
● ●
●
● ●
●
● 0.9
−4
● ● ●
β
●
● 0.7
●
● ● ●
● ●
0.5
−6
0.3
−8
In summary, the MCMC algorithm proceeds with the following two tran-
sition probabilities.
1. Generate pi ∼ N(0, 1), i = 1, . . . , d.
T ǫ (θ, p) w. pr. 1/2
2. Generate (θ̃, p̃) =
.
T −ǫ (θ, p) w. pr. 1/2
n o
With probability α = min 1, π( θ̃)
π(θ)
, accept the proposal, θ ≡ θ̃.
To justify the acceptance probability imagine a minor modification of step
2. After generating (θ̃, p̃) as described we replace p̃ by p† ∼ π(p | θ̃) and
recognize α as the M-H acceptance probability for (θ̃, p† ). Finally, we do
not need to record p in step 2 since p is immediately overwritten in step 1 of
the next iteration. Also, if desired, T ǫ can be changed to implement multiple
leapfrog steps. The algorithm remains unchanged (and the approximation
error increases with multiple steps).
The algorithm can be further simplified by noting that pi (t + ǫ) in the
last step in the leapfrog approximation (6.9), that is T ǫ or T −ǫ in step 2, is
not needed. The value pi (t + ǫ) is never used in the MCMC described here.
It is immediately replaced by a new value, pi ∼ N(0, 1) in step 1 of the
following iteration. The first two lines of (6.9) can be collapsed into
( )
ǫ ∂ ln π(θ(t))
θi (t + ǫ) = θi (t) + ǫ pi (t) + . (6.11)
2 ∂θi
And, even easier, we can further collapse (6.11) with pi ∼ N(0, 1) in step 1
to get
( )
ǫ 2 ∂ ln π(θ(t)) 2
θi (t + ǫ) | θi (t) ∼ N θi (t) + ,ǫ .
2 ∂θi
That is, we replace steps 1 and 2 with a single step.
1′ . Generate a proposal,
( )
ǫ 2 ∂ ln π(θ) 2
θ̃i | θ ∼ N θi + ,ǫ .
2 ∂θi
Accept with probability α = min{1, R} with acceptance ratio
n o
ǫ 2 ∂ ln π(θ̃) 2
π(θ̃) d N θi | θ̃i + 2 ∂θ̃i , ǫ
R= Π n o,
π(θ) i=1 N θ̃i | θi + ǫ 2 ∂ ln π(θ) , ǫ 2
2 ∂θi
the first factor being the ratio of target distributions and the second fac-
tor the ratio of proposal distributions for the proposed and the reciprocal
moves. This and other simplifications are discussed by Welling and Teh
6.6 Implementation Details 115
Multiple Chains
• Generate m chains with t∗ (usually, t∗ ≪ t) iterations each, starting
with m initial values, usually generated to be distinct and well separated
across the support of the target distribution.
∗
• Use the final iterations θ(t ) of each chain to form a MC sample θ(1) , . . . , θ(m)
– the choice of t∗ , beyond some initial burn-in, and the choice of m are
subject to similar considerations as before.
Both approaches have their advantages and limitations, which explains
the great variety of choices that are found in the literature. 16 The first
scheme allows us to reduce the computational cost and can lead to a chain
closer to h(θ | x) than schemes using multiple chains, using the same total
number of iterations. The second scheme makes it easier to control conver-
gence to h(θ | x) by reducing dependence on the initial states and allows a
more direct exploration of the target support. It is more likely to detect if
the apparent convergence of a chain is merely an indication that it is trapped
around some local mode, far from the target distribution. If this is the case,
one can reparametrize the model (difficult in general) or one can change
the initial values. In summary, the scheme with multiple chains seems by
and large appropriate for an exploratory evaluation before running one long
chain for definite inference.
Convergence diagnostics
A variety of methods have been proposed to diagnose convergence. Some
are already included as default in specific or more general software for
Bayesian inference. Others can be added by writing problem-specific code
for a given inference problem. Keeping with the scope of this text, we limit
the discussion to a brief review of some of the most widely used methods.
The most widely used tools to monitor convergence to the stationary
distribution are plots of simulated values across iterations, known as tra-
jectory (or trace) plots, and the analysis of such plots across different time
windows to inspect any changes in the mixing of the chain over the sup-
port of the target posterior distribution. Another tool is the superposition of
16
The issue is discussed at length by Geyer (1992) and in the following discussion, where
many pros and cons of each method are presented.
118 Markov Chain Monte Carlo Methods
Problems
Most of the following problems require programming to implement MCMC
simulation. See Appendix B for some suggestions on how to implement
such simulation in R. Later, in Chapter 9, we will discuss public domain
software to implement MCMC for most problems. However, it is useful to
implement MCMC and understand implementation details for at least some
more stylized problems. It is therefore recommended not to use MCMC
software, such as OpenBUGS, JAGS, or Stan in the following problems.
A large number of good case studies can be found in the manuals for
BUGS and R-INLA, including some substantially more complex infer-
ence problems than the following exercises. See www.openbugs.net/
Examples/ and also the problem set in Chapter 9.
Problems 119
6.1 Metropolis–Hastings. Verify that the transition function (6.4) does indeed
satisfy the detailed balance condition (6.1).
6.2 Data augmentation. The table below shows the observed frequencies y j re-
lated to the observed phenotype defined by the blood group of an individual,
for a sample of n = 435 individuals. Here j ∈ {1, 2, 3, 4} indexes the four
blood groups O, A, B, AB.
j Blood Frequency Probability
group yj pj
1 O 176 r2
2 A 182 p2 + 2pr
3 B 60 q2 + 2qr
4 AB 17 2pq
(The answer need not be perfect – any reasonable, practical, creative sug-
gestions is fine. See Section 9.6 for more discussion of convergence diag-
nostics.)
6.5 We consider a hierarchical event rate model with a Poisson sampling model
yi ∼ Poi(λi ti ), and a prior model
λi ∼ Ga(α, β) and β ∼ Ga(c, d),
where (α, c, d) are fixed hyperparameters.17
a. Find the conditional posterior distributions:
1. h(λi | β, λ j , j , i, y).
2. h(β | λ1 , . . . , λn , y).
b. Using the data from Problem 4.4, implement a Gibbs sampler to simulate
from the posterior in the above model.
c. Alternatively, marginalize with respect to λi to find the marginal likeli-
hood f (yi | β) and h(β | y).
6.6 Assume that x, y are jointly distributed random variables with support (0, B),
with conditional p.d.f.
f (x | y) ∝ e−xy , 0 < x < B, (6.12)
f (y | x) ∝ e−xy , 0 < y < B. (6.13)
a. Propose a Gibbs sampler to generate an MC sample from f (x, y) (this is
easy – there are no tricks).
b. Now consider B = ∞, that is, the support for x and y is the entire posi-
tive real line. Show that there is no (proper) joint distribution f (x, y) with
conditionals (6.12) and (6.13).
c. Justify the statement “Under the setup of (b) we can not apply the Gibbs
sampler.”
6.7 Probit regression. Consider a probit regression model for a binary response
yi on covariates xi , i = 1, . . . , n. Let Φ(·) denote the standard normal c.d.f.
P(yi = 1 | xi ) = Φ(xi′ β), (6.14)
where β = (β0 , β1 , β2 ) and xi = (1, xi1 , xi2 ). We observe the following data.18
These are historical data – do you recognize them?
17
Compare with Problem 4.4 for a variation of the same problem.
18
The (identical) data are available as oring.dta on the book homepage
sites.google.com/view/computationalbayes/home
122 Markov Chain Monte Carlo Methods
0.20
0.15
0.10
0.05
0.00
10 15 20 25 30 35
VELOCITY
Figure 6.3 Galaxy data. The histogram shows the data points.
The curve shows a kernel density estimate.
6.10 This problem describes a real inference problem that arises in bioinformatics
research. The probability model is more complex than other examples, but
does not introduce any serious difficulties. 19
Likelihood (sampling model). Let xi , i = 1, . . . , n, denote a set of multinomial
random variables, with outcomes xi ∈ {1, . . . , N}. Let y = (y1 , . . . , yN ) denote
a contingency table summarizing the multinomial experiment, i.e., y j is the
frequency of outcome j. Let π j = P(xi = j) denote the unknown probability
P
of observing outcome j, π j = 1.0. We write
y ∼ MN−1 (n; π1 , . . . , πN ).
Prior. Due to the nature of the experiment we believe a priori that some of
the π j are much larger than others. We refer to this subset of outcomes with
much larger probability as the “prevalent” outcomes A1 , and to the set of not
so likely outcomes as “rare” outcomes A0 . We will formally define A0 and
A1 in the following.
Parametrization. To define a prior probability model for θ = (π1 , . . . , πN ) it
is convenient to introduce latent indicators z j , and a change of variables for
19
The problem is a stylized form of the analysis for SAGE data discussed by Morris et al
(2003).
Problems 125
the π j :
π∗ q j
if z j = 1
πj =
(6.18)
(1 − π∗ ) r j if z j = 0,
P P
with q j = 1 and r j = 1 (let q j = 0 if z j = 0, and r j = 0 if z j = 1). In
words, the latent variable z j is an indicator for outcome j being a prevalent
outcome, i.e., A1 = { j : z j = 1} and A0 = { j : z j = 0}. The probability
of observing some prevalent outcome is π∗ , with π∗ close to 1.0; q j is the
probability of outcome j given that we observe a prevalent outcome, and r j
is the probability of j given a rare outcome. For later reference we define
M1 = #A1 and M0 = N − M1
as the number of prevalent and rare outcomes, respectively.
Prior probability model h(z, π∗ , q, r). We assume
P(z j = 1) = ρ, (6.19)
a beta prior on the total probability of prevalent outcomes:
π∗ ∼ Be(a∗ , b∗ ), (6.20)
and a Dirichlet prior for the partitioning of π∗ into the cell probabilities π j ,
j ∈ A1 . Let q̃h , h = 1, . . . , M1 denote the non zero weights q j :
(q̃1 , . . . , q̃ M1 ) ∼ D M1 −1 (a1 , . . . , a1 ). (6.21)
The use of equal Dirichlet parameters a1 reflects the symmetric nature of
prior beliefs about (q1 , . . . , q M1 ). Similarly for r j , j ∈ A0 :
(r̃1 , . . . , r̃ M0 ) ∼ D M0 −1 (a0 , . . . , a0 ). (6.22)
y = X β + ǫ. (6.23)
We complete the model with a prior. Let τ2 = 1/σ2 denote the residual
precision:
h(τ2 ) = Ga(ν0 /2, ν0 σ20 /2) and h(β ) = N(β0 , Σ0 ), (6.24)
independently. That is, we assume an informative prior for β (for example,
this could be based on historical data from a related earlier study).
Here, X ∼ Ga(a, b) denotes a gamma distribution with mean E(X) = a/b.
a. State the joint posterior h(β , τ2 | y) (up to a normalizing constant is okay).
No need to simplify.
b. Find the conditional distribution h(β | τ2 , y). Use notation β b for the least
squares solution. Recognize the distribution as a well-known parametric
family.
c. Find h(τ2 | β , y). You can use notation RSS(β ) = (y − X β )′ (y − X β ).
Recognize the distribution as a well-known parametric family.
d. Propose a Gibbs sampling posterior MCMC scheme based on the condi-
tionals from (b) to generate a Monte Carlo sample,
(β (m) , τ2(m) ) ∼ h(β , τ2 | y), m = 1, . . . , M.
6.12 Hamiltonian Monte Carlo. Refer to Example 6.3 in Section 6.5.1.
a. Posterior. Plot the posterior distribution h(α, β | y). Use a contour plot
(or heatplot, or any other format that allows you to add the points for the
simulations from part (b)).
b. Hamiltonian Monte Carlo. Let θ = (α, β). Propose a Hamiltonian Monte
Carlo algorithm to simulate a Monte Carlo sample:
θm ∼ h(θ | y),
m = 1, . . . , M (the θm need not be independent, as in any MCMC). Use
M = 100.
1. Describe the algorithm by stating all transition probabilities.
2. Implement the algorithm.
3. Plot θm by adding them to the plot from part (a).
c. Metropolis–Hastings. Implement a Metropolis–Hastings MCMC to gen-
erate θm ∼ h(θ | y), m = 1, . . . , M.
1. Describe the algorithm by stating all transition probabilities.
2. Plot θm by adding them to the plot from part (a).
128 Markov Chain Monte Carlo Methods
with
h(θ̃) I {d(z̃, y) < ǫ} q(θi | θ̃)
A(θi , θ̃) = ,
h(θi ) 1 q(θ̃ | θi )
where d(z, y) is some distance measure for two (hypothetical or actually ob-
served) data sets z and y, for example, d(z, y) = ||z − y||2 .
Find the stationary distribution π(θ) for this Markov chain.
Hint: Introduce a variable zi , initializing with z0 ∼ f (z0 | θ0 ), and then mod-
ify (6.25) to a proposal distribution for an augmented state vector (θi , zi ),
find the stationary distribution π(θ, z) for the Markov chain with this aug-
mented state vector. Finally, argue that the desired π(θ) is the marginal under
π(θ, z).22
6.14 Thall et al (2003) consider inference for a phase I oncology trial for the
combination of two cytotoxic agents. Here, we focus on estimating the dose–
response surface π(x; θ) = P(y = 1 | x = (x1 , x2 ), θ) for the probability of
toxicity (y = 1) as a function of the doses of the two agents (x1 and x2 ). We
assume standardized doses, 0 ≤ x j ≤ 1, and let yi ∈ {0, 1} denote an indicator
of the i-th patient recording toxicity after being treated with combination
therapy with doses xi1 and xi2 of the two agents. The response surface is
indexed by unknown parameters θ = (a1 , b1 , a2 , b2 , a3 , b3 ),
b3
a1 x1b1 + a2 x2b2 + a3 x1b1 x2b2
π(x, θ) = b3 . (6.26)
1 + a1 x1b1 + a2 x2b2 + a3 x1b1 x2b2
The model is chosen to allow easy incorporation of information about single-
agent toxicities. For x2 = 0 the model reduces to the single-agent dose–
toxicity curve π((x1 , 0), θ) ≡ π1 (x1 , θ) and similarly for π2 . The parameters
(a3 , b3 ) characterize the two-agent interactions.
22
This is one of the variations of ABC. See Marin et al (2012) for a review.
Problems 129
under this chain. Show that P(n) → pθ,y in total variation norm. That is,
lim ||P(n) (x0 , ·) − pθ,y || = 0
n→∞
in total variation norm, for all x0 . You may assume that q(θm | θm−1 , ym−1 ) >
0 for all θm ∈ Θ in the parameter space Θ (and all θm−1 , ym−1 ); and simi-
larly f (y | θ) > 0 for all y ∈ X in the sample space X and θ ∈ Θ.
Use at most three sentences.
c. Can you suggest an alternative way of generating an MC sample xn ∼
pθ,y (θ, y)? 24
Hint: There is a very easy answer.
24
Geweke (2004) uses ergodic Monte Carlo averages under (a) and under (b) to construct
an interesting diagnostic.
7
1
Earlier (Sections 4.2.2. and 4.2.3) we discussed how the evaluation of the
marginal likelihood function (or prior predictive distribution) for model se-
lection can be implemented by means of importance sampling. Since the
1990s, various methods have been proposed in the literature to implement
such schemes.2 This works for problems with moderate dimensional pa-
rameter vectors, but breaks down in higher dimensional problems. More
complex problems usually require the use of Markov chain Monte Carlo
(MCMC) methods to generate posterior Monte Carlo samples (or, to be
more precise, approximate posterior samples), as discussed in previous
chapters. Using such posterior Monte Carlo samples, one can then approx-
imately evaluate the marginal likelihood function. It is convenient to dis-
tinguish two cases. The first case is when the simulated values are gener-
ated from the parameter space for each model separately. The other case
is when the simulated values are generated from the model space, that is,
model indicators (separately or jointly with the parameter space). A popu-
lar algorithm to implement the latter is reversible jump MCMC (RJ).
In the following discussion, M j (or for short, j, when the context clarifies
the use) will denote competing models, and θ j will denote parameters that
index the sampling distribution under model M j , and we will use notation
f (y | θ j , M j ) or for short f j (y | θ j ) for the sampling model under model M j ,
h(θ j | M j ), or h j (θ j ) for the prior under model M j , p(x | M j ), or p j (x) for
the marginal under model M j , and h(M j ) or just h( j) for the prior model
probability.
1
This material will be published by Cambridge University Press as Computational
Bayesian Statistics, by M.A. Amaral Turkman, C.D. Paulino, and P. Müller
(https://tinyurl.com/CompBayes). This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative
works. c Maria Antónia Amaral Turkman, Carlos Daniel Paulino & Peter Müller, 2019.
2
See Chen et al (2000) and Andrieu et al (2004) for useful reviews of these methods.
132 Model Selection and Trans-dimensional MCMC
where b j and a j are fixed values of small and large magnitude, respectively.
Assuming conditional independence the implied prior distribution for β =
(β j , j = 1, . . . , p) in the first level of the hierarchical model is a multivariate
134 Model Selection and Trans-dimensional MCMC
h i
normal β | υ ∼ N p (0, Bυ ), with Bυ = diag (1 − υ j )b2j + υ j a2j b2j being a
diagonal matrix.3
At the second level of the hierarchy, the SSVS model assumes an inverse
gamma prior distribution for σ2 with fixed hyperparameters. More impor-
tantly, for the indicators υ j the model assumes υ j ∼ Ber(w j ), implying the
mixture prior
h(β j | w j ) = (1 − w j )h1 (β j ) + w j h2 (β j ),
where w j = P(υ j = 1) is the prior probability for a non-zero estimate
for β j , i.e., for including x j in the model. This way, each of the 2 p mod-
els is identified by a vector of latent indicators υ = (υ1 , . . . , υ p ). The prior
model probabilities are written as the product of p Bernoulli probabilities,
Ber(w j ), with {w j } representing the prior beliefs about including each co-
variate.
Once the hierarchical model is completely specified, SSVS proceeds by
using a Gibbs sampler 4 to generate a sequence β(t) , σ(t) , υ(t) , t = 1, . . . , T,
that approximates the joint posterior distribution, with mainly the subse-
quence υ(t) , t = 1, . . . , T , being of interest. Inspection of this sequence
allows in particular the identification of a subset of covariates with the
highest inclusion probabilities, which indicate the most promising models
in the light of the data and the assumed prior. Note that any specific value
of υ might not occur with high frequency in the sample, simply due to the
limited size of the sample (T can be much less than 2 p when p is large). In
summary, the nature of SSVS is mainly an exploratory tool to restrict the
class of models for further consideration. Only when a small number of
covariates makes it practicable (say, p < 7), can a similar process be used
to assess the posterior distribution for υ, rather than just restricting the set
of models under consideration.
Several approaches that are in some way related have appeared in the
literature. For example, Kuo and Mallick (1998) use a linear predictor of
P
the form η = pj=1 υ j β j x j , making the variable selection explicit, and a
prior distribution for β that is independent of υ. Dellaportas et al (2002)
keep a dependent prior for β and υ, but use the latter type of linear predictor.
The earlier-described hierarchical prior in the SSVS is not conjugate for
normal linear models (given any subset of included covariates). A conju-
3
George and McCulloch (1993, 1997) consider more general multivariate normal prior
distributions, including correlations in Bυ and dependence on a sampling variance σ2 .
4
See George and McCulloch (1993, 1997) for recommendations about the choice of the
tuning parameters, the complete conditional posterior distributions, and extension to
other families such as generalized linear models.
7.2 MC Simulation over the Model Space 135
gate prior is achieved by scaling Bυ with σ2 Hoff(2009: ch. 9). With this
change in the prior p(β | υ, σ2 ), it becomes possible to evaluate the ker-
nel of the marginal posterior distribution for υ, h∗ (υ | y) = c−1 h(υ | y)
(George and McCulloch, 1997), from which one can then directly gener-
ate a Markov chain in υ, without involving (β, σ2 ). For example, one could
implement a Gibbs sampler using the complete conditional distributions
h(υ j | υ− j , y).
Relying on a predefined subset of models, for example the subset MG of
models that occur in a preliminary sample υ(k) , k = 1, . . . , K, one can then
obtain an estimate of the posterior probability for each model via h∗ using
ĥ(MG | y) ∗
ĥ(υ | y) = ĉ h∗ (υ | y) ≡ h (υ | y), (7.3)
h∗ (MG | y)
where
X 1 X
K
h∗ (MG | y) = h∗ (υ | y), ĥ(MG | y) = I M (υ(k) ).
υ∈MG
K k=1 G
MC3 Method
Generating a Markov chain with a stationary distribution given by the ker-
nel h∗ (υ | y) can be done by an M-H algorithm with some proposal dis-
tribution5 q(υ̃ | υ(t) ) for any υ(t) . Using the special case of a Metropolis
algorithm with symmetric proposal q, the acceptance ratio for a proposal υ̃
generated from q simplifies to R(υ(t) , υ̃) = h∗ (υ̃ | y)/h∗ (υ(t) | y). A simple
example of this class of Metropolis algorithms uses a proposal distribution
that is only non-zero for υ̃ that differs from υ(t) in only one component,
q(υ̃ | υ(t) ) = 1/p, where p is the dimension of the least parsimonious model
corresponding to υ = 1 p with all 1s. This algorithm was proposed for mod-
els with discrete data by Madigan and York (1995) under the name MC3 ,
(Markov chain Monte Carlo model composition). Raftery et al (1997) used
5
Under suitable conditions, this algorithm can be implemented very computation
efficiently, similar to what happens in a Gibbs sampler (George and McCulloch, 1997).
136 Model Selection and Trans-dimensional MCMC
implying that the prior distributions h(θk | M j ) for the parameters corre-
sponding to models k , j are not required to specify (M j , θ j ). Further, one
can easily show that the marginal likelihood for each model can be written
as
h i h i
f (y | M j ) = Eθ f (y | M j , θ) | M j = Eθ j f (y | M j , θ j ) | M j ,
from which one can see that for the purpose of model comparison via Bayes
factors the choice of the pseudo-priors is irrelevant.
The pseudo-prior method makes use of the complete conditional distri-
butions for the J + 1 blocks
h∗ (M, θ | y)
∀M ∈ M h(M | θ, y) = P (7.6)
h∗ (Mk , θ | y)
k∈J
h(θ j | M j , y), M = M j
∀ j ∈ J h(θ j | θ− j , M, y) =
h(θ | M ), (7.7)
j k M = M , k , j.
k
The simulation based on these distributions is more efficient if, for each
model M j , one has conjugacy of the likelihood and the respective prior
h(θ j | M j ), and if one chooses pseudo-priors h(θk | M j ), k , j, that are
similar to h(θk | Mk , y). Note that under conjugacy for model M j one can
exactly evaluate the posterior distribution for the respective parameter and
the marginal likelihood f (y | M j ) and, on the basis of the latter, evaluate the
posterior probability of M j , which allows a separate and easy evaluation of
the posterior quantities.
Once a Monte Carlo sample {(M (t) , θ(t) )} from h(M, θ | y) is available,
it can be used to evaluate posterior probabilities for each model h(M j | y)
(as relative sample frequencies of M = M j ) and, therefore, Bayes factors
between any pair of models. Inference related to h(θ j | M j , y) is obtained
from the subsample of θ(t) consisting of the components θ(t) j associated with
model M j .
For details about the implementation, see Carlin and Louis (2009: sec-
tion 4.5.1). Dellaportas et al (2002) propose a modification of the pseudo-
prior method for the special case of variable selection.
which is composed of two factors related to the jump between models (qm )
followed by a generation of a parameter for the proposed model (q p ), de-
fined as
q( M̃, θ̃ M̃ | M j , θ j ) = qm ( M̃ | M j , θ j ) q p (θ̃ M̃ | M j , θ j , M̃).
Assume M̃ = M j′ . The proposal distribution qm is in many cases selected
to be independent of the parameters under the current model, θ j . That is,
qm ( M̃ | M j , θ j ) = q j j′ is a function of only j, j′ . The distinguishing feature
of the RJ method is in the proposal distribution q p , wich has to accommo-
date the transition from model M j to M̃, including a possible change in
dimension of the parameter vector. Details are introduced in the following.
In short, the reversible jump builds on an M-H algorithm by adding two
important elements. First, we introduce dimension padding by adding aux-
iliary variables. Second, each transition probability can include a determin-
istic transformation. The latter is important to create a well-mixing Markov
chain. For example, consider an RJ for a normal linear regression that al-
lows the use of a line with (M2 ) or without (M1 ) intercept, that is E(yi | xi )
is α + βxi under M2 or βxi under M1 . It is important to adjust the currently
imputed value of the slope β when we propose to add a new intercept α , 0
in the line. That is essentially all that is to be said about the RJ. The rest
are details.
We first introduce the RJ for the simple problem of regression with and
without an intercept. Of course, one would not really use RJ in this prob-
lem. But it provides an easily understood context with all the same details
that are needed in any RJ. Let f (y | α, β, M2 ) and f (y | β, M1 ) denote the
likelihood function under the model with and without intercept, respec-
tively, and let h(α, β | M2 ), h(β | M1 ), and h(M) denote the prior distribu-
tions. In terms of the earlier general notation for model-specific parameter
vectors θ1 = (β) and θ2 = (α, β).
Move up: Assume the current state is ω = (M1 , θ1 ). Propose a new model
M̃ with P( M̃ = 2) = q12 and P( M̃ = 1) = q11 = 1−q12 . If M̃ = M1 , continue
with the usual MCMC step, without change of dimension. Otherwise M̃ =
M2 and:
1. Generate (β̃, α̃) by:
(a) generating an auxiliary variable u ∼ q(u);
(b) using a deterministic mapping (β̃, α̃) = T (β, u).
142 Model Selection and Trans-dimensional MCMC
2. Accept the proposal (β̃, α̃) with probability Aup = min{ρup , 1} and ratio
h( M̃) h(β̃, α̃ | M̃) f (y | α̃, β̃) q21 ∂T (β, u)
ρup (β, u, β̃, α̃) = .
h(M1 ) h(β̃ | M1 ) f (y | β) q12 q(u) ∂(β, u)
prior ratio × likelihood × proposal × Jacobian
Note how we had to carefully record all moves in steps 1 through 3 and
account for the corresponding proposal probabilities in (7.8).
Nothing much changes if the “down” move also involves an auxiliary
v ∈ Rq . It only complicates notation.
Problems 145
Problems
As in Chapter 6, most of the problems require some programming to im-
plement MCMC simulation. See Appendix B for some suggestions on how
to implement such simulation in R.
7.1 Normal linear regression – variable selection. Use the setup and notation of
Problem 6.11.
a. Let β̄ = E(β | τ2 , y) and V = Var(β | τ2 , y) denote the posterior mean and
P
covariance matrix for β conditional on τ2 . Let RSS(β ) = i (yi − x′i β )2
denote the residual sum of squares for β . Prove the following result
!
1 n τ2
f (y | τ2 ) ∝ |V| 2 h(β̄ ) τ 2 exp − RSS(β̄ ) , (7.9)
2
as a function of τ2 . That is, the proportionality constant is a function of y.
Hint: You could use Bayes’ theorem for h(β | y, τ2 ) and substitute for
β = β̄ .
b. Now consider variable selection. Let γ = (γ1 , . . . , γ p ) denote a vector of
P
indicators γ j ∈ {0, 1}, let pγ = j γ j , and let Xγ denote the (n × pγ )
submatrix of X with the columns selected by γ j = 1. Similarly, let βγ
denote the subvector of β selected by γ , and similar for β0γ and Σ0,γ .
We change model (6.23) to include variable selection by using
y = Xγ βγ + ǫ, (7.10)
with ǫ = (ǫ1 , . . . , ǫn ), and a modified prior h(βγ | γ) = N(β0,γ , Σ0,γ ) and
hyperprior h(γ j = 1) = π, j = 1, . . . , p, independently across j.
Use (7.9) to find h(γ | τ2 , y).
c. Propose a posterior MCMC scheme to generate a posterior Monte Carlo
sample (γ (m) , τ2(m) ) ∼ h(γ , τ2 | y).
How would you augment the Monte Carlo sample with β(m) ∼ h(β |
γ (m) , τ2(m) , y), if desired?
7.2 Variable selection. Ročková and George (2014) propose the following im-
plementation of SSVS for a Gaussian linear model:
f (y | β, σ) = N(Xβ, σ2 I),
where y = (y1 , . . . , yn ) is a response vector, and X is an (n × p) matrix of
covariates.
We use the following SSVS prior for the regression coefficients β. Let
cov(β) = Dσ,γ = σ2 diag(a1 , . . . , a p ) with ai = (1 − γi )ν0 + γi ν1
denote the prior covariance matrix for β. Here, γ j ∈ {0, 1}, ν0 is small such
√
that β j ≈ ν0 is practically dropping the jth covariate; and ν1 is large.
146 Model Selection and Trans-dimensional MCMC
We now complete the model with a prior on J. Let Poi+ (λ) denote a Poisson
distribution restricted to positive integers. We assume
J ∼ Poi+ (λ).
Use λ = 5.
a. Joint posterior. Let µ = (µ1 , . . . , µ J ), w = (w1 , . . . , w J ), and x = (x1 , . . . , xn ),
and let θ = (µ, w, σ2 , J) denote the complete parameter vector. Find the
joint posterior distribution h(θ | x).
b. Hierarchical model. Rewrite the mixture as a hierarchical model using
indicators si ∈ {1, . . . , J}.
f (xi | si = j) = N(µ j , σ2 ) and P(si = j | w, J) = w j . (7.13)
Let s−i = (s1,...,i−1 , si+1,...,n ).
Find h(µ j | s, w, x), h(si | s−i , µ, w, J, xi ), and h(w | s, J, x).
c. Reversible jump MCMC. Propose a RJ MCMC for posterior simulation
across variable J.
1. Split move. Propose a transition probability for incrementing J, i.e.,
J˜ = J + 1.
Describe the construction of a proposal (step by step), and state the
acceptance probability for the proposal.
Hint: You could: (1) select a term j to split; assume without loss of
generality j = J (rearrange indexes if needed, to simplify notation);
(2) generate a bivariate auxiliary variable (u, v); (3) define T (θ, u, v) by
µ̃ J = µ J + u, µ̃ J+1 = µ J − u, w̃ J = w J v and w̃ J+1 = w J (1 − v).
2. Merge move. Propose a transition probability for decrementing J, i.e.,
J˜ = J − 1 (“merge move”). Describe the construction of a proposal
(step by step) and state the acceptance probability for the proposal.
d. Implementation. Implement an RJ using the complete conditional distri-
butions from part (b) and the RJ move from (c).
Use the data set galaxy in R. You can get it, for example, in the R pack-
age ElemStatLearn. The data is shown in Figure 6.3.
• Plot J against iteration (trajectory of imputed J).
• Estimate g(x) = E[gθ (x) | x] for x on a grid x.
• Evaluate convergence
R∞ diagnostics for σ2 , g10 = gθ (10), g20 = gθ (20),
and G30 = 30 gθ (s)ds.6
7.4 Pseudo-prior. The following example is about inference in a clinical trial us-
ing a historical control. The trial is a study of a proposed treatment for uterine
papillary serous carcinoma, a very rare disease. The study drug is adminis-
tered in three doses, including 0 (control), 1, and 2. To facilitate a study in a
6
Use the R package boa or coda (or any other) for convergence diagnostics. See Section
9.6.
148 Model Selection and Trans-dimensional MCMC
realistic time frame we consider pooling with historical data on control. Let
yi denote the outcome (PFS, progression free survival, in months) for the ith
patient, numbering all patients, including patients from the historical study
as i = 1, . . . , n, with n = 56, including n0 = 40 patients in the historical data
and n1 = 16 in the new study. Let Ii ∈ {0, 1} denote an indicator for a patient
to be in the historical study and zi ∈ {0, 1, 2} for the dose of the study drug.
All patients in the historical data are on control, i.e., zi = 0, i = 1, . . . , n0 .
We use a Weibull regression with (or without in part (b)) a study effect. That
is, a sampling model
yi ∼ Weib(λ, a), with ln λ = β0 + β1 zi + β2 Ii .
a. Set up a prior β j ∼ N(m j , s2j ), j = 0, 1, 2, with m2 = 0, s0 = 0.1 and
s1 = s2 = 0.5. Determine suitable values for m0 and m1 to match expert
opinion that PFS at doses z = 0, 1, 2 should be around 7, 11, and 14
months (you might not be able to exactly match these means). The data
are in the file uterine.txt which is linked on the book’s homepage.7
In addition to zi , yi , and Ii , the file reports an indicator si for observed
event time (si = 1) versus censoring time (si = 0). Carry out posterior
inference.
b. Now we change the prior on β2 , the study effect. Let δ x denote a point
mass at x. We assume
δ0 if M = 0
β2 ∼
N(m2 , s2 ) if M = 1,
2
split move (and simplify the answer to the following question), we label the
current state vector ω̃ and the proposal ω:
w1 = w̃1 + w̃2 (7.15)
w1 µ1 = w̃1 µ̃1 + w̃2 µ̃2
w1 (κ1d + µ⋆2 ⋆2 ⋆2
1d ) = w̃1 (κ̃1d + µ̃1d ) + w̃2 (κ̃2d + µ̃2d ),
use indicators for Fn and Fn+1 to replace the range of integration. (4) Use a
change of variables to make both sides into integrals over (θn , u). (5) Finally,
argue that equality of the integrands is a sufficient condition for equality of
the integrals, and verify that the RJ acceptance probability satisfies the latter
condition.
8
1
Since the 1980s, investigators have aimed to find efficient, and preferably
simple, approaches to overcome the technical problems of calculus that
arise in Bayesian inference. A variety of strategies have been proposed, in-
cluding in particular multivariate normal approximations of posterior dis-
tributions, the Laplace approach, numerical quadrature methods, classical
Monte Carlo methods, and Markov chain Monte Carlo (MCMC) methods.
Advances in data collection give rise to the need for modeling ever more
complex data structures. This includes, spatio-temporal models, dynamic
linear models, generalized linear mixed models, generalized additive mod-
els, log-Gaussian Cox processes, geo-additive models, and more. All these
models are part of a much larger class of models known as latent Gaus-
sian models (LGMs). See, for example, Blangiardo and Cameletti (2015).
Theoretically, it is always possible to implement MCMC algorithms for
such LGMs. However, such implementations come with many problems in
terms of convergence and computation time.
Recently, Rue et al (2009) developed an analytical approach based on in-
tegrated and nested Laplace approximations (INLAs), which allows deter-
ministic approximation of marginal posterior distributions in these models.
The method provides a particularly efficient implementation of Bayesian
inference in LGMs. The INLA approach has two major advantages over
MCMC techniques. The first is computation time. Using INLA one can get
results within seconds or minutes in models that would take hours or even
days for inference using MCMC algorithms. The second advantage is that
INLA treats LGMs in a unified way, allowing substantial automation of
inference, independent of the specific model.
1
This material will be published by Cambridge University Press as Computational
Bayesian Statistics, by M.A. Amaral Turkman, C.D. Paulino, and P. Müller
(https://tinyurl.com/CompBayes). This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative
works. c Maria Antónia Amaral Turkman, Carlos Daniel Paulino & Peter Müller, 2019.
8.1 Analytical Methods 153
therefore
Z +∞ ! 1
nψ′′ (θ̂)
exp − (θ − θ̂) = 2π(nψ′′ (θ̂))−1 2 .
2
−∞ 2
Thus the integral I in (8.9) can be approximated by
ˆ + O(n−1 )},
I ≈ I{1 (8.10)
where
√ 1
Iˆ = 2πn− 2 σ̂ f (θ̂) exp(−nψ(θ̂))
and σ̂ = [ψ′′ (θ̂)]−1/2 .
• In the k-dimensional case a similar argument leads to Iˆ of the form,
k k 1
Iˆ = (2π) 2 n− 2 det(Σ̂) 2 f (θ̂) exp(−nψ(θ̂)),
where Σ̂−1 = ∇2 ψ(θ̂) is the Hessian matrix of ψ in θ̂.
Of course, higher-order expansions of f and ψ would obtain better ap-
proximations, for example:
Z (
p −nψ̂ ˆ 1 2 ˆ′′
f (θ)e −nψ(θ)
dθ = (2π)σe f+ σ f − σ4 fˆ′ ψ̂′′′ +
2n
)
5 ˆ ′′′ 2 6 1 ˆ (4) 4
+ f (ψ̂ ) σ − f ψ̂ σ + O(n−2 ), (8.11)
12 4
where fˆ, ψ̂, etc., are the respective functions evaluated in θ̂, which is, as
already mentioned, a value where the maximum of −ψ(θ) is achieved and
σ2 = [ψ′′ (θ̂)]−1 .
Assume, then, that one wants to evaluate the posterior expected value of
some function g(θ) of the parameters. By (8.8) we see that E[g(θ) | x] can
be obtained as a ratio of two integrals, that is,
R
g(θ) f (x | θ)h(θ)dθ
E[g(θ) | x] = R . (8.12)
f (x|θ)h(θ)dθ
The basic idea is to apply separate Laplace approximations to the numer-
ator and denominator integrals and consider the ratio of these approxima-
tions. Tierney and Kadane (1986) obtain the following two corresponding
approximations for E[g(θ) | x].
• Use the Laplace approximation for exp(−nψ(θ)) = f (x | θ)h(θ), in
numerator and denominator, with f (θ) = g(θ) in the numerator and
f (θ) = 1 in the denominator of (8.12) to get
E[g(θ) | x] = g(θ̂)[1 + O(n−1 )]. (8.13)
158 Methods Based on Analytic Approximations
Assuming that ψ(·), ψ⋆ (·) are sufficiently regular functions, the Laplace
approximations for the numerator and denominator integrals in (8.14)
(using in both cases f (θ) = 1) are, respectively,
√ √
2πσ⋆ n−1/2 exp{−nψ⋆ (θ⋆ )} and 2πσ̂n−1/2 exp{−nψ(θ̂)}.
From there we get the following approximation for E[g(θ) | x],
E[g(θ) | x] ≈ (σ⋆ /σ̂) exp{−n[ψ⋆ (θ⋆ ) − ψ(θ̂)]}. (8.17)
The approximation errors for the two integrals are of order n−1 . However,
the relevant terms of the two errors are identical and therefore cancel
out in the ratio. Therefore the final approximation has a relative error of
order n−2 .
To arrive at the previous approximation, we imposed a quite restrictive
condition, namely that g be positive almost everywhere. One can proceed
in various ways for approximations with real-valued function g in more
general problems. Tierney et al (1989) suggest to start with an approxi-
mation of the moment-generating function of g(θ) (E[exp{sg(θ)}]), using
Laplace approximation for positive functions, and then from there to ob-
tain an approximation of the expected value of g(θ) as the derivative of the
8.1 Analytical Methods 159
Example 8.2 Getting back to the first example, if one wishes to estimate
θ
g(θ) = ρ = ln 1−θ , one could use a Laplace approximation. Since g(θ) can
take negative values, we either use the first approximation that corresponds
to estimating the mean by the mode, obtaining
an − 1
E(ρ | x) = ln ,
bn − 1
or we use one of the alternative approaches proposed by Tierney et al
(1989).
Using the approach in (8.18) with ψN = ψD = ψ, that is, −nψ = ln h(θ) +
ln f (x | θ); fN (θ) = g(θ); fD (θ) = 1, we get
an − 1 1 (an − bn )
E(ρ | x) = ln +
bn − 1 2n (an − 1)(bn − 1)
1 (an − 1)(bn − 1)
− 2 [(an − 1)2 − (bn − 1)2 ]. (8.19)
n (an + bn − 2)
Compare this result, for specific values of an and bn , with the exact value.
Tanner (1996) suggests obtaining an approximation of the mean of ρ
based on the transformation
1 bn θ
λ= ln .
2 an (1 − θ)
It is easily seen that the distribution of λ is the distribution of Fisher’s z
with probability density function
e2an λ
h(λ) ∝ ,
(2bn + 2an e2λ )(an +bn )
4
For example, if one were to use ψN = ψD ; fN (θ) = g(θ); fD (θ) = 1 then (8.18) would
reduce to (8.13) only. See Tierney et al (1989) or Robert (1994) for more details.
160 Methods Based on Analytic Approximations
These ideas are easily extended to the multiparameter case, with im-
mediate application to the evaluation of marginal posterior distributions,
posterior moments, and predictive densities.
Assume θ ∈ Θ = Rk and that we wish to find the marginal posterior
distribution for θ1 . Partition the parameter vector as θ = (θ1 , θ−(1) ) where
θ−(1) = (θ2 , ..., θk ) ∈ Θ−(1) . The marginal posterior distribution for θ1 can be
written as the ratio of two integrals:
Z
h1 (θ1 | x) = h(θ1 , θ−(1) | x)dθ−(1) (8.20)
Θ−(1)
R
Θ−(1)
h(θ1 , θ−(1) ) f (x | θ1 , θ−(1) )dθ−(1)
= R . (8.21)
Θ
h(θ) f (x | θ)dθ
Applying Laplace approximation to the numerator and denominator inte-
grals in (8.20), we obtain the approximation
!1/2
det(Σ̂∗ (θ1 )) h(θ1 , θ̂−(1) ) f (x | θ1 , θ̂−(1) )
h1 (θ1 | x) ≈ , (8.22)
2πn det(Σ̂) h(θ̂) f (x | θ̂)
where θ̂ maximizes h(θ) f (x | θ), and Σ̂ is the negative inverse of the cor-
responding Hessian matrix evaluated in θ̂, θ̂−(1) maximizes h(θ1 , θ−(1) ) f (x |
θ1 , θ−(1) ) for fixed θ1 , and Σ̂∗ (θ1 ) is the negative inverse of the corresponding
Hessian matrix evaluated in θ̂−(1) .
In Section 8.3 we will see how this result becomes useful in the INLA
approach proposed by Rue and collaborators.
Using similar arguments, one can further show that (8.17) remains valid
when θ ∈ Rk with
σ̂ = |∇2 ψ(θ̂)|−1/2 and σ⋆ = |∇2 ψ⋆ (θ⋆ )|−1/2 ,
where
2 2 ⋆
[∇2 ψ(θ)]i j = ∂∂θψ(θ)
i ∂θ j
and [∇2 ψ⋆ (θ)]i j = ∂∂θψi ∂θ(θ)j .
8.2 Latent Gaussian Models (LGM) 161
For more discussion of this method, see the references cited in the book
by Paulino et al (2018). Despite being a powerful technique, the approach
has some limitations. In the multivariate case the application can become
difficult and impractical, in particular when the integrands are multimodal
and/or the derivatives are difficult to obtain. In many problems, even with
moderate dimension parameter vectors, it is convenient to consider re-
parametrizations for better approximations.
at three different times after treatment. The aim of the study was to evaluate
whether the disease developed differently between the two groups, whether
this depends on age, duration of the disease before treatment, or on an-
other time-dependent covariate that was recorded at the same three times
as the clinical evaluations of the disease.
A Bayesian analysis of this regression problem with repeat measure-
ments is easily implemented as an LGM. Let X jk denote the result of the
clinical evaluation of patient j (with j = 1, ..., n) at time k (with k = 1, 2, 3).
Let z j = (z1, j , z2, j , z3, j ) denote the covariate vector of treatment, age, and
duration of the disease, respectively, for patient j, and let z jk denote the
time-dependent covariate.
We assume the following model for X jk :
• X jk ∼ N(µ jk , σ2 ), independent conditional on the parameters
• µ jk = β0 + β1 z1, j + β2 z2, j + β3 z3, j + β4 z jk + a j + b jk
iid iid
• a j ∼ N(0, σ2a ), b jk ∼ N(0, σ2b ), where a = (a j , j = 1, ..., n) and b =
(b jk , j = 1, ..., n, k = 1, 2, 3) are random effects at the level of patients
and at the level of repeat measurements within a patient (see Section
9.2.1 on including b jk ). The random effects are introduced to induce de-
pendence that arises from the longitudinal nature of the data;
iid
• β = (β1 , β2 , β3 , β4 ), β0 , βi , i = 1, . . . , 4 ∼ N(0, σ2β );
• Let τ = 1/σ2 , τa = 1/σ2a , τb = 1/σ2b , and τβ = 1/σ2β denote the preci-
sions. We assume τ ∼ Ga(c, d) and τ x ∼ Ga(c x , d x ) for x = a, b, β.
The model can be written as a three-level hierarchical model. Let N(x |
m, s2 ) indicate an N(m, s2 ) density for the random variable X.
1. X | z, θ, ψ ∼ f (x | z, θ, ψ) = Π j,k N(x jk | µ jk , 1/ψ2 );
2. θ = (β0 , β, a, b) with θ ∼ N(0, Σ(ψ1 )), i.e. θ | ψ ∼ GMRF(ψ1 );
3. ψ = (ψ1 , ψ2 ) with ψ1 = (τβ , τa , τb ), assuming that the parameters are a
priori independent and ψ2 = τ.
Note that the assumed prior independence of the fixed effects parameters
is not needed. One of the attractive properties of the GMRF (see Rue and
Held, 2005) is that the precision matrix Q = Σ−1 is sparse. In fact, one can
show
θl ⊥ θm | θ−(lm) ⇔ Qlm = 0,
8.3 Integrated Nested Laplace Approximation 163
where Qlm is the lm element of the precision matrix Q. This has computa-
tional advantages when using numerical methods that are specifically de-
signed for sparse matrices.
An alternative way to state the LGM, and which is central to the INLA
approach, is based on recognizing the model as a special case of a re-
gression model with additive structure (Fahrmeir and Tutz, 2001). In these
models, the dependent variable (Xi ) is assumed to have an exponential fam-
ily distribution with mean µi that is linked to a predictor ηi with additive
structure (see the following) via a link function g(µi ) = ηi , and the sampling
model can still additionally be controlled by hyperparameters ψ2 . The gen-
eral form of the predictor is
nβ
X nf
X
ηi = β0 + β j z ji + wki f (k) (uki ) + ǫi , (8.25)
j=1 k=1
where ψ−(k) denotes the vector ψ without the component ψk . The evaluation
of these distributions requires first estimates of h(ψ | x) and h(θi | ψ, x) to
then obtain
Z
h̃(θi | x) = h̃(ψ | x)h̃(θi | ψ, x)dψ (8.26)
Z
h̃(ψk | x) = h̃(ψ | x)dψ−(k) , (8.27)
e h(ψ)h(θ | ψ) f (x | θ, ψ)
h(ψ | x) ∝ ,
e
h(θ | ψ, x) θ=b
θ(ψ)
hyperprior
τy ∼ Ga(γy , δy ), τα ∼ Ga(γα , δα ), τβ ∼ Ga(γβ , δβ ) and p(µα ) = p(µβ ) = c,
with fixed hyperparameters (γy , δy , γα , δα , γβ , δβ ), and assuming indepen-
dence across parameters in the hyperprior.
Let ω = (µα , µβ , τy , τα , τβ , αi , βi , i = 1, . . . , n) denote the complete pa-
rameter vector, and let θi = (αi , βi ) denote the animal-specific random ef-
fects. In a slight abuse of notation, let N(x | m, P) indicate a multivariate
normal distribution for the random variable x with mean m and precision
matrix P and similarly for Ga(x | c, d) (note the use of the precision matrix
instead of the covariance matrix as the second parameter in the normal
distribution). We use a variational family D with variational distributions
of the form
q(ω) = Πi N(θi | mi , Pi ) Π x=y,α, β Ga(τ x | c x , d x ) Π x=α, β N(µ x | m x , P x ).
The rationale for this choice will be evident in a moment, when we work
out the updates (8.32). As before, let E−x denote an expectation under q
with respect to all parameters except for x (where the symbol x here is
a placeholder for any of the parameters). Let h(ω | y) denote the joint
posterior distribution, and let h(x | ω−x , y) denote the complete conditional
posterior for parameter x. For reference, we note the joint posterior. Let
yi = (yi j , j = 1, . . . , J) denote the repeat measurements for animal i. Let X
denote the (J × 2) design matrix for the linear regression for one animal,
let H = XX ′ , and let θ̂i denote the least squares fit for the θi . Also, let
T = diag(τα , τβ ) and µ = (µα , µβ ), and let N = nJ denote the total sample
size. We have
Problems
8.1 Normal approximation. Under a particular genetic model, animals of a given
species should be one of four specific phenotypes with probabilities p1 =
(2 + θ)/4, p2 = (1 − θ)/4, p3 = (1 − θ)/4 and p4 = θ/4, respectively. Let
y1 , . . . , y4 denote the number of animals of each phenotype, and N = y1 +. . .+
y4 . Assuming multinomial sampling and a Be(a, b) prior on θ the posterior
density h(θ | y) takes the form
h(θ | y) ∝ (2 + θ)y1 (1 − θ)y2 +y3 +b−1 θy4 +a−1 , 0 ≤ θ ≤ 1,
that is,
L(θ | y) ≡ ln h(θ | y) = C + y1 ln(2 + θ) + (y2 + y3 + b − 1) ln(1 − θ) +
+ (y4 + a − 1) ln(θ)
′ y1 y2 + y3 + b − 1 y4 + a − 1
L (θ) = − +
2+θ 1−θ θ
′′ y1 y2 + y3 + b − 1 y4 + a − 1
−L (θ) = + + ,
(2 + θ)2 (1 − θ)2 θ2
where C is the log normalization constant. Answer (a) through (d) below
using the following two data sets: N = 197, y = (125, 18, 20, 34) and N = 20,
y = (14, 0, 1, 5).
a. Find a (truncated) normal approximation, N(m, V), 0 ≤ θ ≤ 1, by solving
H ′ = 0 to determine m and using V −1 = −H ′′ (m). Let p1 (θ) denote the
p.d.f. Let m1 and V1 denote posterior mean and variance under the trun-
cated normal approximation (use simulation to find the moments of the
truncated normal – note that m1 , m, due to the truncation).
b. As an alternative approximation, find a beta distribution, Be(a, b), by
matching mean and variance (obtained as in part (a)). Let p2 (θ) denote
the beta p.d.f. Let m2 (= m, by construction) and V2 = V denote the
posterior moments under the beta approximation.
c. Now use p1 and p2 as importance sampling densities to carry out impor-
tance sampling to evaluate the posterior mean and variance. Let (m3 , V3 )
and (m4 , V4 ) denote the approximate posterior moments, using impor-
tance sampling densities p1 and p2 , respectively.
d. With p2 as the importance sampling density, carry out importance sam-
pling to find the normalization constant for h(θ | y). Then use numerical
integration on a grid over 0 ≤ θ ≤ 1 (for example, as a Rieman sum, or
using the trapezoidal rule) to evaluate posterior mean and variance. Let
(m0 , V0 ) denote the estimates.
Make a table to compare (m j , V j ), j = 0, 1, . . . , 4.
8.2 LGM. Refer to Example 8.3. Identify the three levels of the LGM. That is,
state the probability models f (xi | θi , ψ2 ), p(θ | ψ), and h(ψ).
172 Methods Based on Analytic Approximations
8.3 LGM (Rue et al, 2009). Consider a stochastic volatility model for daily
pound–dollar exchange rates, with sampling model
yt | ηt ∼ N(0, exp(ηt )),
t = 1, . . . , nd , log variances ηt = µ + ft , prior
ft | f1 , . . . , ft−1 , φ ∼ N(φ ft−1 , 1/τ),
and hyperpriors
′
eφ ′
τ ∼ Ga(1, 0.1), µ ∼ N(0, 1), and φ = 2 ′ − 1, with φ ∼ N(3, 1).
1 + eφ
Show how the model can be stated as an LGM. Identify the three levels of
the LGM. That is, state the probability models f (xi | θi , ψ2 ), p(θ | ψ), and
h(ψ).
8.4 (Rue et al. 2009). Let yi ∈ {0, 1} denote indicators for pre-malignant (yi = 1)
versus malignant (yi = 0) incidences of cervical cancer, i = 1, . . . , nd . Let
di ∈ {1, . . . , 216} and ai ∈ {1, . . . , 15} index geographical region and age
group of the ith case. Rue and Held (2005) use a logistic binary regression
for pi ≡ P(yi = 1),
logit(pi ) = ηi = µ + faai + fdsi + fdui
where f a = ( f1a , . . . , fAa ), A = 15, is a smooth effect of age groups, f s =
( f1s , . . . , fSs ), S = 216, is a smooth spatial field of geographic region effects,
and f u = ( f1u , . . . , fSu ) are district-specific random effects. For f a we assume
a second-order random walk model with precision κa ,
κa P15 a a a 2
f a | κa ∝ (κa )(15−2)/2 e− 2 j=3 ( f j −2 f j−1 + f j−2 ) .
For f s we assume a conditional autoregressive model f s ∼ CAR(0, κ s , G),
where G is a (S × S ) binary adjacency matrix with G st = 1 if district s is
neighboring district t. Subject to some constraints, the CAR model implies
a multivariate normal prior for f s | κ s ∼ N(0, Σ(κ s , G)). This is all that is
needed in this problem. The random effects fdu are independent N(0, 1/κu ).
The model is completed with independent Ga(1, 0.01) priors for κa , κ s , and
κu , and µ ∼ N(0, 0.01), with the second parameter of the normal distribution
being a precision here (matching the parametrization used in INLA).
Show how the model can be stated as an LGM. Identify θ, ψ, and the three
levels of the LGM. That is, state the probability models f (xi | θi , ψ2 ), p(θ |
ψ), and h(ψ).
8.5 Variational Bayes. For Example 8.4 in Section 8.4.2, find the updating equa-
tions for q(τy ) q(τα ), q(τβ ), q(µα ), and q(µβ ).
8.6 Variational Bayes: mixture model. Consider a mixture of normal model f (x |
PK
θ) = k=1 πk N(x | µk , σ2 ), with θ = (π, µ, σ2 ), where µ = (µ1 , . . . , µK ) and
π = (π1 , . . . , πK ). Let γ = 1/σ2 . We complete the model with conditionally
Problems 173
conjugate priors h(µk ) = N(0, τ), h(γ) = Ga(a, b) and π ∼ DK−1 (α, . . . , α),
with fixed hyperparameters (τ, a, b, α) and prior independence.
a. Find the complete conditional posterior h(θ j | θ− j , x) for µk , πk and γ.
b. For the moment, fix σ2 = 1 and π = (1/K, . . . , 1/K). Consider the mean-
field variation family D with q(µk ) = N(mk , sk ), q(ci = k) = φik , and de-
rive one iteration of CAVI, that is, find the updating equations for (mk , sk )
and φk j .
Hint: See the discussion in Blei et al (2017).
c. Now include γ and π in the parameter vector. Extend D by defining q(γ) =
Ga(c, d) and q(π) = D(e1 , . . . , eK ). Find the updating equations for c, d,
and e.
8.7 In R, load the package MASS which provides the data set galaxies. Scale
the data as x = c(scale(galaxies)). Assume a mixture of K = 5 nor-
PK
mal models, f (xi | θ) = k=1 πk N(µk , σ2 ) with θ = (µ, π, σ2 ). Complete the
model with priors as in the previous problem, using τ = a = b = c = d = e =
1. Use variational inference to generate an approximate Monte Carlo poste-
rior sample, Θ = {θm , m = 1, . . . , M} with θm ∼ h(θ | x), approximately.
a. At each iteration, evaluate the ELBO. Plot it against the iteration.
b. Use Θ to approximate the posterior predictive distribution
b 1 XX m
f (x) = π N(µk , (σm )2 ).
M m k k
Software
1
Using computational methods, like those discussed in previous chapters,
Bayesian data analysis can be implemented for problems arising from a
wide variety of scientific areas. This development has been importantly
supported by the generosity of many investigators who have made soft-
ware implementations of such methods freely available to the scientific
community. The software R includes a variety of packages, or sometimes
just functions, that can be used for Bayesian inference. Interested readers
should consult the page http://cran.r-project.org/web/views/
Bayesian.html. One can find there, for example, the package bayesSurv,
built specifically for Bayesian inference in survival models, for example.
Besides these R packages and the four software packages discussed below,
there are several other public domain programs for Bayesian inference, in-
cluding NIMBLE (de Valpine et al, 2017; Rickert, 2018), bayesm (Rossi
et al, 2005), and the Bayesian regression software by George Karabatsos
(Karabatsos, 2015). Rickert (2018) includes an implementation of infer-
ence for Problem 4.4 and 6.5 in NIMBLE.
For this chapter we selected four general-purpose public domain soft-
ware packages that implement methods based on stochastic simulation. Al-
though they are independent of R, all four can be used through an interface
in R. These are OpenBUGS (Thomas et al, 2006), JAGS (Plummer, 2003),
Stan (Carpenter et al, 2017), and BayesX (Brezger et al, 2005; Belitz et al,
2013). The chapter includes a brief summary of the principal characteristics
of each software and illustrates each in the same example. We also show
how to monitor and diagnose convergence of chains using the software
1
This material will be published by Cambridge University Press as Computational
Bayesian Statistics, by M.A. Amaral Turkman, C.D. Paulino, and P. Müller
(https://tinyurl.com/CompBayes). This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative
works. c Maria Antónia Amaral Turkman, Carlos Daniel Paulino & Peter Müller, 2019.
9.1 Application Example 175
packages CODA and BOA. Both are R packages. Finally, we show how
the same example can be analyzed using R-INLA (www.r-inla.org).
There are several books that show the use of these software packages, in-
cluding those by Ntzoufras (2009), Kruschke (2011, 2014), Korner-Nievergelt
et al (2015) and Blangiardo and Cameletti (2015). Each of these packages
comes with extensive documentation.
To simplify the commands we will always assume that all files that are
needed for the program are saved in the working directory of the R session
(use setwd() if needed). The steps to analyze the model in R2OpenBUGS
are then the following:
178 Software
1. Write the code for the model and save it in a file with extension .txt.
In this case, we saved it in the file Cexemplo1BUGS.txt.
model{
for(i in 1:147){
X[i]~dnorm(mu[i],tau)
mu[i]<-beta0+beta[1]*z1[i]+beta[2]*z2[i]
+beta[3]*z3[i]+beta[4]*z[i]+a[ID[i]]+b[i]
b[i]~dnorm(0,tau_b)
}
for(j in 1:49){
a[j]~dnorm(0,tau_a)
}
for(k in 1:4){
beta[k]~dnorm(0,0.0001)
}
beta0~dnorm(0,0.0001)
tau~dgamma(0.05,0.05)
tau_a~dgamma(0.05,0.05)
tau_b~dgamma(0.05,0.05)
sigma<-1/sqrt(tau)
sigma_a<-1/sqrt(tau_a)
sigma_b<-1/sqrt(tau_b)
}
The first for loop corresponds to the statement of the probability model
given in item 1 of Section 9.1. In BUGS the second parameter (tau) of
the normal distribution is the precision (inverse variance). The loop also
includes the definition of the linear predictor mu and the statement of the
model for the random effects b that appear in items 2. and 3. of the same
section. The number of patients is n = 49, but since all were observed at
three distinct time points, the total number of observations is 147.
The second for loop is the model for the random effects a. Those effects
are specific to each patient, who are identified by the variable ID in the first
loop.
The third for loop defines the prior distribution for the fixed effect pa-
rameters. After this, the prior distribution for the remaining model para-
meters are given. All prior distributions reflect vague prior information.
Finally, to allow monitoring of standard deviations, those are defined as a
function of the respective precisions. Note that the sequence of these com-
mands is completely arbitrary.
To verify the file that defines the model to be used by OpenBUGS, one
can use the command
file.show("Cexemplo1BUGS.txt")
9.2 The BUGS Project: WinBUGS and OpenBUGS 179
If one wishes to monitor mu, or any other parameter that was defined in the
model specification, for example a, or b, those must appear in the above
list.
5. Define the initial values of parameters and hyperparameters in the
model. In this case we need to define initial values for beta0, beta, tau,
180 Software
tau_a, tau_b, a, b. Those are defined in a list that needs to include initial
values for each chain.
Inits<-list(tau=1,tau_a=1,tau_b=1,beta=c(0,0,0,0), b=c(0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),a=c(0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0),beta0=0)
Using more than one chain, one needs to create a list of the same type for
each chain, with different initial values, arranged as objects with names,
for examples, “Inits1” and “Inits2”, and then define a list of lists using
Inits<-list(Inits1,Inits2,...)
6. The program OpenBUGS can then be run through R using the func-
tion bugs(), after having verified that the R2OpenBUGS package is loaded
in the current R session.
library(R2OpenBUGS)
Cexemplo1_openBUGS.fit<- bugs(data=Cexemplo1.data, inits=list(Init
parameters.to.save=Cexemplo1.params,
"Cexemplo1BUGS.txt", n.chains=1, n.iter=40000,
n.burnin=20000, debug=FALSE,save.history=FALSE,DIC=TRUE)
which generates in this case the output (the final column for the 97.5%
quantile is cut to fit the page)
mean sd 2.5% 25% 50% 75%
beta0 17.4290 1.9208 13.9200 16.1000 17.2800 18.6600
beta[1] 4.2329 2.8568 -2.6160 2.5750 4.4480 6.0600
beta[2] 0.1422 0.1391 -0.1303 0.0474 0.1448 0.2352
beta[3] 4.3456 0.9455 2.3960 3.7600 4.3430 4.9100
beta[4] -0.1029 0.0366 -0.1726 -0.1281 -0.1046 -0.0798
tau 1.5607 3.6155 0.0636 0.0847 0.1521 1.1012
tau_a 0.0162 0.0038 0.0097 0.0135 0.0159 0.0185
tau_b 2.2989 5.4749 0.0638 0.0870 0.1604 1.4452
sigma_a 8.0226 0.9547 6.4140 7.3560 7.9320 8.5990
sigma_b 2.1790 1.2991 0.2279 0.8318 2.4965 3.3910
sigma 2.2725 1.2624 0.2750 0.9528 2.5640 3.4360
deviance 583.7390 241.1451 37.2990 403.3750 695.3000 782.2000
9.2 The BUGS Project: WinBUGS and OpenBUGS 181
[1] 20000 12
head(A)# shows the first 6 rows of A
beta0 beta[1] beta[2] beta[3] beta[4] tau
[1,] 15.45 8.683 0.066 4.086 -0.118 0.422
[2,] 16.44 5.370 0.123 4.398 -0.117 0.071
[3,] 19.35 2.387 0.214 4.675 -0.037 2.613
[4,] 19.12 -0.014 0.152 4.254 -0.084 0.361
[5,] 17.49 3.548 0.068 5.744 -0.104 2.452
[6,] 19.98 3.522 0.384 3.826 -0.106 0.092
tau_a tau_b sigma_a sigma_b sigma deviance
[1,] 0.017 0.078 7.640 3.589 1.539 536.6
[2,] 0.016 2.567 7.895 0.624 3.749 790.8
[3,] 0.019 0.074 7.169 3.665 0.619 249.5
[4,] 0.016 0.107 7.853 3.061 1.665 570.3
[5,] 0.015 0.079 8.067 3.558 0.639 286.1
[6,] 0.020 3.179 7.058 0.561 3.298 791.9
Note that only parameters that were declared in the object Cexemplo1.para
appear. To carry out residual analysis, for example, it is important to mon-
itor mu.
If the random effects b are removed (note that the file Cexemplo1BUGS.tx
with the model definition and the initial values need to be changed accord-
ingly) we get
> exemplo1_OpenBUGS1.fit<- bugs(data=Cexemplo1.data,
inits=list(Inits1), parameters.to.save=Cexemplo1.params1,
"Cexemplo1BUGS_semb.txt", n.chains=1,
n.iter=40000,n.burnin=20000, debug=FALSE,save.history=FALSE,
DIC=TRUE)
> exemplo1_OpenBUGS1.fit$summary
mean sd 2.5% 25% 50% 75% 97.5%
beta0 17.210 1.763 13.760 16.040 17.190 18.360 20.790
beta[1] 4.781 2.393 0.063 3.186 4.731 6.417 9.495
beta[2] 0.152 0.145 -0.117 0.052 0.149 0.249 0.450
beta[3] 4.146 0.923 2.384 3.527 4.121 4.751 6.053
beta[4] -0.107 0.036 -0.177 -0.131 -0.107 -0.083 -0.036
tau 0.077 0.011 0.057 0.070 0.077 0.085 0.101
tau_a 0.016 0.004 0.010 0.014 0.016 0.019 0.025
sigma_a 7.974 0.936 6.363 7.319 7.897 8.544 10.020
sigma 3.625 0.265 3.150 3.439 3.608 3.794 4.190
deviance 795.126 12.624 772.600 786.200 794.300 803.200 822.000
> exemplo1_OpenBUGS1.fit$DIC
[1] 843.2
> exemplo1_OpenBUGS1.fit$pD
[1] 48.03
> A1<-exemplo1_OpenBUGS1.fit$sims.matrix
> dim(A1)
[1] 20000 10
9.3 JAGS 183
> head(A1)
beta0 beta[1] beta[2] beta[3] beta[4] tau
[1,] 20.62 -0.8067 0.1487 4.631 -0.1005 0.0838
[2,] 18.39 2.8580 0.0692 5.767 -0.0569 0.0826
[3,] 19.56 2.5330 0.3619 3.045 -0.1277 0.0784
[4,] 16.61 6.1660 -0.1078 4.284 -0.1168 0.0827
[5,] 18.11 2.8790 0.1680 4.372 -0.2055 0.0636
[6,] 15.06 5.1330 0.0461 3.707 -0.1142 0.0639
tau_a sigma_a sigma deviance
[1,] 0.0188 7.291 3.455 773.9
[2,] 0.0141 8.431 3.479 781.4
[3,] 0.0247 6.368 3.572 796.0
[4,] 0.0162 7.844 3.478 783.5
[5,] 0.0251 6.310 3.964 819.5
[6,] 0.0168 7.704 3.955 818.8
> Cexemplo1_OpenBUGS1$DIC
[1] 843.1
> Cexemplo1_OpenBUGS1$pD
[1] 48.12
9.3 JAGS
In 2003 Martyn Plummer, of the International Agency for Research on
Cancer, created JAGS (Just Another Gibbs Sampler) as a clone of BUGS
written in C++, which fixed certain limitations of BUGS. Models writ-
ten in BUGS syntax can be used in JAGS, practically without any change.
This has the advantage that the same code can be run on both platforms.
There are two parts to a model definition in JAGS: the model description
(model{}), as in BUGS, and the definition of the data (data{}). The lat-
ter can be used, for example, to define transformations of the data, define
summary statistics, simulate data sets, etc. The latest version of JAGS was
launched in July 2017 (JAGS 4.3.0). To understand how JAGS works, it
is important to read the manual (Plummer, 2012). The R package R2jags
(https://cran.r-project.org/web/packages/R2jags/R2jags.pdf) serves
as an interface between R and JAGS. One of the main advantages of JAGS
compared to OpenBUGS is the faster run time.
install.packages("R2jags", dependencies=TRUE,
repos="http://cran.us.r-project.org")
library(R2jags)
2. Reading in the data, the definition of the model variables, the decla-
ration of parameters that are to be monitored, and the definition of initial
values is done exactly as before.
Cexemplo1.data <- list("X","ID","z3","z1","z2","z")
Cexemplo1.params1 <- c("beta0","beta","tau","tau_a",
"sigma_a","sigma")
Inits1<-list("tau"=1,"tau_a"=1,"beta"=c(0,0,0,0),
"a"=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
"beta0"=0)
3. Before using R2jags for the first time, we need to set a random variate
seed, for example, by the R command
set.seed(123)
> print(exemplo1_JAGS.fit2)
Inference for Bugs model at
"C:/Users.../model1f5477c03ee1.txt",
fit using jags,
2 chains, each with 40000 iterations (first 20000 discarded),
n.thin = 20
n.sims = 2000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
beta[1] 4.859 2.596 -0.346 3.166 4.939 6.604 9.950
beta[2] 0.157 0.139 -0.113 0.068 0.151 0.242 0.453
beta[3] 4.191 0.918 2.422 3.595 4.173 4.805 6.059
beta[4] -0.105 0.036 -0.175 -0.130 -0.106 -0.081 -0.035
beta0 17.218 1.816 13.857 15.996 17.207 18.435 20.716
186 Software
Note that there are now two extra columns, “Rhat” and “n.eff,” whose
meanings are explained in the output.
A plot of simulated parameter values can be obtained by
traceplot(exemplo1_JAGS.fit2)
print(exemplo1_JAGS.fit2.upd)
Inference for Bugs model at
"C:/Users.../model1f5477c03ee1.txt",
fit using jags,
2 chains, each with 1000 iterations (first 0 discarded)
n.sims = 2000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
beta[1] 4.681 2.562 -0.227 2.972 4.702 6.345 9.832
beta[2] 0.156 0.143 -0.117 0.059 0.151 0.253 0.440
beta[3] 4.179 0.924 2.410 3.575 4.181 4.771 5.951
beta[4] -0.106 0.037 -0.178 -0.132 -0.106 -0.081 -0.036
beta0 17.288 1.777 13.735 16.117 17.333 18.443 20.676
9.4 Stan 187
Note the decrease of the values for pD and DIC. Another function that can
be used for the same purpose, and which requires only one chain, is the
function update().
In Section 9.6 we will discuss how to evaluate convergence diagnostics
for this example, including also for posterior simulation using other pack-
ages.
9.4 Stan
In 2010, Andrew Gelman, Columbia University, New York, and collabora-
tors were working on a Bayesian analysis of multi-level generalized linear
models, as described by Gelman and Hill (2006). The implementation of
these models in WinBUGS or JAGS turned out to be extremely challeng-
ing due to the complex model structure. For example, Matt Schofield noted
that simulation for a multi-level time series model that was used for a cli-
mate model using measurements of growth rings on trees, did not converge
even after hundreds of thousands of iterations (Schofield et al, 2016). To
resolve the problem, Gelman and collaborators developed a new Bayesian
software, which they called Stan, in honor of Stanislaw Ulam, one of the
creators of the Monte Carlo methods. The first version was made available
188 Software
for users in August 2012. Stan does not use Gibbs samplers, but instead
uses Hamiltonian Monte Carlo (Neal, 2011). With the no-U-turn sampler
algorithm that they developed (Hoffman and Gelman, 2014), all parame-
ters are simulated in one block. Using this strategy, convergence problems
were substantially mitigated. In contrast to BUGS, Stan is written as an
imperative language.
Stan allows the use of all basic operators from C++, in addition to a large
number of other special functions, including in particular Bessel functions,
gamma and digamma functions, and a variety of functions related to inverse
functions that are used in generalized linear models. A complete list of the
basic operators and special functions that are implemented in Stan is given
by Carpenter et al (2017). The list of implemented probability distributions,
which also appears in the same reference, is extensive. This allows great
flexibility in model construction.
A bit of history of the development of Stan, as well as implementation
details, can be found in Stan Modeling Language: User’s Guide and Ref-
erence Manual, which corresponds to version 2.6.2. An interface between
R and Stan is implemented in RStan (Stan Development Team, 2014).
real beta0;
real beta[4];
real a[N];
}
transformed parameters {
real<lower=0> sigma_a;
real<lower=0> sigma;
sigma_a = sqrt(1/tau_a);
sigma = sqrt(1/tau);
}
model {
vector[J] mu;
for(i in 1:J){
mu[i] = beta0+beta[1]*z1[i]+beta[2]*z2[i]
+beta[3]*z3[i]+a[ID[i]]+beta[4]*z[i]};
beta0 ~ normal(0,100);
beta ~ normal(0,100);
tau ~ gamma(0.05,0.05);
tau_a ~ gamma(0.05,0.05);
a ~ normal(0, sigma_a);
X ~ normal(mu, sigma);
}
generated quantities {
vector[J] log_lik;
vector[J] m;
for(i in 1:J) {
m[i] = beta0+beta[1]*z1[i]+beta[2]*z2[i]
+beta[3]*z3[i]+a[ID[i]]+beta[4]*z[i]};
log_lik[i] = normal_log(X[i] | m[i], sigma);
}
}
be used later. In this case we define sigma and sigma_a as the square
root of the inverse precision parameters.
• The model block defines the actual model. The vector mu is defined
as a function of the covariates and the random effects a. We define the
model for the response variable (note that the second parameter of the
normal distribution is the standard deviation here). We also define prior
distributions for the parameters and hyperparameters in the model. In
Stan the definition of prior distributions is not obligatory. In the absence
of such definitions, Stan proceeds with uniform prior distributions.
• The generated quantities block is not required. Here we define the
log-likelihood in a way that individual terms can be saved by Stan. These
terms are collected in an object with the name log-lik; the simulated
values of this object can then be used to calculate the widely applicable
information criterion (WAIC) or the LOO (Vehtari et al, 2017), as we
shall see.
2. The data are introduced as a list, as before, or if the data are read from
a file it suffices to create an object with the variable names, as follows:
Cexemplo1<-read.table("Cexemplo1.txt",header=T)
attach(Cexemplo1)
J<-nrow(Cexemplo1) #J=147
N<-length(unique(ID)) #N=49
# object to be used by Stan
Cexemplo1_data<-c("N","J","X","ID","z3","z2","z1","z")
3. Next we can call the function stan() from the package rstan to
simulate from the posterior distribution:
library(rstan)
exemplo1.fit_stan <- stan(file="exemplo1.stan.txt",
data=Cexemplo1_data, iter=40000, chains=2)
Call ?stan to see the arguments of the function stan. Particularly use-
ful are the arguments sample_file and diagnostic_file that allow
indicating the names of files where simulated samples of all model pa-
rameters and convergence diagnostics can be saved. If these names are not
given, these elements are not saved. They can, however, be extracted later.
rstan provides information about the execution time for each chain in
the following form (here only for one chain):
COMPILING THE C++ CODE FOR MODEL ’example1’ NOW.
The execution time for 40,000 iterations is much longer than the ex-
ecution time in OpenBUGS or JAGS, which for this particular problem
is practically instantaneous. However, the number of iterations needed to
achieve convergence is lower using Stan. Above we used the same number
of iterations, but this was not necessary. In fact, if the number of iterations
is not specified in iter=, we find convergence after only 2000 iterations
in this example.
4. To get summary statistics for the marginal posterior distributions for
the parameters of interest, use the command
print(exemplo1.fit_stan,
pars=c("beta0","beta","sigma","sigma_a","tau","tau_a","lp__"))
to get (a final column for the 97.5% quantile is cut to fit the page) the
following:
Inference for Stan model: example1.
2 chains, each with iter=40000; warmup=20000; thin=1;
post-warmup draws per chain=20000, total post-warmup draws=40000.
beta[2] 7261 1
beta[3] 7555 1
beta[4] 10087 1
sigma 21771 1
sigma_a 23821 1
tau 21546 1
tau_a 25151 1
lp__ 11599 1
Here, se_mean is the Monte Carlo standard error of the mean. If no param-
eters are specified then the summary statistics are given for the marginal
posterior distributions of all unknown quantities, including in particular
also a, log_lik, m.
The quantity lp_ which appears as the last element in pars= and whose
summary statistics are found in the last row of the output, is the log pos-
terior density (un-normalized), which is calculated by Stan for the imple-
mentation of Hamiltonian Monte Carlo. It can also be used for model as-
sessment (see, for example, Vethari and Ojanen, 2012).
The print() function provides summary statistics based on all chains
that are generated, whereas summary() provides summary statistics for
each chain separately.
5. To obtain the simulated parameter values, use the function extract().
With the argument permuted=TRUE, a list of all simulated values for all
parameters is created (if only simulated values for some parameters are de-
sired, those need to be given in the argument pars=); using permuted=FALS
an array is created with the first dimension corresponding to iterations, the
second dimension corresponding to chains, and the third dimension corre-
sponding to the parameters. See, for example, the following:
#####using permuted=TRUE##############
samples_stan<-extract(exemplo1.fit_stan,
pars=c("beta0", "beta", "sigma", "sigma_a"),
permuted = TRUE, inc_warmup = FALSE, include = TRUE)
> class(samples_stan)
[1] "list"
> names(samples_stan)
[1] "beta0" "beta" "sigma" "sigma_a"
> length(samples_stan$beta0)
[1] 40000 #20000 each chain
> head(samples_stan$beta0)
[1] 16.18767 18.64417 20.43510 16.69809 14.35278 15.39996
> dim(samples_stan$beta)
9.4 Stan 193
> head(round(samples_stan$beta,3))
chains
iterations chain:1 chain:2
[1,] 16.29099 17.81893
[2,] 16.68243 17.31063
[3,] 16.49383 17.31063
[4,] 16.20388 16.70740
, , parameters = beta[1]
chains
iterations chain:1 chain:2
[1,] 6.530125 5.718621
[2,] 4.949012 6.479835
[3,] 6.000288 6.479835
[4,] 6.204705 7.421142
, , parameters = beta[2]
chains
iterations chain:1 chain:2
[1,] 0.1718956 0.07575568
[2,] 0.1657402 0.18286167
[3,] 0.1793824 0.18286167
[4,] 0.1347633 0.15160846
Estimate SE
elpd_waic -422.4 9.9
p_waic 40.5 4.7
waic 844.8 19.8
9.5 BayesX
BayesX is a program that was developed in the Department of Statistics
at Ludwig Maximilian Universität München by Andreas Brezger, Thomas
Kneib, and Stefan Lang, with the first versions appearing in 2002. The
software is written specifically for structured additive regression models
(Brezger et al, 2005). This family of models (STAR), which was introduced
9.5 BayesX 195
2. Using this code, we get the following output of summary statistics for
the marginal distributions of the parameters:
> summary(exemplo1_BayesX)
### Chain_1
Call:
bayesx(formula = formula, data = data, weights = weights,
subset = subset, offset = offset, na.action = na.action,
contrasts = contrasts, control = control, model = model,
chains = NULL, cores = NULL)
Parametric coefficients:
Mean Sd 2.5% 50% 97.5%
(Intercept) 17.2313 1.8036 13.6702 17.2744 20.7688
z2 0.1557 0.1371 -0.1174 0.1566 0.4250
z3 4.2146 0.9665 2.3040 4.2043 6.2029
z1 4.7691 2.4910 -0.1460 4.7646 9.6957
z -0.1043 0.0366 -0.1756 -0.1055 -0.0319
Scale estimate:
Mean Sd 2.5% 50% 97.5%
Sigma2 13.2389 1.9332 9.9936 13.0804 17.373
Parametric coefficients:
Mean Sd 2.5% 50% 97.5%
(Intercept) 17.1458 1.7125 13.9773 17.0971 20.3820
z2 0.1591 0.1438 -0.1282 0.1612 0.4407
z3 4.1544 0.9413 2.3008 4.1405 6.0312
z1 4.9990 2.5100 -0.2337 5.0116 9.6973
z -0.1025 0.0351 -0.1751 -0.1016 -0.0367
Scale estimate:
Mean Sd 2.5% 50% 97.5%
Sigma2 13.2323 1.9739 9.9179 13.0191 17.518
When the argument control gives a file name for the output, folders are
automatically created in the indicated directory, one for each chain, con-
taining various files with the simulated parameter values (with file exten-
sion .raw) and text files with the summary statistics. In the current exam-
ple, two folders are created with the names Chain_1_bayes2x.estim
and Chain_2_bayes2x.estim. All these data files can later be used for
graphical summaries, to evaluate diagnostics, etc.
3. The command
getscript(exemplo1_BayesX)
[1] 10000
> length(AA[[2]])
[1] 10000
plot(AA)
The object AA is a list that contains in AA[[1]] a sample from the poste-
rior distribution of the fixed effect parameters from chain 1, and similarly in
AA[[2]] for chain 2. Since we save 20,000 iterations and thin out to every
tenth, there are 2000 simulated values for each of the five parameters. The
order in which the simulated values appear is the same as that in which they
were introduced in the formula. Thus AA[[1]][1:2000] contains poste-
rior simulated values, starting first with beta0; AA[[1]][2001:4000]
which contains posterior simulated values for beta1, the coefficient for
the covariate z2, etc.
5. The command plot(AA) plots the superimposed series of simulated
values for each parameter and corresponding marginal posterior density
estimates.
6. To get the posterior simulated values of the variances (in this case, the
variance of a normal distribution), and the corresponding graphical sum-
maries, write
> Va<-samples(exemplo1_BayesX,term = "var-samples")
> length(Va[[1]])
[1] 2000
> plot(Va)
The information from this latex document can also be obtained with
the command
bayesx_logfile(exempl1_BayesX)
Geweke’s Diagnostic
t
Let θ , t = 1, ..., N, be a sequence of states generated by an MCMC simu-
lation and let g(θ) be a function of θ that is to be estimated. The trajectory
g1 , g2 , . . . formed by gt = g(θt ) defines a time series.
The method of Geweke (1992) is based on the application of usual time
series methods to test for the convergence of the simulated sequence. As-
P
suming sufficiently large N, we calculate the sample average ga = n1a g(θt )
P
using the first na iterations, as well as the average gb = n1b g(θt ) using the
202 Software
last nb iterations. If the chain is stationary, then the mean of the first part
of the chain should be similar to the mean of the latter part of the chain.
Letting N → ∞ with fixed na /N and nb /N, one can show that
(ga − gb )
q → N(0, 1),
(s2a /na ) + (s2b /nb )
where s2a and s2b are independent estimates of the asymptotic variances of
ga and gb , adjusted for the autocorrelation of the time series. Using this
statistic, we can now assess whether or not there is evidence for conver-
gence.
discarding the first 10% of iterations. Repeat the process while the null
hypothesis is rejected.
4. If we continue to reject the null hypothesis as the remaining number of
iterations reaches 50% of the initial N iterations, then the Markov chain
simulation has to continue, as the chain has not yet reached equilibrium.
In that case CODA reports the test statistic and indicates that the chain
has failed the test of stationarity.
5. Otherwise, the part of the chain that passed the stationarity test is used to
estimate the mean (m) and asymptotic standard error (s) of the average
and a half-width test is applied to that part of the chain as follows. If
1.96s < mǫ, for small ǫ (CODA uses a default of α = 0.05 and ǫ = 0.1),
then the chain passes the overall test. Otherwise, the condition 1.96s ≥
mǫ means that the Markov chain simulation needs to be continued.
Both packages, CODA and BOA from CRAN, have a function codamenu()
and boa.menu(), respectively, which allows using the programs with a
menu interface, for casual users with limited knowledge of R. For exam-
ple, in CODA:
> library(coda)
> codamenu()
CODA startup menu
3: Quit
Selection: 2
1: Output Analysis
2: Diagnostics
3: List/Change Options
4: Quit
After reading in the data, the user is presented with a list of analysis op-
tions, summary statistics, graphical representations, and convergence diag-
nostics.
The BOA menu is quite similar. See details of the boa.menu() in Smith
(2007).
Alternatively to the menu mode, it is possible to carry out an analysis
using R commands. To have an interface between CODA and R such that
functions can be used from the R command line, a new R class mcmc was
created. BOA accepts the simulated values as a vector or matrix as input
parameter, and can save them as an mcmc object.
The latest version of the manuals for CODA and BOA are available from
https://cran.r-project.org/web/packages/coda/coda.pdf
https://cran.r-project.org/web/packages/boa/boa.pdf.
> A1<-Cexemplo1_OpenBUGS.fit$sims.matrix
> dim(A1)
[1] 40000 10
> head(round(A1,4))
beta0 beta[1] beta[2] beta[3] beta[4] tau tau_a
[1,] 15.73 6.349 0.0098 3.843 -0.1046 0.0819 0.0192
[2,] 18.55 2.689 0.0214 4.742 -0.1315 0.0953 0.0195
[3,] 16.41 6.330 0.2284 4.585 -0.0643 0.0664 0.0218
[4,] 14.18 5.653 -0.1744 4.911 -0.1551 0.0793 0.0127
[5,] 19.86 2.291 0.0826 4.259 -0.1209 0.0875 0.0180
[6,] 19.00 1.449 -0.0214 5.277 -0.0964 0.0778 0.0190
sigma_a sigma deviance
[1,] 7.207 3.495 797.3
[2,] 7.168 3.240 792.8
[3,] 6.779 3.882 800.5
[4,] 8.883 3.552 781.2
[5,] 7.452 3.381 793.0
[6,] 7.247 3.585 783.9
> class(A1)
[1] "matrix"
We see that A1 has 40,000 rows (20,000 iterations for each of two chains)
and ten columns with the names of the monitored parameters. Note that the
states for the first chain are in lines 1 through 20,000, and for the second
chain in rows 20,001 through 40,000.
Thus, to use CODA with two chains, we need to define two objects of
type mcmc, one for each chain, and then combine them into a single object
using the function as.mcmc.list, as is shown in the following example.
> library(coda)
206 Software
> A1_1chain<-as.mcmc(A1[1:20000,])
> A1_2chain<-as.mcmc(A1[20001:40000,])
> A1_2_mcmc<-as.mcmc.list(list(A1_1chain,A1_2chain))
We can plot superimposed trajectories for the two chains, posterior den-
sities (illustrated in Figure 9.3 for the parameters β1 and β2 ), autocorre-
lations, trajectories of quantiles, and the correlation matrix between the
components. The plots are obtained using the following commands:
plot(A1_mcmc)
plot(A1_2_mcmc[,2:3]) #only for cols that appear in the figure
autocorr.plot(A1_2_mcmc) #autocorrelation
cumuplot(A1_2_mcmc) #evaluate quantiles (0.025,0.5,0.975)
crosscorr.plot(A1_2_mcmc)#plot correlation matrix
The convergence diagnostics from Section 9.6.1 are easily evaluated with
the respective CODA functions.
Multivariate psrf
1
beta[1] beta[2]
1.4
median median
1.08
97.5% 97.5%
1.3
1.06
shrink factor
shrink factor
1.2
1.04
1.1
1.02
1.00
1.0
2. Geweke’s Diagnostic
> geweke.diag(A1_2_mcmc)
[[1]]
[[2]]
The output of this function is the Z-scores for the tests of equal means
between the earlier and later part of the series for each variable. Since the
values for the first chain are all within the interval (−1.96, 1.96), we do
not reject the null hypothesis of equal means for all monitored parameters.
However, the same is not true for β0 and β1 in the second chain. Keep
in mind that any parameter failing the test is evidence for the chain as a
whole not having reached stationarity. This is the case since convergence
is a property of the chain, not of individual parameters.
3. Raftery and Lewis’ Diagnostic
This method was considered for short pilot sequences of Markov chains.
Thus, we will apply it for the first 4000 iterates only for each chain (iter-
ates during the initial burn-in, in this case 20,000, were not saved in the
object A1). The method requires the designation of a quantile that is to be
estimated. By default, CODA uses the 0.025 quantile. We use the method
assuming one wishes to estimate the median, that is, the 0.5 quantile.
> raftery.diag(A1_1chain[1:4000,],q=0.5,r=0.01,s=0.95,
converge.eps=0.001)
> raftery.diag(A1_2chain[1:10000,],q=0.5,r=0.01,s=0.95)
[[2]]
With ǫ = 0.01 the parameter β2 passed the test of stationarity (for both
chains), but failed the half-width test, indicating the need to continue sim-
ulation to achieve the desired precision. If we augment ǫ to 0.05, all pa-
rameters pass both stationarity and half-width tests.
5. HPD Intervals
Finally, recall that in CODA one can obtain HPD intervals for all mon-
itored parameters, using the function HPDinterval(). This is illustrated
here with 95% HPD intervals for each of the chains.
> HPDinterval(A1_2_mcmc, prob = 0.95)
[[1]]
lower upper
beta0 13.860000 20.80000
beta[1] 0.036880 9.84600
beta[2] -0.128400 0.42490
beta[3] 2.487000 6.08700
beta[4] -0.178200 -0.03300
tau 0.056580 0.10000
tau_a 0.009279 0.02371
sigma_a 6.246000 9.88700
sigma 3.122000 4.14500
deviance 771.400000 820.30000
attr(,"Probability")
[1] 0.95
[[2]]
lower upper
beta0 13.960000 20.87000
beta[1] 0.002304 9.64300
beta[2] -0.114500 0.42770
beta[3] 2.402000 6.02500
beta[4] -0.178100 -0.03235
tau 0.055750 0.09912
tau_a 0.009449 0.02374
sigma_a 6.223000 9.83100
sigma 3.122000 4.15300
deviance 771.100000 819.80000
attr(,"Probability")
[1] 0.95
212 Software
B. Using R2jags
1. The output from jags() can be converted into an mcmc object with the
command:
exemplo1_JAGS.fit2.mcmc <- as.mcmc(exemplo1_JAGS.fit2 )
2. As before, with the mcmc object one can then use a variety of com-
mands for convergence diagnostics in CODA:
library(coda)
plot(exemplo1_JAGS.fit2.mcmc)
autocorr.plot(exemplo1_JAGS.fit2.mcmc)
gelman.plot(exemplo1_JAGS.fit2.mcmc)
gelman.diag(exemplo1_JAGS.fit2.mcmc)
geweke.diag(exemplo1_JAGS.fit2.mcmc)
9.6 Convergence Diagnostics: the Programs CODA and BOA 213
raftery.diag(exemplo1_JAGS.fit2.mcmc)
heidel.diag(exemplo1_JAGS.fit2.mcmc)
Thus, for using BOA, we proceed exactly as we did in the case of two
chains with output from ROpenBUGS.
C. Using RStan
As mentioned before (item 5, section 9.4.1), the simulated parameter values
for a chain that is set up using the command stan() can be obtained by
applying the function extract() to the output of stan().
> samples_stan_array<-extract(exemplo1.fit_stan,
pars=c("beta0", "beta", "sigma", "sigma_a", "tau", "tau_a"),
permuted = FALSE, inc_warmup = FALSE, include = TRUE)
> class(samples_stan_array)
[1] "array"
> dim(samples_stan_array)
[1] 20000 2 9 #20000 cada cadeia, 2 cadeias, 9 parameters
For BOA, we need to first define dimnames for the row and column names:
214 Software
samples_coda_1<-as.matrix(samples_stan_array[1:20000,1,1:9])
samples_coda_2<-as.matrix(samples_stan_array[1:20000,2,1:9])
dimnames(samples_coda_1)<- list(1:20000,
c("beta0", "beta[1]","beta[2]", "beta[3]", "beta[4]",
"sigma", "sigma_a", "tau", "tau_a"))
dimnames(samples_coda_2)<- list(1:20000,
c("beta0", "beta[1]","beta[2]", "beta[3]", "beta[4]",
"sigma", "sigma_a", "tau", "tau_a"))
D. Using R2BayesX
Recall (item 4, Subsection 9.5.1) that the function samples() applied to
the output of bayesx() returns the simulated values from the posterior
distribution, including the parameters that are indicated in the argument
term. The argument CODA controls the type and class of the output object.
Thus with the following commands we get a list of type mcmc, which
can be used for convergence diagnostics in CODA:
> AA_coda<-samples(exemplo1_BayesX,
term=c("linear-samples","var-samples","sd(ID)"),coda=TRUE)
> class(AA_coda)
[1] "mcmc.list"
> names(AA_coda)
[1] "Chain_1" "Chain_2"
#illustrating
9.7 R-INLA and the Application Example 215
> gelman.diag(AA_coda)
Potential scale reduction factors:
Multivariate psrf
1
#illustrating
> library(boa)
> boa.geweke(AA_boa_1,p.first=0.1,p.last=0.5)
Z-Score p-value
Chain_1.Param.Intercept -0.2281478 0.8195313
Chain_1.Param.z2 1.2278951 0.2194863
Chain_1.Param.z3 -0.2358216 0.8135711
Chain_1.Param.z1 1.1215734 0.2620438
Chain_1.Param.z 0.8195813 0.4124548
Chain_1.Var 0.6110576 0.5411614
> boa.geweke(AA_boa_1,p.first=0.1,p.last=0.5)
Z-Score p-value
Chain_1.Param.Intercept -0.2281478 0.8195313
Chain_1.Param.z2 1.2278951 0.2194863
Chain_1.Param.z3 -0.2358216 0.8135711
Chain_1.Param.z1 1.1215734 0.2620438
Chain_1.Param.z 0.8195813 0.4124548
Chain_1.Var 0.6110576 0.5411614
As with any other R package, to load R-INLA in each work session, write
library(INLA)
to get the most recent and most stable version of the package.
Using R-INLA there is a large number of probability distributions that
can be used for the response variable. The list can be seen with the com-
mand
> names(inla.models()$likelihood)
9.7 R-INLA and the Application Example 217
As before, fixed effects that appear linearly in the model have been cen-
tered. Using the function f(), which appears in the definition of the for-
mula, we define the structural effects (the various types are defined at
www.r-inla.org/models/latent-models). In the case at hand we
have only the patient-specific (variable ID) random effects which are in-
troduced in the model as a. The model iid specifies a normal distribution
with zero mean and precision τa . The prior distribution that is specified in
the argument hyper is for ln(τa ). Being a log-gamma it corresponds to a
gamma distribution for the precision τa .
2. Next we call the inla() function to run the INLA algorithm and
obtain the desired results to eventually proceed with Bayesian inference,
as follows:
> ?inla
> resultado_INLA <- inla(INLA_formula,family="normal",
control.predictor=list(compute=TRUE),
control.compute =list(waic=TRUE,dic=TRUE,cpo=TRUE),
data = Cexemplo1,
control.family=
list(hyper=list(prec=list(prior="loggamma",param=c(1,0.005)))
218 Software
The first line in the above code lists all arguments of the inla() func-
tion, of which the only required ones are the object that states the formula,
in this case INLA_formula, and the object that contains the data, in this
case data = Cexemplo1. By not specifying the other arguments they are
assumed by R-INLA to be specified by omission.
The directive control.predictor=list(compute=TRUE) specifies
to compute the marginal distributions of the linear predictor. There are
other arguments for this function, which can be listed by the command
?control.predictor
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 17.250 1.737 13.820 17.251 20.669 17.253 0
z1 4.772 2.448 -0.050 4.770 9.599 4.766 0
z2 0.154 0.138 -0.118 0.154 0.427 0.153 0
z3 4.169 0.908 2.394 4.164 5.972 4.154 0
z -0.106 0.036 -0.176 -0.106 -0.035 -0.107 0
Random effects:
Name Model
ID IID model
Model hyperparameters:
mean sd 0.025quant 0.5qu
Precision for the Gaussian observations 0.0789 0.0113 0.0587 0.07
Precision for ID 0.0170 0.0039 0.0106 0.01
0.975quant mode
9.7 R-INLA and the Application Example 219
Compare these results with the corresponding results from simulation meth-
ods. In particular, compare the WAIC values with those obtained from
RStan.
4. Note that the results also include, for each configuration of hyperpa-
rameters, an estimate of the effective number of parameters. This estimate
essentially corresponds to the expected number of independent parameters
in the model. In our case, we have 7 + 49 = 56 parameters, but since the
random effects are correlated, the effective number of parameters is lower,
≈ 47, as can be seen. As mentioned, this is one of the strategies, proposed
in Rue et al (2009), to evaluate the accuracy of the approximation. In par-
ticular, if the effective number of parameters is low compared to the sample
size, then one expects the approximation to be good. In this case the ratio
of sample size (147) and effective number of parameters (46.38) is approxi-
mately 3.17, suggesting a reasonably good approximation. In fact, the ratio
can be interpreted as the number of “equivalent replicates” corresponding
to the number of observations for each expected number of effective pa-
rameters.
Another reported quantity is the mean Kullback–Leibler divergence (in
the column kld). This value describes the difference between the normal
approximation and the simplified Laplace approximation (recall the dis-
cussion in Chapter 8 about the various approximation strategies used in
INLA) for the marginal posterior distributions. Small values indicate that
the posterior distribution is well-approximated by a normal.
5. The default approximation strategy in inla() is the simplified Laplace
approach. Other approximation and integration methods can be defined us-
ing the argument control.inla in the function inla(). For example, if
one wanted the complete Laplace approach used, which is recommended
220 Software
6. Besides the results shown earlier, R-INLA can also report two types
of goodness-of-fit measures, namely CPO p(yi | y−i ) and the probability
integral transforms P(Yinova ≤ yi | y−i ) (PIT). To add these in the out-
put, just add cpo=TRUE in the list of the argument control.compute
for the function inla. The values are then returned as part of the output
from inla. A list of all possible values can be obtained by the command
names(resultados_INLA). Of the 51 possible ones, we list only a few:
> names(resultado_INLA)
[1] "names.fixed"
[2] "summary.fixed"
[3] "marginals.fixed"
[4] "summary.lincomb"
[5] "marginals.lincomb"
[6] "size.lincomb"
[7] "summary.lincomb.derived"
[8] "marginals.lincomb.derived"
[9] "size.lincomb.derived"
[10] "mlik"
[11] "cpo"
[12] "po"
[13] "waic"
...
[18] "summary.linear.predictor"
[19] "marginals.linear.predictor"
[20] "summary.fitted.values"
[21] "marginals.fitted.values"
...
[27] "offset.linear.predictor"
...
[51] "model.matrix"
One can then save the values of these two measures of fit in an object, for
example CPO_PIT below, and use, for further analysis, for graphs, etc.
> CPO_PIT<-resultado_INLA$cpo
> names(CPO_PIT)
[1] "cpo" "pit" "failure"
> class(CPO_PIT)
[1] "list"
> summary(CPO_PIT$cpo)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003652 0.0486900 0.0741300 0.0673100 0.0905200 0.0924000
> summary(CPO_PIT$pit)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.7 R-INLA and the Application Example 221
30
15
0
> tau<-resultado_INLA$marginals.hyperpar$
+"Precision for the Gaussian observations"
> sigma<-inla.emarginal(function(x) 1/sqrt(x), tau)
> sigma
[1] 3.588387
> tau_a<-resultado_INLA$marginals.hyperpar$"Precision for ID"
> sigma_a<-inla.emarginal(function(x) 1/sqrt(x), tau_a)
> sigma_a
[1] 7.813781
More information about the functions that are used to operate on marginal
distributions is obtained via the help command ?inla.marginal.
10. One of these functions, inla.hpdmarginal(), allows getting HPD
credible intervals for the model parameters. To obtain these intervals for the
parameters corresponding to the fixed effects in the linear predictor, one
can either get them individually, or get all in one command, as illustrated
here for a 95% interval,
> HPD<-NULL
> for(i in 1:5){
HPD[[i]]<-inla.hpdmarginal
(0.95, resultado_INLA$marginals.fixed[[i]])}
> HPD
[[1]]
low high
level:0.95 13.82332 20.66469
[[2]]
low high
level:0.95 -0.05260815 9.58653
[[3]]
low high
level:0.95 -0.1184688 0.4263571
[[4]]
low high
level:0.95 2.384431 5.958083
[[5]]
224 Software
low high
level:0.95 -0.1759522 -0.03584334
[[2]]
low high
level:0.95 0.009948865 0.02469588
As expected, the HPD intervals for the fixed effect parameters practically
coincide with the equal tail intervals in the summary results. The same is
not true for the precision parameters.
Problems
9.1 FEV Data. Rosner (1999) reports on a study of lung capacity and smoking
on n = 654 youths aged 3 – 19 years. Lung capacity yi is measured as forced
expiratory volume (FEV). Covariates include age (xi1 ) and an indicator of
smoking (xi2 ). The same data are also analyzed by Chistensen et al. (2011:
ch. 7).
Set up a regression of yi on age (xi1 ), adjusting intercept and slope for smok-
ing (xi2 ), that is, a normal linear regression for yi = b0 + b1 xi1 + b2 xi2 +
b3 xi1 xi2 + ǫi j . The data are available as the data set fev in the R package
tmle. Set up inference using OpenBUGS, JAGS, Stan, or INLA. Does FEV
differ by smoking? Does smoking affect the increase of lung capacity with
age? Is the interaction term needed?
9.2 Consider any of the examples in the OpenBUGS manual, www.openbugs.
net/w/Examples. Implement inference in JAGS or Stan or OpenBUGS.
9.3 Consider the Epilepsy study reported in the OpenBUGS manual, www.openbug
net/Examples/Epil.html. Implement inference using R-INLA. Show
the marginal posterior distributions for the logistic regression coefficients
a j . Compare versus a simplified model without extra-Poisson variation (that
is, without the b jk random effect).
9.4 Consider the Salmonella dose–response study in the OpenBUGS manual,
Problems 225
Probability Distributions
Discrete Distributions
Binomial: x ∼ Bi(n, p), x ∈ {0, . . . , n};
f (x) = C nx p x (1 − p)n−x , n ∈ N, 0 < p < 1
µ = np, σ2 = np(1 − p).
Bernoulli: Ber(p) = Bi(n = 1, p).
Beta-binomial: x ∼ BeBin(n, a, b), x ∈ {0, . . . , n}
f (x) = C nx B(a + x, b + n − x)/B(a, b), a > 0, b > 0, n ∈ N
µ = na/(a + b), σ2 = nab(a + b + n)/[(a + b)2 (a + b + 1)].
Note: A BeBin(n, a, b) distribution is a mixture of a Bi(n, θ) with respect
to θ ∼ Be(a, b).
Negative binomial: x ∼ NBin(r, θ), x ∈ N0 ;
f (x) = C xx+r−1 (1 − θ)r θ x , 0 < θ < 1, r ∈ N;
µ = rθ/(1 − θ) and σ2 = rθ/(1 − θ)2 .
Geometric: Geo(θ) = NBin(1, θ).
Probability Distributions 227
C xM Cn−x
N−M
f (x) = , N ∈ N, M ∈ {0, . . . , N}, n ∈ {1, . . . , N};
CnN
µ = nM
N
and σ2 = n N−n M
N−1 N
1− M
N
.
Continuous Distributions
Uniform: x ∼ U(α, β), α ≤ x ≤ β;
f (x) = 1/(β − α), α < β, α, β ∈ R;
µ = (β + α)/2 and σ2 = (β − α)2 /12.
Note: We also write U(S ) for a uniform distribution over a set S .
Beta: x ∼ Be(α, β), 0 ≤ x ≤ 1;
Γ(α + β) α−1
f (x) = x (1 − x) β−1 , α > 0, β > 0;
Γ(α)Γ(β)
µ = α/(α + β) and σ2 = αβ/[(α + β)2 (α + β + 1)].
228 Probability Distributions
Normal: x ∼ N(µ, σ2 ), x ∈ R;
1 2 2
f (x) = √2πσ e−(x−µ) /2σ , µ ∈ R, σ2 > 0;
E(x) = µ and Var(x) = σ2 .
Note: Alternative parametrizations use the precision, 1/σ2 , or the stan-
dard deviation, σ, as the second parameter. WinBUGS, JAGS, and INLA
use precision, STAN and R use standard deviation.
Gamma: x ∼ Ga(α, β), x ≥ 0;
βα
f (x) = Γ(α) , e−βx xα−1 , α > 0, β > 0;
µ = α/β and σ2 = α/β2 .
Exponential: Exp(λ) = Ga(1, λ).
Chi-square: χ2 (n) = Ga(n/2, 1/2).
Erlang: Erl(k, λ) = Ga(k, λ), k ∈ N.
Inverse gamma: x ∼ IGa(α, β), x ≥ 0;
βα −β/x −α−1
f (x) = Γ(α) e x , α > 0, β > 0;
µ = β/(α − 1), α > 1 and σ2 = β2 /[(α − 1)2 (α − 2)], α > 2.
Weibull: x ∼ Weib(λ, k), x ≥ 0;
k−1 k
f (x) = λk λx e−(x/λ) , λ > 0, k > 0;
µ = λΓ(1 + 1k ) and σ2 = λ2 Γ(1 + 2k ) − µ2 .
αα/2 β β/2
f (x) = xα/2−1 (αx + β)−(α+β)/2 , α > 0, β > 0;
B (α/2, β/2)
µ = β/(β − 2), β > 2, and σ2 = 2β2 (α + β − 2) [α(β − 2)2 (β − 4)], β > 4.
Note: x ∼ F(α, β) can be derived as the distribution of
1. x = (x1 /α)/(x2 /β), where x1 ∼ χ2(α) independently of x2 ∼ χ2(β) ; or,
2. x = βy/[α(1 − y)] where y ∼ Be(α/2, β/2).
Multivariate Distributions
Dirichlet: θ ∼ Dc (a), θ ∈ Sc ,
Γ(a• ) ai −1
f (θ) = c+1
Πc+1
i=1 θi , ai ∈ R+ , i = 1, . . . , c + 1; θc+1 = 1 − θ• ,
Πi=1 Γ(ai )
µi = a/a• , σ2i = µi (1 − µi )/(a• + 1) and σi j = −µi µ j /(a• + 1), i , j.
P
Multinomial: x ∼ Mc (n, θ), x = (x1 , . . . , xc+1 ), xi ∈ N0 , ci=1 xi ≤ n and
Pc
xc+1 = n − i=1 xi
n! xi
f (x) = Πc+1
i=1 θi , θ ∈ Sc , θc+1 = 1 − θ•
Πc+1
i=1 xi !
Programming Notes
return(x)
}
#######################################################
# simulating from a bivariate distribution
MCMC Simulation
The last code fragment shows a typical Gibbs sampler implementation.
The main function gibbs() implements the iteration over the steps of
the MCMC simulation. The code implements a Gibbs sampler with two
transition probabilities, generating parameters b and s2 from the respective
complete conditional posterior distributions.
Programming Notes 233
Brezger, A., Kneib, T., and Lang, S. 2005. BayesX: analyzing Bayesian structural
additive regression models. Journal of Statistical Software, Articles, 14(11), 1–22.
(Cited on pages 174, 194, and 195.)
Burnham, K. P., and Anderson, D. R. 2002. Model Selection and Multimodel Inference:
A Practical Information-Theoretic Approach. 2nd edn. Springer.
Carlin, B. P., and Chib, S. 1995. Bayesian model choice via Markov chain Monte Carlo.
Journal of the Royal Statistical Society, B, 57(3), 473–484. (Cited on page 138.)
Carlin, B. P., and Gelfand, A. E. 1991. An iterative Monte Carlo method for non-
conjugate Bayesian analysis. Statistics and Computing, 1(2), 119–128. (Cited on
page 67.)
Carlin, B. P., and Louis, T. A. 2009. Bayesian Methods for Data Analysis. CRC Press.
(Cited on pages viii and 80.)
Carpenter, B., Gelman, A., Hoffman, M., et al. 2017. Stan: a probabilistic programming
language. Journal of Statistical Software, Articles, 76(1), 1–32. (Cited on pages 174
and 188.)
Carvalho, C. M., Polson, N. G., and Scott, J. G. 2010. The horseshoe estimator for
sparse signals. Biometrika, 97(2), 465–480. (Cited on page 136.)
Celeux, G., Forbes, F., Robert, C. P., and Titterington, D. M. 2006. Deviance informa-
tion criteria for missing data models. Bayesian Analysis, 1(4), 651–673. (Cited on
page 81.)
Chen, M.-H. 1994. Importance-weighted marginal Bayesian posterior density esti-
mation. Journal of the American Statistical Association, 89, 818–824. (Cited on
page 60.)
Chen, M.-H., and Shao, Q. 1999. Monte Carlo estimation of Bayesian credible and
HPD intervals. Journal of Computational and Graphical Statistics, 8, 69–92. (Cited
on page 50.)
Chen, M.-H., Shao, Q., and Ibrahim, J. G. 2000. Monte Carlo Methods in Bayesian
Computation. Springer. (Cited on pages 56, 59, and 131.)
Chib, S. 1995. Marginal likelihood from the Gibbs output. Journal of the American
Statistical Association, 90(432), 1313–1321. (Cited on page 132.)
Chib, S., and Jeliazkov, I. 2001. Marginal likelihood from the Metropolis–Hastings
output. Journal of the American Statistical Association, 96(453), 270–281. (Cited
on pages 132 and 133.)
Christensen, R., Johnson, W., Hanson, T., and Branscum, A. 2011. Bayesian Ideas and
Data Analysis: An Introduction for Scientists and Statisticians. CRC Press. (Cited
on page viii.)
Cowles, M. K. 1994. Practical issues in Gibbs sampler implementation with applica-
tion to Bayesian hierarchical modelling of clinical trial data. PhD thesis, University
of Minnesota. (Cited on page 203.)
Cowles, M. K., and Carlin, B. P. 1996. Markov chain Monte Carlo convergence diag-
nostics: a comparative review. Journal of the American Statistical Association, 91,
883–904. (Cited on pages 118 and 201.)
Damien, P., Wakefield, J., and Walker, S. 1999. Gibbs sampling for Bayesian non-
conjugate and hierarchical models by using auxiliary variables. Journal of the Royal
Statistical Society, B, 61(2), 331–344. (Cited on page 108.)
Dawid, A. P. 1985. The impossibility of inductive inference. (invited discussion of
236 References
Gelman, A., Hwang, J., and Vehtari, A. 2014b. Understanding predictive information
criterion for Bayesian models. Statistics and Computing, 24, 997–1016.
Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distribution and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6, 721–741. (Cited on pages 92 and 100.)
Gentle, J. E. 2004. Random Number Generation and Monte Carlo Methods. 2nd edn.
Springer. (Cited on page 45.)
Genz, A., and Kass, R. E. 1997. Subregion adaptative integration of functions having
a dominant peak. Journal of Computational and Graphical Statistics, 6, 92–111.
(Cited on page 55.)
George, E. I., and McCulloch, R. 1997. Approaches for Bayesian variable selection.
Statistica Sinica, 7, 339–373. (Cited on pages 134 and 135.)
George, E. I., and McCulloch, R. E. 1993. Variable selection via Gibbs sampling. Jour-
nal of the American Statistical Association, 88(423), 881–889. (Cited on pages 133
and 134.)
Geweke, J. 1989. Bayesian inference in econometric models using Monte Carlo inte-
gration. Econometrica, 57(02), 1317–39. (Cited on pages 54 and 70.)
Geweke, J. 1992. Evaluating the accuracy of sampling-based approaches to calculating
posterior moments. In: Bayesian Statistics 4. Clarendon Press. (Cited on page 201.)
Geweke, J. 2004. Getting it right. Journal of the American Statistical Association,
99(467), 799–804. (Cited on pages 129 and 130.)
Geyer, C. J. 1992. Practical Markov chain Monte Carlo (with discussion). Statistical
Science, 7, 473–511. (Cited on page 117.)
Gillies, D. 2001. Bayesianism and the fixity of the theoretical framework. Pages 363–
379 of: Corfield, J., and Williamson, J. (eds.), Foundations of Bayesianism. Kluwer
Academic Publishers. (Cited on page 12.)
Givens, G. H., and Hoeting, J. A. 2005. Computational Statistics. Wiley. (Cited on
page 99.)
Gradshteyn, I., and Ryzhik, I. 2007. Table of Integrals, Series, and Products, Jeffrey,
A., and Zwillinger, D. (eds.). Academic Press.
Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika, 82, 711–732. (Cited on page 140.)
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57, 97–109. (Cited on page 92.)
Heidelberger, P., and Welch, P. 1983. Simulation run length control in the presence of
an initial transient. Operations Research, 31, 1109–1144. (Cited on page 201.)
Henderson, H. V., and Velleman, P. F. 1981. Building multiple regression models inter-
actively. Biometrics, 37, 391–411. (Cited on page 76.)
Hoff, P. D. 2009. A First Course in Bayesian Statistical Methods. Springer. (Cited on
page viii.)
Hoffman, M. D., and Gelman, A. 2014. The No-U-Turn sampler: Adaptively setting
path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research,
15(1), 1593–1623. (Cited on page 188.)
Jaynes, E. T. 1968. Prior probabilities. IEEE Transactions on Systems, Science and
Cybernetics, 4, 227–291. (Cited on page 22.)
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge University
Press. (Cited on pages 14 and 21.)
238 References
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. 1999. An introduction
to variational methods for graphical models. Machine Learning, 37(2), 183–233.
(Cited on page 166.)
Karabatsos, G. 2015. A menu-driven software package for Bayesian regression analy-
sis. The ISBA Bulletin, 22, 13–16. (Cited on page 174.)
Kass, R. E., and Raftery, A. E. 1995. Bayes factors. Journal of the American Statistical
Association, 90, 773–795. (Cited on page 87.)
Kass, R. E., and Wasserman, L. 1996. The selection of prior distributions by formal
rules. Journal of the American Statistical Association, 91, 1343–1370. (Cited on
page 18.)
Kempthorn, O., and Folks, L. 1971. Probability, Statistics and Data Analysis. Iowa
State University Press. (Cited on page 7.)
Kneib, T., Heinzl, F., Brezger, A., Bove, D., and Klein, N. 2014. BayesX: R Utilities
Accompanying the Software Package BayesX. R package version 0.2-9. (Cited on
page 195.)
Korner-Nievergelt, F., von Felten, S., Roth, T., et al. 2015. Bayesian Data Analysis in
Ecology Using Linear Models with R, BUGS, and Stan. Academic Press. (Cited on
page 175.)
Kruschke, J. 2011. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Aca-
demic Press/Elsevier. (Cited on page 175.)
Kruschke, J. 2014. Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan.
Academic Press/Elsevier. (Cited on page 175.)
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. 2017. Automatic
differentiation variational inference. Journal of Machine Learning Research, 18(1),
430–474. (Cited on page 170.)
Kuhn, T. S. 1962. The Structure of Scientific Revolutions. University of Chicago Press.
(Cited on page 5.)
Kuo, L., and Mallick, B. 1998. Variable selection for regression models. Sankhya: The
Indian Journal of Statistics, Series B, 60(1), 65–81. (Cited on page 134.)
Lauritzen, S. L., and Spiegelhalter, D. J. 1988. Local computations with probabilities
on graphical structures and their application to expert systems. Journal of the Royal
Statistical Society, B, 50(2), 157–224. (Cited on page 176.)
Lin, D. 2013. Online learning of nonparametric mixture models via sequential vari-
ational approximation. Pages 395–403 of: Proceedings of the 26th International
Conference on Neural Information Processing Systems. USA: Curran Associates
Inc. (Cited on page 170.)
Lindgren, F., and Rue, H. 2015. Bayesian spatial modelling with R-INLA. Journal of
Statistical Software, Articles, 63(19), 1–25. (Cited on page 216.)
Lindgren, F., Rue, H., and Lindstrom, J. 2011. An explicit link between Gaussian
fields and Gaussian Markov random fields: the stochastic partial differential equation
approach. Journal of the Royal Statistical Society, B, 73(4), 423–498. (Cited on
page 216.)
Lindley, D. V. 1990. The 1988 Wald memorial lectures: the present position in Bayesian
statistics. Statistical Science, 5, 44–89. (Cited on page 11.)
Liu, J., and West, M. 2001. Combined parameter and state estimation in simulation-
based filtering. Pages 197–223 of: Doucet, A., de Freitas, N., and Gordon, N. (eds.),
Sequential Monte Carlo Methods in Practice. Springer. (Cited on page 66.)
References 239
Lunn, D., Spiegelhalter, D., Thomas, A., and Best, N. 2009. The BUGS project: Evo-
lution, critique and future directions. Statistics in Medicine, 28(25), 3049–3067.
(Cited on page 176.)
MacEachern, S., and Berliner, L. 1994. Subsampling the Gibbs sampler. The American
Statistician, 48, 188–190.
Madigan, D., and York, J. 1995. Bayesian graphical models for discrete data. Interna-
tional Statistical Review, 63, 215–232. (Cited on page 135.)
Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. 2012. Approximate Bayesian
computational methods. Statistics and Computing, 22(6), 1167–1180. (Cited on
page 128.)
Mayo, D., and Kruse, M. 2001. Principles of inference and their consequences. Pages
381–403 of: Corfield, J., and Williamson, J. (eds.), Foundations of Bayesianism.
Kluwer Academic Publishers. (Cited on page 9.)
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
1953. Equation of state calculations by fast computing machines. J. Chem. Phys,
21, 1087–1092. (Cited on pages 92 and 99.)
Morris, J. S., Baggerly, K. A., and Coombes, K. R. 2003. Bayesian shrinkage estimation
of the relative abundance of mRNA transcripts using SAGE. Biometrics, 59, 476–
486. (Cited on page 124.)
Neal, R. M. 1997. Markov Chain Monte Carlo methods based on “slicing” the density
function. Technical Report. University of Toronto. (Cited on page 108.)
Neal, R. M. 2003. Slice sampling (with discussion). Annals of Statistics, 31, 705–767.
(Cited on page 108.)
Neal, R. M. 2011. MCMC using Hamiltonian dynamics. Chap. 5 of: Brooks, S.,
Gelman, A., Jones, G., and Meng, X.-L. (eds.), Handbook of Markov Chain Monte
Carlo. Chapman & Hall / CRC Press. (Cited on pages 109 and 188.)
Neuenschwander, B., Branson, M., and Gsponer, T. 2008. Critical aspects of the
Bayesian approach to phase I cancer trials. Statistics in Medicine, 27, 2420–2439.
(Cited on page 55.)
Newton, M. A., and Raftery, A. E. 1994. Approximate Bayesian inference by the
weighted likelihood bootstrap (with discussion). Journal of the Royal Statistical
Society, B, 56, 1–48. (Cited on page 88.)
Ntzoufras, I. 2009. Bayesian Modeling Using WinBUGS. (Cited on page 175.)
O’Hagan, A. 2010. Bayesian Inference, Vol. 2B. 3rd edn. Arnold. (Cited on pages 1, 9,
14, and 18.)
O’Quigley, J., Pepe, M., and Fisher, L. 1990. Continual reassessment method: A prac-
tical design for phase 1 clinical trials in cancer. Biometrics, 46(1), 33–48. (Cited on
page 47.)
Park, T., and Casella, G. 2008. The bayesian lasso. Journal of the American Statistical
Association, 103(482), 681–686. (Cited on page 136.)
Patil, V. H. 1964. The Behrens–Fisher problem and its Bayesian solution. Journal of
the Indian Statistical Association, 2, 21. (Cited on page 35.)
Paulino, C. D., and Singer, J. M. 2006. Análise de Dados Categorizados. Editora
Edgard Blücher.
Paulino, C. D., Soares, P., and Neuhaus, J. 2003. Binomial regression with misclassifi-
cation. Biometrics, 59, 670–675. (Cited on page 17.)
240 References
Paulino, C. D., Amaral Turkman, M. A., Murteira, B., and Silva, G. 2018. Estatística
Bayesiana. 2nd edn. Fundacão Calouste Gulbenkian. (Cited on pages 18, 39, 56,
59, 86, 95, 161, and 201.)
Pitt, M. K., and Shephard, N. 1999. Filtering via simulation: Auxiliary particle fil-
ters. Journal of the American Statistical Association, 94(446), 590–599. (Cited on
pages 63, 64, and 65.)
Plummer, M. 2003. JAGS: a program for analysis of Bayesian graphical models using
Gibbs sampling. In: Hornik, K., Leisch, F., and Zeileis, A. (eds.), 3rd International
Workshop on Distributed Statistical Computing (DSC 2003). (Cited on page 174.)
Plummer, M. 2012. JAGS Version 3.3.0 User Manual. http://mcmc-jags.
sourceforge.net, accessed on July 22, 2018. (Cited on page 183.)
Plummer, M., Best, N. G., Cowles, M. K., and Vines, S. K. 2006. CODA: Conver-
gence diagnostics and output analysis for MCMC. R News, 6(1), 7–11. (Cited on
pages 118, 200, and 203.)
Polson, N. G., Stroud, J. R., and Müller, P. 2008. Practical filtering with sequential
parameter learning. Journal of the Royal Statistical Society, B, 70(2), 413–428.
(Cited on page 66.)
Prado, R., and West, M. 2010. Time Series: Modeling, Computation, and Inference.
Chapman & Hall/CRC Press. (Cited on page 66.)
Raftery, A. L., and Lewis, S. 1992. How many iterations in the Gibbs sampler? Pages
763–74 of: Bernardo, J., Berger, J., Dawid, A., and Smith, A. (eds.), Bayesian Statis-
tics IV. Oxford University Press. (Cited on page 201.)
Raftery, A. E., Madigan, D., and Hoeting, J. A. 1997. Bayesian model averaging for
linear regression models. Journal of the American Statistical Association, 92(437),
179–191. (Cited on page 135.)
Richardson, S., and Green, P. J. 1997. On Bayesian analysis of mixtures with an un-
known number of components (with discussion). Journal of the Royal Statistical
Society, B, 59(4), 731–792. (Cited on page 143.)
Rickert, J. 2018. A first look at NIMBLE. Blog: https://rviews.rstudio.com/
2018/07/05/a-first-look-at-nimble/, accessed on July 16, 2018. (Cited on
page 174.)
Ripley, B. D. 1987. Stochastic Simulation. Wiley. (Cited on pages 45 and 46.)
Robert, C. P. 1994. The Bayesian Choice. Springer. (Cited on pages 27 and 159.)
Robert, C. R., and Casella, G. 2004. Monte Carlo Statistical Methods. 2nd edn. New
York: Springer. (Cited on pages 46 and 98.)
Rosner, B. 1999. Fundamentals of Biostatistics. Duxbury. (Cited on page 224.)
Ross, S. M. 2014. Introduction to Probability Models, 11th ed. Academic Press. (Cited
on page 93.)
Rossi, P. E., Allenby, G. M., and McCulloch, R. 2005. Bayesian Statistics and Market-
ing. Wiley. (Cited on page 174.)
Ročková, V., and George, E. I. 2014. EMVS: The EM approach to Bayesian variable se-
lection. Journal of the American Statistical Association, 109(506), 828–846. (Cited
on page 145.)
Rubinstein, R. Y. 1981. Simulation and the Monte Carlo Method. 1st edn. Wiley. (Cited
on page 46.)
Rue, H., and Held, L. 2005. Gaussian Markov Random Fields: Theory and Applica-
tions. Chapman & Hall. (Cited on page 161.)
References 241
Rue, H., Martino, S., and Chopin, N. 2009. Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the
Royal Statistical Society, B, 71(2), 319–392. (Cited on pages 152, 153, 164, 165,
166, 172, 216, and 219.)
Schofield, M. R., Barker, R. J., Gelman, A., Cook, E. R., and Briffa, K. 2016. A model-
based approach to climate reconstruction using tree-ring data. Journal of the Amer-
ican Statistical Association, 2016, 93–106. (Cited on page 187.)
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6, 461–
466. (Cited on pages 79 and 85.)
Scott, S., Blocker, A., Bonassi, F., et al. 2016. Bayes and big data: the consensus Monte
Carlo algorithm. International Journal of Management Science and Engineering
Management, 11(2), 78–88. (Cited on page 70.)
Shaw, J. E. H. 1988. Aspects of numerical integration and summarization. Pages 625–
631 of: Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M. (eds.),
Bayesian Statistics 3. Oxford: University Press. (Cited on page 54.)
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. London:
Chapman and Hall.
Smith, A. F. M. 1991. Bayesian computation methods. Phil. Trans. R. Soc. Lond. A,
337, 369–386.
Smith, A. F. M., and Gelfand, A. E. 1992. Bayesian statistics without tears. The Amer-
ican Statistician, 46, 84–88. (Cited on page 60.)
Smith, B. 2007. BOA: An R package for MCMC output convergence assessment and
posterior inference. Journal of Statistical Software, 21, 1–37. (Cited on pages 118,
200, and 204.)
Spiegelhalter, D. J. 1986. Probabilistic prediction in patient management and clinical
trials. Statistics in Medicine, 5(5), 421–433. (Cited on page 176.)
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. 2002. Bayesian
measures of model complexity and fit (with discussion). Journal of the Royal Sta-
tistical Society, B, 64, 583–639. (Cited on pages 80 and 81.)
Stan Development Team. 2014. RStan: The R Interface to Stan, Version 2.5.0. (Cited
on page 188.)
Sturtz, S., Ligges, U., and Gelman, A. 2005. R2WinBUGS: a package for running
WinBUGS from R. Journal of Statistical Software, 12(3), 1–16. (Cited on page 177.)
Tanner, M. A. 1996. Tools for Statistical Inference. 3rd edn. New York: Springer Verlag.
(Cited on page 159.)
Tanner, M. A., and Wong, W. H. 1987. The calculation of posterior distributions by data
augmentation. Journal of the American Statistical Association, 82(398), 528–540.
(Cited on page 107.)
Thall, P. F., Millikan, R. E., Müller, P., and Lee, S.-J. 2003. Dose-finding with two
agents in phase i oncology trials. Biometrics, 59(3), 487–496. (Cited on pages 128
and 129.)
Thomas, A., O’Hara, B., Ligges, U., and Sturtz, S. 2006. Making BUGS open. R News,
6(01), 12–17. (Cited on page 174.)
Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. Journal of the
Royal Statistical Society, B, 58(1), 267–288. (Cited on page 136.)
Tierney, L. 1994. Markov chains for exploring posterior distributions. Annals of Statis-
tics, 22, 1701–1728. (Cited on page 98.)
242 References
Tierney, L. 1996. Introduction to general state-space Markov chain theory. Pages 61–74
of: Gilks, W., Richardson, S., and Spiegelhalter, D. (eds.), In Markov Chain Monte
Carlo in Practice. Chapman. (Cited on page 93.)
Tierney, L., and Kadane, J. 1986. Accurate approximations for posterior moments and
marginal densities. Journal of The American Statistical Association, 81(03), 82–86.
(Cited on pages 156 and 164.)
Tierney, L., Kass, R., and Kadane, J. 1989. Fully exponential laplace approximations
to expectations and variances of nonpositive functions. Journal of the American
Statistical Association, 84(407), 710–716. (Cited on pages 158 and 159.)
Umlauf, N., Adler, D., Kneib, T., Lang, S., and Zeileis, A. 2015. Structured additive re-
gression models: An r interface to BayesX. Journal of Statistical Software, Articles,
63(21), 1–46. (Cited on pages 195, 196, and 197.)
Vehtari, A., and Ojanen, J. 2012. A survey of Bayesian predictive methods for model
assessment, selection and comparison. Statist. Surv., 6, 142–228.
Vehtari, A., Gelman, A., and Gabry, J. 2017. Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–
1432. (Cited on pages 190 and 194.)
Walker, A. M. 1969. On the asymptotic behaviour of posterior distributions. Journal of
the Royal Statistical Society, B, 31(1), 80–88. (Cited on page 153.)
Wasserman, L. 2004. All of Statistics. Springer-Verlag. (Cited on page 14.)
Watanabe, S. 2010. Asymptotic equivalence of Bayes cross validation and widely appli-
cable information criterion in singular learning theory. Journal of Machine Learning
Research, 11(Dec.), 3571–3594. (Cited on page 82.)
Welling, M., and Teh, Y. W. 2011. Bayesian learning via stochastic gradient Langevin
dynamics. Pages 681–688 of: Proceedings of the 28th International Conference on
International Conference on Machine Learning. Omnipress. (Cited on page 114.)
Zhang, Z., Chan, K. L., Wu, Y., and Chen, C. 2004. Learning a multivariate Gaussian
mixture model with the reversible jump MCMC algorithm. Statistics and Comput-
ing, 14(4), 343–355. (Cited on page 148.)
Index
probability
frequentist, 3
subjective, 6
probit regression, 121
pseudo-Bayes factor, 84
pseudo-prior, 138
metropolized, 139
R-INLA, 216
Rao–Blackwellization, 52
Rayleigh distribution, 42, 228
repeated sampling, 3
slice sampler, 107
Stan, 187
BOA, 213
CODA, 213
state space model, 62
stationary distribution, 94
Student t distribution, 68, 228
multivariate, 230
t distribution, see Student t distribution
transition function, 93
uniform distribution, 43, 227
variable selection
Bayesian lasso, 136
Dirichlet–Laplace, 136
horseshoe, 136
MC3 , 135
SSVS, 133, 145
variational Bayes, 166
Stan, 225
variational family, 166
variational inference, 166
mean-field, 166
Weibull distribution, 228
WinBUGS, 176
BOA, 205
CODA, 205
Wishart distribution, 230