An Introduction To Bayesian Inference, Methods and Computation
An Introduction To Bayesian Inference, Methods and Computation
An Introduction
to Bayesian
Inference,
Methods and
Computation
An Introduction to Bayesian Inference, Methods
and Computation
Nick Heard
An Introduction to Bayesian
Inference, Methods
and Computation
Nick Heard
Imperial College London
London, UK
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2021, corrected publication 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The aim of writing this text was to provide a fast, accessible introduction to Bayesian
statistical inference. The content is directed at postgraduate students with a back-
ground in a numerate discipline, including some experience in basic probability
theory and statistical estimation. The text accompanies a module of the same name,
Bayesian Methods and Computation, which forms part of the online Master of
Machine Learning and Data Science degree programme at Imperial College London.
Starting from an introduction to the fundamentals of subjective probability, the
course quickly advances to modelling principles, computational approaches and then
advanced modelling techniques. Whilst this rapid development necessitates a light
treatment of some advanced theoretical concepts, the benefit is to fast track the reader
to an exciting wealth of modelling possibilities whilst still providing a key grounding
in the fundamental principles.
To make possible this rapid transition from basic principles to advanced modelling,
the text makes extensive use of the probabilistic programming language Stan, which
is the product of a worldwide initiative to make Bayesian inference on user-defined
statistical models more accessible. Stan is written in C++, meaning it is computa-
tionally fast and can be run in parallel, but the interface is modular and simple. The
future of applied Bayesian inference arguably relies on the broadening development
of such software platforms.
Chapter 1 introduces the core ideas of Bayesian reasoning: Decision-making under
uncertainty, specifying subjective probabilities and utility functions and identifying
optimal decisions as those which maximise expected utility. Prediction and estima-
tion, the two core tasks in statistical inference, are shown to be special cases of this
broader decision-making framework. The application-driven reader may choose to
skip this chapter, although philosophically it sets the foundation for everything that
follows.
Chapter 2 presents representation theorems which justify the prior × likelihood
formulation synonymous with Bayesian probability models. Simply believing that
unknown variables are exchangeable, meaning probability beliefs are invariant to
relabelling of the variables, is sufficient to guarantee that construction must hold.
The prior distribution distinguishes Bayesian inference from frequentist statistical
v
vi Preface
methods, and several approaches to specifying prior distributions are discussed. The
prior and likelihood construction leads naturally to consideration of the posterior
distribution, including useful results on asymptotic consistency and normality which
suggest large sample robustness to the choice of prior distribution.
Chapter 3 shows how graphical models can be used to specify dependencies in
probability distributions. Graphical representations are most useful when the depen-
dency structure is a primary target of inferential interest. Different types of graphical
model are introduced, including belief networks and Markov networks, highlighting
that the same graph structure can have different interpretations for different models.
Chapter 4 discusses parametric statistical models. Attention is focused on conju-
gate models, which present the most mathematically convenient parametric approx-
imations of true, but possibly hard to specify, underlying beliefs. Although these
models might appear relatively simplistic, later chapters will show how these basic
models can form the basis of very flexible modelling frameworks.
Chapter 5 introduces the computational techniques which revolutionised the appli-
cation of Bayesian statistical modelling, enabling routine performance of infer-
ential tasks which had previously appeared infeasible. Relatively simple Markov
chain Monte Carlo methods were at the heart of this development, and these are
explained in some detail. A higher level description of Hamiltonian Monte Carlo
methods is also provided, since these methods are becoming increasingly popular
for performing simulation-based computational inference more efficiently. For high-
dimensional inference problems, some useful analytic approximations are presented
which sacrifice the theoretical accuracy of Monte Carlo methods for computational
speed.
Chapter 6 discusses probabilistic programming languages specifically designed
for easing some of the complexities of implementing Bayesian inferential methods.
Particular attention is given to Stan, which has experienced rapid growth in deploy-
ment. Stan automates parallel Hamiltonian Monte Carlo sampling for statistical infer-
ence on any suitably specified Bayesian model on a continuous parameter space. In
the subsequent chapters which introduce more advanced statistical models, Stan is
used for demonstration wherever possible.
Chapter 7 is concerned with model checking. There are no expectations for subjec-
tive probability models to be correct, but it can still be useful to consider how well
observed data appear to fit with an assumed model before making any further predic-
tions using the same model assumptions; it may make sense to reconsider alter-
natives. Posterior predictive checking provides one framework for model checking
in the Bayesian framework, and its application is easily demonstrated in Stan. For
comparing rival models, Bayes factors are shown to be a well-calibrated statistic for
quantifying evidence in favour for one model or the other, providing a vital Bayesian
analogue to Neyman-Pearson likelihood ratios.
Chapter 8 presents the Bayesian linear model as the cornerstone of regression
modelling. Extensions from the standard linear model to other basis functions such
as polynomial and spline regression highlight the flexibility of this fundamental
model structure. Further extensions to generalised linear models, such as logistic
and Poisson regression, are demonstrated through implementation in Stan.
Preface vii
The original version of the book was revised. Copyright page text has been updated. The correction
to the book is available at https://doi.org/10.1007/978-3-030-82808-0_12.
Contents
ix
x Contents
1.1.1 Subjectivism
In the seminal work of de Finetti (see the English translation of de Finetti 2017),
the central idea for the Bayesian paradigm is to address decision-making in the
face of uncertainty from a subjective viewpoint. Given the same set of uncertain
circumstances, two decision-makers could differ in the following ways:
• How desirable different potential outcomes might seem to them.
• How likely they consider the various outcomes to be.
• How they feel their actions might affect the eventual outcome.
The Bayesian decision-making paradigm is most easily viewed through the lens of
an individual making choices (“decisions”) in the face of (personal) uncertainty. For
this reason, certain illustrative elements of this section will be purposefully written
in the first person.
This decision-theoretic view of the Bayesian paradigm represents a mathematical
ideal of how a coherent non-self-contradictory individual should aspire to behave.
This is a non-trivial requirement, made easier with various mathematical formalisms
which will be introduced in the modelling sections of this text. Whilst these for-
malisms might not exactly match my beliefs for specific decision problems, the aim
is to present sufficiently many classes of models that one of them might adequately
reflect my opinions up to some acceptable level of approximation.
Coherence is also the most that will be expected from a decision-maker; there
will be no requirement for me to choose in any sense the right decisions from any
perspective other than my own at that time. Everything within the paradigm is sub-
jective, even apparently absolute concepts such as truth. Statements of certainty such
as “The true value of the parameter is x” should be considered shorthand for “It is my
understanding that the true value of the parameter is x”. This might seem pedantic,
There are numerous sources of individual uncertainty which can complicate decision-
making. These could include:
• Events which have not yet happened, but might happen some time in the future
• Events which have happened which I have not yet learnt about
• Facts which may yet be undiscovered, such as the truth of some mathematical
conjecture
• Facts which may have been discovered elsewhere, but remain unknown to me
• Facts which I have partially or completely forgotten
In the Bayesian paradigm, these and other sources of uncertainty are treated equally.
If there are matters on which I am unsure, then these uncertainties must be acknowl-
edged and incorporated into a rational decision process. Whether or not I perhaps
should know them is immaterial.
Example 1.1 If rolling a die, I might understandably assume that the outcome
will be in Ω = { , , , , , }. Alternatively, I could take a more conserva-
tive viewpoint and extend the space of outcomes to include some unintended or
potentially unforeseen outcomes; for example, Ω = {Dice roll does not take place,
No valid outcome, , , , , , }.
Neither viewpoint in Example 1.1 could irrefutably be said to be right or wrong.
But if I am making a decision which I consider to be affected by the future outcome of
the intended dice roll, I would possibly adopt different positions according to which
set of possible outcomes I chose to focus on. The only requirement for Ω is that it
should contain every outcome I currently conceive to be possible and meaningful to
the decision problem under consideration.
Definition 1.2 (Decision problem) Following Bernardo and Smith (1994), a decision
problem will be composed of three elements:
1. An action a, to be chosen from a set A of possible actions.
2. An uncertain outcome ω, thought to lie within a set Ω of envisaged possible
outcomes.
3. An identifiable consequence, assumed to lie within a set C of possible conse-
quences, resulting from the combination of both the action taken and the ensuing
outcome which occurs.
Axioms 1 C will be totally ordered, meaning there exists an ordering relation ≤C
on C such that for any pair of consequences c1 , c2 ∈ C , necessarily c1 ≤C c2 or
c2 ≤C c1 .
If both c1 ≤C c2 and c2 ≤C c1 , then we write c1 =C c2 . This provides definitions
of (subjective) preference and indifference between consequences.
Remark 1.1 Crucially, the ordering ≤C is assumed to be subjective; my perceived
ordering of the different consequences must be allowed to differ from that of other
decision-makers.
Definition 1.3 (Preferences on consequences) Suppose c1 , c2 ∈ C . If c1 ≤C c2 and
c1 =C c2 , then c2 is said to be a preferable consequence to c1 , written c1 <C c2 . If
c1 =C c2 , then I am indifferent between the two consequences.
Definition 1.4 (Action) An action defines a function which maps outcomes to con-
sequences. For simplicity of presentation, until Section 1.5.1 the actions in A will
be assumed to be discrete, meaning that each can be represented by a generic form
a = {(E 1 , c1 ), (E 2 , c2 ), . . .}, where c1 , c2 , . . . ∈ C , and E 1 , E 2 , . . . are referred to as
fundamental events which form a partition of Ω, meaning Ω = ∪i E i , E i ∩ E j = ∅
for i = j. Then, for example, if I take action a, then I anticipate that any outcome
ω ∈ E 1 would lead to consequence c1 , and so on.
Remark 1.2 When actions are identified, in this way, by the perceived consequences
they will lead to under different outcomes, they are subjective.
a = {(E 1 , c1 ), (E 2 , c2 ), . . .},
a = {(E 1 , c1 ), (E 2 , c2 ), . . .}.
The overall desirability of each action will depend entirely on the uncertainty sur-
rounding the fundamental events E 1 , E 2 , . . . and E 1 , E 2 , . . . and the desirability of
the corresponding consequences c1 , c2 , . . . and c1 , c2 , . . .. This can be exploited in
two ways, which will be developed in later sections:
1. If I innately prefer action a to a , then this preference can be used to quantify my
beliefs about the uncertainty surrounding the fundamental events characterising
each action. This will form the basis for eliciting subjective probabilities (see
Sect. 1.3).
2. Reversing the same argument, once I have elicited my probabilities for certain
events then these can be used to obtain preferences between corresponding actions
through the principle of maximising expected utility (see Sect. 1.4.1).
Remark 1.3 The two actions {(F, c1 ), (F, c2 )} and {(E, c1 ), (E, c2 )} only differ in
the consequences anticipated from any ω ∈ E ∩ F; that is, the event E ∩ F would
lead to a consequence of c1 under the first action and c2 under the second.
Central to the definition given by Bernardo and Smith (1994) for subjective probabil-
ity is the abstract concept of a continuous-indexed family of standard events, denoted
Sx for x ∈ [0, 1]. These standard events are constructed in relation to a hypotheti-
cal, abstract experiment, such that under the classical perspective of probability one
would assign probability x to the standard event Sx occurring, for 0 ≤ x ≤ 1.
As an illustrative example, consider the hypothetical spinning wheel depicted in
Fig. 1.1. This wheel is assumed to have unit circumference and to be plain in colour
apart from a shaded sector of arc length x ∈ [0, 1], creating an angle of 2π x radians
from a horizontal axis. A fixed needle is mounted above the wheel as shown. It could
be imagined that the wheel is to be spun (perhaps vigorously) from some arbitrary
starting orientation; when the wheel has come to rest, one observes whether the fixed
needle is lying within the shaded area of arc length x.
For each x ∈ [0, 1], define the corresponding standard event
2x
6 1 Uncertainty and Decisions
arc length x
= =x
circumference 1
to the event Sx . Later, these standard events will be used to form the basis of a
definition of subjective probability for quantifying individual uncertainty. Briefly, an
individual will assign probability x to an event E ⊆ Ω if they would be indifferent
between receiving a reward if E occurs or alternatively receiving the same reward if
Sx occurs.
Remark 1.5 Axiom 3 uses the continuity in x of the collection of standard events
{Sx : x ∈ [0, 1]}. It states that each event E can be mapped to a unique number
x ∈ [0, 1] through equivalence between E and the standard event Sx when imagined
as alternative opportunities to improve consequences (from c1 to c2 ). This provides
the definition of subjective probability.
Remark 1.7 Subjective probabilities can be elicited in two stages: First, a contin-
uous family of hypothetical standard events are constructed by the decision-maker,
1.3 Subjective Probability 7
such that for each x ∈ [0, 1] there is a corresponding standard event Sx with classical
probability x. Second, a probability P(E) ∈ [0, 1] is assigned to an uncertain event
E ⊆ Ω of interest by identifying equal preference between the two dichotomies
{(E, c1 ), (E, c2 )} and {(SP(E) , c1 ), (SP(E) , c2 )}.
Remark 1.8 In some circumstances, the subjective assessment of the range of possi-
ble outcomes and the probabilities of events within that range may vary according to
which action is being considered; for example, the decision problem may be choos-
ing to role either one or two dice, with corresponding consequences resulting from
the outcome. This presents no contradiction to the above definition, but all subjec-
tive probabilities should be regarded as conditional probabilities which implicitly
condition on a particular action.
For further reading, see Sect. 5.3 of Gelman and Hennig (2017) for a discussion of
subjective Bayesian reasoning within an interesting, wider discussion on objectivity
and subjectivity in science.
It is worth noting the contrast of Definition 1.6 with frequentist probability. Under
the frequentist interpretation, there exists a single probability of event E occurring,
equal to the long run relative frequency at which E would occur in a potentially
unlimited number of repetitions of the uncertain outcome.
Whilst these two interpretations of probability are fundamentally opposed, the two
could easily coincide when subjective probabilities are determined by an individual
using frequentist reasoning to arrive upon their own subjective beliefs.
Remark 1.9 This axiom says that once we condition on an event G occurring, for
any other event E we can still find an equivalent standard event.
Proposition 1.1 For events E, G ⊆ Ω such that P(G) > 0, the conditional proba-
bility of E given the assumed occurrence of G must necessarily be
P(E ∩ G)
P|G (E | G) := . (1.2)
P(G)
The updating Eq. (1.2) provides the unique recipe for how beliefs must be updated
when additional information becomes available, and this can be further refined in the
following theorem.
Theorem 1.1 (Bayes’ theorem) For events E, G ⊆ Ω such that P(G) > 0,
P|E (G | E) P(E)
P|G (E | G) = .
P(G)
1.4 Utility
Remark 1.10 A utility function assigns a subjective numerical value to each of the
possible consequences.
1.4 Utility 9
a ≤ a ⇐⇒ ū(a) ≤ ū(a ),
implying one action will be preferable to another if and only if it has higher expected
utility.
Exercise 1.1 (Linear transformations of utilities) Show that decision problems are
unaffected by positive-gradient linear transformations to the utility function.
10 1 Uncertainty and Decisions
Example 1.3 Continuing Example 1.2, but now assuming a utility function, the
expected utilities of the two actions are
If the decision problem is bounded, then for simplicity and without loss of gen-
erality it can be assumed that u(c∗ ) = 0, u(c∗ ) = 1.
Then for any c ∈ C , the order-preserving requirement of a utility function
determines that u(c) ∈ [0, 1] is the index of the standard event Su(c) such that
{(Ω, c)} ∼ {(Su(c) , c∗ ), (Su(c) , c∗ )} (Axiom 4).
If the decision problem is not bounded, then for some c1 <C c2 , (perhaps after linear
rescaling) it could be assumed without loss of generality that u(c1 ) = 0, u(c2 ) = 1.
Again, Axiom 4 and the order-preserving requirement then determine the rest of the
utility function; specifically, for c ∈ C :
1. If c1 ≤C c ≤C c2 , {(Ω, c)} ∼ {(Su(c) , c2 ), (Su(c) , c1 )}.
2. If c <C c1 , then u(c) < 0 and if {(Ω, c1 )} ∼ {(Sx , c), (Sx , c2 )}, then u(c) =
−x/(1 − x).
3. If c2 <C c, then u(c) > 1 and if {(Ω, c2 )} ∼ {(Sx , c1 ), (Sx , c)}, then u(c) = 1/x.
1.4 Utility 11
Exercise 1.3 Unbounded utility. Suppose u(c1 ) = 0, u(c2 ) = 1. Show that by Axiom 6,
the following must hold.
(i) If c <C c1 and {(Ω, c1 )} ∼ {(Sx , c), (Sx , c2 )}, then u(c) = −x/(1 − x) < 0.
(ii) If c2 <C c and {(Ω, c2 )} ∼ {(Sx , c1 ), (Sx , c)}, then u(c) = 1/x > 1.
Remark 1.11 Randomised actions are a useful extension of the deterministic actions
considered until now. Although sometimes counter-intuitive, in many circumstances
they can sometimes be shown to correspond to optimal or near-optimal behaviours.
where ū |G i (aG i | G i ) := j P|G i (E i j | G i ) u(ci j ) is the conditional expected utility
of action aG i given the occurrence of event G i .
Remark 1.12 Definition 1.13 simply says that the expected utility of a randomised
action is the expectation of the conditional expected utilities of the individual actions.
By considering randomised actions, it can now be shown that the equation for condi-
tional probability (1.2) is necessary when specifying subjective probabilities if those
probabilities are to yield coherent expected utilities, and therefore coherent decisions.
Consider a randomised action a = {(G, aG ), (G, aG )} such that aG = {(E, c∗ ),
(E, c∗ )} and aG = {(Ω, c∗ )}, where u(c∗ ) = 0 and u(c∗ ) = 1. Then by Definition
1.13, a has expected utility
12 1 Uncertainty and Decisions
As noted in Definition 1.4, the initial notation used for actions has presumed dis-
creteness, with a countable partition of Ω leading to countably many consequences
and associated utilities. This section will consider cases where Ω and the space of
actions might be uncountable.
Definition 1.14 (Decision space) A decision space (or continuous action space) is
a set of mappings D = {d : Ω → C } such that the consequence of taking a decision
d ∈ D and observing outcome ω is d(ω) ∈ C .
Definition 1.15 (Expected utility of a decision) For a utility function u : C → R,
the expected utility of a decision d : Ω → C is the usual expectation
ū(d) = u(d(ω)) d P(ω).
Ω
Consider the special case of the decision problem which is to estimate the future
realised value of the unknown outcome ω ∈ Ω. In the typical notation of statisti-
cal estimation, a decision constitutes providing an estimated value ω̂. The eventual
1.5 Estimation and Prediction 13
dω̂ = (ω̂, · )
such that for ω ∈ Ω, dω̂ (ω) = (ω̂, ω), and the expected utility is the negative
expected loss,
ū(dω̂ ) = − (ω̂, ω) f (ω) dω.
Ω
Exercise 1.6 Absolute loss (also known as L 1 loss). If (ω̂, ω) = |ω̂ − ω|, show
that the Bayes optimal decision is to estimate ω by the median of P.
Exercise 1.7 Squared loss (also known as L 2 loss). If (ω̂, ω) = (ω̂ − ω)2 , show
that the Bayes optimal decision is to estimate ω by the mean of P.
Exercise 1.8 Zero-one loss (also known as L ∞ loss). If (ω̂, ω) = 1 − 1{ω̂} (ω),
show that the Bayes optimal decision is to estimate ω by the mode of P.
1.5.3 Prediction
In the preceding sections it would have been natural to envisage ω as a scalar number,
such as the outcome from rolling a die. However, this need not be the case. Bayesian
prediction is the task of estimating an entire probability distribution, rather than a
scalar; correspondingly, in this case ω is an unknown probability distribution on a
space X and Ω is a space of probability distributions on X which I believe contains
ω.
As discussed throughout this chapter, in a Bayesian setting I will have my own
beliefs about ω, characterised by my own subjective probability distribution P(E)
for my probability that ω lies in a subset of probability distribution space E ⊆ Ω.
To avoid self-contradiction, for coherent prediction it should be a requirement
that my optimal decision when estimating ω should be to state my own beliefs. That
is, the optimal decision dω̂ should satisfy, for events F ⊆ X ,
ω̂(F) = ω(F) d P(ω), (1.3)
Ω
14 1 Uncertainty and Decisions
where the right-hand side is my marginal probability for the event F, obtained as an
expectation of the probability of F, ω(F), with respect to my uncertainty about ω
encapsulated by P.
Satisfying (1.3) clearly places constraints on what are allowable loss functions to
lead to coherence. In fact, it can be shown (Bernardo and Smith 1994, Section 2.7)
that the only proper loss functions for coherent prediction have a canonical form
which is the well-known Kullback-Leibler divergence from information theory for
measuring the difference of one probability distribution from another.
If p, q are corresponding density functions satisfying p(x) > 0 =⇒ q(x) > 0, then
p(x) p(x)
KL( p q) := p(x) log dx = E p log .
q(x) q(x)
Using this definition of KL-divergence, the necessary form for a proper loss
function is
dω
(ω̂, ω) = KL(ω ω̂) = log dω. (1.4)
X dω̂
This justifies the use of KL-divergence for measuring discrepancy between two
probability distributions from a Bayesian perspective.
The first chapter introduced the philosophy of Bayesian statistics: when making
individual decisions in the face of uncertainty, probability should be treated as a
subjective measure of beliefs, where all quantities unknown to the individual should
be treated as random quantities.
Eliciting individual probability assessments is a non-trivial endeavour. Even if
I have a relatively well-formed opinion about some uncertain quantity, coherently
assigning precise numerical values (probabilities) to all potential outcomes of inter-
est for that quantity can be particularly challenging when there are infinitely many
possible outcomes.
To counter these difficulties, it can be helpful to consider mathematical models to
represent an individual’s beliefs. There is no presumption that these models should
be somehow correct in terms of representing true underlying dynamics; nonetheless,
they can provide structure for representing beliefs coherently to a good enough degree
of approximation to enable valid decision-making.
The main simplification which will be considered, exchangeability, occurs in con-
texts where a sequence of random quantities are to be observed and a joint probability
distribution for the sequence is required. Symmetries in one’s beliefs about sequences
lead to familiar specifications of probability models which are often considered to be
the hallmark of Bayesian thinking: a likelihood distribution combined with a prior
distribution.
Let X 1 , X 2 , . . .beasequenceofreal-valuedrandomvariablestobeobserved,whichare
mappings of an underlying unknown outcome ω ∈ Ω with probability distribution P.
The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 15
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_2
16 2 Prior and Likelihood Representation
Remark 2.2 Theorem 2.1 shows that any infinitely exchangeable sequence of binary
random variables must arise as a sequence of independent and identically distributed
Bernoulli(θ ) random variables, with a single probability parameter θ drawn from
some distribution Q.
The same property does not extend to finitely exchangeable sequences.
the conditional probability mass function for future elements X m+1 , . . . , X n after
observing x1 , . . . , xm has the form
1
n
p X m+1 ,...,X n |x1 ,...,xm (xm+1 , . . . , xn ) = θ xi (1 − θ )1−xi dQ(θ | x1 , . . . , xm ),
θ=0 i=m+1
where m
θ xi (1 − θ )1−xi dQ(θ )
dQ(θ | x1 , . . . , xm ) = 1 i=1
m x .
i=1 θ (1 − θ )
i 1−xi dQ(θ )
θ=0
Remark 2.3 Observing part of the sequence does not affect exchangeability, and
therefore Theorem 2.1. The initial prior distribution Q(θ ) is simply updated to the
current posterior distribution Q(θ | x1 , . . . , xm ).
Theorem 2.1 extends to non-binary, infinitely exchangeable sequences.
Theorem 2.2 Let X 1 , X 2 , . . . be a sequence of real-valued random variables,
X i ∈ R, which are believed to be infinitely exchangeable and let R be the space
of all probability distributions on R. Then for any n ≥ 1, necessarily there exists a
probability measure Q on R such that
n
P X 1 ,...,X n (x1 , . . . , xn ) = F(xi ) dQ(F). (2.2)
F∈R i=1
where m
F(xi ; θ ) dQ(θ )
dQ(θ | x1 , . . . , xm ) = 1 i=1
m . (2.4)
θ=0 i=1 F(x i ; θ ) dQ(θ )
18 2 Prior and Likelihood Representation
Theorem 2.2 justifies the standard “prior × likelihood” approach commonly applied
to Bayesian statistical modelling of real-valued data: assuming a sampling distribu-
tion comprising identically distributed observables which are conditionally indepen-
dent given an unknown parameter, where the parameter is assumed to be an initial
draw from some “prior” probability distribution.
The likelihood component in this mixture is common to both Bayesian and fre-
quentist statistical approaches, and so more scepticism and attention is often directed
towards how the prior component is specified in Bayesian methods. It is referred to as
the “prior distribution” because it reflects one’s beliefs about the generative mecha-
nism, F, before observing any of the variables X 1 , X 2 , . . .; in contrast the “posterior
distribution” (2.4) reflects updated beliefs about F after observing those data.
discrepancy measure. See Sect. 5.4 of Bernardo and Smith (1994) for discussion of
reference priors and reference decisions; the latter identify the optimal decisions
under a least informative prior—not for operational use for any individual decision-
maker, but to serve as an illustrative benchmark for comparison.
2.3.3 Hyperpriors
k
Q θ (θ ) = wi Q θ|φ=i (θ )
i=1
Given an initial probability distribution reflecting prior beliefs about F and then
observing X 1 , . . . , X n as draws from F, Exercise 2.2 demonstrated the transition
from prior distribution, through the likelihood function, to the posterior distribution
(in this case for infinitely exchangeable random variables). This transformation was a
simple application of Theorem 1.1, Bayes’ theorem, and represents the only coherent
mechanism for updating subjective probabilities.
In principle, the Bayesian paradigm for reporting scientific conclusions from a
fixed collection of data suggests repeating this prior to posterior transformation for a
range of different prior distributions, selected to cover a broad range of prior beliefs
which may plausibly be held by the reader; for each prior distribution, the author
would present the consequent posterior distribution and perhaps a corresponding
optimal decision. However, in practice this procedure is often truncated, with authors
preferring to show a single analysis under a non-informative prior (cf. Sect. 2.3.2),
with the implication that inputting any more informative prior information would
simply bias the conclusions in that direction, albeit by an unspecified amount.
Proposition 2.1 Suppose Q(θ ) is a discrete distribution with dQ(θ ∗ ) > 0. If, for
all other θ = θ ∗ satisfying dQ(θ ) > 0, KL(F(·; θ ) F(·; θ ∗ )) > 0, then
Remark 2.7 It was remarked in Sect. 1.1.1 that the subjective Bayesian paradigm
attaches no particular importance to absolute truths. From that perspective, Proposi-
tion 2.1 might appear to lack any operational significance; in subjective probability,
there is no true likelihood and no true parameter value, nor will there be infinite
random samples to observe.
However, there is a useful conclusion to draw: If you and I agree on exchange-
ability, the form of the sampling distribution F(; θ ) and the range of values which
θ reasonably might take, then even if we disagree on a form for the prior Q(θ ), as
we observe more data our posterior beliefs will uniformly converge. So for reporting
scientific inference on “big data” applications, only the likelihood function really
matters.
n
θ̂n = arg maxθ F(xi ; θ ). (2.6)
i=1
For the maximum likelihood estimator, asymptotically θ̂n ∼ Normalk (θ ∗ , In−1 (θ ∗ )),
where In (θ ) is the so-called Fisher information matrix of the likelihood function,
d2
n
In (θ ) = − log F(xi ; θ ). (2.7)
dθ 2 i=1
Proposition 2.2 Let m 0 = arg maxθ dQ(θ ) be the mode of the prior distribution,
2
and let I0 (θ ) = − dθd 2 log dQ(θ ). Then
Hn = I0 (m 0 ) + In (θ̂n ) (2.8)
Proof For a sketch proof involving a Taylor series expansion, see Bernardo and
Smith (1994, p. 287).
Remark 2.8 Proposition 2.2 states that a large sample posterior distribution can be
well approximated by a Gaussian; as n → ∞ the mean of that Gaussian tends to
the true value θ ∗ and the variance shrinks toward zero provided θ ∗ is identifiable,
implying posterior consistency.
3.1 Graphs
Definition 3.2 (Directed and undirected graphs) For a graph G = (V, E), if E is
symmetric such that (v, v ) ∈ E ⇐⇒ (v , v) ∈ E, then the graph G is said to be
undirected. Otherwise, the edges and graph are directed.
Remark 3.1 Figures 3.1 and 3.2 provide diagrammatic examples of directed and
undirected graphs. Each link drawn between nodes corresponds to an edge; in the
directed graph, these links must have arrows to indicate their direction.
In the context of graphical modelling, the set of nodes in the graph will corre-
spond to a finite set of random variables V = {X 1 , . . . , X n } for which a probability
model must be constructed. The set of edges will correspond to proposed dependen-
cies between variables, defined in different ways according to different modelling
constructs.
Definition 3.3 (Adjacency matrix) For a finite graph G = (V, E), where V =
{X 1 , . . . , X n }, the adjacency matrix of the graph is a binary n × n matrix AG with
entries in {0, 1}, such that (AG )i j = 1 ⇐⇒ (X i , X j ) ∈ E.
Definition 3.4 (Parents and children) In a directed graph G = (V, E), the parents
of node X i ∈ V is the set of nodes which connect to X i through an edge in E,
parents(X i ) = {X j ∈ V : (X j , X i ) ∈ E}. Similarly, the children of X i is the subset
of V connected to by X i , children(X i ) = {X j ∈ V : (X i , X j ) ∈ E}.
Exercise 3.1 (Identifying parents and children) For the directed graph in Fig. 3.1,
find the parents and children of each node in V = {X 1 , X 2 , X 3 , X 4 }.
3.1 Graphs 25
Definition 3.9 (Clique) In an undirected graph G = (V, E), a clique is a fully con-
nected subset of V . Furthermore, a clique is said to be maximal in the graph if there
is no superset which is also a clique.
Exercise 3.3 (Identifying cliques) For the graph of Fig. 3.2, identify the maximal
cliques.
Definition 3.10 (Separation through a set) For an undirected graph G = (V, E) and
disjoint node subsets A , B, C ⊂ V = {X 1 , . . . , X n }, if every path from an element
of A to an element of B contains an element of C , then C is said to separate A
from B.
Exercise 3.4 (Identifying separating sets) For the graph in Fig. 3.2, find the sepa-
rating sets.
Definition 3.11 (Separation) For A , B ⊂ V , A is separated from B in G =
(V, E) if there is no path in G between an element of A and an element of B.
26 3 Graphical Modelling and Hierarchical Models
Definition 3.12 (Belief network) Let G be a DAG on the node set of random variables
V = {X 1 , . . . , X n }. A belief network (also known as a causal graph) with graph G
assumes the joint probability distribution factorises as
n
PG (X 1 , . . . , X n ) = P(X i | parentsG (X i )). (3.1)
i=1
Remark 3.4 In a belief network, the set of DAG edges imply a collection of condi-
tional independence statements, although not uniquely; one joint probability distri-
bution can often be represented by multiple alternative DAGs.
Exercise 3.5 Interpreting the graph in Fig. 3.1 as a belief network, state the form of
the implied joint probability distribution using the notation of (3.1).
X1 X3 X1 X3 X1 X3
X2 X2 X2
Proposition 3.1 For a belief network on a directed graph G = (V, E) and disjoint
node subsets A , B, C ⊂ V , if C d-separates A from B then A ⊥ ⊥ B | C in the
joint distribution PG of the belief network.
Corollary 3.1 Trivially from Proposition 3.1, the connected components of a graph
in a belief network are independent.
Exercise 3.8 (Identifying conditional independencies in a belief network) For each
of the graphs in Fig. 3.3, state the dependence between X 1 and X 3 (i) marginally;
(ii) conditionally given X 2 .
Definition 3.17 (Markov network) Let G be an undirected graph on the node set
{X 1 , . . . , X n }. A Markov network with graph G assumes the joint probability distri-
bution factorises as
C
PG (X 1 , . . . , X n ) ∝ φi (Xi ), (3.2)
i=1
X2
Exercise 3.10 (Pairwise Markov network distribution) Interpreting the graph in Fig.
3.2 as a pairwise Markov network, state the form of the implied joint probability
distribution using the notation of (3.3).
Remark 3.7 The definitions of a Markov network and pairwise Markov network
coincide if and only if the maximal cliques are all edges, meaning there are no
triangles in the graph.
Remark 3.8 For the graph in Fig. 3.4, the definitions of Markov networks and
pairwise Markov networks coincide, both implying PG (X 1 , X 2 , X 3 ) ∝ φ1 (X 1 , X 2 )
φ2 (X 2 , X 3 ). In general, this simple graph would imply X 1 and X 3 are dependent, but
conditionally independent given X 2 .
Definition 3.19 (Markov random field) Let G be an undirected graph on the node set
{X 1 , . . . , X n }. A Markov random field on G assumes the full conditional probability
distributions satisfy
Remark 3.10 The definition of a Markov random field is equivalent to the earlier
definition of a Markov network, as characterised by (3.2).
3.2 Graphical Models 29
X3 X4
(Σ −1 )i j = 0 ⇐⇒ (X i , X j ) ∈ E.
Show that a GMRF satisfies Definition 3.19 for a Markov random field.
Figure 3.5 shows an example of a lattice graph. Lattice graphs provide another case
where the definitions of a Markov network/random field and a pairwise Markov
network coincide. As Markov random fields, these structures are known as lattice
models.
Definition 3.20 (Factor graph) Let G be an (undirected) graph on the extended node
set {X 1 , . . . , X n } ∪ {θ1 , . . . , θk }. A factor graph model assumes the joint probability
distribution for X 1 , . . . , X n factorises as
k
PG (X 1 , . . . , X n ) ∝ φi (neighboursG (θi ) ∩ {X 1 , . . . , X n }). (3.4)
i=1
Remark 3.11 There should be no edges between factor nodes or variable nodes in
a factor graph, since these would have no bearing on (3.4).
30 3 Graphical Modelling and Hierarchical Models
θ1 X5 θ2
X3 X4
Remark 3.12 By introducing additional nodes, factor graphs can represent richer
dependency structures than either belief networks or one Markov networks; one belief
network or Markov network could correspond to multiple possible factor graphs.
Figure 3.6 shows an example factor graph, where the shaded nodes indicate latent
factors.
The edge structure in Fig. 3.6 implies a factorisation of the joint distribution
PG (X 1 , X 2 , X 3 , X 4 , X 5 ) ∝ φ1 (X 1 , X 3 , X 5 )φ2 (X 2 , X 4 , X 5 ).
Section 2.3.3 introduced the idea of specifying probability distributions for unknowns
through hierarchies. Such hierarchies can be interpreted as graphical models.
θ θ
X1 X2 X... Xn X1 X2 X... Xn
(a) (b)
Fig. 3.7 Exchangeability for X 1 , . . . , X n represented as a a belief network or b factor graph
Remark 3.14 Hierarchical model formulations are equivalent to both belief net-
works (Sect. 3.2.1) and factor graphs (Sect. 3.2.3). They can be represented graphi-
cally in either way.
Example 3.1 De Finetti’s representation Eqs. (2.2) and (2.3) for exchangeable vari-
ables X 1 , . . . , X n are simple hierarchical models. This representation is depicted
graphically as both a belief network and a factor model in Fig. 3.7. The differences
are the undirected edges and the explicit interpretation of θ as a latent parameter in
the factor graph. The shaded nodes indicate latent random variables which will not
be observed.
F1
F2
F...
Fn
Fig. 3.8 A belief network representation of a hierarchical model for an n × p matrix of random
variables (X i j ) with two layers of exchangeability: firstly in the rows, secondly in the row entries
Chapter 4
Parametric Models
n
p(x | θ ) := p(xi | θ ),
i=1
and the posterior density for θ (2.4) can be expressed most simply as
1Integration, here and elsewhere, refers to Lebesgue integration for densities of continuous random
variables, and summation for densities (or mass functions) of discrete random variables.
4.2 Conjugate Models 35
Generalisations of the results in Tables 4.1 and 4.2 to more than one observation
from the likelihood model are straightforward; for a second observation, the posterior
p(θ | x) from the right hand column adopts the role of the prior in the middle column,
simply updating the parameter values within the same parametric family.
Proposition 4.1 For conjugate parametric models, the marginal likelihood p(x)
will have a closed-form equation.
Proof For conjugate models the posterior density will have a closed analytic form;
the marginal likelihood could therefore be obtained through rearranging (4.3),
36 4 Parametric Models
p(x | θ ) p(θ )
p(x) = . (4.4)
π(θ )
Any terms involving θ in (4.4) will necessarily cancel, leaving a ratio of normalising
constants from the likelihood and prior densities and the posterior density which do
not depend on θ .
Proposition 4.2 If p(x | θ ) is an exponential family of the form (4.5), then any
normalisable density satisfying
Remark 4.4 Proposition 4.2 provides another justification for the popularity of
exponential family models in statistics; they all have conjugate Bayesian priors.
All of the likelihood models in Tables 4.1 and 4.2 are exponential families.
For a given likelihood model in the De Finetti representation (4.1), adopting a conju-
gate prior distribution is certainly attractive for mathematical convenience. However,
outside of simple exponential family examples, most likelihood models will not have
a conjugate prior distribution.
Besides conjugate models, a tractable posterior distribution will always be theo-
retically available whenever the prior distribution is discrete and has finite support
Θ = {θ : p(θ ) > 0}; in this case, the marginal likelihood p(x) which serves as the
normalising constant of (4.3) is simply the finite sum
4.4 Non-conjugate Models 37
p(x) = p(x | θ ) p(θ ). (4.7)
θ∈Θ
However, in practice, if the number of support points |Θ| is very large (for example,
if θ is high-dimensional) then the summation (4.7) may still be too expensive to
compute.
Moreover, a Bayesian model specification should aim to reflect subjective beliefs,
and adopting a prior distribution simply for reasons of mathematical convenience is
not consistent with this objective. Therefore, in many applications analysts will be
faced with performing inference with non-conjugate statistical models with analyt-
ically intractable posterior distributions which can be identified only up to propor-
tionality through (4.2).
Given a known posterior density π(θ ) (4.2) obtained under an assumed parametric
model, a decision-maker might be interested in visualising or quantifying some lower-
dimensional summaries of this density; this can be particularly useful if the parameter
θ is multi-dimensional, perhaps with high dimension, meaning the density π(θ )
cannot simply be plotted.
P(θ ∈ Rα ) = α.
Remark 4.5 For a given probability distribution and coverage probability 0 < α <
1, infinitely many valid credible intervals may exist.
Exercise 4.6 Let π(θ ) = λe−λθ 1[0,∞) (θ ) with λ > 0. For 0 ≤ α ≤ 1, calculate the
equal-tailed 100α% credible interval for θ .
Chapter 5
Computational Inference
In Sect. 1.5, estimation and prediction were presented as Bayesian decision problems.
Given a subjective probability distribution for an unknown quantity and a subjec-
tively chosen utility or loss function, the Bayes estimate was shown to be the value
which maximises expected utility or equivalently minimises expected loss. Obtain-
ing this estimate apparently requires two stages of calculation: obtaining an analytic
expression for the subjective probability distribution and then using this distribution
to calculate expectations.
In the first stage, suppose I assume exchangeability for observable random vari-
ables X 1 , . . . , X n and a parametric representation (2.3) with an unknown parameter
θ ∈ Θ. After observing values x1 , . . . , xn , my posterior distribution (θ ) can the-
oretically be obtained through updating my prior beliefs via Bayes’ theorem (4.3);
however, the denominator of (4.3) is the result of a definite integral which was noted
in Sect. 4.4 to be analytically intractable in many cases, leaving the posterior den-
sity only available up to an unknown constant of proportionality (4.2). Section 2.3.7
noted that almost any posterior distribution will asymptotically resemble a multi-
variate Gaussian, and in some large sample cases, this might provide an adequate
approximation to the normalised posterior if the necessary maximum likelihood esti-
mates can be calculated, but in general, these asymptotic arguments cannot be relied
upon.
In the second stage, taking a squared error loss function as an example, it is under-
stood from Sect. 1.5.2 that the Bayes estimate for θ under this loss function would
be the mean value with respect to my (updated) subjective probability distribution,
(θ ), and the required calculation therefore requires a second integral
Eπ (θ ) := θ π(θ ) dθ, (5.1)
Θ
Definition 5.1 (Monte Carlo estimate of an expectation) For samples θ (1) , . . . , θ (M)
from π , the Monte Carlo estimate (MC) of Eπ {g(θ )} (5.2) is
1
M
Êπ {g(θ )} := g(θ (i) ). (5.3)
M i=1
Remark 5.1 MC methods are well suited to addressing the case of Item 2 from
Sect. 5.1, where a target density π might be fully known but the integrals required
for calculating expectations with respect to π are not tractable.
Exercise 5.1 (Monte Carlo probabilities) Suppose θ (1) , . . . , θ (M) are random sam-
ples from a density π(θ ) over Θ. State the Monte Carlo estimate of Pπ (θ ∈ A) for
a region A ⊂ Θ.
5.2 Monte Carlo Estimation 41
Exercise 5.3 (Monte Carlo credible interval) For a univariate, real-valued param-
eter θ ∈ R, suppose θ (1) , . . . , θ (M) are random samples from a density π(θ ) and
θ(1) ≤ . . . ≤ θ(M) are the corresponding order statistics. For 0 ≤ α ≤ 1, use the order
statistics to state a Monte Carlo approximated 100α% credible region for θ (cf.
Sect. 4.5.2).
The standard error of an estimator is the standard deviation of the sampling distribu-
tion of the estimate, or more generally an estimate of that standard deviation.
Definition 5.2 (Monte Carlo standard error) For independent samples θ (1) , . . . ,
θ (M) ∼ π(θ ), the (estimated) standard error of the Monte Carlo estimate (5.3) is
1 M
2
s.e.{Êπ {g(θ )}} := g(θ (i) ) − Êπ {g(θ )} . (5.4)
M(M − 1) i=1
Remark 5.3 (5.4) is useful for assessing convergence of the MC √estimate (5.3) to
(5.2). The standard error shrinks to zero at a rate proportional to m.
Second, the Bayes estimate is the value θ̂ ∈ Θ which minimises the (estimated)
expected loss,
arg minθ̂ ∈Θ Êπ {(θ̂, θ)},
Exercise 5.4 (Monte Carlo optimal decision estimation) Suppose just three sam-
ples θ (1) = 2, θ (2) = 5, θ (3) = 11 are obtained from a target density π(θ ) describ-
ing uncertainty about an unknown parameter θ . Assuming a Gaussian kernel loss
function
(θ̂, θ) = − exp{−(θ̂ − θ )2 /10},
plot the Monte Carlo expected loss function Êπ {(θ̂, θ)} for θ̂ over the interval
[0, 12] and numerically evaluate an approximate Bayes estimate of θ .
Defining a so-called importance function as the ratio of the two densities, w(θ ) =
π(θ )/ h(θ ), the identity (5.5) implies
for any dominating density h, thereby expressing a general expectation with respect
to π as a different expectation with respect to h. It immediately follows that a Monte
Carlo approximation of (5.1) can be obtained using samples θ (1) , . . . , θ (m) drawn
from h.
5.2 Monte Carlo Estimation 43
Definition 5.3 (Importance sampling) For samples θ (1) , . . . , θ (M) from h, the impor-
tance sampling Monte Carlo estimate of Eπ {g(θ )} (5.2), or equivalently (5.6), is
1
M
IS
Êπ {g(θ )} := wi g(θ (i) ), (5.7)
M i=1
where wi = w(θ (i) ) = π(θ (i) )/ h(θ (i) ) are the importance weights.
Exercise 5.5 (Importance sampling Monte Carlo standard error) For independent
samples θ (1) , . . . , θ (M) ∼ h(θ ), state a formula for the standard error of the impor-
IS
tance sampling Monte Carlo estimate Êπ {g(θ )} from (5.7).
Remark 5.5 The rate at which the importance sampling standard error from Exer-
cise 5.5 shrinks to zero and the estimate (5.7) converges to the true value depends
upon the functional ratio π/ h. Good convergence can be obtained when h well
approximates π and h possibly has heavier tails (Amaral Turkman et al. 2019).
where γ∗ = γ (θ ) dθ .
Proposition 5.1 Let h(θ ) be a known density which dominates π(θ ), such that h
can be easily sampled from and let θ (1) , . . . , θ (M) be a random sample drawn from
h. Then an importance sampling Monte Carlo estimate for the normalising constant
γ∗ is
1 γ (θ (i) )
M
γˆ∗ = . (5.10)
M i=1 h(θ (i) )
The unknown normalising constant of (5.9) in this case is the marginal likelihood,
γ∗ = p(x).
In the simplest implementation, the prior p(θ ) could be used as the sampling
density; given prior samples θ (1) , . . . , θ (M) , the Monte Carlo estimate (5.10) of the
normalising constant is
1
M
p̂(x) = p(x | θ (i) ). (5.11)
M i=1
Although sampling from the prior leads to a simplified equation for Monte Carlo
estimation of the marginal likelihood, the standard error of (5.11) can be large if the
likelihood is calculated on a large sample x which strongly outweighs the effects of the
prior (cf. Sect. 2.3.6). As noted above, low variance estimates can be obtained when
the sampling density closely resembles the target. Therefore, in large sample cases,
a better importance sampling density could be the asymptotic normal distribution
approximation of a posterior from Sect. 2.3.7.
If sampling directly from a particular target distribution (θ ) (for the purpose of
performing Monte Carlo integration) does not seem possible, and when it is not clear
how to identify a suitable importance sampling density (cf. Sect. 5.2.3), Markov chain
Monte Carlo (MCMC) methods provide a general solution for obtaining approxi-
mate samples from any target density. Conceptually, the idea is straightforward: A
discrete-time homogeneous Markov chain of parameter values θ (1) , θ (2) , . . . is sam-
pled according to a transition probability density function p(θ (i+1) | θ (i) ), chosen
such that the limiting (stationary) distribution of the parameter value sequence has
density π(θ ).
be supposed that an initial value θ (0) is drawn from an initial probability distribu-
tion (possibly a point mass at some particular value) and then subsequent values
θ (1) , θ (2) , . . . are drawn from the transition density p(θ (i+1) | θ (i) ).
Definition 5.5 (π -irreducible Markov chain) A Markov chain with transition den-
sity p(θ (i+1) | θ (i) ) is said to be π -irreducible if for each -measurable set A ⊂ Θ
with (A) > 0 and for each θ ∈ Θ, there exists n > 0 such that P n (A | θ ) > 0.
Remark 5.6 Informally, a π -irreducible Markov chain can eventually reach any
neighbourhood of Θ where the target distribution has positive probability.
Definition 5.6 (Aperiodic Markov chain) A Markov chain with transition density
p(θ (i+1) | θ (i) ) is said to be aperiodic if, for each initial value θ (0) ∈ Θ and each
-measurable set A ⊂ Θ with (A) > 0, {n | P n (A | θ (0) ) > 0} has the greatest
common divisor equal to 1.
Remark 5.7 Informally, an aperiodic Markov chain does not have a cyclic pattern
to how it can arrive at different states.
Remark 5.8 The condition (5.12) required for reversibility is sometimes referred
to as detailed balance.
Proposition 5.2 If the transition density p(θ (i+1) | θ (i) ) of a Markov chain satis-
fies detailed balance (is reversible) with respect to π(θ ), then π(θ ) is a stationary
distribution.
Proof
π(θ ) p(θ | θ ) dθ = π(θ ) p(θ | θ ) dθ = π(θ ).
Θ Θ
46 5 Computational Inference
where
π(θ− j ) := π(θ ) dθ j
Θj
is π(θ )-reversible.
Proof Since θ− j = θ− j with probability 1 under (5.14), then for all such θ, θ ,
where the first equality derives from (5.13) and the second from (5.14).
5.3 Markov Chain Monte Carlo 47
Remark 5.10 Since the full conditional distributions are each π -reversible, a
Markov chain which updates θ by successively sampling new component values
from the full conditionals has stationary distribution π .
μ=1 μ=3
·10−2 ·10−2
π(θ1 ,θ2 )
π(θ1 ,θ2 )
5 5
4 5
0 2 0
−4 −2 0 −5 0
0 2 −2 0
4 −4 5 −5
θ2 θ2
θ1 θ1
Fig. 5.1 Mixture density of two bivariate normal distributions with identity covariance matrix and
means (μ, μ) and (−μ, −μ)
π(θ ) q(θ | θ )
α(θ, θ ) = min 1, (5.15)
π(θ ) q(θ | θ )
and otherwise keeping the chain in its current state. The full algorithm is stated in
Algorithm 2.
Remark 5.11 Since the transition density (5.16) is π -reversible, the Markov chain
obtained from the Metropolis-Hastings algorithm has stationary distribution π .
Remark 5.12 The target density π only enters Algorithm 2 through the ratio
π(θ )/π(θ ) in the acceptance probability (5.15); consequently, π (and also the pro-
posal density q) only needs to be known up to proportionality to utilise the Metropolis-
Hastings algorithm. This is a very useful property in Bayesian inference, where it
has earlier been noted in Sect. 4.4 that a target posterior distribution can often only
be identified up to an unknown normalising constant.
Remark 5.13 As with importance sampling (cf. Sect. 5.2.3), convergence of the
Metropolis-Hastings algorithm depends upon the choice of the proposal density q,
with good performance achieved when q closely resembles the target density π .
The extreme case where q(θ | θ ) = π(θ ) would lead to a sequence of independent
samples drawn directly from π (all accepted with probability 1) and the algorithm
reverts to straightforward Monte Carlo sampling (cf. Sect. 5.2).
Exercise 5.9 Gibbs sampling as Metropolis-Hastings special case. Show that Gibbs
sampling (Sect. 5.3.2) is a special case of the Metropolis-Hastings algorithm with
proposal density
q(θ | θ ) = 1θ− j (θ− j )π(θ j | θ− j )
for a symmetric uniform proposal. In either case, the parameter ε > 0 can be tuned to
influence the acceptance rate of the proposed moves; as ε → 0, the acceptance rate
tends to 1, but at the expense of proper exploration of Θ. In practice, different values
of ε can be explored to get a good trade-off between exploration and acceptance, with
published research (Roberts et al. 1997) suggesting an acceptance ratio of 0.234 can
optimise the efficiency of the algorithm under some quite general conditions. The
consequent advice from the authors is to “tune the proposal variance so that the
average acceptance rate is roughly 1/4”.
Whilst a random walk Metropolis-Hastings algorithm avoids the difficulty of find-
ing a proposal density that globally matches the target, these methods can sometimes
perform poorly in practice by being slow to explore the parameter space, getting
stuck in local modes of the target density. This phenomenon is sometimes described
as poor mixing.
prior beliefs about the synthetic variable p are usually assumed to be described by
a standard multivariate normal distribution that is statistically independent of θ ; the
joint density can then be written up to proportionality as
p p
H (θ, p) := − log π(θ ) + . (5.17)
2
Continuing the mechanics analogy, the first term of (5.17) corresponds to the potential
energy held by the body, proportional to the height of the body on a surface which
has contours of − log π(θ ) at each location θ ; and the second term corresponds to
the kinetic energy of the body, proportional to the squared momentum, p p. The
lowest point on the surface, corresponding to minimal potential energy and therefore
maximal kinetic energy, is the mode of the target density π(θ ).
Returning to the Metropolis-Hastings algorithm, to propose new values in Θ from
a current position, denoted here as θ (0), the idea is to consider a trajectory of the body
through time after applying some momentum. Let θ (t) be the location of the body
at time t, and p(t) the corresponding momentum. The principle of conservation of
energy implies that when the extended target density is interpreted as the Hamiltonian
of a closed dynamical system, the dynamics of that system should require H (θ, p)
to be preserved. This leads to the Hamiltonian equations for the system:
dθ (t) ∂H dp(t) ∂H
= , =− .
dt ∂p dt ∂θ
Evolving the extended parameters θ (t), p(t) according to these equations would
keep (5.17) constant, which corresponds to the body travelling along contours of
the extended target density π̃ . Therefore, proposing new θ values in approximate
accordance with these dynamics can lead to proposals which are far away from
the starting (previous) value but have similar target density π(θ ), leading to good
exploration and high acceptance rates.
In practice, the Hamiltonian dynamics are numerically approximated at inter-
leaved time points using leapfrog integration; for an incremental time step ε > 0,
The partial derivatives of the target density are required in (5.18), implying the
technique is only appropriate for continuous-valued parameters.
The Hamiltonian MCMC algorithm follows the Metropolis-Hastings algorithm
(Algorithm 2), with proposal density q(θ | θ (i−1) ) derived from a sampling pro-
cedure of first obtaining a new starting momentum p(0) from the standard mul-
tivariate normal, and second evolving the Hamiltonian dynamics implied by this
initial momentum via the leapfrog algorithm for some number of time steps L > 0,
beginning at θ (0) = θ (i−1) . The algorithm for this proposal mechanism is given in
Algorithm 3.
Recall Proposition 2.2 from Sect. 2.3.7, which stated that for increasing sample
sizes almost every target posterior distribution approaches an asymptotic normal
distribution,
5.5 Analytic Approximations 53
as n → ∞, where m n (2.9) and Hn (2.8) are respectively the posterior mode and
information matrix. For approximate inference, this large sample property (5.19)
can be exploited in several ways.
Most straightforwardly, the approximated normal distribution density could be
directly substituted in place of the true target density π(θ ), for example if this sim-
plifies an expectation calculation (5.2). However, lower error approximations can be
obtained using a so-called Laplace approximation.
Combining (5.2) with the expression for the posterior distribution (4.3) obtained
from Bayes’ theorem, it follows that a posterior expectation for a function of interest
g(θ ) can be expressed as a ratio of two integrals,
A Laplace approximation (Tierney and Kadane 1986) for (5.20) assumes a normal
approximation to both the denominator and the numerator of this ratio. In general, the
Laplace method of integration uses a second-order application of Taylor’s theorem
to approximate positive function integrands with normal distribution densities: let θ ∗
be the global maximum of a twice-differentiable function h(· ), and H (· ) the Hessian
matrix of h(· ), then
1
h(θ ) ≈ h(θ ∗ ) + (θ − θ ∗ ) H (θ ∗ )−1 (θ − θ ∗ )
2
∗
eh(θ ) (2π ) 2
k
=⇒ eh(θ) dθ ≈ 1 , (5.21)
|−H (θ ∗ )| 2
by comparison with the density of a normal distribution with mean vector θ ∗ and
covariance matrix −H (θ ∗ ). To apply Laplace’s method to (5.20), it must be supposed
that the function of interest g(θ ) is positive almost everywhere. For the logarithm of
the integrands in the denominator and numerator of (5.20), define
The mode and Hessian of h are the posterior density mode and information matrix
(m n , Hn ). Denoting the corresponding mode and Hessian of h̃ by (m̃ n , H̃n ), the
Laplace approximation of (5.20) by application of (5.21) is
54 5 Computational Inference
1
Θeh̃(θ) dθ |Hn | 2 g(m̃ n ) px|θ (x | m̃ n ) pθ (m̃ n )
E{g(θ ) | x} = h(θ)
≈ 1 . (5.22)
Θe dθ | H̃n | 2 px|θ (x | m n ) pθ (m n )
Given a partition of the parameter k-vector θ = (φ, ψ), such that φ is a k -vector
with 1 ≤ k < k, a Laplace approximation can be used to approximate marginal
distributions from the target density π(θ ) ≡ π(φ, ψ),
π(φ) = π(φ, ψ) dψ. (5.23)
Let (ψ̃n,φ , H̃n,φ ) be the mode and Hessian of h̃ φ (ψ), again conditioning on the
fixed value φ. Then using (5.21) in a similar way to deriving (5.22), a Laplace
approximation for the marginal density (5.23) is
eh̃ φ (ψ) dψ
1
|−Hn | 2 px|θ (x | φ, ψ̃n,φ ) pθ (φ, ψ̃n,φ )
π(φ) = h(θ) dθ
≈ 1 k−k
. (5.24)
Θ e |− H̃n,φ | 2 px|θ (x | m n ) pθ (m n ) (2π ) 2
xi ∼ p(xi | θ, φ), i = 1, . . . , n,
θ | φ ∼ Normalk (0, Σ(φ)),
φ ∼ p(φ),
5.5 Analytic Approximations 55
where Σ(φ) is a non-singular covariance matrix which can depend upon the hyper-
parameter φ and whose inverse contains zeros according to the GMRF model. The
normal distribution prior for θ makes models in this class well-suited to Laplace
approximations. Noting the posterior density can be expressed as
11 −1 n
π(θ, φ | x) ∝ p(φ)|Σ(φ)| exp − θ Σ (φ)θ +
2 log p(xi | θ, φ) ,
2 i=1
the INLA approach combines multiple Laplace approximations for conditional dis-
tributions involving θ with numerical integration techniques for φ, and can therefore
enable inference for problems with a very high dimensional θ parameter, provided
φ has low dimension. In particular, using (5.24) the marginal posterior density for φ
is approximated by
n
p(φ)|Σ(φ)| 2 exp − 21 θ̃φ Σ −1 (φ)θ̃φ + i=1
1
log p(xi | θ̃φ , φ)
π̂ (φ | x) ∝ ,
π̂ (θ̃φ | φ, x)
Not all posterior distribution densities can be well approximated with normal distri-
butions, and so variational inference methods (Blei et al. 2017) explore alternative
classes of approximating densities. Let Q be such a class of densities, referred to
as the variational family. Then variational inference seeks to approximate the tar-
get density π(θ ) with the closest member of the variational family, typically using
Kullback-Leibler divergence (cf. Definition 1.16),
The KL-divergence in (5.25) is taken in the reverse direction to the usual order,
presented in (1.4), for comparing an estimated density with the truth. In (5.25),
expectations are taken with respect to the estimating density q rather than the target
1 https://www.r-project.org.
2 https://www.r-inla.org.
56 5 Computational Inference
(a) (b)
2 2
1 1
0 0
θ2
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
θ1 θ1
Fig. 5.2 Approximating a bivariate normal distribution with correlation .95, π(θ1 , θ2 ), with the
closest independent bivariate normal distribution, q(θ1 , θ2 ), minimising a KL(q π ) or b KL(π
q)
With π(θ ) ∝ p(x, θ ) = p(x | θ ) p(θ ), minimising the KL-divergence (5.25) is equiv-
alent to maximising the so-called evidence lower bound.
Definition 5.9 (Evidence lower bound) For fixed x and a probability density q(θ )
satisfying q(θ ) > 0 =⇒ p(x, θ ) > 0, the evidence lower bound (ELBO) is defined
by
ELBO(q) := Eq log p(x, θ ) − Eq log q(θ ). (5.26)
Later in Sect. 7.1 which considers model uncertainty, the marginal likelihood p(x)
will be referred to as the evidence in favour of that particular probability model. This
provides the reasoning behind the name of evidence lower bound: by (5.27),
since KL-divergence is non-negative (cf. Exercise 1.9). Note the lower bound
becomes an equality if q = π , corresponding to the approximation matching the
target distribution.
Exercise 5.12 (ELBO identity) Show that ELBO(q) = Eq log p(x | θ ) − KL(q(θ )
p(θ )).
Remark 5.15 To see the implicit trade-off implied by optimising the ELBO crite-
rion, notice (5.26) is the sum of the expected log value, with respect to q, of the
joint target density, π(x, θ ), plus a quantity referred to in information theory as the
entropy of the approximating density, − Eq log q(θ ). Without the entropy term, the
maximisation would (in the limit) assign probability 1 to the posterior mode for θ ;
58 5 Computational Inference
however, that approximating density would have minimum entropy, and so instead
the optimal q will distribute mass more widely around Θ, but still in areas where
π(θ ) is high.
k
q(θ ) = q j (θ j ). (5.29)
j=1
Each factor q j in (5.29) can assume a different parametric form, which might be
necessary when some components of θ are unconstrained and continuous and others
are possibly discrete. The assumption of independence implicit in (5.29) makes
mean-field variational inference well-suited to optimisation using a technique called
coordinate ascent. Once optimised, the mean-field variational estimate will take the
same form,
k
q ∗ (θ ) = q ∗j (θ j ),
j=1
where q− j is the marginal density for θ− j (4.8) of the current mean-field approxima-
tion,
q− j (θ− j ) = q (θ ).
= j
The CAVI method is summarised in Algorithm 4. Each step of the algorithm main-
tains or increases the objective function ELBO(q), and since ELBO(q) is bounded
above by p(x), eventual convergence at a chosen tolerance threshold is guaranteed.
Σ j j̃ Σ 2j j̃
m j = μj + (m j̃ − μ j̃ ), s 2j = Σjj −
Σ j̃ j̃ Σ j̃ j̃
1 https://mc-stan.org
2 https://docs.pymc.io
3 http://edwardlib.org
4 https://www.python.org
The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
μ ∼ Normal(0, σ 2 /4)
σ −2 ∼ Gamma(1, 1/2)
z i ∼ Normal(μ, σ 2 ), i = 1, . . . , n
θi = 1/{1 + e−zi }, i = 1, . . . , n
X i j ∼ Binomial(100, θi ), i = 1, . . . , n; j = 1, . . . , p. (6.1)
Briefly, this model assumes a matrix, X , of student grades measured as integer per-
centage scores ranging between 0 and 100, such that the ith row corresponds to the ith
student in a class. Each student grade X i j is modelled by a binomial distribution with
a student-specific probability parameter which is assumed to be the same for each
test. This parameter is derived through a logistic transformation of an unobserved
(latent), real-valued aptitude level z i which is assumed to be normally distributed
with unknown mean and variance which are assigned conjugate priors (cf. Sect. 4.2).
Below is some example Python code for simulating from this model, by default
assuming thirty students sitting five tests.
6.2 Stan
Stan, named after the Polish mathematician Stanislaw Ulam, is a probabilistic pro-
gramming language written in C++ which uses sophisticated Monte Carlo and vari-
ational inference algorithms (see Chap. 5) for performing automated Bayesian infer-
ence. In particular, the default inferential method uses the No-U-Turn Sampler of
Hoffman and Gelman (2014), which is an extension of Hamiltonian Monte Carlo (see
Sect. 5.4). The derivatives required for performing HMC and other related inferential
methods are calculated within Stan using automatic differentiation. The user simply
has to declare the statistical model, import any data and then call a sampling routine.
Remark 6.1 Stan does not support sampling of discrete parameters, due to the
reliance of the software on Hamiltonian Monte Carlo sampling methods. For prob-
lems involving discrete parameters, the Stan documentation recommends marginal-
ising any discrete parameters where possible.
The code block contains the quantities which are considered to be known.
The quantities n (the number of students) and p (the number of tests) are declared
as positive integers, and the student test scores (X i j ) are declared to be an n × p
integer matrix taking values between 0 and 100. In the remainder of this block, the
remaining required model hyperparameters and their constraints are listed.
The code block declares the unknown quantities in (6.1): the
n-vector of real number student aptitude values z i , and the unknown mean and vari-
ance parameters (μ, σ 2 ) of the normal distribution for the z i values.
The code block contains any parameter transforma-
tions which are helpful for stating the prior and likelihood models in the final
stan.model.sample{ } block. In this case, the aptitude parameters z i are converted
to binomial parameters θi using the inverse logit function θi = 1/{1 + e−zi } as in
(6.1), and also the aptitude standard deviation σ is obtained as the square root of the
variance.
The code block states the probability distributional assumptions from
(6.1): An inverse-gamma distribution for σ 2 ; normal distributions for μ and the
latent parameters z i ; and a binomial distribution for each individual percentage test
score, using the student-specific transformed parameter θi .
6.2.1 PyStan
Stan can be accessed from a range of computing environments. In this text, it will
be accessed using the Python interface, PyStan.5 The following PyStan 3 code
(student_grade_inference_stan.py) uses the Python simulation code
and Stan model declaration code from above to simulate student test score data
and then fit the underlying model to the data.
5 https://pystan.readthedocs.io.
64 6 Bayesian Software Packages
After importing the necessary packages, the code first simulates student grade data
for n = 30 students taking p = 5 tests. Second, the code loads in the Stan probability
model from student_grade_model.stan. The third block determines that
four separate parallel Hamiltonian MCMC chains are to be run, each requesting
10,000 samples after discarding the first 1000; the call to stan.model.sample()
then obtains the posterior samples.
The final code block creates plots from the posterior samples. The top two cells
show trace plots of the log posterior density of the sampled parameters and the values
of the parameter μ from (6.1). The chains demonstrate stability and good mixing. The
bottom row contains a diagnostic plot for the four chains showing the convergence of
the sample mean for estimating the posterior expectation of μ, and finally a histogram
of the sampled values of μ, pooled across the four chains, for estimating the marginal
6.3 Other Software Libraries 65
density π(μ). The true value of μ used to simulate the test scores is indicated with a
dashed line; note the relatively small sample size (of students and test scores) means
this true value is not yet well estimated by π(μ).
6.3.2 Edward
Edward is a Python library for probabilistic modelling and inference, named after
the statistician George Edward Pelham Box who pioneered iterative approaches to
statistical modelling. Edward was developed using TensorFlow as a back end. Ten-
sorFlow is designed for developing and training so-called machine learning models,
and Edward builds upon this to offer modelling using neural networks (including
popular deep learning techniques) but also supports graphical models (cf. Chap. 3)
and Bayesian nonparametrics (cf. Chap. 9).
Bayesian inference can be performed using variational methods and Hamilto-
nian Monte Carlo, along with some other advanced techniques. Another strength of
Edward is model criticism, using posterior predictive checks which assess how well
data generated from the model under the posterior distribution agree with the realised
data.
6 https://www.tensorflow.org.
Chapter 7
Criticism and Model Choice
across models to obtain marginal probability distributions (7.2), then this model-
averaging approach is the correct method for managing model uncertainty under the
Bayesian paradigm; the model uncertainty is simply one component of a mixture
prior formulation.
Once the outcome variable x is observed, then if prior probabilities over M have
been specified, the updated posterior model probabilities can be obtained via Bayes’
theorem,
dQ(M | x) ∝ dQ(M) P(x | M).
P(Mi ) P(x | Mi )
P(Mi | x) = k , (7.3)
i =1 P(Mi ) P(x | Mi )
If the decision problem is to determine which model was the underlying generative
process which gave rise to x, then the decision-maker should proceed in the manner
described in Chap. 1: specifying a utility or loss function which evaluates the con-
sequences of estimating the model correctly or incorrectly and reporting the model
which maximises expected utility with respect to the model posterior distribution
(7.3).
Example 7.1 If choosing a model m from a finite set of models M using a zero-one
utility function (cf. Exercise 1.8) with the following utility if the true model were M,
1 if m = M
u(m, M) = (7.4)
0 if m = M
and the optimal Bayesian decision would be to report the posterior mode,
Suppose the decision-maker wishes to compare the relative suitability of two partic-
ular models, Mi and M j ; in this case, the comparison can be suitably encapsulated
by the ratio of the posterior probabilities attributed to the two models.
Definition 7.1 (Posterior odds ratio) The posterior odds ratio of model Mi over
model M j is
P(Mi | x) P(Mi ) P(x | Mi )
= × (7.5)
P(M j | x) P(M j ) P(x | M j )
Remark 7.1 The first term on the right-hand side of (7.5) is known as the prior
odds ratio, and the second term is known as the Bayes Factor.
Definition 7.2 (Bayes factor) The Bayes factor in favour of model Mi over model
M j is
P(x | Mi ) P(Mi | x) P(Mi )
Bi j (x) := = .
P(x | M j ) P(M j | x) P(M j )
Remark 7.2 The Bayes factor represents the evidence provided by the data x in
favour of model Mi over M j , measured by the multiplicative change observed in the
odds ratio of the two models upon observing x.
If Bi j > 1, this suggests Mi has become more plausible relative to M j after observ-
ing x, whereas Bi j < 1 suggests the opposite. Bayes factors are non-negative but
have no upper bound, and although a larger Bayes factor presents stronger evidence
in favour of Mi , there is no objective interpretation for any non-degenerate value.
To provide interpretability, Jeffreys (1961) provided some subjective categorisations,
which were later refined by Kass and Raftery (1995); the latter are shown in Table 7.1.
Table 7.1 Bayes factor interpretations according to Kass and Raftery (1995)
Bayes factor Bi j Evidence in favour of Mi
1 to 3 Not worth more than a bare mention
3 to 20 Positive
20 to 150 Strong
>150 Very strong
7.3 Model Selection 71
P(M0 )
P(M1 | x) > P(M0 | x) ⇐⇒ B10 (x) > . (7.6)
P(M1 )
The test procedure in (7.6) implies rejection of the null model M0 in favour of M1
if the Bayes factor B10 (x) exceeds the prior ratio in favour of the null model. In this
way, the prior ratio can be seen to determine the desired significance level of the test.
A threshold value could be chosen by referring to the Bayes factor interpretations
from Table 7.1.
Exercise 7.1 (Bayes factors for Gaussian distributions) Consider the following
model for two exchangeable groups of random samples x = (x1 , . . . , xn ), y =
(y1 , . . . , yn ):
xi ∼ N(θ X , 1), i = 1, . . . , n,
yi ∼ N(θY , 1), i = 1, . . . , n,
θ X , θY ∼ N(0, σ 2 ). (7.7)
M 0 : θ X = θY ;
M1 : θ X ⊥⊥ θY . (7.8)
(i) Derive an equation for the Bayes factor B01 (x, y) in favour of M0 over M1 .
(ii) For fixed observed samples x and y, show that B01 (x, y) → ∞ as the assumed
variance for the mean parameters θ X and θY , σ 2 , tends to infinity. Comment.
Remark 7.3 The phenomenon mentioned in Exercise 7.1 Item ii is known as Lind-
ley’s paradox, named after the Bayesian decision theorist Dennis V. Lindley (1923–
2013), and is further discussed in Proposition 8.3 of Chap. 8. For Bayesian hypothesis
testing, there is no useful concept of a totally uninformative prior for model selec-
tion. If beliefs about unknown parameters are made arbitrarily vague, then the simpler
model will always be preferred, regardless of the data.
72 7 Criticism and Model Choice
One issue with using posterior probabilities and Bayes factors for choosing amongst
models is that these quantities rely upon calculation of the marginal likelihoods of
observed data for each model. It was noted in Sect. 4.1 that the marginal likelihood
will not be analytically calculable for most models; and although Sect. 5.2.4.1 pro-
posed numerical importance sampling methods for estimating marginal likelihoods,
reliable low variance estimates may not be available.
Suppose x = (x1 , . . . , xn ). When the number of samples n is large, Schwarz
(1978) showed that for exponential family (cf. Sect. 4.3) models with a k-dimensional
parameter θ ,
k
log p(x) ≈ log p(x | θ̂ ) − log n,
2
where k is the dimension of θ and θ̂ maximises p(x | θ ). Low BIC values correspond
to good model fit.
Remark 7.4 For a given likelihood model, the BIC (7.9) is twice the negative loga-
rithm of an asymptotic approximation of a corresponding Bayesian marginal likeli-
hood for n samples as n → ∞; this asymptotic marginal likelihood does not depend
on the choice of prior p(θ ), besides requiring appropriate support for the maximum
likelihood estimate. The BIC is therefore only suitable for comparing different for-
mulations of the likelihood component of a parametric model, and not for comparing
prior distributions.
Proposition 7.1 (BIC approximated Bayes factors) If BICi and BIC j denote the
Bayesian information criterion for two models Mi and M j , an approximate Bayes
factor in favour of model i over model j is
1
Bi j ≈ exp − (BICi − BIC j ) .
2
Exercise 7.2 (BIC for Gaussian distributions) Consider the sampling model (7.7)
for two groups of random samples x = (x1 , . . . , xn ), y = (y1 , . . . , yn ) presented in
Exercise 7.1, and the two alternative models M0 and M1 from (7.8) for the respective
mean parameters θ X and θY .
(i) Derive equations for the Bayesian information criterion values BIC0 and BIC1
for the two models M0 over M1 .
7.3 Model Selection 73
(ii) Use these BIC values to give an approximate expression for the corresponding
Bayes factor B01 .
Posterior probabilities and Bayes factors are useful for assessing the relative merits of
rival models. However, it can also be desirable to assess the quality of a single model
in absolute terms, without reference to any proposed alternatives which may not yet
have been identified. Posterior predictive checking (PPC) methods aim to quantify
how well a proposed model structure fits the observed data, using the following
logic: if the model is a good approximation to the generating mechanism for the
observed data, then the posterior distribution of the model parameters should assign
high probability to parameter values which in turn would generate further data similar
to the observed data with high probability if the sampling process was repeated.
Consider a single parametric model with model parameter θ ∈ Θ, prior density
p(θ ) and likelihood density p(x | θ ) for the observed data x ∈ X . Let π(θ ) be the
corresponding posterior density (4.3).
Definition 7.4 (Posterior predictive distribution) The posterior predictive distribu-
tion is the marginal distribution of a second draw xrep ∈ X from the likelihood model
with the same (unknown) parameter, implying a density
π(xrep ) := p(xrep | θ ) π(θ ) dθ
Θ
∝ p(xrep | θ ) p(x | θ ) p(θ ) dθ. (7.10)
Θ
For full generality, let T (x, θ ) be a test statistic for measuring discrepancy between
a data-generating parameter θ and observing data x.
Definition 7.5 (Posterior predictive p-value) A posterior predictive p-value for
T (x, θ ) is the upper tail probability
p := 1[T (x,θ),∞) {T (xrep , θ )} π(θ ) p(xrep | θ ) dxrep dθ. (7.11)
Θ X
74 7 Criticism and Model Choice
Remark 7.5 If the test statistic is simply a function of the data, T (x, θ ) ≡ T (x),
then (7.11) simplifies to
p= 1[T (x),∞) {T (xrep )} π(xrep ) dxrep ,
X
which is the familiar one-sided p-value for an observed statistic T (x), calculated
with respect to the posterior predictive distribution (7.10).
Remark 7.6 More generally, the posterior predictive p-value (7.11) measures how
a joint sample of parameter and new data from the posterior would compare with
sampling a parameter from the posterior and pairing this with the observed data.
Given (possibly approximate) samples θ (1) , . . . , θ (m) obtained from the posterior
density π , a Monte Carlo estimate (cf. Sect. 5.2) of the posterior predictive p-value
(7.11) can be obtained relatively easily provided it is also possible to sample from
the likelihood distribution p(x | θ ): For each parameter value θ (i) sampled from the
(i)
posterior density π , randomly draw new data xrep from the generative likelihood
model with that parameter,
(i) (i)
xrep ∼ p(xrep | θ (i) ); (7.12)
1
m
(i)
p̂ := 1[T (x,θ (i) ),∞) {T (xrep , θ (i) )}. (7.13)
m i=1
When fitting Bayesian models numerically in Stan (cf. Sect. 6.2), it is relatively simple
to carry out posterior predictive checking using a generated quantities{} code block.
This will be illustrated for the student grades example in Sect. 6.2, by considering
two possible test statistics: The first test statistic uses the negative log likelihood as
a measure of discrepancy,
The second statistic does not depend on the model parameters, simply obtaining the
average score for each student,
1
p
x̄i = xi j ,
p j=1
1 // student_grade_model_ppc.stan
2
3 data {
4 int<lower=0> n; // number of students
5 int<lower=0> p; // number of tests
6 int<lower=0, upper=100> X[n, p]; // student test grades
7 real<lower=0> tau;
8 real<lower=0> a;
9 real<lower=0> b;
10 }
11 parameters {
12 real z[n];
13 real mu;
14 real<lower=0> sigma_sq;
15 }
16 transformed parameters {
17 real<lower=0, upper=1> theta[n];
18 real sigma;
19 theta = inv_logit(z);
20 sigma = sqrt(sigma_sq);
21 }
22 model {
23 sigma_sq ˜ inv_gamma(a,b);
24 mu ˜ normal(0, sigma * tau);
25 z ˜ normal(mu, sigma);
26 for (i in 1:n)
27 X[i] ˜ binomial(100,theta[i]);
28 }
29 generated quantities{
30 int<lower=0, upper=100> X_rep[n, p];
31 real log_lhd = 0;
32 real log_lhd_rep = 0;
33 real ppp;
34 for (i in 1:n){
35 for (j in 1:p){
36 log_lhd += binomial_lpmf(X[i][j] | 100,theta[i]);
37 X_rep[i][j] = binomial_rng(100,theta[i]);
38 log_lhd_rep += binomial_lpmf(X_rep[i][j] | 100,theta[i]);
39 }
40 }
41 ppp = log_lhd >= log_lhd_rep ? 1 : 0;
42 }
1 #! /usr/bin/env python
2 ## student_grade_inference_stan_ppc.py
3
4 import stan
5 import numpy as np
6 import matplotlib.pyplot as plt
7
8 # Simulate data
9 from student_grade_simulation import sample_student_grades
10 n, p = 30, 5
11 X, mu, sigma = sample_student_grades(n, p)
12 sm_data = {'n':n, 'p':p, 'tau':0.5, 'a':1, 'b':0.5, 'X':X}
13
14 # Initialise stan object
15 with open('student_grade_model_ppc.stan','r',newline='') as f:
16 sm = stan.build(f.read(),sm_data,random_seed=1)
17
18 # Select the number of MCMC chains and iterations, then sample
19 chains, samples, burn = 4, 10000, 1000
20 fit=sm.sample(num_chains=chains, num_samples=samples, num_warmup=burn, save_warmup=False)
21
22 def T(x): #Variance of student average scores
23 return(np.var(np.mean(x,axis=1)))
24
25 t_obs = T(X) #Value of test statistic for observed data
26 x_rep = fit['X_rep'].reshape(n,p,samples,chains)
27 t_rep = [[T(x_rep[:,:,i,j]) for i in range(samples)] for j in range(chains)]
28
29 # Plot posterior predictive distributions of T from each chain
30 def posterior_predictive_plots(t_rep,true_val):
31 nc = np.matrix(t_rep).shape[0]
32 fig,axs=plt.subplots(1,nc,figsize=(10,3),constrained_layout=True)
33 fig.canvas.manager.set_window_title('Posterior predictive')
34 for j in range(nc):
35 axs[j].autoscale(enable=True, axis='x', tight=True)
36 axs[j].set_title('Chain '+str(j+1))
37 axs[j].hist(np.array(t_rep[j]),200, density=True)
38 axs[j].axvline(true_val, color='c', lw=2, linestyle='--')
39 plt.show()
40
41 posterior_predictive_plots(t_rep,t_obs)
42
43 # Calculate and print posterior predictive p-values for T
44 print("Posterior predictive p-values from variance of means:")
45 print([np.mean(t_obs > t_rep[j]) for j in range(chains)])
46
47 # Print posterior predictive p-values for lhd calculated in Stan
48 print("Posterior predictive p-values from likelihood:")
49 print(np.mean(fit['ppp'].reshape(samples,chains),axis=0))
The p-values suggest no statistical significance for either test statistic, suggesting
a good model fit; indeed, the data have been generated from the assumed probability
model.
Chapter 8
Linear Models
The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 79
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_8
80 8 Linear Models
x1 x2 x... xn
The regression function relating the response to the covariates is specified through
the likelihood density p(yi | θ, xi ) in (8.1). Figure 8.1 shows a belief network rep-
resentation of regression exchangeability.
The linear model is a special case of (8.1), where the parameter is a pair θ = (β, σ ),
with β ∈ R p and σ > 0, and the likelihood density p(yi | θ, xi ) is specified by
yi | θ, xi ∼ Normal(xi · β, σ 2 ). (8.2)
E(yi | xi , β) = xi · β. (8.3)
yi = f (xi ) + i ,
i ∼ Normal(0, σ 2 ),
8.2 Bayes Linear Model 81
where
f (x) = x · β (8.4)
Since (8.5) is an exponential family distribution (cf. Sect. 4.3), it follows from Propo-
sition 4.2 that there is a conjugate prior for (β, σ ). This takes a canonical form
β | σ ∼ Normal p (0, σ 2 V ),
σ −2 ∼ Gamma(a, b), (8.6)
Exercise 8.1 (Marginal density for regression coefficients) Suppose the conjugate
prior distribution (8.6) for the normal linear model.
(i) Show that, marginally,
−(a+ 2p )
Γ (a + 2p ) β V −1 β
p(β) = p 1 1+ ,
(2π b) 2 |V | 2 Γ (a) 2b
V = λ−1 I p (8.8)
for a scalar precision parameter λ > 0. This implies a joint prior probability density
function p
−2 ba λ 2 exp −σ −2 (2b + λβ · β) /2
p(β, σ ) = p . (8.9)
(2π ) 2 Γ (a) σ 2(a−1)+ p
Proposition 8.1 For the linear model (8.5) with conjugate prior (8.6), the posterior
distribution for (β, σ ) after observing responses y = (y1 , . . . , yn ) corresponds to
β | σ, X, y ∼ Normal p (m n , σ 2 Vn ),
σ −2 | X, y ∼ Gamma(an , bn ),
where
Vn = (V −1 + X X )−1 , m n = Vn X y,
n 1
an = a + , bn = b + (y y − y X m n ). (8.10)
2 2
Proof
1
p(β | σ, X, y) ∝ exp − (β − m n ) Vn−1 (β − m n )
2σ 2
=⇒ β | σ, y ∼ Normal p (m n , σ 2 Vn ).
β | σ ∼ Normal p (0, σ 2 V )
=⇒ Xβ | σ, X ∼ Normaln (0, σ 2 X V X )
=⇒ y | σ, X ∼ Normaln (0, σ 2 (X V X + In )), (8.11)
where the last step follows from standard rules for summing Gaussian random vari-
ables. Then by the matrix inversion lemma,
(X V X + In )−1 = In − X (V −1 + X X )−1 X = In − X Vn X
=⇒ y | σ, X ∼ Normaln (0, σ 2 (In − X Vn X )−1 ).
By Bayes’ theorem,
2
=⇒ σ −2 | X, y ∼ Gamma(an , bn ).
Exercise 8.2 (Linear model matrix inverse) The matrix inversion lemma
states that for an n × n matrix A, a k × k matrix V and n × k matrices U, W ,
From Proposition 4.1, it follows that the Bayes linear model with conjugate prior
has a closed-form marginal likelihood.
Proposition 8.2 Suppose the Bayes linear model (8.5) with y ∈ Rn , X ∈ Rn× p and
conjugate prior (8.6). The marginal likelihood for y | X is
1
Γ (an ) |Vn | 2 ba
p(y | X ) = n 1 . (8.12)
(2π ) 2 Γ (a) |V | 2 bn an
Equivalently,
y | X ∼ Stn (2a, 0, b(X V X + In )/a),
Exercise 8.3 (Linear model matrix determinant) The matrix determinant lemma
states that for an n × n matrix A, a k × k matrix V and n × k matrices U, W ,
|A + U V W | = |V −1 + W U ||V ||A|. Using this result, show that |X V X +
In | = |V ||V −1 + X X |.
Proposition 8.3 (Lindley’s paradox) For the linear model under the conjugate prior
(8.6) and assuming (8.8), as λ → 0 the marginal likelihood (8.12) p(y | X ) → 0.
Remark 8.1 Lindley’s paradox in Proposition 8.3 (cf. Proposition 7.1) states that
making prior beliefs increasingly diffuse will eventually lead to diminishingly small
predictive probability density for any possible observation y. Consequently, when
comparing against any fixed alternative model, the Bayes factor in favour of the
alternative model will become arbitrarily large.
Exercise 8.4 (Linear model code) Write computer code (using a language such
as Python) to calculate the marginal likelihood under the linear model. For a matrix
of covariates X and a vector of responses y, write a single function which returns both
the marginal likelihood and the posterior mean for the regression coefficients.
Exercise 8.6 (Zellner’s g-prior) Suppose the n × p covariate matrix X has rank
p, with n > p, and the matrix V in (8.6) satisfies V = g · (X X )−1 for some constant
g > 0; this formulation is known as Zellner’s g-prior (Zellner 1986). Derive a
simplified expression for the linear model marginal likelihood p(y | X ) under this
prior distribution.
corresponding to “uniform” prior beliefs for log σ 2 and each component of the coef-
ficient vector β.
Remark 8.2 The prior density (8.14) is said to be improper since it does not have
a finite integral over the parameter space. It can therefore only be meaningfully
considered as the limiting argument of a sequence of increasingly diffuse, proper
prior densities.
The reference prior can be viewed as a limiting case of the conjugate prior (8.9)
as the hyperparameters a, b, λ → 0. Consequently, the posterior distribution result
from Sect. 8.2.1 carries across as follows.
Proposition 8.4 For the linear model (8.5) with reference prior (8.14), if n > p and
X has rank p, then the posterior distribution for (β, σ ) is
Remark 8.3 The reference posterior (8.15) is only proper when n > p and the rank
of X is equal to p, so that X has full rank.
Remark 8.5 Bayesian inference for the linear model under the reference prior cor-
responds to the standard estimation procedures from classical statistics. For example,
the posterior mean for β in (8.15) is the usual least squares or maximum likelihood
estimate.
86 8 Linear Models
Aside from the reference prior analysis in Sect. 8.2.2, the theory of the Bayes linear
model with conjugate prior required no assumptions about the nature of the covariates
xi = (xi1 , . . . , xi p ) ∈ R p which make up the rows of the matrix X . This observation
allows the following abstraction of the so-called design matrix X , which provides a
valuable generalisation in the use of the linear model.
Most generally, suppose that for each response variable yi there is an observed
p -vector of related measurements z i ∈ R p , for p ≥ 1. Setting p = p and xi = z i
returns the standard linear model (8.5), which is linear in both β and the measurements
zi .
More generally, suppose a list of p functions ψ = (ψ1 , . . . , ψ p ),
ψ j : R p → R, j = 1, . . . , p,
such that each function specifies a covariate for the linear model, leading to the
p-vector covariate
xi = ψ(z i ) = (ψ1 (z i ), . . . , ψ p (z i )).
The functions ψ are referred to as basis functions, since the regression function
(8.4) is constructed by linear combinations of the components of ψ,
p
f (x) = β j ψ j (z).
j=1
Remark 8.6 The linear model (8.5) with design matrix X specified by
X i j = ψ j (z i )
is still a linear model with respect to the regression coefficients β, and so all of
the preceding theory for conjugate posterior distributions and closed-form marginal
likelihoods still applies.
8.3 Generalisation of the Linear Model 87
ψ j (z) = z j−1 , j = 1, . . . , p,
x = (1, z, . . . , z p−1 ).
Again suppose a single observable measurement z ∈ R and now consider the basis
functions
ψ j (z) = (z − τ j )+ , j = 1, . . . , p, (8.16)
for a sequence of p real values τ1 < τ2 < . . . < τ p , referred to as knot points. Basis
functions of the type (8.16) are known as linear splines, since ψ j (z) is zero up until
the value τ j , and a linear function of z − τ j thereafter.
Taking a linear combination of linear spline basis functions gives a regression
function which is piecewise linear, with changes in gradient occurring at each of the
knot points but no discontinuities. Spline regression models are explored in more
detail in Sect. 10.3.
The Bayes linear model presented in Sects. 8.2 and 8.3 is mathematically very con-
venient, but is only suitable for cases where the response variables can be assumed to
be normally distributed and where the linear regression function xi · β corresponds
directly to the expected value of the response variable yi (8.3).
88 8 Linear Models
Generalised linear models extend the linear model to other exponential family dis-
tributions for the response variable through the introduction of an invertible function
g called the link function, such that
g {E(yi | xi , β)} = xi · β,
or, equivalently,
E(yi | xi , β) = g −1 (xi · β). (8.17)
Remark 8.7 The link function generalises the standard linear regression expectation
(8.3), which is clearly a special case of (8.17) with the identity link function.
Remark 8.8 One advantage of using a link function is to guarantee the expected
value of the response (8.17) lies in the correct domain, without requiring any con-
straints on the possible values which the covariates xi or the regression coefficients
might take.
Two examples of generalised linear models are now briefly presented, where the
response variable is either a non-negative integer count or a binary indicator. In both
cases, a zero-mean normal distribution prior (8.6) with V = I p is assumed for the
regression coefficients, which was shown in (8.7) to imply a t-distribution on the
Euclidean norm of the coefficients.
log E(yi | xi , β) = xi · β,
yi | xi , β ∼ Poisson(exp(xi · β))
exp{yi (xi · β) − exp(xi · β)}
=⇒ p(yi | xi , β) = . (8.18)
yi !
8.4 Generalised Linear Models 89
The Stan implementation of Poisson regression extends the model (8.18) slightly by
assuming the presence of a variable intercept term in the linear model,
yi | xi , α, β ∼ Poisson(exp(αi + xi · β)),
// poisson_regression.stan
data {
int<lower=0> n; // number of observations
int<lower=0> p; // number of covariates
int<lower=0> m; // number of grid points
int<lower=0> y[n]; // response variables
matrix[n,p] X; // matrix of covariates
matrix[m,p] grid; // matrix of grid points
real<lower=0> a;
real<lower=0> b;
}
transformed data {
real t_c = (2*a+p-1)/(2*b);
}
parameters {
vector[p] beta;
}
model {
sqrt(dot_self(beta)*t_c) ˜ student_t(2*a, 0, 1);
target += poisson_log_glm_lpmf( y | X, 0, beta );
}
generated quantities {
vector[m] fn_vals;
for (i in 1:m)
fn_vals[i] = exp( dot_product(beta,grid[i]) );
}
The generated quantities{} block declares a vector of values for evaluating the
regression function pointwise over a vector of grid points which are inputs in the
data{} block.
The following PyStan code (poisson_regression_stan.py) simulates
data from a Poisson regression model with a single covariate and then seeks to infer
posterior beliefs about the value of the regression coefficient using
poisson_regression.stan. Both here and in Sect. 8.4.2.1, the plots show
the sampled data and the posterior mean regression function obtained from point-
wise evaluation during posterior sampling, and then the posterior density of the single
coefficient β.
90 8 Linear Models
#! /usr/bin/env python
## poisson_regression_stan.py
import stan
import numpy as np
import matplotlib.pyplot as plt
# Simulate data
gen = np.random.default_rng(seed=0)
n = 25
m = 50
T = 10
x = np.linspace(start=0, stop=T, num=n)
grid = np.linspace(start=0, stop=T, num=m)
beta = .5#gen.normal()
y = [gen.poisson(np.exp(x_i*beta)) for x_i in x]
sm_data = {'n':n, 'p':1, 'a':1, 'b':0.5, 'X':x.reshape((n,1)), 'y':y, 'm':m,
→ 'grid':grid.reshape((m,1))}
# Initialise stan object
with open('poisson_regression.stan','r',newline='') as f:
sm = stan.build(f.read(),sm_data,random_seed=1)
# Select the number of MCMC chains and iterations, then sample
chains, samples, burn = 2, 10000, 1000
fit=sm.sample(num_chains=chains, num_samples=samples, num_warmup=burn, save_warmup=False)
# Plot regression function and posterior for beta
fig,axs=plt.subplots(1,2,figsize=(10,4),constrained_layout=True)
fig.canvas.manager.set_window_title('Poisson regression posterior')
f = np.mean(fit['fn_vals'],axis=1)
true_f = [np.exp(beta*x_i) for x_i in grid]
b = fit['beta'][0]
axs[0].plot(grid,f)
axs[0].plot(grid,true_f, color='c', lw=2, linestyle='--')
axs[0].scatter(x,y, color='black')
axs[0].set_title('Posterior mean regression function')
axs[0].set_xlabel(r'$x$')
h = axs[1].hist(b,200, density=True);
axs[1].axvline(beta, color='c', lw=2, linestyle='--')
axs[1].set_title('Approximate posterior density of '+r'$\beta$')
axs[1].set_xlabel(r'$\beta$')
plt.show()
E(yi | xi , β)
log = xi · β,
1 − E(yi | xi , β)
// logistic_regression.stan
data {
int<lower=0> n; // number of observations
int<lower=0> p; // number of covariates
int<lower=0> m; // number of grid points
int<lower=0,upper=1> y[n]; // response variables
matrix[n,p] X; // matrix of covariates
matrix[m,p] grid; // matrix of grid points
real<lower=0> a;
real<lower=0> b;
}
transformed data {
real t_c = (2*a+p-1)/(2*b);
}
parameters {
vector[p] beta;
}
model {
sqrt(dot_self(beta)*t_c) ˜ student_t(2*a, 0, 1);
y ˜ bernoulli_logit( X * beta );
}
generated quantities {
vector[m] fn_vals;
for (i in 1:m)
fn_vals[i] = inv_logit( dot_product(beta,grid[i]) );
}
92 8 Linear Models
#! /usr/bin/env python
## logistic_regression_stan.py
import stan
import numpy as np
import matplotlib.pyplot as plt
# Simulate data
gen = np.random.default_rng(seed=0)
n = 25
m = 50
T = 5
x = np.linspace(start=-T, stop=T, num=n)
grid = np.linspace(start=-T, stop=T, num=m)
beta = .5#gen.normal()
y = [gen.binomial(1,1/(1+np.exp(-x_i*beta))) for x_i in x]
sm_data = {'n':n, 'p':1, 'a':1, 'b':0.5, 'X':x.reshape((n,1)), 'y':y, 'm':m,
→ 'grid':grid.reshape((m,1))}
# Initialise stan object
with open('logistic_regression.stan','r',newline='') as f:
sm = stan.build(f.read(),sm_data,random_seed=1)
# Select the number of MCMC chains and iterations, then sample
chains, samples, burn = 2, 10000, 1000
fit=sm.sample(num_chains=chains, num_samples=samples, num_warmup=burn, save_warmup=False)
# Plot regression function and posterior for beta
fig,axs=plt.subplots(1,2,figsize=(10,4),constrained_layout=True)
fig.canvas.manager.set_window_title('Logistic regression posterior')
f = np.mean(fit['fn_vals'],axis=1)
true_f = [1.0/(1+np.exp(-beta*x_i)) for x_i in grid]
b = fit['beta'][0]
axs[0].plot(grid,f)
axs[0].plot(grid,true_f, color='c', lw=2, linestyle='--')
axs[0].scatter(x,y, color='black')
axs[0].set_title('Posterior mean regression function')
axs[0].set_xlabel(r'$x$')
h = axs[1].hist(b,200, density=True);
axs[1].axvline(beta, color='c', lw=2, linestyle='--')
axs[1].set_title('Approximate posterior density of '+r'$\beta$')
axs[1].set_xlabel(r'$\beta$')
plt.show()
Chapter 9
Nonparametric Models
The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 93
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_9
94 9 Nonparametric Models
The Dirichlet process (Ferguson 1973) is a conjugate prior for Bayesian inference
about an unknown probability distribution function F.
Definition 9.1 (Dirichlet process) Let α > 0 and let P0 be a probability measure on
X (with distribution function F0 ). A random probability measure P (with distribution
function F) is said to be a Dirichlet process with base measure α · P0 , written P ∼
DP(α · P0 ), if for every (measurable) finite partition B1 , . . . , Bk of X ,
Remark 9.1 The base measure P0 of the Dirichlet process is also the mean, such
that for every (measurable) subset B ⊆ X ,
E{P(B)} = P0 (B).
P0 (B){1 − P0 (B)}
V{P(B)} = . (9.3)
α+1
9.2 Dirichlet Processes 95
1 − γ1 γ1
π1
(1 − γ1 )(1 − γ2 ) γ2 (1 − γ1 )
π2
Remark 9.2 A draw from any Dirichlet process is a discrete distribution with prob-
ability 1, even if the base measure is continuous.
Remark 9.3 The stick-breaking analogy for Definition 9.2 envisages successively
breaking into pieces a stick of unit length, each time snapping off and laying down
a section and then continuing to break the remaining piece of stick. For a GEM(α)
distribution in Definition 9.3, at each break point, the proportion of remaining stick
broken off and placed down follows a Beta(1, α) distribution. The procedure is
illustrated in Fig. 9.1.
96 9 Nonparametric Models
where the atoms of mass are independently drawn from the base measure,
θ1 , θ2 , . . . ∼ P0 ,
and the masses w j are obtained from a stick-breaking process with a Griffiths-Engen-
McCloskey distribution,
(w1 , w2 , . . .) ∼ GEM(α). (9.6)
It was noted above that the Dirichlet process is a conjugate prior for an unknown
probability distribution. This is now demonstrated in the following proposition.
1
n
P̂n (B) = 1 B (xi )
n i=1
be the empirical measure of the samples x, and let αn∗ = α + n, πn∗ = α/αn∗ and
Then
P | x ∼ DP(αn∗ · P∗n ).
Γ (α)
k
p(P(B1 ), . . . , P(Bk )) = k P(B j )α P0 (B j ) .
j=1 Γ {α P0 (B j )} j=1
n
Let n j = i=1 1 B j (xi ) be the number of samples falling inside B j . Then the joint
density of x is
k
p(x | P) = P(B j )n j
j=1
It follows from Proposition 4.1 that independent observations from an unknown dis-
tribution function have a closed-form marginal likelihood under a Dirichlet process
prior. The form of the marginal likelihood is most straightforward when the base
measure P0 is discrete, with corresponding probability mass function p0 .
Perhaps the most revealing formulation of the Dirichlet process arises from con-
sidering the predictive distribution of a further random sample.
Corollary 9.1 The predictive distribution for a new observation xn+1 drawn from
the same unknown distribution is
n
α p0 (xn+1 ) + i=1 1{xn+1 } (xi )
p(xn+1 | x) = . (9.9)
α+n
Proof This follows immediately by expressing the predictive distribution as the ratio
of the respective joint distributions (9.8) for (x, xn+1 ) and x.
Remark 9.5 The form of the predictive distribution (9.9) has a clear interpretation
that a further sample xn+1 can be viewed as a draw from the following mixture distri-
bution: with probability α/(α + n), a new value is sampled from the base distribution
P0 , and with the remaining probability a repeated value is sampled from the empirical
distribution of values observed so far. The concentration parameter can therefore be
interpreted as a notional prior sample size reflecting the base measure.
98 9 Nonparametric Models
Remark 9.6 Following the sequential sampling procedure (9.9), the number of sam-
ples from the base distribution P0 follows a so-called Chinese restaurant table dis-
tribution. After n samples, from Teh (2010), for example, this distribution is known
to have expected value
Polya trees (Mauldin et al. 1992) are a more general class of nonparametric models
for random measures which can support both continuous and discrete distributions.
For real-valued random variables, Polya trees are defined on an infinite sequence of
recursive partitions of a subset of the real line B ⊆ R.
Definition 9.4 (Binary sequences) Let E 0 = ∅ and for m > 1, define
E m := {0, 1}m ,
E := ∪∞m=0 E m ,
such that E m is the set of all length-m binary sequences and E is the set of all
finite-length binary sequences.
Definition 9.5 (Binary tree of partitions) A set = {π0 , π1 , . . .} of nested partitions
of B is said to be a binary tree of partitions if |πm | = 2m . Clearly π0 = {B}, and since
the partitions are nested, the sets in each partition πm can be indexed by elements of
E m in such a way that, for all e ∈ E m−1 ,
Be = Be0 ∪ Be1
for the set Be ∈ πm−1 and the two sets Be0 , Be1 ∈ πm .
Remark 9.7 A natural binary tree of partitions of the unit interval B = (0, 1) is
π0 = {(0, 1)} and
m m
πm = {( e j /2 j , 1/2m + e j /2 j ) | (e1 , . . . , em ) ∈ E m }, m > 0,
j=1 j=1
π0 : B
π1 : B0 B1
π4 : B0000 B0001 B0010 B0011 B0100 B0101 B0110 B0111 B1000 B1001 B1010 B1011 B1100 B1101 B1110 B1111
Fig. 9.2 The first five layers of an infinite sequence of recursive partitions. The shaded regions
show the path through the preceding layers to an example set B0110 in π4
Definition 9.7 (Polya tree) Let be a binary tree of partitions and suppose
A = {αe | e ∈ E} is a corresponding set of positive constants αe > 0 defined for
all partition layers in . For a random probability measure P, if for all m > 0 and
(e1 , . . . , em ) ∈ E m the splitting probabilities satisfy
Remark 9.9 The Dirichlet process from Sect. 9.2 is a special case of the Polya tree
satisfying αe0 + αe1 = αe for all e ∈ E.
where each term (splitting probability) has an independent Beta distribution with
parameters corresponding to that path.
Proof If, for all e, αe0 = αe1 then by symmetry, for all m > 0 and e ∈ E m , E{P(Be )} =
1/2m . The result follows from the usual inversion rule for continuous distribution
functions.
The conjugacy of the Polya tree prior follows immediately from the conjugacy of
the beta distribution for Bernoulli observations noted in Table 4.1 of Sect. 4.2.
n
ne = 1 Be (xi )
i=1
be the number of samples which fall inside the set Be , and let An∗ = {αe + n e | e ∈
E}. Then
P | x ∼ PT(, An∗ ).
Proof For each sample xi and each non-trivial partition level m > 0, recall m−1 (xi )
as the unique binary sequence of length m − 1 such that xi ∈ B m−1 (xi ) . Conditional
on m−1 (xi ), xi must fall in either B m−1 (xi )0 or B m−1 (xi )1 ; from these two possibilities,
xi falls in B m−1 (xi )em with an unknown, Beta(α m−1 (xi )em , α m−1 (xi )(1−em ) ) distributed
probability. Denoting this probability θ,
9.3 Polya Trees 101
As noted above, Polya trees can be constructed to give probability one to either dis-
crete or continuous distributions. The special case of the Dirichlet process obtained
when αe0 + αe1 = αe for all e exemplifies the discrete case. For guaranteeing con-
tinuous probability distributions, Lavine (1992) showed that “as long as the αe ’s do
not decrease too rapidly with m”, P will be continuous; a commonly used choice is
αe1 ...em = αm 2 for some single parameter α > 0.
As with the Dirichlet process, it follows from Proposition 4.1 that independent
observations from an unknown distribution function have a closed-form marginal
likelihood under a Polya tree prior; in this case, this marginal likelihood is most
straightforward when the base measure P0 is continuous.
where n e, j = i< j 1 Be (xi ) and m ∗ (x, j) = min{m > 0 | m (xi ) = m (x j ), i < j}
is the highest partition level for which none of the first ( j − 1) samples in x lie
within the same set as x j .
Exercise 9.4 (Polya tree sampling) Write computer code (using a language such
as Python) to sample a random probability density function from a Polya tree
model with αe1 ...em = αm 2 and a binary tree of partitions centred on F0 (x) =
(x). Plot three sampled densities obtained from setting α = 1, 100, 10000,
respectively.
102 9 Nonparametric Models
Definition 9.9 (Bayesian histogram) Let α > 0 and let P0 be a probability measure
on [a, b]. Let τ be an increasing sequence of m cut points partitioning [a, b] into (m +
1)
segments, with corresponding segment probabilities θ = (θ1 , . . . , θm+1 ) satisfying
m+1
j=1 θ j = 1. A Bayesian histogram model for a random probability measure P
assumes the following representation for the density p:
m+1
θj
p(x | m, τ, θ) = 1[τ j−1 ,τ j ) (x) ,
j=1
τ j − τ j−1
θ | m, τ ∼ Dirichlet{α P0 ([τ0 , τ1 )), . . . , α P0 ([τm , τm+1 ))}. (9.12)
Remark 9.11 The base probability measure P0 from Definition 9.9 is the prior
expectation for P, such that for (P0 -measurable) A ⊆ [a, b], E{P(A)} = P0 (A).
Γ (α)
m+1
Γ {α P0 ([τ j−1 , τ j )) + n j }
p(x | m, τ ) = , (9.14)
Γ (α + n) j=1 Γ {α P0 ([τ j−1 , τ j ))}(τ j − τ j−1 )n j
n
where n j = i=1 1[τ j−1 ,τ j ) (xi ) is the number of samples lying in the segment
[τ j−1 , τ j ).
m+1
θj
n j
p(x | m, τ, θ) =
j=1
τ j − τ j−1
and the results simply follow from the conjugacy of the multinomial and Dirichlet
distributions (cf. Table 4.1).
For any choice of prior density p(m, τ ), the corresponding posterior density for
the number of cut points is given up to proportionality by
Γ (α)
m+1
Γ {α P0 ([τ j−1 , τ j )) + n j }
p(m, τ | x) ∝ p(m, τ ) . (9.16)
Γ (α + n) j=1 Γ {α P0 ([τ j−1 , τ j ))}(τ j − τ j−1 )n j
Now consider three simplifications of the histogram model (9.12). First, for simplicity
of presentation and without loss of generality, suppose that the interval of interest is
the unit interval B = [0, 1].
Second, suppose the base measure P0 in Definition 9.9 is the natural default choice
for the unit interval, Lebesgue measure, such that P0 ([τ j−1 , τ j )) = τ j − τ j−1 .
Third, suppose the unknown distribution P is characterised by a partition model
with an unknown number of equally spaced cut points on [0, 1]. To simplify sub-
sequent notation, let m now denote the number of equally sized segments rather
than the number of cut points. Making this assumption is then equivalent to speci-
fying p(m, τ ) through a non-degenerate probability model p(m) for the unknown
number of segments, whilst for m > 1 the conditional distribution p(τ | m) assigns
probability one to the vector τ ∗ with jth element
j
τ j∗ = , j = 1, . . . , m − 1.
m
With these three conditions, the posterior density (9.16) simplifies to
p(m) m n Γ (α)
m
α
p(m | x) ∝ p(x, m) = Γ + n (m) , (9.17)
Γ (α + n) Γ (α/m)m j=1 m j
where n (m)
j is the number of samples lying between j−1m
and mj .
Also, under this simplified model it follows from (9.13) that after marginalising
θ , the posterior predictive density conditional on m satisfies
m
α + m n (m)
j
p(x | m, x) = 1[ j−1, j) (m x) . (9.18)
j=1
α+n
Model averaging (9.18) with respect to the prior distribution for m obtains the
marginal predictive density
m (m) (m)
∞ p(m) m n Γ (α)
j =1 Γ α/m + n j α +m nj
m
p(x | x) = 1[ j−1, j) (m x) .
Γ (α + n) Γ (α/m)m α+n
m=1 j=1
(9.19)
The predictive density (9.19) could be estimated by a finite approximation of the
outer sum.
Remark 9.13 Relaxing the first two assumptions of this section and returning to a
general base measure P0 on [a, b] with corresponding distribution function F0 , the
same principle of equal bin-width histogram modelling could equally be applied on
∗
the F0 -scale, such that segment j is the interval [τ j−1 , τ j∗ ) where τ j∗ = F0−1 ( j/m).
This is somewhat analogous to the Polya tree partition of (9.10). Then, for example,
9.4 Partition Models 105
p(m) Γ (α)
m
(m) α
p(m | x) ∝ (τ j∗ − τ j−1
∗
)n j Γ + n (m) .
Γ (α + n) Γ (α/m) j=1
m m j
The simplicity of inference for the Bayesian histogram with equal bin widths (and
using Lebesgue measure as the base measure) was illustrated by the joint density
p(x, m) (9.17). With just a single unknown parameter m, it is feasible to take a finite
sum approximation of the marginal likelihood,
∞
p(x) = p(x, m), (9.20)
m=1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 107
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0_10
108 10 Nonparametric Regression
n
n
ci c j k(xi , x j ) ≥ 0.
i=1 j=1
Example 10.1 (Example kernel functions) The following examples satisfy the pos-
itive semidefinite requirement of Definition 10.1 for α, ρ > 0.
• Dot product/linear:
k(x, x ) = α 2 (x · x ).
• Exponential:
k(x, x ) = α 2 exp(−x − x /ρ). (10.2)
where ⎡ ⎤
k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xn )
⎢k(x2 , x1 ) k(x2 , x2 ) . . . k(x2 , xn )⎥
⎢ ⎥
K (x, x) = ⎢ .. .. .. .. ⎥. (10.3)
⎣ . . . . ⎦
k(xn , x1 ) k(xn , x2 ) . . . k(xn , xn )
Remark 10.1 The squared exponential kernel (10.1) is the most commonly used
kernel in Gaussian process modelling; samples from processes with this kernel are
10.2 Gaussian Processes 109
infinitely differentiable (Rasmussen and Williams 2005, p. 83). For the exponential
kernel (10.2), samples are continuous but not differentiable (Rasmussen and Williams
2005, p. 86).
Exercises 10.1 (Gaussian process closure) Suppose f ∼ GP(m, k) and m ∼ GP
(m 0 , k0 ), where m 0 is any function and k and k0 are positive semidefinite kernels.
Show that marginally,
f ∼ GP(m 0 , k + k0 ). (10.4)
yi = f (xi ) + εi , (10.5)
f | x, y ∼ GP(m ∗ , k ∗ ),
where
Proof The result follows from the conjugacy of normal-normal mixtures exploited
in Chap. 8.
It follows from Proposition 4.1 that observations of an unknown function with
independent Gaussian errors have a closed-form marginal likelihood under a Gaus-
sian process prior.
Proposition 10.2 (Gaussian process marginal likelihood) If f ∼ GP(m, k) and
independently yi ∼ Normal( f (xi ), σ 2 ), i = 1, . . . , n, then y | x has a marginal like-
lihood satisfying
Proof As noted in Rasmussen and Williams (2005, p. 19), the likelihood (10.7) is
obtained directly from observing that y | x ∼ Normaln (m(x), K (x, x) + σ 2 In ).
In a further duality with the linear model, the inverse-gamma distribution can
provide a conjugate prior distribution for the error variance σ 2 if the kernel k can
be satisfactorily factorised as k(x, x ) = σ 2 k (x, x ) for a kernel k , such that beliefs
about the parameters of k do not depend on σ .
Corollary 10.1 Under the conditions of Proposition 10.2, suppose k(x, x ) = σ 2
k (x, x ) and correspondingly the matrix K (x, x) = σ 2 K (x, x). Assuming the con-
jugate prior,
σ −2 ∼ Gamma(a, b),
1 Γ (an ) ba
p(y | x, k, m) = · , (10.8)
n 1
(2π ) 2 |K (x, x) + In | 2 Γ (a) bnan
where
an = a + n/2,
bn = b + (y − m(x)) {K (x, x) + In }−1 (y − m(x))/2.
y ∼ Normaln (Xβ, σ 2 In ),
β ∼ Normal p (0, σ 2 λ−1 I p ),
10.2.2 Inference
With normally distributed observation errors leading to closed-form expressions for
the marginal likelihood in Proposition 10.2 and Corollary 10.1, inferential attention
is often primarily focused on the selection of the covariance kernel and the associated
parameters, and secondarily on the mean function (which is often simply assumed
to be zero everywhere).
Remark 10.2 There are two related reasons why a zero mean might safely be
assumed, without significant loss of generality, for Gaussian process modelling of
an unknown function f .
10.2 Gaussian Processes 111
First, if a non-zero mean function m(x) is assumed to be known, then attention can
switch to quantifying uncertainty about the deviation ( f − m) ∼ GP(0, k); inference
about f − m is then based on correspondingly detrended observations (xi , yi ) where
yi = yi − m(xi ), i = 1, . . . , n.
Second, if the mean function m(x) is unknown but can be assumed to also have
a Gaussian process prior with known mean function m 0 (x), then by Exercise 10.1
the marginal distribution for f is again a Gaussian process with mean m 0 (x) and
an additively modified covariance kernel (10.4). The known mean function m 0 (x)
could be subtracted from the observation process and inference about f − m 0 could
again proceed with the assumption of a zero-mean Gaussian process.
Example 10.2 Consider the normal error model from Corollary 10.1 which assumes
an inverse-gamma prior for σ 2 , and suppose the assumed covariance function is the
popular squared exponential covariance kernel (10.1). For simplicity of presentation,
the following Stan code (gp_regression.stan) assumes a univariate unknown
function following a Gaussian process with zero mean function and uninformative,
improper priors for the amplitude parameter α and the length-scale parameter ρ
which determine the squared exponential kernel.
block for placing user-defined functions, where the covariance function is required
for obtaining the posterior mean regression function (10.6), and second within the
block for calculating the covariance matrix fac-
tor (K (x, x) + In ) needed for evaluating the likelihood (10.8). For the latter use
case, the observation in Corollary 10.1 that y is marginally multivariate Student’s
t-distributed is utilised to obtain a very simple statement for the block.
The following PyStan code (gp_regression_stan.py) simulates univariate
functional data with independent standard normal errors, where the true underlying
function is
x2
f (x) = 10 + 5 sin(x) + , 0 ≤ x ≤ 10. (10.9)
5
The code then calls gp_regression.stan in order to make posterior inference
about the two parameters of the squared exponential kernel. Two inferential summary
plots are provided for illustration: First, the posterior mean regression function, eval-
uated at 50 equally spaced grid points; second, the posterior distribution of the most
interesting length-scale parameter ρ, which determines the smoothness of the regres-
sion function by controlling the rate at which covariance decreases with increasing
distance between input points.
10.3 Spline Models 113
m
f (x) = α0 + α1 x + β j (x − τ j )+ , (10.10)
j=1
d
m
f (x) = αjx +
j
β j (x − τ j )d+ . (10.11)
j=0 j=1
with the left-hand side emphasising the dependency of the design matrix X of this
linear model on the m-vector of knot points τ . The quantities an , bn , Vn were defined
in (8.10).
Continuing on from an earlier remark in Sect. 10.2, a consequence of spline
regression being a linear model is that it must therefore be a (degenerate) special case
of a Gaussian process. However, for fixed m, spline regression is not a nonparametric
model. To endow spline regression with the properties of a nonparametric model,
the number of knots must be allowed to increase without upper bound.
Following the same construction as the Bayesian histogram model in Sect. 9.4.1,
a suitable prior distribution p(m, τ ) is required for the number and location of the
knot points, with the Poisson process prior (9.15) being a default choice. Inference
from the posterior density for the knot locations,
1
|Vn | 2
p(m, τ | y, x) ∝ p(m, τ ) 1 ,
|V | 2 bn an
can be achieved using (reversible jump) Markov chain Monte Carlo sampling (Green
1995) (cf. Chap. 5). Further, in-depth coverage of inference for nonparametric spline
models is provided within (Denison et al. 2002).
Exercises 10.3 Spline regression as a Gaussian process. Suppose the spline regres-
sion function f from (10.11) with coefficients (α0 , . . . , αd , β1 , . . . , βm ) ∼
Normalm+d+1 (0, v Im+d+1 ). Express f as a Gaussian process.
T
τ j∗ = j , j = 1, . . . , m. (10.12)
m+1
As in Sect. 9.4.2, posterior expectations with respect to (10.13) can then be calculated
directly by taking a finite sum approximation over a sufficiently large range of values
for m.
10.3 Spline Models 115
Example 10.3 Consider a spline regression function with equally spaced knot
points, partitioning [0, T ] into an unknown, geometrically distributed number of
segments,
d
m
j d
f (x) = αjx +
j
βj x − T ,
j=0 j=1
m+1 +
p(m) = (1 − λ)λm ,
where x j denotes the predictor values from x which lie inside B j , and y j denotes the
corresponding responses. More generally, a partition regression model may incorpo-
rate additional global parameters ψ, suggesting a more general likelihood function
p(y | x, π ) = dQ(ψ) p(y j | x j , θ j , ψ) dQ(θ j | ψ). (10.15)
j Θ
Two illustrative examples of this modelling paradigm are presented in this section:
univariate changepoint models and multivariate classification and regression trees.
If a conjugate regression model is assumed for each segment, (10.14) and (10.15)
can each provide closed-form expressions for the corresponding changepoint likeli-
hood function p(y | x, m, τ, ). In such cases, assuming a suitable prior distribution
p(m, τ ) for the number and location of the changepoints such as the Poisson process
prior (9.15), inference from the posterior density for the changepoints,
118 10 Nonparametric Regression
can be achieved via (reversible jump) Markov chain Monte Carlo sampling (Green
1995). Python code implementing MCMC inference for some standard cases of the
segment regression model p(y | x, θ j , ψ) can be found in Heard (2020); this software
also considers an extended modelling paradigm of changepoint regimes, allowing an
additional complexity that regression parameters θ j might be shared between several
changepoint segments.
Exercises 10.4 (Normal changepoint model as a Gaussian process) Suppose m > 0
known changepoints τ = (τ1 , . . . , τm ) ∈ (0, T )m in a piecewise constant regression
function,
m
f (x) = 1[τ j ,τ j+1 ) (x) · μ j ,
j=0
Even though the assumed regression function has m discontinuities for each value
of m, the model-averaged posterior mean regression function is continuous. Con-
versely, because the number of observations is small (n = 10) the posterior mode
for the number of changepoints is found at m = 2, despite the underlying regression
function begin a smooth, non-constant function.
Y N
x2 ≤ a
Y N Y N
x1 ≤ b x3 ≤ c
Y N
x1 ≤ d
Fig. 10.1 An example of a classification and regression tree model on three variables
120 10 Nonparametric Regression
Any descendant splitting node label s is uniquely defined given its parent’s label s ,
setting s = 2s if the node acts on data for which the query at the parent node is true,
and s = 2s + 1 otherwise.
Exercises 10.5 (CART notation and partition) Consider the tree in Fig. 10.1.
(i) Express the tree as a set of triples (10.17) according to the notation of Denison
et al. (1998).
(ii) State the partition of R3 implied by the tree.
Hierarchical models were previously discussed in Sect. 3.3. This chapter gives fur-
ther details of practical Bayesian modelling with hierarchies. In some application
contexts, the hierarchies are understood to be known during the data collection pro-
cess. For example, in the student-grade model of Sect. 6.1, the hierarchical structure
recognised that each row of the data matrix X corresponded to test grades from the
same student.
In other contexts, the hierarchies may be a subjective construct with associated
uncertainty. These hierarchies are characterised by additional unknown parameters,
sometimes formulated as discrete clusters and otherwise as continuous latent factors.
This chapter considers some more advanced modelling techniques commonly applied
in such cases.
where the integral is taken over some suitable space of density functions for the
unknown density p.
A flexible class of density functions can be obtained by considering families of
mixture distributions. As with the partition models considered in Sect. 9.4, each
component density might be a relatively standard parametric model and yet still give
rise to a mixture which is very adaptable to different underlying density shapes. The
following sections present finite and infinite mixture representations, although the
difference between the two can be fairly limited in practice.
The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 121
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_11
122 11 Clustering and Latent Factor Models
Suppose the assumed density p is a mixture of m component densities from the same
parametric family, with a general formulation
m
p(x) = w j f (x | θ j , ψ), (11.1)
j=1
z i ∼ Categoricalm (w),
xi ∼ f (· | θzi , ψ), (11.2)
such that z i takes value j ∈ {1, . . . , m} with probability p(z i = j) = w j , and then
xi is sampled from the z i th-component density.
Inferring the latent variables z equates to clustering the observed variables x into at
most m non-empty groups, where only samples within the same cluster are assumed
to be drawn from the same population. The conditional likelihood function for x
given the latent cluster allocations z is simply
n
p(x | z, θ, ψ) = f (xi | θzi , ψ). (11.3)
i=1
n
nj = 1{ j} (z i ) (11.6)
i=1
be the number of samples attributed to the jth cluster. Under the categorical model
(11.2),
m
n
p(z | w) = wj j . (11.7)
j=1
Marginalising (11.7) with respect to the Dirichlet prior (11.5) for the unknown mix-
ture weights yields a marginal distribution for the cluster allocations,
Γ (α) Γ (α j + n j )
m
p(z) = . (11.8)
Γ (α + n) j=1 Γ (α j )
Under the Dirichlet prior, the joint conditional distribution for x and z can be
conveniently written up to proportionality as
⎧ ⎫
m ⎨
⎬
p(x, z | θ, ψ) ∝ Γ (α j + n j ) f (xi | θ j , ψ) .
⎩ ⎭
j=1 i:z i = j
Alternatively, by first marginalising the unknown parameters (θ, ψ), the expression
(11.8) can be combined with (11.4) to yield
124 11 Clustering and Latent Factor Models
⎧ ⎫
m ⎨ ⎬
p(x, z) ∝ p(z) p(x | z) ∝ Γ (α j + n j ) f (xi | θ j , ψ) dQ(θ j | ψ) dQ(ψ).
j=1 ⎩ Θ i:z = j ⎭
i
(11.9)
For densities which require support over the whole real line, f in (11.1) is commonly
assumed to be the density of a normal distribution with parameter pair θ j = (μ j , σ j )
denoting the mean and standard deviation, respectively, for the jth mixture compo-
nent, implying
m
p(x) = w j φ{(x − μ j )/σ j }/σ j ,
j=1
with a, b, λ > 0, the parameters μ j and σ j2 can be integrated out according to (11.9)
to obtain the joint distribution (11.9) of x and z,
Γ (α)
m 1
Γ (α j + n j )Γ (a + n j /2) λ 2 ba
p(x, z) = n nj ,
Γ (α + n)(2π ) 2 1 a+ 2
j=1 Γ (α j )Γ (a)(λ + n j ) 2 b + 21 {ẍ j − ẋ 2j /(λ + n j )}
(11.10)
where ẋ j = i:zi = j xi and ẍ j = i:zi = j xi2 .
A simpler but less flexible implementation of the mixture of Gaussians model
can be obtained by assuming a single variance parameter which is common to each
mixture component density, such that θ j = μ j and ψ = σ . The corresponding joint
distribution for x and z is
Γ (α)Γ (a + n/2) λ 2 ba
1
m
Γ (α j + n j )
p(x, z) = ,
Γ (α j )
n
n 1 a+ 2j
Γ (α + n)Γ (a)(2π) 2 (λ + n) 2 b + 21 ẍ − mj=1 ẋ 2j /(λ + n j ) j=1
n
where ẍ = i=1 xi2 .
11.1 Mixture Models 125
For a fixed number of mixture components m, the posterior distribution of the cluster
allocations z ∈ {1, . . . , m}n can be obtained up to proportionality from the joint
distribution (11.9),
p(z | x) ∝ p(x, z). (11.11)
Exercises 11.1 (Mixture of normals full conditionals) For the finite mixture of nor-
mal density model (11.10) with component-specific mean and variance parameters
assuming conjugate priors, state an equation, up to proportionality, for the full con-
ditional distribution p(z i | z−i , x) for i ∈ {1, . . . , n}.
A natural evolution from the potentially infinite mixture model considered in the
previous section is to consider infinite mixture models. In contrast to (11.1), suppose
∞
p(x) = w j f (x | θ j , ψ) (11.12)
j=1
xi | θi ∼ f (· | θi , ψ), i = 1, . . . , n,
θi ∼ G, i = 1, . . . , n,
G ∼ DP(α · P0 ),
For the infinite mixture model (11.12), the latent cluster allocation variables z =
(z 1 , . . . , z n ) could naturally assume an infinite range of values {1, 2, . . .} with no
upper bound. However, the labels assigned to clusters are arbitrary and at most n
clusters can be non-empty. Instead, z can be more useful defined by revealing the
samples sequentially according to the predictive distribution (9.9). Let z 1 = 1 and
denote the number of non-empty clusters formed. Combining the terms from (11.14),
the DPM induces a marginal prior distribution for z,
Γ (α)
m(z)
p(z) = αΓ (n j ). (11.15)
Γ (α + n) j=1
Remark 11.6 The sequential consideration of the samples used to derive (11.15)
does not contradict the exchangeability property from Proposition 11.1, since (11.15)
is symmetric in the indices of z.
128 11 Clustering and Latent Factor Models
Remark 11.7 The Dirichlet process mixture marginal distribution for z (11.15) is
actually very similar to the multinomial-Dirichlet prior (11.8). Although assuming
an infinite number of clusters is mathematically elegant, there is little practical dif-
ference between assuming infinitely many clusters and assuming an unbounded but
finite number of clusters; since when inferring cluster assignments z, the former
specification simply guarantees an infinite number of empty clusters.
Inference for the DPM is analogous to that for the multinomial Dirichlet model.
The joint distribution for (x, z) can be obtained analogously to (11.9),
⎧ ⎫
m(z)
⎨ ⎬
p(x, z) = p(z) p(x | z) ∝ αΓ (n j ) f (xi | θ j , ψ) dQ(θ j ) dQ(ψ),
j=1 ⎩ Θ i:z = j ⎭
i
which has closed-form expression for conjugate parametric models. Posterior infer-
ence for the distribution p(z | x) ∝ p(x, z) again requires Markov chain Monte Carlo
simulation techniques.
n
pi
p(x | p1 , . . . , pn ) = pi (xi, ),
i=1 =1
m
pi (xi, ) = wi, j f (xi, | θ j , ψ), (11.16)
j=1
W · 1m = 1n .
Remark 11.8 There are two points to note about the mixed-membership model
(11.16).
11.2 Mixed-Membership Models 129
X1 X2 X... Xn
w1 w2 w... wn
Definition 11.2 (Latent Dirichlet allocation) The Latent Dirichlet allocation (LDA)
model assumes the following m-component mixture model sampling procedure:
pi ∼ Poisson(ζ ),
xi, | z i, , θ ∼ Categorical|V | (θzi, ),
z i, | wi ∼ Categoricalm (wi ),
θ j ∼ Dirichlet|V | (α),
wi ∼ Dirichletm (γ ),
where the rows of both W and θ are all Dirichlet-distributed random vectors.
The LDA model in Definition 11.2 is often referred to as topic modelling. Under this
interpretation, the m components of the mixture distribution (11.16) represent latent
topics. Each of the m topics is characterised by a specific probability distribution on
the vocabulary of words, parameterised by θ j for topic j.
Similarly, each document is characterised by a specific mixture of the topics,
parameterised by the weights wi . The model assumes two levels of exchangeability:
first amongst documents and second amongst words within a document. The latter
assumption is often referred to as a bag-of-words model, as the ordering of words
within a document is deemed unimportant by the model.
11.2.1.2 Inference
The following Stan code (lda.stan) is based on the example for making infer-
ence on the LDA model provided in the User’s Guide of the Stan documenta-
tion,1 adapted to the notation of this text. Stan does not support ragged array
1 https://mc-stan.org/users/documentation.
11.2 Mixed-Membership Models 131
data formats, and so the documents are concatenated into a single vector x; con-
sequently, an additional variable doc is used to store the starting index of each
document within x. The two probability vectors wi and θ j use the convenient
variable constraint.
In Sect. 11.1, Dirichlet process mixture models (DPMs, Sect. 11.1.2) were presented
as an infinite-dimensional extension of finite mixture models (Sect. 11.1.1). Similarly,
the finite mixed-membership model (11.16) can also naturally extend to an infinite
mixture using a hierarchy of Dirichlet processes.
Pi ∼ DP(α · P), i = 1, . . . , n,
P ∼ DP(γ · P0 ).
Remark 11.10 Under the hierarchical Dirichlet process, each unknown measure Pi
has expected value P0 .
j−1
wi, j = wi, j (1 − wi, ),
=1
wi, j ∼ Beta(γβ j , α), i = 1, . . . , n; j = 1, 2, . . . ,
β1 , β2 , . . . ∼ GEM(α).
f (x | θ j , ψ) = θ j,x .
11.2 Mixed-Membership Models 133
11.2.2.2 Inference
Inference for HDPM has added complexity over LDA due to the unlimited number
of topics. However, open-source software implementations are available, such as the
Python package Gensim.2 This package uses online variational inference as described
in Wang et al. (2011).
xi = Λ · ηi + i , i = 1, . . . , n, (11.18)
where
The elements of the vector ηi ∈ Rk are referred to as the latent factors for sample
i. Typically, in dimension-reduction applications, the latent dimension k p. The
global parameter Λ is a ( p × k) matrix of factor loadings which project the latent
factors into the higher dimensional space R p . As ηi varies over Rk , Λ · ηi defines a
linear subspace of R p , but the observable variables xi lie just outside that subspace
due to the observation error i .
Remark 11.12 For each sample, the latent factors ηi ∈ Rk can be interpreted as the
unobserved measurements of k features which are believed to be linearly related to
the expected value of the response.
Since (11.18) is a linear model, assuming (11.19) implies the latent factors can
be marginalised out similarly to (8.11), yielding
2 https://radimrehurek.com/gensim/models/hdpmodel.html.
134 11 Clustering and Latent Factor Models
Remark 11.13 The marginal distribution (11.20) gives insight into the latent factor
model; the Gram matrix ΛΛ of the rows of the latent factor loadings Λ provides a
low-rank (k < p) additive contribution to the covariance matrix for each exchange-
able data row xi .
p(Λ) = p(ΛU )
To see how well the underlying latent structure is recovered, the posterior distri-
bution for Λ is compared with the true value used to generate X . This comparison
is not completely straightforward, since it was noted above that the model (11.20) is
136 11 Clustering and Latent Factor Models
yi = xi · β + z i · γ + i ,
i ∼ Normal(0, σ 2 ),
β ∼ Normal p (0, σ 2 V ),
γ ∼ Normalq (0, σ 2 U ),
Exercises 11.4 (Latent factor linear model code) Write Stan code to fit the model
from Exercise 11.3 with V = v I p and U = u Iq for known v, u > 0. Assume a
reference prior for the latent factor matrix Z .
Correction to: An Introduction
to Bayesian Inference, Methods
and Computation
Correction to:
N. Heard, An Introduction to Bayesian Inference, Methods
and Computation, https://doi.org/10.1007/978-3-030-82808-0
After publication of the book, the author noticed that the typeset of ‘p’ was incorrectly
processed in Chaps. 2 and 9 during production. Therefore corrections have been
incorporated in these two chapters: the typesetting of ‘P’ was corrected.
For each probability model below, x = (x1 , . . . , xn ) are n independent samples from a
likelihood distribution p(x|θ, ψ), for which there exists a conjugate prior distribution
p(θ ) for θ .
Each of the tables for discrete and continuous parametric models provides the
following details:
• Ranges for x and θ .
• Likelihood distribution p(x | θ, ψ) and the conjugate prior density p(θ ).
• Marginal likelihood p(x) and the posterior density p(θ | x), denoted π(θ ).
• Posterior predictive distribution p(x | x) for a new observation x.
A.1 Notation
n
n
ẋ = xi , ẍ = xi · xi , (A.1)
i=1 i=1
respectively, denote the sum and the sum of squared values in x. Let x(1) ≤ . . . ≤ x(n)
denote the order statistics of x. Finally, for discrete random variables on {1, . . . , k},
let
n
nj = 1{ j} (xi ) (A.2)
i=1
• ζ (a, b) = ∞ x=0 (x + b)
−a
is the Hurwitz zeta function.
• Δ(k)
k denotes the standard (or probability) simplex {u ∈ Rk : u i ≥ 0,
i=1 u i = 1}. k
• For the Dirichlet distribution, α ∈ {u ∈ Rk : u i ≥ 0, i=1 u i > 0}.
• For the normal and inverse Wishart equations, m ∈ Rk and the matrices V , ψ and
B are assumed positive definite.
Uniform(x | 0, θ) Pareto(θ | a, b)
x ∈ [0, ∞) θ ∈ (0, ∞)
1[0,θ] (x) aba 1[b,∞) (θ)
p(x | θ) = p(θ) =
θ θ a+1
aba (a + n)b∗a+n 1[b∗ ,∞) (θ)
p(x) = , b∗ = max{b, x(n) } π(θ) =
(a + n)b∗ a+n θ a+n+1
(a + n)b∗ a+n
p(x | x) = ≡ Pareto(θ | a + n, b∗ )
(a + n + 1) max{b∗ , x}a+n+1
Exponential(x | θ) Gamma(θ | a, b)
x ∈ [0, ∞) θ ∈ (0, ∞)
ba a−1 −bθ
p(x | θ) = θe−θ x p(θ) = θ e
Γ (a)
Γ (a + n) ba (b + ẋ)a+n a+n−1 −(b+ẋ)θ
p(x) = π(θ) = θ e
Γ (a) (b + ẋ)a+n Γ (a + n)
(a + n)(b + ẋ)a+n
p(x | x) = ≡ Gamma(θ | a + n, b + ẋ)
(b + ẋ + x)a+n+1
Gamma(x | ψ, θ) Gamma(θ | a, b)
x ∈ [0, ∞) θ ∈ (0, ∞)
θ ψ ψ−1 −θ x ba a−1 −bθ
p(x | θ) = x e p(θ) = θ e
Γ (ψ) Γ (a)
Γ (a + nψ) ba (b + ẋ)a+nψ a+nψ−1 −(b+ẋ)θ
p(x) = π(θ) = θ e
Γ (a) (b + ẋ)a+nψ Γ (a + nψ)
Γ (a + (n + 1)ψ) (b + ẋ)a+nψ
p(x | x) = ≡ Gamma(θ | a + nψ, b + ẋ)
Γ (a + nψ) (b + ẋ + x)a+(n+1)ψ
Normalk (x | θ, ψ) Normal(θ | m, V )
x ∈ Rk θ ∈ Rk
n
exp{− 21 i=1 (xi − θ) ψ −1 (xi − θ)} exp{− 21 (θ − m) V −1 (θ − m)}
p(x | θ) = p(θ) =
nk n k 1
(2π ) 2 |ψ| 2 (2π ) 2 |V | 2
1 n
|V∗ | 2 exp{− 21 m V −1 m − 21 i=1 xi ψ −1 xi } exp{− 21 (θ − m ∗ ) V∗−1 (θ − m ∗ )}
p(x) = π(θ) =
nk n 1 k 1
(2π ) 2 |ψ| 2 |V | 2 exp{− 21 m ∗ V∗−1 m ∗ } (2π ) 2 |V∗ | 2
m ∗ = V∗ (V −1 m + ψ −1 ẋ), V∗ = (V −1 + nψ −1 )−1 ≡ Normal(θ | m ∗ , V∗ )
p(x | x) = Normal(x | m ∗ , ψ + V∗ )
Normalk (x | 0, θ) Inverse Wishart(θ | a, B)
x ∈ Rk θ ∈ Rk×k , positive definite
n −1 a
exp{− 21 i=1 xi θ xi } |B| 2 exp{− 21 tr(Bθ −1 )}
p(x | θ) = p(θ) =
nk n ak a+k+1 k(k−1)
(2π ) 2 |θ| 2 2 2 |θ| 2 π 4 k a+1−
=1 Γ ( 2 )
a a+n
|B| 2
k
Γ ( a+1−
2 ) |B + ẍ| 2 exp{− 21 tr((B + ẍ)θ −1 )}
p(x) = π(θ) =
a+n a+n+1− ) (a+n)k a+n+k+1 k(k−1)
|B + ẍ| 2 =1 Γ ( 2 2 |θ| π 4 k a+n+1− )
=1 Γ (
2 2
2
a+n
|B + ẍ| 2
k
Γ ( a+n+1− )
p(x | x) = 2 ≡ Inverse Wishart(θ | a + n, B + ẍ)
a+n+1 a+n+2− )
|B + ẍ + x · x | 2 =1 Γ ( 2
Appendix B
Solutions to Exercises
Solution 1.1 Linear transformations of utilities. Let u(· ) be a utility function with
corresponding expected utility ū(· ), and consider a linear transformation
u (c) = α + β u(c),
If β > 0, then for two actions a, a ∈ A, ū (a) < ū (a ) ⇐⇒ ū(a) < ū(a ).
Solution 1.2 Bounded utility. Let a = {(Ω, c)} and a = {(Su(c) , c∗ ), (Su(c) , c∗ )}.
Since P(Ω) = 1, ū(a) = u(c). For the dichotomy a , ū(a ) = P(Su(c) )u(c∗ ) + (1 −
P(Su(c) ))u(c∗ ) = u(c).1 + (1 − u(c)).0 = u(c). Hence ū(a) = ū(a ) and therefore
a ∼ a.
Solution 1.3 Unbounded utility.
(i) If {Ω, c1 )} ∼ {(Sx , c), (Sx , c2 )}, then 0 = u(c1 ) = (1 − x) u(c) + x u(c2 ) =
(1 − x) u(c) + x.1 =⇒ u(c) = −x/(1 − x) < 0.
(ii) If {Ω, c2 )} ∼ {(Sx , c1 ), (Sx , c)}, then 1 = u(c2 ) = (1 − x) u(c1 ) + x u(c) =
(1 − x).0 + x u(c) =⇒ u(c) = 1/x > 1.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 141
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
142 Appendix B: Solutions to Exercises
Differentiating the right-hand side with respect to ω̂ and setting equal to zero yields
ω̂ ∞
f (ω) dω + ω̂ f (ω̂) − ω̂ f (ω̂) = f (ω) dω − ω̂ f (ω̂) + ω̂ f (ω̂)
−∞ ω̂
ω̂ ∞
⇐⇒ f (ω) dω = f (ω) dω
−∞ ω̂
ω̂
1
⇐⇒ f (ω) dω = ,
−∞ 2
∞
since necessarily −∞ f (ω) dω = 1. Hence ω̂ is the median.
Solution 1.7 Squared loss (also known as L 2 loss).
For a univariate, continuous-valued ω ∈ R, the squared loss function gives
expected utility ∞
ū(dω̂ ) = − (ω̂ − ω)2 f (ω) dω.
−∞
=⇒ ω̂ = E(ω).
Solution 1.8 Zero-one loss (also known as L ∞ loss). For a univariate, continuous-
valued ω ∈ R, and for > 0 define the -ball zero-one loss function
As → 0 to obtain the zero-one loss function, the right-hand side tends to f (ω̂)
which is clearly maximised by the mode of f .
Appendix B: Solutions to Exercises 143
Therefore, when KL-divergence is used as a loss function for prediction, the smallest
expected loss (zero) is incurred when reporting genuine beliefs.
Solution 2.1 Finitely exchangeable binary sequences. Suppose X 1 , . . . , X n are
assumed to be independent and identically distributed Bernoulli( 21 ) random vari-
ables, and it is observed that X i = s. Conditional on this information, X 1 , . . . , X n
are still exchangeable with constant probability mass function
1{s} ( i xi )
p X 1 ,...,X n |
i X i =s
(x1 , . . . , xn ) = n .
s
However, for 0 < s < n, this constant mass function cannot be reconciled with a gen-
erative process (2.1) where a probability parameter θ is sampled from a probability
Bernoulli(θ ) trials X 1 , . . . , X n ; a
measure (Q), followed by a sample of n independent
degenerate value of θ ∈ {0, 1} would not admit X i = s, whilst any non-degenerate
value 0 < θ < 1 would admit positive probability to X i = s.
Solution 2.2 Predictive distribution for exchangeable binary sequences. The result
follows from substituting Theorem 2.1 into the conditional probability identity
P X 1 ,...,X n (x1 , . . . , xn )
P X m+1 ,...,X n |x1 ,...,xm (xm+1 , . . . , xn ) = .
P X 1 ,...,X m (x1 , . . . , xm )
n
log F(xi ; θ ) = ẋ log θ + (n − ẋ) log(1 − θ )
i=1
d ẋ n − ẋ
=⇒
log F(xi ; θ ) = −
dθ θ 1−θ
d2 ẋ n − ẋ
=⇒ In (θ ) = − 2 log F(xi ; θ ) = + . (B.1)
dθ θ2 (1 − θ )2
144 Appendix B: Solutions to Exercises
Setting the first derivative (B.1) equal to zero yields θ̂n = ẋ/n = x̄. Similarly for
Q(θ ) = Beta(θ | a, b), the prior mode and m 0 = (a − 1)/(a + b − 2) and
d2 a−1 b−1
I0 (θ) = − log dQ(θ) = +
dθ 2 θ2 (1 − θ)2
=⇒ Hn = I0 (m 0 ) + In (θ̂n )
(a − 1)(b − 1)x̄(1 − x̄)
=⇒ Hn−1 =
(a + b − 2)3 x̄(1 − x̄) + (a − 1)(b − 1)n
(a − 1)x̄{(a + b − 2)2 (1 − x̄) + (b − 1)n}
=⇒ m n = Hn−1 (I0 (m 0 )m 0 + In (θ̂n )θ̂n ) = .
(a + b − 2)3 x̄(1 − x̄) + (a − 1)(b − 1)n
Asymptotically, as n → ∞,
x̄(1 − x̄)
˙ Normal m n , Hn−1 → Normal x̄,
θ | x1 , . . . , xn ∼ .
n
parents(X 1 ) = ∅, children(X 1 ) = {X 2 , X 4 };
parents(X 2 ) = {X 1 }, children(X 2 ) = {X 4 };
parents(X 3 ) = {X 4 }, children(X 3 ) = ∅;
parents(X 4 ) = {X 1 , X 2 }, children(X 4 ) = {X 3 }.
PG (X 1 , X 2 , X 3 , X 4 ) = φ1 (X 1 , X 2 , X 4 )φ2 (X 3 , X 4 ).
when considered as a function of x . The components x j which affect this density are
those j for which Λ j = 0, which by construction are those j for which ( , j) ∈ E.
Hence
f (x | x− ) = f (x | neighboursG (x )).
Solution 4.1 Conjugacy of Bernoulli and beta distributions. Under the Bernoulli
likelihood model,
p(x | θ ) = θ ẋ (1 − θ )n−ẋ ,
n
where ẋ = i=1 xi . If θ ∼ Beta(a, b) then
By (4.2),
π(θ ) ∝ p(x | θ ) p(θ ) ∝ θ a+ẋ−1 (1 − θ )b+n−1 ,
Solution 4.2 Conjugacy of Poisson and gamma distributions. Under the Poisson
likelihood model,
p(x | θ ) ∝ θ ẋ e−nθ ,
n
where ẋ = i=1 xi . If θ ∼ Gamma(a, b) then
By (4.2),
π(θ ) ∝ p(x | θ ) p(θ ) ∝ θ a+ẋ−1 e−(b+n)θ ,
1[b,∞) (θ )
p(θ ) ∝ .
θ a+1
By (4.2),
By (4.2),
π(θ ) ∝ p(x | θ ) p(θ ) ∝ θ a+n−1 e−(b+ẋ)θ ,
and
∞ ∞
ba ba
π(θ2 ) = θ1a e−(b+θ2 )θ1 dθ1 = x a e−x dx
Γ (a) 0 Γ (a)(b + θ2 )a+1 0
Γ (a + 1)ba a ba
= = .
Γ (a)(b + θ2 )a+1 (b + θ2 )a+1
∗
and e−λθ = (1 − α)/2 ⇐⇒ θ ∗ = − log{(1 − α)/2}/λ.
Hence a 100α% credible interval for θ is
1
M
P̂π (θ ∈ A) = 1 A (θ (i) ).
M i=1
Solution 5.2 Monte Carlo estimate of a conditional expectation. Using the identity,
M
provided i=1 1 A (θ (i) ) > 0 (meaning there are samples lying in A).
Solution 5.3 Monte Carlo credible interval. From Exercise 5.1, it follows that
1
3
Êπ { (θ̂,θ)} = − exp{−(θ̂ − θ (i) )2 /10},
3 i=1
−0.3
Êπ { (θ̂,θ)}
−0.4
−0.5
0 2 4 6 8 10 12
θ̂
The following Python code then identifies the minimising value numerically using
the SciPy1 library function .
1 https://www.scipy.org.
Appendix B: Solutions to Exercises 149
where
φ(θi − μ)
w(θi ) =
φ(θi − μ) + φ(θi + μ)
= (1 + e−2θi μ )−1 .
By symmetry,
(iii) As μ increases, the target density becomes bimodal and the mixture weight
w(θi ) → 0 if θi is negative, and w(θi ) → 1 if θi is positive, and therefore θ
becomes stuck near either (−μ, −μ) or (μ, μ).
150 Appendix B: Solutions to Exercises
Starting the chain from (0, 0), the left-hand plot shows good mixing when μ = 1,
whereas in right-hand plot when μ = 3 there is only one transition between the two
modes during 100 iterations.
Solution 5.8 Detailed balance of Metropolis-Hastings algorithm. If θ = θ , then
by symmetry (5.12) trivially holds. If θ = θ , then (5.16) simplifies to p(θ | θ ) =
α(θ, θ )q(θ | θ ) and it remains to show
If π(θ ) = 0 then from (5.15), α(θ, θ ) = 0 and the equality holds. So now suppose
π(θ ) > 0,
π(θ )q(θ | θ )
π(θ )q(θ | θ )α(θ, θ ) = π(θ )q(θ | θ ) min 1,
π(θ )q(θ | θ )
= min{π(θ )q(θ | θ ), π(θ )q(θ | θ )}
π(θ )q(θ | θ )
= π(θ )q(θ | θ ) min 1,
π(θ )q(θ | θ )
= π(θ )q(θ | θ )α(θ , θ ).
Appendix B: Solutions to Exercises 151
Solution 5.9 Gibbs sampling as Metropolis-Hastings special case. Then the ratio of
posterior densities when θ− j = θ− j is
which cancels with the ratio of proposal densities in the Metropolis-Hastings accep-
tance probability (5.15), and hence α(θ, θ ) = 1 and all such Metropolis-Hastings
proposals are accepted with probability 1.
Solution 5.10 Metropolis-Hastings implementation.
In comparison with Gibbs sampling, there are fewer than 100 unique samples in
each case.
152 Appendix B: Solutions to Exercises
Solution 5.11 ELBO equivalence. Since log p(x, θ ) = log π(θ ) + log p(x),
q(θ)
KL(q(θ) π(θ)) = q(θ) log dθ
π(θ)
Θ
= q(θ) log q(θ) dθ − q(θ) log p(x, θ) dθ + q(θ) log p(x) dθ
Θ Θ Θ
= Eq log q(θ) − Eq log p(x, θ) + log p(x)
= − ELBO(q) + log p(x).
The log p(x) term does not depend on the density q, and so minimising this expression
corresponds to maximising ELBO(q).
Solution 5.12 ELBO identity. Since log p(x, θ ) = log p(x | θ ) + log p(θ ),
Solution 5.13 CAVI derivation. Using the identity p(x, θ ) = π(θ ) p(x),
ELBO(q) = Eq log π(θ ) − Eq log q(θ ).
ELBO(q) = Eq− j log π(θ− j ) + Eq j Eq− j log π(θ j | θ− j ) − Eq− j log q− j (θ− j ) − Eq j log q j (θ j ).
Maximising ELBO(q) with respect to q j is equivalent to maximising
(ii) The variances of the component densities q j are fixed, and the algorithm will
converge when each mean value m j = μ j .
(iii) The following Python code implements coordinate ascent variational inference
for the bivariate normal distribution with correlation .95. The printed output
gives the fitted mean and variance values. The contour plot mirrors Fig. 5.2(a).
154 Appendix B: Solutions to Exercises
Similarly,
exp[− 21 { ÿ − (σ −2 + n)−1 ẏ 2 }]
p1 (y) = n 1 .
(2π ) 2 (nσ 2 + 1) 2
Under model M0
p0 (x, y)
B01 (x, y) =
p1 (x) p1 (y)
(nσ 2 + 1)
= 1
exp{− 21 (σ −2 + n)−1 (ẋ 2 + ẏ 2 ) + 21 (σ −2 + 2n)−1 (ẋ + ẏ)2 }.
(2nσ 2 + 1) 2
For large σ ,
1
B01 (x, y) ≈ σ ( n2 ) 2 exp{− 2n
1
(ẋ 2 + ẏ 2 ) + 1
4n
(ẋ + ẏ)2 }
1
= σ ( n2 ) 2 exp{− 4n
1
(ẋ − ẏ)2 }.
x̄ + ȳ
M0 : θ̂ X = θ̂Y = ;
2
M1 : θ̂ X = x̄, θ̂Y = ȳ,
n n
where x̄ = 1
n i=1 xi , ȳ = 1
n i=1 yi . Then for model M1 ,
Under model M0 ,
1
n
x̄ + ȳ 2 x̄ + ȳ 2
log p0 (x, y | θ̂ X = θ̂Y ) = −n log(2π) − xi − + yi −
2 2 2
i=1
n
x̄ + ȳ 2 x̄ + ȳ 2
=⇒ BIC0 = 2n log(2π) + xi + + yi − + log(2n).
2 2
i=1
−(a+ 2p )
Γ (a + 2p ) β β
p(β) = p 1+ .
(2π b) 2 Γ (a) 2b
Regarded as a function of β, this density takes the same form as the density
function of t2a (0, 2a+2bp−1 I p ), except that β can only take non-negative values.
This expression does not require a matrix inversion, unlike the evaluation of the
matrix Vn (8.10) required for the general case.
Solution 8.6 Zellner’s g-prior. If V = g(X X )−1 , then
Γ (a + n/2) ba
p(y | X ) = n p n .
(2π ) Γ (a) (1 + g) (b + 21 y y −
2 2
g
2(1+g)
y X (X X )−1 X y)a+ 2
Solution 9.1 Dirichlet process marginal likelihood. nLet x = (x1 , . . . , xk ) be the
k ≤ n unique values which occur in x, and let n j = i=1 1{x j } (xi ) be the number of
k
k
p(x | P) = p(x j )n j = P(B j )n j
j=1 j=1
and
Γ (α)
k
p(P(B1 ), . . . , P(Bk+1 )) = k P(B j )α P0 (B j )
j=1 Γ {α P0 (B j )} j=1
Γ (α)
k
= k
P(B j )α P0 (x j )
j=1 Γ {α P0 (x j )} j=1
Γ (α)
k
=⇒ p(x, P(B1 ), . . . , P(Bk+1 )) = k
P(B j )α P0 (x j )+n j .
j=1 Γ {α p0 (x j )} j=1
(B.2)
For larger values of α, the sampled mass functions more closely resemble the base
geometric distribution.
Solution 9.3 Binary partition index For x ∈ R, the index of x at any level m is
obtained by calculating the m-digit binary representation of F0 (x). This is achieved
by the following Python code:
For larger values of α, the sampled densities more closely resemble the standard
normal base measure density.
Solution 10.1 Gaussian process closure. For any x = (x1 , . . . , xn ), independently
f (x) − m(x) ∼ Normaln (0, K (x, x)) and m(x) ∼ Normaln (m 0 (x), K 0 (x, x)),
where K 0 (x, x) is the corresponding covariance matrix (10.3) for the kernel k0 .
Since the sum of two independent normal distributions is again normal,
It therefore follows that this regression function can be written a Gaussian process
f ∼ GP(0, k) where the covariance kernel k is the dot product
k(x, x ) = σ 2 λ−1 x · x .
Solution 10.3 Spline regression as a Gaussian process. It follows from Exercise 10.2
that f ∼ GP(0, v · b(· , · )) where
d
m
b(x, x ) = 1 + (x x ) j + {(x − τ j )+ (x − τ j )+ }d .
j=1 j=1
m
b(x, x ) = 1[τ j ,τ j+1 ) (x) · 1[τ j ,τ j+1 ) (x )
j=0
defines an indicator function determining whether x and x lie in the same τ -segment.
Solution 10.5 CART notation and partition.
(i) T = {(1, 2, a), (2, 1, b), (3, 3, c), (6, 1, d)}.
(ii) π = {B1 , . . . , B5 } where
B1 = (−∞, b] × (−∞, a] × R
B2 = (b, ∞) × (−∞, a] × R
B3 = (−∞, d] × (a, ∞) × (−∞, c]
B4 = (d, ∞) × (a, ∞) × (−∞, c]
B5 = R × (a, ∞) × (c, ∞).
Solution 11.1 Mixture of normals full conditionals. Let n j,−i = i =i 1{ j} (z i ) be
the number
of data points aside from xi currently allocated to cluster j. Similarly let
ẋ j,−i = i =i;zi = j xi and ẍ j,−i = i =i;zi = j xi2 . Then for j = 1, . . . , m,
1
(α j + n j,−i )Γ (a + (n j,−i + 1)/2) (λ + n j,−i ) 2
p(z i = j | z−i , x) ∝ 1
Γ (a + n j,−i /2) (λ + n j,−i + 1) 2
% &a+ n j,−i2 +1
b + 21 {ẍ j,−i + xi2 − (ẋ j,−i + xi )2 /(n j + 1 + λ)}
× ' (a+ n j,−i .
2
b + 21 {ẍ j,−i − ẋ 2j,−i /(n j + λ)}
Appendix B: Solutions to Exercises 161
Solution 11.2 Gibbs sampling mixture of normals. The following code is one pos-
sible Python implementation. The hyperparameters are chosen to be α = 0.1, a =
0.1, b = 0.1, λ = 1.
Solution 11.4 Latent factor linear model code. From Exercise 11.4 with V = v I p
and U = u Iq ,
P probability
E expectation
V variance
:= definition
∝ proportional to
→ converges to
∼ distributed as
=⇒ implies
⇐⇒ equivalent to
⊥
⊥ independent
· dot product
R real numbers
N natural numbers, starting at zero
A set complement, {x|x ∈ / A}
1 A (x) indicator, 1 if x ∈ A, 0 otherwise
In n × n identity matrix
B transpose of matrix B
|B| Determinant of matrix B
||v|| Euclidean norm of vector v
|x| Absolute value of real value x
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 163
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
References
Amaral Turkman MA, Paulino CD, Müller P (2019) Computational Bayesian statistics: an intro-
duction. Institute of Mathematical Statistics Textbooks. Cambridge University Press, Cambridge
Barber D (2012) Bayesian reasoning and machine learning. Cambridge University Press, Cambridge
Berger JO, Guglielmi A (2001) Bayesian and conditional frequentist testing of a parametric model
versus nonparametric alternatives. J Am Stat Assoc 96(453):174–184
Bernardo JM, Smith AFM (1994) Bayesian theory. Wiley Series in Probability & Statistics. Wiley
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv: Methodology
Bhattacharya A, Dunson DB (2011) Sparse Bayesian infinite factor models. Biometrika, pp 291–306
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J
Am Stat Assoc 112(518):859–877
Chipman HA, George EI, McCulloch RE (1998) Bayesian CART model search. J Am Stat Assoc
93(443):935–948
Chipman HA, George EI, McCulloch RE (2010) Bart: Bayesian additive regression trees. Ann Appl
Stat 4(1):266–298
de Finetti B (2014) Theory of probability: a critical introductory treatment. Wiley Series in Proba-
bility and Statistics, Wiley
Denison DGT, Mallick BK, Smith AFM (1998) A Bayesian CART algorithm. Biometrika
85(2):363–377
Denison DGT, Holmes CC, Mallick BK, Smith AFM (2002) Bayesian methods for nonlinear clas-
sification and regression. Wiley Series in Probability and Statistics, Wiley
Doucet A, de Freitas N, Gordon N (2001) An introduction to sequential Monte Carlo methods.
Springer, New York, pp 3–14
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(2):209–230
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation.
J Am Stat Assoc 97(458):611–631
Gelman A, Hennig C (2017) Beyond subjective and objective in statistics. J R Stat Soc Ser A
(Statistics in Society) 180(4):967–1033
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2013) Bayesian data analysis, 3rd
edn. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis
Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination. Biometrika 82(4):711–732
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 165
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
166 References
Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models: a rough-
ness penalty approach. Chapman and Hall, United Kingdom
Heard NA (2020) Naheard/changepoints: Bayesian changepoint modelling code. https://doi.org/
10.5281/zenodo.4158489
Hoffman MD, Gelman A (2014) The No-U-Turn sampler: adaptively setting path lengths in Hamil-
tonian Monte Carlo. J Mach Learn Res 15(47):1593–1623
Holmes CC, Denison DGT, Ray S, Mallick BK (2005) Bayesian prediction via partitioning. J
Comput Graph Stat 14(4):811–830
Jeffreys H (1961) Theory of probability, 3rd edn. Englan, Oxford
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795
Kucukelbir A, Tran D, Ranganath R, Gelman A, Blei DM (2017) Automatic differentiation varia-
tional inference. J Mach Learn Res 18(14):1–45
Lau JW, Green PJ (2007) Bayesian model-based clustering procedures. J Comput Graph Stat
16(3):526–558
Lavine M (1992) Some aspects of Polya tree distributions for statistical modelling. Ann Stat
20(3):1222–1235
Leonard T (1973) A Bayesian method for histograms. Biometrika 60(2):297–308
Mauldin RD, Sudderth WD, Williams SC (1992) Polya trees and random distributions. Ann Stat
20(3):1203–1221
Minka TP (2001) Expectation propagation for approximate Bayesian inference. In: Proceedings of
the seventeenth conference on uncertainty in artificial intelligence, UAI’01. Morgan Kaufmann
Publishers Inc., pp 362–369
Neal RM (2011) Mcmc using hamiltonian dynamics. In: Handbook of Markov Chain Monte Carlo.
Chapman and Hall/CRC, pp 113–162
Oh M-S, Raftery AE (2007) Model-based clustering with dissimilarities: a Bayesian approach. J
Comput Graph Stat 16(3):559–585
Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning (Adaptive Compu-
tation and Machine Learning). The MIT Press
Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of
components (with discussion). J R Stat Soc: Ser B (Statistical Methodology) 59(4):731–792
Roberts GO, Rosenthal JS (2004) General state space Markov chains and MCMC algorithms.
Probab Surv 1:20–71
Roberts GO, Gelman A, Gilks WR (1997) Weak convergence and optimal scaling of random walk
Metropolis algorithms. Ann Appl Probab 7(1):110–120
Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models
by using integrated nested Laplace approximations. J R Stat Soc: Ser B (Statistical Methodology)
71(2):319–392
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Sethuraman J (1994) A constructive definition of Dirichlet priors. Statistica Sinica 4(2):639–650
Teh YW (2010) Dirichlet processes. In: Encyclopedia of Machine Learning. Springer, Berlin
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc
101(476):1566–1581
Tierney L, Kadane JB (1986) Accurate approximations for posterior moments and marginal densi-
ties. J Am Stat Assoc 81(393):82–86
Wang C, Paisley J, Blei D (2011) Online variational inference for the hierarchical Dirichlet process.
In: Proceedings of the fourteenth international conference on artificial intelligence and statistics,
vol 15. JMLR Workshop and Conference Proceedings, pp 752–760
Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g prior
distributions. Elsevier, New York, pp 233–243
Index
A Connected, 26
Action, 3 d-connected, 26
randomised, 11 components, 26
Adjacency matrix, 24 Consequence, 3
Aperiodic, 45 Consistency, 20
Asymptotic normality, 21, 52 Continuous random measure, 101
Coordinate ascent, 58
variational inference, 58
B Covariance function, 108
Bag-of-words, 130 Covariates, 79
Base measure, 94 Credible region, 38
Basis function, 86
Bayes factor, 70
interpretation, 70 D
Bayes’ theorem, 8 De Finetti
Bayesian additive regression tree, 120 representation theorem, 16, 31, 33, 36,
Bayesian histogram, 102 67, 94, 121
marginal likelihood, 103 Decision problem, 3
Bayesian information criterion, 72 Decision space, 12
Belief network, 26, 80, 129 Design matrix, 86
Binary sequences, 98 Detailed balance, 45
Binary tree, 98 Digamma function, 98
Dirichlet process, 94
conjugacy, 96
C marginal likelihood, 97
Changepoint model, 117 mixture, 126
regimes, 118 predictive distribution, 97
Chinese restaurant table distribution, 98
Classification and regression tree, 119
Clique, 25 E
Clustering, 122, 125, 127 Edward, 65
Coherence, 1, 11 Entropy, 57
Collider node, 26 Errors, 79, 109
Conditional probability, 7 Estimation, 12
Conjugacy, 34, 123 Event, 2
Conjugate prior, 34, 137 standard, 5
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 167
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
168 Index
F L
Factor graph, 29 Laplace approximation, 53
Fisher information matrix, 21 integrated nested, 54
Frequentist probability, 7 Laplace method, 53
Full conditional distribution, 46 Latent Dirichlet allocation, 129
Fundamental events, 3 Latent factor models, 133
Latent Gaussian model, 54
Lebesgue measure, 104
G Lindley’s paradox, 71, 84
Gaussian process, 108, 114 Linear model, 79, 80
conjugacy, 109 conjugate prior, 81
inference, 110
reference prior, 85
marginal likelihood, 109
Link function, 88
Generalised linear models, 87
Logistic regression, 90
Gibbs sampling, 46
Loss, 13
g-prior, 85
absolute, 13
Gram matrix, 134
expected, 13
Graph, 23
function, 12, 122
directed, 23
squared, 13, 39
directed acyclic, 25
zero-one, 13, 71
undirected, 23
Griffiths-Engen-McCloskey distribution, 95
M
H Marginal likelihood, 35, 44, 68, 83, 97, 109
Hierarchical Dirichlet process, 131 Markov chain Monte Carlo, 44
mixture, 132 Hamiltonian, 50
Hierarchical model, 30, 121 reversible jump, 60, 103
Hyperparameter, 19, 54 Markov network, 27, 28
Hyperprior, 19 Markov random field, 28
Hypothesis testing, 71 Gaussian, 29, 54
Matrix inversion lemma, 83
Mean-field variational
I family, 57
Identifiability, 20 inference, 58
Importance sampling Metropolis-Hastings algorithm, 48
estimation, 43 Mixed-membership model, 128
standard error, 43 Mixing, 50, 64
Intractable integral, 39 Mixture model, 19, 121
Irreducible, 45 finite, 122
infinite, 126
Mixture of Gaussians, 124
K Model averaging, 68
Kernel Model selection, 67, 69
Index 169
S
O
Semi-orthogonal matrix, 134
Odds ratio
Separated, 25
posterior, 70
d-separated, 26
prior, 70
Spline regression, 87, 113
Outcome, 3
marginal likelihood, 113
Splitting probabilities, 99, 100
Stan, 62, 74, 89, 91, 111, 130, 134
P
PyStan, 63, 76, 89, 91, 112, 134
Parametric model, 33
Standard events, 5
Parametric regression, 79
Stationary distribution, 45
Partition model, 102
Stick-breaking process, 95, 127, 132
regression, 116
Poisson regression, 88 Student grades example, 61
Polya tree, 98, 99 Subjective probability, 5
conjugacy, 100 Subjectivism, 1
marginal likelihood, 101
Polynomial regression, 87
Posterior T
distribution, 17, 18 Topic modelling, 130, 132
information matrix, 21 Transition density, 44
matrix, 53
mode, 21, 53
Posterior predictive U
p-value, 73 Utility, 8
checking, 65, 73 expected, 9, 12
distribution, 73 function, 8
Prediction, 13
Predictors, 79
Preference, 3 V
Prior Variational
distribution, 15, 18 family, 55
elicitation, 18 inference, 55, 60, 133