100% found this document useful (1 vote)
282 views177 pages

An Introduction To Bayesian Inference, Methods and Computation

Uploaded by

Pin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
282 views177 pages

An Introduction To Bayesian Inference, Methods and Computation

Uploaded by

Pin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 177

Nick Heard

An Introduction
to Bayesian
Inference,
Methods and
Computation
An Introduction to Bayesian Inference, Methods
and Computation
Nick Heard

An Introduction to Bayesian
Inference, Methods
and Computation
Nick Heard
Imperial College London
London, UK

ISBN 978-3-030-82807-3 ISBN 978-3-030-82808-0 (eBook)


https://doi.org/10.1007/978-3-030-82808-0

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2021, corrected publication 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

The aim of writing this text was to provide a fast, accessible introduction to Bayesian
statistical inference. The content is directed at postgraduate students with a back-
ground in a numerate discipline, including some experience in basic probability
theory and statistical estimation. The text accompanies a module of the same name,
Bayesian Methods and Computation, which forms part of the online Master of
Machine Learning and Data Science degree programme at Imperial College London.
Starting from an introduction to the fundamentals of subjective probability, the
course quickly advances to modelling principles, computational approaches and then
advanced modelling techniques. Whilst this rapid development necessitates a light
treatment of some advanced theoretical concepts, the benefit is to fast track the reader
to an exciting wealth of modelling possibilities whilst still providing a key grounding
in the fundamental principles.
To make possible this rapid transition from basic principles to advanced modelling,
the text makes extensive use of the probabilistic programming language Stan, which
is the product of a worldwide initiative to make Bayesian inference on user-defined
statistical models more accessible. Stan is written in C++, meaning it is computa-
tionally fast and can be run in parallel, but the interface is modular and simple. The
future of applied Bayesian inference arguably relies on the broadening development
of such software platforms.
Chapter 1 introduces the core ideas of Bayesian reasoning: Decision-making under
uncertainty, specifying subjective probabilities and utility functions and identifying
optimal decisions as those which maximise expected utility. Prediction and estima-
tion, the two core tasks in statistical inference, are shown to be special cases of this
broader decision-making framework. The application-driven reader may choose to
skip this chapter, although philosophically it sets the foundation for everything that
follows.
Chapter 2 presents representation theorems which justify the prior × likelihood
formulation synonymous with Bayesian probability models. Simply believing that
unknown variables are exchangeable, meaning probability beliefs are invariant to
relabelling of the variables, is sufficient to guarantee that construction must hold.
The prior distribution distinguishes Bayesian inference from frequentist statistical

v
vi Preface

methods, and several approaches to specifying prior distributions are discussed. The
prior and likelihood construction leads naturally to consideration of the posterior
distribution, including useful results on asymptotic consistency and normality which
suggest large sample robustness to the choice of prior distribution.
Chapter 3 shows how graphical models can be used to specify dependencies in
probability distributions. Graphical representations are most useful when the depen-
dency structure is a primary target of inferential interest. Different types of graphical
model are introduced, including belief networks and Markov networks, highlighting
that the same graph structure can have different interpretations for different models.
Chapter 4 discusses parametric statistical models. Attention is focused on conju-
gate models, which present the most mathematically convenient parametric approx-
imations of true, but possibly hard to specify, underlying beliefs. Although these
models might appear relatively simplistic, later chapters will show how these basic
models can form the basis of very flexible modelling frameworks.
Chapter 5 introduces the computational techniques which revolutionised the appli-
cation of Bayesian statistical modelling, enabling routine performance of infer-
ential tasks which had previously appeared infeasible. Relatively simple Markov
chain Monte Carlo methods were at the heart of this development, and these are
explained in some detail. A higher level description of Hamiltonian Monte Carlo
methods is also provided, since these methods are becoming increasingly popular
for performing simulation-based computational inference more efficiently. For high-
dimensional inference problems, some useful analytic approximations are presented
which sacrifice the theoretical accuracy of Monte Carlo methods for computational
speed.
Chapter 6 discusses probabilistic programming languages specifically designed
for easing some of the complexities of implementing Bayesian inferential methods.
Particular attention is given to Stan, which has experienced rapid growth in deploy-
ment. Stan automates parallel Hamiltonian Monte Carlo sampling for statistical infer-
ence on any suitably specified Bayesian model on a continuous parameter space. In
the subsequent chapters which introduce more advanced statistical models, Stan is
used for demonstration wherever possible.
Chapter 7 is concerned with model checking. There are no expectations for subjec-
tive probability models to be correct, but it can still be useful to consider how well
observed data appear to fit with an assumed model before making any further predic-
tions using the same model assumptions; it may make sense to reconsider alter-
natives. Posterior predictive checking provides one framework for model checking
in the Bayesian framework, and its application is easily demonstrated in Stan. For
comparing rival models, Bayes factors are shown to be a well-calibrated statistic for
quantifying evidence in favour for one model or the other, providing a vital Bayesian
analogue to Neyman-Pearson likelihood ratios.
Chapter 8 presents the Bayesian linear model as the cornerstone of regression
modelling. Extensions from the standard linear model to other basis functions such
as polynomial and spline regression highlight the flexibility of this fundamental
model structure. Further extensions to generalised linear models, such as logistic
and Poisson regression, are demonstrated through implementation in Stan.
Preface vii

Chapter 9 characterises nonparametric models as more flexible parametric models


with a potentially infinite number of parameters, distributing probability mass across
larger function spaces. Dirichlet process and Polya tree models are presented as
respective nonparametric models for discrete and continuous random probability
measures. Partition models such as Bayesian histograms are also included in this
class of models.
Chapter 10 covers nonparametric regression. Particular attention is given to Gaus-
sian processes, which can be regarded as generalisations of the Bayes linear model.
Spline models and partition models are also re-examined in this context.
Chapter 11 combines clustering and latent factor models. Both classes of model
assume a latent underlying structure, which is either discrete or continuous, respec-
tively. Finite and infinite mixture models are considered for clustering data into
homogeneous groupings. Topic modelling of text and other unstructured data is
considered as both a finite and infinite mixture problem. Finally, continuous latent
factor models are presented as an extension of linear regression modelling, through
the inclusion of unobserved covariates. Again, example Stan code is used to illustrate
this class of models.
Throughout the text, there are exercises which should form an important compo-
nent of following this course. Exercises which require access to a computer are
indicated with a symbol; these become increasingly prevalent as the chapters
progress, reflecting the transition within the text from laying fundamental principles
to applied practice.

London, UK Nick Heard


June 2021

The original version of the book was revised. Copyright page text has been updated. The correction
to the book is available at https://doi.org/10.1007/978-3-030-82808-0_12.
Contents

1 Uncertainty and Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.1 Subjective Uncertainty and Possibilities . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Subjectivism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Subjective Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Possible Outcomes and Events . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Decisions: Actions, Outcomes, Consequences . . . . . . . . . . . . . . . . 3
1.2.1 Elements of a Decision Problem . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Preferences on Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Subjective Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Standard Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Equivalent Standard Events . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Definition of Subjective Probability . . . . . . . . . . . . . . . . . 6
1.3.4 Contrast with Frequentist Probability . . . . . . . . . . . . . . . . 7
1.3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.6 Updating Beliefs: Bayes Theorem . . . . . . . . . . . . . . . . . . . 8
1.4 Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Principle of Maximising Expected Utility . . . . . . . . . . . . 9
1.4.2 Utilities for Bounded Decision Problems . . . . . . . . . . . . . 10
1.4.3 Utilities for Unbounded Decision Problems . . . . . . . . . . . 10
1.4.4 Randomised Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.5 Conditional Probability as a Consequence
of Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Estimation and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 Continuous Random Variables and Decision
Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2 Estimation and Loss Functions . . . . . . . . . . . . . . . . . . . . . . 12
1.5.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Prior and Likelihood Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Exchangeability and Infinite Exchangeability . . . . . . . . . . . . . . . . . 15
2.2 De Finetti’s Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . . 16

ix
x Contents

2.3 Prior, Likelihood and Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


2.3.1 Prior Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Non-informative Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Hyperpriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Mixture Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Bayesian Paradigm for Prior to Posterior Reporting . . . . 20
2.3.6 Asymptotic Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.7 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Graphical Modelling and Hierarchical Models . . . . . . . . . . . . . . . . . . . 23
3.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Specifying a Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Neighbourhoods of Graph Nodes . . . . . . . . . . . . . . . . . . . 24
3.1.3 Paths, Cycles and Directed Acyclic Graphs . . . . . . . . . . . 25
3.1.4 Cliques and Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Parametric Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Conjugate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Non-conjugate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Posterior Summaries for Parametric Models . . . . . . . . . . . . . . . . . . 37
4.5.1 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.2 Credible Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Computational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 Intractable Integrals in Bayesian Inference . . . . . . . . . . . . . . . . . . . 39
5.2 Monte Carlo Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Standard Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.2 Estimation Under a Loss Function . . . . . . . . . . . . . . . . . . . 41
5.2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.4 Normalising Constant Estimation . . . . . . . . . . . . . . . . . . . 43
5.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.1 Technical Requirements of Markov Chains
in MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.3 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . 48
5.4 Hamiltonian Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . 50
5.5 Analytic Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5.1 Normal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5.2 Laplace Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Contents xi

5.5.3 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


5.6 Further Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Bayesian Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 Illustrative Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Stan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 PyStan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Other Software Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.1 PyMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.2 Edward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7 Criticism and Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1 Model Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3.1 Selecting From a Set of Models . . . . . . . . . . . . . . . . . . . . . 69
7.3.2 Pairwise Comparisons: Bayes Factors . . . . . . . . . . . . . . . . 70
7.3.3 Bayesian Information Criterion . . . . . . . . . . . . . . . . . . . . . 72
7.4 Posterior Predictive Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4.1 Posterior Predictive p-Values . . . . . . . . . . . . . . . . . . . . . . . 73
7.4.2 Monte Carlo Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4.3 PPC with Stan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.1 Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 Bayes Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2.1 Conjugate Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.2.2 Reference Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.3 Generalisation of the Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3.1 General Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.4 Generalised Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4.1 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.4.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9 Nonparametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.1 Random Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2 Dirichlet Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2.1 Discrete Base Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.3 Polya Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.3.1 Continuous Random Measures . . . . . . . . . . . . . . . . . . . . . . 101
9.4 Partition Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.4.1 Partition Models: Bayesian Histograms . . . . . . . . . . . . . . 102
9.4.2 Bayesian Histograms with Equal Bin Widths . . . . . . . . . 104
xii Contents

10 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107


10.1 Nonparametric Regression Modelling . . . . . . . . . . . . . . . . . . . . . . . 107
10.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.2.1 Normal Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10.3 Spline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.3.1 Spline Regression with Equally Spaced Knots . . . . . . . . 114
10.4 Partition Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.4.1 Changepoint Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.4.2 Classification and Regression Trees . . . . . . . . . . . . . . . . . 119
11 Clustering and Latent Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.1.1 Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
11.1.2 Dirichlet Process Mixture Models . . . . . . . . . . . . . . . . . . . 126
11.2 Mixed-Membership Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
11.2.1 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 129
11.2.2 Hierarchical Dirichlet Processes . . . . . . . . . . . . . . . . . . . . 131
11.3 Latent Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.3.1 Stan Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Correction to: An Introduction to Bayesian Inference, Methods
and Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C1

Appendix A: Conjugate Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . 137


Appendix B: Solutions to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Chapter 1
Uncertainty and Decisions

1.1 Subjective Uncertainty and Possibilities

1.1.1 Subjectivism

In the seminal work of de Finetti (see the English translation of de Finetti 2017),
the central idea for the Bayesian paradigm is to address decision-making in the
face of uncertainty from a subjective viewpoint. Given the same set of uncertain
circumstances, two decision-makers could differ in the following ways:
• How desirable different potential outcomes might seem to them.
• How likely they consider the various outcomes to be.
• How they feel their actions might affect the eventual outcome.
The Bayesian decision-making paradigm is most easily viewed through the lens of
an individual making choices (“decisions”) in the face of (personal) uncertainty. For
this reason, certain illustrative elements of this section will be purposefully written
in the first person.
This decision-theoretic view of the Bayesian paradigm represents a mathematical
ideal of how a coherent non-self-contradictory individual should aspire to behave.
This is a non-trivial requirement, made easier with various mathematical formalisms
which will be introduced in the modelling sections of this text. Whilst these for-
malisms might not exactly match my beliefs for specific decision problems, the aim
is to present sufficiently many classes of models that one of them might adequately
reflect my opinions up to some acceptable level of approximation.
Coherence is also the most that will be expected from a decision-maker; there
will be no requirement for me to choose in any sense the right decisions from any
perspective other than my own at that time. Everything within the paradigm is sub-
jective, even apparently absolute concepts such as truth. Statements of certainty such
as “The true value of the parameter is x” should be considered shorthand for “It is my
understanding that the true value of the parameter is x”. This might seem pedantic,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 1


N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0_1
2 1 Uncertainty and Decisions

but crucially allows contradictions between individuals, and between perspectives


and reality: the decision-making machinery will still function.

1.1.2 Subjective Uncertainty

There are numerous sources of individual uncertainty which can complicate decision-
making. These could include:
• Events which have not yet happened, but might happen some time in the future
• Events which have happened which I have not yet learnt about
• Facts which may yet be undiscovered, such as the truth of some mathematical
conjecture
• Facts which may have been discovered elsewhere, but remain unknown to me
• Facts which I have partially or completely forgotten
In the Bayesian paradigm, these and other sources of uncertainty are treated equally.
If there are matters on which I am unsure, then these uncertainties must be acknowl-
edged and incorporated into a rational decision process. Whether or not I perhaps
should know them is immaterial.

1.1.3 Possible Outcomes and Events

Suppose I, the decision-maker, am interested in a currently unknown outcome ω,


and believe that it will eventually assume a single realised value from an exhaustive
set of possibilities Ω. When considering uncertain outcomes, the assumed set of
possibilities will also be chosen subjectively, as illustrated in the following example.

Example 1.1 If rolling a die, I might understandably assume that the outcome
will be in Ω = { , , , , , }. Alternatively, I could take a more conserva-
tive viewpoint and extend the space of outcomes to include some unintended or
potentially unforeseen outcomes; for example, Ω = {Dice roll does not take place,
No valid outcome, , , , , , }.
Neither viewpoint in Example 1.1 could irrefutably be said to be right or wrong.
But if I am making a decision which I consider to be affected by the future outcome of
the intended dice roll, I would possibly adopt different positions according to which
set of possible outcomes I chose to focus on. The only requirement for Ω is that it
should contain every outcome I currently conceive to be possible and meaningful to
the decision problem under consideration.

Definition 1.1 (Event) An event is a subset of the possible outcomes. An event


E ⊆ Ω is said to occur if and only if the realised outcome ω ∈ E.
1.2 Decisions: Actions, Outcomes, Consequences 3

1.2 Decisions: Actions, Outcomes, Consequences

1.2.1 Elements of a Decision Problem

Definition 1.2 (Decision problem) Following Bernardo and Smith (1994), a decision
problem will be composed of three elements:
1. An action a, to be chosen from a set A of possible actions.
2. An uncertain outcome ω, thought to lie within a set Ω of envisaged possible
outcomes.
3. An identifiable consequence, assumed to lie within a set C of possible conse-
quences, resulting from the combination of both the action taken and the ensuing
outcome which occurs.
Axioms 1 C will be totally ordered, meaning there exists an ordering relation ≤C
on C such that for any pair of consequences c1 , c2 ∈ C , necessarily c1 ≤C c2 or
c2 ≤C c1 .
If both c1 ≤C c2 and c2 ≤C c1 , then we write c1 =C c2 . This provides definitions
of (subjective) preference and indifference between consequences.
Remark 1.1 Crucially, the ordering ≤C is assumed to be subjective; my perceived
ordering of the different consequences must be allowed to differ from that of other
decision-makers.
Definition 1.3 (Preferences on consequences) Suppose c1 , c2 ∈ C . If c1 ≤C c2 and
c1 =C c2 , then c2 is said to be a preferable consequence to c1 , written c1 <C c2 . If
c1 =C c2 , then I am indifferent between the two consequences.
Definition 1.4 (Action) An action defines a function which maps outcomes to con-
sequences. For simplicity of presentation, until Section 1.5.1 the actions in A will
be assumed to be discrete, meaning that each can be represented by a generic form
a = {(E 1 , c1 ), (E 2 , c2 ), . . .}, where c1 , c2 , . . . ∈ C , and E 1 , E 2 , . . . are referred to as
fundamental events which form a partition of Ω, meaning Ω = ∪i E i , E i ∩ E j = ∅
for i = j. Then, for example, if I take action a, then I anticipate that any outcome
ω ∈ E 1 would lead to consequence c1 , and so on.
Remark 1.2 When actions are identified, in this way, by the perceived consequences
they will lead to under different outcomes, they are subjective.

1.2.2 Preferences on Actions

Rational decision-making requires well-founded preferences between possible


actions. Let a, a ∈ A be two possible actions, which for illustration could be written
as
4 1 Uncertainty and Decisions

a = {(E 1 , c1 ), (E 2 , c2 ), . . .},
a = {(E 1 , c1 ), (E 2 , c2 ), . . .}.

The overall desirability of each action will depend entirely on the uncertainty sur-
rounding the fundamental events E 1 , E 2 , . . . and E 1 , E 2 , . . . and the desirability of
the corresponding consequences c1 , c2 , . . . and c1 , c2 , . . .. This can be exploited in
two ways, which will be developed in later sections:
1. If I innately prefer action a to a , then this preference can be used to quantify my
beliefs about the uncertainty surrounding the fundamental events characterising
each action. This will form the basis for eliciting subjective probabilities (see
Sect. 1.3).
2. Reversing the same argument, once I have elicited my probabilities for certain
events then these can be used to obtain preferences between corresponding actions
through the principle of maximising expected utility (see Sect. 1.4.1).

Definition 1.5 (Preferences on actions) For actions a, a ∈ A , a subjective decision-


maker regarding a not to be a preferable action to a is written a ≤ a . For actions
a, a ∈ A , if both a ≤ a and a ≤ a, then a and a are said to be equivalent actions,
written a ∼ a .

Axioms 2 Preferences on actions must be compatible with preferences on conse-


quences. Let E, F be events such that ∅ ⊆ E ⊆ F ⊆ Ω, and let c1 , c2 ∈ C such that
c1 ≤C c2 . Then the following preference on actions must hold:

{(F, c1 ), (F, c2 )} ≤ {(E, c1 ), (E, c2 )}.

Remark 1.3 The two actions {(F, c1 ), (F, c2 )} and {(E, c1 ), (E, c2 )} only differ in
the consequences anticipated from any ω ∈ E ∩ F; that is, the event E ∩ F would
lead to a consequence of c1 under the first action and c2 under the second.

Remark 1.4 By Axiom 2, for ∅ ⊆ E ⊆ Ω and c1 , c2 ∈ C , if c1 ≤C c2 then

{(Ω, c1 )} ≤ {(E, c1 ), (E, c2 )} ≤ {(Ω, c2 )}.

That is, if consequence c2 is preferable to consequence c1 , then I should prefer a


strategy which guarantees a consequence c2 against carrying any risk of exposure to
consequence c1 through the occurrence of event E. Similarly, rather than guarantee-
ing the lesser consequence c1 , I should prefer a strategy whereby the occurrence of
event E will improve the consequence to c2 .
1.3 Subjective Probability 5

1.3 Subjective Probability

1.3.1 Standard Events

Central to the definition given by Bernardo and Smith (1994) for subjective probabil-
ity is the abstract concept of a continuous-indexed family of standard events, denoted
Sx for x ∈ [0, 1]. These standard events are constructed in relation to a hypotheti-
cal, abstract experiment, such that under the classical perspective of probability one
would assign probability x to the standard event Sx occurring, for 0 ≤ x ≤ 1.
As an illustrative example, consider the hypothetical spinning wheel depicted in
Fig. 1.1. This wheel is assumed to have unit circumference and to be plain in colour
apart from a shaded sector of arc length x ∈ [0, 1], creating an angle of 2π x radians
from a horizontal axis. A fixed needle is mounted above the wheel as shown. It could
be imagined that the wheel is to be spun (perhaps vigorously) from some arbitrary
starting orientation; when the wheel has come to rest, one observes whether the fixed
needle is lying within the shaded area of arc length x.
For each x ∈ [0, 1], define the corresponding standard event

Sx = {Needle lies in the shaded area of arc lengthx}.

Classical probability would assign probability

Fig. 1.1 A spinning wheel


with unit circumference and
a fixed needle to depict a
class of standard events Sx
indexed by the arc length
x
parameter x ∈ [0, 1]

2x
6 1 Uncertainty and Decisions

arc length x
= =x
circumference 1
to the event Sx . Later, these standard events will be used to form the basis of a
definition of subjective probability for quantifying individual uncertainty. Briefly, an
individual will assign probability x to an event E ⊆ Ω if they would be indifferent
between receiving a reward if E occurs or alternatively receiving the same reward if
Sx occurs.

1.3.2 Equivalent Standard Events

Recall the standard events {Sx | 0 ≤ x ≤ 1}, introduced in Sect. 1.3.1.


Axioms 3 If E ⊆ Ω, there exists a unique standard event Sx , x ∈ [0, 1], such that
for any c1 , c2 ∈ C such that c1 <C c2 ,

{(E, c1 ), (E, c2 )} ∼ {(Sx , c1 ), (Sx , c2 )}.

Remark 1.5 Axiom 3 uses the continuity in x of the collection of standard events
{Sx : x ∈ [0, 1]}. It states that each event E can be mapped to a unique number
x ∈ [0, 1] through equivalence between E and the standard event Sx when imagined
as alternative opportunities to improve consequences (from c1 to c2 ). This provides
the definition of subjective probability.

Axioms 4 Let c1 , c, c2 ∈ C such that c1 ≤C c ≤C c2 . Then there exists a standard


event Sx with x ∈ [0, 1] satisfying

{(Ω, c)} ∼ {(Sx , c1 ), (Sx , c2 )}. (1.1)

Remark 1.6 For c1 ≤C c ≤C c2 , clearly {(Ω, c1 ), (∅, c2 )} ≤ {(Ω, c)} ≤ {(∅, c1 ),


(Ω, c2 )}. Using the continuity in x of the standard events {Sx : x ∈ [0, 1]}, it is
reasonable to assume that between ∅ and Ω there should exist an event satisfying
the equivalence (1.1).

1.3.3 Definition of Subjective Probability

Definition 1.6 (Subjective probability) For E ⊆ Ω, let Sx be the standard event


satisfying {(E, c1 ), (E, c2 )} ∼ {(Sx , c1 ), (Sx , c2 )} (Axiom 3). Then define the prob-
ability of E to be the classical probability of Sx , written P(E) = x.

Remark 1.7 Subjective probabilities can be elicited in two stages: First, a contin-
uous family of hypothetical standard events are constructed by the decision-maker,
1.3 Subjective Probability 7

such that for each x ∈ [0, 1] there is a corresponding standard event Sx with classical
probability x. Second, a probability P(E) ∈ [0, 1] is assigned to an uncertain event
E ⊆ Ω of interest by identifying equal preference between the two dichotomies
{(E, c1 ), (E, c2 )} and {(SP(E) , c1 ), (SP(E) , c2 )}.

Remark 1.8 In some circumstances, the subjective assessment of the range of possi-
ble outcomes and the probabilities of events within that range may vary according to
which action is being considered; for example, the decision problem may be choos-
ing to role either one or two dice, with corresponding consequences resulting from
the outcome. This presents no contradiction to the above definition, but all subjec-
tive probabilities should be regarded as conditional probabilities which implicitly
condition on a particular action.

For further reading, see Sect. 5.3 of Gelman and Hennig (2017) for a discussion of
subjective Bayesian reasoning within an interesting, wider discussion on objectivity
and subjectivity in science.

1.3.4 Contrast with Frequentist Probability

It is worth noting the contrast of Definition 1.6 with frequentist probability. Under
the frequentist interpretation, there exists a single probability of event E occurring,
equal to the long run relative frequency at which E would occur in a potentially
unlimited number of repetitions of the uncertain outcome.
Whilst these two interpretations of probability are fundamentally opposed, the two
could easily coincide when subjective probabilities are determined by an individual
using frequentist reasoning to arrive upon their own subjective beliefs.

1.3.5 Conditional Probability

Having started from an initial state of information, a decision-maker may need


to update preferences and beliefs when additional information becomes available,
encapsulated by the occurrence of some event G ⊂ Ω. Such considerations require
a notational extension for denoting consequently revised preferences on actions.

Definition 1.7 (Conditional preferences on actions) For actions a, a ∈ A , con-


ditional preferences and equivalences assuming an event G has occurred will be
denoted a ≤|G a and a ∼|G a , respectively.

Using this notion of conditionally equivalent actions, Axiom 3 on equivalent


standard events can be suitably extended.
8 1 Uncertainty and Decisions

Axioms 5 If E, G ⊆ Ω, there exists a unique standard event Sx , x ∈ [0, 1], such


that for any c1 , c2 ∈ C such that c1 < c2 ,

{(E, c1 ), (E, c2 )} ∼|G {(Sx , c1 ), (Sx , c2 )}.

Remark 1.9 This axiom says that once we condition on an event G occurring, for
any other event E we can still find an equivalent standard event.

Definition 1.8 (Subjective conditional probability) For E, G ⊆ Ω, the conditional


probability of E given G, written P|G (E | G), is the index x of the standard event
Sx satisfying {(E, c1 ), (E, c2 )} ∼|G {(Sx , c1 ), (Sx , c2 )}.

Proposition 1.1 For events E, G ⊆ Ω such that P(G) > 0, the conditional proba-
bility of E given the assumed occurrence of G must necessarily be

P(E ∩ G)
P|G (E | G) := . (1.2)
P(G)

Proof See Sect. 1.4.5.

1.3.6 Updating Beliefs: Bayes Theorem

The updating Eq. (1.2) provides the unique recipe for how beliefs must be updated
when additional information becomes available, and this can be further refined in the
following theorem.

Theorem 1.1 (Bayes’ theorem) For events E, G ⊆ Ω such that P(G) > 0,

P|E (G | E) P(E)
P|G (E | G) = .
P(G)

Proof From Proposition 1.1, P(E ∩ G) = P(G) P|G (E | G) and by symmetry, it


must also hold that P(E ∩ G) = P(E) P|E (G | E). Hence P(G) P|G (E | G) =
P(E) P|E (G | E).

1.4 Utility

Definition 1.9 (Utility function) A utility function is a subjective, order-preserving


mapping u : C → R such that c1 ≤C c2 ⇐⇒ u(c1 ) ≤ u(c2 ).

Remark 1.10 A utility function assigns a subjective numerical value to each of the
possible consequences.
1.4 Utility 9

Since each action-outcome pair (a, ω) in a decision problem leads to a con-


sequence in C , a utility function can equivalently be defined as a function u :
A × Ω → R, with
u(a, ω) ≡ u(c)

for the corresponding consequence c for that action-outcome pair.

1.4.1 Principle of Maximising Expected Utility

In complex decision problems with uncertain outcomes, an additional principle on


how to combine uncertainty with utilities is required to identify optimal decisions.
This can be illustrated by a simple example.
Example 1.2 Consider two actions

a1 = {(E, c0 ), (E, c1 )},


a2 = {(F, c0 ), (F, c2 )},

for consequences c0 <C c1 <C c2 and events E, F ⊂ Ω with 0 < P(F)


< P(E) < 1.
Without a method to trade-off between utility and uncertainty, there would be no
basis on which to prefer either action. Action a2 offers the opportunity of a superior
consequence than a1 , but with lower enhancement probability.
This leads to the following axiom for preferences being determined by expected
utility.

Definition 1.10 (Expected utility of a deterministic action) For a probability


measure P and utility function u, the expected utility ū(a) of an action a =
{(E 1 , c1 ), (E 2 , c2 ), . . .} ∈ A is defined to be

ū(a) := P(E i ) u(ci ).
i

Axioms 6 For two actions a, a ∈ A ,

a ≤ a ⇐⇒ ū(a) ≤ ū(a ),

implying one action will be preferable to another if and only if it has higher expected
utility.

Exercise 1.1 (Linear transformations of utilities) Show that decision problems are
unaffected by positive-gradient linear transformations to the utility function.
10 1 Uncertainty and Decisions

Example 1.3 Continuing Example 1.2, but now assuming a utility function, the
expected utilities of the two actions are

ū(a1 ) = {1 − P(E)} u(c0 ) + P(E) u(c1 ),


ū(a2 ) = {1 − P(F)} u(c0 ) + P(F) u(c2 ).

The action a1 is preferable to a2 if and only if

P(E) u(c2 ) − u(c0 )


> .
P(F) u(c1 ) − u(c0 )

The following two sections on bounded and unbounded decision problems


together demonstrate that Axioms 4 and 6 ensure that the form of the utility function
will be uniquely determined by the total ordering of C , up to any positive-gradient
linear rescaling (cf. Exercise 1.1).

1.4.2 Utilities for Bounded Decision Problems

Definition 1.11 (Bounded decision problem) A decision problem is said to be


bounded if there exist worst and best consequences c∗ , c∗ ∈ C such that ∀c ∈ C ,
c∗ ≤C c ≤C c∗ .

If the decision problem is bounded, then for simplicity and without loss of gen-
erality it can be assumed that u(c∗ ) = 0, u(c∗ ) = 1.
Then for any c ∈ C , the order-preserving requirement of a utility function
determines that u(c) ∈ [0, 1] is the index of the standard event Su(c) such that
{(Ω, c)} ∼ {(Su(c) , c∗ ), (Su(c) , c∗ )} (Axiom 4).

Exercise 1.2 Bounded utility. Show that if u(c∗ ) = 0, u(c∗ ) = 1 and c∗ ≤C c ≤C


c∗ , then necessarily by Axiom 6, {(Ω, c)} ∼ {(Su(c) , c∗ ), (Su(c) , c∗ )}.

1.4.3 Utilities for Unbounded Decision Problems

If the decision problem is not bounded, then for some c1 <C c2 , (perhaps after linear
rescaling) it could be assumed without loss of generality that u(c1 ) = 0, u(c2 ) = 1.
Again, Axiom 4 and the order-preserving requirement then determine the rest of the
utility function; specifically, for c ∈ C :
1. If c1 ≤C c ≤C c2 , {(Ω, c)} ∼ {(Su(c) , c2 ), (Su(c) , c1 )}.
2. If c <C c1 , then u(c) < 0 and if {(Ω, c1 )} ∼ {(Sx , c), (Sx , c2 )}, then u(c) =
−x/(1 − x).
3. If c2 <C c, then u(c) > 1 and if {(Ω, c2 )} ∼ {(Sx , c1 ), (Sx , c)}, then u(c) = 1/x.
1.4 Utility 11

Exercise 1.3 Unbounded utility. Suppose u(c1 ) = 0, u(c2 ) = 1. Show that by Axiom 6,
the following must hold.
(i) If c <C c1 and {(Ω, c1 )} ∼ {(Sx , c), (Sx , c2 )}, then u(c) = −x/(1 − x) < 0.
(ii) If c2 <C c and {(Ω, c2 )} ∼ {(Sx , c1 ), (Sx , c)}, then u(c) = 1/x > 1.

Exercise 1.4 Transitivity of preference. Show that for a, a , a ∈ A , if a ≤ a and


a ≤ a , then a ≤ a .

Exercise 1.5 Coherence with probabilities. For events E, F ⊆ Ω, show that if


P(E) ≤ P(F) then {(F, c1 ), (F, c2 )} ≤ {(E, c1 ), (E, c2 )}.

1.4.4 Randomised Strategies

Definition 1.12 (Randomised action) Let G 1 , G 2 , . . . be a partition of Ω. For


each partition event G i , suppose there is a corresponding action aG i = {(E i1 , ci1 ),
(E i2 , ci2 ), . . .} which is determined to be taken if and only if G i occurs. Denote this
randomised action a = {(G 1 , aG 1 ), (G 2 , aG 2 ), . . .}.

Remark 1.11 Randomised actions are a useful extension of the deterministic actions
considered until now. Although sometimes counter-intuitive, in many circumstances
they can sometimes be shown to correspond to optimal or near-optimal behaviours.

Definition 1.13 (Expected utility of a randomised action) The expected utility of a


randomised action a = {(G 1 , aG 1 ), (G 2 , aG 2 ), . . .} is

ū(a) := P(G i ) ū |G i (aG i | G i ),
i


where ū |G i (aG i | G i ) := j P|G i (E i j | G i ) u(ci j ) is the conditional expected utility
of action aG i given the occurrence of event G i .

Remark 1.12 Definition 1.13 simply says that the expected utility of a randomised
action is the expectation of the conditional expected utilities of the individual actions.

1.4.5 Conditional Probability as a Consequence of Coherence

By considering randomised actions, it can now be shown that the equation for condi-
tional probability (1.2) is necessary when specifying subjective probabilities if those
probabilities are to yield coherent expected utilities, and therefore coherent decisions.
Consider a randomised action a = {(G, aG ), (G, aG )} such that aG = {(E, c∗ ),
(E, c∗ )} and aG = {(Ω, c∗ )}, where u(c∗ ) = 0 and u(c∗ ) = 1. Then by Definition
1.13, a has expected utility
12 1 Uncertainty and Decisions

ū(a) = P(G) ū |G (aG | G) + P(G) ū |G (aG | G)


= P(G)[P|G (E | G) u(c∗ ) + P|G (E | G) u(c∗ )] + P(G) P|G (Ω | G) u(c∗ )
= P(G) P|G (E | G).

But equivalently, a could be rewritten as a deterministic action, a = {(E ∩ G, c∗ ),


(E ∩ G, c∗ )}. Then from Definition 1.10, it must also hold that

ū(a) = P(E ∩ G) u(c∗ ) + P(E ∩ G) u(c∗ ) = P(E ∩ G).

Hence for coherence in expected utility, P(E ∩ G) = P(G) P|G (E | G).

1.5 Estimation and Prediction

1.5.1 Continuous Random Variables and Decision Spaces

As noted in Definition 1.4, the initial notation used for actions has presumed dis-
creteness, with a countable partition of Ω leading to countably many consequences
and associated utilities. This section will consider cases where Ω and the space of
actions might be uncountable.
Definition 1.14 (Decision space) A decision space (or continuous action space) is
a set of mappings D = {d : Ω → C } such that the consequence of taking a decision
d ∈ D and observing outcome ω is d(ω) ∈ C .
Definition 1.15 (Expected utility of a decision) For a utility function u : C → R,
the expected utility of a decision d : Ω → C is the usual expectation

ū(d) = u(d(ω)) d P(ω).
Ω

Remark 1.13 If my probability distribution


 on Ω, P, admits a density function
representation f such that P(E) = E f (ω) dω for all events E ⊆ Ω, then

ū(d) = u(d(ω)) f (ω) dω.
Ω

1.5.2 Estimation and Loss Functions

Consider the special case of the decision problem which is to estimate the future
realised value of the unknown outcome ω ∈ Ω. In the typical notation of statisti-
cal estimation, a decision constitutes providing an estimated value ω̂. The eventual
1.5 Estimation and Prediction 13

performance of that estimate is evaluated by a loss function (· , · ), where (ω̂, ω)


quantifies an assumed penalty incurred by estimating the outcome with ω̂ when the
true value transpires to be ω.
In this presentation of decision problems:
• The loss function value (ω̂, ω) is the real-valued consequence of estimating ω
with ω̂.
• The utility of the above consequence is simply the negative loss, u((ω̂, ω)) =
−(ω̂, ω).
The decision of estimating ω by ω̂ could therefore be denoted

dω̂ = (ω̂, · )

such that for ω ∈ Ω, dω̂ (ω) = (ω̂, ω), and the expected utility is the negative
expected loss, 
ū(dω̂ ) = − (ω̂, ω) f (ω) dω.
Ω

Exercise 1.6 Absolute loss (also known as L 1 loss). If (ω̂, ω) = |ω̂ − ω|, show
that the Bayes optimal decision is to estimate ω by the median of P.

Exercise 1.7 Squared loss (also known as L 2 loss). If (ω̂, ω) = (ω̂ − ω)2 , show
that the Bayes optimal decision is to estimate ω by the mean of P.

Exercise 1.8 Zero-one loss (also known as L ∞ loss). If (ω̂, ω) = 1 − 1{ω̂} (ω),
show that the Bayes optimal decision is to estimate ω by the mode of P.

1.5.3 Prediction

In the preceding sections it would have been natural to envisage ω as a scalar number,
such as the outcome from rolling a die. However, this need not be the case. Bayesian
prediction is the task of estimating an entire probability distribution, rather than a
scalar; correspondingly, in this case ω is an unknown probability distribution on a
space X and Ω is a space of probability distributions on X which I believe contains
ω.
As discussed throughout this chapter, in a Bayesian setting I will have my own
beliefs about ω, characterised by my own subjective probability distribution P(E)
for my probability that ω lies in a subset of probability distribution space E ⊆ Ω.
To avoid self-contradiction, for coherent prediction it should be a requirement
that my optimal decision when estimating ω should be to state my own beliefs. That
is, the optimal decision dω̂ should satisfy, for events F ⊆ X ,

ω̂(F) = ω(F) d P(ω), (1.3)
Ω
14 1 Uncertainty and Decisions

where the right-hand side is my marginal probability for the event F, obtained as an
expectation of the probability of F, ω(F), with respect to my uncertainty about ω
encapsulated by P.
Satisfying (1.3) clearly places constraints on what are allowable loss functions to
lead to coherence. In fact, it can be shown (Bernardo and Smith 1994, Section 2.7)
that the only proper loss functions for coherent prediction have a canonical form
which is the well-known Kullback-Leibler divergence from information theory for
measuring the difference of one probability distribution from another.

Definition 1.16 Kullback-Leibler divergence For two probability distributions P, Q


where P is absolutely continuous with respect to Q, the Kullback-Leibler divergence
(or simply, the KL-divergence) of Q from P is

dP dP
KL(P  Q) := log d P = E P log .
dQ dQ

If p, q are corresponding density functions satisfying p(x) > 0 =⇒ q(x) > 0, then

p(x) p(x)
KL( p  q) := p(x) log dx = E p log .
q(x) q(x)

Using this definition of KL-divergence, the necessary form for a proper loss
function is 

(ω̂, ω) = KL(ω  ω̂) = log dω. (1.4)
X dω̂

This justifies the use of KL-divergence for measuring discrepancy between two
probability distributions from a Bayesian perspective.

Exercise 1.9 KL-divergence non-negative. For two probability density functions


p, q, show that KL( p  q) ≥ 0 with equality when p = q, and therefore KL-
divergence is a proper loss function for prediction.
Chapter 2
Prior and Likelihood Representation

The first chapter introduced the philosophy of Bayesian statistics: when making
individual decisions in the face of uncertainty, probability should be treated as a
subjective measure of beliefs, where all quantities unknown to the individual should
be treated as random quantities.
Eliciting individual probability assessments is a non-trivial endeavour. Even if
I have a relatively well-formed opinion about some uncertain quantity, coherently
assigning precise numerical values (probabilities) to all potential outcomes of inter-
est for that quantity can be particularly challenging when there are infinitely many
possible outcomes.
To counter these difficulties, it can be helpful to consider mathematical models to
represent an individual’s beliefs. There is no presumption that these models should
be somehow correct in terms of representing true underlying dynamics; nonetheless,
they can provide structure for representing beliefs coherently to a good enough degree
of approximation to enable valid decision-making.
The main simplification which will be considered, exchangeability, occurs in con-
texts where a sequence of random quantities are to be observed and a joint probability
distribution for the sequence is required. Symmetries in one’s beliefs about sequences
lead to familiar specifications of probability models which are often considered to be
the hallmark of Bayesian thinking: a likelihood distribution combined with a prior
distribution.

2.1 Exchangeability and Infinite Exchangeability

Let X 1 , X 2 , . . .beasequenceofreal-valuedrandomvariablestobeobserved,whichare
mappings of an underlying unknown outcome ω ∈ Ω with probability distribution P.

The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 15
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_2
16 2 Prior and Likelihood Representation

Definition 2.1 (Exchangeability) For n ≥ 1, the finite sequence X 1 , . . . , X n is said


to be exchangeable if, for any permutation σ on n symbols, their induced probability
distribution satisfies
P X 1 ,...,X n = P X σ (1) ,...,X σ (n)

Definition 2.2 (Infinite exchangeability) An infinite sequence X 1 , X 2 , . . . is said to


be infinitely exchangeable if, for all n ≥ 1 and all choices of n indices 1 ≤ i 1 < . . . <
i n < ∞, the subsequence X i1 , . . . , X in is exchangeable.

Remark 2.1 Exchangeability for a probability measure on n random variables sim-


ply implies that their probability distribution is invariant to the order in which they
have been defined. Infinite exchangeability extends the concept to infinite sequences
of random variables, requiring that any finite subsequence must be exchangeable.
Exchangeability can be a very natural (perhaps approximate) assumption in prac-
tical reasoning about uncertainty, such as assigning no importance to the order of
observed outcomes from a (possibly unending) sequence of tosses of a coin.

2.2 De Finetti’s Representation Theorem

On exchangeability, the Italian probability theorist Bruno de Finetti (1906–1985) is


accredited with the following theorem which might be regarded as astonishing for
its generality and impact.
Theorem 2.1 (De Finetti’s representation theorem) Let X 1 , X 2 , . . . be an infinitely
exchangeable sequence of binary random variables, X i ∈ {0, 1}. Then for any n ≥ 1,
there must exist a probability measure Q on [0, 1] such that the corresponding mass
function p X 1 ,...,X n of the probability distribution P X 1 ,...,X n satisfies
 1 
n
p X 1 ,...,X n (x1 , . . . , xn ) = θ xi (1 − θ )1−xi dQ(θ ). (2.1)
θ=0 i=1

Proof See Bernardo and Smith (1994, p. 172).

Remark 2.2 Theorem 2.1 shows that any infinitely exchangeable sequence of binary
random variables must arise as a sequence of independent and identically distributed
Bernoulli(θ ) random variables, with a single probability parameter θ drawn from
some distribution Q.
The same property does not extend to finitely exchangeable sequences.

Exercise 2.1 (Finitely exchangeable binary sequences) Find an example of a finite


sequence of binary random variables for which (2.1) does not hold.

Exercise 2.2 (Predictive distribution for exchangeable binary sequences) Suppose


an infinitely exchangeable binary sequence X 1 , X 2 , . . .. For 1 ≤ m < n, show that
2.2 De Finetti’s Representation Theorem 17

the conditional probability mass function for future elements X m+1 , . . . , X n after
observing x1 , . . . , xm has the form
 1 
n
p X m+1 ,...,X n |x1 ,...,xm (xm+1 , . . . , xn ) = θ xi (1 − θ )1−xi dQ(θ | x1 , . . . , xm ),
θ=0 i=m+1

where m
θ xi (1 − θ )1−xi dQ(θ )
dQ(θ | x1 , . . . , xm ) =  1 i=1
m x .
i=1 θ (1 − θ )
i 1−xi dQ(θ )
θ=0

Remark 2.3 Observing part of the sequence does not affect exchangeability, and
therefore Theorem 2.1. The initial prior distribution Q(θ ) is simply updated to the
current posterior distribution Q(θ | x1 , . . . , xm ).
Theorem 2.1 extends to non-binary, infinitely exchangeable sequences.
Theorem 2.2 Let X 1 , X 2 , . . . be a sequence of real-valued random variables,
X i ∈ R, which are believed to be infinitely exchangeable and let R be the space
of all probability distributions on R. Then for any n ≥ 1, necessarily there exists a
probability measure Q on R such that
 
n
P X 1 ,...,X n (x1 , . . . , xn ) = F(xi ) dQ(F). (2.2)
F∈R i=1

Proof See Bernardo and Smith (1994, p. 177).


Remark 2.4 From Bernardo and Smith (1994), the probability distribution Q has an
operational interpretation, representing the uncertainty surrounding “what we believe
the empirical distribution function would look like for a large sample”.
Remark 2.5 In parametric statistical modelling, the probability function F in The-
orem 2.2 is assumed to have a set parametric form F(·; θ ) for an unknown vector of
parameters θ ∈ Rk . The representation then simplifies to
 
n
P X 1 ,...,X n (x1 , . . . , xn ) = F(xi ; θ ) dQ(θ ). (2.3)
θ∈Rk i=1

Similar to Exercise 2.2, the predictive distribution satisfies


 
n
P X m+1 ,...,X n |x1 ,...,xm (xm+1 , . . . , xn ) = F(xi ; θ ) dQ(θ | x1 , . . . , xm ),
θ∈Rk i=m+1

where m
F(xi ; θ ) dQ(θ )
dQ(θ | x1 , . . . , xm ) =  1 i=1
m . (2.4)
θ=0 i=1 F(x i ; θ ) dQ(θ )
18 2 Prior and Likelihood Representation

2.3 Prior, Likelihood and Posterior

Theorem 2.2 justifies the standard “prior × likelihood” approach commonly applied
to Bayesian statistical modelling of real-valued data: assuming a sampling distribu-
tion comprising identically distributed observables which are conditionally indepen-
dent given an unknown parameter, where the parameter is assumed to be an initial
draw from some “prior” probability distribution.
The likelihood component in this mixture is common to both Bayesian and fre-
quentist statistical approaches, and so more scepticism and attention is often directed
towards how the prior component is specified in Bayesian methods. It is referred to as
the “prior distribution” because it reflects one’s beliefs about the generative mecha-
nism, F, before observing any of the variables X 1 , X 2 , . . .; in contrast the “posterior
distribution” (2.4) reflects updated beliefs about F after observing those data.

2.3.1 Prior Elicitation

Eliciting the prior beliefs of an individual as a fully coherent probability distribution,


obeying all the axioms of probability, presents a daunting challenge which has reg-
ularly been offered as a criticism of Bayesian reasoning. However, the difficulty in
achieving this objective does not undermine its logical necessity. Rather than conced-
ing defeat at the impossibility of exactly quantifying beliefs, various mathematical
devices, such as exchangeability and different modelling ideas introduced in later
chapters, are typically deployed to propose probability distributions which might
hopefully reflect the degrees of belief of an individual to an acceptable degree of
approximation.

2.3.2 Non-informative Priors

The difficulties of accurate prior elicitation for an individual, or perhaps a desire to


identify decisions which might generalise to other individuals, often lead practitioners
to try to propose vague or non-informative prior distributions, so that the observed
data “may speak for themselves”.
The word vague is often translated to mean high variance; assuming probability
to be more widely spread around the mean will typically assign less mass to any one
particular neighbourhood. However, without care this distinction may be artificial,
as higher variance under one parameterisation may, under certain transformations,
imply lower variance for a different parameterisation.
The word non-informative can assume a more specific interpretation: the prior
distribution which would maximise the observed change between the prior and the
corresponding posterior distribution, given observed data and a chosen distributional
2.3 Prior, Likelihood and Posterior 19

discrepancy measure. See Sect. 5.4 of Bernardo and Smith (1994) for discussion of
reference priors and reference decisions; the latter identify the optimal decisions
under a least informative prior—not for operational use for any individual decision-
maker, but to serve as an illustrative benchmark for comparison.

Exercise 2.3 (Variances under transformations) Show that if θ ∼ Gamma(a, b)


(see Sect. A.3), then choices of (a, b) implying high variance for θ correspond to
low variance for 1/θ .

2.3.3 Hyperpriors

For some applications, it can be convenient to specify a hierarchy of prior distribu-


tions. For example, it might seem easier for a practitioner to specify a prior distri-
bution for a parameter θ conditional on the value of some other unknown parameter
φ, Q θ|φ (θ ). A marginal prior for θ can then be recovered through specifying a prior
measure for this hyperparameter (in frequentist statistics, nuisance parameter) φ,
Q φ (φ), as then 
Q θ (θ ) = Q θ|φ (θ ) dQ φ (φ). (2.5)
φ

The additional level of prior modelling, Q φ (φ), is sometimes referred to as a hyper-


prior. By noting that a similar construction could be proposed for Q φ (φ), this hier-
archical structure can be applied recursively to arbitrarily many nested levels.

2.3.4 Mixture Priors

A special case of (2.5) occurs when the hyperparameter φ is assumed to take


one of only finitely many possible values; without loss of generality, k suppose
φ ∈ {1, 2, . . . , k}. Writing wi = dQ φ (i), for i = 1, . . . , k with i=1 wi = 1, (2.5)
simplifies to the finite mixture prior


k
Q θ (θ ) = wi Q θ|φ=i (θ )
i=1

on a finite collection of distributions Q θ|φ=1 , . . . , Q θ|φ=k .


Similarly, (2.5) is sometimes referred to as an infinite mixture model.
20 2 Prior and Likelihood Representation

2.3.5 Bayesian Paradigm for Prior to Posterior Reporting

Given an initial probability distribution reflecting prior beliefs about F and then
observing X 1 , . . . , X n as draws from F, Exercise 2.2 demonstrated the transition
from prior distribution, through the likelihood function, to the posterior distribution
(in this case for infinitely exchangeable random variables). This transformation was a
simple application of Theorem 1.1, Bayes’ theorem, and represents the only coherent
mechanism for updating subjective probabilities.
In principle, the Bayesian paradigm for reporting scientific conclusions from a
fixed collection of data suggests repeating this prior to posterior transformation for a
range of different prior distributions, selected to cover a broad range of prior beliefs
which may plausibly be held by the reader; for each prior distribution, the author
would present the consequent posterior distribution and perhaps a corresponding
optimal decision. However, in practice this procedure is often truncated, with authors
preferring to show a single analysis under a non-informative prior (cf. Sect. 2.3.2),
with the implication that inputting any more informative prior information would
simply bias the conclusions in that direction, albeit by an unspecified amount.

2.3.6 Asymptotic Consistency

The sensitivity of posterior inferences to different prior distribution specifications


(Sect. 2.3.5) is determined by the relative amount of information contained in the
sample likelihood function. Suppose, as in (2.3), that X 1 , . . . , X n are assumed to
be conditionally independent draws from the parametric distribution F(·; θ ) with a
prior probability distribution Q(θ ) for the unknown parameter θ . If the parametric
form F(·; θ ) were true, and the true parameter which gave rise to these samples was
θ ∗ , then the following proposition typifies several results which exist on posterior
consistency.

Proposition 2.1 Suppose Q(θ ) is a discrete distribution with dQ(θ ∗ ) > 0. If, for
all other θ = θ ∗ satisfying dQ(θ ) > 0, KL(F(·; θ ) F(·; θ ∗ )) > 0, then

lim dQ(θ | x1 , . . . , xn ) = 1{θ ∗ } (θ ).


n→∞

Proof See Bernardo and Smith (1994, p. 286).

Remark 2.6 The requirement KL(F(·; θ ) F(·; θ ∗ )) > 0 is sometimes referred to


as identifiability. Under this condition, Proposition 2.1 states that as n → ∞ the
posterior distribution will converge to a single atom of mass located at the true value,
provided that value had non-zero prior mass. In this sense, asymptotically the form
of the prior Q(θ ) does not matter beyond its support, {θ : dQ(θ ) > 0}.
2.3 Prior, Likelihood and Posterior 21

Remark 2.7 It was remarked in Sect. 1.1.1 that the subjective Bayesian paradigm
attaches no particular importance to absolute truths. From that perspective, Proposi-
tion 2.1 might appear to lack any operational significance; in subjective probability,
there is no true likelihood and no true parameter value, nor will there be infinite
random samples to observe.
However, there is a useful conclusion to draw: If you and I agree on exchange-
ability, the form of the sampling distribution F(; θ ) and the range of values which
θ reasonably might take, then even if we disagree on a form for the prior Q(θ ), as
we observe more data our posterior beliefs will uniformly converge. So for reporting
scientific inference on “big data” applications, only the likelihood function really
matters.

2.3.7 Asymptotic Normality

For continuous-valued parameters θ ∈ Rk , under some minor regularity conditions


the posterior distribution is asymptotically normal, analogous to the result in classical
statistics concerning the maximum likelihood estimator


n
θ̂n = arg maxθ F(xi ; θ ). (2.6)
i=1

For the maximum likelihood estimator, asymptotically θ̂n ∼ Normalk (θ ∗ , In−1 (θ ∗ )),
where In (θ ) is the so-called Fisher information matrix of the likelihood function,

d2 
n
In (θ ) = − log F(xi ; θ ). (2.7)
dθ 2 i=1

Proposition 2.2 Let m 0 = arg maxθ dQ(θ ) be the mode of the prior distribution,
2
and let I0 (θ ) = − dθd 2 log dQ(θ ). Then

Hn = I0 (m 0 ) + In (θ̂n ) (2.8)

is the posterior information matrix and

m n = Hn−1 (I0 (m 0 )m 0 + In (θ̂n )θ̂n ) (2.9)

the posterior mode, and asymptotically as n → ∞,

Q(θ | x1 , . . . , xn ) → Normalk (θ | m n , Hn−1 ) → Normalk (θ | θ ∗ , In−1 (θ ∗ )).


22 2 Prior and Likelihood Representation

Proof For a sketch proof involving a Taylor series expansion, see Bernardo and
Smith (1994, p. 287).

Remark 2.8 Proposition 2.2 states that a large sample posterior distribution can be
well approximated by a Gaussian; as n → ∞ the mean of that Gaussian tends to
the true value θ ∗ and the variance shrinks toward zero provided θ ∗ is identifiable,
implying posterior consistency.

Exercise 2.4 Asymptotic normality. Let x1 , . . . , xn be n observations from an


infinitely exchangeable sequence of binary random variables as specified in
Theorem 2.1. Suppose Q(θ ) = Beta(θ | a, b) (see Sect. A.2). Find the asymptotic
normal distribution of θ as n → ∞.
Chapter 3
Graphical Modelling and Hierarchical
Models

In many contexts, straightforward exchangeability can be a useful simplifying


assumption for specifying the joint probability distribution of random variables.
But sometimes, an individual will require more complex structures of statistical
dependence between random quantities to properly represent their beliefs. Graph-
ical models provide a useful framework for characterising joint distributions for
random variables, putting primary focus on characterising uncertainty in the depen-
dency structure amongst the variables. Much of the material in this chapter is drawn
from Barber (2012) and related resources.
Before introducing graphical models, some basic graph concepts and definitions
are required to provide a language for relating probability distributions to graphs.

3.1 Graphs

3.1.1 Specifying a Graph

Definition 3.1 (Graph) A graph is a pair G = (V, E) where V is a non-empty set


of entities, referred to as nodes, and E ⊂ V × V is a set of ordered pairs of nodes
referred to as edges. The subset notation is strict, since for any v ∈ V it is assumed
here that (v, v) ∈/ E (there are no self loops).

Definition 3.2 (Directed and undirected graphs) For a graph G = (V, E), if E is
symmetric such that (v, v  ) ∈ E ⇐⇒ (v  , v) ∈ E, then the graph G is said to be
undirected. Otherwise, the edges and graph are directed.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 23


N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0_3
24 3 Graphical Modelling and Hierarchical Models

Fig. 3.1 An example graph


with directed edges and the X1 X2 ⎡ ⎤
corresponding adjacency 0 1 0 1
⎢0 0 0 1⎥
matrix G= ⎢
=⇒ AG = ⎣ ⎥
0 0 0 0⎦
0 0 1 0
X3 X4

Fig. 3.2 An example graph


with undirected edges and X1 X2 ⎡ ⎤
the corresponding adjacency 0 1 0 1
⎢1 0 0 1⎥
matrix G= ⎢
=⇒ AG = ⎣ ⎥
0 0 0 1⎦
1 1 1 0
X3 X4

Remark 3.1 Figures 3.1 and 3.2 provide diagrammatic examples of directed and
undirected graphs. Each link drawn between nodes corresponds to an edge; in the
directed graph, these links must have arrows to indicate their direction.

In the context of graphical modelling, the set of nodes in the graph will corre-
spond to a finite set of random variables V = {X 1 , . . . , X n } for which a probability
model must be constructed. The set of edges will correspond to proposed dependen-
cies between variables, defined in different ways according to different modelling
constructs.

Definition 3.3 (Adjacency matrix) For a finite graph G = (V, E), where V =
{X 1 , . . . , X n }, the adjacency matrix of the graph is a binary n × n matrix AG with
entries in {0, 1}, such that (AG )i j = 1 ⇐⇒ (X i , X j ) ∈ E.

Remark 3.2 An adjacency matrix provides an alternative characterisation of the


edges in a graph. Figures 3.1 and 3.2 show the corresponding adjacency matrices
implied by the example directed and undirected graphs. The diagonal elements will
always be zero, and for an undirected graph the matrix is necessarily symmetric.

3.1.2 Neighbourhoods of Graph Nodes

Definition 3.4 (Parents and children) In a directed graph G = (V, E), the parents
of node X i ∈ V is the set of nodes which connect to X i through an edge in E,
parents(X i ) = {X j ∈ V : (X j , X i ) ∈ E}. Similarly, the children of X i is the subset
of V connected to by X i , children(X i ) = {X j ∈ V : (X i , X j ) ∈ E}.

Exercise 3.1 (Identifying parents and children) For the directed graph in Fig. 3.1,
find the parents and children of each node in V = {X 1 , X 2 , X 3 , X 4 }.
3.1 Graphs 25

Definition 3.5 (Neighbours) In an undirected graph G = (V, E), the neighbours of


a node X i , written neighbours(X i ), is simply the set of nodes in V connected to X i
by an edge in E, neighbours(X i ) = {X j ∈ V : (X i , X j ) ∈ E}.
Exercise 3.2 (Identifying neighbours) For the undirected graph in Fig. 3.2, find the
neighbours of each node in V = {X 1 , X 2 , X 3 , X 4 }.

3.1.3 Paths, Cycles and Directed Acyclic Graphs

Definition 3.6 (Path) A sequence of distinct nodes X i1 , X i2 , . . . , X in in V is a


directed path in a graph G = (V, E) if, for each 1 ≤ j < n, (X i j , X i j+1 ) ∈ E. The
same sequence is an undirected path in the graph if, for each j, (X i j , X i j+1 ) ∈ E or
(X i j+1 , X i j ) ∈ E.
Definition 3.7 (Cycle) A cycle is a directed path X i1 , X i2 , . . . , X in such that X i1 =
X in .
Definition 3.8 (Directed acyclic graph) A directed acyclic graph (DAG) is a directed
graph containing no cycles.
Remark 3.3 DAGs provide an important link between graph theory and probability
modelling. In Sect. 3.2.1, they will be used to define a class of graphical models
known as Bayesian belief networks. The direction of the links indicate an assumption
of causal dependence. Figure 3.1 is an example of a DAG.

3.1.4 Cliques and Separation

Definition 3.9 (Clique) In an undirected graph G = (V, E), a clique is a fully con-
nected subset of V . Furthermore, a clique is said to be maximal in the graph if there
is no superset which is also a clique.
Exercise 3.3 (Identifying cliques) For the graph of Fig. 3.2, identify the maximal
cliques.
Definition 3.10 (Separation through a set) For an undirected graph G = (V, E) and
disjoint node subsets A , B, C ⊂ V = {X 1 , . . . , X n }, if every path from an element
of A to an element of B contains an element of C , then C is said to separate A
from B.
Exercise 3.4 (Identifying separating sets) For the graph in Fig. 3.2, find the sepa-
rating sets.
Definition 3.11 (Separation) For A , B ⊂ V , A is separated from B in G =
(V, E) if there is no path in G between an element of A and an element of B.
26 3 Graphical Modelling and Hierarchical Models

3.2 Graphical Models

3.2.1 Belief Networks

Definition 3.12 (Belief network) Let G be a DAG on the node set of random variables
V = {X 1 , . . . , X n }. A belief network (also known as a causal graph) with graph G
assumes the joint probability distribution factorises as


n
PG (X 1 , . . . , X n ) = P(X i | parentsG (X i )). (3.1)
i=1

Remark 3.4 In a belief network, the set of DAG edges imply a collection of condi-
tional independence statements, although not uniquely; one joint probability distri-
bution can often be represented by multiple alternative DAGs.
Exercise 3.5 Interpreting the graph in Fig. 3.1 as a belief network, state the form of
the implied joint probability distribution using the notation of (3.1).

3.2.1.1 Connectedness and Direct Separation

Definition 3.13 (Connected graph) A (directed or undirected) graph is said to be


connected if there exists an undirected path between any two nodes in the graph.
Definition 3.14 (Connected components) If graph G = (V, E) is not connected,
the nodes V can be uniquely partitioned into separated (see Definition 3.11) subsets
V1 , . . . , Vk , such that each subgraph Gi = (Vi , E ∩ (Vi × Vi )) is connected and there
are no connections in E between the subgraphs. The subgraphs G1 , . . . , Gk are said
to be the connected components of G .
Definition 3.15 (Collider node) In an undirected path within a directed graph, a
node within the path is said to be a collider for that path if the edges on either side
are both directed towards that node.
Exercise 3.6 (Identifying colliders) Figure 3.3 shows the three possible three-node
paths that can exist within a directed graph. For each case, identify any colliders.

Definition 3.16 (d-connected and d-separated) Let G = (V, E) be a directed graph


and A , B, C ⊂ V be disjoint node subsets.
A is said to be d-connected to B by C if there exists an undirected path between
an element of A and an element of B such that each element on the path is either
1. a non-collider which lies outside C ; or
2. a collider which either lies inside C or has a descendant in C .
Otherwise, C is said to d-separate A from B.
3.2 Graphical Models 27

X1 X3 X1 X3 X1 X3

X2 X2 X2

(a) (b) (c)


Fig. 3.3 The three possible directed graphs (up to label changes) with |V = 3| and |E = 2|

Remark 3.5 The term d-separation is shorthand for “directional separation”.


Remark 3.6 If ∅ d-separates A from B, A and B are simply said to be d-
separated.
Exercise 3.7 (Identifying d-separated and d-connected nodes) For each path in Fig.
3.3, identify any d-separated or d-connected nodes.

3.2.1.2 Independence and Conditional Independence

Proposition 3.1 For a belief network on a directed graph G = (V, E) and disjoint
node subsets A , B, C ⊂ V , if C d-separates A from B then A ⊥ ⊥ B | C in the
joint distribution PG of the belief network.
Corollary 3.1 Trivially from Proposition 3.1, the connected components of a graph
in a belief network are independent.
Exercise 3.8 (Identifying conditional independencies in a belief network) For each
of the graphs in Fig. 3.3, state the dependence between X 1 and X 3 (i) marginally;
(ii) conditionally given X 2 .

3.2.2 Markov Networks

Definition 3.17 (Markov network) Let G be an undirected graph on the node set
{X 1 , . . . , X n }. A Markov network with graph G assumes the joint probability distri-
bution factorises as

C
PG (X 1 , . . . , X n ) ∝ φi (Xi ), (3.2)
i=1

where X1 , . . . , XC are the maximal cliques of G . The non-negative functions φi are


sometimes referred to as potentials.
Exercise 3.9 (Markov network distribution) Interpreting the graph in Fig. 3.2 as a
Markov network, state the form of the implied joint probability distribution using
the notation of (3.2).
28 3 Graphical Modelling and Hierarchical Models

Fig. 3.4 A three-node graph


with two undirected edges
X1 X3

X2

Definition 3.18 (Pairwise Markov network) Let G = (V, E) be an undirected graph


on the node set {X 1 , . . . , X n }. A Markov network with graph G assumes the joint
probability distribution factorises as

PG (X 1 , . . . , X n ) ∝ φi, j (X i , X j ). (3.3)
(X i ,X j )∈E

Exercise 3.10 (Pairwise Markov network distribution) Interpreting the graph in Fig.
3.2 as a pairwise Markov network, state the form of the implied joint probability
distribution using the notation of (3.3).

Remark 3.7 The definitions of a Markov network and pairwise Markov network
coincide if and only if the maximal cliques are all edges, meaning there are no
triangles in the graph.

Remark 3.8 For the graph in Fig. 3.4, the definitions of Markov networks and
pairwise Markov networks coincide, both implying PG (X 1 , X 2 , X 3 ) ∝ φ1 (X 1 , X 2 )
φ2 (X 2 , X 3 ). In general, this simple graph would imply X 1 and X 3 are dependent, but
conditionally independent given X 2 .

3.2.2.1 Conditional Independence

Proposition 3.2 Markov property. For disjoint A , B, C ⊂ V , if C separates A


from B in a graph G = (V, E) then A ⊥⊥ B | C in any Markov network on graph
G.

Remark 3.9 In Fig. 3.4, X 1 ⊥


⊥ X 2 | X 3 . More generally, a node will be conditionally
independent of any other nodes in the graph given its neighbours.

Definition 3.19 (Markov random field) Let G be an undirected graph on the node set
{X 1 , . . . , X n }. A Markov random field on G assumes the full conditional probability
distributions satisfy

PG (X i | X 1 , . . . , X i−1 , X i+1 , . . . , X n ) = PG (X i | neighboursG (X i )).

Remark 3.10 The definition of a Markov random field is equivalent to the earlier
definition of a Markov network, as characterised by (3.2).
3.2 Graphical Models 29

Fig. 3.5 An example lattice


graph
X1 X2

X3 X4

Exercise 3.11 Let G = (V, E) be a graph on V = {X 1 , . . . , X n }. A multivariate


normal distribution Nn (μ, Σ) is a Gaussian Markov random field (GMRF) with
respect to G if the covariance matrix satisfies the condition

(Σ −1 )i j = 0 ⇐⇒ (X i , X j ) ∈ E.

Show that a GMRF satisfies Definition 3.19 for a Markov random field.

3.2.2.2 Lattice Models

Figure 3.5 shows an example of a lattice graph. Lattice graphs provide another case
where the definitions of a Markov network/random field and a pairwise Markov
network coincide. As Markov random fields, these structures are known as lattice
models.

3.2.3 Factor Graphs

Factor graphs provide a further generalisation to (3.2), by allowing products of poten-


tials (or factors) on arbitrary node subsets through the inclusion of additional (latent)
factor nodes θ1 , . . . , θk .

Definition 3.20 (Factor graph) Let G be an (undirected) graph on the extended node
set {X 1 , . . . , X n } ∪ {θ1 , . . . , θk }. A factor graph model assumes the joint probability
distribution for X 1 , . . . , X n factorises as


k
PG (X 1 , . . . , X n ) ∝ φi (neighboursG (θi ) ∩ {X 1 , . . . , X n }). (3.4)
i=1

Remark 3.11 There should be no edges between factor nodes or variable nodes in
a factor graph, since these would have no bearing on (3.4).
30 3 Graphical Modelling and Hierarchical Models

Fig. 3.6 As a factor model


for variables X 1 , . . . , X 5 X1 X2

θ1 X5 θ2

X3 X4

Remark 3.12 By introducing additional nodes, factor graphs can represent richer
dependency structures than either belief networks or one Markov networks; one belief
network or Markov network could correspond to multiple possible factor graphs.

Figure 3.6 shows an example factor graph, where the shaded nodes indicate latent
factors.
The edge structure in Fig. 3.6 implies a factorisation of the joint distribution

PG (X 1 , X 2 , X 3 , X 4 , X 5 ) ∝ φ1 (X 1 , X 3 , X 5 )φ2 (X 2 , X 4 , X 5 ).

3.2.3.1 Conditional Independence

Proposition 3.3 For disjoint A , B, C ⊂ {X 1 , . . . , X n }, if C separates A from B


in a factor graph G = ({X 1 , . . . , X n } ∪ {θ1 , . . . , θk }, E) then A ⊥
⊥ B | C.

Remark 3.13 In Fig. 3.6, {X 1 , X 3 } and {X 2 , X 4 } are conditionally independent


given X 5 .

3.3 Hierarchical Models

Section 2.3.3 introduced the idea of specifying probability distributions for unknowns
through hierarchies. Such hierarchies can be interpreted as graphical models.

Definition 3.21 (Hierarchical model) A Bayesian hierarchical model for random


variables X 1 , . . . , X n is a multiply-layered expression for the joint probability dis-
tribution with one or more hyperparameters.
3.3 Hierarchical Models 31

θ θ

X1 X2 X... Xn X1 X2 X... Xn

(a) (b)
Fig. 3.7 Exchangeability for X 1 , . . . , X n represented as a a belief network or b factor graph

Remark 3.14 Hierarchical model formulations are equivalent to both belief net-
works (Sect. 3.2.1) and factor graphs (Sect. 3.2.3). They can be represented graphi-
cally in either way.

Example 3.1 De Finetti’s representation Eqs. (2.2) and (2.3) for exchangeable vari-
ables X 1 , . . . , X n are simple hierarchical models. This representation is depicted
graphically as both a belief network and a factor model in Fig. 3.7. The differences
are the undirected edges and the explicit interpretation of θ as a latent parameter in
the factor graph. The shaded nodes indicate latent random variables which will not
be observed.

Example 3.2 Hierarchical models can be used to incorporate groups of dependen-


cies into probability models. Again consider X 1 , . . . , X n to be exchangeable, but now
suppose each X i is a p-vector X i = (X i,1 , . . . , X i, p ) which can also be assumed to
be exchangeable.
For example, on a degree course there could be n students who each sit p tests, such
that X i j ∈ [0, 100] corresponds to the percentage score obtained by the ith student
in the jth test. The implied n × p matrix (X i j ) could be regarded as a spreadsheet
recording the student grades, where each row corresponds to a different student and
each column to a different test.
Without further information about the students and the relative difficulty of the
tests, a doubly-exchangeable assumption (for X 1 , . . . , X n and X i,1 , . . . , X i, p ) could
seem reasonable. (In contrast, assuming full exchangeability between all n × p test
scores would be less comfortable, since each student might be expected to perform
comparably in each of the different tests, according to their aptitude.)
Figure 3.8 shows the hierarchical model resulting from these two nested layers
of exchangeability. The root node at the top of the hierarchy, F, is a probability
distribution on the space of probability distributions. Each child node Fi is a draw
from F corresponding to the grade probability distribution of student i. Finally, each
individual test score X i j is an independent random draw from the grade distribution
Fi for that student.
32 3 Graphical Modelling and Hierarchical Models

F1

X11 X12 X1... X1p

F2

F X21 X22 X2... X2p

F...

Xn1 Xn2 Xn... Xnp

Fn

Fig. 3.8 A belief network representation of a hierarchical model for an n × p matrix of random
variables (X i j ) with two layers of exchangeability: firstly in the rows, secondly in the row entries
Chapter 4
Parametric Models

This chapter introduces examples of parametric inferential models commonly used in


the representation framework for exchangeable random variables from Chap. 2, and
also the conditional distributions of more general dependency structures considered
in Chap. 3.

4.1 Parametric Modelling

Suppose x = (x1 , . . . , xn ) are the observed values of exchangeable random variables


which are assumed to be conditionally independent given an unknown parameter
θ ∈ Θ (see Sect. 2.2). To simplify the notation of (2.3), the density of the joint
distribution P X 1 ,...,X n (x1 , . . . , xn ) will now be written as p(x); the prior density for
θ , dQ(θ )/ dθ will be written simply as p(θ ); F(x; θ ) will be denoted p(x | θ ); and
the posterior density dQ(θ | x1 , . . . , xn )/ dθ will simply be written as π(θ ). In this
simplified notation, defining the joint likelihood


n
p(x | θ ) := p(xi | θ ),
i=1

De Finetti’s representation theorem becomes


 
n
p(x) = p(xi | θ ) p(θ ) dθ (4.1)
Θ i=1

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 33


N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0_4
34 4 Parametric Models

and the posterior density for θ (2.4) can be expressed most simply as

π(θ ) ∝ p(x | θ ) p(θ ). (4.2)

Remark 4.1 In Bayesian inference it is common to see (posterior) probability


densities being specified only up to a constant of proportionality. If π(θ ) ∝ g(θ ),
then since all probability
 densities must integrate1 to 1, it necessarily follows that
π(θ ) = g(θ )/{ Θ g(θ ) dθ  }. So (4.2) is simply a shortening of the full expression


p(x | θ ) p(θ ) p(x | θ ) p(θ )


π(θ ) =  = . (4.3)
Θ p(x | θ  ) p(θ  ) dθ  p(x)

However, a note of caution is required: Although the transition from an equation


of proportionality (4.2) to equality (4.3) for the posterior density is automatic from
a theoretical viewpoint, this normalisation requires evaluation of an integral in the
denominator of (4.3) which may not always be analytically tractable.

4.2 Conjugate Models

Definition 4.1 (Conjugacy) A likelihood-prior representation (2.3) is said to be con-


jugate if the prior and posterior densities p(θ ) and π(θ ) from (4.2) are from the same
parametric family.
Remark 4.2 For conjugacy to occur, the likelihood terms p(xi | θ ) must also resem-
ble a density from the same parametric family as the prior p(θ ), up to a constant of
proportionality.
Tables 4.1 and 4.2 give examples of conjugate models for discrete and continuous
random variables. In each row of either table, there is a likelihood model for which
there exists a conjugate prior for one of the parameters, each time denoted θ . The
right hand column shows the transformation from prior p(θ ) to posterior p(θ | x)
implied by a single observation x from the likelihood model p(x | θ ).
Remark 4.3 In Table 4.1, the negative binomial distribution refers to the parame-
terisation  
r +x −1 r
p(x | θ ) = θ (1 − θ )x ,
r −1

corresponding to the number of “failures” observed before seeing r “successes” in a


sequence of independent Bernoulli(θ ) variables. Similarly the geometric distribution
corresponds to Negative Binomial(1, θ ), the distribution for the number of failures
before the first success.

1Integration, here and elsewhere, refers to Lebesgue integration for densities of continuous random
variables, and summation for densities (or mass functions) of discrete random variables.
4.2 Conjugate Models 35

Table 4.1 Conjugate parametric models for discrete random variables


Likelihood, p(x | θ) Conjugate prior, p(θ) Posterior, p(θ | x)
Bernoulli(θ) Beta(a, b) Beta(a + x, b + 1 − x)
Geometric(θ) Beta(a, b) Beta(a + 1, b + x)
Binomial(m, θ) Beta(a, b) Beta(a + x, b + m − x)
Negative Binomial(r, θ) Beta(a, b) Beta(a + r, b + x)
Multinomialk (m, θ) Dirichlet k (α) Dirichletk (α + x)
Poisson(θ) Gamma(a, b) Gamma(a + x, b + 1)

Table 4.2 Conjugate parametric models for continuous random variables


Likelihood, p(x | θ) Conjugate prior, p(θ) Posterior, p(θ | x)
Pareto(a, b) Pareto(a + 1, max{b, x})
Uniform(0, θ)
Gamma(a, b) Gamma(a + 1, b + x)
Exponential(θ)
Gamma(a, b) Gamma(a + ψ, b + x)
Gamma(ψ, θ)
Normalk (μ, P −1 ) Normalk ((Λ + P)−1
Normalk (θ, Λ−1 ) (Λx + Pμ), (Λ + P)−1 )
Inverse Wishart k (a, b) Inverse Wishart k
Normalk (μ, θ) (a + 1, b + (x − μ) · (x − μ) )

Generalisations of the results in Tables 4.1 and 4.2 to more than one observation
from the likelihood model are straightforward; for a second observation, the posterior
p(θ | x) from the right hand column adopts the role of the prior in the middle column,
simply updating the parameter values within the same parametric family.

Exercise 4.1 Suppose x = (x1 , . . . , xn ) are n independent Bernoulli(θ ) random


samples, and θ ∼ Beta(a, b). Derive the posterior distribution for θ | x.

Exercise 4.2 Suppose x = (x1 , . . . , xn ) are n independent Poisson(θ ) random sam-


ples, and θ ∼ Gamma(a, b). Derive the posterior distribution for θ | x.

Exercise 4.3 Suppose x = (x1 , . . . , xn ) are n independent Uniform(0, θ ) random


samples, and θ ∼ Pareto(a, b). Derive the posterior distribution for θ | x.

Exercise 4.4 Suppose x = (x1 , . . . , xn ) are n independent Exponential(θ ) random


samples, and θ ∼ Gamma(a, b). Derive the posterior distribution for θ | x.

Proposition 4.1 For conjugate parametric models, the marginal likelihood p(x)
will have a closed-form equation.

Proof For conjugate models the posterior density will have a closed analytic form;
the marginal likelihood could therefore be obtained through rearranging (4.3),
36 4 Parametric Models

p(x | θ ) p(θ )
p(x) = . (4.4)
π(θ )

Any terms involving θ in (4.4) will necessarily cancel, leaving a ratio of normalising
constants from the likelihood and prior densities and the posterior density which do
not depend on θ .

4.3 Exponential Families

Definition 4.2 (Exponential family) A density p(x | θ ) belongs to an exponential


family if there are functions g, h, η, T such that

p(x | θ ) = h(x)g(θ ) exp{η(θ ) · T (x)}. (4.5)

Proposition 4.2 If p(x | θ ) is an exponential family of the form (4.5), then any
normalisable density satisfying

p(θ ) ∝ g(θ )r exp(η(θ ) · s) (4.6)

for r > 0 is a conjugate prior distribution for θ .

Proof From (4.2), the posterior density would be given up to proportionality by

p(θ | x) ∝ g(θ )r +1 exp{η(θ ) · (s + T (x))},

which is from the same parametric family as (4.6).

Remark 4.4 Proposition 4.2 provides another justification for the popularity of
exponential family models in statistics; they all have conjugate Bayesian priors.
All of the likelihood models in Tables 4.1 and 4.2 are exponential families.

4.4 Non-conjugate Models

For a given likelihood model in the De Finetti representation (4.1), adopting a conju-
gate prior distribution is certainly attractive for mathematical convenience. However,
outside of simple exponential family examples, most likelihood models will not have
a conjugate prior distribution.
Besides conjugate models, a tractable posterior distribution will always be theo-
retically available whenever the prior distribution is discrete and has finite support
Θ = {θ : p(θ ) > 0}; in this case, the marginal likelihood p(x) which serves as the
normalising constant of (4.3) is simply the finite sum
4.4 Non-conjugate Models 37

p(x) = p(x | θ ) p(θ ). (4.7)
θ∈Θ

However, in practice, if the number of support points |Θ| is very large (for example,
if θ is high-dimensional) then the summation (4.7) may still be too expensive to
compute.
Moreover, a Bayesian model specification should aim to reflect subjective beliefs,
and adopting a prior distribution simply for reasons of mathematical convenience is
not consistent with this objective. Therefore, in many applications analysts will be
faced with performing inference with non-conjugate statistical models with analyt-
ically intractable posterior distributions which can be identified only up to propor-
tionality through (4.2).

4.5 Posterior Summaries for Parametric Models

Given a known posterior density π(θ ) (4.2) obtained under an assumed parametric
model, a decision-maker might be interested in visualising or quantifying some lower-
dimensional summaries of this density; this can be particularly useful if the parameter
θ is multi-dimensional, perhaps with high dimension, meaning the density π(θ )
cannot simply be plotted.

4.5.1 Marginal Distributions

Suppose the parameter θ is a k-dimensional vector for k > 1, such that θ =


(θ1 , . . . , θk ), and consider a decision problem to predict the value of a single com-
ponent θ j from that vector. In such circumstances the (k − 1)-vector of remaining
parameters, denoted

θ− j = (θ1 , . . . , θ j−1 , θ j+1 , . . . , θk ), (4.8)

are often referred to as nuisance parameters.


Recall from Sect. 1.5.3 that in the Bayesian paradigm, prediction corresponds
to reporting one’s subjective probability distribution for that parameter. Predicting
the parameter component θ j corresponds to reporting the marginal posterior density
obtained from integrating out the nuisance parameters,

π(θ j ) = π(θ ) dθ− j .
Θ− j
38 4 Parametric Models

Exercise 4.5 Consider a bivariate target density function

ba θ1a e−(b+θ2 )θ1 1[0,∞)2 (θ )


π(θ ) = ,
Γ (a)

for θ = (θ1 , θ2 ) ∈ [0, ∞)


 and constants a, b > 0. Calculate the marginal densities
2

of θ1 and θ2 . [Γ (z) = x z−1 e−x dx.]
0

4.5.2 Credible Regions

Alternatively, it may be of interest to identify a representative interval or region in


which the parameter is believed to lie with some specified high probability, analogous
to a confidence interval or region in frequentist statistics.
In subjective probability, the corresponding notion is referred to as a credible
region.

Definition 4.3 (Credible regions) For 0 ≤ α ≤ 1, a 100α% credible region of a


parameter θ with probability distribution P is a subset Rα ⊆ Θ such that

P(θ ∈ Rα ) = α.

Remark 4.5 For a given probability distribution and coverage probability 0 < α <
1, infinitely many valid credible intervals may exist.

For summarising a (marginal) posterior density π(θ ) for a univariate continuous-


valued parameter θ , a simple procedure for identifying a 100α% credible interval
[θ∗ , θ ∗ ] for θ is to identify interval boundaries such that
 θ∗  ∞
1−α
π(θ ) dθ = π(θ ) dθ = .
−∞ θ∗ 2

This particular choice of interval is sometimes referred to as an equal-tailed credible


interval.

Exercise 4.6 Let π(θ ) = λe−λθ 1[0,∞) (θ ) with λ > 0. For 0 ≤ α ≤ 1, calculate the
equal-tailed 100α% credible interval for θ .
Chapter 5
Computational Inference

5.1 Intractable Integrals in Bayesian Inference

In Sect. 1.5, estimation and prediction were presented as Bayesian decision problems.
Given a subjective probability distribution for an unknown quantity and a subjec-
tively chosen utility or loss function, the Bayes estimate was shown to be the value
which maximises expected utility or equivalently minimises expected loss. Obtain-
ing this estimate apparently requires two stages of calculation: obtaining an analytic
expression for the subjective probability distribution and then using this distribution
to calculate expectations.
In the first stage, suppose I assume exchangeability for observable random vari-
ables X 1 , . . . , X n and a parametric representation (2.3) with an unknown parameter
θ ∈ Θ. After observing values x1 , . . . , xn , my posterior distribution (θ ) can the-
oretically be obtained through updating my prior beliefs via Bayes’ theorem (4.3);
however, the denominator of (4.3) is the result of a definite integral which was noted
in Sect. 4.4 to be analytically intractable in many cases, leaving the posterior den-
sity only available up to an unknown constant of proportionality (4.2). Section 2.3.7
noted that almost any posterior distribution will asymptotically resemble a multi-
variate Gaussian, and in some large sample cases, this might provide an adequate
approximation to the normalised posterior if the necessary maximum likelihood esti-
mates can be calculated, but in general, these asymptotic arguments cannot be relied
upon.
In the second stage, taking a squared error loss function as an example, it is under-
stood from Sect. 1.5.2 that the Bayes estimate for θ under this loss function would
be the mean value with respect to my (updated) subjective probability distribution,
(θ ), and the required calculation therefore requires a second integral

Eπ (θ ) := θ π(θ ) dθ, (5.1)
Θ

where π(θ ) = d(θ )/ dθ is the corresponding density function.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 39


N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_5
40 5 Computational Inference

Slightly more generally, I might wish to estimate a transformation g(θ ). Under


squared error loss, the Bayes estimate would be the expectation of g(θ ) with respect
to my current beliefs about θ ,

Eπ {g(θ )} := g(θ ) π(θ ) dθ. (5.2)
Θ

Like the denominator of (4.3) when calculating a posterior distribution, in general,


the integration required for calculating expectations (5.1) or (5.2) with respect to any
target density π(θ ) is also likely to be analytically intractable.
In summary, two sources of intractability have been identified:
1. Intractable posterior distribution calculation (4.3), where the normalising constant
for π(θ ) cannot be computed.
2. Intractable posterior expectation calculations (5.2), due to either the posterior π(θ )
not being calculable (Item 1), or the integral of g(θ ) × π(θ ) not being tractable.

5.2 Monte Carlo Estimation

Monte Carlo methods are approximate, sampling-based approaches for evaluating


expectations. They exploit the fact that if M ≥ 1 random samples θ (1) , . . . , θ (M) can
be obtained from the density π , then by linearity of expectation
 
1 
M
Eπ g(θ (i) ) = Eπ {g(θ )}.
M i=1

Definition 5.1 (Monte Carlo estimate of an expectation) For samples θ (1) , . . . , θ (M)
from π , the Monte Carlo estimate (MC) of Eπ {g(θ )} (5.2) is

1 
M
Êπ {g(θ )} := g(θ (i) ). (5.3)
M i=1

Remark 5.1 MC methods are well suited to addressing the case of Item 2 from
Sect. 5.1, where a target density π might be fully known but the integrals required
for calculating expectations with respect to π are not tractable.

Remark 5.2 As indicated above, by linearity of expectation (5.3) is an unbiased


estimate of (5.2). By the strong law of large numbers, (5.3) converges to (5.2) almost
surely.

Exercise 5.1 (Monte Carlo probabilities) Suppose θ (1) , . . . , θ (M) are random sam-
ples from a density π(θ ) over Θ. State the Monte Carlo estimate of Pπ (θ ∈ A) for
a region A ⊂ Θ.
5.2 Monte Carlo Estimation 41

Exercise 5.2 (Monte Carlo estimate of a conditional expectation) Suppose


θ (1) , . . . , θ (M) are random samples from a density π(θ ) over Θ, g(θ ) is a trans-
formation of interest and A ⊆ Θ. State a Monte Carlo estimate for the conditional
expectation Eπ|A (g(θ ) | θ ∈ A).

Exercise 5.3 (Monte Carlo credible interval) For a univariate, real-valued param-
eter θ ∈ R, suppose θ (1) , . . . , θ (M) are random samples from a density π(θ ) and
θ(1) ≤ . . . ≤ θ(M) are the corresponding order statistics. For 0 ≤ α ≤ 1, use the order
statistics to state a Monte Carlo approximated 100α% credible region for θ (cf.
Sect. 4.5.2).

5.2.1 Standard Error

The standard error of an estimator is the standard deviation of the sampling distribu-
tion of the estimate, or more generally an estimate of that standard deviation.

Definition 5.2 (Monte Carlo standard error) For independent samples θ (1) , . . . ,
θ (M) ∼ π(θ ), the (estimated) standard error of the Monte Carlo estimate (5.3) is


 1  M
2
s.e.{Êπ {g(θ )}} :=  g(θ (i) ) − Êπ {g(θ )} . (5.4)
M(M − 1) i=1

Remark 5.3 (5.4) is useful for assessing convergence of the MC √estimate (5.3) to
(5.2). The standard error shrinks to zero at a rate proportional to m.

5.2.2 Estimation Under a Loss Function

Suppose the quality of an estimate θ̂ of an unknown parameter θ is quantified by a loss


function (θ̂ , θ ). From Exercise 1.7, under a squared loss function (θ̂ , θ ) = (θ̂ − θ )2
the optimal Bayesian estimate is known to correspond to the posterior mean for θ ; in
this case, the Monte Carlo estimate (5.3) would be directly applicable as the Bayes
estimate for θ .
More generally, for an arbitrary loss function, the Bayes estimate may not take
such a convenient form. However, by the same principle of minimising expected loss
with respect to π , the Bayes estimate can still be identified in principle via Monte
Carlo sampling by first using (5.3) to evaluate the expected loss for any proposed
estimate θ̂ ,
1 
M
Êπ {(θ̂, θ)} = (θ̂, θ (i) ).
M i=1
42 5 Computational Inference

Second, the Bayes estimate is the value θ̂ ∈ Θ which minimises the (estimated)
expected loss,
arg minθ̂ ∈Θ Êπ {(θ̂, θ)},

which will typically need to be identified through numerical optimisation.

 Exercise 5.4 (Monte Carlo optimal decision estimation) Suppose just three sam-
ples θ (1) = 2, θ (2) = 5, θ (3) = 11 are obtained from a target density π(θ ) describ-
ing uncertainty about an unknown parameter θ . Assuming a Gaussian kernel loss
function
(θ̂, θ) = − exp{−(θ̂ − θ )2 /10},

plot the Monte Carlo expected loss function Êπ {(θ̂, θ)} for θ̂ over the interval
[0, 12] and numerically evaluate an approximate Bayes estimate of θ .

5.2.3 Importance Sampling

Sometimes, it may not be possible or convenient to draw random samples directly


from π(θ ) in order to calculate a Monte Carlo estimate, even when the density is
fully known. Importance sampling generalises Monte Carlo estimation by supposing
instead that samples θ (1) , . . . , θ (M) can be drawn from some other density h(θ ); a
weighted average of the corresponding values g(θ (1) ), . . . , g(θ (M) ) is then taken to
approximate (5.2), where the weights are chosen to precisely counterbalance the
discrepancy between the sampling density h and the target density π .
For any density h(θ ) which dominates π(θ ), in the sense π(θ ) > 0 =⇒ h(θ ) >
0, the expectation (5.2) can be rewritten as

g(θ )π(θ ) g(θ )π(θ )
Eπ {g(θ )} = h(θ ) dθ = Eh . (5.5)
Θ h(θ ) h(θ )

Defining a so-called importance function as the ratio of the two densities, w(θ ) =
π(θ )/ h(θ ), the identity (5.5) implies

Eπ {g(θ )} = Eh {w(θ )g(θ )} (5.6)

for any dominating density h, thereby expressing a general expectation with respect
to π as a different expectation with respect to h. It immediately follows that a Monte
Carlo approximation of (5.1) can be obtained using samples θ (1) , . . . , θ (m) drawn
from h.
5.2 Monte Carlo Estimation 43

Definition 5.3 (Importance sampling) For samples θ (1) , . . . , θ (M) from h, the impor-
tance sampling Monte Carlo estimate of Eπ {g(θ )} (5.2), or equivalently (5.6), is

1 
M
IS
Êπ {g(θ )} := wi g(θ (i) ), (5.7)
M i=1

where wi = w(θ (i) ) = π(θ (i) )/ h(θ (i) ) are the importance weights.

Remark 5.4 Importance sampling Monte Carlo estimation with respect to π is


equivalent to Monte Carlo estimation with respect to h,
IS
Êπ {g(θ )} = Êh {w(θ )g(θ )}. (5.8)

Exercise 5.5 (Importance sampling Monte Carlo standard error) For independent
samples θ (1) , . . . , θ (M) ∼ h(θ ), state a formula for the standard error of the impor-
IS
tance sampling Monte Carlo estimate Êπ {g(θ )} from (5.7).

Remark 5.5 The rate at which the importance sampling standard error from Exer-
cise 5.5 shrinks to zero and the estimate (5.7) converges to the true value depends
upon the functional ratio π/ h. Good convergence can be obtained when h well
approximates π and h possibly has heavier tails (Amaral Turkman et al. 2019).

5.2.4 Normalising Constant Estimation

Suppose π(θ ) is known only up to proportionality by π(θ ) ∝ γ (θ ) for some known


function γ , such that
π(θ ) = γ (θ )/γ∗ , (5.9)

where γ∗ = γ (θ ) dθ .

Proposition 5.1 Let h(θ ) be a known density which dominates π(θ ), such that h
can be easily sampled from and let θ (1) , . . . , θ (M) be a random sample drawn from
h. Then an importance sampling Monte Carlo estimate for the normalising constant
γ∗ is
1  γ (θ (i) )
M
γˆ∗ = . (5.10)
M i=1 h(θ (i) )

Proof This follows immediately from the identity (5.7).


44 5 Computational Inference

5.2.4.1 Marginal Likelihood Estimation in Bayesian Inference

A simple application of estimating normalising constants occurs frequently within


Bayesian inference, where the target density π(θ ) is a posterior distribution known
up to proportionality (4.2) by the product of two known functions, the likelihood and
the prior,
γ (θ ) = p(x | θ ) p(θ ).

The unknown normalising constant of (5.9) in this case is the marginal likelihood,
γ∗ = p(x).
In the simplest implementation, the prior p(θ ) could be used as the sampling
density; given prior samples θ (1) , . . . , θ (M) , the Monte Carlo estimate (5.10) of the
normalising constant is
1 
M
p̂(x) = p(x | θ (i) ). (5.11)
M i=1

Although sampling from the prior leads to a simplified equation for Monte Carlo
estimation of the marginal likelihood, the standard error of (5.11) can be large if the
likelihood is calculated on a large sample x which strongly outweighs the effects of the
prior (cf. Sect. 2.3.6). As noted above, low variance estimates can be obtained when
the sampling density closely resembles the target. Therefore, in large sample cases,
a better importance sampling density could be the asymptotic normal distribution
approximation of a posterior from Sect. 2.3.7.

5.3 Markov Chain Monte Carlo

If sampling directly from a particular target distribution (θ ) (for the purpose of
performing Monte Carlo integration) does not seem possible, and when it is not clear
how to identify a suitable importance sampling density (cf. Sect. 5.2.3), Markov chain
Monte Carlo (MCMC) methods provide a general solution for obtaining approxi-
mate samples from any target density. Conceptually, the idea is straightforward: A
discrete-time homogeneous Markov chain of parameter values θ (1) , θ (2) , . . . is sam-
pled according to a transition probability density function p(θ (i+1) | θ (i) ), chosen
such that the limiting (stationary) distribution of the parameter value sequence has
density π(θ ).

5.3.1 Technical Requirements of Markov Chains in MCMC

The following concepts of irreducibility, reversibility and stationarity are key to


MCMC methods, described in more detail in Roberts and Rosenthal (2004). It should
5.3 Markov Chain Monte Carlo 45

be supposed that an initial value θ (0) is drawn from an initial probability distribu-
tion (possibly a point mass at some particular value) and then subsequent values
θ (1) , θ (2) , . . . are drawn from the transition density p(θ (i+1) | θ (i) ).

Definition 5.4 (n-step transition probability distribution) For A ⊆ Θ and n ≥ 1,


the n-step transition probability distribution, P n , is the distribution of the state θ (n)
after n iterations of the Markov chain starting from θ (0) ∈ Θ,
 
P n (A | θ (0) ) := p(θ (1) | θ (0) ) . . . p(θ (n) | θ (n−1) ) dθ1 . . . dθn .
θn ∈A (θ1 ,...,θn−1 )∈Θ n−1

Definition 5.5 (π -irreducible Markov chain) A Markov chain with transition den-
sity p(θ (i+1) | θ (i) ) is said to be π -irreducible if for each -measurable set A ⊂ Θ
with (A) > 0 and for each θ ∈ Θ, there exists n > 0 such that P n (A | θ ) > 0.

Remark 5.6 Informally, a π -irreducible Markov chain can eventually reach any
neighbourhood of Θ where the target distribution has positive probability.

Definition 5.6 (Aperiodic Markov chain) A Markov chain with transition density
p(θ (i+1) | θ (i) ) is said to be aperiodic if, for each initial value θ (0) ∈ Θ and each
-measurable set A ⊂ Θ with (A) > 0, {n | P n (A | θ (0) ) > 0} has the greatest
common divisor equal to 1.

Remark 5.7 Informally, an aperiodic Markov chain does not have a cyclic pattern
to how it can arrive at different states.

Definition 5.7 (π -reversible Markov chain) A Markov chain transition density


p(θ (i+1) | θ (i) ) is said to be π -reversible if and only if

π(θ ) p(θ | θ ) = π(θ ) p(θ | θ ). (5.12)

Remark 5.8 The condition (5.12) required for reversibility is sometimes referred
to as detailed balance.

Definition 5.8 (Stationary distribution) The density π(θ ) is said to be a stationary


distribution for the transition density p(θ | θ ) if and only if

π(θ ) = π(θ ) p(θ | θ ) dθ.

Proposition 5.2 If the transition density p(θ (i+1) | θ (i) ) of a Markov chain satis-
fies detailed balance (is reversible) with respect to π(θ ), then π(θ ) is a stationary
distribution.

Proof  
π(θ ) p(θ | θ ) dθ = π(θ ) p(θ | θ ) dθ = π(θ ).
Θ Θ
46 5 Computational Inference

Remark 5.9 A consequence of Proposition 5.2 is that if an aperiodic Markov chain


can be constructed which is irreducible and reversible with respect to a target density
π , then samples from that Markov chain would eventually converge to be (dependent)
samples from π .
In MCMC methods, a large number of samples are obtained from an aperiodic,
π -irreducible, π -reversible Markov chain, perhaps discarding some initial burn-in
samples before the chain is deemed to have sufficiently converged towards the target.
The retained samples are treated as an approximate sample from π for the purposes
of Monte Carlo estimation (Sect. 5.2). (The standard error formula 5.2 will not apply,
even approximately, for MCMC samples since this was based on an assumption of
independent samples.)
The next two sections introduce the mostly commonly used mechanisms for con-
structing a π -irreducible, π -reversible Markov chain required for MCMC: Gibbs
sampling and the Metropolis-Hastings algorithm.

5.3.2 Gibbs Sampling

Suppose θ = (θ1 , . . . , θk ) is a k-vector of parameters with k > 1. Then, for 1 ≤ j ≤


k, following (4.8) let θ− j denote the (k − 1)-vector comprising the entries of θ with
the jth component removed.
Gibbs sampling operates by selecting an index j (either randomly, or through a
deterministic cycle) and sampling a new value for the component θ j from the full
conditional distribution
π(θ )
π(θ j | θ− j ) := , (5.13)
π(θ− j )

where 
π(θ− j ) := π(θ ) dθ j
Θj

is the marginal density for θ− j .


Proposition 5.3 A Markov chain with transition density

p(θ | θ ) = 1θ− j (θ− j )π(θ j | θ− j ) (5.14)

is π(θ )-reversible.
Proof Since θ− j = θ− j with probability 1 under (5.14), then for all such θ, θ ,

π(θ ) π(θ j | θ− j ) p(θ | θ )


= = ,
π(θ ) π(θ j | θ− j ) p(θ | θ )

where the first equality derives from (5.13) and the second from (5.14).
5.3 Markov Chain Monte Carlo 47

Remark 5.10 Since the full conditional distributions are each π -reversible, a
Markov chain which updates θ by successively sampling new component values
from the full conditionals has stationary distribution π .

A cyclic implementation of the Gibbs sampling algorithm for obtaining approximate


samples from π proceeds according to Algorithm 1.

Algorithm 1: Gibbs sampling


Result: M approximate samples from π(θ )
1 Initialisation: Draw θ (0) ∈ Θ from an initial distribution;
2 for i ← 1 to M do
3 Set θ (i) = θ (i−1) for j ← 1 to k do
4 Draw θ (i) (i) (i)
j ∼ π(θ j | θ− j ) (see (5.13)) ;
5 end
6 end

Gibbs sampling can be particularly convenient within certainly classes of Bayesian


hierarchical models (cf. Sect. 3.3); in such cases, the full conditional distributions
can have tractable forms due to the hierarchical parameterisation. However, Gibbs
sampling should be used with caution, particularly with high-dimensional (large k)
models with strong dependencies between variables; in such cases the variances of
the individual parameter full conditional distributions can become relatively small.
Consequently, the sampler can fail to traverse multimodal target distributions, instead
becoming stuck in local modes.

Exercise 5.6 (Gibbs sampling) Consider a mixture target distribution for θ =


(θ1 , θ2 ) where θ1 , θ2 are independent, identically normally distributed random vari-
ables with variance 1 and mean which is equal to μ with probability 21 and equal to
−μ otherwise, for some value μ > 0.
The target density is depicted in Fig. 5.1 for two different values of the mean
parameter, μ = 1 and μ = 3.
(i) State the target density π(θ1 , θ2 ) in terms of the standard normal density φ(z) =
−z 2 /2
e√

.
(ii) Calculate the full conditional densities π(θ1 | θ2 ) and π(θ2 | θ1 ).
(iii) Show that Gibbs sampling will become less likely to move between two local
modes as μ increases.

 Exercise 5.7 (Gibbs sampling implementation) Implement M = 100 iterations


of Gibbs sampling (Algorithm 1) for the target distribution from Exercise 5.6,
for the two cases (i) μ = 1 and μ = 3 depicted in Fig. 5.1. For each case, plot
the trace of sampled values θ (1) , . . . , θ (M) ∈ R2 to demonstrate the mixing of the
Markov chain.
48 5 Computational Inference

μ=1 μ=3

·10−2 ·10−2
π(θ1 ,θ2 )

π(θ1 ,θ2 )
5 5

4 5
0 2 0
−4 −2 0 −5 0
0 2 −2 0
4 −4 5 −5
θ2 θ2
θ1 θ1

Fig. 5.1 Mixture density of two bivariate normal distributions with identity covariance matrix and
means (μ, μ) and (−μ, −μ)

5.3.3 Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm provides a more general framework for con-


structing π -reversible Markov chains.
Let q(θ | θ ) be the transition density of any irreducible Markov chain on Θ. Then
the Metropolis-Hastings algorithm modifies the dynamics of this Markov chain by
only accepting the moves proposed by q with probability

π(θ ) q(θ | θ )
α(θ, θ ) = min 1, (5.15)
π(θ ) q(θ | θ )

and otherwise keeping the chain in its current state. The full algorithm is stated in
Algorithm 2.

Algorithm 2: The Metropolis-Hastings algorithm


Result: M approximate samples from π(θ )
1 Initialisation: Draw θ (0) ∈ Θ from an initial distribution;
2 for i ← 1 to M do
3 Draw θ ∼ q(θ | θ (i−1) ) ;
4 Draw u ∼ Uniform(0, 1);
5 if u < α(θ, θ ) (see (5.15)) then
6 θ (i) = θ ;
7 else
8 θ (i) = θ (i−1) ;
9 end
10 end
5.3 Markov Chain Monte Carlo 49

Proposition 5.4 The Markov chain transition density


  
p(θ | θ ) = α(θ, θ )q(θ | θ ) + 1 − α(θ, θ̃ )q(θ̃ | θ ) dθ̃ 1θ (θ ) (5.16)
Θ

implied by the Metropolis-Hastings algorithm is π -reversible.

Exercise 5.8 (Detailed balance of Metropolis-Hastings algorithm) Prove Proposi-


tion 5.4 by checking the detailed balance equation (5.12) for the transition density
(5.16), considering separately the two cases θ = θ and θ = θ .

Remark 5.11 Since the transition density (5.16) is π -reversible, the Markov chain
obtained from the Metropolis-Hastings algorithm has stationary distribution π .

Remark 5.12 The target density π only enters Algorithm 2 through the ratio
π(θ )/π(θ ) in the acceptance probability (5.15); consequently, π (and also the pro-
posal density q) only needs to be known up to proportionality to utilise the Metropolis-
Hastings algorithm. This is a very useful property in Bayesian inference, where it
has earlier been noted in Sect. 4.4 that a target posterior distribution can often only
be identified up to an unknown normalising constant.

Remark 5.13 As with importance sampling (cf. Sect. 5.2.3), convergence of the
Metropolis-Hastings algorithm depends upon the choice of the proposal density q,
with good performance achieved when q closely resembles the target density π .
The extreme case where q(θ | θ ) = π(θ ) would lead to a sequence of independent
samples drawn directly from π (all accepted with probability 1) and the algorithm
reverts to straightforward Monte Carlo sampling (cf. Sect. 5.2).

Exercise 5.9 Gibbs sampling as Metropolis-Hastings special case. Show that Gibbs
sampling (Sect. 5.3.2) is a special case of the Metropolis-Hastings algorithm with
proposal density
q(θ | θ ) = 1θ− j (θ− j )π(θ j | θ− j )

for updating the jth component of θ .

5.3.3.1 Random Walk

The most common implementations of the Metropolis-Hastings algorithm propose


new values of θ using local moves generated by a simple random walk with a sym-
metric proposal density, such that q(θ | θ ) = q(θ | θ ). Under this symmetry, the
Metropolis-Hastings acceptance probability (5.15) conveniently simplifies to the
posterior ratio
π(θ )
α(θ, θ ) = min 1, .
π(θ )
50 5 Computational Inference

For example, in a univariate setting, commonly used symmetric proposals for


local moves include
q(θ | θ ) ∝ exp(−(θ − θ )2 /(2ε))

for a symmetric Gaussian proposal, or

q(θ | θ ) ∝ 1(θ−ε,θ+ε) (θ )/(2ε)

for a symmetric uniform proposal. In either case, the parameter ε > 0 can be tuned to
influence the acceptance rate of the proposed moves; as ε → 0, the acceptance rate
tends to 1, but at the expense of proper exploration of Θ. In practice, different values
of ε can be explored to get a good trade-off between exploration and acceptance, with
published research (Roberts et al. 1997) suggesting an acceptance ratio of 0.234 can
optimise the efficiency of the algorithm under some quite general conditions. The
consequent advice from the authors is to “tune the proposal variance so that the
average acceptance rate is roughly 1/4”.
Whilst a random walk Metropolis-Hastings algorithm avoids the difficulty of find-
ing a proposal density that globally matches the target, these methods can sometimes
perform poorly in practice by being slow to explore the parameter space, getting
stuck in local modes of the target density. This phenomenon is sometimes described
as poor mixing.

 Exercise 5.10 (Metropolis-Hastings implementation) Using a bivariate Gaussian


proposal density

q(θ | θ ) ∝ exp{−(θ − θ ) (θ − θ )/8},

implement M = 100 iterations of the Metropolis-Hastings algorithm for the target


distribution from Exercise 5.6. Address the two cases (i) μ = 1 and μ = 3 depicted
in Fig. 5.1. For each case, plot the trace of sampled values θ (1) , . . . , θ (M) ∈ R2 to
demonstrate the mixing of the Markov chain.

5.4 Hamiltonian Markov Chain Monte Carlo

A more sophisticated implementation of the Metropolis-Hastings algorithm which


can avoid the low acceptance rates or poor exploration of simplistic random walks
is to generate proposals using dynamics inspired from Hamiltonian mechanics. The
resulting algorithms are referred to as Hamiltonian Monte Carlo (HMC) methods
(Neal 2011; Betancourt 2017).
To begin the mechanical analogy, the parameter θ of the target density π(θ ) is
first imagined to be the location of a body (typically a small ball) in a frictionless
dynamical system. Second, the target density is augmented with a second parameter
vector p, which acts as the momentum of the body. For elegance and simplicity,
5.4 Hamiltonian Markov Chain Monte Carlo 51

prior beliefs about the synthetic variable p are usually assumed to be described by
a standard multivariate normal distribution that is statistically independent of θ ; the
joint density can then be written up to proportionality as

π̃ (θ, p) ∝ π(θ ) exp(−p p/2).

The negative logarithm of the augmented density (ignoring normalising constants)


is assumed to correspond to total the energy of the body in this dynamical system,
referred to as the Hamiltonian,

p p
H (θ, p) := − log π(θ ) + . (5.17)
2
Continuing the mechanics analogy, the first term of (5.17) corresponds to the potential
energy held by the body, proportional to the height of the body on a surface which
has contours of − log π(θ ) at each location θ ; and the second term corresponds to
the kinetic energy of the body, proportional to the squared momentum, p p. The
lowest point on the surface, corresponding to minimal potential energy and therefore
maximal kinetic energy, is the mode of the target density π(θ ).
Returning to the Metropolis-Hastings algorithm, to propose new values in Θ from
a current position, denoted here as θ (0), the idea is to consider a trajectory of the body
through time after applying some momentum. Let θ (t) be the location of the body
at time t, and p(t) the corresponding momentum. The principle of conservation of
energy implies that when the extended target density is interpreted as the Hamiltonian
of a closed dynamical system, the dynamics of that system should require H (θ, p)
to be preserved. This leads to the Hamiltonian equations for the system:

dθ (t) ∂H dp(t) ∂H
= , =− .
dt ∂p dt ∂θ

Evolving the extended parameters θ (t), p(t) according to these equations would
keep (5.17) constant, which corresponds to the body travelling along contours of
the extended target density π̃ . Therefore, proposing new θ values in approximate
accordance with these dynamics can lead to proposals which are far away from
the starting (previous) value but have similar target density π(θ ), leading to good
exploration and high acceptance rates.
In practice, the Hamiltonian dynamics are numerically approximated at inter-
leaved time points using leapfrog integration; for an incremental time step ε > 0,

∂ log π(θ (t))


p j (t + ε/2) = p j (t) + (ε/2) ,
∂θ j
θ j (t + ε) = θ j (t) + εp j (t + ε/2),
∂ log π(θ (t + ε))
p j (t + ε) = p j (t + ε/2) + (ε/2) . (5.18)
∂θ j
52 5 Computational Inference

The partial derivatives of the target density are required in (5.18), implying the
technique is only appropriate for continuous-valued parameters.
The Hamiltonian MCMC algorithm follows the Metropolis-Hastings algorithm
(Algorithm 2), with proposal density q(θ | θ (i−1) ) derived from a sampling pro-
cedure of first obtaining a new starting momentum p(0) from the standard mul-
tivariate normal, and second evolving the Hamiltonian dynamics implied by this
initial momentum via the leapfrog algorithm for some number of time steps L > 0,
beginning at θ (0) = θ (i−1) . The algorithm for this proposal mechanism is given in
Algorithm 3.

Algorithm 3: Hamiltonian Monte Carlo sampling


Result: A Metropolis-Hastings algorithm proposal θ | θ (i−1)
1 Initialisation: Set θ (0) = θ (i−1) ;
2 for j ← 1 to k do
3 Draw p j (0) ∼ N(0, 1);
4 end
5 for i ← 1 to L do
6 Update p(t + (i − 1)ε), θ (t + (i − 1)ε) → p(t + iε), θ (t + iε) via (5.18);
7 end
8 Set proposal θ = θ (t + Lε).

5.5 Analytic Approximations

Markov chain Monte Carlo methods provide a general-purpose solution to compu-


tational Bayesian inference, providing estimates of arbitrarily high accuracy for any
inference problem given sufficiently many iterations. However, in high-dimensional
applications the time until reaching suitable convergence can be prohibitively long; in
these circumstances, there is growing popularity for analytic approximate solutions.
These approximations trade off the theoretic convergence guarantees of simulation-
based inference methods for much faster inference procedures.
In this section, suppose the target density π(θ ) corresponds to a posterior dis-
tribution density for a k-vector of parameters θ = (θ1 , . . . , θk ) after observing n
likelihood samples x = (x1 , . . . , xn ), such that π is known up to proportionality by
(4.2).

5.5.1 Normal Approximation

Recall Proposition 2.2 from Sect. 2.3.7, which stated that for increasing sample
sizes almost every target posterior distribution approaches an asymptotic normal
distribution,
5.5 Analytic Approximations 53

π(θ ) → Normalk (θ | m n , Hn−1 ) (5.19)

as n → ∞, where m n (2.9) and Hn (2.8) are respectively the posterior mode and
information matrix. For approximate inference, this large sample property (5.19)
can be exploited in several ways.
Most straightforwardly, the approximated normal distribution density could be
directly substituted in place of the true target density π(θ ), for example if this sim-
plifies an expectation calculation (5.2). However, lower error approximations can be
obtained using a so-called Laplace approximation.

5.5.2 Laplace Approximations

Combining (5.2) with the expression for the posterior distribution (4.3) obtained
from Bayes’ theorem, it follows that a posterior expectation for a function of interest
g(θ ) can be expressed as a ratio of two integrals,

Θ g(θ ) p(x | θ ) p(θ ) dθ


E{g(θ ) | x} = . (5.20)
Θ p(x | θ ) p(θ ) dθ

A Laplace approximation (Tierney and Kadane 1986) for (5.20) assumes a normal
approximation to both the denominator and the numerator of this ratio. In general, the
Laplace method of integration uses a second-order application of Taylor’s theorem
to approximate positive function integrands with normal distribution densities: let θ ∗
be the global maximum of a twice-differentiable function h(· ), and H (· ) the Hessian
matrix of h(· ), then

1
h(θ ) ≈ h(θ ∗ ) + (θ − θ ∗ ) H (θ ∗ )−1 (θ − θ ∗ )
2
 ∗
eh(θ ) (2π ) 2
k

=⇒ eh(θ) dθ ≈ 1 , (5.21)
|−H (θ ∗ )| 2

by comparison with the density of a normal distribution with mean vector θ ∗ and
covariance matrix −H (θ ∗ ). To apply Laplace’s method to (5.20), it must be supposed
that the function of interest g(θ ) is positive almost everywhere. For the logarithm of
the integrands in the denominator and numerator of (5.20), define

h(θ ) = log p(x | θ ) + log p(θ ),


h̃(θ ) = log g(θ ) + log p(x | θ ) + log p(θ ).

The mode and Hessian of h are the posterior density mode and information matrix
(m n , Hn ). Denoting the corresponding mode and Hessian of h̃ by (m̃ n , H̃n ), the
Laplace approximation of (5.20) by application of (5.21) is
54 5 Computational Inference

1
Θeh̃(θ) dθ |Hn | 2 g(m̃ n ) px|θ (x | m̃ n ) pθ (m̃ n )
E{g(θ ) | x} = h(θ)
≈ 1 . (5.22)
Θe dθ | H̃n | 2 px|θ (x | m n ) pθ (m n )

Remark 5.14 The Laplace approximation (5.22) is not invariant to transformations


of the parameterisation θ . At least in principle, improved approximations can be
achieved by using alternative parameterisations such that the resulting integrands in
(5.20) more closely resemble normal distribution densities.

5.5.2.1 Approximating Marginal Distributions

Given a partition of the parameter k-vector θ = (φ, ψ), such that φ is a k -vector
with 1 ≤ k < k, a Laplace approximation can be used to approximate marginal
distributions from the target density π(θ ) ≡ π(φ, ψ),

π(φ) = π(φ, ψ) dψ. (5.23)

For a fixed value of φ, define

h̃ φ (ψ) = log px|θ (x | φ, ψ) + log pθ (φ, ψ).

Let (ψ̃n,φ , H̃n,φ ) be the mode and Hessian of h̃ φ (ψ), again conditioning on the
fixed value φ. Then using (5.21) in a similar way to deriving (5.22), a Laplace
approximation for the marginal density (5.23) is

eh̃ φ (ψ) dψ
1
|−Hn | 2 px|θ (x | φ, ψ̃n,φ ) pθ (φ, ψ̃n,φ )
π(φ) = h(θ) dθ
≈ 1 k−k
. (5.24)
Θ e |− H̃n,φ | 2 px|θ (x | m n ) pθ (m n ) (2π ) 2

5.5.2.2 Integrated Nested Laplace Approximation

The integrated nested Laplace approximation (INLA), introduced by Rue et al.


(2009), provides a particular useful implementation of Laplace approximations for
an important model class known as latent Gaussian models (LGMs). An LGM has
a likelihood function which assumes conditional independence given unobserved
parameters θ and hyperparameters φ, such that the prior distribution for θ is a Gaus-
sian Markov random field (GMRF, cf. Exercise 3.11). Accordingly,

xi ∼ p(xi | θ, φ), i = 1, . . . , n,
θ | φ ∼ Normalk (0, Σ(φ)),
φ ∼ p(φ),
5.5 Analytic Approximations 55

where Σ(φ) is a non-singular covariance matrix which can depend upon the hyper-
parameter φ and whose inverse contains zeros according to the GMRF model. The
normal distribution prior for θ makes models in this class well-suited to Laplace
approximations. Noting the posterior density can be expressed as
 
11  −1 n
π(θ, φ | x) ∝ p(φ)|Σ(φ)| exp − θ Σ (φ)θ +
2 log p(xi | θ, φ) ,
2 i=1

the INLA approach combines multiple Laplace approximations for conditional dis-
tributions involving θ with numerical integration techniques for φ, and can therefore
enable inference for problems with a very high dimensional θ parameter, provided
φ has low dimension. In particular, using (5.24) the marginal posterior density for φ
is approximated by
 n 

p(φ)|Σ(φ)| 2 exp − 21 θ̃φ Σ −1 (φ)θ̃φ + i=1
1
log p(xi | θ̃φ , φ)
π̂ (φ | x) ∝ ,
π̂ (θ̃φ | φ, x)

where π̂ (θ | φ, x) is the normal approximation to the corresponding full conditional


distribution and θ̃φ is the constrained mode of that full conditional density for the
fixed value φ.
Full details of the INLA method are beyond the scope of this text, but can be
found in Rue et al. (2009). An open source implementation of the method is freely
available, written in the statistical language R,1 called R-INLA.2

5.5.3 Variational Inference

Not all posterior distribution densities can be well approximated with normal distri-
butions, and so variational inference methods (Blei et al. 2017) explore alternative
classes of approximating densities. Let Q be such a class of densities, referred to
as the variational family. Then variational inference seeks to approximate the tar-
get density π(θ ) with the closest member of the variational family, typically using
Kullback-Leibler divergence (cf. Definition 1.16),

q ∗ (θ ) = arg minq∈Q KL(q(θ )  π(θ )). (5.25)

The KL-divergence in (5.25) is taken in the reverse direction to the usual order,
presented in (1.4), for comparing an estimated density with the truth. In (5.25),
expectations are taken with respect to the estimating density q rather than the target

1 https://www.r-project.org.
2 https://www.r-inla.org.
56 5 Computational Inference

(a) (b)

2 2

1 1

0 0
θ2

−1 −1

−2 −2

−2 −1 0 1 2 −2 −1 0 1 2
θ1 θ1

Fig. 5.2 Approximating a bivariate normal distribution with correlation .95, π(θ1 , θ2 ), with the
closest independent bivariate normal distribution, q(θ1 , θ2 ), minimising a KL(q  π ) or b KL(π 
q)

π . This can lead to advantages in tractability, with freedom to chose a convenient


form for the approximating density q.
An alternative algorithmic framework introduced by Minka (2001), known as
expectation propagation, instead minimises the forward direction KL-divergence,
KL(π(θ )  q(θ )). There are important differences between these two alternative
formulations, which can be particularly important when approximating multimodal
target distributions.
Specifically, KL(q(θ )  π(θ )) is more critical of discrepancies where q(θ ) is large
and π(θ ) is small. Consequently, variational inference concentrates mass around a
local mode of π(θ ), and is said to be zero-forcing for q. In contrast, KL(π(θ )  q(θ ))
is sensitive to q(θ ) being small wherever π(θ ) is large, and therefore must provide
coverage to all modes of π(θ ); it is therefore said to be zero-avoiding for q.
Following the example of Bishop (2006, p. 468), the two plots in Fig. 5.2 illustrate
the contrasting approximations, obtained under the two alternative KL-divergence
formulations, for a simple example where a bivariate normal distribution with corre-
lation coefficient 0.95 is approximated by two independent univariate normal distri-
butions. Both approximations correctly fit the mean of the target distribution, but the
variational inference estimate in the left-hand plot focuses on the mode of the target
distribution, where as the expectation-propagation approximation in the right-hand
plot has higher variance, providing better coverage of the high target density region
but also large areas of very low target density.
5.5 Analytic Approximations 57

5.5.3.1 Evidence Lower Bound

With π(θ ) ∝ p(x, θ ) = p(x | θ ) p(θ ), minimising the KL-divergence (5.25) is equiv-
alent to maximising the so-called evidence lower bound.

Definition 5.9 (Evidence lower bound) For fixed x and a probability density q(θ )
satisfying q(θ ) > 0 =⇒ p(x, θ ) > 0, the evidence lower bound (ELBO) is defined
by
ELBO(q) := Eq log p(x, θ ) − Eq log q(θ ). (5.26)

Exercise 5.11 ELBO equivalence. Show that

KL(q(θ )  π(θ )) = − ELBO(q) + log p(x), (5.27)

and hence minimising KL(q(θ )  π(θ )) is equivalent to maximising ELBO(q).

Later in Sect. 7.1 which considers model uncertainty, the marginal likelihood p(x)
will be referred to as the evidence in favour of that particular probability model. This
provides the reasoning behind the name of evidence lower bound: by (5.27),

log p(x) = ELBO(q) + KL(q(θ )  π(θ ))


≥ ELBO(q)

since KL-divergence is non-negative (cf. Exercise 1.9). Note the lower bound
becomes an equality if q = π , corresponding to the approximation matching the
target distribution.

Exercise 5.12 (ELBO identity) Show that ELBO(q) = Eq log p(x | θ ) − KL(q(θ ) 
p(θ )).

The specification of a variational inference method is completed by deciding upon


the variational family Q over which (5.26) should be maximised. The approximating
density (5.25) for π is then

q ∗ (θ ) = arg maxq∈Q ELBO(q). (5.28)

The computational difficulty of performing the optimisation (5.28) depends upon


the complexity of the variational family. For tractable inference, the most common
choice for Q is the so-called mean-field variational family.

Remark 5.15 To see the implicit trade-off implied by optimising the ELBO crite-
rion, notice (5.26) is the sum of the expected log value, with respect to q, of the
joint target density, π(x, θ ), plus a quantity referred to in information theory as the
entropy of the approximating density, − Eq log q(θ ). Without the entropy term, the
maximisation would (in the limit) assign probability 1 to the posterior mode for θ ;
58 5 Computational Inference

however, that approximating density would have minimum entropy, and so instead
the optimal q will distribute mass more widely around Θ, but still in areas where
π(θ ) is high.

5.5.3.2 Mean-Field Variational Inference

Definition 5.10 (Mean-field variational family) A mean-field variational family Q


on Θ comprises probability density functions q with independent factors,


k
q(θ ) = q j (θ j ). (5.29)
j=1

Each factor q j in (5.29) can assume a different parametric form, which might be
necessary when some components of θ are unconstrained and continuous and others
are possibly discrete. The assumption of independence implicit in (5.29) makes
mean-field variational inference well-suited to optimisation using a technique called
coordinate ascent. Once optimised, the mean-field variational estimate will take the
same form,

k
q ∗ (θ ) = q ∗j (θ j ),
j=1

where q ∗j (θ j ) provides a local variational approximation of the marginal target den-


sity π(θ j ).

Remark 5.16 An unwanted consequence of the independence assumption of mean-


field approximations is that component variances of π(θ ) will typically be underes-
timated, as the main body of elongated, ellipsoidal covariance contours get approx-
imated by smaller circles (cf. Fig. 5.2a).

5.5.3.3 Coordinate Ascent Variational Inference

Coordinate ascent algorithms discover local maxima of objective functions by


sequentially optimising with respect to one parameter component whilst keeping the
other components fixed. Coordinate ascent variational inference (CAVI) assumes a
current mean-field approximation q(θ ) (5.29), and then updates the jth component
q j (θ j ) to a locally optimal solution

q j (θ j ) ∝ exp{Eq− j log π(θ j | θ− j )} ∝ exp{Eq− j log p(x, θ )},


5.5 Analytic Approximations 59

where q− j is the marginal density for θ− j (4.8) of the current mean-field approxima-
tion, 
q− j (θ− j ) = q (θ ).
= j

Exercise 5.13 (CAVI derivation) In coordinate ascent variational inference, show


that
arg maxq j ELBO(q) ∝ exp{Eq− j log π(θ j | θ− j )}.

The CAVI method is summarised in Algorithm 4. Each step of the algorithm main-
tains or increases the objective function ELBO(q), and since ELBO(q) is bounded
above by p(x), eventual convergence at a chosen tolerance threshold is guaranteed.

Algorithm 4: Coordinate ascent variational inference



Result: A mean-field variational estimate q(θ ) = j q j (θ j ) for π(θ )
1 Initialisation: Choose initial distributions q j (θ j ), j = 1, . . . , k, calculate

ELBO( j q j ) using (5.26);

2 while ELBO( j q j ) has not converged do
3 for j ← 1 to k do
4 Set q j ∝ exp{Eqθ− j log p(x, θ )};
5 end 
6 Calculate ELBO( j q j ) using (5.26)
7 end

 Exercise 5.14 (CAVI Gaussian approximation) Suppose π(θ ) = Normal2 (θ |


μ, Σ) with μ ∈ R2 and Σ ∈ R2×2 positive-definite, and let Q j = {Normal(θ j |
m, s 2 ) | m ∈ R, s > 0} be the variational family for component j ∈ {1, 2} (cf.
Fig. 5.2).
(i) Show that the CAVI algorithm local approximation for component j is q j (θ j ) =
Normal(θ j | m j , s 2j ), where

Σ j j̃ Σ 2j j̃
m j = μj + (m j̃ − μ j̃ ), s 2j = Σjj −
Σ j̃ j̃ Σ j̃ j̃

and j̃ = 3 − j is the other component.


(ii) When will the algorithm converge?
(iii) Implement 200 iterations of the CAVI algorithm with target distribution mean
μ = (0, 0), unit variances and correlation coefficient .95. Use starting values
m 1 = 2, m 2 = 3, s1 = .1, s2 = 9. Make a contour plot of the target density and
the mean-field variational approximation.
60 5 Computational Inference

5.6 Further Topics

some of the fundamental computational methods for performing Bayesian inference.


Much of the ongoing research activity in Bayesian statistics is focused in this area,
with a broad range of sophisticated methods being developed. Further advanced
topics include reversible jump Markov Monte Carlo, for transdimensional sampling
from variable-dimension target distributions (Amaral Turkman et al. 2019, Chap. 7);
sequential Monte Carlo sampling from a sequence of target distributions (Doucet
et al. 2001); automatic differentiation for variational inference (Kucukelbir et al.
2017).
Such advanced techniques are well beyond the scope of this text, but the key
principles introduced in this chapter provide the foundations for understanding the
purpose behind these more advanced methods. The next chapter will illustrate com-
putational packages which are openly available for users wishing to carry out different
kinds of Bayesian analysis without addressing these research-level difficulties, where
the complex sampling issues are kept “under the hood”. Nonetheless, in diagnosing
the performance of these (unavoidably imperfect) software tools it is important to
possess this basic level of understanding of how their internal inferential processes
operate.
Chapter 6
Bayesian Software Packages

The research-level complexity of performing Bayesian inference with the statistical


models typically encountered in practical decision problems can provide a barrier
to these methods being widely deployed. To alleviate this problem, a number of
probabilistic programming languages have been developed specifically to automate
Bayesian inference. This text will focus on the language Stan,1 due to its widespread
adoption and the depth of tutorial resources available. Brief details will also be given
for two alternative libraries, PyMC 2 and Edward.3 All three can be accessed through
the general-purpose, interpreted programming language Python.4
To illustrate the use of computer software packages in performing Bayesian infer-
ence, the following hypothetical statistical model will be used to provide a working
example.

6.1 Illustrative Statistical Model


Consider an example which further develops the graphical model structure presented
in Example 3.2 and Fig. 3.8, which envisaged two layers of exchangeability for a
hypothetical class of n students obtaining grades from p tests. Consider the following
parametric model, consistent with Fig. 3.8, which assumes some exponential family
distributions mentioned in Chap. 4:

1 https://mc-stan.org
2 https://docs.pymc.io
3 http://edwardlib.org
4 https://www.python.org

The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 61


corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_2
62 6 Bayesian Software Packages

μ ∼ Normal(0, σ 2 /4)
σ −2 ∼ Gamma(1, 1/2)
z i ∼ Normal(μ, σ 2 ), i = 1, . . . , n
θi = 1/{1 + e−zi }, i = 1, . . . , n
X i j ∼ Binomial(100, θi ), i = 1, . . . , n; j = 1, . . . , p. (6.1)

Briefly, this model assumes a matrix, X , of student grades measured as integer per-
centage scores ranging between 0 and 100, such that the ith row corresponds to the ith
student in a class. Each student grade X i j is modelled by a binomial distribution with
a student-specific probability parameter which is assumed to be the same for each
test. This parameter is derived through a logistic transformation of an unobserved
(latent), real-valued aptitude level z i which is assumed to be normally distributed
with unknown mean and variance which are assigned conjugate priors (cf. Sect. 4.2).
Below is some example Python code for simulating from this model, by default
assuming thirty students sitting five tests.

6.2 Stan
Stan, named after the Polish mathematician Stanislaw Ulam, is a probabilistic pro-
gramming language written in C++ which uses sophisticated Monte Carlo and vari-
ational inference algorithms (see Chap. 5) for performing automated Bayesian infer-
ence. In particular, the default inferential method uses the No-U-Turn Sampler of
Hoffman and Gelman (2014), which is an extension of Hamiltonian Monte Carlo (see
Sect. 5.4). The derivatives required for performing HMC and other related inferential
methods are calculated within Stan using automatic differentiation. The user simply
has to declare the statistical model, import any data and then call a sampling routine.

Remark 6.1 Stan does not support sampling of discrete parameters, due to the
reliance of the software on Hamiltonian Monte Carlo sampling methods. For prob-
lems involving discrete parameters, the Stan documentation recommends marginal-
ising any discrete parameters where possible.

The following code (student_grade_model.stan) is written in the Stan


language, declaring the statistical model (6.1) for student test grades from Sect. 6.1.
6.2 Stan 63

The code block contains the quantities which are considered to be known.
The quantities n (the number of students) and p (the number of tests) are declared
as positive integers, and the student test scores (X i j ) are declared to be an n × p
integer matrix taking values between 0 and 100. In the remainder of this block, the
remaining required model hyperparameters and their constraints are listed.
The code block declares the unknown quantities in (6.1): the
n-vector of real number student aptitude values z i , and the unknown mean and vari-
ance parameters (μ, σ 2 ) of the normal distribution for the z i values.
The code block contains any parameter transforma-
tions which are helpful for stating the prior and likelihood models in the final
stan.model.sample{ } block. In this case, the aptitude parameters z i are converted
to binomial parameters θi using the inverse logit function θi = 1/{1 + e−zi } as in
(6.1), and also the aptitude standard deviation σ is obtained as the square root of the
variance.
The code block states the probability distributional assumptions from
(6.1): An inverse-gamma distribution for σ 2 ; normal distributions for μ and the
latent parameters z i ; and a binomial distribution for each individual percentage test
score, using the student-specific transformed parameter θi .

6.2.1 PyStan
Stan can be accessed from a range of computing environments. In this text, it will
be accessed using the Python interface, PyStan.5 The following PyStan 3 code
(student_grade_inference_stan.py) uses the Python simulation code
and Stan model declaration code from above to simulate student test score data
and then fit the underlying model to the data.

5 https://pystan.readthedocs.io.
64 6 Bayesian Software Packages

After importing the necessary packages, the code first simulates student grade data
for n = 30 students taking p = 5 tests. Second, the code loads in the Stan probability
model from student_grade_model.stan. The third block determines that
four separate parallel Hamiltonian MCMC chains are to be run, each requesting
10,000 samples after discarding the first 1000; the call to stan.model.sample()
then obtains the posterior samples.
The final code block creates plots from the posterior samples. The top two cells
show trace plots of the log posterior density of the sampled parameters and the values
of the parameter μ from (6.1). The chains demonstrate stability and good mixing. The
bottom row contains a diagnostic plot for the four chains showing the convergence of
the sample mean for estimating the posterior expectation of μ, and finally a histogram
of the sampled values of μ, pooled across the four chains, for estimating the marginal
6.3 Other Software Libraries 65

density π(μ). The true value of μ used to simulate the test scores is indicated with a
dashed line; note the relatively small sample size (of students and test scores) means
this true value is not yet well estimated by π(μ).

6.3 Other Software Libraries


6.3.1 PyMC
PyMC is a probabilistic programming package for Python which focuses on simplify-
ing Bayesian inference, primarily through Markov chain Monte Carlo and variational
methods. The fourth version (PyMC4) is being built upon a back end of the C++
and Python based symbolic mathematics library TensorFlow,6 after the Python back
end Theano used by previous versions was discontinued. PyMC offers similar func-
tionality to Stan, also using the No-U-Turn Sampler (Hoffman and Gelman 2014)
Hamiltonian Monte Carlo algorithm as the default inference tool for continuous
parameters. Some users prefer PyMC to Stan for its native Python implementation,
whilst others prefer Stan for its enhanced computation speed (through implementa-
tion in C++) and extensive documentation.

6.3.2 Edward
Edward is a Python library for probabilistic modelling and inference, named after
the statistician George Edward Pelham Box who pioneered iterative approaches to
statistical modelling. Edward was developed using TensorFlow as a back end. Ten-
sorFlow is designed for developing and training so-called machine learning models,
and Edward builds upon this to offer modelling using neural networks (including
popular deep learning techniques) but also supports graphical models (cf. Chap. 3)
and Bayesian nonparametrics (cf. Chap. 9).
Bayesian inference can be performed using variational methods and Hamilto-
nian Monte Carlo, along with some other advanced techniques. Another strength of
Edward is model criticism, using posterior predictive checks which assess how well
data generated from the model under the posterior distribution agree with the realised
data.

6 https://www.tensorflow.org.
Chapter 7
Criticism and Model Choice

In subjective probability, there are no right or wrong systems of beliefs, provided


they are coherently specified; I have my own individual measures of uncertainty
concerning any quantities that I am unsure of, and it is fully admissible that these
could be arbitrarily different from the probability beliefs held by others.
However, it was noted in the introductions to Chaps. 1 and 2 that mathematically
specifying probability distributions which accurately represent systems of beliefs is
a non-trivial exercise, and arguably always carries some degree of imprecision. The
use of probability models, for example incorporating assumptions of exchangeability
and making the choice of the prior measure Q in de Finetti’s representation theorem
(Sect. 2.2), provides tractable approximations of underlying beliefs which at least
possess the necessary coherence properties for rational decision-making.
Furthermore, there is no philosophical requirement for subjective probability dis-
tributions to endure. They need only apply to the specific decision problem being
addressed. Indeed, Bayes’ theorem provides the coherent procedure for updating
beliefs with new information with respect to a previously stated belief system. But
for the next decision, there are other alternatives. In particular, I might want to review
my previous decisions and the consequent outcomes, and call into question whether I
should adopt a different perspective. Such considerations can be referred to as model
criticism and model selection.
In this chapter, it is supposed that the decision-maker may be considering a range
of modelling strategies for representing probabilistic beliefs about a random variable
X for an uncertain outcome ω, where for sufficient generality X : Ω → Rn could
represent a sequence of n ≥ 1 real-valued observations. After observing the realised
value x = X (ω), the decision-maker may want to re-evaluate which modelling strat-
egy might have been most appropriate for capturing the true underlying dynamics
which gave rise to x.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 67


N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0_7
68 7 Criticism and Model Choice

7.1 Model Uncertainty

Let M denote a set of models under consideration. Each proposed model M ∈ M


corresponds to a probability distribution P(x | M) for the random outcome X (ω).
Given an observed value of x, the quantity P(x | M) is known as the evidence for
model M.
If M is a parametric model (4.1) with corresponding unknown parameter θ M ,
then the model evidence P(x | M) can be regarded as a marginal likelihood under
model M, 
P(x | M) = p M (x | θ M ) p M (θ M ) dθ M , (7.1)
ΘM

averaging the parametric likelihood p M (x | θ M ) for model M with respect to the


corresponding prior parameter density p M (θ M ) for model M. Individual parametric
models may therefore correspond to differences in one or more of the following:
• Underlying parameterisations, Θ M ;
• Likelihood model, p M (x | θ M );
• Prior distribution, p M (θ M ).
Furthermore, it should be noted that not all rival models need be parametric, assume
exchangeability, or presume any other structural similarities.
Two contrasting viewpoints can be adopted for handling Bayesian model uncer-
tainty; the first allows all models in M to be considered simultaneously, known as
model averaging, whilst the second, model selection, proposes a single chosen model
from M for use in further analyses.

7.2 Model Averaging

To coherently consider all models in M simultaneously, the decision-maker must


assert a subjective probability distribution Q over M . Combining the probability
distributions for the random variable X implied by the individual models, weighted
according to these prior model probabilities, yields a marginal probability for X ,

P(x) = P(x | M) dQ(M). (7.2)
M

The expression (7.2) can be also be viewed as a marginal likelihood, averaging


individual model marginal likelihoods (7.1) with respect to prior model uncertainty.
In this sense, (7.2) is an example of using a mixture prior distribution (cf. Sect. 2.3.4).
The averaging over a mixture of probability models (7.2) is known as Bayesian model
averaging.
If the decision-maker is comfortable assigning probabilities across M , the set of
candidate models, and prepared to carry the extra computational burden of averaging
7.2 Model Averaging 69

across models to obtain marginal probability distributions (7.2), then this model-
averaging approach is the correct method for managing model uncertainty under the
Bayesian paradigm; the model uncertainty is simply one component of a mixture
prior formulation.

7.3 Model Selection

Once the outcome variable x is observed, then if prior probabilities over M have
been specified, the updated posterior model probabilities can be obtained via Bayes’
theorem,
dQ(M | x) ∝ dQ(M) P(x | M).

In particular, if M = {M1 , . . . , Mk } is a finite collection of k candidate models with


prior probabilities P(M1 ), . . . , P(Mk ), then the posterior probability for the ith model
can be expressed as

P(Mi ) P(x | Mi )
P(Mi | x) = k , (7.3)
i  =1 P(Mi  ) P(x | Mi  )

where the denominator is the model-averaged marginal likelihood (7.2).

7.3.1 Selecting From a Set of Models

If the decision problem is to determine which model was the underlying generative
process which gave rise to x, then the decision-maker should proceed in the manner
described in Chap. 1: specifying a utility or loss function which evaluates the con-
sequences of estimating the model correctly or incorrectly and reporting the model
which maximises expected utility with respect to the model posterior distribution
(7.3).

Example 7.1 If choosing a model m from a finite set of models M using a zero-one
utility function (cf. Exercise 1.8) with the following utility if the true model were M,

1 if m = M
u(m, M) = (7.4)
0 if m = M

then the posterior expected utility of choosing model m is



ū(m) = u(m, M) P(M | x) = P(m | x)
M∈M
70 7 Criticism and Model Choice

and the optimal Bayesian decision would be to report the posterior mode,

arg max M∈M P(M | x).

7.3.2 Pairwise Comparisons: Bayes Factors

Suppose the decision-maker wishes to compare the relative suitability of two partic-
ular models, Mi and M j ; in this case, the comparison can be suitably encapsulated
by the ratio of the posterior probabilities attributed to the two models.

Definition 7.1 (Posterior odds ratio) The posterior odds ratio of model Mi over
model M j is
P(Mi | x) P(Mi ) P(x | Mi )
= × (7.5)
P(M j | x) P(M j ) P(x | M j )

Remark 7.1 The first term on the right-hand side of (7.5) is known as the prior
odds ratio, and the second term is known as the Bayes Factor.

Definition 7.2 (Bayes factor) The Bayes factor in favour of model Mi over model
M j is 
P(x | Mi ) P(Mi | x) P(Mi )
Bi j (x) := = .
P(x | M j ) P(M j | x) P(M j )

Remark 7.2 The Bayes factor represents the evidence provided by the data x in
favour of model Mi over M j , measured by the multiplicative change observed in the
odds ratio of the two models upon observing x.

If Bi j > 1, this suggests Mi has become more plausible relative to M j after observ-
ing x, whereas Bi j < 1 suggests the opposite. Bayes factors are non-negative but
have no upper bound, and although a larger Bayes factor presents stronger evidence
in favour of Mi , there is no objective interpretation for any non-degenerate value.
To provide interpretability, Jeffreys (1961) provided some subjective categorisations,
which were later refined by Kass and Raftery (1995); the latter are shown in Table 7.1.

Table 7.1 Bayes factor interpretations according to Kass and Raftery (1995)
Bayes factor Bi j Evidence in favour of Mi
1 to 3 Not worth more than a bare mention
3 to 20 Positive
20 to 150 Strong
>150 Very strong
7.3 Model Selection 71

7.3.2.1 Bayesian Hypothesis Testing

If one model M0 is a special case of an alternative model M1 (for example, M0


assigns all probability mass to certain values of one or more free parameters in M1 ),
then the selection of a model is analogous to the frequentist statistical paradigm of
hypothesis testing. The Bayes factor corresponds to the uniformly most powerful
likelihood ratio test statistic.
Consider the zero-one utility function (7.4), for which the Bayes optimal decision
is to report the most probable model. In hypothesis testing language, this implies the
null model M0 being rejected in favour of M1 if and only if

P(M0 )
P(M1 | x) > P(M0 | x) ⇐⇒ B10 (x) > . (7.6)
P(M1 )

The test procedure in (7.6) implies rejection of the null model M0 in favour of M1
if the Bayes factor B10 (x) exceeds the prior ratio in favour of the null model. In this
way, the prior ratio can be seen to determine the desired significance level of the test.
A threshold value could be chosen by referring to the Bayes factor interpretations
from Table 7.1.
Exercise 7.1 (Bayes factors for Gaussian distributions) Consider the following
model for two exchangeable groups of random samples x = (x1 , . . . , xn ), y =
(y1 , . . . , yn ):

xi ∼ N(θ X , 1), i = 1, . . . , n,
yi ∼ N(θY , 1), i = 1, . . . , n,
θ X , θY ∼ N(0, σ 2 ). (7.7)

The samples x1 , . . . , xn and y1 , . . . , yn are all assumed to be conditionally indepen-


dent given θ X and θY , and the model specification is completed by specifying the
dependency between θ X and θY in one of two ways:

M 0 : θ X = θY ;
M1 : θ X ⊥⊥ θY . (7.8)

(i) Derive an equation for the Bayes factor B01 (x, y) in favour of M0 over M1 .
(ii) For fixed observed samples x and y, show that B01 (x, y) → ∞ as the assumed
variance for the mean parameters θ X and θY , σ 2 , tends to infinity. Comment.
Remark 7.3 The phenomenon mentioned in Exercise 7.1 Item ii is known as Lind-
ley’s paradox, named after the Bayesian decision theorist Dennis V. Lindley (1923–
2013), and is further discussed in Proposition 8.3 of Chap. 8. For Bayesian hypothesis
testing, there is no useful concept of a totally uninformative prior for model selec-
tion. If beliefs about unknown parameters are made arbitrarily vague, then the simpler
model will always be preferred, regardless of the data.
72 7 Criticism and Model Choice

7.3.3 Bayesian Information Criterion

One issue with using posterior probabilities and Bayes factors for choosing amongst
models is that these quantities rely upon calculation of the marginal likelihoods of
observed data for each model. It was noted in Sect. 4.1 that the marginal likelihood
will not be analytically calculable for most models; and although Sect. 5.2.4.1 pro-
posed numerical importance sampling methods for estimating marginal likelihoods,
reliable low variance estimates may not be available.
Suppose x = (x1 , . . . , xn ). When the number of samples n is large, Schwarz
(1978) showed that for exponential family (cf. Sect. 4.3) models with a k-dimensional
parameter θ ,
k
log p(x) ≈ log p(x | θ̂ ) − log n,
2

where θ̂ is the maximum likelihood estimate of θ maximising p(x | θ ).


On this basis, a popular method for comparing rival models (even outside of
exponential families) is the so-called Bayesian information criterion.
Definition 7.3 The Bayesian information criterion for model selection is defined to
be
BIC := −2 log p(x | θ̂ ) + k log n (7.9)

where k is the dimension of θ and θ̂ maximises p(x | θ ). Low BIC values correspond
to good model fit.
Remark 7.4 For a given likelihood model, the BIC (7.9) is twice the negative loga-
rithm of an asymptotic approximation of a corresponding Bayesian marginal likeli-
hood for n samples as n → ∞; this asymptotic marginal likelihood does not depend
on the choice of prior p(θ ), besides requiring appropriate support for the maximum
likelihood estimate. The BIC is therefore only suitable for comparing different for-
mulations of the likelihood component of a parametric model, and not for comparing
prior distributions.
Proposition 7.1 (BIC approximated Bayes factors) If BICi and BIC j denote the
Bayesian information criterion for two models Mi and M j , an approximate Bayes
factor in favour of model i over model j is
 
1
Bi j ≈ exp − (BICi − BIC j ) .
2

Exercise 7.2 (BIC for Gaussian distributions) Consider the sampling model (7.7)
for two groups of random samples x = (x1 , . . . , xn ), y = (y1 , . . . , yn ) presented in
Exercise 7.1, and the two alternative models M0 and M1 from (7.8) for the respective
mean parameters θ X and θY .
(i) Derive equations for the Bayesian information criterion values BIC0 and BIC1
for the two models M0 over M1 .
7.3 Model Selection 73

(ii) Use these BIC values to give an approximate expression for the corresponding
Bayes factor B01 .

7.4 Posterior Predictive Checking

Posterior probabilities and Bayes factors are useful for assessing the relative merits of
rival models. However, it can also be desirable to assess the quality of a single model
in absolute terms, without reference to any proposed alternatives which may not yet
have been identified. Posterior predictive checking (PPC) methods aim to quantify
how well a proposed model structure fits the observed data, using the following
logic: if the model is a good approximation to the generating mechanism for the
observed data, then the posterior distribution of the model parameters should assign
high probability to parameter values which in turn would generate further data similar
to the observed data with high probability if the sampling process was repeated.
Consider a single parametric model with model parameter θ ∈ Θ, prior density
p(θ ) and likelihood density p(x | θ ) for the observed data x ∈ X . Let π(θ ) be the
corresponding posterior density (4.3).
Definition 7.4 (Posterior predictive distribution) The posterior predictive distribu-
tion is the marginal distribution of a second draw xrep ∈ X from the likelihood model
with the same (unknown) parameter, implying a density

π(xrep ) := p(xrep | θ ) π(θ ) dθ

∝ p(xrep | θ ) p(x | θ ) p(θ ) dθ. (7.10)
Θ

Using techniques from frequentist statistical hypothesis testing, posterior predic-


tive checking is concerned with establishing whether the observed data x could be
regarded as being somehow extreme with respect to the posterior predictive density
(7.10).

7.4.1 Posterior Predictive p-Values

For full generality, let T (x, θ ) be a test statistic for measuring discrepancy between
a data-generating parameter θ and observing data x.
Definition 7.5 (Posterior predictive p-value) A posterior predictive p-value for
T (x, θ ) is the upper tail probability
 
p := 1[T (x,θ),∞) {T (xrep , θ )} π(θ ) p(xrep | θ ) dxrep dθ. (7.11)
Θ X
74 7 Criticism and Model Choice

Remark 7.5 If the test statistic is simply a function of the data, T (x, θ ) ≡ T (x),
then (7.11) simplifies to

p= 1[T (x),∞) {T (xrep )} π(xrep ) dxrep ,
X

which is the familiar one-sided p-value for an observed statistic T (x), calculated
with respect to the posterior predictive distribution (7.10).

Remark 7.6 More generally, the posterior predictive p-value (7.11) measures how
a joint sample of parameter and new data from the posterior would compare with
sampling a parameter from the posterior and pairing this with the observed data.

7.4.2 Monte Carlo Estimation

Given (possibly approximate) samples θ (1) , . . . , θ (m) obtained from the posterior
density π , a Monte Carlo estimate (cf. Sect. 5.2) of the posterior predictive p-value
(7.11) can be obtained relatively easily provided it is also possible to sample from
the likelihood distribution p(x | θ ): For each parameter value θ (i) sampled from the
(i)
posterior density π , randomly draw new data xrep from the generative likelihood
model with that parameter,
(i) (i)
xrep ∼ p(xrep | θ (i) ); (7.12)

then the Monte Carlo estimated posterior predictive p-value is

1 
m
(i)
p̂ := 1[T (x,θ (i) ),∞) {T (xrep , θ (i) )}. (7.13)
m i=1

7.4.3 PPC with Stan

When fitting Bayesian models numerically in Stan (cf. Sect. 6.2), it is relatively simple
to carry out posterior predictive checking using a generated quantities{} code block.
This will be illustrated for the student grades example in Sect. 6.2, by considering
two possible test statistics: The first test statistic uses the negative log likelihood as
a measure of discrepancy,

T (x, θ ) = − log p(x | θ ). (7.14)


7.4 Posterior Predictive Checking 75

The second statistic does not depend on the model parameters, simply obtaining the
average score for each student,

1
p
x̄i = xi j ,
p j=1

and then calculating the variance of these scores


n n
n x̄i2 − ( i=1 x̄i )2
T (x) = i=1
. (7.15)
n(n − 1)

The following Stan programming code (student_grade_model_ppc.stan)


extends the example from Sect. 6.2 with the inclusion of a generated quantities{} code
block to facilitate posterior predictive checks using the test statistics (7.14) and (7.15).

1 // student_grade_model_ppc.stan
2
3 data {
4 int<lower=0> n; // number of students
5 int<lower=0> p; // number of tests
6 int<lower=0, upper=100> X[n, p]; // student test grades
7 real<lower=0> tau;
8 real<lower=0> a;
9 real<lower=0> b;
10 }
11 parameters {
12 real z[n];
13 real mu;
14 real<lower=0> sigma_sq;
15 }
16 transformed parameters {
17 real<lower=0, upper=1> theta[n];
18 real sigma;
19 theta = inv_logit(z);
20 sigma = sqrt(sigma_sq);
21 }
22 model {
23 sigma_sq ˜ inv_gamma(a,b);
24 mu ˜ normal(0, sigma * tau);
25 z ˜ normal(mu, sigma);
26 for (i in 1:n)
27 X[i] ˜ binomial(100,theta[i]);
28 }
29 generated quantities{
30 int<lower=0, upper=100> X_rep[n, p];
31 real log_lhd = 0;
32 real log_lhd_rep = 0;
33 real ppp;
34 for (i in 1:n){
35 for (j in 1:p){
36 log_lhd += binomial_lpmf(X[i][j] | 100,theta[i]);
37 X_rep[i][j] = binomial_rng(100,theta[i]);
38 log_lhd_rep += binomial_lpmf(X_rep[i][j] | 100,theta[i]);
39 }
40 }
41 ppp = log_lhd >= log_lhd_rep ? 1 : 0;
42 }

Line number 37 of student_grade_model_ppc.stan generates a repli-


cate data matrix X rep from the binomial model with the current sampled parameter
vector θ ; this is required for both test statistics (7.14) and (7.15). To calculate the
test statistic (7.14) within Stan, lines 36 and 38 calculate the likelihood function for
the original and replicated data matrices, respectively, and these are compared on
line 41, yielding an indicator to contribute towards the estimated posterior predictive
p-value (7.13).
76 7 Criticism and Model Choice

The following PyStan code (student_grade_inference_stan_ppc.py)


uses the Stan model code from above to fit the model and perform posterior predictive
checking.

1 #! /usr/bin/env python
2 ## student_grade_inference_stan_ppc.py
3
4 import stan
5 import numpy as np
6 import matplotlib.pyplot as plt
7
8 # Simulate data
9 from student_grade_simulation import sample_student_grades
10 n, p = 30, 5
11 X, mu, sigma = sample_student_grades(n, p)
12 sm_data = {'n':n, 'p':p, 'tau':0.5, 'a':1, 'b':0.5, 'X':X}
13
14 # Initialise stan object
15 with open('student_grade_model_ppc.stan','r',newline='') as f:
16 sm = stan.build(f.read(),sm_data,random_seed=1)
17
18 # Select the number of MCMC chains and iterations, then sample
19 chains, samples, burn = 4, 10000, 1000
20 fit=sm.sample(num_chains=chains, num_samples=samples, num_warmup=burn, save_warmup=False)
21
22 def T(x): #Variance of student average scores
23 return(np.var(np.mean(x,axis=1)))
24
25 t_obs = T(X) #Value of test statistic for observed data
26 x_rep = fit['X_rep'].reshape(n,p,samples,chains)
27 t_rep = [[T(x_rep[:,:,i,j]) for i in range(samples)] for j in range(chains)]
28
29 # Plot posterior predictive distributions of T from each chain
30 def posterior_predictive_plots(t_rep,true_val):
31 nc = np.matrix(t_rep).shape[0]
32 fig,axs=plt.subplots(1,nc,figsize=(10,3),constrained_layout=True)
33 fig.canvas.manager.set_window_title('Posterior predictive')
34 for j in range(nc):
35 axs[j].autoscale(enable=True, axis='x', tight=True)
36 axs[j].set_title('Chain '+str(j+1))
37 axs[j].hist(np.array(t_rep[j]),200, density=True)
38 axs[j].axvline(true_val, color='c', lw=2, linestyle='--')
39 plt.show()
40
41 posterior_predictive_plots(t_rep,t_obs)
42
43 # Calculate and print posterior predictive p-values for T
44 print("Posterior predictive p-values from variance of means:")
45 print([np.mean(t_obs > t_rep[j]) for j in range(chains)])
46
47 # Print posterior predictive p-values for lhd calculated in Stan
48 print("Posterior predictive p-values from likelihood:")
49 print(np.mean(fit['ppp'].reshape(samples,chains),axis=0))

Line numbers 22–23 of student_grade_inference_stan_ppc.py define


the student variance test statistic (7.15); this is evaluated for the observed data matrix
at line 25, and for each of the data matrix replicates sampled from the posterior predic-
tive distribution at line 27. At line 41, the estimated posterior predictive distribution
of the variance for a new student cohort is plotted for each MCMC chain, and com-
pared with the observed value. Finally, lines 45 and 49 print the estimated posterior
predictive p-values for each of the two test statistics, obtained from each of the four
MCMC chains.
The following outputs from the code were obtained:
7.4 Posterior Predictive Checking 77

Posterior predictive p-values from variance of means:


[0.5093, 0.5052, 0.5146, 0.5086]
Posterior predictive p-values from likelihood:
[0.1573 0.158 0.1615 0.1538]

The p-values suggest no statistical significance for either test statistic, suggesting
a good model fit; indeed, the data have been generated from the assumed probability
model.
Chapter 8
Linear Models

Infinite exchangeability of a sequence of random variables, here denoted y1 , y2 , . . .,


is a useful simplifying assumption for illustrating many of the fundamental ideas
presented in the preceding chapters. However, in many practical situations, this would
be too limiting as a modelling assumption; often there will be additional available
information xi pertaining to each random quantity yi which affects probabilistic
beliefs about the value which yi is likely to take.
In the language of statistical regression modelling, the random variables of interest
y1 , y2 , . . . are referred to as response variables; they are believed to have a statistical
dependence on the corresponding element of the sequence of so-called covariates
or predictors x1 , x2 , . . . which have either been determined or observed. Regression
modelling is concerned with building statistical models for the conditional distribu-
tion of each yi given xi , primarily through specifying the mean value for yi having
some functional relationship to xi (referred to as the regression function).
The simplest functional relationship is the linear model. With assumed Gaussian
errors in the response variable, the elegant least squares estimation equations from
non-Bayesian statistical linear models extend naturally to the Bayesian case. Despite
the apparent rigidity of a linearity assumption, consideration of different transfor-
mations of either the covariates or the response variable can provide a surprisingly
flexible modelling framework.

8.1 Parametric Regression

Let y = (y1 , . . . , yn ) be an n-vector of real-valued response variables. For each


response variable yi ∈ R, suppose there is a corresponding p-vector of covariates
xi = (xi1 , . . . , xi p ) ∈ R p , p ≥ 1, which are thought to provide information about

The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 79
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_8
80 8 Linear Models

Fig. 8.1 A belief network


representation of regression
θ
exchangeability for
responses y1 , . . . , yn given
covariates x1 , . . . , xn
y1 y2 y... yn

x1 x2 x... xn

the probability distribution of yi . Let X = (xi j ) be an n × p matrix with ith row xi


corresponding to the ith response.
In parametric regression modelling, it is common to assume a relaxation of
exchangeability called regression exchangeability.

Definition 8.1 (Regression exchangeability) Regression exchangeability assumes


that the joint density for y conditional on X has a representation
 
n
p(y | X ) = p(yi | θ, xi ) dQ(θ ) (8.1)
Θ i=1

for a prior distribution Q on some parameter space Θ.

The regression function relating the response to the covariates is specified through
the likelihood density p(yi | θ, xi ) in (8.1). Figure 8.1 shows a belief network rep-
resentation of regression exchangeability.

8.2 Bayes Linear Model

The linear model is a special case of (8.1), where the parameter is a pair θ = (β, σ ),
with β ∈ R p and σ > 0, and the likelihood density p(yi | θ, xi ) is specified by

yi | θ, xi ∼ Normal(xi · β, σ 2 ). (8.2)

The parameter σ is the standard deviation of each response variable, and β is a


p-vector of regression coefficients such that

E(yi | xi , β) = xi · β. (8.3)

The model (8.2) is often written in an equivalent regression form:

yi = f (xi ) + i ,
i ∼ Normal(0, σ 2 ),
8.2 Bayes Linear Model 81

where
f (x) = x · β (8.4)

is the regression function and  = (1 , . . . , n ) are referred to as independent error


variables.
Finally, a third way of expressing the same model, which can be particularly
convenient for mathematical manipulation, is in matrix form

y ∼ Normaln (Xβ, σ 2 In ), (8.5)

where the conditional independence of the response variables in (8.1) is represented


by the diagonal covariance matrix in (8.5).
For a Bayesian parametric regression model (8.1), the specification of the linear
model is completed by a prior distribution for the parameters (β, σ ). Two choices of
prior are commonly considered, which are presented in the next two subsections.

8.2.1 Conjugate Prior

Since (8.5) is an exponential family distribution (cf. Sect. 4.3), it follows from Propo-
sition 4.2 that there is a conjugate prior for (β, σ ). This takes a canonical form

β | σ ∼ Normal p (0, σ 2 V ),
σ −2 ∼ Gamma(a, b), (8.6)

where V is a symmetric, positive semidefinite p × p covariance matrix and a, b > 0.

Exercise 8.1 (Marginal density for regression coefficients) Suppose the conjugate
prior distribution (8.6) for the normal linear model.
(i) Show that, marginally,
 −(a+ 2p )
Γ (a + 2p ) β  V −1 β
p(β) = p 1 1+ ,
(2π b) 2 |V | 2 Γ (a) 2b

corresponding to the density of the multivariate t-distribution, β ∼ t2a (0, ab V ).


√ prior density function for β depends only on the
(ii) If V = I p , show that the
Euclidean norm β = β · β of the regression coefficients and that

2a + p − 1
β ∼ |t2a |, (8.7)
2b

known as a half-Student’s t-distribution with 2a degrees of freedom.


82 8 Linear Models

Typically the regression coefficients will be assumed to be independent and iden-


tically distributed, which corresponds to assuming

V = λ−1 I p (8.8)

for a scalar precision parameter λ > 0. This implies a joint prior probability density
function p  
−2 ba λ 2 exp −σ −2 (2b + λβ  · β) /2
p(β, σ ) = p . (8.9)
(2π ) 2 Γ (a) σ 2(a−1)+ p

Proposition 8.1 For the linear model (8.5) with conjugate prior (8.6), the posterior
distribution for (β, σ ) after observing responses y = (y1 , . . . , yn ) corresponds to

β | σ, X, y ∼ Normal p (m n , σ 2 Vn ),
σ −2 | X, y ∼ Gamma(an , bn ),

where

Vn = (V −1 + X  X )−1 , m n = Vn X  y,
n 1
an = a + , bn = b + (y  y − y  X m n ). (8.10)
2 2
Proof

p(β | σ, X, y) ∝ p(β | σ ) p(y | β, σ, X )


1  −1 1
∝ exp − β V β− (y − Xβ) (y − Xβ)
2σ 2 2σ 2
1
∝ exp − 2 {β  (V −1 + X  X )β − y  Xβ − β  X  y}

1
= exp − 2 {β  Vn−1 β − m n Vn−1 β − β  Vn−1 m n } ,

according to (8.10) since Vn−1 m n = X  y. Completing the square,

1
p(β | σ, X, y) ∝ exp − (β − m n ) Vn−1 (β − m n )
2σ 2
=⇒ β | σ, y ∼ Normal p (m n , σ 2 Vn ).

It remains to derive the posterior distribution of σ . The regression coefficients β can


be marginalised,
8.2 Bayes Linear Model 83

β | σ ∼ Normal p (0, σ 2 V )
=⇒ Xβ | σ, X ∼ Normaln (0, σ 2 X V X  )
=⇒ y | σ, X ∼ Normaln (0, σ 2 (X V X  + In )), (8.11)

where the last step follows from standard rules for summing Gaussian random vari-
ables. Then by the matrix inversion lemma,

(X V X  + In )−1 = In − X (V −1 + X  X )−1 X  = In − X Vn X 
=⇒ y | σ, X ∼ Normaln (0, σ 2 (In − X Vn X  )−1 ).

By Bayes’ theorem,

p(σ −2 | X, y) ∝ p(σ −2 ) p(y | σ, X )


σ −2 
∝ σ −2(a−1) exp(−bσ −2 ) σ −n exp − y (In − X Vn X  )y
2
1 
= σ −2(a+ 2 −1) exp σ −2 y (In − X Vn X  )y
n

2
=⇒ σ −2 | X, y ∼ Gamma(an , bn ).

Exercise 8.2 (Linear model matrix inverse) The matrix inversion lemma
states that for an n × n matrix A, a k × k matrix V and n × k matrices U, W ,

(A + U V W  )−1 = A−1 − A−1 U (V −1 + W  A−1 U )−1 A−1 .

Using this result, show that (X V X  + In )−1 = In − X Vn X  where Vn = (V −1 +


X  X )−1 .

From Proposition 4.1, it follows that the Bayes linear model with conjugate prior
has a closed-form marginal likelihood.

Proposition 8.2 Suppose the Bayes linear model (8.5) with y ∈ Rn , X ∈ Rn× p and
conjugate prior (8.6). The marginal likelihood for y | X is
1
Γ (an ) |Vn | 2 ba
p(y | X ) = n 1 . (8.12)
(2π ) 2 Γ (a) |V | 2 bn an

Equivalently,
y | X ∼ Stn (2a, 0, b(X V X  + In )/a),

where Stn (ν, μ, Σ) is an n-dimensional Student’s t-distribution with ν degrees of


freedom, mean μ and covariance Σ.
84 8 Linear Models

Proof From the proof of Proposition 8.1,

y | σ, X ∼ Normaln (0, σ 2 (In − X Vn X  )−1 )


exp{− 2σ1 2 y  y + 1
2σ 2
y  X (V −1 + X  X )−1 X  y}
=⇒ p(y | σ, X ) = n 1 1 . (8.13)
(2π ) 2 σ n |V | 2 |V −1 + X  X | 2

The denominator uses the identity |X V X  + In | = |V ||V −1 + X  X |, which follows


from the matrix determinant lemma (Proposition 8.3).
Marginalising (8.13) over the inverse-gamma prior for σ 2 ,
 ∞
ba
p(y | X ) = σ −2(a−1) exp{−bσ −2 } p(y | σ, X ) dσ −2
Γ (a) 0
1
Γ (a + n/2) |Vn | 2 ba
= n 1 n ,
(2π ) 2 Γ (a) |V | 2 (b + 21 y  y − 21 y  X m n )a+ 2

by comparison of the integrand with the Gamma(a + n/2, b + 21 y  y − 21 y  X m n )


density function.

Exercise 8.3 (Linear model matrix determinant) The matrix determinant lemma
states that for an n × n matrix A, a k × k matrix V and n × k matrices U, W ,
|A + U V W  | = |V −1 + W  U ||V ||A|. Using this result, show that |X V X  +
In | = |V ||V −1 + X  X |.

Proposition 8.3 (Lindley’s paradox) For the linear model under the conjugate prior
(8.6) and assuming (8.8), as λ → 0 the marginal likelihood (8.12) p(y | X ) → 0.

Proof As λ → 0, |V | → ∞ whilst |Vn | → 1/|X  X |. Hence p(y | X ) → 0.

Remark 8.1 Lindley’s paradox in Proposition 8.3 (cf. Proposition 7.1) states that
making prior beliefs increasingly diffuse will eventually lead to diminishingly small
predictive probability density for any possible observation y. Consequently, when
comparing against any fixed alternative model, the Bayes factor in favour of the
alternative model will become arbitrarily large.

 Exercise 8.4 (Linear model code) Write computer code (using a language such
as Python) to calculate the marginal likelihood under the linear model. For a matrix
of covariates X and a vector of responses y, write a single function which returns both
the marginal likelihood and the posterior mean for the regression coefficients.

Exercise 8.5 (Orthogonal covariate matrix marginal likelihood) Suppose the


columns of the matrix X are orthonormal. Then under model (8.9) where the regres-
sion coefficients are assumed to be independent, derive a simplified expression for
the linear model marginal likelihood p(y | X ). Comment on why this expression
should be easier to evaluate than the general expression (8.12).
8.2 Bayes Linear Model 85

Exercise 8.6 (Zellner’s g-prior) Suppose the n × p covariate matrix X has rank
p, with n > p, and the matrix V in (8.6) satisfies V = g · (X  X )−1 for some constant
g > 0; this formulation is known as Zellner’s g-prior (Zellner 1986). Derive a
simplified expression for the linear model marginal likelihood p(y | X ) under this
prior distribution.

8.2.2 Reference Prior

A commonly used alternative prior distribution is the uninformative reference prior


(cf. Sect. 2.3.2),
1
p(β, σ 2 ) ∝ 2 , (8.14)
σ

corresponding to “uniform” prior beliefs for log σ 2 and each component of the coef-
ficient vector β.
Remark 8.2 The prior density (8.14) is said to be improper since it does not have
a finite integral over the parameter space. It can therefore only be meaningfully
considered as the limiting argument of a sequence of increasingly diffuse, proper
prior densities.
The reference prior can be viewed as a limiting case of the conjugate prior (8.9)
as the hyperparameters a, b, λ → 0. Consequently, the posterior distribution result
from Sect. 8.2.1 carries across as follows.

Proposition 8.4 For the linear model (8.5) with reference prior (8.14), if n > p and
X has rank p, then the posterior distribution for (β, σ ) is

β | σ, X, y ∼ Normal p ((X  X )−1 X  y, σ 2 (X  X )−1 ),


σ −2 | X, y ∼ Gamma(a + n/2, b + (y  y − y  X (X  X )−1 X  y)/2). (8.15)

Remark 8.3 The reference posterior (8.15) is only proper when n > p and the rank
of X is equal to p, so that X has full rank.

Remark 8.4 As a direct consequence of Lindley’s paradox in Proposition 8.3, the


marginal likelihood is not well defined under the improper reference prior (8.14);
the corresponding equation takes value 0 for all values of y. Consequently, reference
priors cannot be used when performing model choice using Bayes factors (cf. Sect.
7.3.2).

Remark 8.5 Bayesian inference for the linear model under the reference prior cor-
responds to the standard estimation procedures from classical statistics. For example,
the posterior mean for β in (8.15) is the usual least squares or maximum likelihood
estimate.
86 8 Linear Models

8.3 Generalisation of the Linear Model

Aside from the reference prior analysis in Sect. 8.2.2, the theory of the Bayes linear
model with conjugate prior required no assumptions about the nature of the covariates
xi = (xi1 , . . . , xi p ) ∈ R p which make up the rows of the matrix X . This observation
allows the following abstraction of the so-called design matrix X , which provides a
valuable generalisation in the use of the linear model.

8.3.1 General Basis Functions

Most generally, suppose that for each response variable yi there is an observed
p -vector of related measurements z i ∈ R p , for p ≥ 1. Setting p = p and xi = z i
returns the standard linear model (8.5), which is linear in both β and the measurements
zi .
More generally, suppose a list of p functions ψ = (ψ1 , . . . , ψ p ),

ψ j : R p → R, j = 1, . . . , p,

such that each function specifies a covariate for the linear model, leading to the
p-vector covariate
xi = ψ(z i ) = (ψ1 (z i ), . . . , ψ p (z i )).

The functions ψ are referred to as basis functions, since the regression function
(8.4) is constructed by linear combinations of the components of ψ,
p
f (x) = β j ψ j (z).
j=1

Each basis function ψ j corresponds to a column of the resulting covariate matrix


X . Given the flexibility available in the choice of the columns of X , the matrix X is
often referred to as the design matrix for a regression model.

Remark 8.6 The linear model (8.5) with design matrix X specified by

X i j = ψ j (z i )

is still a linear model with respect to the regression coefficients β, and so all of
the preceding theory for conjugate posterior distributions and closed-form marginal
likelihoods still applies.
8.3 Generalisation of the Linear Model 87

8.3.1.1 Polynomial Regression

Suppose a single observable measurement z ∈ R and basis functions

ψ j (z) = z j−1 , j = 1, . . . , p,

implying a covariate vector

x = (1, z, . . . , z p−1 ).

This construction implies a degree p − 1 polynomial regression function,


p
f (x) = β j z j−1 ,
j=1

as a special case of the Bayes linear model.

8.3.1.2 Linear Spline Regression

Let the notation (· )+ denote the positive part of a real number,

(t)+ = max{0, t}.

Again suppose a single observable measurement z ∈ R and now consider the basis
functions
ψ j (z) = (z − τ j )+ , j = 1, . . . , p, (8.16)

for a sequence of p real values τ1 < τ2 < . . . < τ p , referred to as knot points. Basis
functions of the type (8.16) are known as linear splines, since ψ j (z) is zero up until
the value τ j , and a linear function of z − τ j thereafter.
Taking a linear combination of linear spline basis functions gives a regression
function which is piecewise linear, with changes in gradient occurring at each of the
knot points but no discontinuities. Spline regression models are explored in more
detail in Sect. 10.3.

8.4 Generalised Linear Models

The Bayes linear model presented in Sects. 8.2 and 8.3 is mathematically very con-
venient, but is only suitable for cases where the response variables can be assumed to
be normally distributed and where the linear regression function xi · β corresponds
directly to the expected value of the response variable yi (8.3).
88 8 Linear Models

Generalised linear models extend the linear model to other exponential family dis-
tributions for the response variable through the introduction of an invertible function
g called the link function, such that

g {E(yi | xi , β)} = xi · β,

or, equivalently,
E(yi | xi , β) = g −1 (xi · β). (8.17)

Remark 8.7 The link function generalises the standard linear regression expectation
(8.3), which is clearly a special case of (8.17) with the identity link function.

Remark 8.8 One advantage of using a link function is to guarantee the expected
value of the response (8.17) lies in the correct domain, without requiring any con-
straints on the possible values which the covariates xi or the regression coefficients
might take.

Two examples of generalised linear models are now briefly presented, where the
response variable is either a non-negative integer count or a binary indicator. In both
cases, a zero-mean normal distribution prior (8.6) with V = I p is assumed for the
regression coefficients, which was shown in (8.7) to imply a t-distribution on the
Euclidean norm of the coefficients.

8.4.1 Poisson Regression

Suppose each response yi ∈ N = {0, 1, 2, . . .} is a non-negative integer count. Fur-


ther assume each count follows a Poisson distribution, with an expected value which
is believed to be linearly dependent on p ≥ 1 covariates xi ∈ R p through the link
function log(· ). These assumptions imply

log E(yi | xi , β) = xi · β,

for some β ∈ R p and

yi | xi , β ∼ Poisson(exp(xi · β))
exp{yi (xi · β) − exp(xi · β)}
=⇒ p(yi | xi , β) = . (8.18)
yi !
8.4 Generalised Linear Models 89

8.4.1.1 Stan Implementation

The Stan implementation of Poisson regression extends the model (8.18) slightly by
assuming the presence of a variable intercept term in the linear model,

yi | xi , α, β ∼ Poisson(exp(αi + xi · β)),

for α = (α1 , . . . , αn ) ∈ Rn . However, these parameters can be fixed at zero to recover


the standard representation (8.18).

Remark 8.9 In some non-Bayesian statistical texts, the intercept parameters αi


would be referred to as random effects, since they differ between individual response
variables, whilst the slope parameters β would be referred to as fixed effects.

The following Stan code (poisson_regression.stan) implements a Pois-


son regression model with p covariates and no intercept.

// poisson_regression.stan
data {
int<lower=0> n; // number of observations
int<lower=0> p; // number of covariates
int<lower=0> m; // number of grid points
int<lower=0> y[n]; // response variables
matrix[n,p] X; // matrix of covariates
matrix[m,p] grid; // matrix of grid points
real<lower=0> a;
real<lower=0> b;
}
transformed data {
real t_c = (2*a+p-1)/(2*b);
}
parameters {
vector[p] beta;
}
model {
sqrt(dot_self(beta)*t_c) ˜ student_t(2*a, 0, 1);
target += poisson_log_glm_lpmf( y | X, 0, beta );
}
generated quantities {
vector[m] fn_vals;
for (i in 1:m)
fn_vals[i] = exp( dot_product(beta,grid[i]) );
}

The generated quantities{} block declares a vector of values for evaluating the
regression function pointwise over a vector of grid points which are inputs in the
data{} block.
The following PyStan code (poisson_regression_stan.py) simulates
data from a Poisson regression model with a single covariate and then seeks to infer
posterior beliefs about the value of the regression coefficient using
poisson_regression.stan. Both here and in Sect. 8.4.2.1, the plots show
the sampled data and the posterior mean regression function obtained from point-
wise evaluation during posterior sampling, and then the posterior density of the single
coefficient β.
90 8 Linear Models

#! /usr/bin/env python
## poisson_regression_stan.py
import stan
import numpy as np
import matplotlib.pyplot as plt
# Simulate data
gen = np.random.default_rng(seed=0)
n = 25
m = 50
T = 10
x = np.linspace(start=0, stop=T, num=n)
grid = np.linspace(start=0, stop=T, num=m)
beta = .5#gen.normal()
y = [gen.poisson(np.exp(x_i*beta)) for x_i in x]
sm_data = {'n':n, 'p':1, 'a':1, 'b':0.5, 'X':x.reshape((n,1)), 'y':y, 'm':m,
→ 'grid':grid.reshape((m,1))}
# Initialise stan object
with open('poisson_regression.stan','r',newline='') as f:
sm = stan.build(f.read(),sm_data,random_seed=1)
# Select the number of MCMC chains and iterations, then sample
chains, samples, burn = 2, 10000, 1000
fit=sm.sample(num_chains=chains, num_samples=samples, num_warmup=burn, save_warmup=False)
# Plot regression function and posterior for beta
fig,axs=plt.subplots(1,2,figsize=(10,4),constrained_layout=True)
fig.canvas.manager.set_window_title('Poisson regression posterior')
f = np.mean(fit['fn_vals'],axis=1)
true_f = [np.exp(beta*x_i) for x_i in grid]
b = fit['beta'][0]
axs[0].plot(grid,f)
axs[0].plot(grid,true_f, color='c', lw=2, linestyle='--')
axs[0].scatter(x,y, color='black')
axs[0].set_title('Posterior mean regression function')
axs[0].set_xlabel(r'$x$')
h = axs[1].hist(b,200, density=True);
axs[1].axvline(beta, color='c', lw=2, linestyle='--')
axs[1].set_title('Approximate posterior density of '+r'$\beta$')
axs[1].set_xlabel(r'$\beta$')
plt.show()

8.4.2 Logistic regression

Suppose each response yi ∈ {0, 1} is a Bernoulli indicator variable with a “success”


probability (or equivalently, expected value) which is believed to be linearly depen-
dent on p ≥ 1 covariates xi ∈ R p through a logistic link function log{· /(1 − · )}.
These assumptions imply
8.4 Generalised Linear Models 91

E(yi | xi , β)
log = xi · β,
1 − E(yi | xi , β)

for some β ∈ R p and

yi | xi , β ∼ Bernoulli({1 + exp(−xi · β)}−1 )


=⇒ p(yi | xi , β) = {1 + exp((−1) yi xi · β)}−1 .

8.4.2.1 Stan Implementation

The following Stan code (logistic_regression.stan) and PyStan code


(logistic_regression_stan.py) implement the logistic regression model
in a directly analogous way to Poisson regression in Sect. 8.4.1.1.

// logistic_regression.stan
data {
int<lower=0> n; // number of observations
int<lower=0> p; // number of covariates
int<lower=0> m; // number of grid points
int<lower=0,upper=1> y[n]; // response variables
matrix[n,p] X; // matrix of covariates
matrix[m,p] grid; // matrix of grid points
real<lower=0> a;
real<lower=0> b;
}
transformed data {
real t_c = (2*a+p-1)/(2*b);
}
parameters {
vector[p] beta;
}
model {
sqrt(dot_self(beta)*t_c) ˜ student_t(2*a, 0, 1);
y ˜ bernoulli_logit( X * beta );
}
generated quantities {
vector[m] fn_vals;
for (i in 1:m)
fn_vals[i] = inv_logit( dot_product(beta,grid[i]) );
}
92 8 Linear Models

#! /usr/bin/env python
## logistic_regression_stan.py
import stan
import numpy as np
import matplotlib.pyplot as plt
# Simulate data
gen = np.random.default_rng(seed=0)
n = 25
m = 50
T = 5
x = np.linspace(start=-T, stop=T, num=n)
grid = np.linspace(start=-T, stop=T, num=m)
beta = .5#gen.normal()
y = [gen.binomial(1,1/(1+np.exp(-x_i*beta))) for x_i in x]
sm_data = {'n':n, 'p':1, 'a':1, 'b':0.5, 'X':x.reshape((n,1)), 'y':y, 'm':m,
→ 'grid':grid.reshape((m,1))}
# Initialise stan object
with open('logistic_regression.stan','r',newline='') as f:
sm = stan.build(f.read(),sm_data,random_seed=1)
# Select the number of MCMC chains and iterations, then sample
chains, samples, burn = 2, 10000, 1000
fit=sm.sample(num_chains=chains, num_samples=samples, num_warmup=burn, save_warmup=False)
# Plot regression function and posterior for beta
fig,axs=plt.subplots(1,2,figsize=(10,4),constrained_layout=True)
fig.canvas.manager.set_window_title('Logistic regression posterior')
f = np.mean(fit['fn_vals'],axis=1)
true_f = [1.0/(1+np.exp(-beta*x_i)) for x_i in grid]
b = fit['beta'][0]
axs[0].plot(grid,f)
axs[0].plot(grid,true_f, color='c', lw=2, linestyle='--')
axs[0].scatter(x,y, color='black')
axs[0].set_title('Posterior mean regression function')
axs[0].set_xlabel(r'$x$')
h = axs[1].hist(b,200, density=True);
axs[1].axvline(beta, color='c', lw=2, linestyle='--')
axs[1].set_title('Approximate posterior density of '+r'$\beta$')
axs[1].set_xlabel(r'$\beta$')
plt.show()
Chapter 9
Nonparametric Models

Parametric probability models provide convenient mathematical structures for


approximating an individual’s uncertain beliefs. For example, simple probability
distributions with a small number of parameters for modelling exchangeable ran-
dom quantities (Chap. 4) or a linear model for regression-exchangeable observations
of a response variable (Chap. 8). The appealing simplicity of parametric models
also carries a severe limitation: having assumed a parametric model, no amount of
observed data can undermine the assumed certainty that the probability distribution
or regression function takes that parametric form with probability one. For small sam-
ple size problems, this limitation can often seem acceptable, but for larger sample
sizes the opportunity for learning potentially more complex underlying relationships
grows and parametric models can become prohibitively restrictive.
More flexible modelling paradigms with the capacity to increase in complexity
with increasing sample size are often referred to as nonparametric methods. This
name can appear somewhat misleading, as these methods typically allow access to
a potentially infinite number of parameters to provide this growth in complexity.
However, the term is used to imply modelling freedom away from assuming a fixed,
finite-dimensional parametric form.
The contrast between the two modelling paradigms is stark. Parametric models
place probability one on a particular parametric functional form being true. Non-
parametric models assume no such fixed relationship, but instead seek to spread
probability mass across a much larger region of appropriate function space, such that
positive mass will be assigned to arbitrarily small neighbourhoods surrounding any
unknown true underlying function belonging to a much broader function class.
The higher complexity of nonparametric models can lead to a loss of analytic
tractability or an increase in computational burden when performing Bayesian infer-
ence. However, there are some notable exceptions, and the next two chapters pro-
vide an overview of some popular nonparametric formulations which can be readily

The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 93
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_9
94 9 Nonparametric Models

deployed in practical applications, either for modelling probability distributions in


the present chapter or regression functions in Chap. 10.

9.1 Random Probability Measures

Recall back in Chap. 2 the generalisation of De Finetti’s representation theorem given


in Theorem 2.2 for an infinitely exchangeable sequence X 1 , X 2 , . . . taking values in
a space X . Necessarily,
 
n
P X 1 ,...,X n (x1 , . . . , xn ) = F(xi ) dQ(F) (9.1)
F i=1

for some probability measure Q on probability distributions F on X .


Section 2.2 immediately proceeded to consider a parametric interpretation, with
F = F(· ; θ ) and Q = Q(θ ) for some finite-dimensional parameter. However, a non-
parametric interpretation is also possible, with Q interpreted as a probability measure
over a wider class of probability distribution functions F.
The following Bayesian nonparametric models for random measures provide dif-
ferent specifications for the prior measure Q in (9.1), each placing mass on probability
distributions with a potentially infinite number of parameters. In each case, Bayesian
inference will be examined for an exchangeable sample x 1 , . . . , xn drawn from the
unknown distribution F on the space X .

9.2 Dirichlet Processes

The Dirichlet process (Ferguson 1973) is a conjugate prior for Bayesian inference
about an unknown probability distribution function F.
Definition 9.1 (Dirichlet process) Let α > 0 and let P0 be a probability measure on
X (with distribution function F0 ). A random probability measure P (with distribution
function F) is said to be a Dirichlet process with base measure α · P0 , written P ∼
DP(α · P0 ), if for every (measurable) finite partition B1 , . . . , Bk of X ,

(P(B1 ), . . . , P(Bk )) ∼ Dirichlet(α P0 (B1 ), . . . , α P0 (Bk )). (9.2)

Remark 9.1 The base measure P0 of the Dirichlet process is also the mean, such
that for every (measurable) subset B ⊆ X ,

E{P(B)} = P0 (B).

The concentration parameter α determines the variance

P0 (B){1 − P0 (B)}
V{P(B)} = . (9.3)
α+1
9.2 Dirichlet Processes 95

1 − γ1 γ1
π1

(1 − γ1 )(1 − γ2 ) γ2 (1 − γ1 )
π2

(1 − γ1 )(1 − γ2 )(1 − γ3 ) γ3 (1 − γ1 )(1 − γ2 )


π3

Fig. 9.1 Illustration of first three iterations of the stick-breaking process

Remark 9.2 A draw from any Dirichlet process is a discrete distribution with prob-
ability 1, even if the base measure is continuous.

Since all sampled probability distributions from DP(α · P0 ) are discrete, it is


possible to equivalently state the conditions for a Dirichlet process as a generative
model for random probability mass functions. This generating process uses a so-
called stick-breaking construction.

 Let π = (π1 , π2 , . . .) be an infinite random


Definition 9.2 (Stick-breaking process)
sequence of probabilities such that ∞ j=1 π j = 1. Then π is defined as a stick-
breaking process if

j−1
πj = γj (1 − γk ), (9.4)
k=1

where γ1 , γ2 , . . . are an infinite sequence of independent random variables in [0, 1].

Definition 9.3 (Griffiths-Engen-McCloskey distribution) A stick-breaking process


π (9.4) follows a Griffiths-Engen-McCloskey distribution with parameter α > 0,
written π ∼ GEM(α), if γk ∼ Beta(1, α) for all k.

Remark 9.3 The stick-breaking analogy for Definition 9.2 envisages successively
breaking into pieces a stick of unit length, each time snapping off and laying down
a section and then continuing to break the remaining piece of stick. For a GEM(α)
distribution in Definition 9.3, at each break point, the proportion of remaining stick
broken off and placed down follows a Beta(1, α) distribution. The procedure is
illustrated in Fig. 9.1.
96 9 Nonparametric Models

Proposition 9.1 If P ∼ DP(α · P0 ), then the corresponding mass function satisfies




p(x) = w j 1{θ j } (x), (9.5)
j=1

where the atoms of mass are independently drawn from the base measure,

θ1 , θ2 , . . . ∼ P0 ,

and the masses w j are obtained from a stick-breaking process with a Griffiths-Engen-
McCloskey distribution,
(w1 , w2 , . . .) ∼ GEM(α). (9.6)

Proof See Sethuraman (1994).

It was noted above that the Dirichlet process is a conjugate prior for an unknown
probability distribution. This is now demonstrated in the following proposition.

Proposition 9.2 Conjugacy of Dirichlet process. Suppose x = (x1 , . . . , xn ) are n


independent samples from P and P ∼ DP(α · P0 ). For (P0 -measurable) B ⊆ X , let

1
n
P̂n (B) = 1 B (xi )
n i=1

be the empirical measure of the samples x, and let αn∗ = α + n, πn∗ = α/αn∗ and

P∗n (B) = πn∗ P0 (B) + (1 − πn∗ )P̂n (B). (9.7)

Then
P | x ∼ DP(αn∗ · P∗n ).

Proof For a finite partition of (measurable) X subsets B1 , . . . , Bk , the Dirichlet


distribution prior (9.2) has density function

Γ (α) 
k
p(P(B1 ), . . . , P(Bk )) = k P(B j )α P0 (B j ) .
j=1 Γ {α P0 (B j )} j=1

n
Let n j = i=1 1 B j (xi ) be the number of samples falling inside B j . Then the joint
density of x is
k
p(x | P) = P(B j )n j
j=1

and hence the posterior density is x


9.2 Dirichlet Processes 97

p(P(B1 ), . . . , P(Bk ) | x) ∝ p(P(B1 ), . . . , P(Bk )) p(x | P)



k
∝ P(B j )α P0 (B j )+n j ,
j=1

corresponding to the Dirichlet(αn∗ P∗n (B1 ), . . . , αn∗ P∗n (Bk )) distribution.

Remark 9.4 In Proposition 9.2, as n → ∞ then αn∗ → ∞, meaning the variance


(9.3) of the Dirichlet process posterior shrinks to zero. Furthermore, the weight
πn∗ → 0 and hence the posterior mean (9.7) converges to the empirical measure,
P∗n → P̂n .

9.2.1 Discrete Base Measure

It follows from Proposition 4.1 that independent observations from an unknown dis-
tribution function have a closed-form marginal likelihood under a Dirichlet process
prior. The form of the marginal likelihood is most straightforward when the base
measure P0 is discrete, with corresponding probability mass function p0 .

Exercise 9.1 (Dirichlet process marginal likelihood) Suppose x = (x1 , . . . , xn )


are n independent samples from P and P ∼ DP(α · P0 ). If P0 is discrete, show
that x has marginal probability mass function

Γ (α)  α p0 (xi ) + j≤i 1{xi } (x j )
n
p(x) =  . (9.8)
Γ (α + n) i=1 α p0 (xi ) + j<i 1{xi } (x j )

Perhaps the most revealing formulation of the Dirichlet process arises from con-
sidering the predictive distribution of a further random sample.

Corollary 9.1 The predictive distribution for a new observation xn+1 drawn from
the same unknown distribution is
n
α p0 (xn+1 ) + i=1 1{xn+1 } (xi )
p(xn+1 | x) = . (9.9)
α+n

Proof This follows immediately by expressing the predictive distribution as the ratio
of the respective joint distributions (9.8) for (x, xn+1 ) and x.

Remark 9.5 The form of the predictive distribution (9.9) has a clear interpretation
that a further sample xn+1 can be viewed as a draw from the following mixture distri-
bution: with probability α/(α + n), a new value is sampled from the base distribution
P0 , and with the remaining probability a repeated value is sampled from the empirical
distribution of values observed so far. The concentration parameter can therefore be
interpreted as a notional prior sample size reflecting the base measure.
98 9 Nonparametric Models

Remark 9.6 Following the sequential sampling procedure (9.9), the number of sam-
ples from the base distribution P0 follows a so-called Chinese restaurant table dis-
tribution. After n samples, from Teh (2010), for example, this distribution is known
to have expected value

α{ψ0 (α + n) − ψ0 (α)} ≈ α log(1 + n/α),

where ψ0 (· ) denotes the digamma function, defined to be the gradient of log Γ (· ).


 Exercise 9.2 (Dirichlet process sampling) Write computer code (using a
language such as Python) to sample a random probability mass function from
a Dirichlet process using a geometric distribution base measure with parameter
0.01. Plot three sampled probability mass functions obtained from setting
α = 10, 1000, 100000, respectively.

9.3 Polya Trees

Polya trees (Mauldin et al. 1992) are a more general class of nonparametric models
for random measures which can support both continuous and discrete distributions.
For real-valued random variables, Polya trees are defined on an infinite sequence of
recursive partitions of a subset of the real line B ⊆ R.
Definition 9.4 (Binary sequences) Let E 0 = ∅ and for m > 1, define

E m := {0, 1}m ,
E := ∪∞m=0 E m ,

such that E m is the set of all length-m binary sequences and E is the set of all
finite-length binary sequences.
Definition 9.5 (Binary tree of partitions) A set  = {π0 , π1 , . . .} of nested partitions
of B is said to be a binary tree of partitions if |πm | = 2m . Clearly π0 = {B}, and since
the partitions are nested, the sets in each partition πm can be indexed by elements of
E m in such a way that, for all e ∈ E m−1 ,

Be = Be0 ∪ Be1

for the set Be ∈ πm−1 and the two sets Be0 , Be1 ∈ πm .
Remark 9.7 A natural binary tree of partitions of the unit interval B = (0, 1) is
π0 = {(0, 1)} and
m m
πm = {( e j /2 j , 1/2m + e j /2 j ) | (e1 , . . . , em ) ∈ E m }, m > 0,
j=1 j=1

illustrated in Fig. 9.2.


9.3 Polya Trees 99

π0 : B

π1 : B0 B1

π2 : B00 B01 B10 B11

π3 : B000 B001 B010 B011 B100 B101 B110 B111

π4 : B0000 B0001 B0010 B0011 B0100 B0101 B0110 B0111 B1000 B1001 B1010 B1011 B1100 B1101 B1110 B1111

Fig. 9.2 The first five layers of an infinite sequence of recursive partitions. The shaded regions
show the path through the preceding layers to an example set B0110 in π4

Remark 9.8 If B = R and F0 a continuous distribution function, corresponding


partitions of the real line can be obtained by applying an inverse transformation to
partitions of the unit interval:
m m
πm = {(F0−1 ( e /2 j ), (F0−1 (1/2m + e /2 j )) | (e1 , . . . , em ) ∈ E m }, m > 0.
j=1 j j=1 j
(9.10)

Given a binary partition {π0 , π1 , . . .} of B and an element x ∈ B, for each partition


level m define m (x) to be the unique length-m binary sequence e ∈ E m such that
x ∈ Be ∈ πm .

 Exercise 9.3 (Binary partition index) Suppose an F0 -centred sequence of


partitions (9.10) with F0 (x) = (x), the standard normal cumulative distribution
function. Evaluate 6 (1.5).

Definition 9.6 (Splitting probabilities) Suppose P is a probability measure on B,


and  a binary tree of partitions of B. Let e = (e1 , . . . , em ) ∈ E m and Be ∈ πm ∈ .
Then since Be1 ...e j ⊆ Be1 ...e j−1 for all j ≤ m, the probability P(Be ) can be factorised
as
m
P(Be ) = P(Be1 ...e j | Be1 ...e j−1 ). (9.11)
j=1

The conditional probabilities in (9.11) are known as splitting probabilities.

Definition 9.7 (Polya tree) Let  be a binary tree of partitions and suppose
A = {αe | e ∈ E} is a corresponding set of positive constants αe > 0 defined for
all partition layers in . For a random probability measure P, if for all m > 0 and
(e1 , . . . , em ) ∈ E m the splitting probabilities satisfy

P(Be1 ...em | Be1 ...em−1 ) ∼ Beta(αe1 ...em−1 em , αe1 ...em−1 (1−em ) ),

then P is said to have a Polya tree distribution, written P ∼ PT(, A ).


100 9 Nonparametric Models

Definition 9.8 (Splitting probabilities) The conditional probabilities in (9.11) are


known as splitting probabilities.

Remark 9.9 The Dirichlet process from Sect. 9.2 is a special case of the Polya tree
satisfying αe0 + αe1 = αe for all e ∈ E.

Remark 9.10 Polya tree probabilities can be interpreted as products of conditional


probabilities determining the path of a particle cascading down the layers of parti-
tions, with B ⊇ Be1 ⊇ Be1 e2 ⊇ . . .. For example, in Fig. 9.2, the probability of the
highlighted set B0110 is obtained through a product,

P(B0110 ) = P(B0 ) P(B01 | B0 ) P(B011 | B01 ) P(B0110 | B011 ),

where each term (splitting probability) has an independent Beta distribution with
parameters corresponding to that path.

The specification of a binary tree of partitions according to a base probability


measure (9.10) allows the Polya tree distribution to be easily centred around that
distribution.
Proposition 9.3 For a chosen a probability measure P0 with distribution function
F0 , suppose P ∼ PT(, A ) where each πm ∈  satisfies (9.10). If the positive con-
stants A are chosen to be symmetric such that for all e ∈ E, αe0 = αe1 then EP = P0 .

Proof If, for all e, αe0 = αe1 then by symmetry, for all m > 0 and e ∈ E m , E{P(Be )} =
1/2m . The result follows from the usual inversion rule for continuous distribution
functions.

The conjugacy of the Polya tree prior follows immediately from the conjugacy of
the beta distribution for Bernoulli observations noted in Table 4.1 of Sect. 4.2.

Proposition 9.4 Conjugacy of Polya tree. Suppose x = (x1 , . . . , xn ) are n indepen-


dent samples from P and P ∼ PT(, A ). For e ∈ E, let


n
ne = 1 Be (xi )
i=1

be the number of samples which fall inside the set Be , and let An∗ = {αe + n e | e ∈
E}. Then
P | x ∼ PT(, An∗ ).

Proof For each sample xi and each non-trivial partition level m > 0, recall m−1 (xi )
as the unique binary sequence of length m − 1 such that xi ∈ B m−1 (xi ) . Conditional
on m−1 (xi ), xi must fall in either B m−1 (xi )0 or B m−1 (xi )1 ; from these two possibilities,
xi falls in B m−1 (xi )em with an unknown, Beta(α m−1 (xi )em , α m−1 (xi )(1−em ) ) distributed
probability. Denoting this probability θ,
9.3 Polya Trees 101

p(θ | xi ∈ B m−1 (xi )em ) ∝ p(xi ∈ B m−1 (xi )em | θ ) × p(θ )


∝ θ × θ α m−1 (xi )em −1 (1 − θ )α m−1 (xi )(1−em ) −1
m−1 (x i )em
=θ (1 − θ )α m−1 (xi )(1−em ) −1 ,

which is proportional to the density of Beta(α m−1 (x i )em


+ 1, α m−1 (x i )(1−em )
). The result
follows.

9.3.1 Continuous Random Measures

As noted above, Polya trees can be constructed to give probability one to either dis-
crete or continuous distributions. The special case of the Dirichlet process obtained
when αe0 + αe1 = αe for all e exemplifies the discrete case. For guaranteeing con-
tinuous probability distributions, Lavine (1992) showed that “as long as the αe ’s do
not decrease too rapidly with m”, P will be continuous; a commonly used choice is
αe1 ...em = αm 2 for some single parameter α > 0.
As with the Dirichlet process, it follows from Proposition 4.1 that independent
observations from an unknown distribution function have a closed-form marginal
likelihood under a Polya tree prior; in this case, this marginal likelihood is most
straightforward when the base measure P0 is continuous.

Proposition 9.5 Polya tree marginal likelihood. Suppose x = (x1 , . . . , xn ) are n


independent samples from P and P ∼ PT(, A ). If P0 = E(P) is continuous with
corresponding probability density function p0 , then x has marginal probability den-
sity function

(x, j)

n 
n m
(α + n m (x), j )(α
m (x j ) m−1 (x j )0
+ α m−1 (x j )1 )
p(x) = p0 (xi ) ,
i=1 j=2 m=1
α m (x j ) (α m−1 (x j )0 + α m−1 (x j )1
+ n m−1 (x), j )


where n e, j = i< j 1 Be (xi ) and m ∗ (x, j) = min{m > 0 | m (xi )  = m (x j ), i < j}
is the highest partition level for which none of the first ( j − 1) samples in x lie
within the same set as x j .

Proof See Berger and Guglielmi (2001).

 Exercise 9.4 (Polya tree sampling) Write computer code (using a language such
as Python) to sample a random probability density function from a Polya tree
model with αe1 ...em = αm 2 and a binary tree of partitions  centred on F0 (x) =
(x). Plot three sampled densities obtained from setting α = 1, 100, 10000,
respectively.
102 9 Nonparametric Models

9.4 Partition Models

A Polya tree defines a random probability measure P on a fixed collection of nested


partitions of a space B, specifying consistent probabilities at each layer. Partition
models are somewhat simpler, specifying P on a single, unknown partition π .
For each B-subset of the partition π, a relatively simple statistical model is typi-
cally assumed. The nonparametric flexibility of a partition model comes from allow-
ing uncertainty about the partition to extend to the dimension |π|; by not assuming an
upper bound for the size of the partition, P can assume a potentially infinite number
of parameters. A simple analogy is approximating an arbitrarily complex function
with a step function with an unlimited number of steps.

9.4.1 Partition Models: Bayesian Histograms

For simplicity of exposition assume B = [a, b] ⊂ R is an interval on the real line,


and that P is an unknown continuous probability measure on [a, b] with density
p. A histogram on [a, b] can be viewed as a partition model: the interval [a, b] is
divided into bins by a sequence of m ≥ 0 cut points τ , where τ = ∅ when m = 0
and otherwise τ = (τ1 , . . . , τm ) with a ≡ τ0 < τ1 < . . . < τm < τm+1 ≡ b. The cut
points define a corresponding partition πτ = {B1 , . . . , Bm+1 } where B j = [τ j−1 , τ j )
is the jth bin of the histogram.
A histogram assumes constant density within each bin, leading to an overall
piecewise constant density on [a, b] with m steps. Leonard (1973) and Gelman et al.
(2013, p. 545) presented the following Bayesian model for such a density.

Definition 9.9 (Bayesian histogram) Let α > 0 and let P0 be a probability measure
on [a, b]. Let τ be an increasing sequence of m cut points partitioning [a, b] into (m +
1)
segments, with corresponding segment probabilities θ = (θ1 , . . . , θm+1 ) satisfying
m+1
j=1 θ j = 1. A Bayesian histogram model for a random probability measure P
assumes the following representation for the density p:


m+1
θj
p(x | m, τ, θ) = 1[τ j−1 ,τ j ) (x) ,
j=1
τ j − τ j−1
θ | m, τ ∼ Dirichlet{α P0 ([τ0 , τ1 )), . . . , α P0 ([τm , τm+1 ))}. (9.12)

Remark 9.11 The base probability measure P0 from Definition 9.9 is the prior
expectation for P, such that for (P0 -measurable) A ⊆ [a, b], E{P(A)} = P0 (A).

Given samples x1 , . . . , xn from an unknown continuous probability distribution


P, the Bayesian histogram model (9.12) provides another conjugate model for P with
closed-form marginal likelihood.
9.4 Partition Models 103

Proposition 9.6 (Bayesian histogram marginal likelihood) Suppose x = (x1 , . . . , xn )


are n independent samples from an unknown continuous distribution P with density
defined by (9.12). Conditional on m and τ , the posterior distribution of θ given x is

θ | m, τ, x ∼ Dirichlet{α P0 ([τ0 , τ1 ) + n 1 ), . . . , α P0 ([τm , τm+1 ) + n m+1 )} (9.13)

and x has marginal probability density function

Γ (α) 
m+1
Γ {α P0 ([τ j−1 , τ j )) + n j }
p(x | m, τ ) = , (9.14)
Γ (α + n) j=1 Γ {α P0 ([τ j−1 , τ j ))}(τ j − τ j−1 )n j

n
where n j = i=1 1[τ j−1 ,τ j ) (xi ) is the number of samples lying in the segment
[τ j−1 , τ j ).

Proof The likelihood function for (9.12) is


m+1
θj
n j
p(x | m, τ, θ) =
j=1
τ j − τ j−1

and the results simply follow from the conjugacy of the multinomial and Dirichlet
distributions (cf. Table 4.1).

Remark 9.12 Assuming α to be relatively small, the marginal likelihood (9.14) is


highest when the bin counts {n j } are each either very large or very small. Therefore
equal bin counts would not correspond to a good partition; these would be better
modelled with a single bin.

To complete the nonparametric formulation of the Bayesian histogram, a prior


distribution must be assigned to the number and location of the cut points, (m, τ ).
The canonical choice for this assignment is a Poisson process on [a, b] with rate
ν > 0 for the arrivals of cut points, leading to a joint prior density

p(m, τ ) = ν m exp{−ν(b − a)}. (9.15)

For any choice of prior density p(m, τ ), the corresponding posterior density for
the number of cut points is given up to proportionality by

Γ (α) 
m+1
Γ {α P0 ([τ j−1 , τ j )) + n j }
p(m, τ | x) ∝ p(m, τ ) . (9.16)
Γ (α + n) j=1 Γ {α P0 ([τ j−1 , τ j ))}(τ j − τ j−1 )n j

Estimation of posterior expectations taken with respect to (9.16) can straightfor-


wardly proceed using (reversible jump) Markov chain Monte Carlo sampling (Green
1995) (cf. Chap. 5).
104 9 Nonparametric Models

9.4.2 Bayesian Histograms with Equal Bin Widths

Now consider three simplifications of the histogram model (9.12). First, for simplicity
of presentation and without loss of generality, suppose that the interval of interest is
the unit interval B = [0, 1].
Second, suppose the base measure P0 in Definition 9.9 is the natural default choice
for the unit interval, Lebesgue measure, such that P0 ([τ j−1 , τ j )) = τ j − τ j−1 .
Third, suppose the unknown distribution P is characterised by a partition model
with an unknown number of equally spaced cut points on [0, 1]. To simplify sub-
sequent notation, let m now denote the number of equally sized segments rather
than the number of cut points. Making this assumption is then equivalent to speci-
fying p(m, τ ) through a non-degenerate probability model p(m) for the unknown
number of segments, whilst for m > 1 the conditional distribution p(τ | m) assigns
probability one to the vector τ ∗ with jth element
j
τ j∗ = , j = 1, . . . , m − 1.
m
With these three conditions, the posterior density (9.16) simplifies to
p(m) m n Γ (α) 
m
α
p(m | x) ∝ p(x, m) = Γ + n (m) , (9.17)
Γ (α + n) Γ (α/m)m j=1 m j

where n (m)
j is the number of samples lying between j−1m
and mj .
Also, under this simplified model it follows from (9.13) that after marginalising
θ , the posterior predictive density conditional on m satisfies

m
α + m n (m)
j
p(x | m, x) = 1[ j−1, j) (m x) . (9.18)
j=1
α+n

Model averaging (9.18) with respect to the prior distribution for m obtains the
marginal predictive density
m (m) (m)
∞ p(m) m n Γ (α)
 j  =1 Γ α/m + n j   α +m nj
m
p(x | x) = 1[ j−1, j) (m x) .
Γ (α + n) Γ (α/m)m α+n
m=1 j=1
(9.19)
The predictive density (9.19) could be estimated by a finite approximation of the
outer sum.

Remark 9.13 Relaxing the first two assumptions of this section and returning to a
general base measure P0 on [a, b] with corresponding distribution function F0 , the
same principle of equal bin-width histogram modelling could equally be applied on

the F0 -scale, such that segment j is the interval [τ j−1 , τ j∗ ) where τ j∗ = F0−1 ( j/m).
This is somewhat analogous to the Polya tree partition of (9.10). Then, for example,
9.4 Partition Models 105

p(m) Γ (α) 
m
(m) α
p(m | x) ∝ (τ j∗ − τ j−1

)n j Γ + n (m) .
Γ (α + n) Γ (α/m) j=1
m m j

9.4.2.1 Approximate Inference

The simplicity of inference for the Bayesian histogram with equal bin widths (and
using Lebesgue measure as the base measure) was illustrated by the joint density
p(x, m) (9.17). With just a single unknown parameter m, it is feasible to take a finite
sum approximation of the marginal likelihood,


p(x) = p(x, m), (9.20)
m=1

by terminating the summation at a suitably large value of m. A useful approximation


of p(m | x) is thereby obtained from the ratio of (9.17) and (9.20). Access to the
discrete posterior distribution for m allows direct posterior inference without resort-
ing to computational methods (cf. Chap. 5). For example, it is straightforward to
calculate posterior expectations for functions of interest as simple weighted sums.
Assuming a relatively uninformative geometric prior distribution for m, the fol-
lowing Python code (bayesian_histogram.py) obtains the maximum a poste-
riori value of m in this setting, and more importantly illustrates model averaging (cf.
Sect. 7.2), marginalising over m to obtain the posterior expectation of the unknown
density function.
106 9 Nonparametric Models

Inference under this model is illustrated by the following Python code


(bayes_histogram_simulate.py), where 10,000 observations are simulated
from a mixture of two beta distributions. The three plots generated by the code show
the true mixture density (dashed line) and the model-averaged posterior expected
density (solid line), the approximated posterior distribution for m, and the histogram
density which contributed most to the model-averaged density function, correspond-
ing to the maximum a posteriori value of m which was equal to 27.
Chapter 10
Nonparametric Regression

Chapter 9 introduced the concept of nonparametric modelling, with a focus on


infinite-dimensional parameter representations for unknown probability measures.
In this chapter, attention turns to regression modelling, introduced in Chap. 8 with
the linear model.

10.1 Nonparametric Regression Modelling


Recall from Chap. 8 the regression problem of expressing probabilistic beliefs about
real-valued response variables y1 , y2 , . . . which are thought to statistically depend
on corresponding known p-dimensional covariates x1 , x2 , . . .. Regression exchange-
ability (Definition 8.1) of the response-covariate pairs (yi , xi ) was noted to be the
natural extension of standard exchangeability in this setting, which simplifies the
regression task to learning a common parametric form for the conditional distribu-
tion of yi | xi .
The linear model was shown in Chap. 8 to be a highly flexible parametric like-
lihood model in the regression-exchangeable framework. The models presented in
this chapter provide further flexibility through infinite-dimensional representations
of regression functions f : x → E(y | x); these will include natural extensions of
the linear model which have unbounded dimension.
As with nonparametric probability models, nonparametric regression models
should allow arbitrarily close representations of functions f from a wider class
of regression functions than fixed parametric forms allow. The first such example
considered is the Gaussian process, popularised for its analytical tractability and its
close relationship with the linear model.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 107
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0_10
108 10 Nonparametric Regression

10.2 Gaussian Processes


Consider the standard regression problem of making inference about an unknown
function f : X → R defined on a space X ⊆ R p for some p ≥ 1. A Gaussian
process prior distribution for f assumes a multivariate normal distribution for the
function values f (x) = ( f (x1 ), . . . , f (xn )) at any finite collection of points x =
(x1 , . . . , xn ) in X , according to the following specification.

Definition 10.1 (Kernel function) A symmetric function k : X × X → R is a pos-


itive semidefinite kernel if, for all x1 , . . . , xn ∈ X , and c1 , . . . , cn ∈ R,


n 
n
ci c j k(xi , x j ) ≥ 0.
i=1 j=1

If k(x, x  ) is a function of x − x  then the kernel is said to be stationary; if k(x, x  )


is a function of |x − x  | then the kernel is isotropic.

Example 10.1 (Example kernel functions) The following examples satisfy the pos-
itive semidefinite requirement of Definition 10.1 for α, ρ > 0.
• Dot product/linear:
k(x, x  ) = α 2 (x · x  ).

• Squared exponential/radial basis function:

k(x, x  ) = α 2 exp(−.5 x − x  2 /ρ 2 ). (10.1)

• Exponential:
k(x, x  ) = α 2 exp(−x − x  /ρ). (10.2)

Definition 10.2 (Gaussian process) Let m : X → R be any function and k : X ×


X → R be a positive semidefinite kernel. Then { f (x) | x ∈ X } is a Gaussian
process with mean function m and covariance function k, written f ∼ GP(m, k) if
for any x = (x1 , . . . , xn ),

f (x) ∼ Normaln (m(x), K (x, x)) ,

where ⎡ ⎤
k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xn )
⎢k(x2 , x1 ) k(x2 , x2 ) . . . k(x2 , xn )⎥
⎢ ⎥
K (x, x) = ⎢ .. .. .. .. ⎥. (10.3)
⎣ . . . . ⎦
k(xn , x1 ) k(xn , x2 ) . . . k(xn , xn )

Remark 10.1 The squared exponential kernel (10.1) is the most commonly used
kernel in Gaussian process modelling; samples from processes with this kernel are
10.2 Gaussian Processes 109

infinitely differentiable (Rasmussen and Williams 2005, p. 83). For the exponential
kernel (10.2), samples are continuous but not differentiable (Rasmussen and Williams
2005, p. 86).
Exercises 10.1 (Gaussian process closure) Suppose f ∼ GP(m, k) and m ∼ GP
(m 0 , k0 ), where m 0 is any function and k and k0 are positive semidefinite kernels.
Show that marginally,
f ∼ GP(m 0 , k + k0 ). (10.4)

10.2.1 Normal Errors


Available information about the function is commonly assumed to be limited
to a finite number of pointwise, typically noisy, real-valued observations y =
(y1 , . . . , yn ) of the function values at domain points x = (x1 , . . . , xn ) in X . If those
observations can be assumed to satisfy

yi = f (xi ) + εi , (10.5)

where the observation errors (ε1 , . . . , εn ) are independent Normal(0, σ 2 ) variables,


then the Gaussian process is a conjugate prior for f .
Proposition 10.1 (Conjugacy of Gaussian process) If independently yi ∼ Normal
( f (xi ), σ 2 ), i = 1, . . . , n and f ∼ GP(m, k), then the posterior distribution for f
is again a Gaussian process

f | x, y ∼ GP(m ∗ , k ∗ ),

where

m ∗ (x) = m(x) + k(x, x){K (x, x) + σ 2 In }−1 (y − m(x)),


k ∗ (x, x  ) = k(x, x  ) − k(x, x){K (x, x) + σ 2 In }−1 k(x, x  ). (10.6)

Proof The result follows from the conjugacy of normal-normal mixtures exploited
in Chap. 8.
It follows from Proposition 4.1 that observations of an unknown function with
independent Gaussian errors have a closed-form marginal likelihood under a Gaus-
sian process prior.
Proposition 10.2 (Gaussian process marginal likelihood) If f ∼ GP(m, k) and
independently yi ∼ Normal( f (xi ), σ 2 ), i = 1, . . . , n, then y | x has a marginal like-
lihood satisfying

exp −(y − m(x)) {K (x, x) + σ 2 In }−1 (y − m(x))/2


p(y | x, k, m, σ ) = 1 n .
|K (x, x) + σ 2 In | 2 (2π ) 2
(10.7)
110 10 Nonparametric Regression

Proof As noted in Rasmussen and Williams (2005, p. 19), the likelihood (10.7) is
obtained directly from observing that y | x ∼ Normaln (m(x), K (x, x) + σ 2 In ).
In a further duality with the linear model, the inverse-gamma distribution can
provide a conjugate prior distribution for the error variance σ 2 if the kernel k can
be satisfactorily factorised as k(x, x  ) = σ 2 k  (x, x  ) for a kernel k  , such that beliefs
about the parameters of k  do not depend on σ .
Corollary 10.1 Under the conditions of Proposition 10.2, suppose k(x, x  ) = σ 2
k  (x, x  ) and correspondingly the matrix K (x, x) = σ 2 K  (x, x). Assuming the con-
jugate prior,
σ −2 ∼ Gamma(a, b),

a further marginalisation of the likelihood (10.7) is

1 Γ (an ) ba
p(y | x, k, m) = · , (10.8)
n 1
(2π ) 2 |K  (x, x) + In | 2 Γ (a) bnan

where

an = a + n/2,
bn = b + (y − m(x)) {K  (x, x) + In }−1 (y − m(x))/2.

Hence (cf. Proposition 8.2),

y | x, k, m ∼ Stn (2a, m(x), b(K  (x, x) + In )/a).

Exercises 10.2 (Linear model as a Gaussian process) Conditional on σ , express


the Bayes linear model with simplified conjugate prior (Sect. 8.2.1),

y ∼ Normaln (Xβ, σ 2 In ),
β ∼ Normal p (0, σ 2 λ−1 I p ),

as normal error observations (10.5) of a Gaussian process.

10.2.2 Inference
With normally distributed observation errors leading to closed-form expressions for
the marginal likelihood in Proposition 10.2 and Corollary 10.1, inferential attention
is often primarily focused on the selection of the covariance kernel and the associated
parameters, and secondarily on the mean function (which is often simply assumed
to be zero everywhere).
Remark 10.2 There are two related reasons why a zero mean might safely be
assumed, without significant loss of generality, for Gaussian process modelling of
an unknown function f .
10.2 Gaussian Processes 111

First, if a non-zero mean function m(x) is assumed to be known, then attention can
switch to quantifying uncertainty about the deviation ( f − m) ∼ GP(0, k); inference
about f − m is then based on correspondingly detrended observations (xi , yi ) where
yi = yi − m(xi ), i = 1, . . . , n.
Second, if the mean function m(x) is unknown but can be assumed to also have
a Gaussian process prior with known mean function m 0 (x), then by Exercise 10.1
the marginal distribution for f is again a Gaussian process with mean m 0 (x) and
an additively modified covariance kernel (10.4). The known mean function m 0 (x)
could be subtracted from the observation process and inference about f − m 0 could
again proceed with the assumption of a zero-mean Gaussian process.

For inference on covariance kernel parameters, there are no further analytical


results and computational inferential methods are required, such as Markov chain
Monte Carlo (Sect. 5.3). Fortunately, implementation in the probabilistic program-
ming language Stan (Sect. 6.2) is straightforward, as demonstrated by the following
synthetic regression data example.

Example 10.2 Consider the normal error model from Corollary 10.1 which assumes
an inverse-gamma prior for σ 2 , and suppose the assumed covariance function is the
popular squared exponential covariance kernel (10.1). For simplicity of presentation,
the following Stan code (gp_regression.stan) assumes a univariate unknown
function following a Gaussian process with zero mean function and uninformative,
improper priors for the amplitude parameter α and the length-scale parameter ρ
which determine the squared exponential kernel.

Notably, Stan has an in-built squared exponential covariance function,


cov_exp_quad, which is used twice in the code: first within the
112 10 Nonparametric Regression

block for placing user-defined functions, where the covariance function is required
for obtaining the posterior mean regression function (10.6), and second within the
block for calculating the covariance matrix fac-
tor (K  (x, x) + In ) needed for evaluating the likelihood (10.8). For the latter use
case, the observation in Corollary 10.1 that y is marginally multivariate Student’s
t-distributed is utilised to obtain a very simple statement for the block.
The following PyStan code (gp_regression_stan.py) simulates univariate
functional data with independent standard normal errors, where the true underlying
function is
x2
f (x) = 10 + 5 sin(x) + , 0 ≤ x ≤ 10. (10.9)
5
The code then calls gp_regression.stan in order to make posterior inference
about the two parameters of the squared exponential kernel. Two inferential summary
plots are provided for illustration: First, the posterior mean regression function, eval-
uated at 50 equally spaced grid points; second, the posterior distribution of the most
interesting length-scale parameter ρ, which determines the smoothness of the regres-
sion function by controlling the rate at which covariance decreases with increasing
distance between input points.
10.3 Spline Models 113

10.3 Spline Models


Linear spline regression models were introduced in Sect. 8.3.1.2 as an interesting
special case of the normal linear model. For an increasing sequence of m ≥ 0 knot
points τ = (τ1 , . . . , τm ), the linear spline basis functions (8.16) together with some
intercept terms give the following piecewise linear regression model:


m
f (x) = α0 + α1 x + β j (x − τ j )+ , (10.10)
j=1

for α j , β j ∈ R. The regression function (10.10) is a continuous function made up


of (m + 1) linear segments, and can be generalised straightforwardly to continuous
piecewise polynomials of degree d ≥ 1 with (d − 1) continuous derivatives:


d 
m
f (x) = αjx +
j
β j (x − τ j )d+ . (10.11)
j=0 j=1

The special case of the regression function (10.11) where d = 3, corresponding to


cubic splines, is known to present an optimal trade-off between smoothness (squared
second derivative) and fidelity to fitted data points (squared residuals) (Green and
Silverman 1994, see, for example, p. 11).
The general spline regression function (10.11) is a linear model with respect to
the regression coefficients (cf. Sect. 8.3.1), and the closed-form marginal likelihood
(8.12) still applies. With V now denoting the covariance matrix of the parameter
vector (α0 , . . . , αd , β1 , . . . , βm ) under the conjugate prior (cf. Sect. 8.2.1), here the
same likelihood equation is written
1
Γ (an ) |Vn | 2 ba
p(y | x, m, τ ) = n 1 ,
(2π ) 2 Γ (a) |V | 2 bn an
114 10 Nonparametric Regression

with the left-hand side emphasising the dependency of the design matrix X of this
linear model on the m-vector of knot points τ . The quantities an , bn , Vn were defined
in (8.10).
Continuing on from an earlier remark in Sect. 10.2, a consequence of spline
regression being a linear model is that it must therefore be a (degenerate) special case
of a Gaussian process. However, for fixed m, spline regression is not a nonparametric
model. To endow spline regression with the properties of a nonparametric model,
the number of knots must be allowed to increase without upper bound.
Following the same construction as the Bayesian histogram model in Sect. 9.4.1,
a suitable prior distribution p(m, τ ) is required for the number and location of the
knot points, with the Poisson process prior (9.15) being a default choice. Inference
from the posterior density for the knot locations,
1
|Vn | 2
p(m, τ | y, x) ∝ p(m, τ ) 1 ,
|V | 2 bn an

can be achieved using (reversible jump) Markov chain Monte Carlo sampling (Green
1995) (cf. Chap. 5). Further, in-depth coverage of inference for nonparametric spline
models is provided within (Denison et al. 2002).

Exercises 10.3 Spline regression as a Gaussian process. Suppose the spline regres-
sion function f from (10.11) with coefficients (α0 , . . . , αd , β1 , . . . , βm ) ∼
Normalm+d+1 (0, v Im+d+1 ). Express f as a Gaussian process.

10.3.1 Spline Regression with Equally Spaced Knots


Analogous to the equal bin-size histogram from Sect. 9.4.2, the spline regression
inference problem can be further simplified through an assumption of equally spaced
knot points on an observation interval, say [0, T ]. Writing p(m) for the prior proba-
bility mass function for the number of knots m, the conditional distribution p(τ | m)
then assigns probability one to the m-vector τ ∗ with jth element

T
τ j∗ = j , j = 1, . . . , m. (10.12)
m+1

Posterior inference then concentrates on the single, unknown parameter m,


1
|Vn | 2
p(m | y, x) ∝ p(m) 1 . (10.13)
|V | 2 bn an

As in Sect. 9.4.2, posterior expectations with respect to (10.13) can then be calculated
directly by taking a finite sum approximation over a sufficiently large range of values
for m.
10.3 Spline Models 115

Example 10.3 Consider a spline regression function with equally spaced knot
points, partitioning [0, T ] into an unknown, geometrically distributed number of
segments,


d 
m
j d
f (x) = αjx +
j
βj x − T ,
j=0 j=1
m+1 +

p(m) = (1 − λ)λm ,

where 0 ≤ λ ≤ 1. The following Python code (spline_regression.py) demon-


strates Bayesian model averaging over the number of knot points to estimate the pos-
terior mean regression function under the conjugate prior. The code builds upon the
code linear_regression.py (see page 156), presented as a solution to Exer-
cise 8.4 which required a marginal likelihood function (named
) for the conjugate Bayes linear model.
For each number of knot points, obtains the
implied linear model design matrix, in order to obtain the marginal likelihood and
posterior mean regression function. Following (8.8), the prior covariance matrix for
the regression coefficients (α0 , . . . , αd , β1 , . . . , βm ) is assumed to take the simplified
form V = λ−1 I p for some scalar value λ > 0, where p = m + d + 1 is the number
of regression coefficients.

Next, the Python code (splines_regression_simulate.py) provides a


simulated example, sampling ten noisy observations from the function (10.9) used
116 10 Nonparametric Regression

in Sect. 10.2.2. Cubic spline regression is determined by choosing d = 3. Three


plots are generated by the code: the true regression function (dashed line) compared
against the model-averaged posterior expectation (solid line), evaluated at 50 equally
spaced grid points; the approximated posterior distribution for m; the fitted spline
function for the maximum a posteriori value of m, which is seen from the middle
plot to be m = 3.

10.4 Partition Regression Models


Partition models were introduced in Sect. 9.4.1 for estimating probability distribu-
tions, demonstrating a fundamental idea that an arbitrarily complex global model can
be arrived upon by adaptively partitioning the model space and assuming relatively
simple statistical models within each region of the partition. This idea extends very
10.4 Partition Regression Models 117

naturally to the regression setting, suggesting an adaptive partition of the covariate


space whilst assuming straightforward parametric regression models for each region
of the partition. Excellent expositions of this principle are given by Holmes et al.
(2005) and Denison et al. (2002, Chap. 7), the latter noting that partition models
build on a premise that “points nearby in predictor space come from the same local
model”. Again, by assuming no upper bound to the size of the partition, partition
models qualify as nonparametric models with the flexibility to approximate a broad
class of regression functions.
Formally, let π = {B1 , B2 , . . .} be a partition of X . The same parametric regres-
sion model p(y | x, θ j ) can be independently applied to each X -subset B j of the
partition π , with subset-specific parameters θ j . A simple partition model for y | x
thereby assumes a likelihood function of the form

p(y | x, π ) = p(y j | x j , θ j ) dQ(θ j ), (10.14)
j Θ

where x j denotes the predictor values from x which lie inside B j , and y j denotes the
corresponding responses. More generally, a partition regression model may incorpo-
rate additional global parameters ψ, suggesting a more general likelihood function
 
p(y | x, π ) = dQ(ψ) p(y j | x j , θ j , ψ) dQ(θ j | ψ). (10.15)
 j Θ

Two illustrative examples of this modelling paradigm are presented in this section:
univariate changepoint models and multivariate classification and regression trees.

10.4.1 Changepoint Models


Partition models in one dimension, such that X ⊆ R are also known as changepoint
models. The covariate space X can be divided into intervals B j = [τ j−1 , τ j ) implied
by an m-vector of ordered changepoints τ = (τ1 , . . . , τm ) (cf. Sects. 9.4.1 and 10.3).
Describing a model specification as a changepoint model is often suggestive of some
discontinuity between segments in the global regression function. In contrast, the
local models within each segment will often assume exchangeability of responses,
such that p(y | x, θ ) = p(y | θ ). In such cases, (10.14), for example, implies

p(y | x, π ) = p(y | x, m, τ ) = p(y j ).


j

If a conjugate regression model is assumed for each segment, (10.14) and (10.15)
can each provide closed-form expressions for the corresponding changepoint likeli-
hood function p(y | x, m, τ, ). In such cases, assuming a suitable prior distribution
p(m, τ ) for the number and location of the changepoints such as the Poisson process
prior (9.15), inference from the posterior density for the changepoints,
118 10 Nonparametric Regression

p(m, τ | y, x) ∝ p(m, τ ) p(y |, x, m, τ ),

can be achieved via (reversible jump) Markov chain Monte Carlo sampling (Green
1995). Python code implementing MCMC inference for some standard cases of the
segment regression model p(y | x, θ j , ψ) can be found in Heard (2020); this software
also considers an extended modelling paradigm of changepoint regimes, allowing an
additional complexity that regression parameters θ j might be shared between several
changepoint segments.
Exercises 10.4 (Normal changepoint model as a Gaussian process) Suppose m > 0
known changepoints τ = (τ1 , . . . , τm ) ∈ (0, T )m in a piecewise constant regression
function,
m
f (x) = 1[τ j ,τ j+1 ) (x) · μ j ,
j=0

where τ0 ≡ 0, τm+1 ≡ T and independently μ1 , . . . , μm+1 ∼ Normal(0, v). Condi-


tional on τ , express this changepoint model for f as a Gaussian process.

10.4.1.1 Changepoint Regression with Equally Spaced Changepoints


Analogously to Sects. 9.4.2 and 10.3.1, inference about an unknown number of
unknown changepoint locations can be largely simplified (although possibly over-
simplified) by assuming the changepoints are equally spaced on a given interval, say
[0, T ], meaning that only the number of changepoints m is considered unknown.
Writing the prior probability mass function for m as p(m), and again denoting the
equally spaced changepoints as τ ∗ (10.12), this leads to a univariate posterior mass
function
p(m | y, x) ∝ p(m) p(y | x, m, τ ∗ ). (10.16)

Posterior inference with (10.16) can typically be well approximated analytically by


taking a finite sum with a suitably large number of terms, as noted in the aforemen-
tioned sections.
Example 10.4 (Piecewise constant normal changepoint regression) Consider the
equally spaced changepoint model (10.16), with exchangeable, normally distributed
observations y j within each segment j. Suppose the normal distribution mean param-
eters μ j vary between segments, but a single, global error variance parameter σ 2 is
shared by all the segments.
Assuming conjugate priors for all unknowns implies a piecewise constant regres-
sion with closed-form marginal likelihood of the form (10.15), where σ corresponds
to the nuisance parameter ψ. This particular changepoint model is actually equivalent
to the spline regression model in Exercise 10.3 when the degree d = 0, corresponding
to a piecewise constant regression function.
Repeating the simulated data analysis from Exercise 10.3 with the same Python
code (spline_regression_simulate.py) but setting d = 0 gives the fol-
lowing output plots for changepoint inference.
10.4 Partition Regression Models 119

Even though the assumed regression function has m discontinuities for each value
of m, the model-averaged posterior mean regression function is continuous. Con-
versely, because the number of observations is small (n = 10) the posterior mode
for the number of changepoints is found at m = 2, despite the underlying regression
function begin a smooth, non-constant function.

10.4.2 Classification and Regression Trees


Binary trees provide an interpretable class of models for recursively partitioning
a multivariate predictor space X ⊂ R p with p ≥ 2. Intuitively, they are a natural
multivariate extension of univariate changepoint models. Figure 10.1 shows an illus-
trative tree; within each of the square terminal nodes, a separate regression model
could be fit to the response data falling within that category, combining for an overall
likelihood model (10.14).
Denison et al. (1998) parameterise a tree T as a set of triples of the form

(splitting node label, variable index, splitting value). (10.17)

Y N
x2 ≤ a

Y N Y N
x1 ≤ b x3 ≤ c

Y N
x1 ≤ d

Fig. 10.1 An example of a classification and regression tree model on three variables
120 10 Nonparametric Regression

Any descendant splitting node label s is uniquely defined given its parent’s label s  ,
setting s = 2s  if the node acts on data for which the query at the parent node is true,
and s = 2s  + 1 otherwise.
 Exercises 10.5 (CART notation and partition) Consider the tree in Fig. 10.1.

(i) Express the tree as a set of triples (10.17) according to the notation of Denison
et al. (1998).
(ii) State the partition of R3 implied by the tree.

Bayesian implementations of partition modelling with trees for classification and


regression problems (CART) are described in Chipman et al. (1998) and Denison
et al. (1998); each proposes a different prior distribution p(T ) for the partitioning
tree T , and uses Markov chain Monte Carlo methods to sample from the posterior
distribution of the tree. Both articles openly discuss how MCMC sampling of the
trees is fraught with difficulties, due to the nested structure of the partitions. Perhaps
more significantly, Chipman et al. (2010) presents a Bayesian additive regression tree
(BART) model which provides further flexibility and better MCMC mixing; some
Python implementations of BART can be found online.
Chapter 11
Clustering and Latent Factor Models

Hierarchical models were previously discussed in Sect. 3.3. This chapter gives fur-
ther details of practical Bayesian modelling with hierarchies. In some application
contexts, the hierarchies are understood to be known during the data collection pro-
cess. For example, in the student-grade model of Sect. 6.1, the hierarchical structure
recognised that each row of the data matrix X corresponded to test grades from the
same student.
In other contexts, the hierarchies may be a subjective construct with associated
uncertainty. These hierarchies are characterised by additional unknown parameters,
sometimes formulated as discrete clusters and otherwise as continuous latent factors.
This chapter considers some more advanced modelling techniques commonly applied
in such cases.

11.1 Mixture Models

Suppose x = (x1 , . . . , xn ) are n sampled continuous random variables which are


assumed to be exchangeable. By De Finetti’s representation theorem (Theorem 2.2),
necessarily
 n
p(x) = p(xi ) dQ(p),
i=1

where the integral is taken over some suitable space of density functions for the
unknown density p.
A flexible class of density functions can be obtained by considering families of
mixture distributions. As with the partition models considered in Sect. 9.4, each
component density might be a relatively standard parametric model and yet still give
rise to a mixture which is very adaptable to different underlying density shapes. The
following sections present finite and infinite mixture representations, although the
difference between the two can be fairly limited in practice.

The original version of this chapter has been revised due to typographic errors. The corrections to
this chapter can be found at https://doi.org/10.1007/978-3-030-82808-0_12

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, 121
corrected publication 2022
N. Heard, An Introduction to Bayesian Inference,
Methods and Computation, https://doi.org/10.1007/978-3-030-82808-0_11
122 11 Clustering and Latent Factor Models

Remark 11.1 Mixture distributions can be regarded as clustering models (Fraley


and Raftery 2002), implicitly partitioning the n variables according to the mixture
component from which they were drawn. Estimating this underlying cluster structure
can sometimes be a primary inferential objective, requiring specification of a suitable
loss function (Sect. 1.5.2) as exemplified by Lau and Green (2007).

11.1.1 Finite Mixture Models

Suppose the assumed density p is a mixture of m component densities from the same
parametric family, with a general formulation


m
p(x) = w j f (x | θ j , ψ), (11.1)
j=1

where θ = (θ1 , . . . , θm ) are unknown parameters specific to each mixture compo-


nent. In contrast, ψ is a global unknown parameter shared across all components,
which in some settings will be redundant. The mixture weights w = (w1 , . . . , wm )
are non-negative and sum to one.
Let z = (z 1 , . . . , z n ) ∈ {1, . . . , m}n denote latent variables representing the mix-
ture components from which each sample is drawn. Formally, xi ∼ p can be equiv-
alently expressed as

z i ∼ Categoricalm (w),
xi ∼ f (· | θzi , ψ), (11.2)

such that z i takes value j ∈ {1, . . . , m} with probability p(z i = j) = w j , and then
xi is sampled from the z i th-component density.
Inferring the latent variables z equates to clustering the observed variables x into at
most m non-empty groups, where only samples within the same cluster are assumed
to be drawn from the same population. The conditional likelihood function for x
given the latent cluster allocations z is simply


n
p(x | z, θ, ψ) = f (xi | θzi , ψ). (11.3)
i=1

Marginalising the unknown parameters θ, ψ from (11.3) with respect to assumed


prior distributions yields
⎧ ⎫
 
m ⎨  ⎬
p(x | z) = f (xi | θ j , ψ) dQ(θ j | ψ) dQ(ψ). (11.4)
 j=1 ⎩ Θ i:z = j ⎭
i
11.1 Mixture Models 123

This calculation will be straightforward when assuming conjugate parametric models


(cf. Sect. 4.2).

11.1.1.1 Dirichlet Prior for Mixture Weights

The conjugate prior for the mixture weights w is a Dirichlet distribution,

w ∼ Dirichletm (α1 , . . . , αm ), (11.5)

for non-negative hyperparameters α1 , . . . , αm , chosen such that α = mj=1 α j repre-


sents a notional prior sample size (cf. Sect. 9.2.1). To obtain symmetry, the Dirichlet
hyperparameters α j are typically assumed to be identical with each α j = α/m for a
chosen value of α > 0.
For a given vector of cluster allocations z and for each j ∈ {1, . . . , m}, let


n
nj = 1{ j} (z i ) (11.6)
i=1

be the number of samples attributed to the jth cluster. Under the categorical model
(11.2),
m
n
p(z | w) = wj j . (11.7)
j=1

Marginalising (11.7) with respect to the Dirichlet prior (11.5) for the unknown mix-
ture weights yields a marginal distribution for the cluster allocations,

Γ (α)  Γ (α j + n j )
m
p(z) = . (11.8)
Γ (α + n) j=1 Γ (α j )

Remark 11.2 The probability distribution (11.8) is known as the multinomial-


Dirichlet distribution.

Under the Dirichlet prior, the joint conditional distribution for x and z can be
conveniently written up to proportionality as
⎧ ⎫
m ⎨
  ⎬
p(x, z | θ, ψ) ∝ Γ (α j + n j ) f (xi | θ j , ψ) .
⎩ ⎭
j=1 i:z i = j

Alternatively, by first marginalising the unknown parameters (θ, ψ), the expression
(11.8) can be combined with (11.4) to yield
124 11 Clustering and Latent Factor Models
⎧ ⎫
 
m ⎨   ⎬
p(x, z) ∝ p(z) p(x | z) ∝ Γ (α j + n j ) f (xi | θ j , ψ) dQ(θ j | ψ) dQ(ψ).
 j=1 ⎩ Θ i:z = j ⎭
i
(11.9)

11.1.1.2 Mixture of Gaussians

For densities which require support over the whole real line, f in (11.1) is commonly
assumed to be the density of a normal distribution with parameter pair θ j = (μ j , σ j )
denoting the mean and standard deviation, respectively, for the jth mixture compo-
nent, implying
m
p(x) = w j φ{(x − μ j )/σ j }/σ j ,
j=1

where φ is the standard normal density.


Assuming conjugate normal and inverse-gamma priors for {(μ j , σ j ) | j =
1, . . . , m},

μ j | σ j ∼ Normal p (0, σ j2 λ−1 ),


σ j−2 ∼ Gamma(a, b),

with a, b, λ > 0, the parameters μ j and σ j2 can be integrated out according to (11.9)
to obtain the joint distribution (11.9) of x and z,

Γ (α) 
m 1
Γ (α j + n j )Γ (a + n j /2) λ 2 ba
p(x, z) = n nj ,
Γ (α + n)(2π ) 2 1 a+ 2
j=1 Γ (α j )Γ (a)(λ + n j ) 2 b + 21 {ẍ j − ẋ 2j /(λ + n j )}
(11.10)
where ẋ j = i:zi = j xi and ẍ j = i:zi = j xi2 .
A simpler but less flexible implementation of the mixture of Gaussians model
can be obtained by assuming a single variance parameter which is common to each
mixture component density, such that θ j = μ j and ψ = σ . The corresponding joint
distribution for x and z is

Γ (α)Γ (a + n/2) λ 2 ba
1

m
Γ (α j + n j )
p(x, z) =   ,
Γ (α j )
n
n 1 a+ 2j
Γ (α + n)Γ (a)(2π) 2 (λ + n) 2 b + 21 ẍ − mj=1 ẋ 2j /(λ + n j ) j=1

n
where ẍ = i=1 xi2 .
11.1 Mixture Models 125

11.1.1.3 Inferring the Clustering and Number of Clusters

For a fixed number of mixture components m, the posterior distribution of the cluster
allocations z ∈ {1, . . . , m}n can be obtained up to proportionality from the joint
distribution (11.9),
p(z | x) ∝ p(x, z). (11.11)

The posterior distribution (11.11) can be explored using straightforward Markov


chain Monte Carlo simulation techniques, such as Gibbs sampling, introduced in
Sect. 5.3.

Exercises 11.1 (Mixture of normals full conditionals) For the finite mixture of nor-
mal density model (11.10) with component-specific mean and variance parameters
assuming conjugate priors, state an equation, up to proportionality, for the full con-
ditional distribution p(z i | z−i , x) for i ∈ {1, . . . , n}.

 Exercises 11.2 (Gibbs sampling mixture of normals) Write code to implement


Gibbs sampling on the finite mixture of normal density model (11.10) with
component-specific mean and variance parameters assuming conjugate priors.
Initialise the Markov chain by ordering the samples and dividing them into m
equal-sized groups.
Run the code with 10,000 sampled data points generated from the mixture of two
beta distributions simulated in Sect. 9.4.2.1, assuming m = 2. After M = 100
iterations, show the proportion of data points assigned to each cluster and the
corresponding sample means.

More commonly the number of mixture components m will be considered


unknown, requiring specification of an additional prior distribution component p(m);
the corresponding posterior distribution of interest extends to

p(m, z | x) ∝ p(m) p(x, z).

In particular, if p(m) is assumed to have unbounded support on the natural numbers


(for example, assuming m ∼ Poisson(λ) for λ > 0), the finite mixture model (11.1)
becomes a potentially infinite mixture model, and can therefore be regarded as another
nonparametric inferential model akin to those considered in this chapter. Richardson
and Green (1997) demonstrated inference for mixture distributions, such as mixtures
of Gaussians, with an unknown number of components using reversible jump Markov
chain Monte Carlo.

Remark 11.3 As with other nonparametric models, admitting an unbounded num-


ber of mixture components allows the finite mixture model (11.1) to fit increasingly
complex density functions as the number of samples n increases.
126 11 Clustering and Latent Factor Models

11.1.2 Dirichlet Process Mixture Models

A natural evolution from the potentially infinite mixture model considered in the
previous section is to consider infinite mixture models. In contrast to (11.1), suppose


p(x) = w j f (x | θ j , ψ) (11.12)
j=1

for an infinite sequence of positive-valued mixture weights w1 , w2 , . . . summing to 1,


and corresponding mixture component density parameters θ1 , θ2 , . . .. A convenient
nonparametric model for obtaining infinite mixtures of type (11.12) is the Dirichlet
process, introduced in Sect. 9.2.
Definition 11.1 (Dirichlet process mixture) The Dirichlet process mixture (DPM)
model for x = (x1 , . . . , xn ) assumes a sampling procedure where each sample xi
is drawn independently from the assumed parametric model f with a sample-
specific parameter θi . Furthermore, each parameter θi is drawn independently from
an unknown discrete distribution with Dirichlet process prior:

xi | θi ∼ f (· | θi , ψ), i = 1, . . . , n,
θi ∼ G, i = 1, . . . , n,
G ∼ DP(α · P0 ),

for α > 0 and some base probability distribution P0 . More formally,


  n 
 
p(x) = f (xi |, θi , ψ) dG(θi ) dQ(G) dQ(ψ), (11.13)
 G i=1 Θ

where Q(G) is a Dirichlet process.


Remark 11.4 To ease inference, the base probability distribution P0 is typically
assumed to be continuous and conjugate to the parametric density f .
Proposition 11.1 Samples x = (x1 , . . . , xn ) drawn from a DPM are exchangeable.
Proof This property propagates automatically from the exchangeability of θ1 , θ2 , . . .
from a distribution following a Dirichlet process (see Exercise 9.1).
Remark 11.5 It was noted in Sect. 9.2 that distributions sampled from a Dirichlet
process are discrete with probability 1, and it is this discreteness which makes (11.13)
a clustering model: the parameter θi for a sample value xi has positive probability
of matching the parameter of other samples in x, and consequently clusters of x
can be defined by equivalence classes of samples with the same parameter value.
Similarly, because samples from a Dirichlet process with continuous base measure
have countably infinite support, the model implies an infinite number of clusters.
11.1 Mixture Models 127

The representation of (11.13) as a countably infinite mixture model (11.12) can


be directly obtained using the stick-breaking interpretation of the Dirichlet process,
presented in Proposition 9.1. Immediately from (9.5), the DPM model (11.13) implies
a random probability density function of the form (11.12) where the atoms θ1 , θ2 , . . .
are drawn from P0 and the mixture weights are determined by (9.6).

11.1.2.1 Inferring Clusters

For the infinite mixture model (11.12), the latent cluster allocation variables z =
(z 1 , . . . , z n ) could naturally assume an infinite range of values {1, 2, . . .} with no
upper bound. However, the labels assigned to clusters are arbitrary and at most n
clusters can be non-empty. Instead, z can be more useful defined by revealing the
samples sequentially according to the predictive distribution (9.9). Let z 1 = 1 and

z i ∈ {1, . . . , z i−1 , z i−1 + 1}, i > 1,

where setting z i = z i−1 + 1 corresponds to forming a new cluster, drawing a


new parameter θzi from the base measure P0 ; otherwise, setting z i = j for j ∈
{1, . . . , z i−1 , } corresponds to reuse of an already drawn parameter θ j . Further, let

α
α+i
if j = z i−1 + 1,
p(z i = j) = n j,i (11.14)
α+i
if 1 ≤ j ≤ z i−1 ,

where n j,i = i−1 =1 1{ j} (z ) is the number of samples allocated to cluster j prior to


sample i. Then assuming (11.14) corresponds exactly to Dirichlet process sampling
(see (9.9) and the subsequent remark).
At the end of the sequence, let n j be the number of samples allocated to cluster
j (11.6), and


m(z) = 1(n j > 0)
j=1

denote the number of non-empty clusters formed. Combining the terms from (11.14),
the DPM induces a marginal prior distribution for z,

Γ (α) 
m(z)
p(z) = αΓ (n j ). (11.15)
Γ (α + n) j=1

Remark 11.6 The sequential consideration of the samples used to derive (11.15)
does not contradict the exchangeability property from Proposition 11.1, since (11.15)
is symmetric in the indices of z.
128 11 Clustering and Latent Factor Models

Remark 11.7 The Dirichlet process mixture marginal distribution for z (11.15) is
actually very similar to the multinomial-Dirichlet prior (11.8). Although assuming
an infinite number of clusters is mathematically elegant, there is little practical dif-
ference between assuming infinitely many clusters and assuming an unbounded but
finite number of clusters; since when inferring cluster assignments z, the former
specification simply guarantees an infinite number of empty clusters.

Inference for the DPM is analogous to that for the multinomial Dirichlet model.
The joint distribution for (x, z) can be obtained analogously to (11.9),
⎧ ⎫
 m(z)
⎨   ⎬
p(x, z) = p(z) p(x | z) ∝ αΓ (n j ) f (xi | θ j , ψ) dQ(θ j ) dQ(ψ),
 j=1 ⎩ Θ i:z = j ⎭
i

which has closed-form expression for conjugate parametric models. Posterior infer-
ence for the distribution p(z | x) ∝ p(x, z) again requires Markov chain Monte Carlo
simulation techniques.

11.2 Mixed-Membership Models

Consider a hierarchical random sample with two assumed layers of exchangeabil-


ity, as previously considered in Example 3.2. For full generality here, first sup-
pose x = (x1 , . . . , xn ) is a vector of exchangeable random variables, each of varying
dimension such that xi = (xi,1 , . . . , xi, pi ), pi ≥ 1. Second, conditional on pi , sup-
pose the ith sample values xi,1 , . . . , xi, pi are also exchangeable. In this multivari-
ate setting, mixed-membership clustering models can extend the mixture modelling
frameworks from Sect. 11.1, by assuming a fixed but unknown distribution over
mixture components for each sample xi .
Formally, for a finite mixed-membership model formulation, x has an assumed
random density function


n 
pi
p(x | p1 , . . . , pn ) = pi (xi, ),
i=1 =1
m
pi (xi, ) = wi, j f (xi, | θ j , ψ), (11.16)
j=1

where W = (wi, j ) is an n × m non-negative matrix with row sums equal to 1,

W · 1m = 1n .

Remark 11.8 There are two points to note about the mixed-membership model
(11.16).
11.2 Mixed-Membership Models 129

Fig. 11.1 Belief network for


a mixed-membership model θ,ψ

X1 X2 X... Xn

w1 w2 w... wn

1. The mixture component densities f (· | θ j , ψ), j = 1, . . . , n are common to each


of the samples.
2. Each sample xi has specific mixture weights wi = (wi,1 , . . . , wi,m ) for the distri-
bution of its components xi , 1 ≤ ≤ pi .
Item 1 allows the learning of shared underlying populations between the samples
x1 , . . . , xn , whilst Item 2 allows for different population representations in each
sample.
The mixed-membership model (11.16) is illustrated schematically as a belief
network in Fig. 11.1.
Figure 11.1 is structurally identical to the belief network diagram for regression
modelling in Fig. 3.7. The key difference is that the shaded nodes for the mixture
weights w1 , . . . , wn indicate that these quantities are unknown, in contrast to the mea-
surable covariates (or factors) in a standard regression model; the mixture weights
can therefore be regarded as latent factors.

11.2.1 Latent Dirichlet Allocation

Mixed-membership models are frequently encountered in statistical analyses of


textual data for determining similarity amongst a collection of documents. There,
each sample xi corresponds to a particular document of length pi , such that
(xi,1 , . . . , xi, pi ) ∈ V pi is the sequence of words in that document, which are drawn
from an overall vocabulary V . (Without loss of generality, it is convenient to abstract
the language of the documents by setting V = {1, . . . , |V |}.)
The most popular mixed-membership model for text analysis is the latent Dirichlet
allocation model for categorical data, proposed by Blei et al. (2003). This finite
mixture model assumes a latent allocation variable z of the same dimensions as x,
such that z i, ∈ {1, . . . , m} attributes a mixture component to word in document i,
1 ≤ i ≤ n and 1 ≤ ≤ pi .
130 11 Clustering and Latent Factor Models

Definition 11.2 (Latent Dirichlet allocation) The Latent Dirichlet allocation (LDA)
model assumes the following m-component mixture model sampling procedure:

pi ∼ Poisson(ζ ),
xi, | z i, , θ ∼ Categorical|V | (θzi, ),
z i, | wi ∼ Categoricalm (wi ),
θ j ∼ Dirichlet|V | (α),
wi ∼ Dirichletm (γ ),

for 1 ≤ j ≤ m, 1 ≤ i ≤ n, 1 ≤ ≤ pi , and positive-valued parameters ζ, α = (α1 ,


. . . , α|V | ), γ = (γ1 , . . . , γm ).
Remark 11.9 LDA can be regarded as a latent linear model for factorising a matrix
of multinomial probabilities. Suppose A = (ai,v ) is an n × |V | matrix such that ai,v is
the probability that a randomly selected, exchangeable element (word) of xi is equal
to v ∈ V ; each row i of the matrix A therefore corresponds to a vector of multinomial
word probabilities for sample i. Taking W as the n × m matrix of weights W = (wi, j )
defined above and writing θ = (θ j,v ) as an (m × |V |) matrix, then under the LDA
model of Definition 11.2,
A = W · θ,

where the rows of both W and θ are all Dirichlet-distributed random vectors.

11.2.1.1 Topic Modelling

The LDA model in Definition 11.2 is often referred to as topic modelling. Under this
interpretation, the m components of the mixture distribution (11.16) represent latent
topics. Each of the m topics is characterised by a specific probability distribution on
the vocabulary of words, parameterised by θ j for topic j.
Similarly, each document is characterised by a specific mixture of the topics,
parameterised by the weights wi . The model assumes two levels of exchangeability:
first amongst documents and second amongst words within a document. The latter
assumption is often referred to as a bag-of-words model, as the ordering of words
within a document is deemed unimportant by the model.

11.2.1.2 Inference

The following Stan code (lda.stan) is based on the example for making infer-
ence on the LDA model provided in the User’s Guide of the Stan documenta-
tion,1 adapted to the notation of this text. Stan does not support ragged array

1 https://mc-stan.org/users/documentation.
11.2 Mixed-Membership Models 131

data formats, and so the documents are concatenated into a single vector x; con-
sequently, an additional variable doc is used to store the starting index of each
document within x. The two probability vectors wi and θ j use the convenient
variable constraint.

11.2.2 Hierarchical Dirichlet Processes

In Sect. 11.1, Dirichlet process mixture models (DPMs, Sect. 11.1.2) were presented
as an infinite-dimensional extension of finite mixture models (Sect. 11.1.1). Similarly,
the finite mixed-membership model (11.16) can also naturally extend to an infinite
mixture using a hierarchy of Dirichlet processes.

Definition 11.3 (Hierarchical Dirichlet processes) Let P0 be a known probability


measure and P1 , . . . , Pn be an exchangeable sequence of unknown probability mea-
sures. Then for concentration parameters α, γ > 0, a hierarchical Dirichlet process
(Teh et al. 2006) model for P1 , . . . , Pn , here denoted HDP(α, γ , P0 ), assumes

Pi ∼ DP(α · P), i = 1, . . . , n,
P ∼ DP(γ · P0 ).

Remark 11.10 Under the hierarchical Dirichlet process, each unknown measure Pi
has expected value P0 .

Remark 11.11 The hierarchy introduces an additional unknown (latent) probability


measure P which encapsulates similarities between P1 , . . . , Pn . As γ → ∞, the
HDP model approaches n independent DP(α · P0 ) draws; as α → ∞, the n unknown
probability models tend towards a single draw from DP(γ · P0 ).
132 11 Clustering and Latent Factor Models

Definition 11.4 (Hierarchical Dirichlet processes mixture) A hierarchical Dirichlet


processes mixture (HDPM) model assumes a sampling procedure where each sample
component xi, is drawn independently from a parametric model f with sample-
specific parameters θi, drawn independently from unknown discrete distributions:

xi, | θi, ∼ f (· | θi, , ψ), i = 1, . . . , n; = 1, . . . , pi ,


θi, ∼ G i , i = 1, . . . , n; = 1, . . . , pi ,
G 1 , . . . , G n ∼ HDP(α, γ , P0 ),

for α, γ > 0 and some base probability distribution P0 .

Proposition 11.2 The HDPM corresponds to an infinite mixed-membership model


constructed by stick-breaking process (9.4) representations for the probability density
functions,
∞
pi (x) = wi, j f (x | θ j , ψ), i = 1, . . . , n, (11.17)
j=1

where θ1 , θ2 , . . . are draws from the base measure P0 and


j−1
wi, j = wi, j (1 − wi, ),
=1
wi, j ∼ Beta(γβ j , α), i = 1, . . . , n; j = 1, 2, . . . ,
β1 , β2 , . . . ∼ GEM(α).

Proof See Teh et al. (2006).

11.2.2.1 Topic Modelling

The hierarchical Dirichlet process (11.17) is a mixed-membership model with an


infinite number of mixture components, in contrast to the finite mixture assumed in
latent Dirichlet allocation. The HDPM can be applied to topic modelling (cf. Sect.
11.2.1.1) on a vocabulary V = {1, . . . , |V |} with the following model specification:
1. P0 should be a |V |-dimensional Dirichlet distribution, such that draws θ1 , θ2 , . . .
from P0 are topics (probability distributions on the vocabulary); then the topic
distribution G i for each document i has different atoms of mass (topic weights)
located at the same infinite list of candidate topics.
2. Word in document i has a Categorical|V | (θi, ) distribution, where θi, is an
independent draw from the topic distribution G i specific to document i. Following
the stick-breaking construction, in (11.17), this corresponds to

f (x | θ j , ψ) = θ j,x .
11.2 Mixed-Membership Models 133

11.2.2.2 Inference

Inference for HDPM has added complexity over LDA due to the unlimited number
of topics. However, open-source software implementations are available, such as the
Python package Gensim.2 This package uses online variational inference as described
in Wang et al. (2011).

11.3 Latent Factor Models

Suppose X = (xi j ) ∈ Rn× p is an (n × p) matrix of random variables, such that the


rows of X , denoted x1 , x2 , . . . , xn , are assumed to be exchangeable p-vectors. On
some occasions, particularly when the dimension p > 1 may be large, it might be
believed that the vectors xi lie close to a lower dimensional subspace of R p . In this
case, probabilistic beliefs about X may be more easily characterised by specifying
probability distributions in the lower dimensional space. One approach for modelling
in alternative dimensions is to deploy latent factor models.
The canonical example of latent factor modelling assumes the following latent
linear model (Bhattacharya and Dunson 2011):

xi = Λ · ηi + i , i = 1, . . . , n, (11.18)

where

i ∼ Normal p (0, Σ),


ηi ∼ Normalk (0, Ik ). (11.19)

The elements of the vector ηi ∈ Rk are referred to as the latent factors for sample
i. Typically, in dimension-reduction applications, the latent dimension k p. The
global parameter Λ is a ( p × k) matrix of factor loadings which project the latent
factors into the higher dimensional space R p . As ηi varies over Rk , Λ · ηi defines a
linear subspace of R p , but the observable variables xi lie just outside that subspace
due to the observation error i .
Remark 11.12 For each sample, the latent factors ηi ∈ Rk can be interpreted as the
unobserved measurements of k features which are believed to be linearly related to
the expected value of the response.
Since (11.18) is a linear model, assuming (11.19) implies the latent factors can
be marginalised out similarly to (8.11), yielding

xi | Λ, Σ ∼ Normal p (0, ΛΛ + Σ). (11.20)

2 https://radimrehurek.com/gensim/models/hdpmodel.html.
134 11 Clustering and Latent Factor Models

Remark 11.13 The marginal distribution (11.20) gives insight into the latent factor
model; the Gram matrix ΛΛ of the rows of the latent factor loadings Λ provides a
low-rank (k < p) additive contribution to the covariance matrix for each exchange-
able data row xi .

For any semi-orthogonal matrix U satisfying UU  = Ik , (ΛU ) · (ΛU ) = ΛΛ ,


and so the covariance factorisation in (11.20) is not unique. In determining a prior
distribution for Λ, it is therefore natural to choose a distribution which is invariant
to these rotations and reflections, satisfying

p(Λ) = p(ΛU )

for any semi-orthogonal matrix U .

11.3.1 Stan Implementation

The following Stan code (latent_factors.stan) implements the latent factor


model from (11.20). For simplicity, Σ is assumed to be a diagonal matrix of indepen-
dent inverse-gamma distributed random variables, and a reference prior is assumed
for the factor loadings.

To illustrate inference for the latent factor model using latent_factors.stan,


the following PyStan code (latent_factors_stan.py) simulates a (50 × 8)
data matrix X from the model and performs posterior inference on Λ and Σ.
11.3 Latent Factor Models 135

To see how well the underlying latent structure is recovered, the posterior distri-
bution for Λ is compared with the true value used to generate X . This comparison
is not completely straightforward, since it was noted above that the model (11.20) is
136 11 Clustering and Latent Factor Models

invariant to semi-orthogonal transformations. This invariance to certain transforma-


tions, such as rotations of Λ, implies taking a simple average of the posterior samples
of Λ would not give a meaningful estimate.
To enable posterior averaging, the MCMC samples are first aligned using Pro-
crustes alignment, as advocated in Oh and Raftery (2007). Each sample is trans-
formed by a different semi-orthogonal matrix optimised to be as close as possible
to a fixed target, here chosen to be the final MCMC sample; the aligned samples
are then averaged to obtain a posterior mean value, and finally this posterior mean
is transformed in order to be aligned as closely as possible to the true value Λ. The
resulting estimate from this post-processing procedure is here denoted Λ̂.
Heat map plots of Λ̂ and the true value are compared side by side in the plot
generated by the code. To demonstrate the value of this alignment procedure, the
plots also show the crude estimate, denoted Λ̄, obtained from directly averaging the
posterior samples of Λ and then finding the closest alignment to the true Λ. The
estimate obtained from mutually aligning the samples is much closer to the true
matrix of factor loadings.

Exercises 11.3 (Latent factor linear model) Let y = (y1 , . . . , yn ) be an n-vector


of real-valued response variables, with an associated n × p matrix of covariates X
with rows x1 , . . . , xn ∈ R p . Consider the latent factor linear model,

yi = xi · β + z i · γ + i ,

which presumes an n × q matrix Z of further, unobserved covariates z , . . . , z n ∈ Rq


with corresponding regression coefficients γ ∈ Rq . Suppose the following indepen-
dent distributions:

i ∼ Normal(0, σ 2 ),
β ∼ Normal p (0, σ 2 V ),
γ ∼ Normalq (0, σ 2 U ),

for σ > 0 and symmetric, positive semidefinite p × p and q × q matrices V and U .


(i) State the conditional distribution [y | σ, X, Z ].
(ii) Suppose σ −2 ∼ Gamma(a, b) for a, b > 0. State the conditional distribution
[y | X, Z ].

 Exercises 11.4 (Latent factor linear model code) Write Stan code to fit the model
from Exercise 11.3 with V = v I p and U = u Iq for known v, u > 0. Assume a
reference prior for the latent factor matrix Z .
Correction to: An Introduction
to Bayesian Inference, Methods
and Computation

Correction to:
N. Heard, An Introduction to Bayesian Inference, Methods
and Computation, https://doi.org/10.1007/978-3-030-82808-0

After publication of the book, the author noticed that the typeset of ‘p’ was incorrectly
processed in Chaps. 2 and 9 during production. Therefore corrections have been
incorporated in these two chapters: the typesetting of ‘P’ was corrected.

The updated version of these chapters can be found at


https://doi.org/10.1007/978-3-030-82808-0,
https://doi.org/10.1007/978-3-030-82808-0_2,
https://doi.org/10.1007/978-3-030-82808-0_5,
https://doi.org/10.1007/978-3-030-82808-0_6,
https://doi.org/10.1007/978-3-030-82808-0_8,
https://doi.org/10.1007/978-3-030-82808-0_9,
https://doi.org/10.1007/978-3-030-82808-0_11

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C1


N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0_12
Appendix A
Conjugate Parametric Models

For each probability model below, x = (x1 , . . . , xn ) are n independent samples from a
likelihood distribution p(x|θ, ψ), for which there exists a conjugate prior distribution
p(θ ) for θ .
Each of the tables for discrete and continuous parametric models provides the
following details:
• Ranges for x and θ .
• Likelihood distribution p(x | θ, ψ) and the conjugate prior density p(θ ).
• Marginal likelihood p(x) and the posterior density p(θ | x), denoted π(θ ).
• Posterior predictive distribution p(x | x) for a new observation x.

A.1 Notation

To make notation concise, let


n 
n

ẋ = xi , ẍ = xi · xi , (A.1)
i=1 i=1

respectively, denote the sum and the sum of squared values in x. Let x(1) ≤ . . . ≤ x(n)
denote the order statistics of x. Finally, for discrete random variables on {1, . . . , k},
let
n
nj = 1{ j} (xi ) (A.2)
i=1

be the number of samples for which xi = j, and let n = (n 1 , . . . , n k ).


Some other items appearing in the following tables:
• Hyperparameters a, b represent positive real numbers unless otherwise stated.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 137
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
138 Appendix A: Conjugate Parametric Models


• ζ (a, b) = ∞ x=0 (x + b)
−a
is the Hurwitz zeta function.
• Δ(k)
k denotes the standard (or probability) simplex {u ∈ Rk : u i ≥ 0,
i=1 u i = 1}. k
• For the Dirichlet distribution, α ∈ {u ∈ Rk : u i ≥ 0, i=1 u i > 0}.
• For the normal and inverse Wishart equations, m ∈ Rk and the matrices V , ψ and
B are assumed positive definite.

A.2 Discrete Models

Uniform(x | {1, . . . , θ}) Zeta(θ | a, b)


x ∈ {1, . . . , θ} θ ∈ {b, b + 1, . . .}, a > 1, b ∈ {1, 2, . . .}
1{1,...,θ } (x) 1{b,b+1,...} (θ)
p(x | θ) = p(θ) =
θ ζ (a, b)θ a
ζ (a + n, b∗ ) 1{b∗ ,b∗ +1,...} (θ)
p(x) = , b∗ = max{b, x(n) } π(θ) =
ζ (a, b) ζ (a + n, b∗ )θ a+n
ζ (a + n + 1, max{b∗ , x})
p(x | x) = ≡ Zeta(θ | a + n, b∗ )
ζ (a + n, b∗ )
Bernoulli(x | θ) Beta(θ | a, b)
x ∈ {0, 1} θ ∈ [0, 1]
Γ (a + b) a−1
p(x | θ) = θ x (1 − θ)1−x p(θ) = θ (1 − θ)b−1
Γ (a) Γ (b)
Γ (a + b) Γ (a + ẋ) Γ (b + n − ẋ) Γ (a + b + n)θ ẋ−1 (1 − θ)b+n−ẋ−1
a+
p(x) = π(θ) =
Γ (a) Γ (b) Γ (a + b + n) Γ (a + ẋ) Γ (b + n − ẋ)
a + ẋ
p(x | x) = ≡ Beta(a + ẋ, b + n − ẋ)
a + b + n − ẋ
Geometric(x | θ) Beta(θ | a, b)
x ∈ {0, 1, 2, . . .} θ ∈ [0, 1]
Γ (a + b) a−1
p(x | θ) = θ(1 − θ)x p(θ) = θ (1 − θ)b−1
Γ (a)Γ (b)
Γ (a + b) Γ (a + n) Γ (b + ẋ) Γ (a + b + n + ẋ) θ a+n−1 (1 − θ)b+ẋ−1
p(x) = π(θ) =
Γ (a) Γ (b) Γ (a + b + n + ẋ) Γ (a + n) Γ (b + ẋ)
(a + n) Γ (b + ẋ + x) Γ (a + b + n + ẋ)
p(x | x) = ≡ Beta(a + n, b + ẋ)
Γ (b + ẋ) Γ (a + b + n + 1 + ẋ + x)
Poisson(x | θ) Gamma(θ | a, b)
x ∈ {0, 1, 2, . . .} θ ∈ [0, ∞)
θ x e−θ ba a−1 −bθ
p(x | θ) = p(θ) = θ e
x! Γ (a)
Γ (a + ẋ) ba (b + n)a+ẋ a+ẋ−1 −(b+n)θ
p(x) = π(θ) = θ e
Γ (a) (b + n)a+ẋ Γ (a + ẋ)
Γ (a + ẋ + x) (b + n)a+ẋ
p(x | x) = ≡ Gamma(a + ẋ, b + n)
Γ (a + ẋ) (b + n + 1)a+ẋ+x
Multinomialk (x | 1, θ) Dirichletk (θ | α)
x ∈ {(1, 0, . . . , 0), . . . , (0, . . . , 0, 1)} θ ∈ Δ(k)

Γ ( kj=1 α j )  k
α −1
p(x | θ) = θx p(θ) = k θj j
j=1 Γ (α j ) j=1
 
Γ ( kj=1 α j ) k
Γ (α j + n j ) Γ ( kj=1 α j + n) k α j +n j −1
p(x) = k π(θ) = k j=1 θ j
Γ ( j=1 α j + n) j=1 Γ (α j ) j=1 Γ (α j + n j )
αj + n j
p(x | x) = k ≡ Dirichletk (θ | α + n)
j=1 α j + n
Appendix A: Conjugate Parametric Models 139

A.3 Continuous Models

Uniform(x | 0, θ) Pareto(θ | a, b)
x ∈ [0, ∞) θ ∈ (0, ∞)
1[0,θ] (x) aba 1[b,∞) (θ)
p(x | θ) = p(θ) =
θ θ a+1
aba (a + n)b∗a+n 1[b∗ ,∞) (θ)
p(x) = , b∗ = max{b, x(n) } π(θ) =
(a + n)b∗ a+n θ a+n+1
(a + n)b∗ a+n
p(x | x) = ≡ Pareto(θ | a + n, b∗ )
(a + n + 1) max{b∗ , x}a+n+1
Exponential(x | θ) Gamma(θ | a, b)
x ∈ [0, ∞) θ ∈ (0, ∞)
ba a−1 −bθ
p(x | θ) = θe−θ x p(θ) = θ e
Γ (a)
Γ (a + n) ba (b + ẋ)a+n a+n−1 −(b+ẋ)θ
p(x) = π(θ) = θ e
Γ (a) (b + ẋ)a+n Γ (a + n)
(a + n)(b + ẋ)a+n
p(x | x) = ≡ Gamma(θ | a + n, b + ẋ)
(b + ẋ + x)a+n+1
Gamma(x | ψ, θ) Gamma(θ | a, b)
x ∈ [0, ∞) θ ∈ (0, ∞)
θ ψ ψ−1 −θ x ba a−1 −bθ
p(x | θ) = x e p(θ) = θ e
Γ (ψ) Γ (a)
Γ (a + nψ) ba (b + ẋ)a+nψ a+nψ−1 −(b+ẋ)θ
p(x) = π(θ) = θ e
Γ (a) (b + ẋ)a+nψ Γ (a + nψ)
Γ (a + (n + 1)ψ) (b + ẋ)a+nψ
p(x | x) = ≡ Gamma(θ | a + nψ, b + ẋ)
Γ (a + nψ) (b + ẋ + x)a+(n+1)ψ
Normalk (x | θ, ψ) Normal(θ | m, V )
x ∈ Rk θ ∈ Rk
n
exp{− 21 i=1 (xi − θ) ψ −1 (xi − θ)} exp{− 21 (θ − m) V −1 (θ − m)}
p(x | θ) = p(θ) =
nk n k 1
(2π ) 2 |ψ| 2 (2π ) 2 |V | 2
1 n 
|V∗ | 2 exp{− 21 m  V −1 m − 21 i=1 xi ψ −1 xi } exp{− 21 (θ − m ∗ ) V∗−1 (θ − m ∗ )}
p(x) = π(θ) =
nk n 1  k 1
(2π ) 2 |ψ| 2 |V | 2 exp{− 21 m ∗ V∗−1 m ∗ } (2π ) 2 |V∗ | 2
m ∗ = V∗ (V −1 m + ψ −1 ẋ), V∗ = (V −1 + nψ −1 )−1 ≡ Normal(θ | m ∗ , V∗ )
p(x | x) = Normal(x | m ∗ , ψ + V∗ )
Normalk (x | 0, θ) Inverse Wishart(θ | a, B)
x ∈ Rk θ ∈ Rk×k , positive definite
n  −1 a
exp{− 21 i=1 xi θ xi } |B| 2 exp{− 21 tr(Bθ −1 )}
p(x | θ) = p(θ) =
nk n ak a+k+1 k(k−1) 
(2π ) 2 |θ| 2 2 2 |θ| 2 π 4 k a+1−
=1 Γ ( 2 )
a a+n
|B| 2 
k
Γ ( a+1−
2 ) |B + ẍ| 2 exp{− 21 tr((B + ẍ)θ −1 )}
p(x) = π(θ) =
a+n a+n+1− ) (a+n)k a+n+k+1 k(k−1) 
|B + ẍ| 2 =1 Γ ( 2 2 |θ| π 4 k a+n+1− )
=1 Γ (
2 2
2
a+n
|B + ẍ| 2 
k
Γ ( a+n+1− )
p(x | x) = 2 ≡ Inverse Wishart(θ | a + n, B + ẍ)
a+n+1 a+n+2− )
|B + ẍ + x · x  | 2 =1 Γ ( 2
Appendix B
Solutions to Exercises

Solution 1.1 Linear transformations of utilities. Let u(· ) be a utility function with
corresponding expected utility ū(· ), and consider a linear transformation

u  (c) = α + β u(c),

where α, β ∈ R. Under utility function u  (· ), the corresponding expected utility for


an action a = {(E 1 , c1 ), (E 2 , c2 ), . . .} ∈ A is
 
ū  (a) = P(E i ) u  (ci ) = α + β P(E i ) u(ci ) = α + β ū(a).
i i

If β > 0, then for two actions a, a  ∈ A, ū  (a) < ū  (a  ) ⇐⇒ ū(a) < ū(a  ).
Solution 1.2 Bounded utility. Let a = {(Ω, c)} and a  = {(Su(c) , c∗ ), (Su(c) , c∗ )}.
Since P(Ω) = 1, ū(a) = u(c). For the dichotomy a  , ū(a  ) = P(Su(c) )u(c∗ ) + (1 −
P(Su(c) ))u(c∗ ) = u(c).1 + (1 − u(c)).0 = u(c). Hence ū(a) = ū(a  ) and therefore
a ∼ a.
Solution 1.3 Unbounded utility.
(i) If {Ω, c1 )} ∼ {(Sx , c), (Sx , c2 )}, then 0 = u(c1 ) = (1 − x) u(c) + x u(c2 ) =
(1 − x) u(c) + x.1 =⇒ u(c) = −x/(1 − x) < 0.
(ii) If {Ω, c2 )} ∼ {(Sx , c1 ), (Sx , c)}, then 1 = u(c2 ) = (1 − x) u(c1 ) + x u(c) =
(1 − x).0 + x u(c) =⇒ u(c) = 1/x > 1.

Solution 1.4 Transitivity of preference. Show that for a, a  , a  ∈ A, if a ≤ a  and


a  ≤ a  , then a ≤ a  .
Solution 1.5 Coherence with probabilities. By Axiom 3, for any c1 <C c2 , ∃ x, x  ∈
[0, 1] such that {(E, c1 ), (E, c2 )} ∼ {(Sx , c1 ), (Sx , c2 )} and {(F, c1 ), (F, c2 )} ∼
{(Sx  , c1 ), (Sx  , c2 )}, namely, x = P(E), x  = P(F) and hence x ≤ x  . Since x ≤ x  ,

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 141
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
142 Appendix B: Solutions to Exercises

Sx ⊆ Sx  , and therefore by Axiom 2 {(Sx , c1 ), (Sx , c2 )} ≤ {(Sx  , c1 ), (Sx  , c2 )} and


the result follows from the initial equivalences and transitivity of preferences.
An alternative proof could have used the principle of maximising expected utility.
Solution 1.6 Absolute loss (also known as L 1 loss). For a univariate, continuous-
valued ω ∈ R, the absolute loss function gives expected utility
 ∞  ω̂  ∞
ū(dω̂ ) = − |ω̂ − ω| f (ω) dω = − (ω̂ − ω) f (ω) dω + (ω̂ − ω) f (ω) dω.
−∞ −∞ ω̂

Differentiating the right-hand side with respect to ω̂ and setting equal to zero yields
  
ω̂ ∞
f (ω) dω + ω̂ f (ω̂) − ω̂ f (ω̂) = f (ω) dω − ω̂ f (ω̂) + ω̂ f (ω̂)
−∞ ω̂
 ω̂  ∞
⇐⇒ f (ω) dω = f (ω) dω
−∞ ω̂
 ω̂
1
⇐⇒ f (ω) dω = ,
−∞ 2

since necessarily −∞ f (ω) dω = 1. Hence ω̂ is the median.
Solution 1.7 Squared loss (also known as L 2 loss).
For a univariate, continuous-valued ω ∈ R, the squared loss function gives
expected utility  ∞
ū(dω̂ ) = − (ω̂ − ω)2 f (ω) dω.
−∞

Differentiating with respect to ω̂ and setting equal to zero yields



0 = −2 (ω̂ − ω) f (ω) dω = −2{ω̂ − E(ω)}

=⇒ ω̂ = E(ω).

Solution 1.8 Zero-one loss (also known as L ∞ loss). For a univariate, continuous-
valued ω ∈ R, and for > 0 define the -ball zero-one loss function

(ω̂, ω) = 1 − 1 B (ω̂) (ω),

where B (ω) = (ω − , ω + ). This loss function implies an expected utility

ū (dω̂ ) = E[1 B (ω̂) (ω)] = P{B (ω̂)}.

As → 0 to obtain the zero-one loss function, the right-hand side tends to f (ω̂)
which is clearly maximised by the mode of f .
Appendix B: Solutions to Exercises 143

Solution 1.9 KL-divergence non-negative. If p = q then KL( p  q) = p(x) log


1 dx = 0.
For p = q, the non-negativity of KL-divergence can be demonstrated using the
logarithmic inequality log(a) ≥ 1 − a −1 for any a > 0. This rule gives
p(x) q(x)
log ≥1− ,
q(x) p(x)
  
p(x) q(x)
=⇒ KL( p  q) = p(x) log dx ≥ p(x) 1 − log dx = { p(x) − q(x)} dx = 0.
q(x) p(x)

Therefore, when KL-divergence is used as a loss function for prediction, the smallest
expected loss (zero) is incurred when reporting genuine beliefs.
Solution 2.1 Finitely exchangeable binary sequences. Suppose X 1 , . . . , X n are
assumed to be independent  and identically distributed Bernoulli( 21 ) random vari-
ables, and it is observed that X i = s. Conditional on this information, X 1 , . . . , X n
are still exchangeable with constant probability mass function

1{s} ( i xi )
p X 1 ,...,X n | 
i X i =s
(x1 , . . . , xn ) = n  .
s

However, for 0 < s < n, this constant mass function cannot be reconciled with a gen-
erative process (2.1) where a probability parameter θ is sampled from a probability
 Bernoulli(θ ) trials X 1 , . . . , X n ; a
measure (Q), followed by a sample of n independent
degenerate value of θ ∈ {0, 1} would not admit X i = s, whilst any non-degenerate
value 0 < θ < 1 would admit positive probability to X i = s.
Solution 2.2 Predictive distribution for exchangeable binary sequences. The result
follows from substituting Theorem 2.1 into the conditional probability identity

P X 1 ,...,X n (x1 , . . . , xn )
P X m+1 ,...,X n |x1 ,...,xm (xm+1 , . . . , xn ) = .
P X 1 ,...,X m (x1 , . . . , xm )

Solution 2.3 Variances under transformations. If θ ∼ Gamma(a, b), then θ −1 ∼


Inverse-Gamma(a, b). The variances for these respective distributions are a/b2 and
b2 /{(a − 1)2 (a − 2)} for a > 2. Consequently, either large a or small b increase the
variance of θ , but reduce the variance of 1/θ .
n
Solution 2.4 Asymptotic normality. Let ẋ = i=1 xi and x̄ = ẋ/n. Then


n
log F(xi ; θ ) = ẋ log θ + (n − ẋ) log(1 − θ )
i=1
d ẋ n − ẋ
=⇒
log F(xi ; θ ) = −
dθ θ 1−θ
d2 ẋ n − ẋ
=⇒ In (θ ) = − 2 log F(xi ; θ ) = + . (B.1)
dθ θ2 (1 − θ )2
144 Appendix B: Solutions to Exercises

Setting the first derivative (B.1) equal to zero yields θ̂n = ẋ/n = x̄. Similarly for
Q(θ ) = Beta(θ | a, b), the prior mode and m 0 = (a − 1)/(a + b − 2) and

d2 a−1 b−1
I0 (θ) = − log dQ(θ) = +
dθ 2 θ2 (1 − θ)2
=⇒ Hn = I0 (m 0 ) + In (θ̂n )
(a − 1)(b − 1)x̄(1 − x̄)
=⇒ Hn−1 =
(a + b − 2)3 x̄(1 − x̄) + (a − 1)(b − 1)n
(a − 1)x̄{(a + b − 2)2 (1 − x̄) + (b − 1)n}
=⇒ m n = Hn−1 (I0 (m 0 )m 0 + In (θ̂n )θ̂n ) = .
(a + b − 2)3 x̄(1 − x̄) + (a − 1)(b − 1)n

Asymptotically, as n → ∞,

  x̄(1 − x̄)
˙ Normal m n , Hn−1 → Normal x̄,
θ | x1 , . . . , xn ∼ .
n

Alternatively, from Section A.2, Q(θ | x1 , . . . , xn ) = Beta(θ | a + ẋ, b + n −


ẋ). The beta distribution is known to be approximately normal when both parameters
grow large. Using the moments of the beta distribution, approximately

a + ẋ (a + ẋ)(b + n − ẋ) x̄(1 − x̄)


θ | x1 , . . . , xn ∼
˙ Normal , → Normal x̄, .
a + b + n (a + b + n)2 (a + b + n + 1) n

Solution 3.1 Identifying parents and children.

parents(X 1 ) = ∅, children(X 1 ) = {X 2 , X 4 };
parents(X 2 ) = {X 1 }, children(X 2 ) = {X 4 };
parents(X 3 ) = {X 4 }, children(X 3 ) = ∅;
parents(X 4 ) = {X 1 , X 2 }, children(X 4 ) = {X 3 }.

Solution 3.2 Identifying neighbours. neighbours(X 1 ) = {X 2 , X 4 }, neighbours(X 2 )


= {X 1 , X 4 }, neighbours(X 3 ) = {X 4 }, neighbours(X 4 ) = {X 1 , X 2 , X 3 }.
Solution 3.3 Identifying cliques. {X 1 , X 2 , X 4 } and {X 3 , X 4 } are both maximal
cliques.
Solution 3.4 Identifying separating sets. {X 4 } separates {X 1 , X 2 } from {X 3 }; {X 1 , X 4 }
separates {X 2 } from {X 3 }; {X 2 , X 4 } separates {X 1 } from {X 3 }.
Solution 3.5 Belief network distribution. PG (X 1 , X 2 , X 3 , X 4 ) = P(X 1 ) P(X 2 | X 1 )
P(X 4 | X 1 , X 2 ) P(X 3 | X 4 ).
Solution 3.6 Identifying colliders. The paths in (a) and (b) have no colliders, but in
path (c) the node X 2 is a collider.
Appendix B: Solutions to Exercises 145

Solution 3.7 Identifying d-separated and d-connected nodes. X 1 and X 3 are d-


separated by X 2 in (a) and (b) of Fig. 3.3, and d-connected by X 2 in (c).
Solution 3.8 Identifying conditional independencies in a belief network.
(i) X 1 ⊥
⊥ X 3 in (a) and (b), and X 1 ⊥
⊥ X 3 in (c).
(ii) X 1 ⊥⊥ X 3 | X 2 in (a) and (b), and X 1 ⊥
⊥ X 3 | X 2 in (c).

Solution 3.9 Markov network distribution.

PG (X 1 , X 2 , X 3 , X 4 ) = φ1 (X 1 , X 2 , X 4 )φ2 (X 3 , X 4 ).

Solution 3.10 Pairwise Markov network distribution.

PG (X 1 , X 2 , X 3 , X 4 ) = φ1,2 (X 1 , X 2 )φ1,4 (X 1 , X 4 )φ2,4 (X 2 , X 4 )φ3,4 (X 3 , X 4 ).

Solution 3.11 Gaussian Markov random field. Suppose X = (X 1 , . . . , X n ) ∼ Nn


(μ, Σ), and let Λ = Σ −1 . Without loss of generality, to simplify notation assume
μ = 0.
For x ∈ Rn , let x− = (x1 , . . . , x −1 , x +1 , . . . , xn ) be the (n − 1)-vector with
component removed. From the density function of the multivariate normal distri-
bution,
n n 
f (x | x− ) ∝ f (x1 , . . . , xn ) ∝= e− 2 xi Λi j x j
∝ e− 2 Λ x 2 − j= Λ
1 1
i=1 j=1 j x xj

when considered as a function of x . The components x j which affect this density are
those j for which Λ j = 0, which by construction are those j for which ( , j) ∈ E.
Hence
f (x | x− ) = f (x | neighboursG (x )).

Solution 4.1 Conjugacy of Bernoulli and beta distributions. Under the Bernoulli
likelihood model,
p(x | θ ) = θ ẋ (1 − θ )n−ẋ ,
n
where ẋ = i=1 xi . If θ ∼ Beta(a, b) then

p(θ ) ∝ θ a−1 (1 − θ )b−1 .

By (4.2),
π(θ ) ∝ p(x | θ ) p(θ ) ∝ θ a+ẋ−1 (1 − θ )b+n−1 ,

which is proportional to the density of Beta(a + ẋ, b + n − ẋ).


Hence θ | x ∼ Beta(a + ẋ, b + n − ẋ).
146 Appendix B: Solutions to Exercises

Solution 4.2 Conjugacy of Poisson and gamma distributions. Under the Poisson
likelihood model,
p(x | θ ) ∝ θ ẋ e−nθ ,
n
where ẋ = i=1 xi . If θ ∼ Gamma(a, b) then

p(θ ) ∝ θ a−1 e−bθ .

By (4.2),
π(θ ) ∝ p(x | θ ) p(θ ) ∝ θ a+ẋ−1 e−(b+n)θ ,

which is proportional to the density of Gamma(a + ẋ, b + n).


Hence θ | x ∼ Gamma(a + ẋ, b + n).
Solution 4.3 Conjugacy of uniform and Pareto distributions. Under the uniform
likelihood model, for x1 , . . . , xn > 0,
n
1[0,θ] (xi ) 1[0,θ] (x(n) ) 1[x(n) ,∞) (θ )
p(x | θ ) = i=1
= = ,
θ n θ n θn

where x(n) = max{x1 , . . . , xn }. If θ ∼ Pareto(a, b) then

1[b,∞) (θ )
p(θ ) ∝ .
θ a+1
By (4.2),

1[b,∞) (θ )1[x(n) ,∞) (θ ) 1[max{b,x(n) },∞) (θ )


π(θ ) ∝ p(x | θ ) p(θ ) ∝ = ,
θ a+n+1 θ a+n+1

which is proportional to the density of Pareto(a + n, max{b, x(n) }).


Hence θ | x ∼ Pareto(a + n, max{b, x(n) }).
Solution 4.4 Conjugacy of exponential and gamma distributions. Under the expo-
nential likelihood model,
p(x | θ ) = θ n e−θ ẋ ,
n
where ẋ = i=1 xi . If θ ∼ Gamma(a, b) then

p(θ ) ∝ θ a−1 e−bθ .

By (4.2),
π(θ ) ∝ p(x | θ ) p(θ ) ∝ θ a+n−1 e−(b+ẋ)θ ,

which is proportional to the density of Gamma(a + n, b + ẋ).


Hence θ | x ∼ Gamma(a + n, b + ẋ).
Appendix B: Solutions to Exercises 147

Solution 4.5 Calculating a marginal distribution. For θ1 , θ2 > 0,


 
ba θ1a e−bθ1 ∞
−θ1 θ2 ba θ1a e−bθ1 −e−θ1 θ2 ∞
π(θ1 ) = e dθ2 =
Γ (a) 0 Γ (a) θ1 0
ba θ1a−1 e−bθ1
=
Γ (a)

and
 ∞  ∞
ba ba
π(θ2 ) = θ1a e−(b+θ2 )θ1 dθ1 = x a e−x dx
Γ (a) 0 Γ (a)(b + θ2 )a+1 0
Γ (a + 1)ba a ba
= = .
Γ (a)(b + θ2 )a+1 (b + θ2 )a+1

Solution 4.6 Credible interval for the exponential distribution.


   θ∗
θ∗ θ∗
−λθ

−λθ  −λθ∗
π(θ ) dθ = λe dθ = −e  =1−e .
−∞ 0 0

Then 1 − e−λθ∗ = (1 − α)/2 ⇐⇒ θ∗ = − log{(1 + α)/2}/λ. Similarly,


  ∞
∞ ∞
−λθ −λθ 
 −λθ ∗
π(θ ) dθ = λe dθ = −e  ∗ =e
θ∗ θ∗ θ


and e−λθ = (1 − α)/2 ⇐⇒ θ ∗ = − log{(1 − α)/2}/λ.
Hence a 100α% credible interval for θ is

[− log{(1 + α)/2}/λ, − log{(1 − α)/2}/λ].

Solution 5.1 Monte Carlo probabilities. The probability of θ lying inside A ⊂ Θ


can be expressed as an expectation using the indicator function,
 
Pπ (θ ∈ A) = π(θ ) dθ = π(θ ) 1 A (θ ) dθ = Eπ {1 A (θ )}.
A Θ

Hence a Monte Carlo estimate for Pπ (θ ∈ A) can be obtained by

1 
M
P̂π (θ ∈ A) = 1 A (θ (i) ).
M i=1

Solution 5.2 Monte Carlo estimate of a conditional expectation. Using the identity,

Eπ|A (g(θ ) | θ ∈ A) = Eπ {1 A (θ ) g(θ )}/ Eπ {1 A (θ )},


148 Appendix B: Solutions to Exercises

it follows that conditional expectations can be approximated by


M
1 A (θ (i) )g(θ (i) )
Eπ|A (g(θ ) | θ ∈ A) = i=1
M (i)
i=1 1 A (θ )

M
provided i=1 1 A (θ (i) ) > 0 (meaning there are samples lying in A).
Solution 5.3 Monte Carlo credible interval. From Exercise 5.1, it follows that

P̂π (θ ∈ [θ(M(1−α)/2) , θ(M(1+α)/2) ]) = α

and therefore Rα = [θ(M(1−α)/2) , θ(M(1+α)/2) ] is a Monte Carlo approximated 100α%


credible region for θ .
Solution 5.4 Monte Carlo optimal decision estimation. The Monte Carlo estimate
of the expected loss function is

1
3
Êπ { (θ̂,θ)} = − exp{−(θ̂ − θ (i) )2 /10},
3 i=1

which is plotted below.

−0.3
Êπ { (θ̂,θ)}

−0.4

−0.5

0 2 4 6 8 10 12
θ̂

The following Python code then identifies the minimising value numerically using
the SciPy1 library function .

1 https://www.scipy.org.
Appendix B: Solutions to Exercises 149

Hence, an approximate Bayesian estimate of θ under this loss function is


θ̂ ≈ 3.532. [This estimate differs from the Monte Carlo estimate of the mean of
π , Êπ (θ ) = (2 + 5 + 11)/3 = 6.]
Solution 5.5 Importance sampling Monte Carlo standard error. Using the identity
(5.8), it follows from (5.4) that

  2
 M
1 
M
=
IS 1 (i) (i)
s.e.{Êπ {g(θ )}} wi g(θ ) − wi g(θ ) .
M(M − 1) i=1 m i=1

Solution 5.6 Gibbs sampling.


1 1
(i) π(θ1 , θ2 ) = φ(θ1 − μ)φ(θ2 − μ) + φ(θ1 + μ)φ(θ2 + μ).
2 2
(ii)

1 1
π(θ2 ) = π(θ1 , θ2 ) dθ1 = φ(θ2 − μ) + φ(θ2 + μ)
2 2
π(θ1 , θ2 ) φ(θ1 − μ)φ(θ2 − μ) + φ(θ1 + μ)φ(θ2 + μ)
=⇒ π(θ1 | θ2 ) = =
π(θ2 ) φ(θ2 − μ) + φ(θ2 + μ)
= w(θ2 ) φ(θ1 − μ) + {1 − w(θ2 )} φ(θ1 + μ),

where
φ(θi − μ)
w(θi ) =
φ(θi − μ) + φ(θi + μ)
= (1 + e−2θi μ )−1 .

By symmetry,

π(θ2 | θ1 ) = w(θ1 ) φ(θ2 − μ) + {1 − w(θ1 )} φ(θ2 + μ).

(iii) As μ increases, the target density becomes bimodal and the mixture weight
w(θi ) → 0 if θi is negative, and w(θi ) → 1 if θi is positive, and therefore θ
becomes stuck near either (−μ, −μ) or (μ, μ).
150 Appendix B: Solutions to Exercises

Solution 5.7 Gibbs sampling implementation.

Starting the chain from (0, 0), the left-hand plot shows good mixing when μ = 1,
whereas in right-hand plot when μ = 3 there is only one transition between the two
modes during 100 iterations.
Solution 5.8 Detailed balance of Metropolis-Hastings algorithm. If θ = θ  , then
by symmetry (5.12) trivially holds. If θ = θ  , then (5.16) simplifies to p(θ  | θ ) =
α(θ, θ  )q(θ  | θ ) and it remains to show

π(θ )α(θ, θ  )q(θ  | θ ) = π(θ  )α(θ  , θ )q(θ | θ  ).

If π(θ  ) = 0 then from (5.15), α(θ, θ  ) = 0 and the equality holds. So now suppose
π(θ  ) > 0,

π(θ  )q(θ | θ  )
π(θ )q(θ  | θ )α(θ, θ  ) = π(θ )q(θ  | θ ) min 1,
π(θ )q(θ  | θ )
= min{π(θ )q(θ  | θ ), π(θ  )q(θ | θ  )}
π(θ )q(θ  | θ )
= π(θ  )q(θ | θ  ) min 1,
π(θ  )q(θ | θ  )
= π(θ  )q(θ | θ  )α(θ  , θ ).
Appendix B: Solutions to Exercises 151

Solution 5.9 Gibbs sampling as Metropolis-Hastings special case. Then the ratio of
posterior densities when θ− j = θ− j is

π(θ  ) π(θ j | θ− j )π(θ− j ) π(θ j | θ− j )


=  = ,
π(θ ) π(θ j | θ− j )π(θ− j ) π(θ j | θ− j )

which cancels with the ratio of proposal densities in the Metropolis-Hastings accep-
tance probability (5.15), and hence α(θ, θ  ) = 1 and all such Metropolis-Hastings
proposals are accepted with probability 1.
Solution 5.10 Metropolis-Hastings implementation.

In comparison with Gibbs sampling, there are fewer than 100 unique samples in
each case.
152 Appendix B: Solutions to Exercises

Solution 5.11 ELBO equivalence. Since log p(x, θ ) = log π(θ ) + log p(x),

q(θ)
KL(q(θ)  π(θ)) = q(θ) log dθ
π(θ)
Θ  
= q(θ) log q(θ) dθ − q(θ) log p(x, θ) dθ + q(θ) log p(x) dθ
Θ Θ Θ
= Eq log q(θ) − Eq log p(x, θ) + log p(x)
= − ELBO(q) + log p(x).

The log p(x) term does not depend on the density q, and so minimising this expression
corresponds to maximising ELBO(q).
Solution 5.12 ELBO identity. Since log p(x, θ ) = log p(x | θ ) + log p(θ ),

ELBO(q) = Eq log p(x, θ ) − Eq log q(θ )


= Eq log p(x | θ ) + Eq log p(θ ) − Eq log q(θ )
q(θ )
= Eq log p(x | θ ) − Eq log
p(θ )
= Eq log p(x | θ ) − KL(q(θ )  p(θ )).

Solution 5.13 CAVI derivation. Using the identity p(x, θ ) = π(θ ) p(x),

ELBO(q) = Eq log p(x, θ ) − Eq log q(θ )


= Eq log π(θ ) + Eq log p(x) − Eq log q(θ ).

Since p(x) does not depend on q, maximising ELBO(q) is equivalent to maximising


ELBO(q) = Eq log π(θ ) − Eq log q(θ ).

Writing π(θ ) = π(θ− j )π(θ j | θ− j ) and q(θ ) = q− j (θ− j )q j (θ j ),


ELBO(q) = Eq− j log π(θ− j ) + Eq j Eq− j log π(θ j | θ− j ) − Eq− j log q− j (θ− j ) − Eq j log q j (θ j ).


Maximising ELBO(q) with respect to q j is equivalent to maximising

Eq j Eq− j log π(θ j | θ− j ) − Eq j log q j (θ j ) = − KL[q j (θ j )  exp{Eq− j log π(θ j | θ− j )}].

This KL-divergence is minimised by setting q j (θ j ) ∝ exp{Eq− j log π(θ j | θ− j )}.


Solution 5.14 CAVI Gaussian approximation.
(i) With just two components, θ− j = θ j̃ . Taking the conditional distribution of a
bivariate normal,
Appendix B: Solutions to Exercises 153
⎛ ⎞
Σ j j̃ Σ2
π(θ j | θ j̃ ) = Normal ⎝μ j + ⎠
j j̃
(θ j̃ − μ j̃ ), Σ j j −
Σ j̃ j̃ Σ j̃ j̃
⎧  2 ⎫
⎨ 1 Σ j j̃ ⎬
=⇒ π(θ j | θ j̃ ) ∝ exp − 2 θ j − μ j − (θ j̃ − μ j̃ )
⎩ 2s j Σ j̃ j̃ ⎭
 2
1 Σ j j̃
=⇒ log π(θ j | θ j̃ ) = − 2 θ j − μ j − (θ − μ j̃ ) + constant
2s j Σ j̃ j̃ j̃
  
1 Σ j j̃
=⇒ Eq j̃ log π(θ j | θ j̃ ) = − 2 θ 2j − 2θ j μ j − (m j̃ − μ j̃ ) + constant
2s j Σ j̃ j̃
1
=− (θ 2j − 2θ j m j ) + constant
2s 2j
   
1 ! 1  2
=⇒ exp{Eq j̃ log π(θ j | θ j̃ )} ∝ exp − 2 θ 2j − 2θ j m j ∝ exp − 2 θ j − m j .
2s j 2s j

(ii) The variances of the component densities q j are fixed, and the algorithm will
converge when each mean value m j = μ j .
(iii) The following Python code implements coordinate ascent variational inference
for the bivariate normal distribution with correlation .95. The printed output
gives the fitted mean and variance values. The contour plot mirrors Fig. 5.2(a).
154 Appendix B: Solutions to Exercises

Solution 7.1 Bayes factors for Gaussian distributions.


n n
(i) Let ẍ = i=1 xi2 , ẋ = i=1 xi . From Section A.3, under model M1 ,

exp[− 21 {ẍ − (σ −2 + n)−1 ẋ 2 }]


p1 (x) = n 1 .
(2π ) 2 (nσ 2 + 1) 2

Similarly,
exp[− 21 { ÿ − (σ −2 + n)−1 ẏ 2 }]
p1 (y) = n 1 .
(2π ) 2 (nσ 2 + 1) 2

Under model M0

exp[− 21 {ẍ + ÿ − (σ −2 + 2n)−1 (ẋ + ẏ)2 }]


p0 (x, y) = 1 .
(2π )n (2nσ 2 + 1) 2

Consequently, the Bayes factor in favour of M0 is

p0 (x, y)
B01 (x, y) =
p1 (x) p1 (y)
(nσ 2 + 1)
= 1
exp{− 21 (σ −2 + n)−1 (ẋ 2 + ẏ 2 ) + 21 (σ −2 + 2n)−1 (ẋ + ẏ)2 }.
(2nσ 2 + 1) 2

(ii) From the previous expression,


"
n2 σ 4
B01 (x, y) = 1+ exp{− 21 (σ −2 + n)−1 (ẋ 2 + ẏ 2 ) + 21 (σ −2 + 2n)−1 (ẋ + ẏ)2 }.
2nσ 2 + 1
Appendix B: Solutions to Exercises 155

For large σ ,
1
B01 (x, y) ≈ σ ( n2 ) 2 exp{− 2n
1
(ẋ 2 + ẏ 2 ) + 1
4n
(ẋ + ẏ)2 }
1
= σ ( n2 ) 2 exp{− 4n
1
(ẋ − ẏ)2 }.

Clearly B01 (x, y) → ∞ as σ → ∞. This means the simpler model, where θ X =


θY , will always be preferred.

Solution 7.2 BIC for Gaussian distributions.


(i) It is easily shown that the maximum likelihood estimates for the mean parameters
θ X and θY under the two models are the corresponding sample means:

x̄ + ȳ
M0 : θ̂ X = θ̂Y = ;
2
M1 : θ̂ X = x̄, θ̂Y = ȳ,
n n
where x̄ = 1
n i=1 xi , ȳ = 1
n i=1 yi . Then for model M1 ,

log p1 (x, y | θ̂ X , θ̂Y ) = log p1 (x | θ̂ X ) + log p1 (y | θ̂Y )


1 1
n n
= −n log(2π ) − (xi − x̄)2 − (yi − ȳ)2
2 2
i=1 i=1

n 
n
=⇒ BIC1 = 2n log(2π ) + (xi − x̄)2 + (yi − ȳ)2 + 2 log(2n).
i=1 i=1

Under model M0 ,
 
1
n
x̄ + ȳ 2 x̄ + ȳ 2
log p0 (x, y | θ̂ X = θ̂Y ) = −n log(2π) − xi − + yi −
2 2 2
i=1
 
 n
x̄ + ȳ 2 x̄ + ȳ 2
=⇒ BIC0 = 2n log(2π) + xi + + yi − + log(2n).
2 2
i=1

(ii) The Bayes factor can be approximated using (7.1).


n
BIC0 − BIC1 = −n x̄ 2 − n ȳ 2 + (x̄ + ȳ)2 − log(2n)
2
n n
= − x̄ 2 − ȳ 2 + n x̄ ȳ − log(2n)
2 2
n
= − (x̄ − ȳ)2 − log(2n)
2
1 √ #n $
=⇒ B01 ≈ exp − (BIC0 − BIC1 ) = 2n exp (x̄ − ȳ)2 .
2 4
156 Appendix B: Solutions to Exercises

Solution 8.1 Marginal density for regression coefficients.


(i) Starting from the joint prior probability density function,
# !$
ba exp −σ −2 b + β  V −1 β/2
p(β, σ 2 ) = p 1
(2π ) 2 |V | 2 Γ (a) σ 2(a−1)+ p
 ∞
ba Γ (a + 2p )
=⇒ p(β) = p(β, σ 2 ) dσ −2 = p 1 p ,
0 (2π ) 2 |V | 2 Γ (a) (b + β  V −1 β/2)a+ 2

by comparison of the integrand as a function of σ −2 with the probability density


function for a Gamma(a + p/2, b + β  V −1 β/2) random variable.
(ii) If V = I p , then

−(a+ 2p )
Γ (a + 2p ) β β
p(β) = p 1+ .
(2π b) 2 Γ (a) 2b

Regarded as a function of β, this density takes the same form as the density
function of t2a (0, 2a+2bp−1 I p ), except that β can only take non-negative values.

Solution 8.2 Linear model matrix inverse. Setting A = In and U = W = X , the


matrix inversion lemma gives (X V X  + In )−1 = In − X (V −1 + X  X )−1 X  =
In − X Vn X  .
Solution 8.3 Linear model matrix determinant. Setting A = In and U = W = X ,
the matrix determinant lemma gives |In + X V X  | = |V −1 + X  X ||V |.
Solution 8.4 Linear model code. The function lm_log_likelihood is one pos-
sible Python implementation:

Solution 8.5 Orthogonal covariate matrix marginal likelihood. If V = λ−1 I p and


X  X = I p , then
p
λ 2 Γ (a + n/2) ba
p(y | X ) = n .
1+λ
n
(2π ) Γ (a) (b + 21 y  y − 2(1+λ)
2
1
y  X X  y)a+ 2
Appendix B: Solutions to Exercises 157

This expression does not require a matrix inversion, unlike the evaluation of the
matrix Vn (8.10) required for the general case.
Solution 8.6 Zellner’s g-prior. If V = g(X  X )−1 , then

Γ (a + n/2) ba
p(y | X ) = n p n .
(2π ) Γ (a) (1 + g) (b + 21 y  y −
2 2
g
2(1+g)
y  X (X  X )−1 X  y)a+ 2

  
Solution 9.1 Dirichlet process marginal likelihood. nLet x = (x1 , . . . , xk ) be the
k ≤ n unique values which occur in x, and let n j = i=1 1{x j } (xi ) be the number of


occurrences of x j . Also for j = 1, . . . , k let B j = {x j } and Bk+1 = X / ∪ j B j . Then


k 
k
p(x | P) = p(x j )n j = P(B j )n j
j=1 j=1

and

Γ (α) 
k
p(P(B1 ), . . . , P(Bk+1 )) = k P(B j )α P0 (B j )
j=1 Γ {α P0 (B j )} j=1

Γ (α) 
k

= k 
P(B j )α P0 (x j )
j=1 Γ {α P0 (x j )} j=1

Γ (α) 
k

=⇒ p(x, P(B1 ), . . . , P(Bk+1 )) = k 
P(B j )α P0 (x j )+n j .
j=1 Γ {α p0 (x j )} j=1
(B.2)

Marginalising (B.2) with respect to (P(B1 ), . . . , P(Bk+1 )),



=⇒ p(x) = p(x, P(B1 ), . . . , P(Bk+1 )) d(P(B1 ), . . . , P(Bk+1 ))
k 
Γ (α) j=1 Γ {α p0 (x j ) + n j }
= k ,

j=1 Γ {α p0 (x j )}
Γ (α + n)

by comparison with the normalising constant of the corresponding Dirichlet distri-


bution.
Solution 9.2 Dirichlet process sampling. The following function dirichlet_
process_sample is one possible Python implementation.
158 Appendix B: Solutions to Exercises

For larger values of α, the sampled mass functions more closely resemble the base
geometric distribution.
Solution 9.3 Binary partition index For x ∈ R, the index of x at any level m is
obtained by calculating the m-digit binary representation of F0 (x). This is achieved
by the following Python code:

Solution 9.4 Polya tree sampling. The following function polya_tree_sample


is one possible Python implementation. For a user-chosen depth m, the code samples
the bin probabilities P(Be ) for each set Be ∈ πm , and then estimates the density at
the mid point of the bin to be equal to the average density of the bin.
Appendix B: Solutions to Exercises 159

For larger values of α, the sampled densities more closely resemble the standard
normal base measure density.
Solution 10.1 Gaussian process closure. For any x = (x1 , . . . , xn ), independently
f (x) − m(x) ∼ Normaln (0, K (x, x)) and m(x) ∼ Normaln (m 0 (x), K 0 (x, x)),
where K 0 (x, x) is the corresponding covariance matrix (10.3) for the kernel k0 .
Since the sum of two independent normal distributions is again normal,

f (x) ∼ Normaln (m 0 (x), K 0 (x, x) + K (x, x))

and hence f ∼ GP(m 0 , k + k0 ).


Solution 10.2 Linear model as a Gaussian process.

β ∼ Normal p (0, σ 2 λ−1 I p ) =⇒ Xβ ∼ Normaln (0, σ 2 λ−1 X X  ).


160 Appendix B: Solutions to Exercises

It therefore follows that this regression function can be written a Gaussian process
f ∼ GP(0, k) where the covariance kernel k is the dot product

k(x, x  ) = σ 2 λ−1 x · x  .

Solution 10.3 Spline regression as a Gaussian process. It follows from Exercise 10.2
that f ∼ GP(0, v · b(· , · )) where


d 
m
b(x, x  ) = 1 + (x x  ) j + {(x − τ j )+ (x  − τ j )+ }d .
j=1 j=1

Solution 10.4 Normal changepoint model as a Gaussian process. The changepoint


model is equivalent to a zero-mean Gaussian process GP(0, v · b(· , · )) where


m
b(x, x  ) = 1[τ j ,τ j+1 ) (x) · 1[τ j ,τ j+1 ) (x  )
j=0

defines an indicator function determining whether x and x  lie in the same τ -segment.
Solution 10.5 CART notation and partition.
(i) T = {(1, 2, a), (2, 1, b), (3, 3, c), (6, 1, d)}.
(ii) π = {B1 , . . . , B5 } where

B1 = (−∞, b] × (−∞, a] × R
B2 = (b, ∞) × (−∞, a] × R
B3 = (−∞, d] × (a, ∞) × (−∞, c]
B4 = (d, ∞) × (a, ∞) × (−∞, c]
B5 = R × (a, ∞) × (c, ∞).


Solution 11.1 Mixture of normals full conditionals. Let n j,−i = i  =i 1{ j} (z i ) be
the number
 of data points aside from  xi currently allocated to cluster j. Similarly let
ẋ j,−i = i  =i;zi  = j xi and ẍ j,−i = i  =i;zi  = j xi2 . Then for j = 1, . . . , m,
1
(α j + n j,−i )Γ (a + (n j,−i + 1)/2) (λ + n j,−i ) 2
p(z i = j | z−i , x) ∝ 1
Γ (a + n j,−i /2) (λ + n j,−i + 1) 2

% &a+ n j,−i2 +1
b + 21 {ẍ j,−i + xi2 − (ẋ j,−i + xi )2 /(n j + 1 + λ)}
× ' (a+ n j,−i .
2
b + 21 {ẍ j,−i − ẋ 2j,−i /(n j + λ)}
Appendix B: Solutions to Exercises 161

Solution 11.2 Gibbs sampling mixture of normals. The following code is one pos-
sible Python implementation. The hyperparameters are chosen to be α = 0.1, a =
0.1, b = 0.1, λ = 1.

Solution 11.3 Latent factor linear model.


(i) y | σ, X, Z ∼ Normaln (0, σ 2 (X V X  + ZU Z  + In )).
(ii) y | X, Z ∼ Stn (2a, 0, b(X V X  + ZU Z  + In )/a).

Solution 11.4 Latent factor linear model code. From Exercise 11.4 with V = v I p
and U = u Iq ,

y | X, Z ∼ Stn (2a, 0, b(v X X  + u Z Z  + In )/a).

The following Stan code is one possible implementation.


162 Appendix B: Solutions to Exercises
Glossary

P probability
E expectation
V variance
:= definition
∝ proportional to
→ converges to
∼ distributed as
=⇒ implies
⇐⇒ equivalent to

⊥ independent
· dot product
R real numbers
N natural numbers, starting at zero
A set complement, {x|x ∈ / A}
1 A (x) indicator, 1 if x ∈ A, 0 otherwise
In n × n identity matrix
B transpose of matrix B
|B| Determinant of matrix B
||v|| Euclidean norm of vector v
|x| Absolute value of real value x

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 163
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
References

Amaral Turkman MA, Paulino CD, Müller P (2019) Computational Bayesian statistics: an intro-
duction. Institute of Mathematical Statistics Textbooks. Cambridge University Press, Cambridge
Barber D (2012) Bayesian reasoning and machine learning. Cambridge University Press, Cambridge
Berger JO, Guglielmi A (2001) Bayesian and conditional frequentist testing of a parametric model
versus nonparametric alternatives. J Am Stat Assoc 96(453):174–184
Bernardo JM, Smith AFM (1994) Bayesian theory. Wiley Series in Probability & Statistics. Wiley
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv: Methodology
Bhattacharya A, Dunson DB (2011) Sparse Bayesian infinite factor models. Biometrika, pp 291–306
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J
Am Stat Assoc 112(518):859–877
Chipman HA, George EI, McCulloch RE (1998) Bayesian CART model search. J Am Stat Assoc
93(443):935–948
Chipman HA, George EI, McCulloch RE (2010) Bart: Bayesian additive regression trees. Ann Appl
Stat 4(1):266–298
de Finetti B (2014) Theory of probability: a critical introductory treatment. Wiley Series in Proba-
bility and Statistics, Wiley
Denison DGT, Mallick BK, Smith AFM (1998) A Bayesian CART algorithm. Biometrika
85(2):363–377
Denison DGT, Holmes CC, Mallick BK, Smith AFM (2002) Bayesian methods for nonlinear clas-
sification and regression. Wiley Series in Probability and Statistics, Wiley
Doucet A, de Freitas N, Gordon N (2001) An introduction to sequential Monte Carlo methods.
Springer, New York, pp 3–14
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(2):209–230
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation.
J Am Stat Assoc 97(458):611–631
Gelman A, Hennig C (2017) Beyond subjective and objective in statistics. J R Stat Soc Ser A
(Statistics in Society) 180(4):967–1033
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2013) Bayesian data analysis, 3rd
edn. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis
Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination. Biometrika 82(4):711–732

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 165
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
166 References

Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models: a rough-
ness penalty approach. Chapman and Hall, United Kingdom
Heard NA (2020) Naheard/changepoints: Bayesian changepoint modelling code. https://doi.org/
10.5281/zenodo.4158489
Hoffman MD, Gelman A (2014) The No-U-Turn sampler: adaptively setting path lengths in Hamil-
tonian Monte Carlo. J Mach Learn Res 15(47):1593–1623
Holmes CC, Denison DGT, Ray S, Mallick BK (2005) Bayesian prediction via partitioning. J
Comput Graph Stat 14(4):811–830
Jeffreys H (1961) Theory of probability, 3rd edn. Englan, Oxford
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795
Kucukelbir A, Tran D, Ranganath R, Gelman A, Blei DM (2017) Automatic differentiation varia-
tional inference. J Mach Learn Res 18(14):1–45
Lau JW, Green PJ (2007) Bayesian model-based clustering procedures. J Comput Graph Stat
16(3):526–558
Lavine M (1992) Some aspects of Polya tree distributions for statistical modelling. Ann Stat
20(3):1222–1235
Leonard T (1973) A Bayesian method for histograms. Biometrika 60(2):297–308
Mauldin RD, Sudderth WD, Williams SC (1992) Polya trees and random distributions. Ann Stat
20(3):1203–1221
Minka TP (2001) Expectation propagation for approximate Bayesian inference. In: Proceedings of
the seventeenth conference on uncertainty in artificial intelligence, UAI’01. Morgan Kaufmann
Publishers Inc., pp 362–369
Neal RM (2011) Mcmc using hamiltonian dynamics. In: Handbook of Markov Chain Monte Carlo.
Chapman and Hall/CRC, pp 113–162
Oh M-S, Raftery AE (2007) Model-based clustering with dissimilarities: a Bayesian approach. J
Comput Graph Stat 16(3):559–585
Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning (Adaptive Compu-
tation and Machine Learning). The MIT Press
Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of
components (with discussion). J R Stat Soc: Ser B (Statistical Methodology) 59(4):731–792
Roberts GO, Rosenthal JS (2004) General state space Markov chains and MCMC algorithms.
Probab Surv 1:20–71
Roberts GO, Gelman A, Gilks WR (1997) Weak convergence and optimal scaling of random walk
Metropolis algorithms. Ann Appl Probab 7(1):110–120
Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models
by using integrated nested Laplace approximations. J R Stat Soc: Ser B (Statistical Methodology)
71(2):319–392
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Sethuraman J (1994) A constructive definition of Dirichlet priors. Statistica Sinica 4(2):639–650
Teh YW (2010) Dirichlet processes. In: Encyclopedia of Machine Learning. Springer, Berlin
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc
101(476):1566–1581
Tierney L, Kadane JB (1986) Accurate approximations for posterior moments and marginal densi-
ties. J Am Stat Assoc 81(393):82–86
Wang C, Paisley J, Blei D (2011) Online variational inference for the hierarchical Dirichlet process.
In: Proceedings of the fourteenth international conference on artificial intelligence and statistics,
vol 15. JMLR Workshop and Conference Proceedings, pp 752–760
Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g prior
distributions. Elsevier, New York, pp 233–243
Index

A Connected, 26
Action, 3 d-connected, 26
randomised, 11 components, 26
Adjacency matrix, 24 Consequence, 3
Aperiodic, 45 Consistency, 20
Asymptotic normality, 21, 52 Continuous random measure, 101
Coordinate ascent, 58
variational inference, 58
B Covariance function, 108
Bag-of-words, 130 Covariates, 79
Base measure, 94 Credible region, 38
Basis function, 86
Bayes factor, 70
interpretation, 70 D
Bayes’ theorem, 8 De Finetti
Bayesian additive regression tree, 120 representation theorem, 16, 31, 33, 36,
Bayesian histogram, 102 67, 94, 121
marginal likelihood, 103 Decision problem, 3
Bayesian information criterion, 72 Decision space, 12
Belief network, 26, 80, 129 Design matrix, 86
Binary sequences, 98 Detailed balance, 45
Binary tree, 98 Digamma function, 98
Dirichlet process, 94
conjugacy, 96
C marginal likelihood, 97
Changepoint model, 117 mixture, 126
regimes, 118 predictive distribution, 97
Chinese restaurant table distribution, 98
Classification and regression tree, 119
Clique, 25 E
Clustering, 122, 125, 127 Edward, 65
Coherence, 1, 11 Entropy, 57
Collider node, 26 Errors, 79, 109
Conditional probability, 7 Estimation, 12
Conjugacy, 34, 123 Event, 2
Conjugate prior, 34, 137 standard, 5

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 167
Nature Switzerland AG 2021
N. Heard, An Introduction to Bayesian Inference, Methods and Computation,
https://doi.org/10.1007/978-3-030-82808-0
168 Index

Evidence, 57, 70 exponential, 108


lower bound, 57 isotropic, 108
model, 68 linear, 108
Exchangeable, 15, 126 positive semidefinite, 108
infinitely, 16 squared exponential, 108
regression, 80, 107 stationary, 108
Expectation propagation, 56 KL-divergence, 14, 55
Exponential family, 36 Knot points, 113

F L
Factor graph, 29 Laplace approximation, 53
Fisher information matrix, 21 integrated nested, 54
Frequentist probability, 7 Laplace method, 53
Full conditional distribution, 46 Latent Dirichlet allocation, 129
Fundamental events, 3 Latent factor models, 133
Latent Gaussian model, 54
Lebesgue measure, 104
G Lindley’s paradox, 71, 84
Gaussian process, 108, 114 Linear model, 79, 80
conjugacy, 109 conjugate prior, 81
inference, 110
reference prior, 85
marginal likelihood, 109
Link function, 88
Generalised linear models, 87
Logistic regression, 90
Gibbs sampling, 46
Loss, 13
g-prior, 85
absolute, 13
Gram matrix, 134
expected, 13
Graph, 23
function, 12, 122
directed, 23
squared, 13, 39
directed acyclic, 25
zero-one, 13, 71
undirected, 23
Griffiths-Engen-McCloskey distribution, 95
M
H Marginal likelihood, 35, 44, 68, 83, 97, 109
Hierarchical Dirichlet process, 131 Markov chain Monte Carlo, 44
mixture, 132 Hamiltonian, 50
Hierarchical model, 30, 121 reversible jump, 60, 103
Hyperparameter, 19, 54 Markov network, 27, 28
Hyperprior, 19 Markov random field, 28
Hypothesis testing, 71 Gaussian, 29, 54
Matrix inversion lemma, 83
Mean-field variational
I family, 57
Identifiability, 20 inference, 58
Importance sampling Metropolis-Hastings algorithm, 48
estimation, 43 Mixed-membership model, 128
standard error, 43 Mixing, 50, 64
Intractable integral, 39 Mixture model, 19, 121
Irreducible, 45 finite, 122
infinite, 126
Mixture of Gaussians, 124
K Model averaging, 68
Kernel Model selection, 67, 69
Index 169

Model uncertainty, 68 mixture, 19


Monte Carlo non-informative, 18
estimation, 40, 74 Procrustes alignments, 136
Hamiltonian, 50 PyMC, 65
sequential, 60 Python, 61–63, 105, 115, 118
standard error, 41
Multinomial-Dirichlet distribution, 123
R
Random probability measure, 94
N Random walk, 49
Nonparametric, 93 Regression, 79
regression, 107 Response variable, 79
Normalising constant estimation, 43 Reversible, 45
Nuisance parameter, 37

S
O
Semi-orthogonal matrix, 134
Odds ratio
Separated, 25
posterior, 70
d-separated, 26
prior, 70
Spline regression, 87, 113
Outcome, 3
marginal likelihood, 113
Splitting probabilities, 99, 100
Stan, 62, 74, 89, 91, 111, 130, 134
P
PyStan, 63, 76, 89, 91, 112, 134
Parametric model, 33
Standard events, 5
Parametric regression, 79
Stationary distribution, 45
Partition model, 102
Stick-breaking process, 95, 127, 132
regression, 116
Poisson regression, 88 Student grades example, 61
Polya tree, 98, 99 Subjective probability, 5
conjugacy, 100 Subjectivism, 1
marginal likelihood, 101
Polynomial regression, 87
Posterior T
distribution, 17, 18 Topic modelling, 130, 132
information matrix, 21 Transition density, 44
matrix, 53
mode, 21, 53
Posterior predictive U
p-value, 73 Utility, 8
checking, 65, 73 expected, 9, 12
distribution, 73 function, 8
Prediction, 13
Predictors, 79
Preference, 3 V
Prior Variational
distribution, 15, 18 family, 55
elicitation, 18 inference, 55, 60, 133

You might also like