Patterns of Scalable Bayesian Inference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 133

Foundations and Trends R in Machine Learning

Vol. 9, No. 2-3 (2016) 119–247


c 2016 E. Angelino, M. J. Johnson, and R. P. Adams
DOI: 10.1561/2200000052

Patterns of Scalable Bayesian Inference

Elaine Angelino∗ Matthew James Johnson∗


UC Berkeley Harvard University
[email protected] [email protected]
Ryan P. Adams
Harvard University and Twitter
[email protected]


Authors contributed equally.
Contents

1 Introduction 120
1.1 Why be Bayesian with big data? . . . . . . . . . . . . . . 121
1.2 The accuracy of approximate integration . . . . . . . . . . 123
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

2 Background 125
2.1 Exponential families . . . . . . . . . . . . . . . . . . . . . 125
2.2 Markov Chain Monte Carlo inference . . . . . . . . . . . . 130
2.2.1 Bias and variance of estimators . . . . . . . . . . . 131
2.2.2 Monte Carlo estimates from independent samples . 132
2.2.3 Markov chains . . . . . . . . . . . . . . . . . . . . 133
2.2.4 Markov chain Monte Carlo (MCMC) . . . . . . . . 135
2.2.5 Metropolis-Hastings (MH) sampling . . . . . . . . 140
2.2.6 Gibbs sampling . . . . . . . . . . . . . . . . . . . 142
2.3 Mean field variational inference . . . . . . . . . . . . . . . 143
2.4 Expectation propagation variational inference . . . . . . . 145
2.5 Stochastic gradient optimization . . . . . . . . . . . . . . 147

3 MCMC with data subsets 151


3.1 Factoring the joint density . . . . . . . . . . . . . . . . . 151
3.2 Adaptive subsampling for Metropolis–Hastings . . . . . . . 152

ii
iii

3.2.1 An approximate MH test based on a data subset . 153


3.2.2 Approximate MH with an adaptive stopping rule . 154
3.2.3 Using a t-statistic hypothesis test . . . . . . . . . . 155
3.2.4 Using concentration inequalities . . . . . . . . . . 157
3.2.5 Error bounds on the stationary distribution . . . . 161
3.3 Sub-selecting data via a lower bound on the likelihood . . 163
3.4 Stochastic gradients of the log joint density . . . . . . . . 165
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 170

4 Parallel and distributed MCMC 173


4.1 Parallelizing standard MCMC algorithms . . . . . . . . . . 174
4.1.1 Conditional independence and graph structure . . . 174
4.1.2 Speculative execution and prefetching . . . . . . . 176
4.2 Defining new parallel dynamics . . . . . . . . . . . . . . . 179
4.2.1 Aggregating from subposteriors . . . . . . . . . . . 181
Embarrassingly parallel consensus of subposteriors . 181
Weighted averaging of subposterior samples . . . . 184
Subposterior density estimation . . . . . . . . . . . 185
Weierstrass samplers . . . . . . . . . . . . . . . . 188
4.2.2 Hogwild Gibbs . . . . . . . . . . . . . . . . . . . . 193
Defining Hogwild Gibbs variants . . . . . . . . . . 194
Theoretical analysis . . . . . . . . . . . . . . . . . 196
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 202

5 Scaling variational algorithms 204


5.1 Stochastic optimization and mean field methods . . . . . . 205
5.1.1 SVI for complete-data conjugate models . . . . . . 206
5.1.2 Stochastic gradients with general nonconjugate
models . . . . . . . . . . . . . . . . . . . . . . . . 210
5.1.3 Exploiting reparameterization for some nonconju-
gate models . . . . . . . . . . . . . . . . . . . . . 214
5.2 Streaming variational Bayes (SVB) . . . . . . . . . . . . . 216
5.3 Scalable expectation propagation . . . . . . . . . . . . . . 219
5.3.1 Parallel expectation propagation (PEP) . . . . . . 219
iv

5.3.2 Stochastic expectation propagation (SEP) . . . . . 222


5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6 Challenges and questions 228

Acknowledgements 235

References 236
Abstract

Datasets are growing not just in size but in complexity, creating a de-
mand for rich models and quantification of uncertainty. Bayesian meth-
ods are an excellent fit for this demand, but scaling Bayesian inference
is a challenge. In response to this challenge, there has been consider-
able recent work based on varying assumptions about model structure,
underlying computational resources, and the importance of asymptotic
correctness. As a result, there is a zoo of ideas with a wide range of
assumptions and applicability.
In this paper, we seek to identify unifying principles, patterns, and
intuitions for scaling Bayesian inference. We review existing work on
utilizing modern computing resources with both MCMC and varia-
tional approximation techniques. From this taxonomy of ideas, we char-
acterize the general principles that have proven successful for designing
scalable inference procedures and comment on the path forward.

E. Angelino, M. J. Johnson, and R. P. Adams. Patterns of Scalable Bayesian


Inference. Foundations and Trends R in Machine Learning, vol. 9, no. 2-3,
pp. 119–247, 2016.
DOI: 10.1561/2200000052.
1
Introduction

We have entered a new era of scientific discovery, in which compu-


tational insights are being integrated with large-scale statistical data
analysis to enable researchers to ask both grander and more subtle ques-
tions about our natural world. This viewpoint asserts that we need not
be limited to the narrow hypotheses that can be framed by traditional
small-scale analysis techniques. Supporting new kinds of data-driven
queries, however, requires that new methods be developed for statisti-
cal inference that can scale up along multiple axes — more samples,
more dimensions, and greater model complexity — as well as scale out
by taking advantage of modern parallel compute environments.
There are a variety of methodological frameworks for statistical
inference; here we are concerned with the Bayesian formalism. In the
Bayesian setting, inference queries are framed as interrogations of a pos-
terior distribution over parameters, missing data, and other unknowns.
By treating these unobserved quantities as random variables and con-
ditioning on observed data, the Bayesian aims to make inferences and
quantify uncertainty in a way that can coherently incorporate new data
and other sources of information.

120
1.1. Why be Bayesian with big data? 121

Coherently managing probabilistic uncertainty is central to


Bayesian analysis, and so the computations associated with most infer-
ence tasks — estimation, prediction, hypothesis testing — are typically
integrations. In some special situations it is possible to perform such
integrations exactly, for example by taking advantage of tractable prior
distributions and conjugacy in the prior-likelihood pair, or by using dy-
namic programming when the dependencies between random variables
are relatively simple. Unfortunately, many inference problems are not
amenable to these exact integration procedures, and so most of the
interest in Bayesian computation focuses on methods of approximate
inference.
There are two dominant paradigms for approximate inference in
Bayesian models: Monte Carlo sampling methods and variational ap-
proximations. The Monte Carlo approach observes that integrations
performed to query posterior distributions can be framed as expecta-
tions, and thus estimated with samples; such samples are most often
generated via simulation from carefully designed Markov chains. Vari-
ational inference instead seeks to compute these integrals by approx-
imating the posterior distribution with a more tractable alternative,
finding the best approximation with powerful optimization algorithms.
In this paper, we examine how these techniques can be scaled up to
larger problems and scaled out across parallel computational resources.
This is not an exhaustive survey of a rapidly-evolving area of research;
rather, we seek to identify the main ideas and themes that are emerging
in this area, and articulate what we believe are some of the significant
open questions and challenges.

1.1 Why be Bayesian with big data?

The Bayesian paradigm is fundamentally about integration: integra-


tion computes posterior estimates and measures of uncertainty, elimi-
nates nuisance variables or missing data, and averages models to com-
pute predictions or perform model comparison. While some statistical
methods, such as MAP estimation, can be described from a Bayesian
perspective, in which case the prior serves simply as a regularizer in an
122 Introduction

optimization problem, such methods are not inherently or exclusively


Bayesian. Posterior integration is the distinguishing characteristic of
Bayesian statistics, and so a defense of Bayesian ideas in the big data
regime rests on the utility of integration.
The big data setting might seem to be precisely where integration
isn’t so important: as the dataset grows, shouldn’t the posterior dis-
tribution concentrate towards a point mass? If big data means we end
up making predictions using concentrated posteriors, why not focus on
point estimation and avoid the specification of priors and the burden of
approximate integration? These objections certainly apply to settings
where the number of parameters is small and fixed (“tall data”). How-
ever, many models of interest have many parameters (“wide data”), or
indeed have a number of parameters that grows along with the amount
of data.
For example, an Internet company making inferences about its
users’ viewing and buying habits may have terabytes of data in to-
tal but only a few observations for its newest customers, the ones most
important to impress with personalized recommendations. Moreover,
it may wish to adapt its model in an online way as data arrive, a task
that benefits from calibrated posterior uncertainties [Stern et al., 2009].
As another example, consider a healthcare company. As its dataset
grows, it might hope to make more detailed and complex inferences
about populations while also making careful predictions with calibrated
uncertainty for each patient, even in the presence of massive missing
data [Lawrence, 2015]. These scaling issues also arise in astronomy,
where hundreds of billions of light sources, such as stars, galaxies, and
quasars, each have latent variables that must be estimated from very
weak observations, and are coupled in a large hierarchical model [Regier
et al., 2015]. In Microsoft Bing’s sponsored search advertising, predic-
tive probabilities inform the pricing in the keyword auction mechanism.
This problem nevertheless must be solved at scale, with tens of millions
of impressions per hour [Graepel et al., 2010].
These are the regimes where big data can be small [Lawrence, 2015]
and the number and complexity of statistical hypotheses grows with
1.2. The accuracy of approximate integration 123

the data. The Bayesian inference methods we survey in this paper may
provide solutions to these challenges.

1.2 The accuracy of approximate integration

Bayesian inference may be important in some modern big data regimes,


but exact integration in general is computationally out of reach. While
decades of research in Bayesian inference in both statistics and ma-
chine learning have produced many powerful approximate inference
algorithms, the big data setting poses some new challenges. Iterative
algorithms that read the entire dataset before making each update
become prohibitively expensive. Sequential computation is at a sig-
nificant and growing disadvantage compared to computation that can
leverage parallel and distributed computing resources. Insisting on zero
asymptotic bias from Monte Carlo estimates of expectations may leave
us swamped in errors from high variance [Korattikara et al., 2014] or
transient bias.
These challenges, and the tradeoffs that may be necessary to address
them, can be viewed in terms of how accurate the integration in our
approximate inference algorithms must be. Markov chain Monte Carlo
(MCMC) algorithms that admit the exact posterior as a stationary dis-
tribution may be the gold standard for generically estimating posterior
expectations, but if standard MCMC algorithms become intractable
in the big data regime we must find alternatives and understand their
tradeoffs. Indeed, someone using Bayesian methods for machine learn-
ing may be less constrained than a classical Bayesian statistician: if the
ultimate goal is to form predictions that perform well according to a
specific loss function, computational gains at the expense of the inter-
nal posterior representation may be worthwhile. The methods studied
here cover a range of such approximate integration tradeoffs.

1.3 Outline

The remainder of this review is organized as five chapters. In Chap-


ter 2, we provide relevant background material on exponential fami-
lies, MCMC inference, mean field variational inference, and stochastic
124 Introduction

gradient optimization. The next three chapters survey recent algorith-


mic ideas for scaling Bayesian inference, highlighting theoretical results
where possible. Each of these central technical chapters ends with a
summary and discussion, identifying emergent themes and patterns as
well as open questions. Chapters 3 and 4 focus on MCMC algorithms,
which are inherently serial and often slow to converge; the algorithms
in the first of these use various forms of data subsampling to scale
up serial MCMC and in the second use a diverse array of strategies to
scale out on parallel resources. In Chapter 5 we discuss two recent tech-
niques for scaling variational mean field algorithms. Both process data
in minibatches: the first applies stochastic gradient optimization meth-
ods and the second is based on incremental posterior updating. Finally,
in Chapter 6 we provide an overarching discussion of the ideas we sur-
vey, focusing on challenges and open questions in large-scale Bayesian
inference.
2
Background

In this chapter we summarize background material on which the ideas


in subsequent chapters are based. This chapter also serves to fix some
common notation. Throughout the chapter, we mostly avoid measure-
theoretic definitions and instead assume that any density exists with
respect to either Lebesgue measure or counting measure, depending on
its context.
First, we cover some relevant aspects of exponential families. Sec-
ond, we cover the foundations of Markov chain Monte Carlo (MCMC)
algorithms, which are the workhorses of Bayesian statistics and are
common in Bayesian machine learning. Indeed, the algorithms dis-
cussed in Chapters 3 and 4 either are MCMC algorithms or aim to
approximate MCMC algorithms. Next, we describe the basics of mean
field variational inference, expectation propagation, and stochastic gra-
dient optimization, which are used extensively in Chapter 5.

2.1 Exponential families

Exponential families of densities play a key role in Bayesian analysis


and many practical Bayesian methods. In particular, likelihoods that

125
126 Background

are exponential families yield natural conjugate prior families, which


can provide analytical and computational advantages in both MCMC
and variational inference algorithms. These ideas are useful not only to
perform exact inference in some restricted settings, but also to build
approximate inference algorithms for more general models. Exponential
families are also particularly relevant in the context of large datasets:
in a precise sense, among all families of densities in which the sup-
port does not depend on the parameter, exponential families are the
only families that admit a finite-dimensional sufficient statistic. Thus
only exponential families allow arbitrarily large amounts of data to be
summarized with a fixed-size description.
In this section we give basic definitions, notation, and results con-
cerning exponential families. For additional perspectives from convex
analysis see Wainwright and Jordan [2008], and for perspectives from
differential geometry see Amari and Nagaoka [2007].
Throughout this section we take all densities to be absolutely con-
tinuous with respect to the appropriate Lebesgue measure (when the
underlying set X is Euclidean space) or counting measure (when X is
discrete), and denote the Borel σ-algebra of a set X as B(X ) (gener-
ated by Euclidean and discrete topologies, respectively). We assume
measurability of all functions as necessary.
Given a statistic function tx : X → Rn and a base measure νX , we
can define an exponential family of probability densities on X relative
to νX and indexed by natural parameter ηx ∈ Rn by
p(x | ηx ) ∝ exp {hηx , tx (x)i} , ∀ηx ∈ Rn , (2.1)
where h·, ·i is the standard inner product on Rn . We also define the
partition function as
Z
Zx (ηx ) , exp {hηx , tx (x)i} νX (dx) (2.2)
and define H ⊆ Rn to be the set of all normalizable natural parameters,
H , {η ∈ Rn : Zx (η) < ∞} . (2.3)
We say that an exponential family is regular if H is open. We can write
the normalized probability density as
p(x | η) = exp {hηx , tx (x)i − log Zx (ηx )} . (2.4)
2.1. Exponential families 127

Finally, when we parameterize the family with some other coordinates


θ, we write the natural parameter as a continuous function ηx (θ) and
write the density as

p(x | θ) = exp {hηx (θ), tx (x)i − log Zx (ηx (θ))} (2.5)

and take Θ = ηx−1 (H) to be the open set of parameters that correspond
to normalizable densities. We summarize this notation in the following
definition.

Definition 2.1 (Exponential family of densities). Given a measure space


(X , B(X ), νX ), a statistic function tx : X → Rn , and a natural param-
eter function ηx : Θ → Rn , the corresponding exponential family of
densities relative to νX is

p(x | θ) = exp {hηx (θ), tx (x)i − log Zx (ηx (θ))} , (2.6)

where
Z
log Zx (ηx ) , log exp {hηx , tx (x)i} νX (dx) (2.7)

is the log partition function.

When we write exponential families of densities for different ran-


dom variables, we change the subscripts on the statistic function, natu-
ral parameter function, and log partition function to correspond to the
symbol used for the random variable. When the corresponding ran-
dom variable is clear from context, we drop the subscripts to simplify
notation.
The statistic tx (x) is sufficient in the sense of the Fisher-Neyman
Factorization [Keener, 2010, Theorem 3.6]. By construction,

p(x|θ) ∝ exp{hη(θ), tx (x)},

and hence tx (x) contains all the information about x that is relevant
for the parameter θ. The Koopman-Pitman-Darmois Theorem shows
that among all families in which the support does not depend on the
parameter, exponential families are the only families which provide
this powerful summarization property, under some mild smoothness
conditions [Hipp, 1974].
128 Background

Exponential families have many convenient analytical and compu-


tational properties. In particular, the next proposition shows that the
log partition function of an exponential family generates cumulants of
the statistic.

Proposition 2.1 (Gradients of log Z and expected statistics). The gra-


dient of the log partition function of an exponential family gives the
expected sufficient statistic,

∇ log Z(η) = Ep(x | η) [t(x)] , (2.8)

where the expectation is over the random variable x with density


p(x | η). More generally, the moment generating function of t(x) can
be written
h i
Mt(x) (s) , Ep(x | η) ehs,t(x)i = elog Z(η+s)−log Z(η) (2.9)

and so derivatives of log Z give cumulants of t(x), where the first cu-
mulant is the mean and the second and third cumulants are the second
and third central moments, respectively.

For members of an exponential family, many quantities can be ex-


pressed generically in terms of the natural parameter, expected statis-
tics under that parameter, and the log partition function. In particular,
using a natural parameterization the Fisher information matrix of an
exponential family can be computed as the Hessian matrix of its log
partition function, as we summarize next.

Definition 2.2 (Score vector and Fisher information matrix). Given a


family of densities p(x | θ) indexed by a parameter θ, the score vector
v(x, θ) is the gradient of the log density with respect to the parameter,

v(x, θ) , ∇θ log p(x | θ), (2.10)

and the Fisher information matrix is the covariance of the score,


h i
I(θ) , Cov [v(x, θ)] = E v(x, θ)v(x, θ)T , (2.11)

where the expectation is taken over the random variable x with density
p(x | θ), and where we have used the identity E[v(x, θ)] = 0.
2.1. Exponential families 129

Proposition 2.2 (Score and Fisher information for exponential families).


Given an exponential family of densities p(x | η) indexed by the natural
parameter η, as in Eq. (2.4), the score with respect to the natural
parameter is given by
v(x, η) = ∇η log p(x | η) = t(x) − ∇ log Z(η) (2.12)
and the Fisher information matrix is given by
I(η) = ∇2 log Z(η). (2.13)
Finally, we summarize conjugacy properties of exponential families
that are particularly useful for Bayesian inference. Given an exponential
family of densities on X as in Definition 2.1, we can define a related
exponential family of densities on Θ by defining a statistic function
tθ (θ) in terms of the functions ηx (θ) and log Zx (ηx (θ)).
Definition 2.3 (Natural exponential family conjugate prior). Given the
exponential family p(x | θ) of Definition 2.1, define the statistic function
tθ : Θ → Rn+1 as the concatenation
tθ (θ) , (ηx (θ), − log Zx (ηx (θ))) , (2.14)
where the first n coordinates of tθ (θ) are given by ηx (θ) and the last
coordinate is given by − log Zx (ηx (θ)). We call the exponential family
with statistic tθ (θ) the natural exponential family conjugate prior to
the density p(x | θ) and write the density as
p(θ) = exp {hηθ , tθ (θ)i − log Zθ (ηθ )} , (2.15)
where ηθ ∈ Rn+1 and the density is taken relative to some measure νΘ
on (Θ, B(Θ)).
Notice that using tθ (θ) we can rewrite the original density p(x | θ),
p(x | θ) = exp {hηx (θ), tx (x)i − log Zx (ηx (θ))} (2.16)
= exp {htθ (θ), (tx (x), 1)i} . (2.17)
This relationship is useful in Bayesian inference: when the exponential
family p(x | θ) is a likelihood function and the family p(θ) is used as a
prior, the pair enjoy a convenient conjugacy property, as summarized
in the next proposition.
130 Background

Proposition 2.3 (Conjugacy). Let the densities p(x | θ) and p(θ) be de-
fined as in Definitions 2.1 and 2.3, respectively. We have the relations

p(θ, x) = exp {hηθ + (tx (x), 1), tθ (θ)i − log Zθ (ηθ )} (2.18)
p(θ | x) = exp {hηθ + (tx (x), 1), tθ (θ)i − log Zθ (ηθ + (tx (x), 1))}
(2.19)

and hence in particular the posterior p(θ | x) is in the same exponential


family as p(θ) with the natural parameter ηθ +(tx (x), 1). Similarly, with
multiple likelihood terms p(xi | θ) for i = 1, 2, . . . , N we have
N N
( )
Y X
p(θ) p(xi | θ) = exp hηθ + (tx (xi ), 1), tθ (θ)i − log Zθ (ηθ ) .
i=1 i=1
(2.20)

Conjugate pairs are particularly useful in Bayesian analysis because


as we observe data the posterior remains in the same family as the
prior, with a parameter that is easy to compute in terms of sufficient
statistics. In particular, if inference in the prior family is tractable, then
inference in the posterior is also tractable.

2.2 Markov Chain Monte Carlo inference

Markov chain Monte Carlo (MCMC) is a class of algorithms for es-


timating expectations with respect to intractable probability distribu-
tions, such as most posterior distributions arising in Bayesian inference.
Given a target distribution, a standard MCMC algorithm proceeds by
simulating an ergodic random walk that admits the target distribution
as its stationary distribution. As we develop in the following subsec-
tions, by collecting samples from the simulated trajectory and forming
Monte Carlo estimates, expectations of many functions can be approx-
imated to arbitrary accuracy. Thus MCMC is employed when samples
or expectations from a distribution cannot be obtained directly, as is
often the case with complex, high-dimensional systems arising across
disciplines.
In this section, we first review the two underlying ideas behind
MCMC algorithms: Monte Carlo methods and Markov chains. First we
2.2. Markov Chain Monte Carlo inference 131

define the bias and variance of estimators. Next, we introduce Monte


Carlo estimators based on independent and identically distributed sam-
ples. We then describe how Monte Carlo estimates can be formed using
mutually dependent samples generated by a Markov chain simulation.
Finally, we introduce two general MCMC algorithms commonly applied
to Bayesian posterior inference, the Metropolis-Hastings and Gibbs
sampling algorithms. Our exposition here mostly follows the standard
treatment, such as in Brooks et al. [2011, Chapter 1], Geyer [1992], and
Robert and Casella [2004].

2.2.1 Bias and variance of estimators


Notions of bias and variance are fundamental to understanding and
comparing estimator performance, and much of our discussion of
MCMC methods is framed in these terms.
Consider using a scalar-valued random variable θ̂ to estimate a fixed
scalar quantity of interest θ. The bias and variance of the estimator θ̂
are defined as

Bias[θ̂] = E[θ̂] − θ (2.21)


2
Var[θ̂] = E[(θ̂ − E[θ̂]) ]. (2.22)
2
The mean squared error E[(θ̂ − θ) ] can be decomposed in terms of the
variance and the square of the bias:
2 2
E[(θ̂ − θ) ] = E[(θ̂ − E[θ̂] + E[θ̂] − θ) ] (2.23)
2 2
= E[(θ̂ − E[θ̂]) ] + (E[θ̂] − θ) (2.24)
2
= Var[θ̂] + Bias[θ̂] (2.25)

This decomposition provides a basic language for evaluating estima-


tors and thinking about tradeoffs. Among unbiased estimators, those
with lower variance are generally preferrable. However, when an unbi-
ased estimator has high variance, a biased estimator that achieves low
variance can have a lower overall mean squared error.
As we describe in the following sections, a substantial amount of
the study of Bayesian statistical computation has focused on algorithms
132 Background

that produce asymptotically unbiased estimates of posterior expecta-


tions, in which the bias due to initialization is transient and is washed
out relatively quickly. In this setting, the error is typically considered
to be dominated by the variance term, which can be made as small
as desired by increasing computation time without bound. When com-
putation becomes expensive as in the big data setting, errors under a
realistic computational budget may in fact be dominated by variance,
as observed by Korattikara et al. [2014], or, as we argue in Chapter 6,
transient bias. Several of the new algorithms we examine in Chapters 3
and 4 aim to adjust this tradeoff by allowing some asymptotic bias
while effectively reducing the variance and transient bias contributions
through more efficient computation.

2.2.2 Monte Carlo estimates from independent samples


Let X be a random variable with finite expectation E[X] = µ, and let
(Xi : i ∈ N) be a sequence of i.i.d. random variables each with the same
distribution as X. The Strong Law of Large Numbers (LLN) states
that the sample average converges almost surely to the expectation µ
as n → ∞:
n
!
1X
P lim Xi = µ = 1. (2.26)
n→∞ n
i=1
This convergence immediately suggests the Monte Carlo method: to
approximate the expectation of X, which to compute exactly may in-
volve an intractable integral, one can use i.i.d. samples and compute a
sample average. In addition, because for any measurable function f the
sequence (f (Xi ) : i ∈ N) is also a sequence of i.i.d. random variables,
we can form the Monte Carlo estimate
n
1X
E[f (X)] ≈ f (Xi ). (2.27)
n i=1

Monte Carlo estimates of this form are unbiased by construction,


and so the quality of a Monte Carlo estimate can be evaluated in terms
of its variance as a function of the number of samples n, which in
turn can be understood with the Central Limit Theorem (CLT), at
least in the asymptotic regime. Indeed, the CLT provides not only
2.2. Markov Chain Monte Carlo inference 133

the asymptotic scaling of the error but also describes the asymptotic
distribution of those errors. If X is real-valued and has finite vari-
ance E[(X − µ)2 ] = σ 2 < ∞, then the CLT states that the devia-
tion n1 ni=1 Xi − µ, rescaled appropriately, converges in distribution
P

and is asymptotically normal:


n
!
1 X
lim P √ (Xi − µ) < α = P(Z < α) (2.28)
n→∞ n i=1

where Z ∼ N (0, σ 2 ). In particular, as n grows, the standard deviation


of the sample average n1 ni=1 Xi − µ converges to zero at an asymp-
P

totic rate proportional to √1n . More generally, for any real-valued mea-
surable function f , the Monte Carlo standard error (MCSE) in the
estimate (2.27) asymptotically scales as √1n .
Monte Carlo estimators effectively reduce the problem of computing
expectations to the problem of generating samples. However, the pre-
ceding statements require the samples used in the Monte Carlo estimate
to be independent, and independent samples can be computationally
difficult to generate. Instead of relying on independent samples, Markov
chain Monte Carlo algorithms compute estimates using mutually de-
pendent samples generated by simulating a Markov chain.

2.2.3 Markov chains


Let X be a discrete or continuous state space and let x, x0 ∈ X denote
states. A time-homogeneous Markov chain is a discrete-time stochastic
process (Xt : t ∈ N) governed by a transition operator T (x → x0 ) that
specifies the probability density of transitioning to a state x0 from a
given state x:
Z
P(Xt+1 ∈ A | Xt = x) = T (x → x0 ) dx0 , ∀t ∈ N, (2.29)
A

for all measurable sets A. A Markov chain is memoryless in the sense


that given the current state its future behavior is independent of its
past history.
Given an initial density π0 (x) for X0 , a Markov chain evolves this
density from one time point to the next through iterative application
134 Background

of the transition operator. We write the application of the transition


operator to a density π0 to yield a new density π1 as
Z
0 0
π1 (x ) = (T π0 )(x ) = T (x → x0 )π0 (x) dx. (2.30)
X

Writing T t to denote t repeated applications of the transition opera-


tor T , the density of Xt induced by π0 and T is then given by πt = T t π0 .
Markov chain simulation follows this iterative definition by itera-
tively sampling the next state using the current state and the transition
operator. That is, after first sampling X0 from π0 ( · ), Markov chain
simulation proceeds at time step t by sampling Xt+1 according to the
density T (xt → · ) induced by the fixed sample xt .
We are interested in Markov chains that converge in total variation
to a unique stationary density π(x) in the sense that

lim kπt − πkTV = 0 (2.31)


t→∞

for any initial distribution π0 , where k · kTV denotes the total variation
norm on densities:
1
Z
kp − qkTV = |p(x) − q(x)| dx. (2.32)
2 X
The total variation distance is also useful as an error metric for the
approximate MCMC we discuss in the sequel. For a transition opera-
tor T (x → x0 ) to admit π(x) as a stationary density, its application
must leave π(x) invariant:

π = T π. (2.33)

For a discussion of general conditions that guarantee a Markov chain


converges to a unique stationary distribution, i.e., that the chain is
ergodic, see Meyn and Tweedie [2009].
In some cases it is easy to show that a transition operator admits
a particular stationary distribution. In particular, it is clear that π
is a stationary distribution when a transition operator T (x → x0 ) is
reversible with respect to π, i.e., it satisfies the detailed balance (re-
versibility) condition with respect to a density π(x),

T (x → x0 )π(x) = T (x0 → x)π(x0 ) ∀x, x0 ∈ X , (2.34)


2.2. Markov Chain Monte Carlo inference 135

which is a pointwise condition over X × X . Integrating over x on both


sides gives:
Z Z
T (x → x0 )π(x) dx = T (x0 → x)π(x0 ) dx
X X Z
= π(x0 ) T (x0 → x) dx
X
= π(x0 ),

which is precisely the required condition from (2.33). We can inter-


pret (2.34) as stating that, for a reversible Markov chain starting from
its stationary distribution, any transition x → x0 is equilibrated by the
corresponding reverse transition x0 → x. Many MCMC methods are
based on deriving reversible transition operators.
For a thorough introduction to Markov chains, see Robert and
Casella [2004, Chapter 6] and Meyn and Tweedie [2009].

2.2.4 Markov chain Monte Carlo (MCMC)


Markov chain Monte Carlo (MCMC) methods simulate a Markov chain
for which the stationary distribution is equal to a target distribution
of interest, and use the simulated samples to form Monte Carlo es-
timates of expectations. That is, consider simulating a Markov chain
with unique stationary density π(x), as in Section 2.2.3, and collecting
its trajectory into a set of samples {Xi }ni=1 . These collected samples can
be used to form a Monte Carlo estimate for a function f of a random
variable X with density π(x) via
n
1X
Z
E[f (X)] = f (x)π(x) dx ≈ f (Xi ). (2.35)
X n i=1

Even though this Markov chain Monte Carlo estimate is not con-
structed from independent samples, under some mild conditions it can
asymptotically satisfy analogs of the Law of Large Numbers (LLN) and
Central Limit Theorem (CLT) that were used to justify ordinary Monte
Carlo methods in Section 2.2.2. We sketch these important results here.
The MCMC analog of the LLN states that for a chain satisfying
basic recurrence conditions and admitting an invariant distribution π,
136 Background

for all functions f that are absolutely integrable with respect to π, i.e.
R
all f : X → R that satisfy X |f (x)| π(x) dx < ∞, we have
n
1X
Z
lim f (Xi ) = f (x) π(x) dx (a.s.), (2.36)
n→∞ n X
i=1

for any initial distribution π0 . This result is the basic motivation to col-
lect samples from a Markov chain trajectory and use those samples to
compute estimates of expectations with respect to the invariant distri-
bution π. For a more detailed statement, see Meyn and Tweedie [2009,
Section 17.1].
Given that Markov chain Monte Carlo estimators can satisfy a law
of large numbers, we’d also like to understand the distribution of es-
timator errors and how the error distribution changes as we collect
more samples. To quantify these errors, the analog of the CLT must
take into account both the Markov dependency structure among the
samples used in the Monte Carlo estimate and also the initial state in
which the chain was started. However, under some additional condi-
tions on both the Markov chain’s convergence rate and the function
f , the sample average for any initial distribution π0 is asymptotically
normal in distribution (with appropriate scaling):
n
!
1 X
lim P √ (f (Xi ) − µ) < α = P(Z < α), (2.37)
n→∞ n n=1
 
Z ∼ N 0, σ 2 , (2.38)

X
σ 2 = Varπ [f (X0 )] + 2 Covπ [f (X0 ), f (Xt )] (2.39)
t=1
R
where µ = X f (x) π(x) dx and where Varπ and Covπ denote the vari-
ance and covariance with respect to the stationary distribution π. Thus
standard error in the MCMC estimate also scales asymptotically as √1n ,
with a constant that depends on the autocovariance function of the sta-
tionary version of the chain. See Meyn and Tweedie [2009, Chapter 17]
and Robert and Casella [2004, Section 6.7] for precise statements of
both the LLN and CLT for Markov chain Monte Carlo estimates and
for conditions on the Markov chain which guarantee that these theo-
rems hold.
2.2. Markov Chain Monte Carlo inference 137

These results show that the asymptotic behavior of MCMC esti-


mates of the form (2.35) is generally comparable to that of ordinary
Monte Carlo estimates as discussed in Section 2.2.2. However, in the
non-asymptotic regime, MCMC estimates differ from ordinary Monte
Carlo estimates in an important respect: there is a transient bias due to
initializing the Markov chain out of stationarity. That is, the initial dis-
tribution π0 from which the first iterate is sampled is generally not the
chain’s stationary distribution π, since if it were then ordinary Monte
Carlo could be performed directly. While the marginal distribution of
each Markov chain iterate converges to the stationary distribution, the
effects of initialization on the initial iterates of the chain contribute an
error term to Eq. (2.35) in the form of a transient bias.
This transient bias does not factor into the asymptotic behavior
described by the MCMC analogs of the LLN and the CLT; asymptoti-
cally, it decreases at a rate of at least O( n1 ) and is hence dominated by
the Monte Carlo standard error which decreases only at rate O( √1n ).
However, its effects can be significant in practice, especially in machine
learning. Whenever a sampled chain seems “unmixed” because its iter-
ates are too dependent on the initialization, errors in MCMC estimates
are dominated by this transient bias.
The simulation in Figure 2.1 illustrates these error terms in MCMC
estimates and how they can behave as more Markov chain samples are
collected. The LLN and CLT for MCMC describe the regime on the far
right of the plot: the total error can be driven arbitrarily small because
the MCMC estimates are asymptotically unbiased, and the total error is
asymptotically dominated by the Monte Carlo standard error. However,
before reaching the asymptotic regime, the error is often dominated by
the transient initialization bias. Several of the new methods we survey
can be understood as attempts to alter the traditional MCMC tradeoffs,
as we discuss further in Chapter 6.
Transient bias can be traded off against Monte Carlo standard error
by choosing different subsets of Markov chain samples in the MCMC
estimator. As an extreme choice, instead of using the MCMC estima-
tor (2.35) with the full set of Markov chain samples {Xi }ni=1 , transient
bias can be minimized by forming estimates using only the last Markov
138 Background

k⇡ TV
⇡kTV
⇡k T nn
k⇡00T
estimator error (log scale)

iterationtime
wall-clock n (log scale)
(log scale)

Transient bias Standard error Total

Figure 2.1: A simulation illustrating error terms in MCMC estimator (2.35)


as a function of the number of Markov chain iterations (log scale). The
marginal distributions of the Markov chain iterates converge to the target
distribution (top panel), while the errors in MCMC estimates due to tran-
sient bias and Monte Carlo standard error are eventually driven arbitrarily
small at rates of O( n1 ) and O( √1n ), respectively (bottom panel). The horizon-
tal axis is shared between the two panels.

chain sample:
E[f (X)] ≈ f (Xn ). (2.40)
However, this choice of MCMC estimator maximizes the Monte Carlo
standard error, which asymptotically cannot be decreased below the
posterior variance of the estimand. A practical choice is to form MCMC
estimates using the last dn/2e Monte Carlo samples, discarding the
other samples as warm-up samples, resulting in an estimator
n
1 X
E[f (X)] ≈ f (Xi ). (2.41)
dn/2e i=bn/2c
2.2. Markov Chain Monte Carlo inference 139

⇡kTV
k⇡0 T n
estimator error (log scale)

iteration n (log scale)

Transient bias Standard error Total

Figure 2.2: A simulation illustrating error terms in MCMC estimator (2.41)


as a function of the number of Markov chain iterations (log scale). Because
the first half of the Markov chain samples are not used in the estimate, the
error due to transient bias is reduced much more quickly than in Figure 2.1
at the cost of shifting up the standard error curve.

With this choice, once the marginal distribution of the Markov chain
iterates approaches the stationary distribution the error due to tran-
sient bias is reduced at up to exponential rates. See Figure 2.2 for an
illustration. With any choice of MCMC estimator, transient bias can
be asymptotically decreased at least as fast as O( n1 ), and potentially
much faster, while MCSE can decrease only as fast as O( √1n ).
Using these ideas, MCMC algorithms provide a general means for
estimating posterior expectations of interest: first construct an algo-
rithm to simulate an ergodic Markov chain that admits the intended
posterior density as its stationary distribution, and then simply run
the simulation, collect samples, and form Monte Carlo estimates from
the samples. The task then is to design an algorithm to simulate from
140 Background

Algorithm 1 Metropolis-Hastings for posterior sampling


Input: Initial state θ0 , number of iterations T , joint density p(θ, x),
proposal density q(θ0 | θ)
Output: Samples θ1 , . . . , θT
for t in 0, . . . , T − 1 do
θ0 ∼ q(θ0 | θt )  . Generate proposal
p(θ0 , x)q(θt | θ0 )

0
α(θt , θ ) ← min 1, . Acceptance probability
p(θt , x)q(θ0 | θt )
u ∼ Unif(0, 1) . Set stochastic threshold
0
if α(θt , θ ) > u then
θt+1 ← θ0 . Accept proposal
else
θt+1 ← θt . Reject proposal

such a Markov chain with the intended stationary distribution. In the


following sections, we briefly review two canonical procedures for con-
structing such algorithms: Metropolis-Hastings and Gibbs sampling.
For a thorough treatment, see Robert and Casella [2004] and Brooks
et al. [2011, Chapter 1].

2.2.5 Metropolis-Hastings (MH) sampling


In the context of Bayesian posterior inference, the Metropolis-Hastings
(MH) algorithm simulates a reversible Markov chain over a state
space Θ that admits the posterior density p(θ | x) as its stationary
distribution. The algorithm depends on a user-specified proposal den-
sity, q(θ0 |θ), which can be evaluated numerically and sampled from effi-
ciently, and also requires that the joint density p(θ, x) can be evaluated
(up to proportionality). The MH algorithm then generates a sequence
of states θ1 , . . . , θT ∈ Θ according to Algorithm 1.
In each iteration, a proposal for the next state θ0 is drawn from the
proposal distribution, conditioned on the current state θ. The proposal
is stochastically accepted with probability given by the acceptance prob-
ability,
p(θ0 , x)q(θ | θ0 )
 
0
α(θ, θ ) = min 1, , (2.42)
p(θ, x)q(θ0 | θ)
2.2. Markov Chain Monte Carlo inference 141

via comparison to a random variate u drawn uniformly from the inter-


val [0, 1]. If u < α(θ, θ0 ), then the next state is set to the proposal, oth-
erwise, the proposal is rejected and the next state is set to the current
state. MH is a generalization of the Metropolis algorithm [Metropolis
et al., 1953], which requires the proposal distribution to be symmet-
ric, i.e., q(θ0 | θ) = q(θ | θ0 ), in which case the acceptance probability is
simply
p(θ0 , x)
 
α(θ, θ0 ) = min 1, . (2.43)
p(θ, x)
Hastings [1970] later relaxed this by showing that the proposal distri-
bution could be arbitrary.
One can show that the stationary distribution is indeed p(θ | x)
by showing that the MH transition operator satisfies detailed bal-
ance (2.34). The MH transition operator density is a two-component
mixture corresponding to the ‘accept’ event and the ‘reject’ event:

T (θ → θ0 ) = α(θ, θ0 )q(θ0 | x) + (1 − β(θ))δθ (θ0 ) (2.44)


Z
β(θ) = α(θ, θ0 )q(θ0 | θ) dθ0 . (2.45)
Θ

To show detailed balance, it suffices to show the two balance conditions

α(θ, θ0 )q(θ0 | θ)p(θ | x) = α(θ0 , θ)q(θ | θ0 )p(θ0 | x) (2.46)


0 0 0
(1 − β(θ))δθ (θ )p(θ | x) = (1 − β(θ ))δθ0 (θ)p(θ | x). (2.47)

To show (2.46) we write


p(θ0 , x)q(θ | θ0 )
 
0 0
α(θ, θ )q(θ | θ)p(θ | x) = min 1, q(θ0 | θ)p(θ | x)
p(θ, x)q(θ0 | θ)
0 p(θ | x)
 
0 0
= min q(θ | θ)p(θ | x), p(θ , x)q(θ | θ )
p(θ, x)
= min q(θ0 | θ)p(θ | x), p(θ0 | x)q(θ | θ0 )


p(θ0 | x)
 
= min q(θ0 | θ)p(θ, x) , p(θ0 | x)q(θ | θ0 )
p(θ0 , x)
p(θ, x)q(θ0 | θ)
 
= min 1, q(θ | θ0 )p(θ0 | x).
p(θ0 , x)q(θ | θ0 )
(2.48)
142 Background

Algorithm 2 Gibbs sampling


Input: A collection of random variables X = {Xi : i ∈ [n]} and
subroutines to sample Xi | X¬i for each i ∈ [n]
Output: Samples {x̂(t) }
Initialize x = (x1 , x2 , . . . , xn )
for t = 1, 2, . . . do
for i = 1, 2, . . . , n do
xi ← sample Xi | X¬i = x¬i
x̂(t) ← (x1 , x2 , . . . , xn )

To show (2.47), note that if θ 6= θ0 then both sides are zero, and if
θ = θ0 then both sides are trivially equal.
See Robert and Casella [2004, Section 7.3] for a more detailed treat-
ment of the Metropolis-Hastings algorithm.

2.2.6 Gibbs sampling

Given a collection of n random variables X = {Xi : i ∈ [n]}, the Gibbs


sampling algorithm iteratively samples each variable Xi conditioned on
the sampled values of the others, X¬i , X \ {Xi }. The algorithm is
summarized in Algorithm 2. When the random variables have a non-
trivial probabilistic graphical model, the conditioning can be reduced to
each variable’s respective Markov blanket [Koller and Friedman, 2009,
Section 12.3.1].
A variant of the systematic scan of Algorithm 2, in which nodes
are traversed in a fixed order for each outer iteration, is the random
scan, in which nodes are traversed according to a random permutation
sampled for each outer iteration. An advantage of the random scan
(and other variants) is that the chain becomes reversible and therefore
simpler to analyze [Robert and Casella, 2004, Section 10.1.2].
The Gibbs sampling algorithm can be analyzed as a special case
of the Metropolis-Hastings algorithm, where the proposal distribution
is based on the conditional distributions and the acceptance probabil-
ity is always one. If the Markov chain produced by a Gibbs sampling
algorithm is ergodic, then the stationary distribution is the target dis-
2.3. Mean field variational inference 143

tribution of X [Robert and Casella, 2004, Theorem 10.6]. The Markov


chain for a Gibbs sampler can fail to be ergodic if, for example, the
support of the target distribution is disconnected [Robert and Casella,
2004, Example 10.7]. A sufficient condition for Gibbs sampling to be er-
godic is that all conditional densities exist and are positive everywhere
[Robert and Casella, 2004, Theorem 10.8].
For a more detailed treatment of Gibbs sampling theory, see Robert
and Casella [2004, Chapters 6 and 10].

2.3 Mean field variational inference

In mean field, and variational inference more generally, the task is to


approximate an intractable distribution, such as a complex posterior,
with a distribution from a tractable family. This tractable approxi-
mating distribution can then be used as a proxy for the intractable
distribution, and we can estimate expectations of interest with respect
to the approximating distribution. In this section we define the mean
field optimization problem, which we revisit in Chapter 5.
To set up posterior inference as an optimization problem, we first
define an objective function that measures the accuracy of an approxi-
mating distribution. The mean field variational inference approach de-
fines an objective function using the following variational inequality.

Proposition 2.4 (Mean field variational inequality). For a probability


density p with respect to a base measure ν of the form
1
Z
p(x) = p̄(x) with Z, p̄(x)ν(dx), (2.49)
Z
where p̄ is the unnormalized density, for all densities q with respect
to ν we have
log Z = L[q] + KL(qkp) ≥ L[q], (2.50)
where
p̄(x)
 
L[q] , Eq(x) log = Eq(x) [log p̄(x)] + H[q], (2.51)
q(x)
q(x)
 
KL(qkp) , Eq(x) log , H[q] , −Eq(x) [q(x)] . (2.52)
p(x)
144 Background

Proof. To show the equality, with x ∼ q we write


p̄(x) q(x)
   
L[q] + KL(qkp) = Eq(x) + Eq(x) log (2.53)
q(x) p(x)
p̄(x)
 
= Eq(x) log (2.54)
p(x)
= log Z. (2.55)
The inequality follows from the property KL(qkp) ≥ 0, known as
Gibbs’s inequality, which follows from Jensen’s inequality and the fact
that the logarithm is concave:
p(x) p(x)
  Z
− KL(qkp) = Eq(x) log ≤ log q(x) ν(dx) = 0 (2.56)
q(x) q(x)
with equality if and only if q = p (ν-a.e.).

We call L[q] the variational lower bound on the log partition func-
tion log Z. Due to the statistical physics origins of variational infer-
ence methods, the negative log partition function − log Z is also called
the free energy (or proportional to the free energy), and L[q] is also
sometimes called the (negative) variational free energy [MacKay, 2003,
Section 33.1]. For two densities q and p with respect to the same base
measure, KL(qkp) is the Kullback-Leibler divergence from q to p, used
as a score of dissimilarity between pairs of densities [Amari and Na-
gaoka, 2007].
The variational lower bound of Proposition 2.4 is useful in inference
because if we wish to approximate an intractable p with a tractable q
by minimizing KL(qkp), we can equivalently choose q to maximize L[q],
which is possible to evaluate since it does not include the partition func-
tion Z. The mean field variational inference problem is then to choose
the approximating distribution q(x) over the family Q to optimize the
objective L[q], as we summarize in the next definition.
Definition 2.4 (Mean field variational inference problem). Given a target
probability density p(x) = Z1 p̄(x), an approximating family Q, the
mean field variational inference problem is
p̄(x)
 
max L[q] or max Eq(x) log . (2.57)
q∈Q q∈Q q(x)
2.4. Expectation propagation variational inference 145

The variational family Q is often made tractable by enforcing fac-


torization structure. This factorization structure in q(x), along with
how factors can be updated one at a time in terms of expectations with
respect to the other factors, gives the mean field method its name.
In the context of Bayesian inference, p is usually an intractable pos-
terior distribution, p̄ is the joint distribution, and Z is the marginal like-
lihood, which plays a central role in Bayesian model selection and the
minimum description length (MDL) criterion [MacKay, 2003, Chapter
28] [Hastie et al., 2001, Chapter 7]. Thus the value of the variational
lower bound L[q] provides a lower bound on the log marginal likelihood,
or an evidence lower bound (ELBO), and can serve as an approximate
model selection criterion [Grosse et al., 2015].
In general mean field variational inference poses a challenging non-
linear programming problem, and standard local optimization methods,
such as (block) coordinate ascent or gradient-based ascent methods, can
only find local optima or stationary points. However, framing posterior
inference as an optimization problem has many advantages. In partic-
ular, it allows allows us to draw on scalable optimization algorithms
for inference, such as the stochastic gradient optimization methods we
summarize in Section 2.5.
See Wainwright and Jordan [2008, Chapter 5] for a convex anal-
ysis perspective on mean field variational inference with exponential
family graphical models, and see Bishop [2006, Chapter 10] and Mur-
phy [2012, Chapter 22] for treatments that discuss Bayesian posterior
inference and conjugacy. Koller and Friedman [2009, Section 11.5.1] de-
velops these methods in the context of alternative variational inference
approaches for probabilistic graphical models.

2.4 Expectation propagation variational inference

Expectation propagation (EP) [Minka, 2001, Murphy, 2012, Wain-


wright and Jordan, 2008] is another variational inference method that
can be used for posterior inference in Bayesian models. It is a gener-
alization of Assumed Density Filtering (ADF) [Maybeck, 1982, Opper
and Winther, 1998] and loopy belief propagation (LBP) [Murphy et al.,
146 Background

1999], and can provide more accurate posterior approximations than


mean field methods in some cases.
As a variational inference method, EP aims to find a tractable dis-
tribution q(θ) that approximates a target distribution p(θ) by solv-
ing an optimization problem. The EP optimization problem is derived
from a variational representation of the log partition function of p(θ),
using a Bethe-like entropy approximation and a relaxed set of con-
vex constraints [Wainwright and Jordan, 2008, Section 4.3]. The EP
algorithm is then a particular Lagrangian method for solving this con-
strained optimization problem, where local moment-matching corre-
sponds to updating Lagrange multipliers. However, because this La-
grangian approach has no clear merit function on which to make guar-
anteed progress [Bertsekas, 2016, Section 5.4], the algorithm has no con-
vergence guarantees. Alternative first-order Lagrangian methods can
lead to locally-convergent algorithms [Heskes and Zoeter, 2002, Arrow
et al., 1959], though we do not discuss them here.
To develop the standard EP updates heuristically, consider a tar-
get posterior distribution p(θ) with factorization structure and a cor-
responding factorized variational family q(θ),
K
Y K
Y
p(θ) ∝ fk (θ), q(θ) ∝ qk (θ), (2.58)
k=1 k=1

where each factor qk (θ) is fixed to be in a parametric model class, such


as a Gaussian, with parameters ηek . Note that this representation is not
necessarily factorized across random variables as in mean field; instead,
each factor might be used to approximate a complex likelihood term
on θ with a simpler form. For any factor index k, define the cavity
distribution q¬k (θ) formed by the product of the other approximating
factors as
q(θ)
q¬k (θ) ∝ . (2.59)
qk (θ)
We’d like to update the factor qk (θ) so that the approximating dis-
tribution q(θ) ∝ qk (θ)q¬k (θ) is closer to the tilted distribution pek (θ)
defined by
pek (θ) ∝ fk (θ)q¬k (θ), (2.60)
2.5. Stochastic gradient optimization 147

which is likely to be a better approximation to the target distribution.


Therefore we update the parameters ηek of qk (θ) to minimize the KL
divergence from the tilted distribution to the approximation, writing
ηek = arg min KL(pek (θ) k q(θ)) (2.61)
ηk
e
 
1
= arg min KL Zk fk (θ)q¬k (θ) qk (θ)q¬k (θ) , (2.62)
ηk
e
where we have introduced the tilted distribution normalizing con-
stant Zk . To accomplish this update, we compute the moments of
the tilted distribution pek (θ) using a method such as exact inference,
MCMC, or nested EP, and then choose the parameters of qk (θ) to ap-
proximately match the estimated moments.
We summarize the resulting algorithm in Algorithm 3. Note that
defining the cavity distribution q¬k (θ) and symbolically forming the
tilted distribution pek (θ) do not require real computational work. In-
stead, the computational effort is in computing or estimating the mo-
ments of pek (θ) and then locally optimizing the parameters ηek of qk (θ)
to approximately match those moments. Because there are many al-
ternative choices for performing these computations, EP serves as an
algorithm template or strategy, and specific EP algorithms fill in these
inference and optimization steps. Expectation propagation can also be
used to provide an estimate of the marginal likelihood.

2.5 Stochastic gradient optimization

In this section we briefly review some basic ideas in stochastic gradi-


ent optimization, which are used most directly in the stochastic vari-
ational inference algorithms of Chapter 5. Related stochastic gradient
ideas are also used in Chapter 3, particularly in developing Stochastic
Gradient Langevin Dynamics. The basic stochastic gradient optimiza-
tion algorithm is given in Algorithm 4 and sufficient conditions for its
convergence to a stationary point are given in Theorem 2.1.
Given a dataset ȳ = {ȳ (k) }K
k=1 , where each ȳ
(k) is a data minibatch,

consider the optimization problem


φ∗ = arg max f (φ) (2.63)
φ
148 Background

Algorithm 3 Expectation propagation (EP)


Input: Target distribution p(θ) = K
Q
QK k=1 fk (θ), parameterized approx-
imating family q(θ) = k=1 qk (θ)
Output: Approximate distribution q(θ)
Initialize parameters of each approximating factor qk (θ)
for t = 1, 2, . . . until convergence do
for k = 1, 2, . . . , K do
Define cavity distribution q¬k ∝ qq(θ)
k (θ)
Define tilted distribution pek (θ) ∝ fk (θ)q¬k (θ)
Set parameters of qk (θ) to minimize KL(pek (θ) k q(θ))
by computing and matching moments
Define the updated approximation q(θ) ∝ qk (θ)q¬k (θ)

where the objective function f decomposes according to


K
X
f (φ) = g(φ, ȳ (k) ). (2.64)
k=1
In the context of variational Bayesian inference, the objective f may be
a variational lower bound on the model log evidence and φ may be the
parameters of the variational family. Alternatively, in MAP inference,
f may be the log joint density and φ may be the model parameters.
Using the decomposition of f , we can compute unbiased Monte
Carlo estimates of its gradient. In particular, if the random index k̂
is sampled from {1, 2, . . . , K}, denoting the probability of sampling
index k as pk > 0, we have
K
" #
X 1 1
∇φ f (φ) = pk ∇φ g(φ, ȳ (k) ) = Ek̂ ∇φ g(φ, ȳ (k̂) ) . (2.65)
k=1
p k pk̂

Thus by considering a Monte Carlo approximation to the expectation


over k̂, we can generate stochastic approximate gradients of the objec-
tive f using only a single ȳ (k) at a time.
A stochastic gradient ascent algorithm uses these approximate gra-
dients to perform updates and find a stationary point of the objective.
At each iteration, such an algorithm samples a data minibatch, com-
putes a gradient with respect to that minibatch, and takes a step in
2.5. Stochastic gradient optimization 149

Algorithm 4 Stochastic gradient ascent


Input: f : Rn → R of the form (2.65), sequences ρ(t) and G(t)
Initialize φ(0) ∈ Rn
for t = 0, 1, 2, . . . do
k̂ (t) ← sample index k with probability pk
(t)
φ(t+1) ← φ(t) + ρ(t) p1 G(t) ∇φ g(φ(t) , ȳ (k̂ ) )

that direction. In particular, for a sequence of stepsizes ρ(t) and a se-


quence of positive definite matrices G(t) , a typical stochastic gradient
ascent algorithm is given in Algorithm 4.
Stochastic gradient algorithms have very general convergence guar-
antees, requiring only weak conditions on the step size sequence and
even the accuracy of the gradients themselves (along with standard
Lipschitz smoothness of the objective). We summarize a common set
of sufficient conditions in Theorem 2.1. Proofs of this result, along with
more general versions, can be found in Bertsekas and Tsitsiklis [1989]
and Bottou [1998]. The conditions on the step size sequence were origi-
nally developed in Robbins and Monro [1951]. Note also that while the
construction here has assumed that the stochasticity in the gradients
arises only from randomly subsampling a finite sum, more general ver-
sions allow for other sources of stochasticity, typically requiring only
bounded variance and allowing some degree of bias [Bertsekas and Tsit-
siklis, 1989, Section 7.8].

Theorem 2.1. Given a continuously differentiable function f : Rn → R


of the form (2.64), if

1. there exists a constant C0 such that f (φ) ≤ C0 for all φ ∈ Rn ,

2. there exists a constant C1 such that

k∇f (φ) − ∇f (φ0 )k2 ≤ C1 kφ − φ0 k2 ∀φ, φ0 ∈ Rn ,

3. there are positive constants C2 and C3 such that

∀t C2 I ≺ G(t) ≺ C3 I,
150 Background

4. and the stepsize sequence ρ(t) satisfies



X ∞
X
ρ(t) = ∞ and (ρ(t) )2 < ∞,
t=0 t=0

then Algorithm 4 converges to a stationary point in the sense that

lim inf k∇f (φ(t) )k = 0 (almost surely). (2.66)


t→∞

While stochastic optimization theory provides convergence guar-


antees, there is no general theory to analyze global rates of parameter
convergence for nonconvex problems such as those that commonly arise
in posterior inference. Indeed, the empirical rate of convergence often
depends strongly on the variance of the stochastic gradient updates and
on the choice of step size sequence. There are automatic methods to
tune or adapt the sequence of stepsizes [Snoek et al., 2012, Ranganath
et al., 2013], though we do not discuss them here.
3
MCMC with data subsets

In MCMC sampling for Bayesian inference, the task is to simulate a


Markov chain that admits as its stationary distribution the posterior
distribution of interest. While there are many standard procedures for
constructing and simulating from such Markov chains, when the dataset
is large many of these algorithms’ updates become computationally
expensive. This growth in complexity naturally suggests the question
of whether there are MCMC procedures that can generate approximate
posterior samples without using the full dataset in each update. In
this chapter, we focus on recent MCMC sampling schemes that scale
Bayesian inference by operating on only subsets of data at a time.

3.1 Factoring the joint density

In most Bayesian inference problems, the fundamental object of interest


is the posterior density, which for fixed data is proportional to the
product of the prior and the likelihood:
p(θ | x) ∝ p(θ, x) = p(θ)p(x | θ). (3.1)
When the data x = {xn }N n=1 are conditionally independent given the
model parameters θ, the likelihood can be decomposed into a product

151
152 MCMC with data subsets

of terms:
N
Y
p(θ | x) ∝ p(θ)p(x | θ) = p(θ) p(xn | θ). (3.2)
n=1

When N is large, this factorization can be exploited to construct


MCMC algorithms in which the updates depend only on subsets of
the data.
In particular, we can use subsets of data to form an unbiased Monte
Carlo estimate of the log likelihood and consequently the log joint den-
sity. The log likelihood is a sum of terms:
N
X
log p(x | θ) = log p(xn | θ), (3.3)
n=1

and we can approximate this sum using a random subset of m < N


terms
m
N X
log p(x | θ) ≈ log p(x∗n | θ), (3.4)
m n=1

where {x∗n }m N
n=1 is a subset of {xn }n=1 sampled uniformly at random
and with replacement. This approximation is an unbiased estimator
and yields an unbiased estimate of the log joint density:
m
N X
log p(θ)p(x | θ) ≈ log p(θ) + log p(x∗n | θ). (3.5)
m n=1

Several of the methods reviewed in this chapter exploit this estimator


to perform MCMC updates.

3.2 Adaptive subsampling for Metropolis–Hastings

In traditional Metropolis–Hastings (MH), we evaluate the joint density


to decide whether to accept or reject a proposal. As noted by Ko-
rattikara et al. [2014], because the value of the joint density depends
on the full dataset, when N is large this is an unappealing amount of
computation to reach a binary decision. In this section, we survey ideas
for using approximate MH tests that depend on only a subset of the
full dataset. The resulting approximate MCMC algorithms proceed in
3.2. Adaptive subsampling for Metropolis–Hastings 153

each iteration by reading only as much data as required to satisfy some


estimated error tolerance.
While there are several variations, the common idea is to model
the probability that the outcome of such an approximate MH test
differs from the exact MH test. This probability model allows us to
construct an approximate MCMC sampler, outlined in Section 3.2.2,
where the user specifies some tolerance for the error in an MH test
and the amount of data evaluated is controlled by an adaptive stopping
rule. Different models for the MH test error lead to different stopping
rules. Korattikara et al. [2014] use a normal model to construct a t-
statistic hypothesis test, which we describe in Section 3.2.3. Bardenet
et al. [2014] instead use concentration inequalities, which we describe
in Section 3.2.4. Given an error model and resulting stopping rule, both
schemes rely on an MH test based on a Monte Carlo estimate of the
log joint density, which we summarize in Section 3.2.1. Our notation
in this section follows Bardenet et al. [2014].
Bardenet et al. [2014] observe that similar ideas have been devel-
oped both in the context of simulated annealing1 by the operations
research community [Bulgak and Sanders, 1988, Alkhamis et al., 1999,
Wang and Zhang, 2006], and in the context of MCMC inference for
factor graphs [Singh et al., 2012].

3.2.1 An approximate MH test based on a data subset


In the Metropolis–Hastings algorithm (§2.2.5), the proposal is stochas-
tically accepted when
p(θ0 | x)q(θ | θ0 )
> u, (3.6)
p(θ | x)q(θ0 | θ)
where u ∼ Unif(0, 1). Rearranging and using log probabilities gives
p(x | θ0 ) q(θ0 | θ)p(θ)
   
log > log u . (3.7)
p(x | θ) q(θ | θ0 )p(θ0 )
Scaling both sides by 1/N gives an equivalent threshold,
Λ(θ, θ0 ) > ψ(u, θ, θ0 ), (3.8)
1
Simulated annealing is a stochastic optimization heuristic that is operationally
similar to MH.
154 MCMC with data subsets

where on the left, Λ(θ, θ0 ) is the average log likelihood ratio,


N N
1 X p(xn | θ0 ) 1 X
 
0
Λ(θ, θ ) = log ≡ `n , (3.9)
N n=1 p(xn | θ) N n=1
where
`n = log p(xn | θ0 ) − log p(xn | θ), (3.10)
and on the right,
1 q(θ0 | θ)p(θ)
 
0
ψ(u, θ, θ ) = log u . (3.11)
N q(θ | θ0 )p(θ0 )
We can form an approximate threshold by subsampling the `n .
Let {`∗n }m
n=1 be a subsample of size m < N , without replacement,
from {`n }Nn=1 . This gives the following approximate test:

Λ̂m (θ, θ0 ) > ψ(u, θ, θ0 ), (3.12)


where m m
1 X p(x∗n | θ0 ) 1 X
 
0
Λ̂m (θ, θ ) = log ≡ `∗ . (3.13)
m n=1 p(x∗n | θ) m n=1 n
This subsampled average log likelihood ratio Λ̂m (θ, θ0 ) is an unbiased
estimate of the average log likelihood ratio Λ(θ, θ0 ). However, an error
is made in the event that the approximate test (3.12) disagrees with
the exact test (3.8), and the probability of such an error event depends
on the distribution of Λ̂m (θ, θ0 ) and not just its mean.
Note that because the proposal θ0 is usually a small perturbation
of θ, we expect log p(xn | θ0 ) to be similar to log p(xn | θ). In this case,
we expect the log likelihood ratios `n have a smaller variance compared
to the variance of log p(xn | θ) across data terms.

3.2.2 Approximate MH with an adaptive stopping rule


A nested sequence of data subsets, sampled without replacement, that
converges to the complete dataset gives us a sequence of approximate
MH tests that converges to the exact MH test. Modeling the error of
such an approximate MH test gives us a mechanism for designing an
approximate MH algorithm in which, at each iteration, we incremen-
tally read more data until an adaptive stopping rule informs us that
3.2. Adaptive subsampling for Metropolis–Hastings 155

Algorithm 5 Approximate MH with an adaptive stopping rule


Input: Initial state θ0 , number of iterations T , data x = {xn }N n=1 ,
posterior p(θ | x), proposal q(θ0 | θ)
Output: Samples θ1 , . . . , θT
for t in 0, . . . , T − 1 do
θ0 ∼ q(θ0 | θt ) . Generate proposal
u ∼ Unif(0, 1) . Draw random number
1
 0
q(θ | θt )p(θt )

ψ(u, θt , θ0 ) ← log u
N q(θt | θ0 )p(θ0 )
Λ̂(θt , θ0 ) ← AvgLogLikeRatioEstimate(θt , θ0 , ψ(u, θt , θ0 ))
if Λ̂(θt , θ0 ) > ψ(u, θt , θ0 ) then . Approximate MH test
θt+1 ← θ 0 . Accept proposal
else
θt+1 ← θt . Reject proposal

our error is less than some user-specified tolerance. Algorithm 5 out-


lines this approach. The function AvgLogLikeRatioEstimate com-
putes Λ̂(θ, θ0 ) according to an adaptive stopping rule that depends on
an error model, i.e., a way to approximate or bound the probability
that the approximate outcome disagrees with the full-data outcome:
h i
P ((Λ̂m (θ, θ0 ) > ψ(u, θ, θ0 )) 6= ((Λ(θ, θ0 ) > ψ(u, θ, θ0 )) . (3.14)

We describe two possible error models in Sections 3.2.3 and 3.2.4.


A practical issue with adaptive subsampling is choosing the sizes of
the data subsets. One approach, taken by Korattikara et al. [2014], is to
use a fixed batch size b and read b more data points at a time. Bardenet
et al. [2014] instead geometrically increase the total subsample size, and
also discuss connections between adaptive stopping rules and related
ideas such as bandit problems, racing algorithms and boosting.

3.2.3 Using a t-statistic hypothesis test


Korattikara et al. [2014] propose an approximate MH acceptance prob-
ability that uses a parametric test of significance as its error model.
By assuming a normal model for the log likelihood estimate Λ̂(θ, θ0 ),
156 MCMC with data subsets

a t-statistic hypothesis test then provides an estimate of whether the


approximate outcome agrees with the full-data outcome, i.e., the ex-
pression in Equation (3.14). This leads to an adaptive framework as
in Section 3.2.2 where, at each iteration, the data are processed incre-
mentally until the t-test satisfies some user-specified tolerance .
Let us model the `n as i.i.d. from a normal distribution with
bounded variance σ 2 :

`n ∼ N (µ, σ 2 ) . (3.15)

The mean estimate µ̂m for µ based on the subset of size m is equal
to Λ̂m (θ, θ0 ):
m
1 X
µ̂m = Λ̂m (θ, θ0 ) = `∗ . (3.16)
m n=1 n

The error estimate σ̂m for σ may be derived from sm / m, where sm is
the empirical standard deviation of the m subsampled `n terms, i.e.,
m  2
r 
sm = Λ̂m (θ, θ0 ) − Λ̂m (θ, θ0 )2 , (3.17)
m−1
where m
1 X
Λ̂2m (θ, θ0 ) = (`∗ )2 . (3.18)
m n=1 n
To obtain a confidence interval, we multiply this estimate by the finite
population correction, giving:
s
sm N −m
σ̂m =√ . (3.19)
m N −1
The test statistic
Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 )
t= (3.20)
σ̂m
follows a Student’s t-distribution with m − 1 degrees of freedom
when Λ(θ, θ0 ) = ψ(u, θ, θ0 ). The tail probability for |t| then gives the
probability that the approximate and actual outcomes agree, and thus

ρ = 1 − φm−1 (|t|) (3.21)


3.2. Adaptive subsampling for Metropolis–Hastings 157

Algorithm 6 Estimate of the average log likelihood ratio. The adap-


tive stopping rule uses a t-statistic hypothesis test.
Parameters: batch size b, user-defined error tolerance 
function AvgLogLikeRatioEstimate(θ, θ0 , ψ(u, θ, θ0 ))
m, Λ̂(θ, θ0 ), Λ̂2 (θ, θ0 ) ← 0, 0, 0
while True do
c ← min(b, N − m)
m+c
!
1 p(x | θ 0)
X n
Λ̂(θ, θ0 ) ← mΛ̂(θ, θ0 ) + log
m+c n=m+1
p(xn | θ)
m+c 2 !
1 p(xn | θ0 )
X 
Λ̂2 (θ, θ0 )
← mΛ̂2 (θ, θ0 ) + log
m+c n=m+1
p(xn | θ)
m←m+c
m  2
r 
s← Λ̂ (θ, θ0 ) − Λ̂(θ, θ0 )2
m−1
s
s N −m
σ̂ ← √
m N −1
!
Λ̂(θ, θ0 ) − ψ(u, θ, θ0 )
ρ ← 1 − φm−1
σ̂
if ρ >  or m = N then
return Λ̂(θ, θ0 )

is the probability that they disagree, where φm−1 (·) is the CDF of
the Student’s t-distribution with m − 1 degrees of freedom. The t-test
thus gives an adaptive stopping rule, i.e., for any user-provided toler-
ance  ≥ 0, we can incrementally increase m until ρ ≤ . We illustrate
this approach in Algorithm 6.

3.2.4 Using concentration inequalities

Bardenet et al. [2014] propose an adaptive subsampling method that


is mechanically similar to using a t-test but instead uses concentration
inequalities. In addition to a bound on the error (of the approximate
acceptance probability) that is local to each iteration, concentration
bounds yield a bound on the total variation distance between the ap-
158 MCMC with data subsets

proximate and true stationary distributions. The method is further


refined with variance reduction techniques in Bardenet et al. [2015],
though we only discuss the basic version here.
As in Section 3.2.3, we evaluate an approximate MH threshold based
on a data subset of size m, given in Equation (3.12). We bound the
probability that the approximate binary outcome is incorrect via con-
centration inequalities that characterize the quality of Λ̂m (θ, θ0 ) as an
estimate for Λ(θ, θ0 ). Such a concentration inequality is a probabilistic
statement that, for δm ∈ (0, 1) and some constant cm ,
 
P Λ̂m (θ, θ0 ) − Λ(θ, θ0 ) ≤ cm ≥ 1 − δm . (3.22)

For example, in Hoeffding’s inequality without replacement [Serfling,


1974] s 
2 m−1 2
  
cm = Cθ,θ0 1− log (3.23)
m N δm
where

Cθ,θ0 = max log p(xn | θ0 ) − log p(xn | θ) = max |`n |, (3.24)


1≤n≤N 1≤n≤N

using `n as in Equation (3.10). Alternatively, if the empirical standard


deviation sm of the m subsampled `∗n terms is small, then the empirical
Bernstein bound,
s
2 log(3/δm ) 6Cθ,θ0 log(3/δm )
cm = sm + , (3.25)
m m
is tighter [Audibert et al., 2009], where sm is given in Equation (3.17).
While Cθ,θ0 can be obtained via all the `n , this is precisely the compu-
tation we want to avoid. Therefore, the user must provide an estimate
of Cθ,θ0 .
Bardenet et al. [2014] use a concentration bound to construct an
adaptive stopping rule based on a strategy called empirical Bernstein
stopping [Mnih et al., 2008]. Let cm be a concentration bound as in
Equation (3.23) or (3.25) and let δm be the associated error. This con-
centration bound states that |Λ̂m (θ, θ0 ) − Λ(θ, θ0 )| ≤ cm with probabil-
ity 1 − δm . If |Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 )| > cm , then the approximate MH
3.2. Adaptive subsampling for Metropolis–Hastings 159

ψ Λ̂m Λ

2cm

Figure 3.1: Reproduction of Figure 2 from Bardenet et al. [2014].


If |Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 )| > cm , then the adaptive stopping rule using a con-
centration bound is satisfied and we use the approximate MH test based
on Λ̂m (θ, θ0 ).

test agrees with the exact MH test with probability 1 − δm . We repro-


duce a helpful illustration of this scenario from Bardenet et al. [2014]
in Figure 3.1. If instead |Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 )| ≤ cm , then we want to
increase m until this is no longer the case. Let M be the stopping time,
i.e., the number of data points evaluated using this criterion,
 
M = min N, inf Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 ) > cm . (3.26)
m≥1

We can set δm according to a user-defined parameter  ∈ (0, 1) so


that  gives an upper bound on the error of the approximate acceptance
probability. Let p > 1 and set
p−1 X
δm = , thus δm ≤ . (3.27)
pmp m≥1

A union bound argument gives


 
\ n o
P Λ̂m (θ, θ0 ) − Λ(θ, θ0 ) ≤ cm  ≥ 1 − , (3.28)
m≥1

under sampling without replacement. Hence, with probability 1 − ,


the approximate MH test based on Λ̂M (θ, θ0 ) agrees with the exact
MH test. In other words, the stopping rule for computing Λ̂m (θ, θ0 ) in
Algorithm 5 is satisfied once we observe |Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 )| > cm .
We illustrate this approach in Algorithm 7, using Hoeffding’s inequality
without replacement.
In their actual implementation, Bardenet et al. [2014] modify δm
to reflect the number of batches processed instead of the subsample
160 MCMC with data subsets

Algorithm 7 Estimate of the average log likelihood ratio. The adap-


tive stopping rule uses Hoeffding’s inequality without replacement.
Parameters: batch size b, user-defined error tolerance , estimate
of Cθ,θ0 = maxn |`n |, p > 1
function AvgLogLikeRatioEstimate(θ, θ0 , ψ(u, θ, θ0 ))
m, Λ̂(θ, θ0 ) ← 0, 0
while True do
c ← min(b, N − M )
m+c
!
0 1 0
X p(xn | θ0 )
Λ̂(θ, θ ) ← mΛ̂(θ, θ ) + log
m+c n=m+1
p(xn | θ)
m←m+c
p−1
δ← 
pmps
2 m−1 2
   
c ← Cθ,θ0 1− log
m N δ
if Λ̂(θ, θ0 ) − ψ(u, θ, θ0 ) > c or m = N then
return Λ̂(θ, θ0 )

size m. For example, suppose we use the concentration bound in Equa-


tion (3.23), i.e., Hoeffding’s inequality without replacement. Then after
processing a subsample of size m in k batches, the adaptive stopping
rule checks whether |Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 )| > cm , where
s
2 m−1 2
   
cm = Cθ,θ0 1− log (3.29)
m N δk

and
p−1
δk = . (3.30)
pk p
Also, as mentioned in Section 3.2.2, Bardenet et al. [2014] geometrically
increase the subsample size by a factor γ. In their experiments, they
use the empirical Bernstein-Serfling bound [Bardenet and Maillard,
2015]. For the hyperparameters, they set p = 2, γ = 2, and  = 0.01,
and remark that they empirically found their algorithm to be robust
to the choice of .
3.2. Adaptive subsampling for Metropolis–Hastings 161

3.2.5 Error bounds on the stationary distribution


In this section, we reproduce some theoretical results from Korattikara
et al. [2014] and Bardenet et al. [2014]. After setting up some notation,
we emphasize the most general aspects of these results, which apply to
pairs of transition kernels whose differences are bounded, and thus are
not specific to adaptive subsampling procedures. The central theorem
is an upper bound on the difference between the stationary distribu-
tions of such pairs of kernels in the case of Metropolis–Hastings. Its
proof depends on the ability to bound the difference in the acceptance
probabilities, at each iteration, of the two MH transition kernels.

Preliminaries and notation. Let P and Q be probability measures


(distributions) with Radon–Nikodym derivatives (densities) fP and fQ ,
respectively, and absolutely continuous with respect to measure ν. The
total variation distance between P and Q is
1
Z
kP − QkTV ≡ |fP (θ) − fQ (θ)|dν(θ). (3.31)
2 θ∈Θ

Let T denote an MH transition kernel with stationary distribution π


and acceptance probability α(θ, θ0 ). Let T̃ denote an approximation
to T , using the same proposal distribution q(θ0 | θ) but with stationary
distribution π̃ and acceptance probability α̃(θ, θ0 ). Throughout this sec-
tion, we specify when T̃ is constructed from T via an adaptive stopping
rule; some of the results are more general. Let

E(θ, θ0 ) = α̃(θ, θ0 ) − α(θ, θ0 ) (3.32)

be the acceptance probability error of the approximate MH test, with


respect to the exact test. Finally, let

Emax = sup |E(θ, θ0 )| (3.33)


θ,θ0

be the worst case absolute acceptance probability error.


The following theorem shows that, under strong ergodicity assump-
tions on the exact MH chain, we have similar ergodicity guarantees for
the approximate MH chain. In addition, the total variation distance
between the stationary distribution of the approximate MH chain and
162 MCMC with data subsets

the target distribution can also be bounded in terms of the ergodicity


parameters of the exact chain and the worst case acceptance probability
error given in Eq. (3.33).

Theorem 3.1 (Total variation bound under uniform geometric ergodic-


ity [Bardenet et al., 2014]). Let T be uniformly geometrically ergodic,
so that there exist real numbers A < ∞ and h < ∞ such that for all
initial densities π0 and integers k > 0,

kT k π0 − πkTV ≤ Aλbk/hc . (3.34)

Then T̃ is uniformly geometrically ergodic with stationary distribu-


tion π̃, and we have for some constant C < ∞,
 bk/hc
kT̃ k π0 − π̃kTV ≤ C 1 − (1 − )h (1 − λ) , (3.35)
AhEmax
kπ − π̃kTV ≤ , (3.36)
1−λ
where  is the user-defined error tolerance for the algorithm and Emax
is defined in (3.33).

Instead of proving Theorem 3.1, we briefly outline a proof from Ko-


rattikara et al. [2014] of a similar theorem that exploits a slightly
stronger assumption on T . Specifically, assume T satisfies the strong
contraction condition,

kT P − πkTV ≤ ηkP − πkTV , (3.37)

for all probability distributions P and some constant η ∈ [0, 1). For
approximate MH with an adaptive stopping rule, the maximum ac-
ceptance probability error Emax directly gives an upper bound on the
single-step error kT̃ P − T P kTV . Combining the single-step error bound
with the contraction condition shows that T̃ is also a strong contraction
and yields a bound on kπ̃ − πkTV .
Finally, we note that an adaptive subsampling schemes using a
concentration inequality enables an upper bound on the stopping
time [Bardenet et al., 2014].
3.3. Sub-selecting data via a lower bound on the likelihood 163

3.3 Sub-selecting data via a lower bound on the likelihood

Maclaurin and Adams [2014] introduce Firefly Monte Carlo (FlyMC),


an auxiliary variable MCMC sampling procedure that operates on only
subsets of data in each iteration. At each iteration, the algorithm dy-
namically selects what data to evaluate based on the random indicators
included in the Markov chain state. In addition, it generates samples
from the exact target posterior rather than an approximation. How-
ever, FlyMC requires a lower bound on the likelihood with a particular
“collapsible” structure (essentially an exponential family lower bound)
and because such bounds are not readily available or tight for many
models it is not as generally applicable. The algorithm’s performance
depends on the tightness of the bound; it can achieve impressive gains
in performance when model structure allows.
FlyMC samples from an augmented posterior that eliminates po-
tentially many likelihood factors. Define

Ln (θ) = p(xn | θ) (3.38)

and let Bn (θ) be a strictly positive lower bound on Ln (θ),


i.e., 0 < Bn (θ) ≤ Ln (θ). For each datum, we introduce a binary aux-
iliary variable zn ∈ {0, 1} conditionally distributed according to a
Bernoulli distribution,
zn  1−zn
Ln (θ) − Bn (θ) Bn (θ)

p(zn | xn , θ) = , (3.39)
Ln (θ) Ln (θ)

where the zn are independent for different n. When the bound is tight,
i.e., Bn (θ) = Ln (θ), then zn = 0 with probability 1. More generally, a
tighter bound results in a higher probability that zn = 0. Augmenting
the density with z = {zn }N n=1 gives:

p(θ, z | x) ∝ p(θ | x)p(z | x, θ)


N
Y
= p(θ) p(xn | θ)p(zn | xn , θ). (3.40)
n=1
164 MCMC with data subsets

Using Equations (3.38) and (3.39), we can now write:


N zn  1−zn
Ln (θ) − Bn (θ) Bn (θ)
Y 
p(θ, z | x) ∝ p(θ) Ln (θ)
n=1
Ln (θ) Ln (θ)
N
Y
= p(θ) (Ln (θ) − Bn (θ))zn Bn (θ)1−zn
n=1
Y Y
= p(θ) (Ln (θ) − Bn (θ)) Bn (θ). (3.41)
n:zn =1 n:zn =0

Thus for any fixed configuration of z we can evaluate the joint density
using only the likelihood terms Ln (θ) where zn = 1 and the bound
values Bn (θ) for each n = 1, 2, . . . , N .
While Equation (3.41) still involves a product of N terms, if the
Q
product of the bound terms n:zn =0 Bn (θ) can be evaluated without
reading each corresponding data point then the joint density can be
evaluated reading only the data xn for which zn = 1. In particular,
if the form of Bn (θ) is an exponential family density, then the prod-
Q
uct n:zn =0 Bn (θ) can be evaluated using only a finite-dimensional suf-
ficient statistic for the data {xn : zn = 0}. Thus by exploiting lower
bounds in the exponential family, FlyMC can reduce the amount of
data required at each iteration of the algorithm while maintaining the
exact posterior as its stationary distribution. Maclaurin and Adams
[2014] show an application of this methodology to Bayesian logistic
regression.
FlyMC presents three main challenges. The first is constructing a
collapsible lower bound, such as an exponential family, that is suf-
ficiently tight. The second is designing an efficient implementation.
Maclaurin and Adams [2014] discuss these issues and, in particular,
design a cache-like data structure for managing the relationship be-
tween the N indicator values and the data. Finally, it is likely that the
inclusion of these auxiliary variables slows the mixing of the Markov
chain, but Maclaurin and Adams [2014] only provide empirical evidence
that this effect is small relative to the computational savings from using
data subsets.
Further analysis of FlyMC, and in particular an explicit connection
to pseudo-marginal techniques and an analysis of how bound tightness
3.4. Stochastic gradients of the log joint density 165

can affect performance, can be found in Bardenet et al. [2015, Section


4.3].

3.4 Stochastic gradients of the log joint density

In this section, we review recent efforts to develop MCMC algorithms


inspired by stochastic optimization techniques. This is motivated by
the existence of, first, MCMC algorithms that can be thought of as the
sampling analogues of optimization algorithms, and second, scalable
stochastic versions of these optimization algorithms.
Traditional gradient ascent or descent performs optimization by
iteratively computing and following a local gradient [Dennis and Schn-
abel, 1983]. In Bayesian MAP inference, the objective function is typi-
cally a log joint density and the update rule for gradient ascent is given
by
t
 
θt+1 = θt + ∇ log p(θt , x) (3.42)
2
for t = 1, . . . , ∞ and some step size sequence (t ). As discussed in Sec-
tion 2.5, stochastic gradient descent (SGD) is a simple modification of
gradient descent that can exploit situations where the objective func-
tion decomposes into a sum of many terms. While the traditional gra-
dient descent update depends on all the data, i.e.,
N
!
t X
θt+1 = θt + ∇ log p(θt ) + ∇ log p(xn | θt ) , (3.43)
2 n=1

SGD forms an update based on only a data subset,


m
!
t N X
θt+1 = θt + ∇ log p(θt ) + ∇ log p(xn | θt ) . (3.44)
2 m n=1
The iterates converge to a local extreme point of the log joint density in
the sense that limt→∞ ∇ log p(θt |x) = 0 if the step size sequence {t }∞
t=1
satisfies ∞ ∞
X X
t = ∞ and 2t < ∞. (3.45)
t=1 t=1
A common choice of step size sequence is t = α(β + t)−γ for
some β > 0 and γ ∈ (0.5, 1].
166 MCMC with data subsets

Welling and Teh [2011] propose stochastic gradient Langevin dy-


namics (SGLD), an approximate MCMC procedure that combines
SGD with a simple kind of Langevin dynamics (Langevin Monte
Carlo) [Neal, 1994]. They extend the Metropolis-adjusted Langevin al-
gorithm (MALA) that uses noisy gradient steps to generate proposals
for a Metropolis–Hastings chain [Roberts and Tweedie, 1996]. At iter-
ation t, the MH proposal is

 
θ 0 = θt + ∇ log p(θt , x) + ηt , (3.46)
2
where the injected noise ηt ∼ N (0, ) is Gaussian. Notice that the

scale of the noise is , i.e., is constant and set by the gradient step
size parameter. The MALA proposal is thus a stochastic gradient step,
constructed by adding noise to a step in the direction of the gradient.
SGLD modifies the Langevin dynamics in Equation (3.46) by using
stochastic gradients based on data subsets, as in Equation (3.44), and
requiring that the step size parameter satisfies Equation (3.45). Thus,
at iteration t, the proposal is
m
!
0 t N X
θ = θt + ∇ log p(θt ) + ∇ log p(xn | θt ) + ηt , (3.47)
2 m n=1

where ηt ∼ N (0, t ). Notice that the injected noise decays with the
gradient step size parameter, but at a slower rate. Specifically, if t
decays as t−γ , then ηt decays as t−γ/2 . As in MALA, the SGLD proposal
is a stochastic gradient step, where the noise comes from subsampling
as well as the injected noise.
An actual Metropolis–Hastings algorithm would accept or reject the
proposal in Equation (3.47) by evaluating the full (log) joint density
at θ0 and θt , but this is precisely the computation we wish to avoid.
Welling and Teh [2011] observe that as t → 0, θ0 → θt in both Equa-
tions (3.46) and (3.47). In this limit, the probability of accepting the
proposal converges to 1, but the chain stops completely. The authors
suggest that t can be decayed to a value that is large enough for effi-
cient sampling, yet small enough for the acceptance probability to be
high. These assumptions lead to a scheme where t > ∞ > 0, for all t,
and all proposals are accepted, therefore the acceptance probability is
3.4. Stochastic gradients of the log joint density 167

Algorithm 8 Stochastic gradient Langevin dynamics (SGLD).


Input: Initial state θ0 , number of iterations T , data x, grad log prior
∇ log p(θ), grad log likelihood ∇ log p(x | θ), batch size m, step size
tuning parameters (e.g., α, β, γ)
Output: Samples θ1 , . . . , θT
J = N/m
for τ in 0, . . . , T /J − 1 do
x ← Permute(x) . For sampling without replacement
for k in 0, . . . , J − 1 do
t = τJ + k
t ← α(β + t)−γ . Example step size
ηt ∼ N (0, t ) . Draw noise  to inject
t  N km+m
X
θ 0 ← θt + ∇ log p(θt ) + ∇ log p(xn | θt ) + ηt
2 m n=km+1
θt+1 ← θ0 . Accept proposal with probability 1

never evaluated. We show this scheme in Algorithm 8. Without the


stochastic MH acceptance step, however, asymptotic samples are no
longer guaranteed to represent the target distribution.
The SGLD algorithm and its analysis have been extended in a num-
ber of directions. Stochastic gradient Fisher scoring modifies the in-
jected noise to interpolate between fast, SGLD-like mixing when the
posterior is close to Gaussian, and slower mixing capable of capturing
non-Gaussian features [Ahn et al., 2012]. In more recent work, Patter-
son and Teh [2013] apply SGLD to Riemann manifold Langevin dynam-
ics [Girolami and Calderhead, 2011]. There has also been some recent
analysis of the asymptotic behavior of SGLD [Teh et al., 2016].
Several recent works [Chen et al., 2014, Shang et al., 2015, Ma
et al., 2015] also combine the idea of SGD with Hamiltonian Monte
Carlo (HMC) [Neal, 1994, 2010]. HMC improves and generalizes
Langevin dynamics with coherent long-range exploration, though there
is disagreement as to whether these stochastic gradient versions sacri-
fice the main advantages of HMC [Betancourt, 2015]. These methods
are also related to recent approaches that combine the idea of ther-
168 MCMC with data subsets

Table 3.1: Summary of recent MCMC methods for Bayesian inference that
operate on data subsets. Error refers to the total variation distance between
the stationary distribution of the Markov chain and the target posterior dis-
tribution.

Adaptive subsampling FlyMC SGLD


Approach Approximate MH test Auxiliary variables Optimization plus noise
Requirements Error model, e.g., t-test Likelihood bound Gradients ∇ log p(θ, x)
Data access Mini-batches Random Mini-batches
Batch size,
Batch size error,
Hyperparameters None error tolerance,
tolerance per iteration
annealing schedule
Asymptotic bias Bounded TV None Biased for all  > 0

mostats with stochastic gradients [Leimkuhler and Shang, 2016, Ding


et al., 2014].
Finally, we note that all the methods in this section require gradi-
ent information, and in particular require the sampled variables to be
continuous.

3.5 Summary

In this chapter, we have surveyed three recent approaches to scal-


ing MCMC that operate on subsets of data. Below and in Table 3.1,
we summarize and compare adaptive subsampling approaches (§3.2),
FlyMC (§3.3), and SGLD (§3.4) along several axes.

Approaches. Adaptive subsampling approaches replace the


Metropolis–Hastings (MH) test, a function of all the data, with
an approximate test that depends on only a subset. FlyMC is an
auxiliary variable method that stochastically replaces likelihood
computations with a collapsible lower bound. Stochastic gradient
Langevin dynamics (SGLD) replaces gradients in a Metropolis-
adjusted Langevin algorithm (MALA) with stochastic gradients based
on data subsets and eliminates the Metropolis–Hastings test.
3.5. Summary 169

Generality, requirements, and assumptions. Each of the methods ex-


ploits assumptions or additional problem structure. Adaptive subsam-
pling methods require an error model that accurately represents the
probability that an approximate MH test will disagree with the exact
MH test. A normal model [Korattikara et al., 2014] or concentration
bounds [Bardenet et al., 2014] represent natural choices; under certain
conditions, tighter concentration bounds may apply. FlyMC requires a
strictly positive collapsible lower bound on the likelihood, essentially an
exponential family lower bound, which may not in general be available.
SGLD requires the log gradients of the prior and likelihood.

Data access patterns. While all the methods use subsets of data,
their access patterns differ. Adaptive subsampling and SGLD require
randomization to avoid issues of bias due to data order, but this ran-
domization can be achieved by permuting the data before each pass
and hence these algorithms allow data access that is mostly sequential.
In contrast, FlyMC operates on random subsets of data determined by
the Markov chain itself, leading to a random access pattern. However,
subsets from one iteration to the next tend to be correlated, and moti-
vate implementation details such as the proposed cache data structure.

Hyperparameters. FlyMC does not introduce additional hyperpa-


rameters that require tuning. Both adaptive subsampling methods and
SGLD introduce hyperparameters that can significantly affect perfor-
mance. Both are mini-batch methods, and thus have the batch size as a
tuning parameter. In adaptive subsampling methods, the stopping cri-
terion is evaluated potentially more than once before it is satisfied. This
motivates schemes that geometrically increase the amount of data pro-
cessed whenever the stopping criterion is not satisfied, which introduces
additional hyperparameters. Adaptive subsampling methods addition-
ally provide a single tuning parameter that allows the user to control
the error at each iteration. Finally, since these adaptive methods de-
fine an approximate MH test, they implicitly also require that the user
specify a proposal distribution. For SGLD, the user must specify an
annealing schedule for the step size parameter; in particular, it should
170 MCMC with data subsets

converge to a small positive value so that the injected noise term dom-
inates, while not being too large compared to the scale of the posterior
distribution.

Error. FlyMC is exact in the sense that the target posterior distribu-
tion is a marginal of its augmented state space. The adaptive subsam-
pling approaches and SGLD (for any positive step size) are approximate
methods in that neither has a stationary distribution equal to the tar-
get posterior. The adaptive subsampling approaches bound the error
of the MH test at each iteration, and for MH transition kernels with
uniform ergodicity this one-step error bound leads to an upper bound
on the total variation distance between the approximate stationary dis-
tribution and the target posterior distribution. The theoretical analysis
of SGLD is less clear [Sato and Nakagawa, 2014].

3.6 Discussion

Data subsets. The methods surveyed in this chapter achieve com-


putational gains by using data subsets in place of an entire dataset of
interest. The adaptive subsampling algorithms (§3.2) are more success-
ful when a small subsample leads to an accurate estimator for the exact
MH test’s accept/reject decision. Intuitively, such an estimator is easier
to construct when the log posterior values at the proposed and current
states are significantly different. This tends to be true far away from
the mode(s) of the posterior, e.g., in the tails of a distribution that
decay exponentially fast, compared to the area around a mode, which
is locally more flat. Thus, these algorithms tend to evaluate more data
when the chain is in the vicinity of a mode, and less data when the
chain is far away (which tends to be the case for an arbitrary initial
condition). SGLD (§3.4) exhibits somewhat related behavior. Recall
that SGLD behaves more like SGD when the update rule is dominated
by the gradient term, which tends to be true during the initial exe-
cution phase. Similar to SGD, the chain progresses toward a mode at
a rate that depends on the accuracy of the stochastic gradients. For
a log posterior target, stochastic gradients tend to be more accurate
3.6. Discussion 171

estimators of true gradients far away from the mode(s). In contrast,


the MAP-tuned version of FlyMC (§3.3) requires the fewest data eval-
uations when the chain is close to the MAP, since by design, the lower
likelihood bounds are tightest there. Meanwhile, the untuned version
of FlyMC tends to exhibit the opposite behavior.

Adaptive proposal distributions. The Metropolis-Hastings algorithm


requires the user to specify a proposal distribution. Fixing proposal
distribution can be problematic, because the behavior of MH is sen-
sitive to the proposal distribution and can furthermore change as the
chain converges. A common solution, employed e.g., by Bardenet et al.
[2014], is to use an adaptive MH scheme [Haario et al., 2001, Andrieu
and Moulines, 2006]. These algorithms tune the proposal distribution
during execution, using information from the samples as they are gen-
erated, in a way that provably converges asymptotically. Often, it is
desirable for the proposal distribution to be close to the target. This
motivates adaptive schemes that fit a distribution to the observed sam-
ples and use this fitted model as the proposal distribution. For example,
a simple online procedure can update the mean µ and covariance Σ of
a multidimensional Gaussian model,
µt+1 = µt + γt+1 (θt+1 − µt ), t ≥ 0,
Σt+1 = Σt + γt+1 ((θt+1 − µt )(θt+1 − µt )> − Σt ),
where t indexes the MH iterations and γt+1 controls the speed with
which the adaptation vanishes. An appropriate choice is γt = t−α
for α ∈ (1/2, 1]. The tutorial by Andrieu and Thoms [2008] provides a
review of this and other, more sophisticated, adaptive MH algorithms.

Combining methods. The subsampling-based methods in this chap-


ter are conceptually modular, and some may be combined. For example,
it might be of interest to consider a ‘tunable’ version of FlyMC that
achieves even greater computational efficiency at the cost of its orig-
inal exactness. For example, we might use an adaptive subsampling
scheme (§3.2) to evaluate only a subset of terms in Equation (3.41);
this subset would need to represent terms corresponding to both possi-
ble values of zn . As another example, Korattikara et al. [2014] suggest
172 MCMC with data subsets

using adaptive subsampling as a way to ‘fix up’ SGLD. Recall that


the original SGLD algorithm completely eliminates the MH test and
blindly accepts all proposals, in order to avoid evaluating the full pos-
terior. A reasonable compromise is to instead evaluate a fraction of the
data within the adaptive subsampling framework, since this bounds the
per-iteration error.

Unbiased likelihood estimators. A basic difficulty with building


MCMC algorithms that operate on data mini-batches is that while
it is straightforward to construct unbiased nonnegative Monte Carlo
estimators of the log likelihood, as in Eq. (3.4), it is not clear how to
construct such estimators for the likelihood itself, which is the quantity
of direct interest in many standard sampler constructions like MH. To
be explicit, the natural Monte Carlo estimator of the likelihood given
by
m
( )
N X
exp log p(x∗n | θ) (3.48)
m n=1
is not unbiased. While it is possible to transform Equation (3.48) into
an unbiased likelihood estimate, e.g., using a Poisson estimator [Wag-
ner, 1987, Papaspiliopoulos, 2009, Fearnhead et al., 2010], the resulting
estimators are not always non-negative, which is a requirement to incor-
porate the estimator into a Metropolis-Hastings algorithm. In general,
we cannot derive unweighted Monte Carlo estimators that are both
unbiased and nonnegative [Jacob and Thiery, 2015, Lyne et al., 2015],
though it is possible if, for example, we have unbiased estimators of the
log likelihood with support in a fixed interval [Jacob and Thiery, 2015,
Theorem 3.1] or if we have auxiliary variables as in FlyMC [Bardenet
et al., 2015, Section 4.3]. Pseudo-marginal MCMC algorithms [Andrieu
and Roberts, 2009],2 first introduced by Lin et al. [2000], use such unbi-
ased nonnegative likelihood estimators, often based on importance sam-
pling [Beaumont, 2003] or particle filters [Andrieu et al., 2010, Doucet
et al., 2015], to construct exact MCMC algorithms.

2
Pseudo-marginal MCMC is also known as exact-approximate sampling.
4
Parallel and distributed MCMC

MCMC procedures that take advantage of parallel computing resources


form another broad approach to scaling Bayesian inference. Because the
computational requirements of inference often scale with the amount of
data involved, and because large datasets may not even fit on a single
machine, these approaches often leverage data parallelism. Some also
leverage model parallelism, distributing latent variables in the model
across many machines. In this chapter we consider several approaches to
scaling MCMC by exploiting parallel computation, either by adapting
classical MCMC algorithms or by defining new simulation dynamics
that are inherently parallel.
One way to use parallel computing resources is to run multiple se-
quential MCMC algorithms at once. However, running identical chains
in parallel does not reduce the transient bias in MCMC estimates of
posterior expectations, though it would reduce their variance. Instead
of using parallel computation only to collect more MCMC samples and
thus reduce only estimator variance without improving transient bias,
it is often preferable to use computational resources to speed up the
simulation of the chain itself. Section 4.1 surveys several methods that

173
174 Parallel and distributed MCMC

use parallel computation to speed up the execution of MCMC proce-


dures, including both basic methods and more recent ideas.
Alternatively, instead of adapting serial MCMC procedures to ex-
ploit parallel resources, another approach is to design new approximate
algorithms that are inherently parallel. Section 4.2 summarizes some
recent ideas for simulations that can be executed in a parallel manner
and have their results aggregated or corrected to represent posterior
samples. Some additional parallel inference strategies that can involve
MCMC inference within an expectation propagation (EP) framework
are also described in Section 5.3.1.

4.1 Parallelizing standard MCMC algorithms

An advantage to parallelizing standard MCMC algorithms is that they


retain their theoretical guarantees and analyses. Indeed, a common goal
is to produce identical samples under serial and parallel execution,
so that parallel resources enable speedups without introducing new
approximations. This section first summarizes some basic opportunities
for parallelism in MCMC and then surveys the speculative execution
framework for MH.

4.1.1 Conditional independence and graph structure


The MH algorithm has a straightforward opportunity for parallelism.
In particular, if the target posterior can be written as
N
Y
π(θ | x) ∝ π0 (θ)π(x | θ) = π0 (θ) π(xn | θ), (4.1)
n=1

then when the number of likelihood terms N is large it may be ben-


eficial to parallelize the evaluation of the product of likelihoods. The
communication between processors is limited to transmitting the value
of the parameter and the scalar values of likelihood products. This ba-
sic parallelization, which naturally fits in a bulk synchronous parallel
(BSP) computational model, exploits conditional independence in the
probabilistic model, namely that the data are independent given the
parameter.
4.1. Parallelizing standard MCMC algorithms 175

✓k zi
K
yi
N

(a) A mixture model


(b) An undirected grid

Figure 4.1: Graphical models and graph colorings can expose opportunities
for parallelism in Gibbs samplers. (a) In this directed graphical model for a
discrete mixture, each label (red) can be sampled in parallel conditioned on
the parameters (blue) and data (gray) and similarly each parameter can be
resampled in parallel conditioned on the labels. (b) This undirected grid has
a classical “red-black” coloring, emphasizing that the variables corresponding
to red nodes can be resampled in parallel given the values of black nodes and
vice-versa.

Gibbs sampling algorithms can exploit more fine-grained condi-


tional independence structure, and are thus a natural fit for graphical
models which express such structure. Given a graphical model and a
corresponding graph coloring with K colors that partitions the set of
random variables into K groups, the random variables in each color
group can be resampled in parallel while conditioning on the values in
the other K − 1 groups [Gonzalez et al., 2011]. When models have a hi-
erarchical exponential family structure, conditioning on a few variables
can lead to many disconnected subgraphs, corresponding to inference
problems to be solved separately and to have their sufficient statis-
tics aggregated; this structure naturally lends itself to Map-Reduce
parallelization [Pradier et al., 2014]. Thus graphical models provide a
natural perspective on opportunities for parallelism. See Figure 4.1 for
some examples.
These opportunities for parallelism, while powerful in some cases,
are limited by the fact that they require frequent global synchroniza-
tion and communication. Indeed, at each iteration it is often the case
176 Parallel and distributed MCMC

that every element of the dataset is read by some processor and many
processors must mutually communicate. The methods we survey in the
remainder of this chapter aim to mitigate these limitations by adjusting
the allocation of parallel resources or by reducing communication.

4.1.2 Speculative execution and prefetching

Another class of parallel MCMC algorithms use speculative parallel


execution to accelerate individual chains. This idea is called prefetching
in some of the literature and appears to have received only limited
attention.
As shown in Algorithm 1, the body of a serial MH implementation
is a loop containing a single conditional statement and two associated
branches. We can thus view the possible execution paths as a binary
tree, illustrated in Figure 4.2. The vanilla version of parallel prefetching
speculatively evaluates all paths in this binary tree on parallel proces-
sors [Brockwell, 2006]. A serial simulation corresponds to exactly one
of these paths, so with J processors this approach achieves a speedup
of log2 J with respect to single core execution, ignoring communication
and bookkeeping overheads.
Naïve prefetching can be improved by observing that the two
branches in Algorithm 1 are not taken with equal probability. On av-
erage, the left-most branch, corresponding to a sequence of rejected
proposals, tends to be more probable; the classic result for the optimal
MH acceptance rate is 0.234 [Roberts et al., 1997], so most prefetching
scheduling policies have been built around the expectation of rejection.
Let α ≤ 0.5 be the expected probability of accepting a proposal. Byrd
et al. [2008] introduced a procedure, called speculative moves, that spec-
ulatively evaluates only along the “reject” branch of the binary tree; in
Figure 4.2, this corresponds to the left-most branch. In each round of
their algorithm, only the first k out of J − 1 extra cores perform use-
ful work, where k is the number of rejected proposals before the first
accepted proposal, starting from the root of the tree. The expected
4.1. Parallelizing standard MCMC algorithms 177

θt

θ0t+1 θ1t+1

t+2 t+2 t+2 t+2


θ00 θ01 θ10 θ11

t+3 t+3 t+3 t+3 t+3 t+3 t+3 t+3


θ000 θ001 θ010 θ011 θ100 θ101 θ110 θ111

Figure 4.2: Schematic of a MH simulation superimposed on the binary tree


of all possible chains. Nodes at depth d correspond to iteration t + d, where
the root is at depth 0. Branching to the right indicates that the proposal is
accepted, and branching to the left indicates that the proposal is rejected.
Thick blue arrows highlight one simulated chain starting at the root θt ; the
first proposal is accepted and the next two are rejected, yielding as output:
θ1t+1 , θ10
t+2 t+3
, θ100 . Each subscript is a sequence, of length d, of 0’s and 1’s, cor-
responding to the history of rejected and accepted proposals with respect to
the root. Filled circles indicate states where the target density is evaluated
during simulation. Those not on the chain’s path correspond to rejected pro-
posals. Their siblings are open circles on this path; since each is a copy of its
parent, the target density does not need to be reevaluated to compute the
next transition.

speedup is then:

X 1−α 1
1 + E(k) < 1 + k(1 − α)k α < 1 + = .
k=0
α α
Note that the first term on the left is due to the core at the root of
the tree, which always performs useful computation in the prefetch-
ing scheme. When α = 0.23, this scheme yields a maximum expected
speedup of about 4.3; it achieves an expected speedup of about 4
with 16 cores. If only a few cores are available, this may be a reasonable
policy, but if many cores are available, their work is essentially wasted.
In contrast, the naïve prefetching policy achieves speedup that grows
as the log of the number of cores. Byrd et al. [2010] later considered the
special case where the evaluation of the likelihood function occurs on
178 Parallel and distributed MCMC

two timescales, slow and fast. They call this method speculative chains;
it modifies the speculative moves approach so that whenever the eval-
uation of the likelihood function is slow, any available cores are used
to speculatively evaluate the subsequent chain, assuming the slow step
resulted in an accept.
Strid [2010] extends the naïve prefetching scheme to allocate cores
according to the optimal “tree shape” with respect to various assump-
tions about the probability of rejecting a proposal, i.e., by greedily
allocating cores to nodes that maximize the depth of speculative com-
putation expected to be correct. The author presents both static and
dynamic prefetching schemes. The static scheme assumes a fixed ac-
ceptance rate; versions of this were proposed earlier in the context of
simulated annealing [Witte et al., 1991]. The dynamic scheme estimates
acceptance probabilities, for example, at each level of the tree by draw-
ing empirical MH samples, or at each branch in the tree by comput-
ing min{β, r} where β is a constant (e.g., β = 1) and r is an estimate
of the MH ratio based on a fast approximation to the target function.
Strid also proposes using the approximate target function to identify
the single most likely path on which to perform speculative compu-
tation, and combines prefetching with other sources of parallelism to
obtain a multiplicative effect.
In closely related work, parallel predictive prefetching makes effi-
cient use of parallel resources by dynamically predicting the outcome
of each MH test [Angelino et al., 2014]. In the case of Bayesian infer-
ence, these predictions can be constructed in the same manner as the
approximate MH algorithms based on subsets of data, as discussed in
Section 3.2.2. Furthermore, these predictions can be made in the con-
text of an error model, e.g., with the concentration inequalities used
by Bardenet et al. [2014]. This yields a straightforward and rational
mechanism for allocating parallel cores to computations most likely to
fall along the true execution path. Algorithms 9 and 10 sketch pseu-
docode for an implementation of parallel predictive prefetching that
follows a master-worker pattern. See the dissertation by Angelino [2014]
for a formal description of the algorithm and implementation details.
4.2. Defining new parallel dynamics 179

Algorithm 9 Parallel predictive prefetching master process


repeat
Receive message from worker j
if worker j wants work then
Find highest utility node ρ in tree with work left to do
Send worker j the computational state of ρ
else if message contains state θρ at proposal node ρ then
Record state θρ at ρ
else if message contains update at ρ then
Update estimate of π(θρ | x) at ρ
for node α in {ρ and its descendants} do
Update utility of α
if utility of α below threshold and worker k at α then
Send worker k message to stop current computation
if posterior computation at ρ and its parent complete then
Know definitively whether to accept or reject θρ
Delete subtree corresponding to branch not taken
if node ρ is the root’s child then
repeat
Trim old root so that new root points to child
Output state at root, the next state in the chain
until posterior computation at root’s child incomplete
until master has output T Metropolis–Hastings chain states
Terminate all worker processes

4.2 Defining new parallel dynamics

In this section we survey two ideas for performing inference using new
parallel dynamics. These algorithms define new dynamics in the sense
that their iterates do not form ergodic Markov chains which admit the
posterior distribution as an invariant distribution, and thus they do not
qualify as classical MCMC schemes. Instead, while some of the updates
in these algorithms resemble standard MCMC updates, the overall dy-
namics are designed to exploit parallel and distributed computation.
A unifying theme of these new methods is to perform local computa-
180 Parallel and distributed MCMC

Algorithm 10 Parallel predictive prefetching worker process


repeat
Send master request for work
Receive work assignment at node ρ from master
if the corresponding state θρ has not yet been generated then
Generate proposal θρ
repeat
Advance the computation of π(θρ | x)
Send update at ρ to master
if receive message to stop current computation then
break
until computation of π(θρ | x) is complete
until terminated by master

tion on data while controlling the amount of global synchronization or


communication.
One such family of ideas involves the definition of subposteriors,
defined using only subsets of the full dataset. Inference in the subpos-
teriors can be performed in parallel, and the results are then globally
aggregated into an approximate representation of the full posterior.
Because the synchronization and communication costs—as well as the
approximation quality—are determined by the aggregation step, sev-
eral such aggregation procedures have been proposed. In Section 4.2.1
we summarize some of these proposals.
Another class of parallel dynamics does not define independent sub-
posteriors but instead, motivated by Gibbs sampling, focuses on sim-
ulating from local conditional distributions with out-of-date informa-
tion. In standard Gibbs sampling, updates can be parallelized in mod-
els with conditional independence structure (Section 4.1), but without
such structure the Gibbs updates may depend on the full dataset and
all latent variables, and thus must be performed sequentially. These se-
quential updates can be especially expensive with large or distributed
datasets. A natural approximation to consider is to run the same local
Gibbs updates in parallel with out-of-date global information and only
4.2. Defining new parallel dynamics 181

infrequent communication. While such a procedure loses the theoret-


ical guarantees provided by standard Gibbs sampling analysis, some
empirical and theoretical results are promising. We refer to this broad
class of methods as Hogwild Gibbs algorithms, and we survey some
particular algorithms and analyses in Section 4.2.2.

4.2.1 Aggregating from subposteriors


Suppose we want to divide the evaluation of the posterior across J
parallel cores. We can divide the data into J partition ele-
ments, x(1) , . . . , x(J) , also called shards, and factor the posterior into J
corresponding subposteriors, as
J
Y
π(θ | x) = π (j) (θ | x(j) ), (4.2)
j=1

where
Y
π (j) (θ | x(j) ) = π0 (θ)1/J π(x | θ), j = 1, . . . , J. (4.3)
x∈x(j)

The contribution from the original prior is down-weighted so that


the posterior is equal to the product of the J subposteriors,
i.e., π(θ | x) = Jj=1 π (j) (θ | x(j) ). Note that a subposterior is not the
Q

same as the posterior formed from the corresponding partition, i.e.,


Y
π (j) (θ | x(j) ) 6= π(θ | x(j) ) = π0 (θ) π(x | θ). (4.4)
x∈x(j)

Embarrassingly parallel consensus of subposteriors


Once a large dataset has been partitioned across multiple machines, a
natural alternative is to try running MCMC inference on each parti-
tion element separately and in parallel. This yields samples from each
subposterior in Equation 4.3, but there is no obvious choice for how to
combine them in a coherent fashion to form approximate samples of the
full posterior. In this section, we survey various proposals for forming
such a consensus solution from the subposterior samples. Algorithm 11
outlines the structure of consensus strategies for embarrassingly paral-
lel posterior sampling. This terminology, used by Huang and Gelman
182 Parallel and distributed MCMC

Algorithm 11 Embarrassingly parallel consensus of subposteriors


Input: Initial state θ0 , number of samples T , data partitions
x(1) , . . . , x(J) , subposteriors π (1) (θ | x(1) ), . . . , π (J) (θ | x(J) )
Output: Approximate samples θ̂1 , . . . , θ̂T
for j = 1, 2, . . . , J in parallel do
Initialize θj,0
for t = 1, 2, . . . , T do
Simulate MCMC sample θj,t from subposterior π (j) (θ | x(j) )
Collect θj,1 , . . . , θj,T
θ̂1 , . . . , θ̂T ← ConsensusSamples({θj,1 , . . . , θj,T }Jj=1 )

[2005] and Scott et al. [2016], invokes related notions of consensus,


notably those that have existed for decades in the optimization litera-
ture on parallel algorithms in decentralized or distributed settings. We
discuss this topic briefly in Section 4.2.1.
Below, we present two recent consensus strategies for combining
subposterior samples, through weighted averaging and density estima-
tion, respectively. The earlier report by Huang and Gelman [2005] pro-
poses four consensus strategies, based either on normal approximations
or importance resampling; the authors focus on Gibbs sampling for hi-
erarchical models and do not evaluate any actual parallel implementa-
tions. Another consensus strategy is the recently proposed variational
consensus Monte Carlo (VCMC) algorithm, which casts the consensus
problem within a variational Bayes framework [Rabinovich et al., 2015].
Throughout this section, Gaussian densities provide a useful refer-
ence point and motivate some of the consensus strategies. Consider the
jointly Gaussian model

θ ∼ N (0, Σ0 ) (4.5)
x(j) | θ ∼ N (θ, Σj ). (4.6)
4.2. Defining new parallel dynamics 183

The joint density is:


J
Y
p(θ, x) = p(θ) p(x(j) | θ)
j=1
J
1 1  (j)
 Y  >  
∝ exp − θ> Σ−1
0 θ exp − x − θ Σ −1
J x (j)
− θ
2 j=1
2
    > 

 1 J J 

>  −1 −1  −1 (j) 
X X
∝ exp − θ Σ0 + Σj θ+  Σj x θ .
 2

j=1 j=1

Thus the posterior is Gaussian:

θ | x ∼ N (µ, Σ), (4.7)

where
 −1
J
Σ = Σ−1 Σ−1
X
0 + (4.8)

j
j=1
 
J
Σ−1
X
(j) 
µ = Σ j x . (4.9)
j=1

To arrive at an expression for the subposteriors, we begin by factoring


the joint distribution into an appropriate product:
J
Y
p(θ, x) ∝ fj (θ), (4.10)
j=1

where

fj (θ) = p(θ)1/J p(x(j) | θ)


1 1  (j)
   >  
= exp − θ> (Σ−1 0 /J)θ exp − x − θ Σ−1
j x (j)
− θ
2 2
1
 
    >
∝ exp − θ> Σ−1 0 /J + Σj
−1
θ + Σ−1 j x
(j)
θ .
2
Thus the subposteriors are also Gaussian:
 
θj ∼ N µ̃j , Σ̃j ∝ fj (θ) (4.11)
184 Parallel and distributed MCMC

where
 −1
Σ̃j = Σ−1 −1
0 /J + Σj (4.12)
 −1  
µ̃j = Σ−1 −1
0 /J + Σj Σ−1
j x
(j)
. (4.13)

Weighted averaging of subposterior samples

One approach is to combine the subposterior samples via weighted av-


eraging [Scott et al., 2016]. For simplicity, we assume that we obtain T
samples in Rd from each subposterior, and let {θj,t }Tt=1 denote the sam-
ples from the jth subposterior. The goal is to construct T consensus
posterior samples {θ̂t }Tt=1 , that (approximately) represent the full pos-
terior, from the JT subposterior samples, where each θ̂t combines sub-
posterior samples {θj,t }Jj=1 . We associate with each subposterior j a
matrix Wj ∈ Rd×d and assume that each consensus posterior sample is
a weighted1 average:
J
X
θ̂t = Wj θj,t . (4.14)
j=1

The challenge now is to design an appropriate set of weights.


Following Scott et al. [2016], we consider the special case of Gaus-
sian subposteriors, as in Equation 4.11. Our presentation is slightly
different, as we also account for the effect of having a prior. We also
drop the subscript t from our notation for simplicity. Let {θj }Jj=1 be
a set of draws from the J subposteriors. Each θj is an independent
Gaussian and thus θ̂ = Jj=1 Wj θj is Gaussian. From Equation 4.13,
P

its mean is
J
X J
X
E[θ̂] = Wj E[θj ] = Wj µ̃j
j=1 j=1
J  −1  
Wj Σ−1 −1
Σ−1
X
(j)
= 0 /J + Σj j x . (4.15)
j=1

1
Our notation differs slightly from that of Scott et al. [2016] in that our
weights Wj are normalized.
4.2. Defining new parallel dynamics 185

Thus, if we choose
 −1
  J  
Wj = Σ Σ−1 −1
= Σ−1 Σ−1  Σ−1 /J + Σ−1 (4.16)
X
0 /J + Σj 0 + j 0 j
j=1

where Σ is the posterior covariance in Equation 4.8, then


 
J
Σ−1
X
(j) 
E[θ̂] = Σ  j x = µ, (4.17)
j=1

where µ is the posterior mean in Equation 4.9. A similar calculation


shows that Cov(θ̂) = Σ.
Thus for the Gaussian model, θ̂ is distributed according to the pos-
terior distribution, indicating that Equation 4.16 gives the appropriate
weights. Each weight matrix Wj is a function of Σ0 , the prior covari-
ance, and the subposterior covariances {Σj }Jj=1 . We can form a Monte
Carlo estimate of each Σj using the empirical sample covariance Σ̄j .
Algorithm 12 summarizes this consensus approach with weighted av-
eraging. While this weighting is optimal in the Gaussian setting, Scott
et al. [2016] show it to be effective in some non-Gaussian models. Scott
et al. [2016] also suggest weighting each dimension of a sample θ by the
reciprocal of its marginal posterior variance, effectively restricting the
weight matrices Wj to be diagonal.

Subposterior density estimation

Another consensus strategy relies on density estimation [Neiswanger


et al., 2014]. A related approach, developed in Minsker et al. [2014],
forms a posterior distribution estimate out of weighted samples by
computing a median of subset posteriors, though we do not discuss
it here. First, use the subposterior samples to separately fit a density
estimator, π̃ (j) (θ | x(j) ), to each subposterior. The product of these den-
sity estimators then represents a density estimator for the full posterior
target, i.e.,
J
Y
π(θ | x) ≈ π̃(θ | x) = π̃ (j) (θ | x(j) ) . (4.18)
j=1
186 Parallel and distributed MCMC

Algorithm 12 Consensus of subposteriors via weighted averaging.


Parameters: Prior covariance Σ0
function ConsensusSamples({θj,1 , . . . , θj,T }Jj=1 )
for j = 1, 2, . . . , J do
Σ̄j ← Sample covariance of {θj,1 , . . . , θj,T }
 −1
J
Σ−1 Σ̄−1
X
Σ← 0 + j

j=1
for j = 1, 2, . . . , J do 
Wj ← Σ Σ−1 0 /J + Σ̄−1
j . Compute weight matrices
for t = 1, 2, . . . , T do
J
X
θ̂t ← Wj θj,t . Compute weighted averages
j=1

return θ̂1 , . . . , θ̂T

Finally, one can sample from this posterior density estimator using
MCMC; ideally, this density is straightforward to obtain and sample.
In general, however, density estimation can yield complex models that
are not amenable to efficient sampling.
Neiswanger et al. [2014] explore three density estimation approaches
of various complexities. Their first approach assumes a parametric
model and is therefore approximate. Specifically, they fit a Gaussian to
each set of subposterior samples, yielding

J
Y
π̃(θ | x) = N (µ̄j , Σ̄j ), (4.19)
j=1

where µ̄j and Σ̄j are the empirical mean and covariance, respectively,
of the samples from the jth subposterior. This product of Gaussians
4.2. Defining new parallel dynamics 187

Algorithm 13 Consensus of subposteriors via fits to Gaussians.


function ConsensusSamples({θj,1 , . . . , θj,T }Jj=1 )
for j = 1, 2, . . . , J do
µ̄j ← Sample mean of {θj,1 , . . . , θj,T }
Σ̄j ← Sample covariance of {θj,1 , . . . , θj,T }
 −1
J
Σ̄−1
X
Σ̂J ←  j
 . Covariance of product of Gaussians
j=1
 
J
Σ̄−1
X
µ̂J ← Σ̂J  j µ̄j
 . Mean of product of Gaussians
j=1
for t = 1, 2, . . . , T do
θ̂t ∼ N (µ̂J , Σ̂J ) . Sample from fitted Gaussian
return θ̂1 , . . . , θ̂T

simplifies to a single Gaussian N (µ̂J , Σ̂J ), where


 −1
J
Σ̄−1
X
Σ̂J =  j
 (4.20)
j=1
 
J
Σ̄−1
X
µ̂J = Σ̂J  j µ̄j .
 (4.21)
j=1

These parameters are straightforward to compute and the overall den-


sity estimate can be sampled with reasonable efficiency and even in
parallel, if desired. Algorithm 13 summarizes this consensus strategy
based on fits to Gaussians.
In the case when the model is jointly Gaussian, the parametric den-
sity estimator we form is N (µ̂J , Σ̂J ), with µ̂J and Σ̂J given in Equa-
tions 4.21 and 4.20, respectively. In this special case, the estimator ex-
actly represents the Gaussian posterior. However, recall that we could
have instead written the exact posterior directly as N (µ, Σ), where µ
and Σ are in Equations 4.9 and 4.8, respectively. Thus computing the
density estimator is about as expensive as computing the exact poste-
rior, requiring about J local matrix inversions.
188 Parallel and distributed MCMC

The second approach proposed by Neiswanger et al. [2014] is to use


a nonparametric kernel density estimate (KDE) for each subposterior.
Suppose we obtain T samples {θj,t }Tt=1 from the jth subposterior, then
its KDE with bandwidth parameter h has the following functional form:

T
1X 1 kθ − θj,t k
 
π̃ (j) (θ | x(j) ) = K , (4.22)
T t=1 hd h

i.e., the KDE is a mixture of T kernels, each centered at one of the


samples. If we use T samples from each subposterior, then the den-
sity estimator for the full posterior is a complicated function with T J
terms, since it the a product of J such mixtures, and is therefore very
challenging to sample from. Neiswanger et al. [2014] use a Gaussian
KDE for each subposterior, and from this derive a density estimator
for the full posterior that is a mixture of T J Gaussians with unnor-
malized mixture weights. They also consider a third, semi-parametric
approach to density estimation given by the product of a parametric
(Gaussian) model and a nonparametric (Gaussian KDE) correction. As
the number of samples T → ∞, the nonparametric and semi-parametric
density estimates exactly represent the subposterior densities and are
therefore asymptotically exact. Unfortunately, their complex mixture
representations grow exponentially in size, rendering them somewhat
unwieldy in practice.

Weierstrass samplers

The consensus strategies surveyed so far are embarrassingly parallel.


These methods obtain samples from each subposterior independently
and in parallel, and from these attempt to construct samples that (ap-
proximately) represent the posterior post-hoc. The methods in this
section proceed similarly, but introduce some amount of information
sharing between the parallel samplers. This communication pattern is
reminiscent of the alternating direction method of multipliers (ADMM)
algorithm for parallel convex optimization; for a detailed treatment of
ADMM, see the review by Boyd et al. [2011].
4.2. Defining new parallel dynamics 189

Weierstrass samplers [Wang and Dunson, 2013] are named for the
Weierstrass transform [Weierstrass, 1885]: for h > 0, we write

( )
(θ − ξ)2
Z ∞
1
Wh f (θ) = √ exp − f (ξ)dξ. (4.23)
−∞ 2πh 2h2

The transformed function Wh f (θ) is the convolution of a one-


dimensional function f (θ) with a Gaussian density of standard devia-
tion h, and so converges pointwise to f (θ) as h → 0,

Z ∞
lim Wh f (θ) = δ(θ − ξ)f (ξ)dξ = f (θ),
h→0 −∞

where δ(τ ) is the Dirac delta function. For h > 0, Wh f (θ) can be
thought of as a smoothed approximation to f (θ). Equivalently, if f (θ)
is the density of a random variable θ, then Wh f (θ) is the density of a
noisy measurement of θ, where the noise is an additive Gaussian with
zero mean and standard deviation h.
Wang and Dunson [2013] analyzes a more general class of Weier-
strass transforms by defining a multivariate version and also allowing
non-Gaussian kernels:

Z ∞ d
θi − ξi
 
(K)
h−1
Y
Wh f (θ1 , . . . , θd ) = f (ξ1 , . . . , ξd ) i Ki dξi .
−∞ i=1
hi

For simplicity, we restrict our attention to the one-dimensional Weier-


strass transform.
Weierstrass samplers use Weierstrass transforms on subposterior
densities to define an augmented model. Let fj (θ) denote the j-th sub-
posterior,

Y
fj (θ) = π (j) (θ | x(j) ) = π0 (θ)1/J π(x | θ), (4.24)
x∈x(j)
190 Parallel and distributed MCMC

so that the full posterior can be approximated as


J
Y J
Y
π(θ | x) ∝ fj (θ) ≈ Wh fj (θ)
j=1 j=1
J Z
( )
Y 1 (θ − ξj )2
= √ exp − fj (ξj )dξj
j=1 2πh 2h2
J
( )
(θ − ξj )2
Z Y
∝ exp − fj (ξj )dξj . (4.25)
j=1
2h2

The integrand of (4.25) defines the joint density of an augmented model


that includes the ξ = {ξj }Jj=1 as auxiliary variables:
J
( )
Y (θ − ξj )2
πh (θ, ξ | x) ∝ exp − fj (ξj ). (4.26)
j=1
2h2

The posterior of interest can then be approximated by the marginal


distribution of θ in the augmented model,
Z
πh (θ, ξ | x)dξ ≈ π(θ | x) (4.27)

with pointwise equality in the limit as h → 0. Thus by running MCMC


in the augmented model, producing Markov chain samples of both θ
and ξ, we can generate approximate samples of the posterior. Further-
more, the augmented model is more amenable to parallelization due to
its conditional independence structure: conditioned on θ, the subpos-
terior parameters ξ are rendered independent.
The same augmented model construction can be motivated without
explicit reference to the Weierstrass transform of densities. Consider the
factor graph model of the posterior in Figure 4.3a, which represents the
definition of the posterior in terms of subposterior factors,
J
Y
π(θ | x) ∝ fj (θ). (4.28)
j=1

This model can be equivalently expressed as a model where each sub-


posterior depends on an exact local copy of θ. That is, writing ξj as
4.2. Defining new parallel dynamics 191

f1 (✓) f2 (✓)

f3 (✓)

(a) Factor graph for π(θ | x) in terms of subposterior factors fj (θ).

f1 (⇠1 ) f2 (⇠2 )
⇠1 ⇠2

(⇠1 , ✓) (⇠2 , ✓)

(⇠3 , ✓)

⇠3

f3 (⇠3 )

(b) Factor graph for the Weierstrass augmented model π(θ, ξ | x).

Figure 4.3: Factor graphs defining the augmented model of the Weierstrass
sampler.
192 Parallel and distributed MCMC

the local copy of θ for subposterior j, the posterior is the marginal of


a new augmented model given by
J
Y
π(θ, ξ | x) ∝ fj (ξj )δ(ξj − θ) . (4.29)
j=1

This new model can be represented by the factor graph in Figure 4.3b,
with potentials ψ(ξj , θ) = δ(ξj − θ). Finally, rather than taking the ξj
to be exact local copies of θ, we can instead relax them to be noisy
Gaussian measurements of θ:
J
Y
πh (θ, ξ | x) ∝ fj (ξj )ψh (ξj , θ) (4.30)
j=1
( )
(θ − ξj )2
ψh (ξj , θ) = exp − . (4.31)
2h2

Thus the potentials ψh (ξj , θ) enforce some consistency across the noisy
local copies of the parameter but allow them to be decoupled, where
the amount of decoupling depends on h. With smaller values of h the
approximate model is more accurate, but the local copies are more
coupled and hence sampling in the augmented model is less efficient.
We can construct a Gibbs sampler for the joint distribu-
tion π(θ, ξ | x) in Equation 4.25 by alternately sampling from p(θ | ξ)
and p(ξj | θ, x(j) ), for j = 1, . . . , J. It follows from Equation 4.25 that
J
( )
Y (θ2 − 2θξj )
p(θ | ξ1 , . . . , ξj , x) ∝ exp − . (4.32)
j=1
2h2

Rearranging terms gives


¯2
( )
(θ − ξ)
p(θ | ξ1 , . . . , ξj , x) ∝ exp − , (4.33)
2h2 /J

where ξ¯ = J −1 Jj=1 ξj . The remaining Gibbs updates follow from


P

Equation 4.25, which directly yields


( )
(j) (θ − ξj )2
p(ξj | θ, x ) ∝ exp − fj (ξj ), j = 1, . . . , J. (4.34)
2h2
4.2. Defining new parallel dynamics 193

Algorithm 14 Weierstrass Gibbs sampling. For simplicity, θ ∈ R.


Input: Initial state θ0 , number of samples T , data partitions
x(1) , . . . , x(J) , subposteriors f1 (θ), . . . , fJ (θ), tuning parameter h
Output: Samples θ1 , . . . , θT
Initialize θ0
for t = 0, 1, . . . , T − 1 do
Send θt to each processor
for j = 1, 2, . . . , J in parallel do
ξj,t+1 ∼ p(ξj,t+1 | θt , x(j) ) ∝ N (ξj,t+1 | θt , h2 ) fj (ξj,t+1 )
Collect ξ1,t+1 , . . . , ξJ,t+1
J
1X
ξ¯t+1 = ξj,t+1
J j=1
θt+1 ∼ N (θt+1 | ξ¯t+1 , h2 /J)

This Gibbs sampler allows for parallelism but requires communica-


tion at every round. A straightforward parallel implementation, shown
in Algorithm 14, generates the updates for ξ1 , . . . , ξJ in parallel, but
the update for θ depends on the most recent values of all the ξj . Wang
and Dunson [2013] describes an approximate variant of the full Gibbs
procedure that avoids frequent communication by only occasionally
updating θ. In other efforts to exploit parallelism while avoiding com-
munication, the authors propose alternate Weierstrass samplers based
on importance sampling and rejection sampling, though the tradeoffs
of these strategies require more empirical evaluation.

4.2.2 Hogwild Gibbs


Instead of designing new parallel algorithms from scratch, another ap-
proach is to take an existing MCMC algorithm and execute its up-
dates in parallel at the expense of accuracy or theoretical guarantees.
In particular, Hogwild Gibbs algorithms take a Gibbs sampling algo-
rithm (§2.2.6) with interdependent sequential updates (e.g., due to col-
lapsed parameters or lack of graphical model structure) and simply
run the updates in parallel anyway, using only occasional communica-
tion and out-of-date (stale) information from other processors. Because
194 Parallel and distributed MCMC

these strategies take existing algorithms and let the updates run ‘hog-
wild’ in the spirit of Hogwild! stochastic gradient descent in convex
optimization [Recht et al., 2011], we refer to these methods as Hogwild
Gibbs.
Similar approaches have a long history. Indeed, Gonzalez et al.
[2011] attribute a version of this strategy, Synchronous Gibbs, to the
original Gibbs sampling paper [Geman and Geman, 1984]. However,
these strategies have seen renewed interest, particularly due to exten-
sive empirical work on Approximate Distributed Latent Dirichlet Al-
location (AD-LDA) [Newman et al., 2007, 2009, Asuncion et al., 2008,
Liu et al., 2011, Ihler and Newman, 2012], which showed that running
collapsed Gibbs sampling updates in parallel allowed for near-perfect
parallelism without a loss in predictive likelihood performance. With
the growing challenge of scaling MCMC both to not only big datasets
but also big models, it is increasingly important to understand when
and how these approaches may be useful.
In this section, we first define some variations of Hogwild Gibbs
based on examples in the literature. Next, we survey the empirical
results and summarize the current state of theoretical understanding.

Defining Hogwild Gibbs variants

Here we define some Hogwild Gibbs methods and related schemes,


such as the stale synchronous parameter server. In particular, we con-
sider bulk-synchronous parallel and asynchronous variations. We also
fix some notation used for the remainder of the section.
For all of the Hogwild Gibbs algorithms, as with standard Gibbs
sampling, we are given a collection of n random variables, {xi : i ∈ [n]}
where [n] , {1, 2, . . . , n}, and we assume that we can sample from the
conditional distributions xi |x¬i , where x¬i denotes {xj : j 6= i}. For
the Hogwild Gibbs algorithms, we also assume we have K processors,
each of which is assigned a set of variables on which to perform MCMC
updates. We represent an assignment of variables to processors by fixing
a partition {I1 , I2 , . . . , IK } of [n], so that the kth processor performs
updates on the state values indexed by Ik .
4.2. Defining new parallel dynamics 195

Algorithm 15 Bulk-synchronous parallel (BSP) Hogwild Gibbs


Input: Joint distribution over x = (x1 , . . . , xn ), partition {I1 , . . . , IK }
of {1, 2, . . . , n}, iteration schedule q(t, k)
Initialize x̄(1)
for t = 1, 2, . . . do
for k = 1, 2, . . . , K in parallel do
(t+1)
x̄Ik ← LocalGibbs(x̄(t) , Ik , q(t, k))
Synchronize
function LocalGibbs(x̄, I, q)
for j = 1, 2, . . . , q do
for i ∈ I in order do
x̄i ← sample xi | x¬i = x̄¬i
return x̄

Bulk-synchronous parallel Hogwild Gibbs

A bulk-synchronous parallel (BSP) Hogwild Gibbs algorithm assigns


variables to processors and alternates between performing paral-
lel processor-local updates and global synchronization steps. During
epoch t, the kth processor performs q(t, k) MCMC updates, such as
Gibbs updates, on the variables {xi : i ∈ Ik } without communicating
with the other processors; in particular, these updates are computed
using out-of-date values for all {xj : j 6∈ Ik }. After all processors have
completed their local updates, all processors communicate the updated
state values in a global synchronization step and the system advances
to the next epoch. We summarize this Hogwild Gibbs variant in Al-
gorithm 15, in which the local MCMC updates are taken to be Gibbs
updates.
Several special cases of the BSP Hogwild Gibbs scheme have been
of interest. The Synchronous Gibbs scheme of Gonzalez et al. [2011]
associates one variable with each processor, so that |Ik | = 1 for each
k = 1, 2, . . . , K (in which case we may take q = 1 since no local iter-
ations are needed with a single variable). One may also consider the
case where the partition is arbitrary and q is very large, in which case
the local MCMC iterations may converge and exact block samples are
196 Parallel and distributed MCMC

drawn on each processor using old statistics from other processors for
each outer iteration. Finally, note that setting K = 1 and q(t, k) = 1
reduces to standard Gibbs sampling on a single processor.

Asynchronous Hogwild Gibbs

Another Hogwild Gibbs pattern involves performing updates asyn-


chronously. That is, processors might communicate only by sending
messages to one another instead of by a global synchronization. Ver-
sions of this Hogwild Gibbs pattern has proven effective both for col-
lapsed latent Dirichlet allocation topic model inference [Asuncion et al.,
2008], and for Indian Buffet Process inference [Doshi-Velez et al., 2009].
A version was also explored in the Gibbs sampler of the Stale Syn-
chronous Parameter (SSP) server of Ho et al. [2013], which placed an
upper bound on the staleness of the entries of the state vector on each
processor.
There are many possible communication strategies in the asyn-
chronous setting, and so we follow a version of the random commu-
nication strategy employed by Asuncion et al. [2008]. In this approach,
after performing some number of local updates, a processor sends its
updated state information to a set of randomly-chosen processors and
receives updates from other processors. The processor then updates its
state representation and performs another round of local updates. A
version of this asynchronous Hogwild Gibbs strategy is summarized in
Algorithm 16.

Theoretical analysis

Despite its empirical successes, theoretical understanding of Hogwild


Gibbs algorithms is limited. There are three settings in which some
analysis has been offered: first, in a variant of AD-LDA, i.e., Hogwild
Gibbs applied to Latent Dirichlet Allocation models [Ihler and New-
man, 2012]; second, in the jointly Gaussian case [Johnson et al., 2013,
Johnson, 2014]; and third, in the discrete variable setting with sparse
factor graph structure [De Sa et al., 2016].
4.2. Defining new parallel dynamics 197

Algorithm 16 Asynchronous Hogwild Gibbs


Initialize x̄(1)
for each processor k = 1, 2, . . . , K in parallel do
for t = 1, 2, . . . do
(t+1)
x̄Ik ← LocalGibbs(x̄(t) , Ik , q(t, k))
(t+1)
Send x̄Ik to K 0 randomly-chosen processors
for each k 0 6= k do
if update x̄Ik0 received from processor k 0 then
(t+1)
x̄I 0 ← x̄Ik0
k
else
(t+1) (t)
x̄I 0 ← x̄I 0
k k

The work of Ihler and Newman [2012] provides some understand-


ing of the effectiveness of a variant of AD-LDA by bounding in terms
of run-time quantities the one-step error probability induced by pro-
ceeding with sampling steps in parallel, thereby allowing an AD-LDA
user to inspect the computed error bound after inference [Ihler and
Newman, 2012, Section 4.2]. In experiments, the authors empirically
demonstrate very small upper bounds on these one-step error proba-
bilities, e.g., a value of their parameter ε = 10−4 meaning that at least
99.99% of samples are expected to be drawn just as if they were sampled
sequentially. However, this per-sample error does not necessarily pro-
vide a direct understanding of the effectiveness of the overall algorithm
because errors might accumulate over sampling steps; indeed, under-
standing this potential error accumulation is of critical importance in
iterative systems. Furthermore, the bound is in terms of empirical run-
time quantities, and thus it does not provide guidance on which other
models the Hogwild strategy may be effective. Ihler and Newman [2012,
Section 4.3] also provides approximate scaling analysis by estimating
the order of the one-step bound in terms of a Gaussian approximation
and some distributional assumptions.
The jointly Gaussian case is more tractable for analysis [Johnson
et al., 2013, Johnson, 2014]. In particular, Johnson [2014, Theorem
7.6.6] shows that for the BSP Hogwild Gibbs process to be stable,
198 Parallel and distributed MCMC

i.e., to form an ergodic Markov chain with a well-defined stationary


distribution, for any variable partition and any iteration schedule it suf-
fices for the model’s joint Gaussian precision matrix to satisfy a gen-
eralized diagonal dominance condition. Because the precision matrix
contains the coefficients of the log potentials in a Gaussian graphical
model, the diagonal dominance condition captures the intuition that
Hogwild Gibbs should be stable when variables do not interact too
strongly. Johnson [2014, Proposition 7.6.8] gives a more refined condi-
tion for the case where the number of processor-local Gibbs iterations
is large.
When a bulk-synchronous parallel Gaussian Hogwild Gibbs pro-
cess defines an ergodic Markov chain and has a stationary distribution,
Johnson [2014, Chapter 7] also provides an understanding of how that
stationary distribution relates to the model distribution. Because both
the model distribution and the Hogwild Gibbs process stationary dis-
tribution are Gaussian, accuracy can be measured in terms of the mean
vector and covariance matrix. Proposition 7.6.1 of Johnson [2014] shows
that the mean of a stable Gaussian Hogwild Gibbs process is always
correct. Propositions 7.7.2 and 7.7.3 of Johnson [2014] identify a trade-
off in the accuracy of the process covariance matrix as a function of
the number of processor-local Gibbs iterations: at least when the pro-
cessor interactions are sufficiently weak, more processor-local iterations
between synchronization steps increase the accuracy of the covariances
among variables within each processor but decrease the accuracy of the
covariances between variables on different processors. Johnson [2014,
Proposition 7.7.4] also gives a more refined error bound as well as an
inexpensive way to correct covariance estimates for the case where the
number of processor-local Gibbs iterations is large.
Finally, De Sa et al. [2016] analyze asynchronous Hogwild Gibbs
sampling on a discrete state space when the model has a sparse factor
graph structure. Using a total influence condition to control the mutual
dependence between variables and Dobrushin’s condition, the authors
bound some measures of bias of mixing time. However, their analysis
(and indeed the asynchronous setting they consider) does not ensure
the existence of a stationary distribution; instead, bias and mixing time
4.3. Summary 199

are measured using the first time the distribution of the iterates comes
sufficiently close to the target distribution. De Sa et al. [2016] also
provide quantitative rate bounds for these measures of bias and mixing
time, and show that these estimates closely track empirical simulations.

4.3 Summary

Many ideas for parallelizing MCMC have been proposed, exhibiting


many tradeoffs. These ideas vary in generality, in faithfulness to the
posterior, and in the parallel computation architectures for which they
are best suited. Here we summarize the surveyed methods, emphasizing
their relative strengths on these criteria. See Table 4.1 for an overview.

Simulating independent Markov chains Independent instances of se-


rial MCMC algorithms can be run in an embarrassingly parallel man-
ner, requiring only minimal communication between processors to en-
sure distinct initializations and to collect samples. This approach can
reduce Monte Carlo variance by increasing the number of samples col-
lected in any time budget, achieving an ideal parallel speedup, but does
nothing to accelerate the warm-up period of the chains during which
the transient bias is eliminated (see Section 2.2.4 and Chapter 6). That
is, using parallel resources to run independent chains does nothing to
improve mixing unless there is some mechanism for information shar-
ing as in Nishihara et al. [2014]. In addition, running an independent
MCMC chain on each processor requires each processor to access the
full dataset, which may be problematic for especially large datasets.
These considerations motivate both subposterior methods and Hogwild
Gibbs.

Direct parallelization of standard updates Some MCMC algorithms


applied to models with particular structure allow for straightforward
parallel implementation. In particular, when the likelihood is factor-
ized across data points, the computation of the Metropolis–Hastings
acceptance probability can be parallelized. This strategy lends itself to
a bulk-synchronous parallel (BSP) computational model. Parallelizing
Parallel and distributed MCMC

Table 4.1: Summary of recent approaches to parallel MCMC.


Parallel density
Prefetching (§4.1.2) Consensus (§4.2.1) Weierstrass (§4.2.1) Hogwild Gibbs (§4.2.2)
evaluation (§4.1.1)
Weak dependencies across
Requirements Conditional independence None Approximate factorization Approximate factorization
processors
BSP and asynchronous mes-
Parallel model BSP Speculative execution MapReduce BSP
sage passing variants
Communication Each iteration Master scheduling Once Tuneable Tuneable
Data partition,
Data partition, Data partition,
Design choices None Scheduling policy synchronization frequency,
consensus algorithm communication frequency
Weierstrass h
Computational
None Scheduling, bookkeeping Consensus step Auxiliary variable sampling None
overheads
Approximation Depends on number of proces- Depends on number of proces- Depends on number of proces-
None None
error sors and consensus algorithm sors and Weierstrass h sors and staleness
200
4.3. Summary 201

MH in this way yields exact MCMC updates and can be effective at re-
ducing the mixing time required by serial MH, but it requires a simple
likelihood function and its implementation requires frequent synchro-
nization and communication, mitigating parallel speedups unless the
likelihood function is very expensive.
Gibbs sampling also presents an opportunity for direct paralleliza-
tion for particular graphical model structures. In particular, given
a graph coloring of the graphical model, variables corresponding to
nodes assigned a particular color are conditionally mutually indepen-
dent and can be updated in parallel without communication. How-
ever, frequent synchronization and significant communication can be
required to transmit sampled values to neighbors after each update.
Relaxing both the strict conditional independence requirements and
synchronization requirements motivates Hogwild Gibbs.

Prefetching and speculative execution The prefetching algorithms


studied in Section 4.1.2 use speculative execution to transform tra-
ditional (serial) Metropolis–Hastings into a parallel algorithm with-
out incurring approximate updates or requiring any model structure.
The implementation naturally follows a master-worker pattern, where
the master allocates (possibly speculative) computational work, such
as proposal generation or (partial) density evaluation, to worker pro-
cessors. Ignoring overheads, basic prefetching algorithms achieve at
least logarithmic speedup in the number of processors available. More
sophisticated scheduling by the master, such as predictive prefetch-
ing [Angelino et al., 2014], can increase speedup significantly. While
this method is very general and yields the same iterates as serial MH,
the speedup can be limited.

Subposterior consensus and Weierstrass samplers Subposterior


methods, such as the consensus Monte Carlo algorithms and the Weier-
strass samplers of Section 4.2.1, allow for data parallelism and minimal
communication because each subposterior Markov chain can be allo-
cated to a processor and simulation can proceed independently. Com-
munication is required only for final sample aggregation in consensus
202 Parallel and distributed MCMC

Monte Carlo or the periodic resampling of the global parameter in


the Weierstrass sampler. In consensus Monte Carlo, the quality of the
approximate inference depends on both the effectiveness of the aggre-
gation strategy and the extent to which dependencies in the posterior
can be factorized into subposteriors. The Weierstrass samplers directly
trade off approximation quality and the amount of decoupling between
subposteriors.
The consensus Monte Carlo approach originated at Google [Scott
et al., 2016] and naturally fits the MapReduce programming model,
allowing it to be executed on large computational clusters. Recent work
has extended consensus Monte Carlo and provides tools for designing
simple consensus strategies [Rabinovich et al., 2015], but the generality
and approximation quality of subposterior methods remain unclear.
The Weierstrass sampler fits well into a BSP model.

Hogwild Gibbs Hogwild Gibbs of Section 4.2.2 also allows for data
parallelism but avoids factorizing the posterior as in consensus Monte
Carlo or instantiating coupled copies of a global parameter as in the
Weierstrass sampler. Instead, processor-local sampling steps (such as
local Gibbs updates) are performed with each processor treating other
processors’ states as fixed at stale values; processors can communicate
updated states less frequently, either via synchronous or asynchronous
communication. Hogwild Gibbs variants span a range of parallel com-
putation paradigms from fully synchronous BSP to fully asynchronous
message passing. While Hogwild Gibbs has proven effective in practice
for several models, its applicability and approximation tradeoffs remain
unclear.

4.4 Discussion

The methods studied here rely to different degrees on redundancy in


the data. In particular, averaging subposterior samples relies on the
subposteriors being relatively similar. Consider two extreme cases: if
the subposteriors are identical then averaging their samples can pro-
duce reasonable results, but if the subposteriors have disjoint support
4.4. Discussion 203

then averaging their samples could produce iterates that look nothing
like the true posterior. Furthermore, averaging may only make sense in
continuous spaces that are closed under convex combinations; these av-
eraging strategies do not apply to samples that take values in discrete
spaces. Hence these subposterior methods may be most appropriate in
the “tall data” regime, where the data are redundant and subposteriors
can be expected to agree.
Performing density estimation on the subposteriors and forming the
product of these densities avoids the problems with direct averaging,
but may itself be computationally expensive and is unlikely to scale
well with dimensionality. Indeed, applying this strategy to a Gaussian
model provides no computational advantage to parallelization.
Unlike the subposterior methods, in Weierstrass samplers the pro-
cessors try to account for the other processors’ data by communicating
through the shared global variable. In this way, the Weierstrass sam-
plers are similar to the expectation propagation methods surveyed in
the next chapter. However, because processors can affect each others’
iterates only through the global variable bottleneck, their communica-
tion is limited. In addition, it is not clear how to extend the Weierstrass
strategies to discrete variables.
Hogwild Gibbs algorithms also include influence between proces-
sors, and without an information bottleneck. Communication between
processors is limited instead by an algorithm’s update frequency. In
addition, Hogwild Gibbs doesn’t inherently rely on data redundancy:
instead, it relies on variables not being too dependent across proces-
sors. For this reason, while Hogwild Gibbs has its own restrictions, it
is a promising method for scaling out to big models with many latent
variables rather than just the “tall data” regime. Hogwild Gibbs also
readily applies to discrete variables.
5
Scaling variational algorithms

Variational inference is another paradigm for posterior inference in


Bayesian models. Because variational methods pose inference as an
optimization problem, ideas in scalable optimization can in principle
yield scalable posterior inference algorithms. In this chapter we consider
such algorithms both for mean field variational inference, which is often
simply called variational Bayes, and for expectation propagation.
Variational inference is typically performed using a family of distri-
butions that does not include the exact posterior, and as a result vari-
ational methods are inherently biased. In particular, tractable varia-
tional families typically provide only unimodal approximations, and of-
ten cannot represent some posterior correlations near particular modes.
As a result, MCMC methods can in principle provide better approxi-
mations even when the Markov chain only explores a single mode in a
reasonable number of iterations. However, while traditional MCMC is
asymptotically unbiased, some of the scalable MCMC algorithms dis-
cussed in Chapters 3 and 4 are not, and it is unclear how to compare
these biases to the biases in variational methods.
Despite its inherent bias, variational inference is widely used in ma-
chine learning because the computational advantage over MCMC can

204
5.1. Stochastic optimization and mean field methods 205

be significant, particularly when scaling inference to large datasets.


The big data context may also inform the relative cost of performing
inference in a constrained variational family rather than attempting to
represent the posterior exactly: when the posterior is concentrated, a
variational approximation, even a Gaussian approximation, may suf-
fice [Bardenet et al., 2015]. While such questions ultimately need to be
explored empirically on a case-by-case basis, the scalable variational
inference methods surveyed in this chapter provide the tools for such
an exploration.
In this chapter we summarize three patterns of scalable variational
inference. First, in Section 5.1, we discuss the application of stochastic
gradient optimization methods to mean field variational inference prob-
lems. Second, in Section 5.2, we describe an alternative approach that
instead leverages the idea of incremental posterior updating to develop
an inference algorithm with minibatch-based updates. Finally, in Sec-
tion 5.3, we describe two scalable versions of expectation propagation
(EP), one based on data parallelism and distributed computation and
another based on stochastic optimization and minibatch updates.

5.1 Stochastic optimization and mean field methods

Stochastic gradient optimization is a powerful tool for scaling opti-


mization algorithms to large datasets, and it has been applied to mean
field variational inference problems to great effect. While many tradi-
tional algorithms for optimizing mean field objective functions, includ-
ing both gradient-based and coordinate optimization methods, require
re-reading the entire dataset in each iteration, the stochastic gradient
framework allows each update to be computed with respect to mini-
batches of the dataset while providing very general asymptotic conver-
gence guarantees.
In this section we first summarize the stochastic variational infer-
ence (SVI) framework of Hoffman et al. [2013], which applies to models
with complete-data conjugacy. Next, we discuss alternatives and exten-
sions which can handle more general models at the cost of updates with
greater variance and, hence, slower convergence.
206 Scaling variational algorithms

z (k) y (k)
k = 1, 2, . . . , K

Figure 5.1: Prototypical graphical model for stochastic variational inference


(SVI). The global latent variables are represented by φ and the local latent
variables by z (k) .

5.1.1 SVI for complete-data conjugate models


This section follows the development in Hoffman et al. [2013]. It de-
pends on results from stochastic gradient optimization theory; see Sec-
tion 2.5 for a review. For notational simplicity we consider each mini-
batch to consist of only a single observation; the generalization to mini-
batches of arbitrary sizes is immediate.
Many common probabilistic models are hierarchical: they can be
written in terms of global latent variables (or parameters), local latent
variables, and observations. That is, many models can be written as
K
Y
p(φ, z, y) = p(φ) p(z (k) | φ)p(y (k) | z (k) , φ) (5.1)
k=1

where φ denotes global latent variables, z = {z (k) }K


k=1 denotes local la-
tent variables, and y = {y (k) }K
k=1 denotes observations. See Figure 5.1
for a graphical model. Given such a class of models, the mean field vari-
ational inference problem is to approximate the posterior p(φ, z | y) for
fixed data y with a distribution of the form q(φ)q(z) = q(φ) k q(z (k) )
Q

by finding a local minimum of the KL divergence from the approx-


imating distribution to the posterior or, equivalently, finding a local
maximum of the log marginal likelihood lower bound
p(φ, z, y)
 
L[q(φ)q(z)] , Eq(φ)q(z) log ≤ log p(y). (5.2)
q(φ)q(z)
Hoffman et al. [2013] develops a stochastic gradient ascent algo-
rithm for such models that leverages complete-data conjugacy. Gra-
5.1. Stochastic optimization and mean field methods 207

dients of L with respect to the parameters of q(φ) have a conve-


nient form if we assume the prior p(φ) and each complete-data like-
lihood p(z (k) , y (k) | φ) are a conjugate pair of exponential family densi-
ties. That is, if we have

log p(φ) = hηφ , tφ (φ)i − log Zφ (ηφ ) (5.3)


(k) (k) (k) (k)
log p(z ,y | φ) = hηzy (φ), tzy (z ,y )i − log Zzy (ηzy (φ)) (5.4)

then conjugacy identifies the statistic of the prior with the natural
parameter and log partition function of the likelihood via

tφ (φ) = (ηzy (φ), − log Zzy (ηzy (φ)) , (5.5)

so that
n o
p(φ, z (k) , y (k) ) ∝ exp hηφ + (tzy (z (k) , y (k) ), 1), tφ (φ)i . (5.6)

Conjugacy implies the optimal variational factor q(φ) has the same
form as the prior; that is, without loss of generality we can write q(φ)
in the same form as (5.3),

q(φ) = exp {hηeφ , tφ (φ)i − log Zφ (ηeφ )} , (5.7)

for some variational parameter ηeφ .


Given this conjugacy structure, we can find a simple expression for
the gradient of L with respect to the global variational parameter ηeφ ,
optimizing out the local variational factor q(z). That is, we write the
variational objective over global parameters as

L(ηeφ ) = max L[q(φ)q(z)] . (5.8)


q(z)

Writing the optimal parameters of q(z) as ηez∗ , note that when ηez∗ is
partially optimized to a stationary point1 of L, so that ∂∂L
η∗
= 0 at ηez∗ ,
e z
the chain rule implies that the gradient2 with respect to the global
1
More generally, when ηez is regularly constrained (typically the constraint set
is linear [Wainwright and Jordan, 2008]), the same result holds because at ηez ∗ the
gradient of the objective is orthogonal to the feasible variations in ηez .
2
For a discussion of differentiability and smoothness issues that can arise when
there is more than one optimizer, see Danskin [1967], Fiacco [1984, Section 2.4], and
Bonnans and Shapiro [2000, Chapter 4]. Here we simply assume ∂ ηez∗ /∂ ηeφ exists.
208 Scaling variational algorithms

variational parameters simplifies:

∂L ∂L ∂L ∂ ηe∗
(ηeφ ) = (ηeφ , ηez∗ ) + ∗ z (ηeφ , ηez∗ ) (5.9)
∂ ηeφ ∂ ηeφ ∂ ηez ∂ ηeφ
∂L
= (ηeφ , ηez∗ ) . (5.10)
∂ ηeφ

Because a locally optimal local factor q(z) can be computed with local
mean field updates for a fixed value of the global variational parameter
ηeφ , we need only find an expression for the gradient ∇e ηφ L(η
eφ ) in terms
of the optimized local factors.
To find an expression for the gradient ∇eηφ L(η
eφ ) that exploits con-
jugacy structure, using (5.6) we can substitute
( )
X
(k) (k)
p(φ, z, y) ∝ exp hηφ + (tzy (z ,y ), 1), tφ (φ)i , (5.11)
k

into the definition of L in (5.2). Using the optimal form of q(φ), we


have
" #
X
(k) (k)
L(ηeφ ) = Eq(φ)q(z) hηφ + tzy (z ,y ) − ηeφ , tφ (φ)i
k
+ log Zφ (ηeφ ) + const. (5.12)
X
(k) (k)
= hηφ + Eq(z (k) ) [tzy (z ,y )] − ηeφ , Eq(φ) [tφ (φ)]i
k
+ log Zφ (ηeφ ) + const , (5.13)

where the constant does not depend on ηeφ . Using the identity for nat-
ural exponential families that

∇ log Zφ (ηeφ ) = Eq(φ) [tφ (φ)], (5.14)

we can write the same expression as


X
L(ηeφ ) =hηφ + Eq(z (k) ) [tzy (z (k) , y (k) )] − ηeφ , ∇ log Zφ (ηeφ )i
k
+ log Zφ (ηeφ ) + const . (5.15)
5.1. Stochastic optimization and mean field methods 209

Thus we can compute the gradient of L(ηeφ ) with respect to the global
variational parameters ηeφ as
X
eφ ) = h∇2 log Zφ (ηeφ ), ηφ +
ηφ L(η
∇e Eq(z (k) ) [tzy (z (k) , y (k) )] − ηeφ i
k
− ∇ log Zφ (ηeφ ) + ∇ log Zφ (ηeφ ) (5.16)
X
2 (k) (k)
= h∇ log Zφ (ηeφ ), ηφ + Eq(z (k) ) [tzy (z ,y )] − ηeφ i
k
where the first two terms come from applying the product rule.
The matrix ∇2 log Zφ (ηeφ ) is the Fisher information of the varia-
tional family, since
h i
2
−Eq(φ) ∇e
η
log q(φ) = ∇2 log Zφ (ηeφ ). (5.17)
φ

In the context of stochastic gradient ascent, we can cancel the multipli-


cation by the matrix ∇2 log Zφ (ηeφ ) simply by choosing the sequence of
(t)
positive definite matrices in Algorithm 4 to be G(t) , ∇2 log Zφ (ηeφ )−1 .
This choice yields a stochastic natural gradient ascent algorithm [Amari
and Nagaoka, 2007], where the updates are stochastic approximations
to the natural gradient
X

e L = ηφ +
eηφ Eq(z (k) ) [tzy (z (k) , y (k) )] − ηeφ . (5.18)
k
Natural gradients effectively include a second-order quasi-Newton cor-
rection for local curvature in the variational family [Martens and
Grosse, 2015, Martens, 2015], making the updates invariant to repa-
rameterization of the variational family and thus often improving per-
formance of the algorithm. More importantly, at least for the case
of complete-data conjugate families considered here, natural gradient
steps are in fact easier to compute than ‘flat’ gradient steps.
Therefore a stochastic natural gradient ascent algorithm on the
global variational parameter ηeφ proceeds at iteration t by sampling a
minibatch y (k) and taking a step of some size ρ(t) in an approximate
natural gradient direction via
 
ηeφ ← (1 − ρ(t) )ηeφ + ρ(t) ηφ + KEq(z (k) ) [t(z (k) , y (k) )] (5.19)
where we have assumed the minibatches are of equal size to simplify
notation. The local variational factor q(z (k) ) is computed using a local
210 Scaling variational algorithms

Algorithm 17 Stochastic Variational Inference (SVI)


(0)
Initialize global variational parameter ηeφ
for t = 0, 1, 2, . . . until convergence do
k̂ ← sample index k with probability pk > 0, for k = 1, 2, . . . , K
q(z (k̂) ) ← LocalMeanField( (t) (k̂)
 ηe , y ) h i
(t+1) (t)
ηeφ ← (1 − ρ(t) )ηeφ + ρ(t) ηφ + p1 Eq(z (k̂) ) t(z (k̂) , y (k̂) )

mean field update on the data minibatch and the global variational
factor. That is, if q(z (k) ) is not further factorized in the mean field
approximation, it is computed according to
n o
q(z (k) ) ∝ exp Eq(φ) [log p(z (k) , y (k) | φ)] . (5.20)

We summarize the general SVI algorithm in Algorithm 17.


To ensure asymptotic convergence to a stationary point of the objec-
tive, the step size sequence (ρ(t) )∞
t=0 is typically chosen to be decreasing
as in Section 2.5. However, Mandt et al. [2016] analyzes the algorithm
with a fixed step size and compares its stationary distribution to the
SGLD inference algorithm discussed in Section 3.4.

5.1.2 Stochastic gradients with general nonconjugate models

The development of SVI in the preceding section assumes that p(φ)


and p(z, y | φ) are a conjugate pair of exponential families. This as-
sumption led to a particularly convenient form for the natural gradient
of the mean field variational objective and hence an efficient stochas-
tic gradient ascent algorithm. However, when models do not have this
conjugacy structure, more general algorithms are required.
In this section we review Black Box Variational Inference (BBVI),
which is a stochastic gradient algorithm for variational inference that
can be applied at scale [Ranganath et al., 2014]. The “black box” name
suggests its generality: while the stochastic variational inference of Sec-
tion 5.1.1 requires particular model structure, BBVI only requires that
the model’s log joint distribution can be evaluated. It also makes few
demands of the variational family, since it only requires that the family
5.1. Stochastic optimization and mean field methods 211

can be sampled and that the gradient of its log joint density with re-
spect to the variational parameters can be computed efficiently. With
these minimal requirements, BBVI is not only useful in the big-data
setting but also a tool for handling nonconjugate variational inference
more generally. Because BBVI uses Monte Carlo approximation to com-
pute stochastic gradient updates, it fits naturally into a stochastic gra-
dient optimization framework, and hence it has the additional benefit
of yielding a scalable algorithm simply by adding minibatch sampling
to its updates at the cost of increasing their variance. In this subsection
we review the general BBVI algorithm and then compare it to the SVI
algorithm of Section 5.1.1. For a review of Monte Carlo estimation, see
Section 2.2.2.
We consider a general model p(θ, y) = p(θ) K (k) | θ) includ-
Q
k=1 p(y
ing parameters θ and observations y = {y (k) }K k=1 divided into K mini-
batches. The distribution of interest is the posterior p(θ | y) and we
write the variational family as q(θ) = q(θ | ηeθ ), where we suppress the
particular mean field factorization structure of q(θ) from the notation.
The mean field variational lower bound is then
p(θ, y)
 
L = Eq(θ) log . (5.21)
q(θ)
Taking the gradient with respect to the variational parameter ηeθ and
expanding the expectation into an integral, we have
p(θ, y)
Z
∇e
ηθ L = ∇e
ηθ q(θ) log dθ (5.22)
q(θ)
p(θ, y) p(θ, y)
Z   Z
= ∇e
ηθ log q(θ)dθ + log ∇ηθ q(θ)dθ (5.23)
q(θ) q(θ) e
where we have moved the gradient into the integrand and applied the
product rule to yield two terms. The first term is identically zero:
p(θ, y) 1
Z   Z
∇e
ηθ log q(θ)dθ = − ∇ηθ [q(θ)] q(θ)dθ (5.24)
q(θ) q(θ) e
Z
=− ∇e
ηθ q(θ)dθ (5.25)
Z
= −∇e
ηθ q(θ)dθ = 0 (5.26)
212 Scaling variational algorithms

where we have used ∇e ηθ log p(θ, y) = 0. To write the second term of


(5.23) in a form that allows convenient Monte Carlo approximation, we
first note the identity
∇e
ηθ q(θ)
∇e
ηθ log q(θ) = =⇒ ∇e ηθ log q(θ)
ηθ q(θ) = q(θ)∇e (5.27)
q(θ)
and hence we can write the second term of (5.23) as
p(θ, y) p(θ, y)
Z Z
log ∇ηθ q(θ)dθ = log ∇ηθ [log q(θ)] q(θ)dθ (5.28)
q(θ) e q(θ) e
p(θ, y)
 
= Eq(θ) log ∇ηθ log q(θ) (5.29)
q(θ) e
1 X p(θ̂, y)
≈ log ∇e
ηθ log q(θ̂) (5.30)
|S| q( θ̂)
θ̂∈S

where in the final line we have written the expectation as a Monte


iid
Carlo estimate using a set of samples S, where θ̂ ∼ q(θ) for θ̂ ∈ S.
Notice that the gradient is written as a weighted sum of gradients of
the variational log density with respect to the variational parameters,
where the weights depend on the model log joint density.
The BBVI algorithm uses the Monte Carlo estimate (5.30) to com-
pute stochastic gradient updates. This gradient estimator is also known
as the score function estimator [Kleijnen and Rubinstein, 1996, Gelman
and Meng, 1998] and REINFORCE gradient [Williams, 1992]. The vari-
ance of these updates, and hence the convergence of the overall stochas-
tic gradient algorithm, depends both on the sizes of the gradients of
the variational log density and on the variance of q(θ). Large variance
in the gradient estimates can lead to very slow optimization, and so
Ranganath et al. [2014] proposes and evaluates two variance reduc-
tion schemes, including a control variate method as well as a Rao-
Blackwellization method that can exploit factorization structure in the
variational family. The control variate method for variance reduction
is also studied in Paisley et al. [2012].
To provide a scalable version of BBVI, gradients can be fur-
ther approximated by subsampling minibatches of data. That is,
using log p(θ, y) = log p(θ) + K (k) | θ) we write (5.29) and
P
k=1 log p(y
5.1. Stochastic optimization and mean field methods 213

Algorithm 18 Minibatch Black-Box Variational Inference (BBVI)


(0)
Initialize ηeθ
for t = 0, 1, 2, . . . do
(t)
S ← {θ̂s } where θ̂s ∼ q( · | ηeθ )
k̂ ∼ Uniform({1, 2, . . . , K})
 
(t+1) (t) 1 p(θ̂)
← log p(y (k̂) | θ̂) ∇e
P
ηeθ ηeθ + |S| θ̂∈S log +K ηθ log q(θ̂)
q(θ̂)

(5.30) as
p(θ)
   
∇e
ηθ L = Ek̂ Eq(θ) log + K log p(y (k̂) | θ̂) ∇e
ηθ log q(θ) (5.31)
q(θ)
!
1 X p(θ̂) h i
≈ log + KEk̂ log p(y (k̂) | θ̂) ∇e
ηθ log q(θ̂) (5.32)
|S| q(θ̂)
θ̂∈S

with the minibatch index k̂ distributed uniformly over {1, 2, . . . , K}


and the minibatches are assumed to be the same size for simpler nota-
tion. This subsampling over minibatches further increases the variance
of the updates and thus may further limit the rate of convergence of
the algorithm. We summarize this version of the BBVI algorithm in
Algorithm 18.
It is instructive to compare the fully general BBVI algorithm ap-
plied to hierarchical models to the SVI algorithm of Section 5.1.1; this
comparison not only shows the benefits of exploiting conjugacy struc-
ture but also suggests a potential Rao-Blackwellization scheme. Tak-
ing θ = (φ, z) and q(θ) = q(φ)q(z) and starting from (5.22) and (5.29),
we can write the gradient as
p(φ, z, y)
 
∇e
ηθ L = Eq(φ)q(z) log ∇ηφ log q(φ)q(z) (5.33)
q(φ)q(z) e
!
1 X p(φ̂, z, y)
≈ Eq(z) log − Eq(z) log q(z) ∇e
ηφ log q(φ̂)
|S| q(φ̂)
φ̂∈S
(5.34)
iid
where S is a set of samples with φ̂ ∼ q(φ) for φ̂ ∈ S. Thus if the
entropy of the local variational distribution q(z) and the expectations
214 Scaling variational algorithms

with respect to q(z) of the log density log p(φ̂, z, y) can be computed
without resorting to Monte Carlo estimation, then the resulting update
would likely have a lower variance than the BBVI update that requires
sampling over both q(φ) and q(z).
This comparison also makes clear the advantages of exploiting con-
jugacy in SVI: when the updates of Section 5.1.1 can be used, nei-
ther q(φ) nor q(z) needs to be sampled. Furthermore, while BBVI uses
stochastic gradients in its updates, the SVI algorithm of Section 5.1.1
uses stochastic natural gradients, adapting to the local curvature of the
variational family. Computing stochastic natural gradients in BBVI
would require both computing the Fisher information matrix of the
variational family and solving a linear system with it.
The main weakness of the score function (REINFORCE) gradient
estimates used in BBVI is their high variance, a weakness that scales
poorly with dimensionality. Next we study an alternative gradient es-
timator that applies to some nonconjugate models but can yield lower
variance estimates.

5.1.3 Exploiting reparameterization for some nonconjugate models


While the score function estimator of BBVI in Section 5.1.2 is suf-
ficiently general to handle essentially any model, some nonconjugate
models admit convenient stochastic gradient estimators that can have
lower variance. In particular, when the latent variables are continuous
(or any discrete latent variables that can be marginalized efficiently),
samples from some variational distributions can be reparameterized in
a way that enables an alternative stochastic gradient estimator. This
technique was described in Williams [1992, Section 7.2] and is related
to non-centered reparameterizations [Papaspiliopoulos et al., 2007]. In
the context of variational inference it is referred to as the reparameter-
ization trick [Kingma and Welling, 2014, Rezende et al., 2014].
Consider again the variational density q(θ) with parameter ηeθ . The
reparameterization trick applies when the random variable θ ∼ q(θ) can
be written as a deterministic function of the parameter ηeθ and another
random variable  with a distribution that does not depend on ηeθ ,
θ = f (ηeθ , ), (5.35)
5.1. Stochastic optimization and mean field methods 215

where  ∼ p() and the partial derivative ∇eηθ f (η


eθ , ) exists and can be
computed efficiently for almost every value of . Using this reparame-
terization, we can write the variational objective as
p(θ, y) p(f (ηeθ , ), y)
   
L(ηeθ ) = Eq(θ) log = Ep() log . (5.36)
q(θ) q(f (ηeθ , ))
That is, the expectation does not depend on ηeθ , and so when we can
exchange differentiation and expectation3 we have
p(f (ηeθ , ), y)
 
∇L(ηeθ ) = Ep() ∇e
ηθ log (5.38)
q(f (ηeθ , ))
1 X p(f (ηeθ , ˆ), y)
≈ ∇ηθ log , (5.39)
|S| ˆ∈S e q(f (ηeθ , ˆ))

iid
where ˆ ∼ p() for each ˆ ∈ S. This Monte Carlo approximation gives
an unbiased estimate of the gradient of the variational objective, and it
often has lower variance than the more general score function estimator
[Kingma and Welling, 2014]. Moreover, it can be straightforward to
compute using automatic differentiation tools [Kucukelbir et al., 2015,
Duvenaud and Adams, 2015]. There are several proposed improvements
to this estimator based on variance reduction and alternative Monte
Carlo objectives [Mnih and Gregor, 2014, Burda et al., 2016, Mnih and
Rezende, 2016].
As a concrete example, if q(θ) is a multivariate Gaussian density
with parameters ηeθ = (µ, Σ), then we can take  ∼ N (0, I) and write
the reparameterization as
1
f (µ, Σ, ) = µ + Σ 2 , (5.40)
1 1 1
where Σ 2 is a matrix satisfying Σ 2 (Σ 2 )T = Σ.
3
For example, Leibniz’s rule states that given a function f : X × Ω → R where
X ⊂ R is open, if f (x, ω) is a Lebesgue-integrable function of ω for each x ∈ X,

the partial derivative function ∂x f (x, ω) exists for almost all ω, and there is an

integrable function G : Ω → R such that | ∂x f (x, ω)| ≤ G(ω), then
Z Z
d ∂
f (x, ω)dω = f (x, ω)dω. (5.37)
dx Ω Ω
∂x
216 Scaling variational algorithms

5.2 Streaming variational Bayes (SVB)

Streaming variational Bayes (SVB) provides an alternative framework


in which to derive minibatch-based scalable variational inference [Brod-
erick et al., 2013]. While the methods of Section 5.1 generally ap-
ply stochastic gradient optimization algorithms to a fixed variational
mean field objective, SVB instead considers the streaming data set-
ting, in which case there may be no fixed dataset size and hence no
fixed variational objective. To handle streaming data, the SVB ap-
proach is based on the classical idea of Bayesian updating, in which
a posterior is updated to reflect new data as they become available.
This sequence of posteriors is approximated by a sequence of varia-
tional models, and each variational model is computed from the pre-
vious variational model via an incremental update on new data. Tank
et al. [2015] develops a related streaming variational inference algo-
rithm that applies to Bayesian nonparametric models using ideas from
assumed density filtering [Minka, 2001]. Another streaming data ap-
proach, which we do not discuss here, is studied in McInerney et al.
[2015].
More concretely, given a prior p(θ) over a parameter θ and a (possi-
bly infinite) sequence of data minibatches y (1) , y (2) , . . ., each distributed
independently according to a likelihood distribution p(y (k) | θ), we con-
sider the sequence of posteriors

p(θ | y (1) , . . . , y (t) ), t = 1, 2, . . . . (5.41)

Given an approximation updating algorithm A one can compute a cor-


responding sequence of approximations
 
p(θ | y (1) , · · · , y (t) ) ≈ qt (θ) = A y (t) , qt−1 (θ) , t = 1, 2, . . . (5.42)

with q0 (θ)p(θ). This sequential updating view naturally suggests an


online or one-pass algorithm in which the update (5.42) is applied suc-
cessively to each of a sequence of minibatches.
A sequence of such updates may also exploit parallel or distributed
computing resources. For example, the sequence of approximations may
5.2. Streaming variational Bayes (SVB) 217

be computed as

p(θ | y (1) , · · · , y (Kt ) ) ≈ qt (θ) (5.43)


 
Kt  
A y (k) , qt−1 (θ) qt−1 (θ)−1 
Y
= qt−1 (θ)  (5.44)
k=Kt−1 +1

where Kt−1 + 1, Kt−1 + 2, . . . , Kt indexes a set of data minibatches for


which each update is computed in parallel before being combined in
the final update from qt−1 (θ) to qt (θ).
This combination of partial results is especially appealing when the
prior p(θ) and the family of approximating distributions q(θ) are in the
same exponential family,

p(θ) ∝ exp {hη, t(θ)i} q0 (θ) ∝ exp {hηe0 , t(θ)i} (5.45)


qt (θ) = A(y (k) , qt−1 (θ)) ∝ exp {hηet , t(θ)i} (5.46)

for a prior natural parameter η and a sequence of variational param-


eters ηet . In the exponential family case, the updates (5.44) can be
written

p(θ | y (1) , · · · , y (Kt ) ) ≈ qt (θ) ∝ exp {hηet , t(θ)i} (5.47)


( )
X
= exp hηet−1 + (ηek − ηet−1 ), t(θ)i
k
(5.48)

where we may take the algorithm A to return an updated natural


parameter, ηek = A(y (k) , ηet ).
Finally, similar updates can be performed in an asynchronous dis-
tributed master-worker setting. Each worker can process a minibatch
and send the corresponding natural parameter increment to a mas-
ter process, which updates the global variational parameter and trans-
mits back the updated variational parameter along with a new data
minibatch. In symbols, we can write that a worker operating on mini-
batch y (k) for some minibatch index k computes the update increment
∆ηek according to
∆ηek = A(y (k) , ηeτ (k) ) (5.49)
218 Scaling variational algorithms

Algorithm 19 Streaming Variational Bayes (SVB)


Initialize ηe0
for each worker p = 1, 2, . . . , P do
Send task (y (p) , ηe0 ) to worker p
as workers send updates do
Receive update ∆ηek from worker p
ηet+1 ← ηet + ∆ηek
Retrieve new data minibatch index k 0
0
Send new task (y (k ) , ηet+1 ) to worker p

Algorithm 20 SVB Worker Process


repeat
Receive task (y (k) , ηet ) from master
∆ηek ← A(y (k) , ηet ) − ηet
Send update ∆ηek to master
until no tasks remain

where τ (k) is the index of the global variational parameter used in the
worker’s computation. Upon receiving an update, the master updates
its global variational parameter synchronously according to

ηet+1 = ηet + ∆ηek . (5.50)

We summarize a version of this process in Algorithms 19 and 20.


A related algorithm, which we do not detail here, is Memoized Vari-
ational Inference (MVI) [Hughes and Sudderth, 2013, Hughes et al.,
2015]. While this algorithm is designed for the fixed dataset setting
rather than the streaming setting, the updates can be similar to those
of SVB. In particular, MVI optimizes the mean field objective in the
conjugate exponential family setting using the mean field coordinate
descent algorithm but with an atypical update order, in which only
some local variational factors are updated at a time. This update order
enables minibatch-based updating but, unlike the stochastic gradient
algorithms, does not optimize out the other local variational factors
not included in the minibatch and instead leaves them fixed.
5.3. Scalable expectation propagation 219

5.3 Scalable expectation propagation

Expectation propagation (EP) is another variational inference strat-


egy distinct from mean field methods. As discussed in Section 2.4, the
EP algorithm is a Lagrangian method and is not guaranteed to make
progress on its objective or any other merit function, and as a result
its stability is more difficult to analyze than that of mean field meth-
ods. However, EP can be very effective in many problems and applica-
tions [Minka, 2001, Murphy, 2012], especially in the important domain
of approximate inference in Gaussian Processes (GPs), which motivates
investigating new versions of EP that are particularly scalable to large
datasets.
Here we summarize two recent approaches to developing scalable
EP algorithms: Parallel EP (PEP) and Stochastic EP (SEP). Parallel
EP [Gelman et al., 2014b, Xu et al., 2014, Hasenclever et al., 2016]
distributes factors across the dataset and runs standard EP updates
in parallel, making the basic strategy analogous to the parallel MCMC
algorithms of Chapter 4. As a result, PEP may be especially relevant
for complex models with many interactions, such as hierarchical mod-
els. Stochastic EP [Li et al., 2015] instead updates a global approx-
imation using stochastic updates on minibatches, and is hence more
analogous to the minibatch-based algorithms of Chapter 3 and Sec-
tions 5.1 and 5.2. Hence SEP provides an EP-based alternative to the
stochastic-update mean field algorithms like SVI and BBVI.

5.3.1 Parallel expectation propagation (PEP)

Gelman et al. [2014b] and Xu et al. [2014] observe that EP provides


a natural way to incorporate parallel computation, and in particular
that for hierarchical models the cavity distribution idea allows parallel
workers to efficiently communicate their inferences and inform each
other. Here we develop the resulting parallel EP algorithm following
Gelman et al. [2014b, Section 3] using the hierarchical model notation
developed in Section 5.1.1 and Figure 5.1. Because these EP methods
can use MCMC for local inference, these methods are strongly related
to the parallel MCMC algorithms developed in Chapter 4.
220 Scaling variational algorithms

For some dataset y = {yk }K


k=1 partitioned into K subsets, consider
the hierarchical joint model
K
Y
p(φ, z, y) = p(φ) p(yk , zk | φ), (5.51)
k=1

with the target posterior distribution defined as either p(φ, z | y) or


p(φ | y). Consider a corresponding approximating distribution on the
global parameter, q(φ), written as
K
Y
q(φ) ∝ p(φ) qk (φ). (5.52)
k=1

As in Section 2.4, we can define the cavity distribution q¬k (φ) as


q(φ)
q¬k (φ) ∝ . (5.53)
qk (φ)
Using the hierarchical model structure, we define the local tilted dis-
tribution pek (φ, zk ) on both the global parameter φ and the local latent
variable zk as
pek (φ, zk ) ∝ q¬k (φ)p(yk , zk | φ), (5.54)
so that the tilted distribution on the global parameter, denoted pek (φ),
is simply the appropriate marginal of Eq. (5.54). Note that the cav-
ity distribution q¬k (φ) acts as a kind of local prior, summarizing the
influence of the model prior p(φ) as well as the influence of the other
data. The moments of pek (φ) can then be estimated with an appropri-
ate inference method, such as MCMC or a local EP algorithm, and
we can update the parameters of the factor qk (φ) so that the resulting
q(φ) ∝ qk (φ)q¬k (φ) matches the estimated moments.
A serial EP algorithm would run these updates for each factor
qk (φ) sequentially and, after each update, the global approximation
q(φ) would be updated as q(φ) ∝ qk (φ)q¬k (φ). However, Gelman et al.
[2014b] instead suggest running these updates in parallel, setting the
global factor to q(φ) ∝ p(φ) k qk (φ) after each round of updates. Be-
Q

cause the bulk of the computational effort is in estimating the moments


of pek (φ) and setting the parameters of qk (φ) so as to match those mo-
ments, this strategy parallelizes the most expensive part of inference
5.3. Scalable expectation propagation 221

Algorithm 21 Parallel Expectation Propagation (PEP)


Input: Joint distribution p(φ, z, y) = p(φ) K p(yk , zk | φ), parame-
Q
QK k=1
terized approximating family q(φ) = k=1 qk (φ)
Output: Approximate marginal posterior q(φ) ≈ p(φ | y)
Initialize parameters of each approximating factor qk (φ)
for t = 1, 2, . . . until convergence do
for k = 1, 2, . . . , K in parallel do
Define cavity distribution q¬k ∝ qq(φ)
k (φ)
Define local tilted distribution pek (φ, zk ) ∝ q¬k (φ)p(yk , zk | φ)
Set parameters of qk (φ) to minimize KL(pek (φ) k q(φ))
by computing and matching marginal moments
QK
Synchronize the updated approximation q(φ) ∝ p(φ) k=1 qk (φ)

while only requiring the parameters of the qk (φ) to be communicated.


At convergence, the posteriors of interest are approximated as
K
Y
p(φ | y) ≈ q(φ) = p(φ) qk (φ) (5.55)
k=1
K
Y
p(φ, z | y) ≈ q(φ) p(zk | φ)p(yk | zk , φ). (5.56)
k=1

The algorithm is summarized in Algorithm 21.


Note that mean field algorithms offer similar opportunities for par-
Q
allelization; if the mean field family is factorized as q(φ) k q(zk ) then
updates to the local factors q(zk ) can be performed in parallel for a
fixed global factor q(φ). However, as with comparing EP with mean
field methods in the sequential setting, in some cases EP can provide
more accurate posterior approximations, and so it is useful to develop
a corresponding parallelization strategy for EP. Gelman et al. [2014b]
also study several algorithmic considerations that arise, including sta-
bility, approximate inference subroutine alternatives, and ways to reuse
some parts of the computation.
While Algorithm 21 uses a global synchronization step for commu-
nication, the EP framework allows more flexible communication strate-
gies based on message passing. Xu et al. [2014] also develop a paral-
222 Scaling variational algorithms

lel EP algorithm, called sampling via moment sharing (SMS), that


uses MCMC for local inference, and the authors highlight that the
processors can even communicate their local moment statistics asyn-
chronously.
Hasenclever et al. [2016] further develop these ideas, citing that
while PEP and SMS work well for simpler models like logistic regres-
sion and hierarchical linear models, they found these strategies not to
be as effective on complex models like Bayesian deep neural networks.
They instead propose stochastic natural-gradient EP (SNEP), which
leverages exponential family structure and a locally-convergent version
of EP [Heskes and Zoeter, 2002] as well as both asynchronous commu-
nication and a posterior server.

5.3.2 Stochastic expectation propagation (SEP)

While the parallel EP algorithm of the preceding section uses the struc-
ture of EP to construct a parallel algorithm, Stochastic Expectation
Propagation (SEP) [Li et al., 2015] instead builds an EP algorithm with
updates that can be computed using only one minibatch of data at a
time. Thus SEP is closer to an EP analog of the other minibatch-based
variational inference algorithms described in Sections 5.1.1 and 5.2, and
has similar advantages. Specifically, parallel EP still requires the pa-
rameters of the K approximating factors to be stored in (distributed)
memory, and also requires global communication to synchronize the ap-
proximating distribution q(φ). In contrast, SEP needs only an amount
of memory that is constant with respect to the size of the dataset
and only performs local computations. However, it also makes stronger
approximating assumptions which might make it less appropriate for
complex hierarchical models.
For some dataset y = {yk }Kk=1 partitioned into K minibatches, con-
sider the joint model

K
Y
p(θ, y) = p(θ) p(yk | θ). (5.57)
k=1
5.4. Summary 223

In standard EP, the posterior would be approximated with a factorized


distribution as
K
Y
q(θ) ∝ p(θ) qk (θ). (5.58)
k=1
At convergence, the product of the qk (θ) factors is meant to approxi-
mate the effect of the likelihood terms, so that k qk (θ) ≈ k p(yk | θ).
Q Q

The key idea in SEP is that we can instead directly parameterize a


factor f (θ) that models the (geometric) average effect of the likelihood
terms, writing
q(θ) ∝ p(θ)f (θ)K , (5.59)
so that f (θ)K ≈ k qk (θ). We can then compute stochastic updates to
Q

the factor f (θ) directly using data minibatches.


The SEP update proceeds on the tth iteration by sampling a mini-
1
batch index k, then defining a cavity distribution q¬k (θ) ∝ p(θ)f (θ)1− K
and tilted distribution pek (θ) ∝ p(yk | θ)q¬k (θ). Next, an intermediate
factor fk (θ) parameterized like f (θ) is chosen to approximately mini-
mize the KL divergence from the tilted distribution pek (θ) to the distri-
bution proportional to q¬k (θ)fk (θ). As with the other EP algorithms,
this step is carried out using an inference subroutine to compute mo-
ments of the tilted distribution and then setting the parameters of fk (θ)
to approximately match those moments. Finally, the average likelihood
factor f (θ) is updated according to f (θ) ∝ f (θ)1−t fk (θ)t for some step
size t , with one natural choice being t = K1 . The resulting algorithm
is summarized in Algorithm 22.

5.4 Summary

Stochastic gradients vs. streaming. The methods of Section 5.1 ap-


ply stochastic optimization to variational mean field inference objec-
tives. In optimization literature and practice, stochastic gradient meth-
ods have a large body of both theoretical and empirical support, and
so these methods offer a compelling framework for scalable inference.
The streaming ideas surveyed in Section 5.2 are less well understood,
but by treating the streaming setting, rather than the setting of a large
fixed-size dataset, they may be more relevant for some applications.
224 Scaling variational algorithms

Algorithm 22 Stochastic Expectation Propagation (SEP)


Input: Joint distribution p(θ, y) = p(θ) K k=1 p(yk | θ), parameterized
Q

average likelihood factor f (θ), step size sequence (t )∞t=1


Output: Approximate posterior q(θ) = p(θ)f (θ)K
Initialize parameters of the average likelihood factor f (θ)
for t = 1, 2, . . . until convergence do
Choose an observation index k from {1, 2, . . . , K}
1
Define cavity distribution q¬k (θ) ∝ p(θ)f (θ)1− K
Define tilted distribution pek (θ) ∝ p(yk | θ)q¬k (θ)
Set parameters of fk (θ) to minimize KL(pek (θ) k q¬k (θ)fk (θ))
by computing and matching moments
Update f (θ) ∝ f (θ)1−t fk (θ)t

Minibatching vs data parallelism. All of this chapter’s scalable ap-


proaches to mean field variational inference, as well as Stochastic Ex-
pectation Propagation’s approach to EP inference, are based on pro-
cessing minibatches of data, analogous to the MCMC methods of Chap-
ter 3. Minibatches are motivated in three ways: through conditionally-
i.i.d. models and stochastic gradient optimization (§5.1), through
streaming data (§5.2), and through the factor updates of EP (§5.3.2).
In both SVI and BBVI, stochastic gradients arise from randomly sam-
pling data minibatches, while in BBVI there is additional stochasticity
due to the Monte Carlo approximation required to handle nonconju-
gate structure. In contrast to SVI and BBVI, SVB (§5.2) processes
data minibatches to drive incremental posterior updates, constructing
a sequence of approximate posterior distributions that correspond to
classical sequential Bayesian updating without having a single fixed ob-
jective to optimize. SEP also directly defines an algorithm with mini-
batch updates. The only method in this chapter that exploits data
parallelism and distributed computation instead of minibatch data ac-
cess is Parallel EP (§5.3.1), making it more analogous to the MCMC
methods of Chapter 4.
5.5. Discussion 225

Generality, requirements, and assumptions. As with most ap-


proaches to scaling MCMC samplers for Bayesian inference,
the minibatch-based variational inference methods depend on
conditionally-i.i.d. model structure. In SVI, BBVI, and SEP mini-
batches map to terms in a factorization of the joint probability. In
SVB, minibatches map to a sequence of likelihoods to be incorporated
into the variational posterior. Some of these methods further depend on
and exploit exponential family and conjugacy structure. SVI is based
on complete-data conjugacy, while BBVI was specifically developed for
nonconjugate models. SVB is a general framework, but in the conju-
gate exponential family case the updates can be written in terms of
simple updates to natural parameters. A direction for future research
for mean field methods might be to develop new methods based on
identifying and exploiting some ‘middle ground’ between the structural
requirements of SVI and BBVI, or similarly of SVB with and without
exponential family structure.

5.5 Discussion

Model complexity and data redundancy. As with many other scal-


able inference methods, the variational inference strategies discussed in
this chapter are likely to work very well in the “tall data” regime, where
the data are modeled as conditionally independent given relatively sim-
ple global parameters. However, it is less clear how these methods can
be expected to work on complex models. In particular, the minibatch-
based algorithms rely to some extent on data redundancy, so that a
sampled minibatch can be used to make global inferences. However,
the optimization perspective provided by variational inference gives us
a clearer understanding of these minibatch effects than in the case of
MCMC. Because variational inference algorithms generally only need
to find local optima, and we don’t require algorithm trajectories to
satisfy the stationary distribution conditions of MCMC, the subsam-
pling errors from minibatches can only lead to slower convergence, at
least in principle. In contrast, the minibatch-based MCMC algorithms
of Chapter 3 introduced new posterior approximation errors. To handle
226 Scaling variational algorithms

complex models with less data redundancy, the data-parallel versions of


EP may provide more flexibility: instead of requiring small minibatches
to be conditionally independent, these EP methods may only require
the data subsets on different machines to be conditionally independent,
and those data subsets may be much larger than standard minibatches.

Parallel variants of minibatch updating. The minibatch-based vari-


ational inference methods developed in this chapter suggest parallel
and asynchronous variants. In the case of SVB, distributed and asyn-
chronous versions, such as the master-worker pattern depicted by Algo-
rithms 19 and 20, have been empirically studied [Broderick et al., 2013].
However, we lack theoretical understanding about these procedures,
and it is unclear how to define and track notions of convergence or
stability. Methods based on stochastic gradients, such as SVI, can nat-
urally be extended to exploit parallel and asynchronous (or “Hogwild”)
variants of stochastic gradient ascent. In such parallel settings, these
optimization-based techniques benefit from powerful gradient conver-
gence results [Bertsekas and Tsitsiklis, 1989, Section 7.8], though tun-
ing such algorithms is still a challenge. Other parallel versions of these
ideas and algorithms have also been developed in Campbell and How
[2014] and Campbell et al. [2015].

Inference networks and bridging the gap to MCMC. Recently, new


strategies for variational inference have provided both computational
efficiency and novel variational families. In stochastic variational in-
ference, instead of optimizing the local variational factor q(z (k) ) in
Algorithm 17, the parameters of a (suboptimal) local factor can be
computed from the data minibatch y (k) using a simpler, non-iterative
computation, like the application of a feed-forward neural network.
This strategy of using inference networks [Rezende et al., 2014, Kingma
and Welling, 2014], also referred to as amortized inference, avoids the
iterative local optimization which can be expensive for general noncon-
jugate models and has led to many new ideas based on the variational
autoencoder [Kingma and Welling, 2014]. Another new idea is to pa-
rameterize variational families in terms of a fixed number of MCMC
5.5. Discussion 227

transition steps [Salimans et al., 2015], “unrolling” the computation and


computing gradients through it to optimize the approximation. These
new ideas, which exploit the convenience of automatic differentiation
and the expressive function approximators provided by deep neural
networks, are likely to further influence scalable variational inference.

Understanding and correcting the biases in variational inference


Variational methods produce biased estimates of posterior expecta-
tions when the variational family does not include the true posterior.
However, the ways in which these biases affect posterior expectations
of interest are often unclear. Because mean field variational inference
tends to underestimate uncertainty, Giordano et al. [2015] develop the
linear response variational Bayes (LRVB) method to improve poste-
rior uncertainty estimates using a sensitivity analysis of the mean field
optimization problem based on the implicit function theorem. Further
research is needed into how the biases found in variational methods can
be analyzed, corrected, or compared to those of approximate MCMC
algorithms.
6
Challenges and questions

In this review, we have examined a variety of different views on scal-


ing Bayesian inference up to large datasets and greater model com-
plexity and out to parallel compute resources. Several different themes
have emerged, from techniques that exploit subsets of data for com-
putational savings to proposals for distributing inference computations
across multiple machines. Progress is being made, but there remain
significant open questions and outstanding challenges to be tackled as
this research programme moves forward.

Trading off errors in MCMC One of the key insights underpinning


much of the recent work on scaling Bayesian inference can be framed
in terms of a kind of bias-variance tradeoff. Traditional MCMC the-
ory provides asymptotically unbiased estimators for which the error
can eventually be driven arbitrarily small. However, in practice, under
limited computational budgets the error can be significant. This error
has two components: transient bias, in which the samples produced are
too dependent on the Markov chain’s initialization, and Monte Carlo
standard error, in which the samples collected may be too few or too
highly correlated to produce good estimates.

228
229

⇡kTV
k⇡0 T n
estimator error (log scale)

wall-clock time (log scale)

Transient bias Standard error Total

Figure 6.1: A simulation illustrating the error terms in traditional MCMC


estimators as a function of wall-clock time (log scale). The marginal distri-
butions of the Markov chain iterates converge to the target distribution (top
panel), while the errors in MCMC estimates due to transient bias and Monte
Carlo standard error are driven arbitrarily small.

Figure 6.1 illustrates the error regimes and tradeoffs in traditional


MCMC.1 Asymptotic analysis describes the regime on the right of the
plot, after the sampler has mixed sufficiently well. In this regime, the
marginal distribution of each sample is essentially equal to the target
distribution, and the transient bias from initialization, which affects
only the early samples in the Monte Carlo sum, is washed out rapidly
at least at a O( n1 ) rate. The dominant source of error is due to Monte
Carlo standard error, which diminishes only at a O( √1n ) rate.
However, machine learning practitioners using MCMC often find
themselves in another regime: in the middle of the plot, the error is
decreasing but dominated instead by the transient bias. The challenge
1
See also Section 2.2.4
230 Challenges and questions

in practice is often to get through this regime, or even to get into it at


all. When the underlying Markov chain does not mix sufficiently well or
when the transitions cannot be computed sufficiently quickly, getting
to this regime may be infeasible for a realistic computational budget.
Several of the new MCMC techniques we have studied aim to ad-
dress this challenge. In particular, the parallel predictive prefetching
method of Section 4.1.2 accelerates this phase of MCMC without af-
fecting the stationary distribution. Other methods instead introduce
approximate transition operators that can be executed more efficiently.
For example, the adaptive subsampling methods of Section 3.2 and the
stochastic gradient sampler of Section 3.4 can execute updates more
efficiently by operating only on data subsets, while the Weierstrass and
Hogwild Gibbs samplers of Sections 4.2.1 and 4.2.2, respectively, exe-
cute more quickly by leveraging parallelism. These transition operators
are approximate in that they do not admit the exact target distribu-
tion as a stationary distribution: instead, the stationary distribution
(when it exists and is unique) is only intended to be close to the tar-
get. Framed in terms of Monte Carlo estimates, these approximations
effectively accelerate the execution of the chain at the cost of introduc-
ing an asymptotic bias. Figure 6.2 illustrates this new tradeoff.
Allowing some asymptotic bias to reduce transient bias or even
Monte Carlo variance is likely to enable MCMC inference at a new
scale. However, both the amount of asymptotic bias introduced by these
methods and the ways in which it depends on model and algorithm pa-
rameters remain unclear. More theoretical understanding and empirical
study is necessary to guide machine learning practice.

Scaling limits of Bayesian inference Scalability in the context of


Bayesian inference is ultimately about spending computational re-
sources to better interrogate posterior distributions. It is therefore im-
portant to consider whether there are fundamental limits to what can
be achieved by, for example, spending more money on Amazon EC2,
for either faster computers or more of them.
In parallel systems, linear scaling is ideal: twice as much compu-
tational power yields twice as much useful work. Unfortunately, even
231

⇡kTV
k⇡0 T n
estimator error (log scale)

iterationtime
wall-clock n (log scale)
(log scale)

Transient bias Standard error Asymptotic bias Total

Figure 6.2: A simulation illustrating the new tradeoffs in some proposed


scalable MCMC methods. Compare to Figure 6.1. As a function of wall-clock
time (log scale), the Markov chain iterations execute more than an order
of magnitude faster; however, because the stationary distribution is not the
target distribution, an asymptotic bias remains (top panel). Correspondingly,
MCMC estimator error, particularly the transient bias, can be driven to a
small value more rapidly, but there is an error floor due to the introduction
of the asymptotic bias (bottom panel).

if this lofty parallel speedup goal is achieved, the asymptotic picture


for MCMC is dim: in the asymptotic regime, doubling the number of
samples collected
√ can only reduce the Monte Carlo standard error by a
factor of 2. This scaling means that there are diminishing returns to
purchasing additional computational resources, even if those resources
provide linear speedup in terms of accelerating the execution of the
MCMC algorithm.
Interestingly, variational methods may not suffer from such intrin-
sic limits. Because expectations are computed against the variational
232 Challenges and questions

distribution, rather than against collected samples, the rate at which


such approximate expectations can improve is not limited by Monte
Carlo effects.

Measuring performance With all the ideas surveyed here, one thing
is clear: there are many alternatives for how to scale Bayesian inference.
How should we compare these alternative algorithms? Can we tell when
any of these algorithms work well in an absolute sense?
One standard approach for evaluating MCMC procedures is to de-
fine a set of scalar-valued test functions (or estimands of interest) and
compute effective sample size [Gelman et al., 2014a, Section 11.5][Kong
et al., 1994] as a function of wall-clock time. However, in complex
models designing an appropriately comprehensive set of test functions
may be difficult. Furthermore, many such measures require the Markov
chain to mix and do not account for any asymptotic bias [Gorham and
Mackey, 2015], hence limiting their applicability to measuring the per-
formance of many of the new inference methods studied here.
To confront these challenges, one recently-proposed approach
[Gorham and Mackey, 2015] draws on Stein’s method, classically used
as an analytical tool, to design an efficiently-computable measure of dis-
crepancy between a target distribution and a set of samples. A natural
measure of discrepancy between a target density p(x) and a (weighted)
sample distribution q(x), where q(x) = ni=1 wi δxi (x) for some set of
P

samples {xi }ni=1 and weights {wi }ni=1 , is to consider their largest abso-
lute difference across a large class of test functions:

dH (q, p) = sup |Eq h(X) − Ep h(X)| (6.1)


h∈H

where H is the class of test functions. While expectations with respect


to the target density p may be difficult to compute, by designing H such
that Ep h(X) = 0 for every h ∈ H, we need only compute expectations
with respect to the sample distribution q. To meet this requirement,
instead of designing H directly, we can instead choose H to be the image
of another function class G under an operator Tp that may depend on
p, so that H = Tp G and the requirement becomes Ep (Tp g)(x) = 0 and
233

the discrepancy measure becomes

dTp G (q, p) = sup |Eq (Tp g)(X)|. (6.2)


g∈G

Such operators Tp can be designed using infinitessimal generators from


continuous-time ergodic Markov processes, and Gorham and Mackey
[2015] suggest using the operator

(Tp g)(x) , hg(x), ∇ log p(x)i + h∇, ∇g(x)i (6.3)

which requires computing only the gradient of the target log density.
Furthermore, while the optimization in (6.2) is infinite-dimensional in
general and might have infinitely many smoothness constraints from G,
Gorham and Mackey [2015] shows that for the sample distribution q
the test function g need only be evaluated at the finitely-many sample
points {xi }ni=1 and that only a small number of constraints must be
enforced. This new performance metric does not require assumptions
on whether the samples are generated from an unbiased, stationary
Markov chain, and so it may provide clear ways to compare across a
broad spectrum sampling-based approximate inference algorithms.
Another recently-proposed approach attempts to estimate or bound
the KL divergence from an algorithm’s approximate posterior represen-
tation to the true posterior, at least when applied to synthetic data.
This approach, called bidirectional Monte Carlo (BDMC) [Grosse et al.,
2015], can be applied to measure the performance of both variational
mean field algorithms as well as annealed importance sampling (AIS)
and sequential Monte Carlo (SMC) algorithms. By rearranging the vari-
ational identity (2.50), we can write the KL divergence KL(qkp) from
an approximating distribution q(z, θ) to a target posterior p(z, θ | ȳ) in
terms of the log marginal likelihood log p(ȳ) and an expectation with
respect to q(z, θ):
p(z, θ | ȳ)
 
KL(qkp) = log p(ȳ) − Eq(z,θ) log . (6.4)
q(z, θ)
Because the expectation can be readily computed in a mean field set-
ting or stochastically lower-bounded when using AIS [Grosse et al.,
2015, Section 4.1], with a stochastic upper bound on log p(ȳ) we can
234 Challenges and questions

use (6.4) to compute a stochastic upper bound on the KL divergence


KL(qkp). BDMC provides a method to compute such stochastic upper
bounds on log p(ȳ) for synthetic datasets ȳ, and so may enable new per-
formance metrics that apply to both sampling-based algorithms as well
as variational mean field algorithms. However, while MCMC transition
operators are used to construct AIS algorithms, BDMC does not di-
rectly apply to evaluating the performance of such transition operators
in standard MCMC inference.
Developing performance metrics and evaluation procedures is crit-
ical to making progress. As observed in Grosse et al. [2015],

In many application areas of machine learning, especially


supervised learning, benchmark datasets have spurred rapid
progress in developing new algorithms and clever refine-
ments to existing algorithms. [. . . ] So far, the lack of quan-
titative performance evaluations in marginal likelihood es-
timation, and in sampling-based inference more generally,
has left us fumbling around in the dark.

By developing better ways to measure the performance of these


Bayesian inference algorithms, we will be much better equipped to com-
pare, improve, and extend them.
Acknowledgements

This work was funded in part by NSF IIS-1421780 and the Alfred P.
Sloan Foundation. E.A. is supported by the Miller Institute for Basic
Research in Science, University of California, Berkeley. M.J. is sup-
ported by a fellowship from the Harvard/MIT Joint Grants program.

235
References

Sungjin Ahn, Anoop Korattikara Balan, and Max Welling. Bayesian posterior
sampling via stochastic gradient Fisher scoring. In Proceedings of the 29th
International Conference on Machine Learning, 2012.
Talal M. Alkhamis, Mohamed A. Ahmed, and Vu Kim Tuan. Simulated
annealing for discrete optimization with estimation. European Journal of
Operational Research, 116(3):530–544, 1999.
Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry.
American Mathematical Society, 2007.
Christophe Andrieu and Eric Moulines. On the ergodicity properties of some
adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):
1462–1505, 2006.
Christophe Andrieu and Gareth O. Roberts. The pseudo-marginal approach
for efficient Monte Carlo computations. Annals of Statistics, pages 697–725,
2009.
Christophe Andrieu and Johannes Thoms. A tutorial on adaptive MCMC.
Statistics and Computing, 18(4):343–373, 2008.
Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov
chain Monte Carlo methods. Journal of the Royal Statistical Society Series
B, 72(3):269–342, 2010.
Elaine Angelino. Accelerating Markov chain Monte Carlo via parallel predic-
tive prefetching. PhD thesis, School of Engineering and Applied Sciences,
Harvard University, 2014.

236
References 237

Elaine Angelino, Eddie Kohler, Amos Waterland, Margo Seltzer, and Ryan P.
Adams. Accelerating MCMC via parallel predictive prefetching. In 30th
Conference on Uncertainty in Artificial Intelligence, pages 22–31, 2014.
Kenneth J. Arrow, Leonid Hurwicz, Hirofumi Uzawa, H.B. Chenery, S.M.
Johnson, S. Karlin, T. Marschak, and R.M. Solow. Studies in linear and
non-linear programming. Stanford University Press, John Wiley & Sons,
1959.
Arthur U. Asuncion, Padhraic Smyth, and Max Welling. Asynchronous dis-
tributed learning of topic models. In Advances in Neural Information Pro-
cessing Systems 21, pages 81–88, 2008.
Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration-
exploitation tradeoff using variance estimates in multi-armed bandits. The-
oretical Computer Science, 410(19):1876–1902, 2009.
Rémi Bardenet and Odalric-Ambrym Maillard. Concentration inequalities for
sampling without replacement. Bernoulli, 20(3):1361–1385, 2015.
Rémi Bardenet, Arnaud Doucet, and Chris Holmes. Towards scaling up
Markov chain Monte Carlo: An adaptive subsampling approach. In Pro-
ceedings of the 31st International Conference on Machine Learning, 2014.
Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte
Carlo methods for tall data. arXiv preprint 1505.02827, 2015.
Mark A. Beaumont. Estimation of population growth or decline in genetically
monitored populations. Genetics, 164(3):1139–60, 2003.
Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, 2016.
Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Com-
putation: Numerical Methods. Prentice-Hall, Inc., Upper Saddle River, NJ,
USA, 1989.
Michael Betancourt. The fundamental incompatibility of scalable Hamiltonian
Monte Carlo and naive data subsampling. In Proceedings of The 32nd
International Conference on Machine Learning, pages 533–540, 2015.
Christopher M. Bishop. Pattern Recognition and Machine Learning (Infor-
mation Science and Statistics). Springer-Verlag New York, Inc., Secaucus,
NJ, USA, 2006.
J. Frederic Bonnans and Alexander Shapiro. Perturbation Analysis of Opti-
mization Problems. Springer Science & Business Media, 2000.
Léon Bottou. On-line learning and stochastic approximations. In David Saad,
editor, On-line Learning in Neural Networks, pages 9–42. Cambridge Uni-
versity Press, New York, NY, USA, 1998.
238 References

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein.
Distributed optimization and statistical learning via the alternating direc-
tion method of multipliers. Foundations and Trends in Machine Learning,
3(1):1–122, January 2011.
A. E. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-
fetching. Journal of Computational and Graphical Statistics, 15(1):246–261,
March 2006.
Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and
Michael I. Jordan. Streaming variational Bayes. In Advances in Neural
Information Processing Systems 26, pages 1727–1735, 2013.
Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of
Markov Chain Monte Carlo. Chapman & Hall/CRC Handbooks of Modern
Statistical Methods. CRC press, 2011.
Akif Asil Bulgak and Jerry L. Sanders. Integrating a modified simulated an-
nealing algorithm with the simulation of a manufacturing system to opti-
mize buffer sizes in automatic assembly systems. In Proceedings of the 20th
Conference on Winter Simulation, pages 684–690, New York, NY, USA,
1988. ACM.
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted
autoencoders. International Conference on Learning Representations, 2016.
Jonathan M. R. Byrd, Stephen A. Jarvis, and Abhir H. Bhalerao. Reducing
the run-time of MCMC programs by multithreading on SMP architectures.
In IEEE International Symposium on Parallel and Distributed Processing,
pages 1–8, 2008.
Jonathan M. R. Byrd, Stephen A. Jarvis, and Abhir H. Bhalerao. On the
parallelisation of MCMC by speculative chain execution. In IEEE Inter-
national Symposium on Parallel and Distributed Processing - Workshop
Proceedings, pages 1–8, 2010.
Trevor Campbell and Jonathan P. How. Approximate decentralized Bayesian
inference. In 30th Conference on Uncertainty in Artificial Intelligence,
pages 102–111, 2014.
Trevor Campbell, Julian Straub, John W. Fisher III, and Jonathan P. How.
Streaming, distributed variational inference for Bayesian nonparametrics.
In Advances in Neural Information Processing Systems 28, pages 280–288,
2015.
Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient Hamil-
tonian Monte Carlo. In Proceedings of the 31st International Conference
on Machine Learning, June 2014.
References 239

John M. Danskin. The Theory of Max-Min and its Application to Weapons


Allocation Problems. Springer-Verlag, New York, 1967.
Chris De Sa, Kunle Olukotun, and Christopher Ré. Ensuring rapid mixing
and low bias for asynchronous Gibbs sampling. In Proceedings of the 33rd
International Conference on Machine Learning, June 2016.
J. E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Uncon-
strained Optimization and Nonlinear Equations. Prentice-Hall Series in
Computational Mathematics, 1983.
Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel,
and Hartmut Neven. Bayesian sampling using stochastic gradient ther-
mostats. In Advances in Neural Information Processing Systems 27, pages
3203–3211, 2014.
Finale Doshi-Velez, David A. Knowles, Shakir Mohamed, and Zoubin Ghahra-
mani. Large scale nonparametric Bayesian inference: Data parallelisation
in the Indian buffet process. In Advances in Neural Information Processing
Systems 22, pages 1294–1302, 2009.
Arnaud Doucet, Michael Pitt, Robert Kohn, and George Deligiannidis. Effi-
cient implementation of Markov chain Monte Carlo when using an unbiased
likelihood estimator. Biometrika, 102(2):295–313, 2015.
David Duvenaud and Ryan P. Adams. Black-box stochastic variational infer-
ence in five lines of python. NIPS Workshop on Black-box Learning and
Inference, 2015.
Paul Fearnhead, Omiros Papaspiliopoulos, Gareth O. Roberts, and Andrew
Stuart. Random-weight particle filtering of continuous time processes. Jour-
nal of the Royal Statistical Society: Series B (Statistical Methodology), 72
(4):497–512, 2010.
Anthony V. Fiacco. Introduction to Sensitivity and Stability Analysis in Non-
linear Programming. Academic Press, Inc., 1984.
Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From
importance sampling to bridge sampling to path sampling. Statistical sci-
ence, pages 163–185, 1998.
Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari,
and Donald B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC,
3rd edition, 2014a.
Andrew Gelman, Aki Vehtari, Pasi Jylänki, Christian Robert, Nicolas Chopin,
and John P Cunningham. Expectation propagation as a way of life. arXiv
preprint arXiv:1412.4869, 2014b.
240 References

Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions,


and the Bayesian restoration of images. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, pages 721–741, 1984.
Charles J. Geyer. Practical Markov chain Monte Carlo. Statistical Science,
pages 473–483, 1992.
Ryan J. Giordano, Tamara Broderick, and Michael I. Jordan. Linear re-
sponse methods for accurate covariance estimates from mean field varia-
tional Bayes. In Advances in Neural Information Processing Systems 28,
pages 1441–1449, 2015.
Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamil-
tonian Monte Carlo methods. Journal of the Royal Statistical Society: Se-
ries B (Statistical Methodology) (With Discussion), 73:123 – 214, 03 2011.
Joseph Gonzalez, Yucheng Low, Arthur Gretton, and Carlos Guestrin. Parallel
Gibbs sampling: From colored fields to thin junction trees. In Proceedings of
the 14th International Conference on Artificial Intelligence and Statistics,
pages 324–332, 2011.
Jackson Gorham and Lester Mackey. Measuring sample quality with Stein’s
method. In Advances in Neural Information Processing Systems 28, pages
226–234, 2015.
Thore Graepel, Joaquin Quñonero Candela, Thomas Borchert, and Ralf Her-
brich. Web-scale Bayesian click-through rate prediction for sponsored
search advertising in Microsoft’s Bing search engine. In Proceedings of the
27th International Conference on Machine Learning, pages 13–20, 2010.
Roger B. Grosse, Zoubin Ghahramani, and Ryan P. Adams. Sandwiching
the marginal likelihood using bidirectional Monte Carlo. arXiv preprint
arXiv:1511.02543, 2015.
Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive Metropo-
lis algorithm. Bernoulli, 7(2):223–242, 04 2001.
Leonard Hasenclever, Stefan Webb, Thibaut Lienart, Sebastian Vollmer, Bal-
aji Lakshminarayanan, Charles Blundell, and Yee Whye Teh. Distributed
Bayesian learning with stochastic natural-gradient expectation propagation
and the posterior server. arXiv preprint arXiv:1512.09327, 2016.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of
Statistical Learning. Springer Series in Statistics. Springer New York Inc.,
New York, NY, USA, 2001.
W. K. Hastings. Monte Carlo sampling methods using Markov chains and
their applications. Biometrika, 57(1):97–109, April 1970.
References 241

Tom Heskes and Onno Zoeter. Expectation propagation for approximate


inference in dynamic Bayesian networks. In 18th Conference on Uncertainty
in Artificial Intelligence, pages 216–223, 2002.
Christian Hipp. Sufficient statistics and exponential families. The Annals of
Statistics, 2(6):1283–1292, 1974.
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim,
Phillip B. Gibbons, Gregory R. Ganger, Garth Gibson, and Eric P. Xing.
More effective distributed ML via a stale synchronous parallel parameter
server. In Advances in Neural Information Processing Systems 26, pages
1223–1231, 2013.
Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochas-
tic variational inference. Journal of Machine Learning Research, 14(1):
1303–1347, May 2013.
Zaiying Huang and Andrew Gelman. Sampling for Bayesian computation
with large datasets. Technical report, Columbia University, 2005.
Michael C. Hughes and Erik B. Sudderth. Memoized online variational infer-
ence for Dirichlet process mixture models. In Advances in Neural Informa-
tion Processing Systems 26, pages 1133–1141, 2013.
Michael C. Hughes, Dae Il Kim, and Erik B. Sudderth. Reliable and scalable
variational inference for the hierarchical Dirichlet process. In Proceedings of
the 18th International Conference on Artificial Intelligence and Statistics,
pages 370–378, 2015.
Alexander Ihler and David Newman. Understanding errors in approximate
distributed latent Dirichlet allocation. IEEE Transactions on Knowledge
and Data Engineering, 24(5):952–960, 2012.
Pierre E. Jacob and Alexandre H. Thiery. On nonnegative unbiased estima-
tors. Annals of Statistics, 43(2):769–784, 04 2015.
Matthew J. Johnson, James Saunderson, and Alan S. Willsky. Analyzing Hog-
wild parallel Gaussian Gibbs sampling. In Advances in Neural Information
Processing Systems 26, pages 2715–2723, 2013.
Matthew James Johnson. Bayesian Time Series Models and Scalable Infer-
ence. PhD thesis, Massachusetts Institute of Technology, 2014.
Robert W. Keener. Theoretical Statistics: Topics for a Core Course. Springer
Texts in Statistics. Springer New York, 2010.
Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In-
ternational Conference on Learning Representations, 2014.
242 References

Jack P. C. Kleijnen and Reuven Y. Rubinstein. Optimization and sensitiv-


ity analysis of computer simulation models by the score function method.
European Journal of Operational Research, 88(3):413–427, 1996.
Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles
and Techniques. MIT Press, 2009.
Augustine Kong, Jun S. Liu, and Wing Hung Wong. Sequential imputations
and Bayesian missing data problems. Journal of the American Statistical
Association, 89(425):278–288, 1994.
Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC
land: Cutting the Metropolis-Hastings budget. In Proceedings of the 31th
International Conference on Machine Learning, volume 32, pages 181–189,
2014.
Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Auto-
matic variational inference in stan. In C. Cortes, N. D. Lawrence, D. D.
Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information
Processing Systems 28, pages 568–576. Curran Associates, Inc., 2015.
Neil D. Lawrence. Modelling in the context of massively missing data,
2015. URL http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/
talks/missingdata_tuebingen15.pdf.
Benedict Leimkuhler and Xiaocheng Shang. Adaptive thermostats for noisy
gradient systems. SIAM Journal on Scientific Computing, 38(2):A712–
A736, 2016.
Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. Stochas-
tic expectation propagation. In Advances in Neural Information Processing
Systems 28, pages 2323–2331, 2015.
L. Lin, K. F. Liu, and J. Sloan. A noisy Monte Carlo algorithm. Physical
Review D, 61:074505, March 2000.
Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. PLDA+:
Parallel latent Dirichlet allocation with data placement and pipeline pro-
cessing. ACM Transactions on Intelligent Systems and Technology, 2(3):
26:1–26:18, May 2011.
Anne-Marie Lyne, Mark Girolami, Yves Atchaé, Heiko Strathmann, and
Daniel Simpson. On Russian roulette estimates for Bayesian inference with
doubly-intractable likelihoods. Statistical Science, 30(4):443–467, 11 2015.
Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic
gradient MCMC. In Advances in Neural Information Processing Systems
28, pages 2917–2925, 2015.
References 243

David J. C. MacKay. Information Theory, Inference & Learning Algorithms.


Cambridge University Press, New York, NY, USA, 2003.
Dougal Maclaurin and Ryan P. Adams. Firefly Monte Carlo: Exact MCMC
with subsets of data. In 30th Conference on Uncertainty in Artificial Intel-
ligence, pages 543–552, 2014.
Stephan Mandt, Matthew D. Hoffman, and David M. Blei. A variational
analysis of stochastic gradient algorithms. In Proceedings of the 33rd In-
ternational Conference on Machine Learning, 2016.
James Martens. New insights and perspectives on the natural gradient
method. arXiv preprint arXiv:1412.1193, 2015.
James Martens and Roger Grosse. Optimizing neural networks with
Kronecker-factored approximate curvature. In Proceedings of the 32nd In-
ternational Conference on Machine Learning, 2015.
Peter S. Maybeck. Stochastic Models, Estimation, and Control, volume 3.
Academic Press, 1982.
James McInerney, Rajesh Ranganath, and David M. Blei. The population
posterior and bayesian inference on streams. In Advances in Neural Infor-
mation Processing Systems 28, pages 1153–1161, 2015.
Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Au-
gusta H. Teller, and Edward Teller. Equation of state calculations by fast
computing machines. The Journal of Chemical Physics, 21(6):1087–1092,
1953.
Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability.
Cambridge University Press, New York, NY, USA, 2nd edition, 2009.
Thomas P. Minka. Expectation propagation for approximate Bayesian infer-
ence. In 17th Conference on Uncertainty in Artificial Intelligence, pages
362–369, 2001.
Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B. Dunson.
Scalable and robust Bayesian inference via the median posterior. In Pro-
ceedings of the 31st International Conference on Machine Learning, pages
1656–1664, 2014.
Andriy Mnih and Karol Gregor. Neural variational inference and learning
in belief networks. In Proceedings of the 31st International Conference on
Machine Learning, pages 1791–1799, 2014.
Andriy Mnih and Danilo Jimenez Rezende. Variational inference for monte
carlo objectives. In Proceedings of the 33rd International Conference on
Machine Learning, 2016.
244 References

Volodymyr Mnih, Csaba Szepesvári, and Jean-Yves Audibert. Empirical Bern-


stein stopping. In Proceedings of the 25th International Conference on Ma-
chine Learning, pages 672–679, 2008.
Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT
Press, 2012.
Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propaga-
tion for approximate inference: An empirical study. In 15th Conference on
Uncertainty in Artificial Intelligence, pages 467–475, 1999.
Radford M. Neal. An improved acceptance procedure for the hybrid Monte
Carlo algorithm. J. Comput. Phys., 111(1):194–203, March 1994.
Radford M. Neal. MCMC using Hamiltonian dynamics. In Handbook of
Markov Chain Monte Carlo, Chapman & Hall/CRC Handbooks of Modern
Statistical Methods, pages 113–162. CRC Press, 2010.
Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embar-
rassingly parallel MCMC. In 30th Conference on Uncertainty in Artificial
Intelligence, pages 623–632, 2014.
David Newman, Padhraic Smyth, Max Welling, and Arthur U. Asuncion.
Distributed inference for latent Dirichlet allocation. In Advances in Neural
Information Processing Systems 20, pages 1081–1088, 2007.
David Newman, Arthur U. Asuncion, Padhraic Smyth, and Max Welling.
Distributed algorithms for topic models. Journal of Machine Learning Re-
search, 10:1801–1828, 2009.
Robert Nishihara, Iain Murray, and Ryan P. Adams. Parallel MCMC with
generalized elliptical slice sampling. Journal of Machine Learning Research,
15:2087–2112, 2014.
Manfred Opper and Ole Winther. A Bayesian approach to on-line learning. In
David Saad, editor, On-line Learning in Neural Networks, pages 363–378.
Cambridge University Press, New York, NY, USA, 1998.
John Paisley, David M. Blei, and Michael I. Jordan. Variational Bayesian
inference with stochastic search. In Proceedings of the 29th International
Conference on Machine Learning, 2012.
Omiros Papaspiliopoulos. A methodological framework for Monte Carlo prob-
abilistic inference for diffusion processes. Technical report, Centre for Re-
search in Statistical Methodology, University of Warwick, June 2009.
Omiros Papaspiliopoulos, Gareth O. Roberts, and Martin Sköld. A general
framework for the parametrization of hierarchical models. Statistical Sci-
ence, pages 59–73, 2007.
References 245

Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin
dynamics on the probability simplex. In Advances in Neural Information
Processing Systems 26, pages 3102–3110, 2013.
M. F. Pradier, P. G. Moreno, F. J. R. Ruiz, I. Valera, H. Mollina-Bulla,
and F. Perez-Cruz. Map/reduce uncollapsed Gibbs sampling for Bayesian
non parametric models. Workshop in Software Engineering for Machine
Learning at NIPS, 2014.
Maxim Rabinovich, Elaine Angelino, and Michael I. Jordan. Variational Con-
sensus Monte Carlo. In Advances in Neural Information Processing Systems
28, pages 1207–1215, 2015.
Rajesh Ranganath, Chong Wang, David M. Blei, and Eric P. Xing. An adap-
tive learning rate for stochastic variational inference. In Proceedings of
the 30th International Conference on Machine Learning, volume 28, pages
298–306, 2013.
Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational
inference. In 17th International Conference on Artificial Intelligence and
Statistics, pages 814–822, 2014.
Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hog-
wild!: A lock-free approach to parallelizing stochastic gradient descent. In
Advances in Neural Information Processing Systems 24, pages 693–701,
2011.
Jeffrey Regier, Andrew Miller, Jon McAuliffe, Ryan P. Adams, Matt Hoffman,
Dustin Lang, David Schlegel, and Prabhat. Celeste: Variational inference
for a generative model of astronomical images. In Proceedings of the 32nd
International Conference on Machine Learning, pages 2095–2103, 2015.
Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-
propagation and approximate inference in deep generative models. In Pro-
ceedings of the 31st International Conference on Machine Learning, pages
1278–1286, 2014.
Herbert Robbins and Sutton Monro. A stochastic approximation method.
The Annals of Mathematical Statistics, pages 400–407, 1951.
Christian P. Robert and George Casella. Monte Carlo Statistical Methods
(Springer Texts in Statistics). Springer-Verlag New York, Inc., 2004.
Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of
Langevin distributions and their discrete approximations. Bernoulli, 2(4):
341–363, 12 1996.
246 References

Gareth O. Roberts, Andrew Gelman, and Walter R. Gilks. Weak conver-


gence and optimal scaling of random walk Metropolis algorithms. Annals
of Applied Probability, 7:110–120, 1997.
Tim Salimans, Diederik P. Kingma, and Max Welling. Markov chain Monte
Carlo and variational inference: Bridging the gap. In Proceedings of the 32nd
International Conference on Machine Learning, pages 1218–1226, 2015.
Issei Sato and Hiroshi Nakagawa. Approximation analysis of stochastic gradi-
ent Langevin dynamics by using Fokker-Planck equation and Ito process.
In Proceedings of the 31st International Conference on Machine Learning,
pages 982–990, 2014.
Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chip-
man, Edward I. George, and Robert E. McCulloch. Bayes and big data: The
consensus Monte Carlo algorithm. International Journal of Management
Science and Engineering Management, 11(2):78–88, 2016.
R. J. Serfling. Probability inequalities for the sum in sampling without re-
placement. The Annals of Statistics, 2(1):39–48, 1974.
Xiaocheng Shang, Zhanxing Zhu, Benedict Leimkuhler, and Amos J.
Storkey. Covariance-controlled adaptive Langevin thermostat for large-
scale Bayesian sampling. In Advances in Neural Information Processing
Systems 28, pages 37–45, 2015.
Sameer Singh, Michael L. Wick, and Andrew McCallum. Monte Carlo MCMC:
Efficient inference by approximate sampling. In Proceedings of the 2012
Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning, pages 1104–1113, 2012.
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian opti-
mization of machine learning algorithms. In Advances in Neural Informa-
tion Processing Systems 25, pages 2951–2959, 2012.
David H. Stern, Ralf Herbrich, and Thore Graepel. Matchbox: Large scale
online Bayesian recommendations. In Proceedings of the 18th International
Conference on World Wide Web, pages 111–120, 2009.
Ingvar Strid. Efficient parallelisation of Metropolis-Hastings algorithms using
a prefetching approach. Computational Statistics & Data Analysis, 54(11):
2814–2835, November 2010.
Alex Tank, Nicholas Foti, and Emily Fox. Streaming variational inference for
Bayesian nonparametric mixture models. In Proceedings of the 18th Inter-
national Conference on Artificial Intelligence and Statistics, pages 968–976,
2015.
References 247

Yee Whye Teh, Alexandre H. Thiery, and Sebastian J. Vollmer. Consistency


and fluctuations for stochastic gradient Langevin dynamics. The Journal
of Machine Learning Research, 17(1):193–225, 2016.
Wolfgang Wagner. Unbiased Monte Carlo evaluation of certain functional
integrals. Journal of Computational Physics, 71(1):21–33, 1987.
Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential
families, and variational inference. Foundations and Trends in Machine
Learning, 1(1-2):1–305, November 2008.
Ling Wang and Liang Zhang. Stochastic optimization using simulated an-
nealing with hypothesis test. Applied Mathematics and Computation, 174
(2):1329–1342, 2006.
Xiangyu Wang and David B. Dunson. Parallel MCMC via Weierstrass sam-
pler. arXiv preprint arXiv:1312.4605, 2013.
Karl Weierstrass. Über die analytische darstellbarkeit sogenannter willkr-
licher functionen einer reellen vernderlichen. Sitzungsberichte der Königlich
Preuischen Akademie der Wissenschaften zu Berlin, 1885. (II). Erste Mit-
teilung (part 1) pp. 633–639, Zweite Mitteilung (part 2) pp. 789–805.
Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient
Langevin dynamics. In Proceedings of the 28th International Conference
on Machine Learning, 2011.
Ronald J Williams. Simple statistical gradient-following algorithms for con-
nectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
E. E. Witte, R. D. Chamberlain, and M. A. Franklin. Parallel simulated
annealing using speculative computation. IEEE Transactions on Parallel
and Distributed Systems, 2(4):483–494, 1991.
Minjie Xu, Balaji Lakshminarayanan, Yee Whye Teh, Jun Zhu, and Bo Zhang.
Distributed Bayesian posterior sampling via moment sharing. In Advances
in Neural Information Processing Systems 27, pages 3356–3364, 2014.

You might also like