Patterns of Scalable Bayesian Inference
Patterns of Scalable Bayesian Inference
Patterns of Scalable Bayesian Inference
∗
Authors contributed equally.
Contents
1 Introduction 120
1.1 Why be Bayesian with big data? . . . . . . . . . . . . . . 121
1.2 The accuracy of approximate integration . . . . . . . . . . 123
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2 Background 125
2.1 Exponential families . . . . . . . . . . . . . . . . . . . . . 125
2.2 Markov Chain Monte Carlo inference . . . . . . . . . . . . 130
2.2.1 Bias and variance of estimators . . . . . . . . . . . 131
2.2.2 Monte Carlo estimates from independent samples . 132
2.2.3 Markov chains . . . . . . . . . . . . . . . . . . . . 133
2.2.4 Markov chain Monte Carlo (MCMC) . . . . . . . . 135
2.2.5 Metropolis-Hastings (MH) sampling . . . . . . . . 140
2.2.6 Gibbs sampling . . . . . . . . . . . . . . . . . . . 142
2.3 Mean field variational inference . . . . . . . . . . . . . . . 143
2.4 Expectation propagation variational inference . . . . . . . 145
2.5 Stochastic gradient optimization . . . . . . . . . . . . . . 147
ii
iii
Acknowledgements 235
References 236
Abstract
Datasets are growing not just in size but in complexity, creating a de-
mand for rich models and quantification of uncertainty. Bayesian meth-
ods are an excellent fit for this demand, but scaling Bayesian inference
is a challenge. In response to this challenge, there has been consider-
able recent work based on varying assumptions about model structure,
underlying computational resources, and the importance of asymptotic
correctness. As a result, there is a zoo of ideas with a wide range of
assumptions and applicability.
In this paper, we seek to identify unifying principles, patterns, and
intuitions for scaling Bayesian inference. We review existing work on
utilizing modern computing resources with both MCMC and varia-
tional approximation techniques. From this taxonomy of ideas, we char-
acterize the general principles that have proven successful for designing
scalable inference procedures and comment on the path forward.
120
1.1. Why be Bayesian with big data? 121
the data. The Bayesian inference methods we survey in this paper may
provide solutions to these challenges.
1.3 Outline
125
126 Background
and take Θ = ηx−1 (H) to be the open set of parameters that correspond
to normalizable densities. We summarize this notation in the following
definition.
where
Z
log Zx (ηx ) , log exp {hηx , tx (x)i} νX (dx) (2.7)
and hence tx (x) contains all the information about x that is relevant
for the parameter θ. The Koopman-Pitman-Darmois Theorem shows
that among all families in which the support does not depend on the
parameter, exponential families are the only families which provide
this powerful summarization property, under some mild smoothness
conditions [Hipp, 1974].
128 Background
and so derivatives of log Z give cumulants of t(x), where the first cu-
mulant is the mean and the second and third cumulants are the second
and third central moments, respectively.
where the expectation is taken over the random variable x with density
p(x | θ), and where we have used the identity E[v(x, θ)] = 0.
2.1. Exponential families 129
Proposition 2.3 (Conjugacy). Let the densities p(x | θ) and p(θ) be de-
fined as in Definitions 2.1 and 2.3, respectively. We have the relations
p(θ, x) = exp {hηθ + (tx (x), 1), tθ (θ)i − log Zθ (ηθ )} (2.18)
p(θ | x) = exp {hηθ + (tx (x), 1), tθ (θ)i − log Zθ (ηθ + (tx (x), 1))}
(2.19)
the asymptotic scaling of the error but also describes the asymptotic
distribution of those errors. If X is real-valued and has finite vari-
ance E[(X − µ)2 ] = σ 2 < ∞, then the CLT states that the devia-
tion n1 ni=1 Xi − µ, rescaled appropriately, converges in distribution
P
totic rate proportional to √1n . More generally, for any real-valued mea-
surable function f , the Monte Carlo standard error (MCSE) in the
estimate (2.27) asymptotically scales as √1n .
Monte Carlo estimators effectively reduce the problem of computing
expectations to the problem of generating samples. However, the pre-
ceding statements require the samples used in the Monte Carlo estimate
to be independent, and independent samples can be computationally
difficult to generate. Instead of relying on independent samples, Markov
chain Monte Carlo algorithms compute estimates using mutually de-
pendent samples generated by simulating a Markov chain.
for any initial distribution π0 , where k · kTV denotes the total variation
norm on densities:
1
Z
kp − qkTV = |p(x) − q(x)| dx. (2.32)
2 X
The total variation distance is also useful as an error metric for the
approximate MCMC we discuss in the sequel. For a transition opera-
tor T (x → x0 ) to admit π(x) as a stationary density, its application
must leave π(x) invariant:
π = T π. (2.33)
Even though this Markov chain Monte Carlo estimate is not con-
structed from independent samples, under some mild conditions it can
asymptotically satisfy analogs of the Law of Large Numbers (LLN) and
Central Limit Theorem (CLT) that were used to justify ordinary Monte
Carlo methods in Section 2.2.2. We sketch these important results here.
The MCMC analog of the LLN states that for a chain satisfying
basic recurrence conditions and admitting an invariant distribution π,
136 Background
for all functions f that are absolutely integrable with respect to π, i.e.
R
all f : X → R that satisfy X |f (x)| π(x) dx < ∞, we have
n
1X
Z
lim f (Xi ) = f (x) π(x) dx (a.s.), (2.36)
n→∞ n X
i=1
for any initial distribution π0 . This result is the basic motivation to col-
lect samples from a Markov chain trajectory and use those samples to
compute estimates of expectations with respect to the invariant distri-
bution π. For a more detailed statement, see Meyn and Tweedie [2009,
Section 17.1].
Given that Markov chain Monte Carlo estimators can satisfy a law
of large numbers, we’d also like to understand the distribution of es-
timator errors and how the error distribution changes as we collect
more samples. To quantify these errors, the analog of the CLT must
take into account both the Markov dependency structure among the
samples used in the Monte Carlo estimate and also the initial state in
which the chain was started. However, under some additional condi-
tions on both the Markov chain’s convergence rate and the function
f , the sample average for any initial distribution π0 is asymptotically
normal in distribution (with appropriate scaling):
n
!
1 X
lim P √ (f (Xi ) − µ) < α = P(Z < α), (2.37)
n→∞ n n=1
Z ∼ N 0, σ 2 , (2.38)
∞
X
σ 2 = Varπ [f (X0 )] + 2 Covπ [f (X0 ), f (Xt )] (2.39)
t=1
R
where µ = X f (x) π(x) dx and where Varπ and Covπ denote the vari-
ance and covariance with respect to the stationary distribution π. Thus
standard error in the MCMC estimate also scales asymptotically as √1n ,
with a constant that depends on the autocovariance function of the sta-
tionary version of the chain. See Meyn and Tweedie [2009, Chapter 17]
and Robert and Casella [2004, Section 6.7] for precise statements of
both the LLN and CLT for Markov chain Monte Carlo estimates and
for conditions on the Markov chain which guarantee that these theo-
rems hold.
2.2. Markov Chain Monte Carlo inference 137
k⇡ TV
⇡kTV
⇡k T nn
k⇡00T
estimator error (log scale)
iterationtime
wall-clock n (log scale)
(log scale)
chain sample:
E[f (X)] ≈ f (Xn ). (2.40)
However, this choice of MCMC estimator maximizes the Monte Carlo
standard error, which asymptotically cannot be decreased below the
posterior variance of the estimand. A practical choice is to form MCMC
estimates using the last dn/2e Monte Carlo samples, discarding the
other samples as warm-up samples, resulting in an estimator
n
1 X
E[f (X)] ≈ f (Xi ). (2.41)
dn/2e i=bn/2c
2.2. Markov Chain Monte Carlo inference 139
⇡kTV
k⇡0 T n
estimator error (log scale)
With this choice, once the marginal distribution of the Markov chain
iterates approaches the stationary distribution the error due to tran-
sient bias is reduced at up to exponential rates. See Figure 2.2 for an
illustration. With any choice of MCMC estimator, transient bias can
be asymptotically decreased at least as fast as O( n1 ), and potentially
much faster, while MCSE can decrease only as fast as O( √1n ).
Using these ideas, MCMC algorithms provide a general means for
estimating posterior expectations of interest: first construct an algo-
rithm to simulate an ergodic Markov chain that admits the intended
posterior density as its stationary distribution, and then simply run
the simulation, collect samples, and form Monte Carlo estimates from
the samples. The task then is to design an algorithm to simulate from
140 Background
p(θ0 | x)
= min q(θ0 | θ)p(θ, x) , p(θ0 | x)q(θ | θ0 )
p(θ0 , x)
p(θ, x)q(θ0 | θ)
= min 1, q(θ | θ0 )p(θ0 | x).
p(θ0 , x)q(θ | θ0 )
(2.48)
142 Background
To show (2.47), note that if θ 6= θ0 then both sides are zero, and if
θ = θ0 then both sides are trivially equal.
See Robert and Casella [2004, Section 7.3] for a more detailed treat-
ment of the Metropolis-Hastings algorithm.
We call L[q] the variational lower bound on the log partition func-
tion log Z. Due to the statistical physics origins of variational infer-
ence methods, the negative log partition function − log Z is also called
the free energy (or proportional to the free energy), and L[q] is also
sometimes called the (negative) variational free energy [MacKay, 2003,
Section 33.1]. For two densities q and p with respect to the same base
measure, KL(qkp) is the Kullback-Leibler divergence from q to p, used
as a score of dissimilarity between pairs of densities [Amari and Na-
gaoka, 2007].
The variational lower bound of Proposition 2.4 is useful in inference
because if we wish to approximate an intractable p with a tractable q
by minimizing KL(qkp), we can equivalently choose q to maximize L[q],
which is possible to evaluate since it does not include the partition func-
tion Z. The mean field variational inference problem is then to choose
the approximating distribution q(x) over the family Q to optimize the
objective L[q], as we summarize in the next definition.
Definition 2.4 (Mean field variational inference problem). Given a target
probability density p(x) = Z1 p̄(x), an approximating family Q, the
mean field variational inference problem is
p̄(x)
max L[q] or max Eq(x) log . (2.57)
q∈Q q∈Q q(x)
2.4. Expectation propagation variational inference 145
∀t C2 I ≺ G(t) ≺ C3 I,
150 Background
151
152 MCMC with data subsets
of terms:
N
Y
p(θ | x) ∝ p(θ)p(x | θ) = p(θ) p(xn | θ). (3.2)
n=1
where {x∗n }m N
n=1 is a subset of {xn }n=1 sampled uniformly at random
and with replacement. This approximation is an unbiased estimator
and yields an unbiased estimate of the log joint density:
m
N X
log p(θ)p(x | θ) ≈ log p(θ) + log p(x∗n | θ). (3.5)
m n=1
`n ∼ N (µ, σ 2 ) . (3.15)
The mean estimate µ̂m for µ based on the subset of size m is equal
to Λ̂m (θ, θ0 ):
m
1 X
µ̂m = Λ̂m (θ, θ0 ) = `∗ . (3.16)
m n=1 n
√
The error estimate σ̂m for σ may be derived from sm / m, where sm is
the empirical standard deviation of the m subsampled `n terms, i.e.,
m 2
r
sm = Λ̂m (θ, θ0 ) − Λ̂m (θ, θ0 )2 , (3.17)
m−1
where m
1 X
Λ̂2m (θ, θ0 ) = (`∗ )2 . (3.18)
m n=1 n
To obtain a confidence interval, we multiply this estimate by the finite
population correction, giving:
s
sm N −m
σ̂m =√ . (3.19)
m N −1
The test statistic
Λ̂m (θ, θ0 ) − ψ(u, θ, θ0 )
t= (3.20)
σ̂m
follows a Student’s t-distribution with m − 1 degrees of freedom
when Λ(θ, θ0 ) = ψ(u, θ, θ0 ). The tail probability for |t| then gives the
probability that the approximate and actual outcomes agree, and thus
is the probability that they disagree, where φm−1 (·) is the CDF of
the Student’s t-distribution with m − 1 degrees of freedom. The t-test
thus gives an adaptive stopping rule, i.e., for any user-provided toler-
ance ≥ 0, we can incrementally increase m until ρ ≤ . We illustrate
this approach in Algorithm 6.
ψ Λ̂m Λ
2cm
and
p−1
δk = . (3.30)
pk p
Also, as mentioned in Section 3.2.2, Bardenet et al. [2014] geometrically
increase the subsample size by a factor γ. In their experiments, they
use the empirical Bernstein-Serfling bound [Bardenet and Maillard,
2015]. For the hyperparameters, they set p = 2, γ = 2, and = 0.01,
and remark that they empirically found their algorithm to be robust
to the choice of .
3.2. Adaptive subsampling for Metropolis–Hastings 161
for all probability distributions P and some constant η ∈ [0, 1). For
approximate MH with an adaptive stopping rule, the maximum ac-
ceptance probability error Emax directly gives an upper bound on the
single-step error kT̃ P − T P kTV . Combining the single-step error bound
with the contraction condition shows that T̃ is also a strong contraction
and yields a bound on kπ̃ − πkTV .
Finally, we note that an adaptive subsampling schemes using a
concentration inequality enables an upper bound on the stopping
time [Bardenet et al., 2014].
3.3. Sub-selecting data via a lower bound on the likelihood 163
where the zn are independent for different n. When the bound is tight,
i.e., Bn (θ) = Ln (θ), then zn = 0 with probability 1. More generally, a
tighter bound results in a higher probability that zn = 0. Augmenting
the density with z = {zn }N n=1 gives:
Thus for any fixed configuration of z we can evaluate the joint density
using only the likelihood terms Ln (θ) where zn = 1 and the bound
values Bn (θ) for each n = 1, 2, . . . , N .
While Equation (3.41) still involves a product of N terms, if the
Q
product of the bound terms n:zn =0 Bn (θ) can be evaluated without
reading each corresponding data point then the joint density can be
evaluated reading only the data xn for which zn = 1. In particular,
if the form of Bn (θ) is an exponential family density, then the prod-
Q
uct n:zn =0 Bn (θ) can be evaluated using only a finite-dimensional suf-
ficient statistic for the data {xn : zn = 0}. Thus by exploiting lower
bounds in the exponential family, FlyMC can reduce the amount of
data required at each iteration of the algorithm while maintaining the
exact posterior as its stationary distribution. Maclaurin and Adams
[2014] show an application of this methodology to Bayesian logistic
regression.
FlyMC presents three main challenges. The first is constructing a
collapsible lower bound, such as an exponential family, that is suf-
ficiently tight. The second is designing an efficient implementation.
Maclaurin and Adams [2014] discuss these issues and, in particular,
design a cache-like data structure for managing the relationship be-
tween the N indicator values and the data. Finally, it is likely that the
inclusion of these auxiliary variables slows the mixing of the Markov
chain, but Maclaurin and Adams [2014] only provide empirical evidence
that this effect is small relative to the computational savings from using
data subsets.
Further analysis of FlyMC, and in particular an explicit connection
to pseudo-marginal techniques and an analysis of how bound tightness
3.4. Stochastic gradients of the log joint density 165
where ηt ∼ N (0, t ). Notice that the injected noise decays with the
gradient step size parameter, but at a slower rate. Specifically, if t
decays as t−γ , then ηt decays as t−γ/2 . As in MALA, the SGLD proposal
is a stochastic gradient step, where the noise comes from subsampling
as well as the injected noise.
An actual Metropolis–Hastings algorithm would accept or reject the
proposal in Equation (3.47) by evaluating the full (log) joint density
at θ0 and θt , but this is precisely the computation we wish to avoid.
Welling and Teh [2011] observe that as t → 0, θ0 → θt in both Equa-
tions (3.46) and (3.47). In this limit, the probability of accepting the
proposal converges to 1, but the chain stops completely. The authors
suggest that t can be decayed to a value that is large enough for effi-
cient sampling, yet small enough for the acceptance probability to be
high. These assumptions lead to a scheme where t > ∞ > 0, for all t,
and all proposals are accepted, therefore the acceptance probability is
3.4. Stochastic gradients of the log joint density 167
Table 3.1: Summary of recent MCMC methods for Bayesian inference that
operate on data subsets. Error refers to the total variation distance between
the stationary distribution of the Markov chain and the target posterior dis-
tribution.
3.5 Summary
Data access patterns. While all the methods use subsets of data,
their access patterns differ. Adaptive subsampling and SGLD require
randomization to avoid issues of bias due to data order, but this ran-
domization can be achieved by permuting the data before each pass
and hence these algorithms allow data access that is mostly sequential.
In contrast, FlyMC operates on random subsets of data determined by
the Markov chain itself, leading to a random access pattern. However,
subsets from one iteration to the next tend to be correlated, and moti-
vate implementation details such as the proposed cache data structure.
converge to a small positive value so that the injected noise term dom-
inates, while not being too large compared to the scale of the posterior
distribution.
Error. FlyMC is exact in the sense that the target posterior distribu-
tion is a marginal of its augmented state space. The adaptive subsam-
pling approaches and SGLD (for any positive step size) are approximate
methods in that neither has a stationary distribution equal to the tar-
get posterior. The adaptive subsampling approaches bound the error
of the MH test at each iteration, and for MH transition kernels with
uniform ergodicity this one-step error bound leads to an upper bound
on the total variation distance between the approximate stationary dis-
tribution and the target posterior distribution. The theoretical analysis
of SGLD is less clear [Sato and Nakagawa, 2014].
3.6 Discussion
2
Pseudo-marginal MCMC is also known as exact-approximate sampling.
4
Parallel and distributed MCMC
173
174 Parallel and distributed MCMC
✓k zi
K
yi
N
Figure 4.1: Graphical models and graph colorings can expose opportunities
for parallelism in Gibbs samplers. (a) In this directed graphical model for a
discrete mixture, each label (red) can be sampled in parallel conditioned on
the parameters (blue) and data (gray) and similarly each parameter can be
resampled in parallel conditioned on the labels. (b) This undirected grid has
a classical “red-black” coloring, emphasizing that the variables corresponding
to red nodes can be resampled in parallel given the values of black nodes and
vice-versa.
that every element of the dataset is read by some processor and many
processors must mutually communicate. The methods we survey in the
remainder of this chapter aim to mitigate these limitations by adjusting
the allocation of parallel resources or by reducing communication.
θt
θ0t+1 θ1t+1
speedup is then:
∞
X 1−α 1
1 + E(k) < 1 + k(1 − α)k α < 1 + = .
k=0
α α
Note that the first term on the left is due to the core at the root of
the tree, which always performs useful computation in the prefetch-
ing scheme. When α = 0.23, this scheme yields a maximum expected
speedup of about 4.3; it achieves an expected speedup of about 4
with 16 cores. If only a few cores are available, this may be a reasonable
policy, but if many cores are available, their work is essentially wasted.
In contrast, the naïve prefetching policy achieves speedup that grows
as the log of the number of cores. Byrd et al. [2010] later considered the
special case where the evaluation of the likelihood function occurs on
178 Parallel and distributed MCMC
two timescales, slow and fast. They call this method speculative chains;
it modifies the speculative moves approach so that whenever the eval-
uation of the likelihood function is slow, any available cores are used
to speculatively evaluate the subsequent chain, assuming the slow step
resulted in an accept.
Strid [2010] extends the naïve prefetching scheme to allocate cores
according to the optimal “tree shape” with respect to various assump-
tions about the probability of rejecting a proposal, i.e., by greedily
allocating cores to nodes that maximize the depth of speculative com-
putation expected to be correct. The author presents both static and
dynamic prefetching schemes. The static scheme assumes a fixed ac-
ceptance rate; versions of this were proposed earlier in the context of
simulated annealing [Witte et al., 1991]. The dynamic scheme estimates
acceptance probabilities, for example, at each level of the tree by draw-
ing empirical MH samples, or at each branch in the tree by comput-
ing min{β, r} where β is a constant (e.g., β = 1) and r is an estimate
of the MH ratio based on a fast approximation to the target function.
Strid also proposes using the approximate target function to identify
the single most likely path on which to perform speculative compu-
tation, and combines prefetching with other sources of parallelism to
obtain a multiplicative effect.
In closely related work, parallel predictive prefetching makes effi-
cient use of parallel resources by dynamically predicting the outcome
of each MH test [Angelino et al., 2014]. In the case of Bayesian infer-
ence, these predictions can be constructed in the same manner as the
approximate MH algorithms based on subsets of data, as discussed in
Section 3.2.2. Furthermore, these predictions can be made in the con-
text of an error model, e.g., with the concentration inequalities used
by Bardenet et al. [2014]. This yields a straightforward and rational
mechanism for allocating parallel cores to computations most likely to
fall along the true execution path. Algorithms 9 and 10 sketch pseu-
docode for an implementation of parallel predictive prefetching that
follows a master-worker pattern. See the dissertation by Angelino [2014]
for a formal description of the algorithm and implementation details.
4.2. Defining new parallel dynamics 179
In this section we survey two ideas for performing inference using new
parallel dynamics. These algorithms define new dynamics in the sense
that their iterates do not form ergodic Markov chains which admit the
posterior distribution as an invariant distribution, and thus they do not
qualify as classical MCMC schemes. Instead, while some of the updates
in these algorithms resemble standard MCMC updates, the overall dy-
namics are designed to exploit parallel and distributed computation.
A unifying theme of these new methods is to perform local computa-
180 Parallel and distributed MCMC
where
Y
π (j) (θ | x(j) ) = π0 (θ)1/J π(x | θ), j = 1, . . . , J. (4.3)
x∈x(j)
θ ∼ N (0, Σ0 ) (4.5)
x(j) | θ ∼ N (θ, Σj ). (4.6)
4.2. Defining new parallel dynamics 183
where
−1
J
Σ = Σ−1 Σ−1
X
0 + (4.8)
j
j=1
J
Σ−1
X
(j)
µ = Σ j x . (4.9)
j=1
where
where
−1
Σ̃j = Σ−1 −1
0 /J + Σj (4.12)
−1
µ̃j = Σ−1 −1
0 /J + Σj Σ−1
j x
(j)
. (4.13)
its mean is
J
X J
X
E[θ̂] = Wj E[θj ] = Wj µ̃j
j=1 j=1
J −1
Wj Σ−1 −1
Σ−1
X
(j)
= 0 /J + Σj j x . (4.15)
j=1
1
Our notation differs slightly from that of Scott et al. [2016] in that our
weights Wj are normalized.
4.2. Defining new parallel dynamics 185
Thus, if we choose
−1
J
Wj = Σ Σ−1 −1
= Σ−1 Σ−1 Σ−1 /J + Σ−1 (4.16)
X
0 /J + Σj 0 + j 0 j
j=1
Finally, one can sample from this posterior density estimator using
MCMC; ideally, this density is straightforward to obtain and sample.
In general, however, density estimation can yield complex models that
are not amenable to efficient sampling.
Neiswanger et al. [2014] explore three density estimation approaches
of various complexities. Their first approach assumes a parametric
model and is therefore approximate. Specifically, they fit a Gaussian to
each set of subposterior samples, yielding
J
Y
π̃(θ | x) = N (µ̄j , Σ̄j ), (4.19)
j=1
where µ̄j and Σ̄j are the empirical mean and covariance, respectively,
of the samples from the jth subposterior. This product of Gaussians
4.2. Defining new parallel dynamics 187
T
1X 1 kθ − θj,t k
π̃ (j) (θ | x(j) ) = K , (4.22)
T t=1 hd h
Weierstrass samplers
Weierstrass samplers [Wang and Dunson, 2013] are named for the
Weierstrass transform [Weierstrass, 1885]: for h > 0, we write
( )
(θ − ξ)2
Z ∞
1
Wh f (θ) = √ exp − f (ξ)dξ. (4.23)
−∞ 2πh 2h2
Z ∞
lim Wh f (θ) = δ(θ − ξ)f (ξ)dξ = f (θ),
h→0 −∞
where δ(τ ) is the Dirac delta function. For h > 0, Wh f (θ) can be
thought of as a smoothed approximation to f (θ). Equivalently, if f (θ)
is the density of a random variable θ, then Wh f (θ) is the density of a
noisy measurement of θ, where the noise is an additive Gaussian with
zero mean and standard deviation h.
Wang and Dunson [2013] analyzes a more general class of Weier-
strass transforms by defining a multivariate version and also allowing
non-Gaussian kernels:
Z ∞ d
θi − ξi
(K)
h−1
Y
Wh f (θ1 , . . . , θd ) = f (ξ1 , . . . , ξd ) i Ki dξi .
−∞ i=1
hi
Y
fj (θ) = π (j) (θ | x(j) ) = π0 (θ)1/J π(x | θ), (4.24)
x∈x(j)
190 Parallel and distributed MCMC
f1 (✓) f2 (✓)
✓
f3 (✓)
f1 (⇠1 ) f2 (⇠2 )
⇠1 ⇠2
(⇠1 , ✓) (⇠2 , ✓)
✓
(⇠3 , ✓)
⇠3
f3 (⇠3 )
(b) Factor graph for the Weierstrass augmented model π(θ, ξ | x).
Figure 4.3: Factor graphs defining the augmented model of the Weierstrass
sampler.
192 Parallel and distributed MCMC
This new model can be represented by the factor graph in Figure 4.3b,
with potentials ψ(ξj , θ) = δ(ξj − θ). Finally, rather than taking the ξj
to be exact local copies of θ, we can instead relax them to be noisy
Gaussian measurements of θ:
J
Y
πh (θ, ξ | x) ∝ fj (ξj )ψh (ξj , θ) (4.30)
j=1
( )
(θ − ξj )2
ψh (ξj , θ) = exp − . (4.31)
2h2
Thus the potentials ψh (ξj , θ) enforce some consistency across the noisy
local copies of the parameter but allow them to be decoupled, where
the amount of decoupling depends on h. With smaller values of h the
approximate model is more accurate, but the local copies are more
coupled and hence sampling in the augmented model is less efficient.
We can construct a Gibbs sampler for the joint distribu-
tion π(θ, ξ | x) in Equation 4.25 by alternately sampling from p(θ | ξ)
and p(ξj | θ, x(j) ), for j = 1, . . . , J. It follows from Equation 4.25 that
J
( )
Y (θ2 − 2θξj )
p(θ | ξ1 , . . . , ξj , x) ∝ exp − . (4.32)
j=1
2h2
these strategies take existing algorithms and let the updates run ‘hog-
wild’ in the spirit of Hogwild! stochastic gradient descent in convex
optimization [Recht et al., 2011], we refer to these methods as Hogwild
Gibbs.
Similar approaches have a long history. Indeed, Gonzalez et al.
[2011] attribute a version of this strategy, Synchronous Gibbs, to the
original Gibbs sampling paper [Geman and Geman, 1984]. However,
these strategies have seen renewed interest, particularly due to exten-
sive empirical work on Approximate Distributed Latent Dirichlet Al-
location (AD-LDA) [Newman et al., 2007, 2009, Asuncion et al., 2008,
Liu et al., 2011, Ihler and Newman, 2012], which showed that running
collapsed Gibbs sampling updates in parallel allowed for near-perfect
parallelism without a loss in predictive likelihood performance. With
the growing challenge of scaling MCMC both to not only big datasets
but also big models, it is increasingly important to understand when
and how these approaches may be useful.
In this section, we first define some variations of Hogwild Gibbs
based on examples in the literature. Next, we survey the empirical
results and summarize the current state of theoretical understanding.
drawn on each processor using old statistics from other processors for
each outer iteration. Finally, note that setting K = 1 and q(t, k) = 1
reduces to standard Gibbs sampling on a single processor.
Theoretical analysis
are measured using the first time the distribution of the iterates comes
sufficiently close to the target distribution. De Sa et al. [2016] also
provide quantitative rate bounds for these measures of bias and mixing
time, and show that these estimates closely track empirical simulations.
4.3 Summary
MH in this way yields exact MCMC updates and can be effective at re-
ducing the mixing time required by serial MH, but it requires a simple
likelihood function and its implementation requires frequent synchro-
nization and communication, mitigating parallel speedups unless the
likelihood function is very expensive.
Gibbs sampling also presents an opportunity for direct paralleliza-
tion for particular graphical model structures. In particular, given
a graph coloring of the graphical model, variables corresponding to
nodes assigned a particular color are conditionally mutually indepen-
dent and can be updated in parallel without communication. How-
ever, frequent synchronization and significant communication can be
required to transmit sampled values to neighbors after each update.
Relaxing both the strict conditional independence requirements and
synchronization requirements motivates Hogwild Gibbs.
Hogwild Gibbs Hogwild Gibbs of Section 4.2.2 also allows for data
parallelism but avoids factorizing the posterior as in consensus Monte
Carlo or instantiating coupled copies of a global parameter as in the
Weierstrass sampler. Instead, processor-local sampling steps (such as
local Gibbs updates) are performed with each processor treating other
processors’ states as fixed at stale values; processors can communicate
updated states less frequently, either via synchronous or asynchronous
communication. Hogwild Gibbs variants span a range of parallel com-
putation paradigms from fully synchronous BSP to fully asynchronous
message passing. While Hogwild Gibbs has proven effective in practice
for several models, its applicability and approximation tradeoffs remain
unclear.
4.4 Discussion
then averaging their samples could produce iterates that look nothing
like the true posterior. Furthermore, averaging may only make sense in
continuous spaces that are closed under convex combinations; these av-
eraging strategies do not apply to samples that take values in discrete
spaces. Hence these subposterior methods may be most appropriate in
the “tall data” regime, where the data are redundant and subposteriors
can be expected to agree.
Performing density estimation on the subposteriors and forming the
product of these densities avoids the problems with direct averaging,
but may itself be computationally expensive and is unlikely to scale
well with dimensionality. Indeed, applying this strategy to a Gaussian
model provides no computational advantage to parallelization.
Unlike the subposterior methods, in Weierstrass samplers the pro-
cessors try to account for the other processors’ data by communicating
through the shared global variable. In this way, the Weierstrass sam-
plers are similar to the expectation propagation methods surveyed in
the next chapter. However, because processors can affect each others’
iterates only through the global variable bottleneck, their communica-
tion is limited. In addition, it is not clear how to extend the Weierstrass
strategies to discrete variables.
Hogwild Gibbs algorithms also include influence between proces-
sors, and without an information bottleneck. Communication between
processors is limited instead by an algorithm’s update frequency. In
addition, Hogwild Gibbs doesn’t inherently rely on data redundancy:
instead, it relies on variables not being too dependent across proces-
sors. For this reason, while Hogwild Gibbs has its own restrictions, it
is a promising method for scaling out to big models with many latent
variables rather than just the “tall data” regime. Hogwild Gibbs also
readily applies to discrete variables.
5
Scaling variational algorithms
204
5.1. Stochastic optimization and mean field methods 205
z (k) y (k)
k = 1, 2, . . . , K
then conjugacy identifies the statistic of the prior with the natural
parameter and log partition function of the likelihood via
so that
n o
p(φ, z (k) , y (k) ) ∝ exp hηφ + (tzy (z (k) , y (k) ), 1), tφ (φ)i . (5.6)
Conjugacy implies the optimal variational factor q(φ) has the same
form as the prior; that is, without loss of generality we can write q(φ)
in the same form as (5.3),
Writing the optimal parameters of q(z) as ηez∗ , note that when ηez∗ is
partially optimized to a stationary point1 of L, so that ∂∂L
η∗
= 0 at ηez∗ ,
e z
the chain rule implies that the gradient2 with respect to the global
1
More generally, when ηez is regularly constrained (typically the constraint set
is linear [Wainwright and Jordan, 2008]), the same result holds because at ηez ∗ the
gradient of the objective is orthogonal to the feasible variations in ηez .
2
For a discussion of differentiability and smoothness issues that can arise when
there is more than one optimizer, see Danskin [1967], Fiacco [1984, Section 2.4], and
Bonnans and Shapiro [2000, Chapter 4]. Here we simply assume ∂ ηez∗ /∂ ηeφ exists.
208 Scaling variational algorithms
∂L ∂L ∂L ∂ ηe∗
(ηeφ ) = (ηeφ , ηez∗ ) + ∗ z (ηeφ , ηez∗ ) (5.9)
∂ ηeφ ∂ ηeφ ∂ ηez ∂ ηeφ
∂L
= (ηeφ , ηez∗ ) . (5.10)
∂ ηeφ
Because a locally optimal local factor q(z) can be computed with local
mean field updates for a fixed value of the global variational parameter
ηeφ , we need only find an expression for the gradient ∇e ηφ L(η
eφ ) in terms
of the optimized local factors.
To find an expression for the gradient ∇eηφ L(η
eφ ) that exploits con-
jugacy structure, using (5.6) we can substitute
( )
X
(k) (k)
p(φ, z, y) ∝ exp hηφ + (tzy (z ,y ), 1), tφ (φ)i , (5.11)
k
where the constant does not depend on ηeφ . Using the identity for nat-
ural exponential families that
Thus we can compute the gradient of L(ηeφ ) with respect to the global
variational parameters ηeφ as
X
eφ ) = h∇2 log Zφ (ηeφ ), ηφ +
ηφ L(η
∇e Eq(z (k) ) [tzy (z (k) , y (k) )] − ηeφ i
k
− ∇ log Zφ (ηeφ ) + ∇ log Zφ (ηeφ ) (5.16)
X
2 (k) (k)
= h∇ log Zφ (ηeφ ), ηφ + Eq(z (k) ) [tzy (z ,y )] − ηeφ i
k
where the first two terms come from applying the product rule.
The matrix ∇2 log Zφ (ηeφ ) is the Fisher information of the varia-
tional family, since
h i
2
−Eq(φ) ∇e
η
log q(φ) = ∇2 log Zφ (ηeφ ). (5.17)
φ
mean field update on the data minibatch and the global variational
factor. That is, if q(z (k) ) is not further factorized in the mean field
approximation, it is computed according to
n o
q(z (k) ) ∝ exp Eq(φ) [log p(z (k) , y (k) | φ)] . (5.20)
can be sampled and that the gradient of its log joint density with re-
spect to the variational parameters can be computed efficiently. With
these minimal requirements, BBVI is not only useful in the big-data
setting but also a tool for handling nonconjugate variational inference
more generally. Because BBVI uses Monte Carlo approximation to com-
pute stochastic gradient updates, it fits naturally into a stochastic gra-
dient optimization framework, and hence it has the additional benefit
of yielding a scalable algorithm simply by adding minibatch sampling
to its updates at the cost of increasing their variance. In this subsection
we review the general BBVI algorithm and then compare it to the SVI
algorithm of Section 5.1.1. For a review of Monte Carlo estimation, see
Section 2.2.2.
We consider a general model p(θ, y) = p(θ) K (k) | θ) includ-
Q
k=1 p(y
ing parameters θ and observations y = {y (k) }K k=1 divided into K mini-
batches. The distribution of interest is the posterior p(θ | y) and we
write the variational family as q(θ) = q(θ | ηeθ ), where we suppress the
particular mean field factorization structure of q(θ) from the notation.
The mean field variational lower bound is then
p(θ, y)
L = Eq(θ) log . (5.21)
q(θ)
Taking the gradient with respect to the variational parameter ηeθ and
expanding the expectation into an integral, we have
p(θ, y)
Z
∇e
ηθ L = ∇e
ηθ q(θ) log dθ (5.22)
q(θ)
p(θ, y) p(θ, y)
Z Z
= ∇e
ηθ log q(θ)dθ + log ∇ηθ q(θ)dθ (5.23)
q(θ) q(θ) e
where we have moved the gradient into the integrand and applied the
product rule to yield two terms. The first term is identically zero:
p(θ, y) 1
Z Z
∇e
ηθ log q(θ)dθ = − ∇ηθ [q(θ)] q(θ)dθ (5.24)
q(θ) q(θ) e
Z
=− ∇e
ηθ q(θ)dθ (5.25)
Z
= −∇e
ηθ q(θ)dθ = 0 (5.26)
212 Scaling variational algorithms
(5.30) as
p(θ)
∇e
ηθ L = Ek̂ Eq(θ) log + K log p(y (k̂) | θ̂) ∇e
ηθ log q(θ) (5.31)
q(θ)
!
1 X p(θ̂) h i
≈ log + KEk̂ log p(y (k̂) | θ̂) ∇e
ηθ log q(θ̂) (5.32)
|S| q(θ̂)
θ̂∈S
with respect to q(z) of the log density log p(φ̂, z, y) can be computed
without resorting to Monte Carlo estimation, then the resulting update
would likely have a lower variance than the BBVI update that requires
sampling over both q(φ) and q(z).
This comparison also makes clear the advantages of exploiting con-
jugacy in SVI: when the updates of Section 5.1.1 can be used, nei-
ther q(φ) nor q(z) needs to be sampled. Furthermore, while BBVI uses
stochastic gradients in its updates, the SVI algorithm of Section 5.1.1
uses stochastic natural gradients, adapting to the local curvature of the
variational family. Computing stochastic natural gradients in BBVI
would require both computing the Fisher information matrix of the
variational family and solving a linear system with it.
The main weakness of the score function (REINFORCE) gradient
estimates used in BBVI is their high variance, a weakness that scales
poorly with dimensionality. Next we study an alternative gradient es-
timator that applies to some nonconjugate models but can yield lower
variance estimates.
iid
where ˆ ∼ p() for each ˆ ∈ S. This Monte Carlo approximation gives
an unbiased estimate of the gradient of the variational objective, and it
often has lower variance than the more general score function estimator
[Kingma and Welling, 2014]. Moreover, it can be straightforward to
compute using automatic differentiation tools [Kucukelbir et al., 2015,
Duvenaud and Adams, 2015]. There are several proposed improvements
to this estimator based on variance reduction and alternative Monte
Carlo objectives [Mnih and Gregor, 2014, Burda et al., 2016, Mnih and
Rezende, 2016].
As a concrete example, if q(θ) is a multivariate Gaussian density
with parameters ηeθ = (µ, Σ), then we can take ∼ N (0, I) and write
the reparameterization as
1
f (µ, Σ, ) = µ + Σ 2 , (5.40)
1 1 1
where Σ 2 is a matrix satisfying Σ 2 (Σ 2 )T = Σ.
3
For example, Leibniz’s rule states that given a function f : X × Ω → R where
X ⊂ R is open, if f (x, ω) is a Lebesgue-integrable function of ω for each x ∈ X,
∂
the partial derivative function ∂x f (x, ω) exists for almost all ω, and there is an
∂
integrable function G : Ω → R such that | ∂x f (x, ω)| ≤ G(ω), then
Z Z
d ∂
f (x, ω)dω = f (x, ω)dω. (5.37)
dx Ω Ω
∂x
216 Scaling variational algorithms
be computed as
where τ (k) is the index of the global variational parameter used in the
worker’s computation. Upon receiving an update, the master updates
its global variational parameter synchronously according to
While the parallel EP algorithm of the preceding section uses the struc-
ture of EP to construct a parallel algorithm, Stochastic Expectation
Propagation (SEP) [Li et al., 2015] instead builds an EP algorithm with
updates that can be computed using only one minibatch of data at a
time. Thus SEP is closer to an EP analog of the other minibatch-based
variational inference algorithms described in Sections 5.1.1 and 5.2, and
has similar advantages. Specifically, parallel EP still requires the pa-
rameters of the K approximating factors to be stored in (distributed)
memory, and also requires global communication to synchronize the ap-
proximating distribution q(φ). In contrast, SEP needs only an amount
of memory that is constant with respect to the size of the dataset
and only performs local computations. However, it also makes stronger
approximating assumptions which might make it less appropriate for
complex hierarchical models.
For some dataset y = {yk }Kk=1 partitioned into K minibatches, con-
sider the joint model
K
Y
p(θ, y) = p(θ) p(yk | θ). (5.57)
k=1
5.4. Summary 223
5.4 Summary
5.5 Discussion
228
229
⇡kTV
k⇡0 T n
estimator error (log scale)
⇡kTV
k⇡0 T n
estimator error (log scale)
iterationtime
wall-clock n (log scale)
(log scale)
Measuring performance With all the ideas surveyed here, one thing
is clear: there are many alternatives for how to scale Bayesian inference.
How should we compare these alternative algorithms? Can we tell when
any of these algorithms work well in an absolute sense?
One standard approach for evaluating MCMC procedures is to de-
fine a set of scalar-valued test functions (or estimands of interest) and
compute effective sample size [Gelman et al., 2014a, Section 11.5][Kong
et al., 1994] as a function of wall-clock time. However, in complex
models designing an appropriately comprehensive set of test functions
may be difficult. Furthermore, many such measures require the Markov
chain to mix and do not account for any asymptotic bias [Gorham and
Mackey, 2015], hence limiting their applicability to measuring the per-
formance of many of the new inference methods studied here.
To confront these challenges, one recently-proposed approach
[Gorham and Mackey, 2015] draws on Stein’s method, classically used
as an analytical tool, to design an efficiently-computable measure of dis-
crepancy between a target distribution and a set of samples. A natural
measure of discrepancy between a target density p(x) and a (weighted)
sample distribution q(x), where q(x) = ni=1 wi δxi (x) for some set of
P
samples {xi }ni=1 and weights {wi }ni=1 , is to consider their largest abso-
lute difference across a large class of test functions:
which requires computing only the gradient of the target log density.
Furthermore, while the optimization in (6.2) is infinite-dimensional in
general and might have infinitely many smoothness constraints from G,
Gorham and Mackey [2015] shows that for the sample distribution q
the test function g need only be evaluated at the finitely-many sample
points {xi }ni=1 and that only a small number of constraints must be
enforced. This new performance metric does not require assumptions
on whether the samples are generated from an unbiased, stationary
Markov chain, and so it may provide clear ways to compare across a
broad spectrum sampling-based approximate inference algorithms.
Another recently-proposed approach attempts to estimate or bound
the KL divergence from an algorithm’s approximate posterior represen-
tation to the true posterior, at least when applied to synthetic data.
This approach, called bidirectional Monte Carlo (BDMC) [Grosse et al.,
2015], can be applied to measure the performance of both variational
mean field algorithms as well as annealed importance sampling (AIS)
and sequential Monte Carlo (SMC) algorithms. By rearranging the vari-
ational identity (2.50), we can write the KL divergence KL(qkp) from
an approximating distribution q(z, θ) to a target posterior p(z, θ | ȳ) in
terms of the log marginal likelihood log p(ȳ) and an expectation with
respect to q(z, θ):
p(z, θ | ȳ)
KL(qkp) = log p(ȳ) − Eq(z,θ) log . (6.4)
q(z, θ)
Because the expectation can be readily computed in a mean field set-
ting or stochastically lower-bounded when using AIS [Grosse et al.,
2015, Section 4.1], with a stochastic upper bound on log p(ȳ) we can
234 Challenges and questions
This work was funded in part by NSF IIS-1421780 and the Alfred P.
Sloan Foundation. E.A. is supported by the Miller Institute for Basic
Research in Science, University of California, Berkeley. M.J. is sup-
ported by a fellowship from the Harvard/MIT Joint Grants program.
235
References
Sungjin Ahn, Anoop Korattikara Balan, and Max Welling. Bayesian posterior
sampling via stochastic gradient Fisher scoring. In Proceedings of the 29th
International Conference on Machine Learning, 2012.
Talal M. Alkhamis, Mohamed A. Ahmed, and Vu Kim Tuan. Simulated
annealing for discrete optimization with estimation. European Journal of
Operational Research, 116(3):530–544, 1999.
Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry.
American Mathematical Society, 2007.
Christophe Andrieu and Eric Moulines. On the ergodicity properties of some
adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):
1462–1505, 2006.
Christophe Andrieu and Gareth O. Roberts. The pseudo-marginal approach
for efficient Monte Carlo computations. Annals of Statistics, pages 697–725,
2009.
Christophe Andrieu and Johannes Thoms. A tutorial on adaptive MCMC.
Statistics and Computing, 18(4):343–373, 2008.
Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov
chain Monte Carlo methods. Journal of the Royal Statistical Society Series
B, 72(3):269–342, 2010.
Elaine Angelino. Accelerating Markov chain Monte Carlo via parallel predic-
tive prefetching. PhD thesis, School of Engineering and Applied Sciences,
Harvard University, 2014.
236
References 237
Elaine Angelino, Eddie Kohler, Amos Waterland, Margo Seltzer, and Ryan P.
Adams. Accelerating MCMC via parallel predictive prefetching. In 30th
Conference on Uncertainty in Artificial Intelligence, pages 22–31, 2014.
Kenneth J. Arrow, Leonid Hurwicz, Hirofumi Uzawa, H.B. Chenery, S.M.
Johnson, S. Karlin, T. Marschak, and R.M. Solow. Studies in linear and
non-linear programming. Stanford University Press, John Wiley & Sons,
1959.
Arthur U. Asuncion, Padhraic Smyth, and Max Welling. Asynchronous dis-
tributed learning of topic models. In Advances in Neural Information Pro-
cessing Systems 21, pages 81–88, 2008.
Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration-
exploitation tradeoff using variance estimates in multi-armed bandits. The-
oretical Computer Science, 410(19):1876–1902, 2009.
Rémi Bardenet and Odalric-Ambrym Maillard. Concentration inequalities for
sampling without replacement. Bernoulli, 20(3):1361–1385, 2015.
Rémi Bardenet, Arnaud Doucet, and Chris Holmes. Towards scaling up
Markov chain Monte Carlo: An adaptive subsampling approach. In Pro-
ceedings of the 31st International Conference on Machine Learning, 2014.
Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte
Carlo methods for tall data. arXiv preprint 1505.02827, 2015.
Mark A. Beaumont. Estimation of population growth or decline in genetically
monitored populations. Genetics, 164(3):1139–60, 2003.
Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, 2016.
Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Com-
putation: Numerical Methods. Prentice-Hall, Inc., Upper Saddle River, NJ,
USA, 1989.
Michael Betancourt. The fundamental incompatibility of scalable Hamiltonian
Monte Carlo and naive data subsampling. In Proceedings of The 32nd
International Conference on Machine Learning, pages 533–540, 2015.
Christopher M. Bishop. Pattern Recognition and Machine Learning (Infor-
mation Science and Statistics). Springer-Verlag New York, Inc., Secaucus,
NJ, USA, 2006.
J. Frederic Bonnans and Alexander Shapiro. Perturbation Analysis of Opti-
mization Problems. Springer Science & Business Media, 2000.
Léon Bottou. On-line learning and stochastic approximations. In David Saad,
editor, On-line Learning in Neural Networks, pages 9–42. Cambridge Uni-
versity Press, New York, NY, USA, 1998.
238 References
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein.
Distributed optimization and statistical learning via the alternating direc-
tion method of multipliers. Foundations and Trends in Machine Learning,
3(1):1–122, January 2011.
A. E. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-
fetching. Journal of Computational and Graphical Statistics, 15(1):246–261,
March 2006.
Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and
Michael I. Jordan. Streaming variational Bayes. In Advances in Neural
Information Processing Systems 26, pages 1727–1735, 2013.
Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of
Markov Chain Monte Carlo. Chapman & Hall/CRC Handbooks of Modern
Statistical Methods. CRC press, 2011.
Akif Asil Bulgak and Jerry L. Sanders. Integrating a modified simulated an-
nealing algorithm with the simulation of a manufacturing system to opti-
mize buffer sizes in automatic assembly systems. In Proceedings of the 20th
Conference on Winter Simulation, pages 684–690, New York, NY, USA,
1988. ACM.
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted
autoencoders. International Conference on Learning Representations, 2016.
Jonathan M. R. Byrd, Stephen A. Jarvis, and Abhir H. Bhalerao. Reducing
the run-time of MCMC programs by multithreading on SMP architectures.
In IEEE International Symposium on Parallel and Distributed Processing,
pages 1–8, 2008.
Jonathan M. R. Byrd, Stephen A. Jarvis, and Abhir H. Bhalerao. On the
parallelisation of MCMC by speculative chain execution. In IEEE Inter-
national Symposium on Parallel and Distributed Processing - Workshop
Proceedings, pages 1–8, 2010.
Trevor Campbell and Jonathan P. How. Approximate decentralized Bayesian
inference. In 30th Conference on Uncertainty in Artificial Intelligence,
pages 102–111, 2014.
Trevor Campbell, Julian Straub, John W. Fisher III, and Jonathan P. How.
Streaming, distributed variational inference for Bayesian nonparametrics.
In Advances in Neural Information Processing Systems 28, pages 280–288,
2015.
Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient Hamil-
tonian Monte Carlo. In Proceedings of the 31st International Conference
on Machine Learning, June 2014.
References 239
Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin
dynamics on the probability simplex. In Advances in Neural Information
Processing Systems 26, pages 3102–3110, 2013.
M. F. Pradier, P. G. Moreno, F. J. R. Ruiz, I. Valera, H. Mollina-Bulla,
and F. Perez-Cruz. Map/reduce uncollapsed Gibbs sampling for Bayesian
non parametric models. Workshop in Software Engineering for Machine
Learning at NIPS, 2014.
Maxim Rabinovich, Elaine Angelino, and Michael I. Jordan. Variational Con-
sensus Monte Carlo. In Advances in Neural Information Processing Systems
28, pages 1207–1215, 2015.
Rajesh Ranganath, Chong Wang, David M. Blei, and Eric P. Xing. An adap-
tive learning rate for stochastic variational inference. In Proceedings of
the 30th International Conference on Machine Learning, volume 28, pages
298–306, 2013.
Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational
inference. In 17th International Conference on Artificial Intelligence and
Statistics, pages 814–822, 2014.
Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hog-
wild!: A lock-free approach to parallelizing stochastic gradient descent. In
Advances in Neural Information Processing Systems 24, pages 693–701,
2011.
Jeffrey Regier, Andrew Miller, Jon McAuliffe, Ryan P. Adams, Matt Hoffman,
Dustin Lang, David Schlegel, and Prabhat. Celeste: Variational inference
for a generative model of astronomical images. In Proceedings of the 32nd
International Conference on Machine Learning, pages 2095–2103, 2015.
Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-
propagation and approximate inference in deep generative models. In Pro-
ceedings of the 31st International Conference on Machine Learning, pages
1278–1286, 2014.
Herbert Robbins and Sutton Monro. A stochastic approximation method.
The Annals of Mathematical Statistics, pages 400–407, 1951.
Christian P. Robert and George Casella. Monte Carlo Statistical Methods
(Springer Texts in Statistics). Springer-Verlag New York, Inc., 2004.
Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of
Langevin distributions and their discrete approximations. Bernoulli, 2(4):
341–363, 12 1996.
246 References