0% found this document useful (0 votes)
18 views11 pages

Black Box Variational Inference

Uploaded by

jinbogrundfos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Black Box Variational Inference

Uploaded by

jinbogrundfos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Black Box Variational Inference

Rajesh Ranganath Sean Gerrish David M. Blei


Princeton University, 35 Olden St., Princeton, NJ 08540
{rajeshr,sgerrish,blei} AT cs.princeton.edu
arXiv:1401.0118v1 [stat.ML] 31 Dec 2013

Abstract the posterior exactly is intractable: practitioners must


resort to approximate methods.
Variational inference has become a widely One of the most widely used methods for approximate
used method to approximate posteriors in posterior estimation is variational inference (Wain-
complex latent variables models. However, wright and Jordan, 2008; Jordan et al., 1999). Varia-
deriving a variational inference algorithm gen- tional inference tries to find the member of a family of
erally requires significant model-specific anal- simple probability distributions that is closest (in KL
ysis, and these efforts can hinder and deter us divergence) to the true posterior distribution.
from quickly developing and exploring a vari-
ety of models for a problem at hand. In this For a specific class of models, those where the condi-
paper, we present a “black box” variational tional distributions have a convenient form (and where
inference algorithm, one that can be quickly a convenient variational family exists), this optimiza-
applied to many models with little additional tion can be carried out with a closed-form coordinate
derivation. Our method is based on a stochas- ascent algorithm (Ghahramani and Beal, 2001). For
tic optimization of the variational objective generic models and arbitrary variational families, how-
where the noisy gradient is computed from ever, there is no closed-form solution: computing the
Monte Carlo samples from the variational dis- required expectations becomes intractable. In these
tribution. We develop a number of methods settings, practitioners have resorted to model-specific
to reduce the variance of the gradient, always algorithms (Jaakkola and Jordan, 1996; Blei and Laf-
maintaining the criterion that we want to ferty, 2007; Braun and McAuliffe, 2007) or generic
avoid difficult model-based derivations. We algorithms that require model specific computations
evaluate our method against the correspond- (Knowles and Minka, 2011; Wang and Blei, 2013; Pais-
ing black box sampling based methods. We ley et al., 2012).
find that our method reaches better predictive Deriving these algorithms on a model-by-model basis
likelihoods much faster than sampling meth- is tedious work. This hinders us from rapidly exploring
ods. Finally, we demonstrate that Black Box modeling assumptions when solving applied problems,
Variational Inference lets us easily explore a and it makes variational methods on complicated dis-
wide space of models by quickly constructing tributions impractical for many practitioners. Our goal
and evaluating several models of longitudinal in this paper is to develop a “black box” variational
healthcare data. inference algorithm, a method that can be quickly ap-
plied to almost any model and with little effort. Our
method allows practitioners to quickly design, apply,
1 Introduction and revise models of their data, without painstaking
derivations each time they want to adjust the model.
Probabilistic models with latent variables have become
a mainstay in modern machine learning applications. Variational inference methods frame a posterior esti-
With latent variables models, we posit a rich latent mation problem as an optimization problem, where the
structure that governs our observations, infer that struc- parameters to be optimized adjust a variational “proxy”
ture from large data sets, and use our inferences to sum- distribution to be similar to the true posterior. Our
marize observations, draw conclusions about current method rewrites the gradient of that objective as the
data, and make predictions about new data. Central expectation of an easy-to-implement function f of the
to working with latent variable models is the problem latent and observed variables, where the expectation
of computing the posterior distribution of the latent is taken with respect to the variational distribution;
structure. For many interesting models, computing and we optimize that objective by sampling from the
variational distribution, evaluating the function f , and do not describe how to set it. We further innovate
forming the corresponding Monte Carlo estimate of on their approach with Rao-Blackwellization, speci-
the gradient. We then use these stochastic gradients fied control variates, adaptive learning rates, and data
in a stochastic optimization algorithm to optimize the subsampling. Salimans and Knowles (2012) provide a
variational parameters. framework based on stochastic linear regression. Un-
like our approach, their method does not generalize
From the practitioner’s perspective, this method re-
to arbitrary approximating families and requires the
quires only that he or she write functions to evaluate
inversion of a large matrix that becomes impractical in
the model log-likelihood. The remaining calculations
high dimensional settings. Kingma and Welling (2013)
(sampling from the variational distribution and eval-
provide an alternative method for variational inference
uating the Monte Carlo estimate) are easily put into
through a reparameterization of the variational distri-
a library to be shared across models, which means
butions. In constrast to our approach, their algorithm
our method can be quickly applied to new modeling
is limited to only continuous latent variables. Car-
settings.
bonetto et al. (2009) present a stochastic optimization
We will show that reducing the variance of the gradient scheme for moment estimation based on the specific
estimator is essential to the fast convergence of our form of the variational objective when both the model
algorithm. We develop several strategies for controlling and the approximating family are in the same exponen-
the variance. The first is based on Rao-Blackwellization tial family. This differs from our more general modeling
(Casella and Robert, 1996), which exploits the factoriza- setting where latent variables may be outside of the
tion of the variational distribution. The second is based exponential family. Finally, Paisley et al. (2012) use
on control variates (Ross, 2002; Paisley et al., 2012), Monte Carlo gradients for difficult terms in the varia-
using the log probability of the variational distribution. tional objective and also use control variates to reduce
We emphasize that these variance reduction methods variance. However, theirs is not a black-box method.
preserve our goal of black box inference because they Both the objective function and control variates they
do not require computations specific to the model. propose require model-specific derivations.
Finally, we show how to use recent innovations in vari-
ational inference and stochastic optimization to scale 2 Black Box Variational Inference
up and speed up our algorithm. First, we use adaptive
learning rates (Duchi et al., 2011) to set the step size Variational inference transforms the problem of approx-
in the stochastic optimization. Second, we develop imating a conditional distribution into an optimization
generic stochastic variational inference (Hoffman et al., problem (Jordan et al., 1999; Bishop, 2006; Wainwright
2013), where we additionally subsample from the data and Jordan, 2008). The idea is to posit a simple family
to more cheaply compute noisy gradients. This inno- of distributions over the latent variables and find the
vates on the algorithm of Hoffman et al. (2013), which member of the family that is closest in KL divergence
requires closed form coordinate updates to compute to the conditional distribution.
noisy natural gradients. In a probabilistic model, let x be observations, z be
We demonstrate our method in two ways. First, we latent variables, and λ the free parameters of a varia-
compare our method against Metropolis-Hastings-in- tional distribution q(z | λ). Our goal is to approximate
Gibbs (Bishop, 2006), a sampling based technique that p(z | x) with a setting of λ. In variational inference we
requires similar effort on the part of the practitioner. optimize the Evidence Lower BOund (ELBO),
We find our method reaches better predictive likeli-
hoods much faster than sampling methods. Second, we L(λ) , Eqλ (z) [log p(x, z) − log q(z)]. (1)
use our method to quickly build and evaluate several Maximizing the ELBO is equivalent to minimizing the
models of longitudinal patient data. This demonstrates KL divergence (Jordan et al., 1999; Bishop, 2006). Intu-
the ease with which we can now consider models gen- itively, the first term rewards variational distributions
erally outside the realm of variational methods. that place high mass on configurations of the latent
variables that also explain the observations; the second
term rewards variational distributions that are entropic,
Related work. There have been several lines of work
i.e., that maximize uncertainty by spreading their mass
that use sampling methods to approximate gradients
on many configurations.
in variational inference. Wingate and Weber (2013)
have independently considered a similar procedure to Practitioners derive variational algorithms to maximize
ours, where the gradient is construed as an expectation the ELBO over the variational parameters by expand-
and the KL is optimized with stochastic optimization. ing the expectation in Eq. 1 and then computing gradi-
They too include a term to reduce the variance, but ents to use in an optimization procedure. Closed form
coordinate-ascent updates are available for condition- Algorithm 1 Black Box Variational Inference
ally conjugate exponential family models (Ghahramani Input: data x, joint distribution p, mean field vari-
and Beal, 2001), where the distribution of each latent ational family q.
variable given its Markov blanket falls in the same fam- Initialize λ1:n randomly, t = 1.
ily as the prior, for a small set of variational families. repeat
However, these updates require analytic computation // Draw S samples from q
of various expectations for each new model, a problem for s = 1 to S do
which is exacerbated when the variational family falls z[s] ∼ q
outside this small set. This leads to tedious bookkeep- end for
ing and overhead for developing new models. ρ = tth value of a Robbins Monro sequence (Eq. 2)
We will instead use stochastic optimization to maximize PS
the ELBO. In stochastic optimization, we maximize a λ = λ + ρ S1 s=1 ∇λ log q(z[s]|λ)(log p(x, z[s]) −
function using noisy estimates of its gradient (Robbins log q(z[s]|λ))
and Monro, 1951; Kushner and Yin, 1997; Bottou and t=t+1
LeCun, 2004). We will form the derivative of the objec- until change of λ is less than 0.01.
tive as an expectation with respect to the variational
approximation and then sample from the variational ap- The derivation of Eq. 2 can be found in the appendix.
proximation to get noisy but unbiased gradients, which Note that in statistics the gradient ∇λ log q(z|λ) of
we use to update our parameters. For each sample, our the log of a probability distribution is called the score
noisy gradient requires evaluating the joint distribution function (Cox and Hinkley, 1979).
of the observed and sampled variables, the variational
distribution, and the gradient of the log of the varia- With this Equation in hand, we compute noisy unbiased
tional distribution. This is a black box method in that gradients of the ELBO with Monte Carlo samples from
the gradient of the log of the variational distribution the variational distribution,
and sampling method can be derived once for each type
S
of variational distribution and reused for many models 1X
∇λ L ≈ ∇λ log q(zs |λ)(log p(x, zs ) − log q(zs |λ)),
and applications. S s=1
where zs ∼ q(z|λ).
Stochastic optimization. We first review stochas-
(3)
tic optimization. Let f (x) be a function to be maxi-
mized and ht (x) be the realization of a random variable With Eq. 3, we can use stochastic optimization to
H(x) whose expectation is the gradient of f (x). Finally, optimize the ELBO.
let ρt be the learning rate. Stochastic optimization up-
dates x at the tth iteration with The basic algorithm is summarized in Algorithm 1.
We emphasize that the score function and sampling
xt+1 ← xt + ρt ht (xt ). algorithms depend only on the variational distribution,
not the underlying model. Thus we can easily build
This converges to a maximum of f (x) when the learning up a collection of these functions for various varia-
rate schedule follows the Robbins-Monro conditions, tional approximations and reuse them in a package
P∞
= ∞ for a variety of models. Further we did not make any
t=1 ρt
P∞ 2
assumptions about the form of the model, only that the
t=1 ρt < ∞. practitioner can compute the log of the joint p(x, zs ).
Because of its simplicity, stochastic optimization is This algorithm significantly reduces the effort needed
widely used in statistics and machine learning. to implement variational inference in a wide variety of
models.
A noisy gradient of the ELBO. To optimize the
ELBO with stochastic optimization, we need to de- 3 Controlling the Variance
velop an unbiased estimator of its gradient which can
be computed from samples from the variational poste- We can use Algorithm 1 to maximize the ELBO, but
rior. To do this, we write the gradient of the ELBO the variance of the estimator of the gradient (under
(Eq. 1) as an expectation with respect to the variational the Monte Carlo estimate in Eq. 3) can be too large
distribution, to be useful. In practice, the high variance gradients
would require very small steps which would lead to
∇λ L = Eq [∇λ log q(z|λ)(log p(x, z) − log q(z|λ))]. slow convergence. We now show how to reduce this
(2) variance in two ways, via Rao-Blackwellization and
easy-to-implement control variates. We exploit the can be computed for each component of the gradient
structure of our problem to use these methods in a without needing to compute model-specific conditional
way that requires no model-specific derivations, which expectations.
preserves our goal of black-box variational inference.
Finally, we construct a Monte Carlo estimator for the
gradient of λi using samples from the variational dis-
3.1 Rao-Blackwellization tribution,
Rao-Blackwellization (Casella and Robert, 1996) re- S
duces the variance of a random variable by replacing it 1X
∇λi log qi (zs |λi )(log pi (x, zs ) − log qi (zs |λi )),
with its conditional expectation with respect to a subset S s=1
of the variables. Note that the conditional expectation where zs ∼ q(i) (z|λ).
of a random variable is a random variable with respect (6)
to the conditioning set. This generally requires analyt-
ically computing problem-specific integrals. Here we This Rao-Blackwellized estimator for each component
show how to Rao-Blackwellize the estimator for each of the gradient has lower variance. In our empirical
component of the gradient without needing to compute study, Figure 2, we plot the variance of this estimator
model-specific integrals. along with that of Eq. 3.
In the simplest setting, Rao-Blackwellization replaces
a function of two variables with its conditional expec- 3.2 Control Variates
tation. Consider two random variables, X and Y , and
a function J(X, Y ). Our goal is to compute its expec- As we saw above, variance reduction methods work
tation E[J(X, Y )] with respect to the joint distribution by replacing the function whose expectation is being
of X and Y . approximated by Monte Carlo with another function
that has the same expectation but smaller variance.
ˆ
Define J(X) ˆ
= E[J(X, Y )|X], and note that E[J(X)] = That is, to estimate Eq [f ] via Monte Carlo we compute
ˆ
E[J(X, Y )]. This means that J(X) can be used in the empirical average of fˆ where fˆ is chosen so Eq [f ] =
place of J(X, Y ) in a Monte Carlo approximation of Eq [fˆ] and Varq [f ] > Varq [fˆ].
ˆ
E[J(X, Y )]. The variance of J(X) is
A control variate (Ross, 2002) is a family of functions
ˆ
Var(J(X)) ˆ
= Var(J(X, Y )) − E[(J(X, Y ) − J(X)) 2
]. with equivalent expectation. Consider a function h,
which has a finite first moment, and a scalar a. Define
ˆ
This means that J(X) is a lower variance estimator fˆ to be
than J(X, Y ).
We return to the problem of estimating the gradient fˆ(z) , f (z) − a(h(z) − E[h(z)]). (7)
of L. Suppose there are n latent variables z1:n and we
This is a family of functions, indexed by a, and note
are using the mean-field variational family, where each
that Eq [fˆ(z)] = Eq [f ] as required. Given a particular
random variable zi is independent and governed by its
function h, we can choose a to minimize the variance
own variational distribution,
of fˆ.
Qn
q(z | λ) = i=1 q(zi | λi ), (4) First we note that variance of fˆ can be written as
where λ1:n are the n variational parameters charac-
terizing the member of the variational family we seek. Var(fˆ) = Var(f ) + a2 Var(h) − 2aCov(f, h).
Consider the ith component of the gradient. Let q(i) be
This equation implies that good control variates have
the distribution of variables in the model that depend
high covariance with the function whose expectation is
on the ith variable, i.e., the Markov blanket of zi ; and
being computed.
let pi (x, z(i) ) be the terms in the joint that depend
on those variables. We can write the gradient with Taking the derivative of Var(fˆ) with respect to a and
respect to λi as an iterated conditional expectation setting it equal to zero gives us the value of a that
which simplifies to minimizes the variance,

∇λi L = a∗ = Cov(f, h)/Var(h).


Eq(i) [∇λi log q(zi |λi )(log pi (x, z(i) ) − log q(zi |λi ))].
(5) With Monte Carlo estimates from the distribution,
which we are collecting anyway to compute E[f ], we can
The derivation of this expression is in the supplement. estimate a∗ with the ratio of the empirical covariance
This equation says that Rao-Blackwellized estimators and variance.
We now apply this method to Black Box Variational Algorithm 2 Black Box Variational Inference (II)
Inference. To maintain the generic nature of the algo- Input: data x, joint distribution p, mean field vari-
rithm, we want to choose a control variate that only ational family q.
depends on the variational distribution and for which Initialize λ1:n randomly, t = 1.
we can easily compute its expectation. Meeting these repeat
criteria, we choose h to be the score function of the // Draw S samples from the variational ap-
variational approximation, ∇λ log q(z), which always proximation
has expectation zero. (See Eq. 14 in the appendix.) for s = 1 to S do
With this control variate, we have a new Monte Carlo z[s] ∼ q
method to compute the Rao-Blackwellized noisy gra- end for
dients of the ELBO. For the ith component of the for i = 1 to n do
gradient, the function whose expectation is being es- for s = 1 to S do
timated and its control variate, fi and hi respectively fi [s] = ∇λi log qi (z[s]|λi )(log pi (x, z[s]) −
are log qi (z[s]|λi ))
hi [s] = ∇λi log qi (z[s]|λi )
fi (z) = ∇λi log q(z|λi )(log p(x, z) − log q(z|λi )) (8) end for Pni ˆ d d
d=1 Cov(fi ,hi )
hi (z) = ∇λi log q(z|λi ). aˆ∗i = P ni ˆ d
d=1 Var(hi )

∇ˆ λ L , 1 PS fi [s] − aˆ∗ hi [s]


i S s=1 i
The estimate for the optimal choice for the scaling is end for
given by summing over the covariance and variance ρ = tth value of a Robbins Monro sequence
for each of the ni dimensions of λi . Letting the dth λ = λ + ρ∇ ˆ λL
dimension of fi and hi be fid and hdi respectively. The t=t+1
optimal scaling for the gradient of the ELBO is given until change of λ is less than 0.01.
by
Pni ˆ d d
∗ d=1 Cov(fi , hi ) with a small modification, computing the a∗i terms
âi = P ni . (9)
ˆ d
d=1 Var(hi ) on small set of examples and computing the required
averages online.
This gives us the following Monte Carlo method to
Algorithm 2 is easily used on many models. It only
compute noisy gradients using S samples
requires samples from the variational distribution, com-
S putations about the variational distribution, and easy
ˆ 1X
∇λi L , ∇λ log qi (zs |λi ) computations about the model.
S s=1
(logpi (x, zs ) − log qi (zs |λi ) − â∗i ), 4 Extensions
where zs ∼ q(i) (z|λ). (10)
We extend the main algorithm in two ways. First, we
Again note that we define the control variates on a address the difficulty of setting the step size sched-
per-component basis. This estimator uses both Rao- ule. Second, we address scalability by subsampling
Blackwellization and control variates. We show in the observations.
empirical study that this generic control variate further
reduces the variance of the estimator.
4.1 AdaGrad
3.3 Black Box Variational Inference (II)
One challenge with stochastic optimization techniques
Putting together the noisy gradient, Rao- is setting the learning rate. Intuitively, we would like
Blackwellization, and control variates, we present the learning rate to be small when the variance of the
Black Box Variational Inference (II). It takes samples gradient is large and vice-versa. Additionally, in prob-
from the variational approximation to compute noisy lems like ours that have different scales1 , the learning
gradients as in Eq. 10. These noisy gradients are rate needs to be set small enough to handle the small-
then used in a stochastic optimization procedure to est scale. To address this issue, we use the AdaGrad
maximize the ELBO. algorithm (Duchi et al., 2011). Let Gt be a matrix
containing the sum across the first t iterations of the
We summarize the procedure in Algorithm 2. Note that outer products of the gradient. AdaGrad defines a per
for simplicity of presentation, this algorithm stores all
1
the samples. We can remove this memory requirement Probability distributions have many parameterizations.
component learning rate as 5.1 Longitudinal Medical Data

ρt = ηdiag(Gt )−1/2 . (11) Our data consist of longitudinal data from 976 patients
(803 train + 173 test) from a clinic at New York Presby-
terian hospital who have been diagnosed with chronic
This is a per-component learning rate since diag(Gt ) kidney disease. These patients visited the clinic a to-
has the same dimension as the gradient. Note that tal of 33K times. During each visit, a subset of 17
since AdaGrad only uses the diagonal of Gt , those measurements (labs) were measured.
are the only elements we need to compute. AdaGrad
captures noise and varying length scales through the The data are observational and consist of measurements
square of the noisy gradient and reduces the number (lab values) taken at the doctor’s discretion when the
of parameters to our algorithm from the standard two patient is at a checkup. This means both that the
parameter Robbins-Monro learning rate. labs at each time step are sparse and that the time
between patient visits are highly irregular. The labs
values are all positive as the labs measure the amount
4.2 Stochastic Inference in Hierarchical of a particular quantity such as sodium concentration
Bayesian Models in the blood.
Stochastic optimization has also been used to scale Our modeling goal is to come up with a low dimen-
variational inference in hierarchical Bayesian models to sional summarization of patients’ labs at each of their
massive data (Hoffman et al., 2013). The basic idea is visits. From this, we aim to to find latent factors that
to subsample observations to compute noisy gradients. summarize each visit as positive random variables. As
We can use a similar idea to scale our method. in medical data applications, we want our factors to
be latent indicators of patient health.
In a hierarchical Bayesian model, we have a hyper-
parameter η, global latent variables β, local latent We evaluate our model using predictive likelihood. To
variables z1...n , and observations x1...n having the log compute predictive likelihoods, we need an approximate
joint distribution posterior on both the global parameters and the per
visit parameters. We use the approximate posterior on
log p(x1...n , z1...n , β) = log p(β|η) the global parameters and calculate the approximate
n
posterior on the local parameters on 75% of the data in
the test set. We then calculate the predictive likelihood
X
+ log p(zi |β) + log p(xi |zi , β).
i=1
on the other 25% of the data in the validation set using
(12) Monte Carlo samples from the approximate posterior.
We initialize randomly and choose the variational fami-
This is the same definition as in Hoffman et al. (2013), lies to be fully-factorized with gamma distributions for
but they place further restrictions on the forms of the positive variables and normals for real valued variables.
distributions and the complete conditionals. Under the We use both the AdaGrad and doubly stochastic ex-
mean field approximating family, applying Eq. 10 to tensions on top of our base algorithm. We use 1,000
construct noisy gradients of the ELBO would require samples from the variational distribution and set the
iterating over every datapoint. Instead we can compute batch size at 25 in all our experiments.
noisy gradients using a sampled observation and sam-
ples from the variational distribution. The derivation 5.2 Model
along with variance reductions can be found in the
supplement. To meet our goals, we construct a Gamma-Normal
time series model. We model our data using weights
drawn from a Normal distribution and observations
5 Empirical Study drawn from a Normal, allowing each factor to both
positively and negative affect each lab while letting
We use Black Box Variational Inference to quickly factors represent lab measurements. The generative
construct and evaluate several models on longitudinal process for this model with hyperparameters denoted
medical data. We demonstrate the effectiveness of our with σ is
variance reduction methods and compare the speed
and predictive likelihood of our algorithm to sampling Draw W ∼ Normal(0, σw ), an L × K matrix
based methods. We evaluate the various models using For each patient p: 1 to P
predictive likelihood and demonstrate the ease at which Draw op ∼Normal(0, σo ), a vector of L
several models can be explored. Define xp0 = α0
Metropolis-Hastings works by sampling from a proposal
0
distribution and accepting or rejecting the samples
based on the likelihood. Standard Metropolis-Hastings
Held Out Log Predictive Likelihood

can work poorly in high dimensional models. We find


that it fails for the Gamma-Normal-TS model. Instead,
−50

we compare to a Gibbs sampling method that uses


Algorithm Metropolis-Hastings to sample from the complete con-
Gibbs
Black Box VI ditionals. For our proposal distribution we use the same
distributions as found in the previous iteration, with
−100

the mean equal the value of the previous parameter.


We compute predictive likelihoods using the posterior
samples generated by the MCMC methods on held out
data in the test set.
−150

0 5 10 15 20
Time (in hours) On this model, we compared Black Box Variational In-
ference to Metropolis-Hastings inside Gibbs. We used
Figure 1: Comparison between Metropolis-Hastings
a fixed computational budget of 20 hours. Figure 1
within Gibbs and Black Box Variational Inference. In
plots time versus predictive likelihood on the held out
the x axis is time and in the y axis is the predictive
set for both methods. We found similar results with
likelihood of the test set. Black Box Variational Infer-
different random initializations of both models. Black
ence reaches better predictive likelihoods faster than
Box Variational Inference gives better predictive likeli-
Gibbs sampling. The Gibbs sampler’s progress slows
hoods and gets them faster than Metropolis-Hastings
considerably after 5 hours.
within Gibbs.3 .

5.4 Variance Reductions


For each visit v: 1 to vp
Draw xpv ∼GammaE(xpv−1 , σx ) We next studied how much variance is reduced with
Draw lpv ∼Normal(W xpv + op , σl ), a vector of L. our variance reduction methods. In Figure 2, we plot
the variance of various estimators of the gradient of the
We set σw , σo , σx to be 1 and σl to be .01. In our variational approximation for a factor in the patient
model, GammaE is the expectation/variance parame- time-series versus iteration number. We compare the
terization of the (L-dimensional) gamma distribution. variance of the Monte Carlo gradient in Eq. 3 to that of
(The mapping between this parameterization and the the Rao-Blackwellized gradient (Eq. 6) and that of the
more standard one can be found in the supplement.) gradient using both Rao-Blackwellization and control
Black Box Variational Inference allows us to make variates (Eq. 10). We found that Rao-Blackwellization
use of non-standard parameterizations for distributions reduces the variance by several orders of magnitude.
that are easier to reason about. This is an important Applying control variates reduces the variance further.
observation, as the standard set of families used in This reduction in variance drastically improves the
variational inference tend to be fairly limited. In this speed at which Black Box Variational Inference con-
case, the expectation parameterization of the gamma verges. In fact, in the time allotted, Algorithm 1—the
distribution allows the previous visit factors to define algorithm without variance reductions—failed to make
the expectation of the current visit factors. Finally, we noticeable progress.
emphasize that coordinate ascent variational inference
and Gibbs sampling are not available for this algorithm 5.5 Exploring Models
because the required conditional distributions do not
have closed form. We developed Black Box Variational Inference to make
it easier to quickly explore and fit many new models
to a data set. We demonstrate this by considering a
5.3 Sampling Methods
sequence of three other factor and time-series models
We compare Black Box Variational Inference to of the health data. We name these models Gamma,
a standard sampling based technique, Metropolis- technique only requires the joint distribution and could
Hastings (Bishop, 2006), that also only needs the joint benefit from added analysis used in more complex methods,
distribution.2 we compare against a similar methods.
3
Black Box Variational Inference also has better pre-
2
Methods that involve a bit more work such as Hamilto- dictive mean-squared error on the labs than Gibbs style
nian Monte Carlo could work in this settings, but as our Metropolis-Hastings.
Model Predictive Likelihood
1e+14
Gamma-Normal -33.9
Gamma-Normal-TS -32.7
Gamma-Gamma -175
Gamma-Gamma-TS -174
Estimator
Variance

Table 1: A comparison between several models for our


1e+11

Basic
Rao−Blackwell patient health dataset. We find that taking into account
Rao−Blackwell+CV
the longitudinal nature of the data in the model leads
to a better fit. The Gamma weight models perform
relatively poorly. This is likely due to the fact that
1e+08

some labs have are negatively correlated. This model


cannot capture such relationships.
0 50 100 150 200 250
Iteration
Gamma-Normal. Similar to the above, we can
Figure 2: Variance comparison for the first component change the time-series Gamma-Normal-TS (studied
of a random patient on the following estimators: Eq. 3, in the previous section) to a simpler factor model. This
the Rao-Blackwellized estimator Eq. 6, and the Rao- is similar to the Gamma model, but with Normal priors.
Blackwellized control variate estimator Eq. 10. We find
These combinations lead to a set of four models that
that Rao-Blackwellizing the naive estimator reduces
are all nonconjugate and for which standard varia-
the variance by several orders of magnitude from the
tional techniques are difficult to apply. Our variational
naive estimator. Adding control variates reduces the
inference method allows us to compute approximate
variance even further.
posteriors for these models to determine which provides
the best low dimensional latent representations.

Gamma-TS, and Gamma-Normal. We set the AdaGrad scaling parameter to 1 for both
the Gamma-Normal models and to .5 for the Gamma
models.
Gamma. We model the latent factors that summa-
rize each visit in our models as positive random vari- Model Comparisons. Table 1 details our models
ables; as noted above, we expect these to be indicative along with their predictive likelihoods. From this we
of patient health. The Gamma model is a positive- see that time helps in modelling our longitudinal health-
value factor model where all of the factors, weights, care data. We also see that the Gamma-Gamma mod-
and observations have positive values. The generative els perform poorly. This is likely because they cannot
process for this model is capture the negative correlations that exist between dif-
ferent medical labs. More importantly, by using Black
Draw W ∼ Gamma(αw , βw ), an L × K matrix Box Variational Inference we were able to quickly fit
For each patient p: 1 to P and explore a set of complicated non-conjugate mod-
Draw op ∼Gamma(αo , βo ), a vector of L els. Without a generic algorithm, approximating the
For each visit v: 1 to vp posterior of any of these models is a project in itself.
Draw xpv ∼Gamma(αx , βx )
Draw lpv ∼GammaE(W xpv + op , σo ), a vector of L.
6 Conclusion
We set all hyperparameters save σo to be 1. As in the
previous model, σo is set to .01. We developed and studied Black Box Variational In-
ference, a new algorithm for variational inference that
drastically reduces the analytic burden. Our main ap-
Gamma-TS. We can link the factors through time
proach is a stochastic optimization of the ELBO by
using the expectation parameterization of the Gamma
sampling from the variational posterior to compute a
distribution. (Note this is harder with the usual natural
noisy gradient. Essential to its success are model-free
parameterization of the Gamma.) This changes xpv to
variance reductions to reduce the variance of the noisy
be distributed as GammaE(xpv−1 , σv ). We draw xp1
gradient. Black Box Variational Inference works well
as above. In this model, the expected values of the
for new models, while requiring minimal analytic work
factors at the next visit is the same as the value at
by the practitioner.
the current visit. This allows us to propagate patient
states through time. There are several natural directions for future improve-
ments to this work. First, the software libraries that References
we provide can be augmented with score functions for a
wider variety of variational families (each score function C. Bishop. Pattern Recognition and Machine Learning.
is simply the log gradient of the variational distribution Springer New York., 2006.
with respect to the variational parameters). Second, D. Blei and J. Lafferty. A correlated topic model of
we believe that number of samples could be set dy- Science. Annals of Applied Statistics, 1(1):17–35,
namically. Finally, carefully-selected samples from the 2007.
variational distribution (e.g., with quasi-Monte Carlo
methods) are likely to significantly decrease sampling L. Bottou and Y. LeCun. Large scale online learning. In
variance. Advances in Neural Information Processing Systems,
2004.
M. Braun and J. McAuliffe. Variational inference for
7 Appendix: The Gradient of the
large-scale models of discrete choice. Journal of
ELBO American Statistical Association, 105(489), 2007.
The key idea behind our algorithm is that the gradient Peter Carbonetto, Matthew King, and Firas Hamze.
of the ELBO can be written as an expectation with A stochastic approximation method for inference
respect to the variational distribution. We start by in probabilistic graphical models. In Y. Bengio,
differentiating Eq. 1, D. Schuurmans, J. Lafferty, C. K. I. Williams, and
A. Culotta, editors, Advances in Neural Information
Processing Systems 22, pages 216–224. 2009.
Z
∇λ L = ∇λ (log p(x, z) − log q(z|λ))q(z|λ)dz
Z George Casella and Christian P Robert. Rao-
blackwellisation of sampling schemes. Biometrika,
= ∇λ [(log p(x, z) − log q(z|λ))q(z|λ)]dz
83(1):81–94, 1996.
Z
= ∇λ [log p(x, z) − log q(z|λ)]q(z|λ)dz E. Cinlar. Probability and Stochastics. Springer, 2011.
Z D. R. Cox and D.V. Hinkley. Theoretical Statistics.
+ ∇λ q(z|λ)(log p(x, z) − log q(z|λ))dz Chapman and Hall, 1979.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive
= −Eq [log q(z|λ)] (13)
Z subgradient methods for online learning and stochas-
+ ∇λ q(z|λ)(log p(x, z) − log q(z|λ))dz, tic optimization. J. Mach. Learn. Res., 12:2121–2159,
July 2011. ISSN 1532-4435.
where we have exchanged derivatives with integrals via Z. Ghahramani and M. Beal. Propagation algorithms
the dominated convergence theorem 4 (Cinlar, 2011) for variational Bayesian learning. In NIPS 13, pages
and used ∇λ [log p(x, z)] = 0. 507–513, 2001.

The first term in Eq. 13 is zero. To see this, note M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochas-
tic variational inference. Journal of Machine Learn-
ing Research, 14(1303–1347), 2013.
  Z
∇λ q(z|λ)
Eq [∇λ log q(z|λ)] = Eq = ∇λ q(z|λ)dz
q(z|λ) T. Jaakkola and M. Jordan. A variational approach
to Bayesian logistic regression models and their ex-
Z
= ∇λ q(z|λ)dz = ∇λ 1 = 0. (14) tensions. In International Workshop on Artificial
Intelligence and Statistics, 1996.
To simplify the second term, first observe that M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul.
∇λ [q(z|λ)] = ∇λ [log q(z|λ)]q(z|λ). This fact gives us Introduction to variational methods for graphical
the gradient as an expectation, models. Machine Learning, 37:183–233, 1999.
Z D. Kingma and M. Welling. Auto-encoding variational
∇λ L = ∇λ [q(z|λ)](log p(x, z) − log q(z|λ))dz bayes. ArXiv e-prints, December 2013.
Z D. Knowles and T. Minka. Non-conjugate variational
= ∇λ log q(z|λ)(log p(x, z) message passing for multinomial and binary regres-
sion. In Advances in Neural Information Processing
− log q(z|λ))q(z|λ)dz
Systems, 2011.
= Eq [∇λ log q(z|λ)(log p(x, z) − log q(z|λ))],
H. Kushner and G. Yin. Stochastic Approximation
4
The score function exists. The score and likelihoods Algorithms and Applications. Springer New York,
are bounded. 1997.
J. Paisley, D. Blei, and M. Jordan. Variational Bayesian Recall the definitions from Section 3 where we defined
inference with stochastic search. In International ∇λi L as the gradient of the ELBO with respect to
Conference on Machine Learning, 2012. λi , pi as the components of the log joint that include
H. Robbins and S. Monro. A stochastic approximation terms form the ith factor, and Eq(i) as the expectation
method. The Annals of Mathematical Statistics, 22 with respect to the set of latent variables that appear
(3):pp. 400–407, 1951. in the complete conditional for zi . Let p−i bet the
components of the joint that does not include terms
S. M. Ross. Simulation. Elsevier, 2002. from the ith factor respectively. We can write the
T. Salimans and D Knowles. Fixed-form variational gradient with respect to the ith factor’s variational
approximation through stochastic linear regression. parameters as
ArXiv e-prints, August 2012. ∇λ i L
M. Wainwright and M. Jordan. Graphical models, ex- =Eq1 . . . Eqn [∇λi log qi (zi |λi )(log p(x, z)
ponential families, and variational inference. Founda- X n
tions and Trends in Machine Learning, 1(1–2):1–305, − log qj (zj |λj ))]
2008. j=1

C. Wang and D. M. Blei. Variational inference for =Eq1 . . . Eqn [∇λi log qi (zi |λi )(log pi (x, z)
nonconjutate models. JMLR, 2013. X n
+ log p−i (x, z) − log qj (zj |λj ))]
D. Wingate and T Weber. Automated variational
j=1
inference in probabilistic programming. ArXiv e-
prints, January 2013. =Eqi [∇λi log qi (zi |λi )(Eq−i [log pi (x, z(i) )]
− log qi (zi |λi ) + Eq−i [log p−i (x, z)
Xn

Supplement − log qj (zj |λj )]]


j=1,i6=j

Derivation of the Rao-Blackwellized Gradient =Eqi [∇λi log qi (zi |λi )(Eq−i [log pi (x, z)]
To compute the Rao-Blackwellized estimators, we need − log qi (zi |λi ) + Ci )]
to compute conditional expectations. Due to the mean =Eqi [∇λi log qi (zi |λi )(Eq−i [log pi (x, z(i) )]
field-assumption, the conditional expectation simplifies
− log qi (zi |λi ))]
due to the factorization
R =Eq(i) [∇λi log qi (zi |λi )(log pi (x, z(i) ) − log qi (zi |λi ))].
J(x, y)p(x)p(x)dy (A.17)
E[J(X, Y )|X] = R
p(x)p(y)dy
Z where we have leveraged the mean field assumption
= J(x, y)p(y)dy = Ey [J(x, y)]. and made use of the identity for the expected score Eq.
14. This means we can Rao-Blackwellize the gradient
(A.15) of the variational parameter λi with respect to the the
latent variables outside of the Markov blanket of zi
Therefore, to construct a lower variance estimator when without needing model specific computations.
the joint distribution factorizes, all we need to do is
integrate out some variables. In our problem this means Derivation of Stochastic Inference in Hierarchi-
for each component of the gradient, we should compute cal Bayesian Models Recall the definition of a hi-
expectations with respect to the other factors. We erarchical Bayesian model with n observations given in
present the estimator in the full mean field family of Eq. 12
variational distributions, but note it applies to any
variational approximation with some factorization like logp(x1...n , z1...n , β)
structured mean-field. X n
= log p(β|η) + log p(zi |β) + log p(xi , |zi , β).
Thus, under the mean field assumption the Rao- i=1
Blackwellized estimator for the gradient becomes
Let the variational approximation for the posterior
n
X distribution be from the mean field family. Let λ be
∇λ L =Eq1 . . . Eqn [ ∇λ log qj (zj |λj )(log p(x, z) the global variational parameter and let φ1...n be the
j=1 local variational parameters. The variational family is
n
X m
Y
− log qj (zj |λj ))]. (A.16) q(β, z1...n ) = q(β|λ) q(zi |φi ). (A.18)
j=1 i=1
Using the Rao Blackwellized estimator to compute iterate over all of the observations at each update
noisy gradients in this family for this model gives
S
ˆ λL = 1
X
∇ ∇λ log q(βs |λ)(log p(βs |η) − log q(βs |λ)
S
S i=1
ˆ 1X
∇λ L = ∇λ log q(βs |λ)(log p(βs |η) − log q(βs |λ) − aˆ∗λ + n(log p(zis |βs ) + log p(xi , zis |βs )))
S i=1
S
ˆφ L =1
n X
∇ ∇λ log q(zs |φi )(−aˆ∗φi + n(log p(zis |βs )
X
+ (log p(zis |βs ) + log p(xi , zis |βs ))) i
S i=1
i=1
S + log p(xi , zis |βs ) − log q(zis |φi )))
ˆφ L =1
X
∇ i
∇λ log q(zs |φi )((log p(zis |βs ) ˆ φ L =0 for all j 6= i.
∇ (A.20)
S i=1
j

+ log p(xi , zis |βs ) − log q(zis |φi ))).


Gamma parameterization equivalence The
shape α and rate β parameterization can be written in
Unfortunately, this estimator requires iterating over terms of the mean µ and variance σ 2 of the gamma as
every data point to compute noisy realizations of the
µ2 µ
gradient. We can mitigate this by subsampling obser- α= , β= . (A.21)
vations. If we let i ∼ U nif (1...n), then we can write σ2 σ2
down a noisy gradient for the ELBO that does not need
to iterate over every observation; this noisy gradient is

S
ˆ λL = 1
X
∇ ∇λ log q(βs |λ)(log p(βs |η) − log q(βs |λ)
S i=1
− n(log p(zis |βs ) + log p(xi , zis |βs )))
S
ˆφ L =1
X
∇ i
∇λ log q(zs |φi )(n(log p(zis |βs )
S i=1
+ log p(xi , zis |βs ) − log q(zis |φi )))
ˆ φ L =0 for all j 6= i.
∇ j

The expected value of this estimator with respect to


the samples from the variational distribution and the
sampled data point is the gradient of the ELBO. This
means we can use it define a stochastic optimization
procedure to maximize the ELBO. We can lower the
variance of the estimator by introducing control vari-
ates. Let

fλ (β, zi ) =∇λ log q(β|λ)(log p(βs |η) − log q(βs |λ)


− n(log p(zis |βs ) + log p(xi , zis |βs )))
hλ (β) =∇λ log q(β|λi )
fφi (β, zi ) =∇λi log q(z|λi )(n(log p(zis |βs )
+ log p(xi , zis |βs ) − log q(zis |φi )))
hφi (zi ) =∇λi log q(zi |φi ). (A.19)

We can compute the optimal scalings for the control


variates, aˆ∗λ and aˆ∗φi , using the S Monte Carlo by sub-
stituting Eq. A.19 into Eq. 9. This gives the following
lower variance noisy gradient that does not need to

You might also like