Explaining Variational Approximations
Explaining Variational Approximations
140 © 2010 American Statistical Association DOI: 10.1198/tast.2010.09058 The American Statistician, May 2010, Vol. 64, No. 2
and the Digamma function, denoted by ψ, is given by ψ(x) =
d
dx log (x).
Column vectors with entries consisting of subscripted vari-
ables are denoted by a boldfaced version of the letter for that
variable. Round brackets will be used to denote the entries of
column vectors. For example, x = (x1 , . . . , xn ) denotes an n × 1
vector with entries x1 , . . . , xn . Scalar functions applied to vec-
tors are evaluated element-wise. For example,
− log qi (θ i ) dθ 1 · · · dθ M
p(y, θ ) i=1
p(y; q) ≡ exp q(θ) log dθ. (4)
q(θ)
= q1 (θ 1 ) log p(y, θ )q2 (θ 2 ) · · ·
Note that the lower bound p(y; q) can also be derived more
directly using Jensen’s inequality, but the above derivation has
the advantage of quantifying the gap between p(y) and p(y; q). × qM (θ M ) dθ 2 · · · dθ M dθ 1
The essence of the density transform variational approach
is approximation of the posterior density p(θ |y) by a q(θ ) − q1 (θ 1 ) log q1 (θ 1 ) dθ 1
for which p(y; q) is more tractable than p(y). Tractability is
achieved by restricting q to a more manageable class of den- + terms not involving q1 .
sities, and then maximizing p(y; q) over that class. According
Define the new joint density function p (y, θ 1 ) by
to (2), maximization of p(y; q) is equivalent to minimization of
the Kullback–Leibler divergence between q and p(·|y). (y, θ 1 ) ≡ exp log p(y, θ )q2 (θ 2 ) · · · qM (θ M ) dθ 2 · · · dθ M
p
The most common restrictions for the q density are:
(a) q(θ ) factorizes into M i=1 qi (θ i ), for some partition exp log p(y, θ )q2 (θ 2 ) · · ·
{θ 1 , . . . , θ M } of θ .
(b) q is a member of a parametric family of density func-
tions. × qM (θ M ) dθ 2 · · · dθ M dθ 1 dy.
142 General
Algorithm 1 Iterative scheme for obtaining the optimal densi- ing examples that the product density transform approach leads
ties under product density restriction (a). The updates are based to tractable solutions in situations where Gibbs sampling is also
on the solutions given at (5). viable.
Initialize: q2∗ (θ 2 ), . . . , qM
∗ (θ ).
M The DAG viewpoint of Bayesian models also gives rise to a
Cycle: useful result arising from the notion of Markov blankets. The
exp{E−θ 1 log p(y, θ)} Markov blanket of a node is the set of children, parents, and
q1∗ (θ 1 ) ← , co-parents of that node. The result
exp{E−θ 1 log p(y, θ)} dθ 1
.. p(θ i |rest) = p(θ i |Markov blanket of θ i ) (7)
.
(Pearl 1988) means that determination of the required full con-
∗ exp{E−θ M log p(y, θ)}
qM (θ M ) ← ditionals involves localized calculations on the DAG. It fol-
exp{E−θ M log p(y, θ)} dθ M lows from this fact and expression (6) that the product den-
until the increase in p(y; q) is negligible. sity approach involves a series of local operations. In Com-
puter Science, this has become known as variational message
passing (Winn and Bishop 2005). See the example in Sec-
Then tion 2.2.3 for illustration of (7) and localization of variational
updates.
(y, θ 1 )
p
log p(y; q) = q1 (θ 1 ) log dθ 1
q(θ 1 ) 2.2.2 Normal Random Sample
+ terms not involving q1 . Our first and most detailed illustration of variational approx-
By Result 1, the optimal q1 is then imation involves approximate Bayesian inference for the most
familiar of statistical settings: a random sample from a Normal
(y, θ 1 )
p
q1∗ (θ 1 ) = p
(θ 1 |y) ≡ distribution. Specifically, consider
(y, θ 1 ) dθ 1
p
ind.
Xi |μ, σ 2 ∼ N (μ, σ 2 )
∝ exp log p(y, θ )q2 (θ 2 ) · · · qM (θ M ) dθ 2 · · · dθ M .
with priors
Repeating the same argument for maximizing log p(y; q) over
each of q2 , . . . , qM leads to the optimal densities satisfying: μ ∼ N (μμ , σμ2 ) and σ 2 ∼ IG(A, B).
qi∗ (θ i ) ∝ exp E−θ i log p(y, θ) , 1 ≤ i ≤ M, (5) The product density transform approximation to p(μ, σ 2 |x) is
where
E−θ i denotes expectation with respect to the density q(μ, σ 2 ) = qμ (μ)qσ 2 (σ 2 ). (8)
j =i qj (θ j ). The iterative scheme, labeled Algorithm 1, can
be used to solve for the qi∗ . The optimal densities take the form
Convexity properties can be used to show that convergence
qμ∗ (μ) ∝ exp Eσ 2 {log p(μ|σ 2 , x)} and
to at least local optima is guaranteed (Boyd and Vandenberghe
2004). If conjugate priors are used, then the qi∗ belong to recog- qσ∗ 2 (σ 2 ) ∝ exp Eμ {log p(σ 2 |μ, x)} ,
nizable density families and the qi∗ updates reduce to updat-
ing parameters in the qi∗ family (e.g., Winn and Bishop 2005). where x = (X1 , . . . , Xn ). Standard manipulations lead to the
Also, in practice it is common to monitor convergence using full conditionals being
log{p(y; q)} rather than p(y; q). Sections 2.2.2–2.2.4 provide
nX/σ 2 + μμ /σμ2 1
illustrations. μ|σ , x ∼ N
2
, and
n/σ 2 + 1/σμ2 n/σ 2 + 1/σμ2
2.2.1 Connection With Gibbs Sampling
n 1
It is easily shown that a valid alternative expression for the σ 2 |μ, x ∼ IG A + , B + x − μ1n 2 ,
2 2
qi∗ (θ i ) is
where X = (X1 + · · · + Xn )/n is the sample mean. The second
qi∗ (θ i ) ∝ exp E−θ i log p(θ i |rest) , (6)
of these, combined with (6), leads to
where
n
qσ∗ 2 (σ 2 ) ∝ exp Eμ − A + + 1 log(σ 2 )
rest ≡ {y, θ 1 , . . . , θ i−1 , θ i+1 , . . . , θ M } 2
is the set containing the random vectors in the model, apart 1
− B + x − μ1n 2
σ2
from θ i . The distributions θ i |rest, 1 ≤ i ≤ M, are known, in the 2
MCMC literature, as the full conditionals. This form of the op-
timal densities reveals a link with Gibbs sampling (e.g., Casella ∝ (σ 2 )−(A+n/2+1)
and George 1992) which involves successive draws from these 1
full conditionals. Indeed, it becomes apparent from the upcom- × exp − B + Eμ x − μ1n 2 σ2 .
2
144 General
Figure 2. Results from applying the product density variational approximation to a simulated Normal random sample. The exact posterior
density functions are added for comparison. The vertical dotted line in the posterior density plots corresponds to the true value of the parameter.
Application of (5) leads to the optimal densities taking the form density the shape parameters for the r + 1 components can
∗ be shown to be deterministic: Au1 + 12 K1 , . . . , Aur + 12 Kr ,
qβ,u (β, u) is a Multivariate Normal density function,
Aε + 12 n. Let Bq(σ 2 ) , . . . , Bq(σur2 ) , Bq(σε2 ) be the accompanying
qσ∗ 2 is a product of r + 1 Inverse Gamma density functions.
u1
rate parameters. The relationships between (μq(β,u) , q(β,u) )
It should be stressed that these forms are not imposed at the and (Bq(σ 2 ) , . . . , Bq(σur2 ) , Bq(σε2 ) ) enforced by (5) lead to the it-
u1
outset, but arise as optimal solutions for model (11)–(13) and erative scheme in Algorithm 3.
product restriction (14). Moreover, the factorization of qσ∗ 2 into In this case log p(y; q) takes the form
r + 1 separate components is also a consequence of (5) for log p(y; q)
the current model, rather than an imposition. Bishop (2006,
sec. 10.2.5) explained how these induced factorizations follow 1
r
n p
from the structure of the DAG and d-separation theory (Pearl = p+ K − log(2π) − log(σβ2 )
2 2 2
=1
1988). This example also benefits from the Markov blanket re-
sult (7) described in Section 2.2.1 and Figure 3. For example, 1 1 2
2 is + log q(β,u) − 2 μq(β) + tr q(β)
the full conditional density of σu1 2 2σβ
2
p(σu1 |rest) = p(σu1
2
|Markov blanket of σu1
2
) n
+ Aε log(Bε ) − Aε + log Bq(σε2 )
= p(σu1
2
|u, σu2
2 2
, . . . , σur ). 2
n
Hence, determination of q ∗ 2 requires calculations involving + log Aε + − log (Aε )
σu1 2
only the subset of the DAG consisting of u and the variance
r
parameters. K
+ Au log(Bu ) − Au + log Bq(σ 2 )
Let μq(β,u) and q(β,u) be the mean and covariance ma- 2 u
∗
trix for the qβ,u density and set C ≡ [X Z]. For the qσ∗ 2
=1
K
+ log Au + − log (Au ) .
2
Note that, within each iteration of Algorithm 3, this expression
applies only after each of the parameter updates has been made.
Upon convergence to μ∗q(β,u) , ∗q(β,u) , B ∗ 2 , . . . , Bq(σ
∗
2 )
q(σu1 ) ur
∗
and Bq(σ 2 ) the approximate posteriors are:
ε
∗
density functions together with the IG(Aε + 12 n, Bq(σ 2)) the sense that most of the posterior probability mass is away
ε
from zero.
density function.
We now provide an illustration for Bayesian analysis of a 2.2.4 Probit Regression and the Use of Auxiliary Variables
dataset involving longitudinal orthodontic measurements on 27 As shown by Albert and Chib (1993), Gibbs sampling
children (source: Pinheiro and Bates 2000). The data are avail- for the Bayesian probit regression model becomes tractable
able in the R computing environment (R Development Core when a particular set of auxiliary variables is introduced. The
Team 2010) via the package nlme (Pinheiro et al. 2009), in
same trick applies to product density variational approximation
the object Orthodont. We entertained the random intercept
(Girolami and Rogers 2006; Consonni and Marin 2007), as we
model
now show.
ind.
distanceij |Ui ∼ N(β0 + Ui + β1 ageij The Bayesian probit regression model that we consider here
is
+ β2 malei , σε2 ),
(15)
ind. Yi |β0 , . . . , βk
Ui |σu2 ∼ N(0, σu2 ), 1 ≤ i ≤ 27, 1 ≤ j ≤ 4,
ind.
ind. ind. ∼ Bernoulli((β0 + β1 x1i + · · · + βk xki )), 1 ≤ i ≤ n,
βi ∼ N(0, σβ2 ), σu2 , σε2 ∼ IG(A, B),
where distanceij is the distance from the pituitary to the where the prior distribution on the coefficient vector β =
pterygomaxillary fissure (mm) for patient i at time point j . (β0 , . . . , βk ) takes the form β ∼ N (μβ , β ). Letting X ≡
Similarly, ageij correspond to the longitudinal age values [1 x1i · · · xki ]1≤i≤n , the likelihood can be written compactly
in years and malei is an indicator of the ith child being as
male. This fits into framework (11)–(12) with y containing
the distanceij measurement, X = [1, ageij , malei ], and p(y|β) = (Xβ)y {1n − (Xβ)}1n −y , β ∼ N (μβ , β ).
Z = I27 ⊗ 14 is an indicator matrix for the random intercepts.
We used the vague priors σβ2 = 108 , A = B = 100 1
and used Introduce the vector of auxiliary variables a = (a1 , . . . , an ),
standardized versions of the distance and age data during the where
fitting. The results were then converted back to the original ind.
units. For comparison, we obtained 1 million samples from the ai |β ∼ N ((Xβ)i , 1).
posteriors using MCMC (with a burn-in of length 5000) and,
from these, constructed kernel density estimate approximations This allows us to write
to the posteriors. For such a high Monte Carlo sample size we
would expect these MCMC-based approximations to be very p(yi |ai ) = I (ai ≥ 0)yi I (ai < 0)1−yi , 1 ≤ i ≤ n.
accurate.
Figure 4 shows the progressive values of log p(y; q) and the In graphical model terms we are introducing a new node to the
approximate posterior densities obtained from applying Algo- graph, as conveyed by Figure 5. Expansion of the parameter set
rithm 3. Once again, convergence of log{p(y; q)} to a maxi- from {β} to {a, β} is the key to achieving a tractable solution.
mum is seen to be quite rapid. The variational approximate pos- Consider the product restriction
terior densities are quite close to those obtained via MCMC,
and indicate statistical significance of all model parameters in q(a, β) = qa (a)qβ (β).
146 General
Figure 4. Approximate posterior densities from applying the product density variational approximation to (11)–(13) for the orthodontic data.
‘Exact’ posterior densities, based on kernel density estimates of 1 million MCMC samples, are shown for comparison.
Then application of (5) leads to tor μq(β) ≡ Eβ (β). We also need to work with the q-density
n yi 1−yi mean of the auxiliary variable vector μq(a) ≡ Ea (a). The itera-
I (a ≥ 0) I (a < 0)
qa∗ (a) =
i i tive scheme, Algorithm 4, emerges.
((Xμq(β) )i ) 1 − ((Xμq(β) )i ) The log p(y; q) expression in this case is
i=1
1 2
× (2π)−n/2 exp − a − Xμq(β)
2 log p(y; q) = yT log Xμq(β)
and qβ∗ (β) is the N(μq(β) , (XT X + −1 −1
β ) ) density function.
+ (1n − y)T log 1n − Xμq(β)
These optimal densities are specified up to the parameter vec- 1 T
− μq(β) − μβ −1β μq(β) − μβ
2
1
− log | β XT X + I|.
2
148 General
Algorithm 5 Iterative scheme for obtaining the parameters in the optimal densities qw∗ , qμ∗ , and qσ∗ 2 in the finite
Normal mixtures example.
Initialize: μq(μk ) ∈ R and α q(wk ) , σq(μ
2
k)
, Aq(σ 2 ) , Bq(σ 2 ) , ω•k > 0, 1 ≤ k ≤ K,
k k
such that K k=1 ω•k = 1.
Cycle: For i = 1, . . . , n and k = 1, . . . , K:
1 1
νik ← ψ αq(wk ) + ψ Aq(σ 2 ) − log Bq(σ 2 )
2 k 2 k
1 2
− Aq(σ 2 ) Xi − μq(μk ) + σq(μ
2
k)
/Bq(σ 2 ) .
2 k k
K
For i = 1, . . . , n and k = 1, . . . , K: ωik ← exp(νik )/ k=1 exp(νik ).
For k = 1, . . . , K:
n
ω•k ← ωik ; 2
σq(μ k)
← 1/ 1/σμ2k + Aq(σ 2 ) ω•k /Bq(σ 2 ) ,
k k
i=1
n
μq(μk ) ← 2
σq(μ k)
μμk /σμ2k + Aq(σ 2 ) ωik Xi /Bq(σ 2 ) ,
k k
i=1
1
αq(wk ) ← α + ω•k ; Aq(σ 2 ) ← Ak + ω•k ,
k 2
1
n
2
Bq(σ 2 ) ← Bk + ωik Xi − μq(μk ) + σq(μ
2
k )
k 2
i=1
until the increase in p(x; q) is negligible.
Figure 7. Variational representation of the logarithmic function. Left axes: Members of family of functions f (x, ξ ) ≡ ξ x − log(ξ ) − 1 versus
ξ > 0, for x ∈ {0.25, 0.5, 1, 2, 4}, shown as gray curves. Right axes: For each x, the minimum of f (x, ξ ) over ξ corresponds to log(x). In the x
direction the f (x, ξ ) are linear and are shown in gray.
150 General
where as close as possible to p(y). Since p(y; ξ ) ≤ p(y) for all ξ , this
! reduces to the problem of maximizing p(y; ξ ) over ξ . Note that
p(y, β) = exp yT Xβ − 1Tn log{1n + exp(Xβ)} this lower bound on log p(y) has explicit expression:
1 1 1
− (β − μβ )T −1 log p(y; ξ ) = log |(ξ )| − log | β |
2 β (β − μβ ) 2 2
" 1 1
k+1 1 + μ(ξ )T (ξ )−1 μ(ξ ) − μTβ −1 β μβ
− log(2π) − log | β | . (21) 2 2
2 2
n
Once again, we are stuck with a multivariate intractable inte- + {ξi /2 − log(1 + eξi ) + (ξi /4) tanh(ξi /2)}.
gral in the normalizing factor. We get around this by noting the i=1
following representation of − log(1 + ex ) as the maxima of a
family of parabolas: Even though this can be maximized numerically in a similar
fashion to (19), Jaakkola and Jordan (2000) derived a simpler
1 algorithm based on the notion of Expectation Maximization
− log(1 + ex ) = max A(ξ )x 2 − x + C(ξ ) for all x ∈ R,
ξ ∈R 2 (EM) (e.g., McLachlan and Krishnan 1997) with β playing the
(22) role of a set of latent variables. Treating y, β as the set of ‘com-
where plete data,’ the E-step of their EM algorithm involves
A(ξ ) ≡ − tanh(ξ/2)/(4ξ ) and Q(ξ new |ξ ) ≡ Eβ|y;ξ {log p(y, β; ξ new )},
C(ξ ) ≡ ξ/2 − log(1 + e ) + ξ tanh(ξ/2)/4.
ξ
where p(y, β; ξ ) is interpreted as the variational lower bound
While the genesis of (22) may be found in the article by on the ‘complete data likelihood.’ This results in the explicit
Jaakkola and Jordan (2000), it is easily checked via elementary expression
calculus methods. It follows from (22) that
Q(ξ new |ξ ) = tr XT diag{A(ξ new )}X{(ξ ) + μ(ξ )μ(ξ )T }
−1Tn log{1n + exp(Xβ)}
+ 1Tn C(ξ new ) + terms not involving ξ new .
1
≥ 1Tn A(ξ ) (Xβ)2 − Xβ + C(ξ ) Differentiating with respect to ξ new and using the fact that
2
A(ξ ) is monotonically increasing over ξ > 0, the M-step can
1 be shown to have the exact solution
= β T XT diag{A(ξ )}Xβ − 1Tn Xβ + 1Tn C(ξ ), (23)
2
(ξ new )2 = diagonal X{(ξ ) + μ(ξ )μ(ξ ) }X . (25)
where ξ = (ξ1 , . . . , ξn ) is an n × 1 vector of variational parame-
ters. This gives us the following lower bound on p(y, β): Taking positive square roots on both sides of (25) leads to Al-
gorithm 6.
1
p(y, β; ξ ) = exp − β T −1 β − 2X diag{A(ξ )}X β
T Convergence of Algorithm 6 is monotone and usually quite
2 rapid (Jaakkola and Jordan 2000).
T
1 T −1
+ y − 1n X + μβ β β 4. FREQUENTIST INFERENCE
2
1 Up until now, we have only dealt with approximate inference
− μTβ −1
β μβ + 1n C(ξ )
T
2 in Bayesian models via variational methods. In this section we
k+1 1 point out that variational approximations can be used in fre-
− log(2π) − log | β | quentist contexts. However, use of variational approximations
2 2
for frequentist inferential problems is much rarer. Frequentist
which is proportional to a Multivariate Normal density in β.
Upon normalization we obtain the following family of varia-
tional approximations to β|y: Algorithm 6 Iterative scheme for obtaining the optimal model
and variational parameters in the Bayesian logistic regression
β|y; ξ ∼ N (μ(ξ ), (ξ )), (24) example.
where Initialize: ξ (n × 1; all entries positive).
−1 Cycle:
(ξ ) ≡ −1
β − 2X diag{A(ξ )}X and
−1
(ξ ) ← −1
1 β − 2X diag{A(ξ )}X ,
μ(ξ ) ≡ (ξ ) X y − 1 + −1β μβ .
2 1
μ(ξ ) ← (ξ ) X y − 1n + −1 β μ β ,
We are left with the problem of determining the vector of 2
variational parameters ξ ∈ Rn . A natural way of choosing these #
is to make ξ ← diagonal X{(ξ ) + μ(ξ )μ(ξ ) }X
p(y; ξ ) ≡ p(y, β; ξ ) dβ until the increase in p(y; ξ ) is negligible.
152 General
Variational approximations have the potential to become an Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors and Model Uncertainty,”
important player in statistical inference. New variational ap- Journal of the American Statistical Association, 90, 773–795. [142]
proximation methods are continually being developed. The re- Kullback, S., and Leibler, R. A. (1951), “On Information and Sufficiency,”
The Annals of Mathematical Statistics, 22, 79–86. [142]
cent emergence of formal software for variational inference is
certain to accelerate its widespread use. The usefulness of vari- McCulloch, C. E., Searle, S. R., and Neuhaus, J. M. (2008), Generalized, Lin-
ear, and Mixed Models (2nd ed.), New York: Wiley. [144]
ational approximations increases as the size of the problem in-
creases and Monte Carlo methods such as MCMC start to be- McGrory, C. A., and Titterington, D. M. (2007), “Variational Approximations
in Bayesian Model Selection for Finite Mixture Distributions,” Computa-
come untenable. tional Statistics and Data Analysis, 51, 5352–5367. [140,148]
McGrory, C. A., Titterington, D. M., Reeves, R., and Pettitt, A. N. (2009),
[Received March 2009. Revised April 2010.] “Variational Bayes for Estimating the Parameters of a Hidden Potts Model,”
Statistics and Computing, 19, 329–340. [140]
REFERENCES McLachlan, G. J., and Krishnan, T. (1997), The EM Algorithm and Extensions,
New York: Wiley-Interscience. [151]
Albert, J. H., and Chib, S. (1993), “Bayesian Analysis of Binary and Polychoto-
Minka, T., Winn, J., Guiver, G., and Kannan, A. (2009), Infer.Net 2.3, Cam-
mous Response Data,” Journal of the American Statistical Association, 88,
bridge, U.K.: Microsoft Research Cambridge. [140]
669–679. [146]
Ormerod, J. T. (2008), “On Semiparametric Regression and Data Mining,”
Archambeau, C., Cornford, D., Opper, M., and Shawe-Taylor, J. (2007),
Ph.D. thesis, School of Mathematics and Statistics, The University of New
“Gaussian Process Approximations of Stochastic Differential Equations,”
South Wales, Sydney, Australia. [150]
Journal of Machine Learning Research: Workshop and Conference Pro-
ceedings, 1, 1–16. [149] Parisi, G. (1988), Statistical Field Theory, Redwood City, CA: Addison-Wesley.
[142]
Barber, D., and Bishop, C. M. (1998), “Ensemble Learning for Multi-Layer
Networks,” in Advances in Neural Information Processing Systems 10, eds. Pearl, J. (1988), Probabilistic Reasoning in Intelligent Systems, San Mateo, CA:
M. I. Jordan, K. J. Kearns, and S. A. Solla, Cambridge, MA: MIT Press, Morgan Kaufmann. [143,145]
pp. 395–401. [149] Pinheiro, J. C., and Bates, D. M. (2000), Mixed-Effects Models in S and
Bishop, C. M. (2006), Pattern Recognition and Machine Learning, New York: S-PLUS, New York: Springer. [146]
Springer. [140,145,148] Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and the R Core Team (2009),
Boyd, S., and Vandenberghe, L. (2004), Convex Optimization, Cambridge: “nlme: Linear and Nonlinear Mixed Effects Models,” R package ver-
Cambridge University Press. [143] sion 3.1-93. [146]
Casella, G., and George, E. I. (1992), “Explaining the Gibbs Sampler,” R Development Core Team (2010), R: A Language and Environment for Statis-
The American Statistician, 46, 167–174. [143] tical Computing, Vienna, Austria: R Foundation for Statistical Computing.
Available at http:// www.R-project.org. [146]
Consonni, G., and Marin, J.-M. (2007), “Mean-Field Variational Approximate
Bayesian Inference for Latent Variable Models,” Computational Statistics Rockafellar, R. (1972), Convex Analysis, Princeton: Princeton University Press.
and Data Analysis, 52, 790–798. [146] [150]
Girolami, M., and Rogers, S. (2006), “Variational Bayesian Multinomial Probit Seeger, M. (2000), “Bayesian Model Selection for Support Vector Machines,
Regression,” Neural Computation, 18, 1790–1817. [146] Gaussian Processes and Other Kernel Classifiers,” in Advances in Neural
Information Processing Systems 12, eds. S. A. Solla, T. K. Leen, and K.-R.
Hall, P., Humphreys, K., and Titterington, D. M. (2002), “On the Adequacy
Müller, Cambridge, MA: MIT Press, pp. 603–609. [149]
of Variational Lower Bound Functions for Likelihood-Based Inference in
Markovian Models With Missing Values,” Journal of the Royal Statistical (2004), “Gaussian Processes for Machine Learning,” International
Society, Ser. B, 64, 549–564. [140] Journal of Neural Systems, 14, 69–106. [149]
Hall, P., Ormerod, J. T., and Wand, M. P. (2010), “Theory of Gaussian Varia- Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D., and
tional Approximation for a Poisson Linear Mixed Model,” Statistica Sinica, Caldas, C. (2005), “A Variational Bayesian Mixture Modelling Framework
to appear. [152] for Cluster Analysis of Gene-Expression Data,” Bioinformatics, 21, 3025–
Honkela, A., and Valpola, H. (2005), “Unsupervised Variational Bayesian 3033. [140]
Learning of Nonlinear Models,” in Advances in Neural Information Titterington, D. M. (2004), “Bayesian Methods for Neural Networks and Re-
Processing Systems 17, eds. L. K. Saul, Y. Weiss, and L. Bottou, Cam- lated Models,” Statistical Science, 19, 128–139. [140,142,152]
bridge, MA: MIT Press, pp. 593–600. [149] Venables, W. N., and Ripley, B. D. (2009), “MASS: Functions and Datasets
Jaakkola, T. S., and Jordan, M. I. (2000), “Bayesian Parameter Estimation via to Support Venables and Ripley, ‘Modern Applied Statistics With S’ (4th
Variational Methods,” Statistics and Computing, 10, 25–37. [150,151] ed.),” R package version 7.2-48. [148]
Jordan, M. I. (2004), “Graphical Models,” Statistical Science, 19, 140–155. Wang, B., and Titterington, D. M. (2006), “Convergence Properties of a Gen-
[140,152] eral Algorithm for Calculating Variational Bayesian Estimates for a Normal
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999), “An In- Mixture Model,” Bayesian Analysis, 1, 625–650. [140]
troduction to Variational Methods for Graphical Models,” Machine Learn- Winn, J., and Bishop, C. M. (2005), “Variational Message Passing,” Journal of
ing, 37, 183–233. [140,150] Machine Learning Research, 6, 661–694. [143]