0% found this document useful (0 votes)
16 views

Explaining Variational Approximations

Variational approximations provide fast, deterministic alternatives to Monte Carlo methods for approximate inference in complex statistical models. They are commonly used for Bayesian inference where intractable integrals arise. This article explains variational approximations using statistical examples familiar to statisticians. Variational approximations work by restricting the space of possible probability distributions to find an approximate solution, trading off accuracy for speed compared to Monte Carlo methods which can achieve arbitrary accuracy.

Uploaded by

Matesta Matesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Explaining Variational Approximations

Variational approximations provide fast, deterministic alternatives to Monte Carlo methods for approximate inference in complex statistical models. They are commonly used for Bayesian inference where intractable integrals arise. This article explains variational approximations using statistical examples familiar to statisticians. Variational approximations work by restricting the space of possible probability distributions to find an approximate solution, trading off accuracy for speed compared to Monte Carlo methods which can achieve arbitrary accuracy.

Uploaded by

Matesta Matesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Explaining Variational Approximations

J. T. O RMEROD and M. P. WAND


approximate inference, as well as Laplace approximation meth-
ods. Variational approximations are a much faster alternative to
Variational approximations facilitate approximate inference MCMC, especially for large models, and are a richer class of
for the parameters in complex statistical models and provide methods than the Laplace variety. They are, however, limited
fast, deterministic alternatives to Monte Carlo methods. How- in their approximation accuracy—as opposed to MCMC which
ever, much of the contemporary literature on variational ap- can be made arbitrarily accurate through increases in the Monte
proximations is in Computer Science rather than Statistics, and Carlo sample sizes. In the interest of brevity, we will not discuss
uses terminology, notation, and examples from the former field. the quality of variational approximations in any detail. Jordan
In this article we explain variational approximation in statistical (2004) and Titterington (2004) pointed to some relevant litera-
terms. In particular, we illustrate the ideas of variational approx- ture on variational approximation accuracy.
imation using examples that are familiar to statisticians. In the statistics literature, variational approximations are be-
ginning to have a presence. Examples include the articles by
KEY WORDS: Bayesian inference; Bayesian networks; Di- Teschendorff et al. (2005), McGrory and Titterington (2007),
rected acyclic graphs; Generalized linear mixed models; Kull- and McGrory et al. (2009) on new variational approximation
back–Leibler divergence; Linear mixed models. methodology for particular applications, and Hall, Humphreys,
and Titterington (2002) and Wang and Titterington (2006) on
the statistical properties of estimators obtained via variational
approximation.
1. INTRODUCTION
In this article we explain variational approximation in terms
Variational approximations is a body of deterministic tech- that are familiar to a statistical readership. Most of our expo-
niques for making approximate inference for parameters in sition involves working through several illustrative examples,
complex statistical models. It is now part of mainstream Com- starting with what is perhaps the most basic: inference from a
puter Science methodology, where it enjoys use in elaborate Normal random sample. Other contexts that are seen to ben-
problems such as speech recognition, document retrieval, and efit from variational approximation include Bayesian general-
genetic linkage analysis (Jordan 2004). Summaries of contem- ized linear models, Bayesian linear mixed models, and non-
porary variational approximations can be found in the works Bayesian generalized linear mixed models. It is anticipated that
of Jordan et al. (1999), Jordan (2004), Titterington (2004), and a statistically literate reader who works through all of the ex-
Bishop (2006, chap. 10). In 2008, a variational approximation- amples will have gained a good understanding of variational
based software package named Infer.NET (Minka et al. 2009) approximations.
emerged with claims of being able to handle a wide variety of Variational approximations can be useful for both likelihood-
statistical problems. based and Bayesian inference. However, their use in the litera-
The name ‘variational approximations’ has its roots in the ture is greater for Bayesian inference where intractable calculus
problems abound. Hence, most of our description of variational
mathematical topic known as variational calculus. Variational
approximations is for Bayesian inference. It is also worth not-
calculus is concerned with the problem of optimizing a func-
ing that situations in which variational approximations are use-
tional over a class of functions on which that functional de-
ful closely correspond to situations where MCMC is useful.
pends. Approximate solutions arise when the class of functions
Section 2 explains the most common variant of variational
is restricted in some way—usually to enhance tractability.
approximation, which we call the density transform approach.
Despite their statistical overtones, variational approximations
A different type, the tangent transform approach, is explained
are not widely known within the statistical community. In par-
in Section 3. Sections 2 and 3 focus exclusively on Bayesian
ticular, they are overshadowed by Monte Carlo methods, es-
inference. In Section 4 we point out that the same ideas transfer
pecially Markov chain Monte Carlo (MCMC), for performing
to frequentist contexts. Some concluding remarks are made in
Section 5.
J. T. Ormerod is Lecturer in Statistics, School of Mathematics and Sta-
tistics, University of Sydney, Sydney 2006, Australia. M. P. Wand is Re- 1.1 Definitions
search Professor in Statistics, Centre for Statistical and Survey Methodol-
ogy, School of Mathematics and Applied Statistics, University of Wollongong, Integrals without limits or subscripts are assumed to be over
Wollongong 2522, Australia (E-mail: [email protected]). The authors are the entire space of the integrand argument. If P is a logical
grateful to the editor, an associate editor, and two referees for their sugges- condition, then I (P) = 1 if P is true and I (P) = 0 if P is
tions for improvement. They also thank Christel Faes, Sarah Neville, Agostino
Nobile, Simone Padoan, Doug Simpson, Mike Titterington, and Shen Wang
false. We use  and φ to denote the standard normal distribu-
for helpful comments. This research was partially supported by Australian Re- tion function and density function, respectively. ∞The Gamma
search Council Discovery project DP0877055. function, denoted by , is given by (x) = 0 ux−1 e−u du

140 © 2010 American Statistical Association DOI: 10.1198/tast.2010.09058 The American Statistician, May 2010, Vol. 64, No. 2
and the Digamma function, denoted by ψ, is given by ψ(x) =
d
dx log (x).
Column vectors with entries consisting of subscripted vari-
ables are denoted by a boldfaced version of the letter for that
variable. Round brackets will be used to denote the entries of
column vectors. For example, x = (x1 , . . . , xn ) denotes an n × 1
vector with entries x1 , . . . , xn . Scalar functions applied to vec-
tors are evaluated element-wise. For example,

exp(a1 , a2 , a3 ) ≡ (exp(a1 ), exp(a2 ), exp(a3 )).


b
Similarly, (a1 , a2 , a3 )(b1 ,b2 ,b3 ) ≡ (a1b1 , a2b2 , a3 3 ). The element-
wise product of two matrices A and B is denoted by A  B.
We use 1d to denote the d × 1 column vector with all √ entries
equal to 1. The norm of a column vector v, defined to be vT v,
is denoted by v. For a d × 1 vector a, we let diag(a) denote
the d × d diagonal matrix containing the entries of a along the
main diagonal. For a d × d square matrix A, we let diagonal(A)
denote the d × 1 vector containing the diagonal entries of A.
For square matrices A1 , . . . , Ar , we let blockdiag(A1 , . . . , Ar )
denote the block diagonal matrix, with ith block equal to Ai . Figure 1. DAGs corresponding to the Bayesian Poisson regression
model (1). Left: the large nodes correspond to scalar random variables
The density function of a random vector u is denoted
in the model. The smaller nodes correspond to constants and the ob-
by p(u). The conditional density of u given v is denoted by served data are shaded. Right: abbreviated DAG for the same model.
p(u|v). The covariance matrix of u is denoted by Cov(u). The constants are suppressed and the nodes u and y correspond to ran-
A d × 1 random vector x has a Multivariate Normal distrib- dom vectors containing the Ui and yij , respectively.
ution with parameters μ and , denoted by x ∼ N (μ, ), if its
density function is
provides DAGs corresponding to the Bayesian Poisson mixed
 
−d/2 −1/2 1 T −1 model:
p(x) = (2π) || exp − (x − μ)  (x − μ) .
2 ind.
Yij |Ui ∼ Poisson(eβ0 +Ui ), i = 1, 2, 3; j = 1, 2,
A random variable x has an Inverse Gamma distribution with ind.  
Ui |σU2 ∼ N (0, σU2 ), β0 ∼ N 0, σβ20 , (1)
parameters A, B > 0, denoted by x ∼ IG(A, B), if its density
function is p(x) = B A (A)−1 x −A−1 e−B/x , x > 0. A random σU2 ∼ IG(A, B) for constants σβ20 , A, B > 0.
vector x = (x1 , . . . , xK ) has a Dirichlet distribution with para-
meter vector α = (α1 , . . . , αK ), where each αk > 0, if its den- The DAG on the left side of Figure 1 has a separate node for
sity function is each scalar random variable and each constant. On the right
side the constant nodes are suppressed and two of the nodes

K  K
⎪ K   α −1 correspond to the random vectors u ≡ (U1 , U2 , U3 ) and y =



⎪  αk (αk ) xk k , (Y11 , . . . , Y32 ).


⎨ k=1 k=1 k=1
p(x) = K 2. DENSITY TRANSFORM APPROACH



⎪ if xk = 1



⎩ k=1 The density transform approach to variational approximation
0, otherwise. involves approximation of posterior densities by other densities
for which inference is more tractable. The approximations are
We write x ∼ Dirichlet(α). If yi has distribution Di for guided by the notion of Kullback–Leibler divergence, which we
each 1 ≤ i ≤ n, and the yi are independent, then we write now explain.
ind.
yi ∼ Di .
It is helpful, although not necessary, to work with directed 2.1 Kullback–Leibler Divergence
acyclic graph (DAG) depictions of Bayesian statistical models. Consider a generic Bayesian model with parameter vector
One reason is the localness of the calculations that arise from θ ∈
and observed data vector y. Bayesian inference is based
the Markov blanket result given in Section 2.2.1. The nodes on the posterior density function
of the DAG correspond to random variables or random vectors
p(y, θ)
in the Bayesian model, and the directed edges convey condi- p(θ |y) = .
tional independence. Because of this connection with Bayesian p(y)
(hierarchical) models, DAGs with random nodes are known as The denominator p(y) is known as the marginal likelihood (or
Bayesian networks in the Computer Science literature. Figure 1 model evidence in the Computer Science literature) and forms

The American Statistician, May 2010, Vol. 64, No. 2 141


the basis of model comparison via Bayes factors (e.g., Kass and In the case of (a), note that the product form is the only assump-
Raftery 1995). It should be noted that, in this Bayesian context, tion being made. Hence (a) represents a type of nonparametric
p(y) is not a likelihood function in the usual sense. Through- restriction. Restriction (a) is also known as mean field approxi-
out this section we assume that y and θ are continuous random mation and has its roots in Physics (e.g., Parisi 1988). The term
vectors. The discrete case has a similar treatment, but with sum- variational Bayes has become commonplace for approximate
mations rather than integrals. Bayesian inference under product density restrictions.
Let q be an arbitrary density function over
. Then the log- Depending on the Bayesian model at hand, both restric-
arithm of the marginal likelihood satisfies tions can have minor or major impacts on the resulting infer-
ence. For example, if p(θ 1 , θ 2 |y) is such that θ 1 and θ 2 have
 
a high degree of dependence, then the restriction q(θ 1 , θ 2 ) =
log p(y) = log p(y) q(θ) dθ = q(θ) log p(y) dθ q1 (θ 1 )q2 (θ 2 ) will lead to a degradation in the resulting in-
   ference. Conversely, if the posterior dependence between θ 1
p(y, θ)/q(θ )
= q(θ ) log dθ and θ 2 is weak, then the product density restriction could lead
p(θ|y)/q(θ ) to very accurate approximate inference. Further discussion on
   this topic, including references, may be found in section 3.2 of
p(y, θ )
= q(θ ) log dθ the article by Titterington (2004).
q(θ)
  
q(θ ) 2.2 Product Density Transforms
+ q(θ ) log dθ
p(θ |y) Restriction of q to a subclass of product densities gives rise
  
p(y, θ ) to explicit solutions for each product component in terms of the
≥ q(θ ) log dθ. (2) others. These, in turn, lead to an iterative scheme for obtaining
q(θ)
the simultaneous solutions. The solutions rely on the following
The inequality arises from the fact that result, which we call Result 1. Note that Result 1 follows imme-
   diately from (2) and (3) above. However, it is useful to present
q(θ) the result for general random vectors.
q(θ ) log dθ ≥ 0
p(θ |y)
Result 1. Let u and v be two continuous random vectors with
for all densities q, with equality if and only if joint density function p(u, v). The maximum value of
  
q(θ ) = p(θ|y) almost everywhere (3) p(u, v)
q(u) log du
q(u)
(Kullback and Leibler 1951). The integral in (3) is known as the
Kullback–Leibler divergence (also known as Kullback–Leibler over all density functions q is attained by q ∗ (u) = p(u|v).
distance) between q and p(·|y). From (2), it follows immedi- Return now to the Bayesian model setting of Section 2.1 and
ately that suppose that q is subject to the product restriction (a). Then
  M

p(y) ≥ p(y; q),
log p(y; q) = qi (θ i ) log p(y, θ)
i=1
where the q-dependent lower bound on the marginal likelihood

is given by
M

   − log qi (θ i ) dθ 1 · · · dθ M
p(y, θ ) i=1
p(y; q) ≡ exp q(θ) log dθ. (4)  
q(θ)
= q1 (θ 1 ) log p(y, θ )q2 (θ 2 ) · · ·
Note that the lower bound p(y; q) can also be derived more
directly using Jensen’s inequality, but the above derivation has 
the advantage of quantifying the gap between p(y) and p(y; q). × qM (θ M ) dθ 2 · · · dθ M dθ 1
The essence of the density transform variational approach 
is approximation of the posterior density p(θ |y) by a q(θ ) − q1 (θ 1 ) log q1 (θ 1 ) dθ 1
for which p(y; q) is more tractable than p(y). Tractability is
achieved by restricting q to a more manageable class of den- + terms not involving q1 .
sities, and then maximizing p(y; q) over that class. According
Define the new joint density function p (y, θ 1 ) by
to (2), maximization of p(y; q) is equivalent to minimization of 
the Kullback–Leibler divergence between q and p(·|y). (y, θ 1 ) ≡ exp log p(y, θ )q2 (θ 2 ) · · · qM (θ M ) dθ 2 · · · dθ M
p
The most common restrictions for the q density are:
    
(a) q(θ ) factorizes into M i=1 qi (θ i ), for some partition exp log p(y, θ )q2 (θ 2 ) · · ·
{θ 1 , . . . , θ M } of θ .

(b) q is a member of a parametric family of density func-
tions. × qM (θ M ) dθ 2 · · · dθ M dθ 1 dy.

142 General
Algorithm 1 Iterative scheme for obtaining the optimal densi- ing examples that the product density transform approach leads
ties under product density restriction (a). The updates are based to tractable solutions in situations where Gibbs sampling is also
on the solutions given at (5). viable.
Initialize: q2∗ (θ 2 ), . . . , qM
∗ (θ ).
M The DAG viewpoint of Bayesian models also gives rise to a
Cycle: useful result arising from the notion of Markov blankets. The
exp{E−θ 1 log p(y, θ)} Markov blanket of a node is the set of children, parents, and
q1∗ (θ 1 ) ←  , co-parents of that node. The result
exp{E−θ 1 log p(y, θ)} dθ 1
.. p(θ i |rest) = p(θ i |Markov blanket of θ i ) (7)
.
(Pearl 1988) means that determination of the required full con-
∗ exp{E−θ M log p(y, θ)}
qM (θ M ) ← ditionals involves localized calculations on the DAG. It fol-
exp{E−θ M log p(y, θ)} dθ M lows from this fact and expression (6) that the product den-
until the increase in p(y; q) is negligible. sity approach involves a series of local operations. In Com-
puter Science, this has become known as variational message
passing (Winn and Bishop 2005). See the example in Sec-
Then tion 2.2.3 for illustration of (7) and localization of variational
   updates.
(y, θ 1 )
p
log p(y; q) = q1 (θ 1 ) log dθ 1
q(θ 1 ) 2.2.2 Normal Random Sample
+ terms not involving q1 . Our first and most detailed illustration of variational approx-
By Result 1, the optimal q1 is then imation involves approximate Bayesian inference for the most
familiar of statistical settings: a random sample from a Normal
(y, θ 1 )
p
q1∗ (θ 1 ) = p
(θ 1 |y) ≡  distribution. Specifically, consider
(y, θ 1 ) dθ 1
p
  ind.
Xi |μ, σ 2 ∼ N (μ, σ 2 )
∝ exp log p(y, θ )q2 (θ 2 ) · · · qM (θ M ) dθ 2 · · · dθ M .
with priors
Repeating the same argument for maximizing log p(y; q) over
each of q2 , . . . , qM leads to the optimal densities satisfying: μ ∼ N (μμ , σμ2 ) and σ 2 ∼ IG(A, B).
 
qi∗ (θ i ) ∝ exp E−θ i log p(y, θ) , 1 ≤ i ≤ M, (5) The product density transform approximation to p(μ, σ 2 |x) is
where
 E−θ i denotes expectation with respect to the density q(μ, σ 2 ) = qμ (μ)qσ 2 (σ 2 ). (8)
j =i qj (θ j ). The iterative scheme, labeled Algorithm 1, can
be used to solve for the qi∗ . The optimal densities take the form
Convexity properties can be used to show that convergence  
qμ∗ (μ) ∝ exp Eσ 2 {log p(μ|σ 2 , x)} and
to at least local optima is guaranteed (Boyd and Vandenberghe  
2004). If conjugate priors are used, then the qi∗ belong to recog- qσ∗ 2 (σ 2 ) ∝ exp Eμ {log p(σ 2 |μ, x)} ,
nizable density families and the qi∗ updates reduce to updat-
ing parameters in the qi∗ family (e.g., Winn and Bishop 2005). where x = (X1 , . . . , Xn ). Standard manipulations lead to the
Also, in practice it is common to monitor convergence using full conditionals being
log{p(y; q)} rather than p(y; q). Sections 2.2.2–2.2.4 provide  
nX/σ 2 + μμ /σμ2 1
illustrations. μ|σ , x ∼ N
2
, and
n/σ 2 + 1/σμ2 n/σ 2 + 1/σμ2
2.2.1 Connection With Gibbs Sampling  
n 1
It is easily shown that a valid alternative expression for the σ 2 |μ, x ∼ IG A + , B + x − μ1n 2 ,
2 2
qi∗ (θ i ) is
  where X = (X1 + · · · + Xn )/n is the sample mean. The second
qi∗ (θ i ) ∝ exp E−θ i log p(θ i |rest) , (6)
of these, combined with (6), leads to
where   
n
qσ∗ 2 (σ 2 ) ∝ exp Eμ − A + + 1 log(σ 2 )
rest ≡ {y, θ 1 , . . . , θ i−1 , θ i+1 , . . . , θ M } 2
  
is the set containing the random vectors in the model, apart 1
− B + x − μ1n  2
σ2
from θ i . The distributions θ i |rest, 1 ≤ i ≤ M, are known, in the 2
MCMC literature, as the full conditionals. This form of the op-
timal densities reveals a link with Gibbs sampling (e.g., Casella ∝ (σ 2 )−(A+n/2+1)
   
and George 1992) which involves successive draws from these 1
full conditionals. Indeed, it becomes apparent from the upcom- × exp − B + Eμ x − μ1n 2 σ2 .
2

The American Statistician, May 2010, Vol. 64, No. 2 143


 
We recognize this as a member of the Inverse Gamma family: n  
+ A log(B) − A + log Bq(σ 2 )
  2
∗ n 1  
qσ 2 (σ ) is IG A + , B + Eμ x − μ1n  .
2 2
n
2 2 + log  A + − log (A).
2
Note that Eμ x − μ1n 2 = x − Eμ (μ)1n 2 + n Varμ (μ) where
 ∞ However, within each iteration of Algorithm 2, this expression
Eμ (μ) = μ0 qμ (μ0 ) dμ0 and is valid only after each of the parameter updates has been made.
−∞ Upon convergence to μ∗q(μ) , (σq(μ)
2 )∗ , and B ∗ , the ap-
 ∞
q(σ 2 )
proximations to the individual posterior densities are
Varμ (μ) = {μ0 − Eμ (μ)}2 qμ (μ0 ) dμ0
−∞   2 ∗ −1/2   2   2 ∗ 
p(μ|x) ≈ 2π σq(μ) exp − μ − μ∗q(μ) / 2 σq(μ)
are the mean and variance of the qμ density. Similar arguments
lead to and
 ∗

nXEσ 2 (1/σ 2 ) + μμ /σμ2 (Bq(σ 2))
A+n/2
qμ (μ) is N
nEσ 2 (1/σ 2 ) + 1/σμ2
, p(σ 2 |x) ≈ (σ 2 )−A−n/2−1
(A + n2 )
  
1 ∗
, (9) × exp −Bq(σ 2 ) /σ
2
, σ 2 > 0.
nEσ 2 (1/σ 2 ) + 1/σμ2
∞ Figure 2 illustrates these variational approximations for a sim-
where Eσ 2 (1/σ 2 ) = 0 (1/σ02 )qσ 2 (σ02 ) dσ02 . When qσ 2 = qσ∗ 2 ulated sample of size n = 20 from the N (100, 225) density. For
we get priors we used μ ∼ N (0, 108 ) and σ 2 ∼ IG( 100 1 1
, 100 ), corre-
A + n/2 sponding to vague beliefs about the mean and variance, and
Eσ 2 (1/σ 2 ) = . (10) such that the prior mean of the precision, 1/σ 2 , is unity. The
B + 12 {x − Eμ (μ)1n 2 + n Varμ (μ)}
initial value for the iterative scheme is Bq(σ 2 ) = 1. The exact
It is now apparent that the functional forms of the optimal densi- posterior densities, obtained via highly accurate quadrature, are
ties qμ∗ and qσ∗ 2 are Normal and Inverse Gaussian, respectively, also displayed. Note that, in this example, convergence is very
but the parameters need to be determined from relationships rapid and the accuracy of the variational approximation is quite
such as (9) and (10). Let good.
μq(μ) ≡ Eμ (μ), 2
σq(μ) ≡ Varμ (μ), and
  2.2.3 Linear Mixed Model
n The Bayesian version of the Gaussian linear mixed model
Bq(σ 2 ) ≡ A + Eσ 2 (1/σ 2 ).
2 takes the general form
Using the relationships established at (9) and (10) we arrive at y|β, u, G, R ∼ N (Xβ + Zu, R), u|G ∼ N (0, G), (11)
Algorithm 2, which can be used to obtain the optimal values of
2 , and B
μq(μ) , σq(μ) q(σ 2 ) . where y is an n × 1 vector of response variables, β is a p × 1
Note that log p(x; q) admits the explicit expression: vector of fixed effects, u is a vector of random effects, X and Z
are corresponding design matrices, and G and R are covariance
1 n 1  2 
log p(x; q) = − log(2π) + log σq(μ) /σμ2 matrices. While several possibilities exist for G and R (e.g.,
2 2 2 McCulloch, Searle, and Neuhaus 2008), we restrict attention
(μq(μ) − μμ )2 + σq(μ)
2 here to variance component models with
−  2 
2σμ2 G = blockdiag σu1 2
IK1 , . . . , σur IKr and
(12)
R = σε2 I.
Algorithm 2 Iterative scheme for obtaining the parameters in
the optimal densities qμ∗ and qσ∗ 2 in the Normal random sample We also impose the conjugate priors:
example.
β ∼ N (0, σβ2 I),
Initialize: Bq(σ 2 ) > 0.
Cycle: 2
σu ∼ IG(Au , Bu ), 1 ≤ ≤ r, (13)
   −1
n  σε2 ∼ IG(Aε , Bε )
2
σq(μ) ← n A+ Bq(σ 2 ) + 1/σμ2 ,
2
    for some σβ2 , Au , Bu , Aε , Bε > 0. Figure 3 is the DAG corre-
n  sponding to model (11)–(13).
μq(μ) ← nX A + Bq(σ 2 ) + μμ /σμ σq(μ)
2 2
,
2 Somewhat remarkably, a tractable solution arises for the two-
1  2  component product
Bq(σ 2 ) ← B + x − μq(μ) 1n  + nσq(μ) 2
2 2
2 q(β, u, σu1 , . . . , σur , σε2 )
until the increase in p(x; q) is negligible.
= qβ,u (β, u)qσ 2 (σu1
2 2
, . . . , σur , σε2 ). (14)

144 General
Figure 2. Results from applying the product density variational approximation to a simulated Normal random sample. The exact posterior
density functions are added for comparison. The vertical dotted line in the posterior density plots corresponds to the true value of the parameter.

Application of (5) leads to the optimal densities taking the form density the shape parameters for the r + 1 components can
∗ be shown to be deterministic: Au1 + 12 K1 , . . . , Aur + 12 Kr ,
qβ,u (β, u) is a Multivariate Normal density function,
Aε + 12 n. Let Bq(σ 2 ) , . . . , Bq(σur2 ) , Bq(σε2 ) be the accompanying
qσ∗ 2 is a product of r + 1 Inverse Gamma density functions.
u1
rate parameters. The relationships between (μq(β,u) ,  q(β,u) )
It should be stressed that these forms are not imposed at the and (Bq(σ 2 ) , . . . , Bq(σur2 ) , Bq(σε2 ) ) enforced by (5) lead to the it-
u1
outset, but arise as optimal solutions for model (11)–(13) and erative scheme in Algorithm 3.
product restriction (14). Moreover, the factorization of qσ∗ 2 into In this case log p(y; q) takes the form
r + 1 separate components is also a consequence of (5) for log p(y; q)
the current model, rather than an imposition. Bishop (2006,

sec. 10.2.5) explained how these induced factorizations follow 1
r
n p
from the structure of the DAG and d-separation theory (Pearl = p+ K − log(2π) − log(σβ2 )
2 2 2
=1
1988). This example also benefits from the Markov blanket re-
sult (7) described in Section 2.2.1 and Figure 3. For example, 1   1  2  
2 is + log  q(β,u)  − 2 μq(β)  + tr  q(β)
the full conditional density of σu1 2 2σβ
 
2
p(σu1 |rest) = p(σu1
2
|Markov blanket of σu1
2
) n  
+ Aε log(Bε ) − Aε + log Bq(σε2 )
= p(σu1
2
|u, σu2
2 2
, . . . , σur ). 2
 
n
Hence, determination of q ∗ 2 requires calculations involving + log  Aε + − log (Aε )
σu1 2
only the subset of the DAG consisting of u and the variance
r   
parameters. K  
+ Au log(Bu ) − Au + log Bq(σ 2 )
Let μq(β,u) and  q(β,u) be the mean and covariance ma- 2 u

trix for the qβ,u density and set C ≡ [X Z]. For the qσ∗ 2
=1
  
K
+ log  Au + − log (Au ) .
2
Note that, within each iteration of Algorithm 3, this expression
applies only after each of the parameter updates has been made.
Upon convergence to μ∗q(β,u) ,  ∗q(β,u) , B ∗ 2 , . . . , Bq(σ

2 )
q(σu1 ) ur

and Bq(σ 2 ) the approximate posteriors are:
ε

p(β, u|y) ≈ the N (μ∗q(β,u) ,  ∗q(β,u) ) density function


and
2
p(σu1 2
, . . . , σur , σε2 |y)

Figure 3. DAG corresponding to the model (11)–(13). ≈ product of the IG(Au + 12 K , Bq(σ 2 ) ), 1 ≤ ≤ r,
u

The American Statistician, May 2010, Vol. 64, No. 2 145


Algorithm 3 Iterative scheme for obtaining the parameters in the optimal densities qβ,u ∗ and q ∗ in the Bayesian
σ2
linear mixed model example.
Initialize: Bq(σε2 ) , Bq(σ 2 ) , . . . , Bq(σur2 ) > 0.
u1
Cycle:
  −1
Aε + n2 T Au1 + 12 K1 Aur + 12 Kr
 q(β,u) ← C C + blockdiag σβ−2 Ip , IK1 , . . . , IKr ,
Bq(σε2 ) Bq(σ 2 ) Bq(σur2 )
u1
 
Aε + n2
μq(β,u) ←  q(β,u) CT y,
Bq(σε2 )
1    
Bq(σε2 ) ← Bε + y − Cμq(β,u) 2 + tr CT C q(β,u) ,
2
1  2  
Bq(σ 2 ) ← Bu + μq(u )  + tr  q(u ) for 1 ≤ ≤ r
u 2
until the increase in p(y; q) is negligible.


density functions together with the IG(Aε + 12 n, Bq(σ 2)) the sense that most of the posterior probability mass is away
ε
from zero.
density function.
We now provide an illustration for Bayesian analysis of a 2.2.4 Probit Regression and the Use of Auxiliary Variables
dataset involving longitudinal orthodontic measurements on 27 As shown by Albert and Chib (1993), Gibbs sampling
children (source: Pinheiro and Bates 2000). The data are avail- for the Bayesian probit regression model becomes tractable
able in the R computing environment (R Development Core when a particular set of auxiliary variables is introduced. The
Team 2010) via the package nlme (Pinheiro et al. 2009), in
same trick applies to product density variational approximation
the object Orthodont. We entertained the random intercept
(Girolami and Rogers 2006; Consonni and Marin 2007), as we
model
now show.
ind.
distanceij |Ui ∼ N(β0 + Ui + β1 ageij The Bayesian probit regression model that we consider here
is
+ β2 malei , σε2 ),
(15)
ind. Yi |β0 , . . . , βk
Ui |σu2 ∼ N(0, σu2 ), 1 ≤ i ≤ 27, 1 ≤ j ≤ 4,
ind.
ind. ind. ∼ Bernoulli((β0 + β1 x1i + · · · + βk xki )), 1 ≤ i ≤ n,
βi ∼ N(0, σβ2 ), σu2 , σε2 ∼ IG(A, B),
where distanceij is the distance from the pituitary to the where the prior distribution on the coefficient vector β =
pterygomaxillary fissure (mm) for patient i at time point j . (β0 , . . . , βk ) takes the form β ∼ N (μβ ,  β ). Letting X ≡
Similarly, ageij correspond to the longitudinal age values [1 x1i · · · xki ]1≤i≤n , the likelihood can be written compactly
in years and malei is an indicator of the ith child being as
male. This fits into framework (11)–(12) with y containing
the distanceij measurement, X = [1, ageij , malei ], and p(y|β) = (Xβ)y {1n − (Xβ)}1n −y , β ∼ N (μβ ,  β ).
Z = I27 ⊗ 14 is an indicator matrix for the random intercepts.
We used the vague priors σβ2 = 108 , A = B = 100 1
and used Introduce the vector of auxiliary variables a = (a1 , . . . , an ),
standardized versions of the distance and age data during the where
fitting. The results were then converted back to the original ind.
units. For comparison, we obtained 1 million samples from the ai |β ∼ N ((Xβ)i , 1).
posteriors using MCMC (with a burn-in of length 5000) and,
from these, constructed kernel density estimate approximations This allows us to write
to the posteriors. For such a high Monte Carlo sample size we
would expect these MCMC-based approximations to be very p(yi |ai ) = I (ai ≥ 0)yi I (ai < 0)1−yi , 1 ≤ i ≤ n.
accurate.
Figure 4 shows the progressive values of log p(y; q) and the In graphical model terms we are introducing a new node to the
approximate posterior densities obtained from applying Algo- graph, as conveyed by Figure 5. Expansion of the parameter set
rithm 3. Once again, convergence of log{p(y; q)} to a maxi- from {β} to {a, β} is the key to achieving a tractable solution.
mum is seen to be quite rapid. The variational approximate pos- Consider the product restriction
terior densities are quite close to those obtained via MCMC,
and indicate statistical significance of all model parameters in q(a, β) = qa (a)qβ (β).

146 General
Figure 4. Approximate posterior densities from applying the product density variational approximation to (11)–(13) for the orthodontic data.
‘Exact’ posterior densities, based on kernel density estimates of 1 million MCMC samples, are shown for comparison.

Then application of (5) leads to tor μq(β) ≡ Eβ (β). We also need to work with the q-density
 n  yi  1−yi  mean of the auxiliary variable vector μq(a) ≡ Ea (a). The itera-
 I (a ≥ 0) I (a < 0)
qa∗ (a) =
i i tive scheme, Algorithm 4, emerges.
((Xμq(β) )i ) 1 − ((Xμq(β) )i ) The log p(y; q) expression in this case is
i=1
 
1 2
× (2π)−n/2 exp − a − Xμq(β)    
2 log p(y; q) = yT log  Xμq(β)
  
and qβ∗ (β) is the N(μq(β) , (XT X +  −1 −1
β ) ) density function.
+ (1n − y)T log 1n −  Xμq(β)
These optimal densities are specified up to the parameter vec- 1 T  
− μq(β) − μβ  −1β μq(β) − μβ
2
1
− log | β XT X + I|.
2

Algorithm 4 Iterative scheme for obtaining the parameters in


the optimal densities qβ∗ and qa∗ in the Bayesian probit regres-
sion example.
Initialize: μq(a) (n × 1).
Cycle:
 T 
μq(β) ← (XT X +  −1 β )
−1
X μq(a) +  −1
β μβ ,

Figure 5. Graphical representations of the probit regression model.


φ(Xμq(β) )
μq(a) ← Xμq(β) +
The left graph does not admit a tractable product density variational (Xμq(β) )y {(Xμq(β) ) − 1n }1n −y
approximation. The right graph overcomes this with the addition of an
auxiliary variable node.
until the increase in p(y; q) is negligible.

The American Statistician, May 2010, Vol. 64, No. 2 147


Upon convergence, the approximate posterior distribution of where
the regression coefficients is
q ∗ (w) = density function of a Dirichlet distribution,
approx.  
β|y ∼ N μ∗q(β) , (XT X +  −1
β )
−1
.
q ∗ (μ) = product of K Normal density functions,
2.2.5 Finite Normal Mixture Model and
Our last example of product density variational approxima-
tion is of great interest within both Statistics and Computer q ∗ (σ 2 ) = product of K Inverse Gamma density functions.
Science: inference for finite mixture models. Let X1 , . . . , Xn
be a univariate sample that is modeled as a random sample For 1 ≤ k ≤ K, let μq(μk ) and σq(μ 2
k)
denote the mean and vari-
from a mixture of K Normal density functions with parameters ance for q ∗ (μk ) and let Aq(σ 2 ) and Bq(σ 2 ) denote the shape and
k k
(μk , σk2 ), 1 ≤ k ≤ K. Accordingly, the joint density function of rate parameters for q ∗ (σk2 ). Also, let
the sample is  
α q(w) ≡ αq(w1 ) , . . . , αq(w1 )
p(x1 , . . . , xn )
K   be the Dirichlet parameter vector for q ∗ (w). The optimal para-
 n
1
2 −1/2 meters may be found obtained using Algorithm 5. Recall, from
= wk (2πσk ) exp − (xi − μk ) /σk , (16)
2 2
2 Section 1.1, that ψ denotes the Digamma function.
i=1 k=1
The log p(x; q) expression in this case is
where the weights wk , 1 ≤ k ≤ K, are nonnegative, and sum to
unity. Let (w1 , . . . , wK ) have prior distribution: 1
log p(x; q) = K{1 − n log(2π)} + log (Kα)
2
(w1 , . . . , wK ) ∼ Dirichlet(α, . . . , α), α > 0.
− K log (α) − log (n + Kα)
We will take the prior distributions for the mean and variance 

K
 
parameters to be + Ak log(Bk ) − Aq(σ 2 ) log Bq(σ 2 )
k k
ind.   ind. k=1
μk ∼ N μμk , σμ2k , σk2 ∼ IG(Ak , Bk ), 1 ≤ k ≤ K.  
+ log  Aq(σ 2 ) − log (Ak )
k
As with the probit regression model, a tractable product den-
sity transform requires the introduction of the auxiliary variable   1  2 
+ log  αq(wk ) + log σq(μ k)
/σμ2k
vectors: 2
1  2  2
(ai1 , . . . , aik )|(w1 , . . . , wK ) − μq(μk ) − μμk + σq(μ
2
k)
/σμk
2
ind. 
∼ Multinomial(1; w1 , . . . , wK ), 1 ≤ i ≤ n. (17) n
− ωik log(ωik ) .
K
According to this notation, k=1 aik = 1 and wk = P (aik = 1). i=1
If we set
Note that, for each iteration of Algorithm 5, this expression is
p(xi |ai1 , . . . , aiK ) valid only after each of the parameter updates has been made.
K !
  "aik Algorithm 5 is similar to the EM algorithm for fitting a finite
2 −1/2 1
= (2πσk ) exp − (xi − μk ) /σk
2 2 Normal mixture model. Comparison and contrast are given in
2 section 10.2.1 of the book by Bishop (2006).
k=1
Figure 6 shows the result of applying Algorithm 5 to data on
independently for each 1 ≤ i ≤ n, then, using (17), the joint
the duration of geyser eruptions. The data are available in the R
density function of the X1 , . . . , Xn is easily shown to be (16).
computing environment via the package MASS (Venables and
Let w, μ, σ 2 , and a be the vectors containing the correspond-
Ripley 2009), in the object geyser$duration. The number
ing subscripted random variables. Then either of the product
of mixtures was set at K = 2 and vague priors with α = 0.001,
density restrictions
μk ∼ N (0, 108 ), and σk2 ∼ IG( 100
1 1
, 100 ) were used. The upper
q(w, μ, σ 2 , a) = q(w, μ)q(σ 2 )q(a) or panel of Figure 6 shows that convergence of log p(x; q) was
(18) obtained after about 20 iterations from naïve starting values.
q(w, μ, σ 2 , a) = q(w, σ 2 )q(μ)q(a) In the lower panel, the curve corresponds to the approximate
posterior mean of the common density function. The shaded
is sufficient for a closed form solution. Note that subscripting
on the q densities is being suppressed to reduce clutter. Regard- region corresponds to approximate pointwise 95% credible sets.
less of which restriction in (18) is chosen, application of Algo- These were obtained using 10,000 draws from q ∗ (w, μ, σ 2 ).
rithm 1 leads to the optimal density for the model parameters Finally, we note that variational approximation methodology
having the product structure could also be used to choose the number of mixtures K. See, for
example, the work of Bishop (2006, sec. 10.2.4) and McGrory
q ∗ (w, μ, σ 2 ) = q ∗ (w)q ∗ (μ)q ∗ (σ 2 ), and Titterington (2007).

148 General
Algorithm 5 Iterative scheme for obtaining the parameters in the optimal densities qw∗ , qμ∗ , and qσ∗ 2 in the finite
Normal mixtures example.
Initialize: μq(μk ) ∈ R and α q(wk ) , σq(μ
2
k)
, Aq(σ 2 ) , Bq(σ 2 ) , ω•k > 0, 1 ≤ k ≤ K,
k k
such that K k=1 ω•k = 1.
Cycle: For i = 1, . . . , n and k = 1, . . . , K:
  1   1  
νik ← ψ αq(wk ) + ψ Aq(σ 2 ) − log Bq(σ 2 )
2 k 2 k

1  2 
− Aq(σ 2 ) Xi − μq(μk ) + σq(μ
2
k)
/Bq(σ 2 ) .
2 k k

K
For i = 1, . . . , n and k = 1, . . . , K: ωik ← exp(νik )/ k=1 exp(νik ).
For k = 1, . . . , K:

n
 
ω•k ← ωik ; 2
σq(μ k)
← 1/ 1/σμ2k + Aq(σ 2 ) ω•k /Bq(σ 2 ) ,
k k
i=1


n
μq(μk ) ← 2
σq(μ k)
μμk /σμ2k + Aq(σ 2 ) ωik Xi /Bq(σ 2 ) ,
k k
i=1
1
αq(wk ) ← α + ω•k ; Aq(σ 2 ) ← Ak + ω•k ,
k 2
1
n
 2 
Bq(σ 2 ) ← Bk + ωik Xi − μq(μk ) + σq(μ
2
k )
k 2
i=1
until the increase in p(x; q) is negligible.

2.3 Parametric Density Transforms


Rather than assuming that q(θ ) has product density structure,
we may instead assume that it belongs to a particular paramet-
ric family and hope that this results in a more tractable approx-
imation to the posterior density p(θ |y). This approach has re-
ceived less attention in the Computer Science literature. Exam-
ples where it has appeared are the works by Barber and Bishop
(1998), Seeger (2000, 2004), Honkela and Valpola (2005), and
Archambeau et al. (2007).
Next, we illustrate parametric density transforms with a sim-
ple example.
2.3.1 Poisson Regression With Gaussian Transform
Consider the Bayesian Poisson regression model
Yi |β0 , . . . , βk
ind.
∼ Poisson(exp(β0 + β1 x1i + · · · + βk xki )), 1 ≤ i ≤ n,
where the prior distribution on the coefficient vector β ≡
(β0 , . . . , βk ) takes the form β ∼ N (μβ ,  β ). As before, we let
X = [1 x1i · · · xki ]1≤i≤n . Then the likelihood is
p(y|β) = exp{yT Xβ − 1Tn exp(Xβ) − 1Tn log(y!)}
Figure 6. Results from application of Algorithm 5 to data on the du- and the marginal likelihood is
ration of geyser eruptions. The upper panel shows successive values
p(y) = (2π)−(k+1)/2 | β |−1/2
of log p(x; q). The lower panel shows approximate mean and point- 

wise 95% credible sets for the common density function. The data are
shown at the base of the plot. × exp yT Xβ − 1Tn exp(Xβ) − 1Tn log(y!)
Rk+1

1
− (β − μβ )T  −1
β (β − μ β ) dβ.
2

The American Statistician, May 2010, Vol. 64, No. 2 149


Note that p(y), and hence p(β|y), involves an intractable inte- be called tangent transform variational approximations since
gral over Rk+1 . they work with ‘tangent-type’ representations of concave and
Take q to be the N(μq(β) ,  q(β) ) density: convex functions. An example of such a representation is
 
q β; μq(β) ,  q(β) log(x) = min{ξ x − log(ξ ) − 1} for all x > 0. (20)
ξ >0
 −1/2
= (2π)−p/2  q(β)  Figure 7 provides a graphical description of (20).
  The representation (20) implies that
1 T −1  
× exp − β − μq(β)  q(β) β − μq(β) .
2 log(x) ≤ ξ x − log(ξ ) − 1 for all ξ > 0.
Then the lower bound (4) admits the explicit expression The fact that ξ x − log(ξ ) − 1 is linear in x for every value of
  the variational parameter ξ > 0 allows for simplifications of
log p y; μq(β) ,  q(β)
  expressions involving the logarithmic function. The value of ξ
1   can then be chosen to make the approximation as accurate as
= y Xμq(β) − 1n exp Xμq(β) + diagonal X q(β) X
T T T
2 possible.
Tangent transform variational approximations are under-
1 T   1  
− μq(β) − μβ  −1 β μq(β) − μβ − tr  −1
β  q(β) pinned by the theory of convex duality (e.g., Rockafellar 1972).
2 2 We will not delve into that here, and instead stay on course
1   1 k + 1
+ log  q(β)  − log | β | + with statistical examples. The interested reader should consult
2 2 2 the article by Jordan et al. (1999).
− 1Tn log(y!). (19)
3.1 Bayesian Logistic Regression
Note that, from (2),
  As described by Jaakkola and Jordan (2000), Bayesian logis-
log p(y) ≥ log p y; μq(β) ,  q(β) tic regression lends itself to tangent transform variational ap-
for all choices of the mean vector μq(β) and covariance ma- proximation. Hence, we consider the Bayesian logistic regres-
trix  q(β) . Choosing these variational parameters to maxi- sion model
mize log p(y; μq(β) ,  q(β) ) makes the approximation as good ind. 
Yi |β0 , . . . , βk ∼ Bernoulli 1 + exp{−(β0 + β1 x1i + · · ·
as possible. The optimal Gaussian density transform q ∗ (β) is −1 
the N(μ∗q(β) ,  ∗q(β) ) density function, where μ∗q(β) and  ∗q(β) + βk xki )} , 1 ≤ i ≤ n,
are the maximizers of log p(y; μq(β) ,  q(β) ). Newton–Raphson
where the prior distribution on the coefficient vector β =
iteration can be used to determine μ∗q(β) and  ∗q(β) . Further de-
(β0 , . . . , βk ) takes the form β ∼ N (μβ ,  β ). The likelihood is
tails may be found in the work of Ormerod (2008).
 
p(y|β) = exp yT Xβ − 1Tn log{1n + exp(Xβ)} ,
3. TANGENT TRANSFORM APPROACH
where X = [1 x1i · · · xki ]1≤i≤n . The posterior density of β is
Not all variational approximations fit within the Kullback– 
Leibler divergence framework. Another variety are what might p(β|y) = p(y, β) p(y, β) dβ,
Rk+1

Figure 7. Variational representation of the logarithmic function. Left axes: Members of family of functions f (x, ξ ) ≡ ξ x − log(ξ ) − 1 versus
ξ > 0, for x ∈ {0.25, 0.5, 1, 2, 4}, shown as gray curves. Right axes: For each x, the minimum of f (x, ξ ) over ξ corresponds to log(x). In the x
direction the f (x, ξ ) are linear and are shown in gray.

150 General
where as close as possible to p(y). Since p(y; ξ ) ≤ p(y) for all ξ , this
! reduces to the problem of maximizing p(y; ξ ) over ξ . Note that
p(y, β) = exp yT Xβ − 1Tn log{1n + exp(Xβ)} this lower bound on log p(y) has explicit expression:

1 1 1
− (β − μβ )T  −1 log p(y; ξ ) = log |(ξ )| − log | β |
2 β (β − μβ ) 2 2
" 1 1
k+1 1 + μ(ξ )T (ξ )−1 μ(ξ ) − μTβ  −1 β μβ
− log(2π) − log | β | . (21) 2 2
2 2
n
Once again, we are stuck with a multivariate intractable inte- + {ξi /2 − log(1 + eξi ) + (ξi /4) tanh(ξi /2)}.
gral in the normalizing factor. We get around this by noting the i=1
following representation of − log(1 + ex ) as the maxima of a
family of parabolas: Even though this can be maximized numerically in a similar
  fashion to (19), Jaakkola and Jordan (2000) derived a simpler
1 algorithm based on the notion of Expectation Maximization
− log(1 + ex ) = max A(ξ )x 2 − x + C(ξ ) for all x ∈ R,
ξ ∈R 2 (EM) (e.g., McLachlan and Krishnan 1997) with β playing the
(22) role of a set of latent variables. Treating y, β as the set of ‘com-
where plete data,’ the E-step of their EM algorithm involves
A(ξ ) ≡ − tanh(ξ/2)/(4ξ ) and Q(ξ new |ξ ) ≡ Eβ|y;ξ {log p(y, β; ξ new )},
C(ξ ) ≡ ξ/2 − log(1 + e ) + ξ tanh(ξ/2)/4.
ξ
where p(y, β; ξ ) is interpreted as the variational lower bound
While the genesis of (22) may be found in the article by on the ‘complete data likelihood.’ This results in the explicit
Jaakkola and Jordan (2000), it is easily checked via elementary expression
calculus methods. It follows from (22) that  
Q(ξ new |ξ ) = tr XT diag{A(ξ new )}X{(ξ ) + μ(ξ )μ(ξ )T }
−1Tn log{1n + exp(Xβ)}
  + 1Tn C(ξ new ) + terms not involving ξ new .
1
≥ 1Tn A(ξ )  (Xβ)2 − Xβ + C(ξ ) Differentiating with respect to ξ new and using the fact that
2
A(ξ ) is monotonically increasing over ξ > 0, the M-step can
1 be shown to have the exact solution
= β T XT diag{A(ξ )}Xβ − 1Tn Xβ + 1Tn C(ξ ), (23)  
2
(ξ new )2 = diagonal X{(ξ ) + μ(ξ )μ(ξ ) }X . (25)
where ξ = (ξ1 , . . . , ξn ) is an n × 1 vector of variational parame-
ters. This gives us the following lower bound on p(y, β): Taking positive square roots on both sides of (25) leads to Al-
 gorithm 6.
1  
p(y, β; ξ ) = exp − β T  −1 β − 2X diag{A(ξ )}X β
T Convergence of Algorithm 6 is monotone and usually quite
2 rapid (Jaakkola and Jordan 2000).
 T 
1 T −1
+ y − 1n X + μβ  β β 4. FREQUENTIST INFERENCE
2
1 Up until now, we have only dealt with approximate inference
− μTβ  −1
β μβ + 1n C(ξ )
T
2 in Bayesian models via variational methods. In this section we

k+1 1 point out that variational approximations can be used in fre-
− log(2π) − log | β | quentist contexts. However, use of variational approximations
2 2
for frequentist inferential problems is much rarer. Frequentist
which is proportional to a Multivariate Normal density in β.
Upon normalization we obtain the following family of varia-
tional approximations to β|y: Algorithm 6 Iterative scheme for obtaining the optimal model
and variational parameters in the Bayesian logistic regression
β|y; ξ ∼ N (μ(ξ ), (ξ )), (24) example.
where Initialize: ξ (n × 1; all entries positive).
 −1 Cycle:
(ξ ) ≡  −1 
β − 2X diag{A(ξ )}X and
 −1
    (ξ ) ←  −1 
1 β − 2X diag{A(ξ )}X ,
μ(ξ ) ≡ (ξ ) X y − 1 +  −1β μβ .    
2 1
μ(ξ ) ← (ξ ) X y − 1n +  −1 β μ β ,
We are left with the problem of determining the vector of 2
variational parameters ξ ∈ Rn . A natural way of choosing these #  
is to make ξ ← diagonal X{(ξ ) + μ(ξ )μ(ξ ) }X

p(y; ξ ) ≡ p(y, β; ξ ) dβ until the increase in p(y; ξ ) is negligible.

The American Statistician, May 2010, Vol. 64, No. 2 151


models that stand to benefit from variational approximations 4.1 Poisson Mixed Model
are those for which specification of the likelihood involves con-
Consider the (non-Bayesian) Poisson mixed model
ditioning on a vector of latent variables u. In this case, the log-
likelihood of the model parameter vector θ takes the form ind.
 Yij |Ui ∼ Poisson{exp(β0 + β1 xij + Ui )},
(29)
(θ ) ≡ log p(y; θ) = log p(y|u; θ )p(u; θ) du. (26) ind.
Ui ∼ N (0, σ ), 2
1 ≤ j ≤ ni , 1 ≤ i ≤ m,
The maximum likelihood estimate of θ is exactly where yij is the j th response measurement for unit i, and
$ the deterministic predictors xij are defined similarly. The log-
θ = argmax (θ )
θ likelihood of (β0 , β1 , σ 2 ) involves intractable integrals, but the
lower bound (28) takes the form
but, because of the integral in (26), (θ ) may not be available in
closed form. Depending on the forms of p(y|u; θ) and p(u; θ), (β0 , β1 , σ 2 ; q)
either the density transform or tangent transform approaches 
m
n
i
can result in more tractable approximations to (θ). For the = {yij (β0 + β1 xij + ui )
remainder of this section we restrict discussion to the density Rm i=1 j =1
transform approach. The tangent transform approach has a sim- 
ilar treatment. β0 +β1 xij +ui u2
Let q(u) be an arbitrary density function in u. Repeating the −e − log(yij !)} − i 2

steps given at (2), but with the log marginal likelihood log p(y)
replaced by the log-likelihood (θ ), we obtain m
   − log(2πσ ) − log q(u1 , . . . , um )
2
p(y, u; θ) 2
(θ ) = q(u) log du
q(u) × q(u1 , . . . , um ) du1 · · · dum .
  
q(u)
+ q(u) log du Setting q to be the product of m univariate Normal densities
p(u|y; θ) with mean μi and variance λi > 0, 1 ≤ i ≤ m, leads to the
≥ (θ ; q), closed form lower bound:
where (β0 , β1 , σ 2 , μ, λ; q)
  
p(y, u; θ)
m
ni

(θ ; q) ≡ q(u) log du. (27)
q(u) = yij (β0 + β1 xij + μi )
i=1 j =1
We now have the option of choosing q to make (θ; q) more

tractable while also aiming to minimize the Kullback–Leibler + eβ0 +β1 xij +μi +(1/2)λi − log(yij !)
divergence between q and p(u|y; θ ). In theory, the product den- m  
sity methodology of Section 2.2 could be used to guide the m 1 μ 2 + λi
+ {1 − log(σ 2 )} + log(λi ) − i 2
choice of q. However, we have yet to find a nontrivial fre- 2 2 σ
i=1
quentist example where an explicit solution arises. Suppose,
instead, that we restrict q to a parametric family of densities for all values of the variational parameters μ = (μ1 , . . . , μm )
{q(u; ξ ) : ξ ∈ }. Then the log-likelihood lower bound (27) be- and λ = (λ1 , . . . , λm ). Maximizing over these parameters nar-
comes rows the gap between (β0 , β1 , σ 2 , μ, λ; q) and (β0 , β1 , σ 2 )
   and so sensible estimators of the model parameters are
p(y, u; θ)
(θ , ξ ; q) = q(u; ξ ) log du. (28) $0 , β
$1 ,$
q(u; ξ ) (β σ 2 ) = (β0 , β1 , σ 2 )
We should maximize over the variational parameters ξ to min- component of
imize the Kullback–Leibler divergence between q(u; ξ ) and
argmax (β0 , β1 , σ 2 , μ, λ; q).
p(u|y; θ), and over the model parameters θ to maximize the β0 ,β1 ,σ 2 ,μ,λ
approximate log-likelihood. This leads to the new maximiza-
tion problem: Recently, Hall, Ormerod, and Wand (2010) established consis-
tency and rates of convergence results for β $1 , and $
$0 , β σ 2.
θ ,$
($ ξ ) = argmax (θ, ξ ; q).
θ,ξ
5. CLOSING DISCUSSION
Then $θ is a variational approximation to the maximum likeli-
hood estimator $ θ . Standard error estimates can be obtained by Our goal in this article is to explain variational approxima-
plugging in $ θ for θ and ξ for ξ in the variational approximate tions in a digestible form for a statistical audience. As men-
Fisher information matrix, the matrix that arises from replace- tioned in the Introduction, the important issue of accuracy of
ment of (θ ) by (θ , ξ ; q) in the definition of Fisher informa- variational approximations is not dealt with here. The expo-
tion. However, to our knowledge, asymptotic normality theory sitions by Jordan (2004) and Titterington (2004) provide ac-
that justifies such standard error estimation has not yet been cess to some of the literature on variational approximation ac-
done. curacy.

152 General
Variational approximations have the potential to become an Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors and Model Uncertainty,”
important player in statistical inference. New variational ap- Journal of the American Statistical Association, 90, 773–795. [142]
proximation methods are continually being developed. The re- Kullback, S., and Leibler, R. A. (1951), “On Information and Sufficiency,”
The Annals of Mathematical Statistics, 22, 79–86. [142]
cent emergence of formal software for variational inference is
certain to accelerate its widespread use. The usefulness of vari- McCulloch, C. E., Searle, S. R., and Neuhaus, J. M. (2008), Generalized, Lin-
ear, and Mixed Models (2nd ed.), New York: Wiley. [144]
ational approximations increases as the size of the problem in-
creases and Monte Carlo methods such as MCMC start to be- McGrory, C. A., and Titterington, D. M. (2007), “Variational Approximations
in Bayesian Model Selection for Finite Mixture Distributions,” Computa-
come untenable. tional Statistics and Data Analysis, 51, 5352–5367. [140,148]
McGrory, C. A., Titterington, D. M., Reeves, R., and Pettitt, A. N. (2009),
[Received March 2009. Revised April 2010.] “Variational Bayes for Estimating the Parameters of a Hidden Potts Model,”
Statistics and Computing, 19, 329–340. [140]
REFERENCES McLachlan, G. J., and Krishnan, T. (1997), The EM Algorithm and Extensions,
New York: Wiley-Interscience. [151]
Albert, J. H., and Chib, S. (1993), “Bayesian Analysis of Binary and Polychoto-
Minka, T., Winn, J., Guiver, G., and Kannan, A. (2009), Infer.Net 2.3, Cam-
mous Response Data,” Journal of the American Statistical Association, 88,
bridge, U.K.: Microsoft Research Cambridge. [140]
669–679. [146]
Ormerod, J. T. (2008), “On Semiparametric Regression and Data Mining,”
Archambeau, C., Cornford, D., Opper, M., and Shawe-Taylor, J. (2007),
Ph.D. thesis, School of Mathematics and Statistics, The University of New
“Gaussian Process Approximations of Stochastic Differential Equations,”
South Wales, Sydney, Australia. [150]
Journal of Machine Learning Research: Workshop and Conference Pro-
ceedings, 1, 1–16. [149] Parisi, G. (1988), Statistical Field Theory, Redwood City, CA: Addison-Wesley.
[142]
Barber, D., and Bishop, C. M. (1998), “Ensemble Learning for Multi-Layer
Networks,” in Advances in Neural Information Processing Systems 10, eds. Pearl, J. (1988), Probabilistic Reasoning in Intelligent Systems, San Mateo, CA:
M. I. Jordan, K. J. Kearns, and S. A. Solla, Cambridge, MA: MIT Press, Morgan Kaufmann. [143,145]
pp. 395–401. [149] Pinheiro, J. C., and Bates, D. M. (2000), Mixed-Effects Models in S and
Bishop, C. M. (2006), Pattern Recognition and Machine Learning, New York: S-PLUS, New York: Springer. [146]
Springer. [140,145,148] Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and the R Core Team (2009),
Boyd, S., and Vandenberghe, L. (2004), Convex Optimization, Cambridge: “nlme: Linear and Nonlinear Mixed Effects Models,” R package ver-
Cambridge University Press. [143] sion 3.1-93. [146]
Casella, G., and George, E. I. (1992), “Explaining the Gibbs Sampler,” R Development Core Team (2010), R: A Language and Environment for Statis-
The American Statistician, 46, 167–174. [143] tical Computing, Vienna, Austria: R Foundation for Statistical Computing.
Available at http:// www.R-project.org. [146]
Consonni, G., and Marin, J.-M. (2007), “Mean-Field Variational Approximate
Bayesian Inference for Latent Variable Models,” Computational Statistics Rockafellar, R. (1972), Convex Analysis, Princeton: Princeton University Press.
and Data Analysis, 52, 790–798. [146] [150]
Girolami, M., and Rogers, S. (2006), “Variational Bayesian Multinomial Probit Seeger, M. (2000), “Bayesian Model Selection for Support Vector Machines,
Regression,” Neural Computation, 18, 1790–1817. [146] Gaussian Processes and Other Kernel Classifiers,” in Advances in Neural
Information Processing Systems 12, eds. S. A. Solla, T. K. Leen, and K.-R.
Hall, P., Humphreys, K., and Titterington, D. M. (2002), “On the Adequacy
Müller, Cambridge, MA: MIT Press, pp. 603–609. [149]
of Variational Lower Bound Functions for Likelihood-Based Inference in
Markovian Models With Missing Values,” Journal of the Royal Statistical (2004), “Gaussian Processes for Machine Learning,” International
Society, Ser. B, 64, 549–564. [140] Journal of Neural Systems, 14, 69–106. [149]
Hall, P., Ormerod, J. T., and Wand, M. P. (2010), “Theory of Gaussian Varia- Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D., and
tional Approximation for a Poisson Linear Mixed Model,” Statistica Sinica, Caldas, C. (2005), “A Variational Bayesian Mixture Modelling Framework
to appear. [152] for Cluster Analysis of Gene-Expression Data,” Bioinformatics, 21, 3025–
Honkela, A., and Valpola, H. (2005), “Unsupervised Variational Bayesian 3033. [140]
Learning of Nonlinear Models,” in Advances in Neural Information Titterington, D. M. (2004), “Bayesian Methods for Neural Networks and Re-
Processing Systems 17, eds. L. K. Saul, Y. Weiss, and L. Bottou, Cam- lated Models,” Statistical Science, 19, 128–139. [140,142,152]
bridge, MA: MIT Press, pp. 593–600. [149] Venables, W. N., and Ripley, B. D. (2009), “MASS: Functions and Datasets
Jaakkola, T. S., and Jordan, M. I. (2000), “Bayesian Parameter Estimation via to Support Venables and Ripley, ‘Modern Applied Statistics With S’ (4th
Variational Methods,” Statistics and Computing, 10, 25–37. [150,151] ed.),” R package version 7.2-48. [148]
Jordan, M. I. (2004), “Graphical Models,” Statistical Science, 19, 140–155. Wang, B., and Titterington, D. M. (2006), “Convergence Properties of a Gen-
[140,152] eral Algorithm for Calculating Variational Bayesian Estimates for a Normal
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999), “An In- Mixture Model,” Bayesian Analysis, 1, 625–650. [140]
troduction to Variational Methods for Graphical Models,” Machine Learn- Winn, J., and Bishop, C. M. (2005), “Variational Message Passing,” Journal of
ing, 37, 183–233. [140,150] Machine Learning Research, 6, 661–694. [143]

The American Statistician, May 2010, Vol. 64, No. 2 153

You might also like