0% found this document useful (0 votes)
34 views44 pages

Variational Bayesian Learning of Directed Graphical Models With Hidden Variables

This document summarizes a research paper on variational Bayesian learning of directed graphical models with hidden variables. The key points are: 1) Exact calculation of the marginal likelihood is intractable for graphical models containing hidden variables, making model selection difficult. 2) The paper presents a variational Bayesian algorithm that optimizes a lower bound approximation to the marginal likelihood, similar to EM. 3) It compares this variational Bayesian approximation to other methods like BIC, CS, and Annealed Importance Sampling on a simple model selection task, finding the variational approach superior to BIC and CS in accuracy and speed.

Uploaded by

Abbé Busoni
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views44 pages

Variational Bayesian Learning of Directed Graphical Models With Hidden Variables

This document summarizes a research paper on variational Bayesian learning of directed graphical models with hidden variables. The key points are: 1) Exact calculation of the marginal likelihood is intractable for graphical models containing hidden variables, making model selection difficult. 2) The paper presents a variational Bayesian algorithm that optimizes a lower bound approximation to the marginal likelihood, similar to EM. 3) It compares this variational Bayesian approximation to other methods like BIC, CS, and Annealed Importance Sampling on a simple model selection task, finding the variational approach superior to BIC and CS in accuracy and speed.

Uploaded by

Abbé Busoni
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Bayesian Analysis (2004) 1, Number 1, pp.

144
Variational Bayesian Learning of Directed
Graphical Models with Hidden Variables
Matthew J. Beal
Computer Science & Engineering
SUNY at Bualo
New York 14260-2000, USA
[email protected]
Zoubin Ghahramani
Department of Engineering
University of Cambridge
Cambridge CB2 11PZ, UK
[email protected]
Abstract. A key problem in statistics and machine learning is inferring suitable
structure of a model given some observed data. A Bayesian approach to model
comparison makes use of the marginal likelihood of each candidate model to form
a posterior distribution over models; unfortunately for most models of interest,
notably those containing hidden or latent variables, the marginal likelihood is
intractable to compute.
We present the variational Bayesian (VB) algorithm for directed graphical mod-
els, which optimises a lower bound approximation to the marginal likelihood in a
procedure similar to the standard EM algorithm. We show that for a large class
of models, which we call conjugate exponential, the VB algorithm is a straightfor-
ward generalisation of the EM algorithm that incorporates uncertainty over model
parameters. In a thorough case study using a small class of bipartite DAGs con-
taining hidden variables, we compare the accuracy of the VB approximation to
existing asymptotic-data approximations such as the Bayesian Information Crite-
rion (BIC) and the Cheeseman-Stutz (CS) criterion, and also to a sampling based
gold standard, Annealed Importance Sampling (AIS). We nd that the VB algo-
rithm is empirically superior to CS and BIC, and much faster than AIS. Moreover,
we prove that a VB approximation can always be constructed in such a way that
guarantees it to be more accurate than the CS approximation.
Keywords: Approximate Bayesian Inference, Bayes Factors, Directed Acyclic
Graphs, EM Algorithm, Graphical Models, Markov Chain Monte Carlo, Model
Selection, Variational Bayes
1 Introduction
Graphical models are becoming increasingly popular as tools for expressing probabilistic
models found in various machine learning and applied statistics settings. One of the
key problems is learning suitable structure of such graphical models from a data set,
y. This task corresponds to considering dierent model complexities too complex
a model will overt the data and too simple a model underts, with neither extreme
generalising well to new data. Discovering a suitable structure entails determining
which conditional dependency relationships amongst model variables are supported by
the data. A Bayesian approach to model selection, comparison, or averaging relies on an
important and dicult quantity: the marginal likelihood p(y [ m) under each candidate
c 2004 International Society for Bayesian Analysis ba0001
2 Variational Bayesian EM for DAGs
model, m. The marginal likelihood is an important quantity because, when combined
with a prior distribution over candidate models, it can be used to form the posterior
distribution over models given observed data, p(m[ y) p(m)p(y [ m). The marginal
likelihood is a dicult quantity because it involves integrating out parameters and,
for models containing hidden (or latent or missing) variables (thus encompassing many
models of interest to statisticians and machine learning practitioners alike), it can be
intractable to compute.
The marginal likelihood is tractable to compute for certain simple types of graphs;
one such type is the case of fully observed discrete-variable directed acyclic graphs with
Dirichlet priors on the parameters (Heckerman et al. 1995; Heckerman 1996). Unfor-
tunately, if these graphical models include hidden variables, the marginal likelihood
becomes intractable to compute even for moderately sized observed data sets. Estimat-
ing the marginal likelihood presents a dicult challenge for approximate methods such
as asymptotic-data criteria and sampling techniques.
In this article we investigate a novel application of the variational Bayesian (VB)
frameworkrst described in Attias (1999b)to approximating the marginal likelihood
of discrete-variable directed acyclic graph (DAG) structures that contain hidden vari-
ables. Variational Bayesian methods approximate the quantity of interest with a strict
lower bound, and the framework readily provides algorithms to optimise the approx-
imation. We describe the variational Bayesian methodology applied to a large class
of graphical models which we call conjugate-exponential, derive the VB approximation
as applied to discrete DAGs with hidden variables, and show that the resulting algo-
rithm that optimises the approximation closely resembles the standard Expectation-
Maximisation (EM) algorithm of Dempster et al. (1977). It will be seen for conjugate-
exponential models that the VB methodology is an elegant Bayesian generalisation of
the EM framework, replacing point estimates of parameters encountered in EM learning
with distributions over parameters, thus naturally reecting the uncertainty over the
settings of their values given the data. Previous work has applied the VB methodology
to particular instances of conjugate-exponential models, for example MacKay (1997),
and Ghahramani and Beal (2000, 2001); Beal (2003) describes in more detail the theo-
retical results for VB in conjugate-exponential models.
We also briey outline and compute the Bayesian Information Criterion (BIC) and
Cheeseman-Stutz (CS) approximations to the marginal likelihood for DAGs (Schwarz
1978; Cheeseman and Stutz 1996), and compare these to VB in a particular model selec-
tion task. The particular task we have chosen is that of nding which of several possible
structures for a simple graphical model (containing hidden and observed variables) has
given rise to a set of observed data. The success of each approximation is measured by
how it ranks the true model that generated the data amongst the alternatives, and also
by the accuracy of the marginal likelihood estimate.
As a gold standard, against which we can compare these approximations, we consider
sampling estimates of the marginal likelihood using the Annealed Importance Sampling
(AIS) method of Neal (2001). We consider AIS to be a gold standard in the sense
that we believe it is one of the best methods to date for obtaining reliable estimates of
Beal, M. J. and Ghahramani, Z. 3
the marginal likelihoods of the type of models explored here, given sucient sampling
computation. To the best of our knowledge, the AIS analysis we present constitutes
the rst serious case study of the tightness of variational Bayesian bounds. An analysis
of the limitations of AIS is also provided. The aim of the comparison is to establish
the reliability of the VB approximation as an estimate of the marginal likelihood in the
general incomplete-data setting, so that it can be used in larger problems for example
embedded in a (greedy) structure search amongst a much larger class of models.
The remainder of this article is arranged as follows. Section 2 begins by examining
the model selection question for discrete directed acyclic graphs, and shows how exact
marginal likelihood calculation becomes computationally intractable when the graph
contains hidden variables. In Section 3 we briey cover the EM algorithm for maximum
likelihood (ML) and maximum a posteriori (MAP) parameter estimation in DAGs with
hidden variables, and derive and discuss the BIC and CS asymptotic approximations.
We then introduce the necessary methodology for variational Bayesian learning, and
present the VBEM algorithm for variational Bayesian lower bound optimisation of the
marginal likelihood in the case of discrete DAGs we show that this is a straightforward
generalisation of the MAP EM algorithm. In Section 3.6 we describe an Annealed
Importance Sampling method for estimating marginal likelihoods of discrete DAGs. In
Section 4 we evaluate the performance of these dierent approximation methods on
the simple (yet non-trivial) model selection task of determining which of all possible
structures within a class generated a data set. Section 5 provides an analysis and
discussion of the limitations of the AIS implementation and suggests possible extensions
to it. In Section 6 we consider the CS approximation, which is one of the state-of-the-
art approximations, and extend a result due to Minka (2001) that shows that the CS
approximation is a lower bound on the marginal likelihood in the case of mixture models,
by showing how the CS approximation can be constructed for any model containing
hidden variables. We complete this section by proving that there exists a VB bound
that is guaranteed to be at least as tight or tighter than the CS bound, independent of
the model structure and type. Finally, we conclude in Section 7 and suggest directions
for future research.
2 Calculating the marginal likelihood of DAGs
We focus on discrete-valued Directed Acyclic Graphs, although all the methodology
described in the following sections is readily extended to models involving real-valued
variables. Consider a data set of size n, consisting of independent and identically dis-
tributed (i.i.d.) observed variables y = y
1
, . . . , y
i
, . . . , y
n
, where each y
i
is a vector
of discrete-valued variables. We model this observed data y by assuming that it is
generated by a discrete directed acyclic graph consisting of hidden variables, s, and ob-
served variables, y. Combining hidden and observed variables we have z = z
1
, . . . , z
n

= (s
1
, y
1
), . . . , (s
n
, y
n
). The elements of each vector z
i
for i = 1, . . . , n are indexed
from j = 1, . . . , [z
i
[, where [z
i
[ is the number of variables in the data vector z
i
. We
dene two sets of indices H and 1 such that those j H are the hidden variables and
those j 1 are observed variables, i.e. s
i
= z
ij
: j H and y
i
= z
ij
: j 1.
4 Variational Bayesian EM for DAGs
Note that z
i
= s
i
, y
i
contains both hidden and observed variables we refer
to this as the complete-data for data point i. The incomplete-data, y
i
, is that which
constitutes the observed data. Note that the meaning of [[ will vary depending on the
type of its argument, for example: [z[ = [s[ = [y[ is the number of data points, n; [s
i
[ is
the number of hidden variables (for the ith data point); [s
ij
[ is the cardinality (or the
number of possible settings) of the jth hidden variable (for the ith data point).
In a DAG, the complete-data likelihood factorises into a product of local probabilities
on each variable
p(z [ ) =
n

i=1
|zi|

j=1
p(z
ij
[ z
ipa(j)
, ) , (1)
where pa(j) denotes the vector of indices of the parents of the jth variable. Each
variable in the graph is multinomial, and the parameters for the graph are the collection
of vectors of probabilities on each variable given each conguration of its parents. For
example, the parameter for a binary variable which has two ternary parents is a matrix
of size (3
2
2) with each row summing to one. For a variable j without any parents
(pa(j) = ), then the parameter is simply a vector of its prior probabilities. Using

jlk
to denote the probability that variable j takes on value k when its parents are in
conguration l, then the complete-data likelihood can be written out as a product of
terms of the form
p(z
ij
[ z
ipa(j)
, ) =
[z
ipa(j)[

l=1
|zij|

k=1

(zij,k)(z
ipa(j)
,l)
jlk
(2)
with

k

jlk
= 1 j, l. Here we use

z
ipa(j)

to denote the number of joint settings of


the parents of variable j. We use Kronecker- notation: (, ) is 1 if its arguments are
identical and zero otherwise. The parameters are given independent Dirichlet priors,
which are conjugate to the complete-data likelihood above (thereby satisfying Condi-
tion 1 for conjugate-exponential models (42), which is required later). The prior is
factorised over variables and parent congurations; these choices then satisfy the global
and local independence assumptions of Heckerman et al. (1995). For each parameter

jl
=
jl1
, . . . ,
jl|zij|
, the Dirichlet prior is
p(
jl
[
jl
, m) =
(
0
jl
)

k
(
jlk
)

jlk
1
jlk
, (3)
where are hyperparameters,
jl
=
jl1
, . . . ,
jl|zij|
, and
jlk
> 0 k,
0
jl
=

k

jlk
,
() is the gamma function, and the domain of is conned to the simplex of probabil-
ities that sum to 1. This form of prior is assumed throughout this article. Since we do
not focus on inferring the hyperparameters we use the shorthand p( [ m) to denote the
prior from here on. In the discrete-variable case we are considering, the complete-data
Beal, M. J. and Ghahramani, Z. 5
marginal likelihood is tractable to compute:
p(z [ m) =
_
d p( [ m)p(z [ ) =
_
d p( [ m)
n

i=1
|zi|

j=1
p(z
ij
[ z
ipa(j)
, ) (4)
=
|zi|

j=1
[z
ipa(j)[

l=1
(
0
jl
)
(
0
jl
+ N
jl
)
|zij|

k=1
(
jlk
+ N
jlk
)
(
jlk
)
, (5)
where N
jlk
is dened as the count in the data for the number of instances of variable j
being in conguration k with parental conguration l:
N
jlk
=
n

i=1
(z
ij
, k)(z
ipa(j)
, l), and N
jl
=
|zij|

k=1
N
jlk
. (6)
Note that if the data set is complete that is to say there are no hidden variables
then s = and so z = y, and the quantities N
jlk
can be computed directly from the
data.
The incomplete-data likelihood results from summing over all settings of the hidden
variables and taking the product over i.i.d. presentations of the data:
p(y [ ) =
n

i=1
p(y
i
[ ) =
n

i=1

{zij}
jH
|zi|

j=1
p(z
ij
[ z
ipa(j)
, ) . (7)
Now the incomplete-data marginal likelihood for n cases follows from marginalising
out the parameters of the model with respect to their prior distribution:
p(y [ m) =
_
d p( [ m)
n

i=1

{zij}
jH
|zi|

j=1
p(z
ij
[ z
ipa(j)
, ) (8)
=

{{zij}
jH
}
n
i=1
_
d p( [ m)
|zi|

j=1
p(z
ij
[ z
ipa(j)
, ) . (9)
The expression (8) is computationally intractable due to the expectation (integral) over
the real-valued conditional probabilities , which couples the hidden variables across
i.i.d. data instances. Put another way (9), pulling the summation to the left of the
product over n instances results in a summation with a number of summands exponential
in the number of data n. In the worst case, (8) or (9) can be evaluated as the sum of
_

jH
[z
ij
[
_
n
Dirichlet integrals. To take an example, a model with just [s
i
[ = 2 hidden
variables and n = 100 data points requires the evaluation of 2
100
Dirichlet integrals. This
means that a linear increase in the amount of observed data results in an exponential
increase in the cost of inference.
6 Variational Bayesian EM for DAGs
Our goal is to learn the conditional independence structure of the model that
is, which variables are parents of each variable. Ideally, we should compare structures
based on their posterior probabilities, and to compute this posterior we need to rst
compute the marginal likelihood (8).
The next section examines several methods that attempt to approximate the
marginal likelihood (8). We focus on a variational Bayesian algorithm, which we
compare to asymptotic criteria and also to sampling-based estimates. For the moment
we assume that the cardinalities of the variables in particular the hidden variables
are xed beforehand; our wish is to discover how many hidden variables there are
and what their connectivity is to other variables in the graph. The related problem
of determining the cardinality of the variables from data can also be addressed in the
variational Bayesian framework, as for example has been recently demonstrated for
Hidden Markov Models (Beal 2003).
3 Estimating the marginal likelihood
In this section we look at some approximations to the marginal likelihood for a model
m, which we refer to henceforth as the scores for m. In Section 3.1 we rst review
ML and MAP parameter learning and briey present the EM algorithm for a general
discrete-variable directed graphical model with hidden variables. Using the nal pa-
rameters obtained from an EM optimisation, we can then construct various asymptotic
approximations to the marginal likelihood, and so derive the BIC and Cheeseman-Stutz
criteria, described in Sections 3.2 and 3.3, respectively. An alternative approach is pro-
vided by the variational Bayesian framework, which we review in some detail in Section
3.4. In the case of discrete directed acyclic graphs with Dirichlet priors, the model is
conjugate-exponential (dened below), and the VB framework produces a very simple
VBEM algorithm. This algorithm is a generalisation of the EM algorithm, and as such
be cast in a way that resembles a direct extension of the EM algorithm for MAP pa-
rameter learning; the algorithm for VB learning for these models is presented in Section
3.5. In Section 3.6 we derive an annealed importance sampling method (AIS) for this
class of graphical model, which is considered to be the current state-of-the-art tech-
nique for estimating the marginal likelihood of these models using sampling. Armed
with these various approximations we pit them against each other in a model selection
task, described in Section 4.
3.1 ML and MAP parameter estimation for DAGs
We begin by deriving the EM algorithm for ML/MAP estimation via a lower bound
interpretation (see Neal and Hinton 1998). We start with the incomplete-data log
likelihood, and lower bound it by a functional T(q
s
(s), ) by appealing to Jensens
Beal, M. J. and Ghahramani, Z. 7
inequality as follows
lnp(y [ ) = ln
n

i=1

{zij}
jH
|zi|

j=1
p(z
ij
[ z
ipa(j)
, ) (10)
=
n

i=1
ln

si
q
si
(s
i
)

|zi|
j=1
p(z
ij
[ z
ipa(j)
, )
q
si
(s
i
)
(11)

i=1

si
q
si
(s
i
) ln

|zi|
j=1
p(z
ij
[ z
ipa(j)
, )
q
si
(s
i
)
(12)
= T(q
si
(s
i
)
n
i=1
, ) . (13)
The rst line is simply the logarithm of equation (7); in the second line we have used
the shorthand s
i
= z
ij

jH
to denote all hidden variables corresponding to the ith
data point, and have multiplied and divided the inner summand (over s
i
) by a varia-
tional distribution q
si
(s
i
) one for each data point y
i
. The inequality that folllows
results from the concavity of the logarithm function and results in an expression that
is a strict lower bound on the log complete data likelihood, denoted T(q
si
(s
i
)
n
i=1
, ).
This expression depends on the parameters of the model, , and is a functional of the
variational distributions q
si

n
i=1
.
Since T(q
si
(s
i
)
n
i=1
, ) is a lower bound on a quantity we wish to maximise, we
maximise the bound by taking functional derivatives with respect to each q
si
(s
i
) while
keeping the remaining q
s
i

(s
i
)
i

=i
xed, and set these to zero yielding
q
si
(s
i
) = p(s
i
[ y
i
, ) i . (14)
Thus the optimal setting of each variational distribution is in fact the exact posterior
distribution for the hidden variable for that data point. This is the E step of the
celebrated EM algorithm; with these settings of the distributions q
si
(s
i
)
n
i=1
, it can
easily be shown that the bound is tight that is to say, the dierence between lnp(y [ )
and T(q
si
(s
i
)
n
i=1
, ) is exactly zero.
The M step of the EM algorithm is obtained by taking derivatives of the bound with
respect to the parameters , while holding xed the distributions q
si
(s
i
) i. Each
jl
is constrained to sum to one, and so we enforce this with Lagrange multipliers c
jl
,

jlk
T(q
s
(s), ) =
n

i=1

si
q
si
(s
i
)

jlk
lnp(z
ij
[ x
ipa(j)
,
j
) + c
jl
(15)
=
n

i=1

si
q
si
(s
i
)(z
ij
, k)(z
ipa(j)
, l)

jlk
ln
jlk
+ c
jl
= 0 , (16)
which upon rearrangement gives

jlk

n

i=1

si
q
si
(s
i
)(z
ij
, k)(z
ipa(j)
, l) . (17)
8 Variational Bayesian EM for DAGs
Due to the normalisation constraint on
jl
the M step can be written
M step (ML):
jlk
=
N
jlk

|zij|
k

=1
N
jlk

, (18)
where the N
jlk
are dened as
N
jlk
=
n

i=1

(z
ij
, k)(z
ipa(j)
, l)
_
qs
i
(si)
, (19)
where angled-brackets )
qs
i
(si)
are used to denote expectation with respect to the hidden
variable posterior q
si
(s
i
) found in the preceding E step. The N
jlk
are interpreted as the
expected number of counts for observing settings of children and parent congurations
over observed and hidden variables. In the cases where both j and pa(j) are observed
variables, N
jlk
reduces to the simple empirical count, as in (6). Otherwise, if j or its
parents are hidden, then expectations need be taken over the posterior q
si
(s
i
) obtained
in the E step.
If we require the MAP EM algorithm, we instead lower bound lnp()p(y [ ). The
E step remains the same, but the M step uses augmented counts from the prior of the
form in (3) to give the following update:
M step (MAP):
jlk
=

jlk
1 + N
jlk

|zij|
k

=1

jlk
1 + N
jlk

. (20)
Repeated applications of the E step (14) and the M step (18, 20) are guaranteed to
increase the log likelihood (with equation (18)) or the log posterior (with equation (20))
of the parameters at every iteration, and converge to a local maximum. We note that
MAP estimation is inherently basis-dependent: for any particular

having non-zero
prior probability, it is possible to nd a (one-to-one) reparameterisation () such that
the MAP estimate for is at (

). This is an obvious drawback of MAP parameter


estimation. Moreover, the use of (20) can produce erroneous results in the case of

jlk
< 1, in the form of negative probabilities. Conventionally, researchers have limited
themselves to Dirichlet priors in which every
jlk
1, although in MacKay (1998) it
is shown how a reparameterisation of into the softmax basis results in MAP updates
which do not suer from this problem (which look identical to (20), but without the
1 in numerator and denominator). Note that our EM algorithms were indeed carried
out in the softmax basis, which avoids such eects and problems with parameters lying
near their domain boundaries.
3.2 The BIC
The Bayesian Information Criterion approximation (BIC; Schwarz 1978) is the asymp-
totic limit to large data sets of the Laplace approximation (Kass and Raftery 1995;
MacKay 1995). The Laplace approximation makes a local quadratic approximation to
Beal, M. J. and Ghahramani, Z. 9
the log posterior around a MAP parameter estimate,

,
lnp(y [ m) = ln
_
d p( [ m) p(y [ ) lnp(

[ m) +lnp(y [

) +
1
2
ln

2H
1

, (21)
where H(

) is the Hessian dened as



2
ln p( | m)p(y | )

. The BIC approximation is


the asymptotic limit of the above expression, retaining only terms that grow with the
number of data n; since the Hessian grows linearly with n, the BIC is given by
lnp(y [ m)
BIC
= lnp(y [

)
d(m)
2
lnn , (22)
where d(m) is the number of parameters in model m. The BIC is interesting because
it does not depend on the prior over parameters, and is attractive because it does
not involve the burdensome computation of the Hessian of the log likelihood and its
determinant. However in general the Laplace approximation, and therefore its BIC
limit, have several shortcomings which are outlined below.
The Gaussian assumption is based on the large data limit, and will represent the
posterior poorly for small data sets for which, in principle, the advantages of Bayesian
integration over ML or MAP are largest. The Gaussian approximation is also poorly
suited to bounded, constrained, or positive parameters, since it assigns non-zero prob-
ability mass outside of the parameter domain. Moreover, the posterior may not be
unimodal for likelihoods with hidden variables, due to problems of identiability; in
these cases the regularity conditions required for convergence do not hold. Even if the
exact posterior is unimodal, the resulting approximation may well be a poor repre-
sentation of the nearby probability mass, as the approximation is made about a locally
maximum probability density. In large models the approximation may become unwieldy
to compute, taking O(nd
2
) operations to compute the derivatives in the Hessian, and
then a further O(d
3
) operations to calculate its determinant (d is the number of param-
eters in the model) further approximations would become necessary, such as those
ignoring o-diagonal elements or assuming a block-diagonal structure for the Hessian,
which correspond to neglecting dependencies between parameters.
For BIC, we require the number of free parameters in each structure. In these exper-
iments we use a simple counting argument and apply the following counting scheme. If
a variable j has no parents in the DAG, then it contributes ([z
ij
[ 1) free parameters,
corresponding to the degrees of freedom in its vector of prior probabilities (constrained
to lie on the simplex

k
p
k
= 1). Each variable that has parents contributes ([z
ij
[ 1)
parameters for each conguration of its parents. Thus in model m the total number of
parameters d(m) is given by
d(m) =
|zi|

j=1
([z
ij
[ 1)
[z
ipa(j)[

l=1

z
ipa(j)l

, (23)
where

z
ipa(j)l

denotes the cardinality (number of settings) of the lth parent of the jth
variable. We have used the convention that the product over zero factors has a value of
10 Variational Bayesian EM for DAGs
one to account for the case in which the jth variable has no parents that is to say

[z
ipa(j)[
l=1

z
ipa(j)l

= 1, if the number of parents


z
ipa(j)l

is 0.
The BIC approximation needs to take into account aliasing in the parameter poste-
rior. In discrete-variable DAGs, parameter aliasing occurs from two symmetries: rst,
a priori identical hidden variables can be permuted, and second, the labellings of the
states of each hidden variable can be permuted. As an example, let us imagine the
parents of a single observed variable are 3 hidden variables having cardinalities (3, 3, 4).
In this case the number of aliases is 1728 (= 2! 3! 3! 4!). If we assume that the
aliases of the posterior distribution are well separated then the score is given by
lnp(y [ m)
BIC
= lnp(y [

)
d(m)
2
lnn + lnS (24)
where S is the number of aliases, and

is the MAP estimate as described in the previous
section. This correction is accurate only if the modes of the posterior distribution are
well separated, which should be the case in the large data set size limit for which BIC
is useful. However, since BIC is correct only up to an indeterminate missing factor, we
might think that this correction is not necessary. In the experiments we examine the
BIC score with and without this correction, and also with and without the inclusion of
the prior term lnp(

[ m).
3.3 The Cheeseman-Stutz approximation
The Cheeseman-Stutz (CS) approximation makes use of the following identity for the
incomplete-data marginal likelihood:
p(y [ m) = p( z [ m)
p(y [ m)
p( z [ m)
= p( z [ m)
_
d p( [ m)p(y [ , m)
_
d p(

[ m)p( z [

, m)
, (25)
which is true for any completion z = s, y of the data. This form is useful because the
complete-data marginal likelihood, p( z [ m), is tractable to compute for discrete DAGs
with independent Dirichlet priors: it is just a product of Dirichlet integrals, as given in
(5). By applying Laplace approximations to the integrals in both the numerator and
denominator, about points

and

in parameter space respectively, and then assuming


the limit of an innite amount of data in order to recover BIC-type forms for both
integrals, we immediately obtain the following estimate of the marginal (incomplete)
likelihood
lnp(y [ m) lnp(y [ m)
CS
lnp(s, y [ m) + lnp(

[ m) + lnp(y [

)
d
2
lnn
lnp(

[ m) lnp(s, y [

) +
d

2
lnn (26)
= lnp(s, y [ m) + lnp(y [

) lnp(s, y [

) . (27)
The last line follows if we choose

to be identical to

and further assume that the
number of parameters in the models for complete and incomplete data are the same, i.e.
Beal, M. J. and Ghahramani, Z. 11
d = d

(Cheeseman and Stutz 1996). In the case of the models examined in this article,
we can ensure that the mode of the posterior in the complete setting is at locations

by completing the hidden data s


i

n
i=1
with their expectations under their posterior
distributions p(s
i
[ y,

), or simply: s
ijk
=(s
ij
, k))
qs
i
(si)
. This procedure will generally
result in non-integer counts N
jlk
on application of (19). Upon parameter re-estimation
using equation (20), we note that

=

remains invariant. The most important aspect
of the CS approximation is that each term of (27) can be tractably evaluated as follows:
from (5) p(s, y [ m) =
|zi|

j=1
[z
ipa(j)[

l=1
(
0
jl
)
(
jl
+

N
jl
)
|zij|

k=1
(
jlk
+

N
jlk
)
(
jlk
)
; (28)
from (7) p(y [

) =
n

i=1

{zij}
jH
|zi|

j=1
[z
ipa(j)[

l=1
|zij|

k=1

(zij,k)(z
ipa(j)
,l)
jlk
; (29)
from (1) p(s, y [

) =
|zi|

j=1
[z
ipa(j)[

l=1
|zij|

k=1


N
jlk
jlk
, (30)
where the

N
jlk
are identical to the N
jlk
of equation (19) if the completion of the data
with s is done with the posterior found in the M step of the MAP EM algorithm used to
nd

. Equation (29) is simply the likelihood output by the EM algorithm, equation (28)
is a function of the counts obtained in the EM algorithm, and equation (30) is a simple
computation again.
As with BIC, the Cheeseman-Stutz score also needs to be corrected for aliases in
the parameter posterior, and is subject to the same caveat that these corrections are
only accurate if the aliases in the posterior are well separated. Finally, we note that
CS is in fact a lower bound on the marginal likelihood, and is intricately related to our
proposed method that is described next. In Section 6 we revisit the CS approximation
and derive a key result on the tightness of its bound.
3.4 Estimating marginal likelihood using Variational Bayes
Here we briey review a method of lower bounding the marginal likelihood, and the cor-
responding deterministic iterative algorithm for optimising this bound that has come to
be known as variational Bayes (VB). Variational methods have been used in the past
to tackle intractable posterior distributions over hidden variables (Neal 1992; Hinton
and Zemel 1994; Saul and Jordan 1996; Jaakkola 1997; Ghahramani and Jordan 1997;
Ghahramani and Hinton 2000), and more recently have tackled Bayesian learning in
specic models (Hinton and van Camp 1993; Waterhouse et al. 1996; MacKay 1997;
Bishop 1999; Ghahramani and Beal 2000). Inspired by MacKay (1997), Attias (2000)
rst described the general form of variational Bayes and showed that it is a generalisa-
tion of the celebrated EM algorithm of Dempster et al. (1977). Ghahramani and Beal
(2001) and Beal (2003) built upon this work, applying it to the large class of conjugate-
12 Variational Bayesian EM for DAGs
exponential models (described below). Just as in the standard E step of EM, we obtain
a posterior distribution over the hidden variables, and we now also treat the parame-
ters of the model as uncertain quantities and infer their posterior distribution as well.
Since the hidden variables and parameters are coupled, computing the exact posterior
distribution over both is intractable and we use the variational methodology to instead
work in the space of simpler distributions those that are factorised between hidden
variables and parameters.
As before, let y denote the observed variables, x denote the hidden variables, and
denote the parameters. We assume a prior distribution over parameters p( [ m),
conditional on the model m. The marginal likelihood of a model, p(y [ m), can be lower
bounded by introducing any distribution over both latent variables and parameters
which has support where p(x, [ y, m) does, by appealing to Jensens inequality:
lnp(y [ m) = ln
_
d dx p(x, y, [ m) = ln
_
d dx q(x, )
p(x, y, [ m)
q(x, )
(31)

_
d dx q(x, ) ln
p(x, y, [ m)
q(x, )
. (32)
Maximising this lower bound with respect to the free distribution q(x, ) results in
q(x, ) = p(x, [ y, m), which when substituted above turns the inequality into an
equality. This does not simplify the problem, since evaluating the exact posterior distri-
bution p(x, [ y, m) requires knowing its normalising constant, the marginal likelihood.
Instead we constrain the posterior to be a simpler, factorised (separable) approximation
q(x, ) = q
x
(x)q

(), which we are at liberty to do as (32) is true for any q(x, ):


lnp(y [ m)
_
d dx q
x
(x)q

() ln
p(x, y, [ m)
q
x
(x)q

()
(33)
=
_
d q

()
__
dx q
x
(x) ln
p(x, y [ , m)
q
x
(x)
+ ln
p( [ m)
q

()
_
(34)
= T
m
(q
x
(x), q

()) (35)
= T
m
(q
x1
(x
1
), . . . , q
xn
(x
n
), q

()) . (36)
The last equality is a consequence of the data y being i.i.d. and is explained below. The
quantity T
m
is a functional of the free distributions, q
x
(x) and q

(); for brevity we


omit the implicit dependence of T
m
on the xed data set y.
The variational Bayesian algorithm iteratively maximises T
m
in (35) with respect
to the free distributions, q
x
(x) and q

(), which is essentially coordinate ascent in the


function space of variational distributions. It is not dicult to show that by taking
functional derivatives of (35), we obtain the VBE and VBM update equations shown
below. Each application of the VBE and VBM steps is guaranteed to increase or leave
unchanged the lower bound on the marginal likelihood, and successive applications are
guaranteed to converge to a local maximum of T
m
(q
x
(x), q

()). The VBE step is


VBE step: q
(t+1)
xi
(x
i
) =
1
Z
xi
exp
__
d q
(t)

() lnp(x
i
, y
i
[ , m)
_
i , (37)
Beal, M. J. and Ghahramani, Z. 13
giving the hidden variable variational posterior
q
(t+1)
x
(x) =
n

i=1
q
(t+1)
xi
(x
i
) , (38)
and the VBM step by
VBM step: q
(t+1)

() =
1
Z

p( [ m) exp
__
dx q
(t+1)
x
(x) lnp(x, y [ , m)
_
. (39)
Here t indexes the iteration number and Z

and Z
xi
are (readily computable) normal-
isation constants. The factorisation of the distribution over hidden variables between
dierent data points in (38) is a consequence of the i.i.d. data assumption, and falls out
of the VB optimisation only because we have decoupled the distributions over hidden
variables and parameters. At this point it is well worth noting the symmetry between
the hidden variables and the parameters. The only distinguishing feature between hid-
den variables and parameters is that the number of hidden variables increases with data
set size, whereas the number of parameters is assumed xed.
Re-writing (33), it is easy to see that maximising T
m
(q
x
(x), q

()) is equivalent
to minimising the Kullback-Leibler (KL) divergence between q
x
(x) q

() and the joint


posterior over hidden states and parameters p(x, [ y, m):
lnp(y [ m) T
m
(q
x
(x), q

()) =
_
d dx q
x
(x) q

() ln
q
x
(x) q

()
p(x, [ y, m)
(40)
= KL[q
x
(x) q

() | p(x, [ y, m)] 0 . (41)


The variational Bayesian EM algorithm reduces to the ordinary EM algorithm for ML
estimation if we restrict the parameter distribution to a point estimate, i.e. a Dirac delta
function, q

() = (

), in which case the M step simply involves re-estimating

.
Note that the same cannot be said in the case of MAP estimation, which is inherently
basis dependent, unlike both VB and ML algorithms. By construction, the VBEM algo-
rithm is guaranteed to monotonically increase an objective function T
m
, as a function of
distributions over parameters and hidden variables. Since we integrate over model pa-
rameters there is a naturally incorporated model complexity penalty. It turns out that,
for a large class of models that we will examine next, the VBE step has approximately
the same computational complexity as the standard E step in the ML framework, which
makes it viable as a Bayesian replacement for the EM algorithm. Moreover, for a large
class of models p(x, y, ) that we call conjugate-exponential (CE) models, the VBE and
VBM steps have very simple and intuitively appealing forms. We examine these points
next. CE models satisfy two conditions:
Condition (1). The complete-data likelihood is in the exponential family:
p(x
i
, y
i
[ ) = g() f(x
i
, y
i
) e
()

u(xi,yi)
, (42)
14 Variational Bayesian EM for DAGs
where () is the vector of natural parameters, u and f are the functions that dene
the exponential family, and g is a normalisation constant:
g()
1
=
_
dx
i
dy
i
f(x
i
, y
i
) e
()

u(xi,yi)
. (43)
Condition (2). The parameter prior is conjugate to the complete-data likelihood:
p( [ , ) = h(, ) g()

e
()

, (44)
where and are hyperparameters of the prior, and h is a normalisation constant:
h(, )
1
=
_
d g()

e
()

. (45)
From the denition of conjugacy, we see that the hyperparameters of a conjugate
prior can be interpreted as the number () and values () of pseudo-observations under
the corresponding likelihood. The list of latent-variable models of practical interest with
complete-data likelihoods in the exponential family is very long, for example: Gaussian
mixtures, factor analysis, principal components analysis, hidden Markov models and ex-
tensions, switching state-space models, discrete-variable belief networks. Of course there
are also many as yet undreamt-of models combining Gaussian, gamma, Poisson, Dirich-
let, Wishart, multinomial, and other distributions in the exponential family. Models
whose complete-data likelihood is not in the exponential family can often be approxi-
mated by models which are in the exponential family and have been given additional
hidden variables (for example, see Attias 1999a).
In Bayesian inference we want to determine the posterior over parameters and hid-
den variables p(x, [ y, , ). In general this posterior is neither conjugate nor in the
exponential family, and is intractable to compute. We can use the variational Bayesian
VBE (37) and VBM (39) update steps, but we have no guarantee that we will be able to
represent the results of the integration and exponentiation steps analytically. Here we
see how models with CE properties are especially amenable to the VB approximation,
and derive the VBEM algorithm for CE models.
Given an i.i.d. data set y = y
1
, . . . y
n
, if the model satises conditions (1) and (2),
then the following results (a), (b) and (c) hold.
(a) The VBE step yields:
q
x
(x) =
n

i=1
q
xi
(x
i
) , (46)
and q
xi
(x
i
) is in the exponential family:
q
xi
(x
i
) f(x
i
, y
i
) e

u(xi,yi)
= p(x
i
[ y
i
, ) , (47)
Beal, M. J. and Ghahramani, Z. 15
with a natural parameter vector
=
_
d q

()() ())
q

()
(48)
obtained by taking the expectation of () under q

() (denoted using angle-


brackets )). For invertible , dening

such that (

) = , we can rewrite the


approximate posterior as
q
xi
(x
i
) = p(x
i
[ y
i
,

) . (49)
(b) The VBM step yields that q

() is conjugate and of the form:


q

() = h( , ) g()

e
()


, (50)
where = +n, = +

n
i=1
u(y
i
), and u(y
i
) = u(x
i
, y
i
))
qx
i
(xi)
is the expec-
tation of the sucient statistic u. We have used )
qx
i
(xi)
to denote expectation
under the variational posterior over the latent variable(s) associated with the ith
datum.
(c) Results (a) and (b) hold for every iteration of variational Bayesian EM i.e. the
forms in (47) and (50) are closed under VBEM.
Results (a) and (b) follow from direct substitution of the forms in (42) and (44) into
the VB update equations (37) and (39). Furthermore, if q
x
(x) and q

() are initialised
according to (47) and (50), respectively, and conditions (42) and (44) are met, then
result (c) follows by induction.
As before, since q

() and q
xi
(x
i
) are coupled, (50) and (47) do not provide an
analytic solution to the minimisation problem, so the optimisation problem is solved
numerically by iterating between these equations. To summarise, for CE models:
VBE Step: Compute the expected sucient statistics u(y
i
)
n
i=1
under the hid-
den variable distributions q
xi
(x
i
), for all i.
VBM Step: Compute the expected natural parameters = ()) under the
parameter distribution given by and .
A major implication of these results for CE models is that, if there exists such a

satisfying (

) = , the posterior over hidden variables calculated in the VBE step is


exactly the posterior that would be calculated had we been performing a standard E step
using

. That is, the inferences using an ensemble of models q

() can be represented
by the eect of a point parameter,

. The task of performing many inferences, each
of which corresponds to a dierent parameter setting, can be replaced with a single
inference step tractably.
We can draw a tight parallel between the EM algorithm for ML/MAP estimation,
and our VBEM algorithm applied specically to conjugate-exponential models. These
16 Variational Bayesian EM for DAGs
EM for MAP estimation Variational Bayesian EM
Goal: maximise p( | y, m) w.r.t. Goal: lower bound p(y| m)
E Step: compute VBE Step: compute
q
(t+1)
x
(x) = p(x| y,
(t)
) q
(t+1)
x
(x) = p(x| y,
(t)
)
M Step: VBM Step:

(t+1)
= arg max

R
dx q
(t+1)
x
(x) ln p(x, y, ) q
(t+1)

() exp
R
dx q
(t+1)
x
(x) lnp(x, y, )
Table 1: Comparison of EM for ML/MAP estimation with VBEM for CE models.
are summarised in table 1. This general result of VBEM for CE models was reported
in Ghahramani and Beal (2001), and generalises the well known EM algorithm for
ML estimation (Dempster et al. 1977). It is a special case of the variational Bayesian
algorithm (equations (37) and (39)) used in Ghahramani and Beal (2000) and in Attias
(2000), yet encompasses many of the models that have been so far subjected to the
variational treatment. Its particular usefulness is as a guide for the design of models,
to make them amenable to ecient approximate Bayesian inference.
The VBE step has about the same time complexity as the E step, and is in all ways
identical except that it is re-written in terms of the expected natural parameters. The
VBM step computes a distribution over parameters (in the conjugate family) rather than
a point estimate. Both ML/MAP EM and VBEM algorithms monotonically increase
an objective function, but the latter also incorporates a model complexity penalty by
integrating over parameters and thereby embodying an Occams razor eect.
3.5 The variational Bayesian lower bound for discrete-valued DAGs
We wish to approximate the incomplete-data log marginal likelihood (8) given by
lnp(y [ m) = ln
_
d p( [ m)
n

i=1

{zij}
jH
|zi|

j=1
p(z
ij
[ z
ipa(j)
, ) . (51)
We can form the lower bound using (33), introducing variational distributions q

()
and q
si
(s
i
)
n
i=1
to yield
lnp(y [ m)
_
d q

() ln
p( [ m)
q

()
+
n

i=1
_
d q

()

si
q
si
(s
i
) ln
p(z
i
[ , m)
q
si
(s
i
)
= T
m
(q

(), q(s)) . (52)


Beal, M. J. and Ghahramani, Z. 17
We now take functional derivatives to write down the variational Bayesian EM algo-
rithm. The VBM step is straightforward:
lnq

() = lnp( [ m) +
n

i=1

si
q
si
(s
i
) lnp(z
i
[ , m) + c , (53)
with c a constant required for normalisation. Given that the prior over parameters
factorises over variables, as in (3), and the complete-data likelihood factorises over the
variables in a DAG, as in (1), equation (53) can be broken down into individual terms:
lnq

jl
(
jl
) = lnp(
jl
[
jl
, m) +
n

i=1

si
q
si
(s
i
) lnp(z
ij
[ z
ipa(j)
, , m) + c
jl
, (54)
where z
ij
may be either a hidden or observed variable, and each c
jl
is a Lagrange
multiplier from which a normalisation constant is obtained. Since the prior is Dirichlet,
it is easy to show that equation (54) has the form of the Dirichlet distribution thus
conforming to result (b) in (50). We dene the expected counts under the hidden
variable variational posterior distribution
N
jlk
=
n

i=1

(z
ij
, k)(z
ipa(j)
, l)
_
qs
i
(si)
. (55)
That is, N
jlk
is the expected total number of times the jth variable (hidden or observed)
is in state k when its parents (hidden or observed) are in state l, where the expectation
is taken with respect to the variational distribution q
si
(s
i
) over the hidden variables.
Then the variational posterior for the parameters is given simply by
q

jl
(
jl
) = Dir (
jlk
+ N
jlk
: k = 1, . . . , [z
ij
[) . (56)
For the VBE step, taking derivatives of (52) with respect to each q
si
(s
i
) yields
lnq
si
(s
i
) =
_
d q

() lnp(z
i
[ , m) + c

i
=
_
d q

() lnp(s
i
, y
i
[ , m) + c

i
, (57)
where each c

i
is a Lagrange multiplier for normalisation of the posterior. Since the
complete-data likelihood p(z
i
[ , m) is in the exponential family and we have placed
conjugate Dirichlet priors on the parameters, we can immediately utilise the result in
(49) which gives simple forms for the VBE step:
q
si
(s
i
) q
zi
(z
i
) =
|zi|

j=1
p(z
ij
[ z
ipa(j)
,

) . (58)
Thus the approximate posterior over the hidden variables s
i
resulting from a variational
Bayesian approximation is identical to that resulting from exact inference in a model
with known point parameters

the choice of

must satisfy (

) = . The natural
parameters for this model are the log probabilities ln
jlk
, where j species which
18 Variational Bayesian EM for DAGs
variable, l indexes the possible congurations of its parents, and k the possible settings
of the variable. Thus
ln

jlk
= (

jlk
) =
jlk
=
_
d
jl
q

jl
(
jl
) ln
jlk
. (59)
Under a Dirichlet distribution, the expectations are dierences of digamma functions
ln

jlk
= (
jlk
+ N
jlk
) (
|zij|

k=1

jlk
+ N
jlk
) j, l, k , (60)
where the N
jlk
are dened in (55), and () is the digamma function. Since this ex-
pectation operation takes the geometric mean of the probabilities, the propagation al-
gorithm in the VBE step is now passed sub-normalised probabilities as parameters:

|zij|
k=1

jlk
1 j, l. This use of sub-normalised probabilities also occurred in
MacKay (1997) in the context of variational Bayesian Hidden Markov Models.
The expected natural parameters become normalised only if the distribution over
parameters is a delta function, in which case this reduces to the MAP inference scenario
of Section 3.1. In fact, using the property of the digamma function for large arguments,
lim
x
(x) = lnx, we nd that equation (60) becomes
lim
n
ln

jlk
= ln(
jlk
+ N
jlk
) ln(
|zij|

k=1

jlk
+ N
jlk
) , (61)
which has recovered the MAP estimator for (20), up to the 1 entries in numerator
and denominator which become vanishingly small for large data, and vanish completely
if MAP is performed in the softmax parameterisation (see MacKay 1998). Thus in the
limit of large data VB recovers the MAP parameter estimate.
To summarise, the VBEM implementation for discrete DAGs consists of iterating
between the VBE step (58) which infers distributions over the hidden variables given a
distribution over the parameters, and a VBM step (56) which nds a variational poste-
rior distribution over parameters based on the hidden variables sucient statistics from
the VBE step. Each step monotonically increases or leaves unchanged a lower bound
on the marginal likelihood of the data, and the algorithm is guaranteed to converge
to a local maximum of the lower bound. The VBEM algorithm uses as a subroutine
the algorithm used in the E step of the corresponding EM algorithm, and so the VBE
steps computational complexity is the same as for EM there is some overhead in
calculating dierences of digamma functions instead of ratios of expected counts, but
this is presumed to be minimal and xed. As with BIC and Cheeseman-Stutz, the lower
bound does not take into account aliasing in the parameter posterior, and needs to be
corrected as described in Section 3.2.
Beal, M. J. and Ghahramani, Z. 19
3.6 Annealed Importance Sampling (AIS)
AIS (Neal 2001) is a state-of-the-art technique for estimating marginal likelihoods, which
breaks a dicult integral into a series of easier ones. It combines techniques from
importance sampling, Markov chain Monte Carlo, and simulated annealing (Kirkpatrick
et al. 1983). AIS builds on work in the Physics community for estimating the free energy
of systems at dierent temperatures, for example: thermodynamic integration (Neal
1993), tempered transitions (Neal 1996), and the similarly inspired umbrella sampling
(Torrie and Valleau 1977). Most of these, as well as other related methods, are reviewed
in Gelman and Meng (1998).
Obtaining samples from the posterior distribution over parameters, with a view to
forming a Monte Carlo estimate of the marginal likelihood of the model, is usually a
very challenging problem. This is because, even with small data sets and models with
just a few parameters, the distribution is likely to be very peaky and have its mass
concentrated in tiny volumes of space. This makes simple approaches such as sampling
parameters directly from the prior or using simple importance sampling infeasible. The
basic idea behind annealed importance sampling is to move in a chain from an easy-
to-sample-from distribution, via a series of intermediate distributions, through to the
complicated posterior distribution. By annealing the distributions in this way the pa-
rameter samples should hopefully come from representative areas of probability mass in
the posterior. The key to the annealed importance sampling procedure is to make use
of the importance weights gathered at all the distributions up to and including the nal
posterior distribution, in such a way that the nal estimate of the marginal likelihood
is unbiased. A brief description of the AIS procedure follows. We dene a series of
inverse-temperatures (k)
K
k=0
satisfying
0 = (0) < (1) < < (K 1) < (K) = 1 . (62)
We refer to temperatures and inverse-temperatures interchangeably throughout this
section. We dene the function:
f
k
() p( [ m)p(y [ , m)
(k)
, k 0, . . . , K . (63)
Thus the set of functions f
k
()
K
k=0
form a series of unnormalised distributions which
interpolate between the prior and posterior, parameterised by . We also dene the
normalisation constants Z
k

_
d f
k
(), and note that Z
0
=
_
d p( [ m) = 1
from normalisation of the prior and that Z
K
=
_
d p( [ m)p(y [ , m) = p(y [ m),
which is exactly the marginal likelihood we wish to estimate. We can estimate Z
K
, or
equivalently
Z
K
Z0
, using the identity
p(y [ m) =
Z
K
Z
0

Z
1
Z
0
Z
2
Z
1
. . .
Z
K
Z
K1
=
K

k=1

k
. (64)
Each of the K ratios in this expression can be individually estimated without bias
using importance sampling: the kth ratio, denoted
k
, can be estimated from a set of
(not necessarily independent) samples of parameters
(k,c)

cC
k
, which are drawn from
20 Variational Bayesian EM for DAGs
the higher temperature (k 1) distribution (the importance distribution) i.e. each

(k,c)
f
k1
(), and the importance weights are computed at the lower temperature
(k). These samples are used to construct the Monte Carlo estimate for
k
:

k

Z
k
Z
k1
=
_
d
f
k
()
f
k1
()
f
k1
()
Z
k1

1
C
k

cC
k
f
k
(
(k,c)
)
f
k1
(
(k,c)
)
, with
(k,c)
f
k1
()
(65)

k
=
1
C
k

cC
k
p(y [
(k,c)
, m)
(k)(k1)
. (66)
The variance of the estimate of each
k
depends on the constituent distributions
f
k
(), f
k1
() being suciently close so as to produce low-variance weights (the sum-
mands in (65)). Neal (2001) shows that it is a sucient condition that the C
k
each be
chosen to be exactly one for the product of the

cR
k
estimators to be an unbiased esti-
mate of the marginal likelihood, p(y [ m), in (64). It is an open research question as to
whether values of

cR
k
can be shown to lead to an unbiased estimate. In our experiments
(Section 4) we use C
k
= 1 and so remain in the realm of an unbiased estimator.
Metropolis-Hastings for discrete-variable models
In general we expect it to be dicult to sample directly from the forms f
k
() in (63), and
so Metropolis-Hastings (Metropolis et al. 1953; Hastings 1970) steps are used at each
temperature to generate the set of C
k
samples required for each importance calculation
in (66). In the discrete-variable graphical models covered in this article, the parameters
are multinomial probabilities, hence the support of the Metropolis proposal distributions
is restricted to the simplex of probabilities summing to 1. One might suggest using
a Gaussian proposal distribution in the softmax parameterisation, b, of the current
parameters , like so:
i
e
bi
/

||
j
e
bj
. However the Jacobian of the transformation
from this vector b back to the vector is zero, and it is hard to construct a reversible
Markov chain.
A dierent and intuitively appealing idea is to use a Dirichlet distribution as the
proposal distribution, with its mean positioned at the current parameter. The precision
of the Dirichlet proposal distribution at inverse-temperature (k) is governed by its
strength, (k), which is a free variable to be set as we wish, provided it is not in any
way a function of the sampled parameters. An MH acceptance function is required to
maintain detailed balance: if

is the sample under the proposal distribution centered


around the current parameter
(k,c)
, then the acceptance function is:
a(

,
(k,c)
) = min
_
f
k
(

)
f
k
(
(k,c)
)
Dir(
(k,c)
[

, (k))
Dir(

[
(k,c)
, (k))
, 1
_
, (67)
where Dir( [ , ) is the probability density of a Dirichlet distribution with mean and
Beal, M. J. and Ghahramani, Z. 21
strength , evaluated at . The next sample is instantiated as follows:

(k,c+1)
=
_

if w < a(

,
(k,c)
) (accept)

(k,c)
otherwise (reject) ,
(68)
where w U(0, 1) is a random variable sampled from a uniform distribution on [0, 1].
By repeating this procedure of accepting or rejecting C

k
C
k
times at the temperature
(k), the MCMC sampler generates a set of (dependent) samples
(k,c)

k
c=1
. A subset
of these
(k,c)

cC
k
, with [(
k
[ = C
k
C

k
, is then used as the importance samples in
the computation above (66). This subset will generally not include the rst few samples,
as these samples are likely not yet samples from the equilibrium distribution at that
temperature, and in the case of C
k
=1 will contain only the most recent sample.
An algorithm to compute all ratios
The entire algorithm for computing all K marginal likelihood ratios is given in Algorithm
3.1. It has several parameters, in particular: the number of annealing steps, K; their
1. Initialise
ini
f
0
() i.e. from the prior p( [ m)
2. For k = 1 to K annealing steps
(a) Run MCMC at temperature (k 1) as follows:
i. Initialise
(k,0)

ini
from previous temp.
ii. Generate the set
(k,c)

k
c=1
f
k1
() as follows:
A. For c = 1 to C

k
Propose

Dir(

[
(k,c1)
, (k))
Accept
(k,c)

according to (67) and (68)


End For
B. Store
ini

(k,C

k
)
iii. Store a subset of these
(k,c)

cC
k
with [(
k
[ = C
k
C

k
(b) Calculate
k

Z
k
Z
k1

1
C
k

C
k
c=1
f
k
(
(k,c)
)
f
k1
(
(k,c)
)
End For
3. Output ln
k

K
k=1
and ln

Z
K
=

K
k=1
ln
k
as the approximation to lnZ
K
Algorithm 3.1: AIS. Computes ratios
k

K
k=1
for the marginal likelihood estimate.
inverse-temperatures (the annealing schedule), (k)
K
k=1
; the parameters of the MCMC
importance sampler at each temperature C

k
, C
k
, (k)
K
k=1
, which are the number of
proposed samples, the number used for the importance estimate, and the precision of
22 Variational Bayesian EM for DAGs
the proposal distribution, respectively. We remind the reader that the estimate has only
been proven to be unbiased in the case of C
k
= 1.
Algorithm 3.1 produces only a single estimate of the marginal likelihood; a particular
attraction of AIS is that one can take averages of estimates from a number G of AIS
runs to form another unbiased estimate of the marginal likelihood with lower variance:
[Z
K
/Z
0
]
(G)
= G
1

G
g=1

K
(g)
k=1

(g)
k
. In Section 5 we discuss the performance of AIS for
estimating the marginal likelihood of the graphical models used in this article, addressing
the specic choices of proposal widths, number of samples, and annealing schedules used
in the experiments.
4 Experiments
In this section we experimentally examine the accuracy of each of the scoring methods
described in the previous section. To this end, we rst describe the class dening our
space of hypothesised structures, then choose a particular member of the class as the
true structure; we generate a set of parameters for that structure, and then generate
varying-sized data sets from that structure with those parameters. Each score is then
used to estimate the marginal likelihoods of each structure in the class, for each possible
data set size. From these estimates, a posterior distribution over the structures can be
computed for each data set size. Our goal is to assess how closely these approximate
distributions reect the true posterior distributions. We use three metrics: i) the rank
given to the true structure (i.e. the modal structure has rank 1); ii) the dierence
between the estimated marginal likelihoods of the top-ranked and true structures; iii)
The Kullback-Leibler divergence of the approximate to the true posterior distributions.
A specic class of graphical model: We examine discrete directed bipartite graphical
models, i.e. those graphs in which only hidden variables can be parents of observed
variables, and the hidden variables themselves have no parents. For our in depth study
we restrict ourselves to graphs which have just k = [H[ = 2 hidden variables, and
p = [1[ = 4 observed variables; both hidden variables are binary i.e. [s
ij
[ = 2 for j H,
and each observed variable has cardinality [y
ij
[ = 5 for j 1.
The number of distinct graphs: In the class of graphs described above, with k distinct
hidden variables and p observed variables, there are 2
kp
possible structures, correspond-
ing to the presence or absence of a directed link between each hidden and each condi-
tionally independent observed variable. If the hidden variables are unidentiable, which
is the case in our example model where they have the same cardinality, then the number
of possible graphs is reduced due to permutation symmetries. It is straightforward to
show in this example that the number of distinct graphs is reduced from 2
24
= 256
down to 136.
Beal, M. J. and Ghahramani, Z. 23
y
i1
s
i1
s
i2
y
i2
y
i3
y
i4
i=1...n
Figure 1: The true structure that was used to generate all the data sets used in the
experiments. The hidden variables (top) are each binary, and the observed variables
(bottom) are each ve-valued. This structure has 50 parameters, and is two links away
from the fully-connected structure. In total, there are 136 possible distinct structures
with two (identical) hidden variables and four observed variables.
The specic model and generating data: We chose the particular structure shown in
Figure 1, which we call the true structure. This structure contains enough links to in-
duce non-trivial correlations amongst the observed variables, whilst the class as a whole
has few enough nodes to allow us to examine exhaustively every possible structure of
the class. There are only three other structures in the class which have more param-
eters than our chosen structure: These are: two structures in which either the left- or
right-most visible node has both hidden variables as parents instead of just one, and one
structure which is fully connected. One should note that our chosen true structure is at
the higher end of complexity in this class, and so we might nd that scoring methods
that do not penalise complexity do seemingly better than naively expected.
Evaluation of the marginal likelihood of all possible alternative structures in the
class is done for academic interest only, since for non-trivial numbers of variables the
number of structures is huge. In practice one can embed dierent structure scoring
methods in a greedy model search outer loop (for example, see Friedman 1998) to
nd probable structures. Here, we are not so much concerned with structure search
per se, since a prerequisite for a good structure search algorithm is an ecient and
accurate method for evaluating any particular structure. Our aim in these experiments
is to establish the reliability of the variational bound as a score, compared to annealed
importance sampling, and the currently employed asymptotic scores such as the BIC
and Cheeseman-Stutz criteria.
The parameters of the true model: Conjugate uniform symmetric Dirichlet priors were
placed over all the parameters of the model that is to say
jlk
= 1 jlk in (3).
This particular prior was arbitrarily chosen for the purposes of the experiments; we
do not expect it to inuence the trends in our conclusions. For the network shown in
Figure 1, parameters were sampled from the prior, once and for all, to instantiate a true
underlying model, from which data was then generated. The sampled parameters are
24 Variational Bayesian EM for DAGs
shown below (their sizes are functions of each nodes and its parents cardinalities):

1
=
_
.12 .88


3
=
_
.03 .03. .64 .02 .27
.18 .15 .22 .19 .27
_

6
=
_
.10 .08 .43 .03 .36
.30 .14 .07 .04 .45
_

2
=
_
.08 .92


4
=
_

_
.10 .54 .07 .14 .15
.04 .15 .59 .05 .16
.20 .08 .36 .17 .18
.19 .45 .10 .09 .17
_

_

5
=
_

_
.11 .47 .12 .30 .01
.27 .07 .16 .25 .25
.52 .14 .15 .02 .17
.04 .00 .37 .33 .25
_

_
,
where
j

2
j=1
are the parameters for the hidden variables, and
j

6
j=3
are the param-
eters for the remaining four observed variables, y
i1
, . . . , y
i4
. Each row of each matrix
denotes the probability of each multinomial setting for a particular conguration of the
parents, and sums to one (up to rounding error). Note that there are only two rows
for
3
and
6
, as both these observed variables have just a single binary parent. For
variables yi2 and yi3, the four rows correspond to the parent congurations: [1 1],
[1 2], [2 1], [2 2] (with parameters
5
and
6
respectively). In this particular instan-
tiation of the parameters, both the hidden variable priors are close to deterministic,
causing approximately 80% of the data to originate from the [2 2] setting of the hidden
variables. This means that we may need many data points before the evidence for two
hidden variables outweighs that for one.
Incrementally larger nested data sets were generated from these parameter settings,
with n 10, 20 ,40, 80, 110, 160, 230, 320, 400, 430, 480, 560, 640, 800, 960, 1120,
1280, 2560, 5120, 10240. The items in the n = 10 data set are a subset of the n = 20
and subsequent data sets, etc. The particular values of n were chosen from an initially
exponentially increasing data set size, followed by inclusion of some intermediate data
sizes to concentrate on interesting regions of behaviour.
4.1 Comparison of scores to AIS
All 136 possible distinct structures were scored for each of the 20 data set sizes given
above, using MAP, BIC, CS, VB and AIS scores. We ran EM on each structure to
compute the MAP estimate of the parameters (Section 3.1), and from it computed the
BIC score (Section 3.2). Even though the MAP probability of the data is not strictly an
approximation to the marginal likelihood, it can be shown to be an upper bound and so
we include it for comparison. We also computed the BIC score including the parameter
prior, denoted BICp, which was obtained by including a term lnp(

[ m) in equation
(24). From the same EM optimisation we computed the CS score (Section 3.3). We
then ran the variational Bayesian EM algorithm with the same initial conditions to give
a lower bound on the marginal likelihood (Section 3.5). To avoid local optima, several
optimisations were carried out with dierent parameter initialisations, drawn each time
from the prior over parameters. The same initial parameters were used for both EM and
VBEM; in the case of VBEM the following protocol was used to obtain a parameter
distribution: a conventional E-step was performed at the initial parameter to obtain
Beal, M. J. and Ghahramani, Z. 25
p(s
i
[y
i
, )
n
i=1
, which was then used in place of q
s
(s) for input to the VBM step, which
was thereafter followed by VBE and VBM iterations until convergence. The highest
score over three random initialisations was taken for each algorithm; empirically this
heuristic appeared to avoid local maxima problems. The EM and VBEM algorithms
were terminated after either 1000 iterations had been reached, or the change in log
likelihood (or lower bound on the log marginal likelihood, in the case of VBEM) became
less than 10
6
per datum.
For comparison, the AIS sampler was used to estimate the marginal likelihood (see
Section 3.6), annealing from the prior to the posterior in K = 16384 steps. A nonlinear
annealing schedule was employed, tuned to reduce the variance in the estimate, and the
Metropolis proposal width was tuned to give reasonable acceptance rates. We chose to
have just a single sampling step at each temperature (i.e. C

k
= C
k
= 1), for which AIS
has been proven to give unbiased estimates, and initialised the sampler at each tem-
perature with the parameter sample from the previous temperature. These particular
choices are explained and discussed in detail in Section 5. Initial marginal likelihood
estimates from single runs of AIS were quite variable, and for this reason several more
batches of AIS runs were undertaken, each using a dierent random initialisation (and
random numbers thereafter); the total of G batches of estimates were averaged accord-
ing to the procedure given at the end of Section 3.6 to give the AIS
(G)
score. In total,
G = 26 batches of AIS runs were carried out.
Scoring all possible structures
Figure 2 shows the MAP, BIC, BICp, CS, VB and AIS
(26)
scores obtained for each of
the 136 possible structures against the number of parameters in the structure. Score
is measured on the vertical axis, with each scoring method (columns) sharing the same
vertical axis range for a particular data set size (rows). The horizontal axis of each
plot corresponds to the number of parameters in the structure (as described in Section
3.2). For example, at the extremes there is one structure with 66 parameters (the
fully connected structure) and one structure with 18 parameters (the fully unconnected
structure). The structure that generated the data has exactly 50 parameters. In each
plot we can see that several structures can occupy the same column, having the same
number of parameters. This means that, at least visually, it is not always possible to
unambiguously assign each point in the column to a particular structure.
The scores shown are those corrected for aliases (see equation (24)). Plots for un-
corrected scores are almost identical. In each plot, the true structure is highlighted
by a symbol, and the structure currently ranked highest by that scoring method is
marked with a . We can see the general upward trend for the MAP score, which
prefers more complicated structures, and the pronounced downward trend for the BIC
and BICp scores, which (over-)penalise structure complexity. In addition, one can see
that neither upward or downward trends are apparent for VB or AIS scores. The CS
score tends to show a downward trend similar to BIC and BICp, and while this trend
weakens with increasing data, it is still present at n = 10240 (bottom row). Although
not veriable from these plots, the vast majority of the scored structures and data set
26 Variational Bayesian EM for DAGs
MAP BIC BICp CS VB AIS
(26)
10
160
640
1280
2560
5120
10240
Figure 2: Scores for all 136 of the structures in the model class, by each of six scoring
methods. Each plot has the score (approximation to the log marginal likelihood) on
the vertical axis, with tick marks every 40 nats, and the number of parameters on the
horizontal axis (ranging from 18 to 66). The middle four scores have been corrected for
aliases (see Section 3.2). Each row corresponds to a data set of a dierent size, n: from
top to bottom we have n = 10, 160, 640, 1280, 2560, 5120, 10240. The true structure is
denoted with a symbol, and the highest scoring structure in each plot marked by the
symbol. Every plot in the same row has the same scaling for the vertical score axis,
set to encapsulate every structure for all scores.
Beal, M. J. and Ghahramani, Z. 27
sizes, the AIS
(26)
score is higher than the VB lower bound, as we would expect.
The plots for large n show a distinct horizontal banding of the scores into three
levels; this is an interesting artifact of the particular model used to generate the data.
For example, we nd on closer inspection some strictly followed trends: all those model
structures residing in the upper band have the rst three observable variables (j =
3, 4, 5) governed by at least one of the hidden variables; all those structures in the middle
band have the third observable (j = 5) connected to at least one hidden variable.
In this particular example, AIS nds the correct structure at n = 960 data points,
but unfortunately does not retain this result reliably until n = 2560. At n = 10240
data points, BICp, CS, VB and AIS all report the true structure as being the one
with the highest score amongst the other contending structures. Interestingly, BIC
still does not select the correct structure, and MAP has given a structure with sub-
maximal parameters the highest score, which may well be due to local maxima in the
EM optimisation.
Ranking of the true structure
Table 2 shows the ranking of the true structure, as it sits amongst all the possible 136
structures, as measured by each of the scoring methods MAP, BIC, BICp, CS, VB and
AIS
(26)
; this is also plotted in Figure 3(a), where the MAP ranking is not included
for clarity. Higher positions in the plot correspond to better rankings: a ranking of 1
means that the scoring method has given the highest marginal likelihood to the true
structure. We should keep in mind that, at least for small data set sizes, there is no
reason to assume that the true posterior over structures has the true structure at its
mode. Therefore we should not expect high rankings at small data set sizes.
For small n, for the most part the AIS score produces dierent (higher) rankings
for the true structure than do the other scoring methods. We expect AIS to perform
accurately with small data set sizes, for which the posterior distribution over parameters
is not at all peaky, and so this suggests that the other approximations are performing
poorly in comparison. However, for almost all n, VB outperforms BIC, BICp and
CS, consistently giving a higher ranking to the true structure. Of particular note is the
stability of the VB score ranking with respect to increasing amounts of data as compared
to AIS (and to some extent CS). Columns in Table 2 with asterisks (*) correspond to
scores that are not corrected for aliases, and are omitted from Figure 3(a). These
corrections assume that the posterior aliases are well separated, and are valid only for
large amounts of data and/or strongly-determined parameters. The correction nowhere
degrades the rankings of any score, and in fact improves them very slightly for CS, and
especially so for VB.
KL divergence of the methods posterior distributions from the AIS estimate
In Figure 3(c) we plot the Kullback-Leibler (KL) divergence between the AIS computed
posterior and the posterior distribution computed by each of the approximations BIC,
28 Variational Bayesian EM for DAGs
n MAP BIC* BICp* CS* VB* BIC BICp CS VB AIS
(26)
10 21 127 55 129 122 127 50 129 115 20
20 12 118 64 111 124 118 64 111 124 92
40 28 127 124 107 113 127 124 107 113 17
80 8 114 99 78 116 114 99 78 116 28
110 8 109 103 98 114 109 103 98 113 6
160 13 119 111 114 83 119 111 114 81 49
230 8 105 93 88 54 105 93 88 54 85
320 8 111 101 90 44 111 101 90 33 32
400 6 101 72 77 15 101 72 77 15 22
430 7 104 78 68 15 104 78 68 14 14
480 7 102 92 80 55 102 92 80 44 12
560 9 108 98 96 34 108 98 96 31 5
640 7 104 97 105 19 104 97 105 17 28
800 9 107 102 108 35 107 102 108 26 49
960 13 112 107 76 16 112 107 76 13 1
1120 8 105 96 103 12 105 96 103 12 1
1280 7 90 59 8 3 90 59 6 3 1
2560 6 25 17 11 11 25 15 11 11 1
5120 5 6 5 1 1 6 5 1 1 1
10240 3 2 1 1 1 2 1 1 1 1
Table 2: Ranking of the true structure by each of the scoring methods, as the size
of the data set is increased. Asterisks (*) denote scores uncorrected for parameter
aliasing in the posterior. These results are from data generated from only one instance
of parameters under the true structures prior over parameters.
Beal, M. J. and Ghahramani, Z. 29
10
1
10
2
10
3
10
4
10
0
10
1
10
2
n
r
a
n
k

o
f

t
r
u
e

s
t
r
u
c
t
u
r
e
AIS
VB
CS
BICp
BIC
(a)
10
1
10
2
10
3
10
4
60
50
40
30
20
10
0
n
s
c
o
r
e

d
i
f
f
e
r
e
n
c
e
AIS
VB
CS
BICp
BIC
(b)
10
1
10
2
10
3
10
4
10
20
30
40
50
60
n
K
L

d
i
v
e
r
g
e
n
c
e

f
r
o
m

A
I
S

p
o
s
t
e
r
i
o
r

d
i
s
t
r
i
b
u
t
i
o
n

/

n
a
t
sVB
CS
BICp
BIC
(c)
Figure 3: (a) Ranking given to the true structure by each scoring method for varying
data set sizes (higher in plot is better), by BIC, BICp, CS, VB and AIS
(26)
methods.
(b) Dierences in log marginal likelihood estimates (scores) between the top-ranked
structure and the true structure, as reported by each method. All dierences are exactly
zero or negative (see text). Note that these score dierences are not normalised for the
number of data n. (c) KL divergences of the approximate posterior distributions from
the estimate of the posterior distribution provided by the AIS method; this measure is
zero if the distributions are identical.
30 Variational Bayesian EM for DAGs
BICp, CS, and VB. We see quite clearly that VB has the lowest KL by a long way out
of all the approximations, over a wide range of data set sizes, suggesting it is remaining
most faithful to the true posterior distribution as approximated by AIS. The increase
of the KL for the CS and VB methods at n = 10240 is almost certainly due to the
AIS sampler having diculty at high n (discussed below in Section 5), and should not
be interpreted as a degradation in performance of either the CS or VB methods. It is
interesting to note that the BIC, BICp, and CS approximations require a vast amount
of data before their KL divergences reduce to the level of VB.
Computation Time
Scoring all 136 structures at 480 data points on a 1GHz Pentium III processor, with an
implementation in Matlab, took: 200 seconds for the MAP EM algorithms required
for BIC/BICp/CS, 575 seconds for the VBEM algorithm required for VB, and 55000
seconds (15 hours) for a single set of runs of the AIS algorithm (using 16384 samples as
in the main experiments); note the results for AIS here used averages of 26 runs. The
massive computational burden of the sampling method (approx 75 hours for just 1 of
26 runs) makes CS and VB attractive alternatives for consideration.
4.2 Performance averaged over the parameter prior
The experiments in the previous section used a single instance of sampled parameters
for the true structure, and generated data from this particular model. The reason for
this was that, even for a single experiment, computing an exhaustive set of AIS scores
covering all data set sizes and possible model structures takes in excess of 15 CPU days.
In this section we compare the performance of the scores over many dierent sampled
parameters of the true structure (shown in Figure 1). 106 parameters were sampled
from the prior and incremental data sets generated for each of these instances as the
true model. MAP EM and VBEM algorithms were employed to calculate the scores
as described in Section 4.1. For each instance of the true model, calculating scores
for all data set sizes used and all possible structures, using three random restarts,
for BIC/BICp/CS and VB took approximately 2.4 and 4.2 hours respectively on an
Athlon 1800 Processor machine, which corresponds to about 1.1 and 1.9 seconds for
each individual score.
Figure 4(a) shows the median ranking given to the true structure by each scor-
ing method, computed over the 106 randomly sampled parameter settings. This plot
corresponds to a smoothed version of Figure 3(a), but unfortunately cannot contain
AIS averages for the computational reasons mentioned above. For the most part VB
outperforms the other scores, although there is a region in which VB seems to underper-
form CS, as measured by this median score. For several cases, the VBEM optimisation
reached the maximum number of allowed iterations before it had converged, whereas
EM always converged. Allowing longer runs should result in improved VB performance.
The VB score of the true structure is generally much closer to that of the top-ranked
Beal, M. J. and Ghahramani, Z. 31
10
1
10
2
10
3
10
4
10
0
10
1
10
2
n
m
e
d
i
a
n

r
a
n
k

o
f

t
r
u
e

s
t
r
u
c
t
u
r
e
VB
CS
BICp
BIC
(a)
10
1
10
2
10
3
10
4
10
0
10
1
10
2
n
h
i
g
h
e
s
t

r
a
n
k

o
f

t
r
u
e

s
t
r
u
c
t
u
r
e
VB
CS
BICp
BIC
(b)
Figure 4: (a) Median ranking of the true structure as reported by BIC, BICp, CS and
VB methods, against the size of the data set n, taken over 106 instances of the true
structure. (b) The highest ranking given to the true structure under BIC, BICp, CS
and VB methods, against the size of the data set n, taken over 106 instances of the true
structure.
structure than is the case for any of the other scores. Figure 4(b) shows the best per-
formance of the BIC, BICp, CS and VB methods over the 106 parameter instances in
terms of the rankings. Lastly, we can examine the success rate of each score at picking
the correct structure: Figure 5 shows the fraction of times that the true structure is
ranked top by the dierent scoring methods, and other measures of performance.
5 AIS analysis, limitations, and extensions
The technique of annealed importance sampling is currently regarded as a state-of-the-
art method for estimating the marginal likelihood in discrete-variable directed acyclic
graphical models. In this section the AIS method is critically examined to gauge its
reliability as a tool for judging the performance of the BIC, CS and VB scores.
The implementation of AIS has considerable exibility: the user must specify the
length, granularity and shape of the annealing schedules, the form of the Metropolis-
Hastings (MH) sampling procedure, the number of samples taken at each temperature,
etc. These and other parameters were described in Section 3.6; here we clarify our
choices of settings and discuss some further ways in which the sampler could be im-
proved.
How can we be sure that the AIS sampler is reporting the correct answer for the
marginal likelihood of each structure? To be sure of a correct answer, one should use
as long and gradual an annealing schedule as possible, containing as many samples at
each temperature as is computationally viable. In the AIS experiments in this article,
we always opted for a single sample at each step of the annealing schedule, initialising
32 Variational Bayesian EM for DAGs
10
1
10
2
10
3
10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
n
s
u
c
c
e
s
s

r
a
t
e

a
t

s
e
l
e
c
t
i
n
g

t
r
u
e

s
t
r
u
c
t
u
r
e
VB
CS
BICp
BIC
(a)
10
1
10
2
10
3
10
4
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
n
m
e
a
n

p
r
o
b
a
b
i
l
i
t
y

o
f

t
r
u
e

s
t
r
u
c
t
u
r
e
VB
CS
BICp
BIC
(b)
10
1
10
2
10
3
10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
n
p
o
s
t
e
r
i
o
r

p
r
o
b
a
b
i
l
i
t
y

o
f

t
r
u
e

s
t
r
u
c
t
u
r
e

>
1
/
1
3
6
VB
CS
BICp
BIC
(c)
Figure 5: (a) The success rate of the scoring methods BIC, BICp, CS and VB, as
measured by the fraction of 106 trials in which the true structure was given ranking 1
amongst the 136 candidate structures, plotted as a function of the data set size. (b)
The posterior probability of the true structure, averaged over 106 trials. Note that for
each trial, a posterior probability of greater that .5 is sucient to guarantee that the
true structure ranks top. (c) The fraction of trials in which the true structure was given
posterior probability >
1
136
, i.e. greater than uniform probability.
Beal, M. J. and Ghahramani, Z. 33
the parameter at the next temperature at the previous sample, and ensured that the
schedule itself was as nely grained as we could aord. This reduces the variables at
our disposal to a single parameter, namely the total number of samples taken in each
run of AIS, which is then directly related to the schedule granularity.
We examine the performance of the AIS sampler as a function of the number of
samples. Figure 6(a) shows several AIS estimates of the marginal likelihood for the data
set of size n = 480 under the model having the true structure. Each trace corresponds to
a dierent point of initialisation of the AIS algorithm, obtained by sampling a parameter
from the prior using 10 dierent random seeds. The top-most trace is initialised at the
true parameters (which we as the experimenter have access to). Each point on a trace
corresponds to a dierent temperature schedule for the AIS sampler with that initialising
seed. Thus, a point at the right of the plot with high K corresponds to a schedule with
many small steps in temperature, whereas a point at the left with low K corresponds to
a coarser temperature schedule. Also plotted for reference are the VB and BIC estimates
of the log marginal likelihood for this data set under the true structure, which are not
functions of the annealing schedule. We know that the VB score is a lower bound on the
log marginal likelihood, and so those estimates from AIS that consistently fall below this
score must be indicative of an inadequate annealing schedule shape, duration and/or
MH design.
For short annealing schedules, which are necessarily coarse to satisfy the boundary
requirements on in equation (62), it is clear that the AIS sampling is badly under-
estimating the log marginal likelihood. The rapid annealing schedule does not give the
sampler time to locate and exploit regions of high posterior probability, forcing it to
neglect representative volumes of the posterior mass. Conversely, the AIS run started
from the true parameters over-estimates the marginal likelihood, because it is prevented
from exploring regions of low probability. Thus, for coarse schedules of less than about
K = 1000 samples, the AIS estimate of the log marginal likelihood seems biased and
has very high variance. Note that the AIS algorithm gives unbiased estimates of the
marginal likelihood, but not necessarily the log marginal likelihood.
We see that all runs converge for suciently long annealing schedules, with AIS
passing the BIC score at about 1000 samples, and the VB lower bound at about 5000
samples. Thus, loosely speaking, where the AIS and VB scores intersect we can consider
their estimates to be roughly equally reliable. At n = 480 the VB scoring method
requires about 1.5s to score the structure, whereas AIS at n = 480 and K = 2
13
requires about 100s. Thus for this scenario, VB is 70 times more ecient at scoring the
structures (at its own reliability).
In this articles main experiments, a value of K = 2
14
= 16384 steps was used, and
it is clear from Figure 6(a) that we can be fairly sure of the AIS method reporting a
reasonably accurate result at this value of K, at least for n = 480. However, how would
we expect these plots to look for larger data sets in which the posterior over parameters
is more peaky and potentially more dicult to navigate during the annealing?
A good indicator of the mobility of the MH sampler is the acceptance rate of proposed
samples. Figure 6(b) shows the fraction of accepted proposals during the annealing run,
34 Variational Bayesian EM for DAGs
10
2
10
3
10
4
10
5
7.8
7.6
7.4
7.2
7
6.8
6.6
6.4
6.2
6
5.8
length of annealing schedule, K
l
o
g

m
a
r
g
i
n
a
l

l
i
k
e
l
i
h
o
o
d

e
s
t
i
m
a
t
e

/

n
(a)
10
1
10
2
10
3
10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
n
a
c
c
e
p
t
a
n
c
e

f
r
a
c
t
i
o
n
(b)
10
1
10
2
10
3
10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
a
c
c
e
p
t
a
n
c
e

f
r
a
c
t
i
o
n
n
1st
2nd
3rd
4th
(c)
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.01
0.1
0.2
0.4
1
10
k/K

(
k
)
(d)
Figure 6: (a) Logarithm of AIS estimates (vertical) of the marginal likelihood for dier-
ent initial conditions of the sampler (dierent traces) and dierent duration of annealing
schedules (horizontal), for the true structure with n = 480 data points. The top-most
trace is that corresponding to setting the initial parameters to the true values that gen-
erated the data. Shown are also the BIC score (dashed) and the VB lower bound (solid).
(b) Acceptance rates of the MH proposals along the entire annealing schedule, for one
batch of AIS scoring of all structures, against the size of the data set, n. The dotted
lines are the sample standard deviations across all structures for each n. (c) Acceptance
rates of the MH proposals for each of four quarters of the annealing schedule, for one
batch of AIS scoring of all structures, against the size of the data set, n. Standard errors
of the means are omitted for clarity. (d) Non-linear AIS annealing schedules, plotted
for six dierent values of e

. In the experiments performed in this article, e

= 0.2.
Beal, M. J. and Ghahramani, Z. 35
n 10. . . 560 640 800 960 1120 1280 2560 5120 10240
single
# AIS
(1)
< VB* 5.7 12.3 8.5 12.3 10.4 17.0 25.5 53.8 71.7
# AIS
(1)
< VB 7.5 15.1 9.4 14.2 12.3 20.8 31.1 59.4 74.5
% M-H rej. <40.3 41.5 43.7 45.9 47.7 49.6 59.2 69.7 79.2
averaged
# AIS
(5)
< VB* 0 0.0 0.0 0.0 0.0 0.7 3.7 13.2 50.0
# AIS
(5)
< VB 1.9 0.0 0.0 0.0 1.5 2.2 5.1 19.9 52.9
Table 3: AIS violations: for each size data set, n, we show the percentage of times,
over the 136 structures, that a particular single AIS run reports marginal likelihoods
below the VB lower bound. These are given for the VB scores that are uncorrected
(*) and corrected for aliases. Also shown are the average percentage rejection rates of
the MH sampler used to gather samples for the AIS estimates. The bottom half of the
table shows the similar violations by the AIS score made from averaging the estimates
of marginal likelihoods from ve separate runs of AIS (see Section 3.6).
averaged over AIS scoring of all 136 possible structures, plotted against the size of the
data set, n; the error bars are the standard errors of the mean acceptance rate across
scoring all structures. We can see that at n = 480, the acceptance rate is rarely below
60%, and so one would indeed expect to see the sort of convergence shown in Figure
6(a). However, for the larger data sets the acceptance rate drops to 20%, implying that
the sampler is having diculty obtaining representative samples from the posterior
distributions in the annealing schedule. Fortunately this drop is only linear in the
logarithm of the data size.
By examining the reported AIS scores, both for single and pooled runs, over the
136 structures and 20 data set sizes, and comparing them to the VB lower bound,
we can see how often AIS violates the lower bound. Table 3 compares the number of
times the reported AIS scores AIS
(1)
and AIS
(5)
are below the VB lower bound, along
with the rejection rates of the MH sampler that were plotted in Figure 6(b) (not a
function of G). From the table we see that for small data sets, the AIS method reports
valid results and the MH sampler is accepting a reasonable proportion of proposed
parameter samples. However, at and beyond n = 560, the AIS sampler degrades to the
point where it reports invalid results for more than half the 136 structures it scores.
However, since the AIS estimate is noisy and we know that the tightness of the VB
lower bound increases with n, this criticism could be considered too harsh indeed if
the bound were tight, we would expect the AIS score to violate the bound on roughly
50% of the runs anyway. The lower half of the table shows that, by combining AIS
estimates from separate runs, we obtain an estimate that violates the VB lower bound
far less often, and as expected we see the 50% violation rate for large amounts of data.
This is a very useful result, and obviates to some extent the MH samplers deciency
in all ve runs. Diagnostically speaking, this analysis is good example of the use of
readily-computable VB lower bounds for evaluating the reliability of the AIS method
36 Variational Bayesian EM for DAGs
post hoc.
Let us return to examining why the sampler is troubled for large data set sizes.
Figure 6(c) shows the fraction of accepted MH proposals during each of four quarters
of the annealing schedule used in the experiments. The rejection rate tends to increase
moving from the beginning of the schedule (the prior) to the end (the posterior), the
degradation becoming more pronounced for large data sets. This is most probably due
to the proposal width remaining unchanged throughout all the AIS implementations;
ideally, one would use a predetermined sequence of proposal widths which would be a
function of the amount of data, n, and the position along the schedule.
We can use a heuristic argument to roughly predict the optimal proposal width to
use for the AIS method. From mathematical arguments the precision of the posterior
distribution over parameters is approximately proportional to the size of the data set
n. Furthermore, the distribution being sampled from at step k of the AIS schedule is
eectively that resulting from a fraction (k) of the data. Therefore, these two factors
imply that the width of the MH proposal distribution should be inversely proportional
to
_
n(k). In the case of multinomial variables, since the variance of a Dirichlet
distribution is approximately inversely proportional to the strength , then the optimal
strength of the proposal distribution should be
opt
n(k) if its precision is to match
the posterior precision. Note that we are at liberty to set these proposal precisions
arbitrarily beforehand without causing the sampler to become biased.
We have not yet discussed the shape of the annealing schedule: should the inverse-
temperatures (k)
K
k=1
change linearly from 0 to 1, or follow some other function?
The particular annealing schedule in these experiments was chosen to be nonlinear,
lingering at higher temperatures for longer than at lower temperatures, according to
(k) =
e

k/K
1 k/K + e

k 0, . . . , K . (69)
For any setting of e

> 0, the series of temperatures is monotonic and the initial and


nal temperatures satisfy (62): (0) = 0, and (K) = 1. For large e

the schedule
becomes linear, and is plotted for dierent values of e

in Figure 6(d). A setting of


e

=0.2 was found to reduce the degree of hysteresis in the annealing ratios.
6 Comparison to Cheeseman-Stutz (CS) approximation
In this section we present two important theoretical results regarding the approximation
of Cheeseman and Stutz (1996), covered in Section 3.3. We briey review the CS
approximation, as used to approximate the marginal likelihood of nite mixture models,
and then show that it is in fact a lower bound on the marginal likelihood in the case
of mixture models (Minka 2001), and that similar CS constructions can be made for
any model containing hidden variables. This observation brings CS into the family of
bounding approximations, of which VB is a member. We then conclude the section by
presenting a construction that proves that VB can be used to obtain a bound that is
always tighter than CS.
Beal, M. J. and Ghahramani, Z. 37
Let m be a directed acyclic graph with parameters giving rise to an i.i.d. data
set denoted by y = y
1
, . . . , y
n
, with corresponding discrete hidden variables s =
s
1
, . . . , s
n
each of cardinality k. Let

be a result of an EM algorithm which has
converged to a local maximum in the likelihood p(y [ ), and let s = s
i

n
i=1
be a
completion of the hidden variables, chosen according to the posterior distribution over
hidden variables given the data and

, such that s
ij
= p(s
ij
= j [ y,

) i = 1, . . . , n.
Since we are completing the hidden variables with real, as opposed to discrete values,
this complete data set does not in general correspond to a realisable data set under
the generative model. This point raises the question of how its marginal probability
p(s, y [ m) is dened. We will see in the following theorem and proof (Theorem 6) that
both the completion required of the hidden variables and the completed data marginal
probability are well-dened, and follow from equations (77) and (78) below.
The CS approximation is given by
p(y [ m) p(y [ m)
CS
= p(s, y [ m)
p(y [

)
p(s, y [

)
. (70)
The CS approximation exploits the fact that, for many models of interest, the rst
term on the right-hand side the complete-data marginal likelihood is tractable
to compute (this is the case for discrete-variable directed acyclic graphs with Dirichlet
priors, as explained in Section 3.3). The term in the numerator of the second term on
the right-hand side is simply the likelihood, which is an output of the EM algorithm
(as is the parameter estimate

), and the denominator is a straightforward calculation
that involves no summations over hidden variables or integrations over parameters.
Theorem 6.1: (Cheeseman-Stutz is a lower bound) Let

be the result of the
M step of EM, and let p(s
i
[ y
i
,

)
n
i=1
be the set of posterior distributions over the
hidden variables obtained in the next E step of EM. Furthermore, let s = s
i

n
i=1
be a
completion of the hidden variables, such that s
ij
= p(s
ij
= j [ y,

) i = 1, . . . , n. Then
the CS approximation is a lower bound on the marginal likelihood:
p(y [ m)
CS
= p(s, y [ m)
p(y [

)
p(s, y [

)
p(y [ m) . (71)
Minka (2001) previously observed that in the specic case of mixture models, the
Cheeseman-Stutz criterion is a lower bound on the marginal likelihood, and this could
explain the reports of good performance in the literature (Cheeseman and Stutz 1996;
Chickering and Heckerman 1997). Our contribution here is a proof of the result given in
(71) that is generally applicable to any model with hidden variables, by using marginal
likelihood bounds with approximations over the posterior distribution of the hidden
variables only. We follow this with a corollary that allows us to always improve on the
CS bound using VB.
Proof of Theorem 6.1: The marginal likelihood can be lower bounded by introducing
38 Variational Bayesian EM for DAGs
a distribution over the settings of each data points hidden variables q
si
(s
i
):
p(y [ m) =
_
d p()
n

i=1
p(y
i
[ )
_
d p()
n

i=1
exp
_

si
q
si
(s
i
) ln
p(s
i
, y
i
[ )
q
si
(s
i
)
_
.
(72)
We place a similar lower bound over the likelihood
p(y [

) =
n

i=1
p(y
i
[

)
n

i=1
exp
_

si
q
si
(s
i
) ln
p(s
i
, y
i
[

)
q
si
(s
i
)
_
, (73)
which can be made an equality if, for each data point, q(s
i
) is set to the exact posterior
distribution given the parameter setting ,
p(y [

) =
n

i=1
p(y
i
[

) =
n

i=1
exp
_

si
q
si
(s
i
) ln
p(s
i
, y
i
[

)
q
si
(s
i
)
_
, (74)
where
q
si
(s
i
) p(s
i
[ y,

) , (75)
which is the result obtained from an exact E step with the parameters set to

. Now
rewrite the marginal likelihood bound (72), using this same choice of q
si
(s
i
), separate
those terms that depend on from those that do not, and substitute in to equation
(74) to obtain:
p(y [ m)
n

i=1
exp
_

si
q
si
(s
i
) ln
1
q
si
(s
i
)
_

_
d p()
n

i=1
exp
_

si
q
si
(s
i
) lnp(s
i
, y
i
[ )
_
(76)
=
p(y [

n
i=1
exp
_

si
q
si
(s
i
) lnp(s
i
, y
i
[

)
_
_
d p()
n

i=1
exp
_

si
q
si
(s
i
) lnp(s
i
, y
i
[ )
_
(77)
=
p(y [

n
i=1
p(s
i
, y
i
[

)
_
d p()
n

i=1
p(s
i
, y
i
[ ) , (78)
where s
i
are dened such that they satisfy:
lnp(s
i
, y [

) =

si
q
si
(s
i
) lnp(s
i
, y
i
[ ) =

si
p(s
i
[ y,

) lnp(s
i
, y
i
[ ) , (79)
where the second equality follows from the setting used in (75) that achieves a tight
bound. The existence of such a completion follows from the fact that, in discrete-variable
directed acyclic graphs of the sort considered in Chickering and Heckerman (1997), the
hidden variables appear only linearly in logarithm of the joint probability p(s, y [ ).
Beal, M. J. and Ghahramani, Z. 39
Equation (78) is the Cheeseman-Stutz criterion of (27) and (71), and is also a lower
bound on the marginal likelihood.
It is possible to derive CS-like approximations for types of graphical model other
than discrete-variables DAGs. In the above proof, no constraints were placed on the
forms of the joint distributions over hidden and observed variables, other than in the
simplifying step in equation (78).
Finally, the following corollary gives some theoretical justication to the empiri-
cally observed superior performance of VB over CS. We present an original key result:
that variational Bayes can always obtain a tighter bound than the Cheeseman-Stutz
approximation.
Corollary 6.2: (VB is at least as tight as CS) That is to say, it is always possible
to nd distributions q
s
(s) and q

() such that
lnp(y [ m)
CS
T
m
(q
s
(s), q

()) lnp(y [ m) . (80)


Proof of Corollary 6.2: Consider the following forms for q
s
(s) and q

():
q
s
(s) =
n

i=1
q
si
(s
i
) , with q
si
(s
i
) = p(s
i
[ y
i
,

) , (81)
q

() lnp()p(s, y [ ))
qs(s)
. (82)
We write the form for q

() explicitly:
q

() =
p()

n
i=1
exp
_
si
q
si
(s
i
) lnp(s
i
, y
i
[ )
_
_
d

p(

n
i=1
exp
_
si
q
si
(s
i
) lnp(s
i
, y
i
[

)
_ , (83)
and note that this is exactly the result of a VBM step. We substitute (83) directly into
the VB lower bound stated in equation (33):
T
m
(q
s
(s), q

()) =
_
d q

()
n

i=1

si
q
si
(s
i
) ln
p(s
i
, y
i
[ )
q
si
(s
i
)
+
_
d q

() ln
p()
q

()
(84)
=
_
d q

()
n

i=1

si
q
si
(s
i
) ln
1
q
si
(s
i
)
+
_
d q

() ln
_
d

p(

)
n

i=1
exp
_

si
q
si
(s
i
) lnp(s
i
, y
i
[

)
_
(85)
=
n

i=1

si
q
si
(s
i
) ln
1
q
si
(s
i
)
+ ln
_
d p()
n

i=1
exp
_

si
q
si
(s
i
) lnp(s
i
, y
i
[ )
_
,
(86)
40 Variational Bayesian EM for DAGs
which is exactly the logarithm of equation (76). And so with this choice of q

() and
q
s
(s), we achieve equality between the CS and VB approximations in (80). We complete
the proof of Corollary 6.2 by noting that any further VB optimisation is guaranteed to
increase or leave unchanged the lower bound, and hence surpass the CS lower bound.
We would expect the VB lower bound starting from the CS solution to improve upon the
CS lower bound in all cases, except in the very special case when the MAP parameter

is exactly the variational Bayes point, dened as


BP

1
(())
q

()
). Since VB
is a lower bound on the marginal likelihood, the entire statement of (80) is proven.
7 Summary
In this paper we have presented various scoring methods for approximating the marginal
likelihood of discrete directed graphical models with hidden variables. We presented
EM algorithms for ML and MAP parameter estimation, showed how to calculate the
asymptotic criteria of BIC and Cheeseman-Stutz, and derived the VBEM algorithm for
approximate Bayesian learning which maintains distributions over the parameters of
the model and has the same complexity as the EM algorithm. We also presented an
Annealed Importance Sampling method designed for discrete-variable DAGs.
Our experiments show that VB consistently outperforms BIC and CS, and that VB
performs, respectively, as well as and more reliably than AIS for intermediate and large
sizes of data. The AIS method has many parameters to tune and requires knowledge
of the model domain to design ecient and reliable sampling schemes and annealing
schedules. VB, on the other hand, has not a single parameter to set or tune, and can be
applied without any expert knowledge, at least in the class of singly-connected discrete-
variable DAGs with Dirichlet priors which we have considered in this paper. Perhaps the
most compelling evidence for the reliability of the VB approximation is given in Figure
3(c), which shows that the KL divergences of the VB-computed posterior distributions
from the AIS standards are much smaller than for the other competing approximations.
It may be that there exists a better AIS scheme than sampling in parameter space.
To be more specic, for any completion of the data the parameters of the model can
be integrated out tractably (at least for the class of models examined in this chapter);
thus an AIS scheme which anneals in the space of completions of the data may be more
ecient than the current scheme which anneals in the space of parameters.
1
However,
this latter scheme may only be ecient for models with little data compared to the
number of parameters, as the sampling space of all completions increases linearly with
the amount of data. This avenue of research is left to further work.
This paper has presented a novel application of variational Bayesian methods to
discrete DAGs. In the literature there have been other attempts to solve this long-
standing model selection problem in DAGs with hidden variables. For example, the
structural EM algorithm of Friedman (1998) uses a structure search algorithm which
uses a scoring algorithm very similar to the VBEM algorithm presented here, except
1
personal communication with R. Neal
Beal, M. J. and Ghahramani, Z. 41
that for tractability, the distribution over is replaced by the MAP estimate,
MAP
. We
have shown here how the VB framework enables us to use the entire distribution over
for inference of the hidden variables. Very recently, Rusakov and Geiger (2005) have
presented a modied BIC score that is asymptotically correct for the type of models we
have examined in this article; future work will involve comparing VB to this modied
BIC score in the non-asymptotic regime.
We have proved that the Cheeseman-Stutz score is a lower bound on the marginal
likelihood in the case of general graphical models with hidden variables, extending the
mixture model result of Minka (2001); and more importantly we proved that there ex-
ists a construction which is guaranteed to produce a variational Bayesian lower bound
that is at least as tight as the Cheeseman-Stutz score (Corollary 6.2 to Theorem 6.1).
This construction builds a variational Bayesian approximation using the same MAP
parameter estimate used to obtain the CS score. However, we did not use this construc-
tion in our experiments, preferring the EM and VBEM algorithms to evolve separately
(although similar parameter initialisations were employed for fairness). As a result we
cannot guarantee that the VB bound is in all runs tighter than the CS bound, as the
dynamics of the optimisations for MAP learning and VB learning may in general lead
even identically initialised algorithms to dierent optima in parameter space (or pa-
rameter distribution space). Nevertheless, we have still seen improvement in terms of
ranking of the true structure by VB as compared to CS. Empirically, the VB lower
bound was observed to be lower than the CS score in only 173 of the 288320 total scores
calculated (only about 0.06%), whereas had we used the construction, which we note
is available to us for any graphical model with hidden variables, then this would have
occurred exactly zero times.
Traditionally, the statistics community have concentrated on MCMC sampling and
asymptotic criteria for computing marginal likelihoods for model selection and averag-
ing. This article has applied the variational Bayes algorithm to scoring directed graphs
and shown it to be empirically superior to existing criteria and, more importantly, the-
oretically superior to the popular Cheeseman-Stutz criterion. We believe that VB will
prove to be of use in many other models, improving the eciency of inference and model
selection tasks without compromising accuracy.
References
Attias, H. (1999a). Independent Factor Analysis. Neural Computation, 11: 803851.
(1999b). Inferring Parameters and Structure of Latent Variable Models by Varia-
tional Bayes. In Proc. 15th Conf. on Uncertainty in Articial Intelligence.
(2000). A variational Bayesian framework for graphical models. In Solla, S. A.,
Leen, T. K., and M uller, K. (eds.), Advances in Neural Information Processing Sys-
tems 12. Cambridge, MA: MIT Press.
Beal, M. J. (2003). Variational Algorithms for Approximate Bayesian Inference. Ph.D.
thesis, Gatsby Computational Neuroscience Unit, University College London, UK.
42 Variational Bayesian EM for DAGs
Bishop, C. M. (1999). Variational PCA. In Proc. Ninth Int. Conf. on Articial Neural
Networks. ICANN.
Cheeseman, P. and Stutz, J. (1996). Bayesian Classication (Autoclass): Theory and
Results. In Fayyad, U. M., Piatesky-Shapiro, G., Smyth, P., and Uthurusamy, R.
(eds.), Advances in Knowledge Discovery and Data Mining, 153180. Menlo Park,
CA: AAAI Press/MIT Press.
Chickering, D. M. and Heckerman, D. (1997). Ecient Approximations for the
Marginal Likelihood of Bayesian Networks with Hidden Variables. Machine Learn-
ing, 29(23): 181212.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B (Methodological), 39: 138.
Friedman, N. (1998). The Bayesian structural EM algorithm. In Proc. Fourteenth
Conference on Uncertainty in Articial Intelligence (UAI 98. San Francisco, CA:
Morgan Kaufmann Publishers.
Gelman, A. and Meng, X. (1998). Simulating normalizing constants: From importance
sampling to bridge sampling to path sampling. Statistical Science, 13: 163185.
Ghahramani, Z. and Beal, M. J. (2000). Variational inference for Bayesian mixtures
of factor analysers. In Advances in Neural Information Processing Systems 12.
Cambridge, MA: MIT Press.
(2001). Propagation algorithms for variational Bayesian learning. In Advances in
Neural Information Processing Systems 13. Cambridge, MA: MIT Press.
Ghahramani, Z. and Hinton, G. E. (2000). Variational Learning for Switching State-
Space Models. Neural Computation, 12(4).
Ghahramani, Z. and Jordan, M. I. (1997). Factorial hidden Markov models. Machine
Learning, 29: 245273.
Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and
Their Applications. Biometrika, 57(1): 97109.
Heckerman, D. (1996). A tutorial on learning with Bayesian networks. Technical
Report MSR-TR-95-06 [ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS] , Mi-
crosoft Research.
Heckerman, D., Geiger, D., and Chickering, D. M. (1995). Learning Bayesian networks:
the combination of knowledge and statistical data. Machine Learning, 20(3): 197
243.
Hinton, G. E. and van Camp, D. (1993). Keeping Neural Networks Simple by Mini-
mizing the Description Length of the Weights. In Sixth ACM Conference on Com-
putational Learning Theory, Santa Cruz.
Beal, M. J. and Ghahramani, Z. 43
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, Minimum Description Length,
and Helmholtz Free Energy. In Cowan, J. D., Tesauro, G., and Alspector, J. (eds.),
Advances in Neural Information Processing Systems 6. San Francisco, CA: Morgan
Kaufmann.
Jaakkola, T. S. (1997). Variational Methods for Inference and Estimation in Graph-
ical Models. Technical Report Ph.D. Thesis, Department of Brain and Cognitive
Sciences, MIT, Cambridge, MA.
Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American
Statistical Association, 90: 773795.
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220(4598): 671680.
MacKay, D. J. C. (1995). Probable networks and plausible predictions a review of
practical Bayesian methods for supervised neural networks. Network: Computation
in Neural Systems, 6: 469505.
(1997). Ensemble Learning for Hidden Markov Models. Technical report,
Cavendish Laboratory, University of Cambridge.
(1998). Choice of Basis for Laplace Approximation. Machine Learning, 33(1).
Metropolis, N., Rosenbluth, A. W., Teller, M. N., and Teller, E. (1953). Equation of
state calculations by fast computing machines. Journal of Chemical Physics, 21:
10871092.
Minka, T. P. (2001). Using lower bounds to approximate integrals.
Neal, R. M. (1992). Connectionist learning of belief networks. Articial Intelligence,
56: 71113.
(1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical
Report CRG-TR-93-1, Department of Computer Science, University of Toronto.
(1996). Bayesian Learning in Neural Networks. Springer-Verlag.
(2001). Annealed importance sampling. Statistics and Computing, 11: 125139.
Neal, R. M. and Hinton, G. E. (1998). A view of the EM algorithm that justies
incremental, sparse, and other variants. In Jordan, M. I. (ed.), Learning in Graphical
Models, 355369. Kluwer Academic Publishers.
Rusakov, D. and Geiger, D. (2005). Asymptotic Model Selection for Naive Bayesian
Networks. Journal of Machine Learning Research, 6: 135.
Saul, L. K. and Jordan, M. I. (1996). Exploiting Tractable Substructures in Intractable
networks. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E. (eds.), Advances
in Neural Information Processing Systems 8. Cambridge, MA: MIT Press.
44 Variational Bayesian EM for DAGs
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,
6: 461464.
Torrie, G. M. and Valleau, J. P. (1977). Nonphysical sampling distributions in Monte
Carlo free energy estimation: Umbrella sampling. J. Comp. Phys., 23: 187199.
Waterhouse, S., MacKay, D. J. C., and Robinson, T. (1996). Bayesian methods for
Mixtures of Experts. In Advances in Neural Information Processing Systems 8.
Cambridge, MA: MIT Press.
About the Authors
Matthew J. Beal completed his PhD in July 2003 at the Gatsby Computational Neuroscience
Unit, University College London, UK, and was a postdoctoral fellow with Radford Neal in
the Computer Science department at the University of Toronto, Canada. He is currently an
Assistant Professor in the Computer Science and Engineering department at the University at
Bualo, the State University of New York, NY 14260, USA.
Zoubin Ghahramani is a Professor of Information Engineering in the Department of Engi-
neering at the University of Cambridge, UK. He also holds an Associate Researcher Professor
appointment in the Center for Automated Learning and Discovery at the School of Computer
Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. Prior to Cambridge, he was
a Reader at the Gatsby Computational Neuroscience Unit, University College London, UK,
and wishes to acknowledge its support for the research reported in this article.

You might also like