0% found this document useful (0 votes)
6 views

1999-Modelling Gene Expression Data Using Dynamic Bayesian Networks

Uploaded by

Haimabati Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

1999-Modelling Gene Expression Data Using Dynamic Bayesian Networks

Uploaded by

Haimabati Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Modelling Gene Expression Data using Dynamic Bayesian Networks

Kevin Murphy and Saira Mian


Computer Science Division, University of California
Life Sciences Division, Lawrence Berkeley National Laboratory
Berkeley, CA 94720
Tel (510) 642 2128, Fax (510) 642 5775
[email protected], [email protected]

Abstract capable of handling noisy data. For example, suppose the


underlying system really is a boolean network, but that we
Recently, there has been much interest in reverse have noisy observations of some of the variables. Then the
engineering genetic networks from time series data set might contain inconsistencies, i.e., there might not
data. In this paper, we show that most of the be any boolean network which can model it. Rather than
proposed discrete time models — including the giving up, we should look for the most probable model
boolean network model [Kau93, SS96], the linear given the data; this of course requires that our model have
model of D’haeseleer et al. [DWFS99], and the a well-defined probabilistic semantics.
nonlinear model of Weaver et al. [WWS99] — The ability of our models to handle hidden variables is also
are all special cases of a general class of mod- important. Typically, what is measured (usually mRNA
els called Dynamic Bayesian Networks (DBNs).
levels) is only one of the factors that we care about; other
The advantages of DBNs include the ability to ones include cDNA levels, protein levels, etc. Often we
model stochasticity, to incorporate prior knowl-
can model the relationship between these factors, even if
edge, and to handle hidden variables and missing
we cannot measure their values. This prior knowledge can
data in a principled way. This paper provides a be used to constrain the set of possible models we learn.
review of techniques for learning DBNs. Key-
words: Genetic networks, boolean networks, The models we use are called Bayesian (belief) Networks
Bayesian networks, neural networks, reverse en- (BNs) [Pea88], which have become the method of choice
gineering, machine learning. for representing stochastic models in the UAI (Uncertainty
in Artificial Intelligence) community. In Section 2, we
explain what BNs are, and show how they generalize the
1 Introduction boolean network model [Kau93, SS96], Hidden Markov
Models [DEKM98], and other models widely used in the
Recently, it has become possible to experimentally mea- computational biology community. In Sections 3 to 7, we
sure the expression levels of many genes simultaneously, review various techniques for learning BNs from data, and
as they change over time and react to external stimuli (see show how REVEAL [LFS98] is a special case of such an
e.g., [WFM 98, DLB97]). In the future, the amount of algorithm. In Section 8, we consider BNs with continuous
such experimental data is expected to increase dramatically. (as opposed to discrete) state, and discuss their relationship
This increases the need for automated ways of discovering to the the linear model of D’haeseleer et al. [DWFS99], the
patterns in such data. Ultimately, we would like to auto- nonlinear model of Weaver et al. [WWS99], and techniques
matically discover the structure of the underlying causal from the neural network literature [Bis95].
network that is assumed to generate the observed data.
In this paper, we consider learning stochastic, discrete time 2 Bayesian Networks
models with discrete or continuous state, and hidden vari-
ables. This generalizes the linear model of D’haeseleer et al.
BNs are a special case of a more general class called graph-
[DWFS99], the nonlinear model of Weaver et al. [WWS99],
and the popular boolean network model [Kau93, SS96], all ical models in which nodes represent random variables, and
the lack of arcs represent conditional independence assump-
of which are deterministic and fully observable.
tions. Undirected graphical models, also called Markov
The fact that our models are stochastic is very important, Random Fields (MRFs; see e.g., [WMS94] for an applica-
since it is well known that gene expression is an inher- tion in biology), have a simple definition of independence:


ently stochastic phenomenon [MA97]. In addition, even if two (sets of) nodes and  are conditionally indepen-
the underlying system were deterministic, it might appear dent given all the other nodes if they are separated in the
stochastic due to our inability to perfectly measure all the graph. By contrast, directed graphical models (i.e., BNs)
variables. Hence it is crucial that our learning algorithms be have a more complicated notion of independence, which
X1 X2 X3 X1 X2 X3

Y1 Y2 Y3

(a) (b)

Figure 2: (a) A Markov Chain represented as a Dynamic


Bayesian Net (DBN). (b) A Hidden Markov Model (HMM)
represented as a DBN. Shaded nodes are observed, non-
Figure 1: The Bayes-Ball algorithm. Two (sets of) nodes


shaded nodes are hidden.


and are conditionally independent (d-separated [Pea88])


given all the others if and only if there is no way for a


ball to get from to


 in the graph. Hidden nodes are 2.1 Relationship to HMMs


nodes whose values are not known, and are depicted as
For our first example of a BN, consider Figure 2(a). We call
unshaded; observed nodes are shaded. The dotted arcs
this a Dynamic Bayesian Net (DBN) because it represents
indicate direction of flow of the ball. The ball cannot pass
through hidden nodes with convergent arrows (top left), nor
how the random variable evolves over time (three time 
through observed nodes with any outgoing arrows. See 
slices are shown). From the graph, we see that  1 is 
[Sha98] for details.
independent of 1 given
path for the Bayes ball between
(since
1 and
blocks the only
1 ). This, of
 
course, is the (first-order) Markov property, which states
that the future is independent of the past given the present.


takes into account the directionality of the arcs (see Fig- Now consider Figure 2(b). is as before, but is now
ure 1). Graphical models with both directed and undirected hidden. What we observe at each time step is , which
arcs are called chain graphs.


is another random variable whose distribution depends on
(and only on) . Hence this graph captures all and only
In a BN, one can intuitively regard an arc from to 

as indicating the fact that




“causes” . (For a more



the conditional independence assumptions that are made in
formal treatment of causality in the context of BNs, see a Hidden Markov Model (HMM) [Rab89].
[HS95].) Since evidence can be assigned to any subset of In addition to the graph structure, a BN requires that we
the nodes (i.e., any subset of nodes can be observed), BNs specify the Conditional Probability Distribution (CPD) of
can be used for both causal reasoning (from known causes

each node given its parents. In an HMM, we assume that
       
 
to unknown effects) an diagnostic reasoning (from known the hidden state variables are discrete, and have a dis-
effects to unknown causes), or any combination of the two. tribution given by Pr . (

1

    !  "# 
The inference algorithms which are needed to do this are is the transition matrix for time slice .) If the observed
briefly discussed in Section 5.1. Note that, if all the nodes
are observed, there is no need to do inference, although we
variables
by Pr   are discrete, we can specify their distribution
 . ( is the observa-
might still want to do learning. 
tion matrix for time slice .) However, in an HMM, it is
In addition to causal and diagnostic reasoning, BNs support also possible for the observed variables to be Gaussian, in
which case we must specify the mean and covariance for
the powerful notion of “explaining away”: if a node is
observed, then its parents become dependent, since they are each value of the hidden state variable, and each value of : 
rival causes for explaining the child’s value (see the bottom see Section 8.
left case in Figure 1.) In contrast, in an undirected graphical As a (hopefully!) familiar example of HMMs, let us con-
model, the parents would be independent, since the child sider the way that they are used for aligning protein se-
separates (but does not d-separate) them. quences [DEKM98]. In this case, the hidden state variable
can take on three possible values, , which
%$'&)(*,+-,.0/
Some other important advantages of directed graphical
represent delete, insert and match respectively. In protein
models over undirected ones include the fact that BNs can
encode deterministic relationships, and that it is easier to 
alignment, the subscript does not refer to time, but rather
to position along a static sequence. This is an important dif-
learn BNs (see Section 3) since they are separable models
(in the sense of [Fri98]). Hence we shall focus exclusively ference from gene expression, where really does represent 
on BNs in this paper. For a careful study of the relation- time (see Section 3).
ship between directed and undirected graphical models, see The observable variable can take on 21 possible values,
[Pea88, Whi90, Lau96].1 which represent the 20 possible amino acids, plus the gap
alignment character “-”. The probability distribution over
these 21 values depends on the current position and of  
1  .2
the current state of the system, . Thus the distribution
3 4 
+  
1
It is interesting to note that much of the theory underlying
graphical models involves concepts such as chordal (triangulated) Pr is the profile for position , Pr

  657598:575  ; <  2( >2(


=
graphs [Gol80], which also arise in other areas of computational is the (“time”-invariant) “background” distribution, and
biology, such as evolutionary tree construction (perfect phyloge- Pr 1 0 if and is 0.0 if ,
nies) and physical mapping (interval graphs). for all . 
A1 A2 A A B

B1 B2
B C A’ B’

C1 C2

(a) (b) C

Figure 3: (a) A DBN, (b) Represented as a boolean network. Figure 4: The noisy-OR gate modelled as a BN. A’ is a
noisy version of A, and B’ is a noisy version of B. C is
a deterministic OR of A’ and B’ (indicated by the double

rameters and
 
In an HMM, “learning” means finding estimates of the pa-
(see Section 4.1); the structure of the
ring). Shaded nodes are observed, non-shaded nodes are
hidden.
model is fixed (namely, it is Figure 2(b)).

2.2 Relationship to boolean networks


noisyor    , we can represent its CPD in tabular


form as follows:
Now consider the DBN in Figure 3(a). (We only show two 

Pr  0
 Pr  1
time slices of the DBN, since the structure repeats.) Just as
0.08


0 0 1.0
an HMM can be thought of as a single stochastic automaton,
1 0  1 18  1
so this can be thought of as a network of four interacting
stochastic automata: node represents the state of the

 

0 1  2 18  2

automaton at time , for example; its next state is determined
1 1  1 2 1  1 2
by its previous state, and the previous state of .
Note that this CPT has 2  1 entries, where  is the


The interconnectivity between the automata can also be number of parents (here,  2), but that there are con-
represented as in Figure 3(b). The presence of directed
   
straints on theseentries. For a start,
  
the requirement that
<
  
cycles means that this is not a BN; rather, an arc from node
 

Pr 0 Pr 1 1 0 for all values




of and eliminates one of the columns, resulting in 2 


 

to node means 1, where is the inter-slice




<<<


adjacency matrix. entries. However, the entries in the remaining columns are
themselves functions of just  free parameters,  1  .
In addition to connections between time slices, we can also 
Hence this is called an implicit or parametric distribution,
allow connections within a time slice (intra-slice connec-
and has the advantage that less data is needed to learn the

tions). This can be used to model co-occurrence effects,  parameter values (see Section 3). By contrast, a full multi-
 


e.g., gene 1 causes to turn on if and only if is also


nomial distribution (i.e., an unconstrained CPT) would have


turned on. Since this relationship between and is not


2 parameters, but could model arbitrary interactions be-
causal, it is more natural to model this with an undirected
tween the parents.
arc; the resulting model would therefore be a chain graph,
which we don’t consider in this paper. The intepretation of the noisy-OR gate is that each input is
an independent cause, which may be inhibited. This can be
 1    
The transition matrix of each automaton (e.g.,
 

modelled by the BN shown in Figure 4. It is common to


Pr 1 1 ) is often called a CPT (Conditional
imagine that one of the parents is permanently on, so that


Probability Table), since it represents the CPD in tabu-


the child can turn on “spontaneously”. This “leak” node
lar form. If each row of the CPT only contains a single
can be used as a catch-all for all other, unmodelled causes.
non-zero value (which must therefore be 1.0), then the au-
tomaton is in fact a deterministic automaton. If all the nodes It is straightforward to generalize the noisy-OR gate to non-
(automata) are deterministic and have two states, the system binary variables and other functions such as AND [Hen89,
is equivalent to a boolean network [Kau93, SS96]. Sri93, HB94]. It is also possible to loosen the assumption
that all the causes are independent [MH97].
In a boolean network, “learning” means finding the best
structure, i.e., inter-slice connectivity matrix (see Sec- Another popular compact representation for CPDs in the
tion 6). Once we know the correct structure, it is easy UAI community is a decision tree [BFGK96]. This is a
to figure out what logical rule each node is using, e.g., by stochastic generalization of the concept of canalyzing func-
exhaustively enumerating them all and finding the one that tion [Kau93], popular in the boolean networks field. (A
fits the data. function is canalyzing if at least one of its inputs has the
property that, when it takes a specific value, the output of
2.3 Stochastic boolean networks the function is independent of all the remaining inputs.)

We can define a stochastic boolean network model by mod- 2.4 Comparison of DBNs and HMMs
ifying the CPDs. For example, a popular CPD in the UAI
community is the noisy-OR model [Pea88]. This is just like Note that we can convert the model in Figure 3(a) into the

an OR gate, except that there is a certain probability  that
the ’th input will be flipped from 1 to 0. For example, if
one in Figure 2(a), by defining a new variable whose state
space is the cross product of the original variables,

       , i.e., by “collapsing” the original model into a


 are usually called hidden variables), or because they just


chain. But now the conditional independence relationships happen to be unmeasured in the training data (in which
are “buried” inside the transition matrix. In particular, the case they are called missing variables). Note that a variable
entries in the transition matrix are products of a smaller can be observed intermittently.

  "   1   Pr      1     1      


number of parameters, i.e.,
 
Unknown structure means we do not know the complete
Pr topology of the graph. Typically we will know some parts
Pr  Pr 


of it, or at least know some properties the graph is likely


 

1 1 1 1
to have, e.g., it is common to assume a bound,  , on
   
So 0s in the transition matrix do not correspond to absence the maximum number of parents (fan-in) that a node can
of arcs in the model. Rather, if 0, it just means
the automaton cannot get from state to state — this says
 take on, and we may know that nodes of a certain “type”
only connect to other nodes of the same type. Such prior
nothing about the connections between the underlying state knowledge can either be a “hard” constraint, or a “soft”
variables. Thus Markov chains and HMMs are not a good prior.
representation for sparse, discrete models of the kind we
To exploit such soft prior knowledge, we must use Bayesian
are considering here.
methods, which have the additional advantage that they re-
turn a distribution over possible models instead of a single
2.5 Higher-order models best model. Handling priors on model structures, however,
is quite complicated, so we do not discuss this issue here
We have assumed that that our models have the first-order (see [Hec96] for a review). Instead, we assume that the
Markov property, i.e., that there are only connections be- goal is to find a single model which maximizes some scor-
tween adjacent time slices. This assumption is without ing function (discussed in Section 6.1). We will, however,
loss of generality, since it is easy to convert a higher-order consider priors on parameters, which are much simpler. In
Markov chain to a first-order Markov chain, simply by Section 8.1, when we consider DBNs in which all the vari-
adding new variables (called lag variables) which contain ables are continuous, we will see that we can use numerical
the old state, and clamping the transition matrices of these priors as a proxy for structural priors.
new variables to the identity matrix. (Note that doing this
for an HMM (as opposed to a DBN) would blow up the An interesting approximation to the full-blown Bayesian
state space exponentially.) approach is to learn a mixture of models, each of which has
different structure, depending on the value of a hidden, dis-
2.6 Relationship to other probabilistic models crete, mixture node. This has been done for Gaussian net-
works [TMCH98] and discrete, undirected trees2 [MJM98];
Although the emphasis in this section has been on models unfortunately, there are severe computational difficulties in
which are relevant to gene expression, we should remark generalizing this technique to arbitrary, discrete networks.
that Bayesian Nets can also be used in evolutionary tree con- In this paper, we restrict our attention to learning the struc-
struction and genetic pedigree analysis. (In fact, the peel- ture of the inter-slice connectivity matrix of the DBN. This
ing (Elston-Stewart) algorithm was invented over a decade has the advantage that we do not need to worry about in-
before being rediscovered in the UAI community.) Also, troducing (directed) cycles, and also that we can use the
Bayesian networks can be augmented with utility and deci-
sion nodes, to create an influence diagram; this can be used lations, i.e., we know an arc must go from
"
1 to

“arrow of time” to disambiguate the direction of causal re-
and
to compute an optimal action or sequence of actions (opti- not vice versa, because the network models the temporal
mal in the decision-theoretic sense of maximizing expected evolution of a physical system. Of course, we still cannot
utility).This might be useful for designing gene-knockout completely overcome the problem that “correlation is not 
experiments, although we don’t pursue this issue here.
and

causation”: it is possible that the correlation between
is due to one or more hidden common causes (or
1

3 Learning Bayesian Networks observed common effects, because of the explaining away
phenomenon). Hence the models learned from data should
be subject to experimental verification.3
In this section, we discuss how to learn BNs from data. The
problem can be summarized as follows.
2
In one particularly relevant example, [MJM98] applies her
Structure/ Observability Method algorithm to the problem of classifying DNA splice sites as intron-
Known, full Sample statistics exon or exon-intron. When she examined each tree in the mixture
Known, partial EM or gradient ascent to find the nodes which are most commonly connected to the class
variable, she found that they were the nodes in the vicinity of the
Unknown, full Search through model space
splice junction, and further, that their CPTs encoded the known
Unknown, partial Structural EM AG/G pattern.
3
[HMC97] discusses Bayesian techniques for learning causal
Full observability means that the values of all variables are (static) networks, and [SGS93] discusses constraint based tech-
known; partial observability means that we do not know niques — but see [Fre97] for a critique of this approach. These
the values of some of the variables. This might be because techniques have mostly been used in the social sciences. One
they cannot, in principle, be measured (in which case they major advantage of molecular biology is that it is usually feasible
When the structure of the model is known, the learning task We see that the log-likelihood scoring function decomposes
becomes one of parameter estimation. Although the focus according to the structure of the graph, and hence we can
of this paper is on learning structure, this calls parameter maximize the contribution to the log-likelihood of each
learning as a subroutine, so we start by discussing this case. node independently (assuming the parameters in each node
are independent of the other nodes).
Finally, we mention that learning BNs is a huge subject,
and in this paper, we only touch on the aspects that we
consider most relevant to learning genetic networks from 4.1 Parameter estimation
time series data. For further details, see the review papers
[Bun94, Bun96, Hec96]. For discrete nodes with CPTs, we can compute the ML
estimates by simple counting. As is well known from the
HMM literature, however, ML estimates of CPTs are prone
4 Known structure, full observability to sparse data problems, which can be solved by using
(mixtures of) Dirichlet priors (pseudo counts).
We assume that the goal of learning in this case is to find
Estimating the parameters of noisy-OR distributions (and
the values of the parameters of each CPD which maximizes
its relatives) is harder, essentially because of the presence
the likelihood of the training data, which contains se-
of hidden variables (see Figure 4), so we must use the
quences (assumed to be independent), each of which has

the observed values of all nodes per slice for each of
 techniques discussed in Section 5.
slices. For notational simplicity, we assume each sequence
is of the same length. Thus we can imagine “unrolling” a
two-slice DBN to produce a (static) BN with slices.
 4.2 Choosing the form of the CPDs

In the above discussion, we assumed that we knew the form


We assume the parameter values for all nodes are constant of the CPD for each node. If we are uncertain, one approach
(tied) across time (in contrast to the assumption made in is to use a mixture distribution, although this introduces

HMM protein alignment), so that for a time series of length a hidden variable, which makes things more complicated
slice, and
 8
, we get one data point (sample) for each CPD in the initial
1 data points for each of the other CPDs. If
(see Section 5). Alternatively, we can use a decision tree
[BFGK96], or a table of parent values along with their asso-
1, we cannot reliably estimate the parameters of the ciated non-zero probabilities [FG96], to represent the CPD.
 28 
nodes in the first slice, so we usually assume these are fixed

This can increase the number of free parameters gradually,
apriori. That leaves us with 1 samples for  

from 1 to 2 , where is the number of parents.
each of the remaining CPDs. In cases where is small
compared to the number of parameters that require fitting,
we can use a numerical prior to regularize the problem (see 5 Known structure, partial observability
Section 4.1). In this case, we call the estimates Maximum A
Posterori (MAP) estimates, as opposed to Maximum Like- When some of the variables are not observed, the likeli-
lihood (ML) estimates. hood surface becomes multimodal, and we must use it-
Using the Bayes Ball algorithm (see Figure 1), it is easy to erative methods, such as EM [Lau95] or gradient ascent
see that each node is conditionally independent of its non- [BKRK97], to find a local maximum of the ML/MAP func-
descendants given its parents (indeed, this is often taken tion. These algorithms need to use an inference algorithm to
as the definition of a BN), and hence, by the chain rule of compute the expected sufficient statistics (or related quan-
probability, we find that the joint probabilityof all the nodes tity) for each node.

  1  1< < <  


in the graph is
5.1 Inference in BNs
    1
 < <1<    1      Pa   Inference in Bayesian networks is a huge subject which we
 
  8 
will not go into in this paper. See [Jen96] for an introduction

 
where 1 is the number of nodes in the un- to one of the most commonly used algorithm (the junction

rolled network (excluding the first slice4 ), Pa  are the
parents of node , and nodes are numbered in topological
tree algorithm). [HD94] gives a good cookbook introduc-
tion to the junction tree algorithm. [SHJ97] explains how

 ( 
order (i.e., parents before children). The normalized log- the forwards-backwards algorithm is a special case of the
( 2& (  <1< < ,( /
likelihood of the training set
, is a sum of
1
log Pr
terms,
, where
one for each node:
junction tree algorithm, and might be a good place to start
if you are familiar with HMMs. [Dec98] would be a good
 
1
place to start if you are familiar with the peeling algorithm
  4
 ,
 ( < (although the junction tree approach is much more efficient
log  Pa  
1

  1  1 (1) for learning). [Mur99] discusses inference in BNs with dis-
crete and continuous variables, and efficient techniques for
handling networks with many observed nodes.
to verify hypotheses experimentally.
4
Since we are not trying to estimate the parameters of nodes in Exact inference in densely connected (discrete) BNs is com-
the initial slice, we have omitted their contribution to the overall putationally intractible, so we must use approximate meth-
probability, for simplicity. ods. There are many approaches, including sampling meth-
{1,2,3,4}
ods such as MCMC [Mac98] and variational methods such
as mean-field [JGJS98]. DBNs are even more computa-
{1,2,3} {2,3,4} {1,3,4} {1,2,4}
tionally intractible in the sense that, even if the connec-
tions between two slices are sparse, correlations can arise
{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}
over several time steps, thus rendering the unrolled network
“effectively” dense. A simple approximate inference algo-
{1} {2} {3} {4}
rithm for the specific case of sparse DBNs is described in
[BK98b, BK98a].
{}

&    /
6 Unknown structure, full observability
  
Figure 5: The subsets of 1 2 3 4 arranged in a lattice.

The ’th row represents subsets of size , for 0 4. 
We start by discussing the scoring function which we use to
select models; we then discuss algorithms which attempt to
optimize this function over the space of models, and finally The first term is computed using Equation 1. The second
examine their computational and sample complexity. term is a penalty for model complexity. Since the number of
parameters in the model is the sum of the number of param-
eters in each node6, we see that the BIC score decomposes
6.1 The objective function used for model selection
just like the log-likelihood (Equation 1). Hence all of the
It is commonly assumed that the goal is to find the model techniques invented for maximizing MI also work for the
with maximum likelihood. Often (e.g., as in the RE- BIC case, and indeed, any decomposable scoring function.
VEAL algorithm [LFS98], and as in the Reconstructabil-
ity Analysis (RA) or General Systems Theory community 6.2 Algorithms for finding the best model
[Kli85, Kri86]) this is stated as finding the model in which
the sum of the mutual information (MI) [CT91] between Given that the score is decomposable, we can learn

  
each node and its parents is maximal; in Appendix A, we the parent set for each node independently. There are
prove that these objective functions are equivalent, in the 
sense that they rank models in the same order.  0  2 such sets, which can be arranged in a
lattice as shown in Figure 5. The problem is to find the
The trouble is, the ML model will be a complete graph, highest scoring point in this lattice.
since this has the largest number of parameters, and hence
can fit the data the best. A well-principled way to avoid this The approach taken by REVEAL [LFS98] is to start at
kind of over-fitting is to put a prior on models, specifying the bottom of the lattice, and evaluate the score at all
that we prefer sparse models. Then, by Bayes’ rule, the points in each successive level, until a point is found with
MAP model is the one that maximizes +      &  4   ,/
a score of 1.0. Since the scoring function they use is +   
Pr 
( Pr  ( (Pr   Pa max Pa
the mutual information between and , and
, where is
is the    
Pr 
(2)
 
entropy of (see Appendix A for definitions), 1.0 is the
Taking logs, we find
highest possible score, and corresponds to Pa
perfect predictor of , i.e., Pa 0.
being a   
log Pr  ( log Pr  (  
log Pr  > 
A normalized MI of 1 is only achievable with a deterministic

 (
relationship. For stochastic relationships, we must decide
where log Pr is a constant independent of . Thus whether the gains in MI produced by a larger parent set is

     +       
the effect of the prior is equivalent to penalizing overly “worth it”. The standard approach in the RA community
complex models, as in the Minimum Description Length uses the fact [Mil55] that 2 ln 4 ,


(MDL) approach. where is the number of samples. Hence we can use a
2
test to decide whether an increase in MI is statistically
An exact Bayesian approach to model selection is usually

hood
 
(    *
(  
unfeasible, since it involves computing the marginal likeli-
, which is a sum over an expo-
significant. (This also gives us some kind of confidence
measure on the connections that we learn.) Alternatively,
we can use a complexity penalty term, as in the BIC score.
nential number of models (see Section 6.2). However, we
can use an asymptotic approximation to the posterior called Of course, if we do not know if we have achieved the max-
BIC (Bayesian Information Criterion), which is defined as imum possible score, we do not know when to stop search-
follows: ing, and hence we must evaluate all points in the lattice (al-

 (  (  Θ̂  8  though we can obviously use branch-and-bound). For large


log Pr log Pr
log
# 
, this is computationally infeasible, so a common approach
2


number of free parameters. In a model with hidden variables, it

where is the number of samples, # is the dimension of might be less than this [GHM96].
the model5 , and Θ̂ is the ML estimate of the parameters. 6
Hence nodes with compact representations of their CPDs will
encur a lower penalty, which can allow connections to form which
5
In the fully observable case, the dimension of a model is the might otherwise have been rejected [FG96].
is to only search up until level  (i.e., assume a bound on

 
the maximum number of parents of each node), which takes
 time. Unfortunately, in real genetic networks, it is
known that some genes can have very high fan-in (and fan-
out), so restricting the bound to  3 would make it
impossible to discover these important “master” genes.
The obvious way to avoid the exponential cost (and the need
for a bound,  ) is to use heuristics to avoid examining all (a)


(b)

possible subsets. (In fact, we must use heuristics of some


kind, since the problem of learning optimal structure is Figure 6: (a) A BN with a hidden variable . (b) The sim-
NP-hard [CHG95].) One approach in the RA framework, plest network that can capture the same distribution without
called Extended Dependency Analysis (EDA) [Con88], is 
using a hidden variable (created using arc reversal and node


as follows. Start by evaluating all subsets of size up to two, elimination). If is binary and the other nodes are tri-
keep all the ones with significant (in the 2 sense) MI with nary, and we assume full CPTs, the first network has 45
the target node, and take the union of the resulting set as the independent parameters, and the second has 708.
set of parents.
The disadvantage of this greedy technique is that it will fail 6.4 Scaling up to large networks
to find a set of parents unless some subset of size two has
significant MI with the target variable. However, a Monte Since there are about  
6400 genes for yeast (S. Cere-
Carlo simulation in [Con88] shows that most random rela- visae), all of which can now be simultaneously measured
tions have this property. In addition, highly interdependent [DLB97], it is clear that we will have to do some pre-
sets of parents (which might fail the pairwise MI test) violate processing to exclude the “irrelevant” genes, and just try to
the causal independence assumption, which is necessary to learn the connections between the rest. After all, many of
justify the use of noisy-OR and similar CPDs. them do not change their expression level during a given
An alternative technique, popular in the UAI community, experiment. In the [DLB97] dataset, for example, which
is to start with an initial guess of the model structure (i.e., consists of 7 times steps, only 416 genes showed a sig-
nificant change — the rest are completely uninformative.

at a specific point in the lattice), and then perform local
search, i.e., evaluate the score of neighboring points in the Of course, even 416 is too big for the techniques we
lattice, and move to the best such point, until we reach a have discussed, so we must pre-process further, perhaps by
local optimum. We can use multiple restarts to try to find clustering genes with similar timeseries [ESBB98]. In Sec-
the global optimum, and to learn an ensemble of models. tion 8.1, we discuss techniques for learning the structure of
Note that, in the partially observable case, we need to have DBNs with continuous state, which scale much better than
an initial guess of the model structure in order to estimate the methods we have discussed above for discrete models.
the values of the hidden nodes, and hence the (expected)
score of each model (see Section 5); starting with the fully
disconnected model (i.e., at the bottom of the lattice) would 7 Unknown structure, partial observability
be a bad idea, since it would lead to a poor estimate.
Finally, we come to the hardest case of all, where the
structure is unknown and there are hidden variables and/or
missing data. One difficulty is that the score does not de-
compose. However, as observed in [Fri97, Fri98, FMR98,
6.3 Sample complexity TMCH98], the expected score does decompose, and so we
can use an iterative method, which alternates between eval-
In addition to the computational cost, another important uating the expected score of a model (using an inference
consideration is the amount of data that is required to re- engine), and changing the model structure, until a local
liably learn structure. For deterministic boolean networks, maximum is reached. This is called the Structural EM
this issue has been addressed from a statistical physics per- (SEM) algorithm. SEM was succesfully used to learn the
spective [Her98] and a computational learning theory (com- structure of discrete DBNs within missing data in [FMR98].
binatorial) perspective [AMK99]. In particular, [AMK99]
prove that, if the fan-in is bounded by a constant  , the 7.1 Inventing new hidden nodes


number of samples needed to identify a boolean network of
 

nodes is lower bounded by Ω 2   log and upper So far, structure learning has meant finding the right con-

bounded by 22 2  log . Unfortunately, the constant nectivity between pre-existing nodes. A more interesting
factor is exponential in  , which can be quite large for cer- problem is inventing hidden nodes on demand. Hidden
tain genes, as we have already remarked. These analyses all nodes can make a model much more compact: see Fig-
assume a “tabula rasa” approach; however, in practice, we ure 6. The standard approach is to keep adding hidden
will often have strong prior knowledge about the structure nodes one at a time, to some part of the network (see be-
(or at least parts of it), which can reduce the data require- low), performing structure learning at each step, until the
ments considerably. score drops. One problem is choosing the cardinality (num-
ber of possible values) for the hidden node, and its type of To create a BN with continuous variables, we need to
CPD. Another problem is choosing where to add the new specify the graph structure and the Conditional Proba-
hidden node. There is no point making it a child, since bility Distributions at each node, as before. The most
hidden children can always be marginalized away, so we common distribution is a Gaussian, since it is analytically
need to find an existing node which needs a new parent,
when the current set of possible parents is not adequate. with
   1 1
 %      "   
 "
tractable. Consider, for example, the DBN in
;
 Figure 2(a),
1 Σ ,
where
5  8  
     8 
[RM98] use the following heuristic for finding nodes which
need new parents: they consider a noisy-OR node which is ; Σ
1
exp  
, 
Σ 1   
(3)
nearly always on, even if its non-leak parents are off, as an
0
 2   2 Σ 2 is the normalizing
indicator that there is a missing parent. Generalizing this is the Normal (Gaussian) distribution evaluated at ,

lem. One approach might be to examine


  Pa  
technique beyond noisy-ORs is an interesting open prob-
: if
1
 constant to ensure the den-
5
 
sity integrates to 1, and is the weight or regression matrix.

   , where     0  Σ are inde-


  the transpose
this is very high, it means the current set of parents are in-
( denotes of .) Another way of writing this

adequate to “explain” the residual entropy; if Pa is the 
best (in the BIC or 2 sense) set of parents we have been is 1

 
able to find in the current model, it suggests we need to pendent Gaussian noise terms, corresponding to unmod-
create a new node and add it to Pa . elled influences. This is called a first-order auto-regressive
AR(1) time series model [Ham94].
A simple heuristic for inventing hidden nodes in the case of
If
is a vector containing the expression levels of all the
DBNs is to check if the Markov property is being violated
for any particular node. If so, it suggests that we need 
genes at time , then this corresponds precisely to the model
in [DWFS99]. (They do not explicitely mention Gaussian
connections to slices further back in time. Equivalently,
we can add new lag variables (see Section 2.5) and connect
to fit the

noise, but it is implicit in their decision to use least squares
term: see Section 8.1). Despite the simplicity
to them. (Note that reasoning across multiple time slices
forms the basis of the (“model free”) correlational analysis of this model (in particular, its linear nature), they find that
approach of [AR95, ASR97].) it models their experimental data very well.

Another heuristic for DBNs might be to first perform a clus-


fine
   
We can also define nonlinear
1

, where
   
models. For example,
 if we de-
1 1 exp
    8 
tering of the timeseries of the observed variables (see e.g.,
[ESBB98]), and then to associate hidden nodes with each is the sigmoid function, we obtain the nonlinear model in
cluster. The result would be a Markov model with a tree- [WWS99].
structured hidden “backbone” c.f., [JGS96]. This is one

It is simple to extend both the linear and nonlinear models  
possible approach to the problem of learning hierarchically
structured DBNs. (Building hierarchical (object oriented)
  
to allow for external inputs , so that we regress
1 and
on
, as the authors point out. For example, in
BNs and DBNs by hand is straightforward, and there are [DWFS99], they inject kainate — “a glutamatergic agonist
algorithms which can exploit this modular structure to a which causes seizures, localized cell death, and severely
certain extent to speed up inference [KP97, FKP98].) disrupts the normal gene expression patterns” — into rat
     
8    
CNS, and measure expression levels  before
 and after. Their

Of course, interpreting the “meaning” of hidden nodes is
model is therefore  , where   0

always tricky, especially since they are often unidentifiable, 1


before the injection, and  exp 2 if is hours
e.g., we can often switch the interpretation of the true and
false states (assuming for simplicity that the hidden node after injection, and and are vectors of length , repre- 
is binary) provided we also permute the parameters appro- senting the response to kainate and an offset (bias) term,
respectively.
priately. (Symmetries such as this are one cause of the

multiple maxima in the likelihood surface.) Our opinion is In Figure 2(a), the 
variables are always observed. Now
that fully automated structure discovery techniques can be consider Figure 2(b), where only the  are observed, and
useful as hypothesis generators, which can then be tested have Gaussian CPDs. If the  3           
are hidden, discrete vari-
by experiment. ables, we have 
;  Σ ; this
is an HMM with Gaussian outputs. If the   %  :
are hid-

    
den, continuous
 variables, we have
8 DBNs with continuous state ;     
Σ . This is called a Linear Dynamical Sys-

vations,
    
tem, since both the dynamics, 1 , and the obser-
, are linear. (It is also called a State Space
Up until now, we have assumed, for simplicity, that all the model.)
random variables are discrete-valued. However, in reality,
most of the quantities we care about in genetic networks are To perform inference in an LDS, we can use the well-
continuous-valued. If we coarsely quantize the continuous known Kalman filter algorithm [BSL93], which is just a
variables (i.e., convert them to a small number of discrete special case of the junction tree algorithm. If the dynamics
values), we lose a lot of information, but if we finely quan- are non-linear, however, inference becomes much harder;
tize them, the resulting discrete model will have too many the standard technique is to make a local linear approxima-
parameters. Hence it is better to work with continuous tion — this is called the Extended Kalman Filter (EKF),
variables directly. and is similar to techniques developed in the Recurrent
Neural Networks literature [Pea95]. (One reason for the matrices
widespread success of HMMs with Gaussian outputs is that 5 *5
the discrete hidden variables can be used to approximate
arbitrary non-linear dynamics, given enough training data.)   ..
1
    ..
2

In the rest of this paper, we will stick to the case of fully 5
. *5 .
observable (but possibly non-linear) models, for simplicity, 1
so that inference will not be necessary. (We will, however,
consider Bayesian methods of parameter learning, which 
We can think of the ’th row of as the ’th input vector 
to a linear neural network with no hidden layers (i.e., a per-
 
can be thought of as inferring the most probable values of
nodes which represent the parameters themselves.) 
ceptron), and the ’th row of as the corresponding output. 
Using this notation, we can rewrite the above expression as
follows (this is called the normal equation):
8.1 Learning
  5  5 
 1 5  
We have assumed that is a vector of random variables.
If we were to represent each component of this vector as
its own (scalar) node, the inter-slice connections would where

is the pseudo-inverse of (which is not square).

correspond to the non-zero values in the weight matrix
, and undirected connections within a slice would cor- 
Thus we see that the least squares solution is the same as the
Maximum Likelihood (ML) solution assuming Gaussian
respond to non-zero entries in Σ 1 [Lau96]. For example, noise.
 1 1 1  
  
The technique used by D’haeseleer et al. [DWFS99] is
if
0 1 1 
1 0 0 , and Σ is diagonal, the model is

equivalent to the one in Figure 3(a). Hence, in the case


to compute the least squares value of

, and interpret
, for some (unspecified) threshold , as indicating 
the absence of an arc between nodes and . However, as
of continuous state models, we can convert back and forth Bishop writes [Bis95, p.360], “[this technique] has little
between representing the structure graphically and numeri- theoretical motivation, and performs poorly in practice”.
cally. This means that structure learning reduces to param-
eter fitting, and we can avoid expensive search through a More sophisticated techniques have been developed in the
discrete space of models! (c.f., [Bra98].) Consequently, we neural network literature, which involve examining the
change in the error function due to small changes in the

will now discuss techniques for learning the parameters of
values of the weights. This requires computing the Hessian

continuous-state DBNs.


of the error surface. In the technique of “optimal brain
+

Consider the AR(1) model in Figure 2(a). For notational

   
damage” [CDS90], they assume is diagonal; in the more

constant is

simplicity we will assume Σ
2 2
, where
2
, so the normalizing
1 2 is the inverse  sophisticated technique of “optimal brain surgeon” [HS93],
they remove this assumption.
variance (the precision). Using Equation 3, we find that the
 
 
likelihood of the data (ignoring terms which are independent Since we believe (hope!) that the true model is sparse (i.e.,
    
most entries in are 0), we can encode this knowledge
of ) is

 
Gaussian prior 7 for each
  (         
(assumption) by using a 0 1

       8
weight:
 1   2     
    
 
1
1

 1   exp    8 2  8   1 2


2
1
exp 
  2
8    
2 2

! 1   exp 
2

  1  exp ( 8   )

2 

 1   8
(4)
  %  
the value of at time  in sequence  ,

 is the isproduct
exp (5)
"    #   
where

 *8 1 is the number of samples, and  is an error  
of the individual normalizing constants,

2


2 2
where (since is an matrix)
(cost) function on the data. (Having multiple sequences is and is an error (cost) function on the weights. Then,
essentially the same as concatenating them into a single using Bayes’ rule (Equation 2) and Equations 4 and 5, the

  1;    exp  8   8$  *


posterior distribution on the weights is
   (
long sequence (modulo boundary conditions), so we shall

henceforth just use a single index, .) Differentiating this 

with respect to and setting to 0 yields

  "  
1 Since the denominator is independent of , the Maximum
     
Posterior (MAP) value of W is given by minimizing the
1
2 2
7
In [WWS99], they use an ad-hoc iterative pruning method to
Let us now rewrite this in a more familiar form. Define the achieve a similar effect.
   


cost function is the mutual information (MI) between and ,

  ;  
  1
2 
2    28    log
  


      
The second term is called a regularizer, and encourages is the entropy of , and
learning of a weight matrix with small values. (Unfortu-
nately, this prior favours many small weights, rather than
    28   

log
a few large ones, although this can be fixed [HP89].) This

 +   6 
regularization technique is also called weight decay. Note
   :     :

that the use of a regularizer overcomes the problem that is the conditional entropy of given . (If Pa  ,
will often be underdetermined, i.e., there will be fewer we define   Pa   and  Pa   

samples than parameters, 2
.    .2+  (*   +    (

0.) Finally, we define the MI score of a model as
;   Pa  ; , where the prob- (
The values and are called hyperparameters, since they
control the prior on the weights and the system noise, re-
Pa
 
abilities are set to their empirical values given , and
 mean the parents of node  in graph .
spectively. Since their values are unknown, the correct
Bayesian approach is to integrate them out, but an approxi- Theorem.( Let
 5 be BNs in which every node has a full
mation that actually works better in practice is to approxi- CPT, and be a fully observable data set. Then9

: ; ( 8 : 5 ; ( .2+  ; ( 8 .2+  5 ; (


mate their posterior by a Gaussian (or maybe a mixture of
Gaussians), find their MAP values, and then plug them in
to the above equation: see [Bis95, sec. 10.4] and [Mac95]
 
 Proof: To see this, note that the normalized log-likelihood

   
for details.


We can associate a separate hyperparameter  for each  (Equation 1) can be rewritten as
weight  , find their MAP values, and use this as a “soft”  
1

     #

means of finding which entries of  to keep: this is called log Pr
Automatic Relevance Determination (ARD) [Mac95], and 

has been found to peform better than the weight deletion
where we have defined  

Pa   for brevity.   is
the number of times the event   

def

 
techniques discussed above, especially on small data sets.

training set. (If  


occurs
, we define Pr  
   #; Pr     and    as the number of times
in the whole
9 Conclusion
the event   occurs.) If wetoassume each node is a full
 #
  Pr      , and so
In this paper, we have explained what DBNs are, and dis- CPT, and set the probabilities their empirical estimates,
cussed how to learn them. In the future, we hope to apply then  
these techniques (in particular, the ones discussed in Sec-
tion 8.1) to real biological data, when enough becomes
publicly available.8
   Pr        log Pr      

8 
A The equivalence between Mutual    
Information and Maximum Likelihood
methods

+     8   
 
  ( 8 
Since  is independent of the structure of the graph
 5 (  +    8
+        
We start by introducing some standard notation from infor-


     log        
mation theory [CT91]. , we find ; ;   Pa 
+     Pa  .

  

We note that this theorem is similar to the result in [CL68],
  8    who

show that the optimal 10 tree-structured MRF is a max-
     3  8      imal weight
+  weight
the
spanning tree (MWST) of the graph in which
  . of[MJM98]
an arc between two nodes,  and , is
 extends this result to mixtures of
We carried out some provisional  experiments on a
8
trees. Trees have the significant advantage that we can
If !"$#%&#%' , then )(*+,.-
subset of the data described in [WFM 98] (available at

9

which has  112,  9 and  1. Applying the ARD tech- #/0(1#/+ , so  and # rank models in the same order.
rsb.info.nih.gov/mol-physiol/PNAS/GEMtable.html),

However, there might be many models which receive the same


 # 234
nique to the 17 genes involved in the GAD/GABA subsystem,
 score under or , so it is ambiguous to say arg max
2 #/
described in [DWFS99], we found that all the ’s were signifi-
cant, indicating a fully connected graph. [DWFS99] got a sparse
graph, but used a larger (unpublished) data set with 28, and
  arg max
10
.
In the sense of minimizing the KL divergence between the
used an unspecified pruning threshold .  true and estimated distribution.
compute the MWST in  
2

time, where is the number [Dec98] R. Dechter. Bucket elimination: a unifying frame-
of variables. However, it is not clear if they are a useful work for probabilistic inference. In M. Jordan, editor,
model for gene expression data. Learning in Graphical Models. Kluwer, 1998.
[DEKM98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.
 
If a node does not have a full CPT, but instead has a para- Biological Sequence Analysis: Probabilistic Models


metric CPD of the right form (i.e., at least capable of rep- of Proteins and Nucleic Acids. Cambridge University
resenting the “true” conditional probability distribution of
the node), then as  ,    Pr   ,   Press, Cambridge, 1998.
[DLB97] J. L. DeRisi, V. R. Lyer, and P. O. Brown. Exploring
and the proof goes through as before. But this leaves us the metabolic and genetic control of gene expression
with the additional problem of choosing the form of CPD: on a genomic scale. Science, 278:680–686, 1997.
see Section 4.2.
[DWFS99] P. D’haeseleer, X. Wen, S. Fuhrman, and R. Som-
ogyi. Linear modeling of mRNA expression levels
References during CNS development and injury. In Proc. of the
Pacific Symp. on Biocomputing, 1999.
[AMK99] T. Akutsu, S. Miyano, and S. Kuhara. Identification
[ESBB98] Michael B. Eisen, Paul T. Spellman, Patrick O.
of genetic networks from a small number of gene ex-
Brown, and David Botstein. Cluster analysis and
pression patterns under the boolean network model.
display of genome-wide expression patterns. Proc.
In Proc. of the Pacific Symp. on Biocomputing, 1999.
of the National Academy of Science, USA, 1998.
[AR95] A. Arkin and J. Ross. Statistical construction of
[FG96] N. Friedman and M. Goldszmidt. Learning Bayesian
chemical reaction mechanisms from measured time-
networks with local structure. In Proc. of the Conf.
series. J. Phys. Chem., 99:970, 1995.
on Uncertainty in AI, 1996.
[ASR97] A. Arkin, P. Shen, and J. Ross. A test case of correla-
[FKP98] N. Friedman, D. Koller, and A. Pfeffer. Structured
tion metric construction of a reaction pathway from
representation of complex stochastic systems. In
measurements. Science, 277:1275–1279, 1997.
Proc. of the Conf. of the Am. Assoc. for AI, 1998.
[BFGK96] C. Boutilier, N. Friedman, M. Goldszmidt, and
[FMR98] N. Friedman, K. Murphy, and S. Russell. Learning
D. Koller. Context-specific independence in bayesian
the structure of dynamic probabilistic networks. In
networks. In Proc. of the Conf. on Uncertainty in AI,
Proc. of the Conf. on Uncertainty in AI, 1998.
1996.
[Bis95] C. M. Bishop. Neural Networks for Pattern Recog- [Fre97] D. A. Freeman. From association to causation via
regression. In V. R. McKim and S. P. Turner, edi-
nition. Clarendon Press, 1995.
tors, Causality in Crisis? Statistical methods and the
[BK98a] X. Boyen and D. Koller. Approximate learning of dy- search for causal knowledge in the social sciences.
namic models. In Neural Info. Proc. Systems, 1998. U. Notre Dame Press, 1997.
[BK98b] X. Boyen and D. Koller. Tractable inference for [Fri97] N. Friedman. Learning Bayesian networks in the
complex stochastic processes. In Proc. of the Conf. presence of missing values and hidden variables. In
on Uncertainty in AI, 1998. Proc. of the Conf. on Uncertainty in AI, 1997.
[BKRK97] J. Binder, D. Koller, S. J. Russell, and K. Kanazawa. [Fri98] N. Friedman. The bayesian structural EM algorithm.
Adaptive probabilistic networks with hidden vari- In Proc. of the Conf. on Uncertainty in AI, 1998.
ables. Machine Learning, 29:213–244, 1997.
[GHM96] D. Geiger, D. Heckerman, and C. Meek. Asymptotic
[Bra98] M. Brand. An entropic estimator for structure dis- model selection for directed networks with hidden
covery. In Neural Info. Proc. Systems, 1998. variables. In Proc. of the Conf. on Uncertainty in AI,
[BSL93] Y. Bar-Shalom and X. Li. Estimation and Tracking: 1996.
Principles, Techniques and Software. Artech House, [Gol80] M. C. Golumbic. Algorithmic Graph Theory and
1993. Perfect Graphs. Academic Press, 1980.
[Bun94] W. L. Buntine. Operations for learning with graphical [Ham94] J. Hamilton. Time Series Analysis. Wiley, 1994.
models. J. of AI Research, pages 159–225, 1994.
[HB94] D. Heckerman and J.S. Breese. Causal indepen-
[Bun96] W. Buntine. A guide to the literature on learning dence for probability assessment and inference using
probabilistic networks from data. IEEE Trans. on bayesian networks. IEEE Trans. on Systems, Man
Knowledge and Data Engineering, 8(2), 1996. and Cybernetics, 26(6):826–831, 1994.
[CDS90] Y. Le Cun, J. Denker, and S. Solla. Optimal brain [HD94] C. Huang and A. Darwiche. Inference in belief net-
damage. In Neural Info. Proc. Systems, 1990. works: A procedural guide. Intl. J. Approx. Reason-
[CHG95] D. Chickering, D. Heckerman, and D. Geiger. Learn- ing, 11, 1994.
ing Bayesian netwprks: search methods and experi- [Hec96] D. Heckerman. A tutorial on learning with Bayesian
mental results. In AI + Stats, 1995. networks. Technical Report MSR-TR-95-06, Mi-
[CL68] C. K. Chow and C. N. Liu. Approximating dis- crosoft Research, March 1996.
crete probability distributions with dependencetrees. [Hen89] M. Henrion. Some practical issues in constructing
IEEE Trans. on Info. Theory, 14:462–67, 1968. belief networks. In Proc. of the Conf. on Uncertainty
[Con88] R. C. Conant. Extended dependency analysis of large in AI, 1989.
systems. Intl. J. General Systems, 14:97–141, 1988. [Her98] J. Hertz. Statistical issues in the reverse engineering
[CT91] T. M. Cover and J. A. Thomas. Elements of Infor- of genetic networks. In Proc. of the Pacific Symp. on
mation Theory. John Wiley, 1991. Biocomputing, 1998.
[HMC97] D. Heckerman, C. Meek, and G. Cooper. A [MJM98] M. Meila, M. Jordan, and Q. Morris. Estimating
Bayesian approach to causal discovery. Technical dependency structure as a hidden variable. Technical
Report MSR-TR-97-05, Microsoft Research, Febru- Report 1648, MIT AI Lab, 1998.
ary 1997. [Mur99] K. P. Murphy. A variational approximation for
[HP89] S. Hanson and L. Pratt. Comparing biases for mini- Bayesian networks with discrete and continuous la-
mal network construction with back-propogation. In tent variables. In Proc. of the Conf. on Uncertainty
Neural Info. Proc. Systems, 1989. in AI, 1999. Submitted.
[HS93] B. Hassibi and D. Stork. Second order derivatives for [Pea88] J. Pearl. Probabilistic Reasoning in Intelligent Sys-
network pruning: optimal brain surgeon. In Neural tems: Networks of Plausible Inference. Morgan
Info. Proc. Systems, 1993. Kaufmann, 1988.
[HS95] D. Heckerman and R. Shachter. Decision-theoretic [Pea95] B. A. Pearlmutter. Gradient calculations for dynamic
foundations for causal reasoning. J. of AI Research, recurrent neural networks: a survey. IEEE Trans. on
3:405–430, 1995. Neural Networks, 6, 1995.
[Jen96] F. V. Jensen. An Introduction to Bayesian Networks. [Rab89] L. R. Rabiner. A tutorial in Hidden Markov Mod-
UCL Press, London, England, 1996. els and selected applications in speech recognition.
Proc. of the IEEE, 77(2):257–286, 1989.
[JGJS98] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and
[RM98] S. Ramachandran and R. Mooney. Theory refine-
L. K. Saul. An introduction to variational methods
ment of bayesian networks with hidden variables. In
for graphical models. In M. Jordan, editor, Learning
Machine Learning: Proceedings of the International
in Graphical Models. Kluwer, 1998.
Conference, 1998.
[JGS96] M. I. Jordan, Z. Ghahramani, and L. K. Saul. Hidden [SGS93] P. Spirtes, C. Glymour, and R. Scheines. Causation,
Markov decision trees. In Neural Info. Proc. Systems,
Prediction, and Search. Springer-Verlag, 1993. Out
1996. of print.
[Kau93] S. Kauffman. The origins of order. Self-organization [Sha98] R. Shachter. Bayes-ball: The rational pastime (for
and selection in evolution. Oxford Univ. Press, 1993. determining irrelevance and requisite information in
[Kli85] G. Klir. The Architecture of Systems Problem Solv- belief networks and influence diagrams). In Proc. of
ing. Plenum Press, 1985. the Conf. on Uncertainty in AI, 1998.
[KP97] D. Koller and A. Pfeffer. Object-oriented bayesian [SHJ97] P. Smyth, D. Heckerman, and M. I. Jordan. Prob-
networks. In Proc. of the Conf. on Uncertainty in AI, abilistic independence networks for hidden Markov
1997. probability models. Neural Computation, 9(2):227–
269, 1997.
[Kri86] K. Krippendorff. Information Theory : Structural
Models for Qualitative Data (Quantitative Applica- [Sri93] S. Srinivas. A generalization of the noisy-or model.
tions in the Social Sciences, No 62). Page Publishers, In Proc. of the Conf. on Uncertainty in AI, 1993.
1986. [SS96] R. Somogyi and C. A. Sniegoski. Modeling the com-
[Lau95] S. L. Lauritzen. The EM algorithm for graphical as- plexity of genetic networks: understanding multige-
sociation models with missing data. Computational netic and pleiotropic regulation. Complexity, 1(45),
Statistics and Data Analysis, 19:191–201, 1995. 1996.
[TMCH98] B. Thiesson, C. Meek, D. Chickering, and D. Heck-
[Lau96] S. Lauritzen. Graphical Models. OUP, 1996.
erman. Learning mixtures of DAG models. In Proc.
[LFS98] S. Liang, S. Fuhrman, and R. Somogyi. Reveal, a
general reverse engineering algorithm for inference  of the Conf. on Uncertainty in AI, 1998.
[WFM 98] X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr,
of genetic network architectures. In Pacific Sym- S. Smith, J. L. Barker, and R. Somogyi. Large-scale
posium on Biocomputing, volume 3, pages 18–29, temporal gene expression mapping of central nervous
1998. system development. Proc. of the National Academy
[MA97] H. H. McAdams and A. Arkin. Stochastic mech- of Science, USA, 95:334–339, 1998.
anisms in gene expression. Proc. of the National [Whi90] J. Whittaker. Graphical Models in Applied Multi-
Academy of Science, USA, 94(3):814–819, 1997. variate Statistics. Wiley, 1990.
[Mac95] D. MacKay. Probable networks and plausible pre- [WMS94] J. V. White, I. Muchnik, and T. F. Smith. Modelling
dictions — a review of practical Bayesian methods protein cores with Markov Random Fields. Mathe-
for supervised neural networks. Network, 1995. matical Biosciences, 124:149–179, 1994.
[Mac98] D. MacKay. Introduction to monte carlo methods. [WWS99] D. C. Weaver, C. T. Workman, and G. D. Stormo.
In M. Jordan, editor, Learning in graphical models. Modeling regulatory networks with weight matrices.
Kluwer, 1998. In Proc. of the Pacific Symp. on Biocomputing, 1999.
[MH97] C. Meek and D. Heckerman. Structure and parameter
learning for causal independence and causal interac-
tion models. In Proc. of the Conf. on Uncertainty in
AI, pages 366–375, 1997.
[Mil55] G. Miller. Note on the bias of information estimates.
In H. Quastler, editor, Information theory in psychol-
ogy. The Free Press, 1955.

You might also like