1999-Modelling Gene Expression Data Using Dynamic Bayesian Networks
1999-Modelling Gene Expression Data Using Dynamic Bayesian Networks
ently stochastic phenomenon [MA97]. In addition, even if two (sets of) nodes and are conditionally indepen-
the underlying system were deterministic, it might appear dent given all the other nodes if they are separated in the
stochastic due to our inability to perfectly measure all the graph. By contrast, directed graphical models (i.e., BNs)
variables. Hence it is crucial that our learning algorithms be have a more complicated notion of independence, which
X1 X2 X3 X1 X2 X3
Y1 Y2 Y3
(a) (b)
! "#
The inference algorithms which are needed to do this are is the transition matrix for time slice .) If the observed
briefly discussed in Section 5.1. Note that, if all the nodes
are observed, there is no need to do inference, although we
variables
by Pr are discrete, we can specify their distribution
. ( is the observa-
might still want to do learning.
tion matrix for time slice .) However, in an HMM, it is
In addition to causal and diagnostic reasoning, BNs support also possible for the observed variables to be Gaussian, in
which case we must specify the mean and covariance for
the powerful notion of “explaining away”: if a node is
observed, then its parents become dependent, since they are each value of the hidden state variable, and each value of :
rival causes for explaining the child’s value (see the bottom see Section 8.
left case in Figure 1.) In contrast, in an undirected graphical As a (hopefully!) familiar example of HMMs, let us con-
model, the parents would be independent, since the child sider the way that they are used for aligning protein se-
separates (but does not d-separate) them. quences [DEKM98]. In this case, the hidden state variable
can take on three possible values, , which
%$'&)(*,+-,.0/
Some other important advantages of directed graphical
represent delete, insert and match respectively. In protein
models over undirected ones include the fact that BNs can
encode deterministic relationships, and that it is easier to
alignment, the subscript does not refer to time, but rather
to position along a static sequence. This is an important dif-
learn BNs (see Section 3) since they are separable models
(in the sense of [Fri98]). Hence we shall focus exclusively ference from gene expression, where really does represent
on BNs in this paper. For a careful study of the relation- time (see Section 3).
ship between directed and undirected graphical models, see The observable variable can take on 21 possible values,
[Pea88, Whi90, Lau96].1 which represent the 20 possible amino acids, plus the gap
alignment character “-”. The probability distribution over
these 21 values depends on the current position and of
1 .2
the current state of the system, . Thus the distribution
3 4
+
1
It is interesting to note that much of the theory underlying
graphical models involves concepts such as chordal (triangulated) Pr is the profile for position , Pr
B1 B2
B C A’ B’
C1 C2
(a) (b) C
Figure 3: (a) A DBN, (b) Represented as a boolean network. Figure 4: The noisy-OR gate modelled as a BN. A’ is a
noisy version of A, and B’ is a noisy version of B. C is
a deterministic OR of A’ and B’ (indicated by the double
rameters and
In an HMM, “learning” means finding estimates of the pa-
(see Section 4.1); the structure of the
ring). Shaded nodes are observed, non-shaded nodes are
hidden.
model is fixed (namely, it is Figure 2(b)).
form as follows:
Now consider the DBN in Figure 3(a). (We only show two
Pr 0
Pr 1
time slices of the DBN, since the structure repeats.) Just as
0.08
0 0 1.0
an HMM can be thought of as a single stochastic automaton,
1 0 1 18 1
so this can be thought of as a network of four interacting
stochastic automata: node represents the state of the
0 1 2 18 2
automaton at time , for example; its next state is determined
1 1 1 2 1 1 2
by its previous state, and the previous state of .
Note that this CPT has 2 1 entries, where is the
The interconnectivity between the automata can also be number of parents (here, 2), but that there are con-
represented as in Figure 3(b). The presence of directed
straints on theseentries. For a start,
the requirement that
<
cycles means that this is not a BN; rather, an arc from node
<<<
adjacency matrix. entries. However, the entries in the remaining columns are
themselves functions of just free parameters, 1 .
In addition to connections between time slices, we can also
Hence this is called an implicit or parametric distribution,
allow connections within a time slice (intra-slice connec-
and has the advantage that less data is needed to learn the
tions). This can be used to model co-occurrence effects, parameter values (see Section 3). By contrast, a full multi-
We can define a stochastic boolean network model by mod- 2.4 Comparison of DBNs and HMMs
ifying the CPDs. For example, a popular CPD in the UAI
community is the noisy-OR model [Pea88]. This is just like Note that we can convert the model in Figure 3(a) into the
an OR gate, except that there is a certain probability that
the ’th input will be flipped from 1 to 0. For example, if
one in Figure 2(a), by defining a new variable whose state
space is the cross product of the original variables,
, i.e., by “collapsing” the original model into a
1 1 1 1
to have, e.g., it is common to assume a bound, , on
So 0s in the transition matrix do not correspond to absence the maximum number of parents (fan-in) that a node can
of arcs in the model. Rather, if 0, it just means
the automaton cannot get from state to state — this says
take on, and we may know that nodes of a certain “type”
only connect to other nodes of the same type. Such prior
nothing about the connections between the underlying state knowledge can either be a “hard” constraint, or a “soft”
variables. Thus Markov chains and HMMs are not a good prior.
representation for sparse, discrete models of the kind we
To exploit such soft prior knowledge, we must use Bayesian
are considering here.
methods, which have the additional advantage that they re-
turn a distribution over possible models instead of a single
2.5 Higher-order models best model. Handling priors on model structures, however,
is quite complicated, so we do not discuss this issue here
We have assumed that that our models have the first-order (see [Hec96] for a review). Instead, we assume that the
Markov property, i.e., that there are only connections be- goal is to find a single model which maximizes some scor-
tween adjacent time slices. This assumption is without ing function (discussed in Section 6.1). We will, however,
loss of generality, since it is easy to convert a higher-order consider priors on parameters, which are much simpler. In
Markov chain to a first-order Markov chain, simply by Section 8.1, when we consider DBNs in which all the vari-
adding new variables (called lag variables) which contain ables are continuous, we will see that we can use numerical
the old state, and clamping the transition matrices of these priors as a proxy for structural priors.
new variables to the identity matrix. (Note that doing this
for an HMM (as opposed to a DBN) would blow up the An interesting approximation to the full-blown Bayesian
state space exponentially.) approach is to learn a mixture of models, each of which has
different structure, depending on the value of a hidden, dis-
2.6 Relationship to other probabilistic models crete, mixture node. This has been done for Gaussian net-
works [TMCH98] and discrete, undirected trees2 [MJM98];
Although the emphasis in this section has been on models unfortunately, there are severe computational difficulties in
which are relevant to gene expression, we should remark generalizing this technique to arbitrary, discrete networks.
that Bayesian Nets can also be used in evolutionary tree con- In this paper, we restrict our attention to learning the struc-
struction and genetic pedigree analysis. (In fact, the peel- ture of the inter-slice connectivity matrix of the DBN. This
ing (Elston-Stewart) algorithm was invented over a decade has the advantage that we do not need to worry about in-
before being rediscovered in the UAI community.) Also, troducing (directed) cycles, and also that we can use the
Bayesian networks can be augmented with utility and deci-
sion nodes, to create an influence diagram; this can be used lations, i.e., we know an arc must go from
"
1 to
“arrow of time” to disambiguate the direction of causal re-
and
to compute an optimal action or sequence of actions (opti- not vice versa, because the network models the temporal
mal in the decision-theoretic sense of maximizing expected evolution of a physical system. Of course, we still cannot
utility).This might be useful for designing gene-knockout completely overcome the problem that “correlation is not
experiments, although we don’t pursue this issue here.
and
causation”: it is possible that the correlation between
is due to one or more hidden common causes (or
1
3 Learning Bayesian Networks observed common effects, because of the explaining away
phenomenon). Hence the models learned from data should
be subject to experimental verification.3
In this section, we discuss how to learn BNs from data. The
problem can be summarized as follows.
2
In one particularly relevant example, [MJM98] applies her
Structure/ Observability Method algorithm to the problem of classifying DNA splice sites as intron-
Known, full Sample statistics exon or exon-intron. When she examined each tree in the mixture
Known, partial EM or gradient ascent to find the nodes which are most commonly connected to the class
variable, she found that they were the nodes in the vicinity of the
Unknown, full Search through model space
splice junction, and further, that their CPTs encoded the known
Unknown, partial Structural EM AG/G pattern.
3
[HMC97] discusses Bayesian techniques for learning causal
Full observability means that the values of all variables are (static) networks, and [SGS93] discusses constraint based tech-
known; partial observability means that we do not know niques — but see [Fre97] for a critique of this approach. These
the values of some of the variables. This might be because techniques have mostly been used in the social sciences. One
they cannot, in principle, be measured (in which case they major advantage of molecular biology is that it is usually feasible
When the structure of the model is known, the learning task We see that the log-likelihood scoring function decomposes
becomes one of parameter estimation. Although the focus according to the structure of the graph, and hence we can
of this paper is on learning structure, this calls parameter maximize the contribution to the log-likelihood of each
learning as a subroutine, so we start by discussing this case. node independently (assuming the parameters in each node
are independent of the other nodes).
Finally, we mention that learning BNs is a huge subject,
and in this paper, we only touch on the aspects that we
consider most relevant to learning genetic networks from 4.1 Parameter estimation
time series data. For further details, see the review papers
[Bun94, Bun96, Hec96]. For discrete nodes with CPTs, we can compute the ML
estimates by simple counting. As is well known from the
HMM literature, however, ML estimates of CPTs are prone
4 Known structure, full observability to sparse data problems, which can be solved by using
(mixtures of) Dirichlet priors (pseudo counts).
We assume that the goal of learning in this case is to find
Estimating the parameters of noisy-OR distributions (and
the values of the parameters of each CPD which maximizes
its relatives) is harder, essentially because of the presence
the likelihood of the training data, which contains se-
of hidden variables (see Figure 4), so we must use the
quences (assumed to be independent), each of which has
the observed values of all nodes per slice for each of
techniques discussed in Section 5.
slices. For notational simplicity, we assume each sequence
is of the same length. Thus we can imagine “unrolling” a
two-slice DBN to produce a (static) BN with slices.
4.2 Choosing the form of the CPDs
(
order (i.e., parents before children). The normalized log- the forwards-backwards algorithm is a special case of the
( 2& ( <1< < ,( /
likelihood of the training set
, is a sum of
1
log Pr
terms,
, where
one for each node:
junction tree algorithm, and might be a good place to start
if you are familiar with HMMs. [Dec98] would be a good
1
place to start if you are familiar with the peeling algorithm
4
,
( < (although the junction tree approach is much more efficient
log Pa
1
1 1 (1) for learning). [Mur99] discusses inference in BNs with dis-
crete and continuous variables, and efficient techniques for
handling networks with many observed nodes.
to verify hypotheses experimentally.
4
Since we are not trying to estimate the parameters of nodes in Exact inference in densely connected (discrete) BNs is com-
the initial slice, we have omitted their contribution to the overall putationally intractible, so we must use approximate meth-
probability, for simplicity. ods. There are many approaches, including sampling meth-
{1,2,3,4}
ods such as MCMC [Mac98] and variational methods such
as mean-field [JGJS98]. DBNs are even more computa-
{1,2,3} {2,3,4} {1,3,4} {1,2,4}
tionally intractible in the sense that, even if the connec-
tions between two slices are sparse, correlations can arise
{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}
over several time steps, thus rendering the unrolled network
“effectively” dense. A simple approximate inference algo-
{1} {2} {3} {4}
rithm for the specific case of sparse DBNs is described in
[BK98b, BK98a].
{}
& /
6 Unknown structure, full observability
Figure 5: The subsets of 1 2 3 4 arranged in a lattice.
The ’th row represents subsets of size , for 0 4.
We start by discussing the scoring function which we use to
select models; we then discuss algorithms which attempt to
optimize this function over the space of models, and finally The first term is computed using Equation 1. The second
examine their computational and sample complexity. term is a penalty for model complexity. Since the number of
parameters in the model is the sum of the number of param-
eters in each node6, we see that the BIC score decomposes
6.1 The objective function used for model selection
just like the log-likelihood (Equation 1). Hence all of the
It is commonly assumed that the goal is to find the model techniques invented for maximizing MI also work for the
with maximum likelihood. Often (e.g., as in the RE- BIC case, and indeed, any decomposable scoring function.
VEAL algorithm [LFS98], and as in the Reconstructabil-
ity Analysis (RA) or General Systems Theory community 6.2 Algorithms for finding the best model
[Kli85, Kri86]) this is stated as finding the model in which
the sum of the mutual information (MI) [CT91] between Given that the score is decomposable, we can learn
each node and its parents is maximal; in Appendix A, we the parent set for each node independently. There are
prove that these objective functions are equivalent, in the
sense that they rank models in the same order. 0 2 such sets, which can be arranged in a
lattice as shown in Figure 5. The problem is to find the
The trouble is, the ML model will be a complete graph, highest scoring point in this lattice.
since this has the largest number of parameters, and hence
can fit the data the best. A well-principled way to avoid this The approach taken by REVEAL [LFS98] is to start at
kind of over-fitting is to put a prior on models, specifying the bottom of the lattice, and evaluate the score at all
that we prefer sparse models. Then, by Bayes’ rule, the points in each successive level, until a point is found with
MAP model is the one that maximizes + & 4 ,/
a score of 1.0. Since the scoring function they use is +
Pr
( Pr ( (Pr Pa max Pa
the mutual information between and , and
, where is
is the
Pr
(2)
entropy of (see Appendix A for definitions), 1.0 is the
Taking logs, we find
highest possible score, and corresponds to Pa
perfect predictor of , i.e., Pa 0.
being a
log Pr ( log Pr (
log Pr >
A normalized MI of 1 is only achievable with a deterministic
(
relationship. For stochastic relationships, we must decide
where log Pr is a constant independent of . Thus whether the gains in MI produced by a larger parent set is
+
the effect of the prior is equivalent to penalizing overly “worth it”. The standard approach in the RA community
complex models, as in the Minimum Description Length uses the fact [Mil55] that 2 ln 4 ,
(MDL) approach. where is the number of samples. Hence we can use a
2
test to decide whether an increase in MI is statistically
An exact Bayesian approach to model selection is usually
hood
( *
(
unfeasible, since it involves computing the marginal likeli-
, which is a sum over an expo-
significant. (This also gives us some kind of confidence
measure on the connections that we learn.) Alternatively,
we can use a complexity penalty term, as in the BIC score.
nential number of models (see Section 6.2). However, we
can use an asymptotic approximation to the posterior called Of course, if we do not know if we have achieved the max-
BIC (Bayesian Information Criterion), which is defined as imum possible score, we do not know when to stop search-
follows: ing, and hence we must evaluate all points in the lattice (al-
number of free parameters. In a model with hidden variables, it
where is the number of samples, # is the dimension of might be less than this [GHM96].
the model5 , and Θ̂ is the ML estimate of the parameters. 6
Hence nodes with compact representations of their CPDs will
encur a lower penalty, which can allow connections to form which
5
In the fully observable case, the dimension of a model is the might otherwise have been rejected [FG96].
is to only search up until level (i.e., assume a bound on
the maximum number of parents of each node), which takes
time. Unfortunately, in real genetic networks, it is
known that some genes can have very high fan-in (and fan-
out), so restricting the bound to 3 would make it
impossible to discover these important “master” genes.
The obvious way to avoid the exponential cost (and the need
for a bound, ) is to use heuristics to avoid examining all (a)
(b)
as follows. Start by evaluating all subsets of size up to two, elimination). If is binary and the other nodes are tri-
keep all the ones with significant (in the 2 sense) MI with nary, and we assume full CPTs, the first network has 45
the target node, and take the union of the resulting set as the independent parameters, and the second has 708.
set of parents.
The disadvantage of this greedy technique is that it will fail 6.4 Scaling up to large networks
to find a set of parents unless some subset of size two has
significant MI with the target variable. However, a Monte Since there are about
6400 genes for yeast (S. Cere-
Carlo simulation in [Con88] shows that most random rela- visae), all of which can now be simultaneously measured
tions have this property. In addition, highly interdependent [DLB97], it is clear that we will have to do some pre-
sets of parents (which might fail the pairwise MI test) violate processing to exclude the “irrelevant” genes, and just try to
the causal independence assumption, which is necessary to learn the connections between the rest. After all, many of
justify the use of noisy-OR and similar CPDs. them do not change their expression level during a given
An alternative technique, popular in the UAI community, experiment. In the [DLB97] dataset, for example, which
is to start with an initial guess of the model structure (i.e., consists of 7 times steps, only 416 genes showed a sig-
nificant change — the rest are completely uninformative.
at a specific point in the lattice), and then perform local
search, i.e., evaluate the score of neighboring points in the Of course, even 416 is too big for the techniques we
lattice, and move to the best such point, until we reach a have discussed, so we must pre-process further, perhaps by
local optimum. We can use multiple restarts to try to find clustering genes with similar timeseries [ESBB98]. In Sec-
the global optimum, and to learn an ensemble of models. tion 8.1, we discuss techniques for learning the structure of
Note that, in the partially observable case, we need to have DBNs with continuous state, which scale much better than
an initial guess of the model structure in order to estimate the methods we have discussed above for discrete models.
the values of the hidden nodes, and hence the (expected)
score of each model (see Section 5); starting with the fully
disconnected model (i.e., at the bottom of the lattice) would 7 Unknown structure, partial observability
be a bad idea, since it would lead to a poor estimate.
Finally, we come to the hardest case of all, where the
structure is unknown and there are hidden variables and/or
missing data. One difficulty is that the score does not de-
compose. However, as observed in [Fri97, Fri98, FMR98,
6.3 Sample complexity TMCH98], the expected score does decompose, and so we
can use an iterative method, which alternates between eval-
In addition to the computational cost, another important uating the expected score of a model (using an inference
consideration is the amount of data that is required to re- engine), and changing the model structure, until a local
liably learn structure. For deterministic boolean networks, maximum is reached. This is called the Structural EM
this issue has been addressed from a statistical physics per- (SEM) algorithm. SEM was succesfully used to learn the
spective [Her98] and a computational learning theory (com- structure of discrete DBNs within missing data in [FMR98].
binatorial) perspective [AMK99]. In particular, [AMK99]
prove that, if the fan-in is bounded by a constant , the 7.1 Inventing new hidden nodes
number of samples needed to identify a boolean network of
nodes is lower bounded by Ω 2 log and upper So far, structure learning has meant finding the right con-
bounded by 22 2 log . Unfortunately, the constant nectivity between pre-existing nodes. A more interesting
factor is exponential in , which can be quite large for cer- problem is inventing hidden nodes on demand. Hidden
tain genes, as we have already remarked. These analyses all nodes can make a model much more compact: see Fig-
assume a “tabula rasa” approach; however, in practice, we ure 6. The standard approach is to keep adding hidden
will often have strong prior knowledge about the structure nodes one at a time, to some part of the network (see be-
(or at least parts of it), which can reduce the data require- low), performing structure learning at each step, until the
ments considerably. score drops. One problem is choosing the cardinality (num-
ber of possible values) for the hidden node, and its type of To create a BN with continuous variables, we need to
CPD. Another problem is choosing where to add the new specify the graph structure and the Conditional Proba-
hidden node. There is no point making it a child, since bility Distributions at each node, as before. The most
hidden children can always be marginalized away, so we common distribution is a Gaussian, since it is analytically
need to find an existing node which needs a new parent,
when the current set of possible parents is not adequate. with
1 1
% "
"
tractable. Consider, for example, the DBN in
;
Figure 2(a),
1 Σ ,
where
5 8
8
[RM98] use the following heuristic for finding nodes which
need new parents: they consider a noisy-OR node which is ; Σ
1
exp
,
Σ 1
(3)
nearly always on, even if its non-leak parents are off, as an
0
2 2 Σ 2 is the normalizing
indicator that there is a missing parent. Generalizing this is the Normal (Gaussian) distribution evaluated at ,
able to find in the current model, it suggests we need to pendent Gaussian noise terms, corresponding to unmod-
create a new node and add it to Pa . elled influences. This is called a first-order auto-regressive
AR(1) time series model [Ham94].
A simple heuristic for inventing hidden nodes in the case of
If
is a vector containing the expression levels of all the
DBNs is to check if the Markov property is being violated
for any particular node. If so, it suggests that we need
genes at time , then this corresponds precisely to the model
in [DWFS99]. (They do not explicitely mention Gaussian
connections to slices further back in time. Equivalently,
we can add new lag variables (see Section 2.5) and connect
to fit the
noise, but it is implicit in their decision to use least squares
term: see Section 8.1). Despite the simplicity
to them. (Note that reasoning across multiple time slices
forms the basis of the (“model free”) correlational analysis of this model (in particular, its linear nature), they find that
approach of [AR95, ASR97].) it models their experimental data very well.
before the injection, and exp 2 if is hours
e.g., we can often switch the interpretation of the true and
false states (assuming for simplicity that the hidden node after injection, and and are vectors of length , repre-
is binary) provided we also permute the parameters appro- senting the response to kainate and an offset (bias) term,
respectively.
priately. (Symmetries such as this are one cause of the
multiple maxima in the likelihood surface.) Our opinion is In Figure 2(a), the
variables are always observed. Now
that fully automated structure discovery techniques can be consider Figure 2(b), where only the are observed, and
useful as hypothesis generators, which can then be tested have Gaussian CPDs. If the 3
are hidden, discrete vari-
by experiment. ables, we have
; Σ ; this
is an HMM with Gaussian outputs. If the % :
are hid-
den, continuous
variables, we have
8 DBNs with continuous state ;
Σ . This is called a Linear Dynamical Sys-
vations,
tem, since both the dynamics, 1 , and the obser-
, are linear. (It is also called a State Space
Up until now, we have assumed, for simplicity, that all the model.)
random variables are discrete-valued. However, in reality,
most of the quantities we care about in genetic networks are To perform inference in an LDS, we can use the well-
continuous-valued. If we coarsely quantize the continuous known Kalman filter algorithm [BSL93], which is just a
variables (i.e., convert them to a small number of discrete special case of the junction tree algorithm. If the dynamics
values), we lose a lot of information, but if we finely quan- are non-linear, however, inference becomes much harder;
tize them, the resulting discrete model will have too many the standard technique is to make a local linear approxima-
parameters. Hence it is better to work with continuous tion — this is called the Extended Kalman Filter (EKF),
variables directly. and is similar to techniques developed in the Recurrent
Neural Networks literature [Pea95]. (One reason for the matrices
widespread success of HMMs with Gaussian outputs is that 5 *5
the discrete hidden variables can be used to approximate
arbitrary non-linear dynamics, given enough training data.) ..
1
..
2
In the rest of this paper, we will stick to the case of fully 5
. *5 .
observable (but possibly non-linear) models, for simplicity, 1
so that inference will not be necessary. (We will, however,
consider Bayesian methods of parameter learning, which
We can think of the ’th row of as the ’th input vector
to a linear neural network with no hidden layers (i.e., a per-
can be thought of as inferring the most probable values of
nodes which represent the parameters themselves.)
ceptron), and the ’th row of as the corresponding output.
Using this notation, we can rewrite the above expression as
follows (this is called the normal equation):
8.1 Learning
5 5
1 5
We have assumed that is a vector of random variables.
If we were to represent each component of this vector as
its own (scalar) node, the inter-slice connections would where
is the pseudo-inverse of (which is not square).
correspond to the non-zero values in the weight matrix
, and undirected connections within a slice would cor-
Thus we see that the least squares solution is the same as the
Maximum Likelihood (ML) solution assuming Gaussian
respond to non-zero entries in Σ 1 [Lau96]. For example, noise.
1 1 1
The technique used by D’haeseleer et al. [DWFS99] is
if
0 1 1
1 0 0 , and Σ is diagonal, the model is
of the error surface. In the technique of “optimal brain
+
Consider the AR(1) model in Figure 2(a). For notational
damage” [CDS90], they assume is diagonal; in the more
constant is
simplicity we will assume Σ
2 2
, where
2
, so the normalizing
1 2 is the inverse sophisticated technique of “optimal brain surgeon” [HS93],
they remove this assumption.
variance (the precision). Using Equation 3, we find that the
likelihood of the data (ignoring terms which are independent Since we believe (hope!) that the true model is sparse (i.e.,
most entries in are 0), we can encode this knowledge
of ) is
Gaussian prior 7 for each
(
(assumption) by using a 0 1
8
weight:
1 2
1
1
! 1 exp
2
1 exp ( 8 )
2
1 8
(4)
%
the value of at time in sequence ,
is the isproduct
exp (5)
" #
where
*8 1 is the number of samples, and is an error
of the individual normalizing constants,
2
2 2
where (since is an matrix)
(cost) function on the data. (Having multiple sequences is and is an error (cost) function on the weights. Then,
essentially the same as concatenating them into a single using Bayes’ rule (Equation 2) and Equations 4 and 5, the
;
1
2
2 28 log
The second term is called a regularizer, and encourages is the entropy of , and
learning of a weight matrix with small values. (Unfortu-
nately, this prior favours many small weights, rather than
28
log
a few large ones, although this can be fixed [HP89].) This
+ 6
regularization technique is also called weight decay. Note
: :
that the use of a regularizer overcomes the problem that is the conditional entropy of given . (If Pa ,
will often be underdetermined, i.e., there will be fewer we define Pa and Pa
samples than parameters, 2
. .2+ (* + (
0.) Finally, we define the MI score of a model as
; Pa ; , where the prob- (
The values and are called hyperparameters, since they
control the prior on the weights and the system noise, re-
Pa
abilities are set to their empirical values given , and
mean the parents of node in graph .
spectively. Since their values are unknown, the correct
Bayesian approach is to integrate them out, but an approxi- Theorem.( Let
5 be BNs in which every node has a full
mation that actually works better in practice is to approxi- CPT, and be a fully observable data set. Then9
for details.
We can associate a separate hyperparameter for each (Equation 1) can be rewritten as
weight , find their MAP values, and use this as a “soft”
1
#
means of finding which entries of to keep: this is called log Pr
Automatic Relevance Determination (ARD) [Mac95], and
has been found to peform better than the weight deletion
where we have defined
Pa for brevity. is
the number of times the event
def
techniques discussed above, especially on small data sets.
log
mation theory [CT91]. , we find ; ; Pa
+ Pa .
We note that this theorem is similar to the result in [CL68],
8 who
show that the optimal 10 tree-structured MRF is a max-
3 8 imal weight
+ weight
the
spanning tree (MWST) of the graph in which
. of[MJM98]
an arc between two nodes, and , is
extends this result to mixtures of
We carried out some provisional experiments on a
8
trees. Trees have the significant advantage that we can
If !"$#%&#%' , then )(*+,.-
subset of the data described in [WFM 98] (available at
9
which has 112, 9 and 1. Applying the ARD tech- #/0(1#/+ , so and # rank models in the same order.
rsb.info.nih.gov/mol-physiol/PNAS/GEMtable.html),
metric CPD of the right form (i.e., at least capable of rep- of Proteins and Nucleic Acids. Cambridge University
resenting the “true” conditional probability distribution of
the node), then as , Pr , Press, Cambridge, 1998.
[DLB97] J. L. DeRisi, V. R. Lyer, and P. O. Brown. Exploring
and the proof goes through as before. But this leaves us the metabolic and genetic control of gene expression
with the additional problem of choosing the form of CPD: on a genomic scale. Science, 278:680–686, 1997.
see Section 4.2.
[DWFS99] P. D’haeseleer, X. Wen, S. Fuhrman, and R. Som-
ogyi. Linear modeling of mRNA expression levels
References during CNS development and injury. In Proc. of the
Pacific Symp. on Biocomputing, 1999.
[AMK99] T. Akutsu, S. Miyano, and S. Kuhara. Identification
[ESBB98] Michael B. Eisen, Paul T. Spellman, Patrick O.
of genetic networks from a small number of gene ex-
Brown, and David Botstein. Cluster analysis and
pression patterns under the boolean network model.
display of genome-wide expression patterns. Proc.
In Proc. of the Pacific Symp. on Biocomputing, 1999.
of the National Academy of Science, USA, 1998.
[AR95] A. Arkin and J. Ross. Statistical construction of
[FG96] N. Friedman and M. Goldszmidt. Learning Bayesian
chemical reaction mechanisms from measured time-
networks with local structure. In Proc. of the Conf.
series. J. Phys. Chem., 99:970, 1995.
on Uncertainty in AI, 1996.
[ASR97] A. Arkin, P. Shen, and J. Ross. A test case of correla-
[FKP98] N. Friedman, D. Koller, and A. Pfeffer. Structured
tion metric construction of a reaction pathway from
representation of complex stochastic systems. In
measurements. Science, 277:1275–1279, 1997.
Proc. of the Conf. of the Am. Assoc. for AI, 1998.
[BFGK96] C. Boutilier, N. Friedman, M. Goldszmidt, and
[FMR98] N. Friedman, K. Murphy, and S. Russell. Learning
D. Koller. Context-specific independence in bayesian
the structure of dynamic probabilistic networks. In
networks. In Proc. of the Conf. on Uncertainty in AI,
Proc. of the Conf. on Uncertainty in AI, 1998.
1996.
[Bis95] C. M. Bishop. Neural Networks for Pattern Recog- [Fre97] D. A. Freeman. From association to causation via
regression. In V. R. McKim and S. P. Turner, edi-
nition. Clarendon Press, 1995.
tors, Causality in Crisis? Statistical methods and the
[BK98a] X. Boyen and D. Koller. Approximate learning of dy- search for causal knowledge in the social sciences.
namic models. In Neural Info. Proc. Systems, 1998. U. Notre Dame Press, 1997.
[BK98b] X. Boyen and D. Koller. Tractable inference for [Fri97] N. Friedman. Learning Bayesian networks in the
complex stochastic processes. In Proc. of the Conf. presence of missing values and hidden variables. In
on Uncertainty in AI, 1998. Proc. of the Conf. on Uncertainty in AI, 1997.
[BKRK97] J. Binder, D. Koller, S. J. Russell, and K. Kanazawa. [Fri98] N. Friedman. The bayesian structural EM algorithm.
Adaptive probabilistic networks with hidden vari- In Proc. of the Conf. on Uncertainty in AI, 1998.
ables. Machine Learning, 29:213–244, 1997.
[GHM96] D. Geiger, D. Heckerman, and C. Meek. Asymptotic
[Bra98] M. Brand. An entropic estimator for structure dis- model selection for directed networks with hidden
covery. In Neural Info. Proc. Systems, 1998. variables. In Proc. of the Conf. on Uncertainty in AI,
[BSL93] Y. Bar-Shalom and X. Li. Estimation and Tracking: 1996.
Principles, Techniques and Software. Artech House, [Gol80] M. C. Golumbic. Algorithmic Graph Theory and
1993. Perfect Graphs. Academic Press, 1980.
[Bun94] W. L. Buntine. Operations for learning with graphical [Ham94] J. Hamilton. Time Series Analysis. Wiley, 1994.
models. J. of AI Research, pages 159–225, 1994.
[HB94] D. Heckerman and J.S. Breese. Causal indepen-
[Bun96] W. Buntine. A guide to the literature on learning dence for probability assessment and inference using
probabilistic networks from data. IEEE Trans. on bayesian networks. IEEE Trans. on Systems, Man
Knowledge and Data Engineering, 8(2), 1996. and Cybernetics, 26(6):826–831, 1994.
[CDS90] Y. Le Cun, J. Denker, and S. Solla. Optimal brain [HD94] C. Huang and A. Darwiche. Inference in belief net-
damage. In Neural Info. Proc. Systems, 1990. works: A procedural guide. Intl. J. Approx. Reason-
[CHG95] D. Chickering, D. Heckerman, and D. Geiger. Learn- ing, 11, 1994.
ing Bayesian netwprks: search methods and experi- [Hec96] D. Heckerman. A tutorial on learning with Bayesian
mental results. In AI + Stats, 1995. networks. Technical Report MSR-TR-95-06, Mi-
[CL68] C. K. Chow and C. N. Liu. Approximating dis- crosoft Research, March 1996.
crete probability distributions with dependencetrees. [Hen89] M. Henrion. Some practical issues in constructing
IEEE Trans. on Info. Theory, 14:462–67, 1968. belief networks. In Proc. of the Conf. on Uncertainty
[Con88] R. C. Conant. Extended dependency analysis of large in AI, 1989.
systems. Intl. J. General Systems, 14:97–141, 1988. [Her98] J. Hertz. Statistical issues in the reverse engineering
[CT91] T. M. Cover and J. A. Thomas. Elements of Infor- of genetic networks. In Proc. of the Pacific Symp. on
mation Theory. John Wiley, 1991. Biocomputing, 1998.
[HMC97] D. Heckerman, C. Meek, and G. Cooper. A [MJM98] M. Meila, M. Jordan, and Q. Morris. Estimating
Bayesian approach to causal discovery. Technical dependency structure as a hidden variable. Technical
Report MSR-TR-97-05, Microsoft Research, Febru- Report 1648, MIT AI Lab, 1998.
ary 1997. [Mur99] K. P. Murphy. A variational approximation for
[HP89] S. Hanson and L. Pratt. Comparing biases for mini- Bayesian networks with discrete and continuous la-
mal network construction with back-propogation. In tent variables. In Proc. of the Conf. on Uncertainty
Neural Info. Proc. Systems, 1989. in AI, 1999. Submitted.
[HS93] B. Hassibi and D. Stork. Second order derivatives for [Pea88] J. Pearl. Probabilistic Reasoning in Intelligent Sys-
network pruning: optimal brain surgeon. In Neural tems: Networks of Plausible Inference. Morgan
Info. Proc. Systems, 1993. Kaufmann, 1988.
[HS95] D. Heckerman and R. Shachter. Decision-theoretic [Pea95] B. A. Pearlmutter. Gradient calculations for dynamic
foundations for causal reasoning. J. of AI Research, recurrent neural networks: a survey. IEEE Trans. on
3:405–430, 1995. Neural Networks, 6, 1995.
[Jen96] F. V. Jensen. An Introduction to Bayesian Networks. [Rab89] L. R. Rabiner. A tutorial in Hidden Markov Mod-
UCL Press, London, England, 1996. els and selected applications in speech recognition.
Proc. of the IEEE, 77(2):257–286, 1989.
[JGJS98] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and
[RM98] S. Ramachandran and R. Mooney. Theory refine-
L. K. Saul. An introduction to variational methods
ment of bayesian networks with hidden variables. In
for graphical models. In M. Jordan, editor, Learning
Machine Learning: Proceedings of the International
in Graphical Models. Kluwer, 1998.
Conference, 1998.
[JGS96] M. I. Jordan, Z. Ghahramani, and L. K. Saul. Hidden [SGS93] P. Spirtes, C. Glymour, and R. Scheines. Causation,
Markov decision trees. In Neural Info. Proc. Systems,
Prediction, and Search. Springer-Verlag, 1993. Out
1996. of print.
[Kau93] S. Kauffman. The origins of order. Self-organization [Sha98] R. Shachter. Bayes-ball: The rational pastime (for
and selection in evolution. Oxford Univ. Press, 1993. determining irrelevance and requisite information in
[Kli85] G. Klir. The Architecture of Systems Problem Solv- belief networks and influence diagrams). In Proc. of
ing. Plenum Press, 1985. the Conf. on Uncertainty in AI, 1998.
[KP97] D. Koller and A. Pfeffer. Object-oriented bayesian [SHJ97] P. Smyth, D. Heckerman, and M. I. Jordan. Prob-
networks. In Proc. of the Conf. on Uncertainty in AI, abilistic independence networks for hidden Markov
1997. probability models. Neural Computation, 9(2):227–
269, 1997.
[Kri86] K. Krippendorff. Information Theory : Structural
Models for Qualitative Data (Quantitative Applica- [Sri93] S. Srinivas. A generalization of the noisy-or model.
tions in the Social Sciences, No 62). Page Publishers, In Proc. of the Conf. on Uncertainty in AI, 1993.
1986. [SS96] R. Somogyi and C. A. Sniegoski. Modeling the com-
[Lau95] S. L. Lauritzen. The EM algorithm for graphical as- plexity of genetic networks: understanding multige-
sociation models with missing data. Computational netic and pleiotropic regulation. Complexity, 1(45),
Statistics and Data Analysis, 19:191–201, 1995. 1996.
[TMCH98] B. Thiesson, C. Meek, D. Chickering, and D. Heck-
[Lau96] S. Lauritzen. Graphical Models. OUP, 1996.
erman. Learning mixtures of DAG models. In Proc.
[LFS98] S. Liang, S. Fuhrman, and R. Somogyi. Reveal, a
general reverse engineering algorithm for inference of the Conf. on Uncertainty in AI, 1998.
[WFM 98] X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr,
of genetic network architectures. In Pacific Sym- S. Smith, J. L. Barker, and R. Somogyi. Large-scale
posium on Biocomputing, volume 3, pages 18–29, temporal gene expression mapping of central nervous
1998. system development. Proc. of the National Academy
[MA97] H. H. McAdams and A. Arkin. Stochastic mech- of Science, USA, 95:334–339, 1998.
anisms in gene expression. Proc. of the National [Whi90] J. Whittaker. Graphical Models in Applied Multi-
Academy of Science, USA, 94(3):814–819, 1997. variate Statistics. Wiley, 1990.
[Mac95] D. MacKay. Probable networks and plausible pre- [WMS94] J. V. White, I. Muchnik, and T. F. Smith. Modelling
dictions — a review of practical Bayesian methods protein cores with Markov Random Fields. Mathe-
for supervised neural networks. Network, 1995. matical Biosciences, 124:149–179, 1994.
[Mac98] D. MacKay. Introduction to monte carlo methods. [WWS99] D. C. Weaver, C. T. Workman, and G. D. Stormo.
In M. Jordan, editor, Learning in graphical models. Modeling regulatory networks with weight matrices.
Kluwer, 1998. In Proc. of the Pacific Symp. on Biocomputing, 1999.
[MH97] C. Meek and D. Heckerman. Structure and parameter
learning for causal independence and causal interac-
tion models. In Proc. of the Conf. on Uncertainty in
AI, pages 366–375, 1997.
[Mil55] G. Miller. Note on the bias of information estimates.
In H. Quastler, editor, Information theory in psychol-
ogy. The Free Press, 1955.