Lec22 PDF
Lec22 PDF
Lec22 PDF
Lecture topics:
The parameters we have to learn are therefore the conditional distributions P (xi |xpai ) in
the product. For later utility we will use P (xi |xpai ) = θxi |xpai to specify the parameters.
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 22 (Jaakkola) 2
where we have again collapsed the available data into counts n(xi , xpai ), the number of
observed instances with a particular setting of the variable and its parents. These are the
sufficient statistics we need from the data in order to estimate the parameters. This will
be true in the Bayesian setting as well (discussed below). Note that the statistics we need
depend on the model structure or graph G. The parameters θ̂xi |xpai that maximize the
log-likelihood have simple closed form expressions in terms of empirical fractions:
n(xi , xpai )
θˆxi |xpai = �ri �
(6)
x� =1 n(xi , xpai )
i
This simplicity is due to our assumption that θxi |xpai can be chosen freely for each setting
of the parents xpai . The parameter estimates are likely not going to be particularly good
when the number of parents increases. For example, � just to provide one observation per
a configuration of parent variables would require j∈pai rj instances. Introducing some
regularization is clearly important, at least in the fully parameterized case. We will provide
a Bayesian treatment of the parameter estimation problem shortly.
BIC and structure estimation
Given the ML parameter estimates θ̂xi |xpai we can evaluate the resulting maximum value
of the log-likelihood l(D; θ̂, G) as well as the corresponding BIC score:
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 22 (Jaakkola) 3
where each term in the sum corresponds to the � size of� the probability table P (xi |xpai )
minus the associated normalization constraints x� P (xi |xpai ) = 1.
i
x1 x2 x1 x2
makes any independence assumptions and are therefore equivalent in this sense. The re
sulting BIC scores for such graphs are also identical. The principle that equivalent graphs
should receive the same score is known as likelihood equivalence. How can we determine if
two graphs are equivalent? In principle this can be done by deriving all the possible inde
pendence statements from the graphs and comparing the resulting lists but there are easier
ways. Two graphs are equivalent if they differ only in the direction of arcs and possess the
same v-structures, i.e., they have the same set of converging arcs (two or more arcs pointing
to a single node). This criterion captures most equivalences. Figure 1 provides a list of all
equivalence classes of graphs over three variables. Only one representative of each class is
shown and the number next to the graph indicates how many graphs there are that are
equivalent to the representative.
Equivalence of graphs and the associated scores highlight why we should not interpret the
arcs in Bayesian networks as indicating the direction of causal influence. While models are
often drawn based on one’s causal understanding, when learning them from the available
data we can only distinguish between models that make different probabilistic assumptions
about the variables involved (different independence properties), not based on which way
the arcs are pointing. It is nevertheless possible to estimate causal Bayesian networks,
models where we can interpret the arcs as causal influences. The difficulty is that we
need interventional data to do so, i.e., data that correspond to explicitly setting some of
the variables to specific values (controlled experiments) rather than simply observing the
values they take.
Bayesian estimation
The idea in Bayesian estimation is to avoid reducing our knowledge about the parameters
into point estimates (e.g., ML estimates) but instead retain all the information in a form of
a distribution over the possible parameter values. This is advantageous when the available
data are limited and the number of parameters is large (e.g., only a few data points per
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 22 (Jaakkola) 4
x1 x2
(6)
x3
x1 x2 x1 x2 x1 x2
x3 x3 x3
x1 x2 x1 x2 x1 x2
x3 x3 x3
x1 x2 x1 x2 x1 x2
x3 x3 x3
x1 x2
(1)
x3
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 22 (Jaakkola) 5
parameter to estimate). The Bayesian framework requires us to also articulate our knowl
edge about the parameters prior to seeing any data in a form of a distribution, the prior
distribution. Consider the following simple graph with three variables
x1 x2
x3
The parameters we have to estimate are {θx1 }, {θx2 }, and {θx3 |x1 ,x2 }. We will assume that
the parameters are a priori independent for each variable and across different configurations
of parents (parameter independence assumption):
�
P (θ) = P ({θx1 }x1 =1,...r1 ) P ({θx2 }x2 =1,...,r2 ) P ({θx3 |x1 ,x2 }x3 =1,...,r3 ) (9)
x1 ,x2
We will also assume that we will use the same prior distribution over the same parameters
should they appear in different graphs (parameter modularity). For example, since x1 has
no parents in G and G’ given by
G G’
x1 x2 x1 x2
x3 x3
we need the same parameter {θx1 } in both models. The parameter modularity assumption
corresponds to using the same prior distribution P ({θx1 }x1 =1,...,r1 ) for both models (other
parameters would have different prior distributions since, e.g., θx3 |x1 ,x2 does not appear
in graph G� ). Finally, we would like the marginal likelihood score to satisfy likelihood
equivalence similarly to BIC. In other words, if G and G� are equivalent, then we would
like P (D|G) = P (D|G� ) where, e.g.,
�
P (D|G) = P (D|θ, G)P (θ)dθ (10)
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 22 (Jaakkola) 6
the Dirichlet distribution. To specify and use this prior distribution, it will be helpful
to change the notation slightly. We will denote the parameters by θijk where i specifies
the variable,
�ri j the parent configuration (see below), and k the value of the variable xi .
Clearly, k=1 θijk = 1 for all i and j. The parent configurations are simply indexed from
j = 1, . . . , qi as in
j x1 x2
1 1 1
2 2 1 (11)
··· ··· ···
q r1 r2
where q = r1 r2 . When xi has no parents we say there is only one “parent configuration” so
that P (xi = k) = θi1k . Note that writing parameters as θijk is graph specific; the parents
of each variable, and therefore also parent configurations, vary from one graph to another.
We will define θij = {θijk }k=1,...,ri so we can talk about all the parameters for xi given a
fixed parent configuration.
Now, the prior distribution of each θij has to be a Dirichlet:
� ri
Γ( k αijk ) � α −1
P (θij ) = � θijkijk = Dirichlet(θij ; αij1 , . . . , αijri ) (12)
k Γ(αijk ) k=1
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 22 (Jaakkola) 7
The second subtlety further constrains the values p�ijk , in addition to the normalization.
In order for the likelihood equivalence to hold, p�ijk should be possible to interpret as
marginals P � (xi , xpai ) of some common distribution over all the variables P � (x1 , . . . , xd )
(common to all the graphs we consider). For example, simply normalizing p�ijk across the
parent configurations and the values of the variables does not ensure that they can be
viewed as marginals from some common joint distribution P � . This subtlety does not often
arise in practice. It is typical and easy to set them based on a uniform distribution so that
1 1 n�
p�ijk = � = or αijk = (14)
ri l∈pai rl ri qi r i qi
This leaves us with only one hyper-parameter to set: n� , the equivalent sample size.
We can now combine the data and the prior to obtain posterior estimates for the parameters.
The prior factors across the variables and across parent configurations. Moreover, we
assume that each observation is complete, containing a value assignment for all the variables
in the model. As a result, we can evaluate the posterior probability over each θij separately
from others. Specifically, for each θij = {θijk }k=1,...,ri , where i and j are fixed, we get
⎡ ⎤
�
P (θij |D, G) ∝ ⎣ P (xti |xtpai , θij )⎦ P (θij ) (15)
t: xtpai →j
� ri
�
n
�
= θijkijk P (θij ) (16)
�k=1
ri
�� ri
�
n α −1
� �
∝ θijkijk θijkijk (17)
k=1 k=1
ri
n +αijk −1
�
= θijkijk (18)
k=1
where the product in the first line picks out only observations where the parent configuration
maps to j (otherwise the case would fall under the domain of another parameter vector).
nijk specifies the number of� observations where xi had value k and its parents xpai were
qi �ri
in configuration j. Clearly, j =1 k=1 nijk = n. The posterior has the same form as the
prior1 and is therefore also Dirichlet, just with updated hyper-parameters:
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 22 (Jaakkola) 8
�
t t Γ( k αijk ) � Γ(αijk + nijk )
This is also the marginal likelihood of data pertaining to xi when xpai are in configuration j.
Since the observations are complete, and the prior is independent for each set of parameters,
the marginal likelihood of all the data is simply a product of these local normalization terms.
The product is taken across variables and across different parent configurations:
qi
n � � ri
� Γ( k αijk ) � Γ(αijk + nijk )
P (D|G) = � � (21)
i=1 j=1
Γ( k αijk + k nijk ) k=1 Γ(αijk )
We would now find a graph G that maximizes P (D|G). Note that Eq.(21) is easy to
evaluate for any particular graph by recomputing some of the counts nijk . We can further
penalize graphs that involve a large number of parameters (or edges) by assigning a prior
probability P (G) over the graphs, and maximizing instead P (D|G)P (G). For�example, the
prior could be some function of the number of parameters in the model or ni=1 (ri − 1)qi
such as
1
P (G) ∝ �n (22)
i=1 (ri − 1)qi
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].