0% found this document useful (0 votes)

15 views16 pages

Lecture 4

1. The document discusses directed graphical models, which represent probabilistic relationships between random variables as a directed acyclic graph. 2. A directed graphical model represents a family of distributions where the joint distribution factors into a product of local functions at each node conditioned on its parents. 3. It is shown that if a distribution factors according to a directed graph G, the distribution obtained by marginalizing a leaf node will factor according to the induced graph on the remaining nodes.

Uploaded by

nguyenhoangnguyennt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views16 pages

Lecture 4

Uploaded by

nguyenhoangnguyennt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Directed and undirected graphical models Fall 2017

Lecture 4 — October 18th

Lecturer: Guillaume Obozinski Scribe:

In this lecture, we will assume that all random variables are discrete, to keep notations as
simple as possible. All the theory presented generalizes immediately to continuous random
variables that have a density by replacing

• the discrete probability distributions considered in this lecture by densities

• summations by integration with respect to a measure of reference (most of the time

the Lebesgue measure).

4.1 Notation and probability review

4.1.1 Notations
We review some notations before establishing some properties of directed graphical models.
Let X1 , X2 , . . . , Xn be random variables with distribution:

P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = pX (x1 , . . . , xn ) = p(x)

where x stands for (x1 , . . . , xn ). Given A ⊂ {1, . . . , n}, we denote the marginal distribution
of xA by: X
p(xA ) = p(xA , xAc ).
x∈Ac

With this notation, we can write the conditional distribution as:

p(xA , xAc )
p(xA |xAc ) =
p(xAc )

We also recall the so-called “chain rule” stating:

p(x1 , . . . , xn ) = p(x1 ) p(x2 |x1 ) p(x3 |x2 , x1 ) . . . p(xn |x1 , . . . , xn−1 )

4.1.2 Independence and conditional independence

Let A, B, and C be disjoint.
We will say that XA is (marginally) independent of XB and write XA ⊥⊥ XB if and only if

p(xA , xB ) = p(xA ) p(xB ) ∀(xA , xB ), (4.1)

4-1
Lecture 4 — October 18th Fall 2017

or equivalently if and only if

p(xA |xB ) = p(xA ) ∀xA , xB s.t. p(xB ) > 0. (4.2)
Similarly we will say that XA is independent from XB conditionally on XC (or given XC )
and we will write XA ⊥⊥ XB | XC if and only if
p(xA , xB |xC ) = p(xA |xC ) p(xB |xC ) ∀xA , xB , xC s.t. p(xC ) > 0, (4.3)
or equivalently if and only if
p(xA |xB , xC ) = p(xA |xC ) ∀xA , xB , xC s.t. p(xB , xC ) > 0. (4.4)
More generally we will say that the (XAi )1≤i≤k are mutually independent if and only if
k
Y
p(xA1 , . . . , xAk ) = p(xAi ) ∀xA1 , . . . , xAk ,
i=1

and that they are mutually independent conditionally on XC (or given XC ) if and only if
k
Y
p(xA1 , . . . , xAk |xC ) = p(xAi |xC ) ∀xA1 , . . . , xAk , xC s.t. p(xC ) > 0.
i=1

Remark 4.1.1 Note that the conditional probability p(xA , xB |xC ) is the probability distri-
bution over (XA , XB ) if XC is known to be equal to xC . In practice, it means that if the
value of XC is observed (e.g. via a measurement) then the distribution over (XA , XB ) is
p(xA , xB |xC ). The conditional independence statement XA ⊥⊥ XB | XC should therefore be
interpreted as "when the value of XC is observed (or given) XA and XB are independent".
Remark 4.1.2 (Pairwise independence vs mutual independence) Consider a collection of
r.v. (X1 , . . . , Xn ). We say that these variables are pairwise independent if for all 1 ≤ i <
j ≤ n, Xi ⊥⊥ Xj . Note that this is different than assuming that X1 , . . . , Xn are mutually
(or jointly or globally) independent. A standard counter-example is as follows: given two
variables X, Y that are independent coin flips define Z via the XOR function ⊕ with Z =
X ⊕Y . Then, the three random variables X, Y, Z are pairwise independent, but not mutually
independent. (Prove this as an exercise.) The notations presented for pairwise independence
could be generalized to collections of variables that are mutually independent.

Three Facts About Conditional Independence

Proving each of these three facts are good simple exercises.

4-2
Lecture 4 — October 18th Fall 2017

4.2 Directed Graphical Model

Graphical models combine probability and graph theory into an efficient data structure. We
want to be able to handle probabilistic models of hundreds of variables. For example, assume
we are trying to model the probability of diseases given the symptoms, as shown below.

Figure 4.1. Nodes representing binary variables indicating the presence or not of a disease or a symptom.

We have n nodes, each a binary variable (Xi ∈ {0, 1}), indicating the presence or absence
of a disease or a symptom. The number of joint probability terms would grow exponentially.
For 100 diseases and symptoms, we would need a table of size 2100 to store all the possible
states. This is clearly intractable. Instead, we will use graphical models to represent the
relationships between nodes.

General issues in this class

1. Representation → DGM, UGM / parameterization → exponential family

2. Inference (computing p(xA |xB )) → sum-product algorithm

3. Statistical estimation → maximum likelihood, maximum entropy

A directed graphical model, also historically called “Bayesian network” when the variables
are discrete, represents a family Qnof distributions, denoted L(G), where L(G) , {p : ∃ le-
gal
P factors, f i , s.t. p(x V ) = i=1 fi (xi , xπi )}, where the legal factors satisfy fi ≥ 0 and
xi fi (xi , xπi ) = 1 ∀i, xπi .

4.2.1 First definitions and properties

Let X1 , . . . , Xn be n random variables with joint distribution p(x) = pX (x1 , . . . , xn ).

4-3
Lecture 4 — October 18th Fall 2017

Definition 4.1 Let G = (V, E) be a DAG with V = {1, . . . , n}. We say that p(x) factorizes
in G, denoted p(x) ∈ L(G), if there exists some functions fi , called factors, such that:
n
Y
∀x, p(x) = fi (xi , xπi )
i=1 (4.6)
X
fi ≥ 0, ∀i, ∀xπi , fi (xi , xπi ) = 1
xi

where we recall that πi stands for the set of parents of the vertex i in G.
We prove the following useful and fundamental property of directed graphical models: if a
probability distribution factorizes according to a directed graph G = (V, E), the distribution
obtained by marginalizing a leaf1 i factorizes according to the graph induced on V \{i}.

Proposition
Qn 4.2 (Leaf marginalization) Suppose that p factorizesQ in G, i.e. p(xV ) =
j=1 fj (xj , xπj ). Then for any leaf i, we have that p(xV \{i} ) = j6=i fj (xj , xπj ) , hence
0 0
p(xV \{i} ) factorizes in G = (V \{i}, E ), the induced graph on V \{i}.

Proof Without loss of generality, we can assume that the leaf is indexed by n. Since it is
a leaf, we clearly have that n ∈ / πi , ∀i ≤ n − 1. We have the following computation:
X
p(x1 , ...xn−1 ) = p(x1 , . . . , xn )
xn !
X n−1 Y
= fi (xi |xπi )fn (xn |xπn )
xn i=1
n−1
Y X
= fi (xi |xπi ) fn (xn |xπn )
i=1 xn
n−1
Y
= fi (xi |xπi ).
i=1

Remark 4.2.1 Note that the new graph obtained by removing a leaf is still a DAG. Indeed,
since we only removed edges and nodes, if there was a cycle in the induced graph, the same
cycle would be present in the original graph, which is not possible since it is DAG.

Remark 4.2.2 Also, by induction, this result shows that in the definition of factorization
we do not need to assume that p is a probability distribution. Indeed, if any function p
satisfies (4.6) then it is a probability distribution, because its non-negative as a product of
non-negative factors and it sums to 1 by using formula proved by induction.
1
We call here a leaf or terminal node of a DAG a node that has no descendant.

4-4
Lecture 4 — October 18th Fall 2017

Lemme 4.3 Let A, B, C be three sets of nodes such that C ⊂ B and A ∩ B = ∅. If

p(xA | xB ) is a function of only (xA , xC ) then p(xA |xB ) = p(xA |xC ).
Proof We denote by f (xA , xC ) := p(xA |xB ) the corresponding function. Then p(xA , xB ) =
p(xA | xB ) p(xB ) = f (xA , xC ) p(xB ). By summing over xB\C , we have
X X
p(xA , xC ) = p(xA , xB ) = f (xA , xC ) p(xB ) = f (xA , xC ) p(xC ),
xB\C xB\C

which proves that p(xA |xC ) = f (xA , xC ) = p(xA |xB ).

Now we try to characterize the factor functions. The following result will imply that if p
factorizes in G, then we have a uniqueness of the factors.
Proposition 4.4 If p(x) ∈ L(G) then, for all i ∈ {1, . . . , n}, fi (xi , xπi ) = p(xi |xπi ).
Proof Assume, without loss of generality, that the nodes are sorted in a topological order.
Consider a node i. Since the node are in topological order, for any 1 ≤ j ≤ n, we have
πj ⊂ {1, . . . , j −Q1}; as a consequence we can apply Proposition 4.2 n −Qi times to obtain that
p(x1 , . . . , xi ) = j≤i f (xj , xπj ). Since we also have p(x1 , . . . , xi−1 ) = j<i f (xj , xπj ), taking
the ratio, we have
p(xi | x1 , . . . , xi−1 ) = f (xi , xπi ).
Since πi ⊂ {1, . . . , i − 1}, this entails by the previous lemma that p(xi | x1 , . . . , xi−1 ) = p(xi |
xπi ) = f (xi , xπi ).

Hence we can give an equivalent definition for a DAG to the notion of factorization:
Definition 4.5 (Equivalent definition) The probability distribution p(x) factorizes in G, de-
noted p(x) ∈ L(G), iff
Yn
∀x, p(x) = p(xi |xπi ) (4.7)
i=1
Example 4.2.1
• Q
(Trivial Graphs) Assume E = ∅, i.e. there are no edges. We then have p(x) =
n
i=1 p(xi ), implying the random variables X1 , . . . , Xn are independent, that is vari-
ables are mutually independent if they factorize in the empty graph.
• (Complete Graphs) Assume now we have a complete graph (thus Qn with n(n − 1)/2 edges
as we need acyclicity for it to be a DAG), we have: p(x) = i=1 p(xi |x1 , . . . , xi−1 ), the
so-called "chain rule" which is always true. Every probability distribution factorizes in
complete graphs. Note that there are n! complete graph possible that are all equivalent...
• (Graphs with several connected components)
Q If G has several connected components
C1 , . . . , Ck , then p ∈ L(G) ⇒ p(x) = kj=1 p(xCj ) (Exercise). As a consequence,
each connected component can be treated separately. In the rest of the lecture, we will
therefore focus on connected graphs.

4-5
Lecture 4 — October 18th Fall 2017

4.2.2 Graphs with three nodes

We consider all connected graphs with three nodes, except for the complete graph, which we
have already discussed.
• (Markov chain) The Markov chain on three nodes is illustrated on Fig.(4.2). For this
graph we have
p(x, y, z) ∈ L(G) ⇒ X ⊥⊥ Y | Z (4.8)
Indeed we have:
p(x, y, z) p(x, y, z) p(x)p(z|x)p(y|z)
p(y|z, x) = =P 0
=P 0
= p(y|z)
p(x, z) y 0 p(y , x, z) y 0 p(x)p(z|x)p(y |z)

X Z Y

Figure 4.2. Markov Chain

• (Latent cause) It is the type of DAG given in Fig.(4.3). We show that:

p(x, y, z) ∈ L(G) ⇒ X ⊥⊥ Y | Z (4.9)
Indeed:
p(x, y, z) p(z)p(y|z)p(x|z)
p(x, y|z) = = = p(x|z)p(y|z)
p(z) p(z)

Figure 4.3. Common latent "cause"

• (Explaining away) Represented in Fig.(4.4), we can show for this type of graph:
p(x) ∈ L(G) ⇒ X ⊥⊥ Y (4.10)
It basically stems from:
X X
p(x, y) = p(x, y, z) = p(x)p(y) p(z|x, y) = p(x)p(y)
z z

4-6
Lecture 4 — October 18th Fall 2017

Figure 4.4. Explaining away, or v-structure

Remark 4.2.3 The word "cause" should here be between quotes and used very carefully,
because the same way that Correlation is not causation, conditional dependance is not cau-
sation either. This is however the historical name for this model. The reason why cause is
a bad name, and that latent factor might be better, is that the factorisation properties that
are encoded by graphical models do not in general correspond to the existence of a causal
mechanisms, but only to conditional independence relations.

Remark 4.2.4 If p factorizes in the "latent cause" graph, then p(x, y, z) = p(z)p(x|z)p(y|z).
But using Bayes’ rule p(z) p(x|z) = p(x) p(z|x), adn so we also have that p(x, y, z) =
p(x)p(z|x)p(y|z) which shows that p is a Markov chain (i.e. factorizes in the Markov chain
graph). This is an example of basic edge reversal that we will discuss in the next section.
Note that we proceeded by equivalence, which shows that the Markov chain graph and the
"latent cause" graph and the reversed Markov chain graph are in fact equivalent in the sense
that distribution that factorize according to one factorize according to the others. This is
what we will call Markov equivalence.

Remark 4.2.5 In the "explaining away" graph, in general X ⊥⊥ Y |Z is not true in the sense
that there exist elements in L(G) such that this statement is violated.

Remark 4.2.6 For a graph, (p ∈ L(G)) implies that p satisfies some list of (positive) con-
ditional independence statements (CIS). The fact that p is in L(G) cannot guarantee that a
given CIS does not hold. This should be obvious because the independent distribution belongs
to all graphical models and satisfies all CIS...

Remark 4.2.7 It is important to note that not all lists of CIS correspond to a graph, in the
sense that there are lists of CIS for which there exists is no graph such that L(G) is formed
exactly of the distributions which satisfy only the conditional independences that are listed or
that are consequences of the ones listed. In particular there is no graph G on three variables
such that L(G) contains all distributions on (X,Y,Z) that satisfy X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥ ⊥Z
and does not contain distributions for which any of these statements is violated. (Remember
that pairwise independence does not imply mutual independence: see Remark 4.1.2).

4-7
Lecture 4 — October 18th Fall 2017

4.2.3 Inclusion, reversal and marginalization properties

Inclusion property. Here is a quite intuitive proposition about included graphs and
their factorization.

Proposition 4.6 If G = (V, E) and G0 = (V, E 0 ) then:

E ⊂ E 0 ⇒ L(G) ⊂ L(G0 ). (4.11)

Proof If p ∈ L(G), then p(x) = ni=1 p(xi , xπi (G) ). Since E ⊂ E 0 , it isQobvious that πi (G) ⊂
Q
πi (G0 ), and we can define fi (xi , xπi (G0 ) ) := p(xi |xπi (G) ). Since p(x) = ni=1 fi (xi , xπi (G0 ) ) and
fi meets the requirements of Definition 4.1, this proves that p ∈ L(G0 ).

The converse of the previous proposition is not true. In particular, different graphs can
define the same set of distribution. We introduce first some new definitions:

Definition 4.7 (Markov equivalence) We say that two graphs G and G0 are Markov equiv-
alent if L(G) = L(G0 ).

Proposition 4.8 (Basic edge reversal) If G = (V, E) is a DAG and if for (i, j) ∈ E, i has
no parents and the only parent of j is i, then the graph obtained by reversing the edge (i, j)
is Markov equivalent to G.

Proof First, note that by reversing such an edge no cycle can be created because the cy-
cle would necessarily contain (j, i) and j has no parent other than i. Using Bayes’ rule:
p(xi ) p(xj |xi ) = p(xj ) p(xi |xj ) we convert the factorization w.r.t. to G to factorization w.r.t.
to the graph obtained by edge reversal.

Informally, the previous result can be reformulated as: an edge reversal that does not
remove or creates any v-structure leads to a graph which is Markov equivalent.
When applied to the 3-nodes graphs considered earlier, this property proves that the
Markov chain and the "latent cause" graph are equivalent. On the other hand, the fact that
the "explain away" graph has a v-structure is the reason why it is not equivalent to the
others.

Definition 4.9 (Covered edge) An edge (i, j) is said to be covered if πj = {i} ∪ πi .

Proposition 4.10 (Covered edge reversal) Let G = (V, E) be a DAG and (i, j) ∈ E a
covered edge. Let G0 = (V, E 0 ) with E 0 = (E\{(i, j)}) ∪ {(j, i)}, then G0 is necessarily also a
DAG and L(G) = L(G0 ).

4-8
Lecture 4 — October 18th Fall 2017

Figure 4.5. Edge (i, j) is covered

Proof Exercise in Homework 2.

Marginalization. We have proved in Proposition 4.2 that if p(x1 , . . . , xn ) factorizes in G,

the distribution obtained by marginalizing a leaf i factorizes in the graph G0 induced on V \{i}
by G. A nice property of the obtained graph is that all the conditional independences between
variables X1 , . . . , Xn−1 that were implied by G are still implied by G0 : marginalization
has lost CI information about Xn but not about the rest of the distribution. It would
be natural to try to generalize this and a legitimate question is: if we marginalise a node i in
a distribution of L(G) is there a simple construction of a graph G0 such that the marginalized
distribution factorizes in G0 and such that all the CIS that hold in G and do not involve Xi are
still implied by G0 . Unfortunately this is not true. Another less ambitious natural question is
then: is there a unique smallest graph G0 such that if p ∈ L(G) then the distribution obtained
by marginalizing i is in L(G0 ). Unfortunately this is not the case either, as illustrated by
the following exemple.

Figure 4.6. Marginalizing the boxed node would not result in family of distributions that cannot be exactly
represented by a directed graphical model and one can check that there is no unique smallest graph in which
the obtained distribution factorize.

Conditional independence with the non-descendents. In a Markov chain, a well

known property is that the Xt is independent of the past given Xt−1 . This result generalizes
as follows in a directed graphical model: if p(x) factorizes in G then every single random
variable is independent from the set of its non-descendants given its parents.

Definition 4.11 The set of non-descendants of i denoted nd(i) is the set of nodes that are
not descendants of i.

Lemme 4.12 For a graph G = (V, E) and a node i, there exists a topological order such
that all elements of nd(i) appear before i.

4-9
Lecture 4 — October 18th Fall 2017

Proof This is easily proved constructively: we construct the topological order in reverse
order. At each iteration we remove a node among leaves (of the remaining graph) which we
add in the reverse order, and specifically, if some leaves are descendants of i then we remove
one of those. If at any iteration there is no leaf that is a descendant of i, it means that all
descendants of i have been removed from the graph. Indeed, if there were some descendants
of i left in the graph, since all their descendants are descendants of i as well there would exist
a leaf node which is a descendant of i. This procedure thus removes all strict descendants
of i first, then i and then only all elements of nd(i).

With this lemma, we can show our main result.

Proposition 4.13 If G is a DAG, then:

p(x) ∈ L(G) ⇔ Xi ⊥⊥ Xnd(i) |Xπi (4.12)

Proof First, we consider the ⇒ direction. Based on the previous lemma we can find an order
such that nd(i) = {1, . . . , i − 1}. But we have proven in Proposition 4.4 that p(xi |xπi ) =
p(xi |x1:(i−1) ), which given the order chosen is also p(xi |x1:(i−1) ) = p(xi |xπi , xnd(i)\πi ); this
proves what we wanted to show: Xi ⊥⊥ Xnd(i)\πi |Xπi .
We now prove the ⇐ direction. Let 1 : n be a topological order, Then {1, · · · i−1} ⊆ nd(i).
(By contradiction, suppose j ∈ {1 · · · i − 1} and j ∈/ nd(i), then ∃ path from i to j, which
contradicts the topological order property as there would be an edge from i to an element
of {1, . . . , i − 1}.)
Yn
By the chain rule, we always have p(xV ) = p(xi |x1:i−1 ) but by the conditional indepen-
i=1
dence assumptions p(xi |x1:i−1 ) = p(xi |xπi ), hence the result by substitution.

4.2.4 d-separation
Given a graph G and A, B and C, three subsets it would be useful to be able to answer the
question: is XA ⊥⊥ XB | XC true for all p ∈ L(G)? An answer is provided by the concept of
d-separation, or directed separation.
We call a chain a path in the symmetrized graph, i.e. in the graph the undirected graph
obtained by ignoring the directionality of the edges.
Definition 4.14 (Chain) Let a, b ∈ V , a chain from a to b is a sequence of nodes, say
(v1 , . . . , vn ) such that v1 = a and vn = b and ∀j, (vj , vj+1 ) ∈ E or (vj+1 , vj ) ∈ E.
Assume C is a set that is observed. We want to define a notion of being ’blocked’ by this
set C in order to answer the underlying question above.

Definition 4.15 (Blocking node in a chain, blocked chain and d-separation)

4-10
Lecture 4 — October 18th Fall 2017

Figure 4.7. D-separation

1. A chain from a et b is blocked at d if:

• either d ∈ C and (vi−1 , d, vi+1 ) is not a v-structure;

• or d ∈
/ C and (vi−1 , d, vi+1 ) is a v-structure and no descendants of d is in C.

2. A chain from a to b is blocked if and only if it is blocked at any node.

3. A and B are said to be d-separated by C if and only if all chains that go from a ∈ A
to b ∈ B are blocked.

Example 4.2.2

• Markov chain: Applying d-separation to the Markov chain retrieves the well know
results that the future is independent to the past given the present.

Figure 4.8. Markov chain

• Hidden Markov Model: We can apply it as well to the hidden Markov chain graph
of Figure 4.9.

4.2.5 Bayes ball algorithm

Checking whether two nodes are d-separated is not always easy. The Bayes ball algorithm is
an intuitive "reachability" algorithm to answer this question. Suppose we want to determine
if X is conditionally independent from Z given Y . The principle of the algorithm is to place
initially a ball on each of the nodes in X, to then let them bounce around according to some

4-11
Lecture 4 — October 18th Fall 2017

etats

observations

Figure 4.9. Hidden Markov Model

rules described below and to see if any reaches Z. X ⊥⊥ Z | Y is true if none reached Z, but
not otherwise.
The rules are as follows for the three canonical graph structures. Note that the balls are
allowed to travel in either direction along the edges of the graph.

1. Markov chain: Balls pass through when we do not observe Y , but are blocked oth-
erwise.

Figure 4.10. Markov chain rule: When Y is observed, balls are blocked (left). When Y is not observed,
balls pass through (right)

2. Two children: Balls pass through when we do not observe Y , but are blocked other-
wise.

Figure 4.11. Rule when X and Z are Y ’s children: When Y is observed, balls are blocked (left). When Y
is not observed, balls pass through (right)

4-12
Lecture 4 — October 18th Fall 2017

3. v-structure: Balls pass through when we observe Y , but are blocked otherwise.

Figure 4.12. v-structure rule: When Y is not observed, balls are blocked (left). When Y is observed, balls
pass through (right)

4.3 Undirected graphical models

4.3.1 Definition
Definition 4.16 Let G = (V, E) be a undirected graph. We denote by C the set of all
cliques of G . We say that a probability distribution p factorizes in G and write p ∈ L(G) if
p(x) is of the form:
1 Y XY
p(x) = ψC (xC ) with ψC ≥ 0, C ∈ C and Z = ψC (xC ).
Z C∈C x C∈C

The functions ψC are not probability distributions like in the directed graphical mod-
els. They are called potentials.

Remark 4.3.1 With the normalization by Z of this expression, we see that the function ψC
are defined up to a multiplicative constant.

Remark 4.3.2 We may restrict C to Cmax , the set of maximal cliques.

This definition can be extended to any function: f is said to factorize in

Remark 4.3.3 Q
G ⇐⇒ f (x) = C∈C ψC (xC ).

4-13
Lecture 4 — October 18th Fall 2017

4.3.2 Trivial graphs

Empty graphs We consider G = (V, E) with E = ∅. For p ∈ L(G), we get:
n
Y
p(x) = ψi (xi ) given that C = {{i} ∈ V }.
i=1

So X1 , ..., Xn must be mutually independent.

2 3

1 4
Complete graphs We consider G = (V, E) with ∀i, j ∈ V, (i, j) ∈ E. For p ∈
L(G), we get:
1
p(x) = ψV (xV ) given that C is reduced to a single set V .
Z
This places no constraints on the distribution of (X1 , ..., Xn ).
2 3

1 4

4.3.3 Separation and conditional dependence

Proposition 4.17 Let G = (V, E) and G0 = (V, E 0 ) be two undirected graphs.

E ⊆ E 0 ⇒ L(G) ⊆ L(G0 )

Proof The cliques of G are included in cliques of G0 .

Definition 4.18 We say that p satisfies the Global Markov property w.r.t. G if and only
if for all A, B, S ⊂ V disjoint subsets: (A and B are separated by S) ⇒ (XA ⊥⊥ XB | XS ).

Proposition 4.19 If p ∈ L(G) then, p satisfies the Global Markov property w.r.t. G.

Proof We suppose without loss of generality that A, B, and S are disjoint sets such that
A ∪ B ∪ S = V , as we could otherwise replace A and B by :

A0 = A ∪ {a ∈ V /a and A are not separated by S}

4-14
Lecture 4 — October 18th Fall 2017

B 0 = V \ {S ∪ A0 }
A0 and B 0 are separated by S and we have the disjoint union A0 ∪ B 0 ∪ S = V . If
we can show that XA0 ⊥⊥ XB 0 |XS , then by the decomposition property, we also have that
XA ⊥⊥ XB |XS for any subset A of A0 and B of B 0 , giving the required general case.
We consider C ∈ C. It is not possible to have both C ∩ A 6= ∅ and C ∩ B 6= ∅ as A and
B are separated by S and C is a clique. Thus C ⊂ A ∪ S or C ⊂ B ∪ S (or both if C ⊂ S).
Let D be the set of cliques C such that C ⊂ A ∪ S and D0 the set of all other cliques. We
have:
1 Y Y
p(x) = ψC (xC ) ψC (xC ) = f (xA∪S )g(xB∪S ).
Z C∈C C∈D 0
C⊂A∪S

Thus:
1 X f (xA , xS )
p(xA , xS ) = f (xA , xS ) g(xB , xS ) =⇒ p(xA |xS ) = P 0
.
Z x x0 f (xA , xS )
B A

Similarly: p(xB |xS ) = P g(xB ,x0 S ) . Hence:

x0 g(xA ,xS )
B

1
Z
f (xA , xS )g(xB , xS ) p(xA , xB , xS )
p(xA , xS )p(xB |xS ) = 1
P 0
P 0
= = p(xA , xB |xS ).
Z x0A f (xA , xS ) x0B g(xA , xS ) p(xS )
i.e. XA ⊥⊥ XB |XS .

Theorem 4.20 (Hammersley - Clifford) If ∀x, p(x) > 0 then p ∈ L(G) ⇐⇒ p satisfies
the global Markov property.

4.3.4 Marginalization
As for directed graphical models, we also have a marginalization notion in undirected graphs.
It is slightly different. If p(x) factorizes in G, then p(x1 , . . . , xn−1 ) factorizes in the graph
where the node n is removed and all neighbors are connected.
Proposition 4.21 Let G = (V, E) be an undirected graph. Let G0 = (V 0 , E 0 ) be the graph
where n is removed and its neighbors are connected, i.e. V 0 = V \ {n}, and E 0 is obtained
from the set E by first connecting together all the neighbours of n and then removing n. If
p ∈ L(G) then p(x1 , ..., xn−1 ) ∈ L(G0 ). Hence undirected graphical models are closed under
marginalization as the construction above is true for any vertex.
We now introduce the notion of Markov blanket
Definition 4.22 For i ∈ V , the Markov blanket of a graph G is the smallest set of nodes
that makes Xi independent to the rest of the graph.
Remark 4.3.4 The Markov blanket in an undirected graph for i ∈ V is the set of its neigh-
bors. For a directed graph, it is the union of all parents, all children and parents of children.

4-15
Lecture 4 — October 18th Fall 2017

4.3.5 Relation between directed and undirected graphical models

Since now we have seen that many notions developed for directed graph naturally extended
to undirected graphs. The raising question is thus to know whether we can find a theory
including both directed and undirected graphs, in particular, is there a way—for instance by
symmetrizing the directed graph as we have done repeatedly—to find a general equivalence
between those two notions. The answer is no, as we will discuss—though it might work in
some special cases described above.
Directed graphical model Undirected graphical model
Yn Y
Factorization p(x) = p(xi |xπi ) p(x) = Z1 ψC (xC )
i=1 C∈C
Set independence d-separation separation
[xi ⊥⊥ xnd(i) |xπi ] (and many more) [XA ⊥⊥ XB |XS ]
Marginalization not closed in general, closed
only when marginalizing leaf nodes
2 3 2
3
1 4 1
Difference grid v-structure
Let G be DAG. Can we find G0 undirected such that L(G) = L(G0 )? L(G) ⊂ L(G0 )?

Definition 4.23 Let G = (V, E) be a DAG. The symmetrized graph of G is G̃ = (V, Ẽ),
with Ẽ = {(u, v), (v, u)/(u, v) ∈ E}, ie. an edge going the opposite direction is added for
every edge in E.

Definition 4.24 Let G = (V, E) be a DAG. The moralized graph Ḡ of G is the sym-
metrized graph G̃, where we add edge such that for all v ∈ V , πv is a clique.

We admit the following proposition:

Proposition 4.25 Let G be a DAG without any v-structure, then Ḡ = G̃ and L(G) =
L(G̃) = L(Ḡ).
In case there is a v-structure in the graph, we can only conclude:
Proposition 4.26 Let G be a DAG, then L(G) ⊂ L(Ḡ).
Ḡ is minimal for the number of edges in the set H of undirected graphs such that L(G) ⊂
L(H).
Not all conditional independence structure for random variables can be factorized in
a graphical model (directed or undirected).

4-16

Lauritzen - Graphical Models (1996)
No ratings yet
Lauritzen - Graphical Models (1996)
306 pages
Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
No ratings yet
Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
59 pages
Epi Summer 24
No ratings yet
Epi Summer 24
291 pages
MCMC
No ratings yet
MCMC
76 pages
Principles of Statistics
No ratings yet
Principles of Statistics
113 pages
Graphical
No ratings yet
Graphical
99 pages
27 Revision
No ratings yet
27 Revision
80 pages
Covariate-Dependent Graphical Model Estimation Via Neural Networks With Statistical Guarantees
No ratings yet
Covariate-Dependent Graphical Model Estimation Via Neural Networks With Statistical Guarantees
33 pages
Book
No ratings yet
Book
106 pages
Prob Inf
No ratings yet
Prob Inf
56 pages
Week 4
No ratings yet
Week 4
48 pages
Graph Lecture19
No ratings yet
Graph Lecture19
42 pages
16 Graphical Models
No ratings yet
16 Graphical Models
27 pages
17 Factor Graphs
No ratings yet
17 Factor Graphs
27 pages
Book
No ratings yet
Book
113 pages
Case Study With Probabilistic Models
No ratings yet
Case Study With Probabilistic Models
85 pages
ML Practical Journal With Writeups
No ratings yet
ML Practical Journal With Writeups
46 pages
04 - Random Variables 2
No ratings yet
04 - Random Variables 2
17 pages
Lower Bounds On The Size of Markov Equivalence Classes: Erik Jahn
No ratings yet
Lower Bounds On The Size of Markov Equivalence Classes: Erik Jahn
18 pages
An Introduction to Probabilistic Graphical Models 【微信公众号：一介狂书生】
No ratings yet
An Introduction to Probabilistic Graphical Models 【微信公众号：一介狂书生】
35 pages
Lec 3
No ratings yet
Lec 3
25 pages
Directed vs. Undirected Graphical Models
No ratings yet
Directed vs. Undirected Graphical Models
16 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
04 Exact Inference
No ratings yet
04 Exact Inference
23 pages
Rikhav Shah Probabilistic Method SCUM Talk
No ratings yet
Rikhav Shah Probabilistic Method SCUM Talk
7 pages
ProbabilisticCombinatorics 15 MAR 2019
No ratings yet
ProbabilisticCombinatorics 15 MAR 2019
114 pages
Report
No ratings yet
Report
37 pages
Belief Propagation Algorithm
No ratings yet
Belief Propagation Algorithm
20 pages
Data Mining - Utrecht University - 10. Slides
No ratings yet
Data Mining - Utrecht University - 10. Slides
49 pages
2.-UndirectedGraphs 2
No ratings yet
2.-UndirectedGraphs 2
8 pages
Cheatsheet PDF
100% (1)
Cheatsheet PDF
4 pages
Directed Graphical Models
No ratings yet
Directed Graphical Models
54 pages
ST Flour Notes
No ratings yet
ST Flour Notes
104 pages
5 Minimal I-Maps, Chordal Graphs, Trees, and Markov Chains
No ratings yet
5 Minimal I-Maps, Chordal Graphs, Trees, and Markov Chains
8 pages
Homework 1
No ratings yet
Homework 1
8 pages
Introduction To State Estimation: What This Course Is About
No ratings yet
Introduction To State Estimation: What This Course Is About
6 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
Summer Camp Graph Theory 2015
No ratings yet
Summer Camp Graph Theory 2015
2 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
Research On CDR
No ratings yet
Research On CDR
24 pages
Bayesian Networks - Exercises: 1 Independence and Conditional Independence
No ratings yet
Bayesian Networks - Exercises: 1 Independence and Conditional Independence
20 pages
Scribe Lecture4
No ratings yet
Scribe Lecture4
9 pages
Nanotechnology (Tell Me Why #94) (Gnv64)
80% (5)
Nanotechnology (Tell Me Why #94) (Gnv64)
98 pages
Threshold Phenomena: 9.1 Asides
No ratings yet
Threshold Phenomena: 9.1 Asides
6 pages
Learning Material - ITC
No ratings yet
Learning Material - ITC
96 pages
PCIE Protocol
No ratings yet
PCIE Protocol
29 pages
Hitchhiker S Guide To Probability
No ratings yet
Hitchhiker S Guide To Probability
6 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Bayes Ball
No ratings yet
Bayes Ball
5 pages
PSCP Template RO and SDO
100% (1)
PSCP Template RO and SDO
28 pages
Theorist's Toolkit Lecture 1: Probabilistic Arguments
No ratings yet
Theorist's Toolkit Lecture 1: Probabilistic Arguments
7 pages
Ab Initio Job Interview Questions and Answers
No ratings yet
Ab Initio Job Interview Questions and Answers
8 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
15-381 Artificial Intelligence: Representation and Problem Solving Homework 2 - Solutions
No ratings yet
15-381 Artificial Intelligence: Representation and Problem Solving Homework 2 - Solutions
7 pages
Graphical Models: Michael I. Jordan
No ratings yet
Graphical Models: Michael I. Jordan
16 pages
HF Security Smart-Pass - Installation Instructions - 1.5.9 - 20220304
No ratings yet
HF Security Smart-Pass - Installation Instructions - 1.5.9 - 20220304
28 pages
Probability Theory - Formula Sheet
No ratings yet
Probability Theory - Formula Sheet
13 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
The Probabilistic Method - ProbabilisticMethod
No ratings yet
The Probabilistic Method - ProbabilisticMethod
9 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
79 pages
HTML Questions and Answers
No ratings yet
HTML Questions and Answers
5 pages
CS19M016 PGM Assignment1
No ratings yet
CS19M016 PGM Assignment1
9 pages
Libre Office Writer MCQ
No ratings yet
Libre Office Writer MCQ
13 pages
CURVIC1
No ratings yet
CURVIC1
48 pages
M.tech (Water Resourses Engg) Syllabus
No ratings yet
M.tech (Water Resourses Engg) Syllabus
22 pages
Linear Programming
No ratings yet
Linear Programming
36 pages
Notifier FCPS 24S6 FCPS 24S8 Field Charger Power Supply
No ratings yet
Notifier FCPS 24S6 FCPS 24S8 Field Charger Power Supply
44 pages
Erp Glossary PDF
No ratings yet
Erp Glossary PDF
2 pages
Lab1 Introduction To Linux
100% (1)
Lab1 Introduction To Linux
20 pages
Applied Linear Regression Models 4th Edi
No ratings yet
Applied Linear Regression Models 4th Edi
4 pages
Stock Ledger
0% (1)
Stock Ledger
25 pages
Fusion Splicer Option
No ratings yet
Fusion Splicer Option
12 pages
Guide To 80X86 Assembly: Overview of The 80X86 Family
No ratings yet
Guide To 80X86 Assembly: Overview of The 80X86 Family
37 pages
New CV
No ratings yet
New CV
5 pages
Log
No ratings yet
Log
40 pages
A Comparison of A Dynamic and Static Optimization of An ASP Flooding Process For EOR
No ratings yet
A Comparison of A Dynamic and Static Optimization of An ASP Flooding Process For EOR
20 pages
VX2757-mhd/VX2757-mhd-CN/ VX2757-mhd-7 Display: User Guide
No ratings yet
VX2757-mhd/VX2757-mhd-CN/ VX2757-mhd-7 Display: User Guide
27 pages
MD-100 Exam Study Guide
No ratings yet
MD-100 Exam Study Guide
6 pages
TechTip 1503 ConvertingManagedApptoModernApp
No ratings yet
TechTip 1503 ConvertingManagedApptoModernApp
5 pages
Comp nd2 FT
No ratings yet
Comp nd2 FT
5 pages
Multiply and Divide Rational Numbers
No ratings yet
Multiply and Divide Rational Numbers
4 pages
Valve PS2601-17308
No ratings yet
Valve PS2601-17308
6 pages
Dear Candidate
No ratings yet
Dear Candidate
3 pages
1588V2/Ptpv2 Synchronization of Alu Nodeb 1588 Design in Ip/Mpls Network
No ratings yet
1588V2/Ptpv2 Synchronization of Alu Nodeb 1588 Design in Ip/Mpls Network
12 pages
ANSWER SHEET IN Statisctics and Probabilty: Written Work
No ratings yet
ANSWER SHEET IN Statisctics and Probabilty: Written Work
1 page
Audiolab Mdac HFC
No ratings yet
Audiolab Mdac HFC
3 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

Lecture 4

Uploaded by

Lecture 4

Uploaded by

Directed and undirected graphical models Fall 2017

Lecture 4 — October 18th

• the discrete probability distributions considered in this lecture by densities

• summations by integration with respect to a measure of reference (most of the time

4.1 Notation and probability review

P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = pX (x1 , . . . , xn ) = p(x)

With this notation, we can write the conditional distribution as:

We also recall the so-called “chain rule” stating:

p(x1 , . . . , xn ) = p(x1 ) p(x2 |x1 ) p(x3 |x2 , x1 ) . . . p(xn |x1 , . . . , xn−1 )

4.1.2 Independence and conditional independence

p(xA , xB ) = p(xA ) p(xB ) ∀(xA , xB ), (4.1)

or equivalently if and only if

Three Facts About Conditional Independence

Proving each of these three facts are good simple exercises.

4.2 Directed Graphical Model

General issues in this class

2. Inference (computing p(xA |xB )) → sum-product algorithm

3. Statistical estimation → maximum likelihood, maximum entropy

4.2.1 First definitions and properties

Lemme 4.3 Let A, B, C be three sets of nodes such that C ⊂ B and A ∩ B = ∅. If

which proves that p(xA |xC ) = f (xA , xC ) = p(xA |xB ).

4.2.2 Graphs with three nodes

Figure 4.2. Markov Chain

• (Latent cause) It is the type of DAG given in Fig.(4.3). We show that:

Figure 4.3. Common latent "cause"

Figure 4.4. Explaining away, or v-structure

4.2.3 Inclusion, reversal and marginalization properties

Proposition 4.6 If G = (V, E) and G0 = (V, E 0 ) then:

E ⊂ E 0 ⇒ L(G) ⊂ L(G0 ). (4.11)

Definition 4.9 (Covered edge) An edge (i, j) is said to be covered if πj = {i} ∪ πi .

Figure 4.5. Edge (i, j) is covered

Proof Exercise in Homework 2.

Marginalization. We have proved in Proposition 4.2 that if p(x1 , . . . , xn ) factorizes in G,

Conditional independence with the non-descendents. In a Markov chain, a well

With this lemma, we can show our main result.

Proposition 4.13 If G is a DAG, then:

p(x) ∈ L(G) ⇔ Xi ⊥⊥ Xnd(i) |Xπi (4.12)

Definition 4.15 (Blocking node in a chain, blocked chain and d-separation)

Figure 4.7. D-separation

1. A chain from a et b is blocked at d if:

• either d ∈ C and (vi−1 , d, vi+1 ) is not a v-structure;

2. A chain from a to b is blocked if and only if it is blocked at any node.

Figure 4.8. Markov chain

4.2.5 Bayes ball algorithm

Figure 4.9. Hidden Markov Model

4.3 Undirected graphical models

Remark 4.3.2 We may restrict C to Cmax , the set of maximal cliques.

This definition can be extended to any function: f is said to factorize in

4.3.2 Trivial graphs

So X1 , ..., Xn must be mutually independent.

4.3.3 Separation and conditional dependence

Proof The cliques of G are included in cliques of G0 .

A0 = A ∪ {a ∈ V /a and A are not separated by S}

Similarly: p(xB |xS ) = P g(xB ,x0 S ) . Hence:

4.3.5 Relation between directed and undirected graphical models

We admit the following proposition:

You might also like