Lecture 4
Lecture 4
In this lecture, we will assume that all random variables are discrete, to keep notations as
simple as possible. All the theory presented generalizes immediately to continuous random
variables that have a density by replacing
where x stands for (x1 , . . . , xn ). Given A ⊂ {1, . . . , n}, we denote the marginal distribution
of xA by: X
p(xA ) = p(xA , xAc ).
x∈Ac
p(xA , xAc )
p(xA |xAc ) =
p(xAc )
4-1
Lecture 4 — October 18th Fall 2017
and that they are mutually independent conditionally on XC (or given XC ) if and only if
k
Y
p(xA1 , . . . , xAk |xC ) = p(xAi |xC ) ∀xA1 , . . . , xAk , xC s.t. p(xC ) > 0.
i=1
Remark 4.1.1 Note that the conditional probability p(xA , xB |xC ) is the probability distri-
bution over (XA , XB ) if XC is known to be equal to xC . In practice, it means that if the
value of XC is observed (e.g. via a measurement) then the distribution over (XA , XB ) is
p(xA , xB |xC ). The conditional independence statement XA ⊥⊥ XB | XC should therefore be
interpreted as "when the value of XC is observed (or given) XA and XB are independent".
Remark 4.1.2 (Pairwise independence vs mutual independence) Consider a collection of
r.v. (X1 , . . . , Xn ). We say that these variables are pairwise independent if for all 1 ≤ i <
j ≤ n, Xi ⊥⊥ Xj . Note that this is different than assuming that X1 , . . . , Xn are mutually
(or jointly or globally) independent. A standard counter-example is as follows: given two
variables X, Y that are independent coin flips define Z via the XOR function ⊕ with Z =
X ⊕Y . Then, the three random variables X, Y, Z are pairwise independent, but not mutually
independent. (Prove this as an exercise.) The notations presented for pairwise independence
could be generalized to collections of variables that are mutually independent.
4-2
Lecture 4 — October 18th Fall 2017
Figure 4.1. Nodes representing binary variables indicating the presence or not of a disease or a symptom.
We have n nodes, each a binary variable (Xi ∈ {0, 1}), indicating the presence or absence
of a disease or a symptom. The number of joint probability terms would grow exponentially.
For 100 diseases and symptoms, we would need a table of size 2100 to store all the possible
states. This is clearly intractable. Instead, we will use graphical models to represent the
relationships between nodes.
A directed graphical model, also historically called “Bayesian network” when the variables
are discrete, represents a family Qnof distributions, denoted L(G), where L(G) , {p : ∃ le-
gal
P factors, f i , s.t. p(x V ) = i=1 fi (xi , xπi )}, where the legal factors satisfy fi ≥ 0 and
xi fi (xi , xπi ) = 1 ∀i, xπi .
4-3
Lecture 4 — October 18th Fall 2017
Definition 4.1 Let G = (V, E) be a DAG with V = {1, . . . , n}. We say that p(x) factorizes
in G, denoted p(x) ∈ L(G), if there exists some functions fi , called factors, such that:
n
Y
∀x, p(x) = fi (xi , xπi )
i=1 (4.6)
X
fi ≥ 0, ∀i, ∀xπi , fi (xi , xπi ) = 1
xi
where we recall that πi stands for the set of parents of the vertex i in G.
We prove the following useful and fundamental property of directed graphical models: if a
probability distribution factorizes according to a directed graph G = (V, E), the distribution
obtained by marginalizing a leaf1 i factorizes according to the graph induced on V \{i}.
Proposition
Qn 4.2 (Leaf marginalization) Suppose that p factorizesQ in G, i.e. p(xV ) =
j=1 fj (xj , xπj ). Then for any leaf i, we have that p(xV \{i} ) = j6=i fj (xj , xπj ) , hence
0 0
p(xV \{i} ) factorizes in G = (V \{i}, E ), the induced graph on V \{i}.
Proof Without loss of generality, we can assume that the leaf is indexed by n. Since it is
a leaf, we clearly have that n ∈ / πi , ∀i ≤ n − 1. We have the following computation:
X
p(x1 , ...xn−1 ) = p(x1 , . . . , xn )
xn !
X n−1 Y
= fi (xi |xπi )fn (xn |xπn )
xn i=1
n−1
Y X
= fi (xi |xπi ) fn (xn |xπn )
i=1 xn
n−1
Y
= fi (xi |xπi ).
i=1
Remark 4.2.1 Note that the new graph obtained by removing a leaf is still a DAG. Indeed,
since we only removed edges and nodes, if there was a cycle in the induced graph, the same
cycle would be present in the original graph, which is not possible since it is DAG.
Remark 4.2.2 Also, by induction, this result shows that in the definition of factorization
we do not need to assume that p is a probability distribution. Indeed, if any function p
satisfies (4.6) then it is a probability distribution, because its non-negative as a product of
non-negative factors and it sums to 1 by using formula proved by induction.
1
We call here a leaf or terminal node of a DAG a node that has no descendant.
4-4
Lecture 4 — October 18th Fall 2017
Now we try to characterize the factor functions. The following result will imply that if p
factorizes in G, then we have a uniqueness of the factors.
Proposition 4.4 If p(x) ∈ L(G) then, for all i ∈ {1, . . . , n}, fi (xi , xπi ) = p(xi |xπi ).
Proof Assume, without loss of generality, that the nodes are sorted in a topological order.
Consider a node i. Since the node are in topological order, for any 1 ≤ j ≤ n, we have
πj ⊂ {1, . . . , j −Q1}; as a consequence we can apply Proposition 4.2 n −Qi times to obtain that
p(x1 , . . . , xi ) = j≤i f (xj , xπj ). Since we also have p(x1 , . . . , xi−1 ) = j<i f (xj , xπj ), taking
the ratio, we have
p(xi | x1 , . . . , xi−1 ) = f (xi , xπi ).
Since πi ⊂ {1, . . . , i − 1}, this entails by the previous lemma that p(xi | x1 , . . . , xi−1 ) = p(xi |
xπi ) = f (xi , xπi ).
Hence we can give an equivalent definition for a DAG to the notion of factorization:
Definition 4.5 (Equivalent definition) The probability distribution p(x) factorizes in G, de-
noted p(x) ∈ L(G), iff
Yn
∀x, p(x) = p(xi |xπi ) (4.7)
i=1
Example 4.2.1
• Q
(Trivial Graphs) Assume E = ∅, i.e. there are no edges. We then have p(x) =
n
i=1 p(xi ), implying the random variables X1 , . . . , Xn are independent, that is vari-
ables are mutually independent if they factorize in the empty graph.
• (Complete Graphs) Assume now we have a complete graph (thus Qn with n(n − 1)/2 edges
as we need acyclicity for it to be a DAG), we have: p(x) = i=1 p(xi |x1 , . . . , xi−1 ), the
so-called "chain rule" which is always true. Every probability distribution factorizes in
complete graphs. Note that there are n! complete graph possible that are all equivalent...
• (Graphs with several connected components)
Q If G has several connected components
C1 , . . . , Ck , then p ∈ L(G) ⇒ p(x) = kj=1 p(xCj ) (Exercise). As a consequence,
each connected component can be treated separately. In the rest of the lecture, we will
therefore focus on connected graphs.
4-5
Lecture 4 — October 18th Fall 2017
X Z Y
• (Explaining away) Represented in Fig.(4.4), we can show for this type of graph:
p(x) ∈ L(G) ⇒ X ⊥⊥ Y (4.10)
It basically stems from:
X X
p(x, y) = p(x, y, z) = p(x)p(y) p(z|x, y) = p(x)p(y)
z z
4-6
Lecture 4 — October 18th Fall 2017
Remark 4.2.3 The word "cause" should here be between quotes and used very carefully,
because the same way that Correlation is not causation, conditional dependance is not cau-
sation either. This is however the historical name for this model. The reason why cause is
a bad name, and that latent factor might be better, is that the factorisation properties that
are encoded by graphical models do not in general correspond to the existence of a causal
mechanisms, but only to conditional independence relations.
Remark 4.2.4 If p factorizes in the "latent cause" graph, then p(x, y, z) = p(z)p(x|z)p(y|z).
But using Bayes’ rule p(z) p(x|z) = p(x) p(z|x), adn so we also have that p(x, y, z) =
p(x)p(z|x)p(y|z) which shows that p is a Markov chain (i.e. factorizes in the Markov chain
graph). This is an example of basic edge reversal that we will discuss in the next section.
Note that we proceeded by equivalence, which shows that the Markov chain graph and the
"latent cause" graph and the reversed Markov chain graph are in fact equivalent in the sense
that distribution that factorize according to one factorize according to the others. This is
what we will call Markov equivalence.
Remark 4.2.5 In the "explaining away" graph, in general X ⊥⊥ Y |Z is not true in the sense
that there exist elements in L(G) such that this statement is violated.
Remark 4.2.6 For a graph, (p ∈ L(G)) implies that p satisfies some list of (positive) con-
ditional independence statements (CIS). The fact that p is in L(G) cannot guarantee that a
given CIS does not hold. This should be obvious because the independent distribution belongs
to all graphical models and satisfies all CIS...
Remark 4.2.7 It is important to note that not all lists of CIS correspond to a graph, in the
sense that there are lists of CIS for which there exists is no graph such that L(G) is formed
exactly of the distributions which satisfy only the conditional independences that are listed or
that are consequences of the ones listed. In particular there is no graph G on three variables
such that L(G) contains all distributions on (X,Y,Z) that satisfy X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥ ⊥Z
and does not contain distributions for which any of these statements is violated. (Remember
that pairwise independence does not imply mutual independence: see Remark 4.1.2).
4-7
Lecture 4 — October 18th Fall 2017
Proof If p ∈ L(G), then p(x) = ni=1 p(xi , xπi (G) ). Since E ⊂ E 0 , it isQobvious that πi (G) ⊂
Q
πi (G0 ), and we can define fi (xi , xπi (G0 ) ) := p(xi |xπi (G) ). Since p(x) = ni=1 fi (xi , xπi (G0 ) ) and
fi meets the requirements of Definition 4.1, this proves that p ∈ L(G0 ).
The converse of the previous proposition is not true. In particular, different graphs can
define the same set of distribution. We introduce first some new definitions:
Definition 4.7 (Markov equivalence) We say that two graphs G and G0 are Markov equiv-
alent if L(G) = L(G0 ).
Proposition 4.8 (Basic edge reversal) If G = (V, E) is a DAG and if for (i, j) ∈ E, i has
no parents and the only parent of j is i, then the graph obtained by reversing the edge (i, j)
is Markov equivalent to G.
Proof First, note that by reversing such an edge no cycle can be created because the cy-
cle would necessarily contain (j, i) and j has no parent other than i. Using Bayes’ rule:
p(xi ) p(xj |xi ) = p(xj ) p(xi |xj ) we convert the factorization w.r.t. to G to factorization w.r.t.
to the graph obtained by edge reversal.
Informally, the previous result can be reformulated as: an edge reversal that does not
remove or creates any v-structure leads to a graph which is Markov equivalent.
When applied to the 3-nodes graphs considered earlier, this property proves that the
Markov chain and the "latent cause" graph are equivalent. On the other hand, the fact that
the "explain away" graph has a v-structure is the reason why it is not equivalent to the
others.
Proposition 4.10 (Covered edge reversal) Let G = (V, E) be a DAG and (i, j) ∈ E a
covered edge. Let G0 = (V, E 0 ) with E 0 = (E\{(i, j)}) ∪ {(j, i)}, then G0 is necessarily also a
DAG and L(G) = L(G0 ).
4-8
Lecture 4 — October 18th Fall 2017
Figure 4.6. Marginalizing the boxed node would not result in family of distributions that cannot be exactly
represented by a directed graphical model and one can check that there is no unique smallest graph in which
the obtained distribution factorize.
Definition 4.11 The set of non-descendants of i denoted nd(i) is the set of nodes that are
not descendants of i.
Lemme 4.12 For a graph G = (V, E) and a node i, there exists a topological order such
that all elements of nd(i) appear before i.
4-9
Lecture 4 — October 18th Fall 2017
Proof This is easily proved constructively: we construct the topological order in reverse
order. At each iteration we remove a node among leaves (of the remaining graph) which we
add in the reverse order, and specifically, if some leaves are descendants of i then we remove
one of those. If at any iteration there is no leaf that is a descendant of i, it means that all
descendants of i have been removed from the graph. Indeed, if there were some descendants
of i left in the graph, since all their descendants are descendants of i as well there would exist
a leaf node which is a descendant of i. This procedure thus removes all strict descendants
of i first, then i and then only all elements of nd(i).
Proof First, we consider the ⇒ direction. Based on the previous lemma we can find an order
such that nd(i) = {1, . . . , i − 1}. But we have proven in Proposition 4.4 that p(xi |xπi ) =
p(xi |x1:(i−1) ), which given the order chosen is also p(xi |x1:(i−1) ) = p(xi |xπi , xnd(i)\πi ); this
proves what we wanted to show: Xi ⊥⊥ Xnd(i)\πi |Xπi .
We now prove the ⇐ direction. Let 1 : n be a topological order, Then {1, · · · i−1} ⊆ nd(i).
(By contradiction, suppose j ∈ {1 · · · i − 1} and j ∈/ nd(i), then ∃ path from i to j, which
contradicts the topological order property as there would be an edge from i to an element
of {1, . . . , i − 1}.)
Yn
By the chain rule, we always have p(xV ) = p(xi |x1:i−1 ) but by the conditional indepen-
i=1
dence assumptions p(xi |x1:i−1 ) = p(xi |xπi ), hence the result by substitution.
4.2.4 d-separation
Given a graph G and A, B and C, three subsets it would be useful to be able to answer the
question: is XA ⊥⊥ XB | XC true for all p ∈ L(G)? An answer is provided by the concept of
d-separation, or directed separation.
We call a chain a path in the symmetrized graph, i.e. in the graph the undirected graph
obtained by ignoring the directionality of the edges.
Definition 4.14 (Chain) Let a, b ∈ V , a chain from a to b is a sequence of nodes, say
(v1 , . . . , vn ) such that v1 = a and vn = b and ∀j, (vj , vj+1 ) ∈ E or (vj+1 , vj ) ∈ E.
Assume C is a set that is observed. We want to define a notion of being ’blocked’ by this
set C in order to answer the underlying question above.
4-10
Lecture 4 — October 18th Fall 2017
3. A and B are said to be d-separated by C if and only if all chains that go from a ∈ A
to b ∈ B are blocked.
Example 4.2.2
• Markov chain: Applying d-separation to the Markov chain retrieves the well know
results that the future is independent to the past given the present.
• Hidden Markov Model: We can apply it as well to the hidden Markov chain graph
of Figure 4.9.
4-11
Lecture 4 — October 18th Fall 2017
etats
observations
rules described below and to see if any reaches Z. X ⊥⊥ Z | Y is true if none reached Z, but
not otherwise.
The rules are as follows for the three canonical graph structures. Note that the balls are
allowed to travel in either direction along the edges of the graph.
1. Markov chain: Balls pass through when we do not observe Y , but are blocked oth-
erwise.
Figure 4.10. Markov chain rule: When Y is observed, balls are blocked (left). When Y is not observed,
balls pass through (right)
2. Two children: Balls pass through when we do not observe Y , but are blocked other-
wise.
Figure 4.11. Rule when X and Z are Y ’s children: When Y is observed, balls are blocked (left). When Y
is not observed, balls pass through (right)
4-12
Lecture 4 — October 18th Fall 2017
3. v-structure: Balls pass through when we observe Y , but are blocked otherwise.
Figure 4.12. v-structure rule: When Y is not observed, balls are blocked (left). When Y is observed, balls
pass through (right)
The functions ψC are not probability distributions like in the directed graphical mod-
els. They are called potentials.
Remark 4.3.1 With the normalization by Z of this expression, we see that the function ψC
are defined up to a multiplicative constant.
4-13
Lecture 4 — October 18th Fall 2017
1 4
Complete graphs We consider G = (V, E) with ∀i, j ∈ V, (i, j) ∈ E. For p ∈
L(G), we get:
1
p(x) = ψV (xV ) given that C is reduced to a single set V .
Z
This places no constraints on the distribution of (X1 , ..., Xn ).
2 3
1 4
E ⊆ E 0 ⇒ L(G) ⊆ L(G0 )
Definition 4.18 We say that p satisfies the Global Markov property w.r.t. G if and only
if for all A, B, S ⊂ V disjoint subsets: (A and B are separated by S) ⇒ (XA ⊥⊥ XB | XS ).
Proposition 4.19 If p ∈ L(G) then, p satisfies the Global Markov property w.r.t. G.
Proof We suppose without loss of generality that A, B, and S are disjoint sets such that
A ∪ B ∪ S = V , as we could otherwise replace A and B by :
4-14
Lecture 4 — October 18th Fall 2017
B 0 = V \ {S ∪ A0 }
A0 and B 0 are separated by S and we have the disjoint union A0 ∪ B 0 ∪ S = V . If
we can show that XA0 ⊥⊥ XB 0 |XS , then by the decomposition property, we also have that
XA ⊥⊥ XB |XS for any subset A of A0 and B of B 0 , giving the required general case.
We consider C ∈ C. It is not possible to have both C ∩ A 6= ∅ and C ∩ B 6= ∅ as A and
B are separated by S and C is a clique. Thus C ⊂ A ∪ S or C ⊂ B ∪ S (or both if C ⊂ S).
Let D be the set of cliques C such that C ⊂ A ∪ S and D0 the set of all other cliques. We
have:
1 Y Y
p(x) = ψC (xC ) ψC (xC ) = f (xA∪S )g(xB∪S ).
Z C∈C C∈D 0
C⊂A∪S
Thus:
1 X f (xA , xS )
p(xA , xS ) = f (xA , xS ) g(xB , xS ) =⇒ p(xA |xS ) = P 0
.
Z x x0 f (xA , xS )
B A
1
Z
f (xA , xS )g(xB , xS ) p(xA , xB , xS )
p(xA , xS )p(xB |xS ) = 1
P 0
P 0
= = p(xA , xB |xS ).
Z x0A f (xA , xS ) x0B g(xA , xS ) p(xS )
i.e. XA ⊥⊥ XB |XS .
Theorem 4.20 (Hammersley - Clifford) If ∀x, p(x) > 0 then p ∈ L(G) ⇐⇒ p satisfies
the global Markov property.
4.3.4 Marginalization
As for directed graphical models, we also have a marginalization notion in undirected graphs.
It is slightly different. If p(x) factorizes in G, then p(x1 , . . . , xn−1 ) factorizes in the graph
where the node n is removed and all neighbors are connected.
Proposition 4.21 Let G = (V, E) be an undirected graph. Let G0 = (V 0 , E 0 ) be the graph
where n is removed and its neighbors are connected, i.e. V 0 = V \ {n}, and E 0 is obtained
from the set E by first connecting together all the neighbours of n and then removing n. If
p ∈ L(G) then p(x1 , ..., xn−1 ) ∈ L(G0 ). Hence undirected graphical models are closed under
marginalization as the construction above is true for any vertex.
We now introduce the notion of Markov blanket
Definition 4.22 For i ∈ V , the Markov blanket of a graph G is the smallest set of nodes
that makes Xi independent to the rest of the graph.
Remark 4.3.4 The Markov blanket in an undirected graph for i ∈ V is the set of its neigh-
bors. For a directed graph, it is the union of all parents, all children and parents of children.
4-15
Lecture 4 — October 18th Fall 2017
Definition 4.23 Let G = (V, E) be a DAG. The symmetrized graph of G is G̃ = (V, Ẽ),
with Ẽ = {(u, v), (v, u)/(u, v) ∈ E}, ie. an edge going the opposite direction is added for
every edge in E.
Definition 4.24 Let G = (V, E) be a DAG. The moralized graph Ḡ of G is the sym-
metrized graph G̃, where we add edge such that for all v ∈ V , πv is a clique.
4-16