Graphical
Graphical
Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita
words
Income
Edges: direct interaction
I Directed edges: Bayesian networks
I Undirected edges: Markov Random fields Undirected
Age
Location
Degree Experience
Income
Probability
Factorizes as product of potentials
Y
Pr(x = x1 , . . . xn ) ∝ ψS (xS )
Probability distribution
n
Y
Pr(x1 . . . xn ) = Pr(xi |pa(xi ))
i=1
Degree Experience
Income
ψ1 (L) = Pr(L)
ψ2 (E , A) = Pr(E |A)
NY CA London Other
0–10 10–15 > 15
0.2 0.3 0.1 0.4
20–30 0.9 0.1 0
30–45 0.4 0.5 0.1
ψ2 (A) = Pr(A) > 45 0.1 0.1 0.8
Probability distribution
Pa(x = L, D, I , A, E ) = Pr(L) Pr(D) Pr(A) Pr(E |A) Pr(I |D, E )
Pr(X |Y , Z ) = Pr(X |Z )
xi ⊥⊥ ND(xi )|Pa(xi )
L⊥ ⊥ E , D, A, I Age
Location
A⊥ ⊥ L, D
Degree Experience
E⊥ ⊥ L, D|A
Income
I ⊥
⊥ A|E , D
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 14 / 105
CIs and Fractorization
Theorem
Given a distribution P(x1 , . . . , xn ) and a DAG G , if P satisfies
Local-CI induced by G , then P can be factorized as per the graph.
Local-CI(P, G ) =⇒ Factorize(P, G )
Proof.
x1 , x2 , . . . , xn topographically ordered (parents before children) in
G.
Local CI(P, G ): P(xi |x1 , . . . , xi−1 ) = P(xi |PaG (xi ))
Chain rule: Q Q
P(x1 , . . . , xn ) = i P(xi |x1 , . . . , xi−1 ) = i P(xi |PaG (xi ))
=⇒ Factorize(P, G )
Proof.
By construction. A subset of ND of each xi were available when
parent of U were chosen minimally.
Proof.
The construction process makes sure that the factorization property
holds. Since factorization implies local-CIs, the constructed BN
satisfied the local-CIs of P
Theorem
The d-separation test identifies the complete set of conditional
independencies that hold in all distributions that conform to a given
Bayesian network.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 22 / 105
Global CIs Examples
x1 x2 x3 x4 x5 x6 x7
x4 x3
All 5 local CIs in the graph: e.g. x1 ⊥⊥ {x3 , x4 , x5 }|x2 etc hold in the
graph.
However, the global CI: x2 ⊥⊥ x4 |x3 does not hold.
All 2 pairwise CIs in the graph: e.g. x1 ⊥⊥ {x3 }|x2 and x2 ⊥⊥ {x3 }|x1
hold in the graph.
However, the local CI: x1 ⊥⊥ x3 does not hold.
Proof.
Theorem 4.8 of KF book (partially)
Q
C ψc (yc , x, θ) 1 X
Pr(y1 , . . . , yn |x, θ) = = exp( Fθ (yc , c, x))
Zθ (x) Zθ (x) c
x x x
Potentials
I Directed: conditional probabilities, more intuitive
I Undirected: arbitrary scores, easy to set.
Dependence structure
I Directed: Complicated d-separation test
I Undirected: Graph separation: A ⊥⊥ B | C iff C separates A and
B in G .
Often application makes the choice clear.
I Directed: Causality
I Undirected: Symmetric interactions.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 44 / 105
Equivalent BNs
Two BN DAGs are said to be equivalent if they express the same set
of CIs. (Examples)
Theorem
Two BNs G1 , G2 are equivalent iff they have the same skeleton and
the same set of immoralities. (An immorality is a structure of the
form x → y ← z with no edge between x and z)
Theorem
A MRF can be converted perfectly into a BN iff it is chordal.
Proof.
Theorems 4.11 and 4.13 of KF book
Algorithm for constructing perfect BNs from chordal MRFs to be
discussed later.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 49 / 105
BN and Chordality
A BN with a minimal undirected cycle of length ≥ 4 must have an
immorality. A BN without any immorality is always chordal.
I Graph
I Potentials: ψi (yiQ , yi+1 )
I Pr (y1 , . . . yn ) = i ψi (yi , yi+1 ), Pr (y1 )
Find, Pr(yi ) for any i, say Pr(y5 = 1)
P
I Exact method: Pr(y5 = 1) = y1 ,...y4 Pr (y1 , . . . y4 , 1) requires
exponential number of summations.
I A more efficient alternative...
I Graph x1 x2 x3 x4 x5 x6 x7
y = y1 . . . yn
y1 y2 y3 y4 argmax y5 y6 y7
y1y2 y2y3 y3yy4Pr(y|x
y4y5= o)y5y6 y6y7
x1 x2 x3 x4 x5 x6 x7
Define ψi (yi−1 , yi ) = Pr(yi |yi−1 ) Pr(xi = oi |yi )
Reduced graph only a single chain of y nodes.
y1 y2 y3 y4 y5 y6 y7
Given n observations: o1 , . . . , on
Given Potentials Pr(yt |yt−1 ) = P(y |y 0 ) (Table with m2 values),
Pr(xt |yt ) = P(x|y ) (Table with mk values), Pr(y1 ) = P(y ) start
probabilities (Table with m values.)
Find maxy Pr(y|x = o)
Bn [y ] = 1 y ∈ [1, . . . , m]
for t = n . . . 2 do
ψ(y , y 0 ) = P(y |y 0 )P(xt = ot |y )
Bt−1 [y 0 ] = maxny =1 ψ(y , y 0 )Bt [y ]
end for
Return maxy B1 [y ]P(y )P(xt = ot |y )
Time taken: O(nm2 )
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 58 / 105
Numerical Example
y’ P(y = 0|y 0 ) P(y = 1|y 0 )
P(y |y 0 ) = 0 0.9 0.1
1 0.2 0.8
y P(x = 0|y ) P(x = 1|y )
P(x|y ) = 0 0.7 0.3
1 0.6 0.4
P(y = 1) = 0.5
Observation [x0 , x1 , x2 ] = [0, 0, 0]
Proof.
In supplementary. (not in syllabus)
x x x x x x xx
x x x
x xxx xxx
x x
Proof: https://people.eecs.berkeley.edu/~jordan/courses/
281A-fall04/lectures/lec-11-16.pdf
Input: Cliques: C1 , . . . Ck
Form a complete weighted graph H with cliques as nodes and edge
weights = size of the intersection of the two cliques it connects.
T = maximum weight spanning tree of H
Return T as the junction tree.
x1 x2 x3
x4 x5
x1x2 x2 x3 x4 x5
x2
x1x2 x2x3 x2x4 x3x5 x4x5
x2x3x4 x3x4 x3x4x5
Algorithm
Start with some initial assignment, say
x1 = [x1 , . . . , xn ] = [0, . . . , 0]
For several iterations
I For each variable xi
Get a new sample xt+1 by replacing value of xi with a new value
sampled according to probability Pr(xi |x1t , . . . xi−1
t , xt , . . . , xt)
i+1 n
N
X
LL(θ, D) = log Pr(yi |xi , θ)
i=1
N
X 1 X
= log
i
exp( Fθ (yci , c, xi ))
i=1
Zθ (x ) c
XX
= [ Fθ (yci , c, xi ) − log Zθ (xi )
i c
The first part is easy to compute but the second term requires to
invoke an inference algorithm to compute Zθ (xi ) for each i.
Computing the gradient of the above objective with respect to θ also
requires inference.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 88 / 105
Training via gradient descent
Assume log-linear models like in CRFs where
Fθ (yci , c, xi )P= θ · f(xi , yci , c) Also, for brevity write
f(xi , yi ) = c f(xi , yci , c)
X X
LL(θ) = log Pr(yi |xi , θ) = (θ · f(xi , yi ) − log Zθ (xi ))
i i
X
max (θ · f(xi , yi ) − log Zθ (xi )) − kθk2 /C
θ
i
E [f2 (x1 , y)] = µ(1, 0, (1, 2))+µ(0, 1, (1, 2))+µ(1, 0, (2, 3))+µ(0, 1, (2, 3))
y1 y2 y3 y4 y5 y6
Draw an arc between any two y which appear together in any of the
8 features.
y1 , y2 , y3 y2 , y3 , y4 y3 , y4 , y5 y5 , y6
1.f1 (y1 , y2 ) +
2.f5 (y2 , y4 ) + −1.f7 (y3 , y5 ) +
1.f1 (y2 , y3 ) + 1.f1 (y5 , y6 ) +
1.f1 (y3 , y4 ) + 1.f1 (y4 , y5 ) +
1.f2 (y1 , y3 ) + 1.f8 (y5 , y6 )
2.f4 (y3 , y4 ) 1.f6 (y4 , y5 )
1.f3 (y2 , y3 )
X
f1 = f1 (−1, −1)µi,i+1 (−1, −1) + f1 (−1, 1)µi,i+1 (−1, 1)+
i
f1 (1, −1)µi,i+1 (1, −1) + f1 (1, 1)µi,i+1 (1, 1)
2 f2 = 2 − µ1,3 (−1, −1) + µ1,3 (−1, 1) + µ1,3 (1, −1) − µ1,3 (1, 1)
3 f8 = µ56 (1, 1)
Y
Pr(y1 , . . . , yn |x, θ) = Pr(yj |yPa(j) , x, θ)
j
Y exp(Fθ (yPa(j) , y , j, x))
= Pm 0
j y 0 =1 exp(Fθ (yPa(j) , y , j, x))
N
X
LL(θ, D) = log Pr(yi |xi , θ)
i=1
XN Y
= log i
Pr(yji |yPa (j), xi , θ)
i=1 j
XX
= log Pr(yji |yPa
i
(j), xi , θ)
i j
XX m
X
= i
Fθ (yPa(j) , yji , j, xi )) − log i
exp(Fθ (yPa(j) , y 0 , j, xi ))
i j y 0 =1
XX
max i
log P(yji |yPa(j) )
θ
i j
XX X
= max log θyj i yi s.t. j
θvu 1 ,...,uk
= 1 ∀j, u1 , . . . , uk
θ j (j)
i j v
XX X X X
= max log θyj i yi − λju1 ,...,uk ( j
θvu 1 ,...,uk
− 1)
θ j (j)
i j j u1 ,...,uk v
x1 x2 x3 x4 x5 x6 x7
EM Algorithm
y1 y y y y y y
Input: Graph
2
G , Data
3
D with
4
observed subset of variables x and
5 6 7
hidden variables z.
yy yy
Initially
1 2
(t y=y 0):y yAssign
2 3 yy
3 4 yy
random variables of parameters
4 5 5 6 6 7
t
Pr(xj |pa(xj ))
for = 1, . . . , T do
E-step
for i = 1, . . . , N do
Use inference in G to estimate conditionals Pri (zc |xi )t for all
variable subsets (i, pa(i)) involving any hidden variable.
end for
M-step
N
Pri (zc |xi )[[xji ==xj ]]
P
Pr(xj |pa(xj ) = zc )t = i=1
PN i t
i=1 Pri (zc |x )
end for
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 103 / 105
More on graphical models
Koller and Friedman, Probabilistic Graphical Models: Principles
and Techniques. MIT Press, 2009.
Wainwright’s article in FnT for Machine Learning. 2009.
Kevin Murphy’s brief online introduction
(http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html)
Graphical models. M. I. Jordan. Statistical Science (Special
Issue on Bayesian Statistics), 19, 140-155, 2004. (http:
//www.cs.berkeley.edu/~jordan/papers/statsci.ps.gz)
Other text books:
I R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J.
Spiegelhalter. ”Probabilistic Networks and Expert Systems”.
Springer-Verlag. 1999.
I J. Pearl. ”Probabilistic Reasoning in Intelligent Systems: