0% found this document useful (0 votes)
86 views99 pages

Graphical

The document discusses graphical models, which provide a way to efficiently represent and query large joint probability distributions. Graphical models represent a joint distribution as a product of smaller factors, with the structure of dependencies encoded in a graph. This factorization allows for intuitive visualization and efficient inference. The document contrasts explicit joint distributions, which quickly become intractable, with graphical models, which can exploit conditional independencies between variables to provide a more compact representation. Two common types of graphical models are Bayesian networks for directed graphs and Markov random fields for undirected graphs.

Uploaded by

Kaishva Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views99 pages

Graphical

The document discusses graphical models, which provide a way to efficiently represent and query large joint probability distributions. Graphical models represent a joint distribution as a product of smaller factors, with the structure of dependencies encoded in a graph. This factorization allows for intuitive visualization and efficient inference. The document contrasts explicit joint distributions, which quickly become intractable, with graphical models, which can exploit conditional independencies between variables to provide a more compact representation. Two common types of graphical models are Bayesian networks for directed graphs and Markov random fields for undirected graphs.

Uploaded by

Kaishva Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Graphical models

Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 1 / 105
Probabilistic modeling
Given: several variables: x1 , . . . xn , n is large.
Task: build a joint distribution function Pr(x1 , . . . xn )
Goal: Answer several kind of projection queries on the
distribution
Basic premise
I Explicit joint distribution is dauntingly large
I Queries are simple marginals (sum or max) over the joint
distribution.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 2 / 105
Examples of Joint Distributions So far
Naive Bayes: P(x1 , . . . xd |y ) , d is large. Assume conditional
independence.
Multivariate Gaussian
Recurrent Neural Networks for Sequence labeling and prediction

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 3 / 105
Example
Variables are attributes are people.

Age Income Experience Degree Location


10 ranges 7 scales 7 scales 3 scales 30 places

An explicit joint distribution over all columns not tractable:


number of combinations: 10 × 7 × 7 × 3 × 30 = 44100.
Queries: Estimate fraction of people with
I Income > 200K and Degree=”Bachelors”,
I Income < 200K, Degree=”PhD” and experience > 10 years.
I Many, many more.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 4 / 105
Alternatives to an explicit joint distribution
Assume all columns are independent of each other: bad
assumption
Use data to detect pairs of highly correlated column pairs and
estimate their pairwise frequencies
I Many highly correlated pairs
income 6⊥⊥ age, income 6⊥
⊥ experience, age6⊥
⊥experience
I Ad hoc methods of combining these into a single estimate
Go beyond pairwise correlations: conditional independencies
I income 6⊥⊥ age, but income ⊥⊥ age | experience
I experience ⊥⊥ degree, but experience 6⊥
⊥ degree | income

Graphical models make explicit an efficient joint


distribution from these independencies
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 5 / 105
More examples of CIs
The grades of a student in various courses are correlated but
they become CI given attributes of the student (hard-working,
intelligent, etc?)
Health symptoms of a person may be correlated but are CI given
the latent disease.
Words in a document are correlated, but may become CI given
the topic.
Pixel color in an image become CI of distant pixels given near-by
pixels.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 6 / 105
Graphical models
Model joint distribution over several variables as a product of smaller
factors that is
1 Intuitive to represent and visualize
I Graph: represent structure of dependencies
I Potentials over subsets: quantify the dependencies
2 Efficient to query
I given values of any variable subset, reason about probability
distribution of others.
I many efficient exact and approximate inference algorithms

Graphical models = graph theory + probability theory.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 7 / 105
Graphical models in use
Roots in statistical physics for modeling interacting atoms in gas
and solids [ 1900]
Early usage in genetics for modeling properties of species [ 1920]
AI: expert systems ( 1970s-80s)
Now many new applications:
I Error Correcting Codes: Turbo codes, impressive success story
(1990s)
I Robotics and Vision: image denoising, robot navigation.
I Text mining: information extraction, duplicate elimination,
hypertext classification, help systems
I Bio-informatics: Secondary structure prediction, Gene discovery
I Data mining: probabilistic classification and clustering.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 8 / 105
Representation
Structure of a graphical model: Graph + Potential
Graph
Nodes: variables x = x1 , . . . xn Directed
I Continuous: Sensor temperatures, income Age
Location
I Discrete: Degree (one of Bachelors,
Masters, PhD), Levels of age, Labels of Degree Experience

words
Income
Edges: direct interaction
I Directed edges: Bayesian networks
I Undirected edges: Markov Random fields Undirected
Age
Location

Degree Experience

Income

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 10 / 105
Representation
Potentials: ψc (xc )
Scores for assignment of values to subsets c of directly
interacting variables.
Which subsets? What do the potentials mean?
I Different for directed and undirected graphs

Probability
Factorizes as product of potentials
Y
Pr(x = x1 , . . . xn ) ∝ ψS (xS )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 11 / 105
Directed graphical models: Bayesian networks
Graph G : directed acyclic
I Parents of a node: Pa(xi ) = set of nodes in G pointing to xi
Potentials: defined at each node in terms of its parents.

ψi (xi , Pa(xi )) = Pr(xi |Pa(xi )

Probability distribution
n
Y
Pr(x1 . . . xn ) = Pr(xi |pa(xi ))
i=1

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 12 / 105
Example of a directed graph
Age
Location

Degree Experience

Income

ψ1 (L) = Pr(L)
ψ2 (E , A) = Pr(E |A)
NY CA London Other
0–10 10–15 > 15
0.2 0.3 0.1 0.4
20–30 0.9 0.1 0
30–45 0.4 0.5 0.1
ψ2 (A) = Pr(A) > 45 0.1 0.1 0.8

20–30 30–45 > 45 ψ2 (I , E , D) = Pr(I |D, A)


0.3 0.4 0.3
3 dimensional table, or a
or, a Guassian distribution
histogram approximation.
(µ, σ) = (35, 10)

Probability distribution
Pa(x = L, D, I , A, E ) = Pr(L) Pr(D) Pr(A) Pr(E |A) Pr(I |D, E )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 13 / 105
Conditional Independencies
Given three sets of variables X , Y , Z , set X is conditionally
independent of Y given Z (X ⊥⊥ Y |Z ) iff

Pr(X |Y , Z ) = Pr(X |Z )

Local conditional independencies in BN: for each xi

xi ⊥⊥ ND(xi )|Pa(xi )

L⊥ ⊥ E , D, A, I Age
Location
A⊥ ⊥ L, D
Degree Experience
E⊥ ⊥ L, D|A
Income
I ⊥
⊥ A|E , D
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 14 / 105
CIs and Fractorization
Theorem
Given a distribution P(x1 , . . . , xn ) and a DAG G , if P satisfies
Local-CI induced by G , then P can be factorized as per the graph.
Local-CI(P, G ) =⇒ Factorize(P, G )

Proof.
x1 , x2 , . . . , xn topographically ordered (parents before children) in
G.
Local CI(P, G ): P(xi |x1 , . . . , xi−1 ) = P(xi |PaG (xi ))
Chain rule: Q Q
P(x1 , . . . , xn ) = i P(xi |x1 , . . . , xi−1 ) = i P(xi |PaG (xi ))
=⇒ Factorize(P, G )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 15 / 105
CIs and Fractorization
Theorem
Given a distribution P(x1 , . . . , xn ) and a DAG G , if P can be
factorized as per G then P satisfies Local-CI induced by G .
Factorize(P, G ) =⇒ Local-CI(P, G )

Proof skipped. (Refer book.)

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 16 / 105
Drawing a BN starting from a distribution
Given a distribution P(x1 , . . . , xn ) to which we can ask any CI of the
form ”Is X ⊥⊥ Y |Z ?” and get a yes, no answer.
Goal: Draw a minimal, correct BN G to represent P.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 17 / 105
Why minimal
Theorem
G constructed by the above algorithm is minimal, that is, we cannot
remove any edge from the BN while maintaining the correctness of
the BN for P

Proof.
By construction. A subset of ND of each xi were available when
parent of U were chosen minimally.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 18 / 105
Why Correct
Theorem
G constructed by the above algorithm is correct, that is, the local-CIs
induced by G hold in P

Proof.
The construction process makes sure that the factorization property
holds. Since factorization implies local-CIs, the constructed BN
satisfied the local-CIs of P

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 19 / 105
Order is important

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 20 / 105
Examples of CIs that hold in BN but not covered
by local-CI

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 21 / 105
Global CIs in a BN
Three sets of variables X , Y , Z . If Z d-separates X from Y in BN
then, X ⊥ ⊥ Y |Z .
In a directed graph H, Z d-separates X from Y if all paths P from
any X to Y is blocked by Z .
A path P is blocked by Z when
1 x1 → x2 → . . . xk and xi ∈ Z
2 x1 ← x2 ← . . . xk and xi ∈ Z
3 x1 . . . ← xi → . . . xk and xi ∈ Z
4 x1 . . . → xi ← . . . xk and xi 6∈ Z and Desc(xi ) 6∈ Z

Theorem
The d-separation test identifies the complete set of conditional
independencies that hold in all distributions that conform to a given
Bayesian network.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 22 / 105
Global CIs Examples

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 23 / 105
Global CIs and Local-CIs
In a BN, the set of CIs combined with the axioms of probability can
be used to derive the Global-CIs.
Proof is long but easy to understand. Sketch of a proof available in
the supplementary.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 24 / 105
Popular Bayesian networks
Hidden Markov Models: speech recognition, information
extraction
y1 y2 y3 y4 y5 y6 y7

x1 x2 x3 x4 x5 x6 x7

I State variables: discrete phoneme, entity tag


I Observation variables: continuous (speech waveform), discrete
y1 y2 y3 y4 y5 y6 y7
(Word)
Kalman Filters: State variables: continuous
I Discussed later
y1y2
2 3 yy
3 4 4 5 yy yy y5y6 y6y7
Topic models for text data
1 Principled mechanism to categorize multi-labeled text
documents while incorporating priors in a flexible generative
framework
2 Application: news tracking
QMR (Quick Medical Reference) system
PRMs:
Sunita Sarawagi Probabilistic
IIT Bombay relational
Graphical
http://www.cse.iitb.ac.in/ networks:
~sunitamodels 25 / 105
Undirected graphical models
Graph G : arbitrary undirected graph
Useful when variables interact symmetrically, no
natural parent-child relationship
Example: labeling pixels of an image.
Potentials ψC (yC ) defined on arbitrary cliques C
of G .
ψC (yC ): Any arbitrary non-negative value, cannot
be interpreted as probability.
Probability distribution
1 Y
Pr(y1 . . . yn ) = ψC (yC )
Z C ∈G

ψC (yC0 ) (partition function)


P Q
where Z = y0 C ∈G

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 26 / 105
Example

yi = 1 (part of foreground), 0 otherwise.


Node potentials
I ψ1 (0) = 4, ψ1 (1) = 1
I ψ2 (0) = 2, ψ2 (1) = 3
I ....
I ψ9 (0) = 1, ψ9 (1) = 1
Edge potentials: Same for all edges
I ψ(0, 0) = 5, ψ(1, 1) = 5, ψ(1, 0) = 1, ψ(0, 1) = 1
Q9 Q
Probability: Pr(y1 . . . y9 ) ∝ k=1 ψk (yk ) (i,j)∈E (G ) ψ(yi , yj )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 27 / 105
Conditional independencies (CIs) in an undirected
graphical model
Let V = {y1 , . . . , yn }.
Let distribution P be represented by an undirected graphical model
G . If Z separates X and Y in G , then X ⊥⊥ Y |Z in P.
The set of all such CIs are called Global-CI of the UGM.
Example:
1 y1 ⊥ ⊥ y3 , y5 y6 , y7 , y8 , y9 |y2 , y4
2 y1 ⊥ ⊥ y3 |y2 , y4 , y5 , y6 , y7 , y8 , y9
3 y1 , y2 , y3 ⊥⊥ y7 , y8 , y9 |y4 , y5 , y6

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 28 / 105
Factorization implies Global-CI
Theorem
Let G be a undirected graph over V = x1 , . . . , xn nodes and
P(x1 , . . . , xn ) be a distribution. If P is represented by G that is, if it
can be factorized as per the cliques of G , then P will also satisfy the
global-CIs of G
Factorize(P, G ) =⇒ Global-CI(P, G )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 29 / 105
Factorization implies Global-CI (Proof)
Available as proof of Theorem 4.1 in KF book.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 30 / 105
Global-CI does not imply factorization.
(Taken from example 4.4 of KF book)
But global-CI does not imply factorization. Consider a distribution
over 4 binary variables: P(x1 , x2 , x3 , x4 )
Let G be
x1 x2

x4 x3

Let P(x1 , x2 , x3 , x4 ) = 1/8 when x1 , x2 , x3 , x4 takes values from this


set ={0000,1000,1100,1110,1111,0111,0011,0001}. In all other cases
it is zero. One can painfully check that all four globals CIs in the
graph: e.g. x1 ⊥⊥ {x3 }|x2 , x4 etc hold in the graph.
Now let us look at factorization. The factors correspond to the edges
in ψ(x1 , x2 ). Each of the four possible assignment of each factor will
get a positive value. But that cannot represent the zero probability
for cases like x1 , x2 , x3 , x4 = 0101.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 31 / 105
Other Conditional independencies (CIs) in an
undirected graphical model
Let V = {y1 , . . . , yn }.
1 Local CI: yi ⊥⊥ V − ne(yi ) − {yi }|ne(yi )
2 Pairwise CI: yi ⊥⊥ yj |V − {yi , yj } if edge (yi , yj ) does not exist.
3 Global CI: X ⊥⊥ Y |Z if Z separates X and Y in the graph.
Equivalent when the distribution P(x) is positive, that is
P(x) > 0, ∀x
1 y1 ⊥ ⊥ y3 , y5 y6 , y7 , y8 , y9 |y2 , y4
2 y1 ⊥ ⊥ y3 |y2 , y4 , y5 , y6 , y7 , y8 , y9
3 y1 , y2 , y3 ⊥⊥ y7 , y8 , y9 |y4 , y5 , y6

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 32 / 105
Relationship between Local-CI and Global-CI
Let G be a undirected graph over V = x1 , . . . , xn nodes and
P(x1 , . . . , xn ) be a distribution. If P satisfies Global-CIs of G , then P
will also satisfy the local-CIs of G but the reverse is not always true.
We will show this with an example.
Consider a distribution over 5 binary variables: P(x1 , . . . , x5 ) where
x1 = x2 , x4 = x5 and x3 = x2 AND x4 .
Let G be
x1 x2 x3 x4 x5

All 5 local CIs in the graph: e.g. x1 ⊥⊥ {x3 , x4 , x5 }|x2 etc hold in the
graph.
However, the global CI: x2 ⊥⊥ x4 |x3 does not hold.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 33 / 105
Relationship between Local-CI and Pairwise-CI
Let G be a undirected graph over V = x1 , . . . , xn nodes and
P(x1 , . . . , xn ) be a distribution. If P satisfies Local-CIs of G , then P
will also satisfy the pairwise-CIs of G but the reverse is not always
true. We will show this with an example.
Consider a distribution over 3 binary variables: P(x1 , x2 , x3 ) where
x1 = x2 = x3 . That is, P(x1 , x2 , x3 ) = 1/2 when all three are equal
and 0 otherwise.
Let G be
x1 x2 x3

All 2 pairwise CIs in the graph: e.g. x1 ⊥⊥ {x3 }|x2 and x2 ⊥⊥ {x3 }|x1
hold in the graph.
However, the local CI: x1 ⊥⊥ x3 does not hold.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 34 / 105
Factorization and CIs
Theorem
(Hammerseley Clifford Theorem) If a positive distribution
P(x1 , . . . , xn ) confirms to the pairwise CIs of a UDGM G, then it can
be factorized as per the cliques C of G as
Y
P(x1 , . . . , xn ) ∝ ψC (yC )
C ∈G

Proof.
Theorem 4.8 of KF book (partially)

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 35 / 105
Summary
Let P be a distribution and H be an undirected graph of the same set
of nodes.
Factorize(P, H) =⇒ Global-CI(P, H) =⇒ Local-CI(P, H) =⇒
Pairwise-CI(P, H)
But only for positive distributions
Pairwise-CI(P, H) =⇒ Factorize(P, H)

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 36 / 105
Constructing an UGM from a positive distribution
Given a positive distribution P(x1 , . . . , xn ) to which we can ask any
CI of the form ”Is X ⊥⊥ Y |Z ?” and get a yes, no answer.
Goal: Draw a minimal, correct UGM G to represent P.
Two options: (1) Using pairwise CI (2) Using Local CI.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 37 / 105
Constructing an UGM from a positive distribution
using Local-CI
Definition: The Markov Blanket of a variable xi , MB(xi ) is the
smallest subset of variables V that makes xi CI of others given the
Markov blanket.
xi ⊥⊥ V − MB(xi )|MB(xi )
The MB of a variable is always unique for a positive distribution.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 38 / 105
Popular undirected graphical models
Interacting atoms in gas and solids [ 1900]
Markov Random Fields in vision for image segmentation
Conditional Random Fields for information extraction
Social networks
Bio-informatics: annotating active sites in a protein molecules.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 39 / 105
Conditional Random Fields (CRFs)
Used to represent conditional distribution P(y|x) where
y = y1 , . . . , yn forms an undirected graphical model.
The potentials are defined over subset of y variables, and the whole
of x.

Q
C ψc (yc , x, θ) 1 X
Pr(y1 , . . . , yn |x, θ) = = exp( Fθ (yc , c, x))
Zθ (x) Zθ (x) c

where Zθ (x) = y0 exp( c Fθ (yc0 , c, x))


P P
clique potential ψc (yc , x) = exp(Fθ (yc , c, x))

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 40 / 105
Potentials in CRFs
Log-linear model over user-defined features. E.g. CRFs, Maxent
models, etc.
Let K be number of features. Denote a feature as fk (yc , c, x).
Then,
K
X
Fθ (yc , c, x) = θk fk (yc , c, x)
k=1

Arbitrary function, e.g. a neural network that takes as input


yc , c, x and transforms them possibly non-linearly into a real
value. θ are the parameters of the network.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 41 / 105
Example: Named Entity Recognition

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 42 / 105
Named Entity Recognition: Features

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 43 / 105
Comparing directed and undirected graphs
Some distributions can only be expressed in one and not the
other.
x x x x

x x x

Potentials
I Directed: conditional probabilities, more intuitive
I Undirected: arbitrary scores, easy to set.
Dependence structure
I Directed: Complicated d-separation test
I Undirected: Graph separation: A ⊥⊥ B | C iff C separates A and
B in G .
Often application makes the choice clear.
I Directed: Causality
I Undirected: Symmetric interactions.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 44 / 105
Equivalent BNs
Two BN DAGs are said to be equivalent if they express the same set
of CIs. (Examples)

Theorem
Two BNs G1 , G2 are equivalent iff they have the same skeleton and
the same set of immoralities. (An immorality is a structure of the
form x → y ← z with no edge between x and z)

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 45 / 105
Converting BN to MRFs
Efficient: Using the Markov Blanket algorithm.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 46 / 105
For which BN can we create perfect MRFs?

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 47 / 105
Converting MRFs to BNs

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 48 / 105
Which MRFs have perfect BNs
Chordal or triangulated graphs
A graph is chordal if it has no minimal cycle of length ≥ 4.

Theorem
A MRF can be converted perfectly into a BN iff it is chordal.

Proof.
Theorems 4.11 and 4.13 of KF book
Algorithm for constructing perfect BNs from chordal MRFs to be
discussed later.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 49 / 105
BN and Chordality
A BN with a minimal undirected cycle of length ≥ 4 must have an
immorality. A BN without any immorality is always chordal.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 50 / 105
Inference queries
1 Marginal probability queries over a small subset of variables:
I Find Pr(Income=’High & Degree=’PhD’)
I Find Pr(pixel y9 = 1)
X
Pr(x1 ) = Pr(x1 . . . xn )
x2 ...xn
X m m
X
= ... Pr(x1 . . . xn )
x2 =1 xn =1

Brute-force requires O(mn−1 ) time.


2 Most likely labels of remaining variables: (MAP queries)
I Find most likely entity labels of all words in a sentence
I Find likely temperature at sensors in a room
x∗ = argmaxx1 ...xn Pr(x1 . . . xn )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 52 / 105
Exact inference on chains
Given,

I Graph
I Potentials: ψi (yiQ , yi+1 )
I Pr (y1 , . . . yn ) = i ψi (yi , yi+1 ), Pr (y1 )
Find, Pr(yi ) for any i, say Pr(y5 = 1)
P
I Exact method: Pr(y5 = 1) = y1 ,...y4 Pr (y1 , . . . y4 , 1) requires
exponential number of summations.
I A more efficient alternative...

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 53 / 105
Exact inference on chains
X
Pr(y5 = 1) = Pr (y1 , . . . y4 , 1)
y1 ,...y4
XXXX
= ψ1 (y1 , y2 )ψ2 (y2 , y3 )ψ3 (y3 , y4 )ψ4 (y4 , 1)
y1 y2 y3 y4
XX X X
= ψ1 (y1 , y2 ) ψ2 (y2 , y3 ) ψ3 (y3 , y4 )ψ4 (y4 , 1)
y1 y2 y3 y4
XX X
= ψ1 (y1 , y2 ) ψ2 (y2 , y3 )B3 (y3 )
y1 y2 y3
XX
= ψ1 (y1 , y2 )B2 (y2 )
y1 y2
X
= B1 (y1 )
y1

An alternative view: flow of beliefs Bi (.) from node i + 1 to node i

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 54 / 105
Adding evidence
Given fixed values of a subset of variables xe (evidence), find the
1 Marginal probability queries over a small subset of variables:
I Find Pr(Income=’High | Degree=’PhD’)
X
Pr(x1 ) = Pr(x1 . . . xn |xe )
x2 ...xm

2 Most likely labels of remaining variables: (MAP queries)


I Find likely temperature at sensors in a room given readings
from a subset of them
x∗ = argmaxx1 ...xm Pr(x1 . . . xn |xe )
Easy to add evidence, just change the potential.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 55 / 105
Case study: HMMs for Information Extraction

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 56 / 105
Inference in HMMs
Given,
y1 y2 y3 y4 y5 y6 y7

I Graph x1 x2 x3 x4 x5 x6 x7

I Potentials: Pr(yi |yi−1 ), Pr(xi |yi )


I Evidence variables: x = x1 . . . xn = o1 . . . on .
y y y y y y y
Find most likely 1
values
2
of 3the hidden
4
state
5
variables.
6 7

y = y1 . . . yn
y1 y2 y3 y4 argmax y5 y6 y7
y1y2 y2y3 y3yy4Pr(y|x
y4y5= o)y5y6 y6y7

x1 x2 x3 x4 x5 x6 x7
Define ψi (yi−1 , yi ) = Pr(yi |yi−1 ) Pr(xi = oi |yi )
Reduced graph only a single chain of y nodes.
y1 y2 y3 y4 y5 y6 y7

Algorithm same as earlier, just replace “Sum” with “Max”


This is ythe
y well-known
yy1 2 yy Viterbi
yy2 3y yalgorithm
yy
3 4 4 5
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
5 6 6 7
~sunitamodels 57 / 105
The Viterbi algorithm
Let observations xt take one of k possible values, states yt take one
of m possible value.

Given n observations: o1 , . . . , on
Given Potentials Pr(yt |yt−1 ) = P(y |y 0 ) (Table with m2 values),
Pr(xt |yt ) = P(x|y ) (Table with mk values), Pr(y1 ) = P(y ) start
probabilities (Table with m values.)
Find maxy Pr(y|x = o)
Bn [y ] = 1 y ∈ [1, . . . , m]
for t = n . . . 2 do
ψ(y , y 0 ) = P(y |y 0 )P(xt = ot |y )
Bt−1 [y 0 ] = maxny =1 ψ(y , y 0 )Bt [y ]
end for
Return maxy B1 [y ]P(y )P(xt = ot |y )
Time taken: O(nm2 )
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 58 / 105
Numerical Example
y’ P(y = 0|y 0 ) P(y = 1|y 0 )
P(y |y 0 ) = 0 0.9 0.1
1 0.2 0.8
y P(x = 0|y ) P(x = 1|y )
P(x|y ) = 0 0.7 0.3
1 0.6 0.4
P(y = 1) = 0.5
Observation [x0 , x1 , x2 ] = [0, 0, 0]

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 59 / 105
Variable elimination on general graphs
Given, arbitrary sets of potentials ψC (xC ), C = cliques in a
graph G .
P Q
Find, Z = x1 ,...,xn C ψC (xC )
x1 , . . . xn = good ordering of variables
F = ψC (xC ), C = cliques in a graph G .
for i = 1 . . . n do
Fi = factors in F that contain xi
Mi = P product of factors in Fi
mi = xi Mi
F = F − Fi ∪ {mi }
end for

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 60 / 105
Example: Variable elimination
Given, ψ12 (x1 , x2 ), ψ24 (x2 , x4 ), ψ23 (x2 , x3 ), ψ45 (x4 , x5 ), ,
ψ35 (x3 , x5 ).
Find,
P Z=
x1 ,...,x5 ψ12 (x1 , x2 )ψ24 (x2 , x4 )ψ23 (x2 , x3 )ψ45 (x4 , x5 )ψ35 (x3 , x5 ).
P
Q 1 x
1 x1 : {ψ12 (x1 , x2 )} → M1 (x1 , x2 ) −−→ m1 (x2 )
P
Q 2 x
2 x2 : {ψ24 (x2 , x4 ), ψ23 (x2 , x3 ), m1 (x2 )} → M2 (x2 , x3 , x4 ) −−→
m2 (x3 , x4 )
P
Q x3
3 x3 : {ψ35 (x3 , x5 ), m2 (x3 , x4 )} → M3 (x3 , x4 , x5 ) −−→ m3 (x4 , x5 )
P
Q x4
4 x4 : {ψ45 (x4 , x5 ), m3 (x4 , x5 )} → M4 (x4 , x5 ) −−→ m4 (x5 )
P
Q x5
5 x5 : {m5 (x5 )} → M5 (x5 ) −−→ Z

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 61 / 105
Choosing a variable elimination order
Complexity of VE O(nmw ) where w is the maximum number of
variables in any factor.
Wrong elimination order can give rise to very large intermediate
factors.
Example: eliminating x2 first will give a factor of size 4.
Given an example where the penalty can be really severe (?)
Choosing the optimal elimination order is NP hard for general
graphs.
Polynomial time algorithm exists for chordal graphs.
I A graph is chordal or triangulated if all cycles of length greater
than three have a shortcut.
Optimal triangulation of graphs is NP hard. (Many heuristics)

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 62 / 105
Finding optimal order in a triangulated graph
Theorem
Every triangulated graph is either complete or has at least two
simplicial vertices. A vertex is simplicial if its neighbors form a
complete set.

Proof.
In supplementary. (not in syllabus)

Goal: find optimal ordering for P(x1 ) inference. x1 has to be last in


the ordering.
Input: Graph G . n = number of vertices of G
for i = 2, . . . , n do
πi = pick any simplicial vertex in G other than 1.
remove πi from G
end for
Return
Sunita Sarawagi ordering
IIT Bombay π1 , π2 , . . . , πGraphical
http://www.cse.iitb.ac.in/ n−1
~sunitamodels 63 / 105
Reusing computation across multiple inference
queries
Given a chain graph with potentials ψi,i+1 (xi , xi+1 ), suppose we need
to compute all n marginals P(x1 ), . . . , P(xn ).
Invoking variable elimination algorithm n times for each xi will entail
a cost of n × nm2 . Can we go faster by reusing work across
computations?

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 64 / 105
Junction tree algorithm
An optimal general-purpose algorithm for exact marginal/MAP
queries
Simultaneous computation of many queries
Efficient data structures
Complexity: O(mw N) w = size of the largest clique in
(triangulated) graph, m = number of values of each discrete
variable in the clique. → linear for trees.
Basis for many approximate algorithms.
Many popular inference algorithms special cases of junction trees

I Viterbi algorithm of HMMs


I Forward-backward algorithm of Kalman filters

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 65 / 105
Junction tree
Junction tree JT of a triangulated graph G with nodes x1 , . . . , xn is a
tree where
Nodes = maximal cliques of G
Edges ensure that if any two nodes contain a variable xi then xi
is present in every node in the unique path between them
(Running intersection property).
Constructing a junction tree
Efficient polynomial time algorithms exist for creating a JT from a
triangulated graph.
1 Enumerate a covering set of cliques
2 Connect cliques to get a tree that satisfies the running
intersection property.
If graph is non-triangulated, triangulate first using heuristics, optimal
triangulation is NP-hard.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 66 / 105
Creating a junction tree from a graphical model
1. Starting graph 2. Triangulate graph 3. Create clique nodes

x x x x x x xx

x x x
x xxx xxx

4. Create tree edges such that 5) Assign potentials to exactly


variables connected. one subsumed clique node.
xx xx Ψ1

x x

xxx xx xxx xxx xx xxx


Ψ23Ψ24 Ψ45Ψ35

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 67 / 105
Finding cliques of a triangulated graph
Theorem
Every triangulated graph has a simplicial vertex, that is, a vertex
whose neighbors form a complete set.

Input: Graph G . n = number of vertices of G


for i = 1, . . . , n do
πi = pick any simplicial vertex in G
Ci = {πi } ∪ Ne(πi )
remove πi from G
end for
Return maximal cliques from C1 , . . . Cn

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 68 / 105
Connecting cliques to form junction tree
Separator variables = intersection of variables in the two cliques
joined by an edge.
Theorem
A clique tree that satisfies the running intersection property
maximizes the number of separator variables.

Proof: https://people.eecs.berkeley.edu/~jordan/courses/
281A-fall04/lectures/lec-11-16.pdf
Input: Cliques: C1 , . . . Ck
Form a complete weighted graph H with cliques as nodes and edge
weights = size of the intersection of the two cliques it connects.
T = maximum weight spanning tree of H
Return T as the junction tree.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 69 / 105
Message passing on junction trees
Each node c
I sends message mc→c 0 (.) to each of its neighbors c 0
F once it has messages from every other neighbor N(c) − {c 0 }.
I mc→c 0 (.) = Message from c to c 0 is the result of sum-product
elimination on side of the tree that contains clique c but not c 0
on the separator variables s = c ∩ c 0
X Y
mc→c 0 (xs ) = ψc (xc ) md→c (xd∩c )
xc−s d∈N(c)−{c 0 }

Replace “sum” with “max” for MAP queries.


Compute marginal probability of any variable xi as
1 c = clique in JT containing xi
P Q
2 Pr(xi ) ∝ xc−x ψc (xc ) d∈N(c) md→c (xd∩c )
i

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 70 / 105
Example

ψ234 (y234 ) = ψ23 (y23 )ψ34 (y34 )


ψ345 (y345 ) = ψ35 (y35 )ψ45 (y45 )
ψ234 (y12 ) = ψ12 (y12 )
P
1 Clique “12” sends Message m12→234 (y2 ) = y1 ψ12 (y12 ) to its
only neighbor.
P
2 Clique “345” sends Message m345→234 (y34 ) = y5 ψ234 (y345 ) to
“234”
3 Clique “234” sendsP Message
m234→345 (y34 ) = y2 ψ234 (y234 )m12→234 (y2 ) to “345”
4 Clique “234” sends
P Message
m234→12 (y2 ) = y4 ψ234 (y234 )m345→234 (y34 ) to “12”
P
Pr(y1 ) ∝ y2 ψ12 (y12 )m234→12 (y2 )
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 71 / 105
Why approximate inference
Exact inference is NP hard. Complexity: O(mw )
I w = tree width = size of the largest clique in (triangulated)
graph-1,
I m = number of values of each discrete variable in the clique.
Many real-life graphs produce large cliques on triangulation
I A n × n grid has a tree width of n
I A Kalman filter on K parallel state variables influencing a
common observation variable, has a tree width of size K + 1

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 73 / 105
Generalized belief propagation
Approximate junction tree with a cluster graph where
1 Nodes = arbitrary clusters, not cliques in triangulated graph.
Only ensure all potentials subsumed.
2 Separator nodes on edges = subset of intersecting variables so
as to satisfy running intersection property.
Special case: Factor graphs.
Example cluster graph
Starting graph

x1 x2 x3

x4 x5

Junction tree. Cluster graph

x1x2 x2 x3 x4 x5
x2
x1x2 x2x3 x2x4 x3x5 x4x5
x2x3x4 x3x4 x3x4x5

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 74 / 105
Belief propagation in cluster graphs
Graph can have loops, tree-based two-phase method not
applicable.
Many variants on scheduling order of propagating beliefs.
I Simple loopy belief propagation [?]
I Tree-reweighted message passing [?, ?]
I Residual belief probagation [?]
Many have no guarantees of convergence. Specific tree-based
orders do [?]
Works well in practice, default method of choice.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 75 / 105
MCMC (Gibbs) sampling
Useful when all else failes, guaranteed to converge to the
optimal over infinite number of samples.
Basic premise: easy to compute conditional probability
Pr(xi |fixed values of remaining variables)

Algorithm
Start with some initial assignment, say
x1 = [x1 , . . . , xn ] = [0, . . . , 0]
For several iterations
I For each variable xi
Get a new sample xt+1 by replacing value of xi with a new value
sampled according to probability Pr(xi |x1t , . . . xi−1
t , xt , . . . , xt)
i+1 n

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 76 / 105
Others
Combinatorial algorithms for MAP [?].
Greedy algorithms: relaxation labeling.
Variational methods like mean-field and structured mean-field.
LP and QP based approaches.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 77 / 105
Parameters in Potentials
1 Manual: Provided by domain expert
I Used in infrequently constructured graphs, example QMR
systems
I Also where potentials are an easy function of the attributes of
connected graphs, example: vision networks.
2 Learned: from examples
I More popular since difficult for humans to assign numeric values
I Many variants of parameterizing potentials.
1 Table potentials: each entry a parameter, example, HMMs
2 Potentials: combination of shared parameters and data
attributes: example, CRFs.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 78 / 105
Graph Structure
1 Manual: Designed by domain expert
I Used in applications where dependency structure is
well-understood
I Example: QMR systems, Kalman filters, Vision (Grids), HMM
for speech recognition and IE.
2 Learned from examples
I NP hard to find the optimal structure.
I Widely researched, mostly posed as a branch and bound search
problem.
I Useful in dynamic situations

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 80 / 105
Learning potentials
Given sample D = {x1 , . . . , xN } of data generated from a distribution
P(x) represented by a graphical model with known structure G , learn
potentials ψC (xC ).
Two settings:
1 All variables observed or not.
1 Fully observed: each training sample xi has all n variables
observed.
2 Partially observed: a subset of the variables are observed.
2 Potentials coupled with a log-partition function or not.
1 No: Closed form solutions
2 Yes: Potentials attached to arbitrary overlapping subset of
variables in a UDGM. Example = edge potentials in a grid
graph. iterative solution as in the case of learning with shared
parameters Discussed later.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 81 / 105
General framework for Parameter learning in
graphical models
Conditional distribution Pr(y|x, θ), potentials are function of x
and parameters θ to be learned.
y = y1 , . . . , yn forms a graphical model: directed or undirected.
Undirected:
Q
ψc (yc , x, θ)
Pr(y1 , . . . , yn |x, θ) = C
Zθ (x)
1 X
= exp( Fθ (yc , c, x))
Zθ (x) c

where Zθ (x) = y0 exp( c Fθ (yc0 , c, x))


P P
clique potential ψc (yc , x) = exp(Fθ (yc , c, x))

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 83 / 105
Forms of Fθ (yc , c, x)
Log-linear model over user-defined features. E.g. CRFs, Maxent
models, etc.
Let K be number of features. Denote a feature as fk (yc , c, x).
Then,
K
X
Fθ (yc , c, x) = θk fk (yc , c, x)
k=1

Arbitrary function, e.g. a neural network that takes as input


yc , c, x and transforms them possibly non-linearly into a real
value. θ are the parameters of the network.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 84 / 105
Example: Named Entity Recognition

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 85 / 105
Named Entity Recognition: Features

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 86 / 105
Training
Given
N input output pairs D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}
Form of Fθ
Learn parameters θ by maximum likelihood.
N
X
max LL(θ, D) = max log Pr(yi |xi , θ)
θ θ
i=1

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 87 / 105
Training undirected graphical model

N
X
LL(θ, D) = log Pr(yi |xi , θ)
i=1
N
X 1 X
= log
i
exp( Fθ (yci , c, xi ))
i=1
Zθ (x ) c
XX
= [ Fθ (yci , c, xi ) − log Zθ (xi )
i c

The first part is easy to compute but the second term requires to
invoke an inference algorithm to compute Zθ (xi ) for each i.
Computing the gradient of the above objective with respect to θ also
requires inference.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 88 / 105
Training via gradient descent
Assume log-linear models like in CRFs where
Fθ (yci , c, xi )P= θ · f(xi , yci , c) Also, for brevity write
f(xi , yi ) = c f(xi , yci , c)

X X
LL(θ) = log Pr(yi |xi , θ) = (θ · f(xi , yi ) − log Zθ (xi ))
i i

Add a regularizer to prevent over-fitting.

X
max (θ · f(xi , yi ) − log Zθ (xi )) − kθk2 /C
θ
i

Concave in θ =⇒ gradient descent methods will work.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 89 / 105
Gradient of the training objective

f(y0 , xi ) exp θ · f(xi , y0 )


P
y0
X
i i
∇L(θ) = f(x , y ) − − 2θ/C
i
Zθ (xi )
X X
= f(xi , yi ) − f(xi , y0 ) Pr(y0 |θ, xi ) − 2θ/C
i y0
X
= f(xi , yi ) − EPr(y0 |θ,xi ) f(xi , y0 ) − 2θ/C
i

EPr(y0 |θ,xi ) fk (xi , y0 ) = y0 fk (xi , y0 ) Pr(y0 |θ, xi )


P

= y0 c fk (xi , yc0 , c) Pr(y0 |θ, xi )


P P

= c yc0 fk (xi , yc0 , c) Pr(yc0 |θ, xi )


P P

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 90 / 105
Computing EPr(y|θt ,xi )fk (xi , y)
Three steps:
1 Pr(y|θt , xi ) is represented as an undirected model where nodes
are the different components of y, that is y1 , . . . , yn .
The potential ψc (yc , x, θ) on clique c is exp(θt · f(xi , yci , c))
2 Run a sum-product inference algorithm on above UGM and
compute for each c, yc marginal probability µ(yc , c, xi ).
3 Using these µs we compute
EPr(y|θt ,xi ) fk (xi , y) = c yc µ(yc , c, xi )fk (xi , c, yc )
P P

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 91 / 105
Example
Consider a parameter learning task for an undirected graphical model
on 3 variables y = [y1 y2 y3 ] where each yi = +1 or 0 and they form
a chain. Let the following two features be defined for it.
f1 (i, x, yi ) = xi yi (where xi =intensity of pixel i)
f2 ((i, j), x, (yi , yj )) = [[yi 6= yj ]]
where [[z]] = 1 if z = true and 0 otherwise.
Initial parameters θ = [θ1 , θ2 ] = [3, −2]
Examples: x1 = [0.1, 0.7, 0.3], y1 = [1, 1, 0]
Using these we can calculate:
1 Node potentials for yi as exp(θ1 xi yi ). For e.g. for y1 it is
[ψ1 (0), ψ1 (1)] = [1, e 3×0.1 ]
2 Edge potentials ψ12 (y1 , y2 ) = ψ23 (y2 , y3 ) = 1 if y1 = y2 and e −2
if y1 6= y2

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 92 / 105
Example (continued)
1 Use above potentials to run sum-product inference on a junction
tree to calculate marginals µ(yi , i) and µ(yi , yj , (i, j))
2 Using these we calculate expected value of features as:
X
E [f1 (x1 , y)] = xi µi (1, i) = 0.1µ(1, 1) + 0.7µ(1, 2) + 0.3µ(1, 3)
i

E [f2 (x1 , y)] = µ(1, 0, (1, 2))+µ(0, 1, (1, 2))+µ(1, 0, (2, 3))+µ(0, 1, (2, 3))

3 The value of f(x1 , y1 ) for each feature is (Note value of


y1 = [1, 1, 0]):

f1 (x1 , y1 ) = 0.1 ∗ 1 + 0.7 ∗ 1 + 0.3 ∗ 0 = 0.8

f2 (x1 , y1 ) = [[y11 6= y21 ]] + [[y21 6= y31 ]] = 1


4 The gradient of each parameter is then.

∇L(θ1 ) = 0.8 − E [f1 (x1 , y)] − 2 ∗ 3/C

∇L(θ2 ) = 1 − E [f2 (x1 , y)] + 2 ∗ 2/C


Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 93 / 105
Another Example
Consider a parameter learning task for an undirected graphical model
on six variables y = [y1 y2 y3 y4 y5 y6 ] where each yi = +1 or −1.
Let the following eight features be defined for it.
f1 (yi , yi+1 ) = [[yi + yi+1 > 1]], 1 ≤ i < 5 f2 (y1 , y3 ) = −2y1 y3
f3 (y2 , y3 ) = y2 y3 f4 (y3 , y4 ) = y3 y4
f5 (y2 , y4 ) = [[y2 y4 < 0]] f6 (y4 , y5 ) = 2y4 y5
f7 (y3 , y5 ) = −y3 y5 f8 (y5 , y6 ) = [[y5 + y6 > 0]].
where [[z]] = 1 if z = true and 0 otherwise. That is,
f(y) = [f1 f2 f3 f4 f5 f6 f7 f8 ]T . Assume the corresponding weight
vector to be θ = [1 1 1 2 2 1 − 1 1]T

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 94 / 105
Example
Draw the underlying graphical model corresponding to the 6 variables.

y1 y2 y3 y4 y5 y6

Draw an arc between any two y which appear together in any of the
8 features.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 95 / 105
Example
Draw the junction tree corresponding to the graph above and assign
potentials to each node of your junction
P tree so that you can run
message passing on it to find Z = y θT f(x, y), that is, define
ψc (yc ) in terms of the above quantities for each clique node c in the
JT.
For clique c, ψc (yc ) = exp(θc · fc (x, yc )). log of the potentials are
shown below

y1 , y2 , y3 y2 , y3 , y4 y3 , y4 , y5 y5 , y6
1.f1 (y1 , y2 ) +
2.f5 (y2 , y4 ) + −1.f7 (y3 , y5 ) +
1.f1 (y2 , y3 ) + 1.f1 (y5 , y6 ) +
1.f1 (y3 , y4 ) + 1.f1 (y4 , y5 ) +
1.f2 (y1 , y3 ) + 1.f8 (y5 , y6 )
2.f4 (y3 , y4 ) 1.f6 (y4 , y5 )
1.f3 (y2 , y3 )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 96 / 105
Example
Suppose you use the junction tree above to compute the marginal
probability for each pair of adjacent variables in the graph of part (a).
Let µij (−1, 1), µij (1, 1), µij (−1, −1), µij (1, −1) denote the marginal
probability of variable pairs yi , yj taking values (-1,1), (1,1), (-1,-1)
and (1,-1) respectively. Express the expected value of the following
features in terms of the µ values.
1

X
f1 = f1 (−1, −1)µi,i+1 (−1, −1) + f1 (−1, 1)µi,i+1 (−1, 1)+
i

f1 (1, −1)µi,i+1 (1, −1) + f1 (1, 1)µi,i+1 (1, 1)

2 f2 = 2 − µ1,3 (−1, −1) + µ1,3 (−1, 1) + µ1,3 (1, −1) − µ1,3 (1, 1)
3 f8 = µ56 (1, 1)

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 97 / 105
Training algorithm
1:Initialize θ0 = 0
2:for t = 1 . . . T do
3: for i = 1 . . . N do
4: gk,i = fk (xi , yi ) − EPr(y0 |θt ,xi ) fk (xi , y0 ) k = 1 . . . K
5: end for P
6: gk = i gk,i k = 1 . . . K
7: θkt = θkt−1 + γt (gk − 2θkt−1 /C )
8: Exit if kgk ≈ zero
9:end for
Running time of the algorithm is O(INn(m2 + K )) where I is the
total number of iterations.

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 98 / 105
Local conditional probability for BN

Y
Pr(y1 , . . . , yn |x, θ) = Pr(yj |yPa(j) , x, θ)
j
Y exp(Fθ (yPa(j) , y , j, x))
= Pm 0
j y 0 =1 exp(Fθ (yPa(j) , y , j, x))

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 99 / 105
Training for BN

N
X
LL(θ, D) = log Pr(yi |xi , θ)
i=1
XN Y
= log i
Pr(yji |yPa (j), xi , θ)
i=1 j
XX
= log Pr(yji |yPa
i
(j), xi , θ)
i j
XX m
X
= i
Fθ (yPa(j) , yji , j, xi )) − log i
exp(Fθ (yPa(j) , y 0 , j, xi ))
i j y 0 =1

Like normal classification task. No challenge arising during training


because of graphical model. Normalizer is easy to compute.
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 100 / 105
Table Potentials in the feature framework.
Assume xi does not exist..(As in HMMs)
i
Fθ (yPa(j) , yji , j)) = log P(yji |yPa(j)
i
), normalizer vanishes.
Pr(yj |yPa(j) ) = Table of real values denoting the probability of
each value of xj corresponding to each combination of values of
the parents (θj ).
If each variables takes m possible values, and has k parents, then
each Pr(yj |yPa(j) ) will require mk (m) parameters in θj .
j
θvu 1 ,...,uk
= Pr(yj = v |ypa(j) = u1 , . . . , uk )

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 101 / 105
Maximum Likelihood estimation of parameters

XX
max i
log P(yji |yPa(j) )
θ
i j
XX X
= max log θyj i yi s.t. j
θvu 1 ,...,uk
= 1 ∀j, u1 , . . . , uk
θ j (j)
i j v
XX X X X
= max log θyj i yi − λju1 ,...,uk ( j
θvu 1 ,...,uk
− 1)
θ j (j)
i j j u1 ,...,uk v

Solve above using gradient descent to get


PN i i
j i=1 [[yj == v , yPa(j) = u1 , . . . , uk ]]
θvu1 ,...,uk = PN i
(1)
i=1 [[yPa(j) = u1 , . . . , uk ]]

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/


Graphical
~sunitamodels 102 / 105
Partially observed, decoupled potentials
y1 y2 y3 y4 y5 y6 y7

x1 x2 x3 x4 x5 x6 x7

EM Algorithm
y1 y y y y y y
Input: Graph
2
G , Data
3
D with
4
observed subset of variables x and
5 6 7

hidden variables z.
yy yy
Initially
1 2
(t y=y 0):y yAssign
2 3 yy
3 4 yy
random variables of parameters
4 5 5 6 6 7

t
Pr(xj |pa(xj ))
for = 1, . . . , T do
E-step
for i = 1, . . . , N do
Use inference in G to estimate conditionals Pri (zc |xi )t for all
variable subsets (i, pa(i)) involving any hidden variable.
end for
M-step
N
Pri (zc |xi )[[xji ==xj ]]
P
Pr(xj |pa(xj ) = zc )t = i=1
PN i t
i=1 Pri (zc |x )
end for
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/
Graphical
~sunitamodels 103 / 105
More on graphical models
Koller and Friedman, Probabilistic Graphical Models: Principles
and Techniques. MIT Press, 2009.
Wainwright’s article in FnT for Machine Learning. 2009.
Kevin Murphy’s brief online introduction
(http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html)
Graphical models. M. I. Jordan. Statistical Science (Special
Issue on Bayesian Statistics), 19, 140-155, 2004. (http:
//www.cs.berkeley.edu/~jordan/papers/statsci.ps.gz)
Other text books:
I R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J.
Spiegelhalter. ”Probabilistic Networks and Expert Systems”.
Springer-Verlag. 1999.
I J. Pearl. ”Probabilistic Reasoning in Intelligent Systems:

Networks of Plausible Inference.” Morgan Kaufmann. 1988.


I Graphical models by Lauritzen, Oxford science publications F.

V. Jensen. ”Bayesian Networks and Decision Graphs”. Springer.


2001.http://www.cse.iitb.ac.in/
Sunita Sarawagi IIT Bombay Graphical
~sunitamodels 105 / 105

You might also like