Lec 3
Lec 3
Mark Paskin
[email protected]
1
Review: conditional independence
If X⊥
⊥Y then Y gives us no information about X.
• X and Y are conditionally independent given Z (written X⊥
⊥Y | Z) iff
If X⊥
⊥Y | Z then Y gives us no new information about X once we know Z.
• We can obtain compact, factorized representations of densities by using the
chain rule in combination with conditional independence assumptions.
• The Variable Elimination algorithm uses the distributivity of × over + to
perform inference efficiently in factorized densities.
2
Review: graphical models
B B
A F A F
C E C E
3
A notation for sets of random variables
It is helpful when working with large, complex models to have a good notation
for sets of random variables.
• Let X = (Xi : i ∈ V ) be a vector random variable with density p.
4
• For each A ⊆ V , let XA = (Xi : i ∈ A).
4 4
• For A, B ⊆ V , let pA = pXA and pA|B = pXA |XB .
4
A notation for assignments
We also need a notation for dealing flexibly with functions of many arguments.
• An assignment to A is a set of index-value pairs u = {(i, xi ) : i ∈ A}, one
per index i ∈ A, where xi is in the range of Xi .
4
• Let XA be the set of assignments to XA (with X = XV ).
• Building new assignments from given assignments:
– Given assignments u and v to disjoint subsets A and B, respectively,
their union u ∪ v is an assignment to A ∪ B.
– If u is an assignment to A then the restriction of u to B ⊆ V is
4
uB = {(i, xi ) ∈ u : i ∈ B}, an assignment to A ∩ B.
• If u = {(i, xi ) : i ∈ A} is an assignment and f is a function, then
4
f (u) = f (xi : i ∈ A)
5
Examples of the assignment notation
6
Review: the inference problem
• Input:
– a vector random variable X = (Xi : i ∈ V );
– a joint density for X of the form
1 Y
p(u) = ψC (uC )
Z
C∈C
7
Dealing with evidence
8
The reformulated inference problem
9
Review: Variable Elimination
• For each i ∈ Q, push in the sum over Xi and compute it:
1 X Y
pQ (v) = ψC (vC ∪ uC )
Z
u∈XQ C∈C
1 X X Y
= ψC (vC ∪ uC ∪ wC )
Z
u∈XQ\{i} w∈X{i} C∈C
1 X Y X Y
= ψC (vC ∪ uC ) ψC (vC ∪ uC ∪ w)
Z
u∈XQ\{i} C∈C w∈X{i} C∈C
i6∈C i∈C
1 X Y
= ψC (vC ∪ uC ) · ψEi (vEi ∪ uEi )
Z
u∈XQ\{i} C∈C
i6∈C
S
This creates a new elimination clique Ei = C∈C C\{i}.
i∈C
1
• At the end we have pQ = Z ψQ and we normalize to obtain pQ (and Z).
10
From Variable Elimination
to the junction tree algorithms
11
Junction trees
d {b, d}
b
{b}
a f
{b, c}
{b, e}
{a, b, c} {b, c, e} {b, e, f}
c e
G T
A cluster graph T is a junction tree for G if it has these three properties:
1. singly connected: there is exactly one path between each pair of clusters.
2. covering: for each clique A of G there is some cluster C such that A ⊆ C.
3. running intersection: for each pair of clusters B and C that contain i,
each cluster on the unique path between B and C also contains i.
12
Building junction trees
13
An example of building junction trees
1. Compute the elimination cliques (the order here is f, d, e, c, b, a).
d
d
b b b b
a f a a a b
c e c e c a
c e
2. Form the complete cluster graph over the maximal elimination cliques and
find a maximum-weight spanning tree.
1
{b, d} {b, e, f}
{b, d}
1
1 1 {b}
2
{b, c}
{b, e}
{b, c, e} {a, b, c} {a, b, c} {b, c, e} {b, e, f}
2
14
Decomposable densities
• A factorized density
1 Y
p(u) = ψC (uC )
Z
C∈C
15
The junction tree inference algorithms
The junction tree algorithms take as input a decomposable density and its
junction tree. They have the same distributed structure:
• Each cluster starts out knowing only its local potential and its neighbors.
• Each cluster sends one message (potential function) to each neighbor.
• By combining its local potential with the messages it receives, each cluster
is able to compute the marginal density of its variables.
16
The message passing protocol
17
The Shafer–Shenoy Algorithm
• The message sent from B to C is defined as
4 X Y
µBC (u) = ψB (u ∪ v) µAB (uA ∪ vA )
v∈XB\C (A,B)∈E
A6=C
This is the product of the cluster’s local potential and the messages
received from all of its neighbors. We will show that βC ∝ pC .
18
Correctness: Shafer–Shenoy is Variable
Elimination in all directions at once
• The cluster belief βC is computed by alternatingly multiplying cluster
potentials together and summing out variables.
• This computation is of the same basic form as Variable Elimination.
• To prove that βC ∝ pC , we must prove that no sum is “pushed in too far”.
• This follows directly from the running intersection property:
clusters containing i
messages with i
messages without i
19
The hugin Algorithm
• Give each cluster C and each separator S a potential function over its
variables. Initialize:
φC (u) = ψC (u)
φS (u) = 1
φ∗S (uS )
φ∗C (u) = φC (u)
φS (uS )
20
Correctness: hugin is a time-efficient
version of Shafer–Shenoy
21
Summary: the junction tree algorithms
Compile time:
1. Build the junction tree T :
(a) Obtain a set of maximal elimination cliques with Node Elimination.
(b) Build a weighted, complete cluster graph over these cliques.
(c) Choose T to be a maximum-weight spanning tree.
2. Make the density decomposable with respect to T .
Run time:
1. Instantiate evidence in the potentials of the density.
2. Pass messages according to the message passing protocol.
3. Normalize the cluster beliefs/potentials to obtain conditional densities.
22
Complexity of junction tree algorithms
23
Generalized Distributive Law
• The general problem solved by the junction tree algorithms is the
sum-of-products problem: compute
X Y
pQ (v) ∝ ψC (vC ∪ uC )
u∈XQ C∈C
24
Summary
• The junction tree algorithms generalize Variable Elimination to the
efficient, simultaneous execution of a large class of queries.
• The algorithms take the form of message passing on a graph called a
junction tree, whose nodes are clusters, or sets, of variables.
• Each cluster starts with one potential of the factorized density. By
combining this potential with the potentials it receives from its neighbors,
it can compute the marginal over its variables.
• Two junction tree algorithms are the Shafer–Shenoy algorithm and the
hugin algorithm, which avoids repeated multiplications.
• The complexity of the algorithms scales with the width of the junction tree.
• The algorithms can be generalized to solve other problems by using other
commutative semirings.
25