0% found this document useful (0 votes)
14 views9 pages

Scribe Lecture4

The document discusses exact probabilistic inference techniques for graphical models. It covers the tasks of inference, which is answering queries about a probability distribution, and learning, which is obtaining a model from data. Exact inference algorithms compute the exact probability and include elimination, message passing, and junction trees. However, exact inference is NP-hard for arbitrary graphs, so approximate inference techniques like sampling are commonly used. The document provides examples of inference queries like likelihood, conditional probability, and most probable assignment.

Uploaded by

Kishan Kumar Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Scribe Lecture4

The document discusses exact probabilistic inference techniques for graphical models. It covers the tasks of inference, which is answering queries about a probability distribution, and learning, which is obtaining a model from data. Exact inference algorithms compute the exact probability and include elimination, message passing, and junction trees. However, exact inference is NP-hard for arbitrary graphs, so approximate inference techniques like sampling are commonly used. The document provides examples of inference queries like likelihood, conditional probability, and most probable assignment.

Uploaded by

Kishan Kumar Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

10-708: Probabilistic Graphical Models 10-708, Spring 2017

Lecture 4: Exact Inference

Lecturer: Eric P. Xing Scribes: Yohan Jo, Baoyu Jing

1 Probabilistic Inference and Learning

In practice, exact inference is not used widely, and most probabilistic inference algorithms are approximate.
Nevertheless, it is important to understand exact inference and its limitations.
There are two typical tasks with graphical models: inference and learning. Given a graphical model M that
describes a unique probability distribution P , inference means answering queries about PM , e.g., PM (X|Y ).
Learning means obtaining a point estimate of model M from data D. However, in statistics, both inference
and learning are commonly referred to as either inference or estimation. From the Bayesian perspective,
for example, learning p(M |D) is actually an inference problem. When not all variables are observable,
computing point estimates of M needs inference to impute the missing data.

1.1 Likelihood

One of the simplest queries one may ask is likelihood estimation. The likelihood estimation of a probability
distribution is to compute the probability of the given evidence, where evidence is an assignment of values
to a subset of variables. Likelihood estimation involves marginalization of the other variables. Formally, let
e and x denote evidence and the remaining variables, respectively, the likelihood of e is
X X
P (e) = ··· P (x1 , · · · , xk , e).
x1 xk

From the perspective of computational statistics, calculating this is computationally expensive because the
computation involves exploring exponentially many configurations.

1.2 Conditional Probability

Another type of queries is the conditional probability of variables given evidence. The probability of variables
X given evidence e is
P (X, e) P (X, e)
P (X|e) = =P .
P (e) x (X = x, e)
P
This is called a posteriori belief in X given evidence e.
As in likelihood estimation, calculating a conditional probability involves marginalization of variables that
are not of our interest (i.e., the summation part in the equation). We will be covering efficient algorithms
for marginalization later in this class.
Calculating a conditional probability is called in different ways depending on the query nodes. When the
query node is a terminal variable in a directed graphical model, the inference process is called prediction. But

1
2 Lecture 4: Exact Inference

y1 y2 P (y1 , y2 )
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3

Table 1: MPA

if the query node is an ancestor of the evidence, the inference process is called diagnosis; e.g., the probability
of disease/fault given evidence symptoms. For instance, in the deep belief network, a restricted Boltzmann
machine with multiple hidden layers, the hidden layers are estimated given data.

1.3 Most Probable Assignment

We may query the most probable assignment (MPA) for a subset of variables in the domain. Given evidence
e, query variables Y , and the other variables Z, the MPA of Y is

X
MPA(Y |e) = arg max P (y|e) = arg max P (y, z|e).
y y
z

This is the maximum a posteriori configuration of Y .


The MPA of a variable depends on its context. For example, in Table 1, the MPA of Y1 is 1 because
P (Y1 = 0) = 0.35 + 0.05 = 0.4 and P (Y1 = 1) = 0.3 + 0.3 = 0.6. However, when we include Y2 as context,
the MPA of (Y1 , Y2 ) is (0, 0) because P (Y1 = 0, Y2 = 0) = 0.35 is the highest value in the table.
This is related to multi-task learning in machine learning. Multi-task learning indicates solving multiple
learning tasks jointly instead of learning task-specific models separately ignoring context. For example, we
may improve the accuracy of a classifier of having lunch and that of a classifier of having coffee if we jointly
model having lunch and having coffee together, compared to making the two classifiers separately.

2 Approaches to Inference

There are two types of inference techniques: exact inference and approximate inference. Exact inference
algorithms calculate the exact value of probability P (X|Y ). Algorithms in this class include the elimination
algorithm, the message-passing algorithm (sum-product, belief propagation), and the junction tree algo-
rithms. We will not cover the junction tree algorithms in the class because they are outdated and confusing.
The time complexity of exact inference on arbitrary graphical models is NP-hard. However, we can improve
efficiency for particular families of graphical models. Approximate inference techniques include stochastic
simulation and sampling methods, Markov chain Monte Carlo methods, and variational algorithms.
A more complex network
Lecture 4: Exact Inference 3

A food
Elimination web
on Chains Undirected Chains

A B C D E A B C D E

Rearranging(b)
terms ...

Model
Rearranging terms ... (a) Chain Conditional Random Field

1
B A
P (e) (b , a ) ( c , b ) ( d , c ) ( e, d )
d c b a Z
C (c, b ) ( dD
P (e) P ( a ) P (b | a ) P ( c | b ) P ( d | c ) P ( e | d ) 1
, c ) ( e, d ) (b , a )
...
d c b a
y1 yP (2c | b) P (d | cy) P3(e | d ) P ( a ) P (b y
| aT
)
Z d c b a

E F
d c b a

...
© Eric Xing @ CMU, 2005-2017 12

xA1 xA2 xA3 xAT G H


© Eric Xing @ CMU, 2005-2017 17

(c) Hidden Markov Model (d) General Directed Graphical Model


p(x2 | y2) … p(yT | yT-1) p(xT | yT)
What is the probability that hawks
Figureare leaving Model
1: Graphical givenExamples
that the grass condition is poor?
© Eric Xing @ CMU, 2005-2017 20

3 Elimination

3.1 Chains

Let’s first consider inference on the simple chain in Figure 1a. The probability P (E = e) can be calculated
as
XXXX
P (e) = P (a, b, c, d, e).
d c b a

This naive summation enumerates over exponentially many configurations of the variables and thus is inef-
ficient. But if we use the chain structure, the marginal probability can be calculated as

Eric Xing @ CMU, 2005-2017 16


XXXX
P (e) = P (a)P (b|a)P (c|b)P (d|c)P (e|d)
d c b a
X X X X
= P (e|d) P (d|c) P (c|b) P (a)P (b|a)
d c b a
X X X
= P (e|d) P (d|c) P (c|b)P (b)
d c b
X X
= P (e|d) P (d|c)P (c)
d c
X
= P (e|d)P (d).
d

The summation in each line involves enumerating over only two variables, and there are four such summations.
Therefore, in general, the time complexity of the elimination algorithm is O(nk 2 ), where n is the number
of variables and k is the possible values of each variable. That is, for simple chains, exact inference can be
done in polynomial time as opposed to the exponential time for the naive approach.
4 Lecture 4: Exact Inference

3.2 Hidden Markov Model

Let’s consider the hidden Markov model in Figure 1c. The conditional probability of variable Yi given X
can be calculated as
X XX X
P (yi |x1 , · · · , xT ) = ··· ··· P (y1 , · · · , yT , x1 , · · · , xT )
y1 yi−1 yi+1 yT
X XX X
= ··· ··· P (y1 )P (x1 |y1 ) · · · P (yT |yT −1 )P (xT |yT )
y1 yi−1 yi+1 yT
X XX X X
= ··· ··· P (x2 |y2 ) · · · P (yT |yT −1 )P (xT |yT ) P (y1 )P (x1 |y1 )P (y2 |y1 )
y2 yi−1 yi+1 yT y1
X XX X
= ··· ··· P (x2 |y2 ) · · · P (yT |yT −1 )P (xT |yT )m(x1 , y2 )
y2 yi−1 yi+1 yT
X XX X
= ··· ··· P (x3 |y3 ) · · · P (yT |yT −1 )P (xT |yT )m(x1 , x2 , y3 )
y3 yi−1 yi+1 yT
.
= ..
In fact, m(x1 , y2 ) is equal to the marginal probability P (x1 , y2 ).

3.3 Undirected Chains

Let’s consider the undirected chain in Figure 1b. With the elimination algorithm, the marginal probability
P (E = e) can be calculated similarly as
XXXX 1
P (e) = φ(b, a)φ(c, b)φ(d, c)φ(e, d)
c a
Z
d b
1 XXX X
= φ(c, b)φ(d, c)φ(e, d) φ(b, a)
Z c a
d b
.
= ..

In general, we can view the task at hand as that of computing the value of an expression of the form
XY
φ,
z φ∈F

where F is a set of factors. This task is called the sum-product inference.

4 Variable Elimination

4.1 The Algorithm

We can extend the elimination algorithm to arbitrary graphical models. For a directed graphical model,
X XX Y
P (X1 , e) = ··· P (xi |pai ).
xn x3 x2 i∈V

For variable elimination, we repeat:


Lecture 4: Exact Inference 5

1. Move all irrelevant terms outside of the innermost sum.


2. Perform the innermost sum and get a new term.
3. Insert the new term into the product.

Note that the elimination algorithm has no benefit if the innermost term includes all variables, that is, xi is
dependent on all the other variables. However, in most problems, the number of variables in the innermost
term is less than the total number of variables.
For undirected graphical models,
φ(X1 , e)
P (X1 |e) = P .
x1 φ(X1 , e)

Let’s consider the graphical model in Figure 1d. The joint probability distribution factorizes to

P (a, b, c, d, e, f, g, h) = P (a)P (b)P (c|b)P (d|a)P (e|c, d)P (f |a)P (g|e)P (h|e, f ).

To calculate the conditional probability P (A|h), we first choose an elimination order:

H, G, F, E, D, C, B.

We condition on the evidence node H by fixing its value to h. To treat marginalization and conditioning as
formally equivalent, we can define an evidence potential δ(h = h̃) whose value is one if the inner statement
is true and zero otherwise. Then, we obtain
X
P (H = h̃|e, f ) = P (h|e, f )δ(h = h̃).
h

The conditional probability P (a|h) is calculated as


XXXXXX
P (a, h̃) = P (a)P (b)P (c|b)P (d|a)P (e|c, d)P (f |a)P (g|e)P (h|e, f )
b c d e f g
X X X X X X X
= P (a) P (b) P (c|b) P (d|a) P (e|c, d) P (f |a) P (g|e) P (h|e, f )δ(h = h̃)
b c d e f g h
X X X X X X
= P (a) P (b) P (c|b) P (d|a) P (e|c, d) P (f |a) P (g|e)mh (e, f )
b c d e f g
X X X X X X
= P (a) P (b) P (c|b) P (d|a) P (e|c, d) P (f |a)mh (e, f ) P (g|e)
b c d e f g
X X X X X
= P (a) P (b) P (c|b) P (d|a) P (e|c, d) P (f |a)mh (e, f )
b c d e f
X X X X
= P (a) P (b) P (c|b) P (d|a) P (e|c, d)mf (a, e)
b c d e
X X X
= P (a) P (b) P (c|b) P (d|a)me (a, c, d)
b c d
X X
= P (a) P (b) P (c|b)md (a, c)
b c
X
= P (a) P (b)mc (a, b, c)
b
= P (a)mb (a)
6 Lecture 4: Exact Inference

Therefore,
P (a, h̃) P (a)mb (a)
P (a|h̃) = =P .
P (h̃) a P (a)mb (a)

4.2 Complexity of Variable Elimination

In one elimination step, we should compute:

X k
Y
mx (y1 , ..., yk ) = m0x (x, y1 , ..., yk )m0x (x, y1 , ..., yk ) = mi (x, yci )
x i=1

Q
In the first equation,
Q we should do |V al(X)| · i |V al(yci )| additions. In the second equation, we should do
k · |V al(X) · i |V al(yci )| multiplications. Therefore the computational complexity is exponential in number
of variables in the intermediate factor.

5 Understanding Variable Elimination

The equations in the above section describe the process of Elimination from the perspective of mathematics.
While we can describe the Elimination process from the perspective of graph - graph elimination algorithm.
There are mainly two steps in the graph elimination algorithm: moralization and (undirected) graph
elimination.

5.1 Moralization

Moralization is a process of converting a directed acyclic graph(DAG) into “equivalent” undirected graph.
Its procedure is:

• Starting from an input DAG

• Connect nodes if they share a common child.

• Make directed edges to undirected edges.

5.2 Graph Elimination

The graph elimination algorithm is:


Input undirected GM or moralized DAG & the elimination order I
for each node Xi in I
- connect all of the remaining neighbors of Xi
- remove Xi from the graph
end
Lecture 4: Exact Inference 7

5.3 Graph Elimination and Marginalization

Now we can interpret the Elimination algorithm from the perspective of graph elimination algorithm. As
shown in the Figure 2 below, the summation step in Elimination can be represented by an elimination
step in graph elimination algorithm. In addition, the intermediate terms in Elimination correspond to the
elimination cliques resulted from graph elimination algorithm (Figure 3a). In addition, we can also construct
a clique tree to represent the elimination process (Figure 3b).

Figure 2: A graph elimination

(a) Elimination cliques (b) Clique tree

Figure 3: Cliques

(a) Star (b) Tree (c) Ising Model

Figure 4: Tree-Width Examples


8 Lecture 4: Exact Inference

5.4 Complexity

The overall complexity is determined by the number of the largest elimination clique. A “good” elimination
will make the largest clique relatively small. Tree-width k is introduced to study this problem. Its definition
is one less than the smallest achievable value of the cardinality of the largest elimination clique, ranging
over all possible elimination ordering. However, finding such k as well as the “best” elimination ordering is
NP-hard. For some of the graphs such as stars (Figure 4a) (k = 2 − 1) and trees (Figure 4b) (k = 2 − 1), we
can easily get their tree-width k, while for graphs like ising model (Figure 4c), it is very hard to compute
the k.

6 Message Passing

Figure 5: Tree GMs: from left to right undirected tree, directed tree, polytree

6.1 From Elimination to Message Passing

One of the limitation of the elimination is that it only answers one query (e.g., on one node). To answer
other queries (nodes) as well, the notion of message is introduced. Each step of elimination is actually a
message passing on a clique tree. Although different query has different clique tree, the message passing
through the tree can be reused.
There are mainly three types of trees (Figure 5): undirected tree, directed tree and polytree. In fact,
directed and undirected trees are equivalent, since:
(1) Undirected trees can be converted to directed by choosing a root and directing all edges away from it.
(2) A directed tree and the corresponding undirected tree make
Q the same Qconditional independence assertions.
(3) Parameterizations are essentially the same. p(x) = Z1 ( i∈V ψ(xi ) (i,j)∈E ψ(xi , xj ))

6.2 Elimination on A Tree

We can show that elimination on trees is equivalent to message passing along tree branches. As shown in
figure 6a, let mji (xi ) denote the factor resulting from eliminating variables from bellow up to i, which is a
function of xi : X Y
mji (xi ) = (ψ(xj )ψ(xi , xj ) mkj (xj ))
xj k∈N (j)\i

This is equivalent to the message from j to i.


Lecture 4: Exact Inference 9

The process of elimination or message passing can be described as:

• Choose query node f as the root of the tree.


• View tree as a directed tree with edges pointing towards leaves from f .

• Elimination ordering based on depth-first traversal.


• Elimination of each node can be considered as message-passing (or Belief Propagation) directly along
tree branches, rather than on some transformed graphs.

Therefore, we can use the tree itself as a data-structure to do general inference.

6.3 The Message Passing Protocol

To efficiently compute the marginal distributions of all nodes in a graph, a Message Passing Protocol (Figure
6b) is introduced: A node can send a message to its neighbors when (and only when) it has received messages
from all its other neighbors. Based on this protocol, a naive computing approach is considering each node as
the root, and executing the message passing algorithm for it. The complexity of the naive approach is N C,
where N is the number of nodes and C is the complexity of a complete message passing.

(a) Message Passing (b) Message Passing Protocol

Figure 6: Message Passing

You might also like