Scribe Lecture4
Scribe Lecture4
In practice, exact inference is not used widely, and most probabilistic inference algorithms are approximate.
Nevertheless, it is important to understand exact inference and its limitations.
There are two typical tasks with graphical models: inference and learning. Given a graphical model M that
describes a unique probability distribution P , inference means answering queries about PM , e.g., PM (X|Y ).
Learning means obtaining a point estimate of model M from data D. However, in statistics, both inference
and learning are commonly referred to as either inference or estimation. From the Bayesian perspective,
for example, learning p(M |D) is actually an inference problem. When not all variables are observable,
computing point estimates of M needs inference to impute the missing data.
1.1 Likelihood
One of the simplest queries one may ask is likelihood estimation. The likelihood estimation of a probability
distribution is to compute the probability of the given evidence, where evidence is an assignment of values
to a subset of variables. Likelihood estimation involves marginalization of the other variables. Formally, let
e and x denote evidence and the remaining variables, respectively, the likelihood of e is
X X
P (e) = ··· P (x1 , · · · , xk , e).
x1 xk
From the perspective of computational statistics, calculating this is computationally expensive because the
computation involves exploring exponentially many configurations.
Another type of queries is the conditional probability of variables given evidence. The probability of variables
X given evidence e is
P (X, e) P (X, e)
P (X|e) = =P .
P (e) x (X = x, e)
P
This is called a posteriori belief in X given evidence e.
As in likelihood estimation, calculating a conditional probability involves marginalization of variables that
are not of our interest (i.e., the summation part in the equation). We will be covering efficient algorithms
for marginalization later in this class.
Calculating a conditional probability is called in different ways depending on the query nodes. When the
query node is a terminal variable in a directed graphical model, the inference process is called prediction. But
1
2 Lecture 4: Exact Inference
y1 y2 P (y1 , y2 )
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3
Table 1: MPA
if the query node is an ancestor of the evidence, the inference process is called diagnosis; e.g., the probability
of disease/fault given evidence symptoms. For instance, in the deep belief network, a restricted Boltzmann
machine with multiple hidden layers, the hidden layers are estimated given data.
We may query the most probable assignment (MPA) for a subset of variables in the domain. Given evidence
e, query variables Y , and the other variables Z, the MPA of Y is
X
MPA(Y |e) = arg max P (y|e) = arg max P (y, z|e).
y y
z
2 Approaches to Inference
There are two types of inference techniques: exact inference and approximate inference. Exact inference
algorithms calculate the exact value of probability P (X|Y ). Algorithms in this class include the elimination
algorithm, the message-passing algorithm (sum-product, belief propagation), and the junction tree algo-
rithms. We will not cover the junction tree algorithms in the class because they are outdated and confusing.
The time complexity of exact inference on arbitrary graphical models is NP-hard. However, we can improve
efficiency for particular families of graphical models. Approximate inference techniques include stochastic
simulation and sampling methods, Markov chain Monte Carlo methods, and variational algorithms.
A more complex network
Lecture 4: Exact Inference 3
A food
Elimination web
on Chains Undirected Chains
A B C D E A B C D E
Rearranging(b)
terms ...
Model
Rearranging terms ... (a) Chain Conditional Random Field
1
B A
P (e) (b , a ) ( c , b ) ( d , c ) ( e, d )
d c b a Z
C (c, b ) ( dD
P (e) P ( a ) P (b | a ) P ( c | b ) P ( d | c ) P ( e | d ) 1
, c ) ( e, d ) (b , a )
...
d c b a
y1 yP (2c | b) P (d | cy) P3(e | d ) P ( a ) P (b y
| aT
)
Z d c b a
E F
d c b a
...
© Eric Xing @ CMU, 2005-2017 12
3 Elimination
3.1 Chains
Let’s first consider inference on the simple chain in Figure 1a. The probability P (E = e) can be calculated
as
XXXX
P (e) = P (a, b, c, d, e).
d c b a
This naive summation enumerates over exponentially many configurations of the variables and thus is inef-
ficient. But if we use the chain structure, the marginal probability can be calculated as
The summation in each line involves enumerating over only two variables, and there are four such summations.
Therefore, in general, the time complexity of the elimination algorithm is O(nk 2 ), where n is the number
of variables and k is the possible values of each variable. That is, for simple chains, exact inference can be
done in polynomial time as opposed to the exponential time for the naive approach.
4 Lecture 4: Exact Inference
Let’s consider the hidden Markov model in Figure 1c. The conditional probability of variable Yi given X
can be calculated as
X XX X
P (yi |x1 , · · · , xT ) = ··· ··· P (y1 , · · · , yT , x1 , · · · , xT )
y1 yi−1 yi+1 yT
X XX X
= ··· ··· P (y1 )P (x1 |y1 ) · · · P (yT |yT −1 )P (xT |yT )
y1 yi−1 yi+1 yT
X XX X X
= ··· ··· P (x2 |y2 ) · · · P (yT |yT −1 )P (xT |yT ) P (y1 )P (x1 |y1 )P (y2 |y1 )
y2 yi−1 yi+1 yT y1
X XX X
= ··· ··· P (x2 |y2 ) · · · P (yT |yT −1 )P (xT |yT )m(x1 , y2 )
y2 yi−1 yi+1 yT
X XX X
= ··· ··· P (x3 |y3 ) · · · P (yT |yT −1 )P (xT |yT )m(x1 , x2 , y3 )
y3 yi−1 yi+1 yT
.
= ..
In fact, m(x1 , y2 ) is equal to the marginal probability P (x1 , y2 ).
Let’s consider the undirected chain in Figure 1b. With the elimination algorithm, the marginal probability
P (E = e) can be calculated similarly as
XXXX 1
P (e) = φ(b, a)φ(c, b)φ(d, c)φ(e, d)
c a
Z
d b
1 XXX X
= φ(c, b)φ(d, c)φ(e, d) φ(b, a)
Z c a
d b
.
= ..
In general, we can view the task at hand as that of computing the value of an expression of the form
XY
φ,
z φ∈F
4 Variable Elimination
We can extend the elimination algorithm to arbitrary graphical models. For a directed graphical model,
X XX Y
P (X1 , e) = ··· P (xi |pai ).
xn x3 x2 i∈V
Note that the elimination algorithm has no benefit if the innermost term includes all variables, that is, xi is
dependent on all the other variables. However, in most problems, the number of variables in the innermost
term is less than the total number of variables.
For undirected graphical models,
φ(X1 , e)
P (X1 |e) = P .
x1 φ(X1 , e)
Let’s consider the graphical model in Figure 1d. The joint probability distribution factorizes to
P (a, b, c, d, e, f, g, h) = P (a)P (b)P (c|b)P (d|a)P (e|c, d)P (f |a)P (g|e)P (h|e, f ).
H, G, F, E, D, C, B.
We condition on the evidence node H by fixing its value to h. To treat marginalization and conditioning as
formally equivalent, we can define an evidence potential δ(h = h̃) whose value is one if the inner statement
is true and zero otherwise. Then, we obtain
X
P (H = h̃|e, f ) = P (h|e, f )δ(h = h̃).
h
Therefore,
P (a, h̃) P (a)mb (a)
P (a|h̃) = =P .
P (h̃) a P (a)mb (a)
X k
Y
mx (y1 , ..., yk ) = m0x (x, y1 , ..., yk )m0x (x, y1 , ..., yk ) = mi (x, yci )
x i=1
Q
In the first equation,
Q we should do |V al(X)| · i |V al(yci )| additions. In the second equation, we should do
k · |V al(X) · i |V al(yci )| multiplications. Therefore the computational complexity is exponential in number
of variables in the intermediate factor.
The equations in the above section describe the process of Elimination from the perspective of mathematics.
While we can describe the Elimination process from the perspective of graph - graph elimination algorithm.
There are mainly two steps in the graph elimination algorithm: moralization and (undirected) graph
elimination.
5.1 Moralization
Moralization is a process of converting a directed acyclic graph(DAG) into “equivalent” undirected graph.
Its procedure is:
Now we can interpret the Elimination algorithm from the perspective of graph elimination algorithm. As
shown in the Figure 2 below, the summation step in Elimination can be represented by an elimination
step in graph elimination algorithm. In addition, the intermediate terms in Elimination correspond to the
elimination cliques resulted from graph elimination algorithm (Figure 3a). In addition, we can also construct
a clique tree to represent the elimination process (Figure 3b).
Figure 3: Cliques
5.4 Complexity
The overall complexity is determined by the number of the largest elimination clique. A “good” elimination
will make the largest clique relatively small. Tree-width k is introduced to study this problem. Its definition
is one less than the smallest achievable value of the cardinality of the largest elimination clique, ranging
over all possible elimination ordering. However, finding such k as well as the “best” elimination ordering is
NP-hard. For some of the graphs such as stars (Figure 4a) (k = 2 − 1) and trees (Figure 4b) (k = 2 − 1), we
can easily get their tree-width k, while for graphs like ising model (Figure 4c), it is very hard to compute
the k.
6 Message Passing
Figure 5: Tree GMs: from left to right undirected tree, directed tree, polytree
One of the limitation of the elimination is that it only answers one query (e.g., on one node). To answer
other queries (nodes) as well, the notion of message is introduced. Each step of elimination is actually a
message passing on a clique tree. Although different query has different clique tree, the message passing
through the tree can be reused.
There are mainly three types of trees (Figure 5): undirected tree, directed tree and polytree. In fact,
directed and undirected trees are equivalent, since:
(1) Undirected trees can be converted to directed by choosing a root and directing all edges away from it.
(2) A directed tree and the corresponding undirected tree make
Q the same Qconditional independence assertions.
(3) Parameterizations are essentially the same. p(x) = Z1 ( i∈V ψ(xi ) (i,j)∈E ψ(xi , xj ))
We can show that elimination on trees is equivalent to message passing along tree branches. As shown in
figure 6a, let mji (xi ) denote the factor resulting from eliminating variables from bellow up to i, which is a
function of xi : X Y
mji (xi ) = (ψ(xj )ψ(xi , xj ) mkj (xj ))
xj k∈N (j)\i
To efficiently compute the marginal distributions of all nodes in a graph, a Message Passing Protocol (Figure
6b) is introduced: A node can send a message to its neighbors when (and only when) it has received messages
from all its other neighbors. Based on this protocol, a naive computing approach is considering each node as
the root, and executing the message passing algorithm for it. The complexity of the naive approach is N C,
where N is the number of nodes and C is the complexity of a complete message passing.