Approximate Inference
Approximate Inference
Machine Intelligence
Thomas D. Nielsen
September 2008
Motivation
Because of the (worst-case) intractability of exact inference in Bayesian networks, try to find more
efficient approximate inference techniques:
Absolute/Relative Error
| p − p̂ |≤ ǫ, i.e. p̂ ∈ [p − ǫ, p + ǫ].
Absolute/Relative Error
| p − p̂ |≤ ǫ, i.e. p̂ ∈ [p − ǫ, p + ǫ].
Absolute/Relative Error
| p − p̂ |≤ ǫ, i.e. p̂ ∈ [p − ǫ, p + ǫ].
This definition is not always fully satisfactory, because it is not symmetric in p and p̂ and not
invariant under the transition p → (1 − p), p̂ → (1 − p̂). Use with care!
When p̂1 , p̂2 are approximations for p1 , p2 with absolute error ≤ ǫ, then no error bounds follow for
p̂1 /p̂2 as an approximation for p1 /p2 .
When p̂1 , p̂2 are approximations for p1 , p2 with relative error ≤ ǫ, then p̂1 /p̂2 approximates p1 /p2
with relative error ≤ (2ǫ)/(1 + ǫ).
Randomized Methods
Most methods for approximate inference are randomized algorithms that compute approximations
P̂ from random samples of instantiations.
We shall consider:
Forward sampling
Likelihood weighting
Gibbs sampling
Metropolis Hastings algorithm
Forward Sampling
Observation: can use Bayesian network as random generator that produces full instantiations
V = v according to distribution P(V).
Example:
A
A t
.2
f
.8
Sampling Algorithm
Thus, we have a randomized algorithm S that produces possible outputs from sp(V) according to
the distribution P(V).
Define
|{i ∈ 1, . . . , N | E = e, A = a in Si }|
P̂(A = a | E = e) :=
|{i ∈ 1, . . . , N | E = e in Si }|
Sample with
not E = e
E = e, A 6= a
E = e, A = a
#
Approximation for P(A = a | E = e):
# ∪
Idea: find sampling algorithm Sc that produces outputs from sp(V) according to the distribution
P(V | E = e).
Idea: find sampling algorithm Sc that produces outputs from sp(V) according to the distribution
P(V | E = e).
A tempting approach: Fix the variables in E to e and sample from the nonevidence variables
only!
Idea: find sampling algorithm Sc that produces outputs from sp(V) according to the distribution
P(V | E = e).
A tempting approach: Fix the variables in E to e and sample from the nonevidence variables
only!
Problem: Only evidence from the ancestors are taken into account!
Approximative inference September 2008 8 / 25
Approximate Inference
Likelihood weighting
Likelihood weighting
A
A t f - Assume evidence B = t.
.2 .8
- Generate a random number r uniformly
from [0,1].
- Set A = t if r ≤ .2 and A = f else.
B
A t f
- If A = t then let the sample count as
B t .7 .3 w (t, t) = 0.7; otherwise w (f , t) = 0.4.
f .4 .6
A
A t f - Assume evidence B = t.
.2 .8
- Generate a random number r uniformly
from [0,1].
- Set A = t if r ≤ .2 and A = f else.
B
A t f
- If A = t then let the sample count as
B t .7 .3 w (t, t) = 0.7; otherwise w (f , t) = 0.4.
f .4 .6
Gibbs Sampling
For notational convenience assume from now on that for some l: E = Vl+1 , Vl+2 , . . . , Vn . Write W
for V1 , . . . , Vl .
Principle: obtain new sample from previous sample by randomly changing the value of only one
selected variable.
Illustration
The process of Gibbs sampling can be understood as a random walk in the space of all
instantiations with E = e:
Reachable in one step: instantiations that differ from current one by value assignment to at most
one variable (assume randomized choice of variable Vk ).
requires sampling from a conditional distribution. In this special case (all but one variables are
instantiated) this is easy: just need to compute for each v ∈ sp(Vk ) the probability
(linear in network size), and choose vi,k according to these probabilities (normalized).
This can be further simplified by computing the distribution on sp(Vk ) only in the Markov blanket of
Vk , i.e. the subnetwork consisting of Vk , its parents, its children, and the parents of its children.
Under certain conditions: the distribution of samples converges to the posterior distribution
P(W | E = e):
lim P(vi = v) = P(W = v | E = e) (v ∈ sp(W)).
i→∞
Effect of dependence
P(vN = v) close to P(W = v | E = e): probability that vN is in the red region is close to
P(A = a | E = e).
This does not guarantee that the fraction of samples in vN , vN+1 , . . . , vN+M that are in the red
region yields a good approximation to P(A = a | E = e)!
v0
vN
vN
vN+M
vN+M
In practice, one tries to counteract these difficulties by restarting the Gibbs sampling several times
(often with different starting points):
v0
v0
vN
vN+M
vN+M
vN
vN
vN+M
v0
Let
{q(v, v′ ) | v, v′ ∈ sp(W)}
be a set of transition probabilities over sp(W), i.e. q(v, ·) is a probability distribution for each
v ∈ sp(W). The q(v, v′ ) are called proposal probabilities.
Define
P(W = v′ | E = e)q(v′ , v)
ff
α(v, v′ ) := min 1,
P(W = v | E = e)q(v, v′ )
P(W = v′ , E = e)q(v′ , v)
ff
:= min 1,
P(W = v, E = e)q(v, v′ )
Under certain conditions: the distribution of samples converges to the posterior distribution
P(W | E = e).
Message passing
A node sends a message to a neighbor by
multiplying the incoming messages from all other neighbors to the potential it holds.
marginalizing the result down to the separator.
A B
P(C|A, B)
C
D E
Message passing
A node sends a message to a neighbor by
multiplying the incoming messages from all other neighbors to the potential it holds.
marginalizing the result down to the separator.
A B
φA
P(C|A, B)
C
D E
Message passing
A node sends a message to a neighbor by
multiplying the incoming messages from all other neighbors to the potential it holds.
marginalizing the result down to the separator.
A B
φA φB
P(C|A, B)
C
D E
Message passing
A node sends a message to a neighbor by
multiplying the incoming messages from all other neighbors to the potential it holds.
marginalizing the result down to the separator.
A B
φA φB
P(C|A, B)
C
φD
D E
Message passing
A node sends a message to a neighbor by
multiplying the incoming messages from all other neighbors to the potential it holds.
marginalizing the result down to the separator.
A B
φA φB
P(C|A, B)
C
φD
πE (C)
D E
Message passing
A node sends a message to a neighbor by
multiplying the incoming messages from all other neighbors to the potential it holds.
marginalizing the result down to the separator.
A B
φA φB
P(C|A, B)
C
φD φE
πE (C)
D E
X
πE (C) = φD P(C | A, B)φA φB
A,B
Message passing
A node sends a message to a neighbor by
multiplying the incoming messages from all other neighbors to the potential it holds.
marginalizing the result down to the separator.
A B
λA (C)
φA φB
P(C|A, B)
C
φD φE
πE (C)
D E
X X
πE (C) = φD P(C | A, B)φA φB λC (A) = P(C | A, B)φB φD φE
A,B B,C
A few observations:
When calculating P(E) we treat C and D as being independent (in the junction tree C and D
would appear in the same separator).
Evidence on a converging connection ma cause the error to cycle.
In general
There is no guarantee of convergence, nor that in case of convergence, it will converge to the
correct distribution. However, the method converges to the correct distribution surprisingly
often!
If the network is singly connected, convergence is guaranteed.
Literature
R.M. Neal: Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical
report CRG-TR-93-11993, Department of Computer Science, University of Toronto.
http://omega.albany.edu:8008/neal.pdf
P. Dagum, M. Luby: Approximating probabilistic inference in Bayesian belief networks is
NP-hard. Artificial Intelligence 60, 1993.