0% found this document useful (0 votes)
24 views56 pages

Prob Inf

The document discusses probabilistic graphical models (PGMs), focusing on Bayesian networks and Markov random fields (MRFs) as methods for representing probability distributions over random variables. It outlines various inference methods, including marginal, conditional, and maximum a posteriori (MAP) inference, and introduces the variable elimination algorithm for efficient computation. Additionally, it covers the conversion between different graphical representations, such as factor graphs, to facilitate inference.

Uploaded by

kociro5434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views56 pages

Prob Inf

The document discusses probabilistic graphical models (PGMs), focusing on Bayesian networks and Markov random fields (MRFs) as methods for representing probability distributions over random variables. It outlines various inference methods, including marginal, conditional, and maximum a posteriori (MAP) inference, and introduces the variable elimination algorithm for efficient computation. Additionally, it covers the conversion between different graphical representations, such as factor graphs, to facilitate inference.

Uploaded by

kociro5434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

15-780 – Probabilistic Inference

J. Zico Kolter

March 30, 2014

1
Outline

Probabilistic graphical models

Exact inference

Approximate inference

2
Outline

Probabilistic graphical models

Exact inference

Approximate inference

3
Probability distributions

• Probabilistic graphical models (PGMs) are about representing


probability distributions over random variables

p(x)

where for this lecture, x ∈ {0, 1}n , p : {0, 1}n → [0, 1]

• Naively, since there are 2n possible assignments to x, can


represent this distribution completely using 2n − 1 numbers, but
quickly becomes intractable for large n

• PGMs are methods to represent these distributions more


compactly, by exploiting conditional independence

4
Bayesian networks

• A Bayesian network is defined by:

1. A directed acyclic graph (DAG) G = (V = {x1 , . . . , xn }, E)

2. A set of conditional probability tables p(xi |Parents(xi ))

• Defines the joint probability distribution


n
Y
p(x) = p(xi |Parents(xi ))
i=1

• Equivalently, each node is conditionally independent of all


non-descendants given its parents

5
Bayes net example
Burglary? Earthquake?
x1 x2

x3 Alarm?

x4 x5

JohnCalls? MaryCalls?

6
Bayes net example
p(x1 = 1) Burglary? Earthquake? p(x = 1)
2
x1 x2
0.001 0.002
x1 x2 p(x3 = 1)
x3 Alarm?
0 0 0.001
0 1 0.29
x4 x5 1 0 0.94
1 1 0.95
JohnCalls? MaryCalls?
x3 p(x4 = 1) x3 p(x5 = 1)
0 0.05 0 0.01
1 0.9 1 0.7

6
Bayes net example
p(x1 = 1) Burglary? Earthquake? p(x = 1)
2
x1 x2
0.001 0.002
x1 x2 p(x3 = 1)
x3 Alarm?
0 0 0.001
0 1 0.29
x4 x5 1 0 0.94
1 1 0.95
JohnCalls? MaryCalls?
x3 p(x4 = 1) x3 p(x5 = 1)
0 0.05 0 0.01
1 0.9 1 0.7

• Can write distribution as


p(x) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 )p(x4 |x1 , x2 , x3 )p(x5 |x1 , x2 , x3 , x4 )
= p(x1 )p(x2 )p(x3 |x1 , x2 )p(x4 |x3 )p(x5 |x3 ) 6
Markov random fields
• A (pairwise) Markov random field (MRF) is defined by:
1. An undirected graph G = (V = {x1 , . . . , xn }, E)

2. A set of unary potential f (xi ) for each i = 1, . . . , n

3. A set of binary potentials f (xi , xj ) for all i, j ∈ E

• Defines the joint probability distribution


n
1 Y Y
p(x) = f (xi ) f (xi , xj )
Z
i=1 i,j∈E

where Z is a normalization constant (also called partition


function)
XY n Y
Z= f (xi ) f (xi , xj )
x i=1 i,j∈E 7
• Equivalently, each node is in MRF is conditionally independent
of all other nodes given it’s neighbors

p(xi |x−i ) = p(xi |Neighbors(xi ))

not trivial to show, known as Hammersley-Clifford theorem

8
MRF example

x1 x2

9
MRF example

x1 x2 f (x1, x2)
0 0 10
0 1 1
1 0 1
1 1 10
x1 f (x1) x2 f (x1)
0 1 x1 x2 0 5
1 5 1 1

9
MRF example

x1 x2 f (x1, x2)
0 0 10 x1 x2 Q
f p(x)
0 1 1 0 0 50 1/3
1 0 1 0 1 25 1/6
1 1 10 1 0 25 1/6
x1 f (x1) x2 f (x1) 1 1 50 1/3
0 1 x1 x2 0 5
1 5 1 1

1
• E.g. p(x1 = 1, x2 = 1) = 150 5 · 10 · 1 = 1/3

9
Factor graphs
• A generalization that captures both Bayesian networks and
Markov random fields

• An undirected graph, G = {V = {x1 , . . . , xn , f1 , . . . , fm }, E}


over variables and factors

• There exists an edge fi — xj if and only if factor fi includes


variable xj

• Defines the joint probability distribution


m
1 Y
p(x) = fi (Xi )
Z
i=1

where Xi = {xj : (fi , xj ) ∈ E}) are all variables in factor fi


10
MRF to factor graph

x1 x2

11
MRF to factor graph

x1 f3 x2

f1 f2

11
MRF to factor graph

x1 x2 f3(x1, x2)
0 0 10
0 1 1
1 0 1
1 1 10

x1 f3 x2

f1 f2

11
Bayes net to factor graph

x1 x2

x3

x4 x5

12
Bayes net to factor graph

x1 x2
f3
f1 f2
x3

f4 f5

x4 x5

12
Bayes net to factor graph

x1 x2
f3
f1 f2
x3 x3 p(x5 = 1)
0 0.01
f4 f5
1 0.7
x4 x5

12
Bayes net to factor graph

x1 x2
f3 x3 x5 f5(x3, x5)
f1 f2
0 0 0.99
x3 x3 p(x5 = 1) 0 1 0.01
0 0.01 1 0 0.3
f4 f5
1 0.7 1 1 0.7
x4 x5

12
Outline

Probabilistic graphical models

Exact inference

Approximate inference

13
Inference in probabilistic graphical models

• Inference generally refers to methods that query probabilities


given a graphical model

• Several types that come up frequently

– Marginal inference: compute p(xI ) for some xI ⊆ {x1 , . . . , xn }


(non-trivial even for xI = {x1 , . . . , xn } in factor graph)

– Conditional inference: compute p(xI |xE = x0E ) for some


xI , xE ⊆ {x1 , . . . , xn }, xI ∩ xE = ∅

– Maximum a posteriori (MAP) inference: compute maxxI p(xI ),


and possibly the maximizing assignment x?I (also, conditional
analogue); also called most probable explanation (MPE)

14
Inference via enumeration
• If we’re willing to enumerate all 2n possible values, inference
queries can be answered easily

– Marginal inference:
X m
XY
p(xI ) = p(xI , x̄I ) = fi (Xi )
x̄I x̄I i=1

– Conditional inference
p(xI , xE = x0E )
p(xI |xE = x0E ) =
p(xE = x0E )

– MAP inference: compute p(xI = x0I ) for all possible assignments


x0I , choose largest

15
Exploiting graph structure in inference

• When n gets large, inference by exact enumeration is intractable

• Can (sometimes) use compact graph representation of the


distribution to derive compact forms of inference

16
Example: chain Bayesian network

x1 x2 x3 x4

X
p(x4 ) = p(x1 , x2 , x3 , x4 )
x1 ,x2 ,x3

17
Example: chain Bayesian network

x1 x2 x3 x4

X
p(x4 ) = p(x1 )p(x2 |x1 )p(x3 |x2 )p(x4 |x3 )
x1 ,x2 ,x3

17
Example: chain Bayesian network

x1 x2 x3 x4

X X
p(x4 ) = p(x3 |x2 )p(x4 |x3 ) p(x1 )p(x2 |x1 )
x2 ,x3 x1

17
Example: chain Bayesian network

x1 x2 x3 x4

X
p(x4 ) = p(x3 |x2 )p(x4 |x3 )p(x2 )
x2 ,x3

17
Example: chain Bayesian network

x1 x2 x3 x4

X X
p(x4 ) = p(x4 |x3 ) p(x3 |x2 )p(x2 )
x3 x2

17
Example: chain Bayesian network

x1 x2 x3 x4

X
p(x4 ) = p(x4 |x3 )p(x3 ) = p(x4 )
x3

17
General algorithm: variable elimination

function G0 = Sum-Product-Eliminate(G, xi )
// eliminate variable xi from the factor graph G
F ← {fj ∈ V : (fj , xi ) ∈ E}
X̃ ← {xk : (fj , xk ) ∈ E, fj ∈ F} − {xi }
P Q
f˜(X̃ ) ← xi fj ∈F fj (Xj )
V 0 ← V − {xi , fj ∈ F} + {f˜}
E 0 ← E − {(fj , xk ) ∈ E : fj ∈ F} + {(f˜, xk ) : xk ∈ X̃ }
return G0 = (V 0 , E 0 )

18
Variable elimination example

x1 x2
f3
f1 f2
x3

f4 f5

x4 x5

19
Variable elimination example

x1 x2
f3 F = {f3, f4, f5}
f1 f2
x3

f4 f5

x4 x5

19
Variable elimination example

x1 x2
f3 F = {f3, f4, f5}
f1 f2
X̃ = {x1, x2, x4, x4}
x3

f4 f5

x4 x5

19
Variable elimination example

x1 x2
F = {f3, f4, f5}
f1 f2
X̃ = {x1, x2, x4, x5}
f˜(x1, x2, x3, x5) = f3(x1, x2, x3)f4(x3, x4)f5(x3, x5)
X

f˜ x3
V ′ = {x1, x2, x4, x5, f1, f2, f˜}
E ′ = {(f1, x1), (f2, x2), (f˜, x1), (f˜, x2), (f˜, x4), (f˜, x5)}
x4 x5

19
• Full variable elimination algorithm just repeatedly eliminates
variables
function G0 = Sum-Product-Variable-Elimination(G, X)
// eliminate an ordered list of variables X
for xi ∈ X :
G ← Sum-Product-Eliminate(G, xi )
return G

• Graph returned at the end is a marginalized factor graph over


non-eliminated variables (eliminating all variables returns
constant equal to partition function Z)

• The ordering matters a lot, eliminating variables in the wrong


order can make algorithm no better than enumeration

20
Variable elimination example
Goal: compute p(x4 )

x1 x2
f3
f1 f2
x3

f4 f5

x4 x5

21
Variable elimination example
Goal: compute p(x4 )

x1 x2
f3
f1 f2
x3

f4 f5

x4 x5

21
Variable elimination example
Goal: compute p(x4 )

x2

f˜1
f2
x3

f4 f5

x4 x5

21
Variable elimination example
Goal: compute p(x4 )

x2

f˜1
f2
x3

f4 f5

x4 x5

21
Variable elimination example
Goal: compute p(x4 )

f˜2

x3

f4 f5

x4 x5

21
Variable elimination example
Goal: compute p(x4 )

f˜2

x3

f4 f5

x4 x5

21
Variable elimination example
Goal: compute p(x4 )

x4 f˜3 x5

21
Variable elimination example
Goal: compute p(x4 )

x4 f˜3 x5

21
Variable elimination example
Goal: compute p(x4 )

x4 f˜4

21
Pitfalls

• Tree-width of graphical model is size of the maximum factor


formed during variable elimination (assuming best ordering);
inference is exponential in tree width

• But...

– Finding best variable elimination ordering is NP-hard

– Some “simple” graphs have high tree-width (e.g., M × N “grid”


MRF has tree-width min(M, N ))

22
Extensions

• Difficulty with variable elimination as stated is that we need to


“rerun” algorithm each time we want to make an inference query

• Solution: slight extension of variable elimination that caches


intermediate factors, making a forward and backward pass over
all variables (Junction Tree or Clique Tree algorithm)

• You’ll probably see these algorithms written in terms of message


passing, but these “messages” are just intermediate factors f˜’s

23
MAP Inference
• Virtually identical approach can be applied to MAP inference

• Only change is replacing sum-product operation


X Y
f˜(X̃ ) ← fj (Xj )
xi fj ∈F

with max-product operation


Y
f˜(X̃ ) ← max fj (Xj )
xi
fj ∈F

• If we want to find actual maximizing assignment, also need to


keep a separate function of which xi value is maximal for each
f˜(X )
24
Outline

Probabilistic graphical models

Exact inference

Approximate inference

25
Sampling methods

• Instead of exactly computing probabilities p(x), we may want to


draw random samples from this distribution x ∼ p(x)

• For example, in Bayesian networks this is straightforward, just


sample individual variables sequentially

xi ∼ p(xi |Parents(xi )) i = 1, . . . , n

• For cases where we can efficiently perform variable elimination,


a slightly modified procedure lets us draw random samples
(perhaps conditioned on evidence)

26
Gibbs sampling
• But what about cases too big for variable elimination?

• A common solution: Gibbs sampling

function x = Gibbs-Sampling(G, x, K)
for i = 1, . . . , K:
Choose a random xi Y
Sample xi ∼ p(xi |x−i ) ∝ fj (Xj )
fj :(fj ,xi )∈G

• In the limit, x will be drawn exactly according to the desired


distribution (but may take exponentially long to converge)

• One of a broad class of methods called Markov Chain Monte


Carlo (MCMC)
27
Inference as optimization

• Inference in graphical models can be cast as an optimization


problem, has been a huge source of ideas for improving exact
and approximate inference methods

• We’re going to consider the simpler case of MAP inference,


which already looks like an optimization problem

maximize p(x)
x

• To put this is a form that we’re more familiar with, for each
|X |
factor fi define the optimization variable µi ∈ R2 i ; µi should
be thought of an an indicator for the assignment to Xi

28
• Abusing notation a bit, we can write optimization as a binary
integer program
m
X
maximize log p(µ) = µTi (log fi )
µ1 ,...,µm
i=1
subject to µ1 , . . . , µn is valid distribution
(µi )j ∈ {0, 1}, ∀i, j

• “Valid distribution” here means assignments have to be


consistent, i.e., if xk ∈ Xi and xk ∈ Xj , then
X X
µi (Xi ) = µj (Xj )
Xi −{xk } Xi −{xk }
P
and they have to have only one non-zero entry j (µi )j =1

29
• This is still a hard, binary integer programming task, but it turns
out that the LP relaxation is sometimes tight (i.e., just
removing the integer constraints still gives the optimal solution)

• One case where relaxation is tight: tree factor graphs (these are
ones we could already solve with max-product)

• Extremely cool: there are other cases where relaxation is still


tight even though naive max-product doesn’t apply, like certain
grid MRFs

• Can also apply to the case of marginal inference (let µ terms


have non-integer values, but also include terms due to partition
function, other constraints)

• A big area of open research


30
Take home points

• Probabilistic models can compactly represent high dimensional


probability distribution

• Inference algorithms provide a method for making probabilistic


queries that also (try to) exploit the structure of the distribution

• Wide range of inference methods, ranging from variable


elimination for exact inference, sampling and optimization
approaches

31

You might also like