0% found this document useful (0 votes)
7 views

Bayesian and inference

This lecture covers Bayesian Networks, including their syntax, semantics, and various inference methods such as exact and approximate inference techniques. It emphasizes the compact representation of joint distributions through topology and conditional probability tables (CPTs), and discusses the construction and evaluation of Bayesian Networks. The document also outlines inference tasks, including simple queries, conjunctive queries, and optimal decision-making in the context of Bayesian Networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Bayesian and inference

This lecture covers Bayesian Networks, including their syntax, semantics, and various inference methods such as exact and approximate inference techniques. It emphasizes the compact representation of joint distributions through topology and conditional probability tables (CPTs), and discusses the construction and evaluation of Bayesian Networks. The document also outlines inference tasks, including simple queries, conjunctive queries, and optimal decision-making in the context of Bayesian Networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Lecture 10: Bayesian Networks and Inference

CS 580 (001) - Spring 2018

Amarda Shehu

Department of Computer Science


George Mason University, Fairfax, VA, USA

May 02, 2018

Amarda Shehu (580) 1


1 Outline of Today’s Class – Bayesian Networks and Inference

2 Bayesian Networks
Syntax
Semantics
Parameterized Distributions

3 Inference on Bayesian Networks


Exact Inference by Enumeration
Exact Inference by Variable Elimination
Approximate Inference by Stochastic Simulation
Approximate Inference by Markov Chain Monte Carlo (MCMC)
Digging Deeper...

Amarda Shehu (580) Outline of Today’s Class – Bayesian Networks and Inference 2
Bayesian Networks

A simple, graphical notation for conditional independence assertions


and hence for compact specification of full joint distributions

Syntax:
a set of nodes, one per variable
a directed, acyclic graph (link ≈ “directly influences”)
a conditional distribution for each node given its parents:
P(Xi |Parents(Xi ))

In the simplest case, conditional distribution represented as


a conditional probability table (CPT) giving the
distribution over Xi for each combination of parent values

Amarda Shehu (580) Bayesian Networks 3


Example

Topology of network encodes conditional independence assertions:

Weather is independent of the other variables

Toothache and Catch are conditionally independent given Cavity

Amarda Shehu (580) Bayesian Networks 4


Example

I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t
call. Sometimes it’s set off by minor earthquakes. Is there a burglar?

Variables: Burglar , Earthquake, Alarm, JohnCalls, MaryCalls

Network topology reflects “causal” knowledge:


– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call

Amarda Shehu (580) Bayesian Networks 5


Example

Amarda Shehu (580) Bayesian Networks 6


Compactness

A CPT for Boolean Xi with k Boolean parents

has:

2k rows for the combinations of parent values

Each row requires one number p for Xi = true


(the number for Xi = false is just 1 − p)

If each variable has no more than k parents,


the complete network requires O(n · 2k ) numbers

I.e., grows linearly with n, vs. O(2n ) for the full joint distribution

For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 = 31)

Amarda Shehu (580) Bayesian Networks 7


Global Semantics

Global semantics defines the full joint

distribution

as the product of the local conditional distributions:


n
P(x1 , . . . , xn ) = Π i = 1 P(xi |parents(Xi ))

e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)


=

Amarda Shehu (580) Bayesian Networks 8


Global Semantics

“Global” semantics defines the full joint

distribution

as the product of the local conditional distributions:


n
P(x1 , . . . , xn ) = Π i = 1 P(xi |parents(Xi ))

e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)

= P(j|a)P(m|a)P(a|¬b, ¬e)P(¬b)P(¬e)
= 0.9 × 0.7 × 0.001 × 0.999 × 0.998
≈ 0.00063

Amarda Shehu (580) Bayesian Networks 9


Local Semantics

Local semantics: each node is conditionally independent


of its nondescendants given its parents

Theorem: Local semantics ⇔ global semantics

Amarda Shehu (580) Bayesian Networks 10


Markov Blanket

Each node is conditionally independent of all others given its


Markov blanket: parents + children + children’s parents

Amarda Shehu (580) Bayesian Networks 11


Constructing Bayesian Networks

Need a method such that a series of locally testable assertions of


conditional independence guarantees the required global semantics

1. Choose an ordering of variables X1 , . . . , Xn


2. For i = 1 to n

add Xi to the network

select parents from X1 , . . . , Xi−1 such that

P(Xi |Parents(Xi )) = P(Xi |X1 , . . . , Xi−1 )

This choice of parents guarantees the global semantics:


n
P(X1 , . . . , Xn ) = Π i = 1 P(Xi |X1 , . . . , Xi−1 ) (chain rule)
n
= Π i = 1 P(Xi |Parents(Xi )) (by construction)

Amarda Shehu (580) Bayesian Networks 12


Example

Suppose we choose the ordering M, J, A, B, E

P(J|M) = P(J)?

Amarda Shehu (580) Bayesian Networks 13


Example

Suppose we choose the ordering M, J, A, B, E

P(J|M) = P(J)? No
P(A|J, M) = P(A|J)? P(A|J, M) = P(A)

Amarda Shehu (580) Bayesian Networks 14


Example

Suppose we choose the ordering M, J, A, B, E

P(J|M) = P(J)? No
P(A|J, M) = P(A|J)? P(A|J, M) = P(A)? No
P(B|A, J, M) = P(B|A)?
P(B|A, J, M) = P(B)?

Amarda Shehu (580) Bayesian Networks 15


Example

Suppose we choose the ordering M, J, A, B, E

P(J|M) = P(J)? No
P(A|J, M) = P(A|J)? P(A|J, M) = P(A)? No
P(B|A, J, M) = P(B|A)? Yes
P(B|A, J, M) = P(B)? No
P(E |B, A, J, M) = P(E |A)?
P(E |B, A, J, M) = P(E |A, B)?

Amarda Shehu (580) Bayesian Networks 16


Example

Suppose we choose the ordering M, J, A, B, E

P(J|M) = P(J)? No
P(A|J, M) = P(A|J)? P(A|J, M) = P(A)? No
P(B|A, J, M) = P(B|A)? Yes
P(B|A, J, M) = P(B)? No
P(E |B, A, J, M) = P(E |A)? No
P(E |B, A, J, M) = P(E |A, B)? Yes

Amarda Shehu (580) Bayesian Networks 17


Example

Deciding conditional independence is hard in noncausal directions


(Causal models and conditional independence seem hardwired for humans!)
Assessing conditional probabilities is hard in noncausal directions
Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed

Amarda Shehu (580) Bayesian Networks 18


Example: Car Diagnosis

Initial evidence: car won’t start


Testable variables (green), “broken, so fix it” variables (orange)
Hidden variables (gray) ensure sparse structure, reduce parameters

Amarda Shehu (580) Bayesian Networks 19


Example: Car Insurance

Amarda Shehu (580) Bayesian Networks 20


Compact Conditional Distributions

CPT grows exponentially with number of parents


CPT becomes infinite with continuous-valued parent or child

Solution: canonical distributions that are defined compactly

Deterministic nodes are the simplest case:

X = f (Parents(X )) for some function f

E.g., Boolean functions

NorthAmerican ⇔ Canadian ∨ US ∨ Mexican

E.g., numerical relationships among continuous variables


∂Level
= inflow + precipitation - outflow - evaporation
∂t

Amarda Shehu (580) Bayesian Networks 21


Compact Conditional Distributions

Noisy-OR distributions model multiple noninteracting causes


1) Parents U1 . . . Uk include all causes (can add leak node)
2) Independent failure probability qi for each cause alone
j
=⇒ P(X |U1 . . . Uj , ¬Uj+1 . . . ¬Uk ) = 1 − Π i = 1 qi

Cold Flu Malaria P(Fever ) P(¬Fever )


F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2 × 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6 × 0.1
T T F 0.88 0.12 = 0.6 × 0.2
T T T 0.988 0.012 = 0.6 × 0.2 × 0.1

Number of parameters linear in number of parents

Amarda Shehu (580) Bayesian Networks 22


Hybrid (Discrete+Continuous) Networks

Discrete (Subsidy ? and Buys?); continuous (Harvest and Cost)

Option 1: discretization—possibly large errors, large CPTs


Option 2: finitely parameterized canonical families
1) Continuous variable, discrete+continuous parents (e.g., Cost)
2) Discrete variable, continuous parents (e.g., Buys?)

Amarda Shehu (580) Bayesian Networks 23


Continuous Child Variables

Need one conditional density function for child variable given continuous parents, for
each possible assignment to discrete parents

Most common is the linear Gaussian model, e.g.,:

P(Cost = c|Harvest = h, Subsidy ? = true)


= N(at h + bt , σt )(c)
 2 !
1 1 c − (at h + bt )
= √ exp −
σt 2π 2 σt

Mean Cost varies linearly with Harvest, variance is fixed

Linear variation is unreasonable over the full range

but works OK if the likely range of Harvest is narrow

Amarda Shehu (580) Bayesian Networks 24


Continuous Child Variables

All-continuous network with LG distributions


=⇒ full joint distribution is a multivariate Gaussian

Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate


Gaussian over all continuous variables for each combination of discrete variable values

Amarda Shehu (580) Bayesian Networks 25


Discrete Variable w/ Continuous Parents

Probability of Buys? given Cost should be a “soft” threshold:

Probit distribution uses integral of Gaussian:


Rx
Φ(x) = −∞
N(0, 1)(x)dx

P(Buys? = true | Cost = c) = Φ((−c + µ)/σ)

Amarda Shehu (580) Bayesian Networks 26


Why the probit?

1. It’s sort of the right shape


2. Can view as hard threshold whose location is subject to noise

Amarda Shehu (580) Bayesian Networks 27


Discrete Variable

Sigmoid (or logit) distribution also used in neural networks:


1
P(Buys? = true | Cost = c) =
1 + exp(−2 −c+µ
σ
)

Sigmoid has similar shape to probit but much longer tails:

Amarda Shehu (580) Bayesian Networks 28


Summary on Bayesian Networks

Bayes nets provide a natural representation for (causally induced)


conditional independence

Topology + CPTs = compact representation of joint distribution

Generally easy for (non)experts to construct

Canonical distributions (e.g., noisy-OR) = compact representation of CPTs

Continuous variables =⇒ parameterized distributions (e.g., linear Gaussian)

Next: Inference on Bayesian Networks

Amarda Shehu (580) Bayesian Networks 29


Summary on Bayesian Networks

Bayes nets provide a natural representation for (causally induced)


conditional independence

Topology + CPTs = compact representation of joint distribution

Generally easy for (non)experts to construct

Canonical distributions (e.g., noisy-OR) = compact representation of CPTs

Continuous variables =⇒ parameterized distributions (e.g., linear Gaussian)

Next: Inference on Bayesian Networks

Amarda Shehu (580) Bayesian Networks 29


Inference Tasks

Simple queries: compute posterior marginal P(Xi |E = e)


e.g., P(NoGas|Gauge = empty , Lights = on, Starts = false)

Conjunctive queries: P(Xi , Xj |E = e) = P(Xi |E = e)P(Xj |Xi , E = e)

Optimal decisions: decision networks include utility information;


probabilistic inference required for P(outcome|action, evidence)

Value of information: which evidence to seek next?

Sensitivity analysis: which probability values are most critical?

Explanation: why do I need a new starter motor?

Amarda Shehu (580) Inference on Bayesian Networks 30


Inference by Enumeration

Slightly intelligent way to sum out variables from the joint without actually constructing
its explicit representation

Simple query on the burglary network:


P(B|j, m)
= P(B, j, m)/P(j, m)
= αP(B, j, m)
=α ΣΣ e a P(B, e, a, j, m)

Rewrite full joint entries using product of CPT entries:


P(B|j, m)
=α Σ Σ P(B)P(e)P(a|B, e)P(j|a)P(m|a)
e a

= αP(B) Σ P(e) Σ P(a|B, e)P(j|a)P(m|a)


e a

Recursive depth-first enumeration: O(n) space, O(d n ) time

Amarda Shehu (580) Inference on Bayesian Networks 31


Enumeration Algorithm

function Enumeration-Ask(X, e, bn) returns a distribution over X


inputs: X, the query variable
e, observed values for variables E
bn, a Bayesian network with variables {X } ∪ E ∪ Y
Q(X ) ← a distribution over X, initially empty
for each value xi of X do
extend e with value xi for X
Q(xi ) ← Enumerate-All(Vars[bn], e)
return Normalize(Q(X ))

function Enum-All(vars, e) returns a real number


if Empty?(vars) then return 1.0
Y ← First(vars)
if Y has value y in e
then returnP P(y | Pa(Y )) × Enum-All(Rest(vars), e)
else return y P(y | Pa(Y )) × Enum-All(Rest(vars), ey )
where ey is e extended with Y = y

Amarda Shehu (580) Inference on Bayesian Networks 32


Evaluation Tree

Enumeration is inefficient: repeated computation


e.g., computes P(j|a)P(m|a) for each value of e

Amarda Shehu (580) Inference on Bayesian Networks 33


Inference by Variable Elimination

Variable elimination refers to a heuristic to reduce complexity of exact inference

Use of memoization to avoid redundant calculations (stored in factors)

P(B|j, m)
= α P(B)
| {z }
Σ e P(e)
| {z }
Σ |P(a|B,
a e) P(j|a) P(m|a)
{z } | {z } | {z }
B E A J M

Σ f (E )Σ f (A, B, E )f (A)f (A)


= αf1 (B) e 2 a 3 4 5 pointwise product and sum out A
= αf (B)Σ f (E )f (B, E )
1 e 2 6 sum out E
= αf1 (B)f7 (B)

Basic operations: pointwise product and summation of factors

Direction: Carry out summations right-to-left

Example of factors:
f4 (A) is hP(j|a), P(j|¬a)i = h0.90, 0.05i f5 (A) is hP(m|a), P(m|¬a)i = h0.70, 0.01i
f3 (A, b, E ) is a matrix of two rows, hP(a|b, e), P(¬a|b, e)i and hP(a|b, ¬e), P(¬a|b, ¬e)i
f3 (A, B, E ) is a 2x2x2 matrix (considering also b and ¬b).

Amarda Shehu (580) Inference on Bayesian Networks 34


Variable Elimination: Basic Operations - Pointwise Product

Pointwise product f4 (A) × f5 (A) = hP(j|a) · P(m|a), P(j|¬a) · P(m|¬a)i

Corresponding entries in vectors are multiplied, yielding another same-size vector

equivalent to going bottom-up in tree, keeping track of both children in a vector, and
multiplying child with parent to “roll up” to higher level.

Generally:
Pointwise product of factors f1 and f2 :
f1 (x1 , . . . , xj , y1 , . . . , yk ) × f2 (y1 , . . . , yk , z1 , . . . , zl )
= f (x1 , . . . , xj , y1 , . . . , yk , z1 , . . . , zl ) vars are unions

Example: f1 (a, b) × f2 (b, c) = f (a, b, c)

Rewrite f4 (A) as f (j, A) and f5 (A) as f (m, A)


Rule suggests f (j, A) × f2 (m, A) = f (j, m, A)

Correct: P(j|A) × P(m|A) = P(j, m|A) (because J and M are conditionally indepedent
given their parent set A)

Amarda Shehu (580) Inference on Bayesian Networks 35


Variable Elimination: Basic Operations - Summation

Consider f3 (A, b, E ) which is a 2x2 matrix:


hP(a|b, e), P(¬a|b, e)i
hP(a|b, ¬e), P(¬a|b, ¬e)i (each row corresponds to branching point in search tree)

“Summing out” A means pointwise product on each branch and sum up at parent

Example: What is = Σ f (A, b, E )f (A)f (A)?


a 3 4 5

Let f4 (A) × f5 (A) be f (j, m, A) =< P(j, m|a), P(j, m|¬a > (from previous slide)
Take pointwise product of first row of f3 (A, b, E ) with f (j, m, A)
Take pointwise product of second row of f3 (A, b, E ) with f (j, m, A)
Sum the two rows to get a new factor f6 (b, E )

Generally, summing out a variable from a product of factors:


move any constant factors outside the summation
add up submatrices in pointwise product of remaining factors

Σf x 1 × · · · × fk = f1 × · · · × fiΣ x fi+1 × · · · × fk = f1 × · · · × fi × fX̄


assuming f1 , . . . , fi do not depend on X
summation needed to account for all values of hidden variables (A, E)
Amarda Shehu (580) Inference on Bayesian Networks 36
Variable Elimination Algorithm

function Elimination-Ask(X, e, bn) returns a distribution over X


inputs: X, the query variable
e, evidence specified as an event
bn, a belief network specifying joint distribution P(X1 , . . . , Xn )
factors ← [ ]; vars ← Reverse(Vars[bn])
for each var in vars do
factors ← [Make-Factor(var , e)|factors]
if var is a hidden variable then factors ← Sum-Out(var, factors)
return Normalize(Pointwise-Product(factors))

Every choice of ordering for variables yields a sound algorithm


Different orderings give different intermediate factors
Certain variable orderings can introduce irrelevant calculations
Intractable to find optimal ordering, but heuristics exist

Amarda Shehu (580) Inference on Bayesian Networks 37


Irrelevant Variables

Consider the query P(JohnCalls|Burglary = true)


X X X
P(J|b) = αP(b) P(e) P(a|b, e)P(J|a) P(m|a)
e a m

Sum over m is identically 1; M is irrelevant to the query

Thm 1: Y is irrelevant unless Y ∈ Ancestors({X } ∪ E)

Here, X = JohnCalls, E = {Burglary }, and


Ancestors({X } ∪ E) = {Alarm, Earthquake}
so MaryCalls is irrelevant

Hence the name, variable elimination algorithm

Amarda Shehu (580) Inference on Bayesian Networks 38


Irrelevant Variables

Consider the query P(JohnCalls|Burglary = true)


X X X
P(J|b) = αP(b) P(e) P(a|b, e)P(J|a) P(m|a)
e a m

Sum over m is identically 1; M is irrelevant to the query

Thm 1: Y is irrelevant unless Y ∈ Ancestors({X } ∪ E)

Here, X = JohnCalls, E = {Burglary }, and


Ancestors({X } ∪ E) = {Alarm, Earthquake}
so MaryCalls is irrelevant

Hence the name, variable elimination algorithm

Amarda Shehu (580) Inference on Bayesian Networks 38


Irrelevant Variables

Defn: moral graph of Bayes net: marry all parents and drop arrows

Defn: A is m-separated from B by C iff separated by C in the moral graph

Thm 2: Y is irrelevant if m-separated from X by E

For P(JohnCalls|Alarm = true), both


Burglary and Earthquake are irrelevant

Amarda Shehu (580) Inference on Bayesian Networks 39


Complexity of Exact Inference

Singly connected networks (or polytrees):


– any two nodes are connected by at most one (undirected) path
– worst-case time and space cost of a query is O(n)
– worst-case time and space cost of n queries is O(n2 )

Multiply connected networks:


– worst-case time and space cost are exponential, O(n · d n ) (n queries, d values per r.v.)
– NP-hard and #P-complete
– can reduce 3SAT to exact inference =⇒ NP-hard
– equivalent to counting 3SAT models =⇒ #P-complete

Amarda Shehu (580) Inference on Bayesian Networks 40


Taming Exact Inference

How to reduce time? Identify structure in BN similar to CSP setting: group variables
together to “reduce” network to a polytree

How? Cluster variables together (joint tree algorithms)

Parents of a node can be grouped into a meta-parent node (meganode)

As in CSP, meganodes may share variables, so special inference algorithm is needed

Algorithm takes care of constraint propagation so that meganodes agree on posterior


probability of shared variables

No free lunch, so what gives?

The exponential time cost is hidden in the combined CPTs, which can become
exponentially large

Amarda Shehu (580) Inference on Bayesian Networks 41


Taming Exact Inference

How to reduce time? Identify structure in BN similar to CSP setting: group variables
together to “reduce” network to a polytree

How? Cluster variables together (joint tree algorithms)

Parents of a node can be grouped into a meta-parent node (meganode)

As in CSP, meganodes may share variables, so special inference algorithm is needed

Algorithm takes care of constraint propagation so that meganodes agree on posterior


probability of shared variables

No free lunch, so what gives?

The exponential time cost is hidden in the combined CPTs, which can become
exponentially large

Amarda Shehu (580) Inference on Bayesian Networks 41


Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Amarda Shehu (580) Inference on Bayesian Networks 42


Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Go for approximate inference algorithms...

Amarda Shehu (580) Inference on Bayesian Networks 42


Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Go for approximate inference algorithms...

that use sampling (Monte Carlo-based) to estimate posterior probabilities

Amarda Shehu (580) Inference on Bayesian Networks 42


Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Go for approximate inference algorithms...

that use sampling (Monte Carlo-based) to estimate posterior probabilities

Amarda Shehu (580) Inference on Bayesian Networks 42


Inference by Stochastic Simulation (Sampling-based)

Basic idea:

1) Draw N samples from a sampling distribution S


Can you draw N samples for the r.v. Coin
from the probability distribution P(Coin) = [0.5, 0.5]?

2) Compute an approximate posterior probability P̂

3) Show this converges to the true probability P

Outline:

– Direct Sampling: Sampling from an empty network

– Rejection sampling: reject samples disagreeing with evidence

– Likelihood weighting: use evidence to weight samples

– Markov chain Monte Carlo (MCMC): sample from a stochastic process whose
stationary distribution is the true posterior

Amarda Shehu (580) Inference on Bayesian Networks 43


Direct Sampling: Sampling from an Empty Network

Empty refers to the absence of any evidence: used to estimate joint probabilities

Main idea:

Sample each r.v. in turn, in topological order, from parents to children

Once parent is sampled, its value is fixed and used to sample child

Events generated via this direct sampling, observing joint probability distribution

To get (prior) probability of an event, have to sample many times, so frequency of


“observing” it among samples approaches its probability

Example next

Amarda Shehu (580) Inference on Bayesian Networks 44


Direct Sampling: Sampling from an Empty Network

Empty refers to the absence of any evidence: used to estimate joint probabilities

Main idea:

Sample each r.v. in turn, in topological order, from parents to children

Once parent is sampled, its value is fixed and used to sample child

Events generated via this direct sampling, observing joint probability distribution

To get (prior) probability of an event, have to sample many times, so frequency of


“observing” it among samples approaches its probability

Example next

Amarda Shehu (580) Inference on Bayesian Networks 44


Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 45


Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 46


Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 47


Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 48


Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 49


Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 50


Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 51


PRIOR-SAMPLE Algorithm for Direct Sampling

function Prior-Sample(bn) returns an event sampled from bn


inputs: bn, a belief network specifying joint distribution P(X1 , . . . , Xn )
x ← an event with n elements
for i = 1 to n do
xi ← a random sample from P(Xi | parents(Xi ))
given the values of Parents(Xi ) in x
return x

Amarda Shehu (580) Inference on Bayesian Networks 52


Direct Sampling Continued

Probability that PriorSample generates a particular event x1 . . . xn :


n
SPS (x1 . . . xn ) =Π i = 1 P(xi |parents(Xi )) = P(x1 . . . xn )
i.e., the true prior probability

E.g., SPS (t, f , t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P(t, f , t, t)

Let NPS (x1 . . . xn ) be the number of samples generated for event x1 , . . . , xn


Then we have:

lim P̂(x1 , . . . , xn ) = lim NPS (x1 , . . . , xn )/N


N→∞ N→∞
= SPS (x1 , . . . , xn )
= P(x1 . . . xn )

That is, estimates derived from PriorSample are consistent


(becomes exact in large-sample limit)

Shorthand: P̂(x1 , . . . , xn ) ≈ P(x1 . . . xn )

Problem: N needs to be sufficiently large to sample “rare events”


Amarda Shehu (580) Inference on Bayesian Networks 53
Rejection Sampling (for Conditional Probabilities P(X |e))

Main idea:

Given distribution too hard to sample directly from it: use an easy-to-sample distribution
for direct sampling, and then reject samples based on hard-to-sample distribution

(1) Direct sampling to sample (X , E ) events from prior distribution in BN


(2) Determine whether (X , E ) is consistent with given evidence e
(3) Get P̂(X |E = e) by counting how often (E = e) and (X , E = e) occur
as per Bayes’ rule: P̂(X |E = e) = N(X ,E =e)
N(E =e)

Example: estimate P(Rain|Sprinkler = true) using 100 samples

Generate 100 samples for Cloudy , Sprinkler , Rain, WetGrass via direct sampling
27 samples have Sprinkler = true event of interest
Of these, 8 have Rain = true and 19 have Rain = false.

P̂(Rain|Sprinkler = true) = Normalize(h8, 19i) = h8/27, 19/27i = h0.296, 0.704i

Similar to a basic real-world empirical estimation procedure

Amarda Shehu (580) Inference on Bayesian Networks 54


Rejection Sampling

P̂(X |e) estimated from samples agreeing with e

function Rejection-Sampling(X, e, bn, N) returns an estimate of


P(X |e)
local variables: N, a vector of counts over X, initially zero
for j = 1 to N do
x ← Prior-Sample(bn)
if x is consistent with e then
N[x] ← N[x]+1 where x is the value of X in x
return Normalize(N[X])

Amarda Shehu (580) Inference on Bayesian Networks 55


Analysis of Rejection Sampling

P̂(X |e) = αNPS (X , e) (algorithm defn.)

= NPS (X , e)/NPS (e) (normalized by NPS (e))

≈ P(X , e)/P(e) (property of PriorSample)

= P(X |e) (defn. of conditional probability)

Hence rejection sampling returns consistent posterior estimates √


Standard deviation of error in each probability proportional to 1/ n (number of r.v.s)

Problem:

If e is very rare event, most samples rejected; hopelessly expensive if P(e) is small

P(e) drops off exponentially with number of evidence variables!

Rejection sampling is unusable for complex problems → Likelihood Weighting instead

Amarda Shehu (580) Inference on Bayesian Networks 56


Likelihood Weighting

A form of importance sampling (for BNs)

Main idea:

Generate only events that are consistent with given values e of evidence variables E

Fix evidence variables to given values, sample only nonevidence variables

Weight each sample by the likelihood it accords the evidence (how likely e is)

Example: Query P(Rain|Cloudy = true, WetGrass = true)

Consider r.v.s in some topological ordering


Set w = 1.0 (weight will be a running product)

If r.v. Xi is in given evidence variables (Cloudy or WetGrass in this example),


w = w × P(Xi |Parents(Xi ))
Else, sample Xi from P(Xi | evidence)
When all r.v.s considered, normalized weights to turn to probabilities

Amarda Shehu (580) Inference on Bayesian Networks 57


Likelihood Weighting Example: P(Rain|Sprinkler = t, WetGrass = t)

Cloudy considered first, sample, w = 1.0 because nonevidence

Amarda Shehu (580) Inference on Bayesian Networks 58


Likelihood Weighting Example: P(Rain|Sprinkler = t, WetGrass = t)

Cloudy considered first, sample, w = 1.0 because nonevidence


Say, Cloudy=T sampled

Amarda Shehu (580) Inference on Bayesian Networks 59


Likelihood Weighting Example: P(Rain|Sprinkler = t, WetGrass = t)

Sprinkler considered next, evidence variable, so update w


w = w × P(Sprinkler = t|Parents(Sprinkler ))
w = 1.0

Amarda Shehu (580) Inference on Bayesian Networks 60


Likelihood Weighting Example: P(Rain|Sprinkler = t, WetGrass = t)

Sprinkler considered next, evidence variable, so update w


w = w × P(Sprinkler = t|Parents(Sprinkler )) = P(Sprinkler = t|Cloudy = t)
w = 1.0 × 0.1

Amarda Shehu (580) Inference on Bayesian Networks 61


Likelihood Weighting Example: P(Rain|Sprinkler = t, WetGrass = t)

Rain considered next, nonevidence, so sample from BN, w does not change
w = 1.0 × 0.1

Amarda Shehu (580) Inference on Bayesian Networks 62


Likelihood Weighting Example: P(Rain|Sprinkler = t, WetGrass = t)

Sample Rain, note Cloudy = t from before


Say, Rain = t sampled
w = 1.0 × 0.1

Amarda Shehu (580) Inference on Bayesian Networks 63


Likelihood Weighting Example

Last r.v. WetGrass, evidence variable, so update w


w = w × P(WetGrass = t|Parents(WetGrass)) = P(W = t|S = t, R = t)
w = 1.0 × 0.1 × 0.99 = 0.099 (this is not probability but weight of this sample)

Amarda Shehu (580) Inference on Bayesian Networks 64


Likelihood Weighting Algorithm

function Likelihood-Weighting(X, e, bn, N) returns an estimate of


P(X |e)
local variables: W, a vector of weighted counts over X, initially zero
for j = 1 to N do
x, w ← Weighted-Sample(bn)
W[x] ← W[x] + w where x is the value of X in x
return Normalize(W[X ])

function Weighted-Sample(bn, e) returns an event and a weight


x ← an event with n elements; w ← 1
for i = 1 to n do
if Xi has a value xi in e
then w ← w × P(Xi = xi | parents(Xi ))
else xi ← a random sample from P(Xi | parents(Xi ))
return x, w

Amarda Shehu (580) Inference on Bayesian Networks 65


Likelihood Weighting Analysis

Sampling probability for WeightedSample is


l
SWS (z, e) = Π
i = 1 P(zi |parents(Zi ))

Note: pays attention to evidence in ancestors only

=⇒ somewhere “in between” prior and posterior distribution


m
Weight for a given sample z, e is w (z, e) = Π i = 1 P(ei |parents(Ei ))

Weighted sampling probability is:


SWS (z, e)w (z, e)
l m
= Πi = 1 P(zi |parents(Zi )) Π
i = 1 P(ei |parents(Ei ))
= P(z, e) (by standard global semantics of network)

Amarda Shehu (580) Inference on Bayesian Networks 66


Likelihood Weighting Analysis Continued

Likelihood weighting returns consistent estimates

Order actually matters

Degradation in performance as number of evidence variables increases

A few samples have nearly all the total weight

Most samples will have very low weights, and weight estimate will be dominated by tiny
fraction of samples that contribute little likelihood to evidence

Exacerbated when evidence variables occur late in the ordering

Nonevidence variables will have no evidence in their parents to guide generation of


samples

Samples in simulations will bear little resemblance to reality suggested by evidence

Change framework: do not directly sample (from scratch), but modify preceding sample

Amarda Shehu (580) Inference on Bayesian Networks 67


Approximate Inference using MCMC

Main idea:

Markov Chain Monte Carlo (MCMC) algorithm(s) generate each sample by making a
random change to a preceding sample

Concept of current state: specifies value for every r.v.

“State” of network = current assignment to all variables

Random change to current state yields next state

A form of MCMC: Gibbs Sampling

Amarda Shehu (580) Inference on Bayesian Networks 68


Gibbs Sampling to Estimate P(X |e)

Initial state has evidence variables assigned as provided

Next state generated by randomly sampling values for nonevidence variables

Each nonevidence variable Z sampled in turn, given its Markov blanket mb

function GIBBS-ASK(X, e, bn, N) returns an estimate of P(X |e)


local variables: N[X ], a vector of counts over X, initially zero
Z, nonevidence variables in bn
x, current state of network, initially copied from e
initialize x with random values for the variables in Z
for j = 1 to N do
for each Zi in Z do
sample the value of Zi in x from P(Zi |mb(Zi ))
given the values of MB(Zi ) in x
N[x] ← N[x] + 1 where x is the value of X in x
return Normalize(N[X ])

Amarda Shehu (580) Inference on Bayesian Networks 69


The Markov Chain

With Sprinkler = true, WetGrass = true, there are four states:

Wander about for a while, average what you see

Amarda Shehu (580) Inference on Bayesian Networks 70


MCMC Example Continued

Estimate P(Rain|Sprinkler = true, WetGrass = true)

Sample Cloudy or Rain given its Markov blanket, repeat.


Count number of times Rain is true and false in the samples.

E.g., visit 100 states


31 have Rain = true, 69 have Rain = false
P̂(Rain|Sprinkler = true, WetGrass = true)
= Normalize(h31, 69i) = h0.31, 0.69i

Theorem: chain approaches stationary distribution

long-run fraction of time spent in each state is exactly


proportional to its posterior probability

Amarda Shehu (580) Inference on Bayesian Networks 71


Markov Blanket Sampling

Markov blanket of Cloudy is Sprinkler and Rain

Markov blanket of Rain is Cloudy , Sprinkler , and WetGrass

Probability given the Markov blanket is calculated as follows:


Π
P(xi0 |mb(Xi )) = P(xi0 |parents(Xi )) Zj ∈Children(Xi ) P(zj |parents(Zj ))
Easily implemented in message-passing parallel systems, brains

Main computational problems:


1) Difficult to tell if convergence has been achieved
2) Can be wasteful if Markov blanket is large:
P(Xi |mb(Xi )) won’t change much (law of large numbers)

Amarda Shehu (580) Inference on Bayesian Networks 72


Summary on Inference on Bayesian Networks

Exact inference by variable elimination:


– polytime on polytrees, NP-hard on general graphs
– space = time, very sensitive to topology

Approximate inference by LW, MCMC:


– LW does poorly when there is lots of (downstream) evidence
– LW, MCMC generally insensitive to topology
– Convergence can be very slow with probabilities close to 1 or 0
– Can handle arbitrary combinations of discrete and continuous variables

Amarda Shehu (580) Inference on Bayesian Networks 73


For Those that Want to Dig Deeper...

♦ MCMC Analysis

♦ Stationarity

♦ Detailed Balance

♦ General Gibbs Sampling

Amarda Shehu (580) Inference on Bayesian Networks 74


MCMC Analysis: Outline

Transition probability q(x → x0 )


Occupancy probability πt (x) at time t

Equilibrium condition on πt defines stationary distribution π(x)


Note: stationary distribution depends on choice of q(x → x0 )

Pairwise detailed balance on states guarantees equilibrium

Gibbs sampling transition probability:


sample each variable given current values of all others
=⇒ detailed balance with the true posterior

For Bayesian networks, Gibbs sampling reduces to


sampling conditioned on each variable’s Markov blanket

Amarda Shehu (580) Inference on Bayesian Networks 75


Stationary Distribution

πt (x) = probability in state x at time t

πt+1 (x0 ) = probability in state x0 at time t + 1

πt+1 in terms of πt and q(x → x0 )

πt+1 (x0 ) = Σxπ (x)q(x → x )


t
0

Stationary distribution: πt = πt+1 = π

π(x0 ) = Σxπ(x)q(x → x ) 0
for all x0

If π exists, it is unique (specific to q(x → x0 ))

In equilibrium, expected “outflow” = expected “inflow”

Amarda Shehu (580) Inference on Bayesian Networks 76


Detailed Balance

“Outflow” = “inflow” for each pair of states:

π(x)q(x → x0 ) = π(x0 )q(x0 → x) for all x, x0

Detailed balance =⇒ stationarity:

Σxπ(x)q(x → x ) 0
= Σxπ(x )q(x 0 0
→ x)

= π(x )Σx q(x


0 0
→ x)
0
= π(x )

MCMC algorithms typically constructed by designing a transition


probability q that is in detailed balance with desired π

Amarda Shehu (580) Inference on Bayesian Networks 77


Gibbs Sampling

Sample each variable in turn, given all other variables

Sampling Xi , let X̄i be all other nonevidence variables


Current values are xi and x̄i ; e is fixed

Transition probability is given by

q(x → x0 ) = q(xi , x̄i → xi0 , x̄i ) = P(xi0 |x̄i , e)

This gives detailed balance with true posterior P(x|e):

π(x)q(x → x0 ) = P(x|e)P(xi0 |x̄i , e) = P(xi , x̄i |e)P(xi0 |x̄i , e)


= P(xi |x̄i , e)P(x̄i |e)P(xi0 |x̄i , e) (chain rule)
= P(xi |x̄i , e)P(xi0 , x̄i |e) (chain rule backwards)
= q(x0 → x)π(x0 ) = π(x0 )q(x0 → x)

Amarda Shehu (580) Inference on Bayesian Networks 78


Performance of Approximation Algorithms

Absolute approximation: |P(X |e) − P̂(X |e)| ≤ 

Relative approximation: e
|P(X | )−P̂(X | e )|
≤
P(X | e)

Relative =⇒ absolute since 0 ≤ P ≤ 1 (may be O(2−n ))

Randomized algorithms may fail with probability at most δ

Polytime approximation: poly(n, −1 , log δ −1 )

Theorem (Dagum and Luby, 1993): both absolute and relative


approximation for either deterministic or randomized algorithms
are NP-hard for any , δ < 0.5

(Absolute approximation polytime with no evidence—Chernoff bounds)

Amarda Shehu (580) Inference on Bayesian Networks 79

You might also like