0% found this document useful (0 votes)

7 views

Bayesian and inference

This lecture covers Bayesian Networks, including their syntax, semantics, and various inference methods such as exact and approximate inference techniques. It emphasizes the compact representation of joint distributions through topology and conditional probability tables (CPTs), and discusses the construction and evaluation of Bayesian Networks. The document also outlines inference tasks, including simple queries, conjunctive queries, and optimal decision-making in the context of Bayesian Networks.

Uploaded by

Mrs.G.Mariammal cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Bayesian and inference

Uploaded by

Mrs.G.Mariammal cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Lecture 10: Bayesian Networks and Inference

CS 580 (001) - Spring 2018

Amarda Shehu

Department of Computer Science

George Mason University, Fairfax, VA, USA

May 02, 2018

Amarda Shehu (580) 1

1 Outline of Today’s Class – Bayesian Networks and Inference

2 Bayesian Networks
Syntax
Semantics
Parameterized Distributions

3 Inference on Bayesian Networks

Exact Inference by Enumeration
Exact Inference by Variable Elimination
Approximate Inference by Stochastic Simulation
Approximate Inference by Markov Chain Monte Carlo (MCMC)
Digging Deeper...

Amarda Shehu (580) Outline of Today’s Class – Bayesian Networks and Inference 2
Bayesian Networks

A simple, graphical notation for conditional independence assertions

and hence for compact specification of full joint distributions

Syntax:
a set of nodes, one per variable
a directed, acyclic graph (link ≈ “directly influences”)
a conditional distribution for each node given its parents:
P(Xi |Parents(Xi ))

In the simplest case, conditional distribution represented as

a conditional probability table (CPT) giving the
distribution over Xi for each combination of parent values

Amarda Shehu (580) Bayesian Networks 3

Example

Topology of network encodes conditional independence assertions:

Weather is independent of the other variables

Toothache and Catch are conditionally independent given Cavity

Amarda Shehu (580) Bayesian Networks 4

Example

I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t
call. Sometimes it’s set off by minor earthquakes. Is there a burglar?

Variables: Burglar , Earthquake, Alarm, JohnCalls, MaryCalls

Network topology reflects “causal” knowledge:

– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call

Amarda Shehu (580) Bayesian Networks 5

Example

Amarda Shehu (580) Bayesian Networks 6

Compactness

A CPT for Boolean Xi with k Boolean parents

has:

2k rows for the combinations of parent values

Each row requires one number p for Xi = true

(the number for Xi = false is just 1 − p)

If each variable has no more than k parents,

the complete network requires O(n · 2k ) numbers

I.e., grows linearly with n, vs. O(2n ) for the full joint distribution

For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 = 31)

Amarda Shehu (580) Bayesian Networks 7

Global Semantics

Global semantics defines the full joint

distribution

as the product of the local conditional distributions:

n
P(x1 , . . . , xn ) = Π i = 1 P(xi |parents(Xi ))

e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)

Amarda Shehu (580) Bayesian Networks 8

Global Semantics

“Global” semantics defines the full joint

distribution

as the product of the local conditional distributions:

n
P(x1 , . . . , xn ) = Π i = 1 P(xi |parents(Xi ))

e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)

= P(j|a)P(m|a)P(a|¬b, ¬e)P(¬b)P(¬e)
= 0.9 × 0.7 × 0.001 × 0.999 × 0.998
≈ 0.00063

Amarda Shehu (580) Bayesian Networks 9

Local Semantics

Local semantics: each node is conditionally independent

of its nondescendants given its parents

Theorem: Local semantics ⇔ global semantics

Amarda Shehu (580) Bayesian Networks 10

Markov Blanket

Each node is conditionally independent of all others given its

Markov blanket: parents + children + children’s parents

Amarda Shehu (580) Bayesian Networks 11

Constructing Bayesian Networks

Need a method such that a series of locally testable assertions of

conditional independence guarantees the required global semantics

1. Choose an ordering of variables X1 , . . . , Xn

2. For i = 1 to n

add Xi to the network

select parents from X1 , . . . , Xi−1 such that

P(Xi |Parents(Xi )) = P(Xi |X1 , . . . , Xi−1 )

This choice of parents guarantees the global semantics:

n
P(X1 , . . . , Xn ) = Π i = 1 P(Xi |X1 , . . . , Xi−1 ) (chain rule)
n
= Π i = 1 P(Xi |Parents(Xi )) (by construction)

Amarda Shehu (580) Bayesian Networks 12

Example

Suppose we choose the ordering M, J, A, B, E

P(J|M) = P(J)?

Amarda Shehu (580) Bayesian Networks 13

Example

Suppose we choose the ordering M, J, A, B, E

P(J|M) = P(J)? No
P(A|J, M) = P(A|J)? P(A|J, M) = P(A)

Amarda Shehu (580) Bayesian Networks 14

Example

Suppose we choose the ordering M, J, A, B, E

Amarda Shehu (580) Bayesian Networks 15

Example

Suppose we choose the ordering M, J, A, B, E

Amarda Shehu (580) Bayesian Networks 16

Example

Suppose we choose the ordering M, J, A, B, E

Amarda Shehu (580) Bayesian Networks 17

Example

Deciding conditional independence is hard in noncausal directions

(Causal models and conditional independence seem hardwired for humans!)
Assessing conditional probabilities is hard in noncausal directions
Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed

Amarda Shehu (580) Bayesian Networks 18

Example: Car Diagnosis

Initial evidence: car won’t start

Testable variables (green), “broken, so fix it” variables (orange)
Hidden variables (gray) ensure sparse structure, reduce parameters

Amarda Shehu (580) Bayesian Networks 19

Example: Car Insurance

Amarda Shehu (580) Bayesian Networks 20

Compact Conditional Distributions

CPT grows exponentially with number of parents

CPT becomes infinite with continuous-valued parent or child

Solution: canonical distributions that are defined compactly

Deterministic nodes are the simplest case:

X = f (Parents(X )) for some function f

E.g., Boolean functions

NorthAmerican ⇔ Canadian ∨ US ∨ Mexican

E.g., numerical relationships among continuous variables

∂Level
= inflow + precipitation - outflow - evaporation
∂t

Amarda Shehu (580) Bayesian Networks 21

Compact Conditional Distributions

Noisy-OR distributions model multiple noninteracting causes

1) Parents U1 . . . Uk include all causes (can add leak node)
2) Independent failure probability qi for each cause alone
j
=⇒ P(X |U1 . . . Uj , ¬Uj+1 . . . ¬Uk ) = 1 − Π i = 1 qi

Cold Flu Malaria P(Fever ) P(¬Fever )

F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2 × 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6 × 0.1
T T F 0.88 0.12 = 0.6 × 0.2
T T T 0.988 0.012 = 0.6 × 0.2 × 0.1

Number of parameters linear in number of parents

Amarda Shehu (580) Bayesian Networks 22

Hybrid (Discrete+Continuous) Networks

Discrete (Subsidy ? and Buys?); continuous (Harvest and Cost)

Option 1: discretization—possibly large errors, large CPTs

Option 2: finitely parameterized canonical families
1) Continuous variable, discrete+continuous parents (e.g., Cost)
2) Discrete variable, continuous parents (e.g., Buys?)

Amarda Shehu (580) Bayesian Networks 23

Continuous Child Variables

Need one conditional density function for child variable given continuous parents, for
each possible assignment to discrete parents

Most common is the linear Gaussian model, e.g.,:

P(Cost = c|Harvest = h, Subsidy ? = true)

= N(at h + bt , σt )(c)
2 !
1 1 c − (at h + bt )
= √ exp −
σt 2π 2 σt

Mean Cost varies linearly with Harvest, variance is fixed

Linear variation is unreasonable over the full range

but works OK if the likely range of Harvest is narrow

Amarda Shehu (580) Bayesian Networks 24

Continuous Child Variables

All-continuous network with LG distributions

=⇒ full joint distribution is a multivariate Gaussian

Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate

Gaussian over all continuous variables for each combination of discrete variable values

Amarda Shehu (580) Bayesian Networks 25

Discrete Variable w/ Continuous Parents

Probability of Buys? given Cost should be a “soft” threshold:

Probit distribution uses integral of Gaussian:

Rx
Φ(x) = −∞
N(0, 1)(x)dx

P(Buys? = true | Cost = c) = Φ((−c + µ)/σ)

Amarda Shehu (580) Bayesian Networks 26

Why the probit?

1. It’s sort of the right shape

2. Can view as hard threshold whose location is subject to noise

Amarda Shehu (580) Bayesian Networks 27

Discrete Variable

Sigmoid (or logit) distribution also used in neural networks:

1
P(Buys? = true | Cost = c) =
1 + exp(−2 −c+µ
σ
)

Sigmoid has similar shape to probit but much longer tails:

Amarda Shehu (580) Bayesian Networks 28

Summary on Bayesian Networks

Bayes nets provide a natural representation for (causally induced)

conditional independence

Topology + CPTs = compact representation of joint distribution

Generally easy for (non)experts to construct

Canonical distributions (e.g., noisy-OR) = compact representation of CPTs

Continuous variables =⇒ parameterized distributions (e.g., linear Gaussian)

Next: Inference on Bayesian Networks

Amarda Shehu (580) Bayesian Networks 29

Summary on Bayesian Networks

Bayes nets provide a natural representation for (causally induced)

conditional independence

Topology + CPTs = compact representation of joint distribution

Generally easy for (non)experts to construct

Canonical distributions (e.g., noisy-OR) = compact representation of CPTs

Continuous variables =⇒ parameterized distributions (e.g., linear Gaussian)

Next: Inference on Bayesian Networks

Amarda Shehu (580) Bayesian Networks 29

Inference Tasks

Simple queries: compute posterior marginal P(Xi |E = e)

e.g., P(NoGas|Gauge = empty , Lights = on, Starts = false)

Conjunctive queries: P(Xi , Xj |E = e) = P(Xi |E = e)P(Xj |Xi , E = e)

Optimal decisions: decision networks include utility information;

probabilistic inference required for P(outcome|action, evidence)

Value of information: which evidence to seek next?

Sensitivity analysis: which probability values are most critical?

Explanation: why do I need a new starter motor?

Amarda Shehu (580) Inference on Bayesian Networks 30

Inference by Enumeration

Slightly intelligent way to sum out variables from the joint without actually constructing
its explicit representation

Simple query on the burglary network:

P(B|j, m)
= P(B, j, m)/P(j, m)
= αP(B, j, m)
=α ΣΣ e a P(B, e, a, j, m)

Rewrite full joint entries using product of CPT entries:

P(B|j, m)
=α Σ Σ P(B)P(e)P(a|B, e)P(j|a)P(m|a)
e a

= αP(B) Σ P(e) Σ P(a|B, e)P(j|a)P(m|a)

e a

Recursive depth-first enumeration: O(n) space, O(d n ) time

Amarda Shehu (580) Inference on Bayesian Networks 31

Enumeration Algorithm

function Enumeration-Ask(X, e, bn) returns a distribution over X

inputs: X, the query variable
e, observed values for variables E
bn, a Bayesian network with variables {X } ∪ E ∪ Y
Q(X ) ← a distribution over X, initially empty
for each value xi of X do
extend e with value xi for X
Q(xi ) ← Enumerate-All(Vars[bn], e)
return Normalize(Q(X ))

function Enum-All(vars, e) returns a real number

if Empty?(vars) then return 1.0
Y ← First(vars)
if Y has value y in e
then returnP P(y | Pa(Y )) × Enum-All(Rest(vars), e)
else return y P(y | Pa(Y )) × Enum-All(Rest(vars), ey )
where ey is e extended with Y = y

Amarda Shehu (580) Inference on Bayesian Networks 32

Evaluation Tree

Enumeration is inefficient: repeated computation

e.g., computes P(j|a)P(m|a) for each value of e

Amarda Shehu (580) Inference on Bayesian Networks 33

Inference by Variable Elimination

Variable elimination refers to a heuristic to reduce complexity of exact inference

Use of memoization to avoid redundant calculations (stored in factors)

P(B|j, m)
= α P(B)
| {z }
Σ e P(e)
| {z }
Σ |P(a|B,
a e) P(j|a) P(m|a)
{z } | {z } | {z }
B E A J M

Σ f (E )Σ f (A, B, E )f (A)f (A)

= αf1 (B) e 2 a 3 4 5 pointwise product and sum out A
= αf (B)Σ f (E )f (B, E )
1 e 2 6 sum out E
= αf1 (B)f7 (B)

Basic operations: pointwise product and summation of factors

Direction: Carry out summations right-to-left

Amarda Shehu (580) Inference on Bayesian Networks 34

Variable Elimination: Basic Operations - Pointwise Product

Pointwise product f4 (A) × f5 (A) = hP(j|a) · P(m|a), P(j|¬a) · P(m|¬a)i

Corresponding entries in vectors are multiplied, yielding another same-size vector

equivalent to going bottom-up in tree, keeping track of both children in a vector, and
multiplying child with parent to “roll up” to higher level.

Generally:
Pointwise product of factors f1 and f2 :
f1 (x1 , . . . , xj , y1 , . . . , yk ) × f2 (y1 , . . . , yk , z1 , . . . , zl )
= f (x1 , . . . , xj , y1 , . . . , yk , z1 , . . . , zl ) vars are unions

Example: f1 (a, b) × f2 (b, c) = f (a, b, c)

Rewrite f4 (A) as f (j, A) and f5 (A) as f (m, A)

Rule suggests f (j, A) × f2 (m, A) = f (j, m, A)

Correct: P(j|A) × P(m|A) = P(j, m|A) (because J and M are conditionally indepedent
given their parent set A)

Amarda Shehu (580) Inference on Bayesian Networks 35

Variable Elimination: Basic Operations - Summation

Consider f3 (A, b, E ) which is a 2x2 matrix:

hP(a|b, e), P(¬a|b, e)i
hP(a|b, ¬e), P(¬a|b, ¬e)i (each row corresponds to branching point in search tree)

“Summing out” A means pointwise product on each branch and sum up at parent

Example: What is = Σ f (A, b, E )f (A)f (A)?

a 3 4 5

Let f4 (A) × f5 (A) be f (j, m, A) =< P(j, m|a), P(j, m|¬a > (from previous slide)
Take pointwise product of first row of f3 (A, b, E ) with f (j, m, A)
Take pointwise product of second row of f3 (A, b, E ) with f (j, m, A)
Sum the two rows to get a new factor f6 (b, E )

Generally, summing out a variable from a product of factors:

move any constant factors outside the summation
add up submatrices in pointwise product of remaining factors

Σf x 1 × · · · × fk = f1 × · · · × fiΣ x fi+1 × · · · × fk = f1 × · · · × fi × fX̄

assuming f1 , . . . , fi do not depend on X
summation needed to account for all values of hidden variables (A, E)
Amarda Shehu (580) Inference on Bayesian Networks 36
Variable Elimination Algorithm

function Elimination-Ask(X, e, bn) returns a distribution over X

inputs: X, the query variable
e, evidence specified as an event
bn, a belief network specifying joint distribution P(X1 , . . . , Xn )
factors ← [ ]; vars ← Reverse(Vars[bn])
for each var in vars do
factors ← [Make-Factor(var , e)|factors]
if var is a hidden variable then factors ← Sum-Out(var, factors)
return Normalize(Pointwise-Product(factors))

Every choice of ordering for variables yields a sound algorithm

Different orderings give different intermediate factors
Certain variable orderings can introduce irrelevant calculations
Intractable to find optimal ordering, but heuristics exist

Amarda Shehu (580) Inference on Bayesian Networks 37

Irrelevant Variables

Consider the query P(JohnCalls|Burglary = true)

X X X
P(J|b) = αP(b) P(e) P(a|b, e)P(J|a) P(m|a)
e a m

Sum over m is identically 1; M is irrelevant to the query

Thm 1: Y is irrelevant unless Y ∈ Ancestors({X } ∪ E)

Here, X = JohnCalls, E = {Burglary }, and

Ancestors({X } ∪ E) = {Alarm, Earthquake}
so MaryCalls is irrelevant

Hence the name, variable elimination algorithm

Amarda Shehu (580) Inference on Bayesian Networks 38

Irrelevant Variables

Consider the query P(JohnCalls|Burglary = true)

X X X
P(J|b) = αP(b) P(e) P(a|b, e)P(J|a) P(m|a)
e a m

Sum over m is identically 1; M is irrelevant to the query

Thm 1: Y is irrelevant unless Y ∈ Ancestors({X } ∪ E)

Here, X = JohnCalls, E = {Burglary }, and

Ancestors({X } ∪ E) = {Alarm, Earthquake}
so MaryCalls is irrelevant

Hence the name, variable elimination algorithm

Amarda Shehu (580) Inference on Bayesian Networks 38

Irrelevant Variables

Defn: moral graph of Bayes net: marry all parents and drop arrows

Defn: A is m-separated from B by C iff separated by C in the moral graph

Thm 2: Y is irrelevant if m-separated from X by E

For P(JohnCalls|Alarm = true), both

Burglary and Earthquake are irrelevant

Amarda Shehu (580) Inference on Bayesian Networks 39

Complexity of Exact Inference

Singly connected networks (or polytrees):

– any two nodes are connected by at most one (undirected) path
– worst-case time and space cost of a query is O(n)
– worst-case time and space cost of n queries is O(n2 )

Multiply connected networks:

– worst-case time and space cost are exponential, O(n · d n ) (n queries, d values per r.v.)
– NP-hard and #P-complete
– can reduce 3SAT to exact inference =⇒ NP-hard
– equivalent to counting 3SAT models =⇒ #P-complete

Amarda Shehu (580) Inference on Bayesian Networks 40

Taming Exact Inference

How to reduce time? Identify structure in BN similar to CSP setting: group variables
together to “reduce” network to a polytree

How? Cluster variables together (joint tree algorithms)

Parents of a node can be grouped into a meta-parent node (meganode)

As in CSP, meganodes may share variables, so special inference algorithm is needed

Algorithm takes care of constraint propagation so that meganodes agree on posterior

probability of shared variables

No free lunch, so what gives?

The exponential time cost is hidden in the combined CPTs, which can become
exponentially large

Amarda Shehu (580) Inference on Bayesian Networks 41

Taming Exact Inference

How to reduce time? Identify structure in BN similar to CSP setting: group variables
together to “reduce” network to a polytree

How? Cluster variables together (joint tree algorithms)

Parents of a node can be grouped into a meta-parent node (meganode)

As in CSP, meganodes may share variables, so special inference algorithm is needed

Algorithm takes care of constraint propagation so that meganodes agree on posterior

probability of shared variables

No free lunch, so what gives?

The exponential time cost is hidden in the combined CPTs, which can become
exponentially large

Amarda Shehu (580) Inference on Bayesian Networks 41

Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Amarda Shehu (580) Inference on Bayesian Networks 42

Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Go for approximate inference algorithms...

Amarda Shehu (580) Inference on Bayesian Networks 42

Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Go for approximate inference algorithms...

that use sampling (Monte Carlo-based) to estimate posterior probabilities

Amarda Shehu (580) Inference on Bayesian Networks 42

Giving up on Exact Inference: Go for Approximate Instead

Or...

Give up on exact inference

Go for approximate inference algorithms...

that use sampling (Monte Carlo-based) to estimate posterior probabilities

Amarda Shehu (580) Inference on Bayesian Networks 42

Inference by Stochastic Simulation (Sampling-based)

Basic idea:

1) Draw N samples from a sampling distribution S

Can you draw N samples for the r.v. Coin
from the probability distribution P(Coin) = [0.5, 0.5]?

2) Compute an approximate posterior probability P̂

3) Show this converges to the true probability P

Outline:

– Direct Sampling: Sampling from an empty network

– Rejection sampling: reject samples disagreeing with evidence

– Likelihood weighting: use evidence to weight samples

– Markov chain Monte Carlo (MCMC): sample from a stochastic process whose
stationary distribution is the true posterior

Amarda Shehu (580) Inference on Bayesian Networks 43

Direct Sampling: Sampling from an Empty Network

Empty refers to the absence of any evidence: used to estimate joint probabilities

Main idea:

Sample each r.v. in turn, in topological order, from parents to children

Once parent is sampled, its value is fixed and used to sample child

Events generated via this direct sampling, observing joint probability distribution

To get (prior) probability of an event, have to sample many times, so frequency of

“observing” it among samples approaches its probability

Example next

Amarda Shehu (580) Inference on Bayesian Networks 44

Direct Sampling: Sampling from an Empty Network

Empty refers to the absence of any evidence: used to estimate joint probabilities

Main idea:

Sample each r.v. in turn, in topological order, from parents to children

Once parent is sampled, its value is fixed and used to sample child

Events generated via this direct sampling, observing joint probability distribution

To get (prior) probability of an event, have to sample many times, so frequency of

“observing” it among samples approaches its probability

Example next

Amarda Shehu (580) Inference on Bayesian Networks 44

Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 45

Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 46

Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 47

Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 48

Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 49

Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 50

Direct Sampling Example

Amarda Shehu (580) Inference on Bayesian Networks 51

PRIOR-SAMPLE Algorithm for Direct Sampling

function Prior-Sample(bn) returns an event sampled from bn

inputs: bn, a belief network specifying joint distribution P(X1 , . . . , Xn )
x ← an event with n elements
for i = 1 to n do
xi ← a random sample from P(Xi | parents(Xi ))
given the values of Parents(Xi ) in x
return x

Amarda Shehu (580) Inference on Bayesian Networks 52

Direct Sampling Continued

Probability that PriorSample generates a particular event x1 . . . xn :

n
SPS (x1 . . . xn ) =Π i = 1 P(xi |parents(Xi )) = P(x1 . . . xn )
i.e., the true prior probability

E.g., SPS (t, f , t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P(t, f , t, t)

Let NPS (x1 . . . xn ) be the number of samples generated for event x1 , . . . , xn

Then we have:

lim P̂(x1 , . . . , xn ) = lim NPS (x1 , . . . , xn )/N

N→∞ N→∞
= SPS (x1 , . . . , xn )
= P(x1 . . . xn )

That is, estimates derived from PriorSample are consistent

(becomes exact in large-sample limit)

Shorthand: P̂(x1 , . . . , xn ) ≈ P(x1 . . . xn )

Problem: N needs to be sufficiently large to sample “rare events”

Amarda Shehu (580) Inference on Bayesian Networks 53
Rejection Sampling (for Conditional Probabilities P(X |e))

Main idea:

Given distribution too hard to sample directly from it: use an easy-to-sample distribution
for direct sampling, and then reject samples based on hard-to-sample distribution

(1) Direct sampling to sample (X , E ) events from prior distribution in BN

(2) Determine whether (X , E ) is consistent with given evidence e
(3) Get P̂(X |E = e) by counting how often (E = e) and (X , E = e) occur
as per Bayes’ rule: P̂(X |E = e) = N(X ,E =e)
N(E =e)

Example: estimate P(Rain|Sprinkler = true) using 100 samples

Generate 100 samples for Cloudy , Sprinkler , Rain, WetGrass via direct sampling
27 samples have Sprinkler = true event of interest
Of these, 8 have Rain = true and 19 have Rain = false.

P̂(Rain|Sprinkler = true) = Normalize(h8, 19i) = h8/27, 19/27i = h0.296, 0.704i

Similar to a basic real-world empirical estimation procedure

Amarda Shehu (580) Inference on Bayesian Networks 54

Rejection Sampling

P̂(X |e) estimated from samples agreeing with e

function Rejection-Sampling(X, e, bn, N) returns an estimate of

P(X |e)
local variables: N, a vector of counts over X, initially zero
for j = 1 to N do
x ← Prior-Sample(bn)
if x is consistent with e then
N[x] ← N[x]+1 where x is the value of X in x
return Normalize(N[X])

Amarda Shehu (580) Inference on Bayesian Networks 55

Analysis of Rejection Sampling

P̂(X |e) = αNPS (X , e) (algorithm defn.)

= NPS (X , e)/NPS (e) (normalized by NPS (e))

≈ P(X , e)/P(e) (property of PriorSample)

= P(X |e) (defn. of conditional probability)

Hence rejection sampling returns consistent posterior estimates √

Standard deviation of error in each probability proportional to 1/ n (number of r.v.s)

Problem:

If e is very rare event, most samples rejected; hopelessly expensive if P(e) is small

P(e) drops off exponentially with number of evidence variables!

Rejection sampling is unusable for complex problems → Likelihood Weighting instead

Amarda Shehu (580) Inference on Bayesian Networks 56

Likelihood Weighting

A form of importance sampling (for BNs)

Main idea:

Generate only events that are consistent with given values e of evidence variables E

Fix evidence variables to given values, sample only nonevidence variables

Weight each sample by the likelihood it accords the evidence (how likely e is)

Example: Query P(Rain|Cloudy = true, WetGrass = true)

Consider r.v.s in some topological ordering

Set w = 1.0 (weight will be a running product)

If r.v. Xi is in given evidence variables (Cloudy or WetGrass in this example),

w = w × P(Xi |Parents(Xi ))
Else, sample Xi from P(Xi | evidence)
When all r.v.s considered, normalized weights to turn to probabilities