0% found this document useful (0 votes)
11 views40 pages

Bayesian

Bayesian in artificial intelligence

Uploaded by

hihi95174
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views40 pages

Bayesian

Bayesian in artificial intelligence

Uploaded by

hihi95174
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

15-780: Graduate Artificial

Intelligence

Inference in Bayesian networks


Bayesian networks: Notations

Conditional P(Lo) = 0.5


probability tables Conditional
(CPTs) dependency
Le

P(Li | Lo) = 0.4 P(S | Lo) = 0.6


Li S
P(Li | ¬Lo) = 0.7 P(S | ¬Lo) = 0.2

Random variables
A example problem
• An alarm system
B E
B – Did a burglary occur?
E – Did an earthquake occur?
A – Did the alarm sound off?
M – Mary calls A

J – John calls

J M
Constructing a Bayesian network:
Revisited
• Step 1: Identify the random variables
• Step 2: Determine the conditional dependencies
- Select on ordering of the variables
- Add them one at a time
- For each new variable X added select the minimal subset of nodes
as parents such that X is independent from all other nodes in the
current network given its parents.
• Step 3: Populate the CPTs
- We will discuss this when we talk about density estimations
Reconstructing a network
Suppose we wanted to add
a new variable to the
network: B E
R – Did the radio announce
that there was an
earthquake?
How should we insert it? A
R

J M
Bayesian networks: Restrictions
and joint distributions
• Bayesian networks are directed acyclic graphs
(DAGs) x1 x2
- Otherwise a node will impact (indirectly) its
own probability making inference hard

x3 x4

x6 x5

This is NOT a valid Bayesian


network!
Bayesian networks: Restrictions
and joint distributions
• Bayesian networks are directed acyclic graphs
(DAGs) x1 x2
- Otherwise a node will impact (indirectly) its
own probability making inference hard
• Given a Bayesian network the joint probability
distribution can be factored as: x3 x4

P ( X ) = ∏ p ( xi | Pa ( xi ))
i

where X is a vector of observations and Pa(xi) is x6 x5


the set of parent nodes of xi
Using Bayesian networks
• Inference
- Computing joint distributions
- Inferring values of unobserved variables
• Structure learning
Bayesian network: Inference
• Once the network is constructed, we
can use algorithms for inferring the B E
values of unobserved variables.
• For example, in our previous network P(B=1 | J, ¬M)?
the only observed variables are the
A
phone calls. However, what we are
really interested in is whether there
was a burglary or not.
• How can we determine that? J M
Inference
• Lets start with a simpler question
- How can we compute a joint distribution from the
network?
- For example, P(B,¬E,A,J, ¬M)?
• Answer:
- That’s easy, lets use the network
Computing: P(B,¬E,A,J, ¬M)
P(B,¬E,A,J, ¬M) = P(B)=.05
P(E)=.1
P(B)P(¬E)P(A | B, ¬E) B E
P(J | A)P(¬M | A)
= 0.05*0.9*.85*.7*.2
P(A|B,E) )=.95
= 0.005355 P(A|B,¬E) = .85
P(A| ¬ B,E) )=.5 A
P(A| ¬ B, ¬ E) = .05

P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Computing: P(B,¬E,A,J, ¬M)
P(B,¬E,A,J, ¬M) = P(B)=.05
P(E)=.1
P(B)P(¬E)P(A | B, ¬E) B E
P(J | A)P(¬M | A)
= 0.05*0.9*.85*.7*.2
P(A|B,E)
We can easily )=.95
compute a
= 0.005355 P(A|B,¬E) = .85
complete joint distribution.
P(A| ¬ B,E) )=.5 A
What about partial
P(A| ¬ B, ¬ E) = .05
distributions? Conditional
distributions?
P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Inference
• We are interested in queries of the form:
P(B | J,¬M)
• This can also be written as a joint: B E
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( J , ¬M , B ) + P ( J , ¬M , ¬B )
A
chain rule

• How do we compute the new joint?

J M
Computing partial joints
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( B , J , ¬ M ) + P ( ¬B , J , ¬ M )

Sum all instances with these settings (the


sum is over the possible assignments to the
other two variables, E and A)
Computing: P(B,J, ¬M)
P(B,J, ¬M) = P(B)=.05
P(E)=.1
P(B,J, ¬M,A,E)+ B E
P(B,J, ¬M, ¬ A,E) +
P(B,J, ¬M,A, ¬ E) +
P(B,J, ¬M, ¬ A, ¬ E) = P(A|B,E) )=.95
P(A|B,¬E) = .85
0.0007+0.00001+0.005+0. P(A| ¬ B,E) )=.5 A
0003 = 0.00601 P(A| ¬ B, ¬ E) = .05

P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Computing partial joints
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( B , J , ¬ M ) + P ( ¬B , J , ¬ M )

Done! Sum all


instances with
these settings
Computing: P(¬ B,J, ¬M)
P(¬ B,J, ¬M) = P(B)=.05
P(E)=.1
P(¬ B,J, ¬M,A,E)+ B E
P(¬ B,J, ¬M, ¬ A,E) +
P(¬ B,J, ¬M,A, ¬ E) +
P(¬ B,J, ¬M, ¬ A, ¬ E) = P(A|B,E) )=.95
P(A|B,¬E) = .85
0.00665+0.002+0.006+0.0 P(A| ¬ B,E) )=.5 A
345 = 0.049 P(A| ¬ B, ¬ E) = .05

P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Computing partial joints

P ( B, J , ¬M )
P ( B | J , ¬M ) =
P ( B, J , ¬M ) + P (¬B, J , ¬M )
0.006
= = 0.11
0.006 + 0.049
Computing partial joints
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( B , J , ¬ M ) + P ( ¬B , J , ¬ M )

Sum all instances with these settings (the


sum is over the possible assignments to the
other two variables, E and A)

But the number of possible assignments is exponential in


the unobserved variables?
That is, unfortunately, the best we can do. General querying
of Bayesian networks is NP-complete
Inference in Bayesian networks if
NP complete (sketch)
• Reduction from 3SAT
• Recall: 3SAT, find satisfying assignments to the
following problem: (a ∨ b ∨ c) ∧ (d ∨ ¬ b ∨ ¬ c) …

What is P(Y)?
P(xi=1) = 0.5

P(xi=1) = (x1 ∨ x2 ∨ x3)

P(Y=1) = (x1 ∧ x2 ∧ x3 ∧ x4)


Y
Other inference methods
• Convert network to a polytree
B E
- In a polytree no two nodes have
more than one path between them A
- For such a graph there is a linear
time algorithm J M
- However, converting into a polytree
requires a large increase in the size of
the graph (number of nodes)
B E

J M
Why is inference in polytrees easy?
• In polytrees, given a variable X we
can always divide the other
variables into two sets:
E+: Variables ‘above’ X y1 y2

E-: Variables ‘below’ X X


• These sets are mutually exclusive
(why?) z2 z1
• Using these sets we can
efficiently compute conditional
and joint distributions
Stochastic inference
• We can easily sample the joint
distribution to obtain possible
instances P(E)=.1
P(B)=.05
1. Sample the free variable B E
2. For every other variable:
- If all parents have been sampled,
sample based on conditional P(A|B,E) )=.95
P(A|B,¬E) = .85
distribution A
P(A| ¬ B,E) )=.5
P(A| ¬ B, ¬ E) = .05
We end up with a new set of
assignments for B,E,A,J and M
which are a random sample from P(J|A) )=.7 J M
the joint P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Stochastic inference
• We can easily sample the joint
distribution to obtain possible
instances P(E)=.1
P(B)=.05
1. Sample the free variable B E
2. For every other variable:
- If all parents have been sampled,
sample based on conditional P(A|B,E) )=.95
P(A|B,¬E) = .85
distribution A
P(A| ¬ B,E) )=.5
P(A| ¬ B, ¬ E) = .05
Is it always possible to
carry out this sampling
P(J|A) )=.7
procedure? why? J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Using sampling for inference
• Lets revisit our problem: Compute P(B | J,¬M)
• Looking at the samples we can cound:
- N: total number of samples
- Nc : total number of samples in which the condition holds (J,¬M)
- NB: total number of samples where the joint is true (B,J,¬M)
• For a large enough N
- Nc / N ≈ P(J,¬M)
- NB / N ≈ P(B,J,¬M)
• And so, we can set
P(B | J,¬M) = P(B,J,¬M) / P(J,¬M) ≈ NB / Nc
Using sampling for inference
• Lets revisit our problem: Compute P(B | J,¬M)
• Looking at the samples we can cound:
Problem: What if the condition rarely
- N: total number of samples
happens?
- Nc : total number of samples in which the condition holds (J,¬M)
- NB: total number ofWe wouldwhere
samples needthelots
jointand lots(B,J,¬M)
is true of
• For a large enoughsamples,
N and most would be wasted
- Nc / N ≈ P(J,¬M)
- NB / N ≈ P(B,J,¬M)
• And so, we can set
P(B | J,¬M) = P(B,J,¬M) / P(J,¬M) ≈ NB / Nc
Weighted sampling
• Compute P(B | J,¬M)
• We can manually set the value of J to
1 and M to 0
B E
• This way, all samples will contain the
correct values for the conditional
variables
• Problems? A

J M
Weighted sampling
• Compute P(B | J,¬M)
• Given an assignment to parents, we
assign a value of 1 to J and 0 to M.
B E
• We record the probability of this
assignment (w = p1p2) and we weight
the new joint sample by w
A

J M
Weighted sampling algorithm for
computing P(B | J,¬M)
• Set NB,Nc = 0
• Sample the joint setting the values for J and M,
compute the weight, w, of this sample
• Nc = Nc+w
• If B = 1, NB = NB+w

• After many iterations, set


P(B | J,¬M) = NB / Nc
Bayesian networks for cancer
detection
Constructing networks
• So far we assumed that the network is derived from
domain knowledge.
• That’s not always easy to do
• Examples:
- How are different regions in the brain related?
- How are terrorists related (social networks)?
Inferring structure from data
• It is possible to infer structure if enough data is provided
• The goal would be to find a structure that leads to a
maximal likelihood*
MaxSP(D | S)
• Problems?

*More on this next week


Inferring structure using maximum
likelihood principle
• The more edges we have, the higher
the likelihood! B E

A
P(M | A, E) ≥ P(M | A)

J M
Why?
- If the two are independent and we have
B E
perfect data, trivially holds
A
- We have more parameters to fit. If there is
some noise in the measurements, we
J M
would likely overfit the data
Inferring structure using maximum
likelihood principle
• The more edges we have, the higher
the likelihood! B E

A
P(M | A, E) ≥ P(M | A)

J M
Solutions:
- Statistical tests
B E
- Penalty functions
A

J M
Likelihood ratio test
• Given two competing models we can compute their
likelihood ratio

P( D | B) Model A
T ( D) = 2 log
P( D | A) X Y

Z
Always ≥ 0, but by how much?

Model B

X Y

Z
Likelihood ratio test
• Given two competing models we can compute their
likelihood ratio

P( D | B) Model A
T ( D) = 2 log ~ χ2
P( D | A) X Y

Z
Always ≥ 0, but by how much?
The result is distributed according to χ2, which
is a distribution defined by the number of free Model B
parameters (the difference in complexity of the
two models) X Y

Z
Likelihood ratio test
• Given two competing models we can compute their
likelihood ratio

P( D | B) Model A
T ( D) = 2 log ~ χ2
P( D | A) X Y

Z
Reject the more complicated model, unless the
ratio is high enough (can use, for example, the
Matlab function CHI2PDF to compute the
Model B
probability of seeing this ratio as a result of noise).

X Y

Z
Penalty functions
• Likelihood ratio tests are appropriate for relatively small
problems (few variables)
• For larger problems we usually use a penalty function
• This function penalizes the likelihood based on the
complexity of the model
L(D | M) = P(D | M)-f(n)
where n is related to the number of parameters
• Most commonly used penalty function:
- AIC: Akaike's Information Criterion
- BIC: Bayesian information criterion
Structure learning for biology
Important points
• Bayes rule
• Joint distribution, independence, conditional
independence
• Definition Bayesian networks
• Inference in Bayesian networks
• Constructing a Bayesian network

You might also like