Bayesian
Bayesian
Intelligence
Random variables
A example problem
• An alarm system
B E
B – Did a burglary occur?
E – Did an earthquake occur?
A – Did the alarm sound off?
M – Mary calls A
J – John calls
J M
Constructing a Bayesian network:
Revisited
• Step 1: Identify the random variables
• Step 2: Determine the conditional dependencies
- Select on ordering of the variables
- Add them one at a time
- For each new variable X added select the minimal subset of nodes
as parents such that X is independent from all other nodes in the
current network given its parents.
• Step 3: Populate the CPTs
- We will discuss this when we talk about density estimations
Reconstructing a network
Suppose we wanted to add
a new variable to the
network: B E
R – Did the radio announce
that there was an
earthquake?
How should we insert it? A
R
J M
Bayesian networks: Restrictions
and joint distributions
• Bayesian networks are directed acyclic graphs
(DAGs) x1 x2
- Otherwise a node will impact (indirectly) its
own probability making inference hard
x3 x4
x6 x5
P ( X ) = ∏ p ( xi | Pa ( xi ))
i
P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Computing: P(B,¬E,A,J, ¬M)
P(B,¬E,A,J, ¬M) = P(B)=.05
P(E)=.1
P(B)P(¬E)P(A | B, ¬E) B E
P(J | A)P(¬M | A)
= 0.05*0.9*.85*.7*.2
P(A|B,E)
We can easily )=.95
compute a
= 0.005355 P(A|B,¬E) = .85
complete joint distribution.
P(A| ¬ B,E) )=.5 A
What about partial
P(A| ¬ B, ¬ E) = .05
distributions? Conditional
distributions?
P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Inference
• We are interested in queries of the form:
P(B | J,¬M)
• This can also be written as a joint: B E
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( J , ¬M , B ) + P ( J , ¬M , ¬B )
A
chain rule
J M
Computing partial joints
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( B , J , ¬ M ) + P ( ¬B , J , ¬ M )
P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Computing partial joints
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( B , J , ¬ M ) + P ( ¬B , J , ¬ M )
P(J|A) )=.7 J M
P(J|¬A) = .05
P(M|A) )=.8
P(M|¬A) = .15
Computing partial joints
P ( B, J , ¬M )
P ( B | J , ¬M ) =
P ( B, J , ¬M ) + P (¬B, J , ¬M )
0.006
= = 0.11
0.006 + 0.049
Computing partial joints
P ( B , J , ¬M )
P ( B | J , ¬M ) =
P ( B , J , ¬ M ) + P ( ¬B , J , ¬ M )
What is P(Y)?
P(xi=1) = 0.5
J M
Why is inference in polytrees easy?
• In polytrees, given a variable X we
can always divide the other
variables into two sets:
E+: Variables ‘above’ X y1 y2
J M
Weighted sampling
• Compute P(B | J,¬M)
• Given an assignment to parents, we
assign a value of 1 to J and 0 to M.
B E
• We record the probability of this
assignment (w = p1p2) and we weight
the new joint sample by w
A
J M
Weighted sampling algorithm for
computing P(B | J,¬M)
• Set NB,Nc = 0
• Sample the joint setting the values for J and M,
compute the weight, w, of this sample
• Nc = Nc+w
• If B = 1, NB = NB+w
A
P(M | A, E) ≥ P(M | A)
J M
Why?
- If the two are independent and we have
B E
perfect data, trivially holds
A
- We have more parameters to fit. If there is
some noise in the measurements, we
J M
would likely overfit the data
Inferring structure using maximum
likelihood principle
• The more edges we have, the higher
the likelihood! B E
A
P(M | A, E) ≥ P(M | A)
J M
Solutions:
- Statistical tests
B E
- Penalty functions
A
J M
Likelihood ratio test
• Given two competing models we can compute their
likelihood ratio
P( D | B) Model A
T ( D) = 2 log
P( D | A) X Y
Z
Always ≥ 0, but by how much?
Model B
X Y
Z
Likelihood ratio test
• Given two competing models we can compute their
likelihood ratio
P( D | B) Model A
T ( D) = 2 log ~ χ2
P( D | A) X Y
Z
Always ≥ 0, but by how much?
The result is distributed according to χ2, which
is a distribution defined by the number of free Model B
parameters (the difference in complexity of the
two models) X Y
Z
Likelihood ratio test
• Given two competing models we can compute their
likelihood ratio
P( D | B) Model A
T ( D) = 2 log ~ χ2
P( D | A) X Y
Z
Reject the more complicated model, unless the
ratio is high enough (can use, for example, the
Matlab function CHI2PDF to compute the
Model B
probability of seeing this ratio as a result of noise).
X Y
Z
Penalty functions
• Likelihood ratio tests are appropriate for relatively small
problems (few variables)
• For larger problems we usually use a penalty function
• This function penalizes the likelihood based on the
complexity of the model
L(D | M) = P(D | M)-f(n)
where n is related to the number of parameters
• Most commonly used penalty function:
- AIC: Akaike's Information Criterion
- BIC: Bayesian information criterion
Structure learning for biology
Important points
• Bayes rule
• Joint distribution, independence, conditional
independence
• Definition Bayesian networks
• Inference in Bayesian networks
• Constructing a Bayesian network