Bayesian Networks Slides
Bayesian Networks Slides
Bayes Theorem
MAP, ML hypotheses
MAP learners
Bayesian reasoning provides the basis for learning algorithms that directly
manipulate probabilities, as well as a framework for analyzing the operation
of other algorithms that do not explicitly manipulate probabilities.
P(D|h)P(h)
P(h|D) =
P(D)
P(D|h)P(h)
P(h|D) =
P(D)
P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(D|h)P(h)
P(h|D) =
P(D)
In many learning scenarios, the learner considers some set of candidate hypotheses H and is interested in
finding the most probable hypothesis h H given the observed data D (or at least one of the maximally
probable if there are several). Any such maximally probable hypothesis is called a maximum a posteriori
(MAP) hypothesis. We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior
probability of each candidate hypothesis.
P(D|h)P(h)
= arg max
hH P(D)
= arg max P(D|h)P(h)
hH
If assume P(hi ) = P(hj ) then can further simplify, and choose the Maximum likelihood (ML) hypothesis
1 argmax f(x) x X : The value of x that maximises f(x), argmax x 2 = -3 where x {1, 2, 3}
March 24, 2017 7 / 35
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P(+|cancer )P(cancer )
P(cancer | +) = P(+)
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+)
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .03.992
.0376 = .791
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .03.992
.0376 = .791
0 0 0 0
P(+) = P(+ | c r )P(c r ) + P(+ | c r )P(c r )
March 24, 2017 8 / 35
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .03.992
.0376 = .791
0 0 0 0
P(+) = P(+ | c r )P(c r ) + P(+ | c r )P(c r )= .0376
March 24, 2017 8 / 35
Most Probable Classification of New Instances
So far weve sought the most probable hypothesis given the data D (i.e., hMAP )
Consider:
Three possible hypotheses:
P(h1 |D) = .4, P(h2 |D) = .3, P(h3 |D) = .3
Given new instance x,
h1 (x) = +, h2 (x) = , h3 (x) =
Whats most probable classification of x?
The most probable classification (negative) in this case is different from the
classification generated by the MAP hypothesis.
If the possible classification of the new example can take on any value v j
from some set V, then the probability P(vj | D) that the correct classification
for the new instance is v ;, is just
Example:
Example:
therefore
Example:
therefore
X
P(+|hi )P(hi |D) = .4
hi H
X
P(|hi )P(hi |D) = .6
hi H
and
X
arg max P(vj |hi )P(hi |D) =
vj V
hi H
Along with decision trees, neural networks, nearest nbr, one of the most practical
learning methods.
When to use
Moderate or large training set available
Attributes that describe instances are conditionally independent given
classification
Successful applications:
Diagnosis
Classifying text documents
which gives
Y
Naive Bayes classifier: vNB = argmax P(vj ) P(ai |vj )
vj V i
Data Sample:
X = (age 30, Income = medium, Student = yes, Creditrating = fair )
Class:
C1: Buys computer = yes
C2: Buys computer = no
March 24, 2017 15 / 35
Naive Bayes: Example
Compute P(X | Ci) for each class where X = (age 30, Income = medium, Student = yes, Credit
rating = fair)
P(X | Ci) :
P(X | Buyscomputer =0 yes 0 ) = 0.222 0.4444 0.667 = 0.044
...but it works surprisingly well anyway. Note dont need estimated posteriors
j |x) to be correct; need only that
P(v
Y
j)
argmax P(v i |vj ) = argmax P(vj )P(a1 . . . , an |vj )
P(a
vj V i vj V
2. what if none of the training instances with target value vj have attribute
value ai ? Then
i |vj ) = 0, and...
P(a
Y
j)
P(v i |vj ) = 0
P(a
i
i |vj )
Typical solution is Bayesian estimate for P(a
i |vj ) nc + mp
P(a
n+m
where
n is number of training examples for which v = vj ,
nc number of examples for which v = vj and a = ai
i |vj )
p is prior estimate for P(a
m is weight given to prior (i.e. number of virtual examples)
Why?
Learn which news articles are of interest
Learn to classify web pages by topic
Interesting because:
Naive Bayes assumption of conditional independence too restrictive
But its intractable without some such assumptions...
Bayesian Belief networks describe conditional independence among subsets of
variables
allows combining prior knowledge about (in)dependencies among variables
with observed training data
P(X |Y , Z ) = P(X |Z )
What is P(C , R, S, W ) ?
What is P(C , R, S, W ) ?
P(C , R, S, W ) = P(C )P(R|C )P(S|C )P(W |R, S
What is P(C , R, S, W ) ?
P(C , R, S, W ) = P(C )P(R|C )P(S|C )P(W |R, S= (.5)(.8)(.9)(.9) =
.324.
March 24, 2017 29 / 35
Suppose you observe it is cloudy and raining. What is the probability that the
grass is wet ?
How can one infer the (probabilities of) values of one or more network variables,
given observed values of others?
Bayes net contains all information needed for this inference
If only one variable with unknown value, easy to infer it
In general case, problem is NP hard
In practice, can succeed in many cases
Exact inference methods work well for some network structures
Monte Carlo methods simulate the network randomly to calculate
approximate solutions