0% found this document useful (0 votes)
93 views56 pages

Bayesian Networks Slides

The document discusses Bayesian learning and Bayes' theorem. Specifically, it provides definitions and explanations of key Bayesian concepts like prior and posterior probabilities, maximum a posteriori hypotheses, and naive Bayes classifiers. It also provides an example application of Bayes' theorem to calculate the probability that a patient has cancer given a positive test result.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views56 pages

Bayesian Networks Slides

The document discusses Bayesian learning and Bayes' theorem. Specifically, it provides definitions and explanations of key Bayesian concepts like prior and posterior probabilities, maximum a posteriori hypotheses, and naive Bayes classifiers. It also provides an example application of Bayes' theorem to calculate the probability that a patient has cancer given a positive test result.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

March 24, 2017

March 24, 2017 1 / 35


Bayesian Learning

Bayes Theorem

MAP, ML hypotheses

MAP learners

Bayes optimal classifier

Naive Bayes learner

Bayesian belief networks

March 24, 2017 2 / 35


Bayesian Learning-Advantages

Bayesian reasoning provides a probabilistic approach to inference.

It is based on the assumption that the quantities of interest are governed by


probability distributions and that optimal decisions can be made by reasoning
about these probabilities together with observed data.

It is important to machine learning because it provides a quantitative


approach to weighing the evidence supporting alternative hypotheses.

Bayesian reasoning provides the basis for learning algorithms that directly
manipulate probabilities, as well as a framework for analyzing the operation
of other algorithms that do not explicitly manipulate probabilities.

March 24, 2017 3 / 35


Bayesian Learning -Relevance

Bayesian learning algorithms that calculate explicit probabilities for


hypotheses, such as the naive Bayes classifier, are among the most practical
approaches to certain types of learning problems.

they provide a useful perspective for understanding many learning algorithms


that do not explicitly manipulate probabilities

March 24, 2017 4 / 35


Bayes Theorem

P(D|h)P(h)
P(h|D) =
P(D)

P(h) = initial probability that hypothesis h holds before we have observed


the training data (called Prior Probability).

P(D) = prior probability of training data D (prior probability that training


data D will be observed (i.e., the probability of D given no knowledge about
which hypothesis holds).

P(D|h) = probability of D given h ( the probability of observing data D


given some world in which hypothesis h holds.)

P(h|D) = probability of h given D (posterior probability of h: it


reflects our confidence that h holds after we have seen the training data D)

March 24, 2017 5 / 35


Observation

P(D|h)P(h)
P(h|D) =
P(D)

P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.

P(h|D) decreases as P (D) increases, because the more probable it is that D


will be observed independent of h, the less evidence D provides in support of
h.

March 24, 2017 6 / 35


Choosing Hypothesis1

P(D|h)P(h)
P(h|D) =
P(D)
In many learning scenarios, the learner considers some set of candidate hypotheses H and is interested in
finding the most probable hypothesis h H given the observed data D (or at least one of the maximally
probable if there are several). Any such maximally probable hypothesis is called a maximum a posteriori
(MAP) hypothesis. We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior
probability of each candidate hypothesis.

Maximum a posteriori hypothesis hMAP :

hMAP = arg max P(h|D)


hH

P(D|h)P(h)
= arg max
hH P(D)
= arg max P(D|h)P(h)
hH

If assume P(hi ) = P(hj ) then can further simplify, and choose the Maximum likelihood (ML) hypothesis

hML = arg max P(D|hi )


hi H

1 argmax f(x) x X : The value of x that maximises f(x), argmax x 2 = -3 where x {1, 2, 3}
March 24, 2017 7 / 35
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

March 24, 2017 8 / 35


Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P(cancer ) = .008 P(cancer ) = .992


P(+|cancer ) = .98 P(|cancer ) = .02
P(+|cancer ) = .03 P(|cancer ) = .97

March 24, 2017 8 / 35


Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P(cancer ) = .008 P(cancer ) = .992


P(+|cancer ) = .98 P(|cancer ) = .02
P(+|cancer ) = .03 P(|cancer ) = .97

P(+|cancer )P(cancer )
P(cancer | +) = P(+)

March 24, 2017 8 / 35


Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P(cancer ) = .008 P(cancer ) = .992


P(+|cancer ) = .98 P(|cancer ) = .02
P(+|cancer ) = .03 P(|cancer ) = .97

P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209

March 24, 2017 8 / 35


Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P(cancer ) = .008 P(cancer ) = .992


P(+|cancer ) = .98 P(|cancer ) = .02
P(+|cancer ) = .03 P(|cancer ) = .97

P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+)

March 24, 2017 8 / 35


Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P(cancer ) = .008 P(cancer ) = .992


P(+|cancer ) = .98 P(|cancer ) = .02
P(+|cancer ) = .03 P(|cancer ) = .97

P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .03.992
.0376 = .791

March 24, 2017 8 / 35


Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P(cancer ) = .008 P(cancer ) = .992


P(+|cancer ) = .98 P(|cancer ) = .02
P(+|cancer ) = .03 P(|cancer ) = .97

P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .03.992
.0376 = .791
0 0 0 0
P(+) = P(+ | c r )P(c r ) + P(+ | c r )P(c r )
March 24, 2017 8 / 35
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P(cancer ) = .008 P(cancer ) = .992


P(+|cancer ) = .98 P(|cancer ) = .02
P(+|cancer ) = .03 P(|cancer ) = .97

P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .98.008
.0376 = .209
P(+|cancer )P(cancer )
P(cancer | +) = P(+) = .03.992
.0376 = .791
0 0 0 0
P(+) = P(+ | c r )P(c r ) + P(+ | c r )P(c r )= .0376
March 24, 2017 8 / 35
Most Probable Classification of New Instances

So far weve sought the most probable hypothesis given the data D (i.e., hMAP )

Given new instance x, what is its most probable classification?


hMAP (x) is not the most probable classification!

Consider:
Three possible hypotheses:
P(h1 |D) = .4, P(h2 |D) = .3, P(h3 |D) = .3
Given new instance x,
h1 (x) = +, h2 (x) = , h3 (x) =
Whats most probable classification of x?

March 24, 2017 9 / 35


Taking all hypotheses into account, the probability that x is positive is .4 (the
probability associated with h i ) , and
The probability that it is negative is therefore 6.

The most probable classification (negative) in this case is different from the
classification generated by the MAP hypothesis.

In general, the most probable classification of the new instance is obtained by


combining the predictions of all hypotheses, weighted by their posterior prob-
abilities.

If the possible classification of the new example can take on any value v j
from some set V, then the probability P(vj | D) that the correct classification
for the new instance is v ;, is just

March 24, 2017 10 / 35


Bayes Optimal Classifier
Bayes optimal classification:
X
arg max P(vj |hi )P(hi |D)
vj V
hi H

Example:

March 24, 2017 11 / 35


Bayes Optimal Classifier
Bayes optimal classification:
X
arg max P(vj |hi )P(hi |D)
vj V
hi H

Example:

P(h1 |D) = .4, P(|h1 ) = 0, P(+|h1 ) = 1


P(h2 |D) = .3, P(|h2 ) = 1, P(+|h2 ) = 0
P(h3 |D) = .3, P(|h3 ) = 1, P(+|h3 ) = 0

therefore

March 24, 2017 11 / 35


Bayes Optimal Classifier
Bayes optimal classification:
X
arg max P(vj |hi )P(hi |D)
vj V
hi H

Example:

P(h1 |D) = .4, P(|h1 ) = 0, P(+|h1 ) = 1


P(h2 |D) = .3, P(|h2 ) = 1, P(+|h2 ) = 0
P(h3 |D) = .3, P(|h3 ) = 1, P(+|h3 ) = 0

therefore
X
P(+|hi )P(hi |D) = .4
hi H
X
P(|hi )P(hi |D) = .6
hi H

and
X
arg max P(vj |hi )P(hi |D) =
vj V
hi H

March 24, 2017 11 / 35


Naive Bayes Classifier

Along with decision trees, neural networks, nearest nbr, one of the most practical
learning methods.

When to use
Moderate or large training set available
Attributes that describe instances are conditionally independent given
classification
Successful applications:
Diagnosis
Classifying text documents

March 24, 2017 12 / 35


Naive Bayes Classifier

Assume target function f : X V , where each instance x described by attributes ha1 , a2 . . . an i.


Most probable value of f (x) is:

vMAP = argmax P(vj |a1 , a2 . . . an )


vj V

P(a1 , a2 . . . an |vj )P(vj )


vMAP = argmax
vj V P(a1 , a2 . . . an )
= argmax P(a1 , a2 . . . an |vj )P(vj )
vj V

Naive Bayes assumption: Y


P(a1 , a2 . . . an |vj ) = P(ai |vj )
i

which gives
Y
Naive Bayes classifier: vNB = argmax P(vj ) P(ai |vj )
vj V i

March 24, 2017 13 / 35


Naive Bayes Algorithm

Naive Bayes Learn(examples)


For each target value vj
j ) estimate P(vj )
P(v
For each attribute value ai of each attribute a
i |vj ) estimate P(ai |vj )
P(a

Classify New Instance(x)


Y
j)
vNB = argmax P(v i |vj )
P(a
vj V ai x

March 24, 2017 14 / 35


Naive Bayes: Example
Traning Dataset
Age Income Student Credit rating Buys compter ?
30 high no fair no
30 high no excellent no
30 . . . 40 high no fair yes
> 40 medium no fair yes
> 40 low yes fair yes
> 40 low yes excellent no
31 . . . 40 low yes excellent yes
30 medium no fair no
30 low yes fair yes
> 40 medium yes fair yes
30 medium yes excellent yes
31 . . . 40 medium no excellent yes
31 . . . 40 high yes fair yes
> 40 medium no excellent no

Data Sample:
X = (age 30, Income = medium, Student = yes, Creditrating = fair )
Class:
C1: Buys computer = yes
C2: Buys computer = no
March 24, 2017 15 / 35
Naive Bayes: Example
Compute P(X | Ci) for each class where X = (age 30, Income = medium, Student = yes, Credit
rating = fair)

P(age =0 < 300 | Buyscomputer =0 yes 0 ) = 2/9 = 0.222


P(age =0 < 300 | Buyscomputer =0 no 0 ) = 3/5 = 0.6
P(income =0 medium0 | Buyscomputer =0 yes 0 ) = 4/9 = 0.444
P(income =0 medium0 | Buyscomputer =0 no 0 ) = 2/5 = 0.4
P(student =0 yes 0 | Buyscomputer =0 yes 0 ) = 6/9 = 0.667
P(student =0 yes 0 | Buyscomputer =0 no 0 ) = 1/5 = 0.2
P(Creditrating =0 fair 0 | Buyscomputer =0 yes 0 ) = 6/9 = 0.667
P(Creditrating =0 fair 0 | Buyscomputer =0 no 0 ) = 2/5 = 0.4

P(X | Ci) :
P(X | Buyscomputer =0 yes 0 ) = 0.222 0.4444 0.667 = 0.044

P(X | Buyscomputer =0 no 0 ) = 0.6 0.4 0.2 0.4 = 0.019

P(X | Ci) P(Ci) :


P(X | Buyscomputer =0 yes 0 ) P(Buyscomputer =0 yes 0 ) = 0.028 (0.044 9
14 )

P(X | Buyscomputer =0 no 0 ) P(Buyscomputer =0 no 0 ) = 0.007 (0.019 5


14 )
X belongs to class buys computer = yes.

March 24, 2017 16 / 35


Naive Bayes: Subtleties

1 Conditional independence assumption is often violated


Y
P(a1 , a2 . . . an |vj ) = P(ai |vj )
i

...but it works surprisingly well anyway. Note dont need estimated posteriors
j |x) to be correct; need only that
P(v
Y
j)
argmax P(v i |vj ) = argmax P(vj )P(a1 . . . , an |vj )
P(a
vj V i vj V

see [Domingos & Pazzani, 1996] for analysis


Naive Bayes posteriors often unrealistically close to 1 or 0

March 24, 2017 17 / 35


Naive Bayes: Subtleties

2. what if none of the training instances with target value vj have attribute
value ai ? Then
i |vj ) = 0, and...
P(a
Y
j)
P(v i |vj ) = 0
P(a
i

i |vj )
Typical solution is Bayesian estimate for P(a

i |vj ) nc + mp
P(a
n+m
where
n is number of training examples for which v = vj ,
nc number of examples for which v = vj and a = ai
i |vj )
p is prior estimate for P(a
m is weight given to prior (i.e. number of virtual examples)

March 24, 2017 18 / 35


Learning to Classify Text

Why?
Learn which news articles are of interest
Learn to classify web pages by topic

Naive Bayes is among most effective algorithms

What attributes shall we use to represent text documents??

March 24, 2017 19 / 35


Learning to Classify Text

Target concept Interesting ? : Document {+, }


1 Represent each document by vector of words
one attribute per word position in document
2 Learning: Use training examples to estimate
P(+)
P()
P(doc|+)
P(doc|)
Naive Bayes conditional independence assumption
length(doc)
Y
P(doc|vj ) = P(ai = wk |vj )
i=1

where P(ai = wk |vj ) is probability that word in position i is wk , given vj

one more assumption: P(ai = wk |vj ) = P(am = wk |vj ), i, m

March 24, 2017 20 / 35


Learn naive Bayes text(Examples, V )
1. collect all words and other tokens that occur in Examples
Vocabulary all distinct words and other tokens in Examples
2. calculate the required P(vj ) and P(wk |vj ) probability terms
For each target value vj in V do
docsj subset of Examples for which the target value is vj
|docsj |
P(vj ) |Examples|
Textj a single document created by concatenating all members of docsj
n total number of words in Textj (counting duplicate words multiple times)
for each word wk in Vocabulary
nk number of times word wk occurs in Textj
nk +1
P(wk |vj ) n+|Vocabulary |

March 24, 2017 21 / 35


Classify naive Bayes text(Doc)
positions all word positions in Doc that contain tokens found in
Vocabulary
Return vNB , where
Y
vNB = argmax P(vj ) P(ai |vj )
vj V ipositions

March 24, 2017 22 / 35


Bayesian Belief Networks

Interesting because:
Naive Bayes assumption of conditional independence too restrictive
But its intractable without some such assumptions...
Bayesian Belief networks describe conditional independence among subsets of
variables
allows combining prior knowledge about (in)dependencies among variables
with observed training data

(also called Bayes Nets)

March 24, 2017 23 / 35


Conditional Independence
Definition: X is conditionally independent of Y given Z if the
probability distribution governing X is independent of the value of Y
given the value of Z ; that is, if

(xi , yj , zk ) P(X = xi |Y = yj , Z = zk ) = P(X = xi |Z = zk )

more compactly, we write

P(X |Y , Z ) = P(X |Z )

Example: Thunder is conditionally independent of Rain, given Lightning


P(Thunder |Rain, Lightning ) = P(Thunder |Lightning )
Naive Bayes uses cond. indep. to justify

P(X , Y |Z ) = P(X |Y , Z )P(Y |Z )


= P(X |Z )P(Y |Z )
March 24, 2017 24 / 35
Bayesian Belief Network

Network represents a set of conditional independence assertions:


Each node is asserted to be conditionally independent of its nondescendants,
given its immediate predecessors.
Directed acyclic graph

March 24, 2017 25 / 35


Bayesian Belief Network

Represents joint probability distribution over all variables


e.g., P(Storm, BusTourGroup, . . . , ForestFire)
in general,
n
Y
P(y1 , . . . , yn ) = P(yi |Parents(Yi ))
i=1

where Parents(Yi ) denotes immediate predecessors of Yi in graph


so, joint distribution is fully defined by graph, plus the P(yi |Parents(Yi ))

March 24, 2017 26 / 35


Example

What is P(cough|smoking and pneumonia) ?

March 24, 2017 27 / 35


Example

What is P(cough|smoking and pneumonia) ?


From table P(C |S Pn) = .95.

March 24, 2017 27 / 35


Example

What is P(smoking |cough) ?

March 24, 2017 28 / 35


Example

What is P(smoking |cough) ?


|S)P(S)
P(S|C ) = P(CP(C )

March 24, 2017 28 / 35


Example

What is P(smoking |cough) ?


|S)P(S)
P(S|C ) = P(CP(C )
P(C |S)P(S) = [P(C |S Pn)P(Pn) + P(C |S Pn)P(Pn)]P(S)

March 24, 2017 28 / 35


Example

What is P(smoking |cough) ?


|S)P(S)
P(S|C ) = P(CP(C )
P(C |S)P(S) = [P(C |S Pn)P(Pn) + P(C |S Pn)P(Pn)]P(S)
= [(.95)(.1) + (.6)(.9)](.2) = .127.

March 24, 2017 28 / 35


Example

What is P(smoking |cough) ?


|S)P(S)
P(S|C ) = P(CP(C )
P(C |S)P(S) = [P(C |S Pn)P(Pn) + P(C |S Pn)P(Pn)]P(S)
= [(.95)(.1) + (.6)(.9)](.2) = .127.
P(C ) = P(C |Pn S)P(Pn)P(S) + P(C |Pn S)P(Pn)P(S) +
P(C |Pn S)P(Pn)P(S) + P(C |Pn S)P(Pn)P(S)

March 24, 2017 28 / 35


Example

What is P(smoking |cough) ?


|S)P(S)
P(S|C ) = P(CP(C )
P(C |S)P(S) = [P(C |S Pn)P(Pn) + P(C |S Pn)P(Pn)]P(S)
= [(.95)(.1) + (.6)(.9)](.2) = .127.
P(C ) = P(C |Pn S)P(Pn)P(S) + P(C |Pn S)P(Pn)P(S) +
P(C |Pn S)P(Pn)P(S) + P(C |Pn S)P(Pn)P(S)
= (.95)(.1)(.2) + (.8)(.1)(.8) + (.6)(.9)(.2) + (.05)(.9)(.8) = .227.

March 24, 2017 28 / 35


Example

What is P(smoking |cough) ?


|S)P(S)
P(S|C ) = P(CP(C )
P(C |S)P(S) = [P(C |S Pn)P(Pn) + P(C |S Pn)P(Pn)]P(S)
= [(.95)(.1) + (.6)(.9)](.2) = .127.
P(C ) = P(C |Pn S)P(Pn)P(S) + P(C |Pn S)P(Pn)P(S) +
P(C |Pn S)P(Pn)P(S) + P(C |Pn S)P(Pn)P(S)
= (.95)(.1)(.2) + (.8)(.1)(.8) + (.6)(.9)(.2) + (.05)(.9)(.8) = .227.
P(S|C) = .127
.227 = .56.
March 24, 2017 28 / 35
Yet Another Example

March 24, 2017 29 / 35


Yet Another Example

What is P(C , R, S, W ) ?

March 24, 2017 29 / 35


Yet Another Example

What is P(C , R, S, W ) ?
P(C , R, S, W ) = P(C )P(R|C )P(S|C )P(W |R, S

March 24, 2017 29 / 35


Yet Another Example

What is P(C , R, S, W ) ?
P(C , R, S, W ) = P(C )P(R|C )P(S|C )P(W |R, S= (.5)(.8)(.9)(.9) =
.324.
March 24, 2017 29 / 35
Suppose you observe it is cloudy and raining. What is the probability that the
grass is wet ?

March 24, 2017 30 / 35


Suppose you observe it is cloudy and raining. What is the probability that the
grass is wet ?
Since wet grass is conditionally independent of cloudy given rain and spinkler,
we have
P(W |C , R) = P(W |R, S)P(S|C ) + P(W |R, S)P(S|C )

March 24, 2017 30 / 35


Suppose you observe it is cloudy and raining. What is the probability that the
grass is wet ?
Since wet grass is conditionally independent of cloudy given rain and spinkler,
we have
P(W |C , R) = P(W |R, S)P(S|C ) + P(W |R, S)P(S|C )

P(W |C , R) = (.99)(.1) + (.9)(.9) = .909.


March 24, 2017 30 / 35
Suppose you observe the spinkler to be on and the grass is wet. What is the
probability that it is raining ?
Suppose you observe that the grass is wet and it is raining. What is the
probability that it is cloudy ?

March 24, 2017 31 / 35


Inference in Bayesian Networks

How can one infer the (probabilities of) values of one or more network variables,
given observed values of others?
Bayes net contains all information needed for this inference
If only one variable with unknown value, easy to infer it
In general case, problem is NP hard
In practice, can succeed in many cases
Exact inference methods work well for some network structures
Monte Carlo methods simulate the network randomly to calculate
approximate solutions

March 24, 2017 32 / 35


Learning of Bayesian Networks

Several variants of this learning task


Network structure might be known or unknown
Training examples might provide values of all network variables, or just some
If structure known and observe all variables
Then its easy as training a Naive Bayes classifier

March 24, 2017 33 / 35


Learning Bayes Nets

Suppose structure known, variables partially observable

e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning,


Campfire...
Similar to training neural network with hidden units
In fact, can learn network conditional probability tables using gradient ascent!
Converge to network h that (locally) maximizes P(D|h)

March 24, 2017 34 / 35


Summary: Bayesian Belief Networks

Combine prior knowledge with observed data


Impact of prior knowledge (when correct!) is to lower the sample complexity
Active research area
Extend from boolean to real-valued variables
Parameterized distributions instead of tables
Extend to first-order instead of propositional systems
More effective inference methods
...

March 24, 2017 35 / 35

You might also like