Bayesian Learning
Bayesian Learning
Bayes P(D |
P(h | D)
Rule: h)P(h)
P(D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given
h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which
the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present.
Furthermore, .008 of the entire population have this cancer.
.97 P( |
P(cancer | )
cancer )P (cancer )
P()
P( |
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which
the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present.
Furthermore, .008 of the entire population have this cancer.
.97 P( |
P(cancer | )
cancer )P (cancer )
P()
P( |
Maximum A Posteriori (MAP) Hypothesis
P(D | h)P(h)
P(h | D) P(D)
The Goal of Bayesian Learning: the most probable
hypothesis given the training data (Maximum A Posteriori
hypothesis)
Computational intensive
Providing a standard for judging the performance
of learning algorithms
Choosing P(h) and P(D|h) reflects our
prior knowledge about the learning task
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable
classification?
•ℎ 𝑀 𝐴 𝑃 ( 𝑥 ) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D)
=.3
Bayes optimal
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) =
classification:
- What is the most probable classification of x ?
argv max h
j H P(v j | hi )P(hi | D)
i
Vall the values a classification can take and vj is one
where V is the set of
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P( | h )P(h
i i | D)
P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0 .4
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 hiH
P( | h )P(h
i i | D)
.6
hiH
Bayes Theorem
P(D | h)P(h)
P(h | D)
P(D)
Naïve Bayes
• Bayes
classification
P(Y | X) P(X |Y )P(Y ) P( X1,, X n |
Y )P(Y
Difficulty: ) the joint
learning P(X1 ,,Xn |C)
probability
• Naïve Bayes classification
Assume all input features are conditionally
P(independent!
X1, X 2 ,, X n | Y ) P( X1 | X 2 ,, X n ,Y )P( X 2 ,,
Xn | Y )
P( X1 | Y )P( X 2 ,, X n | Y )
P( X1 | Y )P( X 2 | Y ) P( X n |
Y)
Example
• Example: Play Tennis
7
Example
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
P(Play=Yes) = P(Play=No) =
9/14 5/14
8
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– P(Outlook=Sunny|Play=Yes)
Look up tables achieved = 2/9
inP(Outlook=Sunny|Play=No)
the learning phrase= 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes| x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No| x ’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
True class
Hypothesized
Pos Neg • Accuracy =
class
(TP+TN)/(P+N
Yes TP FP )
No FN TN • Precision =
P=TP+FN N=FP+TN
TP/(TP+FP
)
• Recall/TP rate
= TP/P
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
• Bayes network represents conditional independence
relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Bayesian Network
• A graphical model that efficiently encodes the joint probability
distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes) X = { X1,
…….Xn}
• Arcs represent probabilistic dependence among variables
• Lack of an arc denotes a conditional independence
• The network structure is a directed acyclic graph
• local probability distributions at each node (Conditional Probability
Table)
Late Rainy
Acciden
wakeu day
t
p
Traffi Meeting
c postpone
Jam d
Late
for
Work
Late for
meetin
g
Representation in Bayesian Belief
Networks
Late
Accid Rain
wak Conditional probability table
e nt y
e up
day associated with each node
specifies the conditional
Traffi Meeting distribution for the
c postpone variable given its immediate
Jam d
Late parents in the graph
for
Wor
k
Late for
meetin
g
• Diagnosis: P(cause|symptom)=?
• Prediction: P(symptom|cause)=?
• Classification: P(class|data)
• Decision-making
C C2
(given a cost
1
function)
Bayesian Networks
• Structure of the graph Conditional independence
relations
In general,
p(X1, X2,....XN) =
p(Xi |The
parents(X
full joint i) ) The graph-structured approximation
distribution
• Requires that graph is acyclic (no directed cycles)
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|
A)p(A)
Find the probability that ‘P1’ is true (P1 has called ‘gfg’),
‘P2’ is true (P2 has called ‘gfg’) when the alarm ‘A’ rang,
but no burglary ‘B’ and fire ‘F’ has occurred.
A P (P1=T) P (P1=F)
Burglary ‘B’ – B F P (A=T) P (A=F)
C
Hidden Markov Model (HMM)
Y1 Y3 Y
Observed
Y2
n
---------------------------------------------
-------
S1 S3
Hidden
S2 Sn
Assumptions:
1. hidden state sequence is Markov
2.observation Yt is conditionally independent of all other
variables given St
MAP estimates:
Only difference:
“imaginary”
examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally
independent
Sometimes assume
variance
– is independent of Y (i.e., i),
– or independent of Xi (i.e., k)
– or both (i.e., )
Gaussian Naïve Bayes Algorithm – continuous Xi
(but still discrete Y)
• Train Naïve Bayes
(examples) for each value
yk
estimate*
for each attribute Xi
estimate
class conditional mean
, variance
• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous
ˆ | Yes) 1
P(x (x 21.64)2 1 (x 21.64)2
2.35 exp 2 2.352 2.35 exp 11.09
ˆ | No) 2 1 2 1
P(x (x 23.88)2 (x 23.88)2
7.09 exp 2 7.09 exp 50.25
2 7.09 2
2
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes
(variables) are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning
with causal relationships between attributes