L13 Bayesian Methods
L13 Bayesian Methods
• Bayes Theorem
• MAP, ML hypotheses
• MAP learners
• Minimum description length principle
• Bayes optimal classifier
• Naïve Bayes learner
• Bayesian belief networks
P ( D | h) P ( h)
arg max
hH P( D)
arg max P ( D | h) P (h)
hH
P(cancer|+) =
P(cancer|+) =
CS 8751 ML & KDD Bayesian Methods 5
Some Formulas for Probabilities
• Product rule: probability P(A B) of a conjunction
of two events A and B:
P(A B) = P(A|B)P(B) = P(B|A)P(A)
• Sum rule: probability of disjunction of two events
A and B:
P(A B) = P(A) + P(B) - P(A B)
• Theorem of total probability: if events A1,…,An
are mutually exclusive with
n
P ( Ai ) 1
i 1
, then
n
P ( B) P( B | Ai ) P( Ai )
i 1
x
Consider any real-valued target function f
Training examples (xi,di), where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each
xi according to some Gaussian distribution with mean = 0
Then the maximum likelihood hypothesis hML is the one that
minimizes the summof squared errors:
hML arg min (d i h( xi )) 2
hH
i 1
CS 8751 ML & KDD Bayesian Methods 10
Learning a Real Valued Function
hML arg max p ( D | h)
hH
m
arg max p (d i | h)
hH
i 1
arg max
m
1
e
1 di h ( xi )
2 σ
2
hH
2πσ 2
i 1
arg mind i h( xi )
2
hH
CS 8751 ML & KDD Bayesian Methods 11
Minimum Description Length Principle
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
hMDL arg min LC1 (h) LC 2 ( D | h)
hH
where LC(x) is the description length of x under
encoding C
Example:
• H = decision trees, D = training data labels
• LC1(h) is # bits to describe tree h
• LC2(D|h) is #bits to describe D given h
– Note LC2 (D|h) = 0 if examples classified perfectly by h.
Need only describe exceptions
• Hence hMDL trades off tree size for training errors
CS 8751 ML & KDD Bayesian Methods 12
Minimum Description Length Principle
hMAP arg max P ( D | h) P (h)
hH
P (a1 , a2 ,..., an | v j ) P (v j )
arg max
v j V P (a1 , a2 ,..., an )
arg max P (a1 , a2 ,..., an | v j ) P (v j )
v j V
Naïve Bayes assumption:
P (a1 , a2 ,..., an | v j ) P (ai | v j )
i
which gives
Naïve Bayes classifier: v NB arg max P (v j ) P (ai | v j )
v V j
i
CS 8751 ML & KDD Bayesian Methods 17
Naïve Bayes Algorithm
Naive_Bayes_Learn(examples)
For each target value v j
Pˆ (v j ) estimate P(v j )
For each attribute value ai of each attribute a
Pˆ (a |v ) estimate P(a |v )
i j i j
Classify_New_Instance( x)
v NB arg max Pˆ (v j ) Pˆ (ai|v j )
v j V
a i x
P(+)*P(Blue|+)*P(SUV|+)*P(2|+)*P(WhiteW|+)=
5/14 * 1/5 * 2/5 * 4/5 * 3/5 = 0.0137
P(-)*P(Blue|-)*P(SUV|-)*P(2|-)*P(WhiteW|-)=
9/14 * 3/9 * 4/9 * 3/9 * 3/9 = 0.0106
Thunder ForestFire