Bayes Algorithm
Bayes Algorithm
P ( D | h) P ( h)
P(h | D) =
P( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D
• P(D|h) = probability of D given h
Choosing Hypotheses
P ( D | h) P ( h)
P(h | D) =
P( D)
Generally want the most probable hypothesis given the
training data
Maximum a posteriori hypothesis hMAP:
hMAP = arg max P (h | D)
h∈H
P ( D | h) P ( h)
= arg max
h∈H P( D)
= arg max P( D | h) P (h)
h∈H
If we assume P(hi)=P(hj) then can further simplify, and
choose the Maximum likelihood (ML) hypothesis
hML = arg max P ( D | hi )
hi ∈H
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the
disease is not present. Furthermore, 0.8% of the entire
population have this cancer.
P(cancer) = P(¬cancer) =
P(+|cancer) = P(-|cancer) =
P(+|¬cancer) = P(-|¬cancer) =
P(cancer|+) =
P(¬cancer|+) =
Some Formulas for Probabilities
• Product rule: probability P(A ∧ B) of a
conjunction of two events A and B:
P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)
• Sum rule: probability of disjunction of two events
A and B:
P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
• Theorem of total probability: if events A1,…,An
are mutually exclusive with ∑i =1 P( Ai ) = 1 , then
n
n
P ( B ) = ∑ P ( B | Ai ) P ( Ai )
i =1
Brute Force MAP Hypothesis Learner
1. For each hypothesis h in H, calculate the posterior
probability
P ( D | h) P ( h)
P(h | D) =
P( D)
2. Output the hypothesis hMAP with the highest
posterior probability
hMAP = arg max P (h | D)
h∈H
Relation to Concept Learning
Consider our usual concept learning task
• instance space X, hypothesis space H, training
examples D
• consider the FindS learning algorithm (outputs
most specific hypothesis from the version space
VSH,D)
x
Consider any real-valued target function f
Training examples (xi,di), where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each
xi according to some Gaussian distribution with mean = 0
Then the maximum likelihood hypothesis hML is the one that
minimizes the sum of squared errors:
m
hML = arg min ∑ (d i − h( xi )) 2
h∈H
i =1
Learning a Real Valued Function
hML = arg max p ( D | h)
h∈H
m
= arg max ∏ p (d i | h)
h∈H
i =1
= arg max ∏
m
1
e
( − 12 )
di −h ( xi ) 2
σ
h∈H 2
2πσ
i =1
= arg min (d i − h( xi ) )
2
h∈H
Minimum Description Length Principle
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
hMDL = arg min LC1 (h) + LC 2 ( D | h)
h∈H
where LC(x) is the description length of x under
encoding C
Example:
• H = decision trees, D = training data labels
• LC1(h) is # bits to describe tree h
• LC2(D|h) is #bits to describe D given h
– Note LC2 (D|h) = 0 if examples classified perfectly by
h. Need only describe exceptions
• Hence hMDL trades off tree size for training errors
Minimum Description Length Principle
hMAP = arg max P ( D | h) P(h)
h∈H
∑ P(− | h ) P(h | D) = .6
hi ∈H
i i
and
arg max ∑ P (v j | hi ) P (hi | D) = -
v j ∈V
hi ∈H
Naïve Bayes Classifier
Along with decision trees, neural networks, nearest
neighor, one of the most practical learning
methods.
When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally
independent given classification
Successful applications:
• Diagnosis
• Classifying text documents
Naïve Bayes Classifier
Assume target function f: X→V, where each instance
x described by attributed (a1,a2,…,an).
Most probable value of f(x) is:
vMAP = arg max P(v j | a1 , a2 ,..., an )
v j ∈V
P(a1 , a2 ,..., an | v j ) P (v j )
= arg max
v j ∈V P (a1 , a2 ,..., an )
= arg max P (a1 , a2 ,..., an | v j ) P (v j )
v j ∈V
Naïve Bayes assumption:
P (a1 , a2 ,..., an | v j ) = ∏ P (ai | v j )
i
which gives
Naïve Bayes classifier: v NB = arg max P (v j )∏ P (ai | v j )
v ∈V j
i
Naïve Bayes Algorithm
Naive_Bayes_Learn(examples)
For each target value v j
Pˆ (v j ) ← estimate P(v j )
For each attribute value ai of each attribute a
Pˆ (a |v ) ← estimate P(a |v )
i j i j
Classify_New_Instance( x)
v NB = arg max Pˆ (v j ) ∏ Pˆ (ai|v j )
v j ∈V
ai ∈x
Naïve Bayes Example
Consider CoolCar again and new instance
(Color=Blue,Type=SUV,Doors=2,Tires=WhiteW)
Want to compute
v NB = arg max P (v j )∏ P (ai | v j )
v j ∈V
i
P(+)*P(Blue|+)*P(SUV|+)*P(2|+)*P(WhiteW|+)=
5/14 * 1/5 * 2/5 * 4/5 * 3/5 = 0.0137
P(-)*P(Blue|-)*P(SUV|-)*P(2|-)*P(WhiteW|-)=
9/14 * 3/9 * 4/9 * 3/9 * 3/9 = 0.0106
Naïve Bayes Subtleties
1. Conditional independence assumption is often
violated
P (a1 , a2 ,..., an | v j ) = ∏ P (ai | v j )
i
• … but it works surprisingly well anyway. Note
that you do not need estimated posteriors to be
correct; need only that
arg max Pˆ (v j )∏ Pˆ (ai | v j ) = arg max P (v j ) P (a1 ,..., an | v j )
v j ∈V v j ∈V
i
• see Domingos & Pazzani (1996) for analysis
• Naïve Bayes posteriors often unrealistically close
to 1 or 0
Naïve Bayes Subtleties
2. What if none of the training instances with target
value vj have attribute value ai? Then
Pˆ (ai | v j ) = 0, and ...
Pˆ (v j )∏ Pˆ (ai | v j ) = 0
Typical solution
i
is Bayesian estimate for Pˆ (ai | v j )
ˆ nc + mp
P (ai | v j ) ←
n+m
• n is number of training examples for which v=vj
• nc is number of examples for which v=vj and a=ai
• p is prior estimate for Pˆ (ai | v j )
• m is weight given to prior (i.e., number of
“virtual” examples)
Bayesian Networks
Interesting because
• Naïve Bayes assumption of conditional
independence is too restrictive
• But it is intractable without some such
assumptions…
• Bayesian belief networks describe conditional
independence among subsets of variables
• allows combing prior knowledge about
(in)dependence among variables with observed
training data
• (also called Bayes Nets)
Conditional Independence
Definition: X is conditionally independent of Y
given Z if the probability distribution governing X
is independent of the value of Y given the value of
Z; that is, if
(∀xi , y j , z k ) P ( X = xi | Y = y j , Z = z k ) = P ( X = xi | Z = z k )
more compactly we write
P(X|Y,Z) = P(X|Z)
Example: Thunder is conditionally independent of
Rain given Lightning
P(Thunder|Rain,Lightning)=P(Thunder|Lightning)
Naïve Bayes uses conditional ind. to justify
P(X,Y|Z)=P(X|Y,Z)P(Y|Z)
=P(X|Z)P(Y|Z)
Bayesian Network
Storm BusTourGroup
Thunder ForestFire