Naïve Bayesv1
Naïve Bayesv1
• Models are naïve since they assume that features are statistical
independence, given the class.
• For each new sample they provide a probability that the sample
belongs to a class (for all classes)
Bayes Classifier
• A probabilistic framework for
solving classification problems
𝑝 𝐵𝐴 𝑝 𝐴
𝑝 𝐴|𝐵 =
𝑝 𝐵
• If a patient has stiff neck, what’s the probability he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P( S ) 1 / 20
Using Bayes Theorem for Classification
𝑝 𝑋 𝐶𝑘 𝑝 𝐶𝑘
𝑝 𝐶𝑘 |𝑋 =
𝑝 𝑋
age buys_comput
income studentcredit_rating
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data to be classified:
<=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
• Compute P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes
• Example
𝑃 𝑀𝑎𝑟𝑡𝑖𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑆𝑖𝑛𝑔𝑙𝑒 𝑌𝑒𝑠 = 2/3
• Given a test record with taxable income equal to $120𝐾, class conditional
probability can be computed as follows:
1 (120 − 110 )2
𝑃 𝐼𝑛𝑐𝑜𝑚𝑒 = 120 𝑁𝑜 = 𝑒𝑥𝑝−
2𝜋(54.54) 2∗2975
= 0.0072
𝑃 𝑋 𝑌𝑒𝑠
= 𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑌𝑒𝑠 × 𝑃 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑌𝑒𝑠
× 𝑃 𝐴𝑛𝑛𝑢𝑎𝑙 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾|𝑌𝑒𝑠
= 1 × 0 × 1.2 × 10−9 = 0
Conditional Probabilities for Continuous Attributes
The posterior probability for class No is 𝑃 𝑁𝑜 𝑋 = 0.7 × 0.0024 = 0.0016
3
𝑃 𝑋 𝑁𝑜 = × 0 × 0.0072 = 0 Naïve Bayes will not be able to
7
1
𝑃 𝑋 𝑌𝑒𝑠 = 0 × 3 × 1.2 × 10−9 = 0 classify X as Yes or No!
• Probability estimation:
c: number of classes
N ic p: prior probability of the
Original : P( Ai | C )
Nc class (p=1/k, for k possible
value of Ai)
N ic 1
Laplace : P ( Ai | C ) m: parameter
Nc c
N ic mp Nc: number of instances in
m - estimate : P( Ai | C ) the class
Nc m
Nic: number of instances
having attribute value Ai in
class c
Zero conditional probability
• Example: P(Marital Status=Married|Yes)=0
– Adding m “virtual” examples (m: tunable but up to 1% of #training examples)
– The “Martial Status” feature can takes only 3 values. So p=1/3.
– Re-estimate P(Martial Status=Married|Yes) with the m-estimate
0 + 3 × 1ൗ3 1
𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑌𝑒𝑠 = =
3+3 6
Zero conditional probability
𝑃 𝑋 𝑁𝑜
= 𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑁𝑜 × 𝑃 𝑀𝑎𝑟𝑡𝑖𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑁𝑜
× 𝑃 𝐴𝑛𝑛𝑢𝑎𝑙 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾|𝑁𝑜
6 6
= × × 0.0072 = 0.0026
10 10
𝑃 𝑋 𝑌𝑒𝑠
= 𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑌𝑒𝑠
× 𝑃 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑌𝑒𝑠
× 𝑃 𝐴𝑛𝑛𝑢𝑎𝑙 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾|𝑌𝑒𝑠
= 4/6 × 1/6 × 1.2 × 10−9 = 1.3 × 10−10
Zero conditional probability
7
The posterior probability for class No is 𝑃 𝑁𝑜 𝑋 = × 0.0026 =
10
0.0018
3
The posterior probability for class No is 𝑃 𝑌𝑒𝑠 𝑋 = × 1.3 ×
10
−10 −11
10 = 4.0 × 10
• Disadvantage
The assumption of independent features. In practice, it is almost
impossible that model will get a set of predictors which are entirely
independent.
If there is no training tuple of a particular class, this causes zero posterior
probability. In this case, the model is unable to make predictions. This
problem is known as Zero Probability/Frequency Problem.