0% found this document useful (0 votes)
47 views31 pages

Naïve Bayesv1

buys_computer = "yes") * P(buys_computer = "no") = 0.007 Therefore, the class predicted for X is "buys_computer = 'yes'" since P("yes") is the highest.

Uploaded by

Ravindra Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views31 pages

Naïve Bayesv1

buys_computer = "yes") * P(buys_computer = "no") = 0.007 Therefore, the class predicted for X is "buys_computer = 'yes'" since P("yes") is the highest.

Uploaded by

Ravindra Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Naïve bayes

What are Naïve Bayes Models ?


• Naïve Bayes models use conditional probabilities to predict the
label value.

• Models are naïve since they assume that features are statistical
independence, given the class.

• Advantage of naïve bayes models


• Work well with minimal training data
• Computationally efficient and highly scalable
Probability
• Probability is used to describe or measure the
uncertainty associated with the outcome of a future
experiment.
• It will rain tomorrow
• The number on the roll of a die
• Outcome of an election

• It varies between 0 and 1. 0 means an event will not


occur, 1 means the event is certain to occur.
Sample Space

• Sample space: The set of all possible outcome of


the experiments.

• If the experiment is a roll of a six-sided die, then


the natural sample space is {1, 2, 3, 4, 5, 6}.

• Suppose the experiment consist of tossing a coin


three times {hhh, hht, htt, ttt, thh, tht, hth, tth}.
Event

• Event is the subsets of the sample space


• A = {the outcome that the die is even} = {2, 4, 6}
• B = {exactly two tosses come out tails} = {htt, ttt, tht}
What is Bayesian Classification?
• Bayesian classifiers are statistical classifiers

• For each new sample they provide a probability that the sample
belongs to a class (for all classes)
Bayes Classifier
• A probabilistic framework for
solving classification problems

• Conditional Probability: is the


probability that a random variable P( X , Y )
P(Y | X ) 
will take on a particular value given P( X )
that the outcome for another
random variable is known. P( X , Y )
P( X | Y ) 
P(Y )
• Bayes’ Theorem finds the
probability of an event occurring
given the probability of another
event that has already occurred. P( X | Y ) P(Y )
P(Y | X ) 
P( X )
Interpreting Bayes’ Theorem
What does Bayes’ Theorem mean ?

𝑝 𝐵𝐴 𝑝 𝐴
𝑝 𝐴|𝐵 =
𝑝 𝐵

𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 ∗ 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦


𝑝 𝑒𝑣𝑒𝑛𝑡 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑝 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
Example of Bayes Theorem
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20

• If a patient has stiff neck, what’s the probability he/she has meningitis?

P( S | M ) P( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20
Using Bayes Theorem for Classification

• Consider each attribute and class label as random


variables

• Given a record with attributes (X1, X2,…, Xd)


• Goal is to predict class Y
• Specifically, we want to find the value of Y that maximizes
P(Y| X1, X2,…, Xd )
Using Bayes Theorem for Classification
• Approach:
• Compute posterior probability P(Y | X1, X2, …, Xd) using the
Bayes theorem
P( X 1 X 2  X d | Y ) P(Y )
P(Y | X 1 X 2  X n ) 
P( X 1 X 2  X d )

• Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)

• Equivalent to choosing value of Y that maximizes


P(X1, X2, …, Xd|Y) P(Y)

• How to estimate P(X1, X2, …, Xd | Y )?


Bayesian Classification Method

• There are two implementation of Bayesian


classification method
• Naïve Bayes Classifier
• Multinomial Naive Bayes
• Bernoulli Naive Bayes: binary features (e.g., word
presence/absence)
• Gaussian: Continuous/real-valued features
• Statistics computed for each class:
• For each feature: mean, standard deviation
• Bayesian Belief Network
The Naïve Bayes Classifier Model
• Starting with Bayes’ rule how do we find the probability of a class 𝐶𝑘
given an evidence 𝑋.

𝑝 𝑋 𝐶𝑘 𝑝 𝐶𝑘
𝑝 𝐶𝑘 |𝑋 =
𝑝 𝑋

• Where the evidence X is comprised of features


(𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , … … … 𝑥𝑛 )
Naïve Bayes Classifier

• Assume independence among attributes Xi when class is


given:
P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

• Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data

• New point is classified to Yj if P(Yj)  P(Xi| Yj) is


maximal.

𝑎𝑟𝑔𝑚𝑎𝑥𝑘 [𝑝(𝐶𝑘 ) ς𝑖 𝑝(𝑥𝑖 𝐶𝑘 ]


Naïve Bayes Classifier: Training Dataset

age buys_comput
income studentcredit_rating
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data to be classified:
<=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
• Compute P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222


31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier (Dataset)

Class: P(Y) = Nc/N


Tid Home Marital Annual Defaulted e.g., P(No) = 7/10,
Owner Status Income Borrower P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Conditional Probabilities for Categorical
Attributes

• For categorical attributes 𝑋𝑖


• The conditional probability 𝑃(𝑋𝑖 = 𝑥𝑖 |𝑌 = 𝑦) is
estimated according to the fraction of training
instances in class y that take on a particular attribute
value 𝑥𝑖 .

• Example
𝑃 𝑀𝑎𝑟𝑡𝑖𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑆𝑖𝑛𝑔𝑙𝑒 𝑌𝑒𝑠 = 2/3

𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑌𝑒𝑠 𝑁𝑜 = 3/7


Conditional Probabilities for Categorical Attributes
𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑌𝑒𝑠 𝑁𝑜 = 3/7
𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑁𝑜 = 4/7
𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑌𝑒𝑠 𝑌𝑒𝑠 = 0
𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑌𝑒𝑠 = 1

𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑆𝑖𝑛𝑔𝑙𝑒 𝑁𝑜 = 2/7


𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝐷𝑖𝑣𝑜𝑟𝑐𝑒𝑑 𝑁𝑜 = 1/7
𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑁𝑜 = 4/7
𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑆𝑖𝑛𝑔𝑙𝑒 𝑌𝑒𝑠 = 2/3
𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝐷𝑖𝑣𝑜𝑟𝑐𝑒𝑑 𝑌𝑒𝑠 = 1/3

𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑌𝑒𝑠 = 0


Conditional Probabilities for Continuous Attributes
There are two ways to estimate the class-conditional
probabilities
1. Discretize each continuous attribute and then
replace the continuous attribute value with its
corresponding discrete interval.

2. A Gaussian distribution is chosen to represent the


class conditional probability for continuous
attribute.
1 (𝑥𝑖 − 𝜇𝑖𝑗 )2
𝑃 𝑋𝑖 = 𝑥𝑖 |𝑌 = 𝑦𝑗 = 𝑒𝑥𝑝−
2𝜋𝜎𝑖𝑗 2𝜎𝑖𝑗2
Conditional Probabilities for Continuous Attributes
Ann Def
ual aul
• 𝜇𝑖𝑗 is the sample mean of 𝑥ҧ for all training records that belongInco
to theted
class 𝑦𝑖 me Bor
ro
we
• 𝜎𝑖𝑗2 sample variance (𝑠 2 ) of training instances. r
125 No
K
• Example, consider the attribute annual income 100 No
K
125 + 100 + 70 + 120 + 60 + 220 + 75 70K No
𝑥ҧ = = 110
7 120 No
(125 − 110)2 +(100 − 110)2 +(70 − 110)2 +(120 − 110)K2
+(60 − 110) 2 +(220 − 110)2 +(75 − 110)2
95K Yes
𝑠2 =
6 60K No
220 No
K
85K Yes
𝑠 = 54.54 75K No
Conditional Probabilities for Continuous Attributes
• For Annual Income
• If Class = No: Sample mean = 110,Sample Variance =2975
• If Class = Yes: Sample mean = 90, Sample Variance = 25

• Given a test record with taxable income equal to $120𝐾, class conditional
probability can be computed as follows:

1 (120 − 110 )2
𝑃 𝐼𝑛𝑐𝑜𝑚𝑒 = 120 𝑁𝑜 = 𝑒𝑥𝑝−
2𝜋(54.54) 2∗2975

= 0.0072

Test record 𝑋 = (𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜, 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑, 𝐼𝑛𝑐𝑜𝑚𝑒 =


$120𝐾), compute the posterior probability 𝑷 𝑵𝒐 𝑿 𝒂𝒏𝒅 𝑷(𝒀𝒆𝒔|𝑿)

Prior probabilities of each class (Yes and No)


P(Yes) = 0.3 and P(No) = 0.7
Conditional Probabilities for Continuous
Attributes
𝑃 𝑋 𝑁𝑜
= 𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑁𝑜 × 𝑃 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑁𝑜
× 𝑃 𝐴𝑛𝑛𝑢𝑎𝑙 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾|𝑁𝑜
4 4
= × × 0.0072 = 0.0024
7 7

𝑃 𝑋 𝑌𝑒𝑠
= 𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑌𝑒𝑠 × 𝑃 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑌𝑒𝑠
× 𝑃 𝐴𝑛𝑛𝑢𝑎𝑙 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾|𝑌𝑒𝑠
= 1 × 0 × 1.2 × 10−9 = 0
Conditional Probabilities for Continuous Attributes
The posterior probability for class No is 𝑃 𝑁𝑜 𝑋 = 0.7 × 0.0024 = 0.0016

The posterior probability for class No is 𝑃 𝑌𝑒𝑠 𝑋 = 0.3 × 0 = 0

𝑷 𝑿 𝑵𝒐 >𝑷 𝑿 𝒀𝒆𝒔 , the record is classified as No


Issues with Naïve Bayes Classifier
Test record 𝑋 = (𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑌𝑒𝑠, 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 =
𝐷𝑖𝑣𝑜𝑟𝑐𝑒𝑑, 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾) , compute the posterior probability
𝑷 𝑵𝒐 𝑿 𝒂𝒏𝒅 𝑷(𝒀𝒆𝒔|𝑿)

3
𝑃 𝑋 𝑁𝑜 = × 0 × 0.0072 = 0 Naïve Bayes will not be able to
7
1
𝑃 𝑋 𝑌𝑒𝑠 = 0 × 3 × 1.2 × 10−9 = 0 classify X as Yes or No!

• If one of the conditional probabilities is zero, then the entire expression


becomes zero
• Need to use other estimates of conditional probabilities than simple fractions
Issues with Naïve Bayes Classifier

• Probability estimation:
c: number of classes
N ic p: prior probability of the
Original : P( Ai | C ) 
Nc class (p=1/k, for k possible
value of Ai)
N ic  1
Laplace : P ( Ai | C )  m: parameter
Nc  c
N ic  mp Nc: number of instances in
m - estimate : P( Ai | C )  the class
Nc  m
Nic: number of instances
having attribute value Ai in
class c
Zero conditional probability
• Example: P(Marital Status=Married|Yes)=0
– Adding m “virtual” examples (m: tunable but up to 1% of #training examples)
– The “Martial Status” feature can takes only 3 values. So p=1/3.
– Re-estimate P(Martial Status=Married|Yes) with the m-estimate
0 + 3 × 1ൗ3 1
𝑃 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑌𝑒𝑠 = =
3+3 6
Zero conditional probability

𝑃 𝑋 𝑁𝑜
= 𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑁𝑜 × 𝑃 𝑀𝑎𝑟𝑡𝑖𝑎𝑙 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑁𝑜
× 𝑃 𝐴𝑛𝑛𝑢𝑎𝑙 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾|𝑁𝑜
6 6
= × × 0.0072 = 0.0026
10 10

𝑃 𝑋 𝑌𝑒𝑠
= 𝑃 𝐻𝑜𝑚𝑒 𝑂𝑤𝑛𝑒𝑟 = 𝑁𝑜 𝑌𝑒𝑠
× 𝑃 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 𝑌𝑒𝑠
× 𝑃 𝐴𝑛𝑛𝑢𝑎𝑙 𝐼𝑛𝑐𝑜𝑚𝑒 = $120𝐾|𝑌𝑒𝑠
= 4/6 × 1/6 × 1.2 × 10−9 = 1.3 × 10−10
Zero conditional probability

7
The posterior probability for class No is 𝑃 𝑁𝑜 𝑋 = × 0.0026 =
10
0.0018

3
The posterior probability for class No is 𝑃 𝑌𝑒𝑠 𝑋 = × 1.3 ×
10
−10 −11
10 = 4.0 × 10

𝑷 𝑿 𝑵𝒐 >𝑷 𝑿 𝒀𝒆𝒔 , the record is classified as No


Naïve Bayes Classifiers
• Advantages
It is not only a simple approach but also a fast and accurate method for
prediction.

Naive Bayes has very low computation cost.

It can efficiently work on a large dataset.

It performs well in case of discrete response variable compared to the


continuous variable.

It can be used with multiple class prediction problems.

It also performs well in the case of text analytics problems.

When the assumption of independence holds, a Naive Bayes classifier


performs better compared to other models.
Naïve Bayes Classifiers

• Disadvantage
The assumption of independent features. In practice, it is almost
impossible that model will get a set of predictors which are entirely
independent.

If there is no training tuple of a particular class, this causes zero posterior
probability. In this case, the model is unable to make predictions. This
problem is known as Zero Probability/Frequency Problem.

You might also like