0% found this document useful (0 votes)
46 views30 pages

3 - Classification - Naive Bayes

The document provides an introduction to Naive Bayes classifiers including background on probability basics, the Naive Bayes principle and algorithms for both discrete and continuous data. Examples are also given to illustrate key concepts like Bayes' theorem and calculating class conditional probabilities under the Naive Bayes assumption.

Uploaded by

21070496
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views30 pages

3 - Classification - Naive Bayes

The document provides an introduction to Naive Bayes classifiers including background on probability basics, the Naive Bayes principle and algorithms for both discrete and continuous data. Examples are also given to illustrate key concepts like Bayes' theorem and calculating class conditional probabilities under the Naive Bayes assumption.

Uploaded by

21070496
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Naïve Bayes

Tran Thi Oanh


Outline
➢Introduction
➢Background and Probability Basics
➢Naïve Bayes – an generative model
oPrinciple and Algorithms (discrete vs. continuous)
oExamples
➢Zero Conditional Probability and Treatment
➢Summary

2
➢Where to use Naïve Bayes Classifier?
o SPAM Filtering
o Performing sentiment analysis of the
Naïve Bayes Classifier audience on social media.
o News Text Classification
o Etc.
3
What is Naïve Bayes
➢Named after Dr Thomas Bayes from
the 1700s who first coined this in the
Western literature.
➢Works on the principle of conditional
probability as given by the Bayes
theorem.
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x2 | x1 )
– Joint probability: x = ( x1 , x2 ), P(x) = P(x1 ,x2 )
– Relationship: P(x1 ,x2 ) = P( x2 | x1 )P( x1 ) = P( x1 | x2 )P( x2 )
– Independence:
P( x2 | x1 ) = P( x2 ), P( x1 | x2 ) = P( x1 ), P(x1 ,x2 ) = P( x1 )P( x2 )
• Bayesian Rule
P(x|c)P(c) Likelihood  Prior
P(c|x) = Posterior =
P(x) Evidence
Discriminative Generative
5
Bayes’ Theorem: Example 1
John flies frequently and likes to upgrade his seat to first class. He has determined that if he checks
in for his flight at least two hours early, the probability that he will get an upgrade is 0.75;
otherwise, the probability that he will get an upgrade is 0.35. With his busy schedule, he checks in
at least two hours before his flight only 40% of the time.
Suppose John did not receive an upgrade on his most recent attempt, what is the probability that
he did not arrive two hours early?
𝑃 𝐴 𝐶 = 0.75
Let
𝐶 = {John arrived at least two hours early} 𝑃 𝐴 ¬𝐶 = 0.35
𝐴 = {John received an upgrade}
𝑃 C = 0.40
such that,
¬𝐶 = {John did not arrive two hours early} 𝑷 ¬𝐂 ¬𝐀 = ?
¬𝐴 = {John did not receive an upgrade}

6
Bayes’ Theorem: Example 1

The question above requires that we compute the probability 𝑷 ¬𝐂 ¬𝐀 .


By directly applying Bayes’ Theorem, we can mathematically formulate the question as:

𝑃 ¬𝐴 ¬𝐶 ∙ 𝑃 ¬𝐶
𝑃 ¬𝐶 ¬𝐴 =
𝑃 ¬𝐴

The rest of the problem is simply figuring out the probability scores of the terms on the right-hand
side.

7
Bayes’ Theorem: Example 1
Start by figuring out the simplest terms based on the available information:
➢ Since John checks in at least two hours early 40% of the time, we know that
𝑃 𝐶 = 0.4
This means that the probability of not checking in at least two hours early is:
𝑃 ¬𝐶 = 1 − 𝑃 𝐶 = 1 − 0.4 = 0.6
➢ The story also tells us that the probability of John receiving an upgrade given that he checked in
early is 0.75, such that 𝑃 𝐴 𝐶 = 0.75
➢ Next, we were told that the probability of John receiving an upgrade given that he did not
checked in early is 0.35, i.e. 𝑃 𝐴 ¬𝐶 = 0.35, which allows us to compute the probability that he
did not receive an upgrade given the same circumstance as
𝑃 ¬𝐴 ¬𝐶 = 1 − 𝑃 𝐴 ¬𝐶 = 1 − 0.35 = 0.65

8
Bayes’ Theorem: Example 1
We were not told of the probability of John receiving an upgrade, or 𝑃 𝐴 . Fortunately, using all the
terms figured out earlier, this probability can be calculated as follows:

P ( A) = P ( A  C ) + P ( A  C )
= P (C )  P ( A | C ) + P (C )  P ( A | C )
= 0.4  0.75 + 0.6  0.35
= 0.51

Since P(A) = 0.51, then P(A) = 1 - P(A) = 0.49

9
Bayes’ Theorem: Example 1
Finally, using the Bayes’ Theorem, we can compute the probability P(C|A) as
follows:

P (A | C )  P (C ) 0.65  0.6


P (C | A) = =  0.796
P (A) 0.49

Answer: the probability that John did not arrive two hours early given that he did not receive an
upgrade is 0.796

10
Probabilistic Classification Principle
• Maximum A Posterior (MAP) classification rule
– For an input x, find the largest one from L probabilities output by a
discriminative probabilistic classifier P(c1 |x), ..., P(cL |x).
– Assign x to label c* if P(c * |x) is the largest.
• Generative classification with the MAP rule
– Apply Bayesian rule to convert them into posterior probabilities
P(x|ci )P(ci )
P(ci |x) =  P(x|ci )P(ci ) Common factor for
P(x) all L probabilities
for i = 1, 2 ,  , L
– Then apply the MAP rule to assign a label

11
Naïve Bayes
• Bayes classification
P(c | x)  P(x | c) P(c) = P( x1 ,  , xn | c) P(c) for c = c1 ,..., cL .
Difficulty: learning the joint probability P( x1 ,  , xn | c) is infeasible!
• Naïve Bayes classification
– Assume all input features are class conditionally independent!
P( x1 , x2 ,  , xn |c) = P( x1 | x2 ,  , xn , c)P( x2 ,  , xn |c)
Applying the = P( x1 |c)P( x2 ,  , xn |c)
independence
assumption = P( x1 |c)P( x2 |c)    P( xn |c)
– Apply the MAP classification rule: assign x' = (a1 , a2 ,  , an ) to c* if
[ P(a1 | c* )    P(an | c* )]P(c* )  [ P(a1 | c)    P(an | c)]P(c), c  c* , c = c1 ,  , cL
estimate of P(a1 ,  , an | c* ) esitmate of P(a1 ,  , an | c)
12
Naïve Bayes

For each target va lue of ci (ci = c1 ,  , cL )


Pˆ (ci )  estimate P(ci ) with examples in S;
For every feature value x jk of each feature x j ( j = 1,  , F ; k = 1,  , N j )
Pˆ ( x j = x jk |ci )  estimate P( x jk |ci ) with examples in S;

x' = (a1 ,  , an )

[Pˆ (a1 |c * )    Pˆ (an |c * )]Pˆ (c * )  [Pˆ (a1 |ci )    Pˆ (an |ci )]Pˆ (ci ), ci  c * , ci = c1 ,  , cL

13
Naïve Bayes
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal distribution
1  ( x j −  ji ) 2 
Pˆ ( x j | ci ) = exp − 
2  ji  2 ji 
2

 ji : mean (avearage) of feature values x j of examples for which c = ci
 ji : standard deviation of feature values x j of examples for which c = ci
– Learning Phase: for X = ( X 1 ,  , X F ), C = c1 ,  , cL
Output: F  L normal distributions and P(C = ci ) i = 1,  , L
– Test Phase: Given an unknown instance X = ( a1 ,  , an )
• Instead of looking-up tables, calculate conditional probabilities with all the
normal distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete case)
14
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes = 21.64 , Yes = 2.35
 =  xn ,  =  ( x n −  ) 2
2
 No = 23.88 , No = 7.09
N n=1 N n=1

– Learning Phase: output two Gaussian models for P(temp|C)


1  ( x − 21 . 64 ) 2
 1  ( x − 21 . 64 ) 2

Pˆ ( x | Yes ) = exp − 
 = exp 
 − 
2.35 2  2  2.35  2.35 2
2
 11.09 
ˆ 1  ( x − 23 . 88) 2
 1  ( x − 23 .88) 2

P ( x | No ) = 
exp − 
 = 
exp − 
7.09 2  2  7.09  7.09 2
2
 50.25 
15
Generalisation of the Bayes’ Theorem
To derive the Naïve Bayes model, the previous output
/ class
Bayes’ Theorem must first be generalised. variable
input variables

The generalisation of the Bayes’ Theorem:


Assuming that we have a dataset looks like the
table shown on the right. The Bayes’ Theorem
assigns an appropriate class label 𝑐𝑖 to each object
(tuple) in the dataset that has multiple attributes
𝑥 ? 𝑥𝑖? 𝑐𝑖?

18
Where to Use Naïve
Bayes Classifier?
19
https://www.youtube.com/watch?v=l3dZ6ZNFjo0
Shopping Example –
Problem Statement

To predict whether a
person will purchase
a product on a
specific combination
of Day, Discount, and
Free Delivery using
Naïve Bayes
Classifier.

20

https://www.youtube.com/watch?v=l3dZ6ZNFjo0
Shopping Example –
Dataset
➢The total observation is 30. (30
records)
➢We have three predictors (Day,
Discount, and Free Delivery) and
one target (Purchase).

➢In the big data era, we are not


looking at three predictors
anymore. It could be thirty or
even more columns in the data
with million records.

21
➢Based on the dataset containing
Shopping Example – three input types of Day, Discount
and Free Delivery, we will populate
Frequency Table frequency tables for each attribute.

22
Shopping Example –
Frequency Table
B A
➢For the Bayes’ Theorem, let the event
“Buy” be “A” and the independent
variables “Discount,” “Free Delivery,” and
“Day” be “B.”

23
Shopping
Example –
Likelihood Table
➢ Let’s calculate the
Likelihood table for one of
the variable, Day, which
includes “Weekday,”
𝑃 𝐵 = 𝑃 Weekday = 11ൗ30 ≈ 0.37 𝑃 𝐵 = 𝑃 Weekday = 11ൗ30 ≈ 0.37 “Weekend,” and
𝑃 𝐴 = 𝑃 No Buy = 6ൗ30 = 0.2 𝑃 𝐶 = 𝑃 Buy = 24ൗ30 = 0.8 “Holiday.”

𝑃 𝐵 𝐴 = 𝑃 Weekday No Buy 𝑃 𝐵 𝐶 = 𝑃 Weekday Buy ➢ Based on this likelihood


table, we will calculate
= 2ൗ6 ≈ 0.33 = 9ൗ24 ≈ 0.38 the conditional
probabilities as shown on
𝑃 𝐶 𝐵 = 𝑃 Buy Weekday the left.
𝑃 𝐴 𝐵 = 𝑃 No Buy Weekday
𝑃 Weekday No Buy × 𝑃 No Buy 𝑃 Weekday Buy × 𝑃 Buy
= =
𝑃 Weekday 𝑃 Weekday
0.33 × 0.2 0.38 × 0.8
= ≈ 0.18 = ≈ 0.82
0.37 0.37
As the Probability(Buy | Weekday) is more than
Probability(No Buy | Weekday), we can conclude that a customer will most
24
likely buy the product on a Weekday.
Shopping Example –
Likelihood Table (2)
➢Now we know how to
calculate the likelihood table
and thus we can do the same
to the remaining tables.

➢Let’s us the three likelihood


tables to calculate whether a
customer will purchase a
product on a specific
combination of Day, Discount,
and Free Delivery.
25
Shopping Example –
Naïve Bayes Classifier
➢ Let’s take a combination of these factors:
o Day = Holiday
B
o Discount = Yes
o Free Delivery = Yes
➢ Let A = No Buy
➢ 𝑃 𝐴𝐵 =
𝑃 No Buy Discount = Yes, Free Delivery = Yes, Day = Holiday
= ሾ𝑃 Discount = Yes No Buy
× 𝑃 Free Delivery = Yes No Buy
× 𝑃 Day = Holiday No Buy × 𝑃 No Buy ሿ
÷ ሾ𝑃 Discount = Yes × 𝑃 Free Delivery = Yes
× 𝑃 Day = Holiday ሿ
1ൗ × 2ൗ × 3ൗ × 6ൗ
6 6 6 30
= ≈ 0.0296
20ൗ × 23ൗ × 11ൗ
30 30 30

𝑃 𝐵𝐴 𝑃 𝐴 26
𝑃 𝐴𝐵 =
𝑃 𝐵
Shopping Example –
Naïve Bayes Classifier
(2)
➢ Let’s take a combination of these factors:
o Day = Holiday
B
o Discount = Yes
o Free Delivery = Yes
➢ Let A = Buy
➢ 𝑃 𝐴𝐵 =
𝑃 Buy Discount = Yes, Free Delivery = Yes, Day = Holiday
= ሾ𝑃 Discount = Yes Buy
× 𝑃 Free Delivery = Yes Buy
× 𝑃 Day = Holiday Buy × 𝑃 Buy ሿ
÷ ሾ𝑃 Discount = Yes × 𝑃 Free Delivery = Yes
× 𝑃 Day = Holiday ሿ
19ൗ 21ൗ 8 24ൗ
24 × 24 × ൗ24 × 30
= ≈ 0.9857
20ൗ × 23ൗ × 11ൗ
30 30 30

𝑃 𝐵𝐴 𝑃 𝐴 27
𝑃 𝐴𝐵 =
𝑃 𝐵
Shopping Example – Naïve Bayes Classifier (3)
➢ Based on the calculation
o Probability of purchase = 0.986
o Probability of no purchase = 0.03
➢ Finally, we have the conditional probabilities of purchase on this day!
➢ Let’s now normalise these probabilities to get the likelihood of the events:
0.986
o Likelihood of Purchase = ≈ 97.05%
0.986+0.03
0.03
o Likelihood of No Purchase = ≈ 2.95%
0.986+0.03

As 97.05% is greater than 2.95%, we can conclude that an average customer will buy on a holiday
with discount and free delivery.

28
Advantage of Very simple
and easy to

Naïve Bayes implement

Classifier Not sensitive to


irrelevant
Needs less
training data
features

Advantage of
Naïve Bayes
Classifier

As it is fast, it
Handles both
can be used in
continuous and
real time
discrete data
predictions
Highly scalable
with number of
predictors and
data points

29
Disadvantage of Its strong assumption
about the
Naïve Bayes independence of
attributes often give
Classifier bad results (i.e. bad
prediction accuracy)
Disadvantage
of Naïve
Bayes
Classifier
Discretising
numerical values may
result in loss of useful
information. (Lower
resolution.)

30
Exercise
➢Attributes are
Color , Type ,
Origin, and the
subject, stolen
can be either
yes or no.
➢We want to
classify a Red
Domestic SUV
Thank you!
Q&A

You might also like