3 - Classification - Naive Bayes
3 - Classification - Naive Bayes
2
➢Where to use Naïve Bayes Classifier?
o SPAM Filtering
o Performing sentiment analysis of the
Naïve Bayes Classifier audience on social media.
o News Text Classification
o Etc.
3
What is Naïve Bayes
➢Named after Dr Thomas Bayes from
the 1700s who first coined this in the
Western literature.
➢Works on the principle of conditional
probability as given by the Bayes
theorem.
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x2 | x1 )
– Joint probability: x = ( x1 , x2 ), P(x) = P(x1 ,x2 )
– Relationship: P(x1 ,x2 ) = P( x2 | x1 )P( x1 ) = P( x1 | x2 )P( x2 )
– Independence:
P( x2 | x1 ) = P( x2 ), P( x1 | x2 ) = P( x1 ), P(x1 ,x2 ) = P( x1 )P( x2 )
• Bayesian Rule
P(x|c)P(c) Likelihood Prior
P(c|x) = Posterior =
P(x) Evidence
Discriminative Generative
5
Bayes’ Theorem: Example 1
John flies frequently and likes to upgrade his seat to first class. He has determined that if he checks
in for his flight at least two hours early, the probability that he will get an upgrade is 0.75;
otherwise, the probability that he will get an upgrade is 0.35. With his busy schedule, he checks in
at least two hours before his flight only 40% of the time.
Suppose John did not receive an upgrade on his most recent attempt, what is the probability that
he did not arrive two hours early?
𝑃 𝐴 𝐶 = 0.75
Let
𝐶 = {John arrived at least two hours early} 𝑃 𝐴 ¬𝐶 = 0.35
𝐴 = {John received an upgrade}
𝑃 C = 0.40
such that,
¬𝐶 = {John did not arrive two hours early} 𝑷 ¬𝐂 ¬𝐀 = ?
¬𝐴 = {John did not receive an upgrade}
6
Bayes’ Theorem: Example 1
𝑃 ¬𝐴 ¬𝐶 ∙ 𝑃 ¬𝐶
𝑃 ¬𝐶 ¬𝐴 =
𝑃 ¬𝐴
The rest of the problem is simply figuring out the probability scores of the terms on the right-hand
side.
7
Bayes’ Theorem: Example 1
Start by figuring out the simplest terms based on the available information:
➢ Since John checks in at least two hours early 40% of the time, we know that
𝑃 𝐶 = 0.4
This means that the probability of not checking in at least two hours early is:
𝑃 ¬𝐶 = 1 − 𝑃 𝐶 = 1 − 0.4 = 0.6
➢ The story also tells us that the probability of John receiving an upgrade given that he checked in
early is 0.75, such that 𝑃 𝐴 𝐶 = 0.75
➢ Next, we were told that the probability of John receiving an upgrade given that he did not
checked in early is 0.35, i.e. 𝑃 𝐴 ¬𝐶 = 0.35, which allows us to compute the probability that he
did not receive an upgrade given the same circumstance as
𝑃 ¬𝐴 ¬𝐶 = 1 − 𝑃 𝐴 ¬𝐶 = 1 − 0.35 = 0.65
8
Bayes’ Theorem: Example 1
We were not told of the probability of John receiving an upgrade, or 𝑃 𝐴 . Fortunately, using all the
terms figured out earlier, this probability can be calculated as follows:
P ( A) = P ( A C ) + P ( A C )
= P (C ) P ( A | C ) + P (C ) P ( A | C )
= 0.4 0.75 + 0.6 0.35
= 0.51
9
Bayes’ Theorem: Example 1
Finally, using the Bayes’ Theorem, we can compute the probability P(C|A) as
follows:
Answer: the probability that John did not arrive two hours early given that he did not receive an
upgrade is 0.796
10
Probabilistic Classification Principle
• Maximum A Posterior (MAP) classification rule
– For an input x, find the largest one from L probabilities output by a
discriminative probabilistic classifier P(c1 |x), ..., P(cL |x).
– Assign x to label c* if P(c * |x) is the largest.
• Generative classification with the MAP rule
– Apply Bayesian rule to convert them into posterior probabilities
P(x|ci )P(ci )
P(ci |x) = P(x|ci )P(ci ) Common factor for
P(x) all L probabilities
for i = 1, 2 , , L
– Then apply the MAP rule to assign a label
11
Naïve Bayes
• Bayes classification
P(c | x) P(x | c) P(c) = P( x1 , , xn | c) P(c) for c = c1 ,..., cL .
Difficulty: learning the joint probability P( x1 , , xn | c) is infeasible!
• Naïve Bayes classification
– Assume all input features are class conditionally independent!
P( x1 , x2 , , xn |c) = P( x1 | x2 , , xn , c)P( x2 , , xn |c)
Applying the = P( x1 |c)P( x2 , , xn |c)
independence
assumption = P( x1 |c)P( x2 |c) P( xn |c)
– Apply the MAP classification rule: assign x' = (a1 , a2 , , an ) to c* if
[ P(a1 | c* ) P(an | c* )]P(c* ) [ P(a1 | c) P(an | c)]P(c), c c* , c = c1 , , cL
estimate of P(a1 , , an | c* ) esitmate of P(a1 , , an | c)
12
Naïve Bayes
[Pˆ (a1 |c * ) Pˆ (an |c * )]Pˆ (c * ) [Pˆ (a1 |ci ) Pˆ (an |ci )]Pˆ (ci ), ci c * , ci = c1 , , cL
13
Naïve Bayes
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal distribution
1 ( x j − ji ) 2
Pˆ ( x j | ci ) = exp −
2 ji 2 ji
2
ji : mean (avearage) of feature values x j of examples for which c = ci
ji : standard deviation of feature values x j of examples for which c = ci
– Learning Phase: for X = ( X 1 , , X F ), C = c1 , , cL
Output: F L normal distributions and P(C = ci ) i = 1, , L
– Test Phase: Given an unknown instance X = ( a1 , , an )
• Instead of looking-up tables, calculate conditional probabilities with all the
normal distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete case)
14
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes = 21.64 , Yes = 2.35
= xn , = ( x n − ) 2
2
No = 23.88 , No = 7.09
N n=1 N n=1
18
Where to Use Naïve
Bayes Classifier?
19
https://www.youtube.com/watch?v=l3dZ6ZNFjo0
Shopping Example –
Problem Statement
To predict whether a
person will purchase
a product on a
specific combination
of Day, Discount, and
Free Delivery using
Naïve Bayes
Classifier.
20
https://www.youtube.com/watch?v=l3dZ6ZNFjo0
Shopping Example –
Dataset
➢The total observation is 30. (30
records)
➢We have three predictors (Day,
Discount, and Free Delivery) and
one target (Purchase).
21
➢Based on the dataset containing
Shopping Example – three input types of Day, Discount
and Free Delivery, we will populate
Frequency Table frequency tables for each attribute.
22
Shopping Example –
Frequency Table
B A
➢For the Bayes’ Theorem, let the event
“Buy” be “A” and the independent
variables “Discount,” “Free Delivery,” and
“Day” be “B.”
23
Shopping
Example –
Likelihood Table
➢ Let’s calculate the
Likelihood table for one of
the variable, Day, which
includes “Weekday,”
𝑃 𝐵 = 𝑃 Weekday = 11ൗ30 ≈ 0.37 𝑃 𝐵 = 𝑃 Weekday = 11ൗ30 ≈ 0.37 “Weekend,” and
𝑃 𝐴 = 𝑃 No Buy = 6ൗ30 = 0.2 𝑃 𝐶 = 𝑃 Buy = 24ൗ30 = 0.8 “Holiday.”
𝑃 𝐵𝐴 𝑃 𝐴 26
𝑃 𝐴𝐵 =
𝑃 𝐵
Shopping Example –
Naïve Bayes Classifier
(2)
➢ Let’s take a combination of these factors:
o Day = Holiday
B
o Discount = Yes
o Free Delivery = Yes
➢ Let A = Buy
➢ 𝑃 𝐴𝐵 =
𝑃 Buy Discount = Yes, Free Delivery = Yes, Day = Holiday
= ሾ𝑃 Discount = Yes Buy
× 𝑃 Free Delivery = Yes Buy
× 𝑃 Day = Holiday Buy × 𝑃 Buy ሿ
÷ ሾ𝑃 Discount = Yes × 𝑃 Free Delivery = Yes
× 𝑃 Day = Holiday ሿ
19ൗ 21ൗ 8 24ൗ
24 × 24 × ൗ24 × 30
= ≈ 0.9857
20ൗ × 23ൗ × 11ൗ
30 30 30
𝑃 𝐵𝐴 𝑃 𝐴 27
𝑃 𝐴𝐵 =
𝑃 𝐵
Shopping Example – Naïve Bayes Classifier (3)
➢ Based on the calculation
o Probability of purchase = 0.986
o Probability of no purchase = 0.03
➢ Finally, we have the conditional probabilities of purchase on this day!
➢ Let’s now normalise these probabilities to get the likelihood of the events:
0.986
o Likelihood of Purchase = ≈ 97.05%
0.986+0.03
0.03
o Likelihood of No Purchase = ≈ 2.95%
0.986+0.03
As 97.05% is greater than 2.95%, we can conclude that an average customer will buy on a holiday
with discount and free delivery.
28
Advantage of Very simple
and easy to
Advantage of
Naïve Bayes
Classifier
As it is fast, it
Handles both
can be used in
continuous and
real time
discrete data
predictions
Highly scalable
with number of
predictors and
data points
29
Disadvantage of Its strong assumption
about the
Naïve Bayes independence of
attributes often give
Classifier bad results (i.e. bad
prediction accuracy)
Disadvantage
of Naïve
Bayes
Classifier
Discretising
numerical values may
result in loss of useful
information. (Lower
resolution.)
30
Exercise
➢Attributes are
Color , Type ,
Origin, and the
subject, stolen
can be either
yes or no.
➢We want to
classify a Red
Domestic SUV
Thank you!
Q&A