0% found this document useful (0 votes)
30 views22 pages

23-Naive Bayes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views22 pages

23-Naive Bayes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Module5_Bayes_Classifica

tion_Methods
Reference: Data Mining: Concepts and Techniques, (3rd Edn.), Jiawei
Han, Micheline Kamber, Morgan Kaufmann, 2015
Bayes ’ Rule

• The Bayes' rule (also Bayes' law or Bayes' theorem):

• This simple equation underlies prediction models.


Example
• A doctor knows that the disease meningitis causes the patient to have a stiff neck,
say, 50% of the time. The doctor also knows some unconditional facts: the prior
probability of a patient having meningitis is 1/50,000, and the prior probability of
any patient having a stiff neck is 1/20. Letting S be the hypothesis that the patient
has a stiff neck and M be the hypothesis that the patient has meningitis.
Bayes ’ Theorem: Basics

• Bayes’ Theorem: P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)


P(X)

• Let X be a data sample (“evidence”): class label is unknown


• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis
holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background
knowledge)
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X’s age is 31..40 with medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X) P(X | H ) P(H ) P(X | H )P(H ) / P(X)


P(X)

• Informally, this can be viewed as


posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes

5
Bayes Classifier
• A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities P( A | B) = P ( B | A ) P (A)
P(B)
• Foundation: Based on Bayes’ Theorem.
• Probabilistic learning: Calculate explicit probabilities for hypothesis, among
the most practical approaches to certain types of learning problems
• Probabilistic prediction: Predict multiple hypotheses, weighted by their
probabilities
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with few other classifiers
Classification Is to Derive the Maximum
Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem P(X | C )P(C )
P(C | X)  i i
i P(X)

• Since P(X) is constant for all classes, only


P(C | X) P(X | C )P(C )
i i i
needs to be maximized
7
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence
n relation between attributes):
P( X | C i)   P( x | C i ) P( x | C i ) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class
distribution
• Once the probability P(X|Ci) is known, assign X to the class with
maximum P(X|Ci)*P(Ci)
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation 1 σ  ( x  ) 2

P ( X | C i )  g ( xk ,  Ci ,  Ci ) g ( x,  ,  )  e 2
2

and P(xk|Ci) is 2 
8
Naïve Bayes Classifier - Example
• Class: • Dataset
age income student credit_rating buys_computer
• C1:buys_computer = ‘yes’ <=30 high no fair no
• C2:buys_computer = ‘no’ <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
• Instance to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
>40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
31…40 medium no excellent yes
Credit_rating = fair) 31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier - Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age income student credit_rating buys_computer
<=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
• Compute P(X|Ci) for each class >40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
>40 low yes excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 31…40 low yes excellent yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 medium no fair no
<=30 low yes fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 >40 medium yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 <=30 medium yes excellent yes
31…40 medium no excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 31…40 high yes fair yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 >40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 10
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
11
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of
accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
12
Example
• Example: Play Tennis - Given a new instance x’, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
EXAMPLE (SPAM/NONSPAM)
• Infer if the email document with the text content “machine learning for free"
is SPAM or NONSPAM using Bayes rule and the document set is given below.
• “free money for free gambling fun” -> SPAM
• money, money, money” -> SPAM
• “gambling for fun” -> SPAM
• “machine learning for fun, fun, fun” -> NONSPAM
• “free machine learning” -> NONSPAM

• Hint: P(Word/Category) = (Number of occurrence of the word


in all the documents from a category+1) divided by (All the
words in every document from a category + Total number of
unique words in all the documents)
EXAMPLE (SPAM/NONSPAM)
• “free money for free
• To find: gambling fun” -> SPAM
• P(SPAM | New_DOC) = P(SPAM) * P(New_DOC | SPAM) • money, money, money” ->
SPAM
• P(NONSPAM | New_DOC) = P(NONSPAM) * P(New_DOC | NONSPAM)
• “gambling for fun” ->
• New_DOC : “machine learning for free" SPAM
• “machine learning for fun,
• P(SPAM) = 3/5 = 0.6 fun, fun” -> NONSPAM
• “free machine learning” -
• P(NONSPAM) = 2/5 = 0.4 > NONSPAM
• P(New_DOC | SPAM) = P(“machine learning for free”| SPAM)
= P(“machine”| SPAM) * P(“learning”| SPAM)* P(“for”| SPAM)* P(“free”| SPAM)
= 0/12 * 0 /12 * 2 /12 * 2 /12
= ( 0+1)/ (12 + 7) * ( 0+1)/ (12 + 7) * ( 2+1)/ (12 + 7) * ( 2+1)/ (12 + 7)
= 1 /19 * 1 /19 * 3/19 * 3/19 =9/ 194
• P(New_DOC | NONSPAM) = P(“machine learning for free”| NONSPAM)
= P(“machine”| NONSPAM) * P(“learning”| NONSPAM)* P(“for”| NONSPAM)* P(“free”| NONSPAM)
= 2/9 * 2 /9* 1/9 * 1/9
= ( 2+1)/ (9 + 7) * ( 2+1)/ (9 + 7) * ( 1+1)/ (9 + 7) * ( 1+1)/ (9 + 7)
= 3 /16 * 3 /16 * 2/16 * 2/16 = 36 / 164
• P( Ci | X ) = P(X|Ci)*P(Ci)
• P(SPAM | New_DOC) = P(SPAM) * P(New_DOC | SPAM) = 0.6 * 9/ 194 = 4.143 * 10-5
• P(NONSPAM |New_DOC) = P(NONSPAM) * P(New_DOC |NONSPAM) = 0.4 * 36/164 = 21.97 * 10-5 15
• Therefore, applying MAP Rule,the New_DOC belongs to the class “NONSPAM”
Example continued
• Example:
• P(Word/Category) = (Number of occurrence of the word in all the
documents from a category+1) divided by (All the words in every
document from a category + Total number of unique words in all the
documents)
EXAMPLE
• Infer if the email document with the text content “machine learning
for free" is SPAM or NONSPAM using Bayes rule and the document
set is given below.
• “free money for free gambling fun” -> SPAM
• money, money, money” -> SPAM
• “gambling for fun” -> SPAM
• “machine learning for fun, fun, fun” -> NONSPAM
• “free machine learning” -> NONSPAM
Example
• Infer the class for the sentence "What is the price of this book“ , using Bayes rule and the dataset
is given below.
Example
• Infer the Tag for the text "A very close game" using Bayes rule and the dataset is given below.
EXAMPLE
• Infer if one can play golf given the weather conditions as “(Sunny, Hot, Normal,
False)" using Bayes rule and the dataset is given below.
ID Outlook Temperature Humidity Windy Play Golf
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No
EXAMPLE
• Infer if one can play golf given the weather conditions as “<Outlook=sunny, Temperature=66,
Humidity=90, Windy=True>" using Bayes rule and the dataset is given below.
EXAMPLE
• Infer if one can play golf given the weather conditions as “<Outlook=overcast, Temperature=66,
Humidity=90, Windy=True>" using Bayes rule and the dataset is given below.

You might also like