0% found this document useful (0 votes)
26 views26 pages

07 - ML - Naive-Bayes-update

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views26 pages

07 - ML - Naive-Bayes-update

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Ho Chi Minh University of Banking

Department of Economic
Mathematics

Machine Learning
Naïve Bayes Classifier (NBC)

Vuong Trong Nhan ([email protected])


Outline
Naive Bayes: Introduction
Bayes’s Theorem
Example using Naive Bayes
Types of Naïve Bayes classifiers
Evaluation
Advantages and Disadvantages
Some applications
Exercises

2
Naïve Bayes classifiers

The Naïve Bayes classifier is a supervised


machine learning algorithm which is used for
classification tasks.
Based on applying Bayes’ theorem
With the “naive” assumption of conditional
independence between every pair of features given
the value of the class variable.

3
Applications of the Naïve Bayes classifier
Spam filtering:
Spam classification is one of the most popular applications of Naïve Bayes cited in
literature. Oreilly.
Document classification:
Document and text classification go hand in hand. Another popular use case of Naïve
Bayes is content classification. Imagine the content categories of a News media
website. All the content categories can be classified under a topic taxonomy based on
the each article on the site. Federick Mosteller and David Wallace are credited with the
first application of Bayesian inference within their 1963 paper.
Sentiment analysis:
While this is another form of text classification, sentiment analysis is commonly
leveraged within marketing to better understand and quantify opinions and attitudes
around specific products and brands.
Mental state predictions:
Using fMRI data, naïve bayes has been leveraged to predict different cognitive states
among humans. The goal of this research was to assist in better understanding
hidden cognitive states, particularly among brain injury patients.
4
https://www.ibm.com/topics/naive-bayes
Example
Dataset that describes the weather conditions for playing Tennis

Play
Day Outlook Temperature Humidity Wind
Tennis Features are
D1 Sunny Hot High Weak No ‘Outlook’,
D2 Sunny Hot High Strong No ‘Temperature’,
D3 Overcast Hot High Weak Yes ‘Humidity’ and ‘Windy’.
D4 Rainy Mild High Weak Yes
Class: play Tennis
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes Predict:
D10 Rainy Mild Normal Weak Yes • Today = D15
D11 Sunny Mild Normal Strong Yes • Play Tennis = ?
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No

D15 Sunny Hot Normal Weak ??? 5


Bayes’ Theorem
Bayes’ theorem finds the probability of an event occurring given the
probability of another event that has already occurred.

where Y and X are events and P(X) ≠ 0.

Trying to find probability of event Y, given the event X is true. Event X


is also termed as evidence.
P(Y) is the prior probability of Y.
Probability of event before evidence is seen).
The evidence is an attribute value of an unknown instance (here, it is event X).
P(Y|X) is a posteriori probability of X, i.e. probability of event after
evidence is seen.
6
“Naïve” Bayes Assumption
Assumption: Each feature contributes independently and equally to the outcome.

Independence: no pair of features are dependent.


o I.e., the temperature being ‘Hot’ has nothing to do with the humidity or
the outlook being ‘Rainy’ has no effect on the winds.
o Hence, the features are assumed to be independent.
Equality: each feature is given the same weight (or
importance).
o I.e., knowing only temperature and humidity alone can’t predict the
outcome accurately.
o None of the attributes is irrelevant and assumed to be contributing
equally to the outcome.
Note: In-fact, the independence assumption is never correct but often works well
in practice. 7
Data presentation

Dataset D: (X, y)
X is an independent feature vector (of size n)
y is class variable
Apply Bayes’ theorem

E.g:
X = (Rain, Hot, High, Weak) P(y|X) means, the probability of “Not playing tennis”
given that the weather conditions are “Rainy outlook”,
y = No
“Temperature is hot”, “high humidity” and “weak wind”.
8
(1)
Naïve Bayes
Since A and B are independent (naive assumption):
P(A,B) = P(A)P(B)
(1) become:
(2)

(2) can be expressed as:


(3)

As the denominator remains constant for a given input, we can


remove that term: Proportion to
(4)

Maximum likelihood

9
Note: The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi | y).
Naïve Bayes

Step 1: Calculate P(y)


y = Yes
y = No

Play Tennis
Yes No P(Yes) P(No)
9 5 9/14 5/14

P(play = Yes) = 9/14


P(play = No) = 5/14

10
Predict Toda
Naïve Bayes D15 y Sunny Hot Normal Weak ???

Step 2: Calculate: P(xi|y)

Outlook Temperature
xi Yes No P(xi|Yes) P(xi|No) xi Yes No P(xi|Yes) P(xi|No)
Sunny 2 3 2/9 3/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/5 Mild 4 2 4/9 2/5
Rainy 3 2 3/9 2/5 Cool 3 1 3/9 1/5
P(Outlook = Sunny | play = Yes) = 2/9 P(Temp. = Hot| play = Yes) = 2/9
P(Outlook = Sunny | play = No) = 3/5 P(Temp. = Hot | play = No) = 2/5

Humidity Wind
xi Yes No P(xi|Yes) P(xi|No) Yes No P(xi|Yes) P(xi|No)
High 3 4 3/9 4/5 Weak 6 2 6/9 2/5
Normal 6 1 6/9 1/5 Strong 3 3 3/9 3/5
P(Humidity = Normal | play = Yes) = 6/9 P(Wind = Weak | play = Yes) = 6/9
P(Humidity = Normal| play = No) = 1/5 P(Wind = Weak | play = No) = 2/5
11
Naïve Bayes
Predict Today = D15, class = ?
D15 Sunny Hot Nomal Weak ???

12
Evaluate a Naïve Bayes classifier

Accuracy, Precision, Recall, Confusion matrix

13
https://www.ibm.com/topics/naive-bayes
Types of Naïve Bayes classifiers
Based on the distributions of the feature values:
Gaussian Naïve Bayes (GaussianNB):
o Feature: continuous variables
o eg. Age ∈ [18, 60]
o Gaussian distribution
Multinomial Naïve Bayes (MultinomialNB):
o Feature: discrete values(eg. frequency counts)
o eg. outlook = {sunny, overcast, rainy}
o Multinomial distribution
Bernoulli Naïve Bayes (BernoulliNB):
o features: Boolean variables
o {True, False} or {1, 0}.
o Bernoulli distribution
Hydrid NB
o by combining existing Naive Bayes models

14
https://www.ibm.com/topics/naive-bayes
Advantages and disadvantages

Advantages
Less complex:
o Naïve Bayes is considered a simpler classifier since the
parameters are easier to estimate.
Scales well:
o Compared to logistic regression, Naïve Bayes is
considered a fast and efficient classifier that is fairly
accurate when the conditional independence assumption
holds. It also has low storage requirements.
Can handle high-dimensional data:
o Use cases, such document classification, can have a high
number of dimensions, which can be difficult for other
classifiers to manage.

15
https://www.ibm.com/topics/naive-bayes
Advantages and disadvantages

Disadvantages:
Subject to Zero frequency:
o Zero frequency occurs when a categorical variable does not
exist within the training set.
o For example, imagine that we’re trying to find the maximum
likelihood estimator for the word, “sir” given class “spam”, but
the word, “sir” doesn’t exist in the training data. The probability
in this case would zero, and since this classifier multiplies all the
conditional probabilities together, this also means that posterior
probability will be zero. (To avoid this issue, laplace smoothing
can be leveraged)
Unrealistic core assumption:
o While the conditional independence assumption overall
performs well, the assumption does not always hold, leading to
incorrect classifications.

16
https://www.ibm.com/topics/naive-bayes
(Optional)

Deal with Zero-frequency problem


Laplace smoothing/correction
Deal with Continuous features
Discretization
Density probability function

17
Yes No P(Yes) P(No)
Zero-frequency problem 9 5 9/14 5/14

Outlook Temperature
xi Yes No P(xi|Yes) P(xi|No) xi Yes No P(xi|Yes) P(xi|No)
Sunny 2 3 2/9 3/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/5 Mild 4 2 4/9 2/5
Rainy 3 2 3/9 2/5 Cool 3 1 3/9 1/5

Humidity Wind
xi Yes No P(xi|Yes) P(xi|No) Yes No P(xi|Yes) P(xi|No)
High 3 4 3/9 4/5 Weak 6 2 6/9 2/5
Normal 6 1 6/9 1/5 Strong 3 3 3/9 3/5

Predict

D16 Overcast Cool High Strong ?

18
Laplace Smoothing/Correction

In Naive Bayes classification, Laplace smoothing, also


known as add-one smoothing is a technique used to
handle the problem of zero probabilities

𝑐𝑜𝑢𝑛𝑡 𝑥& , 𝑦 + 𝛼
𝑃!"#,% 𝑥& 𝑦 =
𝑐𝑜𝑢𝑛𝑡 𝑦 + 𝛼|𝑋|
Where:
• P (xi ∣y ) is the probability of feature xi given class y.
• 𝛼 is the smoothing parameter (𝛼 > 0, usually using 𝛼 = 1)
• count (xi |y) is the count of occurrences of feature xi with class y in the training data.
• count (y ) is the total count of instances of class y in the training data.
• ∣X ∣ is number of unique feature values (or the size of the vocabulary)

19
Laplace smoothing/correction
P(xi|y) without using Laplace smoothing

Outlook Predict
xi Yes No P(xi|Yes) P(xi|No)
D16 Overcast Cool High Strong ?
Sunny 2 3 2/9 3/5
Overcast 4 0 4/9 0/5
Rainy 3 2 3/9 2/5

P(xi|y) using Laplace smoothing


Outlook (using Laplace smoothing)
xi Yes No P(xi|Yes) P(xi|No) • Choose 𝛼 = 1
Sunny 2 3 3/12 4/8 • Outlook = {Sunny, Overcast, Rainy}
Þ |Outlook| = 3
Overcast 4 0 5/12 1/8 • c(Overcast,Yes) = 4, c(Overcast,No) = 0
Rainy 3 2 4/12 3/8 • c(Yes) = 9, c(No) = 5

4+𝟏 5 0+𝟏 1
P(Outlook = Overcast |Yes) = = P(Outlook = Overcast |No) = =
9+ 𝟏∗𝟑 12 5+ 𝟏∗𝟑 8
20
NBC using Laplace smoothing
Yes No P(Yes) P(No)
9 5 9/14 5/14

Outlook (Laplace smoothing) Temperature (Laplace smoothing)


xi Yes No P(xi|Yes) P(xi|No) xi Yes No P(xi|Yes) P(xi|No)
Sunny 2 3 3/12 4/8 Hot 2 2 3/12 3/8
Overcast 4 0 5/12 1/8 Mild 4 2 5/12 3/8
Rainy 3 2 4/12 3/8 Cool 3 1 4/12 2/8

Humidity (Laplace smoothing) Wind (Laplace smoothing)


xi Yes No P(xi|Yes) P(xi|No) Yes No P(xi|Yes) P(xi|No)
High 3 4 4/11 5/7 Weak 6 2 7/11 3/7
Normal 6 1 7/11 2/7 Strong 3 3 4/11 4/7

Predict D16 Overcast Cool High Strong ?

21
NBC with continous features

Deal with continous value


Change to discrete value (data binning)
o Eg.
• Temperature = 80 => high
• Temperature = 70 => mild
• Temperature = 60 => cool
Using probability density distribution function (f)
𝑃 𝑋 𝑥! , 𝑥" , … , 𝑥# 𝑌 = 𝑦 = . 𝑓(𝑋$ = 𝑥$ |𝑌 = 𝑦)

Probability density function for the normal distribution (Gaussian distribution)


1 (#!$)!
!
𝑓 𝑥 = 𝑒 & '!
𝜎 2𝜋
23
NBC with continous features
Using probability density distribution function (f) D17 = { Outlook = Overcast,
Temperature = 60,
Play
Day Outlook Temperature Humidity Wind
Tennis
Humidity = 62,
Wind = Weak }
D1 Sunny 85 85 Weak No
D2 Sunny 80 90 Strong No
(#$%&'%⋯% #))
D3 Overcast 83 86 Weak Yes 𝜇 Temp|yes = +
= 73
D4 Rainy 70 96 Weak Yes (#$,&$)! %(&',&$)! % ⋯ %(#),&$)!
𝜎(Temp|yes) = = 6.2
+,)
D5 Rainy 68 80 Weak Yes
D6 Rainy 65 70 Strong No (#1% #'% …% &))
𝜇 Temp|no = = 74.6
D7 Overcast 64 65 Strong Yes 1
(#1,&3.5)! %(#',&3.5)! % ⋯ %(&),&3.5)!
D8 Sunny 72 95 Weak No 𝜎(Temp|no) = =8
1,)
D9 Sunny 69 70 Weak Yes
D10 Rainy 75 80 Weak Yes Probability density function
D11 Sunny 75 70 Strong Yes for the normal distribution
D12 Overcast 72 90 Strong Yes 1 (-,.)!
,
𝑓 temp = 60|yes = 𝑒 / 0! = 0.071
D13 Overcast 81 75 Weak Yes 𝜎 2𝜋
D14 Rainy 71 91 Strong No 1 ,
(-,.)!
𝑓 temp = 60|no = 𝑒 / 0! = 0.0094
𝜎 2𝜋
24
Summary

Naïve Bayes Classifier


Naïve assumption
Bayes Theory
Types:
Gaussian NB
Multinominal NB
Bernoulli NB

27
Gaussian naïve bayes example
# load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()

# store the feature matrix (X) and response vector (y)


X = iris.data
y = iris.target

# splitting X and y into training and testing sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# training the model on training set


from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# making predictions on the testing set


y_pred = gnb.predict(X_test)

# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy:", metrics.accuracy_score(y_test, y_pred))

### Gaussian Naive Bayes model accuracy: 0.95

28
Exercise

Day Outlook Temperature Humidity Wind Play Tennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Using Naïve Bayes algorithm to predict:


• D15 = (Sunny, Hot, High, Weak) • D16 = (Rain, Mild, Normal, Weak)
• Play tennis (D15) = ? • Play tennis (D16) = ? 29

You might also like