16 - Naïve Bayes Classifier
16 - Naïve Bayes Classifier
TIET, PATIALA
Naïve Bayes Classifier- Introduction
▪Naïve Bayes classifier is a probabilistic classifier that uses Bayes theorem and Naïve
assumption to classify test examples using the training examples.
▪ According to Bayes Theorem,
𝑃 𝐴 𝑃(𝐵|𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
where P(A|B) is called posterior probability of A given B; P(A) is the prior probability of
A; P(B|A)is the likelihood of B given A; and P(B) is the evidence of B.
▪ For machine learning tasks; A is the target variable (𝑦𝑖 ) and B is the input test case
(𝑋 = 𝑥1 𝑥2 𝑥3 𝑥4 … … … … … … 𝑥𝑘 )
𝑃 𝑦𝑖 𝑃(𝑋|𝑦𝑖 )
▪Therefore we find, 𝑃 𝑦𝑖 𝑋 = 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑦𝑖 ∈ 𝑌
𝑃(𝑋)
Naïve Bayes Classifier- Introduction
(Contd….)
▪Since P(X) is constant w.r.t different values of 𝑦𝑖 . Hence it can be ignored.
▪ Therefore, 𝑃 𝑦𝑖 𝑋 ∝ 𝑃 𝑦𝑖 𝑃(𝑋|𝑦𝑖 )
▪ According to Naïve assumption, the probability of each feature in the input is
conditionally independent of each other.
Therefore, X= x1 age, x2 salary, x3 loan, y credit (risky or safe)
𝑃 𝑦𝑖 𝑋 ∝ 𝑃 𝑦𝑖 ς𝑘𝑗=1 𝑃(𝑥𝑗 𝑦𝑖
The final predicted label (y*) for a given input X is thus computed as:
n
y* = arg max P ( yi )
y
P( x
j =1
j | yi )
Training Phase of Naïve Baye Classifier
▪ In the training phase of Naïve Bayes Classifier, we compute prior probability and likelihood
probabilities from the training data.
follows:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑥𝑗 ℎ𝑎𝑠 𝑣𝑎𝑙𝑢𝑒 𝑐 𝑎𝑛𝑑 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑎𝑠 𝑦𝑖 𝑛𝑥𝑗 =𝑐,𝑦𝑖
𝑃 𝑥𝑗 = 𝑐 𝑦𝑖 = =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑎𝑠 𝑦𝑖 𝑛𝑦𝑖
For all, 𝑥𝑗 ∈ 𝑋 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑠𝑒𝑡 , 𝑐 ∈ 𝑢𝑛𝑖𝑞𝑢𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓𝑥𝑗 , and 𝑦𝑖 ∈ 𝑢𝑛𝑖𝑞𝑢𝑒 𝑣𝑎𝑙𝑒𝑠 𝑜𝑓 𝑌
Testing Phase of Naïve Bayes Classifier
▪ In the test phase, for each test example X_test=x1x2x3…….xk , probability of each class
label given the test example is computed as:
𝑘
▪ The final predicted label (y*) for a given input X is thus computed as:
n
y* = arg max P ( yi )
y
P( x
j =1
j | yi )
Numerical Example-I
Consider the following training set, that
classify the output variable play golf as
Yes or No depending upon weather
conditions such as Outlook,
Temperature, Humidity, and Wind
Status.
Using Naïve Bayes Classifier, classify
that whether on a Rainy, Cool, High
Humidity, and Windy day we can play
golf or not.
Example 1-Solution
▪ Training Phase:
3. It is scalable i.e. if a new instance is added it is easy to adjust class prior and
likelihood probabilities.
5. It is very suitable for multi-class classification (as we need not to apply techniques
like one vs. rest to fit multiple binary classifiers).
Naïve Bayes Classifier- Problems
▪ Problem I: Zero Frequency Problem
➢If an individual feature value for a particular class label is missing, then the
frequency-based probability estimate will be zero. And we will get a zero when
all the probabilities are multiplied. This problem is called zero frequency
problem.
➢ For example, in the figure (shown in slide 9), the P(outlook=Overcast | no)
=0 because there is no training example which has overcast outlook for label
no.
➢ To handle this zero frequency problem, we apply smoothing technique.
Naïve Bayes Classifier- Problems (Contd..)
▪ Problem I: Zero Frequency Problem (Solution)
➢ Smoothing is a technique that handles the problem of zero probability in Naïve Bayes.
➢ In smoothing, while computing likelihood of any feature given label, we add a parameter α in
numerator and α X number of features in denominator i.e.,
𝑛𝑥𝑗 =𝑐,𝑦𝑖 + 𝛼
𝑃 𝑥𝑗 = 𝑐 𝑦𝑖 =
𝑛𝑦𝑖 + 𝛼 × 𝑘
Where k is the number of features. 𝛼 is added so that probability is never 0 and 𝛼 × 𝑘 is added
in denominator so that probability is never greater than 1.
➢ When 𝛼 = 1, it is called Laplace Smoothing (correction) and if 𝛼 < 1, it is called Lidstone
Smoothing.
➢ 𝛼 should not be taken greater than 1 because it will give higher probability mass to zero
frequency counts.
Naïve Bayes Classifier- Problems (Contd..)
▪ Problem II: Independence Assumption
➢ Naïve Bayes Classifier, is based on the Naïve assumption, that the features are independent of
each other.
➢ But in real case scenarios, input features are not always independent.
➢ For instance, if we have to label a person as adult or child on the basis of height and weight of
person, then features height and weight are not independent of each other.
➢ In order to handle this problem, we must apply dimensionality reduction if the features are
corelated.
➢ Due to the Naïve assumption, this classifier is most suitable for Text Classification as the
words are features in text and these words can be considered independent for classification.
Naïve Bayes Classifier- Problems (Contd..)
▪ Problem III: Numerical Underflow
➢ We know, likelihood probability is computed as:
𝑘
Where 𝜇𝑥𝑗 ,𝑦𝑖 denote mean of xj feature values labeled as yi and 𝜎𝑥𝑗 ,𝑦𝑖 is standard
deviation of xj feature values labeled as yi.
Numerical Example - 2
Consider the following training set, that
classify the output variable play golf as Yes or
No depending upon weather conditions such as
Temperature, Humidity (same as example 1 but
the features are continuous instead of
categorical).