Lecture - 4.2 - Continuous Data and Zero Frequency Problem in Naive Bayes Classifier
Lecture - 4.2 - Continuous Data and Zero Frequency Problem in Naive Bayes Classifier
You have 2 free member-only stories left this month. Upgrade for unlimited access.
It performs well in case of categorical data as compared to numeric data. So, how do
we perform classification using Naive Bayes when the data we have is continuous in
nature.
If an instance in test data set has a category that was not present during training
then it will assign it “Zero” probability and won’t be able to make prediction. This is
known as Zero frequency problem. It skews the whole performance of the
classification. As a Machine Learning enthusiast, everyone should know how to
tackle if the situation arises.
In this post, we are going to discuss the workings of Naive Bayes classifier with Numeric
/ Continuous Data and the Zero frequency problem, so that it can later be applied to a
real world dataset.
T here are two ways to estimate the class-conditional probabilities for continuous
attributes in naive Bayes classifiers:
We can discretize each continuous attribute and then replace the continuous
attribute value with its corresponding discrete interval. This approach transforms
the continuous attributes into ordinal attributes. The conditional probability
P(X|Y=y), where Y is the target variable is estimated by computing the fraction of
training records belonging to class y that falls within the corresponding interval for
X.
The estimation error depends on the discretisation strategy, as well as the number of
discrete intervals. If the number of intervals is too large, there are too few training
records in each interval to provide a reliable estimate for P(X|Y). On the other hand, if
the number of intervals is too small, then some intervals may aggregate records from
different classes and we may miss the correct decision boundary. Hence, there is no rule
of thumb on on the discretisation strategy.
We can assume a certain form of probability distribution for the continuous variable
and estimate the parameters of the distribution using the training data. A Gaussian
distribution is usually chosen to represent the class-conditional probability for
continuous attributes. The distribution is characterized by two parameters, its mean
and variance.
Image 1
N ow, that we have established the foundation on how to use Gaussian distribution
for continuous attributes, let’s see how it can be used a classifier in Machine
Learning with an example:
For computing this we need prior probabilities of the target variable Play
The total number of instance is 14 and 9 of them have yes as value and 5 of them has no
as value.
p(yes) = 9/14
p(no) = 5/14
Image 2
In order to classify the instance x, we need to calculate the maximum likelihood for both
play=yes and play=no as follows:
The attributes individual probabilities are multiplied because of the naive independent
assumption.
For the attributes Temperature and Humidity the probability can be computed using the
Gaussian distribution formula in Image 1 by inserting the mean and variance values for
the attributes from Image 2.
P(sunny/yes) = 2/9
P(Temperature=66/yes) = 0.034
P(Humidity=90/yes) = 0.0221
P(True/yes) = 3/9
and
P(sunny/no) = 3/5
P(Temperature=66/no) = 0.0279
P(Humidity=90/no) = 0.0381
P(True/no) = 3/5
Classification — NO
N ow, that we have moved handling of continuous / numeric data in Naive Bayes
Classifier, let’s dive into how to handle the Zero Frequency problem.
It occurs when any condition having zero probability in the whole multiplication of the
likelihood makes the whole proabability zero. In such a case, there is something called
Laplace Estimator is used.
Image 3
where,
Image 4
T he explanation for the formula in Image 3 can be a bit difficult to wrap your head
around when it is unseen. Let’s understand it better with the help of an example:
We are going to classify an instance using the same dataset and distribution in Image 1
and Image 2.
For computing this we need prior probabilities of the target variable Play
The total number of instance is 14 and 9 of them have yes as value and 5 of them has no
as value.
p(yes) = 9/14
p(no) = 5/14
P(overcast/yes) = 4/9
and
P(overcast/no) = 0/5 = 0
Rest of the values needed to calculate the likelihood are taken from the previous
example itself.
0.000036 > 0
Classification — YES
Here, it can be seen that one conditional probability P(overcast/no) was the driving
factor in classification. Now, let’s see how can we employ the formula for Laplace
Estimator from Image 4 under the uniform distribution assumption.
where,
nc = 4, since 4 instances where Outlook = overcast & play = yes,
m = 3, since the attribute Outlook has 3 unique values (sunny, overcast, rainy),
Similarly,
where,
m = 3, since the attribute Outlook has 3 unique values (sunny, overcast, rainy),
where,
Similarly,
where,
Classification — YES
Even though the classification did not change but now we have a better scientific
reasoning behind our conclusion.
If you enjoy reading stories like these and want to support me as a writer, consider
signing up to become a Medium member. It’s $5 a month, giving you unlimited access to
stories on Medium. If you sign up using my link, I’ll earn a small commission at no extra
cost to you.
Thank you for reading. I hope anyone reading this got handling of Continuous Data and
Zero Frequency Problem in Naive Bayes Classifier cleared up. Share if you feel like it can
help others. You can read more of my posts here:
Quick Links for Lists of Tagged Stories — Thank You for Visiting
I also have a publication with the goal of fast paced publishing. Read to become a writer.
tarun-gupta.medium.com
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Data Science Machine Learning Computer Science Towards Data Science Classification