0% found this document useful (0 votes)
318 views16 pages

Naive Bayes Classifier: Coin Toss and Fair Dice Example

The document discusses the Naive Bayes classifier algorithm. It begins by explaining what conditional probability and Bayes' rule are. It then describes how Naive Bayes makes the assumption that features are independent, and uses Bayes' rule to calculate the probability that a new data point belongs to each class. An example is provided to demonstrate how to classify a fruit as banana, orange or other using Naive Bayes. Finally, the advantages and disadvantages of Naive Bayes are summarized.

Uploaded by

Rupali Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
318 views16 pages

Naive Bayes Classifier: Coin Toss and Fair Dice Example

The document discusses the Naive Bayes classifier algorithm. It begins by explaining what conditional probability and Bayes' rule are. It then describes how Naive Bayes makes the assumption that features are independent, and uses Bayes' rule to calculate the probability that a new data point belongs to each class. An example is provided to demonstrate how to classify a fruit as banana, orange or other using Naive Bayes. Finally, the advantages and disadvantages of Naive Bayes are summarized.

Uploaded by

Rupali Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Naive Bayes Classifier

1. Introduction
Naive Bayes is a probabilistic machine learning algorithm that can be
used in a wide variety of classification tasks. Typical applications include
filtering spam, classifying documents, sentiment prediction etc. It is based
on the works of Rev. Thomas Bayes (1702-61) and hence the name.

But why is it called ‘Naive’?

The name naive  is used because it assumes the features that go into the
model is independent of each other. That is changing the value of one
feature, does not directly influence or change the value of any of the other
features used in the algorithm.

Naive Bayes does seem to be a simple yet powerful algorithm.


It is a probabilistic model, the algorithm can be coded up easily and the
predictions made real quick.
Real-time quick. Because of this, it is easily scalable and is traditionally
the algorithm of choice for real-world applications (apps) that are required
to respond to user’s requests instantaneously.

But before you go into Naive Bayes, you need to understand what
‘Conditional Probability’ is and what is the ‘Bayes Rule’.

2. What is Conditional Probability?


Coin Toss and Fair Dice Example

When you flip a fair coin, there is an equal chance of getting either heads
or tails. So you can say the probability of getting heads is 50%.

Similarly what would be the probability of getting a 1 when you roll a dice
with 6 faces? Assuming the dice is fair, the probability of 1/6 = 0.166.

Playing Cards Example


If you pick a card from the deck, can you guess the probability of getting a
queen given the card is a spade?
Well, I have already set a condition that the card is a spade. So, the
denominator (eligible population) is 13 and not 52. And since there is only
one queen in spades, the probability it is a queen given the card is a spade
is 1/13 = 0.077

This is a classic example of conditional probability. So, when you say the
conditional probability of A given B, it denotes the probability of A
occurring given that B has already occurred.

Mathematically, Conditional probability of A given B can be computed as:


P(A|B) = P(A AND B) / P(B)

School Example
Consider a school with a total population of 100 persons. These 100
persons can be seen either as ‘Students’ and ‘Teachers’ or as a population
of ‘Males’ and ‘Females’.
With below tabulation of the 100 people, what is the conditional
probability that a certain member of the school is a ‘Teacher’ given that he
is a ‘Man’?

To calculate this, you may intuitively filter the sub-population of 60 males


and focus on the 12 (male) teachers.
So the required conditional probability P(Teacher | Male) = 12 / 60 = 0.2.

This can be represented as the intersection of Teacher (A) and Male (B)
divided by Male (B). Likewise, the conditional probability of B given A can
be computed. The Bayes Rule that we use for Naive Bayes, can be derived
from these two notations.
3. The Bayes Rule
The Bayes Rule is a way of going from P(X|Y), known from the training
dataset, to find P(Y|X).
To do this, we replace A and B in the above formula, with the feature X and
response Y.
For observations in test or scoring data, the X would be known while Y is
unknown. And for each row of the test dataset, you want to compute the
probability of Y given the X has already happened.
What happens if Y has more than 2 categories? we compute the
probability of each class of Y and let the highest win.

4. The Naive Bayes


The Bayes Rule provides the formula for the probability of Y given X. But,
in real-world problems, you typically have multiple X variables.
When the features are independent, we can extend the Bayes Rule to what
is called Naive Bayes.
It is called ‘Naive’ because of the naive assumption that the X’s are
independent of each other. Regardless of its name, it’s a powerful
formula.

In technical jargon, the left-hand-side (LHS) of the equation is understood


as the posterior probability or simply the posterior
The RHS has 2 terms in the numerator.
The first term is called the ‘Likelihood of Evidence’. It is nothing but the
conditional probability of each X’s given Y is of particular class ‘c’.
Since all the X’s are assumed to be independent of each other, you can
just multiply the ‘likelihoods’ of all the X’s and called it the ‘Probability of
likelihood of evidence’. This is known from the training dataset by filtering
records where Y=c.
The second term is called the prior which is the overall probability of Y=c,
where c is a class of Y. In simpler terms, Prior = count(Y=c) / n_Records .
An example is better than an hour of theory. So let’s see one.

5. Naive Bayes Example


Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’.
These are the 3 possible classes of the Y variable.
We have data for the following X variables, all of which are binary (1 or 0).
 Long
 Sweet
 Yellow
The first few rows of the training dataset look like this:
Fruit Long (x1) Sweet (x2) Yellow (x3)
Orange 0 1 0
Banana 1 0 1
Banana 1 1 1
Other 1 1 0
.. .. .. ..
For the sake of computing the probabilities, let’s aggregate the training
data to form a counts table like this.

So the objective of the classifier is to predict if a given fruit is a ‘Banana’


or ‘Orange’ or ‘Other’ when only the 3 features (long, sweet and yellow) are
known.
Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you
predict what fruit it is?
This is the same of predicting the Y when only the X variables in testing
data are known. Let’s solve it by hand using Naive Bayes.
The idea is to compute the 3 probabilities, that is the probability of the
fruit being a banana, orange or other. Whichever fruit type gets the highest
probability wins.
All the information to calculate these probabilities is present in the above
tabulation.
Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.
That is, the proportion of each fruit class out of all the fruits from the
population. You can provide the ‘Priors’ from prior information about the
population. Otherwise, it can be computed from the training data.
For this case, let’s compute from the training data. Out of 1000 records in
training data, you have 500 Bananas, 300 Oranges and 200 Others. So the
respective priors are 0.5, 0.3 and 0.2.
P(Y=Banana) = 500 / 1000 = 0.50
P(Y=Orange) = 300 / 1000 = 0.30
P(Y=Other) = 200 / 1000 = 0.20
Step 2: Compute the probability of evidence that goes in the
denominator.
This is nothing but the product of P of Xs for all X. This is an optional step
because the denominator is the same for all the classes and so will not
affect the probabilities.
P(x1=Long) = 500 / 1000 = 0.50
P(x2=Sweet) = 650 / 1000 = 0.65
P(x3=Yellow) = 800 / 1000 = 0.80
Step 3: Compute the probability of likelihood of evidences that goes in
the numerator.
It is the product of conditional probabilities of the 3 features. If you refer
back to the formula, it says P(X1 |Y=k). Here X1 is ‘Long’ and k is ‘Banana’.
That means the probability the fruit is ‘Long’ given that it is a Banana. In
the above table, you have 500 Bananas. Out of that 400 is long. So, P(Long
| Banana) = 400/500 = 0.8.
Here, I have done it for Banana alone.
Probability of Likelihood for Banana
P(x1=Long | Y=Banana) = 400 / 500 = 0.80
P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70
P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90
So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7
* 0.9 = 0.504
Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get
the probability that it is a banana.
Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’.
The denominator is the same for all 3 cases, so it’s optional to compute.
Clearly, Banana gets the highest probability, so that will be our predicted
class.

6. Building Naive Bayes Classifier in Python


# Import packages

fromsklearn.naive_bayesimportGaussianNB

fromsklearn.model_selectionimporttrain_test_split

fromsklearn.metricsimportconfusion_matrix

importnumpyas np

import pandas as pd

importmatplotlib.pyplotasplt

import seaborn assns;sns.set()

# Import data

training =
pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master
/iris_train.csv ')
test =
pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master
/iris_test.csv ')

# Create the X, Y, Training and Test


xtrain=training.drop('Species', axis=1)
ytrain=training.loc[:,'Species']
xtest=test.drop('Species', axis=1)
ytest=test.loc[:,'Species']

# Init the Gaussian Classifier


model =GaussianNB()

# Train the model


model.fit(xtrain,ytrain)

# Predict Output
pred=model.predict(xtest)

# Plot Confusion Matrix


mat =confusion_matrix(pred,ytest)
names =np.unique(pred)
sns.heatmap(mat, square=True,annot=True,fmt='d', cbar=False,
xticklabels=names,yticklabels=names)
plt.xlabel('Truth')
plt.ylabel('Predicted')

Advantages
 It is easy and fast to predict the class of the test data set. It
also performs well in multi-class prediction.
 When assumption of independence holds, a Naive Bayes
classifier performs better compare to other models
like logistic regression and you need less training data.

 It perform well in case of categorical input variables


compared to numerical variable(s). For numerical variable,
normal distribution is assumed (bell curve, which is a strong
assumption).

Disadvantages
 If categorical variable has a category (in test data set),
which was not observed in training data set, then model will
assign a 0 (zero) probability and will be unable to make a
prediction. This is often known as Zero Frequency. To solve
this, we can use the smoothing technique. One of the
simplest smoothing techniques is called Laplace estimation.

 On the other side naive Bayes is also known as a bad


estimator, so the probability outputs are not to be taken too
seriously.

 Another limitation of Naive Bayes is the assumption of


independent predictors. In real life, it is almost impossible
that we get a set of predictors which are completely
independent.

Applications
 Real time Prediction: Naive Bayes is an eager learning
classifier and it is sure fast. Thus, it could be used for making
predictions in real time.
 Multi class Prediction: This algorithm is also well
known for multi class prediction feature. Here we can predict
the probability of multiple classes of target variable.

 Text classification/ Spam Filtering/ Sentiment


Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems
and independence rule) have higher success rate as
compared to other algorithms. As a result, it is widely used
in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and
negative customer sentiments)

 Recommendation System: Naive Bayes Classifier


and Collaborative Filtering together builds a
Recommendation System that uses machine learning and
data mining techniques to filter unseen information and
predict whether a user would like a given resource or not.

When to use
 Text Classification

 when dataset is huge

 When you have small training set

Dataset

 Iris dataset

 Wine dataset

 Adult dataset
7. Practice Exercise: Predict Human Activity Recognition (HAR)
The objective of this practice exercise is to predict current human activity
based on phisiological activity measurements from 53 different features
based in the HAR dataset . The training and test  datasets are provided.
Build a Naive Bayes model, predict on the test dataset and compute the
confusion matrix.

Now suppose you want to calculate the probability of playing when the weather is overcast, and the temperature
is mild.

P(Overcast|Yes)= 4/9=0.44

Play Golf Play Golf


Frequency Table Likelihood Table
Yes No Yes No
Sunny 2 3 Sunny 2/9 3/9 5/14
Weather Overcast 4 0 Weather Overcast 4/9 0/9 4/14 P(X=Overcast)=4/14=0.2857
Rainy 3 2 Rainy 3/9 2/9 5/14
9/14 5/14

P(Yes)=9/14=0.6428

P(Mild|Yes)= 4/9=0.44

Play Golf Play Golf


Frequency Table Likelihood Table
Yes No Yes No
Hot 2 2 Hot 2/9 2/9 4/14
Temp Mild 4 2 Temp Mild 4/9 2/9 6/14 P(X=Mild)=6/14=0.4285
Cool 3 1 Cool 3/9 1/9 4/14
9/14 5/14

P(Yes)=9/14=0.6428

Probability of playing:
P(Play= Yes | Weather=Overcast, Temp=Mild)

= P(Weather=Overcast, Temp=Mild | Play= Yes)P(Play=Yes) /P(Overcast)*P(Mild)..........(1)

P(Weather=Overcast, Temp=Mild | Play= Yes)= P(Overcast |Yes) P(Mild |Yes) ………..(2)

1. Calculate Prior Probabilities: P(Yes)= 9/14 = 0.64

2. Calculate likelihood Probabilities: P(Overcast |Yes) = 4/9 = 0.44 P(Mild |Yes) = 4/9 = 0.44

3. Put likelihood probabilities in equation (2) P(Weather=Overcast, Temp=Mild | Play= Yes) = 0.44* 0.44=0.1936

4. Calculate evidence probabilities P(Overcast)=4/14=0.2857, P(Mild)=6/14=0.4285

5. P(Overcast)*P(Mild)= 0.2857*0.4285=0.122

6. P(Play=Yes|Weather=Overcast, Temp=Mild)=(0.1936*0.64)/0.122=1

Similarly, you can calculate the probability of not playing:

Probability of not playing:

P(Play= No | Weather=Overcast, Temp=Mild) = P(Weather=Overcast, Temp=Mild | Play= No)P(Play=No) ..........(3)

P(Weather=Overcast, Temp=Mild | Play= No)= P(Weather=Overcast |Play=No) P(Temp=Mild | Play=No) ………..(4)

1. Calculate Prior Probabilities: P(No)= 5/14 = 0.36

2. Calculate likelihood Probabilities: P(Overcast |No) = 0/5 = 0 P(Mild |No) = 2/5 = 0.4

3. Put likelihood probabilities in equation (2) P(Weather=Overcast, Temp=Mild | Play= No) = 0* 0.4=0

4. Calculate evidence probabilities P(Overcast)=4/14=0.2857, P(Mild)=6/14=0.4285

5. P(Overcast)*P(Mild)= 0.2857*0.4285=0.122

6. P(Play=Yes|Weather=Overcast, Temp=Mild)=0*0.36/0.122= 0

The probability of a 'Yes' class is higher. So you can say here that if the weather is overcast than players will play the sport.

# Assigning features and label variables
wheather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
wheather_encoded=le.fit_transform(wheather)
print("Wheather:",wheather_encoded)
# Converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
print("Temp:",temp_encoded)
print("Play:",label)
#Combinig weather and temp into single listof tuples
features=list(zip(wheather_encoded,temp_encoded))
print(features)
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(features,label)

#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print("Predicted Value:", predicted)

Output
[2 2 0 1 1 1 0 2 2 1 2 0 0 1]
Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
[(2, 1), (2, 1), (0, 1), (1, 2), (1, 0), (1, 0), (0, 0), (2, 2), (2, 0), (1, 2), (2, 2), (0, 2), (0,
1), (1, 2)]
Predicted Value: [1]
1. Naive Bayes classifier assumes that the features are
independent of each other.
2. Naive Bayes classifier can be trained faster as compared
to other classification algorithms.
3. Naive Bayes classifier model can predict faster as
compared to other classification algorithms.
4. Naive Bayes classifier model can be modified with new
training data without having to re build the model.
5. Naive Bayes classifier model does not involve
optimization of a cost function.
6. Naive Bayes classifier training does not involve epoch.
7. Naive Bayes classifier model does not involve solving
a matrix equation.
8. When assumptions of independence of features holds,
Naive Bayes classifier model performs better than other
classifiers.
9. When assumptions of independence of features holds,
Naive Bayes classifier model needs less training data.
10. Naive Bayes classifier model performs well in case
of categorical input variables compared to numerical
input variable.

You might also like