Naïve Bayes Classification in Python
Naïve Bayes Classification in Python
Introduction
Naive Bayes is a classification algorithm that is based on Bayes’ theorem. Bayes’
theorem states that the probability of an event is equal to the prior probability of the
event multiplied by the likelihood of the event given some evidence. In the context
of classification, this means that we are trying to find the class that is most likely
given a set of features or attributes.
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 1/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Naive Bayes assumes that the features are independent of each other, meaning that
the presence or absence of one feature does not affect the presence or absence of
another feature. This simplifies the calculation of the likelihood of the features, as
we can calculate the likelihood of each feature separately and then multiply them
together.
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 2/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
# Read dataset
df_net = pd.read_csv('/content/Social_Network_Ads.csv')
df_net.head()
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 3/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Data visualization is the process of creating graphs, charts, and other visual
representations of the data to help identify patterns, relationships, and
anomalies.
Preprocessing aims to prepare the data in a way that will enable effective analysis
and modeling and remove any biases or errors that may affect the results.
Describe data
Get statistical description of data using Pandas describe() function. It shows us the
count, mean, standard deviation, and range of data.
# Describe data
df_net.describe()
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 4/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Distribution of data
Check data distribution.
# Salary distribution
sns.distplot(df_net['EstimatedSalary'])
Label encoding
Label encoding is a preprocessing technique in machine learning and data analysis
where categorical data is converted into numerical values, to make it compatible
with mathematical operations and models.
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 5/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
The categorical data is assigned an integer value, typically starting from 0, and each
unique category in the data is given a unique integer value so that the categorical
data can be treated as numerical data.
# Label encoding
le = LabelEncoder()
df_net['Gender']= le.fit_transform(df_net['Gender'])
Correlation matrix
A correlation matrix is a table that summarizes the relationship between multiple
variables in a dataset. It shows the correlation coefficients between each pair of
variables, which indicate the strength and direction of the relationship between the
variables. It is useful for identifying highly correlated variables and selecting a
subset of variables for further analysis.
# Correlation matrix
df_net.corr()
sns.heatmap(df_net.corr())
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 6/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
The data is then split into a training set and a test set, with the training set used to fit
the model and the test set used to evaluate its performance.
The training set is used to train the model and find the optimal parameters. The
model is then tested on the test set to evaluate its performance and determine its
accuracy. This is important because if the model is trained and tested on the same
data, it may over-fit the data and perform poorly on new, unseen data.
We have split the data into 75% for training and 25% for testing.
to test/train set
y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = True)
Normalization scales the values of the variables so that they fall between 0 and
1. This is done by subtracting the minimum value of the feature and dividing it
by the range (max-min).
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 8/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Standardization transforms the values of the variables so that they have a mean
of 0 and a standard deviation of 1. This is done by subtracting the mean and
dividing it by the standard deviation.
Feature scaling is usually performed before training a model, as it can improve the
performance of the model and reduce the time required to train it, and helps to
ensure that the algorithm is not biased towards variables with larger values.
# Scale dataset
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
It’s important to note that the SVM algorithm requires feature scaling and proper
choice of kernel functions and regularization parameters to produce accurate
predictions.
Pass the X_train and y_train data into the Naïve Bayes classifier model by classifier.fit
to train the model with our training data.
# Classifier
classifier = GaussianNB()
classifier.fit(X_train, y_train)
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 9/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
from the training data. The class with the highest probability is then selected as the
predicted class.
The accuracy of the model can be evaluated on a test set, which was previously held
out from the training process.
# Prediction
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test
Accuracy
Accuracy is a commonly used metric for evaluating the performance of a machine
learning model. It measures the proportion of correct predictions made by the
model on a given dataset.
# Accuracy
accuracy_score(y_test, y_pred)
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 10/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Classification report
A classification report is a summary of the performance of a classification model. It
provides several metrics for evaluating the performance of the model on a
classification task, including precision, recall, f1-score, and support.
The classification report also provides a weighted average of the individual class
scores, which takes into account the imbalance in the distribution of classes in the
dataset.
# Classification report
print(f'Classification Report: \n{classification_report(y_test, y_pred)}')
F1 score
F1-score is the harmonic mean of precision and recall. It provides a single score that
balances precision and recall. Support is the number of instances of each class in
the evaluation dataset.
# F1 score
print(f"F1 Score : {f1_score(y_test, y_pred)}")
Confusion matrix
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 11/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
It provides a clear and detailed understanding of how well the model is performing
and helps to identify areas of improvement.
# Confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
Precision-Recall curve
A precision-recall curve is a plot that summarizes the performance of a binary
classification model as a trade-off between precision and recall and is useful for
evaluating the model’s ability to make accurate positive predictions while finding as
many positive instances as possible. Precision and Recall are two common metrics
for evaluating the performance of a classification model.
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 12/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Precision is the number of true positive predictions divided by the sum of true
positive and false positive predictions. It measures the accuracy of the positive
predictions made by the model.
Recall is the number of true positive predictions divided by the sum of true positive
and false negative predictions. It measures the ability of the model to find all positive
instances.
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(recall, precision, label='Naive Bayes Classification', color = 'firebri
ax.set_title('Precision-Recall Curve')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
plt.box(False)
ax.legend();
AUC/ROC curve
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 13/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC)
are commonly used metrics for evaluating the performance of a binary
classification model.
A ROC curve plots the True Positive Rate (TPR) versus the False Positive Rate (FPR) for
different thresholds of the model’s prediction probabilities. The TPR is the number
of true positive predictions divided by the number of actual positive instances, while
the FPR is the number of false positive predictions divided by the number of actual
negative instances.
The AUC is the area under the ROC curve and provides a single-number metric that
summarizes the performance of the model over the entire range of possible
thresholds.
A high AUC indicates that the model is able to distinguish positive instances from
negative instances well.
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(fpr, tpr, label='Naive Bayes Classification', color = 'firebrick')
ax.set_title('ROC Curve')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
plt.box(False)
ax.legend();
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 14/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Open in app
Search
Visualization predictions
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 15/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium
Example
Let’s see with an example of an Age of 45 and a Salary of 97000 and check if the user
is likely to purchase the insurance or not.
Predicted value [1] means the user is going to purchase the insurance.
Conclusion
Naive Bayes is a fast and simple algorithm that is widely used for text classification,
spam filtering, and sentiment analysis. It is also easy to implement and can handle
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 16/24