0% found this document useful (0 votes)
4 views16 pages

Naïve Bayes Classification in Python

The document provides a comprehensive guide on implementing Naïve Bayes classification in Python, detailing the algorithm's foundation on Bayes' theorem and the assumption of feature independence. It outlines the steps for data preparation, including importing libraries, data analysis, preprocessing, model training, and evaluation using various metrics. The guide emphasizes the importance of understanding model performance through accuracy, confusion matrices, and precision-recall curves.

Uploaded by

anbu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

Naïve Bayes Classification in Python

The document provides a comprehensive guide on implementing Naïve Bayes classification in Python, detailing the algorithm's foundation on Bayes' theorem and the assumption of feature independence. It outlines the steps for data preparation, including importing libraries, data analysis, preprocessing, model training, and evaluation using various metrics. The guide emphasizes the importance of understanding model performance through accuracy, confusion matrices, and precision-recall curves.

Uploaded by

anbu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Naïve Bayes Classification in Python


Shuvrajyoti Debroy · Follow
11 min read · Feb 9, 2023

Listen Share More

Machine Learning Classification Algorithm

Background Image Source: Analytics Insight

Introduction
Naive Bayes is a classification algorithm that is based on Bayes’ theorem. Bayes’
theorem states that the probability of an event is equal to the prior probability of the
event multiplied by the likelihood of the event given some evidence. In the context
of classification, this means that we are trying to find the class that is most likely
given a set of features or attributes.

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 1/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Naive Bayes assumes that the features are independent of each other, meaning that
the presence or absence of one feature does not affect the presence or absence of
another feature. This simplifies the calculation of the likelihood of the features, as
we can calculate the likelihood of each feature separately and then multiply them
together.

Image Source: Techleer

Implement Naïve Bayes Classification in Python


In this example, we will use the social network ads data concerning the Gender, Age,
and Estimated Salary of several users and based on these data we would classify each
user whether they would purchase the insurance or not.

Step 1: Import libraries


We need Pandas for data manipulation, NumPy for mathematical calculations,
MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for machine
learning operations

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 2/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

Step 2: Import data


Download the dataset from here and upload it to your notebook and read it into the
pandas dataframe.

# Read dataset
df_net = pd.read_csv('/content/Social_Network_Ads.csv')
df_net.head()

Step 3: Data Analysis / Preprocessing


Exploratory Data Analysis (EDA) is a process of analyzing and summarizing the
main characteristics of a dataset, with the goal of gaining insight into the underlying
structure, relationships, and patterns within the data. EDA helps to identify
important features, anomalies, and trends in the data that can inform further
analysis and modeling.

EDA typically involves several key steps, including:

Data cleaning and preparation involve removing missing or incorrect values,


transforming variables, and handling outliers.

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 3/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Data visualization is the process of creating graphs, charts, and other visual
representations of the data to help identify patterns, relationships, and
anomalies.

Statistical analysis involves applying mathematical and statistical methods to


the data to identify important features and relationships.

Preprocessing aims to prepare the data in a way that will enable effective analysis
and modeling and remove any biases or errors that may affect the results.

Get required data


We don’t need the User ID column so we can drop it.

# Get required data


df_net.drop(columns = ['User ID'], inplace=True)
df_net.head()

Describe data
Get statistical description of data using Pandas describe() function. It shows us the
count, mean, standard deviation, and range of data.

# Describe data
df_net.describe()

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 4/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Distribution of data
Check data distribution.

# Salary distribution
sns.distplot(df_net['EstimatedSalary'])

Label encoding
Label encoding is a preprocessing technique in machine learning and data analysis
where categorical data is converted into numerical values, to make it compatible
with mathematical operations and models.

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 5/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

The categorical data is assigned an integer value, typically starting from 0, and each
unique category in the data is given a unique integer value so that the categorical
data can be treated as numerical data.

# Label encoding
le = LabelEncoder()
df_net['Gender']= le.fit_transform(df_net['Gender'])

Correlation matrix
A correlation matrix is a table that summarizes the relationship between multiple
variables in a dataset. It shows the correlation coefficients between each pair of
variables, which indicate the strength and direction of the relationship between the
variables. It is useful for identifying highly correlated variables and selecting a
subset of variables for further analysis.

The correlation coefficient can range from -1 to 1, where:

A correlation coefficient of -1 indicates a strong negative relationship between


two variables

A correlation coefficient of 0 indicates no relationship between two variables

A correlation coefficient of 1 indicates a strong positive relationship between two


variables

# Correlation matrix
df_net.corr()
sns.heatmap(df_net.corr())

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 6/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Drop insignificant data


From the correlation matrix, we see that Gender is not correlated to other attributes
so we can drop that too.

# Drop Gender column


df_net.drop(columns=['Gender'], inplace=True)

Step 4: Split data


Splitting data into independent and dependent variables involves separating the
input features (independent variables) from the target variable (dependent variable).
The independent variables are used to predict the value of the dependent variable.

The data is then split into a training set and a test set, with the training set used to fit
the model and the test set used to evaluate its performance.

Independent / Dependent variables


https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 7/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

In our data Age, EstimatedSalary is the independent variable assigned as X, and


Purchased is the dependent variable y.

# Split data into dependent/independent variables


X = df_net.iloc[:, :-1].values
y = df_net.iloc[:, -1].values

Train / Test split


The data is usually divided into two parts, with the majority of the data used for
training the model and a smaller portion used for testing.

The training set is used to train the model and find the optimal parameters. The
model is then tested on the test set to evaluate its performance and determine its
accuracy. This is important because if the model is trained and tested on the same
data, it may over-fit the data and perform poorly on new, unseen data.

We have split the data into 75% for training and 25% for testing.

to test/train set
y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = True)

 

Step 5: Feature scaling


Feature scaling is a method of transforming the values of numeric variables so that
they have a common scale as machine learning algorithms are sensitive to the scale
of the input features.

There are two common methods of feature scaling: normalization and


standardization.

Normalization scales the values of the variables so that they fall between 0 and
1. This is done by subtracting the minimum value of the feature and dividing it
by the range (max-min).

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 8/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Standardization transforms the values of the variables so that they have a mean
of 0 and a standard deviation of 1. This is done by subtracting the mean and
dividing it by the standard deviation.

Feature scaling is usually performed before training a model, as it can improve the
performance of the model and reduce the time required to train it, and helps to
ensure that the algorithm is not biased towards variables with larger values.

# Scale dataset
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 6: Train model


Training a machine learning model involves using a training dataset to estimate the
parameters of the model. The training process uses a learning algorithm that
iteratively updates the model parameters, minimizes a loss function, which
measures the difference between the predicted values and the actual values in the
training data, and updates the model parameters to improve the accuracy of the
model.

It’s important to note that the SVM algorithm requires feature scaling and proper
choice of kernel functions and regularization parameters to produce accurate
predictions.

Pass the X_train and y_train data into the Naïve Bayes classifier model by classifier.fit
to train the model with our training data.

# Classifier
classifier = GaussianNB()
classifier.fit(X_train, y_train)

Step 7: Predict result / Score model


Once the likelihood of the features for each class is calculated, the algorithm
multiplies the likelihood by the prior probability of each class, which is estimated

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 9/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

from the training data. The class with the highest probability is then selected as the
predicted class.

The accuracy of the model can be evaluated on a test set, which was previously held
out from the training process.

# Prediction
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test

 

Step 8: Evaluate model


Accuracy is a useful metric for assessing the performance of a model, but it can be
misleading in some cases. For example, in a highly imbalanced dataset, a model that
always predicts the majority class will have high accuracy, even though it may not be
performing well. Therefore, it is important to consider other metrics, such as
confusion matrix, precision, recall, F1-score, and ROC-AUC, along with accuracy, to
get a more complete picture of the performance of a model.

Accuracy
Accuracy is a commonly used metric for evaluating the performance of a machine
learning model. It measures the proportion of correct predictions made by the
model on a given dataset.

In a binary classification problem, accuracy is defined as the number of correct


predictions divided by the total number of predictions. In a multi-class classification
problem, accuracy is the average of the individual class accuracy scores.

# Accuracy
accuracy_score(y_test, y_pred)

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 10/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Classification report
A classification report is a summary of the performance of a classification model. It
provides several metrics for evaluating the performance of the model on a
classification task, including precision, recall, f1-score, and support.

The classification report also provides a weighted average of the individual class
scores, which takes into account the imbalance in the distribution of classes in the
dataset.

# Classification report
print(f'Classification Report: \n{classification_report(y_test, y_pred)}')

F1 score
F1-score is the harmonic mean of precision and recall. It provides a single score that
balances precision and recall. Support is the number of instances of each class in
the evaluation dataset.

# F1 score
print(f"F1 Score : {f1_score(y_test, y_pred)}")

Confusion matrix
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 11/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

A confusion matrix is used to evaluate the performance of a classification model. It


summarizes the model’s performance by comparing the actual class labels of the
data to the predicted class labels generated by the model.

True Positives (TP): Correctly predicted positive instances.


False Positives (FP): Incorrectly predicted positive instances.
True Negatives (TN): Correctly predicted negative instances.
False Negatives (FN): Incorrectly predicted negative instances.

It provides a clear and detailed understanding of how well the model is performing
and helps to identify areas of improvement.

# Confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)

Precision-Recall curve
A precision-recall curve is a plot that summarizes the performance of a binary
classification model as a trade-off between precision and recall and is useful for
evaluating the model’s ability to make accurate positive predictions while finding as
many positive instances as possible. Precision and Recall are two common metrics
for evaluating the performance of a classification model.

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 12/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Precision is the number of true positive predictions divided by the sum of true
positive and false positive predictions. It measures the accuracy of the positive
predictions made by the model.

Recall is the number of true positive predictions divided by the sum of true positive
and false negative predictions. It measures the ability of the model to find all positive
instances.

# Plot Precision-Recall Curve


y_pred_proba = classifier.predict_proba(X_test)[:,1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

fig, ax = plt.subplots(figsize=(6,6))
ax.plot(recall, precision, label='Naive Bayes Classification', color = 'firebri
ax.set_title('Precision-Recall Curve')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
plt.box(False)
ax.legend();

 

AUC/ROC curve
https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 13/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC)
are commonly used metrics for evaluating the performance of a binary
classification model.

A ROC curve plots the True Positive Rate (TPR) versus the False Positive Rate (FPR) for
different thresholds of the model’s prediction probabilities. The TPR is the number
of true positive predictions divided by the number of actual positive instances, while
the FPR is the number of false positive predictions divided by the number of actual
negative instances.

The AUC is the area under the ROC curve and provides a single-number metric that
summarizes the performance of the model over the entire range of possible
thresholds.

A high AUC indicates that the model is able to distinguish positive instances from
negative instances well.

# Plot AUC/ROC curve


y_pred_proba = classifier.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba)

fig, ax = plt.subplots(figsize=(6,6))
ax.plot(fpr, tpr, label='Naive Bayes Classification', color = 'firebrick')
ax.set_title('ROC Curve')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
plt.box(False)
ax.legend();

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 14/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Open in app

Search

Visualization predictions

Prediction results on the training set

Prediction results on the test set

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 15/24
4/3/25, 6:24 PM Naïve Bayes Classification in Python | by Shuvrajyoti Debroy | Medium

Example
Let’s see with an example of an Age of 45 and a Salary of 97000 and check if the user
is likely to purchase the insurance or not.

# Predict purchase with Age(45) and Salary(97000)


print(classifier.predict(sc.transform([[45, 97000]])))

Predicted value [1] means the user is going to purchase the insurance.

Full Code at GitHub


You can get the full code in my GitHub repository.

Data-Science/Bayes_Theorem.ipynb at main · shuv50/Data-


Science
You can't perform that action at this time. You signed in with another
tab or window. You signed out in another tab or…
github.com

Conclusion
Naive Bayes is a fast and simple algorithm that is widely used for text classification,
spam filtering, and sentiment analysis. It is also easy to implement and can handle

https://medium.com/@shuv.sdr/naïve-bayes-classification-in-python-f869c2e0dbf1 16/24

You might also like