0% found this document useful (0 votes)
10 views8 pages

Naive Bayes Classification

The document describes the implementation of a Naïve Bayes classifier using a dataset about weather conditions and playing tennis. It includes data preprocessing, training the classifier, making predictions, and evaluating the model's performance through accuracy and a confusion matrix. The accuracy achieved by the model is 0.5, and the confusion matrix is visualized using matplotlib.

Uploaded by

angelin272004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

Naive Bayes Classification

The document describes the implementation of a Naïve Bayes classifier using a dataset about weather conditions and playing tennis. It includes data preprocessing, training the classifier, making predictions, and evaluating the model's performance through accuracy and a confusion matrix. The accuracy achieved by the model is 0.5, and the confusion matrix is visualized using matplotlib.

Uploaded by

angelin272004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Naïve Bayes Classification

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
data = {
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny',
'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool',
'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High',
'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
'Windy': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True],
'Play Tennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)

# Encode categorical variables


df_encoded = pd.get_dummies(df, columns=['Outlook', 'Temperature', 'Humidity',
'Windy'])
# Split data into features and target
X = df_encoded.drop(columns=['Play Tennis']).values
y = df_encoded['Play Tennis'].values

# Split data into train and test sets


def train_test_split(X, y, test_size=0.2, random_state=None):
np.random.seed(random_state)
indices = np.random.permutation(len(X))
test_size = int(len(X) * test_size)
X_train = X[indices[:-test_size]]
y_train = y[indices[:-test_size]]
X_test = X[indices[-test_size:]]
y_test = y[indices[-test_size:]]
return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Naive Bayes Classifier


class NaiveBayesClassifier:
def fit(self, X, y):
self.X = X
self.y = y
self.classes = np.unique(y)
self.parameters = []
for i, c in enumerate(self.classes):
X_c = X[y == c]
self.parameters.append([])
for col in X_c.T:
parameters = {"mean": col.mean(), "var": col.var()}
self.parameters[i].append(parameters)
def _calculate_likelihood(self, mean, var, x):
eps = 1e-4
coeff = 1.0 / np.sqrt(2.0 * np.pi * var + eps)
exponent = np.exp(-(np.power(x - mean, 2) / (2 * var + eps)))
return coeff * exponent

def _calculate_prior(self, c):


frequency = np.mean(self.y == c)
return frequency

def _classify(self, sample):


posteriors = []
for i, c in enumerate(self.classes):
posterior = self._calculate_prior(c)
for feature_value, params in zip(sample, self.parameters[i]):
likelihood = self._calculate_likelihood(params["mean"], params["var"],
feature_value)
posterior *= likelihood
posteriors.append(posterior)
return self.classes[np.argmax(posteriors)]

def predict(self, X):


y_pred = [self._classify(sample) for sample in X]
return np.array(y_pred)

# Train the Naive Bayes Classifier


nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(X_train, y_train)
# Make predictions
y_pred = nb_classifier.predict(X_test)

# Calculate accuracy
accuracy = np.mean(y_pred == y_test)
print("Accuracy:", accuracy)

# Confusion Matrix
def confusion_matrix(y_true, y_pred, labels):
cm = np.zeros((len(labels), len(labels)), dtype=int)
label_map = {label: i for i, label in enumerate(labels)}
for i in range(len(y_true)):
cm[label_map[y_true[i]]][label_map[y_pred[i]]] += 1
return cm
labels = np.unique(y)
cm = confusion_matrix(y_test, y_pred, labels)
print("Confusion Matrix:")
print(cm)

# Visualization of Confusion Matrix


plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Reds)
plt.title("Confusion Matrix")
plt.colorbar()
tick_marks = np.arange(len(labels))
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
for i in range(len(labels)):
for j in range(len(labels)):
plt.text(j, i, cm[i, j], ha="center", va="center", color="white" if cm[i, j] >
cm.max() / 2.0 else "black")
plt.tight_layout()
plt.show()
***************
Output:
Accuracy: 0.5
Confusion Matrix:
[[0 0]
[1 1]]
Step by Step Explanation
Data Loading:
The dataset is created as a Python dictionary with keys representing different
features such as Outlook, Temperature, Humidity, Windy, and the target variable
"Play Tennis".
This dictionary is then converted into a pandas DataFrame df.
Data Preprocessing:
 Categorical variables are encoded using one-hot encoding. This is done using
the pd.get_dummies() function, which converts categorical variables into
dummy/indicator variables.
 The line df_encoded = pd.get_dummies(df, columns=['Outlook',
'Temperature', 'Humidity', 'Windy']) is using the pd.get_dummies() function
from the pandas library to perform one-hot encoding on categorical variables
in the DataFrame df.
 Here's what each part of the line does:
 pd.get_dummies(): This function in pandas is used to convert categorical
variables into dummy/indicator variables.
 df: This is the original DataFrame containing the dataset.
 columns=['Outlook', 'Temperature', 'Humidity', 'Windy']: This specifies the
columns in the DataFrame df that should be one-hot encoded. These columns
contain categorical variables that need to be converted into numerical format
for machine learning algorithms.
 The function pd.get_dummies() will create new binary (0 or 1) columns for
each category in the specified categorical columns. Each binary column
represents whether a particular category is present or not in the original
column.
 After executing this line, the DataFrame df_encoded will contain the original
columns along with additional columns representing the one-hot encoded
variables for the specified categorical columns.
Train-Test Split:
 The custom train_test_split() function is defined to split the dataset into
training and testing sets. This function shuffles the data and then splits it based
on the specified test size.

Naive Bayes Classifier:


 The NaiveBayesClassifier class is defined with methods for fitting the model
and making predictions.
 In the fit() method, the mean and variance of each feature in each class are
calculated and stored in the parameters attribute.
 In the predict() method, for each sample in the test set, the likelihood and prior
probabilities are computed for each class using the Gaussian Naive Bayes
formula. The class with the highest posterior probability is chosen as the
predicted class.
Training the Classifier:
 An instance of the NaiveBayesClassifier is created (nb_classifier) and trained
on the training data (X_train and y_train) using the fit() method.
Making Predictions:
 The trained classifier (nb_classifier) is used to predict the labels for the test
data (X_test) using the predict() method.
Accuracy Calculation:
 The accuracy of the model is calculated by comparing the predicted labels
(y_pred) with the true labels (y_test). This is done using NumPy's array
comparison.
Confusion Matrix:
 A confusion matrix is generated to evaluate the performance of the classifier.
The confusion_matrix() function calculates the confusion matrix based on the
true and predicted labels.
Confusion Matrix Visualization:
 The confusion matrix is visualized using matplotlib. Each cell in the matrix
represents the number of samples that belong to the corresponding true and
predicted labels. The color intensity of each cell indicates the count of
samples.
 Finally, the accuracy and confusion matrix are printed, and the visualization
of the confusion matrix is displayed using matplotlib.

**********************

You might also like