0% found this document useful (0 votes)
8 views8 pages

Aiml Assignment-2

The document outlines an assignment on spam email detection using a Naïve Bayes classifier, detailing the algorithm's foundation in Bayes' Theorem. It includes steps for data preprocessing, model training, and evaluation, demonstrating high accuracy in predictions. The conclusion emphasizes the effectiveness and efficiency of the Naïve Bayes algorithm for spam detection.

Uploaded by

bhaveshtupe06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Aiml Assignment-2

The document outlines an assignment on spam email detection using a Naïve Bayes classifier, detailing the algorithm's foundation in Bayes' Theorem. It includes steps for data preprocessing, model training, and evaluation, demonstrating high accuracy in predictions. The conclusion emphasizes the effectiveness and efficiency of the Naïve Bayes algorithm for spam detection.

Uploaded by

bhaveshtupe06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

AIML ASSIGNMENT-2

BHAVESH SANTOSHKUMAR TUPE


CSE A (23U104014)
Introduction

Spam emails are unsolicited messages that often contain


advertisements, phishing attempts, or malicious links. To filter spam
efficiently, we can use a Naïve Bayes classifier, a probabilistic
machine learning model based on Bayes' Theorem. It assumes that the
presence of one word in an email is independent of the presence of
any other word (hence, "naïve").

Bayes’ Theorem

The classifier is based on Bayes' theorem, which states:

P(A∣B)= P(B∣A)×P(A)/ P(B)

Where:

P(A|B): Probability that an email is spam given the words in the


email.

P(B|A): Probability of words appearing in spam emails.

P(A): Prior probability of spam.

P(B): Probability of words appearing in any email.

Using this, we compute the probability of an email being spam or not


spam (ham) based on its words.

Algorithm

Start

Import necessary libraries (pandas, sklearn, CountVectorizer,


MultinomialNB)

Load the dataset (Spam SMS or Email data)

Preprocess the data (convert text to numerical vectors)

Split data into training and test sets

Train the Naïve Bayes classifier (MultinomialNB)

Predict the classification for test data

Evaluate the model using accuracy, precision, recall

Display sample results

End
Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix

# Load dataset
df = pd.read_csv("emails.csv")
print(df.head())
# Rename column for consistency
df.rename(columns={'spam': 'label'}, inplace=True)

# Check class distribution


plt.figure(figsize=(6,4))
sns.countplot(x=df['label'], palette=['blue', 'red'])
plt.xticks([0, 1], ['Ham', 'Spam'])
plt.xlabel("Email Type")
plt.ylabel("Count")
plt.title("Spam vs. Ham Email Distribution")
plt.show()

# Preprocess text data


df['text'] = df['text'].str.lower().str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Generate WordClouds
spam_words = ' '.join(df[df['label'] == 1]['text'])
ham_words = ' '.join(df[df['label'] == 0]['text'])

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.title("Most Common Words in Spam Emails")
spam_wordcloud = WordCloud(width=400, height=300,
background_color='black', colormap='Reds').generate(spam_words)
plt.imshow(spam_wordcloud, interpolation='bilinear')
plt.axis("off")
plt.subplot(1,2,2)
plt.title("Most Common Words in Ham Emails")
ham_wordcloud = WordCloud(width=400, height=300,
background_color='black', colormap='Blues').generate(ham_words)
plt.imshow(ham_wordcloud, interpolation='bilinear')
plt.axis("off")

plt.show()

# Convert text to numerical features


vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])
y = df['label']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train Naïve Bayes model


classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predictions
y_pred = classifier.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)

# Confusion Matrix Visualization


conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="Blues",
xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.title("Confusion Matrix")
plt.show()

# Accuracy & F1-score comparison


metrics = ['Accuracy', 'Precision (Spam)', 'Recall (Spam)', 'F1-score
(Spam)']
values = [accuracy, report['1']['precision'], report['1']['recall'],
report['1']['f1-score']]

plt.figure(figsize=(8,5))
sns.barplot(x=metrics, y=values, palette="coolwarm")
plt.ylim(0, 1)
plt.ylabel("Score")
plt.title("Model Performance Metrics")
plt.show()

# Sample Prediction
sample_email = ["Congratulations! You've won a free car. Claim now!"]
sample_vector = vectorizer.transform(sample_email)
prediction = classifier.predict(sample_vector)

print("\nSample Email Prediction:", "Spam" if prediction[0] == 1 else


"Ham")
print(f"\nFinal Model Accuracy: {accuracy * 100:.2f}%")
OUTPUT:

text spam
0 Subject: naturally irresistible your corporate... 1
1 Subject: the stock trading gunslinger fanny i... 1
2 Subject: unbelievable new homes made easy im ... 1
3 Subject: 4 color printing special request add... 1
4 Subject: do not have money , get software cds ... 1
Sample Email Prediction: Spam

Final Model Accuracy: 99.21%


Conclusion

• Naïve Bayes is an effective algorithm for spam detection due to


its simplicity and efficiency.

• The model achieves high accuracy and is widely used in email


spam filters.

• The technique is fast and scalable, making it suitable for large


datasets.

You might also like