0% found this document useful (0 votes)
2 views3 pages

Email Spam Detection

The document outlines a spam detection model using a dataset of emails, which is processed and cleaned before being split into training and testing sets. A K-Nearest Neighbors classifier is trained on the TF-IDF representation of the text data, achieving an accuracy of approximately 92%. The model's performance is evaluated using a confusion matrix and a classification report, indicating strong precision for ham emails but lower recall for spam emails.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views3 pages

Email Spam Detection

The document outlines a spam detection model using a dataset of emails, which is processed and cleaned before being split into training and testing sets. A K-Nearest Neighbors classifier is trained on the TF-IDF representation of the text data, achieving an accuracy of approximately 92%. The model's performance is evaluated using a confusion matrix and a classification report, indicating strong precision for ham emails but lower recall for spam emails.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

import pandas as pd

import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("C:/Users/dhamini_eashitha/Downloads/mail_data.csv",
encoding='latin-1')
df.columns = ['label', 'message']

df['label'] = df['label'].map({'ham': 0, 'spam': 1}) # Convert labels


to binary (0 = ham, 1 = spam)

# Text Cleaning Function


def clean_text(text):
text = text.lower()
text = re.sub(f"[{string.punctuation}]", "", text) # Remove
punctuation
text = re.sub(r"\d+", "", text) # Remove numbers
return text

df['message'] = df['message'].apply(clean_text)

print(df)

label message
0 0 go until jurong point crazy available only in ...
1 0 ok lar joking wif u oni
2 1 free entry in a wkly comp to win fa cup final...
3 0 u dun say so early hor u c already then say
4 0 nah i dont think he goes to usf he lives aroun...
... ... ...
5567 1 this is the nd time we have tried contact u u...
5568 0 will ã¼ b going to esplanade fr home
5569 0 pity was in mood for that soany other suggest...
5570 0 the guy did some bitching but i acted like id ...
5571 0 rofl its true to its name

[5572 rows x 2 columns]

# Splitting the dataset


X_train, X_test, y_train, y_test = train_test_split(df['message'],
df['label'], test_size=0.2, random_state=42)
# Convert text into numerical representation using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train the K-Nearest Neighbors (KNN) Model


model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_tfidf, y_train)

KNeighborsClassifier()

# Predictions
y_pred = model.predict(X_test_tfidf)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test,
y_pred))

Accuracy: 0.9210762331838565
Classification Report:
precision recall f1-score support

0 0.92 1.00 0.96 966


1 1.00 0.41 0.58 149

accuracy 0.92 1115


macro avg 0.96 0.70 0.77 1115
weighted avg 0.93 0.92 0.91 1115

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Example Prediction
sample_email = ["Congratulations! You've won a free iPhone. Click here
to claim your prize."]
sample_email_tfidf = vectorizer.transform(sample_email)
prediction = model.predict(sample_email_tfidf)
print("Spam" if prediction[0] == 1 else "Ham")

Ham

You might also like