0% found this document useful (0 votes)
50 views4 pages

Email Spam Classifier Phase1

This document describes building a machine learning model to classify emails as spam or ham (non-spam) by following standard data science procedures: loading and exploring a dataset of SMS messages, preprocessing the data, extracting features using TF-IDF, training a random forest classifier model, evaluating the model on test data, and visualizing results including feature importance and a confusion matrix.

Uploaded by

Aravind T P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views4 pages

Email Spam Classifier Phase1

This document describes building a machine learning model to classify emails as spam or ham (non-spam) by following standard data science procedures: loading and exploring a dataset of SMS messages, preprocessing the data, extracting features using TF-IDF, training a random forest classifier model, evaluating the model on test data, and visualizing results including feature importance and a confusion matrix.

Uploaded by

Aravind T P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

# Email Spam Classifier

![](https://media.giphy.com/media/KxlbRn0HuTW7gZID83/giphy.gif )

#### The objective is to develop a machine learning model that can categorize emails into two
categories: spam and non-spam (often referred to as "ham").

#### This model will help us filter out unwanted and potentially harmful emails from our inbox.

#### We will follow standard data science procedures, including data loading, preprocessing, feature
extraction, model training, evaluation, and prediction, to achieve this goal.

#### Let's begin building our email spam detector!

## Importing Necessary Libraries

# Import Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Load and Explore the Dataset

# Load the dataset

df = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding='ISO-8859-1')

# Display the first few rows of the dataset

df.head()

## Data Preprocessing

# Display the column names of the DataFrame

print(df.columns)

# Convert 'spam' and 'ham' to binary labels


df['v1'] = df['v1'].map({'spam': 0, 'ham': 1})

# Split the data into features (X) and target (Y)

X = df["v2"]

Y = df["v1"]

# Split the data into training and test sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.35, random_state=3)

## Feature Extraction - TF-IDF

# TF-IDF feature extraction

tfidf_vectorizer = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features = tfidf_vectorizer.fit_transform(X_train)

X_test_features = tfidf_vectorizer.transform(X_test)

## Model Training (Random Forest)

# Model training

model = RandomForestClassifier(n_estimators=100, random_state=3)

model.fit(X_train_features, Y_train)

## Model Evaluation (Random Forest)

prediction_on_training_data = model.predict(X_train_features)

accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

prediction_on_test_data = model.predict(X_test_features)

accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

#Print accuracy

print('Accuracy on training data: {:.2f} %'.format(accuracy_on_training_data * 100))

print('Accuracy on test data: {:.2f} %'.format(accuracy_on_test_data * 100))

## Confusion Matrix Visualization(Random Forest Classifier)

# Confusion Matrix Visualization

conf_matrix = confusion_matrix(Y_test, prediction_on_test_data)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False,

xticklabels=['Spam', 'Ham'], yticklabels=['Spam', 'Ham'])

plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.title('Confusion Matrix')

plt.show()

## Classification Report (Random Forest Classifier)

classification_rep = classification_report(Y_test, prediction_on_test_data, target_names=['Spam',


'Ham'])

print("Classification Report:")

print(classification_rep)

## Feature Importance Visualization (Random Forest)

feature_importance = model.feature_importances_

feature_names = tfidf_vectorizer.get_feature_names_out()

sorted_idx = np.argsort(feature_importance)[-20:] # Top 20 important features

plt.figure(figsize=(10, 6))

plt.barh(range(len(sorted_idx)), feature_importance[sorted_idx], align="center")

plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])

plt.xlabel("Feature Importance")

plt.ylabel("Feature")

plt.title("Top 20 Important Features (Random Forest)")

plt.show()

## Make Predictions on New Input (Random Forest Classifier)

input_your_mail = "Keep yourself safe for me because I need you and I miss you already and I envy
everyone that see's you in real life"

input_data_features = tfidf_vectorizer.transform([input_your_mail])

prediction = model.predict(input_data_features)

if prediction[0] == 1:

print("Ham Mail")
else:

print("Spam Mail")

You might also like