0% found this document useful (0 votes)

50 views4 pages

Email Spam Classifier Phase1

This document describes building a machine learning model to classify emails as spam or ham (non-spam) by following standard data science procedures: loading and exploring a dataset of SMS messages, preprocessing the data, extracting features using TF-IDF, training a random forest classifier model, evaluating the model on test data, and visualizing results including feature importance and a confusion matrix.

Uploaded by

Aravind T P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views4 pages

Email Spam Classifier Phase1

Uploaded by

Aravind T P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

# Email Spam Classifier

![](https://media.giphy.com/media/KxlbRn0HuTW7gZID83/giphy.gif )

#### The objective is to develop a machine learning model that can categorize emails into two
categories: spam and non-spam (often referred to as "ham").

#### This model will help us filter out unwanted and potentially harmful emails from our inbox.

#### We will follow standard data science procedures, including data loading, preprocessing, feature
extraction, model training, evaluation, and prediction, to achieve this goal.

#### Let's begin building our email spam detector!

## Importing Necessary Libraries

# Import Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Load and Explore the Dataset

# Load the dataset

df = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding='ISO-8859-1')

# Display the first few rows of the dataset

df.head()

## Data Preprocessing

# Display the column names of the DataFrame

print(df.columns)

# Convert 'spam' and 'ham' to binary labels

df['v1'] = df['v1'].map({'spam': 0, 'ham': 1})

# Split the data into features (X) and target (Y)

X = df["v2"]

Y = df["v1"]

# Split the data into training and test sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.35, random_state=3)

## Feature Extraction - TF-IDF

# TF-IDF feature extraction

tfidf_vectorizer = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features = tfidf_vectorizer.fit_transform(X_train)

X_test_features = tfidf_vectorizer.transform(X_test)

## Model Training (Random Forest)

# Model training

model = RandomForestClassifier(n_estimators=100, random_state=3)

model.fit(X_train_features, Y_train)

## Model Evaluation (Random Forest)

prediction_on_training_data = model.predict(X_train_features)

accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

prediction_on_test_data = model.predict(X_test_features)

accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

#Print accuracy

print('Accuracy on training data: {:.2f} %'.format(accuracy_on_training_data * 100))

print('Accuracy on test data: {:.2f} %'.format(accuracy_on_test_data * 100))

## Confusion Matrix Visualization(Random Forest Classifier)

# Confusion Matrix Visualization

conf_matrix = confusion_matrix(Y_test, prediction_on_test_data)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False,

xticklabels=['Spam', 'Ham'], yticklabels=['Spam', 'Ham'])

plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.title('Confusion Matrix')

plt.show()

## Classification Report (Random Forest Classifier)

classification_rep = classification_report(Y_test, prediction_on_test_data, target_names=['Spam',

'Ham'])

print("Classification Report:")

print(classification_rep)

## Feature Importance Visualization (Random Forest)

feature_importance = model.feature_importances_

feature_names = tfidf_vectorizer.get_feature_names_out()

sorted_idx = np.argsort(feature_importance)[-20:] # Top 20 important features

plt.figure(figsize=(10, 6))

plt.barh(range(len(sorted_idx)), feature_importance[sorted_idx], align="center")

plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])

plt.xlabel("Feature Importance")

plt.ylabel("Feature")

plt.title("Top 20 Important Features (Random Forest)")

plt.show()

## Make Predictions on New Input (Random Forest Classifier)

input_your_mail = "Keep yourself safe for me because I need you and I miss you already and I envy
everyone that see's you in real life"

input_data_features = tfidf_vectorizer.transform([input_your_mail])

prediction = model.predict(input_data_features)

if prediction[0] == 1:

print("Ham Mail")
else:

print("Spam Mail")

Libro de Inglés
80% (5)
Libro de Inglés
108 pages
English For Specific Purposes
100% (1)
English For Specific Purposes
6 pages
Wheeler's Cyclical Model
No ratings yet
Wheeler's Cyclical Model
10 pages
Portfolio Output No.21: Reflection On Leadership An Membership
No ratings yet
Portfolio Output No.21: Reflection On Leadership An Membership
6 pages
Leadership - Three OpEx Questions You Need To Know The Answers To
No ratings yet
Leadership - Three OpEx Questions You Need To Know The Answers To
26 pages
Hofstadter and McGraw (1993) - Letter Spirit - An Emergent Model of The Perception and Creation of Alphabetic Style
No ratings yet
Hofstadter and McGraw (1993) - Letter Spirit - An Emergent Model of The Perception and Creation of Alphabetic Style
29 pages
ED TM1 Trainers Methodology Level I
100% (6)
ED TM1 Trainers Methodology Level I
2 pages
FS1 - Episode 10 - Delos Santos
No ratings yet
FS1 - Episode 10 - Delos Santos
16 pages
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
Learning Among Pupils With Learning Difficulties
No ratings yet
Learning Among Pupils With Learning Difficulties
6 pages
G.5 - Learning Burden PPT REV
No ratings yet
G.5 - Learning Burden PPT REV
38 pages
Deulen - Social Constructivism and Online Learning - 2013 Article
No ratings yet
Deulen - Social Constructivism and Online Learning - 2013 Article
10 pages
Machine Learning Classification Bootcamp Cheatsheet
No ratings yet
Machine Learning Classification Bootcamp Cheatsheet
7 pages
EN6RC-IVc-3.2.5 - 2023-2024 - Day 2
No ratings yet
EN6RC-IVc-3.2.5 - 2023-2024 - Day 2
4 pages
Critical Success Factors For The Implementation of Blended Learning in Higher Education
No ratings yet
Critical Success Factors For The Implementation of Blended Learning in Higher Education
4 pages
Thesis Instructionalcompetenceofpreserviceteachersinrelationshipwiththeiracademicperformance
No ratings yet
Thesis Instructionalcompetenceofpreserviceteachersinrelationshipwiththeiracademicperformance
68 pages
Chapter 1 Reasearch Paper Final
No ratings yet
Chapter 1 Reasearch Paper Final
9 pages
Finger Family Song
No ratings yet
Finger Family Song
2 pages
ENG103 Course Outlinefor Fall 2022
No ratings yet
ENG103 Course Outlinefor Fall 2022
6 pages
Thick Description
No ratings yet
Thick Description
13 pages
OM Quiz
No ratings yet
OM Quiz
10 pages
Raj Kumar Subedi - Curriculum development-CIST College
No ratings yet
Raj Kumar Subedi - Curriculum development-CIST College
71 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
Lab Week 7
No ratings yet
Lab Week 7
3 pages
The Resilient Survivor-A Student Social Worker's Journey On Placement Smith 2014
No ratings yet
The Resilient Survivor-A Student Social Worker's Journey On Placement Smith 2014
14 pages
Decision Tree, Random Forest
No ratings yet
Decision Tree, Random Forest
37 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
Leadership Competency Inventory
No ratings yet
Leadership Competency Inventory
13 pages
5) Randomforest - Ipynb - Colaboratory
No ratings yet
5) Randomforest - Ipynb - Colaboratory
12 pages
Unit 104 Engineering Perspectives and Skills
No ratings yet
Unit 104 Engineering Perspectives and Skills
10 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
Python Implementation of Random Forest Algorithm
No ratings yet
Python Implementation of Random Forest Algorithm
10 pages
Dynamic Enterprise Architecture and Practical EA
100% (3)
Dynamic Enterprise Architecture and Practical EA
31 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
ML With Python Practical
No ratings yet
ML With Python Practical
22 pages
Edpe Ass 1 ..trm2
No ratings yet
Edpe Ass 1 ..trm2
5 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Modelling and Simulation Sample Model 4
No ratings yet
Modelling and Simulation Sample Model 4
3 pages
FML File Final
No ratings yet
FML File Final
36 pages
Coe Projects
No ratings yet
Coe Projects
7 pages
Citko 2005
No ratings yet
Citko 2005
23 pages
M.Ed. in Higher Education Program Learning Outcomes: Self-Assessment
No ratings yet
M.Ed. in Higher Education Program Learning Outcomes: Self-Assessment
4 pages
Spam Detection
No ratings yet
Spam Detection
10 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Linearregression SVM
No ratings yet
Linearregression SVM
3 pages
Code
No ratings yet
Code
6 pages
ML Lab6
No ratings yet
ML Lab6
4 pages
Random Forest
No ratings yet
Random Forest
5 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
Bayesian Algorithm
No ratings yet
Bayesian Algorithm
6 pages
ML 2 16
No ratings yet
ML 2 16
6 pages
DM ML Practical
No ratings yet
DM ML Practical
13 pages
21ST DLL Q4 Week6
No ratings yet
21ST DLL Q4 Week6
5 pages
02 - Email - Spam - Ipynb - Colab
No ratings yet
02 - Email - Spam - Ipynb - Colab
11 pages
ML Manual With Outputs
No ratings yet
ML Manual With Outputs
30 pages
Cureus 0016 00000054193
No ratings yet
Cureus 0016 00000054193
8 pages
DWDM Pavan Final
No ratings yet
DWDM Pavan Final
10 pages
ML PDF
No ratings yet
ML PDF
30 pages
Spamdetection
No ratings yet
Spamdetection
6 pages
Email Spam Detection
No ratings yet
Email Spam Detection
3 pages
Fyp 4
No ratings yet
Fyp 4
12 pages
Sound Cylinders MC
No ratings yet
Sound Cylinders MC
3 pages
23BCE7092 ML Lab Assignment
No ratings yet
23BCE7092 ML Lab Assignment
14 pages
Classification
No ratings yet
Classification
3 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
8 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
ML Prac1-10
No ratings yet
ML Prac1-10
32 pages
Aam p-4 To 6
No ratings yet
Aam p-4 To 6
6 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
AAM PR QB
No ratings yet
AAM PR QB
13 pages
Data Analytics III
No ratings yet
Data Analytics III
5 pages
Shobit Sharma (2124399) ML Lab File PDF
No ratings yet
Shobit Sharma (2124399) ML Lab File PDF
19 pages
CP4252 Lab Manual
No ratings yet
CP4252 Lab Manual
13 pages
PYHTONPRACT
No ratings yet
PYHTONPRACT
4 pages
EX - NO:3: Algorithm
No ratings yet
EX - NO:3: Algorithm
11 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
Bi 6 New
No ratings yet
Bi 6 New
6 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
ML5 Implementation
No ratings yet
ML5 Implementation
32 pages
Dsbda 10
No ratings yet
Dsbda 10
5 pages
Code Examples in Space
No ratings yet
Code Examples in Space
13 pages
AI Report
No ratings yet
AI Report
8 pages
AI
No ratings yet
AI
16 pages
Program
No ratings yet
Program
10 pages
ML Lab-1
No ratings yet
ML Lab-1
32 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Email Spam Classifier Phase1

Uploaded by

Email Spam Classifier Phase1

Uploaded by

# Email Spam Classifier

#### Let's begin building our email spam detector!

## Importing Necessary Libraries

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Load and Explore the Dataset

# Load the dataset

# Display the first few rows of the dataset

# Display the column names of the DataFrame

# Convert 'spam' and 'ham' to binary labels

# Split the data into features (X) and target (Y)

# Split the data into training and test sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.35, random_state=3)

## Feature Extraction - TF-IDF

# TF-IDF feature extraction

tfidf_vectorizer = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

## Model Training (Random Forest)

model = RandomForestClassifier(n_estimators=100, random_state=3)

## Model Evaluation (Random Forest)

accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

print('Accuracy on training data: {:.2f} %'.format(accuracy_on_training_data * 100))

print('Accuracy on test data: {:.2f} %'.format(accuracy_on_test_data * 100))

## Confusion Matrix Visualization(Random Forest Classifier)

# Confusion Matrix Visualization

conf_matrix = confusion_matrix(Y_test, prediction_on_test_data)

xticklabels=['Spam', 'Ham'], yticklabels=['Spam', 'Ham'])

## Classification Report (Random Forest Classifier)

classification_rep = classification_report(Y_test, prediction_on_test_data, target_names=['Spam',

## Feature Importance Visualization (Random Forest)

sorted_idx = np.argsort(feature_importance)[-20:] # Top 20 important features

plt.barh(range(len(sorted_idx)), feature_importance[sorted_idx], align="center")

plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])

plt.title("Top 20 Important Features (Random Forest)")

## Make Predictions on New Input (Random Forest Classifier)

You might also like