0% found this document useful (0 votes)

145 views16 pages

Spam Email Classifier

This project aims to classify emails as spam or not spam using machine learning algorithms. The group used XGBoost, Naive Bayes, Random Forest, and SVM classifiers on a dataset of spam and non-spam emails. Data cleaning and preprocessing was performed including removing HTML, lowering case, removing stop words and punctuation, and stemming words. XGBoost achieved the best results with 98.62% accuracy, 97.47% recall, and 98.18% precision. The group created a Python function to classify new emails using the Naive Bayes or XGBoost models. This spam filtering has applications for email services and maintaining business communications.

Uploaded by

saravanan iyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

145 views16 pages

Spam Email Classifier

Uploaded by

saravanan iyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

TE MINIPROJECT

PROJECT TITLE- SPAM

EMAIL CLASSIFIER
GROUP MEMBERS-
VINEET IYER 118A1029
ABHISHEK JOSHI 118A1030
VISHAK KODETHUR 118A1033
TUSHANT GOKHE 118A1024
ABOUT OUR PROJECT

In this Project we classify whether an email is spam or not using Machine

Learning. The Machine Learning Algorithms which we have used in our project
are XGBoost Classifier, Naive Bayes. The other Algorithms which we have
used are Random Forest, Multinomial Naive Bayes and Support Vector
Machine. But the one used in our project is XGBoost considering precision
scores,recall scores and F-scores.
What is Machine Learning?

Machine Learning involves computers discovering how they can

perform tasks without being explicitly programmed to do so..

It provides systems the ability to automatically learn and improve

from previous experience without being programmed.

Thus it helps in our project in predicting whether an email is spam

or not .
CLASSIFICATION OF MACHINE LEARNING
ALGORITHMS

1] Supervised Machine Learning-It is a type of Machine Learning in which

Machines are trained using well “Labelled” training data.On the basis of that
Machines predict the output.

2] Unsupervised Machine Learning -Here Models are not supervised using

training dataset.Instead model itself find hidden patterns and insights from the
given data.

3] Reinforcement Learning-Here output depends on state of current input and

next input depends on output of previous input.
Random Forest Classifier(Supervised Learning)

It is a classifier that contains number of

Decision trees on various subset of the
given dataset.
XGBoost Classifier(Supervised Learning Algorithm)

It is one of the most popular and efficient implementation of gradient

boosted trees algorithm.

Why is XGBoost Fast?

It uses CPU cache to store calculated gradients to make necessary

calculations fast.
Multinomial Naive Bayes(Supervised Learning)
Using sklearn

The multinomial Naive Bayes classifier is

suitable for classification with discrete
features (e.g., word counts for text
classification). The multinomial distribution
normally requires integer feature counts.
However, in practice, fractional counts such
as tf-idf may also work.
Manually Coded Naive Bayes(Supervised Learning)

We created a vocabulary of 10000 most commonly occurring words

after data cleaning was done.

Then we calculated probabilities of these words for in complete dataset,

spam emails, and non-spam emails separately.

Then we found Posterior probabilities of each word for spam and non-
spam emails using formula:
DataSet Details

Source: SpamAssassin Public Corpus (

https://spamassassin.apache.org/old/publiccorpus/)

Data Format:

Separate folders for spam and non-spam emails.

The emails are documents consisting of sender’s information and mail

history of replies/forwards.

Some emails also use HTML which has to be cleaned.

DataSet Cleaning (5 steps)

1) Removal of HTML Tags (Using BeautifulSoup)

2) Converting words to lowercase and tokenising them into list of
separate words.
3) Removing all the stop words, numbers, special characters and
punctuation marks.
4) Stem all the words to its root word (Using PorterStemmer)
5) Creating Vocabulary (10000 words)
Training the Model

We split dataset into training and testing data in ratio 7:3.

Then we find probabilities of all the words in our vocabulary in 3

different formats:

1) Probability of word throughout the dataset.

2) Probability of word in Spam Emails.
3) Probability of word in Non-Spam Emails.
Testing the Model

We test the dataset by finding the probability of an email being spam

and non-spam using Naive Bayes algorithm:

Classification of an email being spam/non-spam is determined by

comparison of above two probabilities.
Scores of various models

Algorithm Accuracy Recall Score Precision Score F1 Score

XGBoost 98.62% 97.47% 98.18% 97.83%

RandomForest 97.64% 93.86% 98.67% 96.21%

Naive Bayes 98.14% 98.80% 96.8% 97%

(Manual)

Naive Bayes 94.36% 83.03% 99.14% 90.37%

(sklearn)

SVM 88.73% 65.70% 98.38% 78.79%

Python Function
Parameters:

data: This will take a string containing the contents of email.

mode: Default: mode=2 : It is used when data only contains email content.
Otherwise, it is considered to contain sender information and mail history as well

classifier:

(Default) classifier=’manual’: Only Manual Naive Bayes is used to classify.

classifier=’xgb’: Only XGBoost is considered to classify.

Returns: Boolean: True if email is spam and False for otherwise

Future Scope

1] Our project can help in filtering out spam messages received in emails.

2] It can help in maintaining proper business communications.

3] It can be used in various education sectors too.

THANK YOU

Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
PPT
0% (1)
PPT
15 pages
Final PPT
No ratings yet
Final PPT
18 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
For Email
No ratings yet
For Email
8 pages
Mastering WhatsApp Business
No ratings yet
Mastering WhatsApp Business
140 pages
Presentation On KYC Procedure (Opening of Trading and Demat Account)
100% (1)
Presentation On KYC Procedure (Opening of Trading and Demat Account)
56 pages
BSF Recruitment 2022 For 323 Posts
No ratings yet
BSF Recruitment 2022 For 323 Posts
29 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
No ratings yet
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
8 pages
Planner Blue
No ratings yet
Planner Blue
29 pages
How To Setup Outlook With Bluehost
No ratings yet
How To Setup Outlook With Bluehost
10 pages
CS6735 ProgrammingProject Group08 Report
No ratings yet
CS6735 ProgrammingProject Group08 Report
7 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Pending Proj
No ratings yet
Pending Proj
37 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Utilizing Specialized Communication Skills
No ratings yet
Utilizing Specialized Communication Skills
17 pages
Introduction To Spam Email Detection
No ratings yet
Introduction To Spam Email Detection
16 pages
0 - Spam Mail Prediction
No ratings yet
0 - Spam Mail Prediction
29 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
Research Article On The Forensic
No ratings yet
Research Article On The Forensic
14 pages
Final
No ratings yet
Final
51 pages
ML Lab
No ratings yet
ML Lab
13 pages
Spam Detection
No ratings yet
Spam Detection
4 pages
Published Paper
No ratings yet
Published Paper
9 pages
Improving Spam Email Classification Accuracy Using Ensemble Techniques: A Stacking Approach
No ratings yet
Improving Spam Email Classification Accuracy Using Ensemble Techniques: A Stacking Approach
13 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
Aayush Nihar Spam Mail Filtering
No ratings yet
Aayush Nihar Spam Mail Filtering
18 pages
A Study of Machine Learning Algorithms On Email Spam Classification
No ratings yet
A Study of Machine Learning Algorithms On Email Spam Classification
10 pages
1 s2.0 S0950705106001390 Main
No ratings yet
1 s2.0 S0950705106001390 Main
6 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Maths Answers
No ratings yet
Maths Answers
4 pages
Amrit Science Campus: Submitted by
No ratings yet
Amrit Science Campus: Submitted by
35 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Ijirt156181 Paper
No ratings yet
Ijirt156181 Paper
5 pages
B. Flowchart of The Model: Esult
No ratings yet
B. Flowchart of The Model: Esult
3 pages
A Comparison of The Accuracy of Support Vector
No ratings yet
A Comparison of The Accuracy of Support Vector
17 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
Spam Detection and Filtering
No ratings yet
Spam Detection and Filtering
16 pages
MIS-Unit 5
No ratings yet
MIS-Unit 5
48 pages
Document
No ratings yet
Document
11 pages
Health Care
No ratings yet
Health Care
23 pages
Ass 3
No ratings yet
Ass 3
2 pages
Title: Abstract
No ratings yet
Title: Abstract
2 pages
Email Spam Detection Project
No ratings yet
Email Spam Detection Project
2 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
No ratings yet
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
7 pages
An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
No ratings yet
An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
6 pages
Content Based Spam Detection in Email Us PDF
No ratings yet
Content Based Spam Detection in Email Us PDF
5 pages
Irjet V9i11154
No ratings yet
Irjet V9i11154
4 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Spam Classifier
No ratings yet
Spam Classifier
8 pages
By: Saravanan Iyer Tecec 118A1028: Experiment 4 Aim
No ratings yet
By: Saravanan Iyer Tecec 118A1028: Experiment 4 Aim
16 pages
Email Spam CLassifier by Hamas Ur Rehman
No ratings yet
Email Spam CLassifier by Hamas Ur Rehman
3 pages
Analysis of Spam Email Filtering Through Naive Bayes Algorithm Across Different Datasets
No ratings yet
Analysis of Spam Email Filtering Through Naive Bayes Algorithm Across Different Datasets
4 pages
FSD - Dispatch
No ratings yet
FSD - Dispatch
13 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Server Iptv
No ratings yet
Server Iptv
11 pages
Truecallerr
No ratings yet
Truecallerr
26 pages
Census Evaluation in The Philippines
No ratings yet
Census Evaluation in The Philippines
32 pages
Chapter 1-3
No ratings yet
Chapter 1-3
47 pages
Bus Correspondence
No ratings yet
Bus Correspondence
5 pages
Problem Statement
No ratings yet
Problem Statement
10 pages
CSS Easy Solution Searchable CompsTreasure
No ratings yet
CSS Easy Solution Searchable CompsTreasure
123 pages
Internet Marketing Strategy PDF
No ratings yet
Internet Marketing Strategy PDF
67 pages
Social
No ratings yet
Social
14 pages
To Predict The Fraud in Auto Insurance Claims: Insofe PHD Hackathon Prepared By: Nimesh Harishbhai Katoriwala
No ratings yet
To Predict The Fraud in Auto Insurance Claims: Insofe PHD Hackathon Prepared By: Nimesh Harishbhai Katoriwala
32 pages
Lp3 Assignment Submission Methodology
No ratings yet
Lp3 Assignment Submission Methodology
12 pages
Emergency Communication Plan
No ratings yet
Emergency Communication Plan
5 pages
Library Data
No ratings yet
Library Data
4 pages
Digital Clean Up
No ratings yet
Digital Clean Up
50 pages
GPR 3103 - Contract Law 1 - 29-10-24
No ratings yet
GPR 3103 - Contract Law 1 - 29-10-24
6 pages
Email Simplified - User Guide For Iprusales
No ratings yet
Email Simplified - User Guide For Iprusales
37 pages
Hotel Revenue Management Services
No ratings yet
Hotel Revenue Management Services
10 pages
FOI Request Form
No ratings yet
FOI Request Form
4 pages
Otp Registration: Please Complete and Return This Form, With The Accompanying Documents To: Please Note
No ratings yet
Otp Registration: Please Complete and Return This Form, With The Accompanying Documents To: Please Note
2 pages
CSA - M3 - Ktunotes - in
No ratings yet
CSA - M3 - Ktunotes - in
12 pages
Rbi Hall Ticket-2
No ratings yet
Rbi Hall Ticket-2
5 pages
Bomb Threat Response
No ratings yet
Bomb Threat Response
1 page
Up Registration
No ratings yet
Up Registration
3 pages
9 Internal - External Communication
No ratings yet
9 Internal - External Communication
15 pages
BCC Art Guidelines
No ratings yet
BCC Art Guidelines
1 page
Citation 320651664
No ratings yet
Citation 320651664
1 page
Statistical Analysis Activity
No ratings yet
Statistical Analysis Activity
1 page

Spam Email Classifier

Uploaded by

Spam Email Classifier

Uploaded by

TE MINIPROJECT

PROJECT TITLE- SPAM

In this Project we classify whether an email is spam or not using Machine

Machine Learning involves computers discovering how they can

It provides systems the ability to automatically learn and improve

Thus it helps in our project in predicting whether an email is spam

1] Supervised Machine Learning-It is a type of Machine Learning in which

2] Unsupervised Machine Learning -Here Models are not supervised using

3] Reinforcement Learning-Here output depends on state of current input and

It is a classifier that contains number of

It is one of the most popular and efficient implementation of gradient

Why is XGBoost Fast?

It uses CPU cache to store calculated gradients to make necessary

The multinomial Naive Bayes classifier is

We created a vocabulary of 10000 most commonly occurring words

Then we calculated probabilities of these words for in complete dataset,

Source: SpamAssassin Public Corpus (

Separate folders for spam and non-spam emails.

The emails are documents consisting of sender’s information and mail

Some emails also use HTML which has to be cleaned.

1) Removal of HTML Tags (Using BeautifulSoup)

We split dataset into training and testing data in ratio 7:3.

Then we find probabilities of all the words in our vocabulary in 3

1) Probability of word throughout the dataset.

We test the dataset by finding the probability of an email being spam

Classification of an email being spam/non-spam is determined by

Algorithm Accuracy Recall Score Precision Score F1 Score

XGBoost 98.62% 97.47% 98.18% 97.83%

RandomForest 97.64% 93.86% 98.67% 96.21%

Naive Bayes 98.14% 98.80% 96.8% 97%

Naive Bayes 94.36% 83.03% 99.14% 90.37%

SVM 88.73% 65.70% 98.38% 78.79%

data: This will take a string containing the contents of email.

(Default) classifier=’manual’: Only Manual Naive Bayes is used to classify.

classifier=’xgb’: Only XGBoost is considered to classify.

Returns: Boolean: True if email is spam and False for otherwise

2] It can help in maintaining proper business communications.

3] It can be used in various education sectors too.

You might also like