0% found this document useful (0 votes)
145 views16 pages

Spam Email Classifier

This project aims to classify emails as spam or not spam using machine learning algorithms. The group used XGBoost, Naive Bayes, Random Forest, and SVM classifiers on a dataset of spam and non-spam emails. Data cleaning and preprocessing was performed including removing HTML, lowering case, removing stop words and punctuation, and stemming words. XGBoost achieved the best results with 98.62% accuracy, 97.47% recall, and 98.18% precision. The group created a Python function to classify new emails using the Naive Bayes or XGBoost models. This spam filtering has applications for email services and maintaining business communications.

Uploaded by

saravanan iyer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views16 pages

Spam Email Classifier

This project aims to classify emails as spam or not spam using machine learning algorithms. The group used XGBoost, Naive Bayes, Random Forest, and SVM classifiers on a dataset of spam and non-spam emails. Data cleaning and preprocessing was performed including removing HTML, lowering case, removing stop words and punctuation, and stemming words. XGBoost achieved the best results with 98.62% accuracy, 97.47% recall, and 98.18% precision. The group created a Python function to classify new emails using the Naive Bayes or XGBoost models. This spam filtering has applications for email services and maintaining business communications.

Uploaded by

saravanan iyer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

TE MINIPROJECT

PROJECT TITLE- SPAM


EMAIL CLASSIFIER
GROUP MEMBERS-
VINEET IYER 118A1029
ABHISHEK JOSHI 118A1030
VISHAK KODETHUR 118A1033
TUSHANT GOKHE 118A1024
ABOUT OUR PROJECT

In this Project we classify whether an email is spam or not using Machine


Learning. The Machine Learning Algorithms which we have used in our project
are XGBoost Classifier, Naive Bayes. The other Algorithms which we have
used are Random Forest, Multinomial Naive Bayes and Support Vector
Machine. But the one used in our project is XGBoost considering precision
scores,recall scores and F-scores.
What is Machine Learning?

Machine Learning involves computers discovering how they can


perform tasks without being explicitly programmed to do so..

It provides systems the ability to automatically learn and improve


from previous experience without being programmed.

Thus it helps in our project in predicting whether an email is spam


or not .
CLASSIFICATION OF MACHINE LEARNING
ALGORITHMS

1] Supervised Machine Learning-It is a type of Machine Learning in which


Machines are trained using well “Labelled” training data.On the basis of that
Machines predict the output.

2] Unsupervised Machine Learning -Here Models are not supervised using


training dataset.Instead model itself find hidden patterns and insights from the
given data.

3] Reinforcement Learning-Here output depends on state of current input and


next input depends on output of previous input.
Random Forest Classifier(Supervised Learning)

It is a classifier that contains number of


Decision trees on various subset of the
given dataset.
XGBoost Classifier(Supervised Learning Algorithm)

It is one of the most popular and efficient implementation of gradient


boosted trees algorithm.

Why is XGBoost Fast?

It uses CPU cache to store calculated gradients to make necessary


calculations fast.
Multinomial Naive Bayes(Supervised Learning)
Using sklearn

The multinomial Naive Bayes classifier is


suitable for classification with discrete
features (e.g., word counts for text
classification). The multinomial distribution
normally requires integer feature counts.
However, in practice, fractional counts such
as tf-idf may also work.
Manually Coded Naive Bayes(Supervised Learning)

We created a vocabulary of 10000 most commonly occurring words


after data cleaning was done.

Then we calculated probabilities of these words for in complete dataset,


spam emails, and non-spam emails separately.

Then we found Posterior probabilities of each word for spam and non-
spam emails using formula:
DataSet Details

Source: SpamAssassin Public Corpus (


https://spamassassin.apache.org/old/publiccorpus/)

Data Format:

Separate folders for spam and non-spam emails.

The emails are documents consisting of sender’s information and mail


history of replies/forwards.

Some emails also use HTML which has to be cleaned.


DataSet Cleaning (5 steps)

1) Removal of HTML Tags (Using BeautifulSoup)


2) Converting words to lowercase and tokenising them into list of
separate words.
3) Removing all the stop words, numbers, special characters and
punctuation marks.
4) Stem all the words to its root word (Using PorterStemmer)
5) Creating Vocabulary (10000 words)
Training the Model

We split dataset into training and testing data in ratio 7:3.

Then we find probabilities of all the words in our vocabulary in 3


different formats:

1) Probability of word throughout the dataset.


2) Probability of word in Spam Emails.
3) Probability of word in Non-Spam Emails.
Testing the Model

We test the dataset by finding the probability of an email being spam


and non-spam using Naive Bayes algorithm:

Classification of an email being spam/non-spam is determined by


comparison of above two probabilities.
Scores of various models

Algorithm Accuracy Recall Score Precision Score F1 Score

XGBoost 98.62% 97.47% 98.18% 97.83%

RandomForest 97.64% 93.86% 98.67% 96.21%

Naive Bayes 98.14% 98.80% 96.8% 97%


(Manual)

Naive Bayes 94.36% 83.03% 99.14% 90.37%


(sklearn)

SVM 88.73% 65.70% 98.38% 78.79%


Python Function
Parameters:

data: This will take a string containing the contents of email.

mode: Default: mode=2 : It is used when data only contains email content.
Otherwise, it is considered to contain sender information and mail history as well

classifier:

(Default) classifier=’manual’: Only Manual Naive Bayes is used to classify.

classifier=’xgb’: Only XGBoost is considered to classify.

Returns: Boolean: True if email is spam and False for otherwise


Future Scope

1] Our project can help in filtering out spam messages received in emails.

2] It can help in maintaining proper business communications.

3] It can be used in various education sectors too.


THANK YOU

You might also like