0% found this document useful (0 votes)
21 views35 pages

Report (1) 1

The document discusses the growing issue of email spam, which constitutes about 55% of all emails, and the importance of using machine learning for spam detection. It outlines the project objectives, methodologies, and the use of various algorithms like Naïve Bayes and random forest for classifying spam. Additionally, it details the system architecture, data processing techniques, and the challenges faced in accurately filtering spam without losing legitimate emails.

Uploaded by

m61926759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views35 pages

Report (1) 1

The document discusses the growing issue of email spam, which constitutes about 55% of all emails, and the importance of using machine learning for spam detection. It outlines the project objectives, methodologies, and the use of various algorithms like Naïve Bayes and random forest for classifying spam. Additionally, it details the system architecture, data processing techniques, and the challenges faced in accurately filtering spam without losing legitimate emails.

Uploaded by

m61926759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Email Spam Detection

CHAPTER 1

Introduction
Today, Spam has become a major problem in communication over internet. It has been accounted that
around 55% of all emails are reported as spam and the number has been growing steadily. Spam which is
also known as unsolicited bulk email has led to the increasing use of email as email provides the perfect
ways to send the unwanted advertisement or junk newsgroup posting at no cost for the sender. This
chances has been extensively exploited by irresponsible organizations and resulting to clutter the mail
boxes of millions of people all around the world.

Spam has been a major concern given the offensive content of messages, spam is a waste of time. End
user is at risk of deleting legitimate mail by mistake. Moreover, spam also impacted the economical which
led some countries to adopt legislation.

Text classification is used to determine the path of incoming mail/message either into inbox or straight
to spam folder. It is the process of assigning categories to text according to its content. It is used to
organized, structures and categorize text. It can be done either manually or automatically. Machine
learning automatically classifies the text in a much faster way than manual technique. Machine learning
uses prelabelled text to learn the different associations between pieces of text and it output. It used feature
extraction to transform each text to numerical representation in form of vector which represents the
frequency of word in predefined dictionary.

Text classification is important to structure the unstructured and messy nature of text such as
documents and spam messages in a cost-effective way. Machine learning can make more accurate
precisions in realtime and help to improve the manual slow process to much better and faster analysing
big data. It is important especially to a company to analyse text data, help inform business decisions and
even automate business processes.

Dept of CSE, KIT 1 2024-25


Email Spam Detection

In this project, machine learning techniques are used to detect the spam message of a mail. Machine
learning is where computers can learn to do something without the need to explicitly program them for
the task.

It uses data and produce a program to perform a task such as classification. Compared to knowledge
engineering, machine learning techniques require messages that have been successfully pre-classified. The
pre-classified messages make the training dataset which will be used to fit the learning algorithm to the
model in machine learning studio.

A combination of algorithms are used to learn the classification rules from messages. These
algorithms are used for classification of objects of different classes. These algorithms are provided with
pre labelled data and an unknown text. After learning from the prelabelled data each of these algorithms
predict which class the unknown text may belong to and the category predicted by majority is considered
as final.

Dept of CSE, KIT 2 2024-25


Email Spam Detection

CHAPTER 2

Literature Review

2.1 Introduction
This chapter discusses about the literature review for machine learning classifier that being used in
previous researches and projects. It is not about information gathering but it summarize the prior research
that related to this project. It involves the process of searching, reading, analysing, summarising and
evaluating the reading materials based on the project.

A lot of research has been done on spam detection using machine learning. But due to the
evolvement of spam and development of various technologies the proposed methods are not dependable.
Natural language processing is one of the lesser known fields in machine learning and it reflects here
with comparatively less work present.

2.2 Summary
From various studies, we can take that for various types of data various models performs better.
Naïve Bayes, random forest, SVM, logistic regression are some of the most used algorithms in spam
detection and classification.

Dept of CSE, KIT 3 2024-25


Email Spam Detection

CHAPTER 3
Objectives and Scope

3.1 Problem Statement


Spammers are in continuous war with Email service providers. Email service providers implement
various spam filtering methods to retain their users, and spammers are continuously changing patterns,
using various embedding tricks to get through filtering. These filters can never be too aggressive because
a slight misclassification may lead to important information loss for consumer. A rigid filtering method
with additional reinforcements is needed to tackle this problem.

3.2 Objectives
The objectives of this project are
i. To create a ensemble algorithm for classification of spam with highest possible accuracy.
ii. To study on how to use machine learning for spam detection.
iii. To study how natural language processing techniques can be implemented in spam detection.

iv. To provide user with insights of the given text leveraging the created algorithm and NLP.

3.3 Project Scope


This project needs a coordinated scope of work.
i. Combine existing machine learning algorithms to form a better ensemble algorithm. ii. Clean,
processing and make use of the dataset for training and testing the model created.
iii. Analyse the texts and extract entities for presentation.

3.4 Limitations
This Project has certain limitations.
i. This can only predict and classify spam but not block it.
ii. Analysis can be tricky for some alphanumeric messages and it may struggle with entity
detection. iii. Since the data is reasonably large it may take a few seconds to classify and
anlayse the message.

Dept of CSE, KIT 4 2024-25


Email Spam Detection

CHAPTER 4
Experimentation and Methods

4.1 Introduction
This chapter will explain the specific details on the methodology being used to develop this
project. Methodology is an important role as a guide for this project to make sure it is in the right path and
working as well as plan. There is different type of methodology used in order to do spam detection and
filtering. So, it is important to choose the right and suitable methodology thus it is necessary to
understand the application functionality itself.

4.2 System Architecture


The application overview has been presented below and it gives a basic structure of the application.

fig
no. 4.1 Architecture

The UI, Text processing and ML Models are the three important modules of this project.
Each Module’s explanation has been given in the later sections of this chapter.

A more complicated and detailed view of architecture is presented in the workflow section.

Dept of CSE, KIT 5 2024-25


Email Spam Detection

4.3 Modules and Explanation


The Application consists of three modules.
i. UI
ii. Machine Learning iii. Data Processing

I. UI Module
a. This Module contains all the functions related to UI(user interface).

b. The user interface of this application is designed using Streamlit library from python based

packages.
c. The user inputs are acquired using the functions of this library and forwarded to data processing

module for processing and conversion.


d. Finally the output from ML module is sent to this module and from this module to user in visual

form.

II. Machine Learning Module


a. This module is the main module of all three modules.

b. This modules performs everything related to machine learning and results analysis.

c. Some main functions of this module are

i. Training machine learning models.


ii. Testing the model iii. Determining the respective parameter values for each model. iv. Key-

word extraction. v. Final output calculation


d. The output from this module is forwarded to UI for providing visual response to user

III. Data Processing Module


a. The raw data undergoes several modifications in this module for further process.

b. Some of the main functions of this module includes

i. Data cleaning ii. Data merging of datasets iii. Text Processing using NLP iv. Conversion of

text data into numerical data(feature vectors). v. Splitting of data.


c. All the data processing is done using Pandas and NumPy libraries.

d. Text processing and text conversion is done using NLTK and scikit-learn libraries.

Dept of CSE, KIT 6 2024-25


Email Spam Detection

4.4 Requirements
Hardware Requirements
PC/Laptop
Ram – 8 Gig
Storage – 100-200 Mb
Software Requirements
OS – Windows 7 and above
Code Editor – Pycharm, VS Code, Built in IDE
Anaconda environment with packages nltk, numpy, pandas, sklearn, tkinter, nltk data.
Supported browser such as chrome, firefox, opera etc..

4.5 WorkFlow

fig no. 4.2 Workflow

In the above architecture, the objects depicted in Green belong to a module called Data
Processing. It includes several functions related to data processing, natural Language Processing. The
objects depicted in Blue belong to the Machine Learning module. It is where everything related to ML is
embedded. The red objects represent final results and outputs.

Dept of CSE, KIT 7 2024-25


Email Spam Detection

4.5.1 Data Collection and Description


● Data plays an important role when it comes to prediction and classification, the more the data
the more the accuracy will be.
● The data used in this project is completely open-source and has been taken from various
resources like Kaggle and UCI

● For the purpose of accuracy and diversity in data multiple datasets are taken. 2 datasets
containing approximately over 12000 mails and their labels are used for training and testing
the application.
● 6000 spam mails are taken for generalisation of data and to increase the accuracy.

Data Description
Dataset : enronSpamSubset.
Source : Kaggle
Description : this dataset is part of a larger dataset
called enron. This dataset contains a set of spam and
nonspam emails with 0 for non spam and 1 for spam in label
attribute.
Composition :
Unique values : 9687
Spam values : 5000 Non-spam values : 4687 fig no. 4.3
enron spam

Dataset : lingspam.
Source : Kaggle
Description : This dataset is part of a larger dataset called
Enron1 which contains emails classified as spam or
ham(not-spam).
Composition :
Unique values : 2591
Spam values : 419 Non-spam values : 2172

Dept of CSE, KIT 8 2024-25


Email Spam Detection

fig no. 4.4 lingspam

4.5.2 Data Processing


4.5.2.1 Overall data processing
It consists of two main tasks
● Dataset cleaning
It includes tasks such as removal of outliers, null value removal, removal of unwanted features
from data.

● Dataset Merging
After data cleaning, the datasets are merged to form a single dataset containing only two
features(text, label).
Data cleaning, Data Merging these procedures are completely done using Pandas library.

4.5.2.2 Textual data processing


● Tag removal
Removing all kinds of tags and unknown characters from text using regular expressions
through Regex library.

● Sentencing, tokenization
Breaking down the text(email/SMS) into sentences and then into tokens(words).
This process is done using NLTK pre-processing library of python.
● Stop word removal
Stop words such as of , a ,be , … are removed using stopwords NLTK library of python.
● Lemmatization
Words are converted into their base forms using lemmatization and pos-tagging This
process gives key-words through entity extraction.
This process is done using chunking in regex and NLTK lemmatization.
● Sentence formation
The lemmatized tokens are combined to form a sentence.
This sentence is essentially a sentence converted into its base form and removing stop
words.

Dept of CSE, KIT 9 2024-25


Email Spam Detection

Then all the sentences are combined to form a text.


● While the overall data processing is done only to datasets, the textual processing is
done to both training data, testing data and also user input data.

4.5.2.3 Feature Vector Formation


● The texts are converted into feature vectors(numerical data) using the words present in all
the texts combined

● This process is done using countvectorization of NLTK library.


● The feature vectors can be formed using two language models Bag of Words and Term
Frequency-inverse Document Frequency.

4.5.2.3.1 Bag of Words


Bag of words is a language model used mainly in text classification. A bag of words represents the text in
a numerical form.
The two things required for Bag of Words are
• A vocabulary of words known to us.
• A way to measure the presence of words.
Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens.
“ It was the best of times, it was the worst
of times, it was the age of wisdom, it was
the age of foolishness, ”
The unique words here (ignoring case and punctuation) are:
[ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ] The next
step is scoring words present in every document.

After scoring the four lines from the above stanza can be represented in vector form as
“It was the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

Dept of CSE, KIT 10 2024-25


Email Spam Detection

This is the main process behind the bag of words but in reality the vocabulary even from a couple of
documents is very large and words repeating frequently and important in nature are taken and remaining
are removed during the text processing stage.

4.5.2.3.2 Term Frequency-inverse document frequency


Term frequency-inverse document frequency of a word is a measurement of the importance of a word.

It compares the repentance of words to the collection of documentsand calculates the score.

Terminology for the below formulae: t


– term(word) d – document(set of
words)
N – count of documents
The TF-IDF process consists of various activities listed below.

i) Term Frequency
The count of appearance of a particular word in a document is called term frequency 𝒕𝒇(𝒕, 𝒅) =
𝒄𝒐𝒖𝒏𝒕
𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅

ii) Document Frequency


Document frequency is the count of documents the word was detected in. We consider one instance of
a word and it doesn’t matter if the word is present multiple times.
𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏
𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔 iii) Inverse Document
Frequency
• IDF is the inverse of document frequency.
• It measures the importance of a term t considering the information it contributes. Every term is
considered equally important but certain terms such as (are, if, a, be, that, ..) provide little
information about the document. The inverse document frequency factor reduces the
importance of words/terms that has highe recurrence and increases the importance of
words/terms that are rare.
𝒊𝒅𝒇(𝒕) = 𝑵/𝒅𝒇

Dept of CSE, KIT 11 2024-25


Email Spam Detection

Finally, the TF-IDF can be calculated by combining the term frequency and inverse document
frequency.

𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏))

the process can be explained using the following example:

“Document 1 It is going to rain today.


Document 2 Today I am not going outside.
Document 3 I am going to watch the season premiere.”

The Bag of words of the above sentences is


[going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1]
Then finding the term frequency

Table no. 4.1 Term frequency

Then finding the inverse document frequency

Table no. 4.2 inverse document frequency

Dept of CSE, KIT 12 2024-25


Email Spam Detection

Applying the final equation the values of tf-idf becomes

Table no. 4.3 TF-IDF

Using the above two language models the complete data has been converted into two kinds of vectors
and stored into a csv type file for easy access and minimal processing.

4.5.3 Data Splitting

The data splitting is done to create two kinds of data Training data and testing data. Training data is
used to train the machine learning models and testing data is used to test the models and analyse results.
80% of total data is selected as testing data and remaining data is testing data .

4.5.4 Machine Learning

4.5.4.1 Introduction
Machine Learning is process in which the computer performs certain tasks without giving instructions. In
this case the models takes the training data and train on them.
Then depending on the trained data any new unknown data will be processed based on the ruled derived
from the trained data.

Dept of CSE, KIT 13 2024-25


Email Spam Detection

After completing the countvectorization and TF-IDF stages in the workflow the data is converted into
vector form(numerical form) which is used for training and testing models.

For our study various machine learning models are compared to determine which method is more suitable

for this task. The models used for the study include Logistic Regression, Naïve Bayes, Random Forest

Classifier, K Nearest Neighbors, and Support Vector Machine Classifier and a proposed model which was

created using an ensemble approach.

4.5.4.2 Algorithms
a combination of 5 algorithms are used for the classifications.

4.5.4.2.1 Naïve Bayes Classifier


A naïve Bayes classifier is a supervised probabilistic machine learning model that is used for
classification tasks. The main principle behind this model is the Bayes theorem.

Bayes Theorem:
Naive Bayes is a classification technique that is based on Bayes’ Theorem with an assumption that all the
features that predict the target value are independent of each other. It calculates the probability of each
class and then picks the one with the highest probability.
Naive Bayes classifier assumes that the features we use to predict the target are independent and do not
affect each other. Though the independence assumption is never correct in real-world data, but often
works well in practice. so that it is called “Naive” [14].

P(A│B)=(P(B│A)P(A))/P(B)
P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability.
P(B|A) is the probability of data B given that hypothesis A was true.
P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior
probability of A.
P(B) is the probability of the data (regardless of the hypothesis) [15].

Naïve Bayes classifiers are mostly used for text classification. The limitation of the Naïve Bayes model is
that it treats every word in a text as independent and is equal in importance but every word cannot be

Dept of CSE, KIT 14 2024-25


Email Spam Detection

treated equally important because articles and nouns are not the same when it comes to language. But due
to its classification efficiency, this model is used in combination with other language processing
techniques.

4.5.4.2.2 Random Forest Classifier

Random Forest classifier is a supervised ensemble algorithm. A random forest consists of multiple
random decision trees. Two types of randomnesses are built into the trees. First, each tree is built on a
random sample from the original data. Second, at each tree node, a subset of features is randomly selected
to generate the best split [16].

Decision Tree:
The decision tree is a classification algorithm based completely on features. The tree repeatedly
splits the data on a feature with the best information gain. This process continues until the information
gained remains constant. Then the unknown data is evaluated feature by feature until categorized. Tree
pruning techniques are used for improving accuracy and reducing the overfitting of data.

Several decision trees are created on subsets of data the result that was given by the majority of trees is
considered as the final result. The number of trees to be created is determined based on accuracy and
other metrics through iterative methods. Random forest classifiers are mainly used on condition-based
data but it works for text if the text is converted into numerical form.

4.5.4.2.3 Logistic Regression

Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event. It is used when the data is linearly separable and the outcome is
binary or dichotomous [17]. The probabilities are calculated using a sigmoid function.

For example, let us take a problem where data has n features.


We need to fit a line for the given data and this line can be represented by the equation

Dept of CSE, KIT 15 2024-25


Email Spam Detection

z=b_0+b_1 x_1+b_2 x_2+b_3 x_3….+b_n x_n


here z = odds generally, odds are calculated as

odds=p(event occurring)/p(event not occurring)

Sigmoid Function:

A sigmoid function is a special form of logistic function hence the name logistic regression. The
logarithm of odds is calculated and fed into the sigmoid function to get continuous probability ranging
from 0 to 1.

The logarithm of odds can be calculated by

log( odds)=dot(features,coefficients)+intercept

and these log_odds are used in the sigmoid function to get probability.

h(z)=1/(1+e^(-z) )
The output of the sigmoid function is an integer in the range 0 to 1 which is used to determine which class
the sample belongs to. Generally, 0.5 is considered as the limit below which it is considered a NO, and
0.5 or higher will be considered a YES. But the border can be adjusted based on the requirement.

4.5.4.2.4 K-Nearest Neighbors

KNN is a classification algorithm. It comes under supervised algorithms. All the data points are assumed
to be in an n-dimensional space. And then based on neighbors the category of current data is determined
based on the majority.
Euclidian distance is used to determine the distance between points.

The distance between 2 points is calculated as


d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )

Dept of CSE, KIT 16 2024-25


Email Spam Detection

The distances between the unknown point and all the others are calculated. Depending on the K provided
k closest neighbors are determined. The category to which the majority of the neighbors belong is
selected as the unknown data category.

If the data contains up to 3 features then the plot can be visualized. It is fairly slow compared to other
distance-based algorithms such as SVM as it needs to determine the distance to all points to get the
closest neighbors to the given point.

4.5.4.2.5 Support Vector Machines(SVM)

It is a machine learning algorithm for classification. Decision boundaries are drawn between various
categories and based on which side the point falls to the boundary the category is determined.

Support Vectors:

The vectors closer to boundaries are called support vectors/planes. If there are n categories then there will
be n+1 support vectors. Instead of points, these are called vectors because they are assumed to be starting
from the origin.The distance between the support vectors is called margin. We want our margin to be as
wide as possible because it yields better results.

There are three types of boundaries used by SVM to create boundaries.


Linear: used if the data is linearly separable.
Poly: used if data is not separable. It creates any data into 3-dimensional data.
Radial: this is the default kernel used in SVM. It converts any data into infinite-dimensional data.

If the data is 2-dimensional then the boundaries are lines. If the data is 3-dimensional then the boundaries
are planes. If the data categories are more than 3 then boundaries are called hyperplanes.

An SVM mainly depends on the decision boundaries for predictions. It doesn’t compare the data to all
other data to get the prediction due to this SVM’s tend to be quick with predictions.

Dept of CSE, KIT 17 2024-25


Email Spam Detection

4.5.5 Experimentation

The process goes like data collection and processing then natural language processing and then
vectorization then machine learning.The data is collected, cleaned, and then subjected to natural language
processing techniques specified in section IV. Then the cleaned data is converted into vectors using Bag
of Words and TF-IDF methods which goes like...

The Data is split into Training data and Testing Data in an 80-20 split ratio. The training and testing data
is converted into Bag-of-Words vectors and TF-IDF vectors.

There are several metrics to evaluate the models but accuracy is considered for comparing BoW and
TFIDF models. Accuracy is generally used to determine the efficiency of a model.

Accuracy:
“Accuracy is the number of correctly predicted data points out of all the data points”.

Naïve Bayes Classification algorithm:


Two models, one for Bow and one for TF-IDF are created and trained using respective training vectors
and training labels. Then the respective testing vectors and labels are used to get the score for the model.

fig no. 4.5 naïve Bayes

The scores for Bag-of-Words and TF-IDF are visualized.

The scores for the Bow model and TF-IDF models are 98.04 and 96.05 respectively for using the naïve
bayes model.

Dept of CSE, KIT 18 2024-25


Email Spam Detection

Logistic Regression:

Two models are created following the same procedure used for naïve Bayes models and then tested the
results obtained are visualized below.

fig no. 4.6 Logistic Regression (Bow vs TF-IDF)

The scores for BoW and TF-IDF models are 98.53 and 98.80 respectively.

K-Nearest Neighbors:

Similar to the above models the models are created and trained using respective vectors and labels. But in

addition to the data, the number of neighbors to be considered should also be provided.

Using Iterative Method K =3 (no of Neighbors) provided the best results for the BoW model and K = 9
provided the best results for the TF-IDF model.

Using the K values the scores for BOW and TF- IDF
are visualized below.

fig no. 4.7 Neighbors vs Accuracy

Dept of CSE, KIT 19 2024-25


Email Spam Detection

Taking K=3 and K=9 for Bow and TF-IDF respectively the
scores are calculated and are presented below.

fig no. 4.8 KNN (Bow vs TF-IDF)

Random Forest:

Similar to previous algorithms two models are created and trained using respective training

vectors and training labels. But the number of trees to be used for forest has to be provided.

fig no. 4.9 Random Forest (trees vs


score)

Dept of CSE, KIT 20 2024-25


Email Spam Detection

Using the Iterative method best value for the


number of trees is determined. From the results, it is
clear that 19 estimators provide the best score for
both the BoW and TF-IDF models. The no of tress
and scores for both models are visualized.

The scores for BoW and TF-IDF models are


visualized.

( fig no. 4.10 Random Forest(bow vs tfidf) Support


Vector Machines (SVM):
Finally, two SVM models, one for BoW and one for TF-IDF are created and then trained using
respective training vectors and labels. Then tested using testing vectors and labels.

fig no. 4.11 SVM(Bow vs TF_IDF)

The scores for BoW and TF-IDF models are 59.41 and 98.82 respectively.

Proposed Model:

In our proposed system we combine all the models and make them into one. It takes an unknown point
and feeds it into every model to get predictions. Then it takes these predictions, finds the category which
was predicted by the majority of the models, and finalizes it.

Dept of CSE, KIT 21 2024-25


Email Spam Detection

To determine which model is effective we used three metrics Accuracy, Precision, and F1score. In the
earlier system, we used only the F1 Score because we were not determining which model is best but
which language model is best suited for classification.

4.5.6 User Interface(UI) interface (UI) is an important component in this application. The
user only interacts with the interface.
The UI of this project has been constructed with the help of an open source library called streamlit. The

complete information and API reference sheet can be obtained from here

4.5.7 Working Procedure


The working procedure includes the internal working and the data flow of application.
i. After running the application some procedures are automated.
1. Reading data from file

2. Cleaning the texts

3. Processing
4. Splitting the data

5. Intialising and training the models ii. The user just needs to provide some data to

classify in the area provided.

iii. The provided data undergoes several procedures after submission.


1. Textual Processing

2. Feature Vector conversion

3. Entity extraction iv. The created vectors are provided to trained models to get

predictions.

v. After getting predictions the category predicted by majority will be selected.


vi. The accuracies of that prediction will be calculated vii. The accuracies and entities
extracted from the step 3 will be provided to user.
Every time the user gives something new the procedure from step 2 will be repeated.

Dept of CSE, KIT 22 2024-25


Email Spam Detection

CHAPTER 5
Results and Discussion
5.1 Language Model Selection
While selecting the best language model the data has been converted into both types of vectors and then the models
been tested for to determine the best model for classifying spam.
The results from individual models are presented in the experimentation section under methodology. Now
comparing the results from the models.

Dept of CSE, KIT 23 2024-25


Email Spam Detection

fig no. 5.1 Bow vs TF-IDF (Cumulative)

From the figure it is clear that TF-IDF proves to be better than BoW in every model tested. Hence TF-IDF has been
selected as the primary language model for textual data conversion in feature vector formation.

5.2 Proposed Model results


To determine which model is effective we used three metrics Accuracy, Precision, and F1score.
The resulted values for the proposed model are
Accuracy – 99.0 Precision – 98.5
F1 Score – 98.6

5.3 Comparison
The results from the proposed model has been compared with all the models individually in tabular form to
illustrate the differences clearly.

Metric Accuracy Precision F1 Score


Model

Naïve Bayes 96.0 99.2 95.2

Logistic 98.4 97.8 98.6


Regression
Random forest 96.8 96.4 96.3

KNN 96.6 96.9 96.0

SVM 98.8 97.8 98.6

Proposed 99.0 98.5 98.6


model

Dept of CSE, KIT 24 2024-25


Email Spam Detection

Table no. 5.1Models and results

The color REDindicates that the value is lower thanthe proposed model and GREENindicates equal or higher.

Here we can observe that our proposed model outperforms almost every other model in every metric. Only one
model(naïve Bayes) has slightly higher accuracy than our model but it is considerably lagging in other metrics.
The results are visually presented below for easier understanding and comparison.

fig no. 5.2 Comparision of Models

From the above comparison barchart we can clearly see that all models individually are not as efficient as the
proposed method.

5.4 Summary
There are two main tasks in the project implementation. Language model selection for completing the textual
processing phase and proposed model creation using the individual algorithms. These two tasks require comparison
from other models and select of various parameters for better efficiency.

During the language model selection phase two models, Bag of Words and TF-IDF are compared to select the best
model and from the results obtained it is evident that TF-IDF performs better.

During the proposed model design various algorithms are tested with different parameters to get best parameters.
Models are merged to form a ensemble algorithm and the results obtained are presented and compared above. It is
clear from the results that the proposed model outperforms others in almost every metric derived.

Dept of CSE, KIT 25 2024-25


Email Spam Detection

Appendices A.Source code

1. Module – Data Processing


import re
from nltk.tokenize import sent_tokenize,word_tokenize from nltk import pos_tag from
nltk.corpus import wordnet as wn from nltk.corpus import stopwords from
nltk.stem.wordnet import WordNetLemmatizer from collections import defaultdict import
spacy

tag_map = defaultdict(lambda : wn.NOUN) tag_map['J'] = wn.ADJ


tag_map['V'] = wn.VERB tag_map['R'] = wn.ADV
lemmatizer=WordNetLemmatizer()
stop_words=set(stopwords.words('english'))
nlp=spacy.load('en_core_web_sm') def
process_sentence(sentence):
nouns = list() base_words = list() final_words = list() words_2 =

Dept of CSE, KIT 26 2024-25


Email Spam Detection

word_tokenize(sentence) sentence = re.sub(r'[^ \w\s]', '', sentence) sentence


= re.sub(r'_', ' ', sentence) words = word_tokenize(sentence)
pos_tagged_words = pos_tag(words) for token, tag in pos_tagged_words:

base_words.append(lemmatizer.lemmatize(token,tag_map[tag[0]])) for word in base_words: if


word not in stop_words:
final_words.append(word)
sym = ' '
sent = sym.join(final_words) pos_tagged_sent =
pos_tag(words_2) for token, tag in pos_tagged_sent:
if tag == 'NN' and len(token)>1:
nouns.append(token)
return sent, nouns

def clean(email):
email = email.lower() sentences = sent_tokenize(email)
total_nouns = list() string = "" for sent in sentences:
sentence, nouns = process_sentence(sent) string
+= " " + sentence total_nouns += nouns
return string, nouns

def ents(text): doc = nlp(text) expls = dict()


if doc.ents:
for ent in doc.ents: labels = list(expls.keys()) label =
ent.label_ word = ent.text if label in labels: words
= expls[label] words.append(word) expls[label] =
words
else: expls[label] = [word]
return expls
else:
return 'no'

2. Module – Machine Learning


from sklearn.feature_extraction.text import
CountVectorizer,TfidfVectorizer import numpy as np
from sklearn.model_selection import train_test_split from sklearn.naive_bayes
import MultinomialNB from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression from sklearn.neighbors import
KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier import
pandas as pd

class model:
def __init__(self):

Dept of CSE, KIT 27 2024-25


Email Spam Detection

self.df = pd.read_csv('Cleaned_Data.csv') self.df['Email'] = self.df.Email.apply(lambda


email:
np.str_(email)) self.Data = self.df.Email self.Labels = self.df.Label
self.training_data, self.testing_data,
self.training_labels, self.testing_labels = train_test_split(self.Data,self.Labels,random_state=10)
self.training_data_list = self.training_data.to_list() self.vectorizer = TfidfVectorizer()
self.training_vectors =
self.vectorizer.fit_transform(self.training_data_list) self.model_nb = MultinomialNB()
self.model_svm = SVC(probability=True) self.model_lr =
LogisticRegression() self.model_knn =
KNeighborsClassifier(n_neighbors=9) self.model_rf =
RandomForestClassifier(n_estimators=19) self.model_nb.fit(self.training_vectors,
self.training_labels) self.model_lr.fit(self.training_vectors, self.training_labels)
self.model_rf.fit(self.training_vectors, self.training_labels)
self.model_knn.fit(self.training_vectors, self.training_labels)
self.model_svm.fit(self.training_vectors, self.training_labels) def
get_prediction(self,vector): pred_nb=self.model_nb.predict(vector)[0]
pred_lr=self.model_lr.predict(vector)[0] pred_rf=self.model_rf.predict(vector)
[0] pred_svm=self.model_svm.predict(vector)[0]
pred_knn=self.model_knn.predict(vector)[0]
preds=[pred_nb,pred_lr,pred_rf,pred_svm,pred_knn]
spam_counts=preds.count(1) if spam_counts>=3: return 'Spam' return 'Non-
Spam' def
get_probabilities(self,vector):
prob_nb=self.model_nb.predict_proba(vector)[0]*100 prob_lr =
self.model_lr.predict_proba(vector)[0] * 100 prob_rf =
self.model_rf.predict_proba(vector)[0] * 100 prob_knn =
self.model_knn.predict_proba(vector)[0] * 100 prob_svm =
self.model_svm.predict_proba(vector)[0] * 100 return
[prob_nb,prob_lr,prob_rf,prob_knn,prob_svm]

def get_vector(self,text):
return self.vectorizer.transform([text])
3. Module – User interface
import time from ML import model import
streamlit as st from DP import *
import matplotlib.pyplot as plt import seaborn as sns
inputs=[0,1] @st.cache() def create_model():
mode=model() return mode
col1,col2,col3,col4,col5=st.columns(5) with col3:
st.title("Spade")
st.write('welcome to Spade...') st.write('A Spam Detection algorithm based on Machine Learning and
Natural Language Processing') text=st.text_area('please provide email/text you wish to
classify',height=400,placeholder='type/paste

Dept of CSE, KIT 28 2024-25


Email Spam Detection

more than 50 characters here')


file=st.file_uploader("please upload file with your text.. (only
.txt format supported")

if len(text)>20: inputs[0]=1
if file is None: inputs[1]=0
if inputs.count(1)>1:
st.error('multiple inputs given please select only one
option') else:
if inputs[0]==1: e=text given_email = e if
inputs[1]==1:
bytes_data = file.getvalue() given_email

= bytes_data

predictions=[] probs=[] col1,col2,col3,col4,col5=st.columns(5)


with col3:
clean_button = st.button('Detect')
st.caption("In case of a warning it's probably related to caching of your browser") st.caption("please
hit the detect button again....")

if clean_button:
if inputs.count(0)>1:
st.error('No input given please try after giving the
input') else: with st.spinner('Please wait while the model is running....'): mode =
create_model() given_email,n=clean(given_email) vector =
mode.get_vector(given_email)
predictions.append(mode.get_prediction(vector))
probs.append(mode.get_probabilities(vector)) col1, col2, col3 =
st.columns(3) with col2:
st.header(f"{predictions[0]}") probs_pos = [i[1] for i in
probs[0]] probs_neg = [i[0] for i in probs[0]] if predictions[0] ==
'Spam':
# st.caption(str(probs_pos)) plot_values = probs_pos else:
# st.caption(str(probs_neg)) plot_values = probs_neg plot_values=[int(i) for i in
plot_values] st.header(f'These are the results obtained from the
models') col1, col2 = st.columns([2, 3]) with col1:
st.subheader('predicted Accuracies of models')
with st.expander('Technical Details'): st.write('Model-1 : Naive Bayes')
st.write('Model-2 : Random Forest') st.write('Model-3 : Logistic Regression')
st.write('Model-4 : K-Nearest Neighbors') st.write('Model-5 : Support Vector
Machines')
with col2:
st.write('Model-1', plot_values[0]) bar1 = st.progress(0) for i

Dept of CSE, KIT 29 2024-25


Email Spam Detection

in range(plot_values[0]):
time.sleep(0.01) bar1.progress(i)
st.write('Model-2', plot_values[1]) bar2 = st.progress(0) for i
in range(plot_values[1]):
time.sleep(0.01) bar2.progress(i)
st.write('Model-3', plot_values[2]) bar3 = st.progress(0) for i
in range(plot_values[2]):
time.sleep(0.01) bar3.progress(i)
st.write('Model-4', plot_values[3]) bar4 = st.progress(0) for i
in range(plot_values[3]):
time.sleep(0.01) bar4.progress(i)
st.write('Model-5', plot_values[4]) bar5 = st.progress(0) for i
in range(plot_values[4]):
time.sleep(0.01) bar5.progress(i) st.header('These are
some insights from the given
text.') entities=ents(text) col1,col2=st.columns([2,3]) with
col1:
st.subheader('These are the named entities extracted
from the text') st.write('please expand each category to view
the entities') st.write('a small description has been included
with entities for user understanding') with col2: if
entities=='no':
st.subheader('No Named Entities found.')
else: renames = {'CARDINAL': 'Numbers', 'TIME':
'Time', 'ORG': 'Companies/Organizations', 'GPE': 'Locations',
'PERSON': 'People', 'MONEY': 'Money',
'FAC': 'Factories'} for i in renames.keys():
with st.expander(renames[i]): st.caption(spacy.explain(i)) values
= list(set(entities[i])) strin = ', '.join(values) st.write(strin)

Dept of CSE, KIT 30 2024-25


Email Spam Detection

B. SCREENSHOTS

Dept of CSE, KIT 31 2024-25


Email Spam Detection

Dept of CSE, KIT 32 2024-25


Email Spam Detection

CHAPTER 6

Conclusion and Future Scope


6.1 Conclusion
From the results obtained we can conclude that an ensemble machine learning model is more
effective in detection and classification of spam than any individual algorithms. We can also conclude
that TF-IDF (term frequency inverse document frequency) language model is more effective than Bag
of words model in classification of spam when combined with several algorithms. And finally we can
say that spam detection can get better if machine learning algorithms are combined and tuned to
needs.

6.2 Future work


There are numerous appilcations to machine learning and natural language processing and when
combined they can solve some of the most troubling problems concerned with texts. This application
can be scaled to intake text in bulk so that classification can be done more affectively in some public
sites.

Other contexts such as negative, phishing, malicious, etc,. can be used to train the model to filter
things such as public comments in various social sites. This application can be converted to online
type of machine learning system and can be easily updated with latest trends of spam and other mails
so that the system can adapt to new types of spam emails and texts.

References
[1] S. H. a. M. A. T. Toma, "An Analysis of Supervised Machine Learning Algorithms for Spam Email

Detection," in International Conference on Automation, Control and Mechatronics for Industry 4.0
(ACMI), 2021.

[2] S. Nandhini and J. Marseline K.S., "Performance Evaluation of Machine Learning Algorithms for
Email Spam Detection," in International Conference on Emerging Trends in Information Technology
and Engineering (ic-ETITE), 2020.

[3] A. L. a. S. S. S. Gadde, "SMS Spam Detection using Machine Learning and Deep Learning

Techniques," in 7th International Conference on Advanced Computing and Communication Systems


(ICACCS), 2021, 2021.

Dept of CSE, KIT 33 2024-25


Email Spam Detection

[4] V. B. a. B. K. P. Sethi, "SMS spam detection and comparison of various machine learning

algorithms," in International Conference on Computing and Communication Technologies for Smart


Nation (IC3TSN), 2017.

[5] G. D. a. A. R. P. Navaney, "SMS Spam Filtering Using Supervised Machine Learning Algorithms," in

8th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 2018.

[6] S. O. Olatunji, "Extreme Learning Machines and Support Vector Machines models for email spam

detection," in IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE),
2017.

[7] S. S. a. N. N. Kumar, "Email Spam Detection Using Machine Learning Algorithms," in Second

International Conference on Inventive Research in Computing Applications (CIRCA), 2020.

[8] R. Madan, "medium.com," [Online]. Available: https://medium.com/analytics-


vidhya/tf-idf-term-frequency-technique-easiest-explanatio n-for-text-classification-in-nlp-with-
code8ca3912e58c3.

[9] N. D. J. a. M. M. A. M. M. RAZA, "A Comprehensive Review on Email Spam


Classification using Machine Learning Algorithms," in International Conference on Information
Networking (ICOIN), 2021, 2021.

[10] A. B. S. A. a. P. M. M. Gupta, "A Comparative Study of Spam SMS Detection Using Machine

Learning Classifiers," in Eleventh International Conference on Contemporary Computing (IC3),

2018.

[11] M. M. J. Fattahi, "SpaML: a Bimodal Ensemble Learning Spam Detector based on NLP Techniques,"

in IEEE 5th International Conference on Cryptography, Security and Privacy (CSP), 2021, 2021.

[12] Harika, "Analytics Vidhya," [Online]. Available:


https://www.analyticsvidhya.com/blog/2021/07/an-introduction-to-logistic-regression/.

[13] İ. A. D. a. M. D. H. Karamollaoglu, "Detection of Spam E-mails with Machine Learning Methods,"

in Innovations in Intelligent Systems and Applications Conference (ASYU), 2018.

[14] M. N. U. a. R. K. H. F. Hossain, "Analysis of Optimized Machine Learning and Deep Learning


Techniques for Spam Detection," in IEEE International IoT, Electronics and Mechatronics
Conference (IEMTRONICS), 2021.
[15] H. Deng, "Towards Data Science," [Online]. Available:
https://towardsdatascience.com/random-forest-3a55c3aca46d.
[16] j. Brownlee, "machinelearningmastery," 2017. [Online]. Available:

Dept of CSE, KIT 34 2024-25


Email Spam Detection

machinelearningmastery.com/gentle-introduction-bag-words-model.
[17] d. AI, "deepai," [Online]. Available:
deepai.org/machine-learning-glossary-and-terms/accuracy-error-rate.
Appendices A.Source code 1. Module – Data Processing

import re

Dept of CSE, KIT 35 2024-25

You might also like