0% found this document useful (0 votes)

19 views

spam detection

Uploaded by

ajeetgkp115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

spam detection

Uploaded by

ajeetgkp115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

CHAPTER No TITLE PAGE No

Abstract 5
List of Figures 8
List of Tables 9
1 Introduction 10
2 Literature Review 12
2.1 introduction 12
2.2 Related work 12
2.3 Summary 13
3 Objectives and Scope 14
3.1 Problem statement 14
3.2 Objectives 14
3.3 Project Scope 14
3.4 Limitations 14
4. Experimentation and Methods 15
4.1 Introduction 15
4.2 System architecture 15
4.3 Modules and Explanation 15
4.4 Requirements 17
4.5 Workflow 17
4.5.1 Data collection and Description 18
4.5.2 Data Processing 19
4.5.2.1 Overall Data Processing 19
4.5.2.2 Textual Data Processing 19
4.5.2.3 Feature Vector Processing 20
4.5.2.3.1 bag of words 20
4.5.2.3.2 TF-IDF 20
4.5.3 Data Splitting 23
4.5.4 Machine Learning 23
4.5.4.1 Introduction 23

vii
4.5.4.2 Algorithms 23

4.5.4.2.1 Naïve bayes Classifier 23

4.5.4.2.2 Random Forest Classifier 24
4.5.4.2.3 Logistic Regression 25
4.5.4.2.4 K-Nearest Neighbors 26
4.5.4.2.5 Support Vector machines 26
4.5.5 Experimentation 27
4.5.6 User Interface(UI) 30
4.5.7 Working Procedure 31
5 Results and Discussion 32
5.1 Language Model selection 32
5.2 Proposed Model 32
5.3 Comparison 32
5.4 Summary 34
6 Conclusion and Future Scope 35
6.1 Conclusion 35
6.2 Future Work 35
References 36
Appendices 38
A. Source code 38
B. Screenshots 43

viii
List of Figures

Fig No Title Pg no
4.1 Architecture 15
4.2 Workflow 17
4.3 Enron Data 18
4.4 Ling spam 18
4.5 Naïve Bayes(Bow vs TF-IDF) 27
4.6 Logistic Regression(Bow vs TF-IDF) 28
4.7 Neighbors vs Accuracy(KNN) 28
4.8 KNN(Bow vs TF-IDF) 29
4.9 Random Forest(trees vs scores) 29
4.10 Random Forest(Bow vs TF-IDF) 29
4.11 SVM(Bow vs TF-IDF) 30
5.1 Bow vs TF-IDF(Cumulative) 32
5.2 Comparision of Models 33

ix
Chapter 1
Introduction

Today, Spam has become a major problem in communication over internet. It has been
accounted that around 55% of all emails are reported as spam and the number has been
growing steadily. Spam which is also known as unsolicited bulk email has led to the
increasing use of email as email provides the perfect ways to send the unwanted
advertisement or junk newsgroup posting at no cost for the sender. This chances has been
extensively exploited by irresponsible organizations and resulting to clutter the mail boxes of
millions of people all around the world.
Spam has been a major concern given the offensive content of messages, spam is a waste of
time. End user is at risk of deleting legitimate mail by mistake. Moreover, spam also
impacted the economical which led some countries to adopt legislation.
Text classification is used to determine the path of incoming mail/message either into inbox
or straight to spam folder. It is the process of assigning categories to text according to its
content. It is used to organized, structures and categorize text. It can be done either manually
or automatically. Machine learning automatically classifies the text in a much faster way than
manual technique. Machine learning uses pre-labelled text to learn the different associations
between pieces of text and it output. It used feature extraction to transform each text to
numerical representation in form of vector which represents the frequency of word in
predefined dictionary.
Text classification is important to structure the unstructured and messy nature of text such as
documents and spam messages in a cost-effective way. Machine learning can make more
accurate precisions in real-time and help to improve the manual slow process to much better
and faster analysing big data. It is important especially to a company to analyse text data,
help inform business decisions and even automate business processes.
In this project, machine learning techniques are used to detect the spam message of a
mail. Machine learning is where computers can learn to do something

1
without the need to explicitly program them for the task.

It uses data and produce a program to perform a task such as classification. Compared to
knowledge engineering, machine learning techniques require messages that have been
successfully pre-classified. The pre-classified messages make the training dataset which will
be used to fit the learning algorithm to the modelin machine learning studio.
A combination of algorithms are used to learn the classification rules from messages.
These algorithms are used for classification of objects of different classes. These algorithms
are provided with pre labelled data and an unknown text. After learning from the prelabelled
data each of these algorithms predict which class the unknown text may belong to and the
category predicted by majority is consideredas final.

2
Chapter 2
Literature Review

2.1 Introduction

This chapter discusses about the literature review for machine learning classifier that being
used in previous researches and projects. It is not about information gathering but it summarize the
prior research that related to this project. It involves the process of searching, reading, analysing,
summarising and evaluating the reading materials based on the project.
A lot of research has been done on spam detection using machine learning. But due to the evolvement
of spam and development of various technologies the proposed methods are not dependable. Natural
language processing is one of the lesser known fields in machine learning and it reflects here with
comparatively less work present.
2.2 Related work

Spam classification is a problem that is neither new nor simple. A lot of research has been
done and several effective methods have been proposed.

i. M. RAZA, N. D. Jayasinghe, and M. M. A. Muslam have analyzed various

techniques for spam classification and concluded that naïve Bayes and support
vector machines have higher accuracy than the rest, around 91%consistently [1].
ii. S. Gadde, A. Lakshmanarao, and S. Satyanarayana in their paper on spamdetection
concluded that the LSTM system resulted in higher accuracy of 98%[2].
iii. P. Sethi, V. Bhandari, and B. Kohli concluded that machine learning
algorithms perform differently depending on the presence of different
attributes [3].
iv. H. Karamollaoglu, İ. A. Dogru, and M. Dorterler performed spam classification on
Turkish messages and emails using both naïve Bayes classification algorithms and
support vector machines and concluded that the accuracies ofboth models measured
around 90% [4].
v. P. Navaney, G. Dubey, and A. Rana compared the efficiency of the SVM,

3
naïve Bayes, and entropy method and the SVM had the highest accuracy(97.5%)
compared to the other two models [5].
vi. S. Nandhini and J. Marseline K.S in their paper on the best model for spam
detection it is concluded that random forest algorithm beats others in accuracy and
KNN in building time [6].
vii. S. O. Olatunji concluded in her paper that while SVM outperforms ELM interms
of accuracy, the ELM beats the SVM in terms of speed [7].
viii. M. Gupta, A. Bakliwal, S. Agarwal, and P. Mehndiratta studied classical machine
learning classifiers and concluded that convolutional neural networkoutperforms the
classical machine learning methods by a small margin but take more time for
classification [8].
ix. N. Kumar, S. Sonowal, and Nishant, in their paper, published that naïveBayes
algorithm is best but has class conditional limitations [9].
x. T. Toma, S. Hassan, and M. Arifuzzaman studied various types of naïve Bayes
algorithms and proved that the multinomial naïve Bayes classificationalgorithm has
better accuracy than the rest with an accuracy of 98% [10].

F. Hossain, M. N. Uddin, and R. K. Halder in their study concluded that machine learning
models outperform deep learning models when it comes to spam classification and ensemble models
outperform individual models in terms of accuracy and precision [11].

2.3 Summary

From various studies, we can take that for various types of data various models performs
better. Naïve Bayes, random forest, SVM, logistic regression are some of the most used algorithms in
spam detection and classification.

4
Chapter 3
Objectives and Scope

3.1 Problem Statement

Spammers are in continuous war with Email service providers. Email service providers
implement various spam filtering methods to retain their users, and spammers are continuously
changing patterns, using various embedding tricks to get through filtering. These filters can never be
too aggressive because a slight misclassification may lead to important information loss for consumer.
A rigid filtering method with additional reinforcements is needed to tackle this problem.

3.2 Objectives

The objectives of this project are

i. To create a ensemble algorithm for classification of spam with highest possible

accuracy.
ii. To study on how to use machine learning for spam detection.

iii. To study how natural language processing techniques can be implemented inspam
detection.
iv. To provide user with insights of the given text leveraging the created algorithmand
NLP.
3.3 Project Scope

This project needs a coordinated scope of work.

i. Combine existing machine learning algorithms to form a better ensemble

algorithm.
ii. Clean, processing and make use of the dataset for training and testing the modelcreated.
iii. Analyse the texts and extract entities for presentation.

3.4 Limitations

This Project has certain limitations.

i. This can only predict and classify spam but not block it.

ii. Analysis can be tricky for some alphanumeric messages and it may struggle withentity
detection.

5
iii. Since the data is reasonably large it may take a few seconds to classify andanlayse
the message.

6
Chapter 4
Experimentation and Methods

4.1 Introduction

This chapter will explain the specific details on the methodology being used to develop this
project. Methodology is an important role as a guide for this project to make sure it is in the right
path and working as well as plan. There is different type of methodology used in order to do spam
detection and filtering. So, it is important to choose the right and suitable methodology thus it is
necessary to understand the application functionality itself.

4.2 System Architecture

The application overview has been presented below and it gives a basic structure of the application.

fig no. 4.1 Architecture

The UI, Text processing and ML Models are the three important modules of this project. Each
Module’s explanation has been given in the later sections of this chapter.
A more complicated and detailed view of architecture is presented in the workflow section.

4.3 Modules and Explanation

The Application consists of three modules.

i. UI

ii. Machine Learning

7
iii. Data Processing

8
I. UI Module

a. This Module contains all the functions related to UI(user interface).

b. The user interface of this application is designed using Streamlit library frompython
based packages.
c. The user inputs are acquired using the functions of this library and forwarded todata
processing module for processing and conversion.
d. Finally the output from ML module is sent to this module and from this module touser in
visual form.

II. Machine Learning Module

a. This module is the main module of all three modules.

b. This modules performs everything related to machine learning and results analysis.

c. Some main functions of this module are

i. Training machine learning models.

ii. Testing the model

iii. Determining the respective parameter values for each model.

iv. Key-word extraction.

v. Final output calculation

d. The output from this module is forwarded to UI for providing visual response touser

e.
III. Data Processing Module

a. The raw data undergoes several modifications in this module for further process.

b. Some of the main functions of this module includes

i. Data cleaning

ii. Data merging of datasets

iii. Text Processing using NLP

iv. Conversion of text data into numerical data(feature vectors).

v. Splitting of data.

c. All the data processing is done using Pandas and NumPy libraries.

d. Text processing and text conversion is done using NLTK and scikit-learn libraries.

9
4.4 Requirements

Hardware Requirements

PC/Laptop
Ram – 8 Gig
Storage – 100-200 Mb

Software Requirements

OS – Windows 7 and above

Code Editor – Pycharm, VS Code, Built in IDE

Anaconda environment with packages nltk, numpy, pandas, sklearn, tkinter, nltk data.
Supported browser such as chrome, firefox, opera etc..

4.5 WorkFlow

fig no. 4.2 Workflow

In the above architecture, the objects depicted in Green belong to a module called Data
Processing. It includes several functions related to data processing, natural Language Processing. The
objects depicted in Blue belong to the Machine Learning module. It is whereeverything related to ML

10
● Data plays an important role when it comes to prediction and classification, themore
the data the more the accuracy will be.
● The data used in this project is completely open-source and has been taken fromvarious
resources like Kaggle and UCI
● For the purpose of accuracy and diversity in data multiple datasets are taken. 2
datasets containing approximately over 12000 mails and their labels are used for
training and testing the application.
● 6000 spam mails are taken for generalisation of data and to increase the
accuracy.

Data Description

Dataset : enronSpamSubset.

Source : Kaggle

Description : this dataset is part of a larger dataset called

enron. This dataset contains a set of spam andnon-spam
emails with 0 for non spam and 1 for spamin label
attribute.
Composition :

Unique values : 9687

Spam values : 5000 Non-
spam values : 4687
fig no. 4.3 enron spam

Dataset : lingspam.
Source : Kaggle
Description : This dataset is part of a larger dataset calledEnron1
which contains emails classified as spam or ham(not-spam).
Composition :

Unique values : 2591

Spam values : 419
Non-spam values : 2172

11
4.5.2 Data Processing

4.5.2.1 Overall data processing

It consists of two main tasks

● Dataset cleaning

It includes tasks such as removal of outliers, null value removal, removal ofunwanted
features from data.
● Dataset Merging

After data cleaning, the datasets are merged to form a single dataset containingonly two
features(text, label).
Data cleaning, Data Merging these procedures are completely done usingPandas library.

4.5.2.2 Textual data processing

● Tag removal

Removing all kinds of tags and unknown characters from text using regular
expressions through Regex library.
● Sentencing, tokenization

Breaking down the text(email/SMS) into sentences and then into

tokens(words).
This process is done using NLTK pre-processing library of python.

● Stop word removal

Stop words such as of , a ,be , … are removed using stopwords NLTK libraryof
python.
● Lemmatization

Words are converted into their base forms using lemmatization andpos-
tagging
This process gives key-words through entity extraction.

This process is done using chunking in regex and NLTK lemmatization.

● Sentence formation

The lemmatized tokens are combined to form a sentence.

This sentence is essentially a sentence converted into its base form andremoving stop
12
words.
Then all the sentences are combined to form a text.

● While the overall data processing is done only to datasets, the textual processing is
done to both training data, testing data and also user input data.

13
4.5.2.3 Feature Vector Formation

● The texts are converted into feature vectors(numerical data) using the wordspresent
in all the texts combined
● This process is done using countvectorization of NLTK library.

● The feature vectors can be formed using two language models Bag of Wordsand
Term Frequency-inverse Document Frequency.

4.5.2.3.1 Bag of Words

Bag of words is a language model used mainly in text classification. A bag of wordsrepresents the
text in a numerical form.
The two things required for Bag of Words are

• A vocabulary of words known to us.

• A way to measure the presence of words.

Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens.

“ It was the best of times, it

was the worst of times, it
was the age of wisdom,
it was the age of foolishness, ”

The unique words here (ignoring case and punctuation) are:

[ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ] The
next step is scoring words present in every document.

After scoring the four lines from the above stanza can be represented in vector form as “It was
the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

"it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

This is the main process behind the bag of words but in reality the vocabulary even from a couple of
documents is very large and words repeating frequently and important in nature are taken and

14
remaining are removed during the text processing stage.

Terminology for the below formulae:

t – term(word)

d – document(set of words)N –
count of documents
The TF-IDF process consists of various activities listed below.

i) Term Frequency

The count of appearance of a particular word in a document is called term frequency

𝒕𝒇(𝒕, 𝒅) = 𝒄𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅

ii) Document Frequency

Document frequency is the count of documents the word was detected in. We considerone
instance of a word and it doesn’t matter if the word is present multiple times.
𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔

iii) Inverse Document Frequency

• IDF is the inverse of document frequency.

• It measures the importance of a term t considering the information it contributes. Every

term is considered equally important but certain terms such as (are, if, a, be, that, ..)
provide little information about the document. The inverse documentfrequency factor
reduces the importance of words/terms that has highe recurrence and increases the
importance of words/terms that are rare.
𝒊𝒅𝒇(𝒕) = 𝑵/𝒅𝒇

Finally, the TF-IDF can be calculated by combining the term frequency and inversedocument
frequency.

𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏))

the process can be explained using the following example:
15
“Document 1 It is going to rain today.

Document 2 Today I am not going outside.

Document 3 I am going to watch the season premiere.”

The Bag of words of the above sentences is

[going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1]

16
Then finding the term frequency

table no. 4.1 Term frequency

Then finding the inverse document frequency

table no. 4.2 inverse document frequency

17
Applying the final equation the values of tf-idf becomes

table no. 4.3 TF-IDF

18
Using the above two language models the complete data has been converted into two kinds of
vectors and stored into a csv type file for easy access and minimal processing.

4.5.3 Data Splitting

The data splitting is done to create two kinds of data Training data and testing data.

Training data is used to train the machine learning models and testing data is used to test the
models and analyse results. 80% of total data is selected as testing data and remaining data is testing
data.

4.5.4 Machine Learning

4.5.4.1 Introduction

Machine Learning is process in which the computer performs certain tasks without giving
instructions. In this case the models takes the training data and train on them.
Then depending on the trained data any new unknown data will be processed based on the ruled
derived from the trained data.
After completing the countvectorization and TF-IDF stages in the workflow the data is converted into
vector form(numerical form) which is used for training and testing models.
For our study various machine learning models are compared to determine which method is more
suitable for this task. The models used for the study include Logistic Regression, Naïve Bayes,
Random Forest Classifier, K Nearest Neighbors, and Support Vector Machine Classifier and a
proposed model which was created using an ensemble approach.

4.5.4.2 Algorithms

a combination of 5 algorithms are used for the classifications.

4.5.4.2.1 Naïve Bayes Classifier

A naïve Bayes classifier is a supervised probabilistic machine learning model that is used for
classification tasks. The main principle behind this model is the Bayes theorem.

Bayes Theorem:

Naive Bayes is a classification technique that is based on Bayes’ Theorem with an assumption that all
the features that predict the target value are independent of each other. It calculates the probability of

19
Naive Bayes classifier assumes that the features we use to predict the target are independent and do
not affect each other. Though the independence assumption is never correct in real-world data, but
often works well in practice. so that it is called “Naive” [14].

P(A│B)=(P(B│A)P(A))/P(B)

P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability.
P(B|A) is the probability of data B given that hypothesis A was true.

P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior
probability of A.
P(B) is the probability of the data (regardless of the hypothesis) [15].

Naïve Bayes classifiers are mostly used for text classification. The limitation of the Naïve Bayes
model is that it treats every word in a text as independent and is equal in importance but every word
cannot be treated equally important because articles and nouns are not the same when it comes to
language. But due to its classification efficiency, this model is used in combination with other
language processing techniques.

4.5.4.2.2 Random Forest Classifier

Random Forest classifier is a supervised ensemble algorithm. A random forest consists of multiple
random decision trees. Two types of randomnesses are built into the trees. First, each tree is built on a
random sample from the original data. Second, at each tree node, a subset of features is randomly
selected to generate the best split [16].
Decision Tree:

The decision tree is a classification algorithm based completely on features. The tree
repeatedly splits the data on a feature with the best information gain. This process continues until the
information gained remains constant. Then the unknown data is evaluated feature by feature until
categorized. Tree pruning techniques are used for improving accuracy and reducing the overfitting of
data.
Several decision trees are created on subsets of data the result that was given by the majority of trees
is considered as the final result. The number of trees to be created is determined based on accuracy
and other metrics through iterative methods. Random forest classifiers are mainly used on condition-
based data but it works for text if the text .

20
4.5.4.2.3 Logistic Regression

Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event. It is used when the data is linearly separable and the outcome
is binary or dichotomous [17]. The probabilities are calculated using a sigmoid function.
For example, let us take a problem where data has n features.

We need to fit a line for the given data and this line can be represented by the equation

z=b_0+b_1 x_1+b_2 x_2+b_3 x_3….+b_n x_n

here z = odds

generally, odds are calculated as

odds=p(event occurring)/p(event not occurring)

Sigmoid Function:

A sigmoid function is a special form of logistic function hence the name logistic regression.The
logarithm of odds is calculated and fed into the sigmoid function to get continuous probability
ranging from 0 to 1.

The logarithm of odds can be calculated by

log(odds)=dot(features,coefficients)+intercept

and these log_odds are used in the sigmoid function to get probability.

h(z)=1/(1+e^(-z) )

The output of the sigmoid function is an integer in the range 0 to 1 which is used to determine which
class the sample belongs to. Generally, 0.5 is considered as the limit below which it is considered a
NO, and 0.5 or higher will be considered a YES. But the border can be adjusted based on the
requirement.
21
4.5.4.2.4 K-Nearest Neighbors

KNN is a classification algorithm. It comes under supervised algorithms. All the data points are
assumed to be in an n-dimensional space. And then based on neighbors the category of current data is
determined based on the majority.
Euclidian distance is used to determine the distance between points.

The distance between 2 points is calculated as

d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )

The distances between the unknown point and all the others are calculated. Depending on the K
provided k closest neighbors are determined. The category to which the majority of the neighbors
belong is selected as the unknown data category.
If the data contains up to 3 features then the plot can be visualized. It is fairly slow compared to other
distance-based algorithms such as SVM as it needs to determine the distance to all points to get the
closest neighbors to the given point.

4.5.4.2.5 Support Vector Machines(SVM)

It is a machine learning algorithm for classification. Decision boundaries are drawn between various
categories and based on which side the point falls to the boundary the category is determined.

Support Vectors:
The vectors closer to boundaries are called support vectors/planes. If there are n categories then there
will be n+1 support vectors. Instead of points, these are called vectors because they are assumed to be
starting from the origin.The distance between the support vectors is called margin. We want our
margin to be as wide as possible because it yields better results.

There are three types of boundaries used by SVM to create boundaries.

Linear: used if the data is linearly separable.

Poly: used if data is not separable. It creates any data into 3-dimensional data.

Radial: this is the default kernel used in SVM. It converts any data into infinite-dimensional data.

22
If the data is 2-dimensional then the boundaries are lines. If the data is 3-dimensional then the
boundaries are planes. If the data categories are more than 3 then boundaries are called hyperplanes.

An SVM mainly depends on the decision boundaries for predictions. It doesn’t compare the data to all
other data to get the prediction due to this SVM’s tend to be quick with predictions.

4.5.5 Experimentation

The process goes like data collection and processing then natural language processing and then
vectorization then machine learning.The data is collected, cleaned, and then subjected to natural
language processing techniques specified in section IV. Then the cleaned data isconverted into
vectors using Bag of Words and TF-IDF methods which goes like...
The Data is split into Training data and Testing Data in an 80-20 split ratio. The training and testing
data is converted into Bag-of-Words vectors and TF-IDF vectors.
There are several metrics to evaluate the models but accuracy is considered for comparing BoW and
TF-IDF models. Accuracy is generally used to determine the efficiency of a model.
Accuracy:

“Accuracy is the number of correctly predicted data points out of all the data points”.

Naïve Bayes Classification algorithm:

Two models, one for Bow and one for TF-IDF are created and trained using respective training
vectors and training labels. Then the respective testing vectors and labels are used to get the score for
the model.

fig no. 4.5 naïve Bayes

23
The scores for Bag-of-Words and TF-IDF are visualized.
The scores for the Bow model and TF-IDF models are 98.04 and 96.05 respectively for using the
naïve bayes model.
Logistic Regression:
Two models are created following the same procedure used for naïve Bayes models andthen tested
the results obtained are visualized below.

fig no. 4.6 Logistic Regression (Bow vs TF-IDF) The

scores for BoW and TF-IDF models are 98.53 and 98.80 respectively.

K-Nearest Neighbors:

Similar to the above models the models are created and trained using respective vectors and labels.
But in addition to the data, the number of neighbors to be considered should alsobe provided.
Using Iterative Method K =3 (no of Neighbors) provided the best results for the BoW model and K =
9 provided the best results for the TF-IDF model.

Using the K values the scores for BOW and TF-IDF are visualized below.

fig no. 4.7 Neighbors vs AccuracyTaking

K=3 and K=9 for Bow and TF-IDF

24
respectively the scores are calculated and are presented below.

fig no. 4.8 KNN (Bow vs TF-IDF)

Random Forest:

Similar to previous algorithms two models are created and trained using respective training
vectors and training labels. But the number of trees to be used for forest has to beprovided.

fig no. 4.9 Random Forest (trees vs

score)

Using the Iterative method best value for the number of

trees is determined. From the results, itis clear that 19
estimators provide the best score for both the BoW and
TF-IDF models. The no of tress and scores for both

25
( fig no. 4.10 Random Forest(bow vs tfidf)

Support Vector Machines (SVM):

Finally, two SVM models, one for BoW and one for TF-IDF are created and then trained
using respective training vectors and labels. Then tested using testing vectors and labels.

fig no. 4.11 SVM(Bow vs TF_IDF)

The scores for BoW and TF-IDF models are 59.41 and 98.82 respectively.

Proposed Model:
In our proposed system we combine all the models and make them into one. It takes an unknown
point and feeds it into every model to get predictions. Then it takes these predictions, finds the
category which was predicted by the majority of the models, and finalizes it.

To determine which model is effective we used three metrics Accuracy, Precision, and F1score. In the
earlier system, we used only the F1 Score because we were not determining which model is best but
which language model is best suited for classification.

4.5.6 User Interface(UI)

interface (UI) is an important component in this application. The user only interacts with the
interface.
The UI of this project has been constructed with the help of an open source library called streamlit.
The complete information and API reference sheet can be obtained from here

26
4.5.7 Working Procedure

The working procedure includes the internal working and the data flow of application.

i. After running the application some procedures are automated.

1. Reading data from file

2. Cleaning the texts

3. Processing

4. Splitting the data

5. Intialising and training the models

ii. The user just needs to provide some data to classify in the area provided.

iii. The provided data undergoes several procedures after submission.

1. Textual Processing

2. Feature Vector conversion

3. Entity extraction

iv. The created vectors are provided to trained models to get predictions.

v. After getting predictions the category predicted by majority will be selected.

vi. The accuracies of that prediction will be calculated

vii. The accuracies and entities extracted from the step 3 will be provided to user.

Every time the user gives something new the procedure from step 2 will be repeated.

27
Chapter 5
Results and Discussion

5.1 Language Model Selection

While selecting the best language model the data has been converted into both types of vectors
and then the models been tested for to determine the best model for classifyingspam.
The results from individual models are presented in the experimentation section under
methodology. Now comparing the results from the models.

fig no. 5.1 Bow vs TF-IDF (Cumulative)

From the figure it is clear that TF-IDF proves to be better than BoW in every model tested. Hence TF-
IDF has been selected as the primary language model for textual data conversion in feature vector
formation.

5.2 Proposed Model results

To determine which model is effective we used three metrics Accuracy, Precision, andF1score.
The resulted values for the proposed model are

Accuracy – 99.0

Precision – 98.5
F1 Score – 98.6

5.3 Comparison

28
Metric Accuracy Precision F1 Score
Model
Naïve Bayes 96.0 99.2 95.2

Logistic 98.4 97.8 98.6

Regressio
n
Random 96.8 96.4 96.3
forest
KNN 96.6 96.9 96.0

SVM 98.8 97.8 98.6

Proposed 99.0 98.5 98.6

model

Table no. 5.1

29
5.4 Summary

There are two main tasks in the project implementation. Language model selection for completing the
textual processing phase and proposed model creation using the individual algorithms. These two
tasks require comparison from other models and select of various parameters for better efficiency.
During the language model selection phase two models, Bag of Words and TF-IDF are compared to
select the best model and from the results obtained it is evident that TF-IDF performs better.
During the proposed model design various algorithms are tested with different parameters to get best
parameters. Models are merged to form a ensemble algorithm and the results obtained are presented
and compared above. It is clear from the results that the proposed model outperforms others in almost
every metric derived.

30
Chapter 6
Conclusion and Future Scope

From the results obtained we can conclude that an ensemble machine learning model is more
effective in detection and classification of spam than any individual algorithms. We can also
conclude that TF-IDF (term frequency inverse document frequency) language model is more
effective than Bag of words model in classification of spam when combined with several
algorithms. And finally we can say that spam detection can get better if machine learning
algorithms are combined and tuned toneeds.

6.1 Future work

There are numerous appilcations to machine learning and natural language processing and
when combined they can solve some of the most troubling problems concerned with texts. This
application can be scaled to intake text in bulk so that classification can be done more affectively
in some public sites.

Other contexts such as negative, phishing, malicious, etc,. can be used to train the model to
filter things such as public comments in various social sites. This application can be converted
to online type of machine learning system and can be easily updated with latest trends of spam
and other mails so that the system can adapt to new types of spam emails and texts.

31
References

[1] S. H. a. M. A. T. Toma, "An Analysis of Supervised Machine Learning Algorithms for Spam
Email Detection," in International Conference on Automation, Control and Mechatronics for
Industry 4.0 (ACMI), 2021.
[2] S. Nandhini and J. Marseline K.S., "Performance Evaluation of Machine Learning Algorithms
for Email Spam Detection," in International Conference on Emerging Trends in Information
Technology and Engineering (ic-ETITE), 2020.
[3] A. L. a. S. S. S. Gadde, "SMS Spam Detection using Machine Learning and Deep Learning
Techniques," in 7th International Conference on Advanced Computing and Communication
Systems (ICACCS), 2021, 2021.
[4] V. B. a. B. K. P. Sethi, "SMS spam detection and comparison of various machine learning
algorithms," in International Conference on Computing and Communication Technologies for
Smart Nation (IC3TSN), 2017.
[5] G. D. a. A. R. P. Navaney, "SMS Spam Filtering Using Supervised Machine Learning
Algorithms," in 8th International Conference on Cloud Computing, Data Science & Engineering
(Confluence), 2018.
[6] S. O. Olatunji, "Extreme Learning Machines and Support Vector Machines models for email
spam detection," in IEEE 30th Canadian Conference on Electrical and Computer Engineering
(CCECE), 2017.
[7] S. S. a. N. N. Kumar, "Email Spam Detection Using Machine Learning Algorithms," in Second
International Conference on Inventive Research in Computing Applications (CIRCA), 2020.
[8] R. Madan, "medium.com," [Online]. Available:
https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanatio n-for-
text-classification-in-nlp-with-code-8ca3912e58c3.
[9] N. D. J. a. M. M. A. M. M. RAZA, "A Comprehensive Review on Email Spam Classification
using Machine Learning Algorithms," in International Conference on Information Networking
(ICOIN), 2021, 2021.
[10] A. B. S. A. a. P. M. M. Gupta, "A Comparative Study of Spam SMS Detection Using Machine
Learning Classifiers," in Eleventh International Conference on Contemporary Computing (IC3),
2018.
[11] M. M. J. Fattahi, "SpaML: a Bimodal Ensemble Learning Spam Detector based on NLP
Techniques," in IEEE 5th International Conference on Cryptography, Security and

32
Privacy (CSP), 2021, 2021.

[12] Harika, "Analytics Vidhya," [Online]. Available:

https://www.analyticsvidhya.com/blog/2021/07/an-introduction-to-logistic-regression/.
[13] İ. A. D. a. M. D. H. Karamollaoglu, "Detection of Spam E-mails with Machine Learning
Methods," in Innovations in Intelligent Systems and Applications Conference (ASYU), 2018.
[14] M. N. U. a. R. K. H. F. Hossain, "Analysis of Optimized Machine Learning and Deep Learning
Techniques for Spam Detection," in IEEE International IoT, Electronics and Mechatronics
Conference (IEMTRONICS), 2021.
[15] H. Deng, "Towards Data Science," [Online]. Available: https://towardsdatascience.com/random-
forest-3a55c3aca46d.

33
A. Screenshots

34
35
36

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
16 pages
Kratochwill Et Al. Best Practices in School-Based Problem Solving
100% (1)
Kratochwill Et Al. Best Practices in School-Based Problem Solving
22 pages
Report Minor Project PDF
No ratings yet
Report Minor Project PDF
37 pages
Cerebral Palsy Revalida Format
No ratings yet
Cerebral Palsy Revalida Format
10 pages
1822 b Deleted
No ratings yet
1822 b Deleted
38 pages
1822 b Deleted Merged Cropped
No ratings yet
1822 b Deleted Merged Cropped
40 pages
Spam Filter - Machine Learning
No ratings yet
Spam Filter - Machine Learning
25 pages
Irjet V9i11154
No ratings yet
Irjet V9i11154
4 pages
17 - Project Report - NLP-2-27
No ratings yet
17 - Project Report - NLP-2-27
26 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
REPORT[1]_1
No ratings yet
REPORT[1]_1
35 pages
5-7
No ratings yet
5-7
3 pages
Elshoush 2019
No ratings yet
Elshoush 2019
6 pages
1 s2.0 S0950705106001390 Main
No ratings yet
1 s2.0 S0950705106001390 Main
6 pages
Final ppt
No ratings yet
Final ppt
51 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
Real Time Spam Detection
No ratings yet
Real Time Spam Detection
65 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
111 1460444112 - 12-04-2016 PDF
No ratings yet
111 1460444112 - 12-04-2016 PDF
7 pages
Spam Detection Using Large Datasets with Multilingual Support
No ratings yet
Spam Detection Using Large Datasets with Multilingual Support
7 pages
Spam Detection
No ratings yet
Spam Detection
4 pages
Deep Learning for Image Spam Detection
No ratings yet
Deep Learning for Image Spam Detection
44 pages
research paper 3
No ratings yet
research paper 3
7 pages
0_SPAM MAIL PREDICTION
No ratings yet
0_SPAM MAIL PREDICTION
29 pages
Final_report(Saie)
No ratings yet
Final_report(Saie)
38 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
E-Mail Spam Detection Using Machine Lear PDF
No ratings yet
E-Mail Spam Detection Using Machine Lear PDF
7 pages
E-Mail Spam Detection Using Machine Learning and Deep Learning
No ratings yet
E-Mail Spam Detection Using Machine Learning and Deep Learning
7 pages
ML Module 1
No ratings yet
ML Module 1
26 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
NLP m4
No ratings yet
NLP m4
97 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
No ratings yet
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
53 pages
Big As References
No ratings yet
Big As References
1 page
A System To Filter Unwanted Messages From Osn User Walls
100% (1)
A System To Filter Unwanted Messages From Osn User Walls
30 pages
An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
No ratings yet
An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
6 pages
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
No ratings yet
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
64 pages
Id - 3747 - Literature Review
No ratings yet
Id - 3747 - Literature Review
3 pages
Technovate Poster - Template (AutoRecovered)
No ratings yet
Technovate Poster - Template (AutoRecovered)
1 page
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
vishal FOML micro project vishal & milan
No ratings yet
vishal FOML micro project vishal & milan
26 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
internship codsoft machine learning
No ratings yet
internship codsoft machine learning
36 pages
Industrial Training Report
No ratings yet
Industrial Training Report
31 pages
Faith Computer Main Project
No ratings yet
Faith Computer Main Project
44 pages
text classification research paper 2
No ratings yet
text classification research paper 2
7 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
PRUTHVIRAJ MICOR FOML
No ratings yet
PRUTHVIRAJ MICOR FOML
26 pages
Format For PBS
No ratings yet
Format For PBS
18 pages
Text Classification Based on Machine Learning and
No ratings yet
Text Classification Based on Machine Learning and
12 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
21ai63 Mod 1
No ratings yet
21ai63 Mod 1
38 pages
Spam Detection Thesis
100% (3)
Spam Detection Thesis
6 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
A System For Health Document Classification Using Machine Learning
No ratings yet
A System For Health Document Classification Using Machine Learning
76 pages
Lect05
No ratings yet
Lect05
17 pages
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
No ratings yet
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
11 pages
Module 1
No ratings yet
Module 1
34 pages
Exploring the Python Library Ecosystem: A Comprehensive Guide
From Everand
Exploring the Python Library Ecosystem: A Comprehensive Guide
Kameron Hussain
No ratings yet
Python for Machine Learning: From Fundamentals to Real-World Applications
From Everand
Python for Machine Learning: From Fundamentals to Real-World Applications
Kameron Hussain
No ratings yet
Internet Banking in SBI - Preeti Pawar 357358
100% (3)
Internet Banking in SBI - Preeti Pawar 357358
90 pages
Dermatological Effects of Different Keratolytic Agents On Acne Vulgaris
No ratings yet
Dermatological Effects of Different Keratolytic Agents On Acne Vulgaris
29 pages
Lost On You
No ratings yet
Lost On You
2 pages
Lesson 8:: Producing Cleaned-Up Drawing and In-Between Drawing
No ratings yet
Lesson 8:: Producing Cleaned-Up Drawing and In-Between Drawing
24 pages
Recording Studio Directory
No ratings yet
Recording Studio Directory
20 pages
01 - Oman Passport
No ratings yet
01 - Oman Passport
6 pages
Bus485 100 PDF
No ratings yet
Bus485 100 PDF
27 pages
My Homework Lesson 1 Polygons
100% (1)
My Homework Lesson 1 Polygons
6 pages
Intrinsic and Extrinsic Motivation
No ratings yet
Intrinsic and Extrinsic Motivation
8 pages
123 Emami
No ratings yet
123 Emami
41 pages
11th Experiment-5
No ratings yet
11th Experiment-5
2 pages
Crack Control in Concrete Masonry Walls
No ratings yet
Crack Control in Concrete Masonry Walls
4 pages
Cookies and Sessions
No ratings yet
Cookies and Sessions
16 pages
MCQ-Prop shaft and joints
No ratings yet
MCQ-Prop shaft and joints
5 pages
Chamomile NA34536 PX - D 1 EN
No ratings yet
Chamomile NA34536 PX - D 1 EN
2 pages
Stacking System in Plumbing
No ratings yet
Stacking System in Plumbing
5 pages
Team 5 - Y2Y
No ratings yet
Team 5 - Y2Y
17 pages
4 2 Shelf Side Steel 5 1 TOP Steel: Project
No ratings yet
4 2 Shelf Side Steel 5 1 TOP Steel: Project
6 pages
8th Grade Algebra Curriculum Map
No ratings yet
8th Grade Algebra Curriculum Map
9 pages
Rfg70N06, Rfp70N06, Rf1S70N06, Rf1S70N06Sm: 70A, 60V, Avalanche Rated, N-Channel Enhancement-Mode Power Mosfets
No ratings yet
Rfg70N06, Rfp70N06, Rf1S70N06, Rf1S70N06Sm: 70A, 60V, Avalanche Rated, N-Channel Enhancement-Mode Power Mosfets
6 pages
657a7d6b-804f-4219-904f-5ec7ca780a8c (1)
No ratings yet
657a7d6b-804f-4219-904f-5ec7ca780a8c (1)
194 pages
ĐỀ CHUẨN MINH HỌA SỐ 19
No ratings yet
ĐỀ CHUẨN MINH HỌA SỐ 19
4 pages
Q. 8086 Programmer's Model: Register Organization (IMP)
No ratings yet
Q. 8086 Programmer's Model: Register Organization (IMP)
6 pages
SJ-20130329092338-003-ZXR10 5250-H Series (V2.05.10) All Gigabit Intelligent Switch Command Reference
100% (1)
SJ-20130329092338-003-ZXR10 5250-H Series (V2.05.10) All Gigabit Intelligent Switch Command Reference
706 pages
Gcse Photography Coursework Examples
67% (3)
Gcse Photography Coursework Examples
8 pages
FS 1 - LE 12-Edited
No ratings yet
FS 1 - LE 12-Edited
10 pages
IGNOU - Examination Form Acknowledgement
No ratings yet
IGNOU - Examination Form Acknowledgement
2 pages
Barayuga v. Adventist
100% (1)
Barayuga v. Adventist
1 page

spam detection

Uploaded by

spam detection

Uploaded by

TABLE OF CONTENTS

CHAPTER No TITLE PAGE No

4.5.4.2.1 Naïve bayes Classifier 23

i. M. RAZA, N. D. Jayasinghe, and M. M. A. Muslam have analyzed various

3.1 Problem Statement

The objectives of this project are

i. To create a ensemble algorithm for classification of spam with highest possible

This project needs a coordinated scope of work.

i. Combine existing machine learning algorithms to form a better ensemble

This Project has certain limitations.

4.2 System Architecture

fig no. 4.1 Architecture

4.3 Modules and Explanation

The Application consists of three modules.

ii. Machine Learning

a. This Module contains all the functions related to UI(user interface).

II. Machine Learning Module

a. This module is the main module of all three modules.

c. Some main functions of this module are

i. Training machine learning models.

ii. Testing the model

iii. Determining the respective parameter values for each model.

iv. Key-word extraction.

v. Final output calculation

b. Some of the main functions of this module includes

ii. Data merging of datasets

iii. Text Processing using NLP

iv. Conversion of text data into numerical data(feature vectors).

OS – Windows 7 and above

Code Editor – Pycharm, VS Code, Built in IDE

fig no. 4.2 Workflow

Description : this dataset is part of a larger dataset called

Unique values : 9687

Unique values : 2591

4.5.2.1 Overall data processing

It consists of two main tasks

4.5.2.2 Textual data processing

Breaking down the text(email/SMS) into sentences and then into

● Stop word removal

This process is done using chunking in regex and NLTK lemmatization.

The lemmatized tokens are combined to form a sentence.

4.5.2.3.1 Bag of Words

• A vocabulary of words known to us.

• A way to measure the presence of words.

“ It was the best of times, it

The unique words here (ignoring case and punctuation) are:

"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

"it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

Terminology for the below formulae:

The count of appearance of a particular word in a document is called term frequency

𝒕𝒇(𝒕, 𝒅) = 𝒄𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅

ii) Document Frequency

iii) Inverse Document Frequency

• IDF is the inverse of document frequency.

• It measures the importance of a term t considering the information it contributes. Every

𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏))

Document 2 Today I am not going outside.

Document 3 I am going to watch the season premiere.”

The Bag of words of the above sentences is

[going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1]

table no. 4.1 Term frequency

Then finding the inverse document frequency

table no. 4.2 inverse document frequency

table no. 4.3 TF-IDF

4.5.3 Data Splitting

4.5.4 Machine Learning

a combination of 5 algorithms are used for the classifications.

4.5.4.2.1 Naïve Bayes Classifier

4.5.4.2.2 Random Forest Classifier

z=b_0+b_1 x_1+b_2 x_2+b_3 x_3….+b_n x_n

generally, odds are calculated as

odds=p(event occurring)/p(event not occurring)

The logarithm of odds can be calculated by

The distance between 2 points is calculated as

4.5.4.2.5 Support Vector Machines(SVM)

There are three types of boundaries used by SVM to create boundaries.