spam detection
spam detection
Abstract 5
List of Figures 8
List of Tables 9
1 Introduction 10
2 Literature Review 12
2.1 introduction 12
2.2 Related work 12
2.3 Summary 13
3 Objectives and Scope 14
3.1 Problem statement 14
3.2 Objectives 14
3.3 Project Scope 14
3.4 Limitations 14
4. Experimentation and Methods 15
4.1 Introduction 15
4.2 System architecture 15
4.3 Modules and Explanation 15
4.4 Requirements 17
4.5 Workflow 17
4.5.1 Data collection and Description 18
4.5.2 Data Processing 19
4.5.2.1 Overall Data Processing 19
4.5.2.2 Textual Data Processing 19
4.5.2.3 Feature Vector Processing 20
4.5.2.3.1 bag of words 20
4.5.2.3.2 TF-IDF 20
4.5.3 Data Splitting 23
4.5.4 Machine Learning 23
4.5.4.1 Introduction 23
vii
4.5.4.2 Algorithms 23
viii
List of Figures
Fig No Title Pg no
4.1 Architecture 15
4.2 Workflow 17
4.3 Enron Data 18
4.4 Ling spam 18
4.5 Naïve Bayes(Bow vs TF-IDF) 27
4.6 Logistic Regression(Bow vs TF-IDF) 28
4.7 Neighbors vs Accuracy(KNN) 28
4.8 KNN(Bow vs TF-IDF) 29
4.9 Random Forest(trees vs scores) 29
4.10 Random Forest(Bow vs TF-IDF) 29
4.11 SVM(Bow vs TF-IDF) 30
5.1 Bow vs TF-IDF(Cumulative) 32
5.2 Comparision of Models 33
ix
Chapter 1
Introduction
Today, Spam has become a major problem in communication over internet. It has been
accounted that around 55% of all emails are reported as spam and the number has been
growing steadily. Spam which is also known as unsolicited bulk email has led to the
increasing use of email as email provides the perfect ways to send the unwanted
advertisement or junk newsgroup posting at no cost for the sender. This chances has been
extensively exploited by irresponsible organizations and resulting to clutter the mail boxes of
millions of people all around the world.
Spam has been a major concern given the offensive content of messages, spam is a waste of
time. End user is at risk of deleting legitimate mail by mistake. Moreover, spam also
impacted the economical which led some countries to adopt legislation.
Text classification is used to determine the path of incoming mail/message either into inbox
or straight to spam folder. It is the process of assigning categories to text according to its
content. It is used to organized, structures and categorize text. It can be done either manually
or automatically. Machine learning automatically classifies the text in a much faster way than
manual technique. Machine learning uses pre-labelled text to learn the different associations
between pieces of text and it output. It used feature extraction to transform each text to
numerical representation in form of vector which represents the frequency of word in
predefined dictionary.
Text classification is important to structure the unstructured and messy nature of text such as
documents and spam messages in a cost-effective way. Machine learning can make more
accurate precisions in real-time and help to improve the manual slow process to much better
and faster analysing big data. It is important especially to a company to analyse text data,
help inform business decisions and even automate business processes.
In this project, machine learning techniques are used to detect the spam message of a
mail. Machine learning is where computers can learn to do something
1
without the need to explicitly program them for the task.
It uses data and produce a program to perform a task such as classification. Compared to
knowledge engineering, machine learning techniques require messages that have been
successfully pre-classified. The pre-classified messages make the training dataset which will
be used to fit the learning algorithm to the modelin machine learning studio.
A combination of algorithms are used to learn the classification rules from messages.
These algorithms are used for classification of objects of different classes. These algorithms
are provided with pre labelled data and an unknown text. After learning from the prelabelled
data each of these algorithms predict which class the unknown text may belong to and the
category predicted by majority is consideredas final.
2
Chapter 2
Literature Review
2.1 Introduction
This chapter discusses about the literature review for machine learning classifier that being
used in previous researches and projects. It is not about information gathering but it summarize the
prior research that related to this project. It involves the process of searching, reading, analysing,
summarising and evaluating the reading materials based on the project.
A lot of research has been done on spam detection using machine learning. But due to the evolvement
of spam and development of various technologies the proposed methods are not dependable. Natural
language processing is one of the lesser known fields in machine learning and it reflects here with
comparatively less work present.
2.2 Related work
Spam classification is a problem that is neither new nor simple. A lot of research has been
done and several effective methods have been proposed.
3
naïve Bayes, and entropy method and the SVM had the highest accuracy(97.5%)
compared to the other two models [5].
vi. S. Nandhini and J. Marseline K.S in their paper on the best model for spam
detection it is concluded that random forest algorithm beats others in accuracy and
KNN in building time [6].
vii. S. O. Olatunji concluded in her paper that while SVM outperforms ELM interms
of accuracy, the ELM beats the SVM in terms of speed [7].
viii. M. Gupta, A. Bakliwal, S. Agarwal, and P. Mehndiratta studied classical machine
learning classifiers and concluded that convolutional neural networkoutperforms the
classical machine learning methods by a small margin but take more time for
classification [8].
ix. N. Kumar, S. Sonowal, and Nishant, in their paper, published that naïveBayes
algorithm is best but has class conditional limitations [9].
x. T. Toma, S. Hassan, and M. Arifuzzaman studied various types of naïve Bayes
algorithms and proved that the multinomial naïve Bayes classificationalgorithm has
better accuracy than the rest with an accuracy of 98% [10].
F. Hossain, M. N. Uddin, and R. K. Halder in their study concluded that machine learning
models outperform deep learning models when it comes to spam classification and ensemble models
outperform individual models in terms of accuracy and precision [11].
2.3 Summary
From various studies, we can take that for various types of data various models performs
better. Naïve Bayes, random forest, SVM, logistic regression are some of the most used algorithms in
spam detection and classification.
4
Chapter 3
Objectives and Scope
Spammers are in continuous war with Email service providers. Email service providers
implement various spam filtering methods to retain their users, and spammers are continuously
changing patterns, using various embedding tricks to get through filtering. These filters can never be
too aggressive because a slight misclassification may lead to important information loss for consumer.
A rigid filtering method with additional reinforcements is needed to tackle this problem.
3.2 Objectives
iii. To study how natural language processing techniques can be implemented inspam
detection.
iv. To provide user with insights of the given text leveraging the created algorithmand
NLP.
3.3 Project Scope
3.4 Limitations
i. This can only predict and classify spam but not block it.
ii. Analysis can be tricky for some alphanumeric messages and it may struggle withentity
detection.
5
iii. Since the data is reasonably large it may take a few seconds to classify andanlayse
the message.
6
Chapter 4
Experimentation and Methods
4.1 Introduction
This chapter will explain the specific details on the methodology being used to develop this
project. Methodology is an important role as a guide for this project to make sure it is in the right
path and working as well as plan. There is different type of methodology used in order to do spam
detection and filtering. So, it is important to choose the right and suitable methodology thus it is
necessary to understand the application functionality itself.
The application overview has been presented below and it gives a basic structure of the application.
The UI, Text processing and ML Models are the three important modules of this project. Each
Module’s explanation has been given in the later sections of this chapter.
A more complicated and detailed view of architecture is presented in the workflow section.
i. UI
7
iii. Data Processing
8
I. UI Module
b. The user interface of this application is designed using Streamlit library frompython
based packages.
c. The user inputs are acquired using the functions of this library and forwarded todata
processing module for processing and conversion.
d. Finally the output from ML module is sent to this module and from this module touser in
visual form.
b. This modules performs everything related to machine learning and results analysis.
d. The output from this module is forwarded to UI for providing visual response touser
e.
III. Data Processing Module
a. The raw data undergoes several modifications in this module for further process.
i. Data cleaning
v. Splitting of data.
c. All the data processing is done using Pandas and NumPy libraries.
d. Text processing and text conversion is done using NLTK and scikit-learn libraries.
9
4.4 Requirements
Hardware Requirements
PC/Laptop
Ram – 8 Gig
Storage – 100-200 Mb
Software Requirements
Anaconda environment with packages nltk, numpy, pandas, sklearn, tkinter, nltk data.
Supported browser such as chrome, firefox, opera etc..
4.5 WorkFlow
In the above architecture, the objects depicted in Green belong to a module called Data
Processing. It includes several functions related to data processing, natural Language Processing. The
objects depicted in Blue belong to the Machine Learning module. It is whereeverything related to ML
10
● Data plays an important role when it comes to prediction and classification, themore
the data the more the accuracy will be.
● The data used in this project is completely open-source and has been taken fromvarious
resources like Kaggle and UCI
● For the purpose of accuracy and diversity in data multiple datasets are taken. 2
datasets containing approximately over 12000 mails and their labels are used for
training and testing the application.
● 6000 spam mails are taken for generalisation of data and to increase the
accuracy.
Data Description
Dataset : enronSpamSubset.
Source : Kaggle
Dataset : lingspam.
Source : Kaggle
Description : This dataset is part of a larger dataset calledEnron1
which contains emails classified as spam or ham(not-spam).
Composition :
11
4.5.2 Data Processing
● Dataset cleaning
It includes tasks such as removal of outliers, null value removal, removal ofunwanted
features from data.
● Dataset Merging
After data cleaning, the datasets are merged to form a single dataset containingonly two
features(text, label).
Data cleaning, Data Merging these procedures are completely done usingPandas library.
● Tag removal
Removing all kinds of tags and unknown characters from text using regular
expressions through Regex library.
● Sentencing, tokenization
Stop words such as of , a ,be , … are removed using stopwords NLTK libraryof
python.
● Lemmatization
Words are converted into their base forms using lemmatization andpos-
tagging
This process gives key-words through entity extraction.
● Sentence formation
This sentence is essentially a sentence converted into its base form andremoving stop
12
words.
Then all the sentences are combined to form a text.
● While the overall data processing is done only to datasets, the textual processing is
done to both training data, testing data and also user input data.
13
4.5.2.3 Feature Vector Formation
● The texts are converted into feature vectors(numerical data) using the wordspresent
in all the texts combined
● This process is done using countvectorization of NLTK library.
● The feature vectors can be formed using two language models Bag of Wordsand
Term Frequency-inverse Document Frequency.
Bag of words is a language model used mainly in text classification. A bag of wordsrepresents the
text in a numerical form.
The two things required for Bag of Words are
Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens.
[ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ] The
next step is scoring words present in every document.
After scoring the four lines from the above stanza can be represented in vector form as “It was
the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
This is the main process behind the bag of words but in reality the vocabulary even from a couple of
documents is very large and words repeating frequently and important in nature are taken and
14
remaining are removed during the text processing stage.
t – term(word)
d – document(set of words)N –
count of documents
The TF-IDF process consists of various activities listed below.
i) Term Frequency
Document frequency is the count of documents the word was detected in. We considerone
instance of a word and it doesn’t matter if the word is present multiple times.
𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔
Finally, the TF-IDF can be calculated by combining the term frequency and inversedocument
frequency.
16
Then finding the term frequency
17
Applying the final equation the values of tf-idf becomes
18
Using the above two language models the complete data has been converted into two kinds of
vectors and stored into a csv type file for easy access and minimal processing.
The data splitting is done to create two kinds of data Training data and testing data.
Training data is used to train the machine learning models and testing data is used to test the
models and analyse results. 80% of total data is selected as testing data and remaining data is testing
data.
4.5.4.1 Introduction
Machine Learning is process in which the computer performs certain tasks without giving
instructions. In this case the models takes the training data and train on them.
Then depending on the trained data any new unknown data will be processed based on the ruled
derived from the trained data.
After completing the countvectorization and TF-IDF stages in the workflow the data is converted into
vector form(numerical form) which is used for training and testing models.
For our study various machine learning models are compared to determine which method is more
suitable for this task. The models used for the study include Logistic Regression, Naïve Bayes,
Random Forest Classifier, K Nearest Neighbors, and Support Vector Machine Classifier and a
proposed model which was created using an ensemble approach.
4.5.4.2 Algorithms
A naïve Bayes classifier is a supervised probabilistic machine learning model that is used for
classification tasks. The main principle behind this model is the Bayes theorem.
Bayes Theorem:
Naive Bayes is a classification technique that is based on Bayes’ Theorem with an assumption that all
the features that predict the target value are independent of each other. It calculates the probability of
19
Naive Bayes classifier assumes that the features we use to predict the target are independent and do
not affect each other. Though the independence assumption is never correct in real-world data, but
often works well in practice. so that it is called “Naive” [14].
P(A│B)=(P(B│A)P(A))/P(B)
P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability.
P(B|A) is the probability of data B given that hypothesis A was true.
P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior
probability of A.
P(B) is the probability of the data (regardless of the hypothesis) [15].
Naïve Bayes classifiers are mostly used for text classification. The limitation of the Naïve Bayes
model is that it treats every word in a text as independent and is equal in importance but every word
cannot be treated equally important because articles and nouns are not the same when it comes to
language. But due to its classification efficiency, this model is used in combination with other
language processing techniques.
Random Forest classifier is a supervised ensemble algorithm. A random forest consists of multiple
random decision trees. Two types of randomnesses are built into the trees. First, each tree is built on a
random sample from the original data. Second, at each tree node, a subset of features is randomly
selected to generate the best split [16].
Decision Tree:
The decision tree is a classification algorithm based completely on features. The tree
repeatedly splits the data on a feature with the best information gain. This process continues until the
information gained remains constant. Then the unknown data is evaluated feature by feature until
categorized. Tree pruning techniques are used for improving accuracy and reducing the overfitting of
data.
Several decision trees are created on subsets of data the result that was given by the majority of trees
is considered as the final result. The number of trees to be created is determined based on accuracy
and other metrics through iterative methods. Random forest classifiers are mainly used on condition-
based data but it works for text if the text .
20
4.5.4.2.3 Logistic Regression
Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event. It is used when the data is linearly separable and the outcome
is binary or dichotomous [17]. The probabilities are calculated using a sigmoid function.
For example, let us take a problem where data has n features.
We need to fit a line for the given data and this line can be represented by the equation
here z = odds
Sigmoid Function:
A sigmoid function is a special form of logistic function hence the name logistic regression.The
logarithm of odds is calculated and fed into the sigmoid function to get continuous probability
ranging from 0 to 1.
log(odds)=dot(features,coefficients)+intercept
and these log_odds are used in the sigmoid function to get probability.
h(z)=1/(1+e^(-z) )
The output of the sigmoid function is an integer in the range 0 to 1 which is used to determine which
class the sample belongs to. Generally, 0.5 is considered as the limit below which it is considered a
NO, and 0.5 or higher will be considered a YES. But the border can be adjusted based on the
requirement.
21
4.5.4.2.4 K-Nearest Neighbors
KNN is a classification algorithm. It comes under supervised algorithms. All the data points are
assumed to be in an n-dimensional space. And then based on neighbors the category of current data is
determined based on the majority.
Euclidian distance is used to determine the distance between points.
d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )
The distances between the unknown point and all the others are calculated. Depending on the K
provided k closest neighbors are determined. The category to which the majority of the neighbors
belong is selected as the unknown data category.
If the data contains up to 3 features then the plot can be visualized. It is fairly slow compared to other
distance-based algorithms such as SVM as it needs to determine the distance to all points to get the
closest neighbors to the given point.
It is a machine learning algorithm for classification. Decision boundaries are drawn between various
categories and based on which side the point falls to the boundary the category is determined.
Support Vectors:
The vectors closer to boundaries are called support vectors/planes. If there are n categories then there
will be n+1 support vectors. Instead of points, these are called vectors because they are assumed to be
starting from the origin.The distance between the support vectors is called margin. We want our
margin to be as wide as possible because it yields better results.
Poly: used if data is not separable. It creates any data into 3-dimensional data.
Radial: this is the default kernel used in SVM. It converts any data into infinite-dimensional data.
22
If the data is 2-dimensional then the boundaries are lines. If the data is 3-dimensional then the
boundaries are planes. If the data categories are more than 3 then boundaries are called hyperplanes.
An SVM mainly depends on the decision boundaries for predictions. It doesn’t compare the data to all
other data to get the prediction due to this SVM’s tend to be quick with predictions.
4.5.5 Experimentation
The process goes like data collection and processing then natural language processing and then
vectorization then machine learning.The data is collected, cleaned, and then subjected to natural
language processing techniques specified in section IV. Then the cleaned data isconverted into
vectors using Bag of Words and TF-IDF methods which goes like...
The Data is split into Training data and Testing Data in an 80-20 split ratio. The training and testing
data is converted into Bag-of-Words vectors and TF-IDF vectors.
There are several metrics to evaluate the models but accuracy is considered for comparing BoW and
TF-IDF models. Accuracy is generally used to determine the efficiency of a model.
Accuracy:
“Accuracy is the number of correctly predicted data points out of all the data points”.
Two models, one for Bow and one for TF-IDF are created and trained using respective training
vectors and training labels. Then the respective testing vectors and labels are used to get the score for
the model.
23
The scores for Bag-of-Words and TF-IDF are visualized.
The scores for the Bow model and TF-IDF models are 98.04 and 96.05 respectively for using the
naïve bayes model.
Logistic Regression:
Two models are created following the same procedure used for naïve Bayes models andthen tested
the results obtained are visualized below.
K-Nearest Neighbors:
Similar to the above models the models are created and trained using respective vectors and labels.
But in addition to the data, the number of neighbors to be considered should alsobe provided.
Using Iterative Method K =3 (no of Neighbors) provided the best results for the BoW model and K =
9 provided the best results for the TF-IDF model.
Using the K values the scores for BOW and TF-IDF are visualized below.
24
respectively the scores are calculated and are presented below.
Random Forest:
Similar to previous algorithms two models are created and trained using respective training
vectors and training labels. But the number of trees to be used for forest has to beprovided.
score)
25
( fig no. 4.10 Random Forest(bow vs tfidf)
Finally, two SVM models, one for BoW and one for TF-IDF are created and then trained
using respective training vectors and labels. Then tested using testing vectors and labels.
The scores for BoW and TF-IDF models are 59.41 and 98.82 respectively.
Proposed Model:
In our proposed system we combine all the models and make them into one. It takes an unknown
point and feeds it into every model to get predictions. Then it takes these predictions, finds the
category which was predicted by the majority of the models, and finalizes it.
To determine which model is effective we used three metrics Accuracy, Precision, and F1score. In the
earlier system, we used only the F1 Score because we were not determining which model is best but
which language model is best suited for classification.
interface (UI) is an important component in this application. The user only interacts with the
interface.
The UI of this project has been constructed with the help of an open source library called streamlit.
The complete information and API reference sheet can be obtained from here
26
4.5.7 Working Procedure
The working procedure includes the internal working and the data flow of application.
3. Processing
ii. The user just needs to provide some data to classify in the area provided.
1. Textual Processing
3. Entity extraction
iv. The created vectors are provided to trained models to get predictions.
vii. The accuracies and entities extracted from the step 3 will be provided to user.
Every time the user gives something new the procedure from step 2 will be repeated.
27
Chapter 5
Results and Discussion
While selecting the best language model the data has been converted into both types of vectors
and then the models been tested for to determine the best model for classifyingspam.
The results from individual models are presented in the experimentation section under
methodology. Now comparing the results from the models.
From the figure it is clear that TF-IDF proves to be better than BoW in every model tested. Hence TF-
IDF has been selected as the primary language model for textual data conversion in feature vector
formation.
To determine which model is effective we used three metrics Accuracy, Precision, andF1score.
The resulted values for the proposed model are
Accuracy – 99.0
Precision – 98.5
F1 Score – 98.6
5.3 Comparison
28
Metric Accuracy Precision F1 Score
Model
Naïve Bayes 96.0 99.2 95.2
29
5.4 Summary
There are two main tasks in the project implementation. Language model selection for completing the
textual processing phase and proposed model creation using the individual algorithms. These two
tasks require comparison from other models and select of various parameters for better efficiency.
During the language model selection phase two models, Bag of Words and TF-IDF are compared to
select the best model and from the results obtained it is evident that TF-IDF performs better.
During the proposed model design various algorithms are tested with different parameters to get best
parameters. Models are merged to form a ensemble algorithm and the results obtained are presented
and compared above. It is clear from the results that the proposed model outperforms others in almost
every metric derived.
30
Chapter 6
Conclusion and Future Scope
From the results obtained we can conclude that an ensemble machine learning model is more
effective in detection and classification of spam than any individual algorithms. We can also
conclude that TF-IDF (term frequency inverse document frequency) language model is more
effective than Bag of words model in classification of spam when combined with several
algorithms. And finally we can say that spam detection can get better if machine learning
algorithms are combined and tuned toneeds.
There are numerous appilcations to machine learning and natural language processing and
when combined they can solve some of the most troubling problems concerned with texts. This
application can be scaled to intake text in bulk so that classification can be done more affectively
in some public sites.
Other contexts such as negative, phishing, malicious, etc,. can be used to train the model to
filter things such as public comments in various social sites. This application can be converted
to online type of machine learning system and can be easily updated with latest trends of spam
and other mails so that the system can adapt to new types of spam emails and texts.
31
References
[1] S. H. a. M. A. T. Toma, "An Analysis of Supervised Machine Learning Algorithms for Spam
Email Detection," in International Conference on Automation, Control and Mechatronics for
Industry 4.0 (ACMI), 2021.
[2] S. Nandhini and J. Marseline K.S., "Performance Evaluation of Machine Learning Algorithms
for Email Spam Detection," in International Conference on Emerging Trends in Information
Technology and Engineering (ic-ETITE), 2020.
[3] A. L. a. S. S. S. Gadde, "SMS Spam Detection using Machine Learning and Deep Learning
Techniques," in 7th International Conference on Advanced Computing and Communication
Systems (ICACCS), 2021, 2021.
[4] V. B. a. B. K. P. Sethi, "SMS spam detection and comparison of various machine learning
algorithms," in International Conference on Computing and Communication Technologies for
Smart Nation (IC3TSN), 2017.
[5] G. D. a. A. R. P. Navaney, "SMS Spam Filtering Using Supervised Machine Learning
Algorithms," in 8th International Conference on Cloud Computing, Data Science & Engineering
(Confluence), 2018.
[6] S. O. Olatunji, "Extreme Learning Machines and Support Vector Machines models for email
spam detection," in IEEE 30th Canadian Conference on Electrical and Computer Engineering
(CCECE), 2017.
[7] S. S. a. N. N. Kumar, "Email Spam Detection Using Machine Learning Algorithms," in Second
International Conference on Inventive Research in Computing Applications (CIRCA), 2020.
[8] R. Madan, "medium.com," [Online]. Available:
https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanatio n-for-
text-classification-in-nlp-with-code-8ca3912e58c3.
[9] N. D. J. a. M. M. A. M. M. RAZA, "A Comprehensive Review on Email Spam Classification
using Machine Learning Algorithms," in International Conference on Information Networking
(ICOIN), 2021, 2021.
[10] A. B. S. A. a. P. M. M. Gupta, "A Comparative Study of Spam SMS Detection Using Machine
Learning Classifiers," in Eleventh International Conference on Contemporary Computing (IC3),
2018.
[11] M. M. J. Fattahi, "SpaML: a Bimodal Ensemble Learning Spam Detector based on NLP
Techniques," in IEEE 5th International Conference on Cryptography, Security and
32
Privacy (CSP), 2021, 2021.
33
A. Screenshots
34
35
36