Spam Detection in Email Using Machine Le

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Spam Detection in Email using Machine Learning

IT19154404 R. A. Shehan Sanjula


Department of Computer Systems Engineering
Sri Lanka Institute of Information Technology
Malabe, Sri Lanka
[email protected]

Abstract—In today's world, email is used in almost every leading to increased spam email problems. In order to detect
industry, from business to education. Emails can be categorized and filter spam and spammers, the researchers proposed a
into two categories: ham and spam. Junk emails, also known as variety of spam detection methods. Currently, Spam email
spam messages, are emails that have been designed to harm detection methods mainly fall into two categories: those based
recipients by wasting their time, computing resources, and on behaviour patterns and those based on semantic patterns.
stealing their valuable information. It is estimated that spam Each type of approach has its drawbacks and limitations.
emails are increasing at a rapid rate. One of the most important Since the advent of the Internet and increased communication
and prominent spam prevention techniques is filtering email.
around the globe, spam emails have grown significantly [5].
Naive Bayes, Decision Trees, Neural Networks, and Random
Through the Internet, spammers can send spam from
Forests are among the methods used for this purpose by
researchers. In this project, I examine the Logistic Regression
anywhere in the world by hiding their identities. Spam mail
machine learning model for spam filtering in email by still dominates the internet despite all the antispam tools and
categorizing messages into appropriate groups. This study also techniques available. Those attacks most commonly involve
compares the techniques based on accuracy, precision, recall, malicious emails containing links to malicious websites that
etc. The accuracy level for this project was around 97%. can cause harm to the victim's personal information. The
Towards the end, these insights and future research directions, memory or capacity of servers can also be occupied by spam
and challenges are outlined. emails, slowing down their response times. All organizations
carefully evaluate the tools available to battle spam in their
Keywords—Machine Learning, Logistic Regression, Spam environment to accurately detect spam emails and avoid the
Email Filtering, TfidfVectorizer, Random State, Deployment increasing issue of spam in emails. Whitelists and blacklists,
mail header analysis, keyword checking, and spam detection
I. INTRODUCTION are some of the popular mechanisms for analyzing incoming
In the modern era of information technology, information emails [6].
sharing is easier than ever. Users can exchange information on
B. Solution Proposed
a variety of platforms from anywhere across the globe. Email
has become currently the easiest, cheapest, and most rapid According to researchers, 40% of social networks have
method of transmitting information in the world among all accounts that are utilised for spam [7]. By sending hidden
information sharing mediums. Emails, on the other hand, are links in the text, spammers target specific segments, review
vulnerable to a variety of attacks, the most popular and pages, or fan pages to promote pornographic or other product
destructive of which is spam due to their simplicity [1]. Aside sites from fraudulent accounts. The same kinds of noxious
from wasting recipients' time and resources, receiving emails emails are sent to the same kinds of individuals or associations
that are not related to their interests may contain malicious on a regular basis. A better detection of these types of emails
content in the form of attachments or URLs which may can be achieved by investigating these highlights. In order to
compromise the host system's security [2]. The term spam differentiate between spam and non-spam emails, artificial
refers to any irrelevant and unwanted messages or emails sent intelligence (AI) can be used [8].
by an attacker to significant numbers of recipients by email or Headers, subjects, and bodies of the messages can be used
any other means of communication [2]. to extract feature information for this solution. These data can
Therefore, the security of the email system requires a great then be grouped into spam and ham based on their nature.
deal of attention. In spam emails, viruses, rats, and Trojans Detecting spam today is commonplace by using learning-
may be contained. Users are often lured towards online based classifiers. Using learning-based classification, spam
services by this technique. It is possible for attackers to send emails are suspected of having a set of specific features to
spam emails with attachments that contain multiple-file distinguish them from legitimate emails. In learning-based
extensions, link to malicious, spamming websites, and worse, models, identifying spam has become more complex due to
result in data and financial fraud and identity theft [3, 4]. It is many factors. There are several factors contributing to spam
possible to create keywords-based rules that serve as filters for subjectivity, including idea drift, language problems,
email messages with many email providers. Even so, this overhead processing, and latency in texting. According to my
method is not very practical because it is difficult, and users proposed method, 97% of the accuracy rate of emails is
do not want to customize their email messages, which leads to classified as spam and ham based on their nature, which is an
spammers attacking their accounts [4]. outstanding achievement since existing systems lack such
precision.
A. The importance of Spam Detection in Email using
Machine Learning II. RELATED WORK
IoT has become a part of our daily lives over the last few Email spam is defined as unsolicited fake bulk emails sent
decades and is growing rapidly. The emergence of IoT is from any account or automated system. Spam emails are
becoming more widespread by today, and it has become a III. DATASET
major issue over the last decade. Typically, Spambots (a In this project, I am using a dataset obtained from Kaggle.
computerized application that crawls email addresses through I have named it “mail_data.csv”. In this dataset, there are
the Internet) are used to collect Email IDs to send spam 5500+ raw mail data in CSV format. I will discuss the basic
emails. In the detection of spam emails, machine learning has properties of the dataset in the methodology section, and how
been playing a vital role recently. A supervised approach with I used it in my machine learning project with feature
feature selection on email spam detection was presented by extraction, cleanups, removing redundant values, filling in
Kaur and Verma [9]. For spam detection systems, they missing values, etc.
introduced the knowledge discovery process. This poll also
addresses the selection of characteristics based on N-Gram. IV. METHODOLOGY
After detecting N - 1 terms in a sentence or text corpus, N-
Gram is a predictive-based method that predicts the Our primary objective with this product is to differentiate
probability of the following word occurrence [09, 10]. They spam from ham emails that we receive daily. It determines
compare nonmachine learning (Signatures, Blacklist and which emails come to your inbox and which emails should go
Whitelist, and mail header checking) with machine learning to the spam folder in real life more effective way. Here I will
(Nave Bayes, Support Vector Machine, multilayer perceptron be using the Logistic Regression model to build this project.
Neural Network) techniques for detecting spam emails. Since It is because the Logistic Regression model is the best model
they are using all these supervised machine learning we can use when it comes to the binary classification problem.
algorithms and evaluate the results based on precision, recall, Further, we already discussed related works. My plan is to
and accuracy false positives are generated at a high rate customize the code we built so that it can be used to its
depending on the dataset. maximum potential.

DeBarr and Wechsler introduced another spam filtering


system that uses Random Forest algorithms to categorize
spam emails and active learning to refine the categorization
[11]. In their approach, each email has divided into two
sections. For that, they have used email messages from RFC
822 (Internet) [11]. For training the dataset, the researchers
have used Support Vector Machine, Random Forest, Naive
Bayes, and KNN [11]. However, since the research solely
depends on term frequency and inverse document frequency
of all features of each email which leads to stopping the
accuracy of the model at 95.2%.
Takhmiri and Haroonabadi [12] use a fuzzy Decision Tree
and the Naive Bayes algorithm to provide a new method for
detecting spam. They extract spam behaviour patterns using
the baked voting algorithm. They did this because, in the real
world, apparent features do not exist. For spam and ham Figure 1: Logical Flow
email classification, decision trees utilise fuzzy Mamdani
rules according to the research. They next use the Nave Bayes We start with raw spam and ham email data, according to
classifier [12,13] in the dataset. Eventually, by separating the above flow. After the data has been collected, we will train
votes into smaller portions, the baking approach is applied. our machine learning model using the data. Nevertheless, they
This method provides them with an optimum weight to aren’t directly applicable to our project. To accomplish this,
improve accuracy. The study utilised a dataset of 1000 our data needs to be preprocessed. Our text data will be
emails, of which 350 (35%) were spam and 650 (65%) were converted into numbers during data preprocessing, since we
ham emails [14] which kind a short dataset. know that machines can only understand numbers. Following
that, I will split our data into training and test data, which will
To identify the emails as junk mail or ham, Verma and be used in training and evaluating our model. Once I have
Sofat utilized the supervised machine learning method ID3 to done that, I will feed the data into our Logistic Regression
construct the decision trees of the study [15]. Further, the model. A trained model will eventually predict whether a mail
hidden Markov model was used to calculate the odds of many is spam or ham by analyzing its contents.
events occurring at the same time [16]. The proposed
A. Integrating Machine Learning into the project
approach classifies all emails as spam or valid by calculating
the total chance of each e-mail using previously classified We already have the Kaggle dataset, so let's explore how
email phrases. This study makes use of the Enron dataset, I'll use it in my Machine Learning project. I will start by
which contains 5172 emails together [15, 16]. 2086 of the importing dependencies (the libraries and functions) we will
5172 emails were spam, while the other 2086 were legitimate. be using.
Using the feature set gathered from the Enron dataset, their
algorithm can classify emails as spam or ham. Using the
fitness function from the Scikit-Learn package in the
suggested model, they got an 11% error. On the given dataset,
their model had an accuracy rate of 89%.
Figure 2: Importing dependencies
Model selection, features extraction, splitting the data and Let's find out how many columns and rows we have in our
matrix import are all done through the sklearn learn library. dataset.
As our algorithm always expects the input to be an integer or
a float, we must insert a feature extraction layer in the middle
to convert the words to integers or floats.
There are a couple of methods of doing this: Figure 5: Checking the number of rows and columns
TfidfVectorizer, CountVectorizer, and Word Embedding.
Counting words is good, but can we do better? The problem As you can see, this is a fairly large dataset. We have a
with simple word counts is that some words, such as “the” and significant number of data here which is 5572 emails. We now
“and”, appear repeatedly without adding any meaningful see that the category column represents two labels. Therefore,
information. The word embedding technique attempts to in our next step, we need to encode those labels. In this case,
convert a word into a vector-based format, and this vector I will label spam as 0 and ham as 1.
describes where this word resides within a higher dimensional
space. When two words have similar meanings, their cosine
distances will be shorter, and they will be closer to one
another. However, our purpose will not be achieved by doing
so. When that happens, TfidfVectorizer comes into play. In
addition to counting each word, the vectorizer will try to
downscale words that appear across multiple documents or
Figure 6: Label Encoding
sentences.
Now, I provide this message data and the labels separately
to the machine learning model. It's like giving X-axis and Y-
axis values. So, the features or message data will be the input,
and the output or the target column will be the category. For
this purpose, let's make two variables.

Figure 3: Data collection and Pre-Processing

The next step will be loading the dataset into a pandas data
frame. The raw data is shown in the above figure. Since the
dataset contains null values & missing values, this will pose a
problem. This issue will be resolved by converting them into
null strings in the next step.

Figure 7: Separating the data as text and label

The next part is the most important since we use one set of
data to test our model and another set to evaluate it. In other
Figure 4: Replace the null values with null string words, part of the X will be our training data, and the other
part will be our test data. The same applies to Y. In this
instance, we will take advantage of the train split function we
imported above. A total of 80% of the 5572 emails will be
used for the training data; the remaining 20% will be used for
the test data. With the random state, I can be sure that our
train_test_split will return the same split every time, which
will give consistency to our model.

Figure 11: Training the model

Figure 8: Splitting the data into training data and test data B. Evaluating the model
A training model must be evaluated before we proceed to
The next step will be feature extraction which we convert build a predictive system. An array called
text values to feature vectors (which has meaningful prediction_on_training_data stores the values predicted by the
numerical values) where Logistic Regression model can trained model. We then compare the predicted values. Here I
understand. max_df is used to remove terms that appear too will utilize the accuracy_score function. We need to provide
frequently. A parameter stop_words = “english” will ignore two parameters. In one, we have the “true” value, which is
words in English that add little meaning to a sentence. In the Y_train, and in the other, we have the “prediction_on_training
next step, fit_transform will convert it to feature vectors. Since data”.
we still have object data type in the data frame, it needs to
convert to integer eventually.

Figure 12: Evaluating the training data and prediction

Our model has been tested using training data, so let’s try
Figure 9: Feature extraction and transform text data into feature it with test data as well. Sometimes a model can overfit.
vectors Therefore, I am testing my model with test data as well as
training data.

Figure 13: Evaluating the test data and prediction

Now that I am confident in my model, let's build a


predictive system. This can be achieved by submitting a
random sample of emails to the model, which can then predict
whether it is spam or not. Here are some examples based on
some emails I selected from my dataset. The value of the label
is predicted using the predict function.

Figure 10: Displaying the transformed data

Now I am going to train my model. It will require


importing the Logistic Regression model. Next, I will feed
the model the training data (X-axis and Y-axis values). Figure 14: Building a Predictive system
C. Local Deployment on Ubuntu LTS 20.04.1 x64

Figure 15: The Python script developed by me

To deploy my model as a web application on Ubuntu, I


developed a Python script (app.py). Upon running the script,
Flask (a Python-based web app framework) is imported to figure, we can see that the flag server has successfully been
render the model into a web application. In the following started on http://127.0.0.1:5000.
Figure 16: Local Deployment on Ubuntu 20.04.1 x64

D. Commercializing the product on the Internet

Figure 17: Hosting the commercial product on the internet


As a commercial product, the Web Application is hosted
on the internet. You can find it at https://spam-email-filtering-
system.shehansanjula.dev
GitHub Repository is at
https://github.com/ShehanSanjula/Spam-Email-Filtering-
System-Public
V. RESULTS
The system will produce the following results. Let's test
the system by sending a spam email.

Figure 19: Determining if the mail is Ham mail

In both cases, the application accurately identified the


emails.
VI. CONCLUSION
Researchers have become increasingly interested in spam
detection and filtering over the last two decades. Several
studies have been conducted in this area because of its
substantial impact on a variety of areas, such as consumer
behavior or fake reviews. In the study, lessons learned from
each machine learning category are compared with previous
approaches. Additionally, spam filters find it challenging to
evaluate features from multiple angles, including temporal,
writing style, semantic and statistical ones. Models are trained
primarily on balanced datasets, while self-learning models are
not feasible. Deep fake is another challenge facing spam
detection systems. According to the findings of this study,
most proposed spam email detection techniques are based on
supervised machine learning techniques. This project provides
an in-depth analysis of these Logistic Regression algorithm
and some future directions for searching and detecting spam
email.
ACKNOWLEDGEMENT
I thank Dr Lakmal Rupasinghe, the lecturer in charge of
the Machine Learning for Cyber Security - IE4092, Ms
Chethana Liyanapathirana, the senior lecturer, and Ms
Figure 18: Checking the content of the mail to see if it's spam
Laneesha Ruggahakotuwa, the assistant lecturer and all
The application works as expected. Let's see if it works when associated lecturers and instructors of Sri Lanka Institute of
a ham mail is used. Information Technology, for granting me an opportunity to
conduct this Machine Learning Project Report with guidance.
This work was supported in part by the Research Groups
Faculty of Computing, Department of Computer Systems
Engineering under Grant Machine Learning for Cyber
Security - IE4092.
REFERENCES
[1] H. Faris, A. M. Al-Zoubi, A. A. Heidari et al., “An intelligent system
for spam detection and identification of the most relevant features
based on evolutionary random weight networks,” Information Fusion,
vol. 48, pp. 67–83, 2019
[2] S. O. Olatunji, “Extreme Learning machines and Support Vector
Machines models for email spam detection,” in Proceedings of the
2017 IEEE 30th Canadian Conference on Electrical and Computer
Engineering (CCECE), IEEE, Windsor, Canada, April 2017.
[3] A. Alghoul, S. Al Ajrami, G. Al Jarousha, G. Harb, and S. S. Abu-
Naser, “Email classification using artificial neural
network,” International Journal for Academic Development, vol. 2,
2018.
[4] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep International Conference, ISC High Performance 2020, vol. 69, pp.
learning for cyber security intrusion detection: approaches, datasets, 170–179, Frankfurt, Germany, 2020.
and comparative study,” Journal of Information Security and [11] Z. Guo, Y. Shen, A. K. Bashir et al., “Robust spammer detection using
Applications, vol. 50, Article ID 102419, 2020. collaborative neural network in Internet of thing applications,” IEEE
[5] H. Bhuiyan, A. Ashiquzzaman, T. Islam Juthi, S. Biswas, and J. Ara, Internet of Things Journal, vol. 8, 2020.
“A survey of existing e-mail spam filtering methods considering [12] Y. Dou, G. Ma, P. S. Yu, and S. Xie, “Robust spammer detection by
machine learning techniques,” Global Journal of Computer Science nash reinforcement learning,” in Proceedings of the 26th ACM
and Technology, vol. 18, 2018. SIGKDD International Conference on Knowledge Discovery & Data
[6] T. Vyas, P. Prajapati, and S. Gadhwal, “A survey and evaluation of Mining, ACM, Virtual Event CA, USA, July 2020.
supervised machine learning techniques for spam e-mail filtering,” [13] M. H. Arif, J. Li, M. Iqbal, and K. Liu, “Sentiment analysis and spam
in Proceedings of the 2015 IEEE international conference on detection in short informal text using learning classifier systems,” Soft
electrical, computer and communication technologies (ICECCT), Computing, vol. 22, no. 21, pp. 7281–7291, 2018.
IEEE, Tamil Nadu, India, March 2015.
[14] N. Kumar and S. Sonowal, “Email spam detection using machine
[7] A. K. Jain and B. B. Gupta, “A novel approach to protect against learning algorithms,” in Proceedings of the 2020 Second International
phishing attacks at client side using auto-updated white-list,” EURASIP Conference on Inventive Research in Computing Applications
Journal on Information Security, vol. 2016, no. 1, p. 9, 2016. (ICIRCA), pp. 108–113, Coimbatore, India, 2020.
[8] A. Subasi, S. Alzahrani, A. Aljuhani, and M. Aljedani, “Comparison of [15] A. J. Saleh, A. Karim, B. Shanmugam et al., “An intelligent spam
decision tree algorithms for spam E-mail filtering,” in Proceedings of detection model based on artificial immune system,” Information, vol.
the 2018 1st International Conference on Computer Applications & 10, no. 6, p. 209, 2019.
Information Security (ICCAIS), IEEE, Riyadh, Saudi Arabia, April
2018. [16] W. Peng, L. Huang, J. Jia, and E. Ingram, “Enhancing the naive bayes
spam filter through intelligent text modification detection,”
[9] H. Faris, I. Aljarah, and B. Al-Shboul, “A hybrid approach based on in Proceedings of the 2018 17th IEEE International Conference on
particle swarm optimization and random forests for e-mail spam Trust, Security And Privacy In Computing And Communications/12th
filtering,” in Proceedings of the International Conference on IEEE International Conference on Big Data Science And Engineering
Computational Collective Intelligence, Springer, Halkidiki, Greece, (TrustCom/BigDataSE), IEEE, New York, NY, USA, August 2018.
September 2016.
[10] N. Sutta, Z. Liu, and X. Zhang, “A study of machine learning
algorithms on email spam classification,” in Proceedings of the 35th

You might also like