Spam Detection in Email Using Machine Le
Spam Detection in Email Using Machine Le
Spam Detection in Email Using Machine Le
Abstract—In today's world, email is used in almost every leading to increased spam email problems. In order to detect
industry, from business to education. Emails can be categorized and filter spam and spammers, the researchers proposed a
into two categories: ham and spam. Junk emails, also known as variety of spam detection methods. Currently, Spam email
spam messages, are emails that have been designed to harm detection methods mainly fall into two categories: those based
recipients by wasting their time, computing resources, and on behaviour patterns and those based on semantic patterns.
stealing their valuable information. It is estimated that spam Each type of approach has its drawbacks and limitations.
emails are increasing at a rapid rate. One of the most important Since the advent of the Internet and increased communication
and prominent spam prevention techniques is filtering email.
around the globe, spam emails have grown significantly [5].
Naive Bayes, Decision Trees, Neural Networks, and Random
Through the Internet, spammers can send spam from
Forests are among the methods used for this purpose by
researchers. In this project, I examine the Logistic Regression
anywhere in the world by hiding their identities. Spam mail
machine learning model for spam filtering in email by still dominates the internet despite all the antispam tools and
categorizing messages into appropriate groups. This study also techniques available. Those attacks most commonly involve
compares the techniques based on accuracy, precision, recall, malicious emails containing links to malicious websites that
etc. The accuracy level for this project was around 97%. can cause harm to the victim's personal information. The
Towards the end, these insights and future research directions, memory or capacity of servers can also be occupied by spam
and challenges are outlined. emails, slowing down their response times. All organizations
carefully evaluate the tools available to battle spam in their
Keywords—Machine Learning, Logistic Regression, Spam environment to accurately detect spam emails and avoid the
Email Filtering, TfidfVectorizer, Random State, Deployment increasing issue of spam in emails. Whitelists and blacklists,
mail header analysis, keyword checking, and spam detection
I. INTRODUCTION are some of the popular mechanisms for analyzing incoming
In the modern era of information technology, information emails [6].
sharing is easier than ever. Users can exchange information on
B. Solution Proposed
a variety of platforms from anywhere across the globe. Email
has become currently the easiest, cheapest, and most rapid According to researchers, 40% of social networks have
method of transmitting information in the world among all accounts that are utilised for spam [7]. By sending hidden
information sharing mediums. Emails, on the other hand, are links in the text, spammers target specific segments, review
vulnerable to a variety of attacks, the most popular and pages, or fan pages to promote pornographic or other product
destructive of which is spam due to their simplicity [1]. Aside sites from fraudulent accounts. The same kinds of noxious
from wasting recipients' time and resources, receiving emails emails are sent to the same kinds of individuals or associations
that are not related to their interests may contain malicious on a regular basis. A better detection of these types of emails
content in the form of attachments or URLs which may can be achieved by investigating these highlights. In order to
compromise the host system's security [2]. The term spam differentiate between spam and non-spam emails, artificial
refers to any irrelevant and unwanted messages or emails sent intelligence (AI) can be used [8].
by an attacker to significant numbers of recipients by email or Headers, subjects, and bodies of the messages can be used
any other means of communication [2]. to extract feature information for this solution. These data can
Therefore, the security of the email system requires a great then be grouped into spam and ham based on their nature.
deal of attention. In spam emails, viruses, rats, and Trojans Detecting spam today is commonplace by using learning-
may be contained. Users are often lured towards online based classifiers. Using learning-based classification, spam
services by this technique. It is possible for attackers to send emails are suspected of having a set of specific features to
spam emails with attachments that contain multiple-file distinguish them from legitimate emails. In learning-based
extensions, link to malicious, spamming websites, and worse, models, identifying spam has become more complex due to
result in data and financial fraud and identity theft [3, 4]. It is many factors. There are several factors contributing to spam
possible to create keywords-based rules that serve as filters for subjectivity, including idea drift, language problems,
email messages with many email providers. Even so, this overhead processing, and latency in texting. According to my
method is not very practical because it is difficult, and users proposed method, 97% of the accuracy rate of emails is
do not want to customize their email messages, which leads to classified as spam and ham based on their nature, which is an
spammers attacking their accounts [4]. outstanding achievement since existing systems lack such
precision.
A. The importance of Spam Detection in Email using
Machine Learning II. RELATED WORK
IoT has become a part of our daily lives over the last few Email spam is defined as unsolicited fake bulk emails sent
decades and is growing rapidly. The emergence of IoT is from any account or automated system. Spam emails are
becoming more widespread by today, and it has become a III. DATASET
major issue over the last decade. Typically, Spambots (a In this project, I am using a dataset obtained from Kaggle.
computerized application that crawls email addresses through I have named it “mail_data.csv”. In this dataset, there are
the Internet) are used to collect Email IDs to send spam 5500+ raw mail data in CSV format. I will discuss the basic
emails. In the detection of spam emails, machine learning has properties of the dataset in the methodology section, and how
been playing a vital role recently. A supervised approach with I used it in my machine learning project with feature
feature selection on email spam detection was presented by extraction, cleanups, removing redundant values, filling in
Kaur and Verma [9]. For spam detection systems, they missing values, etc.
introduced the knowledge discovery process. This poll also
addresses the selection of characteristics based on N-Gram. IV. METHODOLOGY
After detecting N - 1 terms in a sentence or text corpus, N-
Gram is a predictive-based method that predicts the Our primary objective with this product is to differentiate
probability of the following word occurrence [09, 10]. They spam from ham emails that we receive daily. It determines
compare nonmachine learning (Signatures, Blacklist and which emails come to your inbox and which emails should go
Whitelist, and mail header checking) with machine learning to the spam folder in real life more effective way. Here I will
(Nave Bayes, Support Vector Machine, multilayer perceptron be using the Logistic Regression model to build this project.
Neural Network) techniques for detecting spam emails. Since It is because the Logistic Regression model is the best model
they are using all these supervised machine learning we can use when it comes to the binary classification problem.
algorithms and evaluate the results based on precision, recall, Further, we already discussed related works. My plan is to
and accuracy false positives are generated at a high rate customize the code we built so that it can be used to its
depending on the dataset. maximum potential.
The next step will be loading the dataset into a pandas data
frame. The raw data is shown in the above figure. Since the
dataset contains null values & missing values, this will pose a
problem. This issue will be resolved by converting them into
null strings in the next step.
The next part is the most important since we use one set of
data to test our model and another set to evaluate it. In other
Figure 4: Replace the null values with null string words, part of the X will be our training data, and the other
part will be our test data. The same applies to Y. In this
instance, we will take advantage of the train split function we
imported above. A total of 80% of the 5572 emails will be
used for the training data; the remaining 20% will be used for
the test data. With the random state, I can be sure that our
train_test_split will return the same split every time, which
will give consistency to our model.
Figure 8: Splitting the data into training data and test data B. Evaluating the model
A training model must be evaluated before we proceed to
The next step will be feature extraction which we convert build a predictive system. An array called
text values to feature vectors (which has meaningful prediction_on_training_data stores the values predicted by the
numerical values) where Logistic Regression model can trained model. We then compare the predicted values. Here I
understand. max_df is used to remove terms that appear too will utilize the accuracy_score function. We need to provide
frequently. A parameter stop_words = “english” will ignore two parameters. In one, we have the “true” value, which is
words in English that add little meaning to a sentence. In the Y_train, and in the other, we have the “prediction_on_training
next step, fit_transform will convert it to feature vectors. Since data”.
we still have object data type in the data frame, it needs to
convert to integer eventually.
Our model has been tested using training data, so let’s try
Figure 9: Feature extraction and transform text data into feature it with test data as well. Sometimes a model can overfit.
vectors Therefore, I am testing my model with test data as well as
training data.