E-Mail Spam Detection Using Machine Learning KNN
E-Mail Spam Detection Using Machine Learning KNN
Abstract--Email is a primary mode of communication for categorized as either cancerous or noncancerous depending
software developers due of its convenience. Employing a spam on its severity. When it comes to determining whether or not
filter is required in order to maintain efficient communication. a financial transaction is legal, there is no room for ambiguity.
This examination will primarily concentrate on a spam When there are more than two different categories of labels,
prevention software programme. This article discusses how the
a multi-class classification should be utilized. One such
Machine Learning model that Google updated on Collab can
recognize and prevent almost all spam and phishing emails. This strategy would be to divide reviews of movies into three
indicates that their email spam filter is so efficient that only one categories: positive, negative, and neutral.
message out of one thousand is allowed to pass through. There One of the most common and prevalent difficulties
are various different approaches to machine learning that can
associated with natural language processing is the
be used to identify spam; but, in recent years, the "KNN"
method has become increasingly prominent. In order to classification of strings. Examples of this include the
accomplish the goals of this post, we will do research into the automatic classification of emails as either spam or non-
operation of spam classification algorithms and attempt to spam, as well as the categorization of movies and news pieces
determine how these systems arrive at their findings. The into various genres. In this essay, I will focus more on the
challenge of deciding whether an email should be classified as third example and study it in further detail.
spam or not is referred to as "spam detection."
Keywords: E-mail, Spam Classification, Machine Learning – Problem Description:
KNN. In this article, authors look at the process of deciding
I. INTRODUCTION whether an email is spam or not and try to understand it. This
is called Spam Detection, and it is a problem of putting things
The process by which voice-activated devices respond to into two groups.
questions. As well as the process that determines whether or
not a communication is considered to be spam during the This is done for a simple reason: if one can find
preliminary review.Natural Language Processing (NLP), anonymous and unwanted emails, people can stop spam
which converts text into insights that can be put to use with messages from getting into the user's inbox, making the user's
following data, is the tool that is responsible for carrying out experience better.
all of this work. One of the most difficult areas of AI study is
natural language processing (NLP) because of the contextual
nature of text input. It needs to be changed before it can be
understood by machines, and the feature extraction process
needs to be broken down into steps.
Classification problems can be divided into two primary Figure 1. Emails are sent through a spam detector. If an email is
groups: those with only two classes, sometimes known as detected as spam, it is sent to the spam folder, else to the inbox. (Image
binary classification problems, and those with three or more Source: Ramya Vidyalaya [5]).
groups (multi-class classification problems). In a binary The unsolicited commercial email messages known as
classification system, the labels can only take on one of two spam have become a significant problem on the internet. The
possible forms. For example, a patient's condition can be act of sending unwanted commercial electronic messages is
1024
979-8-3503-9826-7/22/$31.00 ©2022 IEEE
Authorized licensed use limited to: b-on: UNIVERSIDADE DE AVEIRO. Downloaded on July 31,2023 at 15:21:31 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Contemporary Computing and Informatics (IC3I)
In the present day that we live in, many individuals use A. Methods for Sorting Data According to the Content of
e-mails as a method of communication for business, for their Its Records:
personal lives, and for their professional lives. In 2018, an
estimated 296 billion emails were sent, which breaks down to The phrases that an email contains, the number of times
an average of 130 emails per person, every day. those phrases appear, and how they are dispersed throughout
the message all play a role in determining whether or not the
Spam is growing more widespread as more people use email is considered spam.
the internet and send more emails than ever before. This is
mostly responsible for the proliferation of spam. Historically, B. Techniques for the Filtration of Spam Derived from
spam has made up more than fifty percent of all email traffic. Previous Experience:
Each and every day, fraud is still responsible for the loss of
millions of dollars. The algorithms that classify incoming emails into spam
and non-spam categories have been trained on the content of
On the other hand, as shown in the graph that follows, emails that have been classified in the past.
the volume of emails of this kind has dramatically fallen since
the year 2016. This is due to the fact that anti-spam tools have C. Heuristics- or rule-based approaches to spam filtering
undergone consistent development throughout the course of can be considered:
recent years.
A "regular expression" is a predetermined set of criteria that
is utilizedby algorithms in order to provide a score for each
individual email message. The scores that emails obtain
determine whether or not they are considered spam by the
system.
1025
Authorized licensed use limited to: b-on: UNIVERSIDADE DE AVEIRO. Downloaded on July 31,2023 at 15:21:31 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Contemporary Computing and Informatics (IC3I)
D. Methods for identifying spam that were formerly based below are quite popular because they contain many emails.
on similarities between messages have been updated: First, the 2006 Enron corpus datasets, which are 55% spam
emails, Second, the Trec 2007 dataset contains 67% spam
After the attributes of each newly received email have emails and was created in 2007. The dataset is divided into a
been used to generate a vector in a multidimensional space, "train" and a "test" subset using the train/test split. Make sure
those vectors are then applied to the process of plotting points that both sets have the same total number of emails and that
that represent the email. The KNN algorithm is applied in there is an equal number of spam and "ham" emails.
order to identify the spam and non-spam groups that are
geographically closest to these fresh data points before
classifying them.
b) Data Set:
1026
Authorized licensed use limited to: b-on: UNIVERSIDADE DE AVEIRO. Downloaded on July 31,2023 at 15:21:31 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Contemporary Computing and Informatics (IC3I)
Once the initial processing is complete, researchers focus a) KNN (K-Nearest Neighbor) Implementation:
to have an access of significant vocabulary [6-11]. This is the The clustering algorithm K-Nearest Neighbor is
place where you may keep track of every occurrence of the comparable to Nearest Neighbor. However, it does not just
terms that appear in each column. There are a number of other give the instance that is closest; rather, it takes into account
classes that these traits are eligible for, including the the K instances that are closest to the new one. K-NN assigns
following: a ranking to each new occurrence depending on the frequency
with which the first K cases appear. The value that is assigned
The amount of one-of-a-kind words, the number of one- to K is frequently considered to be a tuning hyperparameter.
of-a-kind meanings for those words, the existence or absence The tried-and-true Hit-and-Try method can be utilized during
of a bag of words labelled "adult material," and so on are all the process of fine-tuning. With this method, we will
significant aspects. arbitrarily generate new values for K and assess how these
new values affect the overall performance of the model [15].
Additional account features include the sender's age, the
total number of recipients, the number of answers they've The first 80% of the data will be used to train the model,
received, and the URL of the sender's website. and 20% will be used to validate it. The data set was taken
from Kaggle.
It is essential to keep in mind that the actual URLs have
not been modified; only the language has changed. Instead of The Euclidean distance can be used to find the case that
using "HTTP google," you would type is closest. The K-NN algorithm can be used for this task with
"https://www.google.com/" if you wanted to use HTTPS. The the Scikit-learn library.
phenomenon being described is one that is frequently referred
to by the term "normalization." There is less weight placed on
the recipient's age, the sender's date of birth, the account's (1)
age, the sender's sexual orientation, and the recipient's age.
Removing stop words, noise, and stemming are three
effective strategies that can be utilized in order to cut down
on the size of these enormous features. The Porter Stemmer
Algorithm is well-known within the field of stemming
algorithms. In most cases, the following is what happens
when we stem:
• The Elimination of Prefixes (Un-, Re-, Pre-, etc.) This section discusses the performance of proposed K-
NN algorithm, It only takes a user to accidentally delete one
essential communication before they begin to question
whether or not the time and effort spent on spam filtering is
truly worthwhile. As a result of this, we have an obligation to
ensure that our algorithm reaches the maximum level of
Figure 3.6. Shows the List of Stop Words precision that is technically feasible. However, there are
academics who contend that accuracy is not the only essential
Considering the below example, parameter that should be considered when evaluating the
performance of spam filtering [16].
1027
Authorized licensed use limited to: b-on: UNIVERSIDADE DE AVEIRO. Downloaded on July 31,2023 at 15:21:31 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Contemporary Computing and Informatics (IC3I)
considered when evaluating the performance of spam that still remained in spam filtering; it has an accuracy rate of
filtering. 93.18%.
REFERENCES
TABLE 3.1. SHOWS THE POSITIVE AND NEGATIVE PREDICTION OF KNN
ALGORITHM.
[1] Srivastava, A., Singh, A., Joseph, S.G.Borole, Y.D., Singh,
H.K.,WSN-IoT Clustering for Secure Data Transmission in E-Health
Sector using Green Computing Strategy,2021 9th International
Conference on Cyber and IT Service Management, CITSM 2021, 2021
[2] Shrivastava, A.; Ranga, J.; Narayana, V.N.S.L.; Chiranjivi; Borole,
Y.D., Green Energy Powered Charging Infrastructure for Hybrid EVs,
2021 9th International Conference on Cyber and IT Service
In order to evaluate the performance of our model for Management, CITSM
classifying spam, authors will make use of the confusion 2021,DOI: 10.1109/CITSM52892.2021.9589027.
matrix, which can be seen below, to compare it against four [3] M. Awad, M. Foqaha, Email spam classification using hybrid approach
of RBF neural network and particle swarm optimization, Int. J. Netw.
distinct criteria. Secur. Appl. 8 (4) (2016).
[4] D.M. Fonseca, O.H. Fazzion, E. Cunha, I. Las-Casas, P.D. Guedes, W.
Meira, M. Chaves, measuring characterizing, and avoiding spam traffic
costs, IEEE Int. Comp. 99 (2016).
[5] Visited on May 15, 2017, Kaspersky Lab Spam Report, 2017, 2012,
https://www.
securelist.com/en/analysis/204792230/Spam_Report_April_2012.
[6] E.M. Bahgat, S. Rady, W. Gad, an e-mail filtering approach using
classification techniques, in: The 1st International Conference on
Advanced Intelligent System and Informatics (AISI2015), November
28-30, 2015, Springer International Publishing, BeniSuef, Egypt, 2016,
pp. 321–331.
[7] C.P. Lueg, from spam filtering to information retrieval and back:
seeking conceptual foundations for spam filtering, Proc. Assoc. Inf.
Sci. Technol. 42 (1) (2005).
[8] Emmanuel Gbenga Dada, Joseph Stephen Bassi, Machine learning for
email spam filtering: review, approaches, and open research problems.
[9] Loredana Fire, Camelia Lemnaru, Spam Detection Filter using KNN
Algorithm and Resampling
[10] Anurag Shrivastava; Chinmaya Kumar Nayak; R. Dilip; Soumya
Fig 3.9 Confusion Matrix of Spam and Ordinary mail.
Ranjan Samal; Sandeep Rout; Shaikh Mohd Ashfaque, Automatic
robotic system design and development for vertical hydroponic
TABLE 3.2 SHOWS THE ACCURACY OF PROPOSED KNN ALGORITHM. farming using IoT and big data analysis,Materials Today: Proceedings,
Data Support Precision Recall F1- Score 2021-07,DOI: 10.1016/j.matpr.2021.07.294
949 0.96 [11] Anurag Shrivastava; Rajneesh Sharma; Mohit Kumar Saxena; V.
Ham 0.99 0.93
Shanmugasundaram; Moti Lal Rinawa; Ankit, Solar energy capacity
166 0.80 assessment and performance evaluation of a standalone PV system
Spam 0.71 0.93
using PVSYST,Materials Today: Proceedings,2021.
0.94 [12] Chawla, P., Chana, I. & Rana, A. A novel strategy for automatic test
avg / total 0.94 0.93
1115 data generation using soft computing technique. Front. Comput. Sci. 9,
Accuracy 0.931835650224216 346–363 (2015). https://doi.org/10.1007/s11704-014-3496-9
[13] G. Dubey, A. Rana and N. K. Shukla, "User reviews data analysis using
opinion mining on web," 2015 International Conference on Futuristic
IV. CONCLUSION Trends on Computational Analysis and Knowledge Management
(ABLAZE), Greater Noida, India, 2015, pp. 603-612, doi:
This research examines the use of machine learning 10.1109/ABLAZE.2015.7154934.
strategies to the process of spam filtering. Recent [14] Dash, Y., Dubey, S. K., & Rana, A., Maintainability prediction of
object oriented software system by using artificial neural network
classification methods used to sort messages into the approach. International Journal of Soft Computing and Engineering
categories of spam or ham are dissected here. It was discussed (IJSCE),2012, 2(2), 420-423.
how various strategies can be used in conjunction with [15] S. Gupta, A. Rana and V. Kansal, "Comparison of Heuristic
machine learning classifiers to tackle spam. Researchers have techniques:A case of TSP," 2020 10th International Conference on
Cloud Computing, Data Science & Engineering (Confluence), Noida,
investigated how spam has developed over time in order to India, 2020, pp. 172-177, doi:
trick detection systems. The purpose of this study is to 10.1109/Confluence47617.2020.9058211.
investigate public datasets and performance indicators that [16] Priyanka Chawla, Inderveer Chana, Ajay Rana, Cloud-based automatic
might be utilized in the process of evaluating spam filters. test data generation framework, Journal of Computer and System
Sciences, Volume 82, Issue 5, 2016, Pages 712-738, ISSN 0022-0000,
The difficulties that machine learning algorithms encounter https://doi.org/10.1016/j.jcss.2015.12.001.
while attempting to combat spam were highlighted, and a
number of different approaches to machine learning were
compared and contrasted with one another. The KNN
algorithm was offered as a solution to address the challenges
1028
Authorized licensed use limited to: b-on: UNIVERSIDADE DE AVEIRO. Downloaded on July 31,2023 at 15:21:31 UTC from IEEE Xplore. Restrictions apply.