0% found this document useful (0 votes)
25 views10 pages

Voting Classification Method For Email Spam Prediction

Uploaded by

SouravMishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Voting Classification Method For Email Spam Prediction

Uploaded by

SouravMishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Voting Classification Method for Email Spam

Prediction
*Saurabh Gupta Sourav Mishra
Dept. of Information Technology Dept. of Information Technology
Indian Institute of Information Technology Allahabad Indian Institute of Information Technology Allahabad
Prayagraj, India Prayagraj, India
[email protected] [email protected]

Dr. Vijay K Chaurasiya


Dept. of Information Technology
Indian Institute of Information Technology Allahabad
Prayagraj, India
[email protected]

Abstract—E-mail customers get several hundred spam mes- of enormous amount of spam emails from unfamiliar senders
sages with fresh content on a daily basis, from fresh addresses is occurred before the users of email in their mailboxes every
which are generated by automated programming tools. In real day. Spamming emphasized on activating the online cyber
time, it is quite unrealistic, not to mention mundane, to filter
spam through outdated methodologies like dark-white catalogs. fraud via social engineering, a major part of which is occurred
The use of text mining schemes on electronic mails can make filter with an email which is sent by unreliable origin in which a
the spam emails more competently. The email spam detection URL is included, when opened, is said to have compromised
techniques have various phases. The pre-processing will clean one’s personal data. Spamming persists financially feasible
up the data-set and feature extraction phase will identify traits because spammers are able to manage their mailing records
affecting the assigned set to the greatest extent possible. The
combination of multiple classifiers is used in this phase for the low-priced.
classification such as SVM, NB, KNN and Random Forest. The In recent times, the spammers target diverse kinds of electronic
parameters used for the evaluation of developed architecture communication websites such as emails to activate the spam.
include recall, accuracy, and precision. This work implements Most of the persons have email and the issue related to email
new methodology in the python software The results of proposed spam is often faced by them. The major issue occurred before
model show high improvement for the email spam prediction.
Index Terms—Voting Classification, SVM, Naı̈ve Bayes, Ran-
the clients and ISPs (Internet Service Providers) is spam. The
dom Forest, Decision Tree, KNN, Email Spam. major reason is that the electronic communications attain a lot
of attention and the spam transferring innovation is increased
at other side. It is easy to access emails, thus, the hackers are
I. INTRODUCTION
able to launch attacks on this platform. The primary danger
The advent of a Word Wide Web has changed the way of to email is spam which is faced by most of the email users
communication amongst people, and has driven the expansion [2]. The term spam is utilized to characterize the delivery of
of new communication amenities, for example, electronic unwanted message, and junk mails to the inbox of internet
mail (email). It has now turned out to be an indispensable users. Thus, email spam is a kind of non-requested data that
constituent of the communication framework of multiple- the malicious user transmits to the E-letter boxes.
businesses and merchants. Nevertheless, this technology has
also a weakness that nasty people misuse this ”free” mail A. Email Spam Filtering Process
structure by delivering redundant mass volume of messages, A surge in the number of spammers and spam emails has
gain revenues, or steal personal data or IDs, thereby harming been noticed in the recent years, as the investment required
users. Such people focus on controlling security and reliable for the spamming business is minimum. This has led to a
identification lapses whose generation is done in the exist- system that finds each email suspicious, causing substantial
ing E mail communication model in which SMTP (Simple investments in defence mechanisms. The most commonly
Mail Transfer Protocol) is utilized, which lacks the ability to used mail filtering schemes are Knowledge Engineering (KE)
validate the source of email at the user or mail server end and Machine Learning (ML). The approaches based on KE
[1]. The existing structure of SMTP is exposed to misuse, generate a set of rules so as to classify messages as spam
as any correspondent can forge their identity and transfer or genuine mail. A general rule like this might be like ”If a
emails comprising any content of their choice to any addressee. message has the text ‘Buy Now’ in its subject, the message is
Such abuse of E messaging infrastructures to casually deliver “spam”. Such rule set must be built by either of the two, i.e.,
redundant emails is known as ”spamming”. Currently, the issue by the filter’s user, or by some other authority. The downside
of this approach is that the set of rules needs to be regularly B. Feature engineering: The second phase is considered
updated, and many users find it inconvenient to preserve them. as a process to make the decision in which some attributes
In the latter case of machine learning, it is not required to are deployed for learning from a presented set of training
explicitly specify any rules. Apart from that, it needs a set of instances. Every attribute is consisted of diverse values. Thus,
pre-classified documents (training samples). The classification each training example (a valid or email message) is mapped
rules are then learned from this data using a specific algorithm. to a vector in a multidimensional space, which has dimension
This task is carried out in efficient manner analyzing the notion characteristics. Feature engineering consists of three phases:
of ML in which various techniques are contained. Nowadays, tokenization, feature selection and feature extraction which are
maximum available data is incomplete. It contains collective, defined as:
noisy and missing values. The model to filter the email spam • Tokenization: In text classification and spam filtering, the
is planned on the basis of ML that is executed 3 phases. most commonly used attributes are arrangements of characters
The data is preprocessed, feature engineering is done and the which provides minimal meaning in a text, i.e., words. In the
ML algorithm is utilized in these phases. Figure 1 represents broader sense, it means decomposing a text into tokens through
general architecture of Email spam filtering. a process known as tokenization.
• Dimensionality reduction: The major issue in ML (ma-
chine learning) techniques is that the dimensionality of data
is higher. This problem is developed as the magnitude of the
former datasets is depicted. The standard concept to diminish
the data size and maintain the quantity of features to the
minimal possible level is that the training time and restraint of
storage space must be alleviated and the overhead is lessened.
The methods of mitigating the dimensionality has two kinds.
First is planned on the basis of extracting the attributes and
the second is on selecting the attributes.
• Feature selection: The purpose of feature selection is to
Fig. 1. Email Spam Filtering Architecture
obtain a subset of words with the similar or even better pre-
diction strength in comparison to the original set of words. To
A. Pre-processing: The major purpose of the initial phase select the best words, a function that selects and ranks words
is to pre-process the e-mails, some words that are integrated, based on their goodness. This function counts the feature
articles etc. For eliminating from the email composition as quality. This approach is utilized for lessening a particular
these components are ineffective while classifying the email. CF (cost function). The process of selecting the attributes is
This phase is executed when an email is received. This stage ineffective to change the data. This process is executed in
eliminates some words including integrated words, articles the phase utilized to pre-process the data prior to train the
taken from email structure due to the inefficiency of these classification algorithm. This approach is utilized to select the
words in classifying the data. Some other words of this kind variable, reduce the attribute or select the variable subset. The
are defined as: major attributes to detect the spam in email are known as mail
• Stop Words or Punctuation: Particularly, some redundant body and subject, size of the mail, existed count of words,
words are utilizing while posting a review. These words are recipient age, recipient responded (defines that the recipient
not supported effectively to recognize the spam feedbacks. sends response to the mail or not). The features of sender
Therefore, the noise and the useless tokens are prevented after account assist in detecting the spam are sender’s nation, IP
eliminating these words before performing the tokenization. address, Email, and status.
To illustrate, assume the words, “the weather is cool”. After • Feature Extraction: Feature extraction tends to generate
stopping the words and eliminating the punctuation, the review an artificial word set, whose words are dissimilar and shorter
defines the cool weather. than the original one. In automated text classification, the
• PoS (Part of speech) tagging: The tagging word attributes approaches for feature extraction are Term Clustering and LSI
are comprised with PoS (Parts of Speech) in accordance (Latent Semantic Indexing). Term clustering makes groups of
with the recognized context of review text. Furthermore, the semantically related words. The term clustering is inappli-
correction is tagged with the close and associated words in a cable in the spam filtering context. LSI attempts to reduce
review text. The standard form of this approach is to recognize the problem posed by polysemy and synonyms when listing
words as nouns, verbs, adjectives, adverbs, etc. documents.
• Stemming Word: A stemming algorithm is employed to C. Email Classification: The fundamental objective of
transform diverse forms of words into a single documented supervised learning is to classify an email message. It em-
format. To illustrate, let a review, “works”, “working”, and phasizes on developing a probabilistic system of a function
“worked” as instance of the word ‘work’. The implementation in order to map the emails to classes. A learning algorithm
of stemming is required to the review text earlier than its is introduced with a set of patterns, whose classification or
tokenization. labelling is done, using the entire email dataset to classify
the single instance of messages. This set is called a trained acknowledged spam addresses or compromised servers. The
set. This approach is executed to remove multiple classified entire domains that include various FPs may be blocked the
messages from the training set prior to develop an algorithm aggressive black listings. The way of tackling this issue is
so that its efficiency is tested. This set is named as testing that a number of distributed black listings must be present
set. The accuracy of the developed algorithm is computed and the information of sender must be compared against
by generating several techniques from diverse sections of some of them prior to block an email. The latest DSNBL
examples on the sets utilized to train and test the system. are dynamic in nature that can be capable of developing with
Thereafter, the error after classifying the data is averaged over novel information as well as of terminating the entries. For
every algorithm. This cycle is called n-times cross validation this purpose, the current reflection of existing situation is
in which n is utilized to define the no. of times of dividing maintained in the address space.
the instance set. This cycle is utilized to quantify several • Grey Listing: The main objective of this approach is to
algorithms in evaluation and to offer multiple times cross send the junk with the utilization of spam bots. It is special
validation. After developing the model, the futuristic emails software prepared for sending thousands of emails in a short
are classified. The learning algorithm is a significant part of time. This software is different from the conventional email
a document classification system. The final phase focuses on servers and cannot follow the email RFC standards. It is an
implementing ML (machine learning) algorithms to filter the appropriate feature that the gray listings utilize. When an email
spam in email. A number of ML techniques such as PNB is received from an unknown sender which is not available in
(probabilistic naı̈ve bayes), KNN (k-nearest neighbor), DT a white listing, a tupla sender–receiver is generated. That mail
(decision tree) and LSVM (linear support vector machine) are is sent again through a real server for discovering the tupla
adopted to classify the spam. These algorithms are utilized by thclassifying it as a spam or legitimate with a set of hand-
to compact the index vectors which are useful in generating a coded rules. Content filtering methods are planned on the basis
space having least dimensionality. For this, the original vectors of specifying the lists of words or regular expressions which
are integrated with the pattern of words which are appeared are disallowed in mail messages. The email header in which
together. list of recipients, IP addresses source and subject are contained
D. Measure for Evaluation of Performance: Diverse have analyzed in this filtering.
parameters like specificity, accuracy, sensitivity, and execution • White Listings: These lists are well known approaches for
time are considered to analyze the efficacy of the classification filtering the spam email. The addresses which are assumed safe
algorithm while detecting the email spams. have included in this list. The implementation of this method
is done in the server side or in the client side and often found
B. Classic Spam Filtering Methods as a complement to other more effectual approaches. In server-
Before ML techniques, there are several diverse technical side white lists, the addresses must be authenticated through an
measures that have been utilized to filter the spam. Some of administrator prior to going to the trusted list. This technique
these well-known approaches are defined as: has feasibility for a small company or a server having a small
• Heuristic Content Filtering: The Heuristic filters are number of email accounts. However, may face problem in case
planned on the basis of rule. These filters are in search it pretended to utilize in large corporate servers with every user
for patterns in the spam mails which can be employed for having its own white list.
classifying the spam mails. It assists in analyzing the content • Black Listings: These lists are frequently named as
of a message and classifying it as a spam or legitimate DNSBL and utilized for filtering the emails that are sent via
with a set of hand-coded rules. Content filtering methods are acknowledged spam addresses or compromised servers. The
planned on the basis of specifying the lists of words or regular entire domains that include various FPs may be blocked the
expressions which are disallowed in mail messages. The email aggressive black listings. The way of tackling this issue is
header in which list of recipients, IP addresses source and that a number of distributed black listings must be present
subject are contained have analyzed in this filtering. and the information of sender must be compared against
• White Listings: These lists are well known approaches for some of them prior to block an email. The latest DSNBL
filtering the spam email. The addresses which are assumed safe are dynamic in nature that can be capable of developing with
have included in this list. The implementation of this method novel information as well as of terminating the entries. For
is done in the server side or in the client side and often found this purpose, the current reflection of existing situation is
as a complement to other more effectual approaches. In server- maintained in the address space.
side white lists, the addresses must be authenticated through an • Collaborative Filtering: It is a distributed approach
administrator prior to going to the trusted list. This technique implemented for filtering the spam. This method assists in
has feasibility for a small company or a server having a small sharing the judgments regarding spam and non-spam from
number of email accounts. However, may face problem in case every user to the other users. In case, a group of users have
it pretended to utilize in large corporate servers with every user tagged an email that is coming from a common sender as
having its own white list. spam in the similar domain, the information in those emails is
• Black Listings: These lists are frequently named as utilized through the system for learning so that those particular
DNSBL and utilized for filtering the emails that are sent via emails can be categorized and the rest of users in the domain
cannot receive those emails. This equation illustrates the weight of training sample using
. In case represents a support vector while a regulation pa-
C. Machine Learning for Email Spam Filtering rameter is illustrated with for acquiring good accuracy and the
ML (Machine Learning) is a sub-field of extensively utilized model intricacy. This process emphasizes on acquiring a supe-
AI (artificial intelligence) filed. The fundamental objective rior generalization potential. A kernel function denoted with
of this approach is to provide efficiency to the machines K is adopted to quantify the similarity amid two instances.
for learning like human beings. Learning is process which RBF (Radial Basis Function) kernel function is expressed
is assisted in understanding, monitoring and illustrating the mathematically as:
information related to some statistical event. Unsupervised
learning is planned on the basis of a process that exposes the
hidden clusters or results in investigating the irregularities in
data such as spam in emails or network attack. Some attributes The weights are computed to classify the test example x:
called BOW (bag of words) or the subject line analysis are
considered to detect the email spam. A 2-d (two-dimensional)
matrix is utilized for input in the task of classifying the
email. The axes of this matrix is employed to illustrate the
messages and the attributes. The initial phase is to split the
email classification sections into diverse sub-sections. The
major issue is to collect and represent the data. After that, the
attributes of email are selected and diminished for alleviating
the size to execute the further phases of the undertaking. In
the end, the period of classifying email is executed to expose In general, the values are evaluated using a cross validation
the authentic mapping amid sets utilized to test and train the procedure on the dataset used to train the system. This proce-
system. Some of the ML (machine learning) algorithms are dure is considered for detecting the generalization potential
defined as: on novel instances that are absent in the training dataset.
i. Support Vector Machine (SVM) classifier: This algorithm A k-fold cross validation is executed to divide the training
is designed on the basis of notion of decision planes that are dataset into k approx. subsets having similar dimensionality,
utilized to denote the decision boundaries. A decision plane in which one subset is not considered. Moreover, a classifier
focuses on splitting a group of objects containing diverse class is developed on the effective instances. Afterward, the original
memberships. This algorithm is adopted for investigating the subset is utilized to compute the efficacy of the algorithm. This
effective hyperplane. For this, the highest margin is utilized for evaluation is conducted on dataset after repeating this cycle
partitioning two classes. This process plays a significant role k times for every subset. The huge training dataset helps in
in generating a robust solution to deal with the optimization implementing a small subset to perform the cross validation
problem. so that the computing cost is alleviated.
ii. Naı̈ve Bayes classifier: This classification algorithm is
an effectual classifier for classifying the spam in emails. It is
recognized as Naive as it is capable of ignoring the possible
dependences or associations among inputs and diminishing a
multivariate issue regarding a set of univariate issues. NB
(Naı̈ve Bayes) is adopted for classifying the spam emails
effectively. Such algorithm makes the deployment of words
probabilities. The incoming email is recognized as spam email
in case some words are found in spam, not in authentic
Fig. 2. An SVM dividing Black and White Points in 3 Dimensions section. This algorithm is applicable in software utilized to
filter the mail. The training of Bayesian filters is required.
All words support certain probability that help in determining
them in spam or ham email in its database. The total of words
probabilities that exceeds a certain limit helps the filter in
marking the e-mail to either class. In general, the algorithms
are utilized to classify the emails into two classes namely spam
or authentic. The Bayesian probability computation is often
utilized in most of the statistic-based spam filters for inserting
statistics of individual token into a total score. It is useful
to make a decision with regard to total score. The effective
statistic for a token T is present in the form of its spam rating,
which is expressed as:
Here, Cspam(T) is used to illustrate the number of spam
mails and CHam(T) is the number of ham mails in which token
T is included. The spam mails are integrated in a separate
gathered spam emails. For this, KNN (K-Nearest Neighbor) is
token to evaluate the probability of a mail M with tokens
adopted to classify the emails. This algorithm is able to deal
T1,...TN. The easiest technique, to classify the emails, aims
with several issues for deciding the exact class of objects. The
to compute the spam mails in a separate token and compare
pre-existing group of the classified objects is considered in this
it with the product of individual’s token in which authentic
algorithm. Such circumstance contains objects which are used
messages are comprised.
to illustrate the spam messages. Therefore,Sn+1 defines all the
novel incoming spam mails for k nearest spam mails which
are related to a certain class. This algorithm is employed for
every cluster Cp for evaluating the relevance score:

A mail is considered as spam in case of the overall product


that shows spam S[M] is found superior to the H[M] which
is utilized to define the ham messages.
iii. Logistic Regression: It is a significant statistical al-
gorithm utilized to analyze the data in which one or more
independent feature are included that provides the results. A
dichotomous attribute is employed in LR (Logistic Regression) Here, Op,Ip are elements and their indexes of k NNs of the
algorithm for evaluating the results. This algorithm categorizes spam message are defined Sn+1, Sp denotes a set of spam
the results in two sections. The dependent variable is defined messages in cluster Cp and
as a dichotomous or binary according to the LR system. To
illustrate, the coding of the data is done as 1 to define positive
or spam, or 0 defines the ham negative case. This algorithm
emphasizes on investigating the effective model to illustrate
the association of a group of independent attributes and The spam message Sn+1 comes under the class Cp,
the interest dichotomous feature. This algorithm is executed that has the highest value score Sn+1,Cp. The case, score
to extract the significant level and typical faults known as ((Sn+1, Cp) < θ), defines that the spam message is not come
coefficient values. A logic transformation probability related under any of the clusters Cp. This situation leads to develop
to the occurrence of the related attribute is classified using a novel cluster Cp+1 in which (θ) is defined as a predefined
given equation as: threshold.
v. Artificial Neural Networks: This algorithm is utilized to
define a computational model planned on the basis of biologi-
cal NNs (neural networks). The artificial neurons are gathered
in this algorithms. This algorithm is a kind of adaptive system
AND to generate variations in its structure with regard to the data
flowing across the AN (artificial network) during a learning
stage. The instance-based learning is considered to develop
this algorithm. Perceptron and MLP (multilayer perceptron)
are two kinds of ANNs (Artificial Neural Networks).
The data classified on the basis of LR algorithm defines MLP is a kind of function whose visualization is done as a net-
two classes in which y may be 0 and 1. This approach offers work in which diverse layers of neurons are comprised which
a parametric form of P(y=1, X;W) that defines the parameter are associated with a FFNN (feedforward neural network). The
vector with W. initial layer contains neurons called input neurons which are
It is unintentional for outlining this equation as the log odds deployed to illustrate the input variables. Moreover, the last
of class 1 utilizes to represent a linear function of x as an layer offers the output neurons that offers the result value of a
instance. LR (logistic regression) algorithm is able to classify function. The hidden layers are defined as the corresponding
and filter the email as spam and genuine. layers present amid the initial and final layer.
iv. K Nearest Neighbour (K-NN): Due to the variation
in spam emails, the novel categories of spam emails are II. RELATED WORK
replaced when the data is clustered. This process is essential M. K. Chae, et.al (2017) suggested a hybrid solution of
for clusters. Thus, the challenging task is of classifying the algorithm to classify the spam email [8]. This approach
employed a classifier on the basis of context as an effective spam-detection was avoided using a number of methods by
technique. This process was proved through computing the IG spammers including obfuscation systems. Therefore, the in-
in the maximization of accuracy obtained after classifying the vestigated approach employed the email content to generate
spam. This approach comprised of 3 stages such as to pre- the keyword corpus according to some technique to process
process the email, extract the attributes and classify the email. the text process so that the obfuscation method was handled.
The analysis depicted the efficacy of the LingerIG spam filter the system was trained using 4327 emails and the tested using
for splitting the unwanted emails from the cluster of similar 4292 emails on CSDMC2010 dataset in order to quantify the
work emails. The results of experimentation indicated that the investigated approach. The results attained in experimentation
suggested solution offered the precision of 100 percent when proved that the investigated approach provided the accuracy
the spam was filtered through the developers from University of 92.8 percent.
of Sydney. The study indicated that the suggested approach Reshma Varghese, et.al (2017) discussed that the major
had provided the feasibility. The experiments proved that this intend was to explore the best feature set such that the spam
solution was capable of enhancing the process to filter the email was filtered [13]. Four kinds of features were taken in
spam. account to perform this study. The score of Naı̈ve Bayes was
P. Rajendran, et.al (2016) described that the email was utilized to eliminate the rare characteristics. The technique of
a private mean of communication. A main obstacle was IG was utilized as a tool. The major intend was to create a
generated by the inherent privacy restraints in the construction feature occurrence matrix for weighting the Term Frequency-
of effectual techniques to filter the spam in which there was Inverse Document Frequency values. AB, J48 technique and
a necessity of accessing an enormous email data regarding Support Vector Machine algorithms were utilized to construct
a number of users [9]. A spam filtering system to preserve the presented system. The experiments were conducted on
privacy was predicted which was adaptable and assisted the individual systems and the ensemble algorithms. Both systems
user in composing the privacy settings for their emails. A two- offered ROC of 1 and false positive rate of 0.
level framework was recommended to filter the spam and for J. Vijaya Chandra, et.al (2016) intended the spear phishing
determining the finest privacy policy. The HTML content was procedure as a part of novel attacks that assisted in collecting
employed to detect the spam with the deployment of similarity the information and aimed at an individual or association
matching system. The automatic settings were facilitated for [14]. The data about recipient was collected by the means
email in the recommended framework to filter them as spam. of the social engineering methods. The psychological attacks
Simranjit Kaur Tuteja, et.al (2016) stated that a great amount were integrated with the technical tricks in order to send the
of spontaneous mail flow was come in mail boxes of user daily malicious emails at which the web-links were comprised in
[10]. The bulk spam or phishing mail was the chief negative phishing emails for provoking the recipient in order click on
aspect from the earlier decades. Furthermore, these types of them. The websites having impact of malware were included
unwanted spam emails were tedious for numerous email users. in these kinks. This procedure was based on the spammed and
It also pressurized the IT infrastructure of organizations and malevolent emails. The techniques to detect the recipient side:
charged billions of dollars from businesses in lost efficiency. spam or Junk mail filters were out forward using mathematical
Thus, the maximization of demand for effectual filtering concept on BN. A self-destructive method was adopted to
spam became very significant. For this purpose, the BPNN protect the sensitive data from intruders. A quantification
filtering algorithm was implemented that had planned on the was conducted on the intended approach and its efficacy was
basis of text classification that the considerable emails were presented in the experimentation.
categorized from the unwanted emails. Abdulhamit Subasi, et.al (2018) analyzed that numerous
Yan Zhang, et.al (2019) presented a 3-phase technique techniques utilized to detect the emails having spam were
to filter the spam that assisted in mitigating the chance of provided to filter the spam [15]. Such methods were adopted
misclassification [11]. This technique was accomplished after in several schemes and at diverse levels. The new applications
splitting the incoming emails into three sections such as to were assisted in improving the precision of filter. Different
include the email, having spam, in the spam folder. The filtering techniques were employed on the basis of various
genuine folder was utilized to include the authentic email. ML techniques so that the spam e-mails were filtered out.
The email ineffective to make precise decisions was comprised But, some of these techniques provided worse precision and
in a distrustful folder. The presented technique was useful to some were very expensive with regard to the computational
denote tradeoff amid the precision. The GTRS was employed complexity. An approach was recommended for spam e-mail
to obtain the coverage so that the email spam was filtered. filtering on the basis of DT algorithmic approaches which
An evaluation was conducted on the game formulation and were simple and provided superior accuracy. The outcomes
repetition learning method. The UCI dataset was executed in of experiment demonstrated that the recommended RF classi-
the experimentation. The outcomes confirmed the adaptability fication algorithm had performed better as compared to other
of the presented technique to improve the coverage level DT classifiers on the public data sets.
effectively with regard to precision level. Ersin Enes Eryılmaz, et.al (2020) states that all people or
Pingchuan Liu, et.al (2016) investigated an effective ap- communities who want propaganda, advertising and scams
proach to filter the possible e-mails having spam [12]. The due to easier and less expensive use of email[16]. People or
communities who wanted to achieve their goals sent spam two separate planes. The breed of naive Bayes (NB) classifiers
and redundancies to email accounts without their knowledge. rely upon Bayes’ theorem which limits diverse probabilities.
Serious financial and moral damage has been inflicted on Using ML and spam detection, probabilities are linked to the
Internet users and participants in Internet traffic as a result related frequencies of word presence in messages. The next
of these letters. Junk email was sent to recipient without any idea is the alleged NB hypothesis based on the independence
consent and usually with malicious intent. Keras DL library of all features with respect to the output (i.e., their original
was utilized on the Turkish dataset for detecting the spam. class).
There were 800 e-mails and half of which were spam e-mails Though this hypothesis is not often true, NB classifiers
had comprised in this dataset. The accuracy of 100 percent is able to make a highly fruitful results to classify the data
was provided by the DL algorithm LSTM on dataset. although the training data has not multiple instances. In
Shafiya Afzal Sheikh, et.al (2020) proposed an on-demand addition, algorithms belonging to the Naı̈ve Bayes algorithms
spam filtering technique, in which the email client is facilitated are considered quickly and easy-going. SVMs are one of the
to notify the user about the appearance of spam in their inbox most used classification algorithms although their utility is
and move them to unwanted folders as a solution to this vulner- widespread. In case a dataset having labels is given, SVM dis-
ability [17]. The test results show that the proposed technique covers a classification (separation) hyperplane by finding the
is effective in removing the spam messages from the inbox maximal distance among data points that belong to dissimilar
and improving the classification efficiency while classifying categories. Two kinds of Support Vector Machine algorithms
the subsequent spam messages from these spammers as spam. exist: hard-margin (need to classify each point) and soft-
margin. Unlike K-Nearest Neighbor classifiers, this approach
III. METHODOLOGY
is advantageous for SVMs to work in higher dimensions.
A voting classifier consists of many classification architec- The data points are separated more competently through the
tures. Voting techniques can be broadly classified into soft increase in the number of features. The points nearest to the
and hard voting. Soft voting is a voting type that assigns classification hyperplane classification are known as SVs. A
weights to every classifier for voting. Many of the classifiers hyperplane is also known as a decision boundary and divides
implemented can be used for the framework’s training. The components that belong to dissimilar groups.
trained classifiers will serve as input to the voting classifiers
which take the input test data and produce the results. IV. RESULTS AND ANALYSIS
The task of email spam prediction has several stages. The
pre-processing phase will clean the content and classification
phase will classify data into certain classes.

Fig. 4. Naı̈ve Bayes Classifier

Figure 4 shows that email spam detection consists of various


Fig. 3. Proposed Flowchart steps. Classification techniques allow you to classify your data
into negative, positive, and neutral classes. The naive bayes
It comes under the type of Supervised ML Algorithms. This method also classifies the data into similar classes. Naive
approach can be applied to perform regression, classification Bayes classifier provides 62.27 accuracy.
and outlier discovery. These classification models are inspired Figure 5 shows that email spam detection consists of various
by the notion of a hyperplane for classification. A hyperplane steps. Classification techniques allow you to classify your data
is the representative of the subspace of a vector space. It into negative, positive, and neutral classes. The SVM method
has less than one dimension of the vector space. The most also classifies the data into similar classes. SVM classifier
optimal hyperplane detects the maximum difference between provides 86 accuracy.
Fig. 5. SVM Classifier Fig. 8. Voting Classifier

steps. Classification techniques allow you to classify your data


into negative, positive, and neutral classes. The voting classi-
fication method also classifies the data into similar classes.
Voting classifier provides 95 accuracy.

Fig. 6. Decision Tree Classifier

Figure 6 shows that email spam detection consists of various


steps. Classification techniques allow you to classify your data
into negative, positive, and neutral classes. The decision tree
method also classifies the data into similar classes. Decision
tree classifier provides 75 accuracy.

Fig. 9. Voting Classifier

As shown in figure 9, the proposed model is implemented


which is the hybrid model combination of multiple classifiers.
The proposed Model improve the accuracy for the email spam
detection. The proposed model is compared with existing
models like SVM, NB, KNN, and Random forest to test
reliability of the model.
TABLE I
DATA - SET D ETAILS
Fig. 7. KNN Classifier
Number of Attributes 2
Voting Classifier Figure 7 shows that email spam detection Number of Instances 4623
consists of various steps. Classification techniques allow you Number of Test Samples 727
to classify your data into negative, positive, and neutral classes. Data Division set 60 percent
The KNN method also classifies the data into similar classes. TABLE II
KNN classifier provides 78 accuracy. C OMPARISON A NALYSIS
Figure 8 shows that email spam detection consists of various
Model Name Accuracy Precision Recall C ONCLUSION
62.27 per-
Naı̈ve Bayes 39 percent 62 percent This paper sheds light on the difficulties encountered while
cent
77.93 per- predicting spam emails using machine learning proving it
KNN 78 percent 78 percent to be a tricky task. The various techniques proposed for
cent
75.19 per- accomplishing this in the previous years are based on factors
Decision Tree 75 percent 75 percent such as tokenization, word polarity etc. The work shown
cent
86.38 per- in this paper involves a machine learning based approach
SVM 86 percent 86 percent for detecting spam emails based on various phases. Several
cent
94.62 per- combinations of multiple classifiers are used in this phase for
Proposed 95 percent 95 percent the classification such as SVM, NB and Random Forest. The
cent
formulated architecture yields up to 95 percent accuracy. The
precision and recall values are estimated to be 95 percent and
94.62 percent respectively.

R EFERENCES
[1] Priti Sharma, Uma Bhardwaj, “Machine Learning based Spam E-Mail
Detection”, 2018, International Journal of Intelligent Engineering and
Systems, Vol.11, No.3.
[2] M. Deepika, Shilpa Rani, “PERFORMANCE OF MACHINE LEARN-
ING TECHNIQUES FOR EMAIL SPAM FILTERING”, 2017, IJRTER.
[3] Esha Bansal, Pradeep Kumar Bhatia, “A SURVEY OF VARIOUS
MACHINE LEARNING ALGORITHMS ON EMAIL SPAMMING”,
2017, International Journal of Advances in Electronics and Computer
Science.
[4] Dr. Swapna Borde, Utkarsh M. Agrawal, Viraj S. Bilay, Nilesh M.
Dogra, “Supervised Machine Learning techniques for Spam Email
Detection”, 2017, IJSART, Volume 3 Issue 3.
[5] Deepika Mallampati, Nagaratna P. Hegde, “A Machine Learning Based
Fig. 10. Accuracy Study Email Spam Classification Framework Model: Related Challenges and
Issues”, 2020, International Journal of Innovative Technology and Ex-
ploring Engineering (IJITEE), Volume-9 Issue-4.
Figure 10 exhibits the accuracy-based comparison between [6] Harjot Kaur, Er. Prince Verma, “International Journal of Engineering
Sciences Research Technology”, 2017, IJESRT.
SVM, KNN, Random Forest and the proposed model which [7] A.Lakshmanarao, “An Efficient Spam Ensemble Machine Journal of
is hybrid model for the email spam prediction. The incepted Applied Volume 5, Issue 9 K.Chandra Sekhar, Y.Swath, Classification
architecture achieves up to 90 percent accuracy for the email System Using Learning Algorithm”, 2018, Science and Computations,.
spam prediction. [8] M. K. Chae, AbeerAlsadoon, P.W.C. Prasad, SasikumaranSreedharan,
“Spam filtering email classification (SFECM) using gain and graph
mining algorithm”, 2017, 2nd International Conference on Anti-Cyber
Crimes (ICACC).
[9] P. Rajendran, M. Janaki, S. M. Hemalatha, B. Durkananthini, “Adaptive
privacy policy prediction for email spam filtering”, 2016, World Confer-
ence on Futuristic Trends in Research and Innovation for Social Welfare
(Startup Conclave).
[10] Simranjit Kaur Tuteja, NagarajuBogiri, “Email Spam filtering using
BPNN classification algorithm”, 2016, International Conference on
Automatic Control and Dynamic Optimization Techniques (ICACDOT).
[11] Yan Zhang, Peng Fei Liu, Jing Tao Yao, “Three-way Email Spam Filter-
ing with Game-theoretic Rough Sets”, 2019, International Conference
on Computing, Networking and Communications (ICNC).
[12] Pingchuan Liu, Teng-Sheng Moh, “Content Based Spam E-mail Filter-
ing”, 2016, International Conference on Collaboration Technologies and
Systems (CTS).
[13] Reshma Varghese, K.A. Dhanya, “Efficient Feature Set for Spam Email
Filtering”, 2017, IEEE 7th International Advance Computing Conference
(IACC).
[14] J. Vijaya Chandra, NarasimhamChalla, Sai Kiran Pasupuleti, “A practical
approach to E-mail spam filters to protect data from advanced persistent
Fig. 11. Accuracy Study
threat”, 2016, International Conference on Circuit, Power and Comput-
ing Technologies (ICCPCT).
[15] Abdulhamit Subasi, Sara Alzahrani, Afnan Aljuhani, MahaAljedani,
Here, figure 11 exhibits that the precision-recall values “Comparison of Decision Tree Algorithms for Spam E-mail Filtering”,
of proposed model are compared with the SVM, KNN and 2018, 1st International Conference on Computer Applications Informa-
random forest. The Proposed model is the combination of tion Security (ICCAIS).
[16] Ersin Enes Eryılmaz, DurmuşÖzkanŞahin, ErdalKılıç, “Filtering Turk-
Naı̈ve Bayes, SVM and Random forest for the email spam ish Spam Using LSTM From Deep Learning Techniques”, 2020, 8th
prediction. International Symposium on Digital Forensics and Security (ISDFS).
[17] Shafiya Afzal Sheikh, M. Tariq Banday, “Improving Efficiency of E-
mail Classification Through On-Demand Spam Filtering”, 2020, 8th
International Conference on Reliability, Infocom Technologies and Op-
timization (Trends and Future Directions) (ICRITO).

You might also like