Voting Classification Method For Email Spam Prediction
Voting Classification Method For Email Spam Prediction
Prediction
*Saurabh Gupta Sourav Mishra
Dept. of Information Technology Dept. of Information Technology
Indian Institute of Information Technology Allahabad Indian Institute of Information Technology Allahabad
Prayagraj, India Prayagraj, India
[email protected] [email protected]
Abstract—E-mail customers get several hundred spam mes- of enormous amount of spam emails from unfamiliar senders
sages with fresh content on a daily basis, from fresh addresses is occurred before the users of email in their mailboxes every
which are generated by automated programming tools. In real day. Spamming emphasized on activating the online cyber
time, it is quite unrealistic, not to mention mundane, to filter
spam through outdated methodologies like dark-white catalogs. fraud via social engineering, a major part of which is occurred
The use of text mining schemes on electronic mails can make filter with an email which is sent by unreliable origin in which a
the spam emails more competently. The email spam detection URL is included, when opened, is said to have compromised
techniques have various phases. The pre-processing will clean one’s personal data. Spamming persists financially feasible
up the data-set and feature extraction phase will identify traits because spammers are able to manage their mailing records
affecting the assigned set to the greatest extent possible. The
combination of multiple classifiers is used in this phase for the low-priced.
classification such as SVM, NB, KNN and Random Forest. The In recent times, the spammers target diverse kinds of electronic
parameters used for the evaluation of developed architecture communication websites such as emails to activate the spam.
include recall, accuracy, and precision. This work implements Most of the persons have email and the issue related to email
new methodology in the python software The results of proposed spam is often faced by them. The major issue occurred before
model show high improvement for the email spam prediction.
Index Terms—Voting Classification, SVM, Naı̈ve Bayes, Ran-
the clients and ISPs (Internet Service Providers) is spam. The
dom Forest, Decision Tree, KNN, Email Spam. major reason is that the electronic communications attain a lot
of attention and the spam transferring innovation is increased
at other side. It is easy to access emails, thus, the hackers are
I. INTRODUCTION
able to launch attacks on this platform. The primary danger
The advent of a Word Wide Web has changed the way of to email is spam which is faced by most of the email users
communication amongst people, and has driven the expansion [2]. The term spam is utilized to characterize the delivery of
of new communication amenities, for example, electronic unwanted message, and junk mails to the inbox of internet
mail (email). It has now turned out to be an indispensable users. Thus, email spam is a kind of non-requested data that
constituent of the communication framework of multiple- the malicious user transmits to the E-letter boxes.
businesses and merchants. Nevertheless, this technology has
also a weakness that nasty people misuse this ”free” mail A. Email Spam Filtering Process
structure by delivering redundant mass volume of messages, A surge in the number of spammers and spam emails has
gain revenues, or steal personal data or IDs, thereby harming been noticed in the recent years, as the investment required
users. Such people focus on controlling security and reliable for the spamming business is minimum. This has led to a
identification lapses whose generation is done in the exist- system that finds each email suspicious, causing substantial
ing E mail communication model in which SMTP (Simple investments in defence mechanisms. The most commonly
Mail Transfer Protocol) is utilized, which lacks the ability to used mail filtering schemes are Knowledge Engineering (KE)
validate the source of email at the user or mail server end and Machine Learning (ML). The approaches based on KE
[1]. The existing structure of SMTP is exposed to misuse, generate a set of rules so as to classify messages as spam
as any correspondent can forge their identity and transfer or genuine mail. A general rule like this might be like ”If a
emails comprising any content of their choice to any addressee. message has the text ‘Buy Now’ in its subject, the message is
Such abuse of E messaging infrastructures to casually deliver “spam”. Such rule set must be built by either of the two, i.e.,
redundant emails is known as ”spamming”. Currently, the issue by the filter’s user, or by some other authority. The downside
of this approach is that the set of rules needs to be regularly B. Feature engineering: The second phase is considered
updated, and many users find it inconvenient to preserve them. as a process to make the decision in which some attributes
In the latter case of machine learning, it is not required to are deployed for learning from a presented set of training
explicitly specify any rules. Apart from that, it needs a set of instances. Every attribute is consisted of diverse values. Thus,
pre-classified documents (training samples). The classification each training example (a valid or email message) is mapped
rules are then learned from this data using a specific algorithm. to a vector in a multidimensional space, which has dimension
This task is carried out in efficient manner analyzing the notion characteristics. Feature engineering consists of three phases:
of ML in which various techniques are contained. Nowadays, tokenization, feature selection and feature extraction which are
maximum available data is incomplete. It contains collective, defined as:
noisy and missing values. The model to filter the email spam • Tokenization: In text classification and spam filtering, the
is planned on the basis of ML that is executed 3 phases. most commonly used attributes are arrangements of characters
The data is preprocessed, feature engineering is done and the which provides minimal meaning in a text, i.e., words. In the
ML algorithm is utilized in these phases. Figure 1 represents broader sense, it means decomposing a text into tokens through
general architecture of Email spam filtering. a process known as tokenization.
• Dimensionality reduction: The major issue in ML (ma-
chine learning) techniques is that the dimensionality of data
is higher. This problem is developed as the magnitude of the
former datasets is depicted. The standard concept to diminish
the data size and maintain the quantity of features to the
minimal possible level is that the training time and restraint of
storage space must be alleviated and the overhead is lessened.
The methods of mitigating the dimensionality has two kinds.
First is planned on the basis of extracting the attributes and
the second is on selecting the attributes.
• Feature selection: The purpose of feature selection is to
Fig. 1. Email Spam Filtering Architecture
obtain a subset of words with the similar or even better pre-
diction strength in comparison to the original set of words. To
A. Pre-processing: The major purpose of the initial phase select the best words, a function that selects and ranks words
is to pre-process the e-mails, some words that are integrated, based on their goodness. This function counts the feature
articles etc. For eliminating from the email composition as quality. This approach is utilized for lessening a particular
these components are ineffective while classifying the email. CF (cost function). The process of selecting the attributes is
This phase is executed when an email is received. This stage ineffective to change the data. This process is executed in
eliminates some words including integrated words, articles the phase utilized to pre-process the data prior to train the
taken from email structure due to the inefficiency of these classification algorithm. This approach is utilized to select the
words in classifying the data. Some other words of this kind variable, reduce the attribute or select the variable subset. The
are defined as: major attributes to detect the spam in email are known as mail
• Stop Words or Punctuation: Particularly, some redundant body and subject, size of the mail, existed count of words,
words are utilizing while posting a review. These words are recipient age, recipient responded (defines that the recipient
not supported effectively to recognize the spam feedbacks. sends response to the mail or not). The features of sender
Therefore, the noise and the useless tokens are prevented after account assist in detecting the spam are sender’s nation, IP
eliminating these words before performing the tokenization. address, Email, and status.
To illustrate, assume the words, “the weather is cool”. After • Feature Extraction: Feature extraction tends to generate
stopping the words and eliminating the punctuation, the review an artificial word set, whose words are dissimilar and shorter
defines the cool weather. than the original one. In automated text classification, the
• PoS (Part of speech) tagging: The tagging word attributes approaches for feature extraction are Term Clustering and LSI
are comprised with PoS (Parts of Speech) in accordance (Latent Semantic Indexing). Term clustering makes groups of
with the recognized context of review text. Furthermore, the semantically related words. The term clustering is inappli-
correction is tagged with the close and associated words in a cable in the spam filtering context. LSI attempts to reduce
review text. The standard form of this approach is to recognize the problem posed by polysemy and synonyms when listing
words as nouns, verbs, adjectives, adverbs, etc. documents.
• Stemming Word: A stemming algorithm is employed to C. Email Classification: The fundamental objective of
transform diverse forms of words into a single documented supervised learning is to classify an email message. It em-
format. To illustrate, let a review, “works”, “working”, and phasizes on developing a probabilistic system of a function
“worked” as instance of the word ‘work’. The implementation in order to map the emails to classes. A learning algorithm
of stemming is required to the review text earlier than its is introduced with a set of patterns, whose classification or
tokenization. labelling is done, using the entire email dataset to classify
the single instance of messages. This set is called a trained acknowledged spam addresses or compromised servers. The
set. This approach is executed to remove multiple classified entire domains that include various FPs may be blocked the
messages from the training set prior to develop an algorithm aggressive black listings. The way of tackling this issue is
so that its efficiency is tested. This set is named as testing that a number of distributed black listings must be present
set. The accuracy of the developed algorithm is computed and the information of sender must be compared against
by generating several techniques from diverse sections of some of them prior to block an email. The latest DSNBL
examples on the sets utilized to train and test the system. are dynamic in nature that can be capable of developing with
Thereafter, the error after classifying the data is averaged over novel information as well as of terminating the entries. For
every algorithm. This cycle is called n-times cross validation this purpose, the current reflection of existing situation is
in which n is utilized to define the no. of times of dividing maintained in the address space.
the instance set. This cycle is utilized to quantify several • Grey Listing: The main objective of this approach is to
algorithms in evaluation and to offer multiple times cross send the junk with the utilization of spam bots. It is special
validation. After developing the model, the futuristic emails software prepared for sending thousands of emails in a short
are classified. The learning algorithm is a significant part of time. This software is different from the conventional email
a document classification system. The final phase focuses on servers and cannot follow the email RFC standards. It is an
implementing ML (machine learning) algorithms to filter the appropriate feature that the gray listings utilize. When an email
spam in email. A number of ML techniques such as PNB is received from an unknown sender which is not available in
(probabilistic naı̈ve bayes), KNN (k-nearest neighbor), DT a white listing, a tupla sender–receiver is generated. That mail
(decision tree) and LSVM (linear support vector machine) are is sent again through a real server for discovering the tupla
adopted to classify the spam. These algorithms are utilized by thclassifying it as a spam or legitimate with a set of hand-
to compact the index vectors which are useful in generating a coded rules. Content filtering methods are planned on the basis
space having least dimensionality. For this, the original vectors of specifying the lists of words or regular expressions which
are integrated with the pattern of words which are appeared are disallowed in mail messages. The email header in which
together. list of recipients, IP addresses source and subject are contained
D. Measure for Evaluation of Performance: Diverse have analyzed in this filtering.
parameters like specificity, accuracy, sensitivity, and execution • White Listings: These lists are well known approaches for
time are considered to analyze the efficacy of the classification filtering the spam email. The addresses which are assumed safe
algorithm while detecting the email spams. have included in this list. The implementation of this method
is done in the server side or in the client side and often found
B. Classic Spam Filtering Methods as a complement to other more effectual approaches. In server-
Before ML techniques, there are several diverse technical side white lists, the addresses must be authenticated through an
measures that have been utilized to filter the spam. Some of administrator prior to going to the trusted list. This technique
these well-known approaches are defined as: has feasibility for a small company or a server having a small
• Heuristic Content Filtering: The Heuristic filters are number of email accounts. However, may face problem in case
planned on the basis of rule. These filters are in search it pretended to utilize in large corporate servers with every user
for patterns in the spam mails which can be employed for having its own white list.
classifying the spam mails. It assists in analyzing the content • Black Listings: These lists are frequently named as
of a message and classifying it as a spam or legitimate DNSBL and utilized for filtering the emails that are sent via
with a set of hand-coded rules. Content filtering methods are acknowledged spam addresses or compromised servers. The
planned on the basis of specifying the lists of words or regular entire domains that include various FPs may be blocked the
expressions which are disallowed in mail messages. The email aggressive black listings. The way of tackling this issue is
header in which list of recipients, IP addresses source and that a number of distributed black listings must be present
subject are contained have analyzed in this filtering. and the information of sender must be compared against
• White Listings: These lists are well known approaches for some of them prior to block an email. The latest DSNBL
filtering the spam email. The addresses which are assumed safe are dynamic in nature that can be capable of developing with
have included in this list. The implementation of this method novel information as well as of terminating the entries. For
is done in the server side or in the client side and often found this purpose, the current reflection of existing situation is
as a complement to other more effectual approaches. In server- maintained in the address space.
side white lists, the addresses must be authenticated through an • Collaborative Filtering: It is a distributed approach
administrator prior to going to the trusted list. This technique implemented for filtering the spam. This method assists in
has feasibility for a small company or a server having a small sharing the judgments regarding spam and non-spam from
number of email accounts. However, may face problem in case every user to the other users. In case, a group of users have
it pretended to utilize in large corporate servers with every user tagged an email that is coming from a common sender as
having its own white list. spam in the similar domain, the information in those emails is
• Black Listings: These lists are frequently named as utilized through the system for learning so that those particular
DNSBL and utilized for filtering the emails that are sent via emails can be categorized and the rest of users in the domain
cannot receive those emails. This equation illustrates the weight of training sample using
. In case represents a support vector while a regulation pa-
C. Machine Learning for Email Spam Filtering rameter is illustrated with for acquiring good accuracy and the
ML (Machine Learning) is a sub-field of extensively utilized model intricacy. This process emphasizes on acquiring a supe-
AI (artificial intelligence) filed. The fundamental objective rior generalization potential. A kernel function denoted with
of this approach is to provide efficiency to the machines K is adopted to quantify the similarity amid two instances.
for learning like human beings. Learning is process which RBF (Radial Basis Function) kernel function is expressed
is assisted in understanding, monitoring and illustrating the mathematically as:
information related to some statistical event. Unsupervised
learning is planned on the basis of a process that exposes the
hidden clusters or results in investigating the irregularities in
data such as spam in emails or network attack. Some attributes The weights are computed to classify the test example x:
called BOW (bag of words) or the subject line analysis are
considered to detect the email spam. A 2-d (two-dimensional)
matrix is utilized for input in the task of classifying the
email. The axes of this matrix is employed to illustrate the
messages and the attributes. The initial phase is to split the
email classification sections into diverse sub-sections. The
major issue is to collect and represent the data. After that, the
attributes of email are selected and diminished for alleviating
the size to execute the further phases of the undertaking. In
the end, the period of classifying email is executed to expose In general, the values are evaluated using a cross validation
the authentic mapping amid sets utilized to test and train the procedure on the dataset used to train the system. This proce-
system. Some of the ML (machine learning) algorithms are dure is considered for detecting the generalization potential
defined as: on novel instances that are absent in the training dataset.
i. Support Vector Machine (SVM) classifier: This algorithm A k-fold cross validation is executed to divide the training
is designed on the basis of notion of decision planes that are dataset into k approx. subsets having similar dimensionality,
utilized to denote the decision boundaries. A decision plane in which one subset is not considered. Moreover, a classifier
focuses on splitting a group of objects containing diverse class is developed on the effective instances. Afterward, the original
memberships. This algorithm is adopted for investigating the subset is utilized to compute the efficacy of the algorithm. This
effective hyperplane. For this, the highest margin is utilized for evaluation is conducted on dataset after repeating this cycle
partitioning two classes. This process plays a significant role k times for every subset. The huge training dataset helps in
in generating a robust solution to deal with the optimization implementing a small subset to perform the cross validation
problem. so that the computing cost is alleviated.
ii. Naı̈ve Bayes classifier: This classification algorithm is
an effectual classifier for classifying the spam in emails. It is
recognized as Naive as it is capable of ignoring the possible
dependences or associations among inputs and diminishing a
multivariate issue regarding a set of univariate issues. NB
(Naı̈ve Bayes) is adopted for classifying the spam emails
effectively. Such algorithm makes the deployment of words
probabilities. The incoming email is recognized as spam email
in case some words are found in spam, not in authentic
Fig. 2. An SVM dividing Black and White Points in 3 Dimensions section. This algorithm is applicable in software utilized to
filter the mail. The training of Bayesian filters is required.
All words support certain probability that help in determining
them in spam or ham email in its database. The total of words
probabilities that exceeds a certain limit helps the filter in
marking the e-mail to either class. In general, the algorithms
are utilized to classify the emails into two classes namely spam
or authentic. The Bayesian probability computation is often
utilized in most of the statistic-based spam filters for inserting
statistics of individual token into a total score. It is useful
to make a decision with regard to total score. The effective
statistic for a token T is present in the form of its spam rating,
which is expressed as:
Here, Cspam(T) is used to illustrate the number of spam
mails and CHam(T) is the number of ham mails in which token
T is included. The spam mails are integrated in a separate
gathered spam emails. For this, KNN (K-Nearest Neighbor) is
token to evaluate the probability of a mail M with tokens
adopted to classify the emails. This algorithm is able to deal
T1,...TN. The easiest technique, to classify the emails, aims
with several issues for deciding the exact class of objects. The
to compute the spam mails in a separate token and compare
pre-existing group of the classified objects is considered in this
it with the product of individual’s token in which authentic
algorithm. Such circumstance contains objects which are used
messages are comprised.
to illustrate the spam messages. Therefore,Sn+1 defines all the
novel incoming spam mails for k nearest spam mails which
are related to a certain class. This algorithm is employed for
every cluster Cp for evaluating the relevance score:
R EFERENCES
[1] Priti Sharma, Uma Bhardwaj, “Machine Learning based Spam E-Mail
Detection”, 2018, International Journal of Intelligent Engineering and
Systems, Vol.11, No.3.
[2] M. Deepika, Shilpa Rani, “PERFORMANCE OF MACHINE LEARN-
ING TECHNIQUES FOR EMAIL SPAM FILTERING”, 2017, IJRTER.
[3] Esha Bansal, Pradeep Kumar Bhatia, “A SURVEY OF VARIOUS
MACHINE LEARNING ALGORITHMS ON EMAIL SPAMMING”,
2017, International Journal of Advances in Electronics and Computer
Science.
[4] Dr. Swapna Borde, Utkarsh M. Agrawal, Viraj S. Bilay, Nilesh M.
Dogra, “Supervised Machine Learning techniques for Spam Email
Detection”, 2017, IJSART, Volume 3 Issue 3.
[5] Deepika Mallampati, Nagaratna P. Hegde, “A Machine Learning Based
Fig. 10. Accuracy Study Email Spam Classification Framework Model: Related Challenges and
Issues”, 2020, International Journal of Innovative Technology and Ex-
ploring Engineering (IJITEE), Volume-9 Issue-4.
Figure 10 exhibits the accuracy-based comparison between [6] Harjot Kaur, Er. Prince Verma, “International Journal of Engineering
Sciences Research Technology”, 2017, IJESRT.
SVM, KNN, Random Forest and the proposed model which [7] A.Lakshmanarao, “An Efficient Spam Ensemble Machine Journal of
is hybrid model for the email spam prediction. The incepted Applied Volume 5, Issue 9 K.Chandra Sekhar, Y.Swath, Classification
architecture achieves up to 90 percent accuracy for the email System Using Learning Algorithm”, 2018, Science and Computations,.
spam prediction. [8] M. K. Chae, AbeerAlsadoon, P.W.C. Prasad, SasikumaranSreedharan,
“Spam filtering email classification (SFECM) using gain and graph
mining algorithm”, 2017, 2nd International Conference on Anti-Cyber
Crimes (ICACC).
[9] P. Rajendran, M. Janaki, S. M. Hemalatha, B. Durkananthini, “Adaptive
privacy policy prediction for email spam filtering”, 2016, World Confer-
ence on Futuristic Trends in Research and Innovation for Social Welfare
(Startup Conclave).
[10] Simranjit Kaur Tuteja, NagarajuBogiri, “Email Spam filtering using
BPNN classification algorithm”, 2016, International Conference on
Automatic Control and Dynamic Optimization Techniques (ICACDOT).
[11] Yan Zhang, Peng Fei Liu, Jing Tao Yao, “Three-way Email Spam Filter-
ing with Game-theoretic Rough Sets”, 2019, International Conference
on Computing, Networking and Communications (ICNC).
[12] Pingchuan Liu, Teng-Sheng Moh, “Content Based Spam E-mail Filter-
ing”, 2016, International Conference on Collaboration Technologies and
Systems (CTS).
[13] Reshma Varghese, K.A. Dhanya, “Efficient Feature Set for Spam Email
Filtering”, 2017, IEEE 7th International Advance Computing Conference
(IACC).
[14] J. Vijaya Chandra, NarasimhamChalla, Sai Kiran Pasupuleti, “A practical
approach to E-mail spam filters to protect data from advanced persistent
Fig. 11. Accuracy Study
threat”, 2016, International Conference on Circuit, Power and Comput-
ing Technologies (ICCPCT).
[15] Abdulhamit Subasi, Sara Alzahrani, Afnan Aljuhani, MahaAljedani,
Here, figure 11 exhibits that the precision-recall values “Comparison of Decision Tree Algorithms for Spam E-mail Filtering”,
of proposed model are compared with the SVM, KNN and 2018, 1st International Conference on Computer Applications Informa-
random forest. The Proposed model is the combination of tion Security (ICCAIS).
[16] Ersin Enes Eryılmaz, DurmuşÖzkanŞahin, ErdalKılıç, “Filtering Turk-
Naı̈ve Bayes, SVM and Random forest for the email spam ish Spam Using LSTM From Deep Learning Techniques”, 2020, 8th
prediction. International Symposium on Digital Forensics and Security (ISDFS).
[17] Shafiya Afzal Sheikh, M. Tariq Banday, “Improving Efficiency of E-
mail Classification Through On-Demand Spam Filtering”, 2020, 8th
International Conference on Reliability, Infocom Technologies and Op-
timization (Trends and Future Directions) (ICRITO).