0% found this document useful (0 votes)

53 views59 pages

A Comprehensive Survey For Intelligent Spam Email Detection

Uploaded by

fadeenk9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views59 pages

A Comprehensive Survey For Intelligent Spam Email Detection

Uploaded by

fadeenk9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 59

Received November 1, 2019, accepted November 16, 2019, date of publication November 20, 2019, date of current version

December 4, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2954791

A Comprehensive Survey for

Intelligent Spam Email Detection
ASIF KARIM , SAMI AZAM , BHARANIDHARAN SHANMUGAM ,
KRISHNAN KANNOORPATTI , AND MAMOUN ALAZAB
College of Engineering, IT and Environment, Charles Darwin University, Casuarina, NT 0810, Australia
Corresponding author: Asif Karim ([email protected])

ABSTRACT The tremendously growing problem of phishing e-mail, also known as spam including
spear phishing or spam borne malware, has demanded a need for reliable intelligent anti-spam e-mail
filters. This survey paper describes a focused literature survey of Artificial Intelligence (AI) and Machine
Learning (ML) methods for intelligent spam email detection, which we believe can help in developing
appropriate countermeasures. In this paper, we considered 4 parts in the email’s structure that can be used
for intelligent analysis: (A) Headers Provide Routing Information, contain mail transfer agents (MTA)
that provide information like email and IP address of each sender and recipient of where the email
originated and what stopovers, and final destination. (B) The SMTP Envelope, containing mail
exchangers’ identification,
originating source and destination domains\users. (C) First part of SMTP Data, containing information
like from, to, date, subject – appearing in most email clients (D) Second part of SMTP Data, containing
email body including text content, and attachment. Based on the number the relevance of an emerging
intelligent method, papers representing each method were identified, read, and summarized. Insightful
findings, challenges and research problems are disclosed in this paper. This comprehensive survey paves
the way for future research endeavors addressing theoretical and empirical aspects related to intelligent
spam email detection.

INDEX TERMS Machine learning, phishing attack, spear phishing, spam detection, spam email, spam
filtering.
I. INTRODUCTION A. RELEVANT SPAM EMAIL STATISTICS
Email spamming refers to the act of distributing unsolicited In the following subsections, we will highlight some
messages, optionally sent in bulk, using email; whereas current worldwide statistical observations. Besides, some
emails of the opposite nature are known as ham, or useful country- specific metrics will also be discussed.
emails [1]. The word ‘‘spam’’ came into existence from The statistics relating to the adoption of email as a
‘‘Shoulder Pork HAM’’, a canned precooked meat marketed means for communication is quite staggering. As of 2017,
in 1937, and eventually with the passage of time, digital there were nearly 5.5 billion email accounts which are
mailing junks have taken the word [2]. actively in use [4], this number is projected to grow over
Spam emails are propagated by the spammers for simple 5.5 billion in 2019 [5]; nearly one third of the population
marketing purposes to unfold more malicious activities such are estimated to use email by the dawn of 2019 [5]. As of
as financial disruption and reputational damage, both in per- 2018 approximately 236 billion emails are exchanged daily
sonal and institutional front. The practice of spamming is [6], of which around 53.5% are just spams [4]. In fact,
now spreading rapidly in other digital communication 2018 saw an average of 14.5 billion spam emails daily [3].
channels as well. FBI recently reported a loss of USD 12.5 Billion to
Financial motivation is one of the primary reasons for the business email consumers in 2018 incurred by spam emails
spammers and it has been estimated that spammers earn [7]. The financial loss incurred by the businesses due to
around USD 3.5 million from spam every year [3]. this spamming attack may just skyrocket in few years’
time, hitting an accumulated figure of around USD 257
The associate editor coordinating the review of this manuscript and Billion from 2012 by the mid of 2020 [3]. The estimated
approving it for publication was Yi Zhang . yearly damage will be around USD
20.5 Billion [3].

VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/

168261
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 1. Financial loss incurred in australian markets due to digital There are some other good review papers available on
scams.
the topic that discussed anti-spam frameworks in general,
but as the field is expanding fast with lots of novel and
automated ideas of spam email detection, we deemed it is
necessary to orchestrate a comprehensive review paper that
will analyze the state of the art developments as no other
contemporary paper strictly focusses on current and recent
trends and solutions geared specifically towards phishing
spam email using machine learning algorithms. This paper
also provides an exclusive detailed analytical insights
based on the reviews. The insights clearly identify multiple
gaps that can be addressed using machine learning
principles.- showing the general future direction of
United States has traditionally been the largest source of research in this domain.
spam, however, in recent times it is not the case anymore.
Though there were legislations such as CAN-SPAM (Con- C. SCOPE OF SPAM EMAILS ANALYSED
trolling the Assault of Non-Solicited Pornography and Mar-
Dissecting and critically analyzing scholarly research work
keting Act) to protect the users, it did not have the expected
on all types of spam email is itself a mammoth task and
deterrent effect on the spammers [8]. USA houses world’s
often impossible in a single survey attempt. Bearing that in
top 70% spam gangs, responsible for coordinated worldwide
mind, this paper primarily focusses on the intelligent and
spamming [3].
automated solutions devised against malicious spam
Scamwatch reports [9] portrays a grim figures in financial
emails. Particularly on the following:
losses for Australian consumers due to verities of scam
types, primarily carried out through phones and emails in 1) Containing malicious links
the last three years as portrayed in Table 1. 2) Containing malicious attachment
As discussed in Table 1, the trend is heading upwards 3) Phishing attempts
each year for digital theft and email spams will only rise due 4) Phishing and Spoofing campaigns
to the increasing adoption of this media as mentioned in the This survey work excludes studies that addresses ‘Only’
above mentioned statistics. Investment scams basically the marketing email spam.
offers fraudulent but promising business opportunities in
exchange of significant amount of money, while dating D. RESEARCH METHODOLOGY
scams victimizes individuals looking for romantic partners The papers for evaluation have been selected based on the
in digital spaces. When comes to delivering malware to objective of this research attempt. We have looked into
propagate such scams, emails are still the primary choice for several papers selected based on the listed index terms
the scammers. Recent reports indicate Australian businesses and thoroughly analyzed the presented method, whether it
and consumers already lost nearly AUD 56,000 due to email has effectively used machine learning principles; how
fraud just within the first two and half months of 2019 [10]. robust and impactful the proposed solution really is; and
As of April 2019, Brazil and Russia have conveniently finally the degree of modification required to address the
overtaken USA and China (another substantial spam origi- drawback(s) the solution may exhibit. Only the works
nating country), to produce approximately 16% and 14% of showing significant impact and intelligent automation
total volume of worldwide spam [11]. have been selected as those were deemed promising for
further research. The other two sections dealing with a
B. RESEARCH MOTIVATION small number of static and bio-inspired techniques have
The motivation behind this research initiative is to address a been mainly added to highlight the current state of email
gap that has risen over time in the field of spam email spam frameworks and diversity in research directions.
detection. The current solutions are mostly lagging behind
the innovativeness the spammers are constantly bringing in, E. STRUCTURE OF THE PAPER
which heavily justifies the emergence of machine learning This survey paper has been structured in such a way so that
based anti-spam propositions. This review work critically the necessary background for the studies analysed are
evaluates number of such reasonably recent solutions and addressed first. Section II details out the parts of an email,
provides insights into ways upon which further and how the spammers take advantage of these various
improvement can be obtained. The paper also discusses a parts to craft verities of spam attacks on users, that is the
number of existing non-machine learning based types of email spam. Though this review paper intents to
frameworks to highlight the loopholes and the current state evaluate the machine learning based solutions aimed
of affairs, this also signifies why machine learning primarily for phishing and spoofing attacks, Section III
automated procedures should be the approach of the newly will discuss number of general purpose non AI based spam
developed systems. detection systems and frameworks that do not rely on

37190 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Machine Learning principles.

VOLUME7,2019 37191
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 2. Email data parts explanation.

FIGURE 1. Email data parts.

It is important to have a look into such propositions to better

understand where we stand at current times against spam
emails and the necessity to bring in automated intelligence 1) EMAIL PHISHING
into the existing tools and emerging processes. The section Email Phishing is one of the most common ways of
after that (Section IV) is based on several Bio-inspired and carrying out spam attacks on senders, and achieved through
Machine Learning based approaches for spam classification manipulating data part C (The ‘From’ field) or B. The
and detection. Section V will have a detailed discussion on aim is to fashion the message in such a way so that it
insights that have been gained as the result of the critical appears to have been sent from someone or somewhere,
review done in Section IV. Section VI clearly highlights often known to user, other than the actual source [13].
the future direction of the research followed by the Spammers also tamper with the domain that is passed in
conclusion. the HELO statement, so that it seems the mail has
originated from some known domain [7]. This indicates
II. ANATOMY OF EMAIL AND TYPES OF SPAM spoofing may occur in data part B (SMTP Envelop) as
ATTACK well. A Malaysian oil distribution company suffered a
The nuisance of spamming will inevitably find its way into substantial financial loss over USD 1 Million in 2017 due
almost all sorts of digital communication mediums in use at to email spoofing [19].
the present era. Among these, spamming through emails has
always been one of the most exploited arena for the 2) SPEAR PHISHING
fraudsters. This section depicts a detailed structure of the In practical terms, ‘‘Spear Phishing’ is a form of general
email itself and the various attacking techniques adopted by email phishing family, in that it deceives with legitimate-
the scammers. Email data parts are composed of several looking messages [20]. A phishing scam may optionally
different blocks as illustrated in Fig 1 [24], while Table 2 provide a link to a bogus website where the end-user is
summarizes the blocks. required to enter sensitive financial and personal
information [14], [15]. It operates on the body of the email,
A. DISCUSSION ON TYPES OF SPAM ATTACK
that is, data part D. Such type of spear phishing can also
There are number of attacks that are being constantly contain attachment which can carry malicious malwares
bombarded on users worldwide, such as email spoofing [21]. It is also possible, basically using social engineering
[12], [13], phishing [14], [15], variants of phishing attack tricks, to craft the message in such a way, without
like spear phishing, clone phishing, whaling, covert redirect malicious links or attachments, that the user will forced to
etc. Besides, spoofing and phishing attacks, variations such take certain steps based on the content of the email, which
as clickjacking has also emerged within spam emails. will ultimately benefit the scammers.
Hackers have even gone the distance as to hide the text The spam in case of Spear Phishing is crafted using per-
behind images to battle the anti-spam programs [16]. sonal information about the user, often gathered though
Chung-Man [17] demonstrated that in particular, phishing social engineering methods [14], and sent from, what
alerts may lead to a considerable negative return on stocks appears to be, a trusted source. These types of attacks are
or market value of global firms. Others have indicated the often harder to detect with traditional filters due to the
destructive effects on companies whose email messages had sophisticated person- alization of the look as well as the
been marked spams by the anti-spam systems, but were, in content. Spear phishing can be used to generate a form of
fact, not actually spams, an expensive instance of False serious attack known as ‘‘Advanced Persistent Threats’’
Positive detection [18]. Below are some discussions on (APT), such as GhostNET
several common attacks.
37192 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

and Stuxnet [22]. Just in 2018, sensitive financial • Find P and Q, two large (e.g., 1024-bit) prime numbers.
information of employees of the ABC Bus Company (USA) • Choose E such that E > 1, E < PQ, and E and (P-1)
had been compromised due to phishing email scams [23]. (Q-1) are relatively prime, meaning they have no prime
factors in common. E does not have to be prime, but it
3) WHALING must be odd. (P-1)(Q-1) cannot be prime because it is
A variation of phishing attacks; in this form, the attack is an even number.
often directed towards high level officials of a company. In • Compute D such that DE = 1 (mod (P-1)(Q-1))
the case of whaling or ‘CEO Fraud attack’, the • The encryption function is C = (T^E) mod PQ, where
impersonating web page/email will adopt a more serious C
executive-level form. As it works in the data part D as well, is the ciphertext (a positive integer), T is the plaintext (a
the content will be cre- ated targeting mostly an upper positive integer). The message being encrypted, T ,
manager or a senior executive who has some high level must be less than the modulus, PQ.
clearance inside the organization and are almost always • The decryption function is T = (C^D) mod PQ, where
either an urgent executive issue - affecting the whole of the C is the ciphertext (a positive integer), T is the plaintext
company, or a customer complaint. The emails will be (a positive integer).
sourced from a fake origin, disguising as a legitimate The public key is the pair (PQ, E) while D is the private key.
business establishment (same or other company) or even the A
CEO of the host company itself [24]. The risks and major advantage of this cryptography is that one can
dangerous are similar to that of the other forms of phishing publish ones public key freely, because there are no known
and spoofing. easy methods of calculating D, P, or Q given only (PQ, E)
Europe’s principal electrical cable and wire manufacturer, - the public key. Besides, popular Email Service Providers
Leone AG, lost a massive e40 million due to a sophisticated (ESP) like Gmail now provides End-to-End encryption
corporate email scam using a combination of spear phishing facility through S/MIME (Secure/Multipurpose Internet
and whaling attacks [25]. In a recent Whaling attack in Mail Extensions), which itself is based on Public Key
2018, French cinema chain ‘Pathé’ lost over USD 21 Encryption. However, such type of cryptography is often
Million [26]. slower than other available methods.

III. NON-AI BASED CURRENT ANTI-SPAM 2) SENDER POLICY FRAMEWORK (SPF)

SYSTEMS Sender Policy Framework (SPF) now a days has become
Following are number of common anti-spam frameworks, one of the critical email authentication mechanisms, often
most of which are available under different platforms such used along with DKIM. However, this technology itself is
as standalone software programs or online based solutions. a standalone framework and is an email validation protocol
These do not adopt any AI based approaches. architected to detect and block email spoofing by providing
a system to allow receiving mail exchangers to authenticate
A. SERVER AUTHORIZATION/AUTHENTICATION that the incoming mail from a domain indeed has arrived
SCHEMES from an IP address authorized by that domain’s
Following are some of the notable Server Authoriza- administrator [38]. SPF basically prevents the scammers to
tion/Authentication Schemes. distribute emails on someone else’s behalf.
The receiving server of the incoming message will look
1) DOMAINKEYS IDENTIFIED MAIL (DKIM) for the SPF record of the sender server along with the mes-
One of the most complicated frameworks that are in circu- sage. The SPF record will have a list of allowable IPs from
lation these days [34]. The entire process is implemented which emails messages of that specific sender (or user) are
through a public key encryption. However, due to the rea- allowed to originate. So, in case the list do not contain the
sonably low adoption of this rather formidable framework IP address of the server that sent the message to the
by the ESPs [35], a certain email, with a nil DKIM field, receiving server, the receiving server will not allow the
cannot be marked as confirmed spam. DKIM is also message to pass through [38]. SPF works in both part A
susceptible to spoofing [36]. DKIM operates in both part C and B.
and D [Fig. 1]. ‘Public Key Encryption (PKE)’ method is Even though not all mail servers implemented SPF as of
considered as one of the concrete encryption techniques now, but the adoption of the technology is rapidly gaining
designed till now. Generally two (2) keys are involved in the pace with time.
process [37]. ‘Public Key’, one of the two keys allocated to
each party, and is published in an open directory in a place B. PROPOSITIONS BASED ON
where anyone can easily search for it, for example by email ARCHITECTURAL MODIFICATION
addresses. Then there is a ‘Private Key’, a secret key The simple and unassuming design of SMTP has long been
maintain by each party. held responsible for a range of spam attacks. Bandav et al.
Several steps are involved before a successful encryption [39] state that hackers even spoof the ‘Date’ field in SMTP
and decryption cycle is completed using PKE [37]. header to keep their spam emails on top of receiver’s

VOLUME7,2019 37193
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Inbox, so that immediate attention can be gained. The

authors have also suggested that ESP’s should employ a
dedicated ‘Time Stamping Server’ to authenticate the
sending date for every email. Number of other researchers
have proposed alteration (for instance, modifying some of
the SMTP transactional steps) in the blueprint of SMTP to
make it more

37194 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

secure and robust as the preferred choice for mail transfer AOL etc.) mathematically calculated an alphanumeric
protocol. However, such steps are not always practicable for strings of 32 to 128 characters, known as signature of the
multiple reasons as discussed later. This section will high- email (the Hash value), to store it in a database [47]. The
light few of such commendable research undertakings and a vendors work on the idea that spammers will send out a
discussion on the hindering bottlenecks. burst of spam emails to achieve their target and some of
A pre-acceptance test of emails has been discussed by these spam emails will reach to their honeypot accounts,
Esquivel et al. [40], which works by analysing the features that is, account that had been set up specifically to catch
of individual SMTP transactions such as EHLO/HELO mes- spam emails. These vendors also rely on the fact that the
sage sequences. These can be further divided into different generated signature will be largely different for spam to
categories based on the working mechanism such as that of non-spam emails [48]. Therefore, soon as the
‘Protocol Defects’. Protocol Defect can detect any extra signature pattern matches to that of the spam pattern, it is
suspicious data blocks in the input buffer before the added to a database for spam signature, and as other emails
EHLO/HELO message transaction takes place. arrive at any of the other customer accounts, those are
The work done Bajaj et al. [41] suggest that filters in the instantly discarded if found spam - by matching the
spam blocking network servers should use the facility of signature (calculated using the exact same method, from
detecting suspicious behaviour patterns of VoIP spam header and body, thus the method works in both part C and
callers- which can be built into the signalling protocols used D [Fig. 1]) to that of the one stored in the database,
in VoIP, such as SIP (Session Initiation Protocol). SIP is an provided the record is found. Vendors supply the database
Application Layer protocol- heavily used to create, modify, to other Email Service Providers (ESP) and thus once an
and terminate a multimedia session (streaming videos, email is identified as spam, it is updated in several of these
online games, instant messaging etc.) over the Internet databases positioned throughout the globe.
Protocol [42]. SIP can also use Message Digest (MD5) Message Digest 5 (MD5) was one of the popular choices
authentication for security purposes. To detect suspicious for cryptographic hashing. It is a cryptographic algorithm
behavior, SIP can inherently apply automated frameworks that accepts an input of any length and generates a message
that can analyze the message to determine whether the digest that is 128 bits long, often known as the
message is syntactically wrong, have no apparent meaning, ‘‘fingerprint’’ or ‘‘hash’’ of the input. MD5 is quite useful
hard to interpret or may lead to a deadlock [43]. when a potentially long message needs to be processed
The above discussed studies are fine examples of impres- and/or compared quickly.
sive steps in the direction of fortifying the SMTP framework The problem with such technique is that spammers have
at the root level. On the other hand, the very popularity and already succeeded, in a rather constant basis, in devising
adoption of SMTP at the first instance as the de-facto tools that can actually break the hashing algorithms.
protocol of choice for email communication has established Further, the issue of database update is also a lingering
some strict deterrents. For instance, a slight modification to bottleneck for quite sometimes now as it is being
the protocol may introduce a wave of changes to other inter- automatically updated but in a delayed nature, and that
twined enabling services needed for successful mail window is enough for a lot of fraudsters [50]. Furthermore,
delivery, both in regards to efficiency and usefulness [44]. if that database itself is hacked, then it is curtains for the
Thus such structural modifications, required at the core ESPs. Thus the technique has seen eroded accuracy over
infrastructure of email communication, will surely introduce the years [51]. SHA-3 is a recent development in progress
operational complexities. This constrain of the SMTP has to replace MD5 and its close variations altogether.
been an issue long since and hackers and spammers have
exploited the drawbacks from a very early stage [44]. 2) FUZZY HASHING
As illustrated by Chen et al. [54], researchers have used
C. COLLABORATIVE MODELS Hashing principles (such as Fuzzy Hashing) to detect spam
Under collaborative spam filtering modeling strategies, each campaigns by clustering emails on the basis of similar
message is delivered to a number of recipients. A specific goals.
message will most certainly be received and judged by Fuzzy Hashing can effectively be used to measure the
another user. Collaborative models exhibit the process of resemblance of two sequences of characters by calculating
capturing, recording, and querying these early judgments. scores based on similarity on the spam messages. Fuzzy
Over time these collections become significant enough to hashing relies on both ‘Traditional Hash’ function (for
stamp a verdict on a certain email. A number of techniques instance,
are in circulation to achieve various steps of a successful MD5) and a ‘Rolling Hash’ function. The Rolling Hash
collaborative framework [46]. value of a string M = m0... mn−1 can be obtained from (1),
where k is the modulo and 0 ≤ a < k the base. Both k and
1) CRYPTOGRAPHIC HASHING b are usually prime number and do not have any prime
factors in
A highly successful method in earlier days of email com-
common.
munication, where large email vendors (Hotmail, Yahoo!, n−1
L
h(M ) = ( an−1−imi)mod k (1)

VOLUME7,2019 37195
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
i=1

37196 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

In the work of Chen et al. [54], the Rolling hash simply Greylisting takes on the view that a legitimate sender, will
divides the input into arbitrary sized pieces. These pieces are resend the email if the initial attempt is unsuccessful, while
then hashed with the traditional hash function. The concate- the spammers will just move on to the next sender and will
nation of the hash values obtained after hashing all the not bother to check whether the email has been delivered.
pieces forms the ‘Fuzzy Hash Value’ of the given content. However, this approach can simply be bypassed by
Hash values are often considerably compact than the resending the spam email [41].
original string of characters, as they produce fixed length
output, irrespective of the length of input [55]. In this way, 5) DNS BLACKLISTING AND WHITELISTING
contents that are not exactly identical, but slightly differ in
DNS (Domain Name Server) Blacklisting is carried out in
some way, can still be grouped under the same hood.
two different flavours. First one involves maintaining a list
The research [54] takes on the view that the emails from
of mail-server IPs identified as spam originator or
same campaign will have a higher similarity score among
propagator in a centralized database [44], [48], [58]. The
each other while the scores will be far apart among emails
other way is to mark spam based on Uniform Resource
from different spam campaigns. It has also been shown that
Identifiers (URIs), usually domain names or websites; the
emails from same campaign have similar sort of URL or
blacklists then consist of such malicious URIs [59]. These
email address. A lacking of the work is that it does not
blacklists or databases of known spamming IP or domains
address concerns regarding ‘Asymmet- ric Distance
can then be given access by the administrator either for
Computation’ [56], where the cluster distance score may
free or with a price. The email server using this service will
become non-deterministic if the order of input changes [57].
execute an additional DNS query on the host that is
Due to several drawbacks of MD5, in particular a slight
sending the message to determine the source status; in this
change in input dramatically alters the corresponding hash
case the queried DNS server will be the one provided by
value, which is not always desired, as often the message
the DNS Blacklisting service. However, all such blacklists
content of multiple spam email varies slightly, but those
suffer from the inability of early detection of malicious
are still considered spam and often from same campaign, the
phishing URLs on the wake of the attack because their
application of a locally-sensitive hashing algorithm, known
database update process is not fast enough [60].
as ‘Nilsimsa’ [49] has grown considerably for hashing
The issues with Blacklists are that the spammers can fre-
purposes, it generates a score from 0 (dissimilar objects) to
quently alter source address [58]. Also the source address
128 (very similar objects or identical). Nilsimsa uses a 5-
itself can be spoofed as mentioned earlier. In case of
byte fixed- size sliding window that analyses the input on a
blacklists composed of URIs, spammers continuously set
byte-by- byte before generating trigrams (group of three
up cheap new domains before starting a fresh cycle of mass
consecutive characters) of probable combinations of the
spamming, leaving the blacklists very little time to react
input characters. The trigrams map into a 256-bit array to
instantly. Addi- tionally, the list is often quite slow to be
produce the desired hash [49]. Another supervised k-NN
updated and thus rather ineffectual against phishing email
based close variation of Nilsimsa, known as TLSH (Trend
threats that banks on user-visits at short-lived phishing
Locality Sensitive Hash- ing) has garnered much attention
websites.
these days [205]. It pro-
vides a similarity score between 0 and 1000+, where any Whitelisting is the practice of maintaining a list of mail-
score <=100 will identify the two entities being similar to servers that are only administered by confirmed legitimate
each other, mostly originating from same source [205]. administrators, or to accept content from bona fide users.
Projects such as SSDEEP, a Context Triggered Piecewise Different organizations have such whitelists of their own to
Hashing program, is also an important addition to this field make things easier for the customers. Blacklisting and
[205]. Whitelisting operates in both part A and B [Fig. 1]
When it comes to Whitelisting and Blacklisting of spam-
3) DISTRIBUTED CHECKSUM CLEARINGHOUSE (DCC) ming sources, the Spamhaus Project, initiated in 1998, has
Distributed Checksum Clearinghouse (DCC), another hash become a workhorse in this arena [61]. A number of ISPs
sharing framework against spam emails, works by counting and email servers use the lists to reduce the amount of
how many times a specific message has been reported as spam that reaches their users. Spamhaus also provides
spam. It takes a checksum of the message body and stores it information on certain domains and main server for
in a clearinghouse or server [52]. Thus with every additional intentionally providing a Spam Support Service for Profit.
reporting to the clearinghouse of a message being spam, the The Project currently has over 600 million subscribers
checksum count increases by 1. Bulk mail in this way can [61].
confidently be identified because the response number and
checksum count are usually lot higher. The checksums are 6) SOCIAL TRUST BASED SOLUTIONS
fuzzy in nature and oftentimes multiple DCC servers ‘Social Trust’ is a layer that is capable of providing a
participate in the checksum exchange process [53]. It measure of the system’s belief that a host is distributing
operates in both Part A, B and D [Fig. 1]. spam emails. Other nodes, that do not have spam
4) GREYLISTING identification processes installed, can actually receive the

VOLUME7,2019 37197
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

notion of these ‘Trust’ enabled nodes and can take

appropriate action [62].

37198 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 3. Few Regex rules for spam filtering. matching [67]. For instance the pattern ‘\b[A-Z0-9._%+-
]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b’ can be used to look for
an email address in a set of texts. Programmers can also
use
this pattern to check the validity of an entered email
address, regardless of programming languages. Regex is
incredibly useful in finding out strings of almost any
pattern.
Heuristic systems are fast and easy to install, but in case
the scammers are able to get a hold of the ruleset, they can
very easily craft messages to avoid the filtering system
[68]. Regex based methods work on part A, C and D [Fig.
1].

E. CONTENT BASED APPROACHES

These systems primarily relies on the examination of the
body or content of the email. Several well-known
Sirivianos et al. [62] envisioned a framework based on techniques are used for such spam filtration systems as
social trust rooted in ‘Online Social Networks (OSN)’. This discussed below.
system, known as ‘SocialFilter’, aims at accumulating the
experience of a number of spam detectors, which in a sense, 1) REGULAR CONTENT FILTERING SYSTEMS
according to the author, ‘‘democratizes’’ the mitigation of A common class of spam detectors for quite some time
spam. It is a graph-based solution where the reports received now has been the ‘Content based Filtering’ method and
are assessed for trustworthiness. This assessment is used to several of its variations. In these systems, a thorough
gauge a value which echoes the system’s conviction that the analysis is done on the host message to find out patterns in
reported host is actually spamming. The performance is message texts, these are then matched with predefined and
enhanced by 1.5%-2% than few other comparative products. confirmed spam patterns and a score is recorded. A
Lin et al. [63] argued that an authentic sender maintains ties decision of spam or ham is taken after comparing the
with a reasonably small social circle, that is, they have noted cumulative score against a threshold value [69]. The
a legitimate sender often communicates with a small number typical example of content based filtering systems is the
of accounts multiple times, mostly the ones that are in user’s ‘Rule Based Expert Systems’. Such type of classification
social circle. However, spamming accounts tend to just can be applied when the classes in consideration are static,
communicate very few times (most cases just once) with an and their components can cater for feature-wise
account but their actions are far wider, that is tend to spam distinguishability [70].
thousands of accounts. Authors have employed ‘Bloom Fil- Fig 2 shows the approach graphically where it can be
ters’ to develop the statistics over some time. Bloom Filters, seen that some keywords have been designated as mark-
a space-efficient probabilistic data structure, were ers for spam email content (‘cheap’ and ‘mortgage’ in this
introduced by Burton Howard Bloom in 1970 [64]. This case). A certain ‘Weight’ has been assigned to these words
study assumes botnets are behind rapid spamming, but the depending upon, mostly, general rate or frequency of the
framework pro- word appearing in confirmed spam emails. These values
posed is unable to identify exact spamming bots. are then summed up to derive the ‘Cumulative Score’,
which is then compared against a ‘Threshold Value’ (1.0 in
D. HEURISTIC FILTERING MODELS this case). Once this Cumulative Score overtakes the
These are Rule-based static filtering systems that can be Threshold Value, the email is marked as confirmed spam.
extremely efficient to downright inefficient (poor accuracy) Despite being highly impactful, the system suffers from
depending upon how versatile the rules are and how fre- ‘Context Sensitivity’, meaning the actual intended message
quently these are being updated [65], [66]. and background of the discussion may not be taken into
account. The method fails to take into consideration the
1) REGULAR EXPRESSION (REGEX) BASED context of the content, thus emails having discussions or
FILTERING SYSTEMS edu- cational message on negative entities, for example
Rules are developed mostly using Regular Expressions as ‘Viagra’, may be flagged as spam.
depicted in Table 3. Scores are assigned for each of the
matched rules and the total value is calculated to check if it 2) CONTEXT SENSITIVE PROPOSALS
tops a pre-set threshold value, which indicates that the email To address contextual issues found at content based
is indeed spam [65], [66]. filtering approach, Laorden et al. [71], in his approach of
A regular expression (or regex) is a pattern that describes using of semantics in spam filtering by introducing a pre-
a certain portion of text for the purpose of mostly string processing step of ‘Word Sense Disambiguation (WSD)’

VOLUME7,2019 37199
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

[71], argued that WSD is an important pre-processing steps

which can increase the accuracy rapidly with majority of
the techniques. WSD deals with solving the problem of
determining the most appropriate ‘sense’ (meaning) of the
word under

37200 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

long documents, the performance of VSM may often show

the need for significant improvement.
3) FUZZY LOGIC BASED SYSTEMS
Fuzzy logic was introduced by Lotfi Zadeh as a mechanism
to process imprecise data. A Fuzzy Controller itself is
glued together by three linked segments [75], and
consequently, Che et al. [74] have developed their novel
algorithm on the back of Fuzzy Controller principle,
having three distinct but interlinked segments [74]. The
authors brought multiple angles into consideration as it
used elements from social engineering practices, fuzzy
control and semantic web to devise a novel algorithm to
tackle phishing email.
The first part of the algorithm builds up the semantic
FIGURE 2. Content based filtering. web database which establishes the relationships between
event and words (similar meaning words are grouped
particular context. Laorden et al. [71] have, in fact, disam- together). The events are specific keywords (from email
biguated the terms using ‘Part of Speech’ (POS) Tagging content and subject, excluding prepositions) that insists the
before constructing the ‘Vector Space Model (VSM)’. How- user to take some action (the aim of the phishing email).
ever, the work does not address word collocation. The second part is building the category database which is
POS Tagging and VSM are two of the most used frame- used to classify phishing emails. To achieve the target it
works for both automated and non-automated email filtering first goes through an Even-Pair generation process, where,
system. POS Tagging refers to the process of classifying using the semantic database built into the earlier step,
words into their respective parts of speech [72]. The two words are converted to related events; and two events are
most common algorithms are ‘Stochastic Tagging’ (uses fused together to form a pair [74]. These pairs will then be
‘Proba- bility’ measure) and ‘Rules Based Tagging’ (use inputted to a Fuzzy control function to determine the
contextual information to assign tags to ambiguous or closest category. Finally, the last stage adds suggestions to
unknown words). The Vector Space Model (VSM), is a users on categorizations of the new incoming emails based
popular algebraic model primarily employed for the on logic that are derived primarily out of the above steps
representation of text documents and also incorporated in [74].
number of spam detection models. The model is built in The framework puts highest emphasis on the content of
several steps, which initiates with a weight being assigned to the email rather than header or domain information.
each term found in the collection of documents [73].
Oftentimes the weight is equal to the frequency of F. SOURCE BASED FILTERING FRAMEWORKS
occurrence of the term t throughout the document d . This Identifying the validity of the source of email has been
arrangement is called Term Frequency (tf) and denoted proven quite important to detect the class of the email in
using tft,d . Now as all the terms are not equally significant, question. Following are few most common techniques.
in a view to apply some form of scaling down into the
weights, the inverse of another term, Document Frequency 1) IP BASED FILTERING
(df) is introduced. df is basically the total number of According to Hu et al. [76], ‘Source based Filtering’, espe-
documents in the collection where the term t can be cially using IP address, has also been popular and effective
located [73], and thus denoted as dft . to a certain degree, as it is quite difficult for even the
If the total number of documents in the concerned corpus spammers to work around the IP address of the spam and
is identified as N , the inverse document frequency (idft ) of thus if certain range of IP addresses can be identified as
the term in question t can be obtained using (2) [73]. malicious, these emails can then be blocked from mass
distribution. Further, IP addresses reveal geographic
idft = log( N ) (2)
locations as well, and it is a well-known fact that countries
dft from certain geographical boundaries are a mass source of
Finally, both Term Frequency and Inverse Document Fre- spam, thus emails from those areas may be considered as
quency is combined to cement the composite document- spam with high degree of confidence, even though there
wise weight for each of the terms using (3) [73]. might be issues as discussed earlier regarding such country
tf −idft = tf xlog( N ) (3) based filtering.
,d Source based filtering has also been used to tackle
dft
Botnets. Spammers have tried to use Botnets to the
VSM is often employed in building a vocabulary of most
maximum effect for automating high volume spam
impactful words that may have an effective weight in differ-
dispersion operation with speed. Wanrooij and Pras [77]
entiating the content type. One point to note is in case of
and Stringhini et al. [78] pointed out the fact that Low

VOLUME7,2019 37201
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Volume Spammers (LVS) are relatively harder to detect

than High Volume Spammers

37202 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

(HVS), due to usage of botnets. Examining the working There are open-source and commercial products available
mechanism of botnets, Wanrooij and Pras [77] have pro- for spam detection that decide whether an email is spam
posed an assumption, termed as ‘Bad Neighbourhood’, they based on the filtration results obtained via multiple filters.
also suggested filtering using core attributes such as IP SpamAssassin and Zerospam are two of such most used
addresses and any machine-readable hyperlinks in the email products.
itself. The performance has shown superiority over tradi-
tional frameworks such as SpamAssassin, mainly because of 1) SPAMASSASIN
lesser execution complexity and very low false positive rate SpamAssassin is a free and open-source anti-spam prod-
[77]. However, the work needs to be tested longer. The uct that has garnered several positive reviews over the
authors have faced issues like URI (Uniform Resource years for its effectiveness and simplicity of installation.
Identifier, URIs and URLs are often interchangeably used) The product uses a number of above-discussed techniques
Blacklisting bottlenecks; such occurrences also require for filtration purposes, such as DNS-based blacklists and
addressing. DNS-based whitelists, Heuristic based checks, Fuzzy-
checksum-based spam detection, SPF, DKIM and Bayesian
2) DNS LOOKUP SYSTEMS filtering [82].
Even though such a process is not fully guaranteed to get the 2) ZEROSPAM
junk out, but oftentimes it can be a strong indication for
Zerospam is a widely used commercial software that has
further checking. It works by looking up if a record for the
also gained some grounds in effective spam detection [83].
domain name, from which the email claims to have
It also uses a number of existing techniques such as IP
originated (the part after ‘@’) does exist (the ‘‘A Record’’).
address and Domain Check, Attachment and URL
If it is not, then there is reason to doubt the validity of the
Scanning, Heuristic based filtering and Bayesian filtering
email, as oftentimes such spamming domains are short-lived
[84].
[79]. However, it suffers from the fact that the FROM field
The performance for these software has demonstrated
of the header can also be spoofed as discussed before. Thus
regular fluctuations. A common problem with a number
the scammer can just simply put a closely related valid
of commercial and open-source solutions are the lack of
domain. The method is mostly applicable to part C [Fig. 1].
detectability in case of some form of phishing and word
obfuscation. Besides, difficulty in implementation and
G. OTHER FRAMEWORKS
usability compli- cations are oftentimes observed.
There are few other propositions available which are either That being said, the targeted scope of this paper pro-
not so commonly implemented or in an emerging state. hibits a detailed discussion on several other commercial
and open-source software available that work at different
1) COUNTRY BASED FILTERING capac- ities by combining several existing techniques
Certain email servers often entirely block email streams discussed above.
from certain countries as certain geographical boundaries Fig. 3 illustrates an overall interconnection between
are often a mass source of spam [76]. This techniques may email data parts and different spam detection techniques
have high false positive rate of detection as even though discussed till now, while Table 4 tabulates a summarized
spammers in certain countries probably are more active than view of majority of the above-discussed research works
others, but lots of benign and legitimate emails can circulate and the reported results from these studies. The table also
the World Wide Web as well from those nations. The highlights the key points and shortcomings of some of the
method is mostly applicable to part A [Fig. 1]. established spam detection techniques. The overall table
links these different techniques and research initiative to
2) PEER TO PERR INFRASTRUCTURE the anatomy of spam [Fig. 1] to better underscore where in
Bradbury [80] shed lights on a different approach known as the very structure of an email these frameworks may
‘Bitmessaging’, based on the similar proof-of-work con- belong. The colored lines in Fig. 3 links the respective
cept [81] that is being used in Bitcoin transactions. The method to the corresponding email data part, whereas the
framework relies on the BitMessage peer-to-peer commu- dotted lines signify under which category the respective
nication protocol, and uses completely decentralized and techniques lie.
encrypted network. On the contrary, one usability drawback
is the nature of BitMessage addresses, which are rather com- IV. ARTIFICIAL INTELLIGENCE BASED
plicated and unintuitive long alphanumeric strings that are DETECTION AND CLASSIFICATION
difficult for a normal user to deal with. The system also has TECHNIQUES
some scalability concerns as it is not yet fully compatible The endeavour for successful detection and classification
with the existing email infrastructure. The write-up also dis- of spam emails have been going on for quite sometimes
cusses a fundamentally different email system known as now. Over time number of successful methods had been
Dark Mail [80]. devised, but with time many of these could not face the
witty changes the spammers bring into their crafts on a
H. COMBINING EXISTING TECHNIQUES

VOLUME7,2019 37203
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

rather continuous basis.

37204 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

FIGURE 3. The interconnection between email data parts and different spam detection techniques.

A. SYSTEMS BASED ON BIO-INSPIRED INTELLIGENCE natural living beings [85]. This is an emerging field of
These are computational algorithms motivated by inherent study, and consequently number of different algorithms are
behavious and mechanisms often observed within the coming up with time. The following section illustrates
various some of the

VOLUME7,2019 37205
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 4. A summary of some of the above-discussed spam detection techniques.

37206 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 4. (Continued.) A summary of some of the above-discussed spam detection techniques.

VOLUME7,2019 37207
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 4. (Continued.) A summary of some of the above-discussed spam detection techniques.

spam detection systems based upon such evolutionary and Selection Algorithm (NSA)’ [91], improved and reinforced
biology based computational algorithms. with the addition of ‘Particle Swarm Optimization (PSO)’
and ‘Differential Evolution (DE)’ have been put into
1) GENETIC ALGORITHM BASED SYSTEMS action. NSA is inspired by the self-nonself discrimination
Ruano-Ordás et al. [86] argued that application of automati- behaviour commonly observed in the mammalian acquired
cally generated regular expressions (regex) can be one immune system [92]. PSO has been developed based on
significantly strong method in identifying messages that the social foraging behaviour observed in some animals
have been obfuscated by the spammers. The work such as school- ing behaviour of fish and flocking
compactly illustrate groups of sentences from compromised behaviour of birds [93] and Differential Evolution is a
emails that follow a suspicious pattern. Thus the idea can be metaheuristic that attempts to gradually optimize a given
deployed as a local content based filtering system. It can problem by multiple iterative passes over a candidate
also be shared in a P2P network for a collaborative approach solution with regard to a given measure of quality. It can
in combating spams. The paper takes on the view that Bio- work with very high dimensional dataset, without always
inspired Evolutionary Algorithms, such as Genetic guaranteeing an optimum solution [93]. The combined
Programming, should be used to generate the regular approach discussed in this work shows increased
expressions. Genetic programming is based upon a subset of performance than a standalone NSA based system. Idris et
Evolutionary Algorithms, known as Genetic Algorithm. It is al. [89] achieved an increase of accuracy around 7%-9%
a search heuristic that is based upon the ‘Theory of Natural than that of standalone NSA, especially over 1000
Evolution’ [87]. The work also presented a reasonably detectors.
effective software, developed taking the drawbacks and However, both of these studies [89], [90] does not seem
limitations of some other contemporary similar systems to address the behaviour of the proposed model in regards
into consideration; the system has been termed as to the gaps in the understanding of few issues with the
‘DiscoverRegex’. A key improvement over the research of Particle Swarm Optimization algorithm, such as getting
Conrad [88], claimed to have been achieved by the work, is trapped in local minima and Heterogeneity [94].
the ‘Fitness Function’, an essential segment of any Genetic This problem regarding local minima may also crop up
Algorithm based solution. Thus the proposed in number of traditional nonlinear optimization algorithms.
DiscoverRegex uses (4) for the Fitness Function. In this case the function (typically a ‘cost function’ in
+ 1) (4) a neighbourhood around that local
fitness(i) = matches(i, spam)X (
Machine Learning) produces a minimum than the
length(i) + 1 greater value at every other point in
matches (i, spam) denotes for the number of spam messages local minimum itself. On the contrary, the global minimum
that match regular expression i and length (i) represents the of a function results in the minimization of the function on
size of the generated pattern. its entire domain, and not just on a neighbourhood of the
The results shows improvement over other software pack- minimum [95]. The ideal result of the function should be
ages. However, the work only described generating regular the global minima, or at least quite close to it. There are
expressions from the content of the spam subject header but always one global minima but there can be multiple local
not the body of the spam. Thus this enhancement needs to minima.
be incorporated to make the system fully complete.
3) OTHER RELATED SYSTEMS
2) NSA AND PSO BASED SYSTEMS An impressive numbers of other experimentation done with
Idris et al. [89] and Idris et al. [90] discussed proposition biologically inspired algorithms indeed delivered some
where other bio-inspired algorithms, such as ‘Negative more interesting outcomes besides the above discussed

37208 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

ones when comes to spam detection. Zhu and Tan [96]

proposed a

VOLUME7,2019 37209
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

‘Biological Immune System (BIS)’ based model where no training will be provided. Based on the dataset, the algo-
‘Local Concentration (LC) based Feature Extraction’ rithms will try to figure out common features within a group
approach has been adopted for the development of the anti- of items and will rearrange the data points in clusters based
spam model. Such LC approach is thought to be able to on the commonality [101]. Alongside clustering, another
effectively determine position-correlated information from a type of unsupervised learning is ‘Association Rule
message by transmuting each area of a message to a Learning (ASL)’; it finds pattern in large datasets based on
corresponding LC feature. The proposition tends to divide some measure of interesting properties. For example, to
the message content into the size of a fixed length window deduce an activity pattern of an individual. Equation (5)
that goes through (slides) each chunk of the divided content. can be deployed as following, where P and Q belongs to a
However, if the length of message content itself is shorter set of items R [102]. ASL has also been used in recent
than the length of this sliding window, then the performance times as an aid to develop supervised classification models
degrades. The study reported an accuracy and precision of [103].
over 96% with the size of sliding window set to 150
characters per window. P ⇒ Q, where P, Q ⊆ R
{day ∈ (weekends, public_holidays), weather ∈ (sunny)}
B. MACHINE LEARNING BASED SYSTEMS
⇒ {fishing} (5)
Machine Learning (ML) is the engineering steps formulated
in a view to make the computational instruments to act with- The above example can be stated in general terms as,
out being explicitly programmed. Machine Learning can be when its weekend or public holiday and the weather is
a great boon to tackle the spam issue primarily because of sunny, the individual spends time on fishing.
its ability to evolve and tune itself with time, and counter a Semi-supervised Machine Learning Algorithms: It is an
key bottleneck ingrained in other classes of spam detection amalgamation of both supervised and unsupervised
mechanism – ‘Concept Drift’. Researchers pointed out that learning. Oftentimes it has been seen that in a collection of
the contents and operating mechanism of spam emails large amount of input data, only a limited volume is
change over time so the techniques that work now, may actually labelled; semi-supervised learning algorithms
render useless in near future due to the change in structure work well in such scenarios [104].
and content of these spam emails; this phenomena is called Reinforcement Learning: In a Reinforcement learning
Concept Drift [97], [98]. Wang et al. [99] conducted a system, the agent is capable of learning the pathways on
statistical analysis of spam emails over a period of 15 years the fly using a temporal learning scheme, without
(1998 – 2013) and demonstrated how spammers adopt supervision. Agent is the entity that decides what action (At
changes in not only spam contents, but also in the delivery ) to take. The system works on the basis of trial and error,
mechanism. where depending upon the action of the agent, a positive or
The following sections will discuss a number of such negative feedback
technique and the results obtained once tried and tested on (Rt+1) is provided at the next instance [105].
different spam corpora. However, before that, some
Machine Learning based terminologies must be briefly 1) PERFORMANCE MEASUREMENT
discussed such as types of algorithms and associated Most often the performance of Machine Learning models
benefits as majority of the models utilize these algorithms or are calculated using various measures such as Accuracy,
its close variations. Precision, Recall, F-Measure, Receiver Operator
Supervised Machine Learning Algorithms: Systems utiliz- Character- istic (ROC) Plot and Receiver Operator
ing Supervised Machine Learning algorithms tends to learn Characteristic (ROC) Area to name a few.
from a set of labelled data, where the possible output for the These measurements are mostly determined using True
corresponding input is already given [100]. The algorithm Positive (TP) – when the model correctly predicts the class,
tends to go over this set of data (learns) and eventually for instance classifying a spam as ‘spam’, True Negative
builds up the ‘Idea’ or probabilistic mapping between the (TN) – when the model correctly predicts the opposite
nature of input and most likely output (the result). class, for instance classifying a ham as ‘not spam’. False
Supervised Learning can be branched out into two different Positive (FP) – when the model incorrectly predicts the
subtypes, Classification and Regression [100]. Supervised class, for instance classifying a ham as ‘spam’, False
algorithms that, in most cases, produces outputs of Negative (FN) – when the model incorrectly predicts the
categorical nature, are said to be classification algorithm, for opposite class, for instance classifying a spam as ‘ham’.
instance: Spam or Ham, whereas, supervised algorithm that Table 5 describes the key terminologies of performance
predicts outputs of continuous numerical value, are denoted measurement of a model.
as regression
algorithm, for example: $1000-$5000, 50◦F etc. 2) FEATURE SELECTION AND ENGINEERING
Unsupervised Machine Learning Algorithms: As the name In any Machine Learning based model, ‘Feature Selec- tion
suggests, unsupervised learning refers to the fact that the [106], [107] and Feature Engineering’ is a really crucial
model will not have any labelled data to work with, and thus task as it is used to derive new and novel features from the
37210 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

existing ones to better facilitate the subsequent learning

and generalization steps if a Machine Learning based
algorithm is deployed to build a model [108]. The
performance of the built model can often drastically
improve if an intelligent

VOLUME7,2019 37211
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 5. Definitions Of some performance measuring terms. many iterations, eventually

and intuitive Feature Selection and Engineering phase can

be executed beforehand.
Scores of Machine Learning based systems [109]–[113]
carry out from many of email headers parameters and
content while designing the model to draw inference based
on feature data that are not from expected spectrum.

3) HIGHLIGHTING SOME KEY SUPERVISED

AND UNSUPERVISED ALGORITHMS
There are a number of algorithms in use in each of the
categories briefly discussed above. Table 6 highlights few
strengths and weaknesses of some of the primary algorithms
that will be discussed in this study.

4) SUPERVISED LEARNING BASED PROPOSITIONS

This section will dissect a number of Machine Learning
based research attempts which are primarily supervised by
nature. These include one or more supervised algorithms in
a view to develop an automated spam detection framework.

a: ARTIFICIAL NEURAL NETWORK BASED

FRAMEWORKS
Artificial Neural Networks (henceforth ANN) are built
using artificial neurons, modelled after neurons of biological
brains. Depending upon the system, the total number of arti-
ficial neurons could be from few dozens to many thousands.
These are connected in a series of layers, and divided into
Input, Hidden and Output Layer.
The connection between the neurons, or often called
units, is represented by a number called ‘Weight’. Weights
can both be positive and negative, meaning either they
excite or suppresses another neuron. Normally information
passes from Input Layer, through Hidden Later(s) to Output
Layer, it is called a ‘Feedforward’ arrangement [114].
ANNs ‘learn’, most commonly, through a process called
‘Backpropaga- tion’. In this model, the produced output of
the network is compared or matched with the one that
should instead have been produced. The difference is then
taken to adjust the weights between the connections, in this
case starting from the output layer, to hidden layer(s) up
until input layer, hence the term ‘Backpropagation’. Over

37212 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
TABLE 6. A summary of some useful machine learning algorithms.

the network is able to produce a sufficiently accurate and

acceptable result [114].
As can be seen from Fig. 4, an artificial neuron sum-
ming up all the weights (w1-wn) from inputs (x1-xn) before

VOLUME7,2019 37213
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 6. (Continued.) A summary of some useful machine learning

algorithms.

FIGURE 5. Porter’s Stemmer.

evaluated based on the statistics past occurrences to mark
an email spam or ham. The proposed framework achieves a
recall measure of nearly 95%. To prepare the training set,
words are collected from a collection of junk emails and
are separated as good or bad as well as the length is also
taken into consideration. For example, the three neural net-
works consecutively works on a set of words of length
three, four and five characters. Words are further put into
three different groups, namely marketing purposes,
commercial, financial, and pornography; a weight is
assigned for each of the groups depending upon the
importance to the end user. The result shows low true
negative and high false positive percentages. An advantage
of this technique is that the user has flexibility in deciding
the kinds of email more intrusive to him or her than some
other categories, and thus set the weight accordingly. This
technique is reasonably effective against word obfuscation
as well as simple phishing attempts. Having said that, the
system is limited to ‘Bag of Words’ approaches [118] and
thus easy for the spammers to adopt other evasive
workarounds.
There are various ways to carry out stemming [119], of
which Porter’s algorithm [116] found most success. A
stemming algorithm retrieves the stem of a word. Fig. 5
FIGURE 4. An artificial neuron with its activation function [30].
illustrates the stemming logic rather intuitively.

an ‘Activation Function’ f , together with a bias b, decides b: DEEP LEARNING BASED FRAMEWORKS
whether the neuron should fire (process and pass the infor- Deep Learning is a subclass of Machine Learning that can
mation through); and if it does, then what will be its learn in supervised and in unsupervised fashion. It employs
strength. An Activation Function also normalizes the output cascading layer of processing units (nonlinear) for feature
of a neuron especially after several runs. extraction and transformation. The output of each of these
Nosseir et al. [115] developed ANN based classifier to layers is fed into the next consecutive layer as input and
identify unacceptable and acceptable words from the these layers (often processing different levels of
message content of an email. The multi-neural network abstraction) construct a hierarchy of concepts [28], [120].
classifiers deal with words from the email body after the Algorithms such as DeepSVM [79], Convolutional Neural
words have been pre-processed to remove the stop words Networks [121] (henceforth CNN), Deep Neural Network,
(articles, preposi- Deep Boltzmann Machine are few of the developments that
tions) and noises (obfuscated words such as I∗n$u∗rènce are based upon the principle of Deep Learning.
or misspellings) along with the application of stemming Seth and Biswas [122] introduced Deep Learning tech-
process niques, such as CNN to tackle spam emails based on
to extract the word root using Porter’s Algorithm [116]. One images and spam content. To classify e-mails containing
of the major concern for the system is that it was tested on a both image and text, the authors have proposed two multi-
small scale database of words and should further be tested modal architectures. Each of these architectures combines
on a larger setting before using a blacklist and whitelist both image and text classifiers, producing an output class.
content filter. The derived accuracy from the measurements The first archi- tecture works on the basis of ‘Feature
provided has been found to be 99.87% for a five-character Fusion’ while the other mines the rules between the two
ANN. classifiers as well as uses class probabilities. It has been
Malge and Chaware [117] deployed tokenization, stop reported that the later
words removal and stemming before feeding the result in
feature extraction algorithm. The obtained set of words is

37214 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

projected higher accuracy of 98.11%; but the dataset is known as ‘Curse of Dimensionality’ [130]. To circumvent
really a small one of just over 1500 images, whereas CNNs Curse of Dimensionality many filters perform some degree
require hugely enriched and larger datasets to produce of ‘Dimensionality Reduction’ before applying the anti-
results that can be generalized over multiple different spam filter to classify incoming messages. Dimensionality
instances. More- over, the Dropout rate, a regularization Reduc- tion also limits overfitting. With every single
technique discussed at length in [123], has been fixed at 0.5, addition of dimension d , data increases exponentially i.e.
which may not be the optimal value for every situation, as nd , where n is the number of data points at the start,
stated in the work of [124]. The study needed to project the underscoring the augmentation in complexity that Curse of
effects of other dropout rates before settling into 0.5. Dimensionality introduces.
Shang and Zhang [125] also deployed CNNs for image
classification in spam emails. However, CNNs sometimes c: NAÏVE BAYES BASED PROPOSITIONS
do not perform well on real world images partly due to the Another popular supervised algorithm is Naïve Bayes
noises of different sorts that distort the image, but the paper (henceforth NB), developed on Bayes’ Rule. Bayes’ Rule,
did not discuss on such issues. introduced by Thomas Bayes, attempts to derive the
Barushka and Hajek [126], have demonstrated that reduc- probability of an event with the help of some prior
tion of features often reduces accuracy and precision as well knowledge of that event- related condition [131]. For
as recall, although ANN and Decision Tree showed instance, if someone’s sprinting speed is related to body
inspiring performance with a significantly reduced feature weight, then with the application of this Bayes’ rule, the
set. The research work [126] also argues that ‘Shallow body weight can be applied to determine individual’s
Neural Net- works’ are a poor fit for handling high sprinting speed more accurately than that of determining
dimensional data yet computationally expensive, unless sprinting speed without the knowledge of body weight.
some other advanced techniques, such as, ‘Dropout Mahdavinejad et al. [132] stated that NB classifiers
Regularization’ [127] and ‘Rec- tified Liner Unit (ReLU)’ - require a limited number of data points for training
a popular Activation Function, are combined. Such methods purposes, they are reliable and considerably faster, as well
can address some critical spam filtering limitations such as as efficient in dealing with high-dimensional data points.
optimization convergence to non- optimal local minimum, Bielza and Naga [133] also pointed to similar directions as
an example of a problem arising out of overfitting and high- it argued the Bayesian network classifiers, using some
dimensional data. Thus the authors have proposed a model form of Naïve Bayes algorithm, in general are far superior
of spam filter that integrates a high- dimensional N-Gram than other pattern recognition classifiers in terms of
term frequency–inverse document frequency (tf-idf) algorithm efficiency and effectiveness in learning a
feature selection. The proposition is composed of a modified model from a dataset. However, NB considers features to
distribution-based balancing algorithm [128], and a be completely indepen- dent [134], which, in any practical
‘Regularized Deep Multi-Layer Perceptron Neural Network’ application, is not always the case in majority of the
model with Rectified Linear Units, in a view to capture situations.
intricate high-dimensional data features. Multilayer Zhou et al. [135] mentioned that number of research
Perceptron (henceforth MLP) is a class of feed-forward works have used a ternary approach (Spam, Ham and
ANN. The model does not require any dimensionality Unsure) of determining whether a mail is a spam or not,
reduction and was tested on four different datasets. N-Gram using, in most cases, NB classifier. Their proposed
is a contiguous sequence of n terms or items from a sample modification enhances the calculation and interpretation of
of speech or text. The results show Deep Neural Networks the required thresholds, which has been determined in the
can be quite promising and the model outperforms number earlier developed systems just on the intuitive
of spam filters commonly in use, showing an accuracy of understanding to define ternary email categories. The
98.76% on Enron dataset. The main limitation of [126] is authors have employed ‘Decision-Theoretic Rough Set
the framework is exceedingly computationally intensive (DTRS)’ models with NB classification to regu- larize this
than some of the common techniques available and thus its computation of the threshold value. The result did show
performance as a spam filter in a more standardized significant improvement in ‘Cost-sensitivity’ (a ‘loss
computing hardware is questionable. function’ is regarded as the ‘costs’ of making classifica-
‘Overfitting’ can be a quite critical aspect of a Machine tion decisions) in grouping emails into spam, ham and
Learning based model, when it models the training data too ‘suspect’ from three different datasets. Despite
well, to an extent that it negatively impacts the performance demonstrating weighted accuracy of 90.05% (assuming
of the model on new data [129], [217]. that misclassifying a legitimate email as spam is 9 times
As seen in the above study and will be highlighted in the more costly than the opposite), the proposition doesn’t
studies discussed forward, ‘High Dimensionality’ of feature solve the issue of automatically classifying an email as
space (too many attributes) is a recurring problem in number spam, as the user still has to make a decision from the
of Machine Learning dependent models especially that uses group of email marked as ‘‘Suspect’’, and this leaves a
Bayesian techniques. With the increase of dimensionality, potential possibility of error in judgement from the part of
the complexity rises exponentially; this problematic issue is the user.

VOLUME7,2019 37215
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Qingsong and Ting [136] worked on ‘Mutual

Information Feature Selection’ algorithm and introduced
Word Frequency

37216 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Factor and Average Word Frequency factor to improve upon domain

the application of algorithm on both Chinese and English age is not so clear. Similar observation goes for the
language corpuses. Emails were classified using NB ‘Number of Dots’ parameter.
algorithm and the result indeed showed some improvement. Hayat et al. [97] in their work introduced a framework,
Neverthe- less, the Chinese classification showed on top of traditional NB, that showed improved
substandard result than that of the English one; probable performance on a simulated future direction over
cause of which would be the lack of efficient methods in implementation that uses NB classification in a
Chinese word segmenta- tion and pre-processing. straightforward way. The work com- pares a batch of
Jatana and Sharma [137] presented an improvement of emails to that of the old ones, and if the distributions
NB algorithm by introducing a fragmented and encoded seem considerably distinct, the mechanism stipulates
database technique somewhat similar to Radix Sorting ‘Concept Drift’ has taken place, and updates itself based on
algorithm. The traditional NB based approaches employ the hybridization of the two concerned models. The result-
simple tokenization of words, that is, only extraction and ing model displayed an improvement of 8%-9% in terms of
storage of words as token into some database. The authors, accuracy over multinomial NB. However, the research
instead, proposed to encode the words using ASCII values initiative needs to be more adaptive in the sense that
and store those in distributed database, in a sorted order, for instead of judging a group of emails for the occurrence of
faster processing. The words are encoded by taking the Concept Drift, the author feels it should be checked after
ASCII values of the alphabets and then finding the every single email and the system should update itself
difference (absolute) of consecutive words. For example, to accordingly if the need be; but that might hamper the usual
tokenize the word ‘Speed’, the authors have performance [97].
used the following principle after changing it to lower case: Lee et al. [139], used Weighted Naïve Bayes (WNB)
s − p = abs(115 − 112) = 3 along with natural language features such as Parts of
p − e = abs(112 − 101) = Speech (POS) tagging [72] to formulate a spam filter that
11 e − e = abs(101 − 101) = only examines the subject header. The system transforms a
0 e − d = abs(101 − 100) = subject line into
1 a feature vector x = (x1, x2, . . . , xn), where xi is the value
of feature Xi. Alongside POS, to determine the value of xi,
So the code for token ‘Speed’ is 31101. Now the distributed
both Bag-of-Words [139] method and statistical features of
database is broken into 26 different sets (0-25). The encoded
the subject such as total length, case and composition have
tokens are stored in these databases based on the difference
been utilized. With every input of new feature vector x,
of the first two characters, which is in this case is ‘3’. Thus
denoting h and s as the ham and spam class respectively,
the token 31101 gets stored into dataset numbered ‘3’.
the algorithm predicts that x is in class cl according to the
Tokens are in this way stored in sorted order in the current training set X , as shown in (6). In (6), p(cl|n) is the
repository. Binary search is used for searching purposes to posterior probability of modified WNB [139].
determine the top K token with highest probabilities. l = arg max p(cl|x) (6)
The researchers [137] claim that it enhanced execution l∈{h,s}
The framework achieved 95.74% accuracy on Enron
speed of the algorithm nearly six times, tested on LingSpam
dataset. On the contrary, POS tagging is rather ineffective
and SpamAssassin. Nonetheless the work has more room for
against those subject headers that contain word obfusca-
improvements, for instance if hash functions are used in
tion [72] and animated contents.
sorting [137].
Ranganayakulu and Chellappan [138] considered host d: DECISION TREE BASED PROPOSITIONS
based and lexical features, Age of domain and Page Rank to One of the most impactful algorithms in the field of
classify URL within the email body to be malicious or not. Machine Learning is the Decision Tree (henceforth DT)
With a rather minimal of feature set, Bayshean classifiers algorithm. DT based ‘Learning’, in most cases commonly
have been put into use for classification purposes. The employs an upside-down tree based progression method.
classifier deals with a training dataset of malicious DT can be used to resolve both classification and
phishing URLs and legitimate URLs. The probability for regression problems. [140]. The growth of the tree from the
each of the features to occur in the dataset is calculated and root node starts by deciding upon a ‘Best Feature’ or ‘Best
their respective scores are obtained through cumulative Attribute’ from the set of available attributes, and then by
addition. Finally if the cumulative score crosses the applying splitting.
threshold value, the system determines a malicious phishing In majority of the instances, the selection of ‘Best
URL is present in the email, and thus it is a spam. Attribute’ is done through the calculation of two more
The above system [138] demonstrated an FPR (False measurements, ‘Entropy’ as shown in (7), and
Positive Rate) of 0.4% and TPR (True Positive Rate) of subsequently calculating the Information Gain, shown in
92.8%. Though the framework is quite compact, nonethe- (8). The ‘best attribute’ is the one that imparts most
less, the ‘Page Rank’ features has been suspended by information. Entropy defines how homogeneous, or the
Google as of now, moreover the logic behind calculating the lack of, the dataset is and Information Gain is the change in

VOLUME7,2019 37217
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Entropy of an attribute, usually a reduction [140].

E(D) = −P(positive)log2P(positive) − P(negative)log2P(negative)
(7)

37218 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Equation (9) calculates the Entropy E, of a dataset D, which calculation as it uses a lot more
holds the positive and negative ‘Decision Attributes’. features than a standalone DT, but generally it does
produce higher accuracy in dealing with unseen datasets.
Gain (Attribute X) = Entropy (Decision Attribute Y) The ensemble method upon which RF is founded is
− Entropy(X, Y ) (8) known as ‘Bootstrap Aggregation’, a powerful, yet simple
algorithm. Bootstrap Aggregation can address overfitting
Gain is calculated for each of the features or attributes arising out of high variance of algorithms such as DT.
and the one with the highest value is selected as it provides Tran et al. [146] indicated that limited work has been
most information gain. The whole process is repeated for done on detecting malicious contents in spam emails, in the
sub- branches of the tree to eventually complete the DT. form of malignant URL or harmful attachments (malware).
Ouyang et al. [141] developed frameworks based on DT Coders for malware are relentlessly developing novel and
and another algorithm known as ‘Rulefit’ [142], in order to clever techniques for transforming binary code that
carry out a comprehensive empirical study into the efficacy cannot be detected by anti-virus scanners, and their level of
of using packet and flow features in the detection of spam sophistication is growing with time [29]. The proposed
emails from a single-enterprise perspective. The flow based model extracts many different features in a rather time-
analysis critically examines an email using different meth- efficient manner from email content and metadata, without
ods, such as DNS Blacklisting, filters that works on SYN using external tools. Some of the header and content
packet features, filters based on key traffic characteristics features used are quite unique, and according to the authors
and finally content analysis. Addition of each of the stages have not been used elsewhere. RF has been applied to
in the processing adds more overhead and thus measure the effectiveness of the selected features.
computational complexity increases. A message is marked Nevertheless, the authors have suggested the system does
as spam when any of the layers confidently labels it as not always perform well enough against detecting
spam. Researchers claim that the proposed work on network malicious URLs in spam emails, but rather effective
level filtering for spam detection can greatly reduce the against detecting potentially hazardous attachments.
workload for a more intensive content level filtering. The proposed model of Shams and Mercer [147]
Sheikhalishahi et al. [143] envisioned a preliminary extracted features such as words, length of words and
approach to a new algorithm known as ‘Categorical Cluster- documents etc. and carried out the classification task on
ing Tree (CCTree)’, which is based on DT. CCTRee four different data sets such as CSDMC2010,
extracts a set of categorical features such as size of email, SpamAssassin, LingSpam and Enron with multiple
number of embedded links, attachment information, HTML classification algorithms. The authors claim that classifiers
characteristics etc. to build a tree of clusters. The generated using meta-learning algorithm (‘Bagging’ in
researchers argued that it has a simpler approach in dealing this case) performs better than probabilistic and tree
with the task in hand, thus the complexity is quite low based models. The Bagging model demonstrated average
comparing to some other approaches. On the other hand, the accuracy of 94.75% across all four datasets. The algorithm
research attempt is yet to be tested and implemented on a is a close variation of RF. Regardless of reasonable
large-scale dataset, and it only showed some theoretical accuracy, the work does not address the issue of high
underpinning where the low complexity and easy-to- dimen- sionality and the associated increase in the
understand representation of the chosen features have been complexity of the proposed methods.
highlighted as the key strength of the proposed CCTree
f: LOGISTIC REGRESSION BASED SOLUTIONS
algorithm.
To address the issue of ‘Concept Drift’, Sheu et al. [144] Logistic Regression (henceforth LR) is another simple, yet
designed a DT-based framework that works in conjunction very useful supervised approach, applicable to a wide
range of binary classification problems, for instance,
with ‘Incremental Learning’ of spam keywords, set up in an
predicting binary-valued labels for a data point z such that
online environment for continuous enrichment. The
z(i) ∈ {0, 1}, such probabilities can be calculated from
precision attained at the point of publication is 95.5%. (10) and (11).
1
e: RANDOM FOREST BASED FRAMEWORKS P (z = 1|x) = (9)
Random Forest (henceforth RF) is one of the most
successful supervised classification and regression 1 + exp(−θ T x)
P (z = 0|x) = 1−P (z = 1|x) (10)
techniques based on ensemble learning. It operates by
constructing an entire forest from multitudes of random and The right hand side of (9) will ‘squash’ the value of θ T
uncorrelated Decision Trees during training segment [145]. within the range of 0 to 1 so that it can be interpreted as a
Ensemble learning methods employ multiple learning Probability [148]. Over the years, apart from scientific
algorithms to come up with an optimal predictive analytics, research, LR has been widely adopted in many different
which can perform better than any of the individual model’s fields such as marketing, health care and economics to
prediction [145]. RF may incur additional complexity in its name a few [149]. Both the equation signify how likely the

VOLUME7,2019 37219
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

proba- bility is to fall within the value 0 and 1 strictly.

Pawar and Patil [150] demonstrated for a small dataset
of less than a thousand, a regular LR model performs best
(accuracy of 98%), however, as the research suggests, to
keep

37220 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

the performance consistence with the growth of dataset, authors have experimented with
another version of LR, the Multiple Instance Logistic three different flavours and came into conclusion that
Regres- sion (MILR) should be used, as it demonstrated a Bhat- tacharyya Kernel (BK) works best. The framework
consistent accuracy within the tight range of 93.3% to attained an accuracy of around 92% when both images and
94.6% (up to 2500 data points). texts are considered from an email, however, the accuracy
declines for spam which is based solely on image or text.
g: SUPPORT VECTOR MACHINE BASED PROPOSITIONS
Deep Support Vector Machine (DeepSVM) are known
Support Vector Machine (henceforth SVM), a well-
to perform better than CNNs and standalone SVMs due to
established supervised learning technique used for classifi-
a design improvement where instead of single layer, any
cation, was originally proposed by Alexey Chervonenkis
number of layers can be used with kernel functions. Roy et
and Vladimir N. Vapnik in 1963 [151].
al. [157] proposed an application of Deep SVMs in spam
A hyperplane creates the differentiating classes by
classification. The lower level SVMs carry out the task of
analysing various features found in the dataset. The SVM
feature extraction while the highest level SVM performs
can work in any number of dimensions. In a R2 (2-
the actual prediction using the extracted features. The
dimensional) workspace, a hyperplane is a line, in R3 it is a
authors have also compared the performance with ANN
plane and in Rn it is termed as ‘hyperplane’. The algorithm
and regular SVM. The model based on Deep SVM showed
may identify several hyperplanes. But the optimum one
highest accuracy of 92.8%.
would be the one that has the maximum distance from the
training datasets of each of the classes. Given a specific h: ADABOOST BASED PROPOSITIONS
hyperplane, the com- puted distance from it to the nearest
Boosting basically means to combine a number of simple
data points of both sides can be used to draw the margin,
learners (classifiers that produces an accuracy just above
There should never be any data points inside the margins.
50%) to formulate a highly accurate prediction. AdaBoost
The bigger the margins, the better the model will generalize
(Adaptive Boosting) sets different weights to both samples
with unseen data. Support Vectors, required to calculate the
and classifiers [158]. This enforces the classifiers to put
margins, are the data points ‘On’ or ‘Closest’ to the margins
concentrated focus on observations that are rather difficult
[152]. Though SVM is a supervised techniques, but work
to accurately classify. The formula for the final classifier is
has also been done to use it in unsupervised clustering
shown at (11).
[153], [154]. SVM has the unique ability to transform non-
L
linearly separable data to a new linearly separable data by a H (p) = +/ − ( K α h (p)) (11)
mechanism known as ‘kernel trick’ [154]. k
Similar to the study of [126], Diale et al. [155] demon- Equation (11) is a linear combination of all of the weak
strated that while using SVM for email classification, opti- classifiers (simple learners), where K is the total number of
mising the kernel type and kernel parameters are of utmost weak classifiers, hk (p) is the output of weak classifier t
importance. The authors have indicated that varying feature (can only be -1 or 1). αk is the weight applied to classifier
extraction and feature selection techniques for SVM often k. The
bring about the need for employing different kernel final decision is derived by looking at the sign (+/-) of
functions for optimum performance. They have also (13). The research done by Varghese and Dhanya [159]
concluded that increasing number of features available for attempted to develop a filter using Parts of Speech (POS)
feature selection and extraction resulted in better Tag, Bigram POS Tag, Bag-of-Word (BoW)s and Bigram
performance, that is, there is a positive correlation. This Bag-of-Word (BoW)s. It has been detected that POS tags
research attempt primarily works with the words from email and Bigram POS Tag features demonstrated better output
body for feature engineering; but excludes other forms of using AdaBoost as the classifier; the experimentation
features, such as header, URL and domain information. achieved a False Positive Rate of 0. On the contrary, [159]
Thus the obtained results are only accurate within a limited suffers from the same issue due to POS Tagging as
boundary of circumstances. discussed earlier. In addition, as pointed out in [160],
A combined model has been envisioned by Amayri and Adaboost as a classifier might incur issues such as high
Bouguila [156] where both textual and visual (images) computational cost and non- scalability. Apart from this,
information from emails have been combined and the work does not address any header information, leaving
simultaneously put into action in detecting spams. The a loophole for number of different types of spam emails.
framework is based on building probabilistic SVM kernels
from mixture of Langevin distributions [156]. For the i: K-NEAREST NEIGHBOUR BASED SOLUTIONS
textual features, certain header information, such as FROM, K-Nearest Neighbour (henceforth KNN) is widely used
REPLY-TO, Cc, Bcc and TO fields have been consulted classification technique that boasts a commendable balance
along with email content and subject. For the visual part, among several important criterion such as predictive
texts in the embedded images have been extracted using ability, intuitive interpretability and time required for
OCR, and certain visual features of the images have also calculation (for a moderately rich dataset). Though
been included in the feature vector. For the SVM kernel, the algorithms such as RF does have higher capability in

VOLUME7,2019 37221
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

prediction, but lags behind in few other parameters.

Unsurprisingly, industry adoption

37222 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

of KNN is quite high. KNN can often be used to formulate Spambase dataset. However, the authors have used a 66%
regression models, but that is not very common. KNN uses split, rather than the more traditional 80%-20%, without
‘Euclidian Distance’ to determine the distance between two really explaining the rationale behind the choice.
data points (Xn and Xm) as shown in (12) [161]. DT based systems often tend to have high sensitivity to
D noise and overfitting. This issue has been highlighted in the
Dist(X , X ) =
n m
X − X )2
n m
(12) work of Wijaya and Bisri [167]. To tackle such issue, the
Li=1 i researchers have added a regular LR to the process. In this
On the hindsight, instead of using Euclidean Distance, hybrid spam detection system, data is fed into an LR
Sharma and Suryawanshi [162] have proposed ‘Spearman module before passing through the DT based segment. The
Correlation’ [163] as the distance measure for KNN based reported accuracy is 91.67%. The work does not use any
classification as shown in (13). X and Y are training and test- feature engineering methods, and simply uses all the
ing tuple respectively while n is the number of observations. available aspects as features. This simplicity gives the
The values of dij usually lies between 1 and −1. framework effortless execution, but makes the accuracy
Pn less realistic.
6 (rank(Xi) − rank(Yj))2 It has been stated by Nizamani et al. [168] that efficient
and advanced feature selection weights more than the types
of classification algorithms used when comes to identifying
i=1
dij = 1 − The authors have Tree (implemented
n(n2 − 1) employed SVM, in Java), CCM
(13) J48 Deci- sion (Cardiac Contrac-
deceitful emails.
The changes have shown some enhancements over regu- reviewed number of classification algorithms and found
lar KNN model with nearly 50% improvement in accuracy Rotation Forest to be performing better than some other
(97.44% in 80%-20% Train and Test ratio). A limitation of common algorithms with an accuracy of 94.2% on the
the study is the size of the dataset, having just over 4000 tility Modulation) and NB classifiers together with various
data points. KNN often needs a rather large dataset to carefully designed features and disseminates the idea that
produce a rather stable model with realistic accuracy. frequency based features generally achieve top accuracy,
Besides, a bit of elaboration was needed for fixing the value 96% in their study. The work only deals with the contents
of K as 3. The authors have also expects the study to be used of the fraudulent emails for feature extraction, ignoring the
in conjunction with other more robust and complete spam header, which is also an important aspect that needs to be
filtering frameworks. considered. Besides, Alsmadi and Alhami [169] argued
that better false positive rate can be acquired through the
j: MULTI-ALGORITHM SUPERVISED SYSTEMS deployment of N- Gram based clustering and classification
A number of interesting propositions have been put forward than employing any other algorithms, even the one
that employ more than one supervised algorithms in discussed in [168].
different segment of the framework to develop the final Feng et al. [170] offered a hybrid model composed of
model. This section will highlights some of such recent NB and SVM, attaining an accuracy of around 91.5% with
solutions that, mostly is a hybrid of the above discussed a training set of 8,000 samples. The framework tries to
algorithms. reduce dependency issue among features as much as
In his study, Wang [164] proposed a heterogeneous possible - commonly observed in NB based models. The
ensemble approach for spam detection composed of DT, study aims to extend its functionality towards image spam
NB and Bayesian Net algorithm. Heterogeneous ensembles as one of the future improvements. However, we believe
composed of methodologically different learning the authors [170] should also include header and domain
algorithms. The study have also discussed multiple information in its analysis of spam emails.
procedures for algorithm selection in building the Islam and Abawajy [171] developed a multitier classifier
ensembles. The researchers have compared the framework where an email is checked for an accurate labelling in the
with homogenous ensemble techniques and found their first two tires, and if any misclassification occurs (initial
approach to be performing better with an accuracy of 94%. two tires giving out conflicting labelling), it is then sent to
Similar to [164], Large et al. [165] also suggested that third tier. The choice of algorithms (SVM, AdaBoost and
heterogeneous ensemble-based spam filtering frameworks NB) picked for each of the tires has been decided after
perform better. However, the researchers argued that instead juggling the selected algorithms and their respective tiers.
of simple tree based ensemble techniques used in [164], the A strength of this technique is that the processing among
more advanced ones, based on slightly complex algorithms tiers transpire in parallel, unlike some other ensemble
such as RF, Rotation Forest, Deep Neural Network and based multitier classifiers. The model returned a high
Support Vector Machine, can actually perform better in accuracy rate (around 96.8%) with low rate of false
varieties of scenarios. The claim can also be substanti- ated positive detection.
from Shuaib et al. [166] where the researchers have A behavior-based mechanism has been discussed by

VOLUME7,2019 37223
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Hamid and Abawajy [172] to detect phishing emails using

hybrid feature selection approach. They have deployed 4
different classifiers (Bayes Net, AdaBoost, Decision
Table and Random Forest) to mine sender’s behaviour, in a
view to find out whether the source is a legitimate one.
Sender’s behaviour is further broken down into two
subgroups: Unique Sender (US) and Unique Domain (UD).
The inputs to the

37224 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Sender Behaviour algorithm are the domain message-id Aski et al. [176] brought forward a rule based
(DMID) and lists of email sender (ES). The system showed framework where 23 meticulously chosen features have
an average accuracy of 93% with only 7 features from a been selected from a personally compiled spam databases
rather limited set of 3000 data points. On the other hand, a and each of these criterion have been scored to get a total
similar framework [173], achieving a slender advantage in value, which was subsequently compared to a threshold
terms of accuracy, used a high number of features - 43. value to finally label an email as spam or ham. MLP, NB
The framework developed by More and Kulkarni [174] classifier and C4.5 Decision Tree classifier have been used
and tested on Enron dataset using NB and RF demonstrated to train the model as well as the individual model’s
an accuracy of 96.87%. The system employs text analysis of performances have been compared; of which the MLP
the email body using NB, and categorizes the words in based model scored accuracy around 99% [176]. The MLP
several linguistic features as well specific spam words. If it based model propagates information by activating input
is found that the message body contains over 5% spam neurons that contains labelled values. The Activation of
words, it is flagged as Spam. Besides, the same set of emails neurons is calculated either in the middle or output layer
are passed through a classification system built on RF that using (15), where ai represents the activation level of
uses the following Polynomial Kernel Function as shown in neuron i; j denotes neuron set of the previous layer; Wij is
(14). X and Y are vectors of features derived from test or the weight of the link between neuron j and I , and Oj is the
training samples, and C is constant. output of neuron j.
K (X, Y ) = (XT Y + C)2 (14) ai = σ (L WijOj) (15)
j
The obtained result has also been compared with ANN
and LR built model (following the same Kernel Function). However, the small testing dataset (750 spam and ham
An improvement can be added if the issue of high dimen- in total) is somewhat limits the wide acceptance of the
sionality and the associated increase in the complexity of the results obtained for this study. Thus an effective
proposed methods can be explained in detail. performance measure in terms of memory and time
Islam and Xiang [8] developed a promising email classi- footprint for large scale datasets is yet to be determined,
fication technique based on data filtering method. The work also, the study does not mention how the model will
broached an innovative filtering technique using a modified perform against certain critical attacks such as spear
‘Instance Selection Method (ISM)’ to cut down on the least phishing.
valuable data instances from training model and then As illustrated in earlier works, ‘Baysian Probability
classify the test data. The aim of ISM, enhanced by NB, is Theo- rem’ has been the choice for handling uncertainty in
to identify which instances (examples, patterns) in email datasets. However, the work of Zhang et al. [177] rather
corpora should be selected as representatives of the entire argued the ‘Dempster-Shafer (D-S) theory of evidence’
dataset, without significant loss of information. Several [178] is better equipped than Baysian probability while
algorithms have been tried and the model displayed an using statistical classification. Uncertainty can arise in
accuracy of 96.5%. How- ever, according to the authors, the number of regards in the analysis of spam corpuses such as
system needs to have the capability to handle incoming assigning missing values to features. In D-S theory, given
emails to address Concept Drift [8]. a domain α, a probability mass is assigned separately to
each subset of α, whereas in classical probability theory,
k: SUPERVISED SYSTEMS DISCUSSING PERFORMANCE
this probability mass is assigned to each individual
OF DIFFERENT ALGORITHMS
elements. Such an assignment is called a Basic Probability
The core of the above discussed systems are either built with
Assignment or BPA [179]. The researchers have selected 5
a single supervised algorithm or multiple ones. Below are a
most representative header features of spam corpus after
discussion on single-algorithm frameworks where multiple
appropriate quantification. Their D-S integrated
algorithms have been individually tested to design the pri-
classification model found ANN to be one of the most
mary classifier of the system, and based on the performance,
effective classification algorithms along with NB.
the best one has been chosen to finalize the classifier.
Ergin and Isik [180] highlighted the fact that spam is not
The proposed binary classification model named
only a problem in emails based on English language, but
‘Sentinel’ by Shams and Mercer [147] utilized features of
also non-English speakers also have to deal with the issue.
Natural Lan- guage Processing before developing the
The work in question demonstrated a Turkish spam
classifier with multiple supervised algorithms in a view to
filtering system developed with the aid of DT and ANN as
evaluate performance of each of those algorithms. Among
classifiers, while ‘Mutual Information (MI)’ method has
the five algorithms tested, RF, Adaboost, Bagged Random
been deployed for feature selection. ANN attained an
Forest, SVM and NB, Adaboost and Bagged Random Forest
accuracy of 91.08%. Though the study states the
performed equally best on four different spam datasets.
superiority of Mutual Informa- tion (MI) over more widely
However, real time training and response latency have not
applied technique - ‘Informa- tion Gain (IG)’, the extensive
been considered as well as performance against Concept
study by George [181] found otherwise, where it has been
Drift [97] is yet unknown.
concluded the performance of Mutual Information is not up

VOLUME7,2019 37225
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

to the scratch, mainly due to its sensitivity to probability

estimation error.

37226 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

Sharaff et al. [182] conducted experimentation on a pro- algorithms have been used for classification evaluation and
cessed Enron dataset with standard DT, J48 Decision Trees, the system projected an average accuracy of over 90% with
SVM and BayesNet. The study reported the effectiveness of considerable dimensionality reduction of feature set; LR
J48 and BayesNet over SVM. found to be performing optimally. Nevertheless, the reduced
Sharma and Kaur [183] tested a spam detection feature set
framework built upon RBF (Radial Bias Function) Network of 70 is still a bit too large and more effective testing
(a subclass of ANN), where neurons were separately trained against phishing attacks is required.
to address common spamming techniques. The approach Bhagat and Moawad [190] carried out similar semantic
seemed to have increased the performance of RBF and also based implementations. The resulting reduction of feature
outper- formed SVM. The research resulted in an average set was around 37%, with LR showing optimal
accuracy of 99.83% after five consecutive runs. performance with an accuracy of 96% while RF performed
Nonetheless, the dataset of just 1000 words is not the least, demonstrating accuracy around 85%. The
comprehensive at all, and the proposed feature extraction somewhat similar study of Bhagat et al. [191] attained a
method is rather vague. feature reduction rate of 43.5% through the stemming of
Saab et al. [184] also measures the performance of SVM, the email body on Enron dataset. Multiple classifiers have
Local Mixture SVM, DT and ANN on spambase dataset. been tested and SVM and LR performed comparably better
While taking into account the full 57 available features, than other classifiers with LR showing an accuracy of
SVM demonstrated the highest precision (93.42%), while 97.7%.
ANN the highest accuracy (94.02%). However, this high Nonetheless, both [190] and [191] suffers from
accuracy was achieved in exchange of the longest training contextual ambiguity issues. Ambiguity refers to the fact
time. that a sentence in context may indicate multiple meaning,
The presence of malicious URL in phishing emails is a for example, ‘‘There was not a single man at the party’’,
key characteristics of spam emails and Vanhoenshoven et can be interpreted as I ) Absence of bachelors at the party
al. [185] tested the effectiveness of RF in detecting such II) Absence of men altogether [192]. The right conclusion
URLs within spam emails using a publicly available can be deduced upon analysing the context within which
database. The authors came into conclusion that with an the sentence has been used.
accuracy of 97.69%, RF actually performed better than few Besides the above studies, Almeida et al. [193]
other classification techniques such as MLP, C4.5 Decision conceived a process of expanding short texts, often found
Tree, SVM and NB. Features were ranked with Pearson in SMS spams, but could sometimes be seen in spam
Correlation Coefficient’ emails too. The authors argued that when the original text
[186] for selection. Qaroush et al. [187] also justified the is too short and mostly filled with abbreviations and
superiority of RF (reported accuracy of 99.27%) by com- idioms, it can be harder to apply any sort of classification
paring its performance against several other classification algorithm on it, most because the feature set is also
methods while building the classifier using various extremely limited. Feature Engineering is also difficult out
important email header features. of this limited initial feature set. Their proposed
A study based on semantic method has been introduced normalization and expansion method is based on semantic
by Bhagat et al. [188] using Wordnet ontology [189] as well dictionaries, lexicography and highly effective techniques
as some ‘Similarity Measures’ to reduce the high number of for semantic analysis and disambiguation. The study can
extracted features. ‘Path Length Measure’ has been chosen
also generate novel attributes to feed into any classifi-
as the most suitable algorithm for determining the similarity cation algorithm. The statistical evaluation done on the out-
measures. Path Length Measure derives the semantic put showed promising directions. However, the researchers
similarity of a pair of concepts. The calculation starts by concluded more thorough testing and performance
counting the number of nodes along the shortest path measurements are required.
between the concepts which can be found in the ’is-a’ Méndez et al. [194] devised a semantic-based feature
hierarchies of WordNet. In general, the path similarity score selection approach. The first critical segment of the
is inversely proportional to the number of nodes along the proposed method is e-mail topic extractor and guesser and
shortest path between the two words. Equation (16) the other one is computing the topic-related significance of
summarizes the nature of the derived score where w1 and each feature. To guess the topic of the email, the
w2 are the two terms researchers have seman- tically grouped terms into more
1 generic topics, that is, each of the topic has a bunch of
PATH (w1, w2) = (16)
related terms under it, and the more the terms are found in
length(w1, w2) the content from a certain category, the higher the
This resulted reduction of high number of features also likelihood of the email being belonging to that topic. These
reduces space and time complexity. tf-idf has been used for root level of topic is taken from the Wordnet Lexical
feature updates while feature selection is done with Database. The logic ensures each email may actually fall
Principal Component Analysis (PCS). Multiple supervised under multiple topic. The set of topic comprises both spam
and ham groups. Finally, it is then determined whether the
VOLUME7,2019 37227
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

email contains higher number of spam topic, in which case

it is declared as spam. The model has been evaluated
against several common Machine Learning Algorithms for
benchmarking. The proposed ‘Topic Guessing’ technique
showed significant improvement especially in terms of
performance. However, the authors feel that the manual
specification of root topic level needs further attention.

37228 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

5) UNSUPERVISED LEARNING BASED PROPOSITIONS documents may lead to successful grouping and
This section will analyse a number of Machine Learning identification of original authors. Even though emails are
based research attempts which are primarily unsupervised highly unstruc- tured, Alazab et al. [200] tried to
by nature. These include one or multiple unsupervised algo- implement the idea on spam detection, especially for
rithms to develop automated spam detection framework. phishing campaign identification. The researchers have
deployed an Unsupervised Auto- mated Natural Cluster
a: K-MEANS CLUSTRING BASED FRAMEWORKS Ensemble (NUANCE) methodology to approximately
K-means Clustering is one particularly useful, simple and cluster spam emails. The final clustering is achieved by
popular algorithm which intends to group similar data points hierarchically clustering the approximate sets, giving 27
together in a view to finding the underlying pattern. The different clusters. Though the system is impressive and
algorithm produces the final output through iterative refine- achieves improvement in the general direction of ‘author-
ment. The number of groups is denoted by K , and itera- ship attribution’ in spam campaign detection, however, the
tively each data point is assigned to one of these groups of intra-dynamics within the campaign groups may go unde-
clusters based on the identified similarities among the tected.
features [195]. Halder et al. [201] used clustering algorithms such as K-
Determining the optimum value for K , the total number Means and Expectation Maximization (henceforth EM) on
of clusters, which needs to be inputted for the algorithm to schemas such as stylistic features of emails, for example
work, can sometimes be tricky and users often run the total number of punctuations and contractions, number of
system multiple times with different values of K to compare email IDs used in the body etc. The authors have also
the results. Several methods exist for getting a reasonably looked into sematic features, that is, statistical measures of
solid approximation of K [195]. different words used in a batch of emails. Besides, they
In their work, Basavaraju and Prabhakar [196] proposed have also taken the combination of these two approaches
system that employs the text clustering based on ‘Vector into account. The cluster analysis has been carried out on a
Space Model (VSM)’. The method performed reasonably dataset of 2600 spam emails. It was detected that this
well on identifying spam emails. Representation of data is method can be successfully deployed to identify writing
done using a VSM and data reduction has been achieved styles of spam campaigns. Fur- ther, prototypes can be
through a custom developed Clustering techniques using the built based upon the extracted patterns for future
features of K-Means algorithm and BIRCH (Balanced Iter- identification of spam emails. K-Means showed 80%
ative Reducing and Clustering using Hierarchies) algorithm, success rate when a combined approach is taken while EM
achieving an accuracy of around 76%. This study uses raw projected a success rate of 84.6% while dealing with only
words from the documents to develop the VSM, A point of semantic features. However, its detection rate drops to
concern for the system is that in case of spammers using 57.4% while using a combined approach. The success rates
character variations, such as disguising the word insurance have been reported in terms of ‘Purity’ of clusters – which
as I∗n$u∗rènce, it will be difficult for the framework to basically projects the quality of the cluster. The accuracy
work correctly. should also be within similar range. The authors’ area of
A content based approach has been put forward by concentration has generally been rather narrow as there are
Laorden et al. [197]. The proposition works on anomaly number of important features in spam email detection such
detection to spam filtering by comparing features such as email subject headers, URLs composition, detailed
‘Word Frequency’ to that of a dataset of ham, or valid header and domain information, attachments etc. which
email. The inspected email, if shows considerable deviation have not been discussed, thus there are rooms for a
from a normal scale, will be considered as spam. The considerable expansion of this work.
techniques utilizes an algorithm known as ‘Quality The expectation Maximization (EM) is an effective
Threshold (QT)’, which basically falls into the category of iterative algorithm that calculates the Maximum
Partitional Clustering algorithms [198], a close variation of Likelihood (ML) estimate in the presence of hidden or
K-Means Clustering, giving an edge in reducing the number missing data [202]. Latent variables, or unobserved
of vectors in the dataset used as normality. This attempt also variables which cannot be directly measured, but rather can
lessens the processing overhead significantly. On the be inferred from the non- latent variables, are often used in
contrary, the system may render ineffective against the an EM models for gauging the best estimation.
usage of language features such as Synonyms, Hyponyms
[199] and Metonymy. The study achieved a weighted b: SELF-ORGANIZING MAP BASED PROPOSITIONS
accuracy of 92.27% on LingSpam dataset. Basnet et al. [27], Self-Organizing Map (SOM), an unsupervised technique
also reported similar accuracy of 90.6% using k-means. that borrows the baseline idea from ANN. However,
‘Authorship Attribution’ in recent times has become a instead of ‘Backpropagation’, it uses a process called
valuable tool in resolving issues around authorial disputes ‘Competitive Learning’ to produce a two-dimensional map
mainly in historic documents and literatures. Patterns of input space with higher dimension [203]. It is
regarding grammatical and syntactic features emerging out conceptualized that in Competitive Learning, output
of such neurons are in competition to respond to input patterns. At

VOLUME7,2019 37229
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

training stage, the output unit that is able to provide the

highest activation to a given

37230 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

input pattern is brought closer to the input pattern, whereas features have been selected to
the rest of the neurons are left unchanged. The process is begin with. In addition, the spam dataset is rather limited in
repeated number of times, eventually forming clusters of number of emails it holds.
closely related data points [204]. A variation of standard PCA, termed as ‘PCAII’, has
Porras et al. [206] declared that several benefits can be been broached by Gomez et al. [211], where the features of
obtained if SOM is used instead of KNN for clustering, such both the classes under analysis are combined together. Few
as eradication of inputting the number of clusters as one of variations of ‘Latent Dirichlet Allocation’ (LDA) have
the parameters to the algorithm. Instead SOM can use a also been proposed in this work. The modified algorithm
threshold, a radius-based boundary, to manipulate the had been applied on TREC 2007 spam corpus and the
algorithm’s sensitivity. Further, topological aspect of the output showed a balanced and stable performance
similarity among several clusters can be observed with regardless of dimensionality reduction.
much ease. Multiple filtering systems work in unison for
spam detection in this model. On the contrary, according to 6) SEMI-SUPERVISED LEARNING BASED SYSTEMS
the author, SOM calculation complexity may make it less Semi-supervised spam filtering systems have also demon-
than optimum for datasets that are limited both in the aspect strated promise, even though not many attempts have been
of size and diversity. The experiment has been carried out taken to construct such systems yet. This section will shed
on a dataset of 6047 email messages. some light on few of such frameworks.
Cabrera-León et al. [207] introduced another SOM based Las-Casas et al. [103] inspired a technique called
system where they used 13 different categories for the ‘SpaDes’ (Spammer Detection at the Source), which works
emails. The researchers had started by a 4-stage by analysing SMTP metrics such as number of distinct
preprocessing of emails (both spam and ham). First stage SMTP servers targeted, number of observed SMTP
batch-extracted all the emails’ subject and content and filled transactions, average geodesic distance to destination,
whitespaces with alphanumeric characters. The second stage average transaction size (in bytes) and average SMTP
removed all the transaction inter-arrival time (IAT). These SMTP metrics
stop words and calculated raw Term Frequency measure
are studied via a Machine Learning algorithm known as
along with some other metadata (spam\jam) to the process-
‘Active Lazy Associative Clas- sification’ (ALAC) [212]
ing. The following stage built a 13-dimensional integer
array to hold the themes and categorize the above-processed to build a prediction model. Asso- ciative classification
texts. The last preprocessing phase added ‘weights’ to the method aims to amalgamate supervised classification and
words of unsupervised association rule mining techniques in order
each of the 13 categories. The model was then built using to build a model known as associative classifier. Though
SOM (with ‘Batch’ learning method), finally, a threshold the proposition did show reasonably satisfactory
value was used to label the clusters accordingly. The frame- performance, however, it has been reported that over time
work recorded an accuracy of 94.4%. A concern was noted the system did not produce consistent performance, due to
by the authors in the performance of the model against the changes in behaviour of the spammers’ way of carrying
newer and off-topic emails, where the accuracy did get out spamming with time. The role and impact of Machine
affected, leaving more room for improvement. Learning based algorithms in the detection of spam emails
will be further discussed in the following sections.
Smadi et al. [15] presented a framework, ‘Phishing
c: PRINCIPAL COMPONENT ANALYSIS (PCA) BASED Email Detection System (PEDS)’ based on both supervised
FRAMEWORKS and unsupervised techniques in conjunction with
PCA is a statistical framework that works extremely well in reinforcement learning methodology, which gives the
most cases for Dimensionality Reduction in such a fashion system an increased ability to adapt itself based on the
where the maximum variations of the dataset can be retained detected changes and modifications in the environment.
[208]. PCA is also a valuable tool in building Pre- dictive The target of the system is Zero- Day Phishing attacks
Models. The system is an ‘Orthogonal Linear Trans- [15]. The core of the system, ‘Feature Evaluation and
formation’ that transmutes the normalized inputted data to a Reduction (FEaR)’ algorithm, can select and rank the
new coordinate system [209]. important features from emails dynamically based on the
Dagher and Antoun [210] deployed four different scenar- environmental parameters. FEaR is based on Regression
ios regarding feature pre-processing using PCA (Principal Tree (RT) algorithm, a subtype of Decision Tree. Immedi-
Component Analysis). Out of the four, two notable ones are ately after the execution of FEaR, another novel algorithm,
- representing ham and spam emails using same and DENNuRL (Dynamic Evolving Neural Network using
different set of features. It has been reported that PCA Rein- forcement Learning) will take over to allow the core
performs best when both the classes of emails are Three- Layer Neural Network of PEDS to evolve
represented using same features, having an accuracy of dynamically and build the optimum Neural Network
94.5%. On the hindsight, depending upon the best selected possible. DENNuRL has the element of Reinforcement
features, the other three scenarios may as well perform Learning where the degree of ‘Reward’ has been linked
differently. Besides there is no mention of the fact how the with the Mean Square Error (MSE) of the Neural Network
VOLUME7,2019 37231
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

in (16) and (17). In (16), n is the number of emails used in

the evaluation process, oi is the output for an email and ti is
the desired target for the same

37232 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

email. fea- ture selection to

improve upon the
n performance of the Logit
(oi − ti)2
P
Boost algorithm. Random
MSE = i=1
(17) Boost is more like an
n
1 extension of RF. Its
Reward = (18) runtime complexity is
MSE around one-fourth one-
Debarr and Wechsler
fourth to that of the RF
[216] experimented with
(with comparable
both supervised and
accuracy) and it also
unsupervised frameworks
reduces the training time
to generate a hybrid model
of Logit Boost quite
known as Random Boost.
significantly.
The system uses random
Though the achieved accuracy rate is 99.05%, some of assigned weights to different training samples to indicate the
the features that have been used are rather unconventional; degree of importance of each of the sample. The system is
for example ‘BodyDearWord’, ‘BodyNumChars’, rather tied to the directory structure of individual’s email
‘BodyNum- Words’, ‘NumLinkNonASCII’, and settings and more works are needed to make it flex- ible and
‘ContainScript’. These features do not have any real usable for different types of email management systems.
significance on whether a mail is spam. The authors have Meng et al. [33] used Multi-view datasets along with
not argued the inclusion of these, which leaves a scope for disagreement-based semi-supervised learning to build a
improvement. framework. Multi-view datasets means to have more than
Another study by Hassan [213] investigated the effect of one dataset, composed of different features, but selected
combining text clustering using K-Means algorithm with from the same data source. The endeavour takes motivation
various supervised classification algorithms. Some of the from the fact that there are number of problems in super-
features from clustering space have been shared with vised classification which often hinders the practicability
classification module to gauge the degree of improvement of these systems, such as data labelling. Disagreement-
in classification. However, the outcome portrayed an based Semi-supervised learning on the other hand is well
insignificant gain which is not really viable against the equipped to handle both labelled and unlabelled data. In
added computational complexity. Table 7 tabulates a this types of semi-supervised learning, multiple learners
summarized view of a ‘sample’ of 42 studies drawn from actively col- laborate to analyse a set of unlabelled data;
the Machine Learning based techniques discussed in this the disagreement among these ‘‘learners’’ plays a key
section. Out of this 42 studies, 28 are supervised, 6 are semi- role in the final outcome [32]. The proposed method gives
supervised and 8 and unsupervised. an accuracy of a bit over 85%. But as suggested by [32],
The semi-supervised model put forward by Padhiyar and Disagreement- based Semi-supervised learning, at the
Rekh [214] has been built upon KNN and NB. The model current state, is not really safe in the sense that oftentimes
claims to achieve improved classification accuracy than a the exploitation of unlabelled data may result in adverse
standalone NB or KNN based methods. But further effect on the model’s performance.
inspection shows that probably the work will not be highly
accurate when availability of initial labelled documents are V. AN ANALYTICAL DISSECTION OF THE
limited. As explained earlier, labelled documents contain STUDIES CARRIED OUT AND FUTURE
labelled data; that is data for which the target answer is SPAM DETECTION RESEARCH
already known. However, the study proposes Expectation DIRECTIONS
Maximiza- tion (EM) to be added to handle the scarcity of This section will shed lights on some key insights that can
labelled data, but it has not been implemented in the study. be derived from the above critically analyzed studies. We
Further, the implemented framework lacks an effective will start with the general, non-AI based spam detection
feature selection and pre-processing module. techniques discussed in Section III.
Though KNN works well in majority of the cases, but
Chakrabarty and Roy [215] highlighted few issues with the A. KEY INSIGHTS FROM GENERAL NON-
logic behind the calculation of the similarity measure; which AUTOMATED SPAM DETECTION FRAMEWORKS
creates the requirement of high memory usage and com- From Fig. 6, we can see how the non-automated anti-spam
plicates the calculation which eventually puts pressure on systems discussed in Section III have been designed
the system resources. To address such issues, the authors around different subsections of email infrastructure.
have proposed an amalgamation of KNN and unsupervised Clearly, email data parts C and D [Fig. 1] have been
Minimum Spanning Tree. exploited a lot more than other parts. Thus these systems
The system showed to have attained an accuracy of 75%. have relied upon email content, subject, sender and
They have also reduced the size of the training set and receiver information, while other header features such as IP
VOLUME7,2019 37233
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

source and destination along with SMTP transactional

fields have been used in a bit lesser degree.
From Table 4 we can see that around 50% of the non-
automated techniques depend upon multiple data parts of
an email.

B. KEY INSIGHTS FROM MACHINE LEARNING

BASED SPAM DETECTION FRAMEWORKS
Based on the information presented in Table 7, a number of
useful perceptions can be obtained on the nature and trend
of research done on Machine Learning based spam

37234 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

TABLE 7. A summary of machine learning based techniques.

detection techniques since circa 2010. Facts deduced anti-spam systems, with 67% of the selected sample.
from the sample can decidedly aid in understanding the
direction of research that had been conducted over the
years. Some of the obtained insights will also indicate
potential future research directions.
1) FINDING A: HIGH ADOPTION OF
SUPERVISED TECHNIQUES
The PI chart on Fig. 7 demonstrates the high adoption of
supervised techniques in developing or benchmarking such

VOLUME7,2019 37235
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

consistent.

FIGURE 6. Number of non-automated spam detection studies in

relation to various email data parts.

FIGURE 7. Proportion of types of frameworks.

As suggested by Fig. 7, supervised approaches have been

the first choice for the researchers and developers, clearly
signifying the fact that there is high degree of room and
opportunity to expand the research into both semi-
supervised, unsupervised and even reinforcement based
models.
True form of Artificial Intelligence can only be achieved
through non-supervised learning, so there is clearly a need
to investigate and develop this wing of Machine Learning
for anti-spam systems.

2) FINDING B: PROBABLE REASON BEHIND

MARGINAL ADOPTION OF UNSUPERVISED AND
SEMI-SUPERVISED ALGORITHMS
The reason for lower adoption rate for Semi-supervised and
unsupervised method may be explained through some
statistical studies on the outcome (Accuracy in this case)
they provide. Although the sample in Table 7 shows dis-
proportionate number of research works for these two types
of methods in comparison to supervised learning, but the
following Scatterplots and Standard Deviation calculations
may shed some light on the underlying cause.
The scatterplots have been laid out in Fig. 8 (supervised),
9 (semi-supervised) and 10 (unsupervised). It can clearly be
seen from the three plots that accuracy for supervised
learning (Fig. 8) has been within a tightly closed range, with
lesser variation; meaning the outcomes are mostly

37236 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
FIGURE 8. Scatterplo for accuracy of supervised methods (sd: 2.97).

FIGURE 9. Scatterplot for accuracy of semi-supervised methods (sd:

5.39).

FIGURE 10. Scatterplo for accuracy of unsupervised methods (sd: ≈

9.20).

The average accuracy is also around min-nineties which is

quite acceptable.
On the contrary, scatterplots for semi-supervised (Fig. 9)
and unsupervised (Fig. 10) demonstrates the grouping of
accuracy is not that tightly maintained; which means the
results are not consistent and can vary widely, thus incurs
less confidence among the researchers and developers
alike. Nonetheless, with the innovation of new algorithms
and techniques for unsupervised and semi-supervised
methods, such high variance should come down.
The Standard Deviation (SD) for supervised, semi-
supervised and unsupervised frameworks have been calcu-
lated as 2.97, 5.39 and ≈9.20 respectively from the
reported accuracy. The values clearly confirm the findings
from the scatterplots.
It is clear from Finding A and B that there are high
degree of opportunities to work with non-supervised spam
detection frameworks in a view to bring its performance to
a comparable level (or even better) to that of supervised
methods, as unsupervised learning do hold few distinct
advantages over supervised learning, such as the easier
availability of

VOLUME7,2019 37237
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

models conveniently. It is clear that research initiatives on

hybrid systems can be something that may need more atten-
tion.
In fact, the relation to the two variables, total number of
algorithm used (p) and total number of studies undertaken
(q) has almost a near-perfect inverse correlation. Using
Pearson’s correlation coefficient [186], r, as shown in (19),
the value of p comes out as -0.91, which clearly points
towards the sharp negative correlation between p and q.
r= (19)
(p − p)(q − q)

FIGURE 11. Showing the total number of studies in which each of

the algorithms has been used (at least 3 studies).
(p − p)2 (q − q)2
Thus more research on hybrid systems can be a possible
area of future research that can be investigated as indicated
by Finding D.
5) FINDING E: APPLICATION OF HEADER
AND DOMAIN FEATURES
Out of the 58 studies evaluated throughout Section III.C,
only 9 of those (or ≈14%) have used some form of header
or domain features (excluding subject field) while
designing spam detection systems using Machine Learning
algorithms.
Such an observation highlights an opening where more
gravity and careful analysis can be applied regarding
header, domain and even URL based features of an email.
FIGURE 12. Grouping the studies based on the number of Besides, the frameworks that have used these features,
algorithms used to build the primary model. have worked with only limited set of it, and often left out a
number of useful ones such as the ‘Received from:’ header
unlabelled data than labelled ones and lesser computational fields, ‘Age of domain’ to name a few.
complexity. From Finding E we can underscore the fact that the
future spam detection frameworks may consider evaluating
3) FINDING C: ALGORITHMIC PREFERENCES a number of these useful header, URL and domain features
The Bar chart on Fig. 11 primarily illustrates the prevalence simultaneously to formulate an efficient and effective set
of supervised algorithms such as Naïve Bayes and SVM. of features through appropriate feature engineering.
It is understood that with the probable rise in
unsupervised and semi-supervised system, these number 6) FINDING F: HANDLING OF CONCEPT DRIFT
should change as well. Another point to note that along with Again, out of 58 recent research initiatives, only 2 [97],
a more complex and resource-consuming algorithm such as [144], have worked on the issue of Concept Drift with
SVM, a simpler and easier algorithm such as Naïve Bayes automated principles, which is just ≈1.15%. This
also has its applications in a wide varieties of settings. highlights a strong research prospect as addressing Concept
Drift is something that makes machine learning based
filtering systems to stand
4) FINDING D: PROPORTION OF SINGLE
ALGORITHM SYSTEMS AGAINST MULTI- apart from the traditional static ones.
ALGORITHM FRAMEWORKS
7) FINDING G: UNDER-ANALYSING THE EMAIL
Machine Learning based models can have a single algorithm CONTENT
in its core, or multiple algorithms may be used to formulate
Almost all the studies that work with email content to
a working model. A subset of 35 studies from Table 7 has
detect spam emails, especially the phishing ones, rely on
been drawn out to understand the design patterns. Fig. 12
word- based clustering or classification models and the
summarizes the finding. In case of sole algorithm systems,
degree of closeness of these clusters or classification
oftentimes the study behind these frameworks carried out a
models to high- probability spam words. The approach is
comparison analysis using multiple algorithms after the
reasonably logical, however, in modern times, the
model has been implemented, but the primary framework
spammers create these phishing emails in the light of
for the model had been built upon a single Machine
several psychological aspects of users’ mindset.
Learning algorithm.
In general, a well-crafted phishing email is modelled as
As can be seen from Fig.12, number of frameworks based
closely as possible after the Fig. 13, where the message
on a single algorithm outnumbers multi-algorithm
37238 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

body may contain words or phrases related to Finance and

Personal issues, with ‘Subject’ header holding phrase that

VOLUME7,2019 37239
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

algorithms, such
as SVM and Naïve Bayes are in high demand. We have
also came into conclusion that single-algorithm anti-spam
systems are quite common thus the potentiality of research
into hybrid and multi-algorithm systems is quite promising.
Besides, research that focusses on email header features
excluding the ‘subject’ field, URLs within the email body
and sender domain information need to substantially
increase. Another important area that needs increasing
FIGURE 13. Nature of an effective phishing email.
attention is the addressing of ‘Concept Drift’, which would
definitely make a system to perform optimally under
will definitely put the user in a position where he\she will gradual modification in spamming techniques and motives.
feel the urge to open the email at an earliest instance In addition, the current way of dealing spam emails of
possible.
phishing nature is not the most efficient as described, thus
Thus for the content analysis to be effective, along with
requires a more innovative approach that will take into
normal word-based analysis, we believe an automated
account the different angles of the problem.
mechanism is also required which will be able to detect,
A point of concern is that despite several
from multiple angles, how closely an email matches to that
admonishments from multiple bodies, governments of
of the above discussed structure in Fig. 13 before finally
number of leading countries in the world have fell short in
labelling it as an instance of phishing email. Such an
forming effective regulations that can really have a lasting
approach has not been seen in our studies of the modern
impact on this issue [31]. Nevertheless, the actions to
content based analysis techniques, and we believe more
strengthen cybersecurity have seen greater gravity in
research is required in this area of content analysis (along
recent times, resulting in the increased research and
with the subject header).
streamlined availability of fund- ing in this field. Thus it
Certainly, to be effective against spammers and
can be expected that a formidable framework, equipped
fraudsters, a Machine Learning based framework, if fully
with measures against the drawbacks highlighted in this
leveraged, should be able to counter all of these key issues
study, will soon become available for commercial and
as much as possible. Therefore, we believed that the future
personal deployment.
research should encompass the directions that have been
identified in the previous section.
REFERENCES
[1] O. Saad, A. Darwish, and R. Faraj, ‘‘A survey of machine learning tech-
VI. FUTURE WORK
niques for Spam filtering,’’ Int. J. Comput. Sci. Netw. Secur., vol. 12, no.
The survey work presented in this paper discussed the types 2,
and implication of spam emails on modern society and com- p. 66, Feb. 2012.
merce. A multitude of spam detection frameworks – both [2] M. K. Paswan, P. S. Bala, and G. Aghila, ‘‘Spam filtering: Comparative
analysis of filtering techniques,’’ in Proc. Int. Conf. Adv. Eng., Sci.
Machine Learning based and regular non-automated ones, Manage. (ICAESM), Mar. 2012, pp. 170–176.
have been dissected critically to depict a complete picture of [3] E. Bauer. 15 Outrageous Email Spam Statistics that Still Ring True in
the current development and future direction of the field. It 2018. Accessed: Jul. 20, 2019. [Online]. Available: https://www.
propellercrm.com/blog/email-spam-statistics
is expected that in near future adequate development will [4] Statista. Number of E-mail Users Worldwide From 2017 to 2023.
branch out in lesser explored arena of Machine Learning Accessed: Jul. 24, 2019. [Online]. Available: https://www.statista.com/
based spam identification propositions. It is reasonably clear Leoni AG loses Ă40m in an email scamstatistics/456519/forecast-
number- of-active-email-accounts-worldwide/
from the reviews that the currently emerging frameworks, [5] Y. Cohen, D. Gordon, and D. Hendler, ‘‘Early detection of spamming
even though using automated machine leaning based solu- accounts in large-Scale service provider networks,’’ Knowl.-Based Syst.,
tions, are often not equipped to deal with the multiple angles vol. 142, pp. 241–255, Feb. 2018.
[6] Campaign Monitor. The Shocking Truth About How Many Emails Are
from which an email spam threat can spread. Thus the future Sent. Accessed: Jul. 25, 2019. [Online]. Available: https://www.
direction of research in this field should be to develop anti- campaignmonitor.com/blog/email-marketing/2018/03/shocking-truth-
spam software that can simultaneously battle against about-how-many-emails-sent/
[7] O. A. Okunade, ‘‘Manipulating E-mail server feedback for spam preven-
multiple types of email spamming, considering multiple tion,’’ Arid Zone J. Eng., Technol. Environ., vol. 13, no. 3, pp. 391–399,
angles of attack as discussed above, with a single Jun. 2017.
installation of the software. [8] R. Islam and Y. Xiang, ‘‘Email classification using data reduction
method,’’ in Proc. 5th Int. ICST Conf. Commun. Netw. China, Aug.
2010, pp. 1–5.
VII. CONCLUSION [9] Scamwatch. Scam Statistics. Accessed: Jul. 15, 2019. [Online].
After a thorough analysis, the study results in several Available: https://www.scamwatch.gov.au/about-scamwatch/scam-
statistics
different observations especially in the realm of Machine
[10] Scamwatch. Scam Statistics. Accessed: Jul. 16, 2019. [Online].
Learning based proposition. It is noted that high adoption of Available: https://www.scamwatch.gov.au/about-scamwatch/scam-
supervised approaches is quite obvious, the reason behind statistics? scamid=31&date=2019
this turns out to be a better consistency in the performance [11] A. Test. Spam Statistics. Accessed: Jul. 16, 2019. [Online]. Available:
https://www.av-test.org/en/statistics/spam/
of the model. It has also been highlighted that certain
37240 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[12] N. Banu and M. Banu, ‘‘A Comprehensive Study of Phishing Attacks,’’
Int.
J. Comput. Sci. Inf. Technol., vol. 4, no. 6, pp. 783–786, 2013.
[13] H. Hu and G. Wang, ‘‘Revisiting Email spoofing attacks,’’ 2018,
arXiv:1801.00853. [Online]. Available: https://arxiv.org/abs/1801.00853

VOLUME7,2019 37241
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

[14] R. A. Halaseh and J. Alqatawna, ‘‘Analyzing cybercrimes strategies: The pp. 266–275, 2016.
case of phishing attack,’’ in Proc. Cybersecur. Cyberforensics Conf. [38] D. Sipahi, G. Dalkιlιç, and M. H. Özcanhan, ‘‘Detecting spam through
(CCC), Aug. 2016, pp. 82–88. their sender policy framework records,’’ Secur. Commun. Netw., vol. 8,
[15] S. Smadi, N. Aslam, and L. Zhang, ‘‘Detection of online phishing email no. 18,
using dynamic evolving neural network based on reinforcement learning,’’ pp. 3555–3563, Dec. 2015.
Decis. Support Syst., vol. 107, pp. 88–102, Mar. 2018. [39] M. T. Banday, F. A. Mir, J. A. Qadri, and N. A. Shah, ‘‘Analyz- ing
Internet E-mail date-spoofing,’’ Digit. Invest., vol. 7, nos. 3–4,
[16] A. Attar, R. M. Rad, and R. E. Atani, ‘‘A survey of image spamming and
pp. 145–153, Apr. 2011.
filtering techniques,’’ Artif. Intell. Rev., vol. 40, no. 1, pp. 71–105, Jun.
2011. [40] H. Esquivel, A. Akella, and T. Mori, ‘‘On the effectiveness of IP
reputation for spam filtering,’’ in Proc. 2nd Int. Conf. Commun. Syst.
[17] A. L. Chung-Man, ‘‘An analysis of the impact of phishing and anti-
Netw. (COM- SNETS), Jan. 2010, pp. 1–10.
phishing related announcements on market value of global firms,’’ HKU,
[41] K. S. Bajaj, F. Egbufor, and J. Pieprzyk, ‘‘Critical analysis of spam pre-
Hong Kong, Tech. Rep., 2009.
vention techniques,’’ in Proc. 3rd Int. Workshop Secur. Commun. Netw.
[18] N. Raad, G. Alam, B. Zaidan, and A. Zaidan, ‘‘Impact of spam adver- (IWSCN), May 2011, pp. 83–87.
tisement through E-mail: A study to assess the influence of the anti- spam [42] R. R. Roy, ‘‘Basic session initiation protocol,’’ in Handbook on Session
on the E-mail marketing,’’ Afr. J. Bus. Manage., vol. 4, no. 11, Initiation Protocol: Networked Multimedia Communications for IP Tele-
pp. 2362–2367, Sep. 2010. phony. Boca Raton, FL, USA: CRC Press, 2016, pp. 5–166.
[19] TheStar. Company Cheated of RM 4.5 Mil Due to Email Spoofing. [43] R. Ferdous, ‘‘Analysis and protection of SIP based services,’’ Ph.D.
Accessed: Jul. 30, 2019. [Online]. Available: https://www.thestar.com. dissertation, Dept. Inf. Eng. Comput. Sci., Univ. Trento, Trento, Italy,
my/news/nation/2017/06/11/kedah-based-company-cheated-due-to- 2014.
email-spoofing [44] G. Caruana and M. Li, ‘‘A survey of emerging approaches to spam filter-
[20] T. L. Shan, G. N. Samy, B. Shanmugam, S. Azam, K. C. Yeo, and ing,’’ ACM Comput. Surv., vol. 44, no. 2, p. 9, Feb. 2012.
K. Kannoorpatti, ‘‘Heuristic systematic model based guidelines for phish- [45] P. Sousa, A. Machado, M. Rocha, P. Cortez, and M. Rio, ‘‘A
ing victims,’’ in Proc. IEEE Annu. India Conf. (INDICON), Dec. 2016, collaborative approach for spam detection,’’ in Proc. 2nd Int. Conf.
pp. 1–6. Evolving Internet, Sep. 2010, pp. 92–97.
[21] H. M. Al-Mashhadi and M. H. Alabiech, ‘‘A survey of Email service: [46] M. Mojdeh, ‘‘Personal Email spam filtering with minimal user interac-
Attacks, security methods and protocols,’’ Int. J. Comput. Appl., vol. 162, tion,’’ Ph.D. dissertation, Dept. Comput. Sci., Univ. Waterloo, Waterloo,
no. 11, pp. 31–40, 2017. ON, Canada, 2012.
[22] J. V. Chandra, N. Challa, and S. K. Pasupuleti, ‘‘A practical approach to [47] M. Prilepok, P. Berek, J. Platos, and V. Snasel, ‘‘Spam detection using
E- mail spam filters to protect data from advanced persistent threat,’’ in data compression and signatures,’’ Cybern. Syst. Int. J., vol. 44, nos. 6–
Proc. Int. Conf. Circuit, Power Comput. Technol. (ICCPCT), Mar. 2016, 7, pp. 533–549, Mar. 2013.
pp. 1–5. [48] S. Geerthik and P. Anish, ‘‘Filtering spam: Current trends and tech-
[23] ABC Bus Company. Accessed: Apr. 5, 2019. [Online]. Available: https:// niques,’’ Int. J. Mechtron., Elect. Comput. Technol., vol. 3, no. 8,
www.doj.nh.gov/consumer/security-breaches/documents/abc-bus- pp. 208–223, Jul. 2013.
20180302.pdf [49] E. Damiani, S. D. Capitani, D. Vimercati, S. Paraboschi, and P. Samarati,
[24] B. B. Gupta, N. A. G. Arachchilage, and K. E. Psannis, ‘‘Defending ‘‘An open digest-based technique for spam detection,’’ in Proc. ISCA
against phishing attacks: Taxonomy of methods, current issues and future 17th Int. Conf. Parallel Distrib. Comput. Syst., 2004, pp. 559–564.
directions,’’ Telecommun. Syst., vol. 67, no. 2, pp. 247–267, Feb. 2017. [50] B. Biggio, G. Fumera, I. Pillai, and F. Roli, ‘‘A survey and experimental
[25] A. Han. Leoni AG Loses 40m in an Email Scam. Accessed: Apr. 17, 2019. evaluation of image spam filtering techniques,’’ Pattern Recognit. Lett.,
[Online]. Available: https://www.bankvaultonline. com/news/security- vol. 32, no. 10, pp. 1436–1446, Jul. 2011.
news/leoni-ag-loses-e40m-in-an-email-scam/ [51] Process Software, LLC. Explanation of Common Spam Filtering Tech-
[26] M. J. Schwartz. French Cinema Chain Fires Dutch Executives Over CEO niques. Accessed: Feb. 11, 2019. [Online]. Available:
Fraud. Accessed: Apr. 17, 2019. [Online]. Available: https://www. http://www.process.
bankinfosecurity.com/blogs/french-cinema-chain-fires-dutch-executives- com/products/pmas/whitepapers/explanation_filter_techniques.html
over-ceo-fraud-p-2681 [52] Vernon Schryver. Distributed Checksum Clearinghouses. Accessed:
Oct. 5, 2019. [Online]. Available: https://www.dcc-servers.net/dcc/
[27] R. Basnet, S. Mukkamala, and A. H. Sung, ‘‘Detection of phishing attacks:
A machine learning approach,’’ in Soft Computing Applications in [53] H. Wang, R. Zhou, and Y. Wang, ‘‘An anti-spam filtering system based
Industry (Studies in Fuzziness and Soft Computing), vol. 226. Berlin, on the naive Bayesian classifier and distributed checksum
Germany: Springer-Verlag, 2008, pp. 373–383. clearinghouse,’’ in Proc. 3rd Int. Symp. Intell. Inf. Technol. Appl., Nov.
2009, pp. 128–131.
[28] R. Vinayakumar, M. Alazab, K. Soman, P. Poornachandran,
[54] J. Chen, R. Fontugne, A. Kato, and K. Fukuda, ‘‘Clustering spam cam-
A. Al-Nemrat, and S. Venkatraman, ‘‘Deep learning approach for
paigns with fuzzy hashing,’’ in Proc. AINTEC Asian Internet Eng. Conf.
intelligent intrusion detection system,’’ IEEE Access, vol. 7,
(AINTEC), Nov. 2014, p. 66.
pp. 41525–41550, 2019.
[55] A. Karim, ‘‘Multi-layer masking of character data with a visual image
[29] M. Alazab, ‘‘Profiling and classifying the behavior of malicious codes,’’ key,’’ Int. J. Comput. Netw. Inf. Secur., vol. 10, no. 10, pp. 41–49, Oct.
J. Syst. Softw., vol. 100, pp. 91–102, Feb. 2015. 2017, doi: 10.5815/ijcnis.2017.10.05.
[30] V. Gupta. Understanding Feedforward Neural Networks. Accessed: [56] C.-Y. Chiu, A. Prayoonwong, and Y.-C. Liao, ‘‘Learning to index for
Jun. 10, 2019. [Online]. Available: www.learnopencv.com/ nearest neighbor search,’’ 2018, arXiv:1807.02962. [Online]. Available:
[31] M. R. Sánchez, T. T. Loon, and V. Victor, ‘‘An anti-spam framework for https://arxiv.org/abs/1807.02962
Singapore,’’ Media Asia, vol. 30, no. 4, pp. 240–246, 2003. [57] Y. Li, S. C. Sundaramurthy, A. G. Bardas, X. Ou, D. Caragea, X. Hu, and
[32] Z. Zhou, ‘‘Disagreement-based semi-supervised learning,’’ Acta Auto- J. Jang, ‘‘Experimental study of fuzzy hashing in malware clustering
matica Sinica, vol. 39, no. 11, pp. 1871–1878, 2013, doi: 10.3724/ analysis,’’ in Proc. 8th Workshop Cyber Secur. Experimentation Test
sp.j.1004.2013.01871. (CSET), Berkeley, CA, USA: USENIX Association, 2015, p. 8.
[33] W. Li, W. Meng, Z. Tan, and Y. Xiang, ‘‘Towards designing an Email [58] P. Liu and T.-S. Moh, ‘‘Content based spam E-mail filtering,’’ in Proc.
classification system using multi-view based semi-supervised learning,’’ in Int. Conf. Collaboration Technol. Syst. (CTS), Nov. 2016, pp. 218–224.
Proc. IEEE 13th Int. Conf. Trust, Secur. Privacy Comput. Commun., Sep. [59] D. Chiba, M. Akiyama, T. Yagi, K. Hato, T. Mori, and S. Goto,
2014, pp. 174–181. ‘‘DomainChroma: Building actionable threat intelligence from malicious
[34] S. Hameed, T. Kloht, and X. Fu, ‘‘Identity based Email sender authenti- domain names,’’ Comput. Secur., vol. 77, pp. 138–161, Aug. 2018.
cation for spam mitigation,’’ in Proc. 8th Int. Conf. Digit. Inf. Manage. [60] A. Ramachandran, N. Feamster, and S. Vempala, ‘‘Filtering spam with
(ICDIM), Sep. 2013, pp. 14–19. behavioral blacklisting,’’ in Proc. 14th ACM Conf. Comput. Commun.
[35] E. Calò. SPF, DKIM and DMARC Brief Explanation and Best Prac- tices. Secur. (CCS), Oct. 2007, pp. 342–351.
Accessed: Feb. 21, 2019. [Online]. Available: https://www. [61] Spamhaus. The Spamhaus Project. Accessed: Oct. 5, 2019. [Online].
endpoint.com/blog/2014/04/15/spf-dkim-and-dmarc-brief-explanation Avail- able: www.spamhaus.org
[36] L. Seltzer. DKIM: Useless or Just Disappointing. Accessed: Mar. 8, 2019. [62] M. Sirivianos, K. Kim, and X. Yang, ‘‘SocialFilter: Introducing social
[Online]. Available: www.zdnet.com trust to collaborative spam mitigation,’’ in Proc. IEEE INFOCOM, Apr.
2011,
[37] A. Karim, ‘‘A cryptographic application for secure information transfer pp. 2300–2308.
in a linux network environment,’’ Amer. J. Eng. Res., vol. 5, no. 8,

37242 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

[63] P.-C. Lin, P.-H. Lin, P.-R. Chiou, and C.-T. Liu, ‘‘Detecting spamming Artif. Intell., vol. 28, no. 2, pp. 97–110, Jan. 2014.
activities by network monitoring with Bloom filters,’’ in Proc. 15th Int. [91] A. J. Saleh, A. Karim, B. Shanmugam, S. Azam, K. Kannoorpatti,
Conf. Adv. Commun. Technol. (ICACT), Jan. 2013, pp. 163–168. M. Jonkman, and F. D. Boer, ‘‘An intelligent spam detection model
[64] Bloom Filters. Accessed: May 15, 2019. [Online]. Available: https:// based on artificial immune system,’’ Information, vol. 10, no. 6, p. 209,
llimllib.github.io Jun. 2019.
[65] P. Revar, A. Shah, J. Patel, and P. Khanpara, ‘‘A review on different types [92] S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri, ‘‘Self-nonself
of spam filtering techniques,’’ Int. J. Adv. Res. Comput. Sci., vol. 8, no. 5, discrimination in a computer,’’ in Proc. IEEE Comput. Soc. Symp. Res.
pp. 2720–2723, May/Jun. 2017. Secur. Privacy, May 2014, pp. 202–212.
[66] S. Khanna, H. Chaudhry, and G. S. Bindra, ‘‘Inbound outbound Email [93] J. Brownlee, Clever Algorithms: Nature-Inspired Programming Recipes.
traffic analysis and Its SPAM impact,’’ in Proc. 4th Int. Conf. Comput. Abu Dhabi, United Arab Emirates: LuLu, 2012.
Intell., Commun. Syst. Netw., Jul. 2012, pp. 181–186. [94] S. S. Aote, M. M. Raghuwanshi, and L. Malik, ‘‘A brief review on
[67] L. Ilie, ‘‘Regular expression matching,’’ in Encyclopedia of Algorithms. particle swarm optimization: Limitations & future directions,’’ Int. J.
New York, NY, USA: Springer-Verlag, 2014, pp. 1–6, doi: 10.1007/978- Comput. Sci. Eng., vol. 2, no. 5, pp. 196–200, Sep. 2013.
3- 642-27848-8_340-2. [95] S. Salhi and N. M. Queen, ‘‘A hybrid algorithm for identifying global
[68] Pettingers. What’s Working Now in Spam Assassin. Accessed: Feb. 13, and local minima when optimizing functions with many minima,’’ Eur.
2019. [Online]. Available: http://www.pettingers.org/ annoyances/sa- J. Oper. Res., vol. 155, no. 1, pp. 51–67, May 2004.
rules.html [96] Y. Zhu and Y. Tan, ‘‘Extracting discriminative information from e-mail
[69] S. Khanna, H. Chaudhry, and G. S. Bindra, ‘‘Inbound outbound Email for spam detection inspired by immune system,’’ in Proc. IEEE Congr.
traffic analysis and its SPAM impact,’’in Proc. 4th Int. Conf. Comput. Evol. Comput., Jul. 2010, pp. 1–7.
Intell., Commun. Syst. Netw., Jul. 2012, pp. 181–186. [97] M. Z. Hayat, J. Basiri, L. Seyedhossein, and A. Shakery, ‘‘Content-based
[70] R. K. Kumar, G. Poonkuzhali, and P. Sudhakar, ‘‘Comparative study on concept drift detection for Email spam filtering,’’ in Proc. 5th Int. Symp.
Email spam classifier using data mining techniques,’’ in Proc. Int. Multi Telecommun., Dec. 2010, pp. 531–536.
Conf. Eng. Comput. Scientists. vol. 1, Mar. 2012, pp. 14–16. [98] K. Jackowski, B. Krawczyk, and M. Woźniak, ‘‘Application of adaptive
[71] C. Laorden, I. Santos, B. Sanz, G. Alvarez, and P. G. Bringas, ‘‘Word splitting and selection classifier to the spam filtering problem,’’ Cybern.
sense disambiguation for spam filtering,’’ Electron. Commerce Res. Appl., Syst. Int. J., vol. 44, nos. 6–7, pp. 569–588, 2013.
vol. 11, no. 3, pp. 290–298, 2012. [99] D. Wang, D. Irani, and C. Pu, ‘‘Is Email business dying?: A study on
[72] D. Kumawat and V. Jain, ‘‘POS tagging approaches: A comparison,’’ Int. evolution of Email spam over fifteen years,’’ EAI Endorsed Trans.
J. Comput. Appl., vol. 118, no. 6, pp. 32–38, Jan. 2015. Collaborative Comput., vol. 1, no. 1, p. e3, May 2014.
[73] R. M. Ravindran and A. S. Thanamani, ‘‘K-means document clustering [100] R. Sathya and A. Abraham, ‘‘Comparison of supervised and
using vector space model,’’ Bonfring Int. J. Data Mining, vol. 5, no. 2, unsupervised learning algorithms for pattern classification,’’ Int. J. Adv.
pp. 10–14, Jul. 2015. Res. Artif. Intell., vol. 2, no. 2, pp. 34–38, Feb. 2013.
[74] H. Che, Q. Liu, L. Zou, H. Yang, D. Zhou, and F. Yu, ‘‘A content-based [101] F. Qian, A. Pathak, Y. C. Hu, Z. M. Mao, and Y. Xie, ‘‘A case for
phishing Email detection method,’’ IEEE Int. Conf. Softw. Qual., Rel. unsupervised-learning-based spam filtering,’’ ACM SIGMETRICS Per-
Secur. Companion (QRS-C), Jul. 2017, pp. 415–422. form. Eval. Rev., vol. 38, no. 1, p. 367, Dec. 2010.
[75] L. A. Zadeh, ‘‘Advances in fuzzy systems—Applications and theory,’’ in [102] R. Agrawal, T. Imieliński, and A. Swami, ‘‘Mining association rules
Fuzzy Sets, Fuzzy Logic, And Fuzzy Systems: Selected Papers By Lotfi A between sets of items in large databases,’’ in Proc. ACM SIGMOD Int.
Zadeh. Singapore: World Scientific Publishing Co Pte Ltd, 1996, Conf. Manage. Data (SIGMOD), May 1993, pp. 207–216.
pp. 394–432, doi: 10.1142/9789814261302_0021. [103] P. H. B. Las-Casas, J. M. Almeida, M. A. Gonzalves, D. Guedes,
[76] Y. Hu, C. Guo, E. W. T. Ngai, M. Liu, and S. Chen, ‘‘A scalable A. Ziviani, and H. T. Marques-Neto, ‘‘Adaptive spammer detection at
intelligent non-content-based spam-filtering framework,’’ Expert Syst. the source network,’’ in Proc. IEEE Global Commun. Conf.
Appl., vol. 37, no. 12, pp. 8557–8565, Dec. 2010. (GLOBECOM), Dec. 2013, pp. 1434–1439.
[77] W. V. Wanrooij and A. Pras, ‘‘Filtering spam from bad neighborhoods,’’ [104] X. Zhu, ‘‘Semi-supervised learning,’’ in Encyclopedia of Machine
Int. J. Netw. Manage., vol. 20, no. 6, pp. 433–444, Nov./Dec. 2010. Learn- ing and Data Mining. New York, NY, USA: Springer, 2010, pp.
[78] G. Stringhini, M. Egele, A. Zarras, T. Holz, C. Kruegel, and G. Vigna, 1142– 1147, doi: 10.1007/978-1-4899-7687-1_749.
‘‘B@bel: Leveraging Email delivery for spam mitigation,’’ in Proc. 21st [105] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk,
USENIX Conf. Secur. Symp., 2012, p. 2. ‘‘A survey on reinforcement learning models and algorithms for traffic
[79] A. Liska and G. Stowe, ‘‘Understanding DNS,’’ in DNS Security, 2016, signal control,’’ ACM Comput. Surv., vol. 50, no. 3, p. 34, Oct. 2017.
pp. 1–23. [106] M. A. Karim, J. Currie, and T.-T. Lie, ‘‘Dynamic event detection using
[80] D. Bradbury, ‘‘Can we make Email secure,’’ Netw. Secur., vol. 3, no. 3, a distributed feature selection based machine learning approach in a self-
pp. 13–16, Mar. 2014. healing microgrid,’’ IEEE Trans. Power Syst., vol. 33, no. 5,
[81] Anon. Proof of Work. Accessed: May 27, 2019.[Online]. Available: pp. 4706–4718, Sep. 2018.
https://en.bitcoin.it/wiki/Proof_of_work [107] D. Hassan, ‘‘On Determining the most effective subset of features for
[82] SpamAssassin. The Apache SpamAssassin. Accessed: Oct. 6, 2019. detecting phishing Websites,’’ Int. J. Comput. Appl., vol. 122, no. 20,
[Online]. Available: https://spamassassin.apache.org/ pp. 1–7, Jan. 2015.
[83] Capterra. Com. Anti-Spam Software. Accessed: Oct. 6, 2019. [Online]. [108] Feature Extraction Foundations and Applications, Springer-Verlag,
Available: www.capterra.com/anti-spam-software/ New York, NY, USA, 2016.
[84] ZeroSpam. The ZeroSpam Solution. Accessed: Oct. 6, 2019. [Online]. [109] O. A. Adewumi and A. A. Akinyelu, ‘‘A hybrid firefly and support
Available: https://www.zerospam.ca/en/home/ vector machine classifier for phishing email detection,’’ Kybernetes, vol.
[85] A. Darwish, ‘‘Bio-inspired computing: Algorithms review, deep analysis, 45, no. 6,
and the scope of applications,’’ Future Comput. Informat. J., vol. 3, no. 2, pp. 977–994, Jun. 2016.
pp. 231–246, Dec. 2018. [110] M. Khonji, Y. Iraqi, and A. Jones, ‘‘Phishing detection: A literature
[86] D. Ruano-Ordás, F. Fdez-Riverola, and J. R. Méndez, ‘‘Using survey,’’ IEEE Commun. Surveys Tuts., vol. 15, no. 4, pp. 2091–2121,
evolutionary computation for discovering spam patterns from E-mail 4th Quart, 2013.
samples,’’ Inf. Process. Manage., vol. 54, no. 2, pp. 303–317, Mar. 2018. [111] I. Qabajeh and F. Thabtah, ‘‘An experimental study for assessing email
[87] B. Meadows, P. Riddle, C. Skinner, and M. M. Barley, ‘‘Evaluating the classification attributes using feature selection methods,’’ in Proc. 3rd
seeding genetic algorithm,’’ in Advances in Artificial Intelligence (Lecture Int. Conf. Adv. Comput. Sci. Appl. Technol., Dec. 2014, pp. 125–132.
Notes in Computer Science). Springer, 2013, pp. 221–227. [112] R. Mohammad, T. L. McCluskey, and T. Fadi, ‘‘An assessment of
[88] E. Conrad. Detecting Spam With Genetic Regular Expressions. London, features related to phishing Websites using an automated technique,’’ in
U.K.: SANS Institute InfoSec Reading Room, 2007. Proc. Int. Conf. Internet Technol. Secured Trans., London, U.K., Dec.
[89] I. Idris, A. Selamat, N. T. Nguyen, S. Omatu, O. Krejcar, K. Kuca, and 2012,
M. Penhaker, ‘‘A combined negative selection algorithm-particle swarm pp. 492–497.
optimization for an email spam detection system,’’ Eng. Appl. Artif. Intell., [113] A. Yasin and A. Abuhasan, ‘‘An intelligent classification model for
vol. 39, pp. 33–44, Mar. 2015. Phishing Email detection,’’ Int. J. Netw. Secur. Appl., vol. 8, no. 4,
pp. 55–72, 2016.
[90] I. Idris, A. Selamat, and S. Omatu, ‘‘Hybrid email spam detection model
with negative selection algorithm and differential evolution,’’ Eng. Appl.

VOLUME7,2019 37243
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

[114] T. Rashid, Make Your Own Neural Network: A Gentle Journey Through radix encoded fragmented database approach,’’ in Proc. Int. Conf.
the Mathematics of Neural Networks, and Making Your Own Using the Comput. Sustain. Global Develop. (INDIACom), Mar. 2014, pp. 939–
Python Computer Language. Charleston, SC, USA: CreateSpace, 2017. 942.
[115] A. Nosseir, K. Nagati, and I. Taj-Eddin, ‘‘Intelligent word-based spam [138] D. Ranganayakulu and C. C, ‘‘Detecting malicious URLs in E-mail—
filter detection using multi-neural networks,’’ Int. J. Comput. Sci. Issues, An implementation,’’ AASRI Procedia, vol. 4, pp. 125–131, Jan. 2013.
vol. 10, no. 2, p. 17, Mar. 2013. [139] C.-N. Lee, Y.-R. Chen, and W.-G. Tzeng, ‘‘An online subject-based
[116] M. F. Porter, ‘‘An algorithm for suffix stripping,’’ Program, vol. 14, no. spam filter using natural language features,’’ in Proc. IEEE Conf.
3, Dependable Secure Comput., Aug. 2017, pp. 479–487.
pp. 130–137, 1980. [140] S. Hegelich, ‘‘Decision trees and random forests: Machine learning
[117] A. Malge and S. M. Chaware, ‘‘An efficient framework for spam mail techniques to classify rare events,’’ Eur. Policy Anal., vol. 2, no. 1, pp.
detection in attachments using NLP,’’ Int. J. Sci. Res., vol. 5, no. 6, 98–120, 2016.
pp. 1121–1125, May 2016. [141] T. Ouyang, S. Ray, M. Allman, and M. Rabinovich, ‘‘A large-scale
[118] Y. Zhang, R. Jin, and Z. Zhou, ‘‘Understanding bag-of-words model: A empirical analysis of email spam detection through network
statistical framework,’’ Int. J. Mach. Learn. Cybern., vol. 1, nos. 1–4, characteristics in a stand-alone enterprise,’’ Comput. Netw., vol. 59, pp.
pp. 43–52, Dec. 2010. 101–121, Feb. 2014.
[119] J. Singh and V. Gupta, ‘‘Text Stemming: Approaches, applications, and [142] The R Implementation of Rulefit Learning Algorithm. Accessed: Jun. 17,
challenges,’’ ACM Comput. Surv., vol. 49, no. 3, p. 45, Dec. 2016. 2019. [Online]. Available: www.stat.stanford.edu/jhf/R-RuleFit. html
[120] L. Deng and D. Yu, Deep Learning: Methods and Applications. Boston, [143] M. Sheikhalishahi, M. Mejri, and N. Tawbi, ‘‘Clustering spam Emails
MA, USA: Now, 2014. into campaigns,’’ in Proc. 1st Int. Conf. Inf. Syst. Secur. Privacy, Feb.
[121] M. F. A. Foysal, M. S. Islam, A. Karim, and N. Neehal, ‘‘Shot-Net: A 2015,
convolutional neural network for classifying different cricket shots,’’ in pp. 90–97.
Proc. Int. Conf. Recent Trends Image Process. Pattern Recognit., 2019, [144] J.-J. Sheu, K.-T. Chu, N.-F. Li, and C.-C. Lee, ‘‘An efficient
pp. 111–120. incremental learning mechanism for tracking concept drift in spam
[122] S. Seth and S. Biswas, ‘‘Multimodal spam classification using deep filtering,’’ Plos ONE, vol. 12, no. 2, Sep. 2017, Art. no. e0171518.
learning techniques,’’ in Proc. 13th Int. Conf. Signal-Image Technol. [145] K. Fawagreh, M. M. Gaber, and E. Elyan, ‘‘Random forests: From early
Internet- Based Syst. (SITIS), Dec. 2017, pp. 346–349. developments to recent advancements,’’ Syst. Sci. Control Eng., vol. 2,
[123] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- no. 1, pp. 602–609, 2014.
nov, ‘‘Dropout: A simple way to prevent neural networks from [146] K. N. Tran, M. Alazab, and R. Broadhurst, ‘‘Towards a feature rich
overfitting,’’ model for predicting spam Emails containing malicious attachments and
J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014. URLs,’’ in Proc. 11th Australas. Data Mining Conf. (AusDM), 2014, pp.
[124] H. Jansma. Don’t Use Dropout in Convolutional Networks—Towards 1–10.
Data Science. Accessed: Jul. 30, 2019. [Online]. Available: https:// [147] R. Shams and R. E. Mercer, ‘‘Classifying spam Emails using text and
towardsdatascience.com/dont-use-dropout-in-convolutional-networks- readability features,’’ in Proc. IEEE 13th Int. Conf. Data Mining, Dec.
81486c823c16 2013, pp. 657–666.
[125] E.-X. Shang and H.-G. Zhang, ‘‘Image spam classification based on [148] S. Sperandei, ‘‘Understanding logistic regression analysis,’’ Biochemia
convolutional neural network,’’ in Proc. Int. Conf. Mach. Learn. Cybern. Medica, vol. 24, no. 1, pp. 12–18, 2014.
(ICMLC), Jul. 2016, pp. 398–403. [149] C. Constantin, ‘‘Using the logistic regression model in supporting
[126] A. Barushka and P. Hajek, ‘‘Spam filtering using integrated distribution- decisions of establishing marketing strategies,’’ Bull. Transilvania Univ.
based balancing approach and regularized deep neural networks,’’ Appl. Braşov. vol. 8, no. 2, p. 57, Jul. 2015.
Intell., vol. 48, no. 10, pp. 3538–3556, Oct. 2018. [150] K. Pawar and M. Patil, ‘‘Pattern classification under attack on spam
[127] J. Nam, J. Kim, E. L. Mencía, I. Gurevych, and J. Fürnkranz, ‘‘Large- filtering,’’ in Proc. IEEE Int. Conf. Res. Comput. Intell. Commun. Netw.
scale multi-label text classification—Revisiting neural networks,’’ in (ICRCICN), Nov. 2015, pp. 197–201.
Machine Learning and Knowledge Discovery in Databases (Lecture Notes [151] B. Schoelkopf, Empirical Inference. Berlin, Germany: Springer-Verlag,
in Computer Science), vol. 10534. Cham, Switzerland: Springer, 2014, 2016.
pp. 437–452. [152] J. Nayak, B. Naik, and H. S. Behera, ‘‘A comprehensive survey on
[128] P. Bermejo, J. A. Gámez, and J. M. Puerta, ‘‘Improving the perfor- support vector machine in data mining tasks: Applications &
mance of Naive Bayes multinomial in e-mail foldering by introducing challenges,’’ Int. J. Database Theory Appl., vol. 8, no. 1, pp. 169–186,
distribution-based balance of datasets,’’ Expert Syst. Appl., vol. 38, no. 3, Dec. 2015.
pp. 2072–2080, Mar. 2011. [153] S. Winters-Hilt and S. Merat, ‘‘SVM clustering,’’ BMC Bioinformatics,
[129] R. T. Nakatsu, ‘‘Information visualizations used to avoid the problem of vol. 8, no. S7, 2007.
overfitting in supervised machine learning,’’ in HCI in Business, Govern- [154] S. Winters-Hilt, ‘‘Clustering via support vector machine boosting with
ment and Organizations. Supporting Business (Lecture Notes in Computer simulated annealing,’’ Int. J. Comput. Optim., vol. 4, no. 1, pp. 53–89,
Science), vol. 10294. Cham, Switzerland: Springer, 2017, pp. 373–385. 2017.
[130] T. A. Almeida, J. Almeida, and A. Yamakami, ‘‘Spam filtering: How the [155] M. Diale, C. Van Der Walt, T. Celik, and A. Modupe, ‘‘Feature
dimensionality reduction affects the accuracy of Naive Bayes classifiers,’’ selection and support vector machine hyper-parameter optimisation for
J. Internet Services Appl., vol. 1, no. 3, pp. 183–200, Feb. 2010. spam detection,’’ in Proc. Pattern Recognit. Assoc. South Africa Robot.
[131] L. Melian and A. Nursikuwagus, ‘‘Prediction student eligibility in voca- Mechatronics Int. Conf. (PRASA-RobMech), Nov. 2016, pp. 1–7.
tion school with Naïve-Byes decision algorithm,’’ IOP Conf. Ser., Mater. [156] O. Amayri and N. Bouguila, ‘‘Content-based spam filtering using
Sci. Eng., vol. 407, no. 1, 2018, Art. no. 012140. hybrid generative discriminative learning of both textual and visual
[132] M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, features,’’ in Proc. IEEE Int. Symp. Circuits Syst., May 2012, pp. 862–
and A. P. Sheth, ‘‘Machine learning for Internet of Things data analysis: A 865.
survey,’’ Digit. Commun. Netw., vol. 4, no. 3, pp. 161–175, 2018. [157] S. S. Roy, A. Sinha, R. Roy, C. Barna, and P. Samui, ‘‘Spam Email
[133] C. Bielza and P. Larrañaga, ‘‘Discrete Bayesian network classifiers: A detection using deep support vector machine, support vector machine and
survey,’’ ACM Comput. Surv., vol. 47, no. 1, p. 5, Jul. 2014. artificial neural network,’’ in Soft Computing Applications (Advances in
[134] S. Manlangit, S. Azam, B. Shanmugam, K. Kannoorpatti, M. Jonkman, Intelligent Systems and Computing). Berlin, Germany: Springer-Verlag,
and A. Balasubramaniam, ‘‘An efficient method for detecting fraudulent Apr. 2017, pp. 162–174.
transactions using classification algorithms on an anonymized credit card [158] R. Wang, ‘‘AdaBoost for feature selection, classification and its relation
data set,’’ in Intelligent Systems Design and Applications (Advances in with SVM, A review,’’ Phys. Procedia, vol. 25, pp. 800–807, Jan. 2012.
Intelligent Systems and Computing), vol. 736. Cham, Switzerland: [159] R. Varghese and K. A. Dhanya, ‘‘Efficient feature set for spam Email
Springer, 2018, pp. 418–429. filtering,’’ in Proc. IEEE 7th Int. Advance Comput. Conf. (IACC), Jan.
[135] B. Zhou, Y. Yao, and J. Luo, ‘‘Cost-sensitive three-way Email spam 2017,
filtering,’’ J. Intell. Inf. Syst., vol. 42, no. 1, pp. 19–45, Feb. 2013. pp. 732–737.
[136] L. Ting and Y. Qingsong, ‘‘Spam feature selection based on the [160] H. Zuhair, A. Selmat, and M. Salleh, ‘‘The Effect of Feature Selection
improved mutual information algorithm,’’ in Proc. 4th Int. Conf. on Phish Website Detection,’’ Int. J. Adv. Comput. Sci. Appl., vol. 6, no.
Multimedia Inf. Netw. Secur., Nov. 2012, pp. 67–70. 10,
[137] N. Jatana and K. Sharma, ‘‘Bayesian spam classification: Time efficient pp. 221–232, 2016.

37244 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[161] L.-Y. Hu, M.-W. Huang, S.-W. Ke, and C.-F. Tsai, ‘‘The distance
function effect on k-nearest neighbor classification for medical
datasets,’’ Springer- Plus, vol. 5, no. 1, p. 1304, Aug. 2016.

VOLUME7,2019 37245
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

[162] A. Sharma and A. Suryawanshi, ‘‘A novel method for detecting spam [187] A. Qaroush, I. M. Khater, and M. Washaha, ‘‘Identifying spam e-mail
Email using KNN classification with spearman correlation as distance based-on statistical header features and sender behavior,’’ in Proc.
measure,’’ Int. J. Comput. Appl., vol. 136, no. 6, pp. 28–35, Feb. 2016. CUBE Int. Inf. Technol. Conf. (CUBE), vol. 12, Sep. 2012, pp. 771–778.
[163] Spearman’s Rank-Order Correlation. Accessed: Jul. 15, 2019. [Online]. [188] E. M. Bahgat, S. Rady, W. Gad, and I. F. Moawad, ‘‘Efficient email
Available: https://statistics.laerd.com/statistical-guides/spearmans-rank- classification approach based on semantic methods,’’ Ain Shams Eng. J.,
order-correlation-statistical-guide.php vol. 9, no. 4, pp. 3259–3269, Dec. 2018.
[164] W. Wang, ‘‘Heterogeneous Bayesian ensembles for classifying spam [189] L. T. Nguyen and K. M. Huynh, ‘‘Using WordNet similarity and
Emails,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2010, pp. 1– transla- tions to create Synsets for ontology-based vietnamese
8. WordNet,’’ in Proc. 5th IIAI Int. Congr. Adv. Appl. Inform. (IIAI-AAI),
[165] J. Large, J. Lines, and A. Bagnall, ‘‘The heterogeneous ensembles of Jul. 2016, pp. 651–656.
standard classification algorithms (HESCA): The whole is greater than the [190] E. M. Bahgat and I. F. Moawad, ‘‘Semantic-based feature reduction
sum of its parts,’’ 2017, arXiv:1710.09220. [Online]. Available: approach for E-mail classification,’’ in Proc. Int. Conf. Adv. Intell. Syst.
https://arxiv.org/abs/1710.09220 Inform., 2016, pp. 53–63.
[166] M. Shuaib, O. Osho, I. Ismaila, and J. K. Alhassan, ‘‘Comparative analy- [191] E. M. Bahgat, S. Rady, and W. Gad, ‘‘An E-mail filtering approach
sis of classification algorithms for Email spam detection,’’ Int. J. Comput. using classification techniques,’’ in Proc. 1st Int. Conf. Adv. Intell. Syst.
Netw. Inf. Secur., vol. 10, no. 1, pp. 60–67, Aug. 2018. Inform. (AISI), Beni Suef, Egypt, Oct. 2015, pp. 321–331.
[167] A. Wijaya and A. Bisri, ‘‘Hybrid decision tree and logistic regression [192] M. K. Anjali and A. P. Babu, ‘‘Ambiguities in natural language
classifier for Email spam detection,’’ in Proc. 8th Int. Conf. Inf. Technol. processing,’’ Int. J. Innov. Res. Comput. Commun. Eng., vol. 2, no. 5,
Elect. Eng. (ICITEE), Oct. 2016, pp. 1–4. pp. 392–394, 2014.
[168] S. Nizamani, N. Memon, M. Glasdam, and D. D. Nguyen, ‘‘Detection of [193] T. A. Almeida, T. P. Silva, I. Santos, and J. M. G. Hidalgo, ‘‘Text
fraudulent emails by employing advanced feature abundance,’’ Egyptian normalization and semantic indexing to enhance instant messaging and
Informat. J., vol. 15, no. 3, pp. 169–174, Nov. 2014. SMS spam filtering,’’ Knowl.-Based Syst., vol. 108, pp. 25–32, Sep.
[169] I. Alsmadi and I. Alhami, ‘‘Clustering and classification of email con- 2016.
tents,’’ J. King Saud Univ.-Comput. Inf. Sci., vol. 27, no. 1, pp. 46–57, Jan. [194] J. R. Méndez, T. R. Cotos-Yañez, and D. Ruano-Ordás, ‘‘A new
2015. semantic- based feature selection method for spam filtering,’’ Appl. Soft
[170] W. Feng, J. Sun, L. Zhang, C. Cao, and Q. Yang, ‘‘A support vector Comput., vol. 76, pp. 89–104, Mar. 2019.
machine based naive Bayes algorithm for spam filtering,’’ in Proc. IEEE [195] M. A. Syakur, B. K. Khotimah, E. M. S. Rochman, and B. D. Satoto,
35th Int. Perform. Comput. Commun. Conf. (IPCCC), Dec. 2016, pp. 1–8. ‘‘Integration K-means clustering method and elbow method for
[171] R. Islam and J. Abawajy, ‘‘A multi-tier phishing detection and filtering identification of the best customer profile cluster,’’ IOP Conf. Ser.,
approach,’’ J. Netw. Comput. Appl., vol. 36, no. 1, pp. 324–335, Jan. 2013. Mater. Sci. Eng., vol. 336, no. 1, 2018, Art. no. 012017.
[172] I. R. A. Hamid and J. Abawajy, ‘‘Phishing Email feature selection [196] M. Basavaraju and D. R. Prabhakar, ‘‘A novel method of spam mail
approach,’’ in Proc. IEEE 10th Int. Conf. Trust, Secur. Privacy Comput. detection using text based clustering approach,’’ Int. J. Comput. Appl.,
Commun., Nov. 2011, pp. 916–921. vol. 5, no. 4, pp. 15–25, Aug. 2010.
[173] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair, ‘‘A comparison of
[197] C. Laorden, X. Ugarte-Pedrero, I. Santos, B. Sanz, J. Nieves, and P.
machine learning techniques for phishing detection,’’ in Proc. Anti-
G. Bringas, ‘‘Study on the effectiveness of anomaly detection for spam
Phishing Working Groups 2nd Annu. Ecrime Researchers Summit, vol. 7,
filtering,’’ Inf. Sci., vol. 277, pp. 421–444, Sep. 2014.
Oct. 2007, pp. 60–69.
[198] U. Kutbay, ‘‘Partitional clustering,’’ Recent Applications in Data
[174] S. More and S. A. Kulkarni, ‘‘Data mining with machine learning applied
Cluster- ing, 2018, doi: 10.5772/intechopen.75836.
for email deception,’’ in Proc. Int. Conf. Opt. Imag. Sensor Secur.
[199] R. D. Kortum, ‘‘Hyperonyms and hyponyms,’’ in Varieties of Tone.
(ICOSS), Jul. 2013, pp. 1–4, doi: 10.1109/icoiss.2013.6678403.
New York, NY, USA: Palgrave Macmillan, 2013, pp. 178–180, doi:
[175] R. Shams and R. E. Mercer, ‘‘Personalized spam filtering with natural
10.1057/9781137263544_23.
language attributes,’’ in Proc. 12th IEEE Int. Conf. Mach. Learn. Appl.
(ICMLA). Miami, FL, USA: IEEE, Dec. 2013, pp. 127–132. [200] M. Alazab, R. Layton, R. Broadhurst, and B. Bouhours, ‘‘Malicious
[176] A. S. Aski and N. K. Sourati, ‘‘Proposed efficient algorithm to filter spam Emails developments and authorship attribution,’’ in Proc. 4th
Cybercrime Trustworthy Comput. Workshop. Bundoora, Australia: La
spam using machine learning techniques,’’ Pacific Sci. Rev. A, Natural
Trobe Univer- sity, Nov. 2013, pp. 58–68.
Sci. Eng., vol. 18, no. 2, pp. 145–149, Jul. 2016.
[177] C. Zhang, X. Su, Y. Hu, Z. Zhang, and Y. Deng, ‘‘An evidential spam- [201] S. Halder, R. Tiwari, and A. Sprague, ‘‘Information extraction from
filtering framework,’’ Cybern. Syst. Int. J., vol. 47, no. 6, pp. 427–444, spam emails using stylistic and semantic features to identify spammers,’’
Jun. 2016. in Proc. IEEE Int. Conf. Inf. Reuse Integr., Aug. 2011, pp. 104–107.
[178] J. Kukulies and R. H. Schmitt, ‘‘Uncertainty-based test planning using [202] K. Xu, ‘‘Expectation-maximization algorithm,’’ in Encyclopedia of
dempster-shafer theory of evidence,’’ in Proc. 2nd Int. Conf. Syst. Rel. Systems Biology. New York, NY, USA: Springer-Verlag, 2013, doi:
Saf. (ICSRS), Dec. 2017, pp. 243–249. 10.1007/978-1-4419-9863-7_449.
[179] R. Sun and Y. Deng, ‘‘A new method to determine generalized basic [203] J. M. Hancock, ‘‘Self-organizing map (Kohonen Map, SOM),’’ in
probability assignment in the open world,’’ IEEE Access, vol. 7, no. 1, Dictio- nary of Bioinformatics and Computational Biology. Hoboken,
pp. 52827–52835, 2019. NJ, USA: Wiley, 2004, doi: 10.1002/0471650129.dob0661.
[180] S. Ergin and S. Isik, ‘‘The investigation on the effect of feature vector [204] D. I. Kumar and M. R. Kounte, ‘‘Comparative study of self- organizing
dimension for spam email detection with a new framework,’’ in Proc. 9th map and deep self-organizing map using MATLAB,’’ in Proc. Int. Conf.
Iberian Conf. Inf. Syst. Technol. (CISTI), Jun. 2014, pp. 1–4. Commun. Signal Process. (ICCSP), Apr. 2016,
[181] G. Forman, ‘‘An extensive empirical study of feature selection metrics pp. 1020–1023.
for text classification,’’ J. Mach. Learn. Res., vol. 3, pp. 1289–1305, Mar. [205] A. Azab, R. Layton, M. Alazab, and J. Oliver, ‘‘Mining malware to
2003. detect variants,’’ in Proc. 5th Cybercrime Trustworthy Comput. Conf.,
[182] A. Sharaff, N. K. Nagwani, and A. Dhadse, ‘‘Comparative study of Nov. 2014,
classification algorithms for spam Email detection,’’ in Emerging pp. 44–53.
Research in Computing, Information, Communication and Applications. [206] S. Porras, B. Baruque, B. Vaquerizo, and E. Corchado, ‘‘Clus- tering
Singapore: Springer, 2015, pp. 237–244. ensemble for spam filtering,’’ in Hybrid Artificial Intelli- gent Systems
[183] R. Sharma and G. Kaur, ‘‘E-mail spam detection using SVM and RBF,’’ (Lecture Notes in Computer Science). Springer, 2011,
Int. J. Modern Edu. Comput. Sci., vol. 8, no. 4, pp. 57–63, Apr. 2016. pp. 363–372.
[184] S. A. Saab, N. Mitri, and M. Awad, ‘‘Ham or spam? A comparative study [207] Y. Cabrera-León, P. G. Báez, and C. P. Suárez-Araujo, ‘‘Self-
for some content-based classification algorithms for email filtering,’’ in organizing Maps in the design of anti-spam filters a proposal based on
Proc. 17th IEEE Medit. Electrotech. Conf. (MELECON), Apr. 2014, thematic categories,’’ in Proc. 8th Int. Joint Conf. Comput. Intell., 2016,
pp. 339–343, doi: 10.1109/melcon.2014.6820574. pp. 1–12.
[185] F. Vanhoenshoven, G. Nápoles, R. Falcon, K. Vanhoof, and M. Köppen, [208] X. Kong, C. Hu, and Z. Duan, ‘‘Generalized principal component
‘‘Detecting malicious URLs using machine learning techniques,’’ in Proc. analysis,’’ in Principal Component Analysis Networks and Algorithms.
IEEE Symp. Ser. Comput. Intell. (SSCI), Dec. 2016, pp. 1–8. Singapore: Springer, 2017, pp. 185–233.
[209] I. T. Jolliffe, ‘‘Principal component analysis,’’ in Springer Series in
[186] P. Sedgwick, ‘‘Pearson’s correlation coefficient,’’ BMJ, vol. 345, Jul. Statistics, vol. 28, 2nd ed. New York, NY, USA: Springer, 2002, p. 487.
2012, Art. no. e4483, doi: 10.1136/bmj.e4483.

37246 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection

[210] I. Dagher and R. Antoun, ‘‘Ham-spam filtering using different PCA sce-
narios,’’ in Proc. IEEE Intl Conf. Comput. Sci. Eng. (CSE) IEEE Intl BHARANIDHARAN SHANMUGAM is
Conf. Embedded Ubiquitous Comput. (EUC) 15th Intl Symp. Distrib. currently a research-intensive Lecturer with the
Comput. Appl. Bus. Eng. (DCABES), Aug. 2016, pp. 542–545. College of Engineering and IT, Charles Darwin
[211] J. C. Gomez, E. Boiy, and M.-F. Moens, ‘‘Highly discriminative statis- University, Australia. He has a large number of
tical features for email classification,’’ Knowl. Inf. Syst., vol. 31, no. 1, publications in several different journals and
pp. 23–53, 2012. conference proceedings. His research interest
[212] R. Silva, M. A. Gonçalves, and A. Veloso, ‘‘Rule-based active sampling mainly revolves around the field of
for learning to rank,’’ in Proc. ECML PKDD, 2011, pp. 240–255. cybersecurity.
[213] D. Hassan, ‘‘Investigating the effect of combining text clustering with
classification on improving spam email detection,’’ in Intelligent Systems
Design and Applications (Advances in Intelligent Systems and Comput-
ing). Cham, Switzerland: Springer, 2017, pp. 99–107, doi: 10.1007/978-3-
319-53480-0_10.
[214] H. Padhiyar and P. Rekh, ‘‘An improved expectation maximization based
semi-supervised email classification using Naive Bayes and k-nearest
neighbor,’’ Int. J. Comput. Appl., vol. 101, no. 6, pp. 7–11, Jan. 2014.
[215] A. Chakrabarty and S. Roy, ‘‘An optimized k-NN classifier based on
minimum spanning tree for email filtering,’’ in Proc. 2nd Int. Conf. Bus.
Inf. Manage. (ICBIM), Jan. 2014, pp. 47–52.
[216] D. Debarr and H. Wechsler, ‘‘Spam detection using Random Boost,’’
KRISHNAN KANNOORPATTI is currently a
Pattern Recognit. Lett., vol. 33, no. 10, pp. 1237–1244, Jul. 2012.
[217] M. S. Junayed, A. A. Jeny, S. T. Atik, N. Neehal, A. Karim, S. Azam, and Research Active Associate Professor with the
B. Shanmugam, ‘‘AcneNet—A deep CNN based classification approach Col- lege of Engineering, IT and Environment,
for acne classes,’’ in Proc. 12th Int. Conf. Inf. Commun. Technol. Syst. Charles Darwin University, Australia. In
(ICTS), Jul. 2019, pp. 203–208. addition of being a stellar academic and
innovative researcher, he also has an extensive
experience of working with the government
ASIF KARIM is currently a Ph.D. Researcher bodies in setting up data privacy policies at
with Charles Darwin University, Australia. His national and state level.
research interests include machine intelligence
and cryptographic communication. He is
currently working toward the development of a
robust and advanced email filtering system,
primarily using machine learning algorithms. He
has considerable industry experience in IT,
primarily in the field of software engineering.

SAMI AZAM is currently a leading Researcher MAMOUN ALAZAB is currently an Associate

and a Lecturer with the College of Engineering Professor with the College of Engineering, IT
and IT, Charles Darwin University, Australia. He and Environment, Charles Darwin University,
is actively involved in the research fields relating Australia. He is also a cyber-security researcher
to computer vision, signal processing, artificial and practitioner with industry and academic
intelligence, and biomedical engineering. He has experience. His research is multidisciplinary
a number of publications in peer-reviewed that focuses on cyber security and digital
journals and international conference forensics of computer systems, including current
proceedings. and emerging issues in the cyber environment
like cyber-physical systems and the Internet of
Things (IoT), by taking
into consideration the unique challenges present in these environments,
with a focus on cybercrime detection and prevention.

VOLUME7,2019 37247

Ransomware
No ratings yet
Ransomware
50 pages
CYBER FORENSIC ANALYTICS Complete
100% (1)
CYBER FORENSIC ANALYTICS Complete
163 pages
Install Live Credit Card Generator Tool On Termux - Live CC Working 100% - Termux Hacking
64% (11)
Install Live Credit Card Generator Tool On Termux - Live CC Working 100% - Termux Hacking
13 pages
Incident Response Playbook: Phishing Investigation (Part 1) : Stop / Remove The User / Identity Off The Potential List
No ratings yet
Incident Response Playbook: Phishing Investigation (Part 1) : Stop / Remove The User / Identity Off The Potential List
4 pages
Unit42 Cybersecurity Checklist 57 Tips
No ratings yet
Unit42 Cybersecurity Checklist 57 Tips
6 pages
Comprehensive Phishing Guide
No ratings yet
Comprehensive Phishing Guide
49 pages
The Osint Cyber War 2023-06-12
100% (1)
The Osint Cyber War 2023-06-12
26 pages
SY0 601 Exam
No ratings yet
SY0 601 Exam
148 pages
Final Year Project
No ratings yet
Final Year Project
3 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
PRACTICAL 1 ISA (1) Merged Compressed
No ratings yet
PRACTICAL 1 ISA (1) Merged Compressed
155 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
A Study On Marketing Strategy Adopted by Upi With Special Reference To Gpay, Phonepe, Amazon Pay, Paytm Etc
No ratings yet
A Study On Marketing Strategy Adopted by Upi With Special Reference To Gpay, Phonepe, Amazon Pay, Paytm Etc
59 pages
DLP Notes
No ratings yet
DLP Notes
10 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
SC-4-1-Cyber Security
No ratings yet
SC-4-1-Cyber Security
2 pages
Machine Learning Based Spam E-Mail Detection
No ratings yet
Machine Learning Based Spam E-Mail Detection
10 pages
0 - Spam Mail Prediction
No ratings yet
0 - Spam Mail Prediction
29 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
It100 Finals Accumulated Quiz Questions
No ratings yet
It100 Finals Accumulated Quiz Questions
25 pages
Unit 5 Computer Networks
No ratings yet
Unit 5 Computer Networks
88 pages
SET User Manual Made For SET 6.0: Trusted
No ratings yet
SET User Manual Made For SET 6.0: Trusted
88 pages
NLP Report
No ratings yet
NLP Report
19 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Research Article On The Forensic
No ratings yet
Research Article On The Forensic
14 pages
CISDP Pre Test - Oct7
No ratings yet
CISDP Pre Test - Oct7
10 pages
Report Pakistan Surveillance
No ratings yet
Report Pakistan Surveillance
80 pages
Email Report
No ratings yet
Email Report
15 pages
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
No ratings yet
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
12 pages
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
No ratings yet
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
4 pages
Security and Communication Networks - 2022 - Ahmed - Machine Learning Techniques For Spam Detection in Email and IoT
No ratings yet
Security and Communication Networks - 2022 - Ahmed - Machine Learning Techniques For Spam Detection in Email and IoT
19 pages
Email Spam A Comprehensive Review of Optimize Detection Methods Challenges and Open Research Problems
No ratings yet
Email Spam A Comprehensive Review of Optimize Detection Methods Challenges and Open Research Problems
31 pages
Bhardwaj Sharma 2022 Email Spam Detection Using Bagging and Boosting of Machine Learning Classifiers
No ratings yet
Bhardwaj Sharma 2022 Email Spam Detection Using Bagging and Boosting of Machine Learning Classifiers
25 pages
(IJCST-V12I1P3) :ipsita Panda, Sidharth Dash
No ratings yet
(IJCST-V12I1P3) :ipsita Panda, Sidharth Dash
6 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Jebin 2
No ratings yet
Jebin 2
22 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
5 pages
Access Management Buyers Guide
No ratings yet
Access Management Buyers Guide
20 pages
A Comprehensive Survey For Intelligent Spam Email Detection
No ratings yet
A Comprehensive Survey For Intelligent Spam Email Detection
35 pages
46 - Ijme... Mech Engg..Research Paper-1
No ratings yet
46 - Ijme... Mech Engg..Research Paper-1
10 pages
SULTHANA A Detailed Analysis On Spam Emails and Detection Using Machine Learning Algorithms
No ratings yet
SULTHANA A Detailed Analysis On Spam Emails and Detection Using Machine Learning Algorithms
12 pages
PAN-OS v8 Release Notes
No ratings yet
PAN-OS v8 Release Notes
60 pages
Report
No ratings yet
Report
11 pages
Splk-5001 Exam - DFDF
No ratings yet
Splk-5001 Exam - DFDF
6 pages
Title of Project
No ratings yet
Title of Project
29 pages
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
No ratings yet
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
13 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
Spam 2023
No ratings yet
Spam 2023
11 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
Ijirt156181 Paper
No ratings yet
Ijirt156181 Paper
5 pages
IJISAE 25 Dr+K.+Aditya+Shastry 8 1103
No ratings yet
IJISAE 25 Dr+K.+Aditya+Shastry 8 1103
9 pages
Email Spam Detection (Research Paper)
No ratings yet
Email Spam Detection (Research Paper)
8 pages
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
No ratings yet
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
5 pages
VBK23 Cse 041
No ratings yet
VBK23 Cse 041
6 pages
2024 Compu t3 6
No ratings yet
2024 Compu t3 6
10 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Moutafis EWS 098
No ratings yet
Moutafis EWS 098
8 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Slide Format
No ratings yet
Slide Format
14 pages
Email Spam Detector
No ratings yet
Email Spam Detector
12 pages
Upsell Guide For Partners: Microsoft 365 Business Basic Microsoft 365 Business Premium
No ratings yet
Upsell Guide For Partners: Microsoft 365 Business Basic Microsoft 365 Business Premium
3 pages
Email (Research) 3
No ratings yet
Email (Research) 3
7 pages
Allstate ID PlusNA5 063023
No ratings yet
Allstate ID PlusNA5 063023
11 pages
2023 V14i805
No ratings yet
2023 V14i805
7 pages
Evaluating The Effectiveness of Machine Learning Methods For
No ratings yet
Evaluating The Effectiveness of Machine Learning Methods For
8 pages
A Study of Suspicious E-Mail Detection Techniques
No ratings yet
A Study of Suspicious E-Mail Detection Techniques
8 pages
Bae Systems Covid Crime Index Apr2021
No ratings yet
Bae Systems Covid Crime Index Apr2021
14 pages
Unit 2 - Cyber Security
No ratings yet
Unit 2 - Cyber Security
9 pages
Spam Classification Based On Supervised Learning U
No ratings yet
Spam Classification Based On Supervised Learning U
6 pages
Review (3) A Comprehensive Review On Email Spam Classification Using Machine Learning Algorithms
No ratings yet
Review (3) A Comprehensive Review On Email Spam Classification Using Machine Learning Algorithms
6 pages
Fin Irjmets1697888326
No ratings yet
Fin Irjmets1697888326
4 pages
10939-Article Text-13747-1-10-20240802
No ratings yet
10939-Article Text-13747-1-10-20240802
8 pages
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
No ratings yet
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
17 pages
$RB0DCAN
No ratings yet
$RB0DCAN
10 pages
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
No ratings yet
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
7 pages
Published Paper
No ratings yet
Published Paper
9 pages
Securing Your Digital World
No ratings yet
Securing Your Digital World
8 pages
Email Based Spam Detection
No ratings yet
Email Based Spam Detection
5 pages
A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
No ratings yet
A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
11 pages
Chapter 3 Quiz
No ratings yet
Chapter 3 Quiz
4 pages
Final Year Project
No ratings yet
Final Year Project
3 pages
Cyber
No ratings yet
Cyber
2 pages
New Evil Proxy Fishing Tool Qbe X S RM
No ratings yet
New Evil Proxy Fishing Tool Qbe X S RM
2 pages
Security Issue With Digital Marketing
No ratings yet
Security Issue With Digital Marketing
2 pages
Spam Nation: The Inside Story of Organized Cybercrime—from Global Epidemic to Your Front Door
From Everand
Spam Nation: The Inside Story of Organized Cybercrime—from Global Epidemic to Your Front Door
Brian Krebs
4/5 (9)
Email Spam: Fundamentals and Applications
From Everand
Email Spam: Fundamentals and Applications
Fouad Sabry
No ratings yet

A Comprehensive Survey For Intelligent Spam Email Detection

Uploaded by

A Comprehensive Survey For Intelligent Spam Email Detection

Uploaded by

Received November 1, 2019, accepted November 16, 2019, date of publication November 20, 2019, date of current version

A Comprehensive Survey for

Machine Learning principles.

TABLE 2. Email data parts explanation.

FIGURE 1. Email data parts.

It is important to have a look into such propositions to better

III. NON-AI BASED CURRENT ANTI-SPAM 2) SENDER POLICY FRAMEWORK (SPF)

Inbox, so that immediate attention can be gained. The

notion of these ‘Trust’ enabled nodes and can take

E. CONTENT BASED APPROACHES

[71], argued that WSD is an important pre-processing steps

long documents, the performance of VSM may often show

Volume Spammers (LVS) are relatively harder to detect

rather continuous basis.

TABLE 4. A summary of some of the above-discussed spam detection techniques.

TABLE 4. (Continued.) A summary of some of the above-discussed spam detection techniques.

TABLE 4. (Continued.) A summary of some of the above-discussed spam detection techniques.

ones when comes to spam detection. Zhu and Tan [96]

existing ones to better facilitate the subsequent learning

TABLE 5. Definitions Of some performance measuring terms. many iterations, eventually

and intuitive Feature Selection and Engineering phase can

3) HIGHLIGHTING SOME KEY SUPERVISED

4) SUPERVISED LEARNING BASED PROPOSITIONS

a: ARTIFICIAL NEURAL NETWORK BASED

the network is able to produce a sufficiently accurate and

TABLE 6. (Continued.) A summary of some useful machine learning

FIGURE 5. Porter’s Stemmer.

Qingsong and Ting [136] worked on ‘Mutual

Factor and Average Word Frequency factor to improve upon domain

Entropy of an attribute, usually a reduction [140].

proba- bility is to fall within the value 0 and 1 strictly.

prediction, but lags behind in few other parameters.

Hamid and Abawajy [172] to detect phishing emails using

to the scratch, mainly due to its sensitivity to probability

email contains higher number of spam topic, in which case

training stage, the output unit that is able to provide the

in (16) and (17). In (16), n is the number of emails used in

email. fea- ture selection to

source and destination along with SMTP transactional

B. KEY INSIGHTS FROM MACHINE LEARNING

TABLE 7. A summary of machine learning based techniques.

FIGURE 6. Number of non-automated spam detection studies in

FIGURE 7. Proportion of types of frameworks.

As suggested by Fig. 7, supervised approaches have been

2) FINDING B: PROBABLE REASON BEHIND

FIGURE 9. Scatterplot for accuracy of semi-supervised methods (sd:

FIGURE 10. Scatterplo for accuracy of unsupervised methods (sd: ≈

The average accuracy is also around min-nineties which is

models conveniently. It is clear that research initiatives on

FIGURE 11. Showing the total number of studies in which each of

body may contain words or phrases related to Finance and

SAMI AZAM is currently a leading Researcher MAMOUN ALAZAB is currently an Associate

You might also like