A Comprehensive Survey For Intelligent Spam Email Detection
A Comprehensive Survey For Intelligent Spam Email Detection
December 4, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2954791
ABSTRACT The tremendously growing problem of phishing e-mail, also known as spam including
spear phishing or spam borne malware, has demanded a need for reliable intelligent anti-spam e-mail
filters. This survey paper describes a focused literature survey of Artificial Intelligence (AI) and Machine
Learning (ML) methods for intelligent spam email detection, which we believe can help in developing
appropriate countermeasures. In this paper, we considered 4 parts in the email’s structure that can be used
for intelligent analysis: (A) Headers Provide Routing Information, contain mail transfer agents (MTA)
that provide information like email and IP address of each sender and recipient of where the email
originated and what stopovers, and final destination. (B) The SMTP Envelope, containing mail
exchangers’ identification,
originating source and destination domains\users. (C) First part of SMTP Data, containing information
like from, to, date, subject – appearing in most email clients (D) Second part of SMTP Data, containing
email body including text content, and attachment. Based on the number the relevance of an emerging
intelligent method, papers representing each method were identified, read, and summarized. Insightful
findings, challenges and research problems are disclosed in this paper. This comprehensive survey paves
the way for future research endeavors addressing theoretical and empirical aspects related to intelligent
spam email detection.
INDEX TERMS Machine learning, phishing attack, spear phishing, spam detection, spam email, spam
filtering.
I. INTRODUCTION A. RELEVANT SPAM EMAIL STATISTICS
Email spamming refers to the act of distributing unsolicited In the following subsections, we will highlight some
messages, optionally sent in bulk, using email; whereas current worldwide statistical observations. Besides, some
emails of the opposite nature are known as ham, or useful country- specific metrics will also be discussed.
emails [1]. The word ‘‘spam’’ came into existence from The statistics relating to the adoption of email as a
‘‘Shoulder Pork HAM’’, a canned precooked meat marketed means for communication is quite staggering. As of 2017,
in 1937, and eventually with the passage of time, digital there were nearly 5.5 billion email accounts which are
mailing junks have taken the word [2]. actively in use [4], this number is projected to grow over
Spam emails are propagated by the spammers for simple 5.5 billion in 2019 [5]; nearly one third of the population
marketing purposes to unfold more malicious activities such are estimated to use email by the dawn of 2019 [5]. As of
as financial disruption and reputational damage, both in per- 2018 approximately 236 billion emails are exchanged daily
sonal and institutional front. The practice of spamming is [6], of which around 53.5% are just spams [4]. In fact,
now spreading rapidly in other digital communication 2018 saw an average of 14.5 billion spam emails daily [3].
channels as well. FBI recently reported a loss of USD 12.5 Billion to
Financial motivation is one of the primary reasons for the business email consumers in 2018 incurred by spam emails
spammers and it has been estimated that spammers earn [7]. The financial loss incurred by the businesses due to
around USD 3.5 million from spam every year [3]. this spamming attack may just skyrocket in few years’
time, hitting an accumulated figure of around USD 257
The associate editor coordinating the review of this manuscript and Billion from 2012 by the mid of 2020 [3]. The estimated
approving it for publication was Yi Zhang . yearly damage will be around USD
20.5 Billion [3].
VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
168261
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
TABLE 1. Financial loss incurred in australian markets due to digital There are some other good review papers available on
scams.
the topic that discussed anti-spam frameworks in general,
but as the field is expanding fast with lots of novel and
automated ideas of spam email detection, we deemed it is
necessary to orchestrate a comprehensive review paper that
will analyze the state of the art developments as no other
contemporary paper strictly focusses on current and recent
trends and solutions geared specifically towards phishing
spam email using machine learning algorithms. This paper
also provides an exclusive detailed analytical insights
based on the reviews. The insights clearly identify multiple
gaps that can be addressed using machine learning
principles.- showing the general future direction of
United States has traditionally been the largest source of research in this domain.
spam, however, in recent times it is not the case anymore.
Though there were legislations such as CAN-SPAM (Con- C. SCOPE OF SPAM EMAILS ANALYSED
trolling the Assault of Non-Solicited Pornography and Mar-
Dissecting and critically analyzing scholarly research work
keting Act) to protect the users, it did not have the expected
on all types of spam email is itself a mammoth task and
deterrent effect on the spammers [8]. USA houses world’s
often impossible in a single survey attempt. Bearing that in
top 70% spam gangs, responsible for coordinated worldwide
mind, this paper primarily focusses on the intelligent and
spamming [3].
automated solutions devised against malicious spam
Scamwatch reports [9] portrays a grim figures in financial
emails. Particularly on the following:
losses for Australian consumers due to verities of scam
types, primarily carried out through phones and emails in 1) Containing malicious links
the last three years as portrayed in Table 1. 2) Containing malicious attachment
As discussed in Table 1, the trend is heading upwards 3) Phishing attempts
each year for digital theft and email spams will only rise due 4) Phishing and Spoofing campaigns
to the increasing adoption of this media as mentioned in the This survey work excludes studies that addresses ‘Only’
above mentioned statistics. Investment scams basically the marketing email spam.
offers fraud- ulent but promising business opportunities in
exchange of significant amount of money, while dating D. RESEARCH METHODOLOGY
scams victimizes individuals looking for romantic partners The papers for evaluation have been selected based on the
in digital spaces. When comes to delivering malware to objective of this research attempt. We have looked into
propagate such scams, emails are still the primary choice for sev- eral papers selected based on the listed index terms
the scammers. Recent reports indicate Australian businesses and thoroughly analyzed the presented method, whether it
and consumers already lost nearly AUD 56,000 due to email has effectively used machine learning principles; how
fraud just within the first two and half months of 2019 [10]. robust and impactful the proposed solution really is; and
As of April 2019, Brazil and Russia have conveniently finally the degree of modification required to address the
overtaken USA and China (another substantial spam origi- drawback(s) the solution may exhibit. Only the works
nating country), to produce approximately 16% and 14% of showing signifi- cant impact and intelligent automation
total volume of worldwide spam [11]. have been selected as those were deemed promising for
further research. The other two sections dealing with a
B. RESEARCH MOTIVATION small number of static and bio-inspired techniques have
The motivation behind this research initiative is to address a been mainly added to highlight the current state of email
gap that has risen over time in the field of spam email spam frameworks and diversity in research directions.
detection. The current solutions are mostly lagging behind
the innovativeness the spammers are constantly bringing in, E. STRUCTURE OF THE PAPER
which heavily justifies the emergence of machine learning This survey paper has been structured in such a way so that
based anti-spam propositions. This review work critically the necessary background for the studies analysed are
evaluates number of such reasonably recent solutions and addressed first. Section II details out the parts of an email,
provides insights into ways upon which further and how the spammers take advantage of these various
improvement can be obtained. The paper also discusses a parts to craft verities of spam attacks on users, that is the
number of exist- ing non-machine learning based types of email spam. Though this review paper intents to
frameworks to highlight the loopholes and the current state evaluate the machine learning based solutions aimed
of affairs, this also signifies why machine learning primarily for phish- ing and spoofing attacks, Section III
automated procedures should be the approach of the newly will discuss number of general purpose non AI based spam
developed systems. detection systems and frameworks that do not rely on
37190 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
VOLUME7,2019 37191
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
and Stuxnet [22]. Just in 2018, sensitive financial • Find P and Q, two large (e.g., 1024-bit) prime numbers.
information of employees of the ABC Bus Company (USA) • Choose E such that E > 1, E < PQ, and E and (P-1)
had been compromised due to phishing email scams [23]. (Q-1) are relatively prime, meaning they have no prime
factors in common. E does not have to be prime, but it
3) WHALING must be odd. (P-1)(Q-1) cannot be prime because it is
A variation of phishing attacks; in this form, the attack is an even number.
often directed towards high level officials of a company. In • Compute D such that DE = 1 (mod (P-1)(Q-1))
the case of whaling or ‘CEO Fraud attack’, the • The encryption function is C = (T^E) mod PQ, where
impersonating web page/email will adopt a more serious C
executive-level form. As it works in the data part D as well, is the ciphertext (a positive integer), T is the plaintext (a
the content will be cre- ated targeting mostly an upper positive integer). The message being encrypted, T ,
manager or a senior executive who has some high level must be less than the modulus, PQ.
clearance inside the organization and are almost always • The decryption function is T = (C^D) mod PQ, where
either an urgent executive issue - affecting the whole of the C is the ciphertext (a positive integer), T is the plaintext
company, or a customer complaint. The emails will be (a positive integer).
sourced from a fake origin, disguising as a legitimate The public key is the pair (PQ, E) while D is the private key.
business establishment (same or other company) or even the A
CEO of the host company itself [24]. The risks and major advantage of this cryptography is that one can
dangerous are similar to that of the other forms of phishing publish ones public key freely, because there are no known
and spoofing. easy meth- ods of calculating D, P, or Q given only (PQ, E)
Europe’s principal electrical cable and wire manufacturer, - the public key. Besides, popular Email Service Providers
Leone AG, lost a massive e40 million due to a sophisticated (ESP) like Gmail now provides End-to-End encryption
corporate email scam using a combination of spear phishing facility through S/MIME (Secure/Multipurpose Internet
and whaling attacks [25]. In a recent Whaling attack in Mail Extensions), which itself is based on Public Key
2018, French cinema chain ‘Pathé’ lost over USD 21 Encryption. However, such type of cryptography is often
Million [26]. slower than other available methods.
VOLUME7,2019 37193
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37194 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
secure and robust as the preferred choice for mail transfer AOL etc.) mathematically calculated an alphanumeric
protocol. However, such steps are not always practicable for strings of 32 to 128 characters, known as signature of the
multiple reasons as discussed later. This section will high- email (the Hash value), to store it in a database [47]. The
light few of such commendable research undertakings and a vendors work on the idea that spammers will send out a
discussion on the hindering bottlenecks. burst of spam emails to achieve their target and some of
A pre-acceptance test of emails has been discussed by these spam emails will reach to their honeypot accounts,
Esquivel et al. [40], which works by analysing the features that is, account that had been set up specifically to catch
of individual SMTP transactions such as EHLO/HELO mes- spam emails. These vendors also rely on the fact that the
sage sequences. These can be further divided into different generated signature will be largely different for spam to
categories based on the working mechanism such as that of non-spam emails [48]. Therefore, soon as the
‘Protocol Defects’. Protocol Defect can detect any extra signature pattern matches to that of the spam pattern, it is
suspicious data blocks in the input buffer before the added to a database for spam signature, and as other emails
EHLO/HELO message transaction takes place. arrive at any of the other customer accounts, those are
The work done Bajaj et al. [41] suggest that filters in the instantly discarded if found spam - by matching the
spam blocking network servers should use the facility of signature (calculated using the exact same method, from
detecting suspicious behaviour patterns of VoIP spam header and body, thus the method works in both part C and
callers- which can be built into the signalling protocols used D [Fig. 1]) to that of the one stored in the database,
in VoIP, such as SIP (Session Initiation Protocol). SIP is an provided the record is found. Vendors supply the database
Application Layer protocol- heavily used to create, modify, to other Email Service Providers (ESP) and thus once an
and terminate a multimedia session (streaming videos, email is identified as spam, it is updated in several of these
online games, instant messaging etc.) over the Internet databases positioned throughout the globe.
Protocol [42]. SIP can also use Message Digest (MD5) Message Digest 5 (MD5) was one of the popular choices
authentication for security purposes. To detect suspicious for cryptographic hashing. It is a cryptographic algorithm
behavior, SIP can inherently apply automated frameworks that accepts an input of any length and generates a message
that can analyze the message to determine whether the digest that is 128 bits long, often known as the
message is syntactically wrong, have no apparent meaning, ‘‘fingerprint’’ or ‘‘hash’’ of the input. MD5 is quite useful
hard to interpret or may lead to a deadlock [43]. when a poten- tially long message needs to be processed
The above discussed studies are fine examples of impres- and/or compared quickly.
sive steps in the direction of fortifying the SMTP framework The problem with such technique is that spammers have
at the root level. On the other hand, the very popularity and already succeeded, in a rather constant basis, in devising
adoption of SMTP at the first instance as the de-facto tools that can actually break the hashing algorithms.
protocol of choice for email communication has established Further, the issue of database update is also a lingering
some strict deterrents. For instance, a slight modification to bottleneck for quite sometimes now as it is being
the protocol may introduce a wave of changes to other inter- automatically updated but in a delayed nature, and that
twined enabling services needed for successful mail window is enough for a lot of fraudsters [50]. Furthermore,
delivery, both in regards to efficiency and usefulness [44]. if that database itself is hacked, then it is curtains for the
Thus such structural modifications, required at the core ESPs. Thus the technique has seen eroded accuracy over
infrastructure of email communication, will surely introduce the years [51]. SHA-3 is a recent development in progress
operational complexities. This constrain of the SMTP has to replace MD5 and its close variations altogether.
been an issue long since and hackers and spammers have
exploited the drawbacks from a very early stage [44]. 2) FUZZY HASHING
As illustrated by Chen et al. [54], researchers have used
C. COLLABORATIVE MODELS Hashing principles (such as Fuzzy Hashing) to detect spam
Under collaborative spam filtering modeling strategies, each campaigns by clustering emails on the basis of similar
message is delivered to a number of recipients. A specific goals.
message will most certainly be received and judged by Fuzzy Hashing can effectively be used to measure the
another user. Collaborative models exhibit the process of resemblance of two sequences of characters by calculating
capturing, recording, and querying these early judgments. scores based on similarity on the spam messages. Fuzzy
Over time these collections become significant enough to hash- ing relies on both ‘Traditional Hash’ function (for
stamp a verdict on a certain email. A number of techniques instance,
are in circulation to achieve various steps of a successful MD5) and a ‘Rolling Hash’ function. The Rolling Hash
collaborative framework [46]. value of a string M = m0... mn−1 can be obtained from (1),
where k is the modulo and 0 ≤ a < k the base. Both k and
1) CRYPTOGRAPHIC HASHING b are usually prime number and do not have any prime
factors in
A highly successful method in earlier days of email com-
common.
munication, where large email vendors (Hotmail, Yahoo!, n−1
L
h(M ) = ( an−1−imi)mod k (1)
VOLUME7,2019 37195
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
i=1
37196 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
In the work of Chen et al. [54], the Rolling hash simply Greylisting takes on the view that a legitimate sender, will
divides the input into arbitrary sized pieces. These pieces are resend the email if the initial attempt is unsuccessful, while
then hashed with the traditional hash function. The concate- the spammers will just move on to the next sender and will
nation of the hash values obtained after hashing all the not bother to check whether the email has been delivered.
pieces forms the ‘Fuzzy Hash Value’ of the given content. However, this approach can simply be bypassed by
Hash values are often considerably compact than the resending the spam email [41].
original string of characters, as they produce fixed length
output, irrespective of the length of input [55]. In this way, 5) DNS BLACKLISTING AND WHITELISTING
contents that are not exactly identical, but slightly differ in
DNS (Domain Name Server) Blacklisting is carried out in
some way, can still be grouped under the same hood.
two different flavours. First one involves maintaining a list
The research [54] takes on the view that the emails from
of mail-server IPs identified as spam originator or
same campaign will have a higher similarity score among
propagator in a centralized database [44], [48], [58]. The
each other while the scores will be far apart among emails
other way is to mark spam based on Uniform Resource
from different spam campaigns. It has also been shown that
Identifiers (URIs), usually domain names or websites; the
emails from same campaign have simi- lar sort of URL or
blacklists then consist of such malicious URIs [59]. These
email address. A lacking of the work is that it does not
blacklists or databases of known spamming IP or domains
address concerns regarding ‘Asymmet- ric Distance
can then be given access by the administrator either for
Computation’ [56], where the cluster distance score may
free or with a price. The email server using this service will
become non-deterministic if the order of input changes [57].
execute an additional DNS query on the host that is
Due to several drawbacks of MD5, in particular a slight
sending the message to determine the source status; in this
change in input dramatically alters the corresponding hash
case the queried DNS server will be the one provided by
value, which is not always desired, as often the message
the DNS Blacklisting service. However, all such blacklists
con- tent of multiple spam email varies slightly, but those
suffer from the inability of early detection of malicious
are still considered spam and often from same campaign, the
phishing URLs on the wake of the attack because their
appli- cation of a locally-sensitive hashing algorithm, known
database update process is not fast enough [60].
as ‘Nilsimsa’ [49] has grown considerably for hashing
The issues with Blacklists are that the spammers can fre-
purposes, it generates a score from 0 (dissimilar objects) to
quently alter source address [58]. Also the source address
128 (very similar objects or identical). Nilsimsa uses a 5-
itself can be spoofed as mentioned earlier. In case of
byte fixed- size sliding window that analyses the input on a
blacklists composed of URIs, spammers continuously set
byte-by- byte before generating trigrams (group of three
up cheap new domains before starting a fresh cycle of mass
consecutive characters) of probable combinations of the
spamming, leaving the blacklists very little time to react
input characters. The trigrams map into a 256-bit array to
instantly. Addi- tionally, the list is often quite slow to be
produce the desired hash [49]. Another supervised k-NN
updated and thus rather ineffectual against phishing email
based close variation of Nilsimsa, known as TLSH (Trend
threats that banks on user-visits at short-lived phishing
Locality Sensitive Hash- ing) has garnered much attention
websites.
these days [205]. It pro-
vides a similarity score between 0 and 1000+, where any Whitelisting is the practice of maintaining a list of mail-
score <=100 will identify the two entities being similar to servers that are only administered by confirmed legitimate
each other, mostly originating from same source [205]. administrators, or to accept content from bona fide users.
Projects such as SSDEEP, a Context Triggered Piecewise Different organizations have such whitelists of their own to
Hashing program, is also an important addition to this field make things easier for the customers. Blacklisting and
[205]. Whitelisting operates in both part A and B [Fig. 1]
When it comes to Whitelisting and Blacklisting of spam-
3) DISTRIBUTED CHECKSUM CLEARINGHOUSE (DCC) ming sources, the Spamhaus Project, initiated in 1998, has
Distributed Checksum Clearinghouse (DCC), another hash become a workhorse in this arena [61]. A number of ISPs
sharing framework against spam emails, works by counting and email servers use the lists to reduce the amount of
how many times a specific message has been reported as spam that reaches their users. Spamhaus also provides
spam. It takes a checksum of the message body and stores it information on certain domains and main server for
in a clearinghouse or server [52]. Thus with every additional intentionally providing a Spam Support Service for Profit.
reporting to the clearinghouse of a message being spam, the The Project currently has over 600 million subscribers
checksum count increases by 1. Bulk mail in this way can [61].
confidently be identified because the response number and
checksum count are usually lot higher. The checksums are 6) SOCIAL TRUST BASED SOLUTIONS
fuzzy in nature and oftentimes multiple DCC servers ‘Social Trust’ is a layer that is capable of providing a
participate in the checksum exchange process [53]. It measure of the system’s belief that a host is distributing
operates in both Part A, B and D [Fig. 1]. spam emails. Other nodes, that do not have spam
4) GREYLISTING identification processes installed, can actually receive the
VOLUME7,2019 37197
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37198 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
TABLE 3. Few Regex rules for spam filtering. matching [67]. For instance the pattern ‘\b[A-Z0-9._%+-
]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b’ can be used to look for
an email address in a set of texts. Programmers can also
use
this pattern to check the validity of an entered email
address, regardless of programming languages. Regex is
incredibly useful in finding out strings of almost any
pattern.
Heuristic systems are fast and easy to install, but in case
the scammers are able to get a hold of the ruleset, they can
very easily craft messages to avoid the filtering system
[68]. Regex based methods work on part A, C and D [Fig.
1].
VOLUME7,2019 37199
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37200 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
VOLUME7,2019 37201
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37202 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
(HVS), due to usage of botnets. Examining the working There are open-source and commercial products available
mechanism of botnets, Wanrooij and Pras [77] have pro- for spam detection that decide whether an email is spam
posed an assumption, termed as ‘Bad Neighbourhood’, they based on the filtration results obtained via multiple filters.
also suggested filtering using core attributes such as IP SpamAssassin and Zerospam are two of such most used
addresses and any machine-readable hyperlinks in the email products.
itself. The performance has shown superiority over tradi-
tional frameworks such as SpamAssassin, mainly because of 1) SPAMASSASIN
lesser execution complexity and very low false positive rate SpamAssassin is a free and open-source anti-spam prod-
[77]. However, the work needs to be tested longer. The uct that has garnered several positive reviews over the
authors have faced issues like URI (Uniform Resource years for its effectiveness and simplicity of installation.
Identifier, URIs and URLs are often interchangeably used) The product uses a number of above-discussed tech- niques
Blacklisting bottlenecks; such occurrences also require for filtration purposes, such as DNS-based blacklists and
addressing. DNS-based whitelists, Heuristic based checks, Fuzzy-
checksum-based spam detection, SPF, DKIM and Bayesian
2) DNS LOOKUP SYSTEMS filtering [82].
Even though such a process is not fully guaranteed to get the 2) ZEROSPAM
junk out, but oftentimes it can be a strong indication for
Zerospam is a widely used commercial software that has
further checking. It works by looking up if a record for the
also gained some grounds in effective spam detection [83].
domain name, from which the email claims to have
It also uses a number of existing techniques such as IP
originated (the part after ‘@’) does exist (the ‘‘A Record’’).
address and Domain Check, Attachment and URL
If it is not, then there is reason to doubt the validity of the
Scanning, Heuristic based filtering and Bayesian filtering
email, as oftentimes such spamming domains are short-lived
[84].
[79]. However, it suffers from the fact that the FROM field
The performance for these software has demonstrated
of the header can also be spoofed as discussed before. Thus
reg- ular fluctuations. A common problem with a number
the scammer can just simply put a closely related valid
of com- mercial and open-source solutions are the lack of
domain. The method is mostly applicable to part C [Fig. 1].
detectability in case of some form of phishing and word
obfuscation. Besides, difficulty in implementation and
G. OTHER FRAMEWORKS
usability compli- cations are oftentimes observed.
There are few other propositions available which are either That being said, the targeted scope of this paper pro-
not so commonly implemented or in an emerging state. hibits a detailed discussion on several other commercial
and open-source software available that work at different
1) COUNTRY BASED FILTERING capac- ities by combining several existing techniques
Certain email servers often entirely block email streams discussed above.
from certain countries as certain geographical boundaries Fig. 3 illustrates an overall interconnection between
are often a mass source of spam [76]. This techniques may email data parts and different spam detection techniques
have high false positive rate of detection as even though discussed till now, while Table 4 tabulates a summarized
spammers in certain countries probably are more active than view of major- ity of the above-discussed research works
others, but lots of benign and legitimate emails can circulate and the reported results from these studies. The table also
the World Wide Web as well from those nations. The highlights the key points and shortcomings of some of the
method is mostly applicable to part A [Fig. 1]. established spam detection techniques. The overall table
links these differ- ent techniques and research initiative to
2) PEER TO PERR INFRASTRUCTURE the anatomy of spam [Fig. 1] to better underscore where in
Bradbury [80] shed lights on a different approach known as the very structure of an email these frameworks may
‘Bitmessaging’, based on the similar proof-of-work con- belong. The colored lines in Fig. 3 links the respective
cept [81] that is being used in Bitcoin transactions. The method to the corresponding email data part, whereas the
framework relies on the BitMessage peer-to-peer commu- dotted lines signify under which category the respective
nication protocol, and uses completely decentralized and techniques lie.
encrypted network. On the contrary, one usability drawback
is the nature of BitMessage addresses, which are rather com- IV. ARTIFICIAL INTELLIGENCE BASED
plicated and unintuitive long alphanumeric strings that are DETECTION AND CLASSIFICATION
difficult for a normal user to deal with. The system also has TECHNIQUES
some scalability concerns as it is not yet fully compatible The endeavour for successful detection and classification
with the existing email infrastructure. The write-up also dis- of spam emails have been going on for quite sometimes
cusses a fundamentally different email system known as now. Over time number of successful methods had been
Dark Mail [80]. devised, but with time many of these could not face the
witty changes the spammers bring into their crafts on a
H. COMBINING EXISTING TECHNIQUES
VOLUME7,2019 37203
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37204 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
FIGURE 3. The interconnection between email data parts and different spam detection techniques.
A. SYSTEMS BASED ON BIO-INSPIRED INTELLIGENCE natural living beings [85]. This is an emerging field of
These are computational algorithms motivated by inherent study, and consequently number of different algorithms are
behavious and mechanisms often observed within the coming up with time. The following section illustrates
various some of the
VOLUME7,2019 37205
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37206 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
VOLUME7,2019 37207
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
spam detection systems based upon such evolutionary and Selection Algorithm (NSA)’ [91], improved and reinforced
biology based computational algorithms. with the addition of ‘Particle Swarm Optimization (PSO)’
and ‘Differential Evolution (DE)’ have been put into
1) GENETIC ALGORITHM BASED SYSTEMS action. NSA is inspired by the self-nonself discrimination
Ruano-Ordás et al. [86] argued that application of automati- behaviour commonly observed in the mammalian acquired
cally generated regular expressions (regex) can be one immune system [92]. PSO has been developed based on
signif- icantly strong method in identifying messages that the social foraging behaviour observed in some animals
have been obfuscated by the spammers. The work such as school- ing behaviour of fish and flocking
compactly illustrate groups of sentences from compromised behaviour of birds [93] and Differential Evolution is a
emails that follow a suspicious pattern. Thus the idea can be metaheuristic that attempts to gradually optimize a given
deployed as a local content based filtering system. It can problem by multiple itera- tive passes over a candidate
also be shared in a P2P network for a collaborative approach solution with regard to a given measure of quality. It can
in combating spams. The paper takes on the view that Bio- work with very high dimen- sional dataset, without always
inspired Evolutionary Algorithms, such as Genetic guaranteeing an optimum solu- tion [93]. The combined
Programming, should be used to generate the regular approach discussed in this work shows increased
expressions. Genetic programming is based upon a subset of performance than a standalone NSA based system. Idris et
Evolutionary Algorithms, known as Genetic Algorithm. It is al. [89] achieved an increase of accuracy around 7%-9%
a search heuristic that is based upon the ‘Theory of Natural than that of standalone NSA, especially over 1000
Evolution’ [87]. The work also pre- sented a reasonably detectors.
effective software, developed taking the drawbacks and However, both of these studies [89], [90] does not seem
limitations of some other contemporary sim- ilar systems to address the behaviour of the proposed model in regards
into consideration; the system has been termed as to the gaps in the understanding of few issues with the
‘DiscoverRegex’. A key improvement over the research of Particle Swarm Optimization algorithm, such as getting
Conrad [88], claimed to have been achieved by the work, is trapped in local minima and Heterogeneity [94].
the ‘Fitness Function’, an essential segment of any Genetic This problem regarding local minima may also crop up
Algorithm based solution. Thus the proposed in number of traditional nonlinear optimization algorithms.
DiscoverRegex uses (4) for the Fitness Function. In this case the function (typically a ‘cost function’ in
+ 1) (4) a neighbourhood around that local
fitness(i) = matches(i, spam)X (
Machine Learning) produces a minimum than the
length(i) + 1 greater value at every other point in
matches (i, spam) denotes for the number of spam messages local minimum itself. On the contrary, the global minimum
that match regular expression i and length (i) represents the of a function results in the minimization of the function on
size of the generated pattern. its entire domain, and not just on a neighbourhood of the
The results shows improvement over other software pack- minimum [95]. The ideal result of the function should be
ages. However, the work only described generating regular the global minima, or at least quite close to it. There are
expressions from the content of the spam subject header but always one global minima but there can be multiple local
not the body of the spam. Thus this enhancement needs to minima.
be incorporated to make the system fully complete.
3) OTHER RELATED SYSTEMS
2) NSA AND PSO BASED SYSTEMS An impressive numbers of other experimentation done with
Idris et al. [89] and Idris et al. [90] discussed proposi- tion biologically inspired algorithms indeed delivered some
where other bio-inspired algorithms, such as ‘Negative more interesting outcomes besides the above discussed
37208 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
VOLUME7,2019 37209
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
‘Biological Immune System (BIS)’ based model where no training will be provided. Based on the dataset, the algo-
‘Local Concentration (LC) based Feature Extraction’ rithms will try to figure out common features within a group
approach has been adopted for the development of the anti- of items and will rearrange the data points in clusters based
spam model. Such LC approach is thought to be able to on the commonality [101]. Alongside clustering, another
effectively determine position-correlated information from a type of unsupervised learning is ‘Association Rule
message by transmuting each area of a message to a Learning (ASL)’; it finds pattern in large datasets based on
corresponding LC feature. The proposition tends to divide some measure of interesting properties. For example, to
the message content into the size of a fixed length window deduce an activity pattern of an individual. Equation (5)
that goes through (slides) each chunk of the divided content. can be deployed as following, where P and Q belongs to a
However, if the length of message content itself is shorter set of items R [102]. ASL has also been used in recent
than the length of this sliding window, then the performance times as an aid to develop supervised classification models
degrades. The study reported an accuracy and precision of [103].
over 96% with the size of sliding window set to 150
characters per window. P ⇒ Q, where P, Q ⊆ R
{day ∈ (weekends, public_holidays), weather ∈ (sunny)}
B. MACHINE LEARNING BASED SYSTEMS
⇒ {fishing} (5)
Machine Learning (ML) is the engineering steps formulated
in a view to make the computational instruments to act with- The above example can be stated in general terms as,
out being explicitly programmed. Machine Learning can be when its weekend or public holiday and the weather is
a great boon to tackle the spam issue primarily because of sunny, the individual spends time on fishing.
its ability to evolve and tune itself with time, and counter a Semi-supervised Machine Learning Algorithms: It is an
key bottleneck ingrained in other classes of spam detection amalgamation of both supervised and unsupervised
mechanism – ‘Concept Drift’. Researchers pointed out that learning. Oftentimes it has been seen that in a collection of
the contents and operating mechanism of spam emails large amount of input data, only a limited volume is
change over time so the techniques that work now, may actually labelled; semi-supervised learning algorithms
render useless in near future due to the change in structure work well in such scenarios [104].
and content of these spam emails; this phenomena is called Reinforcement Learning: In a Reinforcement learning
Concept Drift [97], [98]. Wang et al. [99] conducted a sys- tem, the agent is capable of learning the pathways on
statistical analysis of spam emails over a period of 15 years the fly using a temporal learning scheme, without
(1998 – 2013) and demonstrated how spammers adopt supervision. Agent is the entity that decides what action (At
changes in not only spam contents, but also in the delivery ) to take. The system works on the basis of trial and error,
mechanism. where depending upon the action of the agent, a positive or
The following sections will discuss a number of such negative feedback
technique and the results obtained once tried and tested on (Rt+1) is provided at the next instance [105].
different spam corpora. However, before that, some
Machine Learning based terminologies must be briefly 1) PERFORMANCE MEASUREMENT
discussed such as types of algorithms and associated Most often the performance of Machine Learning models
benefits as majority of the models utilize these algorithms or are calculated using various measures such as Accuracy,
its close variations. Precision, Recall, F-Measure, Receiver Operator
Supervised Machine Learning Algorithms: Systems utiliz- Character- istic (ROC) Plot and Receiver Operator
ing Supervised Machine Learning algorithms tends to learn Characteristic (ROC) Area to name a few.
from a set of labelled data, where the possible output for the These measurements are mostly determined using True
corresponding input is already given [100]. The algorithm Positive (TP) – when the model correctly predicts the class,
tends to go over this set of data (learns) and eventually for instance classifying a spam as ‘spam’, True Negative
builds up the ‘Idea’ or probabilistic mapping between the (TN) – when the model correctly predicts the opposite
nature of input and most likely output (the result). class, for instance classifying a ham as ‘not spam’. False
Supervised Learning can be branched out into two different Positive (FP) – when the model incorrectly predicts the
subtypes, Classification and Regression [100]. Supervised class, for instance classifying a ham as ‘spam’, False
algorithms that, in most cases, produces outputs of Negative (FN) – when the model incorrectly predicts the
categorical nature, are said to be classification algorithm, for opposite class, for instance classifying a spam as ‘ham’.
instance: Spam or Ham, whereas, supervised algorithm that Table 5 describes the key terminologies of performance
predicts outputs of continuous numerical value, are denoted measurement of a model.
as regression
algorithm, for example: $1000-$5000, 50◦F etc. 2) FEATURE SELECTION AND ENGINEERING
Unsupervised Machine Learning Algorithms: As the name In any Machine Learning based model, ‘Feature Selec- tion
suggests, unsupervised learning refers to the fact that the [106], [107] and Feature Engineering’ is a really crucial
model will not have any labelled data to work with, and thus task as it is used to derive new and novel features from the
37210 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
VOLUME7,2019 37211
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37212 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
TABLE 6. A summary of some useful machine learning algorithms.
VOLUME7,2019 37213
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
an ‘Activation Function’ f , together with a bias b, decides b: DEEP LEARNING BASED FRAMEWORKS
whether the neuron should fire (process and pass the infor- Deep Learning is a subclass of Machine Learning that can
mation through); and if it does, then what will be its learn in supervised and in unsupervised fashion. It employs
strength. An Activation Function also normalizes the output cascading layer of processing units (nonlinear) for feature
of a neu- ron especially after several runs. extraction and transformation. The output of each of these
Nosseir et al. [115] developed ANN based classifier to layers is fed into the next consecutive layer as input and
identify unacceptable and acceptable words from the these layers (often processing different levels of
message content of an email. The multi-neural network abstraction) con- struct a hierarchy of concepts [28], [120].
classifiers deal with words from the email body after the Algorithms such as DeepSVM [79], Convolutional Neural
words have been pre-processed to remove the stop words Networks [121] (henceforth CNN), Deep Neural Network,
(articles, preposi- Deep Boltzmann Machine are few of the developments that
tions) and noises (obfuscated words such as I∗n$u∗rènce are based upon the principle of Deep Learning.
or misspellings) along with the application of stemming Seth and Biswas [122] introduced Deep Learning tech-
process niques, such as CNN to tackle spam emails based on
to extract the word root using Porter’s Algorithm [116]. One images and spam content. To classify e-mails containing
of the major concern for the system is that it was tested on a both image and text, the authors have proposed two multi-
small scale database of words and should further be tested modal archi- tectures. Each of these architectures combines
on a larger setting before using a blacklist and whitelist both image and text classifiers, producing an output class.
content filter. The derived accuracy from the measurements The first archi- tecture works on the basis of ‘Feature
provided has been found to be 99.87% for a five-character Fusion’ while the other mines the rules between the two
ANN. classifiers as well as uses class probabilities. It has been
Malge and Chaware [117] deployed tokenization, stop reported that the later
words removal and stemming before feeding the result in
feature extraction algorithm. The obtained set of words is
37214 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
projected higher accuracy of 98.11%; but the dataset is known as ‘Curse of Dimensionality’ [130]. To circumvent
really a small one of just over 1500 images, whereas CNNs Curse of Dimensionality many filters perform some degree
require hugely enriched and larger datasets to produce of ‘Dimensionality Reduction’ before applying the anti-
results that can be generalized over multiple different spam filter to classify incoming messages. Dimensionality
instances. More- over, the Dropout rate, a regularization Reduc- tion also limits overfitting. With every single
technique discussed at length in [123], has been fixed at 0.5, addition of dimension d , data increases exponentially i.e.
which may not be the optimal value for every situation, as nd , where n is the number of data points at the start,
stated in the work of [124]. The study needed to project the underscoring the augmentation in complexity that Curse of
effects of other dropout rates before settling into 0.5. Dimensionality introduces.
Shang and Zhang [125] also deployed CNNs for image
classification in spam emails. However, CNNs sometimes c: NAÏVE BAYES BASED PROPOSITIONS
do not perform well on real world images partly due to the Another popular supervised algorithm is Naïve Bayes
noises of different sorts that distort the image, but the paper (hence- forth NB), developed on Bayes’ Rule. Bayes’ Rule,
did not discuss on such issues. intro- duced by Thomas Bayes, attempts to derive the
Barushka and Hajek [126], have demonstrated that reduc- probability of an event with the help of some prior
tion of features often reduces accuracy and precision as well knowledge of that event- related condition [131]. For
as recall, although ANN and Decision Tree showed instance, if someone’s sprinting speed is related to body
inspiring performance with a significantly reduced feature weight, then with the application of this Bayes’ rule, the
set. The research work [126] also argues that ‘Shallow body weight can be applied to determine individual’s
Neural Net- works’ are a poor fit for handling high sprinting speed more accurately than that of determining
dimensional data yet computationally expensive, unless sprinting speed without the knowledge of body weight.
some other advanced tech- niques, such as, ‘Dropout Mahdavinejad et al. [132] stated that NB classifiers
Regularization’ [127] and ‘Rec- tified Liner Unit (ReLU)’ - require a limited number of data points for training
a popular Activation Function, are combined. Such methods purposes, they are reliable and considerably faster, as well
can address some critical spam filtering limitations such as as effi- cient in dealing with high-dimensional data points.
optimization convergence to non- optimal local minimum, Bielza and Naga [133] also pointed to similar directions as
an example of a problem arising out of overfitting and high- it argued the Bayesian network classifiers, using some
dimensional data. Thus the authors have proposed a model form of Naïve Bayes algorithm, in general are far superior
of spam filter that integrates a high- dimensional N-Gram than other pattern recognition classifiers in terms of
term frequency–inverse document fre- quency (tf-idf) algorithm effi- ciency and effectiveness in learning a
feature selection. The proposition is composed of a modified model from a dataset. However, NB considers features to
distribution-based balancing algorithm [128], and a be completely indepen- dent [134], which, in any practical
‘Regularized Deep Multi-Layer Perceptron Neural Network’ application, is not always the case in majority of the
model with Rectified Linear Units, in a view to capture situations.
intricate high-dimensional data features. Multilayer Zhou et al. [135] mentioned that number of research
Perceptron (henceforth MLP) is a class of feed-forward works have used a ternary approach (Spam, Ham and
ANN. The model does not require any dimensionality Unsure) of determining whether a mail is a spam or not,
reduction and was tested on four different datasets. N-Gram using, in most cases, NB classifier. Their proposed
is a contiguous sequence of n terms or items from a sample modification enhances the calculation and interpretation of
of speech or text. The results show Deep Neural Networks the required thresholds, which has been determined in the
can be quite promising and the model outperforms number earlier developed systems just on the intuitive
of spam fil- ters commonly in use, showing an accuracy of understanding to define ternary email categories. The
98.76% on Enron dataset. The main limitation of [126] is authors have employed ‘Decision-Theoretic Rough Set
the framework is exceedingly computationally intensive (DTRS)’ models with NB classification to regu- larize this
than some of the common techniques available and thus its computation of the threshold value. The result did show
performance as a spam filter in a more standardized significant improvement in ‘Cost-sensitivity’ (a ‘loss
computing hardware is questionable. function’ is regarded as the ‘costs’ of making classifica-
‘Overfitting’ can be a quite critical aspect of a Machine tion decisions) in grouping emails into spam, ham and
Learning based model, when it models the training data too ‘sus- pect’ from three different datasets. Despite
well, to an extent that it negatively impacts the performance demonstrating weighted accuracy of 90.05% (assuming
of the model on new data [129], [217]. that misclassifying a legitimate email as spam is 9 times
As seen in the above study and will be highlighted in the more costly than the oppo- site), the proposition doesn’t
studies discussed forward, ‘High Dimensionality’ of feature solve the issue of automatically classifying an email as
space (too many attributes) is a recurring problem in number spam, as the user still has to make a decision from the
of Machine Learning dependent models especially that uses group of email marked as ‘‘Suspect’’, and this leaves a
Bayesian techniques. With the increase of dimensionality, potential possibility of error in judgement from the part of
the complexity rises exponentially; this problematic issue is the user.
VOLUME7,2019 37215
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37216 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
VOLUME7,2019 37217
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37218 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
Equation (9) calculates the Entropy E, of a dataset D, which calculation as it uses a lot more
holds the positive and negative ‘Decision Attributes’. features than a standalone DT, but generally it does
produce higher accuracy in dealing with unseen datasets.
Gain (Attribute X) = Entropy (Decision Attribute Y) The ensemble method upon which RF is founded is
− Entropy(X, Y ) (8) known as ‘Bootstrap Aggregation’, a powerful, yet simple
algorithm. Bootstrap Aggregation can address overfitting
Gain is calculated for each of the features or attributes arising out of high variance of algorithms such as DT.
and the one with the highest value is selected as it provides Tran et al. [146] indicated that limited work has been
most information gain. The whole process is repeated for done on detecting malicious contents in spam emails, in the
sub- branches of the tree to eventually complete the DT. form of malignant URL or harmful attachments (malware).
Ouyang et al. [141] developed frameworks based on DT Coders for malware are relentlessly developing novel and
and another algorithm known as ‘Rulefit’ [142], in order to clever tech- niques for transforming binary code that
carry out a comprehensive empirical study into the efficacy cannot be detected by anti-virus scanners, and their level of
of using packet and flow features in the detection of spam sophistication is growing with time [29]. The proposed
emails from a single-enterprise perspective. The flow based model extracts many different features in a rather time-
analysis critically examines an email using different meth- efficient manner from email content and metadata, without
ods, such as DNS Blacklisting, filters that works on SYN using external tools. Some of the header and content
packet features, filters based on key traffic characteristics features used are quite unique, and according to the authors
and finally content analysis. Addition of each of the stages have not been used elsewhere. RF has been applied to
in the processing adds more overhead and thus measure the effectiveness of the selected features.
computational complexity increases. A message is marked Nevertheless, the authors have suggested the sys- tem does
as spam when any of the layers confidently labels it as not always perform well enough against detecting
spam. Researchers claim that the proposed work on network malicious URLs in spam emails, but rather effective
level filtering for spam detection can greatly reduce the against detecting potentially hazardous attachments.
workload for a more intensive content level filtering. The proposed model of Shams and Mercer [147]
Sheikhalishahi et al. [143] envisioned a preliminary extracted features such as words, length of words and
approach to a new algorithm known as ‘Categorical Cluster- documents etc. and carried out the classification task on
ing Tree (CCTree)’, which is based on DT. CCTRee four different data sets such as CSDMC2010,
extracts a set of categorical features such as size of email, SpamAssassin, LingSpam and Enron with multiple
number of embedded links, attachment information, HTML classification algorithms. The authors claim that classifiers
character- istics etc. to build a tree of clusters. The generated using meta-learning algo- rithm (‘Bagging’ in
researchers argued that it has a simpler approach in dealing this case) performs better than probabilis- tic and tree
with the task in hand, thus the complexity is quite low based models. The Bagging model demonstrated average
comparing to some other approaches. On the other hand, the accuracy of 94.75% across all four datasets. The algorithm
research attempt is yet to be tested and implemented on a is a close variation of RF. Regardless of reasonable
large-scale dataset, and it only showed some theoretical accuracy, the work does not address the issue of high
underpinning where the low complexity and easy-to- dimen- sionality and the associated increase in the
understand representation of the chosen features have been complexity of the proposed methods.
highlighted as the key strength of the proposed CCTree
f: LOGISTIC REGRESSION BASED SOLUTIONS
algorithm.
To address the issue of ‘Concept Drift’, Sheu et al. [144] Logistic Regression (henceforth LR) is another simple, yet
designed a DT-based framework that works in conjunction very useful supervised approach, applicable to a wide
range of binary classification problems, for instance,
with ‘Incremental Learning’ of spam keywords, set up in an
predicting binary-valued labels for a data point z such that
online environment for continuous enrichment. The
z(i) ∈ {0, 1}, such probabilities can be calculated from
precision attained at the point of publication is 95.5%. (10) and (11).
1
e: RANDOM FOREST BASED FRAMEWORKS P (z = 1|x) = (9)
Random Forest (henceforth RF) is one of the most
successful supervised classification and regression 1 + exp(−θ T x)
P (z = 0|x) = 1−P (z = 1|x) (10)
techniques based on ensemble learning. It operates by
constructing an entire forest from multitudes of random and The right hand side of (9) will ‘squash’ the value of θ T
uncorrelated Decision Trees during training segment [145]. within the range of 0 to 1 so that it can be interpreted as a
Ensemble learning methods employ multiple learning Probability [148]. Over the years, apart from scientific
algorithms to come up with an optimal predictive analytics, research, LR has been widely adopted in many different
which can perform better than any of the individual model’s fields such as marketing, health care and economics to
prediction [145]. RF may incur additional complexity in its name a few [149]. Both the equation signify how likely the
VOLUME7,2019 37219
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37220 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
the performance consistence with the growth of dataset, authors have experimented with
another version of LR, the Multiple Instance Logistic three different flavours and came into conclusion that
Regres- sion (MILR) should be used, as it demonstrated a Bhat- tacharyya Kernel (BK) works best. The framework
consistent accuracy within the tight range of 93.3% to attained an accuracy of around 92% when both images and
94.6% (up to 2500 data points). texts are considered from an email, however, the accuracy
declines for spam which is based solely on image or text.
g: SUPPORT VECTOR MACHINE BASED PROPOSITIONS
Deep Support Vector Machine (DeepSVM) are known
Support Vector Machine (henceforth SVM), a well-
to perform better than CNNs and standalone SVMs due to
established supervised learning technique used for classifi-
a design improvement where instead of single layer, any
cation, was originally proposed by Alexey Chervonenkis
number of layers can be used with kernel functions. Roy et
and Vladimir N. Vapnik in 1963 [151].
al. [157] proposed an application of Deep SVMs in spam
A hyperplane creates the differentiating classes by
classification. The lower level SVMs carry out the task of
analysing various features found in the dataset. The SVM
feature extraction while the highest level SVM performs
can work in any number of dimensions. In a R2 (2-
the actual prediction using the extracted features. The
dimensional) workspace, a hyperplane is a line, in R3 it is a
authors have also compared the performance with ANN
plane and in Rn it is termed as ‘hyperplane’. The algorithm
and regular SVM. The model based on Deep SVM showed
may identify several hyperplanes. But the optimum one
highest accu- racy of 92.8%.
would be the one that has the maximum distance from the
training datasets of each of the classes. Given a specific h: ADABOOST BASED PROPOSITIONS
hyperplane, the com- puted distance from it to the nearest
Boosting basically means to combine a number of simple
data points of both sides can be used to draw the margin,
learners (classifiers that produces an accuracy just above
There should never be any data points inside the margins.
50%) to formulate a highly accurate prediction. AdaBoost
The bigger the margins, the better the model will generalize
(Adaptive Boosting) sets different weights to both samples
with unseen data. Support Vectors, required to calculate the
and classifiers [158]. This enforces the classifiers to put
margins, are the data points ‘On’ or ‘Closest’ to the margins
concentrated focus on observations that are rather difficult
[152]. Though SVM is a supervised techniques, but work
to accurately classify. The formula for the final classifier is
has also been done to use it in unsupervised clustering
shown at (11).
[153], [154]. SVM has the unique ability to transform non-
L
linearly separable data to a new linearly separable data by a H (p) = +/ − ( K α h (p)) (11)
mechanism known as ‘kernel trick’ [154]. k
Similar to the study of [126], Diale et al. [155] demon- Equation (11) is a linear combination of all of the weak
strated that while using SVM for email classification, opti- classifiers (simple learners), where K is the total number of
mising the kernel type and kernel parameters are of utmost weak classifiers, hk (p) is the output of weak classifier t
importance. The authors have indicated that varying feature (can only be -1 or 1). αk is the weight applied to classifier
extraction and feature selection techniques for SVM often k. The
bring about the need for employing different kernel final decision is derived by looking at the sign (+/-) of
functions for optimum performance. They have also (13). The research done by Varghese and Dhanya [159]
concluded that increasing number of features available for attempted to develop a filter using Parts of Speech (POS)
feature selection and extraction resulted in better Tag, Bigram POS Tag, Bag-of-Word (BoW)s and Bigram
performance, that is, there is a positive correlation. This Bag-of-Word (BoW)s. It has been detected that POS tags
research attempt primarily works with the words from email and Bigram POS Tag features demonstrated better output
body for feature engineering; but excludes other forms of using AdaBoost as the classifier; the experimentation
features, such as header, URL and domain information. achieved a False Positive Rate of 0. On the contrary, [159]
Thus the obtained results are only accurate within a limited suffers from the same issue due to POS Tagging as
boundary of circumstances. discussed earlier. In addition, as pointed out in [160],
A combined model has been envisioned by Amayri and Adaboost as a classifier might incur issues such as high
Bouguila [156] where both textual and visual (images) computational cost and non- scalability. Apart from this,
infor- mation from emails have been combined and the work does not address any header information, leaving
simultaneously put into action in detecting spams. The a loophole for number of differ- ent types of spam emails.
framework is based on building probabilistic SVM kernels
from mixture of Langevin distributions [156]. For the i: K-NEAREST NEIGHBOUR BASED SOLUTIONS
textual features, certain header information, such as FROM, K-Nearest Neighbour (henceforth KNN) is widely used
REPLY-TO, Cc, Bcc and TO fields have been consulted classification technique that boasts a commendable balance
along with email content and subject. For the visual part, among several important criterion such as predictive
texts in the embedded images have been extracted using ability, intuitive interpretability and time required for
OCR, and certain visual features of the images have also calculation (for a moderately rich dataset). Though
been included in the feature vector. For the SVM kernel, the algorithms such as RF does have higher capability in
VOLUME7,2019 37221
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37222 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
of KNN is quite high. KNN can often be used to formulate Spambase dataset. However, the authors have used a 66%
regression models, but that is not very common. KNN uses split, rather than the more traditional 80%-20%, without
‘Euclidian Distance’ to determine the distance between two really explaining the rationale behind the choice.
data points (Xn and Xm) as shown in (12) [161]. DT based systems often tend to have high sensitivity to
D noise and overfitting. This issue has been highlighted in the
Dist(X , X ) =
n m
X − X )2
n m
(12) work of Wijaya and Bisri [167]. To tackle such issue, the
Li=1 i researchers have added a regular LR to the process. In this
On the hindsight, instead of using Euclidean Distance, hybrid spam detection system, data is fed into an LR
Sharma and Suryawanshi [162] have proposed ‘Spearman module before passing through the DT based segment. The
Correlation’ [163] as the distance measure for KNN based reported accuracy is 91.67%. The work does not use any
classification as shown in (13). X and Y are training and test- feature engi- neering methods, and simply uses all the
ing tuple respectively while n is the number of observations. available aspects as features. This simplicity gives the
The values of dij usually lies between 1 and −1. framework effortless execution, but makes the accuracy
Pn less realistic.
6 (rank(Xi) − rank(Yj))2 It has been stated by Nizamani et al. [168] that efficient
and advanced feature selection weights more than the types
of classification algorithms used when comes to identifying
i=1
dij = 1 − The authors have Tree (implemented
n(n2 − 1) employed SVM, in Java), CCM
(13) J48 Deci- sion (Cardiac Contrac-
deceitful emails.
The changes have shown some enhancements over regu- reviewed number of classification algorithms and found
lar KNN model with nearly 50% improvement in accuracy Rotation Forest to be performing better than some other
(97.44% in 80%-20% Train and Test ratio). A limitation of common algorithms with an accuracy of 94.2% on the
the study is the size of the dataset, having just over 4000 tility Modulation) and NB classifiers together with various
data points. KNN often needs a rather large dataset to carefully designed features and disseminates the idea that
produce a rather stable model with realistic accuracy. frequency based features generally achieve top accuracy,
Besides, a bit of elaboration was needed for fixing the value 96% in their study. The work only deals with the contents
of K as 3. The authors have also expects the study to be used of the fraudulent emails for feature extraction, ignoring the
in conjunction with other more robust and complete spam header, which is also an important aspect that needs to be
filtering frame- works. considered. Besides, Alsmadi and Alhami [169] argued
that better false positive rate can be acquired through the
j: MULTI-ALGORITHM SUPERVISED SYSTEMS deployment of N- Gram based clustering and classification
A number of interesting propositions have been put forward than employing any other algorithms, even the one
that employ more than one supervised algorithms in discussed in [168].
different segment of the framework to develop the final Feng et al. [170] offered a hybrid model composed of
model. This section will highlights some of such recent NB and SVM, attaining an accuracy of around 91.5% with
solutions that, mostly is a hybrid of the above discussed a training set of 8,000 samples. The framework tries to
algorithms. reduce dependency issue among features as much as
In his study, Wang [164] proposed a heterogeneous possible - com- monly observed in NB based models. The
ensem- ble approach for spam detection composed of DT, study aims to extend its functionality towards image spam
NB and Bayesian Net algorithm. Heterogeneous ensembles as one of the future improvements. However, we believe
com- posed of methodologically different learning the authors [170] should also include header and domain
algorithms. The study have also discussed multiple information in its analysis of spam emails.
procedures for algorithm selection in building the Islam and Abawajy [171] developed a multitier classifier
ensembles. The researchers have compared the framework where an email is checked for an accurate labelling in the
with homogenous ensemble tech- niques and found their first two tires, and if any misclassification occurs (initial
approach to be performing better with an accuracy of 94%. two tires giving out conflicting labelling), it is then sent to
Similar to [164], Large et al. [165] also suggested that third tier. The choice of algorithms (SVM, AdaBoost and
heterogeneous ensemble-based spam filtering frameworks NB) picked for each of the tires has been decided after
perform better. However, the researchers argued that instead juggling the selected algorithms and their respective tiers.
of simple tree based ensemble techniques used in [164], the A strength of this technique is that the processing among
more advanced ones, based on slightly complex algo- rithms tiers transpire in parallel, unlike some other ensemble
such as RF, Rotation Forest, Deep Neural Network and based multitier classi- fiers. The model returned a high
Support Vector Machine, can actually perform better in accuracy rate (around 96.8%) with low rate of false
varieties of scenarios. The claim can also be substanti- ated positive detection.
from Shuaib et al. [166] where the researchers have A behavior-based mechanism has been discussed by
VOLUME7,2019 37223
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37224 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
Sender Behaviour algorithm are the domain message-id Aski et al. [176] brought forward a rule based
(DMID) and lists of email sender (ES). The system showed framework where 23 meticulously chosen features have
an average accuracy of 93% with only 7 features from a been selected from a personally compiled spam databases
rather limited set of 3000 data points. On the other hand, a and each of these criterion have been scored to get a total
similar framework [173], achieving a slender advantage in value, which was subsequently compared to a threshold
terms of accuracy, used a high number of features - 43. value to finally label an email as spam or ham. MLP, NB
The framework developed by More and Kulkarni [174] classifier and C4.5 Decision Tree classifier have been used
and tested on Enron dataset using NB and RF demonstrated to train the model as well as the individual model’s
an accuracy of 96.87%. The system employs text analysis of performances have been compared; of which the MLP
the email body using NB, and categorizes the words in based model scored accuracy around 99% [176]. The MLP
several linguistic features as well specific spam words. If it based model propagates information by activating input
is found that the message body contains over 5% spam neurons that contains labelled values. The Activation of
words, it is flagged as Spam. Besides, the same set of emails neurons is calculated either in the middle or output layer
are passed through a classification system built on RF that using (15), where ai represents the activation level of
uses the following Polynomial Kernel Function as shown in neuron i; j denotes neuron set of the previous layer; Wij is
(14). X and Y are vectors of features derived from test or the weight of the link between neuron j and I , and Oj is the
training samples, and C is constant. output of neuron j.
K (X, Y ) = (XT Y + C)2 (14) ai = σ (L WijOj) (15)
j
The obtained result has also been compared with ANN
and LR built model (following the same Kernel Function). However, the small testing dataset (750 spam and ham
An improvement can be added if the issue of high dimen- in total) is somewhat limits the wide acceptance of the
sionality and the associated increase in the complexity of the results obtained for this study. Thus an effective
proposed methods can be explained in detail. performance mea- sure in terms of memory and time
Islam and Xiang [8] developed a promising email classi- footprint for large scale datasets is yet to be determined,
fication technique based on data filtering method. The work also, the study does not mention how the model will
broached an innovative filtering technique using a modified perform against certain critical attacks such as spear
‘Instance Selection Method (ISM)’ to cut down on the least phishing.
valuable data instances from training model and then As illustrated in earlier works, ‘Baysian Probability
classify the test data. The aim of ISM, enhanced by NB, is Theo- rem’ has been the choice for handling uncertainty in
to identify which instances (examples, patterns) in email datasets. However, the work of Zhang et al. [177] rather
corpora should be selected as representatives of the entire argued the ‘Dempster-Shafer (D-S) theory of evidence’
dataset, without significant loss of information. Several [178] is bet- ter equipped than Baysian probability while
algorithms have been tried and the model displayed an using statistical classification. Uncertainty can arise in
accuracy of 96.5%. How- ever, according to the authors, the number of regards in the analysis of spam corpuses such as
system needs to have the capability to handle incoming assigning missing val- ues to features. In D-S theory, given
emails to address Concept Drift [8]. a domain α, a probability mass is assigned separately to
each subset of α, whereas in classical probability theory,
k: SUPERVISED SYSTEMS DISCUSSING PERFORMANCE
this probability mass is assigned to each individual
OF DIFFERENT ALGORITHMS
elements. Such an assignment is called a Basic Probability
The core of the above discussed systems are either built with
Assignment or BPA [179]. The researchers have selected 5
a single supervised algorithm or multiple ones. Below are a
most representative header features of spam corpus after
discussion on single-algorithm frameworks where multiple
appropriate quantification. Their D-S integrated
algorithms have been individually tested to design the pri-
classification model found ANN to be one of the most
mary classifier of the system, and based on the performance,
effec- tive classification algorithms along with NB.
the best one has been chosen to finalize the classifier.
Ergin and Isik [180] highlighted the fact that spam is not
The proposed binary classification model named
only a problem in emails based on English language, but
‘Sentinel’ by Shams and Mercer [147] utilized features of
also non-English speakers also have to deal with the issue.
Natural Lan- guage Processing before developing the
The work in question demonstrated a Turkish spam
classifier with multi- ple supervised algorithms in a view to
filtering system developed with the aid of DT and ANN as
evaluate performance of each of those algorithms. Among
classifiers, while ‘Mutual Information (MI)’ method has
the five algorithms tested, RF, Adaboost, Bagged Random
been deployed for feature selection. ANN attained an
Forest, SVM and NB, Adaboost and Bagged Random Forest
accuracy of 91.08%. Though the study states the
performed equally best on four different spam datasets.
superiority of Mutual Informa- tion (MI) over more widely
However, real time training and response latency have not
applied technique - ‘Informa- tion Gain (IG)’, the extensive
been considered as well as performance against Concept
study by George [181] found otherwise, where it has been
Drift [97] is yet unknown.
concluded the performance of Mutual Information is not up
VOLUME7,2019 37225
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37226 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
Sharaff et al. [182] conducted experimentation on a pro- algorithms have been used for classification evaluation and
cessed Enron dataset with standard DT, J48 Decision Trees, the system projected an average accuracy of over 90% with
SVM and BayesNet. The study reported the effectiveness of consider- able dimensionality reduction of feature set; LR
J48 and BayesNet over SVM. found to be performing optimally. Nevertheless, the reduced
Sharma and Kaur [183] tested a spam detection feature set
framework built upon RBF (Radial Bias Function) Network of 70 is still a bit too large and more effective testing
(a subclass of ANN), where neurons were separately trained against phishing attacks is required.
to address common spamming techniques. The approach Bhagat and Moawad [190] carried out similar semantic
seemed to have increased the performance of RBF and also based implementations. The resulting reduction of feature
outper- formed SVM. The research resulted in an average set was around 37%, with LR showing optimal
accu- racy of 99.83% after five consecutive runs. performance with an accuracy of 96% while RF performed
Nonetheless, the dataset of just 1000 words is not the least, demonstrating accuracy around 85%. The
comprehensive at all, and the proposed feature extraction somewhat similar study of Bhagat et al. [191] attained a
method is rather vague. feature reduction rate of 43.5% through the stemming of
Saab et al. [184] also measures the performance of SVM, the email body on Enron dataset. Multiple classifiers have
Local Mixture SVM, DT and ANN on spambase dataset. been tested and SVM and LR performed comparably better
While taking into account the full 57 available features, than other classifiers with LR showing an accuracy of
SVM demonstrated the highest precision (93.42%), while 97.7%.
ANN the highest accuracy (94.02%). However, this high Nonetheless, both [190] and [191] suffers from
accuracy was achieved in exchange of the longest training contextual ambiguity issues. Ambiguity refers to the fact
time. that a sentence in context may indicate multiple meaning,
The presence of malicious URL in phishing emails is a for example, ‘‘There was not a single man at the party’’,
key characteristics of spam emails and Vanhoenshoven et can be interpreted as I ) Absence of bachelors at the party
al. [185] tested the effectiveness of RF in detecting such II) Absence of men altogether [192]. The right conclusion
URLs within spam emails using a publicly available can be deduced upon analysing the context within which
database. The authors came into conclusion that with an the sentence has been used.
accuracy of 97.69%, RF actually performed better than few Besides the above studies, Almeida et al. [193]
other classification techniques such as MLP, C4.5 Decision conceived a process of expanding short texts, often found
Tree, SVM and NB. Features were ranked with Pearson in SMS spams, but could sometimes be seen in spam
Correlation Coefficient’ emails too. The authors argued that when the original text
[186] for selection. Qaroush et al. [187] also justified the is too short and mostly filled with abbreviations and
superiority of RF (reported accuracy of 99.27%) by com- idioms, it can be harder to apply any sort of classification
paring its performance against several other classification algorithm on it, most because the feature set is also
methods while building the classifier using various extremely limited. Feature Engineering is also difficult out
important email header features. of this limited initial feature set. Their proposed
A study based on semantic method has been introduced normalization and expansion method is based on semantic
by Bhagat et al. [188] using Wordnet ontology [189] as well dictionaries, lexicography and highly effective tech- niques
as some ‘Similarity Measures’ to reduce the high number of for semantic analysis and disambiguation. The study can
extracted features. ‘Path Length Measure’ has been chosen
also generate novel attributes to feed into any classifi-
as the most suitable algorithm for determining the similarity cation algorithm. The statistical evaluation done on the out-
measures. Path Length Measure derives the semantic put showed promising directions. However, the researchers
similar- ity of a pair of concepts. The calculation starts by concluded more thorough testing and performance
counting the number of nodes along the shortest path measure- ments are required.
between the concepts which can be found in the ’is-a’ Méndez et al. [194] devised a semantic-based feature
hierarchies of WordNet. In general, the path similarity score selection approach. The first critical segment of the
is inversely proportional to the number of nodes along the proposed method is e-mail topic extractor and guesser and
shortest path between the two words. Equation (16) the other one is computing the topic-related significance of
summarizes the nature of the derived score where w1 and each feature. To guess the topic of the email, the
w2 are the two terms researchers have seman- tically grouped terms into more
1 generic topics, that is, each of the topic has a bunch of
PATH (w1, w2) = (16)
related terms under it, and the more the terms are found in
length(w1, w2) the content from a certain category, the higher the
This resulted reduction of high number of features also likelihood of the email being belonging to that topic. These
reduces space and time complexity. tf-idf has been used for root level of topic is taken from the Wordnet Lexical
feature updates while feature selection is done with Database. The logic ensures each email may actually fall
Principal Component Analysis (PCS). Multiple supervised under multiple topic. The set of topic comprises both spam
and ham groups. Finally, it is then determined whether the
VOLUME7,2019 37227
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37228 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
5) UNSUPERVISED LEARNING BASED PROPOSITIONS documents may lead to successful grouping and
This section will analyse a number of Machine Learning identification of original authors. Even though emails are
based research attempts which are primarily unsupervised highly unstruc- tured, Alazab et al. [200] tried to
by nature. These include one or multiple unsupervised algo- implement the idea on spam detection, especially for
rithms to develop automated spam detection framework. phishing campaign identifica- tion. The researchers have
deployed an Unsupervised Auto- mated Natural Cluster
a: K-MEANS CLUSTRING BASED FRAMEWORKS Ensemble (NUANCE) methodology to approximately
K-means Clustering is one particularly useful, simple and cluster spam emails. The final clustering is achieved by
popular algorithm which intends to group similar data points hierarchically clustering the approximate sets, giving 27
together in a view to finding the underlying pattern. The different clusters. Though the system is impressive and
algorithm produces the final output through iterative refine- achieves improvement in the general direction of ‘author-
ment. The number of groups is denoted by K , and itera- ship attribution’ in spam campaign detection, however, the
tively each data point is assigned to one of these groups of intra-dynamics within the campaign groups may go unde-
clusters based on the identified similarities among the tected.
features [195]. Halder et al. [201] used clustering algorithms such as K-
Determining the optimum value for K , the total number Means and Expectation Maximization (henceforth EM) on
of clusters, which needs to be inputted for the algorithm to schemas such as stylistic features of emails, for example
work, can sometimes be tricky and users often run the total number of punctuations and contractions, number of
system multiple times with different values of K to compare email IDs used in the body etc. The authors have also
the results. Several methods exist for getting a reasonably looked into sematic features, that is, statistical measures of
solid approximation of K [195]. different words used in a batch of emails. Besides, they
In their work, Basavaraju and Prabhakar [196] proposed have also taken the combination of these two approaches
system that employs the text clustering based on ‘Vector into account. The cluster analysis has been carried out on a
Space Model (VSM)’. The method performed reasonably dataset of 2600 spam emails. It was detected that this
well on identifying spam emails. Representation of data is method can be successfully deployed to identify writing
done using a VSM and data reduction has been achieved styles of spam campaigns. Fur- ther, prototypes can be
through a custom developed Clustering techniques using the built based upon the extracted patterns for future
features of K-Means algorithm and BIRCH (Balanced Iter- identification of spam emails. K-Means showed 80%
ative Reducing and Clustering using Hierarchies) algorithm, success rate when a combined approach is taken while EM
achieving an accuracy of around 76%. This study uses raw projected a success rate of 84.6% while dealing with only
words from the documents to develop the VSM, A point of semantic features. However, its detection rate drops to
concern for the system is that in case of spammers using 57.4% while using a combined approach. The success rates
character variations, such as disguising the word insurance have been reported in terms of ‘Purity’ of clusters – which
as I∗n$u∗rènce, it will be difficult for the framework to basically projects the quality of the cluster. The accu- racy
work correctly. should also be within similar range. The authors’ area of
A content based approach has been put forward by concentration has generally been rather narrow as there are
Laorden et al. [197]. The proposition works on anomaly number of important features in spam email detection such
detection to spam filtering by comparing features such as email subject headers, URLs composition, detailed
‘Word Frequency’ to that of a dataset of ham, or valid header and domain information, attachments etc. which
email. The inspected email, if shows considerable deviation have not been discussed, thus there are rooms for a
from a nor- mal scale, will be considered as spam. The considerable expansion of this work.
techniques utilizes an algorithm known as ‘Quality The expectation Maximization (EM) is an effective
Threshold (QT)’, which basically falls into the category of itera- tive algorithm that calculates the Maximum
Partitional Clustering algorithms [198], a close variation of Likelihood (ML) estimate in the presence of hidden or
K-Means Clustering, giving an edge in reducing the number missing data [202]. Latent variables, or unobserved
of vectors in the dataset used as normality. This attempt also variables which cannot be directly measured, but rather can
lessens the pro- cessing overhead significantly. On the be inferred from the non- latent variables, are often used in
contrary, the system may render ineffective against the an EM models for gauging the best estimation.
usage of language features such as Synonyms, Hyponyms
[199] and Metonymy. The study achieved a weighted b: SELF-ORGANIZING MAP BASED PROPOSITIONS
accuracy of 92.27% on LingSpam dataset. Basnet et al. [27], Self-Organizing Map (SOM), an unsupervised technique
also reported similar accuracy of 90.6% using k-means. that borrows the baseline idea from ANN. However,
‘Authorship Attribution’ in recent times has become a instead of ‘Backpropagation’, it uses a process called
valuable tool in resolving issues around authorial disputes ‘Competitive Learning’ to produce a two-dimensional map
mainly in historic documents and literatures. Patterns of input space with higher dimension [203]. It is
regard- ing grammatical and syntactic features emerging out conceptualized that in Competitive Learning, output
of such neurons are in competition to respond to input patterns. At
VOLUME7,2019 37229
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37230 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
input pattern is brought closer to the input pattern, whereas features have been selected to
the rest of the neurons are left unchanged. The process is begin with. In addition, the spam dataset is rather limited in
repeated number of times, eventually forming clusters of number of emails it holds.
closely related data points [204]. A variation of standard PCA, termed as ‘PCAII’, has
Porras et al. [206] declared that several benefits can be been broached by Gomez et al. [211], where the features of
obtained if SOM is used instead of KNN for clustering, such both the classes under analysis are combined together. Few
as eradication of inputting the number of clusters as one of varia- tions of ‘Latent Dirichlet Allocation’ (LDA) have
the parameters to the algorithm. Instead SOM can use a also been proposed in this work. The modified algorithm
thresh- old, a radius-based boundary, to manipulate the had been applied on TREC 2007 spam corpus and the
algorithm’s sensitivity. Further, topological aspect of the output showed a balanced and stable performance
similarity among several clusters can be observed with regardless of dimensionality reduction.
much ease. Multiple filtering systems work in unison for
spam detection in this model. On the contrary, according to 6) SEMI-SUPERVISED LEARNING BASED SYSTEMS
the author, SOM calcu- lation complexity may make it less Semi-supervised spam filtering systems have also demon-
than optimum for datasets that are limited both in the aspect strated promise, even though not many attempts have been
of size and diversity. The experiment has been carried out taken to construct such systems yet. This section will shed
on a dataset of 6047 email messages. some light on few of such frameworks.
Cabrera-León et al. [207] introduced another SOM based Las-Casas et al. [103] inspired a technique called
system where they used 13 different categories for the ‘SpaDes’ (Spammer Detection at the Source), which works
emails. The researchers had started by a 4-stage by analysing SMTP metrics such as number of distinct
preprocessing of emails (both spam and ham). First stage SMTP servers targeted, number of observed SMTP
batch-extracted all the emails’ subject and content and filled transactions, average geodesic distance to destination,
whitespaces with alphanumeric characters. The second stage average transaction size (in bytes) and average SMTP
removed all the transaction inter-arrival time (IAT). These SMTP metrics
stop words and calculated raw Term Frequency measure
are studied via a Machine Learning algorithm known as
along with some other metadata (spam\jam) to the process-
‘Active Lazy Associative Clas- sification’ (ALAC) [212]
ing. The following stage built a 13-dimensional integer
array to hold the themes and categorize the above-processed to build a prediction model. Asso- ciative classification
texts. The last preprocessing phase added ‘weights’ to the method aims to amalgamate supervised classification and
words of unsupervised association rule mining tech- niques in order
each of the 13 categories. The model was then built using to build a model known as associative classi- fier. Though
SOM (with ‘Batch’ learning method), finally, a threshold the proposition did show reasonably satisfactory
value was used to label the clusters accordingly. The frame- performance, however, it has been reported that over time
work recorded an accuracy of 94.4%. A concern was noted the system did not produce consistent performance, due to
by the authors in the performance of the model against the changes in behaviour of the spammers’ way of carrying
newer and off-topic emails, where the accuracy did get out spamming with time. The role and impact of Machine
affected, leaving more room for improvement. Learning based algorithms in the detection of spam emails
will be further discussed in the following sections.
Smadi et al. [15] presented a framework, ‘Phishing
c: PRINCIPAL COMPONENT ANALYSIS (PCA) BASED Email Detection System (PEDS)’ based on both supervised
FRAMEWORKS and unsupervised techniques in conjunction with
PCA is a statistical framework that works extremely well in reinforcement learning methodology, which gives the
most cases for Dimensionality Reduction in such a fash- ion system an increased ability to adapt itself based on the
where the maximum variations of the dataset can be retained detected changes and mod- ifications in the environment.
[208]. PCA is also a valuable tool in building Pre- dictive The target of the system is Zero- Day Phishing attacks
Models. The system is an ‘Orthogonal Linear Trans- [15]. The core of the system, ‘Feature Evaluation and
formation’ that transmutes the normalized inputted data to a Reduction (FEaR)’ algorithm, can select and rank the
new coordinate system [209]. important features from emails dynamically based on the
Dagher and Antoun [210] deployed four different scenar- environmental parameters. FEaR is based on Regression
ios regarding feature pre-processing using PCA (Principal Tree (RT) algorithm, a subtype of Decision Tree. Immedi-
Component Analysis). Out of the four, two notable ones are ately after the execution of FEaR, another novel algorithm,
- representing ham and spam emails using same and DENNuRL (Dynamic Evolving Neural Network using
different set of features. It has been reported that PCA Rein- forcement Learning) will take over to allow the core
performs best when both the classes of emails are Three- Layer Neural Network of PEDS to evolve
represented using same features, having an accuracy of dynamically and build the optimum Neural Network
94.5%. On the hindsight, depending upon the best selected possible. DENNuRL has the element of Reinforcement
features, the other three scenarios may as well perform Learning where the degree of ‘Reward’ has been linked
differently. Besides there is no mention of the fact how the with the Mean Square Error (MSE) of the Neural Network
VOLUME7,2019 37231
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37232 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
37234 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
detection techniques since circa 2010. Facts deduced anti-spam systems, with 67% of the selected sample.
from the sample can decidedly aid in understanding the
direction of research that had been conducted over the
years. Some of the obtained insights will also indicate
potential future research directions.
1) FINDING A: HIGH ADOPTION OF
SUPERVISED TECHNIQUES
The PI chart on Fig. 7 demonstrates the high adoption of
supervised techniques in developing or benchmarking such
VOLUME7,2019 37235
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
consistent.
37236 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
FIGURE 8. Scatterplo for accuracy of supervised methods (sd: 2.97).
VOLUME7,2019 37237
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
VOLUME7,2019 37239
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
algorithms, such
as SVM and Naïve Bayes are in high demand. We have
also came into conclusion that single-algorithm anti-spam
systems are quite common thus the potentiality of research
into hybrid and multi-algorithm systems is quite promising.
Besides, research that focusses on email header features
excluding the ‘subject’ field, URLs within the email body
and sender domain information need to substantially
increase. Another important area that needs increasing
FIGURE 13. Nature of an effective phishing email.
attention is the addressing of ‘Concept Drift’, which would
definitely make a system to perform optimally under
will definitely put the user in a position where he\she will gradual modification in spamming techniques and motives.
feel the urge to open the email at an earliest instance In addition, the current way of dealing spam emails of
possible.
phishing nature is not the most efficient as described, thus
Thus for the content analysis to be effective, along with
requires a more innovative approach that will take into
normal word-based analysis, we believe an automated
account the different angles of the problem.
mech- anism is also required which will be able to detect,
A point of concern is that despite several
from multiple angles, how closely an email matches to that
admonishments from multiple bodies, governments of
of the above discussed structure in Fig. 13 before finally
number of leading countries in the world have fell short in
labelling it as an instance of phishing email. Such an
forming effective regulations that can really have a lasting
approach has not been seen in our studies of the modern
impact on this issue [31]. Nevertheless, the actions to
content based analysis techniques, and we believe more
strengthen cyberse- curity have seen greater gravity in
research is required in this area of content analysis (along
recent times, resulting in the increased research and
with the subject header).
streamlined availability of fund- ing in this field. Thus it
Certainly, to be effective against spammers and
can be expected that a formidable framework, equipped
fraudsters, a Machine Learning based framework, if fully
with measures against the drawbacks highlighted in this
leveraged, should be able to counter all of these key issues
study, will soon become available for commercial and
as much as possible. Therefore, we believed that the future
personal deployment.
research should encompass the directions that have been
identified in the previous section.
REFERENCES
[1] O. Saad, A. Darwish, and R. Faraj, ‘‘A survey of machine learning tech-
VI. FUTURE WORK
niques for Spam filtering,’’ Int. J. Comput. Sci. Netw. Secur., vol. 12, no.
The survey work presented in this paper discussed the types 2,
and implication of spam emails on modern society and com- p. 66, Feb. 2012.
merce. A multitude of spam detection frameworks – both [2] M. K. Paswan, P. S. Bala, and G. Aghila, ‘‘Spam filtering: Comparative
analysis of filtering techniques,’’ in Proc. Int. Conf. Adv. Eng., Sci.
Machine Learning based and regular non-automated ones, Manage. (ICAESM), Mar. 2012, pp. 170–176.
have been dissected critically to depict a complete picture of [3] E. Bauer. 15 Outrageous Email Spam Statistics that Still Ring True in
the current development and future direction of the field. It 2018. Accessed: Jul. 20, 2019. [Online]. Available: https://www.
propellercrm.com/blog/email-spam-statistics
is expected that in near future adequate development will [4] Statista. Number of E-mail Users Worldwide From 2017 to 2023.
branch out in lesser explored arena of Machine Learning Accessed: Jul. 24, 2019. [Online]. Available: https://www.statista.com/
based spam identification propositions. It is reasonably clear Leoni AG loses Ă40m in an email scamstatistics/456519/forecast-
number- of-active-email-accounts-worldwide/
from the reviews that the currently emerging frameworks, [5] Y. Cohen, D. Gordon, and D. Hendler, ‘‘Early detection of spamming
even though using automated machine leaning based solu- accounts in large-Scale service provider networks,’’ Knowl.-Based Syst.,
tions, are often not equipped to deal with the multiple angles vol. 142, pp. 241–255, Feb. 2018.
[6] Campaign Monitor. The Shocking Truth About How Many Emails Are
from which an email spam threat can spread. Thus the future Sent. Accessed: Jul. 25, 2019. [Online]. Available: https://www.
direction of research in this field should be to develop anti- campaignmonitor.com/blog/email-marketing/2018/03/shocking-truth-
spam software that can simultaneously battle against about-how-many-emails-sent/
[7] O. A. Okunade, ‘‘Manipulating E-mail server feedback for spam preven-
multiple types of email spamming, considering multiple tion,’’ Arid Zone J. Eng., Technol. Environ., vol. 13, no. 3, pp. 391–399,
angles of attack as discussed above, with a single Jun. 2017.
installation of the software. [8] R. Islam and Y. Xiang, ‘‘Email classification using data reduction
method,’’ in Proc. 5th Int. ICST Conf. Commun. Netw. China, Aug.
2010, pp. 1–5.
VII. CONCLUSION [9] Scamwatch. Scam Statistics. Accessed: Jul. 15, 2019. [Online].
After a thorough analysis, the study results in several Available: https://www.scamwatch.gov.au/about-scamwatch/scam-
statistics
different observations especially in the realm of Machine
[10] Scamwatch. Scam Statistics. Accessed: Jul. 16, 2019. [Online].
Learning based proposition. It is noted that high adoption of Available: https://www.scamwatch.gov.au/about-scamwatch/scam-
supervised approaches is quite obvious, the reason behind statistics? scamid=31&date=2019
this turns out to be a better consistency in the performance [11] A. Test. Spam Statistics. Accessed: Jul. 16, 2019. [Online]. Available:
https://www.av-test.org/en/statistics/spam/
of the model. It has also been highlighted that certain
37240 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[12] N. Banu and M. Banu, ‘‘A Comprehensive Study of Phishing Attacks,’’
Int.
J. Comput. Sci. Inf. Technol., vol. 4, no. 6, pp. 783–786, 2013.
[13] H. Hu and G. Wang, ‘‘Revisiting Email spoofing attacks,’’ 2018,
arXiv:1801.00853. [Online]. Available: https://arxiv.org/abs/1801.00853
VOLUME7,2019 37241
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[14] R. A. Halaseh and J. Alqatawna, ‘‘Analyzing cybercrimes strategies: The pp. 266–275, 2016.
case of phishing attack,’’ in Proc. Cybersecur. Cyberforensics Conf. [38] D. Sipahi, G. Dalkιlιç, and M. H. Özcanhan, ‘‘Detecting spam through
(CCC), Aug. 2016, pp. 82–88. their sender policy framework records,’’ Secur. Commun. Netw., vol. 8,
[15] S. Smadi, N. Aslam, and L. Zhang, ‘‘Detection of online phishing email no. 18,
using dynamic evolving neural network based on reinforcement learning,’’ pp. 3555–3563, Dec. 2015.
Decis. Support Syst., vol. 107, pp. 88–102, Mar. 2018. [39] M. T. Banday, F. A. Mir, J. A. Qadri, and N. A. Shah, ‘‘Analyz- ing
Internet E-mail date-spoofing,’’ Digit. Invest., vol. 7, nos. 3–4,
[16] A. Attar, R. M. Rad, and R. E. Atani, ‘‘A survey of image spamming and
pp. 145–153, Apr. 2011.
filtering techniques,’’ Artif. Intell. Rev., vol. 40, no. 1, pp. 71–105, Jun.
2011. [40] H. Esquivel, A. Akella, and T. Mori, ‘‘On the effectiveness of IP
reputation for spam filtering,’’ in Proc. 2nd Int. Conf. Commun. Syst.
[17] A. L. Chung-Man, ‘‘An analysis of the impact of phishing and anti-
Netw. (COM- SNETS), Jan. 2010, pp. 1–10.
phishing related announcements on market value of global firms,’’ HKU,
[41] K. S. Bajaj, F. Egbufor, and J. Pieprzyk, ‘‘Critical analysis of spam pre-
Hong Kong, Tech. Rep., 2009.
vention techniques,’’ in Proc. 3rd Int. Workshop Secur. Commun. Netw.
[18] N. Raad, G. Alam, B. Zaidan, and A. Zaidan, ‘‘Impact of spam adver- (IWSCN), May 2011, pp. 83–87.
tisement through E-mail: A study to assess the influence of the anti- spam [42] R. R. Roy, ‘‘Basic session initiation protocol,’’ in Handbook on Session
on the E-mail marketing,’’ Afr. J. Bus. Manage., vol. 4, no. 11, Initiation Protocol: Networked Multimedia Communications for IP Tele-
pp. 2362–2367, Sep. 2010. phony. Boca Raton, FL, USA: CRC Press, 2016, pp. 5–166.
[19] TheStar. Company Cheated of RM 4.5 Mil Due to Email Spoofing. [43] R. Ferdous, ‘‘Analysis and protection of SIP based services,’’ Ph.D.
Accessed: Jul. 30, 2019. [Online]. Available: https://www.thestar.com. disser- tation, Dept. Inf. Eng. Comput. Sci., Univ. Trento, Trento, Italy,
my/news/nation/2017/06/11/kedah-based-company-cheated-due-to- 2014.
email-spoofing [44] G. Caruana and M. Li, ‘‘A survey of emerging approaches to spam filter-
[20] T. L. Shan, G. N. Samy, B. Shanmugam, S. Azam, K. C. Yeo, and ing,’’ ACM Comput. Surv., vol. 44, no. 2, p. 9, Feb. 2012.
K. Kannoorpatti, ‘‘Heuristic systematic model based guidelines for phish- [45] P. Sousa, A. Machado, M. Rocha, P. Cortez, and M. Rio, ‘‘A
ing victims,’’ in Proc. IEEE Annu. India Conf. (INDICON), Dec. 2016, collaborative approach for spam detection,’’ in Proc. 2nd Int. Conf.
pp. 1–6. Evolving Internet, Sep. 2010, pp. 92–97.
[21] H. M. Al-Mashhadi and M. H. Alabiech, ‘‘A survey of Email service: [46] M. Mojdeh, ‘‘Personal Email spam filtering with minimal user interac-
Attacks, security methods and protocols,’’ Int. J. Comput. Appl., vol. 162, tion,’’ Ph.D. dissertation, Dept. Comput. Sci., Univ. Waterloo, Waterloo,
no. 11, pp. 31–40, 2017. ON, Canada, 2012.
[22] J. V. Chandra, N. Challa, and S. K. Pasupuleti, ‘‘A practical approach to [47] M. Prilepok, P. Berek, J. Platos, and V. Snasel, ‘‘Spam detection using
E- mail spam filters to protect data from advanced persistent threat,’’ in data compression and signatures,’’ Cybern. Syst. Int. J., vol. 44, nos. 6–
Proc. Int. Conf. Circuit, Power Comput. Technol. (ICCPCT), Mar. 2016, 7, pp. 533–549, Mar. 2013.
pp. 1–5. [48] S. Geerthik and P. Anish, ‘‘Filtering spam: Current trends and tech-
[23] ABC Bus Company. Accessed: Apr. 5, 2019. [Online]. Available: https:// niques,’’ Int. J. Mechtron., Elect. Comput. Technol., vol. 3, no. 8,
www.doj.nh.gov/consumer/security-breaches/documents/abc-bus- pp. 208–223, Jul. 2013.
20180302.pdf [49] E. Damiani, S. D. Capitani, D. Vimercati, S. Paraboschi, and P. Samarati,
[24] B. B. Gupta, N. A. G. Arachchilage, and K. E. Psannis, ‘‘Defending ‘‘An open digest-based technique for spam detection,’’ in Proc. ISCA
against phishing attacks: Taxonomy of methods, current issues and future 17th Int. Conf. Parallel Distrib. Comput. Syst., 2004, pp. 559–564.
directions,’’ Telecommun. Syst., vol. 67, no. 2, pp. 247–267, Feb. 2017. [50] B. Biggio, G. Fumera, I. Pillai, and F. Roli, ‘‘A survey and experimental
[25] A. Han. Leoni AG Loses 40m in an Email Scam. Accessed: Apr. 17, 2019. evaluation of image spam filtering techniques,’’ Pattern Recognit. Lett.,
[Online]. Available: https://www.bankvaultonline. com/news/security- vol. 32, no. 10, pp. 1436–1446, Jul. 2011.
news/leoni-ag-loses-e40m-in-an-email-scam/ [51] Process Software, LLC. Explanation of Common Spam Filtering Tech-
[26] M. J. Schwartz. French Cinema Chain Fires Dutch Executives Over CEO niques. Accessed: Feb. 11, 2019. [Online]. Available:
Fraud. Accessed: Apr. 17, 2019. [Online]. Available: https://www. http://www.process.
bankinfosecurity.com/blogs/french-cinema-chain-fires-dutch-executives- com/products/pmas/whitepapers/explanation_filter_techniques.html
over-ceo-fraud-p-2681 [52] Vernon Schryver. Distributed Checksum Clearinghouses. Accessed:
Oct. 5, 2019. [Online]. Available: https://www.dcc-servers.net/dcc/
[27] R. Basnet, S. Mukkamala, and A. H. Sung, ‘‘Detection of phishing attacks:
A machine learning approach,’’ in Soft Computing Applications in [53] H. Wang, R. Zhou, and Y. Wang, ‘‘An anti-spam filtering system based
Industry (Studies in Fuzziness and Soft Computing), vol. 226. Berlin, on the naive Bayesian classifier and distributed checksum
Germany: Springer-Verlag, 2008, pp. 373–383. clearinghouse,’’ in Proc. 3rd Int. Symp. Intell. Inf. Technol. Appl., Nov.
2009, pp. 128–131.
[28] R. Vinayakumar, M. Alazab, K. Soman, P. Poornachandran,
[54] J. Chen, R. Fontugne, A. Kato, and K. Fukuda, ‘‘Clustering spam cam-
A. Al-Nemrat, and S. Venkatraman, ‘‘Deep learning approach for
paigns with fuzzy hashing,’’ in Proc. AINTEC Asian Internet Eng. Conf.
intelligent intrusion detection system,’’ IEEE Access, vol. 7,
(AINTEC), Nov. 2014, p. 66.
pp. 41525–41550, 2019.
[55] A. Karim, ‘‘Multi-layer masking of character data with a visual image
[29] M. Alazab, ‘‘Profiling and classifying the behavior of malicious codes,’’ key,’’ Int. J. Comput. Netw. Inf. Secur., vol. 10, no. 10, pp. 41–49, Oct.
J. Syst. Softw., vol. 100, pp. 91–102, Feb. 2015. 2017, doi: 10.5815/ijcnis.2017.10.05.
[30] V. Gupta. Understanding Feedforward Neural Networks. Accessed: [56] C.-Y. Chiu, A. Prayoonwong, and Y.-C. Liao, ‘‘Learning to index for
Jun. 10, 2019. [Online]. Available: www.learnopencv.com/ nearest neighbor search,’’ 2018, arXiv:1807.02962. [Online]. Available:
[31] M. R. Sánchez, T. T. Loon, and V. Victor, ‘‘An anti-spam framework for https://arxiv.org/abs/1807.02962
Singapore,’’ Media Asia, vol. 30, no. 4, pp. 240–246, 2003. [57] Y. Li, S. C. Sundaramurthy, A. G. Bardas, X. Ou, D. Caragea, X. Hu, and
[32] Z. Zhou, ‘‘Disagreement-based semi-supervised learning,’’ Acta Auto- J. Jang, ‘‘Experimental study of fuzzy hashing in malware clustering
matica Sinica, vol. 39, no. 11, pp. 1871–1878, 2013, doi: 10.3724/ anal- ysis,’’ in Proc. 8th Workshop Cyber Secur. Experimentation Test
sp.j.1004.2013.01871. (CSET), Berkeley, CA, USA: USENIX Association, 2015, p. 8.
[33] W. Li, W. Meng, Z. Tan, and Y. Xiang, ‘‘Towards designing an Email [58] P. Liu and T.-S. Moh, ‘‘Content based spam E-mail filtering,’’ in Proc.
classification system using multi-view based semi-supervised learning,’’ in Int. Conf. Collaboration Technol. Syst. (CTS), Nov. 2016, pp. 218–224.
Proc. IEEE 13th Int. Conf. Trust, Secur. Privacy Comput. Commun., Sep. [59] D. Chiba, M. Akiyama, T. Yagi, K. Hato, T. Mori, and S. Goto,
2014, pp. 174–181. ‘‘DomainChroma: Building actionable threat intelligence from malicious
[34] S. Hameed, T. Kloht, and X. Fu, ‘‘Identity based Email sender authenti- domain names,’’ Comput. Secur., vol. 77, pp. 138–161, Aug. 2018.
cation for spam mitigation,’’ in Proc. 8th Int. Conf. Digit. Inf. Manage. [60] A. Ramachandran, N. Feamster, and S. Vempala, ‘‘Filtering spam with
(ICDIM), Sep. 2013, pp. 14–19. behavioral blacklisting,’’ in Proc. 14th ACM Conf. Comput. Commun.
[35] E. Calò. SPF, DKIM and DMARC Brief Explanation and Best Prac- tices. Secur. (CCS), Oct. 2007, pp. 342–351.
Accessed: Feb. 21, 2019. [Online]. Available: https://www. [61] Spamhaus. The Spamhaus Project. Accessed: Oct. 5, 2019. [Online].
endpoint.com/blog/2014/04/15/spf-dkim-and-dmarc-brief-explanation Avail- able: www.spamhaus.org
[36] L. Seltzer. DKIM: Useless or Just Disappointing. Accessed: Mar. 8, 2019. [62] M. Sirivianos, K. Kim, and X. Yang, ‘‘SocialFilter: Introducing social
[Online]. Available: www.zdnet.com trust to collaborative spam mitigation,’’ in Proc. IEEE INFOCOM, Apr.
2011,
[37] A. Karim, ‘‘A cryptographic application for secure information transfer pp. 2300–2308.
in a linux network environment,’’ Amer. J. Eng. Res., vol. 5, no. 8,
37242 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[63] P.-C. Lin, P.-H. Lin, P.-R. Chiou, and C.-T. Liu, ‘‘Detecting spamming Artif. Intell., vol. 28, no. 2, pp. 97–110, Jan. 2014.
activities by network monitoring with Bloom filters,’’ in Proc. 15th Int. [91] A. J. Saleh, A. Karim, B. Shanmugam, S. Azam, K. Kannoorpatti,
Conf. Adv. Commun. Technol. (ICACT), Jan. 2013, pp. 163–168. M. Jonkman, and F. D. Boer, ‘‘An intelligent spam detection model
[64] Bloom Filters. Accessed: May 15, 2019. [Online]. Available: https:// based on artificial immune system,’’ Information, vol. 10, no. 6, p. 209,
llimllib.github.io Jun. 2019.
[65] P. Revar, A. Shah, J. Patel, and P. Khanpara, ‘‘A review on different types [92] S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri, ‘‘Self-nonself
of spam filtering techniques,’’ Int. J. Adv. Res. Comput. Sci., vol. 8, no. 5, discrimination in a computer,’’ in Proc. IEEE Comput. Soc. Symp. Res.
pp. 2720–2723, May/Jun. 2017. Secur. Privacy, May 2014, pp. 202–212.
[66] S. Khanna, H. Chaudhry, and G. S. Bindra, ‘‘Inbound outbound Email [93] J. Brownlee, Clever Algorithms: Nature-Inspired Programming Recipes.
traffic analysis and Its SPAM impact,’’ in Proc. 4th Int. Conf. Comput. Abu Dhabi, United Arab Emirates: LuLu, 2012.
Intell., Commun. Syst. Netw., Jul. 2012, pp. 181–186. [94] S. S. Aote, M. M. Raghuwanshi, and L. Malik, ‘‘A brief review on
[67] L. Ilie, ‘‘Regular expression matching,’’ in Encyclopedia of Algorithms. particle swarm optimization: Limitations & future directions,’’ Int. J.
New York, NY, USA: Springer-Verlag, 2014, pp. 1–6, doi: 10.1007/978- Comput. Sci. Eng., vol. 2, no. 5, pp. 196–200, Sep. 2013.
3- 642-27848-8_340-2. [95] S. Salhi and N. M. Queen, ‘‘A hybrid algorithm for identifying global
[68] Pettingers. What’s Working Now in Spam Assassin. Accessed: Feb. 13, and local minima when optimizing functions with many minima,’’ Eur.
2019. [Online]. Available: http://www.pettingers.org/ annoyances/sa- J. Oper. Res., vol. 155, no. 1, pp. 51–67, May 2004.
rules.html [96] Y. Zhu and Y. Tan, ‘‘Extracting discriminative information from e-mail
[69] S. Khanna, H. Chaudhry, and G. S. Bindra, ‘‘Inbound outbound Email for spam detection inspired by immune system,’’ in Proc. IEEE Congr.
traffic analysis and its SPAM impact,’’in Proc. 4th Int. Conf. Comput. Evol. Comput., Jul. 2010, pp. 1–7.
Intell., Commun. Syst. Netw., Jul. 2012, pp. 181–186. [97] M. Z. Hayat, J. Basiri, L. Seyedhossein, and A. Shakery, ‘‘Content-based
[70] R. K. Kumar, G. Poonkuzhali, and P. Sudhakar, ‘‘Comparative study on concept drift detection for Email spam filtering,’’ in Proc. 5th Int. Symp.
Email spam classifier using data mining techniques,’’ in Proc. Int. Multi Telecommun., Dec. 2010, pp. 531–536.
Conf. Eng. Comput. Scientists. vol. 1, Mar. 2012, pp. 14–16. [98] K. Jackowski, B. Krawczyk, and M. Woźniak, ‘‘Application of adaptive
[71] C. Laorden, I. Santos, B. Sanz, G. Alvarez, and P. G. Bringas, ‘‘Word splitting and selection classifier to the spam filtering problem,’’ Cybern.
sense disambiguation for spam filtering,’’ Electron. Commerce Res. Appl., Syst. Int. J., vol. 44, nos. 6–7, pp. 569–588, 2013.
vol. 11, no. 3, pp. 290–298, 2012. [99] D. Wang, D. Irani, and C. Pu, ‘‘Is Email business dying?: A study on
[72] D. Kumawat and V. Jain, ‘‘POS tagging approaches: A comparison,’’ Int. evolu- tion of Email spam over fifteen years,’’ EAI Endorsed Trans.
J. Comput. Appl., vol. 118, no. 6, pp. 32–38, Jan. 2015. Collaborative Comput., vol. 1, no. 1, p. e3, May 2014.
[73] R. M. Ravindran and A. S. Thanamani, ‘‘K-means document clustering [100] R. Sathya and A. Abraham, ‘‘Comparison of supervised and
using vector space model,’’ Bonfring Int. J. Data Mining, vol. 5, no. 2, unsupervised learning algorithms for pattern classification,’’ Int. J. Adv.
pp. 10–14, Jul. 2015. Res. Artif. Intell., vol. 2, no. 2, pp. 34–38, Feb. 2013.
[74] H. Che, Q. Liu, L. Zou, H. Yang, D. Zhou, and F. Yu, ‘‘A content-based [101] F. Qian, A. Pathak, Y. C. Hu, Z. M. Mao, and Y. Xie, ‘‘A case for
phishing Email detection method,’’ IEEE Int. Conf. Softw. Qual., Rel. unsupervised-learning-based spam filtering,’’ ACM SIGMETRICS Per-
Secur. Companion (QRS-C), Jul. 2017, pp. 415–422. form. Eval. Rev., vol. 38, no. 1, p. 367, Dec. 2010.
[75] L. A. Zadeh, ‘‘Advances in fuzzy systems—Applications and theory,’’ in [102] R. Agrawal, T. Imieliński, and A. Swami, ‘‘Mining association rules
Fuzzy Sets, Fuzzy Logic, And Fuzzy Systems: Selected Papers By Lotfi A between sets of items in large databases,’’ in Proc. ACM SIGMOD Int.
Zadeh. Singapore: World Scientific Publishing Co Pte Ltd, 1996, Conf. Manage. Data (SIGMOD), May 1993, pp. 207–216.
pp. 394–432, doi: 10.1142/9789814261302_0021. [103] P. H. B. Las-Casas, J. M. Almeida, M. A. Gonzalves, D. Guedes,
[76] Y. Hu, C. Guo, E. W. T. Ngai, M. Liu, and S. Chen, ‘‘A scalable A. Ziviani, and H. T. Marques-Neto, ‘‘Adaptive spammer detection at
intelligent non-content-based spam-filtering framework,’’ Expert Syst. the source network,’’ in Proc. IEEE Global Commun. Conf.
Appl., vol. 37, no. 12, pp. 8557–8565, Dec. 2010. (GLOBECOM), Dec. 2013, pp. 1434–1439.
[77] W. V. Wanrooij and A. Pras, ‘‘Filtering spam from bad neighborhoods,’’ [104] X. Zhu, ‘‘Semi-supervised learning,’’ in Encyclopedia of Machine
Int. J. Netw. Manage., vol. 20, no. 6, pp. 433–444, Nov./Dec. 2010. Learn- ing and Data Mining. New York, NY, USA: Springer, 2010, pp.
[78] G. Stringhini, M. Egele, A. Zarras, T. Holz, C. Kruegel, and G. Vigna, 1142– 1147, doi: 10.1007/978-1-4899-7687-1_749.
‘‘B@bel: Leveraging Email delivery for spam mitigation,’’ in Proc. 21st [105] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk,
USENIX Conf. Secur. Symp., 2012, p. 2. ‘‘A survey on reinforcement learning models and algorithms for traffic
[79] A. Liska and G. Stowe, ‘‘Understanding DNS,’’ in DNS Security, 2016, signal control,’’ ACM Comput. Surv., vol. 50, no. 3, p. 34, Oct. 2017.
pp. 1–23. [106] M. A. Karim, J. Currie, and T.-T. Lie, ‘‘Dynamic event detection using
[80] D. Bradbury, ‘‘Can we make Email secure,’’ Netw. Secur., vol. 3, no. 3, a distributed feature selection based machine learning approach in a self-
pp. 13–16, Mar. 2014. healing microgrid,’’ IEEE Trans. Power Syst., vol. 33, no. 5,
[81] Anon. Proof of Work. Accessed: May 27, 2019.[Online]. Available: pp. 4706–4718, Sep. 2018.
https://en.bitcoin.it/wiki/Proof_of_work [107] D. Hassan, ‘‘On Determining the most effective subset of features for
[82] SpamAssassin. The Apache SpamAssassin. Accessed: Oct. 6, 2019. detecting phishing Websites,’’ Int. J. Comput. Appl., vol. 122, no. 20,
[Online]. Available: https://spamassassin.apache.org/ pp. 1–7, Jan. 2015.
[83] Capterra. Com. Anti-Spam Software. Accessed: Oct. 6, 2019. [Online]. [108] Feature Extraction Foundations and Applications, Springer-Verlag,
Available: www.capterra.com/anti-spam-software/ New York, NY, USA, 2016.
[84] ZeroSpam. The ZeroSpam Solution. Accessed: Oct. 6, 2019. [Online]. [109] O. A. Adewumi and A. A. Akinyelu, ‘‘A hybrid firefly and support
Available: https://www.zerospam.ca/en/home/ vector machine classifier for phishing email detection,’’ Kybernetes, vol.
[85] A. Darwish, ‘‘Bio-inspired computing: Algorithms review, deep analysis, 45, no. 6,
and the scope of applications,’’ Future Comput. Informat. J., vol. 3, no. 2, pp. 977–994, Jun. 2016.
pp. 231–246, Dec. 2018. [110] M. Khonji, Y. Iraqi, and A. Jones, ‘‘Phishing detection: A literature
[86] D. Ruano-Ordás, F. Fdez-Riverola, and J. R. Méndez, ‘‘Using survey,’’ IEEE Commun. Surveys Tuts., vol. 15, no. 4, pp. 2091–2121,
evolutionary computation for discovering spam patterns from E-mail 4th Quart, 2013.
samples,’’ Inf. Process. Manage., vol. 54, no. 2, pp. 303–317, Mar. 2018. [111] I. Qabajeh and F. Thabtah, ‘‘An experimental study for assessing email
[87] B. Meadows, P. Riddle, C. Skinner, and M. M. Barley, ‘‘Evaluating the classification attributes using feature selection methods,’’ in Proc. 3rd
seeding genetic algorithm,’’ in Advances in Artificial Intelligence (Lecture Int. Conf. Adv. Comput. Sci. Appl. Technol., Dec. 2014, pp. 125–132.
Notes in Computer Science). Springer, 2013, pp. 221–227. [112] R. Mohammad, T. L. McCluskey, and T. Fadi, ‘‘An assessment of
[88] E. Conrad. Detecting Spam With Genetic Regular Expressions. London, features related to phishing Websites using an automated technique,’’ in
U.K.: SANS Institute InfoSec Reading Room, 2007. Proc. Int. Conf. Internet Technol. Secured Trans., London, U.K., Dec.
[89] I. Idris, A. Selamat, N. T. Nguyen, S. Omatu, O. Krejcar, K. Kuca, and 2012,
M. Penhaker, ‘‘A combined negative selection algorithm-particle swarm pp. 492–497.
optimization for an email spam detection system,’’ Eng. Appl. Artif. Intell., [113] A. Yasin and A. Abuhasan, ‘‘An intelligent classification model for
vol. 39, pp. 33–44, Mar. 2015. Phishing Email detection,’’ Int. J. Netw. Secur. Appl., vol. 8, no. 4,
pp. 55–72, 2016.
[90] I. Idris, A. Selamat, and S. Omatu, ‘‘Hybrid email spam detection model
with negative selection algorithm and differential evolution,’’ Eng. Appl.
VOLUME7,2019 37243
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[114] T. Rashid, Make Your Own Neural Network: A Gentle Journey Through radix encoded fragmented database approach,’’ in Proc. Int. Conf.
the Mathematics of Neural Networks, and Making Your Own Using the Comput. Sustain. Global Develop. (INDIACom), Mar. 2014, pp. 939–
Python Computer Language. Charleston, SC, USA: CreateSpace, 2017. 942.
[115] A. Nosseir, K. Nagati, and I. Taj-Eddin, ‘‘Intelligent word-based spam [138] D. Ranganayakulu and C. C, ‘‘Detecting malicious URLs in E-mail—
filter detection using multi-neural networks,’’ Int. J. Comput. Sci. Issues, An implementation,’’ AASRI Procedia, vol. 4, pp. 125–131, Jan. 2013.
vol. 10, no. 2, p. 17, Mar. 2013. [139] C.-N. Lee, Y.-R. Chen, and W.-G. Tzeng, ‘‘An online subject-based
[116] M. F. Porter, ‘‘An algorithm for suffix stripping,’’ Program, vol. 14, no. spam filter using natural language features,’’ in Proc. IEEE Conf.
3, Dependable Secure Comput., Aug. 2017, pp. 479–487.
pp. 130–137, 1980. [140] S. Hegelich, ‘‘Decision trees and random forests: Machine learning
[117] A. Malge and S. M. Chaware, ‘‘An efficient framework for spam mail tech- niques to classify rare events,’’ Eur. Policy Anal., vol. 2, no. 1, pp.
detection in attachments using NLP,’’ Int. J. Sci. Res., vol. 5, no. 6, 98–120, 2016.
pp. 1121–1125, May 2016. [141] T. Ouyang, S. Ray, M. Allman, and M. Rabinovich, ‘‘A large-scale
[118] Y. Zhang, R. Jin, and Z. Zhou, ‘‘Understanding bag-of-words model: A empir- ical analysis of email spam detection through network
statistical framework,’’ Int. J. Mach. Learn. Cybern., vol. 1, nos. 1–4, characteristics in a stand-alone enterprise,’’ Comput. Netw., vol. 59, pp.
pp. 43–52, Dec. 2010. 101–121, Feb. 2014.
[119] J. Singh and V. Gupta, ‘‘Text Stemming: Approaches, applications, and [142] The R Implementation of Rulefit Learning Algorithm. Accessed: Jun. 17,
challenges,’’ ACM Comput. Surv., vol. 49, no. 3, p. 45, Dec. 2016. 2019. [Online]. Available: www.stat.stanford.edu/jhf/R-RuleFit. html
[120] L. Deng and D. Yu, Deep Learning: Methods and Applications. Boston, [143] M. Sheikhalishahi, M. Mejri, and N. Tawbi, ‘‘Clustering spam Emails
MA, USA: Now, 2014. into campaigns,’’ in Proc. 1st Int. Conf. Inf. Syst. Secur. Privacy, Feb.
[121] M. F. A. Foysal, M. S. Islam, A. Karim, and N. Neehal, ‘‘Shot-Net: A 2015,
convolutional neural network for classifying different cricket shots,’’ in pp. 90–97.
Proc. Int. Conf. Recent Trends Image Process. Pattern Recognit., 2019, [144] J.-J. Sheu, K.-T. Chu, N.-F. Li, and C.-C. Lee, ‘‘An efficient
pp. 111–120. incremental learning mechanism for tracking concept drift in spam
[122] S. Seth and S. Biswas, ‘‘Multimodal spam classification using deep filtering,’’ Plos ONE, vol. 12, no. 2, Sep. 2017, Art. no. e0171518.
learn- ing techniques,’’ in Proc. 13th Int. Conf. Signal-Image Technol. [145] K. Fawagreh, M. M. Gaber, and E. Elyan, ‘‘Random forests: From early
Internet- Based Syst. (SITIS), Dec. 2017, pp. 346–349. developments to recent advancements,’’ Syst. Sci. Control Eng., vol. 2,
[123] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- no. 1, pp. 602–609, 2014.
nov, ‘‘Dropout: A simple way to prevent neural networks from [146] K. N. Tran, M. Alazab, and R. Broadhurst, ‘‘Towards a feature rich
overfitting,’’ model for predicting spam Emails containing malicious attachments and
J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014. URLs,’’ in Proc. 11th Australas. Data Mining Conf. (AusDM), 2014, pp.
[124] H. Jansma. Don’t Use Dropout in Convolutional Networks—Towards 1–10.
Data Science. Accessed: Jul. 30, 2019. [Online]. Available: https:// [147] R. Shams and R. E. Mercer, ‘‘Classifying spam Emails using text and
towardsdatascience.com/dont-use-dropout-in-convolutional-networks- readability features,’’ in Proc. IEEE 13th Int. Conf. Data Mining, Dec.
81486c823c16 2013, pp. 657–666.
[125] E.-X. Shang and H.-G. Zhang, ‘‘Image spam classification based on [148] S. Sperandei, ‘‘Understanding logistic regression analysis,’’ Biochemia
convolutional neural network,’’ in Proc. Int. Conf. Mach. Learn. Cybern. Medica, vol. 24, no. 1, pp. 12–18, 2014.
(ICMLC), Jul. 2016, pp. 398–403. [149] C. Constantin, ‘‘Using the logistic regression model in supporting
[126] A. Barushka and P. Hajek, ‘‘Spam filtering using integrated distribution- decisions of establishing marketing strategies,’’ Bull. Transilvania Univ.
based balancing approach and regularized deep neural networks,’’ Appl. Braşov. vol. 8, no. 2, p. 57, Jul. 2015.
Intell., vol. 48, no. 10, pp. 3538–3556, Oct. 2018. [150] K. Pawar and M. Patil, ‘‘Pattern classification under attack on spam
[127] J. Nam, J. Kim, E. L. Mencía, I. Gurevych, and J. Fürnkranz, ‘‘Large- filtering,’’ in Proc. IEEE Int. Conf. Res. Comput. Intell. Commun. Netw.
scale multi-label text classification—Revisiting neural networks,’’ in (ICRCICN), Nov. 2015, pp. 197–201.
Machine Learning and Knowledge Discovery in Databases (Lecture Notes [151] B. Schoelkopf, Empirical Inference. Berlin, Germany: Springer-Verlag,
in Computer Science), vol. 10534. Cham, Switzerland: Springer, 2014, 2016.
pp. 437–452. [152] J. Nayak, B. Naik, and H. S. Behera, ‘‘A comprehensive survey on
[128] P. Bermejo, J. A. Gámez, and J. M. Puerta, ‘‘Improving the perfor- support vector machine in data mining tasks: Applications &
mance of Naive Bayes multinomial in e-mail foldering by introducing challenges,’’ Int. J. Database Theory Appl., vol. 8, no. 1, pp. 169–186,
distribution-based balance of datasets,’’ Expert Syst. Appl., vol. 38, no. 3, Dec. 2015.
pp. 2072–2080, Mar. 2011. [153] S. Winters-Hilt and S. Merat, ‘‘SVM clustering,’’ BMC Bioinformatics,
[129] R. T. Nakatsu, ‘‘Information visualizations used to avoid the problem of vol. 8, no. S7, 2007.
overfitting in supervised machine learning,’’ in HCI in Business, Govern- [154] S. Winters-Hilt, ‘‘Clustering via support vector machine boosting with
ment and Organizations. Supporting Business (Lecture Notes in Computer simulated annealing,’’ Int. J. Comput. Optim., vol. 4, no. 1, pp. 53–89,
Science), vol. 10294. Cham, Switzerland: Springer, 2017, pp. 373–385. 2017.
[130] T. A. Almeida, J. Almeida, and A. Yamakami, ‘‘Spam filtering: How the [155] M. Diale, C. Van Der Walt, T. Celik, and A. Modupe, ‘‘Feature
dimensionality reduction affects the accuracy of Naive Bayes classifiers,’’ selection and support vector machine hyper-parameter optimisation for
J. Internet Services Appl., vol. 1, no. 3, pp. 183–200, Feb. 2010. spam detec- tion,’’ in Proc. Pattern Recognit. Assoc. South Africa Robot.
[131] L. Melian and A. Nursikuwagus, ‘‘Prediction student eligibility in voca- Mechatronics Int. Conf. (PRASA-RobMech), Nov. 2016, pp. 1–7.
tion school with Naïve-Byes decision algorithm,’’ IOP Conf. Ser., Mater. [156] O. Amayri and N. Bouguila, ‘‘Content-based spam filtering using
Sci. Eng., vol. 407, no. 1, 2018, Art. no. 012140. hybrid generative discriminative learning of both textual and visual
[132] M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, features,’’ in Proc. IEEE Int. Symp. Circuits Syst., May 2012, pp. 862–
and A. P. Sheth, ‘‘Machine learning for Internet of Things data analysis: A 865.
survey,’’ Digit. Commun. Netw., vol. 4, no. 3, pp. 161–175, 2018. [157] S. S. Roy, A. Sinha, R. Roy, C. Barna, and P. Samui, ‘‘Spam Email
[133] C. Bielza and P. Larrañaga, ‘‘Discrete Bayesian network classifiers: A detection using deep support vector machine, support vector machine and
sur- vey,’’ ACM Comput. Surv., vol. 47, no. 1, p. 5, Jul. 2014. artificial neural network,’’ in Soft Computing Applications (Advances in
[134] S. Manlangit, S. Azam, B. Shanmugam, K. Kannoorpatti, M. Jonkman, Intelligent Systems and Computing). Berlin, Germany: Springer-Verlag,
and A. Balasubramaniam, ‘‘An efficient method for detecting fraudu- lent Apr. 2017, pp. 162–174.
transactions using classification algorithms on an anonymized credit card [158] R. Wang, ‘‘AdaBoost for feature selection, classification and its relation
data set,’’ in Intelligent Systems Design and Applications (Advances in with SVM, A review,’’ Phys. Procedia, vol. 25, pp. 800–807, Jan. 2012.
Intelligent Systems and Computing), vol. 736. Cham, Switzerland: [159] R. Varghese and K. A. Dhanya, ‘‘Efficient feature set for spam Email
Springer, 2018, pp. 418–429. fil- tering,’’ in Proc. IEEE 7th Int. Advance Comput. Conf. (IACC), Jan.
[135] B. Zhou, Y. Yao, and J. Luo, ‘‘Cost-sensitive three-way Email spam 2017,
filtering,’’ J. Intell. Inf. Syst., vol. 42, no. 1, pp. 19–45, Feb. 2013. pp. 732–737.
[136] L. Ting and Y. Qingsong, ‘‘Spam feature selection based on the [160] H. Zuhair, A. Selmat, and M. Salleh, ‘‘The Effect of Feature Selection
improved mutual information algorithm,’’ in Proc. 4th Int. Conf. on Phish Website Detection,’’ Int. J. Adv. Comput. Sci. Appl., vol. 6, no.
Multimedia Inf. Netw. Secur., Nov. 2012, pp. 67–70. 10,
[137] N. Jatana and K. Sharma, ‘‘Bayesian spam classification: Time efficient pp. 221–232, 2016.
37244 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[161] L.-Y. Hu, M.-W. Huang, S.-W. Ke, and C.-F. Tsai, ‘‘The distance
function effect on k-nearest neighbor classification for medical
datasets,’’ Springer- Plus, vol. 5, no. 1, p. 1304, Aug. 2016.
VOLUME7,2019 37245
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[162] A. Sharma and A. Suryawanshi, ‘‘A novel method for detecting spam [187] A. Qaroush, I. M. Khater, and M. Washaha, ‘‘Identifying spam e-mail
Email using KNN classification with spearman correlation as dis- tance based-on statistical header features and sender behavior,’’ in Proc.
measure,’’ Int. J. Comput. Appl., vol. 136, no. 6, pp. 28–35, Feb. 2016. CUBE Int. Inf. Technol. Conf. (CUBE), vol. 12, Sep. 2012, pp. 771–778.
[163] Spearman’s Rank-Order Correlation. Accessed: Jul. 15, 2019. [Online]. [188] E. M. Bahgat, S. Rady, W. Gad, and I. F. Moawad, ‘‘Efficient email
Available: https://statistics.laerd.com/statistical-guides/spearmans-rank- classification approach based on semantic methods,’’ Ain Shams Eng. J.,
order-correlation-statistical-guide.php vol. 9, no. 4, pp. 3259–3269, Dec. 2018.
[164] W. Wang, ‘‘Heterogeneous Bayesian ensembles for classifying spam [189] L. T. Nguyen and K. M. Huynh, ‘‘Using WordNet similarity and
Emails,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2010, pp. 1– transla- tions to create Synsets for ontology-based vietnamese
8. WordNet,’’ in Proc. 5th IIAI Int. Congr. Adv. Appl. Inform. (IIAI-AAI),
[165] J. Large, J. Lines, and A. Bagnall, ‘‘The heterogeneous ensembles of Jul. 2016, pp. 651–656.
standard classification algorithms (HESCA): The whole is greater than the [190] E. M. Bahgat and I. F. Moawad, ‘‘Semantic-based feature reduction
sum of its parts,’’ 2017, arXiv:1710.09220. [Online]. Available: approach for E-mail classification,’’ in Proc. Int. Conf. Adv. Intell. Syst.
https://arxiv.org/abs/1710.09220 Inform., 2016, pp. 53–63.
[166] M. Shuaib, O. Osho, I. Ismaila, and J. K. Alhassan, ‘‘Comparative analy- [191] E. M. Bahgat, S. Rady, and W. Gad, ‘‘An E-mail filtering approach
sis of classification algorithms for Email spam detection,’’ Int. J. Comput. using classification techniques,’’ in Proc. 1st Int. Conf. Adv. Intell. Syst.
Netw. Inf. Secur., vol. 10, no. 1, pp. 60–67, Aug. 2018. Inform. (AISI), Beni Suef, Egypt, Oct. 2015, pp. 321–331.
[167] A. Wijaya and A. Bisri, ‘‘Hybrid decision tree and logistic regression [192] M. K. Anjali and A. P. Babu, ‘‘Ambiguities in natural language
classifier for Email spam detection,’’ in Proc. 8th Int. Conf. Inf. Technol. process- ing,’’ Int. J. Innov. Res. Comput. Commun. Eng., vol. 2, no. 5,
Elect. Eng. (ICITEE), Oct. 2016, pp. 1–4. pp. 392–394, 2014.
[168] S. Nizamani, N. Memon, M. Glasdam, and D. D. Nguyen, ‘‘Detection of [193] T. A. Almeida, T. P. Silva, I. Santos, and J. M. G. Hidalgo, ‘‘Text
fraudulent emails by employing advanced feature abundance,’’ Egyptian normalization and semantic indexing to enhance instant messaging and
Informat. J., vol. 15, no. 3, pp. 169–174, Nov. 2014. SMS spam filtering,’’ Knowl.-Based Syst., vol. 108, pp. 25–32, Sep.
[169] I. Alsmadi and I. Alhami, ‘‘Clustering and classification of email con- 2016.
tents,’’ J. King Saud Univ.-Comput. Inf. Sci., vol. 27, no. 1, pp. 46–57, Jan. [194] J. R. Méndez, T. R. Cotos-Yañez, and D. Ruano-Ordás, ‘‘A new
2015. semantic- based feature selection method for spam filtering,’’ Appl. Soft
[170] W. Feng, J. Sun, L. Zhang, C. Cao, and Q. Yang, ‘‘A support vector Comput., vol. 76, pp. 89–104, Mar. 2019.
machine based naive Bayes algorithm for spam filtering,’’ in Proc. IEEE [195] M. A. Syakur, B. K. Khotimah, E. M. S. Rochman, and B. D. Satoto,
35th Int. Perform. Comput. Commun. Conf. (IPCCC), Dec. 2016, pp. 1–8. ‘‘Integration K-means clustering method and elbow method for
[171] R. Islam and J. Abawajy, ‘‘A multi-tier phishing detection and filtering identifica- tion of the best customer profile cluster,’’ IOP Conf. Ser.,
approach,’’ J. Netw. Comput. Appl., vol. 36, no. 1, pp. 324–335, Jan. 2013. Mater. Sci. Eng., vol. 336, no. 1, 2018, Art. no. 012017.
[172] I. R. A. Hamid and J. Abawajy, ‘‘Phishing Email feature selection [196] M. Basavaraju and D. R. Prabhakar, ‘‘A novel method of spam mail
approach,’’ in Proc. IEEE 10th Int. Conf. Trust, Secur. Privacy Comput. detection using text based clustering approach,’’ Int. J. Comput. Appl.,
Commun., Nov. 2011, pp. 916–921. vol. 5, no. 4, pp. 15–25, Aug. 2010.
[173] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair, ‘‘A comparison of
[197] C. Laorden, X. Ugarte-Pedrero, I. Santos, B. Sanz, J. Nieves, and P.
machine learning techniques for phishing detection,’’ in Proc. Anti-
G. Bringas, ‘‘Study on the effectiveness of anomaly detection for spam
Phishing Working Groups 2nd Annu. Ecrime Researchers Summit, vol. 7,
filtering,’’ Inf. Sci., vol. 277, pp. 421–444, Sep. 2014.
Oct. 2007, pp. 60–69.
[198] U. Kutbay, ‘‘Partitional clustering,’’ Recent Applications in Data
[174] S. More and S. A. Kulkarni, ‘‘Data mining with machine learning applied
Cluster- ing, 2018, doi: 10.5772/intechopen.75836.
for email deception,’’ in Proc. Int. Conf. Opt. Imag. Sensor Secur.
[199] R. D. Kortum, ‘‘Hyperonyms and hyponyms,’’ in Varieties of Tone.
(ICOSS), Jul. 2013, pp. 1–4, doi: 10.1109/icoiss.2013.6678403.
New York, NY, USA: Palgrave Macmillan, 2013, pp. 178–180, doi:
[175] R. Shams and R. E. Mercer, ‘‘Personalized spam filtering with nat- ural
10.1057/9781137263544_23.
language attributes,’’ in Proc. 12th IEEE Int. Conf. Mach. Learn. Appl.
(ICMLA). Miami, FL, USA: IEEE, Dec. 2013, pp. 127–132. [200] M. Alazab, R. Layton, R. Broadhurst, and B. Bouhours, ‘‘Malicious
[176] A. S. Aski and N. K. Sourati, ‘‘Proposed efficient algorithm to filter spam Emails developments and authorship attribution,’’ in Proc. 4th
Cybercrime Trustworthy Comput. Workshop. Bundoora, Australia: La
spam using machine learning techniques,’’ Pacific Sci. Rev. A, Natural
Trobe Univer- sity, Nov. 2013, pp. 58–68.
Sci. Eng., vol. 18, no. 2, pp. 145–149, Jul. 2016.
[177] C. Zhang, X. Su, Y. Hu, Z. Zhang, and Y. Deng, ‘‘An evidential spam- [201] S. Halder, R. Tiwari, and A. Sprague, ‘‘Information extraction from
filtering framework,’’ Cybern. Syst. Int. J., vol. 47, no. 6, pp. 427–444, spam emails using stylistic and semantic features to identify spammers,’’
Jun. 2016. in Proc. IEEE Int. Conf. Inf. Reuse Integr., Aug. 2011, pp. 104–107.
[178] J. Kukulies and R. H. Schmitt, ‘‘Uncertainty-based test planning using [202] K. Xu, ‘‘Expectation-maximization algorithm,’’ in Encyclopedia of
dempster-shafer theory of evidence,’’ in Proc. 2nd Int. Conf. Syst. Rel. Systems Biology. New York, NY, USA: Springer-Verlag, 2013, doi:
Saf. (ICSRS), Dec. 2017, pp. 243–249. 10.1007/978-1-4419-9863-7_449.
[179] R. Sun and Y. Deng, ‘‘A new method to determine generalized basic [203] J. M. Hancock, ‘‘Self-organizing map (Kohonen Map, SOM),’’ in
probability assignment in the open world,’’ IEEE Access, vol. 7, no. 1, Dictio- nary of Bioinformatics and Computational Biology. Hoboken,
pp. 52827–52835, 2019. NJ, USA: Wiley, 2004, doi: 10.1002/0471650129.dob0661.
[180] S. Ergin and S. Isik, ‘‘The investigation on the effect of feature vector [204] D. I. Kumar and M. R. Kounte, ‘‘Comparative study of self- organizing
dimension for spam email detection with a new framework,’’ in Proc. 9th map and deep self-organizing map using MATLAB,’’ in Proc. Int. Conf.
Iberian Conf. Inf. Syst. Technol. (CISTI), Jun. 2014, pp. 1–4. Commun. Signal Process. (ICCSP), Apr. 2016,
[181] G. Forman, ‘‘An extensive empirical study of feature selection metrics pp. 1020–1023.
for text classification,’’ J. Mach. Learn. Res., vol. 3, pp. 1289–1305, Mar. [205] A. Azab, R. Layton, M. Alazab, and J. Oliver, ‘‘Mining malware to
2003. detect variants,’’ in Proc. 5th Cybercrime Trustworthy Comput. Conf.,
[182] A. Sharaff, N. K. Nagwani, and A. Dhadse, ‘‘Comparative study of Nov. 2014,
classification algorithms for spam Email detection,’’ in Emerging pp. 44–53.
Research in Computing, Information, Communication and Applications. [206] S. Porras, B. Baruque, B. Vaquerizo, and E. Corchado, ‘‘Clus- tering
Singapore: Springer, 2015, pp. 237–244. ensemble for spam filtering,’’ in Hybrid Artificial Intelli- gent Systems
[183] R. Sharma and G. Kaur, ‘‘E-mail spam detection using SVM and RBF,’’ (Lecture Notes in Computer Science). Springer, 2011,
Int. J. Modern Edu. Comput. Sci., vol. 8, no. 4, pp. 57–63, Apr. 2016. pp. 363–372.
[184] S. A. Saab, N. Mitri, and M. Awad, ‘‘Ham or spam? A comparative study [207] Y. Cabrera-León, P. G. Báez, and C. P. Suárez-Araujo, ‘‘Self-
for some content-based classification algorithms for email filtering,’’ in organizing Maps in the design of anti-spam filters a proposal based on
Proc. 17th IEEE Medit. Electrotech. Conf. (MELECON), Apr. 2014, thematic categories,’’ in Proc. 8th Int. Joint Conf. Comput. Intell., 2016,
pp. 339–343, doi: 10.1109/melcon.2014.6820574. pp. 1–12.
[185] F. Vanhoenshoven, G. Nápoles, R. Falcon, K. Vanhoof, and M. Köppen, [208] X. Kong, C. Hu, and Z. Duan, ‘‘Generalized principal component
‘‘Detecting malicious URLs using machine learning techniques,’’ in Proc. analysis,’’ in Principal Component Analysis Networks and Algorithms.
IEEE Symp. Ser. Comput. Intell. (SSCI), Dec. 2016, pp. 1–8. Singapore: Springer, 2017, pp. 185–233.
[209] I. T. Jolliffe, ‘‘Principal component analysis,’’ in Springer Series in
[186] P. Sedgwick, ‘‘Pearson’s correlation coefficient,’’ BMJ, vol. 345, Jul. Statistics, vol. 28, 2nd ed. New York, NY, USA: Springer, 2002, p. 487.
2012, Art. no. e4483, doi: 10.1136/bmj.e4483.
37246 VOLUME7,2019
A.Karimetal.:ComprehensiveSurveyforIntelligentSpamEmailDetection
[210] I. Dagher and R. Antoun, ‘‘Ham-spam filtering using different PCA sce-
narios,’’ in Proc. IEEE Intl Conf. Comput. Sci. Eng. (CSE) IEEE Intl BHARANIDHARAN SHANMUGAM is
Conf. Embedded Ubiquitous Comput. (EUC) 15th Intl Symp. Distrib. currently a research-intensive Lecturer with the
Comput. Appl. Bus. Eng. (DCABES), Aug. 2016, pp. 542–545. College of Engineering and IT, Charles Darwin
[211] J. C. Gomez, E. Boiy, and M.-F. Moens, ‘‘Highly discriminative statis- University, Australia. He has a large number of
tical features for email classification,’’ Knowl. Inf. Syst., vol. 31, no. 1, publications in several different journals and
pp. 23–53, 2012. conference proceed- ings. His research interest
[212] R. Silva, M. A. Gonçalves, and A. Veloso, ‘‘Rule-based active sampling mainly revolves around the field of
for learning to rank,’’ in Proc. ECML PKDD, 2011, pp. 240–255. cybersecurity.
[213] D. Hassan, ‘‘Investigating the effect of combining text clustering with
classification on improving spam email detection,’’ in Intelligent Systems
Design and Applications (Advances in Intelligent Systems and Comput-
ing). Cham, Switzerland: Springer, 2017, pp. 99–107, doi: 10.1007/978-3-
319-53480-0_10.
[214] H. Padhiyar and P. Rekh, ‘‘An improved expectation maximization based
semi-supervised email classification using Naive Bayes and k-nearest
neighbor,’’ Int. J. Comput. Appl., vol. 101, no. 6, pp. 7–11, Jan. 2014.
[215] A. Chakrabarty and S. Roy, ‘‘An optimized k-NN classifier based on
minimum spanning tree for email filtering,’’ in Proc. 2nd Int. Conf. Bus.
Inf. Manage. (ICBIM), Jan. 2014, pp. 47–52.
[216] D. Debarr and H. Wechsler, ‘‘Spam detection using Random Boost,’’
KRISHNAN KANNOORPATTI is currently a
Pattern Recognit. Lett., vol. 33, no. 10, pp. 1237–1244, Jul. 2012.
[217] M. S. Junayed, A. A. Jeny, S. T. Atik, N. Neehal, A. Karim, S. Azam, and Research Active Associate Professor with the
B. Shanmugam, ‘‘AcneNet—A deep CNN based classification approach Col- lege of Engineering, IT and Environment,
for acne classes,’’ in Proc. 12th Int. Conf. Inf. Commun. Technol. Syst. Charles Darwin University, Australia. In
(ICTS), Jul. 2019, pp. 203–208. addition of being a stellar academic and
innovative researcher, he also has an extensive
experience of working with the government
ASIF KARIM is currently a Ph.D. Researcher bodies in setting up data privacy policies at
with Charles Darwin University, Australia. His national and state level.
research interests include machine intelligence
and crypto- graphic communication. He is
currently working toward the development of a
robust and advanced email filtering system,
primarily using machine learning algorithms. He
has considerable industry experience in IT,
primarily in the field of software engineering.
VOLUME7,2019 37247