Phising Email Using Cuckoo Search With SVM
Phising Email Using Cuckoo Search With SVM
Weina Niu, Xiaosong Zhang, Guowu Yang, Zhiyuan Ma, Zhongliu Zhuo
School of Computer Science and Engineering, and Center for Cyber Security
University of Electronic Science and Technology of China, UESTC
Chengdu, China
Email: [email protected], {johnsonzxs,guowu}@uestc.edu.cn, [email protected], [email protected]
Abstract—Phishing attacks are common online, which have accuracy. Features also have an impact on identifying new
resulted in financial losses through using either malware or phishing emails.
social engineering. Thus, phishing email detection with high In this paper, we extract 23 features including body-based
accuracy has been an issue of great interest. Machine learning-
based detection methods, particularly Support Vector Machine features, URLbased features, and header-based features to
(SVM), have been proved to be effective. However, the param- detect phishing emails. Then, we build a hybrid classifier
eters of kernel method, whose default is that class numbers based on these 23 features together, where Cuckoo Search
reciprocals in general, affect the classification accuracy of SVM. (CS) [6] is used for parameter selection of kernel function.
In order to improve the classification accuracy, this paper The hybrid classifier combining CS with SVM is evaluated
proposes a model, called Cuckoo Search SVM(CS-SVM). The
CS-SVM extracts 23 features, which are used to construct the on a testing dataset including old and new phishing emails
hybrid classifier. In the hybrid classifier, Cuckoo Search (CS) and yields a better result than SVM classifier with default
is integrated with SVM to optimize parameter selection of parameter of Radial Basis Function (RBF).
Radial Basis Function(RBF). Experiments are performed on The remainder of the paper is organized as follows:
a dataset consisting of 1,384 phishing emails and 20,071 non- Section 2 gives an overview of the related work on phishing
phishing emails. Experimental results show that the proposed
method has higher phishing email detection accuracy than email detection; Section 3 describes our proposed CS-SVM
SVM classifier with default parameter value. The CS-SVM classification method and 23 features used for phishing
classifier can obtain the highest accuracy of 99.52 percent. identification; Section 4 introduces experimental setup and
Keywords-APT; phishing email detection; SVM; Cuckoo analyzes experimental results; Conclusion and future work
search; RBF are summarized in Section 5.
II. R ELATED WORK
I. I NTRODUCTION
Some phishing email detection techniques have been
In recent years, more sophisticated attacks have become proposed in recent years to reduce the damage caused by
a common problem in cyber security, one of which is Ad- phishing attacks. Generally, phishing detection can be clas-
vanced Persistent Threat (APT) [1][2]. APT attackers often sified into the network-based approach, blacklist, whitelist,
use social engineering techniques, for instance, phishing and content-based approach [7]. The network-based ap-
email, to invade the target network. Attackers send emails proach identifies phishing email through interfering TCP
containing a phishing link to a malicious website or an and UDP sessions. Since most of the message content are
attachment that contains malicious programs to target users. transmitted in encryption mode, the network-based approach
Then, attackers deceive target users to install a malicious is difficult to implement. Blacklist characteristic library of
program and then control the target host to steal sensitive phishing email recognition. This would be the same for the
information or cause damage. Additionally, the number of whitelist. Although blacklist and whitelist are simple, they
phishing emails continues to rise. Phishing Activity Trends fail to detect new phishing attack. Moreover, blacklist and
Report [3] published by APWG finds 446,065 unique phish- whitelist collection are time-consuming. The content-based
ing sites, 18 million new malware samples in the second approach is designed to obtain attack patterns, which has
quarter of 2016 alone, and an average of more than 200,000 the highest detection accuracy among existing detection ap-
per day. proaches. Meanwhile, the content-based detection approach
The content-based approach is the most accurate phishing often makes use of machine learning techniques to discover
detection method since it can identify new phishing attacks. new phishing emails with high identification accuracy. The
Support Vector Machine (SVM) [4] is the most popular comparisons among these approaches are shown in Table I.
technology of all the content-based approaches. Machine
learning-based technique has been shown to be effective to Features in the content-based detection literature are clas-
detect phishing email [5]. However, parameter selection of sified into URLbased and text-based. Text-based features
kernel method has a significant effect on the SVM classifier include message headers and message body. Toolan and
phishing email.
CS Phishing email
(5) Presence of order, payment, RE- in the title
Title in phishing emails generally contains words like
SVM Non-phishing
order, payment, RE- in order to deceive recipients. In this
work, presence of order, payment, RE- in the title of an email
Pre-processing Feature extraction is checked and a binary value is recorded for classifying
and normalization phishing emails.
First stage
1055
segment of an email with You, or Your, this email is most is nearly two decades.
likely to be a phishing email. (11) Number of http/https in an URL is greater than 1
3) URL-based Features: This section gives explanations The existence of “//“ within the URL path means that the
of 13 features related to URL in the email body. user will be redirected to another website. Thus, there are
(1) Presence of IP address in URLs more than 1 http or https in a URL, then the URL may be
If an IP address is used as an alternative to the redirected to a malicious website.
domain name in the URL of an email, such as (12) Domain similarity measure
http://15.9.31.123/fake.html, this email may be a phishing We find it interesting that many phishing domain names
email. Therefore, in this work, we make use of the presence are similar to famous domain names. Thus, phishers not only
of IP address in URL to identify phishing emails. make these domain names easy to remember but also make
(2) Number of URLs these domains more like normal ones. Therefore, we can
Phishing emails frequently contain multiple URLs to make use of domain similarity measure to classify phishing
illegitimate websites. Thus, the number of URLs is a feature emails.
for identifying phishing emails in this work. (13) Using URL Shortening Services
(3) Number of dots in domain name is greater than 3 URL shortening is used for making an URL considerably
If the number of dots in a domain is greater than three, smaller in length and still leads to the required web page.
then the URL is classified as Phishing since it will have (14) Disparities between domain name server in URLs
multiple subdomains. On the contrary, the legitimate domain and domain name server of sender
name in the URL of an email has no subdomains, such as Phishers often send malicious links whose servers reside
“http://baidu.com“. in different countries or regions in order to hide the true
(4) Presence of “@“ symbol in URLs attack source. Thus, domain name servers of URLs in
If an URL contains “@“ symbol, the browser will ignore phishing email do not reside in the same country of the
everything preceding the “@“ symbol and the real address attacker.
often follows the “@“ symbol.
(5) Active duration of domain names in URLs is less than B. Classification
6 months In this section, we introduce our hybrid classifier, which
This feature can be extracted from WHOIS database. Most makes use of Cuckoo search to optimize parameter selection
phishing websites live for a short period of time. in the kernel function. CS is integrated with SVM to
(6) Length of URL is greater than 54 characters construct a hybrid classifier, CS-SVM, for selecting the
Phishers make use of long URL to hide the doubtful part. optimum parameter in kernel function. In this paper, there
Through the analysis of existing data sets, if the length of the are 23 features and several thousand emails in our testing
URL is greater than 54 characters then an email containing environment. Thus, we select RBF to identify phishing
this URL is classified as a phishing email. email.
(7) Presence of “-“ character in URL In the proposed CS-SVM algorithm, we select the tradi-
Legitimate email rarely uses dash symbol in URLs. How- tional SVM algorithm as our fitness function. That is, this
ever, phishers tend to use - character to deceive the recipient paper uses the value of to generate the hyperplane which
to believe that this is a legitimate web page. minimizes the training errors and also maximizes the margin
(8) Presence of DNS record with the correctly classified data points. Then, this work
Since the claimed identity of phishing URL is not recog- calculates the classification error about normal emails and
nized by the WHOIS database, the website is classified as phishing emails using the current hyperplane. We modify
Phishing when its DNS record is empty or not found. the value of according to Cuckoo Search algorithm until
(9) ALEXA ranking is greater than 100,000 the classification error remains unchanged or the maximal
Active duration of the malicious website is short, so the number of CS iterations reaches. The detailed process of
number of users and the number of pages that they visit CS-SVM is illustrated in Algorithm 1.
is relatively small. On the contrary, the popularity of a
IV. E XPERIMENTAL R ESULTS AND A NALYSIS
legitimate website is high. The ALEXA value is a commonly
used indicator of site traffic ranking statistics. Furthermore, A. Evaluation Metrics
in the worst scenario, legitimate websites ranked among the There are two phases in supervised machine learning,
top 100,000. Thus, if ALEXA ranking of a domain name in namely the training phase and testing phase. To evaluate the
an URL is greater than 100,000 or it is not found, then this performance of the proposed classifier, this paper chooses
website may be malicious. three commonly used evaluation metrics, which are True
(10) Registration duration is less than 2 months Positive Rate (TPR), False Positive Rate (FPR), and ac-
Legitimate domains are regularly paid for several years in curacy [10]. TPR and TFR are explained as follows: the
advance, for example the registration duration of google.com proportion of correct identified phishing emails and the sum
1056
Table II
Algorithm 1 CS-SVM algorithm T HE DISTRIBUTIONS OF TRAINING SET AND TESTING SET
Require: N R the number of runs, N C the number of
Cuckoos, D the dimension of objective search space, Data set P-e S-P-e N-P-e
training 50%∼90% 0 50%∼90%
Θ the tuning parameter value range of γ in RBF, testing 10%-181∼50%-181 all 10%∼50%
p∂ the probability of finding eggs of exotic Cuckoo, Notes: P-e, phishing email; S-P-e, state-of-art phishing email;
T = (xi , yi ), (i = 1, ..., N ; yi = −1|yi = 1) training N-P-e, non-phishing email
email samples with two classes
Ensure: γ parameter value which minimizes the classifica-
tion error In order to evaluate the true positive rates and false
1) Initialize nests location γg,i (i = 1, ..., N C, g = 1) positive rates of our proposed optimized SVM classifier,
2) Generate the hyper planes SV Mg,n we conduct an evaluating experiment to select the optimal
3) Calculate classification error Eg,n using SV Mg,n parameter in our training data set including part of non-
4) Select the best solution γg,i with the minimum error phishing emails and phishing emails from the Nazario
5) while g ≤ N R do corpora. The distributions of training set and testing set is
6) Generate new location γg+1,i using Levy flight shown in Table II.
7) Generate the hyper planes SV Mg+1,n In our CS-SVM algorithm, there are several parameters
8) Calculate new classification error Eg+1,i of γg+1,i requiring initialization. The parameter settings in experiment
9) Select the best solution γg+1,i with the minimum error are shown in Table III.
10) if Eg+1,i > Eg,j then
Table III
11) Retain the best solution Eg+1,k PARAMETERS SETTING IN EXPERIMENT
12) Discard other solutions according to p∂
13) Generate new values to replace discarded values Parameter Description Value
n number of Cuckoos 25
using stochastic preference swimming
A minimum value of γ 0
14) end if B maximum value of γ 1
15) Run time add 1 dimension dimension of input value 15
16) end while iteration number of iteration 1000
p∂ probability of selecting new nests 0.25
17) Output γ
C. Experimental Analysis
of phishing emails classified by SVM, the proportion of
correct identified non-phishing emails and the sum of non- In reference to different γ, different email data and differ-
phishing emails classified by SVM. Precision, recall, and ent features, a comparison is made to verify the effectiveness
accuracy are defined as follows. of our proposed hybrid classifier.
1) Identification Accuracy With Different γ: Experiments
NT P
P recision = are performed to evaluate phishing email identification accu-
NT P + NF P racy at different parameters in kernel function. We first select
NT P
Recall = (1) sixty percent of non-phishing emails from Nazario corpora
NT P + NF N and the same proportion of phishing emails from Enron
NT P + NT N randomly. Identification accuracy changes with different
Accuracy =
NP + NF parameter γ, which is shown in Fig. 2. SVM yields to
where, N indicates the number, NP represents the number of the highest classification accuracy of 99.5064 percent when
actual phishing emails, NF represents the number of actual the value of γ is 0.77. Also, the number of emails that
non-phishing emails. are classified correctly keeps invariable with γ increases.
However, the identification accuracy of SVM reduces with
B. Experimental Setup γ decreases. If the default parameter is set to be greater than
The experimental data used in this paper is collected from 0.77, we can obtain the optimal SVM classifier in the current
three data set and consists of phishing emails as well as email data. However, when γ takes the default value 0.5, the
non-phishing emails. The phishing email set consists of two phishing emails identification rate is less than our method.
different corpora, in which one is 1,203 old phishing emails Also, there are many local optimal points in the range of 0
collected by Jose Nazario in 2005 [11] and the other is 181 to 1, such as 0.03, 0.13, 0.23, 0.29, and 0.48.
up-to-date phishing emails reported in MillerSmiles archive We randomly select sixty percent and seventy percent of
[12]. The 20,071 non-phishing emails are collected from the emails from Nazario corpora, Enron set, respectively. The
public Enron email set [13] by the CALO Project consisting phishing identification precision of SVM increases with γ
of more than 150 employees. , based on the graph of false positives and false negatives
1057
Table IV
C LASSIFICATION RESULTS OF CS-SVM AND SVM
R Algorithm γ TP FN TN FP
CS-SVM 0.79 726 56 10035 0
1:1
SVM 0.5 696 86 10035 0
CS-SVM 0.64 840 62 12042 0
5:6
SVM 0.5 806 96 12042 0
CS-SVM 0.79 946 77 14049 0
Fig. 2. Identification accuracy with different parameter 5:7
SVM 0.5 909 114 14049 0
CS-SVM 0.92 1059 84 16056 0
5:8
SVM 0.5 1025 118 16056 0
CS-SVM 0.57 1167 96 18063 0
change is shown in Fig. 3. The false positive rate remains 5:9
SVM 0.5 1155 108 18063 0
about the same along with γ changes. Decreased parameter CS-SVM 0.74 1277 107 20071 0
1:2
results in discrimination of the false negative rate. Here, the SVM 0.5 1218 166 20071 0
classifier has lowest false positive rate and false negative
rate when γ is 0.79 and 0.62, respectively. Moreover, the
false negative rate can seriously affect the performance of a TPR of 92.8% and 89%, respectively. The TPR of CS-
classifier. In different training data sets, different γ values SVM is higher than that of SVM. Also, phishing email
make SVM yield the highest classification accuracy. Thus, identification accuracy of CS-SVM is better under different
parameter in kernel function is set to default value does not testing rations. Moreover, the highest of TPR is 93.12%,
meet the real situation. which is about four percent higher than SVM with a default
value. However, the FPR of two different classifiers is the
same. These results revealed that the overall performance of
CS-SVM outperformed traditional SVM.
1058
V. C ONCLUSION A ND F UTURE W ORK [7] Gupta, BB and Tewari, Aakanksha and Jain, Ankit Kumar
and Agrawal, Dharma P, ”Fighting against phishing attacks:
Phishing email, especially spear phishing, has brought se- state of the art and future challenges,” Neural Computing and
rious challenges to cyber security. The best current phishing Applications, 2016, pp. 1–26.
email detection method is based on content and support vec-
tor machine. Since email features are far less than training [8] Toolan, Fergus and Carthy, Joe, ”Feature selection for spam
emails detection, Radial basis function kernel with default and phishing detection,” eCrime Researchers Summit (eCrime),
2010, pp. 1–12.
parameter is used to deal with phishing email identification.
However, default parameter cannot make SVM classifier [9] Huang, Huajun and Qian, Liang and Wang, Yaojun, ”A SVM-
perfect. In this work, we combine optimal characteristics based technique to detect phishing URLs,” Information Tech-
of Cuckoo search algorithm with SVM classifier to select nology Journal, 2012, vol. 11, no. 7, p. 921.
optimal parameter value in kernel function. In addition,
[10] Adewumi, Oluyinka Aderemi and Akinyelu, Ayobami An-
we select 23 features including header, URL, and body. dronicus, ”A hybrid firefly and support vector machine classi-
We perform experiments on a dataset, which contains three fier for phishing email detection,” Kybernetes, 2016, vol. 45,
archives. Experimental results show that CS-SVM has a no. 6, pp. 977–994.
higher phishing email detection accuracy at different training
set. This indicates that the proposed method is better than [11] [Nazario, J, ”The online phishing corpus”, http://monkey.org/
∼jose/wiki/doku.php.
SVM classifier with default parameter value.
The future work will be devoted to the optimization of CS- [12] ”The online phishing corpus,” http://www.millersmiles.co.uk/.
SVM because our proposed method is performed on a single
machine. Thus, we hope that our optimization algorithm [13] ”Cohen WW (2016) Enron email dataset,” https://www.cs.
could be run on a distributed platform. For example, we cmu.edu/*./enron/.
can use map-reduce to calculate fitness function of different
γ in parallel.
ACKNOWLEDGMENTS
This work was supported by the National Natural Science
Foundation of China (Grant No. 61572115), the Key Basic
Research of Sichuan Province (Grant No. 2016JY0007).
R EFERENCES
[1] Granadillo, Gustavo Gonzalez and Garcia-Alfaro, Joaquin and
Debar, Hervé and Ponchel, Christophe and Martin, Laura
Rodriguez Considering technical and financial impact in the
selection of security countermeasures against Advanced Persis-
tent Threats (APTs), New Technologies, Mobility and Security
(NTMS), 2015 7th International Conference on, IEEE, 2015,
pp. 1–6.
1059