0% found this document useful (0 votes)
19 views11 pages

LSTM

Uploaded by

Emma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

LSTM

Uploaded by

Emma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

IEEE TRANSACTIONS ON BIG DATA, TBD-2019-11-0170 1

LSTM based Phishing Detection for Big


Email Data

Qi Li, Mingyu Cheng, Junfeng Wang, Bowen Sun

Abstract—In recent years, cyber criminals have successfully invaded many important information systems by using phishing mail,
causing huge losses. The detection of phishing mail from big email data has been paid public attention. However, the camouflage
technology of phishing mail is becoming more and more complex, and the existing detection methods are unable to confront with
the increasingly complex deception methods and the growing number of email. In this paper, we proposed a LSTM based phishing
detection method for big email data. The new method includes two important stages, sample expansion stage and testing stage
under sufficient samples. In the sample expansion stage, we combined KNN with K-Means to expand the training data set, so
that the size of training samples can meet the needs of in-depth learning. In the testing stage, we first preprocess these samples,
including generalization, word segmentation and word vector generation. Then, the preprocessed data is used to train a LSTM
model. Finally, on the basis of the trained model, we classify the phishing emails. By experiment,we evaluate the performance
of the proposed method, and experimental results show that the accuracy of our phishing detection method can reach 95%.

Index Terms—phinshing email, LSTM, social engineering

—————————— ◆ ——————————

1 INTRODUCTION

I N recent years, cyber security incidents have occurred


frequently. In most of these incidents, attackers have
use of mail, the number of mails has increased, and tradi-
tional methods are difficult to detect phishing emails on
used phishing email as a knock-on to successfully invade big email data efficiently. For these reasons, there is no uni-
government systems (such as the US State Department and versal and effective tool to detect or prevent harpoon
the White House [1]), well-known companies (such as phishing, which makes it as the main attack means to break
Google and RSA [2]), and websites of politicians and social through valuable targets, and has also been proved to be
organizations in many countries (such as John Podesta and the prelude of most APT attacks [8]. In recent years, phish-
DNC [3]). This series of high-profile incidents highlights ing attacks have attracted more and more attention. Re-
the growing popularity and power of phishing attacks. On searchers try to use sandbox [9], behavior blacklist [10, 11],
the one hand, phishing emails often cause economic dam- filtering botnet email [12], sender reputation analysis [13,
age to enterprises. On the other hand, phishing emails lead 14], linguistic attribution [15, 16]and other methods to an-
to the leakage of private information, which causes dam- alyze phishing behavior [17]. Among these existing re-
age to the industry or even the country. search methods, behavior blacklist method, botnet email
Unlike attacks that exploit specific technical vulnerabili- filtering method and sender reputation analysis method
ties in software and protocols, phishing attacks are based need to be isolated according to sender's information.
on social engineering [4, 5]. By sending fraudulent emails, However, these analysis methods cannot confront with the
the attacker induces the recipient to take some dangerous impact of increasingly complex phishing attacks in the
actions (such as clicking on links, entering passwords, etc.) confrontation environment, for the research of phishing in-
without knowing it. From the attacker's point of view, volves many aspects besides cyber-attacks, such as social
phishing attack does not need too much technical cost, engineering [18], psychology [19], economics [20], con-
does not depend on any specific vulnerabilities, and is eas- sciousness [21], and coping measures [22]. These technolo-
ier to avoid technical defense than malware attack. From gies are mainly used to prevent fraudulent phishing which
the defender’s point of view, phishing attack is difficult to redirect users to fake websites through embedded links in
be judged by simple rules, and it is difficult to achieve all- email, and is not easy to adapt to activity attribution and
round protection [6, 7]. Meanwhile, with the widespread identification.
Machine learning [23, 24] is an effective way to solve
phishing attacks, when introduced into phishing email de-
————————————————
tection in complex environments. But this idea faces many
Qi Li is with Beijing University of Posts and Telecommunications, Beijing,
China (e-mail: [email protected]). difficulties in the actual implementation process. Im-
Mingyu Cheng is with Beijing University of Posts and Telecommunica- portantly, in practice, phishing email can be classified into
tions, Beijing, China (e-mail: [email protected]). many categories [25] according to its camouflage means,
Junfeng Wang is with Sichuan University, Chengdu, China (e-mail:
[email protected]). such as disguising as a public domain name, copying IP,
Bowen is with China Information Security Certification Center (e-mail:
[email protected]).
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
xxxx-xxxx/0x/$xx.00 © 200x IEEE
See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Published by the IEEE Computer Society
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

2 IEEE TRANSACTIONS ON BIG DATA, TBD-2019-11-0170

using short links, etc., each kind of phishing email has dif- mailboxes, effecting many people.
ferent characteristics. 3) Inductivity: The contents and themes of phishing emails
Although these methods mentioned above can detect are diverse, such as news, politics, economy, entertainment
phishing email to a certain extent, for identity forgery and gossip, etc. But each category of phishing emails are induc-
cloud attachment, the methods such as feature extraction tive definitely, that is, phishing emails must induce recipi-
and sandbox are invalid. In addition, there is a great differ- ents to click the malicious content in the file, so the phish-
ence between the various open-source datasets used for re- ing emails must be inductive.
search on the Internet and the real data in practical appli- 4) Severity: Once the link is clicked by the recipient or the
cation, which seriously affects the generalization of the attachment is opened by the recipient, the malicious at-
model and detection effect. tackers may remotely manipulate the victim's host, steal
Therefore, we first propose a sample labeling method in the confidential file from the compromised host, even
our paper. We use a clustering algorithm to accurately la- spreading the virus through the captured host to the entire
bel the existing email samples on big email data which are intranet environment, which will cause significant harm to
not marked precisely. Meanwhile, we can also expand the the entire enterprise and even national security.
email samples and solve the problems caused by insuffi-
ciently accurately labeled data. Secondly, since we need to
classify according to the message body, so we use the
LSTM (Long Short-Term Memory Network) neural net-
work model for training, mainly owing to the excessive
length of the message body. The LSTM neural network can
effectively process information through three gate units,
and solve the problem of gradient disappearance caused
by the excessive length of context. So, we can train a LSTM
neural network model to detect phishing emails, which ef-
fectively solve the problems mentioned above and achieve
effective detection of phishing emails.

2 RELATE WORKS
The essence of phishing email [26] is that by inducing peo-
ple to click on malicious links or open malicious docu-
ments and attachments, the attackers can complete their
malicious purpose. Nowadays, the phishing email attack Fig. 1. Imitating linked phishing emails and email header
methods are mainly divided into two categories as is
shown in Fig.1 and Fig.2. Fig.1 shows malicious-link based
phishing email and the email's header, which involves con-
structing a similar domain name or imitating a domain
name to attack the recipient, and the recipient will be at-
tacked once the link is clicked [27]. Fig.2 shows malicious
attachment phishing email and email's header, which
mainly involves inducing the recipient to download and
open the malicious attachment of the email, and using the
malicious attachment to attack the victim.
By analyzing phishing emails reported by the sand-
boxes, these phishing emails have four characteristics,
which will help us detecting phishing emails effectly.
1) Flexibility: When an attacker sends a phishing email,
email's properties can be changed flexibly. For example, at-
tackers can use various IPs, email's names, domains and
malicious files. The domain name may be newly applied,
or controlled through a website vulnerability. Moreover,
malicious documents used by phishing emails may use
system vulnerabilities that have been exposed in the vul-
nerability library or use unexposed 0day vulnerabilities,
which adds great difficulty to the detection of phishing
emails.
2) Broadcastability: Attackers don't care target person, they
only care about whether the attack is successful. Therefore,
the similar phishing email may be delivered to multiple Fig. 2. Malicious attachment phishing email and email header
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

QI LI ET AL.: LSTM BASED PHISHING DETECTION FOR BIG EMAIL DATA 3

The main methods of detecting phishing emails are as 3 FRAMEWORK OVERVIEW


follows: Sandbox-based phishing email detection method
In our proposed method there are two key issues need to
[28-30], black and white list-based phishing email detec-
be solved. On one hand, the data in our experiments is
tion method [31-34], machine learning based phishing
from enterprise, which has their own mail server. Due to
email detection method [35-38].
the differences of the type and the distribution between
Sandbox-based phishing email detection method mainly
open-source dataset and our data, we cannot use open-
performs static and dynamic detection of phishing email
attachments through sandbox to detect phishing emails [29, source dataset to train machine learning or deep learning
30]. In Ref. [30], the author proposed a novel sandbox tool model with better generalization. Therefore, we first filter
to detect attachments in emails. This tool had a good de- out a certain amount of phishing emails which are in line
tection effect on unencrypted malicious documents, but with the nature of this enterprise manually, and then use
cannot detect encrypted attachments. KNN and K-Means algorithms to accomplish automatic la-
Phishing email detection method based on black and belling, which leads the number of samples can support
white lists [32-34] detects phishing emails by establishing the training of deep learning models. On the other hand,
a feature database for the phishing emails and normal the existing phishing detection methods usually performs
emails [31]. In Ref. [32], a method is porposed for detecting feature extraction on the email or separate analysis of the
phishing emails based on domain name credibility, by set- partial sentences of the text information to detect the mail.
ting the domain name credibility by historical data. How- However, such methods ignore the message body. To this
ever, attackers who use phishing emails can change their end, after the automatic labelling we use LSTM model of
IP, sender's mailbox or change the attachment name at which the input is extended sample to detect the email,
their will and get rid of the detection. which can take advantage of all the information in the
The phishing email detection system based on machine email. As shown in Fig.3, the overall architecture of our
learning [35] detects phishing emails by extracting a large proposed system consists of three important phases: Mail
number of features and fitting the features using a model preprocessing phase, Data amplification phase and model
of machine learning. In Ref. [39], an anti-phishing method training phase.
based on feature analysis is porposed to detect phishing In the automatic labelling phase, we label the emails by
emails. The features are extracted by the label of historical the result of the extracted feature, our proposed data label-
data, and information entropy. In Ref. [40], the authors in- ing method and the pre-classified result. In our method,
troduce a method with the Natural Language Processing we proposed a string distance as the distance of our clus-
method aiming at detecting phishing emails by extracting tering algorithm, and we combine KNN [42] with K-Means
keywords from the message body. The main problem of [43] to expand the training data set so that the size of train-
these two methods is that there will be feature loss in the ing samples can meet the needs of in-depth learning. In the
processing of feature extraction. Therefore, machine learn- model training phase, we preprocess these samples, in-
ing algorithms can not accurately detect phishing emails. cluding generalization, word segmentation and word vec-
Owing to the difficulty of detection caused by compli- tor generation. Finally we train a LSTM model to classify
cated form of phishing email, this paper uses LSTM neural the phishing emails. In our model, Dropout and Regulari-
network to automatically extract the characteristics of zation is used to avoid over-fitting. We also use Adam as
phishing emails to detect the flexible and versatile phish- optimizer adjust learning rate.
ing emails.

Fig. 3. The framework of the proposed phishing detection method


2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

4 IEEE TRANSACTIONS ON BIG DATA, TBD-2019-11-0170

played sender's mailbox can be forged, thus the real infor-


mation of the sender is vital and should be taken into con-
4 THE DETAILS OF PHISHING EMAIL DETECTION sideration. This part can be obtained from the eml file in
4.1 Email Data Preprocessing Method the email server.
Before we label the emails, we proposed a kind of email B) Content feature of the email: This part of the feature in-
data preprocessing method to make our methods more ef- cludes the title of the email, attachment’s name of the email,
ficient. This method is divided into two parts: Filtering of- the attachment suffix of the email, and whether the email
fensive emails; labeling daily working mailboxes. contains a URL. If an URL is included, we first determine
whether the URL is long or short, and then determine
4.1.1 Filtering Offensive Emails whether the URL is shortened by URL shortener because
When an attacker intrusion the victim by phishing mails, phishers often use this method in the email.
the attacker's intrusion methods, such as malicious links We can vectorize the email samples based on the seven-
and malicious attachments, are definitely displayed in the tuple phishing email feature extraction algorithm, and
email. Therefore, before classifying the emails, the number then cluster the phishing emails to get accurate labeling
of email that need to be classified can be reduced by filter- training dataset so that our phishing email detection algo-
ing the offensive emails, which can greatly improve the ef- rithm can identity the phishing emails accurately and effi-
ficiency of our algorithm. The way to classify the offensive ciently.
emails is to make sure that the email contains a URL or at-
tachment. After that, we discard the non- offensive emails 4.2.3 Improved Levenshtein Distance
and only focus on these offensive emails. The Edit Distance [41] represents the minimum number of
times a single character needs to be deleted, inserted, or
4.1.2 Labeling Daily Working Mailboxes replaced from s to t. For two strings a and b, their lengths
Some mailboxes send a lot of offensive emails every are |a| and |b|, their Levenshtein Distance leva , b a , b ( )
month. Among these mailboxes, some are normal working defined as:
mailboxes, others are phishing mailboxes. Therefore, sta- max(i, j ) if min(i , j ) = 0
tistics on mailboxes that send a large number of emails per 
month can effectively reduce the number of emails need to leva , b (i -1, j ) + 1
be classified.
leva , b (| a |,| b |)   (1)
min leva , b(i, j -1) + 1 otherwise

4.2 Sample Expansion Method  leva , b (i -1, j -1) + 1( ai  bj )


4.2.1 Automatic Labelling when ai = bj is 0, otherwise is 1. leva , b ( i , j ) is the Edit
The data in our paper are collected from an enterprise, Distance between the first i characters of a and the first j
which has its own mail server. We found that the mail sam- characters of b. The similarity of a, b can be expressed as
ples collected from the open-source dataset cannot be ap-
Sima , b :
plied to the real scenario due to its different distribution.
That is to say, we can't use the open-source dataset to train
Sima , b = 1 − (leva , b ( a , b ) / max ( a , b )) (2)
By this string distance we can cluster emails and get rid
a model and use it to detect the phishing mail in the real
of the problem of feature loss, and cluster the emails accu-
world. Therefore, it is very important to label the data of
rately.
actual data environment and obtain the training dataset
that meets the needs. Since data labelling is a relatively
4.2.4 Sample Labeling Algorithm
complex work, we obtained a small amount of labeled data
The K-Means algorithm [43] is the most classic partition-
by artificial method, especially for malicious samples.
based clustering algorithm. Its central idea is that all the
Then we used KNN and K-Means to expand the samples
points are clustered centered on k points in space and all
from the whole dataset on the basis of these small amount
clusters update the values of their center iteratively until
of labeled data. Then the expanded samples with high sim-
the best clustering result is obtained. The concrete steps of
ilarity to the manual labeled data will be used for subse-
algorithm are as follows:
quent analysis.
Algorithm1 K-Means algorithm
4.2.2 Phishing Email Feature Extraction Algorithm Required:K number of the initial point, Ci center
Phishing emails have the characteristics of broadcastability, Point of i-th cluster, Pi i-th point, Fij i-th feature of
so there will be a large number of similar phishing emails Pj , FCij the i-th feature of Cj , CLi the i-th cluster, Ni
in the email server. They maybe come from the same IP or the number of points in CLi , Na the number of all
the same Mailbox. And we propose an email feature ex- data, Da , b the distance between point a and point b
traction algorithm based on the seven-tuple set and label Input: Pi ( i = 0 → Na − 1 )
emails by this algorithm. The seven-element features are Output: the catogary of Pi
mainly composed of the following two parts: 1 Randomly pick k points as the initial point
A) Header feature: This part of the feature includes real 2 While Cluster changes or not reach maximum num-
source ip, real sender's mailbox, and the consistency of real ber of iterations.
mailbox and displayed sender's mailbox. Now that the dis- 3 For Pi (i = 0 → Na − 1) in dataset:
4 For Cj ( j = 0 → k − 1) ( j = 0 → k − 1) :
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

QI LI ET AL.: LSTM BASED PHISHING DETECTION FOR BIG EMAIL DATA 5

6 normal emails ; The third character indicates the algo-


5 DPi , Cj =  SimFij,FCjj (3) rithm used, "1" is the result of the K-Means algorithm, and
j =0
"2" is the result of the KNN algorithm. For example, "pp_1"
6 Assign data to the points of cluster with the
represents emails are phishing emails and they are judged
smallest D
as phishing emails by K-Means algorithm. To ensure the
7 In Ci (i = 0 → k − 1)
magnitude and reliability of the data set, we use Equations
8 For Pm ( m = 0 → Ni ) :
(8) and Equations (9) to get the expanded data set.
9 For Pn ( n = m → Ni ) :
Phishing email samples in the dataset:
10 SimFim,Fin
phishing samples = pp _1 & pp _ 2 + np _1 & np _ 2 (8)
11 Ci = (2   SimFim, Fin /n( n − 1))(i = 0 → 6) (5)
Normal email samples in the dataset:
The central idea of the KNN algorithm[42] is when the
normal samples = pn _1 & pn _ 2 + nn _1 & nn _ 2 (9)
data and its label in the training set are known, input the In the formula, a & b denotes the intersection of a and b,
test data, compare the features of the test data with the cor- and a + b denotes the union of a and b.
responding features to the training set and find the top K
data most similar to the training set. The category of the 4.3 LSTM Algorithm
test data is the most frequently occurring classification of k
RNN (Recurrent Neural Networks), due to its special net-
data. The concrete steps of algorithm are as follows:
work model, considers the previous sequence information
Algorithm2 KNN algorithm
when learning the current time information. Therefore,
Required: Pi the i-th point, Fij i-th feature of Pj ,
RNN neural network has unique advantages in dealing
Na the number of all the points, Ii the nearest K
with time series and text sequence problems. However, it
point of Pi , Ni the number of the intial point,
Da , b the distance between point a and point b, Pki is difficult for RNN to learn long distance information.
the Probability of class i LSTM is a special form of RNN that overcomes the prob-
Input: Pi (i = 0 → Na − 1) lems of the classic RNN model [44]. The neuronal cells of
Output: the catogary of Pi the LSTM neural network is shown in Fig.4.
1 Pi (i = 0 → Na − 1) In Fig.4, the state of a neuron is similar to a conveyor belt.
2 For Pi ( j = 0 → Na − 1) Data can be transmitted over the entire strip with only a
6 small amount of linear interaction. It is able to keep the in-
3 DPi , Ij =  SimFii,Fij (6) formation flowing on it easily. The LSTM neurons are
i =0
mainly composed of three gate structures that can choose
4 Ascending DPi , Ij (i = 0 → Na − 1 , j = 0 → Na − 1) passed information. The gate structure is mainly realized
5 Take the first K distance Dk and correspond- by a neural layer of sigmoid and a point-by-point multipli-
ing i-th point PRij from the ascending re- cation operation.
sults,"j" represents their catogary
k
6 Pkc =  DPi , PRjc /  DK (7)
k =1

7 Pi belongs to the category with the highest


probability value
After all the sample data detected by sandbox, they are
divided into phishing emails and normal emails. Due to
the high false alarm rate and false negative rate of the sand-
box, the data classified by the sandbox as phishing emails
contain normal samples, and classified by the sandbox as
normal emails contain phishing emails.
Because of the characteristics of the broadcasbility, there
are a certain number of similar emails in our email server, Fig. 4. The structure of LSTM neurons
so they can be clustered effectively by clustering algorithm.
The characteristics of these phishing emails are displayed
in the form of strings, such as the mail headers and attach- Forgetting Gate: The left part of the Fig.4 shows the for-
ment names, so string distance metioned above is used as getting gate of the LSTM neuron. The input of forgetting
the distance of the clustering algorithm. We use K-Means gate are Ht − 1 and Xt . By processing the input, the forget-
algorithm and KNN algorithm to cluster and reclassify the ting gate can output a number between 0 and 1, which rep-
results of sandbox. resents the degree of forgetting. If the output is 1, it means
The result is defined by the following rules: The first that all the information is "remembered". If it is 0, it means
character indicates the type of email judged by sandbox, that all the information is "forgotten".
"p" indicates that the emails are phishing emails, and "n" ft = (W f  [ht −1 , xt ] + b f ) (10)
indicates that the emails are a normal email; The second
character indicates the type of email judged by our algo- In Equation(10), Wf is the weights of the forgetting gate,
rithm, "p" indicates that the emails are classified as phish- ht − 1, xt  combines two vectors together, bf is the bias of
ing emails, and "n" indicates that the emails are judged as
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

6 IEEE TRANSACTIONS ON BIG DATA, TBD-2019-11-0170

the forgetting gate,  represents sigmoid function. from January 2017 to June 2018. The total number of the
Input gate: In the Fig.4, the middle part represents the email is 29,942,735. The experiment is conducted in the Ub-
input gate, and the input gate layer determines which val- untu14.04LTS environment, using python3.5.4 and
ues need to be updated. The tanh layer generates a new Keras2.1.2 as the neural network framework to build the
vector, and the input gate layer and the tanh layer update network, using Google open source tensorflow1.4.1 as the
the state together. back-end computing framework. The CPU of the server is
Inter(R)Xeon(R)CPU [email protected], and the GPU is
it = (Wi  [ ht −1 , xt ] + bi ) (11)
TITAN(X)(Pascal).
In (11), Wi is the weights of the input gate, bi is the bias
of the input gate
5.2 Evaluation Criteria
Output Gate: In the Fig.4, the part on the right is the out-
put gate, which determines the output. Firstly, run a sig- In the experiment, the experimental results are evaluated
mod layer to determine which part of the state is what we by four parameters, namely Acc, P, R, and F1. These four
need to output. Then, enter the state into tanh function to parameters are defined as follows:
limit the values of output function between -1 and 1. Fi- errorsum
Acc = (1- )  100% (13)
nally we can get output by multiplying output obtained in sum
the previous step and sigmoid gate.
ot = (Wo  [ ht −1 , xt ] + bo ) (12) TP
P=  100% (14)
In (12), ht − 1 represents the output value of the last neu- TP + FP
ron. LSTM retentes and controls the information com-
TP
pletely through these three gate structures. Our paper uses R=  100% (15)
the LSTM algorithm to train the classifier to detect phish- TP + FN
ing emails.
In our LSTM neuron network, the softsign function is 2 P R
F1 =  100% (16)
used as the activation function, instead of the tanh function P+R
for the faster calculating speed. Adam is used as optimizer
to adjust learning rate, so that our model can converge In the Equations (13), (14), (15), (16), errorsum represents
quickly. Orthogonal initialize is also used to solve the gra- the number of samples with incorrect classification, sum
dient disappearance and gradient explosion problem in represents the total number of samples. TP is true positive,
deep network come from the excessive length of the mes- it represents the number of phishing emails; FN and FP are
sage body. the number of false negative and false positive; R repre-
After the automatic labelling stage, the number of la- sents recall rate, indicating how many phishing emails are
beled samples can support the training of deep learning correctly classified by the model; F1-score is based on the
model. Before the message body is used as the input into harmonic mean of the precision and recall rate, evaluating
LSTM, the data is preprocessed as follows: model performance comprehensively.
1)Corpus construction. We construct the corpus of the
message body. Ignore the mail without message body. 5.3 Sample Labeling Algorithm Results
2)Word segmentation. In this paper, we mainly focus on In order to verify the results of our sample labeling algo-
Chinese mail and English mail. We use Jieba library to im- rithm, four months are randomly selected from the June
plement word segmentation in Chinese sentences and use 2017 to December 2017 and then 1,000 emails are randomly
the space to split English sentences. selected from each month as the verification set to test the
3)Removing the stop words. Filter irrelevant words in seg- labeling method proposed by us.
mentation results, such as modal particles, auxiliary words Since only accurately labeled sample are needed, we ver-
and conjunctions. ify the accuracy of the results and compare them to the ac-
4)Transformation from words to vectors. In this step, we curacy of the corporate sandboxes. The selected four
used word2vec to represent the semantic information of months are June 2017, September 2017, October 2017, and
the word by learning the text. December 2017. The result is shown in Fig.5: "pslacc"and
5)Length normalization. First, we calculate the average "nslacc" means the accuracy of positive sample and nega-
length of the training data. When the length of a vector is tive sample labeling, "pslnum" and "nslacc" means the
greater than the average length, we truncate the vector. In- number of positive and negative sample labeling. From the
stead, we fill it with '0'. Fig.5, the number of result labeled by our method is
slightly less than the number of result labeled by sand-
boxes, but the accuracy of our labeling algorithm is much
5 EXPERIMENT RESULTS AND DISSCUSSION higher than the accuracy of result labeled by sandboxes,
5.1 Experimental Facilities and Data Sources almost reaches 100%.
In the experiment, we first collect emails from our email After analyzing the results of our proposed clustering al-
server, and mailbox data from some companies and organ- gorithm, there are always some similarities in the emails
izations as our experimental data. Email data is selected that are grouped together, such as similar senders, similar
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

QI LI ET AL.: LSTM BASED PHISHING DETECTION FOR BIG EMAIL DATA 7

attachment names, or similar mail names. These emails Type B = pn _1& pn _ 2 (14)
may come from the same attacker, or have similar themes, Type C = np _1& np _ 2 (15)
so these emails are grouped together, which leads to a high
Type D = nn _1& nn _ 2 (16)
accuracy of our clustering algorithm. The shortage of our
clustering algorithm is settled by our large data volume.
5.4 LSTM Algorithm Results
The result of our clustering algorithm can effectively sup-
port our clustering algorithm and prepare enough accu- In the experiment, the structure of dataset is shown in TA-
rately labeled email samples for our LSTM neuron network. BLE 1, and we first compare the result of other different
Then, the data from June 2017 to December 2017 is used neurons with the result of LSTM neuron network. These
to cluster by our proposed method, and the number of chosen neurons are mainly used to process the sequence
email of type A is 20,3642, the number of email of type B is data, including standard RNN neurons, GRU neurons, Bi-
10,271, the number of email of type C is 56,920, the number LSTM neurons and TextCNN neurons. DS3 as the training
of email of type D is 2,532,984. According to our labeling set of our model, and 10,000 emails are randomly chosen
algorithm: from January 2018 to June 2018 as the validation set results.
Type A = pp _1& pp _ 2 (13) The results is shown in Fig.6.

Fig. 5. The results of our labeling method

TABLE 1
THE NUMBER OF SAMPLES IN DATASET
Ratio of the
Positive samples Negative samples positive sam-
Dataset Total samples ples and nega-
Type A Type C Type B Type D tive samples
DS1 93080 56920 10271 39729 2000000 3:1
DS2 76414 56920 10271 56395 2000000 2:1
DS3 43080 56920 10271 89729 2000000 1:1
DS4 9746 56920 10271 123063 2000000 1:2
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

8 IEEE TRANSACTIONS ON BIG DATA, TBD-2019-11-0170

DS5 0 56920 10271 132809 2000000 1:3

phishing email, malicious link phishing email and mali-


cious attachment phishing email. For the the malicious link
From the Fig.6, the selected LSTM neurons gets the best phishing email, enterprise sandboxes detected such phish-
result in four neurons. The RNN neural network gets the ing emails mainly by matching links to their information
worst result among the four neural networks due to its library. However, due to the flexibility of malicious link
simple model. The results of other four neural network are mentioned above, the enterprise sandboxes cannot effec-
better than the result of RNN neural networks, but a little tively detect malicious links, which leads to the low accu-
poor than the result of our LSTM neural network. Espe- racy of sandboxes detection for this type of email. For ma-
cially, the result of CNN is closest in our method. In the licious attachment phishing email, these kind of email is
text classification, LSTM and CNN are both commonly generally used for transfer key files, but the enterprise
used deep learning models. CNN model extracts features sandboxes cannot effectively detect these two kinds of at-
similar to n-gram which ignores word order, and it cannot tachments. These two points are the main reason why the
achieve satisifying results in the task of emotion analysis. accuracy of the sandboxes is lower than the accuracy of our
Relatively, LSTM can better capture the induced state- model.
ments in the message, thus it works better in phishing de-
tection.

Fig. 7. The results of different datasets

Fig. 6. The results of different neurons

Meanwhile, different datasets are used for our experi-


ment. We use the same LSTM neural network model in the
training process. The results are shown in Fig.7.
From Fig.7, when ratio of the positive and negative sam-
ple is 1:1, the model performed best in the verification set.
For imbalanced training data, the prediction accuracy of
the model is very unsatisfactory, compared with the-
balanced training data. The reason is that when the train-
ing dataset is imbalanced, the model tilt to the side with
the larger sample quantities during the process of training.
As we can see in the Fig.7, the recall rate of DS1 is very high.
The model of DS1 tend to classify the email as phishing
email because of the large number of phishing email in
training dataset. But there are a large number of false pos- Fig. 8. The results of different datasets
itives in the results of DS1, and the F1-score of DS1 is very
low. After that, we adjust the hyperparameters of model, the
At the same time, we input our verification set into four parameters are set as TABLE 2.
common enterprise sandboxes, and compared the results From the experimental results shown in Fig.9, the nar-
with the results of our method. From Fig.8, the enterprise row and deep neural network model is more effective
sandboxes perform poorly than our method. The reason is when the number of neurons is same. Using L1 regulariza-
that the enterprise sandboxes cannot detect two kinds of tion can effectively improve the accuracy of the data. In our
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

QI LI ET AL.: LSTM BASED PHISHING DETECTION FOR BIG EMAIL DATA 9

experimental results, the accuracy of M4 has plummeted, poorly in the validation set. On the activation function, the
it is because after we reduce the dropout ratio, our training accuracy of the model using relu function was slightly
model has a problem of overfitting, so that it performed higher than accuracy of the model using leakyrelu function.

TABLE 2
SETTINGS OF DIFFERENT NETWORKSES

Hyperparameters Number of single- Activation


depth layer neurons function Regularization Dropout
Model
M1 5 2048 Relu L1 0.5
M2 10 1024 Relu L1 0.5
M3 10 1024 Relu No Regularization 0.5
M4 10 1024 Relu L1 0.1
M5 10 1024 leakyrelu L1 0.5

Then, our method is compared with the machine learn- list method to detect phishing emails. We use the IP,
ing detection model. By extracting the email header fea- sender, url embedded into email, and attachments from
tures, including the sender IP, the one-hot encoding of the phishing emails and normal emails in the previous months
sender country, and whether the sender address and the to build black and white list. The result of the black and
reply address are consistent, text feature, including one- white list detection is as follows:
hot encoding of suffix of the attachment name, we can de-
tect phishing emails by machine learning algorithm and
extracted features. We use Support Vector Machines, Ran-
dom Forest, Xgboost and lightGBM to fit these features
and used the model to classify the validation set.

Fig. 10. The results compared with the method of black and white list

As shown in Fig. 10, the detection result used black and


white list is far less effective than the result used our
Fig. 9. The results compared with shallow machine learning method. The phishing email detection method based on
black and white list cannot obtain better result, because
From Fig. 9, the method proposed by us is far superior phishing emails are more flexible. When an attacker re-
than the shallow machine learning method. The reasons places the sending mailbox or replaces the attachment
may be as follow: the features of emails are mainly in the name, the black and white list cannot detect that Phishin
form of strings. When convert the string features into dig- mail.
ital features, method is mainly one-hot encoding. However,
this method is extremely inefficient in massive data, and a 6 CONCLUSION
large number of strings cannot be converted into digital
features due to feature dimension problems, which leads This paper analyzes the existing phishing email detection
to loss of a large number of features. Secondly, owing to methods and finds that the traditional detection methods
the flexibility of the phishing email, attackers can change are difficult to accurately detect phishing emails. Therefore,
the header features of the emails freely and get rid of we designed a phishing email detection method based on
phishing email detection system based on machine learn- LSTM neural network. At the same time, when we de-
ing. Ultimately these two reasons lead to the low detection signed the model, the problem of the phishing email did
accuracy of shallow learning methods. not have an accurately labeled dataset. So we used a phish-
Finally, we compare our method to the black and white
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

10 IEEE TRANSACTIONS ON BIG DATA, TBD-2019-11-0170

ing email feature extraction algorithm to extract the char- [17] H. Shahriar and M. Zulkernine, "Trustworthiness testing of phishing
acteristics of the email, and then use the extracted features websites: A behavior model-based approach", Future Generation Com-
puter Systems, vol. 28, no. 8, pp. 1258-1271, 2012.
to cluster the emails, so as to achieve accurate labeling of
[18] A. Ferreira and G. Lenzini, "An Analysis of Social Engineering Princi-
phishing emails. Finally, we train the model and compare ples in Effective Phishing", in Workshop on Socio-technical Aspects in Se-
the proposed method with the traditional phishing email curity & Trust Within the IEEE Computer Security Foundations Symposium,
detection method by the experiment. Our method per- 2015.
fomed better than the existing phishing email detection [19] T. Spears, "Phishing for Phools: The Economics of Manipulation & De-
method, it improves accuracy, reduces false negative rate ception", Quantitative Finance, vol. 17, no. 2, pp. 165-167, 2016.
[20] C. Konradt, A. Schilling and B. Werners, "Phishing: An economic anal-
and false positive rate.
ysis of cybercrime perpetrators", Computers & Security, vol. 58, pp. 39-
46, 2016. Available: 10.1016/j.cose.2015.12.001
ACKNOWLEDGMENT [21] N. Safa, M. Sookhak, R. Von Solms, S. Furnell, N. Ghani and T. Hera-
wan, "Information security conscious care behaviour formation in or-
This work was supported in part by the National Key Re-
ganizations", Computers & Security, vol. 53, pp. 65-78, 2015.
search and Development Program (2016QY06X1205 , [22] J. Kang and D. Lee, "Advanced White List Approach for Preventing
2018YFB0804503,) and the National Natural Science Foun- Access to Phishing Sites", in International Conference on Convergence In-
dation of China (61762086, U1836103).” formation Technology, 2007.
[23] A. Saeed, N. Dario and X. Wang, "A comparison of machine learning
techniques for phishing detection", in Anti-phishing Working Groups Ec-
REFERENCES
rime Researchers Summit, 2007.
[1] "US State Department Hack Has Major Security Implications", [24] S. Rawal, B. Rawal, A. Shaheen and S. Malik, "Phishing Detection in E-
Security Intelligence, 2019. [Online]. mails using Machine Learning", International Journal of Applied Infor-
[2] K. Zetter, L. Matsakis, I. Lapowsky, G. Graff, E. Dreyfuss and L. New- mation Systems, vol. 12, no. 7, pp. 21-24, 2017.
[25] V. Gandhi and P. Kumar, "A Study on Phishing: Preventions and Anti-
man, "Researchers Uncover RSA Phishing Attack, Hiding in Plain Sight",
Phishing Solutions", International Journal of Scientific Research, vol. 1, no.
WIRED, 2018. [Online]. 2, pp. 68-69, 2012.
[3] L. Matsakis, I. Lapowsky, G. Graff, E. Dreyfuss and L. Newman, [26] J. Hong, "The state of phishing attacks", Communications of the ACM,
"Why the DNC Thought a Phishing Test Was a Real Attack", vol. 55, no. 1, p. 74, 2012.
WIRED, 2018. [Online]. [27] B. Gupta, A. Tewari, A. Jain and D. Agrawal, "Fighting against phish-
[4] M. Alsharnouby, F. Alaca and S. Chiasson, "Why phishing still ing attacks: state of the art and future challenges", Neural Computing
and Applications, vol. 28, no. 12, pp. 3629-3654, 2016.
works: User strategies for combating phishing attacks", Interna-
[28] D. Komashinskiy, "An approach to detect malicious documents based
tional Journal of Human-Computer Studies, vol. 82, pp. 69-82, 2015.
on Data Mining techniques", SPIIRAS Proceedings, vol. 3, no. 26, p. 126,
Available: 10.1016/j.ijhcs.2015.05.005. 2014.
[5] T. Jagatic, N. Johnson, M. Jakobsson and F. Menczer, "Social [29] N. Nissim, A. Cohen, C. Glezer and Y. Elovici, "Detection of malicious
phishing", Communications of the ACM, vol. 50, no. 10, pp. 94-100, PDF files and directions for enhancements: A state-of-the art survey",
2007. Available: 10.1145/1290958.1290968. Computers & Security, vol. 48, pp. 246-266, 2015.
[6] N. Arachchilage, S. Love and K. Beznosov, "Phishing threat [30] X. Han, N. Kheir and D. Balzarotti, "PhishEye: Live Monitoring of
Sandboxed Phishing Kits", in Acm Sigsac Conference on Computer &
avoidance behaviour: An empirical investigation", Computers in
Communications Security, 2017.
Human Behavior, vol. 60, pp. 185-197, 2016. [31] Y. Cao, W. Han and Y. Le, "Anti-phishing based on automated indi-
[7] K. Parsons, A. McCormac, M. Pattinson, M. Butavicius and C. vidual white-list", in Workshop on Digital Identity Management, 2008.
Jerram, "The design of phishing studies: Challenges for research- [32] A. Jain and B. Gupta, "A novel approach to protect against phishing
ers", Computers & Security, vol. 52, pp. 194-206, 2015. attacks at client side using auto-updated white-list", EURASIP Journal
[8] "Phishing APTs (Advanced Persistent Threats)", InfoSec Re- on Information Security, vol. 2016, no. 1, 2016.
[33] G. Ramesh, I. Krishnamurthi and K. Kumar, "An efficacious method
sources, 2018. [Online].
for detecting phishing webpages through target domain identifica-
[9] G. Singh, "Phishing & a Live Technical Analysis", SSRN Elec-
tion", Decision Support Systems, vol. 61, pp. 12-22, 2014.
tronic Journal, 2017. Available: 10.2139/ssrn.2940415.
[34] S. Marchal, J. François and R. State, "Proactive Discovery of Phishing
[10] K. Jansson and R. von Solms, "Phishing for phishing awareness",
Related Domain Names", in International Conference on Research in At-
Behaviour & Information Technology, vol. 32, no. 6, pp. 584-593,
tacks, 2012.
2013. Available: 10.1080/0144929x.2011.632650.
[35] Pradeep Tiwari and Ravendra Ratan Singh, "Machine Learning based
[11] T. Nikolaos, V. Nikos and M. Alexios, "Browser Blacklists: The Utopia
Phishing Website Detection System", International Journal of Engineering
of Phishing Protection.", E-Business and Telecommunications, no., 2014.
and Research, vol. 4, no. 12, 2015. Available: 10.17577/ijertv4is120262.
[12] W. Khan, M. Khan, F. Bin Muhaya, M. Aalsalem and H. Chao, "A
[36] A. Jain and B. Gupta, "A machine learning based approach for phish-
Comprehensive Study of Email Spam Botnet Detection", IEEE Commu-
ing detection using hyperlinks information", Journal of Ambient Intelli-
nications Surveys & Tutorials, vol. 17, no. 4, pp. 2271-2295, 2015.
gence and Humanized Computing, 2018.
[13] S. Jeeva and E. Rajsingh, "Phishing URL detection-based feature selec-
[37] A. Jain and B. Gupta, "Comparative analysis of features based machine
tion to classifiers", International Journal of Electronic Security and Digital
learning approaches for phishing detection", in International Conference
Forensics, vol. 9, no. 2, p. 116, 2017.
on Computing for Sustainable Global Development, 2016.
[14] J. Chaudhry and R. Rittenhouse, "Phishing: Classification and Coun-
[38] N. Abdelhamid, F. Thabtah and H. Abdel-jaber, "Phishing detection:
termeasures", in International Conference on Multimedia, 2016.
A recent intelligent machine learning comparison based on models
[15] H. Che, Q. Liu and L. Zou, "A Content-Based Phishing Email Detection
content and features", in IEEE International Conference on Intelligence &
Method", in 2017 IEEE International Conference on Software Quality, Reli-
Security Informatics, 2017.
ability and Security Companion, 2017.
[39] Y. Du and F. Xue, "Research of the Anti-phishing Technology Based
[16] C. Tan, K. Chiew, K. Wong and S. Sze, "PhishWHO: Phishing webpage
on E-mail Extraction and Analysis", in International Conference on Infor-
detection via identity keywords extraction and target domain name
mation Science & Cloud Computing Com-panion, 2014.
finder", Decision Support Systems, vol. 88, pp. 18-27, 2016.

2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data

QI LI ET AL.: LSTM BASED PHISHING DETECTION FOR BIG EMAIL DATA 11

[40] T. Peng, I. Harris and Y. Sawa, "Detecting Phishing Attacks Using Nat-
ural Language Processing and Ma-chine Learning", in IEEE Interna-
tional Conference on Semantic Computing, 2018.
[41] A. Bolton and C. Anderson-Cook, "APT malware static trace analysis
through bigrams and graph edit distance", Statistical Analysis and Data
Mining: The ASA Data Science Journal, vol. 10, no. 3, pp. 182-193, 2017.
[42] P. Lakshmi, "Different Similarity Measures for Text Classification Us-
ing Knn", IOSR Journal of Computer Engineering, vol. 5, no. 6, pp. 30-36,
2012.
[43] A. Suryavanshi, "A Survey Paper on Modified Approach for Kmeans
Algorithm", International journal of Emerging Trends in Science and Tech-
nology, 2016.
[44] K. Greff, R. Srivastava, J. Koutnik, B. Steunebrink and J. Schmidhuber,
"LSTM: A Search Space Odyssey", IEEE Transactions on Neural Net-
works and Learning Systems, vol. 28, no. 10, pp. 2222-2232, 2017.

Qi Li received the Ph.D. degree in computer science and technology


from Beijing University of Posts and Telecommunications, China, in
2010. She is currently an associate professor in the Information Se-
curity Center, State Key Laboratory of Networking and Switching
Technology, School of Computer Science, Beijing University of Posts
and Telecommunications, China. Her current research focuses on in-
formation systems and software.

Mingyu Cheng received the B.S. degree in information security from


Xi’an University of Posts and Telecommunications, China, in 2019. He
is currently pursuing the M.S degree in information security at Beijing
University of Posts and Telecommunications. His research interests
are in the areas of software security and data analysis.

Junfeng Wang received the M.S. degree in Computer Application


Technology from Chongqing University of Posts and Telecommunica-
tions, Chongqing in 2001 and Ph.D. degree in Computer Science from
University of Electronic Science and Technology of China, Chengdu
in 2004. From July 2004 to August 2006, he held a postdoctoral posi-
tion in Institute of Software, Chinese Academy of Sciences. From Au-
gust 2006, Dr. Wang is with the College of Computer Science and the
School of Aeronautics & Astronautics, Sichuan University as a profes-
sor. His recent research interests include network and information se-
curity, spatial information networks and data mining.

Bowen Sun received the M.S. degree in Information Security from


Beijing University of Post and Telecommunications, Beijing, China in
2019. He is currently a research assistant in CNITSEC. His recent re-
search interests include malware analysis and machine learning.

2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.

You might also like