LSTM
LSTM
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data
Abstract—In recent years, cyber criminals have successfully invaded many important information systems by using phishing mail,
causing huge losses. The detection of phishing mail from big email data has been paid public attention. However, the camouflage
technology of phishing mail is becoming more and more complex, and the existing detection methods are unable to confront with
the increasingly complex deception methods and the growing number of email. In this paper, we proposed a LSTM based phishing
detection method for big email data. The new method includes two important stages, sample expansion stage and testing stage
under sufficient samples. In the sample expansion stage, we combined KNN with K-Means to expand the training data set, so
that the size of training samples can meet the needs of in-depth learning. In the testing stage, we first preprocess these samples,
including generalization, word segmentation and word vector generation. Then, the preprocessed data is used to train a LSTM
model. Finally, on the basis of the trained model, we classify the phishing emails. By experiment,we evaluate the performance
of the proposed method, and experimental results show that the accuracy of our phishing detection method can reach 95%.
—————————— ◆ ——————————
1 INTRODUCTION
using short links, etc., each kind of phishing email has dif- mailboxes, effecting many people.
ferent characteristics. 3) Inductivity: The contents and themes of phishing emails
Although these methods mentioned above can detect are diverse, such as news, politics, economy, entertainment
phishing email to a certain extent, for identity forgery and gossip, etc. But each category of phishing emails are induc-
cloud attachment, the methods such as feature extraction tive definitely, that is, phishing emails must induce recipi-
and sandbox are invalid. In addition, there is a great differ- ents to click the malicious content in the file, so the phish-
ence between the various open-source datasets used for re- ing emails must be inductive.
search on the Internet and the real data in practical appli- 4) Severity: Once the link is clicked by the recipient or the
cation, which seriously affects the generalization of the attachment is opened by the recipient, the malicious at-
model and detection effect. tackers may remotely manipulate the victim's host, steal
Therefore, we first propose a sample labeling method in the confidential file from the compromised host, even
our paper. We use a clustering algorithm to accurately la- spreading the virus through the captured host to the entire
bel the existing email samples on big email data which are intranet environment, which will cause significant harm to
not marked precisely. Meanwhile, we can also expand the the entire enterprise and even national security.
email samples and solve the problems caused by insuffi-
ciently accurately labeled data. Secondly, since we need to
classify according to the message body, so we use the
LSTM (Long Short-Term Memory Network) neural net-
work model for training, mainly owing to the excessive
length of the message body. The LSTM neural network can
effectively process information through three gate units,
and solve the problem of gradient disappearance caused
by the excessive length of context. So, we can train a LSTM
neural network model to detect phishing emails, which ef-
fectively solve the problems mentioned above and achieve
effective detection of phishing emails.
2 RELATE WORKS
The essence of phishing email [26] is that by inducing peo-
ple to click on malicious links or open malicious docu-
ments and attachments, the attackers can complete their
malicious purpose. Nowadays, the phishing email attack Fig. 1. Imitating linked phishing emails and email header
methods are mainly divided into two categories as is
shown in Fig.1 and Fig.2. Fig.1 shows malicious-link based
phishing email and the email's header, which involves con-
structing a similar domain name or imitating a domain
name to attack the recipient, and the recipient will be at-
tacked once the link is clicked [27]. Fig.2 shows malicious
attachment phishing email and email's header, which
mainly involves inducing the recipient to download and
open the malicious attachment of the email, and using the
malicious attachment to attack the victim.
By analyzing phishing emails reported by the sand-
boxes, these phishing emails have four characteristics,
which will help us detecting phishing emails effectly.
1) Flexibility: When an attacker sends a phishing email,
email's properties can be changed flexibly. For example, at-
tackers can use various IPs, email's names, domains and
malicious files. The domain name may be newly applied,
or controlled through a website vulnerability. Moreover,
malicious documents used by phishing emails may use
system vulnerabilities that have been exposed in the vul-
nerability library or use unexposed 0day vulnerabilities,
which adds great difficulty to the detection of phishing
emails.
2) Broadcastability: Attackers don't care target person, they
only care about whether the attack is successful. Therefore,
the similar phishing email may be delivered to multiple Fig. 2. Malicious attachment phishing email and email header
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data
the forgetting gate, represents sigmoid function. from January 2017 to June 2018. The total number of the
Input gate: In the Fig.4, the middle part represents the email is 29,942,735. The experiment is conducted in the Ub-
input gate, and the input gate layer determines which val- untu14.04LTS environment, using python3.5.4 and
ues need to be updated. The tanh layer generates a new Keras2.1.2 as the neural network framework to build the
vector, and the input gate layer and the tanh layer update network, using Google open source tensorflow1.4.1 as the
the state together. back-end computing framework. The CPU of the server is
Inter(R)Xeon(R)CPU [email protected], and the GPU is
it = (Wi [ ht −1 , xt ] + bi ) (11)
TITAN(X)(Pascal).
In (11), Wi is the weights of the input gate, bi is the bias
of the input gate
5.2 Evaluation Criteria
Output Gate: In the Fig.4, the part on the right is the out-
put gate, which determines the output. Firstly, run a sig- In the experiment, the experimental results are evaluated
mod layer to determine which part of the state is what we by four parameters, namely Acc, P, R, and F1. These four
need to output. Then, enter the state into tanh function to parameters are defined as follows:
limit the values of output function between -1 and 1. Fi- errorsum
Acc = (1- ) 100% (13)
nally we can get output by multiplying output obtained in sum
the previous step and sigmoid gate.
ot = (Wo [ ht −1 , xt ] + bo ) (12) TP
P= 100% (14)
In (12), ht − 1 represents the output value of the last neu- TP + FP
ron. LSTM retentes and controls the information com-
TP
pletely through these three gate structures. Our paper uses R= 100% (15)
the LSTM algorithm to train the classifier to detect phish- TP + FN
ing emails.
In our LSTM neuron network, the softsign function is 2 P R
F1 = 100% (16)
used as the activation function, instead of the tanh function P+R
for the faster calculating speed. Adam is used as optimizer
to adjust learning rate, so that our model can converge In the Equations (13), (14), (15), (16), errorsum represents
quickly. Orthogonal initialize is also used to solve the gra- the number of samples with incorrect classification, sum
dient disappearance and gradient explosion problem in represents the total number of samples. TP is true positive,
deep network come from the excessive length of the mes- it represents the number of phishing emails; FN and FP are
sage body. the number of false negative and false positive; R repre-
After the automatic labelling stage, the number of la- sents recall rate, indicating how many phishing emails are
beled samples can support the training of deep learning correctly classified by the model; F1-score is based on the
model. Before the message body is used as the input into harmonic mean of the precision and recall rate, evaluating
LSTM, the data is preprocessed as follows: model performance comprehensively.
1)Corpus construction. We construct the corpus of the
message body. Ignore the mail without message body. 5.3 Sample Labeling Algorithm Results
2)Word segmentation. In this paper, we mainly focus on In order to verify the results of our sample labeling algo-
Chinese mail and English mail. We use Jieba library to im- rithm, four months are randomly selected from the June
plement word segmentation in Chinese sentences and use 2017 to December 2017 and then 1,000 emails are randomly
the space to split English sentences. selected from each month as the verification set to test the
3)Removing the stop words. Filter irrelevant words in seg- labeling method proposed by us.
mentation results, such as modal particles, auxiliary words Since only accurately labeled sample are needed, we ver-
and conjunctions. ify the accuracy of the results and compare them to the ac-
4)Transformation from words to vectors. In this step, we curacy of the corporate sandboxes. The selected four
used word2vec to represent the semantic information of months are June 2017, September 2017, October 2017, and
the word by learning the text. December 2017. The result is shown in Fig.5: "pslacc"and
5)Length normalization. First, we calculate the average "nslacc" means the accuracy of positive sample and nega-
length of the training data. When the length of a vector is tive sample labeling, "pslnum" and "nslacc" means the
greater than the average length, we truncate the vector. In- number of positive and negative sample labeling. From the
stead, we fill it with '0'. Fig.5, the number of result labeled by our method is
slightly less than the number of result labeled by sand-
boxes, but the accuracy of our labeling algorithm is much
5 EXPERIMENT RESULTS AND DISSCUSSION higher than the accuracy of result labeled by sandboxes,
5.1 Experimental Facilities and Data Sources almost reaches 100%.
In the experiment, we first collect emails from our email After analyzing the results of our proposed clustering al-
server, and mailbox data from some companies and organ- gorithm, there are always some similarities in the emails
izations as our experimental data. Email data is selected that are grouped together, such as similar senders, similar
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data
attachment names, or similar mail names. These emails Type B = pn _1& pn _ 2 (14)
may come from the same attacker, or have similar themes, Type C = np _1& np _ 2 (15)
so these emails are grouped together, which leads to a high
Type D = nn _1& nn _ 2 (16)
accuracy of our clustering algorithm. The shortage of our
clustering algorithm is settled by our large data volume.
5.4 LSTM Algorithm Results
The result of our clustering algorithm can effectively sup-
port our clustering algorithm and prepare enough accu- In the experiment, the structure of dataset is shown in TA-
rately labeled email samples for our LSTM neuron network. BLE 1, and we first compare the result of other different
Then, the data from June 2017 to December 2017 is used neurons with the result of LSTM neuron network. These
to cluster by our proposed method, and the number of chosen neurons are mainly used to process the sequence
email of type A is 20,3642, the number of email of type B is data, including standard RNN neurons, GRU neurons, Bi-
10,271, the number of email of type C is 56,920, the number LSTM neurons and TextCNN neurons. DS3 as the training
of email of type D is 2,532,984. According to our labeling set of our model, and 10,000 emails are randomly chosen
algorithm: from January 2018 to June 2018 as the validation set results.
Type A = pp _1& pp _ 2 (13) The results is shown in Fig.6.
TABLE 1
THE NUMBER OF SAMPLES IN DATASET
Ratio of the
Positive samples Negative samples positive sam-
Dataset Total samples ples and nega-
Type A Type C Type B Type D tive samples
DS1 93080 56920 10271 39729 2000000 3:1
DS2 76414 56920 10271 56395 2000000 2:1
DS3 43080 56920 10271 89729 2000000 1:1
DS4 9746 56920 10271 123063 2000000 1:2
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data
experimental results, the accuracy of M4 has plummeted, poorly in the validation set. On the activation function, the
it is because after we reduce the dropout ratio, our training accuracy of the model using relu function was slightly
model has a problem of overfitting, so that it performed higher than accuracy of the model using leakyrelu function.
TABLE 2
SETTINGS OF DIFFERENT NETWORKSES
Then, our method is compared with the machine learn- list method to detect phishing emails. We use the IP,
ing detection model. By extracting the email header fea- sender, url embedded into email, and attachments from
tures, including the sender IP, the one-hot encoding of the phishing emails and normal emails in the previous months
sender country, and whether the sender address and the to build black and white list. The result of the black and
reply address are consistent, text feature, including one- white list detection is as follows:
hot encoding of suffix of the attachment name, we can de-
tect phishing emails by machine learning algorithm and
extracted features. We use Support Vector Machines, Ran-
dom Forest, Xgboost and lightGBM to fit these features
and used the model to classify the validation set.
Fig. 10. The results compared with the method of black and white list
ing email feature extraction algorithm to extract the char- [17] H. Shahriar and M. Zulkernine, "Trustworthiness testing of phishing
acteristics of the email, and then use the extracted features websites: A behavior model-based approach", Future Generation Com-
puter Systems, vol. 28, no. 8, pp. 1258-1271, 2012.
to cluster the emails, so as to achieve accurate labeling of
[18] A. Ferreira and G. Lenzini, "An Analysis of Social Engineering Princi-
phishing emails. Finally, we train the model and compare ples in Effective Phishing", in Workshop on Socio-technical Aspects in Se-
the proposed method with the traditional phishing email curity & Trust Within the IEEE Computer Security Foundations Symposium,
detection method by the experiment. Our method per- 2015.
fomed better than the existing phishing email detection [19] T. Spears, "Phishing for Phools: The Economics of Manipulation & De-
method, it improves accuracy, reduces false negative rate ception", Quantitative Finance, vol. 17, no. 2, pp. 165-167, 2016.
[20] C. Konradt, A. Schilling and B. Werners, "Phishing: An economic anal-
and false positive rate.
ysis of cybercrime perpetrators", Computers & Security, vol. 58, pp. 39-
46, 2016. Available: 10.1016/j.cose.2015.12.001
ACKNOWLEDGMENT [21] N. Safa, M. Sookhak, R. Von Solms, S. Furnell, N. Ghani and T. Hera-
wan, "Information security conscious care behaviour formation in or-
This work was supported in part by the National Key Re-
ganizations", Computers & Security, vol. 53, pp. 65-78, 2015.
search and Development Program (2016QY06X1205 , [22] J. Kang and D. Lee, "Advanced White List Approach for Preventing
2018YFB0804503,) and the National Natural Science Foun- Access to Phishing Sites", in International Conference on Convergence In-
dation of China (61762086, U1836103).” formation Technology, 2007.
[23] A. Saeed, N. Dario and X. Wang, "A comparison of machine learning
techniques for phishing detection", in Anti-phishing Working Groups Ec-
REFERENCES
rime Researchers Summit, 2007.
[1] "US State Department Hack Has Major Security Implications", [24] S. Rawal, B. Rawal, A. Shaheen and S. Malik, "Phishing Detection in E-
Security Intelligence, 2019. [Online]. mails using Machine Learning", International Journal of Applied Infor-
[2] K. Zetter, L. Matsakis, I. Lapowsky, G. Graff, E. Dreyfuss and L. New- mation Systems, vol. 12, no. 7, pp. 21-24, 2017.
[25] V. Gandhi and P. Kumar, "A Study on Phishing: Preventions and Anti-
man, "Researchers Uncover RSA Phishing Attack, Hiding in Plain Sight",
Phishing Solutions", International Journal of Scientific Research, vol. 1, no.
WIRED, 2018. [Online]. 2, pp. 68-69, 2012.
[3] L. Matsakis, I. Lapowsky, G. Graff, E. Dreyfuss and L. Newman, [26] J. Hong, "The state of phishing attacks", Communications of the ACM,
"Why the DNC Thought a Phishing Test Was a Real Attack", vol. 55, no. 1, p. 74, 2012.
WIRED, 2018. [Online]. [27] B. Gupta, A. Tewari, A. Jain and D. Agrawal, "Fighting against phish-
[4] M. Alsharnouby, F. Alaca and S. Chiasson, "Why phishing still ing attacks: state of the art and future challenges", Neural Computing
and Applications, vol. 28, no. 12, pp. 3629-3654, 2016.
works: User strategies for combating phishing attacks", Interna-
[28] D. Komashinskiy, "An approach to detect malicious documents based
tional Journal of Human-Computer Studies, vol. 82, pp. 69-82, 2015.
on Data Mining techniques", SPIIRAS Proceedings, vol. 3, no. 26, p. 126,
Available: 10.1016/j.ijhcs.2015.05.005. 2014.
[5] T. Jagatic, N. Johnson, M. Jakobsson and F. Menczer, "Social [29] N. Nissim, A. Cohen, C. Glezer and Y. Elovici, "Detection of malicious
phishing", Communications of the ACM, vol. 50, no. 10, pp. 94-100, PDF files and directions for enhancements: A state-of-the art survey",
2007. Available: 10.1145/1290958.1290968. Computers & Security, vol. 48, pp. 246-266, 2015.
[6] N. Arachchilage, S. Love and K. Beznosov, "Phishing threat [30] X. Han, N. Kheir and D. Balzarotti, "PhishEye: Live Monitoring of
Sandboxed Phishing Kits", in Acm Sigsac Conference on Computer &
avoidance behaviour: An empirical investigation", Computers in
Communications Security, 2017.
Human Behavior, vol. 60, pp. 185-197, 2016. [31] Y. Cao, W. Han and Y. Le, "Anti-phishing based on automated indi-
[7] K. Parsons, A. McCormac, M. Pattinson, M. Butavicius and C. vidual white-list", in Workshop on Digital Identity Management, 2008.
Jerram, "The design of phishing studies: Challenges for research- [32] A. Jain and B. Gupta, "A novel approach to protect against phishing
ers", Computers & Security, vol. 52, pp. 194-206, 2015. attacks at client side using auto-updated white-list", EURASIP Journal
[8] "Phishing APTs (Advanced Persistent Threats)", InfoSec Re- on Information Security, vol. 2016, no. 1, 2016.
[33] G. Ramesh, I. Krishnamurthi and K. Kumar, "An efficacious method
sources, 2018. [Online].
for detecting phishing webpages through target domain identifica-
[9] G. Singh, "Phishing & a Live Technical Analysis", SSRN Elec-
tion", Decision Support Systems, vol. 61, pp. 12-22, 2014.
tronic Journal, 2017. Available: 10.2139/ssrn.2940415.
[34] S. Marchal, J. François and R. State, "Proactive Discovery of Phishing
[10] K. Jansson and R. von Solms, "Phishing for phishing awareness",
Related Domain Names", in International Conference on Research in At-
Behaviour & Information Technology, vol. 32, no. 6, pp. 584-593,
tacks, 2012.
2013. Available: 10.1080/0144929x.2011.632650.
[35] Pradeep Tiwari and Ravendra Ratan Singh, "Machine Learning based
[11] T. Nikolaos, V. Nikos and M. Alexios, "Browser Blacklists: The Utopia
Phishing Website Detection System", International Journal of Engineering
of Phishing Protection.", E-Business and Telecommunications, no., 2014.
and Research, vol. 4, no. 12, 2015. Available: 10.17577/ijertv4is120262.
[12] W. Khan, M. Khan, F. Bin Muhaya, M. Aalsalem and H. Chao, "A
[36] A. Jain and B. Gupta, "A machine learning based approach for phish-
Comprehensive Study of Email Spam Botnet Detection", IEEE Commu-
ing detection using hyperlinks information", Journal of Ambient Intelli-
nications Surveys & Tutorials, vol. 17, no. 4, pp. 2271-2295, 2015.
gence and Humanized Computing, 2018.
[13] S. Jeeva and E. Rajsingh, "Phishing URL detection-based feature selec-
[37] A. Jain and B. Gupta, "Comparative analysis of features based machine
tion to classifiers", International Journal of Electronic Security and Digital
learning approaches for phishing detection", in International Conference
Forensics, vol. 9, no. 2, p. 116, 2017.
on Computing for Sustainable Global Development, 2016.
[14] J. Chaudhry and R. Rittenhouse, "Phishing: Classification and Coun-
[38] N. Abdelhamid, F. Thabtah and H. Abdel-jaber, "Phishing detection:
termeasures", in International Conference on Multimedia, 2016.
A recent intelligent machine learning comparison based on models
[15] H. Che, Q. Liu and L. Zou, "A Content-Based Phishing Email Detection
content and features", in IEEE International Conference on Intelligence &
Method", in 2017 IEEE International Conference on Software Quality, Reli-
Security Informatics, 2017.
ability and Security Companion, 2017.
[39] Y. Du and F. Xue, "Research of the Anti-phishing Technology Based
[16] C. Tan, K. Chiew, K. Wong and S. Sze, "PhishWHO: Phishing webpage
on E-mail Extraction and Analysis", in International Conference on Infor-
detection via identity keywords extraction and target domain name
mation Science & Cloud Computing Com-panion, 2014.
finder", Decision Support Systems, vol. 88, pp. 18-27, 2016.
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2020.2978915, IEEE Transactions on Big Data
[40] T. Peng, I. Harris and Y. Sawa, "Detecting Phishing Attacks Using Nat-
ural Language Processing and Ma-chine Learning", in IEEE Interna-
tional Conference on Semantic Computing, 2018.
[41] A. Bolton and C. Anderson-Cook, "APT malware static trace analysis
through bigrams and graph edit distance", Statistical Analysis and Data
Mining: The ASA Data Science Journal, vol. 10, no. 3, pp. 182-193, 2017.
[42] P. Lakshmi, "Different Similarity Measures for Text Classification Us-
ing Knn", IOSR Journal of Computer Engineering, vol. 5, no. 6, pp. 30-36,
2012.
[43] A. Suryavanshi, "A Survey Paper on Modified Approach for Kmeans
Algorithm", International journal of Emerging Trends in Science and Tech-
nology, 2016.
[44] K. Greff, R. Srivastava, J. Koutnik, B. Steunebrink and J. Schmidhuber,
"LSTM: A Search Space Odyssey", IEEE Transactions on Neural Net-
works and Learning Systems, vol. 28, no. 10, pp. 2222-2232, 2017.
2332-7790 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Wollongong. Downloaded information.
on May 31,2020 at 20:41:14 UTC from IEEE Xplore. Restrictions apply.