Classifying Phishing URLs Using Recurrent Neural Networks
Classifying Phishing URLs Using Recurrent Neural Networks
Abstract—As the technical skills and costs associated with redirect their victims to the phishing trap. Impersonating legit-
the deployment of phishing attacks decrease, we are witnessing imate URLs is the most common social engineering method a
an unprecedented level of scams that push the need for better phisher can use to lure victims to their website. Therefore, a
methods to proactively detect phishing threats. In this work, solid first step towards blocking fraud sites is to use the URLs
we explored the use of URLs as input for machine learning themselves to screen possible phishing websites [6].
models applied for phishing site prediction. In this way, we
compared a feature-engineering approach followed by a random Being able to determine the maliciousness of a website
forest classifier against a novel method based on recurrent neural by simply evaluating its URL provides a major strategic
networks. We determined that the recurrent neural network advantage. The number of victims can be reduced to nearly
approach provides an accuracy rate of 98.7% even without the
zero while minimizing operational efforts by avoiding massive
need of manual feature creation, beating by 5% the random forest
method. This means it is a scalable and fast-acting proactive use of more complex methods such content analysis [7].
detection system that does not require full content analysis. For this work, we focused on using machine learning
Keywords—Phishing detection; Cybercrime; Feature engineer- techniques for the classification of phishing sites using only
ing; Recurrent neural networks; Long short term memory networks. their URLs. Specifically, we compared the combination of
lexical and statistical analysis of URLs as input for a random
forest (RF) classifier against a novel approach that employs
recurrent neural networks, more particularly, a long/short
I. I NTRODUCTION term memory network (LSTM). RFs, with manually-created
features, have been widely used for classification problems
Phishing attacks are a growing problem worldwide. Ac- [8]. Moreover, this method has been successfully applied to
cording to the Anti-Phishing Working Group (APWG), phish- identifying phishing URLs [9]. On the other hand, LSTM
ing websites increased by 250% from the last quarter of 2015 models are competent at detecting long patterns in sequences
to the first quarter of 2016, targeting more than 400 brands and have been applied to solve different text analysis problems
each month [1]. This is the most the APWG has ever seen [10]. The LSTM method does not require the manual extraction
since they began tracking and reporting on phishing in 2004. of features, since it directly learns a representation from the
Phishing, by definition, is the act of defrauding an online user URL’s sequence of characters [11]. Recently, they have been
by posing as a trustworthy institution or entity in order to used to detect domain-generated algorithms [12], showing
obtain personal information [2]. The use of phishing by a great promise for the infosec industry. To the best of our
criminal is centered around using social engineering schemes knowledge, this is the first time that the LSTM model has
to steal personal and financial information. The attacks are been applied to the detection of phishing URLs.
designed to lead consumers to reveal financial data such as
usernames and/or passwords in fraudulent websites posing as To evaluate both approaches, we took a corpus comprised
legitimate entities. [1], [3]. of one million phishing URLs extracted from Phishtank1 and
one million harmless URLs from CommonCrawl2 . The results
Nowadays, phishing attacks can be launched from any- show that despite using the URLs as sole input, the RF and
where in the world at insignificant costs by people with little the LSTM methods achieved an accuracy rate of 93.5% and
to no technical skills [4]. Organizations trying to protect their 98.7%, respectively. Additionally, we compared both methods
users from these attacks are having a hard time dealing with in terms of training and evaluation times, and the amount of
the massive amount of emerging sites, which must be identified data needed to converge.
and labeled as malicious or harmless before users can access
them. The remaining of this paper is organized as follows: In
Section II, we will provide a background on the problem,
There is no shortage of methods at the time of distributing as well as any related work on phishing detection. Sections
attacks, and phishers enjoy a wide range of techniques for mak-
ing a site appear legitimate while evading detection [4], [5]. 1 PhishTank (https://www.phishtank.com/)
However, at the end of the day, all of them rely on URLs that 2 Common Crawl (http://commoncrawl.org/)
978-1-5386-2701-3/17/$31.00
c 2017 IEEE
III and IV will provide in-depth descriptions of both machine The scheme specifies the resource’s access mechanism or
learning methods. Subsequently, in Section V, we will describe network location (e.g. http, ftp, mailto), while the rest of the
the data and methodology used for the experiments. Section VI URL may vary depending on the scheme selected. For the
presents the experimental results. Finally, we will provide the HTTP protocol, a possible syntactic construction for the next
conclusions of the paper in Section VII. part can be:
// < host >:< port > / < U RL − path >
II. BACKGROUND
It starts with the domain name or IP address of a network host.
A. Phishing Detection Then, there is the port number to connect to and the URL-path
Phishing URL detection can be done via proactive or that provides details on how the resource can be accessed (e.g.
reactive means. On the reactive end, we find services such http://host.com:80/page).
as Google Safe Browsing API3 . This type of services expose
a blacklist of malicious URLs to be queried. Blacklists are III. C LASSIFYING P HISHING U SING URL L EXICAL AND
constructed by using different techniques, including manual re- S TATISTICAL F REQUENCIES
porting, honeypots, or by crawling the web in search of known
phishing characteristics [13], [14]. For example, browsers In this section, we will describe our approach to combine
make use of blacklists to block access upon reaching the URLs the lexical and statistical analysis of a URL with a random
contained in them. One drawback of such reactive method is forest classifier to classify phishing websites based on URL
that in order for a phishing URL to be blocked, it must be features. The process is summarized in Fig. 1. First, a series
previously included in the blacklist. This implies that web users of important variables are extracted with a feature-engineering
remain at risk until the URL is submitted and the blacklist is approach. Then, a classification algorithm is used to build the
updated. What is more, since the majority of phishing sites are model.
active for less than a day [14], [15], their mission is complete
by the time they are added to the blacklist. A. Feature Engineering
Proactive methods mitigate this problem by analyzing the The attackers’ objective when crafting a phishing URL is
characteristics of a web page in real time in order to assess the to trick users into thinking it is a legitimate website. In this
potential risk of a web page. Risk assessment is done through way, the cybercriminal hopes users will reveal their personal
a classification model [16]. Some of the machine learning and financial information. In order to achieve this, the attackers
methods that have been used to detect phishing include: follow certain tried-and-true patterns, which can be detected
support vector machines [17], streaming analytics [18], gra- by an experienced eye. In collaboration with security analysts
dient boosting [6], [19], random forests [20], latent Dirichlet at Easy Solutions, Inc.4 , we identified a set of 14 features that
allocation [21], online incremental learning [22], and neural can be used to create lexical and statistical analysis of URLs:
networks [23]. Several of these methods employ an array of
website characteristics, which mean that in order to evaluate • Domain exists in Alexa rank5 : If the domain exists
a site, first it has to be rendered before the algorithm can be among the top one million Alexa domains. The Alexa
used. This adds a significant amount of time to the evaluation rank is a list of domains arranged by internet popular-
process [24], [25]. Using URLs, instead of content analysis, ity. Most phishing sites are hosted in hacked legitimate
reduces the evaluation time because only a limited portion of sites or new domains. If the phishing is hosted in a
text is analyzed. hijacked website, it is unlikely the domain is part of
the top Alexa domains, since top-ranked domains tend
Lately, the application of machine learning techniques to have better security measures. If the phishing is
for URL classification has been gaining attention. Several hosted in a newly-registered domain, the domain will
studies proposing the use of classification algorithms to detect not appear in the Alexa rank.
phishing URLs have come to the light in recent years [6], [20],
[26]. These studies are mainly focused on creating features • Subdomain length: This takes the subdomain’s URL
through expert knowledge and lexical analysis of the URL. length. Phishing sites try to mimic the legitimate site’s
Then, the phishing site’s characteristic are used as quantitative URL by using its domain as their sub domain. Real
input for the model. The model in turn learns to recognize websites tend to have a short subdomain.
patterns and associations the inputs must follow in order to • URL length: This takes the URL’s length. A long URL
label a site as legitimate or malicious. increases the odds of confusing the user.
• Path length: This takes the URL’s path length. Phish-
B. Uniform Resource Locator Structure
ing URLs tend to have a longer path than the legiti-
The Uniform Resource Locator (URL), as specified in mate ones.
the RFC 1738 [27], is a string representation for a resource • URL Entropy: Calculates URL entropy. The higher
available on the Internet. the entropy of a URL, the more complex it is. Since
A URL is written as follows: phishing URLs tend to have random text, we can
attempt to find them by their entropy.
< scheme >:< scheme − specif ic − part >
4 https://www.easysolutions.net
3 https://safebrowsing.google.com/ 5 http://www.alexa.com/
Fig. 1. Feature-engineering approach for classifying phishing URLs. First, a series of important variables are extracted with a feature-engineering approach.
Then, a classification algorithm is used to build the model.
• Length ratio: Calculates the ratio between URL length Leibler divergence on the character frequencies be-
and path length. In [20] they concluded that phishing tween the URL and English.
URLs tend to have a higher ratio than legitimate
URLs. B. Classification Algorithm
• ’@’ and ’-’ count: Counts @ and - characters in the Once the features are extracted, a binary classifier is trained
URL. In accordance with [20] this feature is added. using the presented URL features. We use a random forest
In URLs, everything to the left of @ gets ignored. In (RF) [28] method to achieve this. An RF is a classification
light of this, phishing URLs use it to deceive users. algorithm that relies on weaker models to build a stronger
For example [email protected]. model with the average of the weaker model responses. The
• Punctuation count: The count of . ! # $ % & , . weaker models used are classification trees and each one of
; ’ in the URL. In [20], they found that phishing them recursively splits the data set based on feature values,
URLs usually show a higher occurrence of punctuation and stops the current split when all input instances belong to
count. the same class. We chose RF since it is widely used [8] and
can be trained to run in parallel.
• Other TLDs count: The number of TLDs that appear
in the URL’s path. Phishing URLs try to impersonate IV. M ODELING P HISHING URL S WITH R ECURRENT
legitimate URLs by using their domain and TLD in N EURAL N ETWORKS
the path.
In the previous section, we designed a set of features
• Is IP: If the URL is an IP instead of a domain. It is a extracted from a URL and fed them into a classification model
feature that has been used in the literature. to predict whether a URL is a case of phishing. We now
• Suspicious words count: The number of suspicious approach the problem in a different way. Instead of manually
words in the URL. Suspicious words include ’con- extracting the features, we directly learn a representation from
firm’, ’account’, ’secure’, ’webscr’, ’login’, ’signin’, the URL’s character sequence.
’submit’, ’update’, ’logon’, ’secure’, ’wp’, ’cmd’ and Each character sequence exhibits correlations, that is,
’admin’. They were chosen manually by observing nearby characters in a URL are likely to be related to each
phishing URLs. other. These sequential patterns are important because they can
• Euclidean distance: The Euclidean distance between be exploited to improve the performance of the predictors [11].
English characters in the URL. This feature, and the A neural network is a bio-inspired machine learning model
following two, attempt to measure how much the URL that consists of a set of artificial neurons with connections
differs from common English. between them. Recurrent Neural Networks (RNN) are a type
• Kolmogorov-Smirnov statistic: Calculates the two- of neural network that is able to model sequential patterns.
sample Kolmogorov-Smirnov statistic on character The distinctive characteristic of RNNs is that they introduce
frequencies between the URL and English. the notion of time to the model, which in turn allows them to
process sequential data one element at a time and learn their
• Kullback-Leibler divergence: Calculate the Kullback- sequential dependencies [10].
Fig. 2. Recurrent neural network for classifying phishing URLs based on LSTM units. Each input character is translated by an 128-dimension embedding. The
translated URL is fed into a LSTM layer as a 150-step sequence. Finally, the classification is performed using an output sigmoid neuron.
URL Phish
http://www.cheatsguru.com/pc/the sims 3 ambitions/requests/ False
http://www.sherdog.com/pictures/gallery/fighter/f 1349/137143/10/ False
http://www.mauipropertysearch.com/maui-meadows.php False
https://www.sanfordhealth.org/HealthInformation/ChildrensHealth/Article/73980 False
http://strathprints.strath.ac.uk/18806/ False
http://www.grahamleader.com/ci 25029538/these-are-5-worst-super-bowl-halftime-shows False
http://www.nwherald.com/2014/04/14/rizzo-homers-for-cubs-in-loss-to-cardinals/apxo9hf/ False
http://th.urbandictionary.com/define.php?term=politics&defid=1634182 False
http://www.carolinaguesthouse.co.uk/onlinebooking/?industrytype=1&startdate=2013-09-05&nights=2&windowsearch=0&location&productid=25d47a24-6b74-46... False
http://www.lander.edu/Business-Administration/Human-Resources/new-employees/policies-procedures False
http://msystemtech.ru/components/com users/Italy/zz/Login.php?run= login-submit&session=68bbd43c854147324d77872062349924&=68bbd43c854147324d778720... True
http://moviesjingle.com/auto/163.com/index.php True
http://any3.co.nz/wp-includes/Text/pp/5885d80a13c0db1f8e%26ee%3D111e61ae3eeb78bcbc5ec9fa804ee562/5885d80a13c0db1f8e%26ee%3D111e61ae3eeb78bcbc5ec9f... True
http://paypal.com.update.account.toughbook.cl/8a30e847925afc5975161aeabe8930f1/?cmd= home&dispatch=d09b78f5812945a73610edf3852f5ebed09b78f5812945a... True
http://www.zeroaccidente.ro/cache/mod login/home/37baa5e40016ab2b877fee2f0c921570/ True
http://mail.kungfuexperience.co.uk/user-verfication/216545649874az6548945648t754867t56/5959730380a7dbe17368373c106f5866 True
http://www.argo.nov.edu54.ru/plugins/system/applse3/54e9ce13d8baee95696633257b33b2b5/ True
http://rarosbun.rel7.com/ True
http://tech2solutions.com/home/wp-admin/includes/trulia/index.html True
http://esxcc.com/js/index.htm?http://us.battle.net/login/en/?ref=http://ruuyqyrus.battle.net/d3/en/index& True
• Precision·Recall ,
F1 -Score = 2 Precision+Recall
VI. R ESULTS
In this section we present the experimental results. First, we
evaluated the performance of the traditional feature engineer-
ing plus the classification-algorithm methodology presented
in Section III. We created 14 features based on the URL’s
lexical and statistical analysis. Then, we trained a random
forest classifier with 100 decision trees. We used the random
forest implementation of the Scikit-Learn library [31].
Using the 2,000,000 URLs described above, we tested the Fig. 4. ROC curve random forest classifier. In average, the models have an
accuracy of the model using a 3-fold cross validation strategy. AUC statistic of 98.44% and a standard deviation of 0.0008%.
The results are shown in TABLE II. The average accuracy
of the model stands at 93.47%, with a recall of 93.28% and
precision of 93.63%. We also evaluated the AUC of the model try to deceive users by employing suspicious words known by
for each fold (ROC curves are shown in Fig. 4. In average, the victims. Also, it is observed that phishing URLs tend to
the models show an AUC statistic of 98.44%. Moreover, the have a higher length ratio between the length of the path and
models provide consistent results as the standard deviation of the hostname.
the accuracy folds is 0.01%, and 0.0008% for the AUC.
Afterwards, we trained the LSTM network as described
Furthermore, we analyzed which features were more im- in Section IV. In particular, we used the implementation of
portant for performing classification in the random forest the Keras [32] with a Theano [33] backend. We defined the
classifier. This is done by counting the number of times each number of epochs to be 20, and for each fold, we used 90%
feature was selected in the different decision trees inside the of the data for training, and 10% for internal validation. In
random forest. Feature importance is shown in Fig. 5. The most Fig. 6, the learning curve of the LSTM network is shown.
important feature for the algorithm is the number of suspicious It is observed that in just 20 epochs, the validation accuracy
words in the URL. This is not surprising since attackers will converges, increasing to over 98% from epoch 10 onward.
Fig. 7. ROC curve LSTM. The AUC of the model is very high, having an
average of 99.91%.