0% found this document useful (0 votes)

86 views

Classifying Phishing URLs Using Recurrent Neural Networks

This paper explores using URLs as input for machine learning models to classify phishing sites. It compares a random forest classifier using manually extracted features to a novel approach using recurrent neural networks (LSTM). The LSTM approach achieves 98.7% accuracy without manual feature engineering, outperforming random forests.

Uploaded by

Corporacion H21

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

Classifying Phishing URLs Using Recurrent Neural Networks

Uploaded by

Corporacion H21

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Classifying Phishing URLs Using

Recurrent Neural Networks

Alejandro Correa Bahnsen† , Eduardo Contreras Bohorquez∗ , Sergio Villegas† ,

Javier Vargas† and Fabio A. González∗
† Easy
Solutions Research
∗ MindLab
Research Group, Universidad Nacional de Colombia, Bogotá
Email: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract—As the technical skills and costs associated with redirect their victims to the phishing trap. Impersonating legit-
the deployment of phishing attacks decrease, we are witnessing imate URLs is the most common social engineering method a
an unprecedented level of scams that push the need for better phisher can use to lure victims to their website. Therefore, a
methods to proactively detect phishing threats. In this work, solid first step towards blocking fraud sites is to use the URLs
we explored the use of URLs as input for machine learning themselves to screen possible phishing websites [6].
models applied for phishing site prediction. In this way, we
compared a feature-engineering approach followed by a random Being able to determine the maliciousness of a website
forest classifier against a novel method based on recurrent neural by simply evaluating its URL provides a major strategic
networks. We determined that the recurrent neural network advantage. The number of victims can be reduced to nearly
approach provides an accuracy rate of 98.7% even without the
zero while minimizing operational efforts by avoiding massive
need of manual feature creation, beating by 5% the random forest
method. This means it is a scalable and fast-acting proactive use of more complex methods such content analysis [7].
detection system that does not require full content analysis. For this work, we focused on using machine learning
Keywords—Phishing detection; Cybercrime; Feature engineer- techniques for the classification of phishing sites using only
ing; Recurrent neural networks; Long short term memory networks. their URLs. Specifically, we compared the combination of
lexical and statistical analysis of URLs as input for a random
forest (RF) classifier against a novel approach that employs
recurrent neural networks, more particularly, a long/short
I. I NTRODUCTION term memory network (LSTM). RFs, with manually-created
features, have been widely used for classification problems
Phishing attacks are a growing problem worldwide. Ac- [8]. Moreover, this method has been successfully applied to
cording to the Anti-Phishing Working Group (APWG), phish- identifying phishing URLs [9]. On the other hand, LSTM
ing websites increased by 250% from the last quarter of 2015 models are competent at detecting long patterns in sequences
to the first quarter of 2016, targeting more than 400 brands and have been applied to solve different text analysis problems
each month [1]. This is the most the APWG has ever seen [10]. The LSTM method does not require the manual extraction
since they began tracking and reporting on phishing in 2004. of features, since it directly learns a representation from the
Phishing, by definition, is the act of defrauding an online user URL’s sequence of characters [11]. Recently, they have been
by posing as a trustworthy institution or entity in order to used to detect domain-generated algorithms [12], showing
obtain personal information [2]. The use of phishing by a great promise for the infosec industry. To the best of our
criminal is centered around using social engineering schemes knowledge, this is the first time that the LSTM model has
to steal personal and financial information. The attacks are been applied to the detection of phishing URLs.
designed to lead consumers to reveal financial data such as
usernames and/or passwords in fraudulent websites posing as To evaluate both approaches, we took a corpus comprised
legitimate entities. [1], [3]. of one million phishing URLs extracted from Phishtank1 and
one million harmless URLs from CommonCrawl2 . The results
Nowadays, phishing attacks can be launched from any- show that despite using the URLs as sole input, the RF and
where in the world at insignificant costs by people with little the LSTM methods achieved an accuracy rate of 93.5% and
to no technical skills [4]. Organizations trying to protect their 98.7%, respectively. Additionally, we compared both methods
users from these attacks are having a hard time dealing with in terms of training and evaluation times, and the amount of
the massive amount of emerging sites, which must be identified data needed to converge.
and labeled as malicious or harmless before users can access
them. The remaining of this paper is organized as follows: In
Section II, we will provide a background on the problem,
There is no shortage of methods at the time of distributing as well as any related work on phishing detection. Sections
attacks, and phishers enjoy a wide range of techniques for mak-
ing a site appear legitimate while evading detection [4], [5]. 1 PhishTank (https://www.phishtank.com/)
However, at the end of the day, all of them rely on URLs that 2 Common Crawl (http://commoncrawl.org/)
978-1-5386-2701-3/17/$31.00
c 2017 IEEE
III and IV will provide in-depth descriptions of both machine The scheme specifies the resource’s access mechanism or
learning methods. Subsequently, in Section V, we will describe network location (e.g. http, ftp, mailto), while the rest of the
the data and methodology used for the experiments. Section VI URL may vary depending on the scheme selected. For the
presents the experimental results. Finally, we will provide the HTTP protocol, a possible syntactic construction for the next
conclusions of the paper in Section VII. part can be:
// < host >:< port > / < U RL − path >
II. BACKGROUND
It starts with the domain name or IP address of a network host.
A. Phishing Detection Then, there is the port number to connect to and the URL-path
Phishing URL detection can be done via proactive or that provides details on how the resource can be accessed (e.g.
reactive means. On the reactive end, we find services such http://host.com:80/page).
as Google Safe Browsing API3 . This type of services expose
a blacklist of malicious URLs to be queried. Blacklists are III. C LASSIFYING P HISHING U SING URL L EXICAL AND
constructed by using different techniques, including manual re- S TATISTICAL F REQUENCIES
porting, honeypots, or by crawling the web in search of known
phishing characteristics [13], [14]. For example, browsers In this section, we will describe our approach to combine
make use of blacklists to block access upon reaching the URLs the lexical and statistical analysis of a URL with a random
contained in them. One drawback of such reactive method is forest classifier to classify phishing websites based on URL
that in order for a phishing URL to be blocked, it must be features. The process is summarized in Fig. 1. First, a series
previously included in the blacklist. This implies that web users of important variables are extracted with a feature-engineering
remain at risk until the URL is submitted and the blacklist is approach. Then, a classification algorithm is used to build the
updated. What is more, since the majority of phishing sites are model.
active for less than a day [14], [15], their mission is complete
by the time they are added to the blacklist. A. Feature Engineering
Proactive methods mitigate this problem by analyzing the The attackers’ objective when crafting a phishing URL is
characteristics of a web page in real time in order to assess the to trick users into thinking it is a legitimate website. In this
potential risk of a web page. Risk assessment is done through way, the cybercriminal hopes users will reveal their personal
a classification model [16]. Some of the machine learning and financial information. In order to achieve this, the attackers
methods that have been used to detect phishing include: follow certain tried-and-true patterns, which can be detected
support vector machines [17], streaming analytics [18], gra- by an experienced eye. In collaboration with security analysts
dient boosting [6], [19], random forests [20], latent Dirichlet at Easy Solutions, Inc.4 , we identified a set of 14 features that
allocation [21], online incremental learning [22], and neural can be used to create lexical and statistical analysis of URLs:
networks [23]. Several of these methods employ an array of
website characteristics, which mean that in order to evaluate • Domain exists in Alexa rank5 : If the domain exists
a site, first it has to be rendered before the algorithm can be among the top one million Alexa domains. The Alexa
used. This adds a significant amount of time to the evaluation rank is a list of domains arranged by internet popular-
process [24], [25]. Using URLs, instead of content analysis, ity. Most phishing sites are hosted in hacked legitimate
reduces the evaluation time because only a limited portion of sites or new domains. If the phishing is hosted in a
text is analyzed. hijacked website, it is unlikely the domain is part of
the top Alexa domains, since top-ranked domains tend
Lately, the application of machine learning techniques to have better security measures. If the phishing is
for URL classification has been gaining attention. Several hosted in a newly-registered domain, the domain will
studies proposing the use of classification algorithms to detect not appear in the Alexa rank.
phishing URLs have come to the light in recent years [6], [20],
[26]. These studies are mainly focused on creating features • Subdomain length: This takes the subdomain’s URL
through expert knowledge and lexical analysis of the URL. length. Phishing sites try to mimic the legitimate site’s
Then, the phishing site’s characteristic are used as quantitative URL by using its domain as their sub domain. Real
input for the model. The model in turn learns to recognize websites tend to have a short subdomain.
patterns and associations the inputs must follow in order to • URL length: This takes the URL’s length. A long URL
label a site as legitimate or malicious. increases the odds of confusing the user.
• Path length: This takes the URL’s path length. Phish-
B. Uniform Resource Locator Structure
ing URLs tend to have a longer path than the legiti-
The Uniform Resource Locator (URL), as specified in mate ones.
the RFC 1738 [27], is a string representation for a resource • URL Entropy: Calculates URL entropy. The higher
available on the Internet. the entropy of a URL, the more complex it is. Since
A URL is written as follows: phishing URLs tend to have random text, we can
attempt to find them by their entropy.
< scheme >:< scheme − specif ic − part >
4 https://www.easysolutions.net
3 https://safebrowsing.google.com/ 5 http://www.alexa.com/
Fig. 1. Feature-engineering approach for classifying phishing URLs. First, a series of important variables are extracted with a feature-engineering approach.
Then, a classification algorithm is used to build the model.

• Length ratio: Calculates the ratio between URL length Leibler divergence on the character frequencies be-
and path length. In [20] they concluded that phishing tween the URL and English.
URLs tend to have a higher ratio than legitimate
URLs. B. Classification Algorithm
• ’@’ and ’-’ count: Counts @ and - characters in the Once the features are extracted, a binary classifier is trained
URL. In accordance with [20] this feature is added. using the presented URL features. We use a random forest
In URLs, everything to the left of @ gets ignored. In (RF) [28] method to achieve this. An RF is a classification
light of this, phishing URLs use it to deceive users. algorithm that relies on weaker models to build a stronger
For example [email protected]. model with the average of the weaker model responses. The
• Punctuation count: The count of . ! # $ % & , . weaker models used are classification trees and each one of
; ’ in the URL. In [20], they found that phishing them recursively splits the data set based on feature values,
URLs usually show a higher occurrence of punctuation and stops the current split when all input instances belong to
count. the same class. We chose RF since it is widely used [8] and
can be trained to run in parallel.
• Other TLDs count: The number of TLDs that appear
in the URL’s path. Phishing URLs try to impersonate IV. M ODELING P HISHING URL S WITH R ECURRENT
legitimate URLs by using their domain and TLD in N EURAL N ETWORKS
the path.
In the previous section, we designed a set of features
• Is IP: If the URL is an IP instead of a domain. It is a extracted from a URL and fed them into a classification model
feature that has been used in the literature. to predict whether a URL is a case of phishing. We now
• Suspicious words count: The number of suspicious approach the problem in a different way. Instead of manually
words in the URL. Suspicious words include ’con- extracting the features, we directly learn a representation from
firm’, ’account’, ’secure’, ’webscr’, ’login’, ’signin’, the URL’s character sequence.
’submit’, ’update’, ’logon’, ’secure’, ’wp’, ’cmd’ and Each character sequence exhibits correlations, that is,
’admin’. They were chosen manually by observing nearby characters in a URL are likely to be related to each
phishing URLs. other. These sequential patterns are important because they can
• Euclidean distance: The Euclidean distance between be exploited to improve the performance of the predictors [11].
English characters in the URL. This feature, and the A neural network is a bio-inspired machine learning model
following two, attempt to measure how much the URL that consists of a set of artificial neurons with connections
differs from common English. between them. Recurrent Neural Networks (RNN) are a type
• Kolmogorov-Smirnov statistic: Calculates the two- of neural network that is able to model sequential patterns.
sample Kolmogorov-Smirnov statistic on character The distinctive characteristic of RNNs is that they introduce
frequencies between the URL and English. the notion of time to the model, which in turn allows them to
process sequential data one element at a time and learn their
• Kullback-Leibler divergence: Calculate the Kullback- sequential dependencies [10].
Fig. 2. Recurrent neural network for classifying phishing URLs based on LSTM units. Each input character is translated by an 128-dimension embedding. The
translated URL is fed into a LSTM layer as a 150-step sequence. Finally, the classification is performed using an output sigmoid neuron.

One limitation of general RNNs is that they are unable to

learn the correlation between elements more than 5 or 10 time
steps apart [29]. A model that overcomes this problem is Long
Short Term Memory (LSTM). This model can bridge elements
separated by more than 1,000 time steps without loss of short
time lag capabilities [30].
LSTM is an adaptation of RNN. Here, each neuron is
replaced by a memory cell that, in addition to a conventional
neuron representing an internal state, uses multiplicative units
as gates to control the flow of information. A typical LSTM
cell has an input gate that controls the input of information
from the outside, a forget cell that controls whether to keep or
forget the information in the internal state, and an output gate
that allows or prevents the internal state to be seen from the
outside.
In this work, we used LSTM units to build a model that
receives as input a URL as character sequence and predicts
whether or not the URL corresponds to a case of phishing.
The architecture is illustrated in Fig. 2. Each input character is Fig. 3. URL length distribution. It is shown that the phishing and legitimate
translated by a 128-dimension embedding. The translated URL URLs have a very similar length distribution, confirming that they are quite
is fed into a LSTM layer as a 150-step sequence. Finally, the similar and difficult to tell apart.
classification is performed using an output sigmoid neuron.
The network is trained by backpropagation using a cross-
is to confuse web users by making the phishing site look as
entropy loss function and dropout in the last layer.
genuine as possible.

V. E XPERIMENTAL S ETUP Moreover, in Fig. 3, a comparison of the URL length

distribution is presented. It can be observed that the phishing
A. Data pages tend to have slightly longer URLs (measured in number
of characters).
To train both models, a dataset of real and phishing URLs
was constructed. In total, 2 million URLs were used in the
B. Experiment Design
training process. Half of them legitimate and half of them
phishing. The legitimate URLs came from Common Crawl, In order to evaluate the performance of the models, we
a corpus of web crawl data. The phishing URLs came from used a 3-fold cross-validation strategy. This process consists
Phishtank, a website used as phishing URL deposit. In TABLE of splitting data in 3 folds. Then train the data using two folds
I, a sample of ten legitimate URLs and ten phishing URLs are while the remaining one is used for model validation. This
shown. Note how similar the legitimate and malicious URLs process is repeated 3 times, only using each fold for validation
can actually be. This is expected as the objective of an attacker once. In the end, all the performance metrics on validation
TABLE I. S AMPLE OF THE RAW URL DATABASE . S HOWN IS THE SIMILARITY OF LEGITIMATE AND PHISHING URL S . T HIS IS EXPECTED AS THE
OBJECTIVE OF AN ATTACKER IS TO CONFUSE THE VICTIM BY MAKING THE PHISHING SITE LOOK AS HARMLESS AS POSSIBLE .

URL Phish
http://www.cheatsguru.com/pc/the sims 3 ambitions/requests/ False
http://www.sherdog.com/pictures/gallery/fighter/f 1349/137143/10/ False
http://www.mauipropertysearch.com/maui-meadows.php False
https://www.sanfordhealth.org/HealthInformation/ChildrensHealth/Article/73980 False
http://strathprints.strath.ac.uk/18806/ False
http://www.grahamleader.com/ci 25029538/these-are-5-worst-super-bowl-halftime-shows False
http://www.nwherald.com/2014/04/14/rizzo-homers-for-cubs-in-loss-to-cardinals/apxo9hf/ False
http://th.urbandictionary.com/define.php?term=politics&defid=1634182 False
http://www.carolinaguesthouse.co.uk/onlinebooking/?industrytype=1&startdate=2013-09-05&nights=2&windowsearch=0&location&productid=25d47a24-6b74-46... False
http://www.lander.edu/Business-Administration/Human-Resources/new-employees/policies-procedures False
http://msystemtech.ru/components/com users/Italy/zz/Login.php?run= login-submit&session=68bbd43c854147324d77872062349924&=68bbd43c854147324d778720... True
http://moviesjingle.com/auto/163.com/index.php True
http://any3.co.nz/wp-includes/Text/pp/5885d80a13c0db1f8e%26ee%3D111e61ae3eeb78bcbc5ec9fa804ee562/5885d80a13c0db1f8e%26ee%3D111e61ae3eeb78bcbc5ec9f... True
http://paypal.com.update.account.toughbook.cl/8a30e847925afc5975161aeabe8930f1/?cmd= home&dispatch=d09b78f5812945a73610edf3852f5ebed09b78f5812945a... True
http://www.zeroaccidente.ro/cache/mod login/home/37baa5e40016ab2b877fee2f0c921570/ True
http://mail.kungfuexperience.co.uk/user-verfication/216545649874az6548945648t754867t56/5959730380a7dbe17368373c106f5866 True
http://www.argo.nov.edu54.ru/plugins/system/applse3/54e9ce13d8baee95696633257b33b2b5/ True
http://rarosbun.rel7.com/ True
http://tech2solutions.com/home/wp-admin/includes/trulia/index.html True
http://esxcc.com/js/index.htm?http://us.battle.net/login/en/?ref=http://ruuyqyrus.battle.net/d3/en/index&amp True

TABLE II. R ESULTS RANDOM FOREST

folds are averaged. In this way, the variance is reduced and
we can obtain a better estimate of the model’s performance. Fold AUC Accuracy Recall Precision F1-score
0 0.984499 0.934818 0.932642 0.936632 0.934633
The performance evaluation is done using standard classi- 1 0.984458 0.934782 0.932798 0.93663 0.93471
2 0.984489 0.934588 0.93302 0.935924 0.934469
fication evaluation measures, as described below: Average 0.984482 0.934729 0.93282 0.936395 0.934604
Std dev 1.8e-05 0.000101 0.000155 0.000333 0.0001
T P +T N
• Accuracy = T P +T N +F P +F N
TP
• Recall = T P +F N
• Precision = T PT+FP
P

• Precision·Recall ,
F1 -Score = 2 Precision+Recall

where T P and F N are the numbers of true and false negatives

respectively. We define the phishing URLs as positive and the
legitimate/ham ones as negative. Lastly, we also used the ROC
curve to evaluate AUC statistic.

VI. R ESULTS
In this section we present the experimental results. First, we
evaluated the performance of the traditional feature engineer-
ing plus the classification-algorithm methodology presented
in Section III. We created 14 features based on the URL’s
lexical and statistical analysis. Then, we trained a random
forest classifier with 100 decision trees. We used the random
forest implementation of the Scikit-Learn library [31].
Using the 2,000,000 URLs described above, we tested the Fig. 4. ROC curve random forest classifier. In average, the models have an
accuracy of the model using a 3-fold cross validation strategy. AUC statistic of 98.44% and a standard deviation of 0.0008%.
The results are shown in TABLE II. The average accuracy
of the model stands at 93.47%, with a recall of 93.28% and
precision of 93.63%. We also evaluated the AUC of the model try to deceive users by employing suspicious words known by
for each fold (ROC curves are shown in Fig. 4. In average, the victims. Also, it is observed that phishing URLs tend to
the models show an AUC statistic of 98.44%. Moreover, the have a higher length ratio between the length of the path and
models provide consistent results as the standard deviation of the hostname.
the accuracy folds is 0.01%, and 0.0008% for the AUC.
Afterwards, we trained the LSTM network as described
Furthermore, we analyzed which features were more im- in Section IV. In particular, we used the implementation of
portant for performing classification in the random forest the Keras [32] with a Theano [33] backend. We defined the
classifier. This is done by counting the number of times each number of epochs to be 20, and for each fold, we used 90%
feature was selected in the different decision trees inside the of the data for training, and 10% for internal validation. In
random forest. Feature importance is shown in Fig. 5. The most Fig. 6, the learning curve of the LSTM network is shown.
important feature for the algorithm is the number of suspicious It is observed that in just 20 epochs, the validation accuracy
words in the URL. This is not surprising since attackers will converges, increasing to over 98% from epoch 10 onward.
Fig. 7. ROC curve LSTM. The AUC of the model is very high, having an
average of 99.91%.

TABLE III. R ESULTS LSTM N ETWORK

Fold AUC Accuracy Recall Precision F1-score
Fig. 5. Feature importance of the random forest classifier. The most important 0 0.999044 0.9871 0.991114 0.983203 0.987143
features are the number of suspicious words and the ratio of the path and 1 0.999106 0.987921 0.989549 0.986359 0.987952
2 0.999141 0.987844 0.98716 0.988506 0.987833
hostname lengths.
Average 0.999097 0.987622 0.989274 0.986023 0.987642
Std dev 4e-05 0.00037 0.001626 0.002178 0.000357

Comparison of the methods

First, we compared the accuracy and F1-Score of both
methods. The comparisons are presented in Fig. 8a and Fig. 8b.
For both measures, the difference between models is consistent
across folds. In average, the LSTM network has an accuracy
5% higher than the feature-engineer model with RF. Similar
results were found for the F1-Score.
We also compared the accuracy of the models using
different numbers of URLs for training. For this experiment,
we randomly selected 200,000 URLs and use them to test the
different models. Then, we selected samples of sizes 1,000,
5,000, 10,000, 50,000, 100,000, 500,000 and 1,000,000, and
trained both methods in each sample. The results are sum-
marized in Fig. 9. The LSTM model consistently outperforms
the RF model. Also, the LSTM network improves its own
performance faster than the RF, as the number of training
URLs increases.
Lastly, we compared the training and evaluation times for
both models. The comparison took place on a Lenovo Y50
Fig. 6. Learning curve LSTM network. The learning curve shows that in machine with 16 GB of memory, an Intel Core i7-4710HQ
just 20 epochs the validation accuracy converges, increasing to over 98% from
epoch 10 onwards.
CPU @ 2.50GHz x8 processor, and a GeForce GTX 860M
GPU. As can be seen in TABLE IV, the LSTM model requires
significantly more time to train. Using the 2 million URLs, the
As shown in TABLE III, the LSTM model has an average RF is trained in an average time of less than 3 minutes. On the
accuracy of 98.76%, with a standard deviation of just 0.03%, other hand, LSTM requires 238 minutes. It should be pointed
suggesting stable performance across the different folds. More- out that once the models have been trained, the RF method is
over, as shown in Fig. 7, the AUC of the model is very high, able to evaluate 942 URLs per second compared to the 281
having an average of 99.91%. URLs per second of the LSTM method. However, the memory
requirements of the RF model are almost 500 times higher than
those for LSTM. This is directy related to the complexity of
storing the model parameters.
(a) Accuracy by Fold (b) F1-Score by Fold
Fig. 8. It is observed that the LSTM method consistently outperforms the random forest in each fold. Moreover, the accuracy and F1-Score are stable between
folds, which suggest a highly robust model.

TABLE IV. C OMPARISON OF THE METHODS

Method Training Time Evaluation Time Memory Consumption
minutes URLs per second MB
RF 2.95 ± 0.11 942.12 ± 95.02 288.7
LSTM 238.7 ± 0.79 280.90 ± 64.48 0.581

from the Common Crawl database and one million phishing

URLs from Phishtank. Both models showed great statistical
results. On one hand the RF had an F1 -Score of 0.93 and an
accuracy of 93.5%, while the LSTM had F1 -Score of 0.98 and
an accuracy of 98.7%.
With these results, we can conclude that discerning URLs
by their patterns is a good predictor of phishing websites. The
results yield that creating an URL-based proactive phishing
detection system is a much more feasible approach than doing
full-content analysis. In comparison, this system would exhibit
faster responses, since full-content analysis is not required.
RF and LSTM are able to evaluate URLs at a rate of 942
per second and 281 per second respectively. Nevertheless,
a significant difference in the memory requirements of the
models is noticeable. While RF takes 288.7 MB of memory,
Fig. 9. Comparison of model accuracy vs number of training URLs. For LSTM only employs 581 KB. This is crucial as there are
this experiment, we randomly selected 200,000 URLs and use them to test
the models. It is observed that the LSTM model consistently outperforms the memory-restricted applications, such as mobile apps. In this
RF model, improving its performance faster as the number of training URLs case. the RF model is unpractical and LSTM should be chosen.
increases and arriving to a near perfect classification with less training cases.
In our analysis of the methodologies we found pros and
cons for both methods. The LSTM model shows an overall
VII. C ONCLUSIONS AND D ISCUSSION higher prediction performance without the need of expert
knowledge to create the features. The downside is that inner
We explored how well we can discern phishing URLs from workings cannot be interpreted easily. Conversely, the RF
legitimate URLs using two methodologies: feature-engineering model on average achieved a performance 5 percentage points
with a lexical and statistical URL analysis with a random forest lower than the LSTM model and needed expert knowledge to
(RF) classifier, and a novel approach using a long/short term create the features. However, the RF model can be interpreted
memory neural network (LSTM). The former has been widely more easily due to input features and feature importance. In
used since the 1990s, the latter is a newer method within general terms, neural network models require far more training
recurrent neural networks. In order to evaluate the approaches, data, time and expertise to achieve satisfactory results than
we used a database comprised of one million legitimate URLs traditional models such as RF. We showed how an RF can
be trained in less than 3 minutes, while LSTM required 238 [15] C. Whittaker, B. Ryner, and M. Nazif, “Large-Scale Automatic
minutes. Additionally, we have to take into account parameters Classification of Phishing Pages,” NDSS ’10, 2010.
tuning in both models. In an RF model, we were required [16] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair, “A comparison of
to tweak the number of trees and the depth, two static vari- machine learning techniques for phishing detection,” in Proceedings
of the anti-phishing working groups 2nd annual eCrime researchers
ables. In LSTM, we tweaked the network architecture, which summit on - eCrime ’07, 2007, pp. 60–69.
includes number of epochs, size of inner layers, embedding [17] G. L’Huillier, A. Hevia, R. Weber, and S. Rios, “Latent semantic analy-
size, and dropout parameters, among others. sis and keyword extraction for phishing classification,” in International
Conference on Intelligence and Security Informatics, 2010, pp. 129–
R EFERENCES 131.
[18] S. Marchal, J. Francois, R. State, and T. Engel, “PhishStorm: Detecting
[1] APWG, “Phishing Activity Trends Report, 3rd Quarter 2016,” Tech. Phishing With Streaming Analytics,” IEEE Transactions on Network
Rep. December, 2016. and Service Management, vol. 11, no. 4, pp. 458–471, 2014.
[2] S. Roopak and T. Thomas, “A Novel Phishing Page Detection [19] S. Marchal, K. Saari, N. Singh, and N. Asokan, “Know Your Phish:
Mechanism Using HTML Source Code Comparison and Cosine Novel Techniques for Detecting Phishing Sites and their Targets,” oct
Similarity,” in 2014 Fourth International Conference on Advances in 2015.
Computing and Communications, 2014, pp. 167–170.
[20] R. Verma and K. Dyer, “On the Character of Phishing URLs: Accurate
[3] R. Dhamija, J. D. Tygar, and M. Hearst, “Why Phishing Works,” in and Robust Statistical Learning Classifiers,” in ACM Conference on
SIGCHI Conference on Human Factors in Computing Systems, 2006, Data and Application Security and Privacy, 2015, pp. 111–121.
pp. 581–590.
[21] V. Ramanathan and H. Wechsler, “Phishing detection and impersonated
[4] J. Vargas, A. Correa Bahnsen, S. Villegas, and D. Ingevaldson, “Know- entity discovery using Conditional Random Field and Latent Dirichlet
ing your enemies: Leveraging data analysis to expose phishing patterns Allocation,” Computers & Security, vol. 34, pp. 123–139, 2013.
against a major US financial institution,” in 2016 APWG Symposium
[22] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying
on Electronic Crime Research (eCrime), 2016, pp. 52–61.
Suspicious URLs : An Application of Large-Scale Online Learning,”
[5] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Learning to detect in International Conference on Machine Learning, Montreal, Canada,
malicious urls,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 2009, pp. 681–688.
30:1–30:24, May 2011.
[23] R. M. Mohammad, F. Thabatah, and L. McCluskey, “Predicting phish-
[6] S. Marchal, K. Saari, N. Singh, and N. Asokan, “Know Your Phish: ing websites based on self-structuring neural network,” Neural Com-
Novel Techniques for Detecting Phishing Sites and Their Targets,” in puting and Applications, vol. 25, no. 2, pp. 443–458, 2014.
International Conference on Distributed Computing Systems, 2016, pp.
[24] C. Ardi and J. Heidemann, “Poster: Lightweight content-based
323–333.
phishing detection,” USC/Information Sciences Institute, Tech. Rep.
[7] T. Thakur and R. Verma, Catching Classical and Hijack-Based ISI-TR-2015-698, May 2015.
Phishing Attacks. Cham: Springer International Publishing, 2014, pp.
[25] G. Wang, H. Liu, S. Becerra, K. Wang, S. Belongie, H. Shacham, and
318–337.
S. Savage, “Verilogo: Proactive phishing detection via logo recognition,”
[8] M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do UC San Diego, Tech. Rep. CS2011-0969, Aug. 2011.
we Need Hundreds of Classifiers to Solve Real World Classification
[26] A. Le, A. Markopoulou, and M. Faloutsos, “PhishDef: URL Names
Problems ?” Journal of Machine Learning Research, vol. 15, pp.
Say It All,” in INFOCOM, 2011 Proceedings IEEE, 2011.
3133–3181, 2014.
[27] T. Berners-Lee, L. Masinter, and M. McCahill, “Uniform resource
[9] S. Marchal, R. State, and T. Engel, “PhishScore: Hacking Phishers
locators (url),” Tech. Rep., 1994.
Minds,” in CNSM, 2014, pp. 46–54.
[28] L. Breiman, “Random Forests,” Machine Learning, vol. 45, pp. 5–32,
[10] Z. C. Lipton, “A Critical Review of Recurrent Neural Networks for
2001.
Sequence Learning,” CoRR, vol. abs/1506.0, pp. 1–38, 2015.
[29] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
[11] T. Dietterich, “Machine learning for sequential data: A review,”
continual prediction with LSTM.” Neural computation, vol. 12, no. 10,
Structural, syntactic, and statistical pattern recognition, pp. 1–15,
pp. 2451–2471, 2000.
2002.
[30] S. Hochreiter and J. J. Schmidhuber, “Long short-term memory,”
[12] J. Woodbridge, H. S. Anderson, A. Ahuja, and D. Grant, “Predicting
Neural Computation, vol. 9, no. 8, pp. 1–32, 1997.
Domain Generation Algorithms with Long Short-Term Memory
Networks,” http://arxiv.org/abs/1611.00791, nov 2016. [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
[13] J. Zhang, P. Porras, and J. Ullrich, “Highly Predictive Blacklisting,” in
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and
17th USENIX Security Symposium, 2008, pp. 107–122.
E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of
[14] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond Blacklists : Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
Learning to Detect Malicious Web Sites from Suspicious URLs,” World
[32] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.
Wide Web Internet And Web Information Systems, pp. 1245–1253,
2009. [33] Theano Development Team, “Theano: A Python framework for
fast computation of mathematical expressions,” arXiv e-prints, vol.
abs/1605.02688, May 2016.

Patricia S. Churchland - Terrence J. Sejnowski - The Computational Brain-Bradford Book (2016) PDF
100% (2)
Patricia S. Churchland - Terrence J. Sejnowski - The Computational Brain-Bradford Book (2016) PDF
569 pages
Python Machine Learning For Beginners Learning From Scratch Numpy Pandas Matplotlib Seaborn SKle
100% (1)
Python Machine Learning For Beginners Learning From Scratch Numpy Pandas Matplotlib Seaborn SKle
277 pages
A Method To Measure The Efficiency of Phishing Emails Detection Features
No ratings yet
A Method To Measure The Efficiency of Phishing Emails Detection Features
5 pages
Applied Thermal Engineering: Bahman Zarenezhad, Ali Aminian
No ratings yet
Applied Thermal Engineering: Bahman Zarenezhad, Ali Aminian
5 pages
Portfolio Optimization With Return Prediction Using Deep Learning and Machine Learning
No ratings yet
Portfolio Optimization With Return Prediction Using Deep Learning and Machine Learning
15 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
6 pages
V6I602
No ratings yet
V6I602
8 pages
Phishing Detection Using Machine Learnin
No ratings yet
Phishing Detection Using Machine Learnin
5 pages
Major Project Final Report
No ratings yet
Major Project Final Report
53 pages
Detection of Url Based Phishing Attacks Using Machine Learning IJERTV8IS110269
No ratings yet
Detection of Url Based Phishing Attacks Using Machine Learning IJERTV8IS110269
8 pages
phishing4
No ratings yet
phishing4
6 pages
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
No ratings yet
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
11 pages
(IJETA-V11I3P35) : Ms. Apoorva Joshi, Ms. Apoorva Joshi, Manvi Bhardwaj
No ratings yet
(IJETA-V11I3P35) : Ms. Apoorva Joshi, Ms. Apoorva Joshi, Manvi Bhardwaj
4 pages
Paper 1
No ratings yet
Paper 1
5 pages
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
No ratings yet
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
6 pages
Phishing URL Detection Using ML: Project Report
No ratings yet
Phishing URL Detection Using ML: Project Report
25 pages
Phishing_Review_2023
No ratings yet
Phishing_Review_2023
17 pages
Fin Irjmets1682919970
No ratings yet
Fin Irjmets1682919970
5 pages
Jain 2018
No ratings yet
Jain 2018
14 pages
CyberSec Review3 Team10
No ratings yet
CyberSec Review3 Team10
28 pages
Random Forest
No ratings yet
Random Forest
10 pages
Fake Website Detection
No ratings yet
Fake Website Detection
13 pages
CSE3502-Final J Comp Report
No ratings yet
CSE3502-Final J Comp Report
20 pages
Phishing Phase1 Report
No ratings yet
Phishing Phase1 Report
20 pages
Review Paper
No ratings yet
Review Paper
9 pages
Fake Url
No ratings yet
Fake Url
64 pages
A Machine Learning Based Approach For Phishing Detection Using
No ratings yet
A Machine Learning Based Approach For Phishing Detection Using
14 pages
Generative Adversarial Network-Based Phishing URL Detection With Variational Autoencoder and Transformer
No ratings yet
Generative Adversarial Network-Based Phishing URL Detection With Variational Autoencoder and Transformer
8 pages
Machine_Learning_for_Detecting_the_Phishing_Threats
No ratings yet
Machine_Learning_for_Detecting_the_Phishing_Threats
6 pages
Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK
No ratings yet
Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK
7 pages
Enhancing Phishing URL Detection Through Comprehen
No ratings yet
Enhancing Phishing URL Detection Through Comprehen
7 pages
Phish Guard Phishing Website using Machine Learning Algorithms
No ratings yet
Phish Guard Phishing Website using Machine Learning Algorithms
10 pages
ssrn-3624621
No ratings yet
ssrn-3624621
14 pages
Phishing Detection in Email Using Deep Learning
No ratings yet
Phishing Detection in Email Using Deep Learning
8 pages
Phishing URL Detection Using LSTM Based Ensemble Learning Approaches
No ratings yet
Phishing URL Detection Using LSTM Based Ensemble Learning Approaches
17 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
16 pages
Detection of Phishing Websites Using Machine Learning IJERTV10IS050235
No ratings yet
Detection of Phishing Websites Using Machine Learning IJERTV10IS050235
5 pages
Phishing Seminar
No ratings yet
Phishing Seminar
19 pages
Leveraging Advanced Machine Learning Techniques For Phishing Website Detection
No ratings yet
Leveraging Advanced Machine Learning Techniques For Phishing Website Detection
6 pages
A multi-algorithm approach for phishing uniform resource locator’s detection
No ratings yet
A multi-algorithm approach for phishing uniform resource locator’s detection
10 pages
Phishing Detection (Yamu Research Project)
No ratings yet
Phishing Detection (Yamu Research Project)
19 pages
Detection of URL Based Phishing Websites Using Machine Learning
No ratings yet
Detection of URL Based Phishing Websites Using Machine Learning
6 pages
CH 2. Literature Survey
No ratings yet
CH 2. Literature Survey
5 pages
Logistic Regression Based Machine Learning Technique For Phishing Website Detection
No ratings yet
Logistic Regression Based Machine Learning Technique For Phishing Website Detection
4 pages
Presentation Slides
No ratings yet
Presentation Slides
42 pages
Network Security Report
No ratings yet
Network Security Report
42 pages
Social Engineering Detection: Phishing URLs
No ratings yet
Social Engineering Detection: Phishing URLs
7 pages
155-Article Text-230-3-10-20230813
No ratings yet
155-Article Text-230-3-10-20230813
7 pages
PUMMP: Phishing URL Detection Using Machine Learning With Monomorphic and Polymorphic Treatment of Features
No ratings yet
PUMMP: Phishing URL Detection Using Machine Learning With Monomorphic and Polymorphic Treatment of Features
20 pages
Machine Learning-Driven Phishing Detection: A Robust Browser Extension Solution
No ratings yet
Machine Learning-Driven Phishing Detection: A Robust Browser Extension Solution
4 pages
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
No ratings yet
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
16 pages
depuuuDOCNW[1]
No ratings yet
depuuuDOCNW[1]
28 pages
20mis0106 VL2023240102875 Pe003
No ratings yet
20mis0106 VL2023240102875 Pe003
42 pages
Detection of Phishing On Apps and Websites - Project Report
No ratings yet
Detection of Phishing On Apps and Websites - Project Report
21 pages
Phishing Website Detection Based On Multidimensional Features Driven by Deep Learning
No ratings yet
Phishing Website Detection Based On Multidimensional Features Driven by Deep Learning
14 pages
Detection of Phising Websites Using Machine Learning Approaches
No ratings yet
Detection of Phising Websites Using Machine Learning Approaches
9 pages
Paper 7AdvancesinEngineeringSoftware
No ratings yet
Paper 7AdvancesinEngineeringSoftware
6 pages
Batch-5 Journal-6 ECE-D new (1)
No ratings yet
Batch-5 Journal-6 ECE-D new (1)
6 pages
base paper
No ratings yet
base paper
16 pages
Ozcan A Hybrid DNN-LSTM Model For Detecting Phishing Url
No ratings yet
Ozcan A Hybrid DNN-LSTM Model For Detecting Phishing Url
17 pages
Research Paper
No ratings yet
Research Paper
9 pages
Batch-5 ECE-D
No ratings yet
Batch-5 ECE-D
4 pages
IEEE
No ratings yet
IEEE
12 pages
Research_paper_ Group-B5
No ratings yet
Research_paper_ Group-B5
4 pages
Fascination: Honeypots and Cybercrime
From Everand
Fascination: Honeypots and Cybercrime
Armin Snyder
No ratings yet
Costos Contabilidad 2023
No ratings yet
Costos Contabilidad 2023
16 pages
Market Research and The New Product Development PR
No ratings yet
Market Research and The New Product Development PR
25 pages
Research On High Security of IP Tunnel in Virtual Private Network
No ratings yet
Research On High Security of IP Tunnel in Virtual Private Network
6 pages
Gestión de Inventario 2021
No ratings yet
Gestión de Inventario 2021
10 pages
Administración 1
No ratings yet
Administración 1
15 pages
Motivación y Desempeño 1
No ratings yet
Motivación y Desempeño 1
14 pages
1 Artículo 2022 Ingles
No ratings yet
1 Artículo 2022 Ingles
28 pages
Wren 2021
No ratings yet
Wren 2021
13 pages
Cyber Risk in IoT Systems
No ratings yet
Cyber Risk in IoT Systems
27 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
Network Forensic Frameworks - Survey and Research Challenges
100% (1)
Network Forensic Frameworks - Survey and Research Challenges
14 pages
Information Systems and Computer Engineering
No ratings yet
Information Systems and Computer Engineering
101 pages
02 Alavi - Whole - Thesis
No ratings yet
02 Alavi - Whole - Thesis
314 pages
Artículo Reporte de Cifras Inmobiliarias en Perú
No ratings yet
Artículo Reporte de Cifras Inmobiliarias en Perú
33 pages
Detection and Analysis Cerber Ransomware Based On Network Forensics Behavior
No ratings yet
Detection and Analysis Cerber Ransomware Based On Network Forensics Behavior
9 pages
A Benchmark of Machine Learning Approaches For Credit Score Prediction
No ratings yet
A Benchmark of Machine Learning Approaches For Credit Score Prediction
8 pages
Economic Justification For Automation
No ratings yet
Economic Justification For Automation
6 pages
Ai Foundation Syllabus
No ratings yet
Ai Foundation Syllabus
18 pages
Agents 101: Artificial Intelligence & Machine Learning | PPT
No ratings yet
Agents 101: Artificial Intelligence & Machine Learning | PPT
27 pages
Fault Detection Based On Deep Learning For Digital VLSI Circuits
No ratings yet
Fault Detection Based On Deep Learning For Digital VLSI Circuits
10 pages
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
No ratings yet
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
102 pages
Python Machine Learning The Ultimate Beginners Gui...
No ratings yet
Python Machine Learning The Ultimate Beginners Gui...
83 pages
Smart Data Processing For Energy Harvesting System Using Ambient Noise With Deep Learning
No ratings yet
Smart Data Processing For Energy Harvesting System Using Ambient Noise With Deep Learning
5 pages
Compiled Syllabus
No ratings yet
Compiled Syllabus
146 pages
Paper 8665
No ratings yet
Paper 8665
7 pages
An Ingression Into Deep Learning - FP
No ratings yet
An Ingression Into Deep Learning - FP
17 pages
Image Captioning
No ratings yet
Image Captioning
16 pages
Radial Basis Function Networks 2 2001
No ratings yet
Radial Basis Function Networks 2 2001
372 pages
Calibration of A Load Cell Using A Neural Network
No ratings yet
Calibration of A Load Cell Using A Neural Network
12 pages
Forecasting Bitcoin Volatility Using Hybrid GARCH Models With Machine Learning
No ratings yet
Forecasting Bitcoin Volatility Using Hybrid GARCH Models With Machine Learning
18 pages
Ai in Electronics
100% (1)
Ai in Electronics
24 pages
Fundamentals of Neural Networks PDF
100% (4)
Fundamentals of Neural Networks PDF
476 pages
Kogut 2021
No ratings yet
Kogut 2021
8 pages
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture1 Compressed
No ratings yet
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture1 Compressed
27 pages
Micromachines 13 00947
No ratings yet
Micromachines 13 00947
12 pages
Automated System For Detection and Classification of Leather Defects
No ratings yet
Automated System For Detection and Classification of Leather Defects
10 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
300+ TOP Neural Networks Multiple Choice Questions and Answers
No ratings yet
300+ TOP Neural Networks Multiple Choice Questions and Answers
29 pages
BITS_F312_1334_20240731165555
No ratings yet
BITS_F312_1334_20240731165555
3 pages
Design of Smart Chess Board That Can Predict The N
No ratings yet
Design of Smart Chess Board That Can Predict The N
9 pages
Basic Neural Networks
No ratings yet
Basic Neural Networks
9 pages
DIT865 2018 Mar Solution
No ratings yet
DIT865 2018 Mar Solution
9 pages

Classifying Phishing URLs Using Recurrent Neural Networks

Uploaded by

Classifying Phishing URLs Using Recurrent Neural Networks

Uploaded by

Classifying Phishing URLs Using

Recurrent Neural Networks

Alejandro Correa Bahnsen† , Eduardo Contreras Bohorquez∗ , Sergio Villegas† ,

One limitation of general RNNs is that they are unable to

V. E XPERIMENTAL S ETUP Moreover, in Fig. 3, a comparison of the URL length

TABLE II. R ESULTS RANDOM FOREST

where T P and F N are the numbers of true and false negatives

TABLE III. R ESULTS LSTM N ETWORK

Comparison of the methods

TABLE IV. C OMPARISON OF THE METHODS

from the Common Crawl database and one million phishing

You might also like