0% found this document useful (0 votes)
230 views

Phishing Detection Using Machine Learning

This document discusses machine learning techniques for detecting phishing websites. It begins by providing background on the growth of phishing attacks and techniques used by phishers. It then reviews different approaches that have been used to detect phishing, including user training and software-based detection. The document focuses on using machine learning methods to detect phishing websites by analyzing common characteristics. It evaluates several machine learning algorithms on a dataset of phishing websites and discusses future work.

Uploaded by

bhattajagdish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views

Phishing Detection Using Machine Learning

This document discusses machine learning techniques for detecting phishing websites. It begins by providing background on the growth of phishing attacks and techniques used by phishers. It then reviews different approaches that have been used to detect phishing, including user training and software-based detection. The document focuses on using machine learning methods to detect phishing websites by analyzing common characteristics. It evaluates several machine learning algorithms on a dataset of phishing websites and discusses future work.

Uploaded by

bhattajagdish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Phishing Detection Using Machine Learning

Techniques

Vahid Shahrivari Mohammad Mahdi Darabi Mohammad Izadi


Computer Engineering Departmen School of Electrical and Computer Engineering Computer Engineering Departmen
Sharif University of Technology University of Tehran Sharif University of Technology
Tehran, Iran Tehran, Iran Tehran, Iran
[email protected] [email protected] [email protected]
arXiv:2009.11116v1 [cs.CR] 20 Sep 2020

Abstract—The Internet has become an indispensable part of 2019 to 165,772 cases in the first quarter of 2020. Phishing
our life, However, It also has provided opportunities to anony- has caused severe damages to many organizations and the
mously perform malicious activities like Phishing. Phishers try global economy, in the fourth quarter of 2019, APWG member
to deceive their victims by social engineering or creating mock-
up websites to steal information such as account ID, username, OpSec Security found that SaaS and webmail sites remained
password from individuals and organizations. Although many the most frequent targets of phishing attacks. Phishers continue
methods have been proposed to detect phishing websites, Phishers to harvest credentials from these targets by operating BEC and
have evolved their methods to escape from these detection subsequently gain access to corporate SaaS accounts [2]. Many
methods. One of the most successful methods for detecting these
malicious activities is Machine Learning. This is because most
Phishing attacks have some common characteristics which can
be identified by machine learning methods. In this paper, we
compared the results of multiple machine learning methods for
predicting phishing websites.
Index Terms—Phishing, Classification, Cybercrime, Machine-
learning

I. I NTRODUCTION
Phishing is a kind of Cybercrime trying to obtain important
or confidential information from users which is usually carried
out by creating a counterfeit website that mimics a legitimate
website. Phishing attacks employ a variety of techniques such
as link manipulation, filter evasion, website forgery, covert
redirect, and social engineering. The most common approach
is to set up a spoofing web page that imitates a legitimate Fig. 1. Total number of phishing websites detected by APWG [2]
website. These type of attacks were top concerns in the latest
2018 Internet Crime Report, issued by the U.S. Federal Bureau approaches have been used to filter out phishing websites.
of Investigations Internet Crime Complaint Center (IC3). The Each of these methods is appliable on different stages of
statistics gathered by the FBIs IC3 for 2018 showed that attack flow, for example, network-level protection, authenti-
internet-based theft, fraud, and exploitation remain pervasive cation, client-side tool, user education, server-side filters, and
and were responsible for a staggering $2.7 billion in financial classifiers. Although there are some unique features in every
losses in 2018. In that year, the IC3 received 20,373 complaints type of phishing attack, most of these attacks depict some
against business email compromise (BEC) and email account similarities and patterns. Since machine learning methods
compromise (EAC), with losses of more than $1.2 billion proved to be a powerful tool for detecting patterns in data,
[1]. The report notes that the number of these sophisticated these methods have made it possible to detect some of the
attacks have grown increasingly in recent years. Anti-Phishing common phishing traits, therefore, recognizing phishing web-
Working Group(APWG) emphasizes that phishing attacks sites. In this paper, we provide a comparative and analytical
have grown in recent years, Figure 1 illustrates the total evaluation of different machine learning methods on detecting
number of phishing sites detected by APWG in the first quarter the phishing websites. The machine learning methods that
of 2020 and the last quarter of 2019. This number has a we studied are Logistic Regression, Decision Tree, Random
gradual growth raising from 162,155 in the last quarter of Forest, Ada-Boost, Support Vector Machine, KNN, Artificial
Neural Networks, Gradient Boosting, and XGBoost. The rest III. P HISHING DETECTION APPROACHES : AN OVERVIEW
of this paper is organized as follows: in section II we list some Various methods have been proposed to avert phishing
widely used phishing techniques, in Section III we discuss attacks through each level of attack flow. Some of these
different types of phishing and phishing attack prevention methods require training the users to be prepared for future
methods. In section IV we provide an overview of different attacks and some of them work automatically and warn the
machine learning methods for phishing detection. In section user. These methods can be listed as follows:
V we illustrate the features of our dataset. In section VI and • User training
VII we show evaluation results of suggested machine learning • Software detection
methods and finally we draw conclusions and discuss future
works in section VIII. A. User training
II. P HISHING TECHNIQUES Educating users and company employees and warning them
about phishing attacks have an impact on preventing phishing
In this section, we discuss some well-known phishing
attacks. Multiple methods have been proposed for training
approaches used by criminals to deceive people.
users. Many researches concluded that the most impactful
A. Link manipulation approach to help the users to distinguish between phishing and
The phishing is mainly about links. There are some clever legitimate websites is interactive teaching [7] [8]. Although
ways to manipulate a URL to make look like a legitimate URL. user training is an effective method however humans errors
One method is to represent the malicious URLs as hyperlinks still exist and people are prone to forget their training. Training
with name on websites. Another method is to use misspelled also requires a significant amount of time and it is not much
URLs which will look like a legitimate URL for example appreciated by non-technical users [9].
ghoogle.com. A variant of typosquatting that is much harder to B. Software detection
recognize compared to mentioned link manipulation methods
Although user training can prevent some phishing attacks
is called IDN Spoofing in which the attackers use a character
however we are bombarded every day by hundreds of websites
in non-English language that looks exactly like an English
therefore applying our training on each website is a cumber-
character for example using a Cyrillic ”c” or ”a” instead of
some and sometimes non-practical task. Another alternative
English counterparts [3].
for detecting phishing websites is to use the software. The
B. Filter evasion software can analyze multiple factors like the content of the
Phishers show the content of their website in pictures or website, email message, URL, and many other features before
they use Adobe-Flash making it difficult to be detected by it makes its final decision which is more reliable than humans.
some phishing detection methods. To avoid this kind of attack Multiple software methods are proposed for phishing detection
using optical character recognition is required [4]. which is categorized as follows:
1) List-base approach: One of the widely used methods for
C. Website forgery
phishing detection is using blacklist-based anti-phishing
In this type of attack, Phishing is happening at a legitimate methods which are integrated into web browsers. These
website by manipulating the target website JavaScript code. methods use two types of lists, namely the white list
These types of attacks which are also known as cross-site which contains the name of valid websites, and the
scripting are very hard to detect because the victim is using blacklist which keeps the record of malicious websites.
the legitimate website. Usually, the blacklist is obtained either through user
D. Covert redirect feedback or through third-party reports which are cre-
This attacks targets websites using OAuth 2.0 and OpenID ated by using another phishing detection scheme. Some
protocol. While trying to grant token access to a legitimate studies have shown that blacklist-based anti-phishing ap-
website, users are giving their token to a malicious service. proaches can detect 90 percent of the malicious website
However, this method did not gain much attention due to its at the time of initial check [10].
low significance [5]. 2) Visual similarity-base approach: One of the main rea-
sons that people are tricked into believing that they are
E. Social engineering using a legitimate website but in reality, they are filling a
This type of phishing is carried out through social inter- form in a malicious website is that the phishing website
action. It uses psychological tricks to deceive users to give appearance is exactly similar to the targeted legitimate
away security information. This type of attack happens in website. Some methods use visual similarities by analyz-
multi-steps. At first, the phisher investigates the potential weak ing text content, text format, HTML, CSS, and images
points of targets required for the attack. Then, the phisher of web pages to identify phishing websites [11] [12].
tries to gain the target’s trust and at last, provide a situation Chen el al [13] also proposed discriminative keypoint
in which the target reveals important information. There are features that consider phishing detection as an image
some social engineering phishing methods, namely, baiting, matching problem. Visual similarity-based approaches
scareware, pretexting, and spear phishing [6]. have their limitations, for example, methods that use the
content of a website will fail to detect websites that use specific tasks like phishing detection. Since phishing is a
images instead of text. Methods that use image matching classification problem, Machine learning models can be used
methods are very time-consuming and hard to gather as a powerful tool. Machine learning models could adapt to
enough data [14]. changes quickly to identify patterns of fraudulent transactions
3) Heuristics and machine learning based: Machine learn- that help to develop a learning-based identification system.
ing methods have proved to be a powerful tool to Most of the machine learning models discussed here are
classify malicious activities or artifacts like spam emails classified as supervised machine learning, This is where an
or phishing websites. Most of these methods require algorithm tries to learn a function that maps an input to an
training data, fortunately, there are many phishing web- output based on example input-output pairs. It infers a function
site samples to train a machine learning model. Some from labeled training data consisting of a set of training
machine learning methods use vision techniques by examples. We present machine learning methods that we used
analyzing a snapshot of a website [15] and some of in our study.
them use content and features of the website for phishing
detection. Multiple machine learning methods have been A. Logistic Regression
used to detect phishing websites some of which are Logistic Regression is a classification algorithm used to
Logistic regression, decision tree, random forest, Ada assign observations to a discrete set of classes. Unlike linear
boost, SVM, KNN, neural networks, gradient boosting, regression which outputs continuous number values, Logistic
and XGBoost which are described in the following Regression transforms its output using the logistic sigmoid
section. function to return a probability value which can then be
In a recent study [16] on phishing, the authors emphasized mapped to two or more discrete classes. Logistic regression
that when some new solutions were proposed to overcome var- works well when the relationship in the data is almost linear
ious phishing attacks, attackers evolve their method to bypass despite if there are complex nonlinear relationships between
the newly proposed phishing method. Therefore, the use of variables, it has poor performance. Besides, it requires more
hybrid models and machine learning-based methods is highly statistical assumptions before using other techniques.
recommended. In this paper, we are going to use machine
B. K Near Neighbors
learning-based classifiers for detecting phishing websites.
K-Nearest Neighbors (KNN) is one of the simplest algo-
rithms used in machine learning for regression and classifi-
cation problems which is non-parametric and lazy. In KNN
there is no need for an assumption for the underlying data
distribution. KNN algorithm uses feature similarity to predict
the values of new datapoints which means that the new data
point will be assigned a value based on how closely it matches
the points in the training set. The similarity between records
can be measured in many different ways. Once the neighbors
are discovered, the summary prediction can be made by
returning the most common outcome or taking the average.
As such, KNN can be used for classification or regression
problems. There is no model to speak of other than holding
the entire training dataset.

C. Support Vector Machine


Support vector machines (SVMs) are one of the most popu-
lar classifiers. The idea behind SVM is to get the closest point
between two classes by using the maximum distance between
classes. This technique is a supervised learning model used
for linear and nonlinear classification. Nonlinear classification
is performed using a kernel function to map the input to a
Fig. 2. An Overview of phishing detection approaches higher-dimensional feature space. Although SVMs are very
powerful and are commonly used in classification, it has some
weakness. They need high calculations to train data. Also, they
IV. MACHINE LEARNING APPROACH are sensitive to noisy data and are therefore prone to over-
Machine learning provides simplified and efficient methods fitting. The four common kernel functions at the SVM are
for data analysis. It has indicated promising outcomes in real- linear, RBF (radial basis function), sigmoid, and polynomial,
time classification problems recently. The key advantage of which is listed in Table I. Each kernel function has particular
machine learning is the ability to create flexible models for parameters that must be optimized to obtain the best result.
TABLE I F. Ada-Boost
F OUR COMMON KERNELS [17]
From some aspects, Ada-boost is like Random Forest, the
Kernel Type Formula Parameter Ada-Boost classification like Random Forest groups weak
classification models to form a strong classifier. A single
Linear K(xn , xi ) = (xn , xi ) C,γ
RBF K(xn , xi ) = exp(−γkxn − xi k2 + C) C,γ model may poorly categorize objects. But if we combine
Sigmoid K(xn , xi ) = tanh(γ(xn , xi ) + r) C,γ,r several classifiers by selecting a set of samples in each iteration
Polynomial K(xn , xi ) = (γ(xn , xi ) + r)d C,γ,r,d and assign enough weight to the final vote, it can be good
for the overall classification. Trees are created sequentially
as weak learners and correcting incorrectly predicted samples
by assigning a larger weight to them after each round of
D. Decision Tree prediction. The model is learning from previous errors. The
Decision tree classifiers are used as a well-known classifica- final prediction is the weighted majority vote (or weighted
tion technique. A decision tree is a flowchart-like tree structure median in case of regression problems). In short Ada-Boost
where an internal node represents a feature or attribute, the algorithm is repeated by selecting the training set based on the
branch represents a decision rule, and each leaf node represents accuracy of the previous training. The weight of each classifier
the outcome. The topmost node in a decision tree is known as trained in each iteration depends on the accuracy obtained
the root node. It learns to partition based on the attribute value. from previous ones [19].
It partitions the tree in a recursive manner called recursive G. Gradeint Boosting
partitioning. This particular feature gives the tree classifier a
Gradient Boosting trains many models incrementally and
higher resolution to deal with a variety of data sets, whether
sequentially. The main difference between Ada-Boost and
numerical or categorical data. Also, decision trees are ideal
Gradient Boosting Algorithm is how algorithms identify the
for dealing with nonlinear relationships between attributes and
shortcomings of weak learners like decision trees. While the
classes. Regularly, an impurity function is determined to assess
Ada-Boost model identifies the shortcomings by using high
the quality of the division for each node, and the Gini Variety
weight data points, Gradient Boosting performs the same
Index is used as a known criterion for the total performance.
methods by using gradients in the loss function. The loss func-
In practice, the decision tree is flexible in the sense that it
tion is a measure indicating how good the models coefficients
can easily model nonlinear or unconventional relationships. It
are at fitting the underlying data. A logical understanding of
can interpret the interaction between predictors. It can also be
loss function would depend on what we are trying to optimize.
interpreted very well because of its binary structure. However,
[20]
the decision tree has various drawbacks that tend to overuse
data. Besides, updating a decision tree by new samples is H. XGBoost
difficult. XGBoost is a refined and customized version of a Gradient
Boosting to provide better performance and speed. The most
E. Random Forest important factor behind the success of XGBoost is its scala-
bility in all scenarios. The XGBoost runs more than ten times
Random Forest, as its name implies, contains a large number faster than popular solutions on a single machine and scales
of individual decision trees that act as a group to decide to billions of examples in distributed or memory-limited set-
the output. Each tree in a random forest specifies the class tings. The scalability of XGBoost is due to several important
prediction, and the result will be the most predicted class algorithmic optimizations. These innovations include a novel
among the decision of trees. The reason for this amazing tree learning algorithm for handling sparse data; a theoretically
result from Random Forest is because of the trees protect justified weighted quantile sketch procedure enables handling
each other from individual errors. Although some trees may instance weights in approximate tree learning. Parallel and dis-
predict the wrong answer, many other trees will rectify the tributed computing make learning faster which enables quicker
final prediction, so as a group the trees can move in the right model exploration. More importantly, XGBoost exploits out-
direction. Random Forests achieve a reduction in overfitting of-core computation and enables data scientists to process
by combining many weak learners that underfit because they hundreds of millions of examples on a desktop. Finally, it
only utilize a subset of all training samples Random Forests is even more exciting to combine these techniques to make
can handle a large number of variables in a data set. Also, an end-to-end system that scales to even larger data with the
during the forest construction process, they make an unbiased least amount of cluster resources. [21]
estimate of the generalization error. Besides, they can estimate
the lost data well. The main drawback of Random Forests I. Artificial Neural Networks
is the lack of reproducibility because the process of forest Artificial neural networks (ANNS) are a learning model
construction is random. Besides, it is difficult to interpret the roughly inspired by biological neural networks. These models
final model and subsequent results, because it involves many are multilayered, each layer containing several processing
independent decision trees. [18] units called neurons. Each neuron receives its input from
its adjacent layers and computes its output with the help 4) Having @ Symbol: Using the @ symbol in the URL
of its weight and a non-linear function called the activation leads the browser to ignore everything preceding the @
function. In feed-forward neural networks like in 3, data flows symbol and the real address often follows the @ symbol
from the first layer to the last layer. Different layers may 5) Double Slash Redirection: The existence of // within
perform different transformations on their input. The weights the URL which means that the user will be redirected
of neurons are set randomly at the start of the training and to another website
they are gradually adjusted by the help of the gradient descent 6) Prefix Suffix: Phishers tend to add prefixes or suffixes
method to get close to the optimal solution. The power of separated by (-) to the domain name so that users feel
neural networks is due to the non-linearity of hidden nodes. that they are dealing with a legitimate webpage. For
As a result, introducing non-linearity in the network is very example http://www.Confirme-paypal.com.
important so that you can learn complex functions [22]. 7) Having Sub Domain: Having subdomain in URL.
8) SSL State: Shows that website use SSL
9) Domain Registration Length: Based on the fact that a
phishing website lives for a short period
10) Favicon: A favicon is a graphic image (icon) associated
with a specific webpage. If the favicon is loaded from a
domain other than that shown in the address bar, then the
webpage is likely to be considered a Phishing attempt.
11) Using Non-Standard Port: To control intrusions, it is
much better to merely open ports that you need. Several
firewalls, Proxy and Network Address Translation (NAT)
servers will, by default, block all or most of the ports
and only open the ones selected
12) HTTPS token: Having deceiving https token in URL.
For example, http://https-www-mellat-phish.ir
13) Request URL: Request URL examines whether the
external objects contained within a webpage such as
Fig. 3. Artificial Neural Network images, videos, and sounds are loaded from another
domain.
V. DATA SET DESCRIPTION 14) URL of Anchor: An anchor is an element defined by
One of the main challenges in our research was the scarcity the < a > tag. This feature is treated exactly as Request
of phishing dataset. Although many scientific papers about URL.
phishing detection have been published, they have not pro- 15) Links In Tags: It is common for legitimate websites
vided the dataset on which they used in their research. More- to use ¡Meta¿ tags to offer metadata about the HTML
over, another factor that hinders finding a desirable dataset is document; ¡Script¿ tags to create a client side script; and
the lack of a standard feature set to record characteristics of ¡Link¿ tags to retrieve other web resources.
a phishing website. The dataset that we used in our research 16) Server Form Handler: If the domain name in SFHs is
was well researched and benchmarked by some researchers. different from the domain name of the webpage.
Fortunately, the accompanying wiki of the dataset comes 17) Submitting Information To E-mail: A phisher might
with a data description document which discusses the data redirect the users information to his email.
generation strategies taken by the authors of the dataset [23]. 18) Abnormal URL: It is extracted from the WHOIS
For updating our dataset with new phishing websites we have database. For a legitimate website, identity is typically
also implemented a code that extracts features of new phishing part of its URL.
websites that are provided by the PhishTank website. The 19) Website Redirect Count: If the redirection is more than
dataset contains about 11,000 sample websites, we used 10% four-time
of samples in the testing phase. Each website is marked either 20) Status Bar Customization: Use JavaScript to show a
legitimate or phishing. The features of our dataset are as fake URL in the status bar to users
follows: 21) Disabling Right Click: It is treated exactly as Using
1) Having IP Address: If an IP address is used in- onMouseOver to hide the Link
stead of the domain name in the URL, such as 22) Using Pop-up Window: Showing having popo-up win-
http://217.102.24.235/sample.html. dows on the webpage.
2) URL Length: Phishers can use a long URL to hide the 23) IFrame: IFrame is an HTML tag used to display an
doubtful part in the address bar. additional webpage into one that is currently shown.
3) Shortening Service: Links to the webpage that has a 24) Age of Domain: If the age of the domain is less than a
long URL. For example, the URL http://sharif.hud.ac.uk/ month.
can be shortened to bit.ly/1sSEGTB. 25) DNS Record: Having the DNS record
26) Web Traffic: This feature measures the popularity of TABLE II
the website by determining the number of visitors. D ESCRIPTION OF DATASET
27) Page Rank: Page rank is a value ranging from 0 to 1. features mean std
PageRank aims to measure how important a webpage is Having IP Address 0.3137 0.9495
on the Internet. URL Length -0.6331 0.7660
Shortening Service 0.7387 0.6739
28) Google Index: This feature examines whether a website Having @ Symbol 0.7005 0.7135
is in Googles index or not. Double Slash Redirecting 0.7414 0.6710
29) Links Pointing To Page: The number of links pointing Prefix Suffix -0.7349 0.6781
to the web page. Having Sub Domain 0.0639 0.8175
SSL Final State 0.2509 0.9118
30) Statistical Report: If the IP belongs to top phishing IPs Domain Reg Length -0.3367 0.9416
or not. Favicon 0.6285 0.7777
Port 0.7282 0.6853
HTTPS Token 0.6750 0.7377
Request URL 0.1867 0.9824
URL of Anchor -0.0765 0.7151
Links in Tags -0.1181 0.7639
SFH -0.5957 0.7591
Submitting To Email 0.6356 0.7720
Abnormal URL 0.7052 0.7089
Website Redirect Count 0.1156 0.3198
On Mouse over 0.7620 0.6474
RightClick 0.9138 0.4059
PopUpWidnow 0.6133 0.7898
IFrame 0.8169 0.5767
Age of Domain 0.0612 0.9981
DNS Record 0.3771 0.9262
Web Traffic 0.2872 0.8277
Page Rank -0.4836 0.8752
Google Index 0.7215 0.6923
Links Pointing to Page 0.3440 0.5699
Statistical Report 0.7195 0.6944
Result 0.1138 0.9935

2pr
F1 = (4)
p+r
Fig. 4. Corrolation of features in datasets VII. E XPERIMENTAL RESULTS
In our experiments, we used 10-fold cross-validation for
VI. E VALUATION M ETRICS model performance evaluation. we divided the data set into
For evaluating phishing classification performance we use 10 sub-samples. A sub-sample is used for testing data and the
accuracy(acc) recall(r), precision(p), F1 score, test time, and rest is used for training models. Since phishing detection is
train time of classifiers. Recall measures the percentage of a classification problem we must use a binary classification
phishing websites that the model manages to detect (mod- model, we consider “-1“ as a phishing sample and “1“ as a
els effectiveness). Precision measures the degree to which legitimate one.
the phishing detected websites are indeed phishing (models In our study, we used various machine learning models
safety). F1 score is the weighted harmonic mean of precision for detection phishing websites which are Logistic regression,
and recall. Let NL→L be the number of legitimate websites Ada booster, random forest, KNN, neural networks, SVM,
classified as legitimate, NL→P be the number of legitimate Gradient boosting, XGBoost. We evaluate the accuracy, preci-
websites misclassified as phishing, NP →L be the number of sion, recall, F1 score, training time, and testing time of these
phishing misclassified as legitimate and NP →P be the number models and we used different methods of feature selection
of phishing websites classified as phishing. Thus the following and hyperparameters tuning for getting the best results. Table
equations hold II shows the comparison between accuracy, precision, recall,
and F1 score of these models.
NL→L + NP →P For finding the best performance from support vector ma-
acc = (1)
NL→L + NL→P + NP →L + NP →P chine we have tested four kinds of kernel:
NP →P • Linear kernel
r= (2)
NP →L + NP →P • Polynomial kernel
NP →P • Sigmoid kernel
p= (3) • RBF kernel
NL→P + NP →P
TABLE III
C LASSIFICATION R ESULTS FOR D IFFERENT M ETHODS

classifier train time (s) test time(s) accuracy recall precision F1 score
logistic regression 0.080971 0.006414 0.926550 0.943968 0.925700 0.934704
decision tree 0.021452 0.003737 0.965988 0.971414 0.967681 0.969531
random forest 0.436126 0.021941 0.972682 0.981484 0.969852 0.975622
ada booster 0.336519 0.016766 0.936953 0.954362 0.933943 0.944032
KNN 0.112972 0.353562 0.952780 0.962968 0.952783 0.957827
neural network 9.088517 0.006925 0.969879 0.978723 0.967605 0.973112
SVM linear 1.647538 0.053979 0.927726 0.945592 0.926268 0.935779
SVM poly 1.048257 0.074207 0.949254 0.968816 0.941779 0.955083
SVM rbf 1.341540 0.103329 0.952149 0.968815 0.946580 0.957543
SVM sigmoid 1.344607 0.109696 0.827498 0.846515 0.844311 0.845305
gradient boosting 0.891888 0.005298 0.948621 0.962481 0.946234 0.954260
XGBoost 0.506072 0.006237 0.983235 0.981047 0.987235 0.976802

In our experience Linear, Polynomial, and RBF kernels In KNN classification we found out the best performance is
would work equally well on this dataset but we get the best acquired when we set k to 5. In KNN classification there is no
performance from the RBF kernel. The choice of the kernel optimal number to set k that is suitable for all kinds of datasets.
and regularization parameters can be optimized with a cross- According to the KNN result which is shown in Figure 6 the
validation model selection. With more than a few hyper- noise will have a higher impact on the result when the number
parameters to tune, automated model selection is likely to of neighbors is small, moreover, a large number of neighbors
result in severe over-fitting, due to the variance of the model make it computationally expensive to acquire the result. Our
selection criterion. In the absence of expert knowledge, the result has also shown that a small number of neighbors is the
RBF kernel makes a good default kernel when our problem most flexible fit which will have low bias but the high variance
requiring a non-linear classifier. In Figure 5 performance of plus a large number of neighbors will have a smoother decision
SVM with the different kernel are presented. boundary which means lower variance but higher bias.

The main advantage of XGBoost is its fast speed compared


to other algorithms, such as ANN and SVM, and it’s reg-
ularization parameter that successfully reduces variance. But
even aside from the regularization parameter, this algorithm
leverages a learning rate and subsamples from the features like
random forests, which increases its ability to generalize even
further. However, XGBoost is more difficult to understand,
visualize, and to tune compared to AdaBoost and Random
Forests. There is a multitude of hyperparameters that can
be tuned to increase performance.XGBoost is a particularly
interesting algorithm when speed as well as high accuracies
are of the essence. Nevertheless, more resources in training the
model are required because the model tuning needs more time
and expertise from the user to achieve meaningful outcomes.

Fig. 5. Performance of SVM classfier with various kernels


As expected, neural network’s training time was consid-
erably higher compared to other machine learning models.
We found that Random Forest is highly accurate, relatively XGBoost’s F1 score was slightly better compared with neural
robust against noise and outliers, it is fast, simple to implement network’s. This is due to the fact that our training data size is
and understand, and can do feature selection implicitly. being small. Unlike XGBoost, neural network model is also unable
unaffected by noise is the main advantage of Random Forest to explain why it have predicted a website as a phishing
over AdaBoost. According to Central Limit Theorem, Random one. The explainability will help us to specify key features
Forest reduces variance by increasing the number of trees. more easily. In the implementation of neural networks we use
However, the main disadvantage of Random Forests that we Adam optimizer and relu activation function in the hidden
faced in implementing our model was the high number of layer, figure 7 shows the performance of the neural network
hyperparameters to tune for getting the best performance. with a different number of the hidden layer, we get the best
Moreover, Random Forest introduces randomness into the performance with 30 hidden layers. We trained our model on
training and testing data which is not suitable for all data sets. 500 epochs with early stopping.
improve model performance. Moreover, this algorithm is easy
to understand and to visualize. However, for noisy data, the
performance of AdaBoost is debated with some arguing that
it generalizes well, while others show that noisy data leads
to poor performance due to the algorithm spending too much
time on learning extreme cases and skewing results. Compared
to random forests and XGBoost, Moreover, AdaBoost is not
optimized for speed, therefore being significantly slower than
XGBoost.
It is worth mentioning that there is no guarantee that the
combination of multiple classifiers will always perform better
than the best individual classifier in the ensemble classifiers.
The results motivate future works to add more features to the
dataset, which could improve the performance of these models,
hence it could combine machine learning models with other
phishing detection techniques like example List-Base methods
to obtain better performance. Besides, we will explore to
Fig. 6. KNN with different K
propose and develop a new mechanism to extract new features
from the website to keep up with new techniques in phishing
attacks.

IX. DATA AND CODE


To facilitate reproducibility of the research in this paper,
all codes and data are shared at this GitHub repository :
https://github.com/fafal-abnir/phishing detection

ACKNOWLEDGMENT
This research was supported by Smart Land co. We would
like to express our special thanks Abed Farvardin for providing
us a resource for doing this project as well as Saeed Shahrivari
who gave us the golden opportunity to do this wonderful
project on the Phishing detection, which also helped us in
doing a lot of research and we came to know about so many
new things we are really thankful to them.

Fig. 7. Neural Network with different depth R EFERENCES


[1] FBI, “Ic3 annual report released.”
[2] APWG, “Phishing activity trends report.”
VIII. C ONCLUSION AND FUTURE WORK [3] V. B. et al, “study on phishing attacks,” International Journal of
Computer Applications, 2018.
In this research, we have implemented and evaluated twelve [4] I.-F. Lam, W.-C. Xiao, S.-C. Wang, and K.-T. Chen, “Counteracting
phishing page polymorphism: An image layout analysis approach,”
classifiers on the phishing website dataset that consists of in International Conference on Information Security and Assurance,
6157 legitimate websites and 4898 phishing websites. The pp. 270–279, Springer, 2009.
examined classifiers are Logistic Regression, Decision Tree, [5] W. Jing, “Covert redirect vulnerability,” 2017.
Support Vector Machine, Ada Boost, Random Forest, Neural [6] K. Krombholz, H. Hobel, M. Huber, and E. Weippl, “Advanced social
engineering attacks,” Journal of Information Security and applications,
Networks, KNN, Gradient Boosting, and XGBoost. According vol. 22, pp. 113–122, 2015.
to our result in Table III, we get very good performance in [7] P. Kumaraguru, J. Cranshaw, A. Acquisti, L. Cranor, J. Hong, M. A.
ensembling classifiers namely, Random Forest, XGBoost both Blair, and T. Pham, “School of phish: a real-world evaluation of anti-
phishing training,” in Proceedings of the 5th Symposium on Usable
on computation duration and accuracy. The main idea behind Privacy and Security, pp. 1–12, 2009.
ensemble algorithms is to combine several weak learners into a [8] R. C. Dodge Jr, C. Carver, and A. J. Ferguson, “Phishing for user security
stronger one, this is perhaps the primary reason why ensemble- awareness,” computers & security, vol. 26, no. 1, pp. 73–80, 2007.
[9] R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works,” in
based learning is used in practice for most of the classification Proceedings of the SIGCHI conference on Human Factors in computing
problems. There are certain advantages and disadvantages systems, pp. 581–590, 2006.
inherent to the AdaBoost algorithm. AdaBoost is relatively [10] C. Ludl, S. McAllister, E. Kirda, and C. Kruegel, “On the effectiveness
of techniques to detect phishing sites,” in International Conference on
robust to overfitting in low noisy datasets [?]. AdaBoost Detection of Intrusions and Malware, and Vulnerability Assessment,
has only a few hyperparameters that need to be tuned to pp. 20–39, Springer, 2007.
[11] A. P. Rosiello, E. Kirda, F. Ferrandi, et al., “A layout-similarity-based
approach for detecting phishing pages,” in 2007 Third International
Conference on Security and Privacy in Communications Networks and
the Workshops-SecureComm 2007, pp. 454–463, IEEE, 2007.
[12] S. Afroz and R. Greenstadt, “Phishzoo: Detecting phishing websites
by looking at them,” in 2011 IEEE fifth international conference on
semantic computing, pp. 368–375, IEEE, 2011.
[13] K.-T. Chen, J.-Y. Chen, C.-R. Huang, and C.-S. Chen, “Fighting phish-
ing with discriminative keypoint features,” IEEE Internet Computing,
vol. 13, no. 3, pp. 56–63, 2009.
[14] A. K. Jain and B. B. Gupta, “Phishing detection: Analysis of visual
similarity based approaches,” Security and Communication Networks,
vol. 2017, 2017.
[15] R. S. Rao and S. T. Ali, “A computer vision technique to detect phishing
attacks,” in 2015 Fifth International Conference on Communication
Systems and Network Technologies, pp. 596–601, IEEE, 2015.
[16] B. B. Gupta, N. A. Arachchilage, and K. E. Psannis, “Defending
against phishing attacks: taxonomy of methods, current issues and future
directions,” Telecommunication Systems, vol. 67, no. 2, pp. 247–267,
2018.
[17] A. Karatzoglou, D. Meyer, and K. Hornik, “Support vector machines in
r,” Journal of statistical software, vol. 15, no. 9, pp. 1–28, 2006.
[18] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–
32, 2001.
[19] T. Hastie, S. Rosset, J. Zhu, and H. Zou, “Multi-class adaboost,”
Statistics and its Interface, vol. 2, no. 3, pp. 349–360, 2009.
[20] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics
& data analysis, vol. 38, no. 4, pp. 367–378, 2002.
[21] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
in Proceedings of the 22nd acm sigkdd international conference on
knowledge discovery and data mining, pp. 785–794, 2016.
[22] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,
2016.
[23] R. M. Mohammad, F. Thabtah, and L. McCluskey, “Phishing websites
features,” School of Computing and Engineering, University of Hudder-
sfield, 2015.

You might also like