Phishing Detection Using Machine Learning
Phishing Detection Using Machine Learning
Techniques
Abstract—The Internet has become an indispensable part of 2019 to 165,772 cases in the first quarter of 2020. Phishing
our life, However, It also has provided opportunities to anony- has caused severe damages to many organizations and the
mously perform malicious activities like Phishing. Phishers try global economy, in the fourth quarter of 2019, APWG member
to deceive their victims by social engineering or creating mock-
up websites to steal information such as account ID, username, OpSec Security found that SaaS and webmail sites remained
password from individuals and organizations. Although many the most frequent targets of phishing attacks. Phishers continue
methods have been proposed to detect phishing websites, Phishers to harvest credentials from these targets by operating BEC and
have evolved their methods to escape from these detection subsequently gain access to corporate SaaS accounts [2]. Many
methods. One of the most successful methods for detecting these
malicious activities is Machine Learning. This is because most
Phishing attacks have some common characteristics which can
be identified by machine learning methods. In this paper, we
compared the results of multiple machine learning methods for
predicting phishing websites.
Index Terms—Phishing, Classification, Cybercrime, Machine-
learning
I. I NTRODUCTION
Phishing is a kind of Cybercrime trying to obtain important
or confidential information from users which is usually carried
out by creating a counterfeit website that mimics a legitimate
website. Phishing attacks employ a variety of techniques such
as link manipulation, filter evasion, website forgery, covert
redirect, and social engineering. The most common approach
is to set up a spoofing web page that imitates a legitimate Fig. 1. Total number of phishing websites detected by APWG [2]
website. These type of attacks were top concerns in the latest
2018 Internet Crime Report, issued by the U.S. Federal Bureau approaches have been used to filter out phishing websites.
of Investigations Internet Crime Complaint Center (IC3). The Each of these methods is appliable on different stages of
statistics gathered by the FBIs IC3 for 2018 showed that attack flow, for example, network-level protection, authenti-
internet-based theft, fraud, and exploitation remain pervasive cation, client-side tool, user education, server-side filters, and
and were responsible for a staggering $2.7 billion in financial classifiers. Although there are some unique features in every
losses in 2018. In that year, the IC3 received 20,373 complaints type of phishing attack, most of these attacks depict some
against business email compromise (BEC) and email account similarities and patterns. Since machine learning methods
compromise (EAC), with losses of more than $1.2 billion proved to be a powerful tool for detecting patterns in data,
[1]. The report notes that the number of these sophisticated these methods have made it possible to detect some of the
attacks have grown increasingly in recent years. Anti-Phishing common phishing traits, therefore, recognizing phishing web-
Working Group(APWG) emphasizes that phishing attacks sites. In this paper, we provide a comparative and analytical
have grown in recent years, Figure 1 illustrates the total evaluation of different machine learning methods on detecting
number of phishing sites detected by APWG in the first quarter the phishing websites. The machine learning methods that
of 2020 and the last quarter of 2019. This number has a we studied are Logistic Regression, Decision Tree, Random
gradual growth raising from 162,155 in the last quarter of Forest, Ada-Boost, Support Vector Machine, KNN, Artificial
Neural Networks, Gradient Boosting, and XGBoost. The rest III. P HISHING DETECTION APPROACHES : AN OVERVIEW
of this paper is organized as follows: in section II we list some Various methods have been proposed to avert phishing
widely used phishing techniques, in Section III we discuss attacks through each level of attack flow. Some of these
different types of phishing and phishing attack prevention methods require training the users to be prepared for future
methods. In section IV we provide an overview of different attacks and some of them work automatically and warn the
machine learning methods for phishing detection. In section user. These methods can be listed as follows:
V we illustrate the features of our dataset. In section VI and • User training
VII we show evaluation results of suggested machine learning • Software detection
methods and finally we draw conclusions and discuss future
works in section VIII. A. User training
II. P HISHING TECHNIQUES Educating users and company employees and warning them
about phishing attacks have an impact on preventing phishing
In this section, we discuss some well-known phishing
attacks. Multiple methods have been proposed for training
approaches used by criminals to deceive people.
users. Many researches concluded that the most impactful
A. Link manipulation approach to help the users to distinguish between phishing and
The phishing is mainly about links. There are some clever legitimate websites is interactive teaching [7] [8]. Although
ways to manipulate a URL to make look like a legitimate URL. user training is an effective method however humans errors
One method is to represent the malicious URLs as hyperlinks still exist and people are prone to forget their training. Training
with name on websites. Another method is to use misspelled also requires a significant amount of time and it is not much
URLs which will look like a legitimate URL for example appreciated by non-technical users [9].
ghoogle.com. A variant of typosquatting that is much harder to B. Software detection
recognize compared to mentioned link manipulation methods
Although user training can prevent some phishing attacks
is called IDN Spoofing in which the attackers use a character
however we are bombarded every day by hundreds of websites
in non-English language that looks exactly like an English
therefore applying our training on each website is a cumber-
character for example using a Cyrillic ”c” or ”a” instead of
some and sometimes non-practical task. Another alternative
English counterparts [3].
for detecting phishing websites is to use the software. The
B. Filter evasion software can analyze multiple factors like the content of the
Phishers show the content of their website in pictures or website, email message, URL, and many other features before
they use Adobe-Flash making it difficult to be detected by it makes its final decision which is more reliable than humans.
some phishing detection methods. To avoid this kind of attack Multiple software methods are proposed for phishing detection
using optical character recognition is required [4]. which is categorized as follows:
1) List-base approach: One of the widely used methods for
C. Website forgery
phishing detection is using blacklist-based anti-phishing
In this type of attack, Phishing is happening at a legitimate methods which are integrated into web browsers. These
website by manipulating the target website JavaScript code. methods use two types of lists, namely the white list
These types of attacks which are also known as cross-site which contains the name of valid websites, and the
scripting are very hard to detect because the victim is using blacklist which keeps the record of malicious websites.
the legitimate website. Usually, the blacklist is obtained either through user
D. Covert redirect feedback or through third-party reports which are cre-
This attacks targets websites using OAuth 2.0 and OpenID ated by using another phishing detection scheme. Some
protocol. While trying to grant token access to a legitimate studies have shown that blacklist-based anti-phishing ap-
website, users are giving their token to a malicious service. proaches can detect 90 percent of the malicious website
However, this method did not gain much attention due to its at the time of initial check [10].
low significance [5]. 2) Visual similarity-base approach: One of the main rea-
sons that people are tricked into believing that they are
E. Social engineering using a legitimate website but in reality, they are filling a
This type of phishing is carried out through social inter- form in a malicious website is that the phishing website
action. It uses psychological tricks to deceive users to give appearance is exactly similar to the targeted legitimate
away security information. This type of attack happens in website. Some methods use visual similarities by analyz-
multi-steps. At first, the phisher investigates the potential weak ing text content, text format, HTML, CSS, and images
points of targets required for the attack. Then, the phisher of web pages to identify phishing websites [11] [12].
tries to gain the target’s trust and at last, provide a situation Chen el al [13] also proposed discriminative keypoint
in which the target reveals important information. There are features that consider phishing detection as an image
some social engineering phishing methods, namely, baiting, matching problem. Visual similarity-based approaches
scareware, pretexting, and spear phishing [6]. have their limitations, for example, methods that use the
content of a website will fail to detect websites that use specific tasks like phishing detection. Since phishing is a
images instead of text. Methods that use image matching classification problem, Machine learning models can be used
methods are very time-consuming and hard to gather as a powerful tool. Machine learning models could adapt to
enough data [14]. changes quickly to identify patterns of fraudulent transactions
3) Heuristics and machine learning based: Machine learn- that help to develop a learning-based identification system.
ing methods have proved to be a powerful tool to Most of the machine learning models discussed here are
classify malicious activities or artifacts like spam emails classified as supervised machine learning, This is where an
or phishing websites. Most of these methods require algorithm tries to learn a function that maps an input to an
training data, fortunately, there are many phishing web- output based on example input-output pairs. It infers a function
site samples to train a machine learning model. Some from labeled training data consisting of a set of training
machine learning methods use vision techniques by examples. We present machine learning methods that we used
analyzing a snapshot of a website [15] and some of in our study.
them use content and features of the website for phishing
detection. Multiple machine learning methods have been A. Logistic Regression
used to detect phishing websites some of which are Logistic Regression is a classification algorithm used to
Logistic regression, decision tree, random forest, Ada assign observations to a discrete set of classes. Unlike linear
boost, SVM, KNN, neural networks, gradient boosting, regression which outputs continuous number values, Logistic
and XGBoost which are described in the following Regression transforms its output using the logistic sigmoid
section. function to return a probability value which can then be
In a recent study [16] on phishing, the authors emphasized mapped to two or more discrete classes. Logistic regression
that when some new solutions were proposed to overcome var- works well when the relationship in the data is almost linear
ious phishing attacks, attackers evolve their method to bypass despite if there are complex nonlinear relationships between
the newly proposed phishing method. Therefore, the use of variables, it has poor performance. Besides, it requires more
hybrid models and machine learning-based methods is highly statistical assumptions before using other techniques.
recommended. In this paper, we are going to use machine
B. K Near Neighbors
learning-based classifiers for detecting phishing websites.
K-Nearest Neighbors (KNN) is one of the simplest algo-
rithms used in machine learning for regression and classifi-
cation problems which is non-parametric and lazy. In KNN
there is no need for an assumption for the underlying data
distribution. KNN algorithm uses feature similarity to predict
the values of new datapoints which means that the new data
point will be assigned a value based on how closely it matches
the points in the training set. The similarity between records
can be measured in many different ways. Once the neighbors
are discovered, the summary prediction can be made by
returning the most common outcome or taking the average.
As such, KNN can be used for classification or regression
problems. There is no model to speak of other than holding
the entire training dataset.
2pr
F1 = (4)
p+r
Fig. 4. Corrolation of features in datasets VII. E XPERIMENTAL RESULTS
In our experiments, we used 10-fold cross-validation for
VI. E VALUATION M ETRICS model performance evaluation. we divided the data set into
For evaluating phishing classification performance we use 10 sub-samples. A sub-sample is used for testing data and the
accuracy(acc) recall(r), precision(p), F1 score, test time, and rest is used for training models. Since phishing detection is
train time of classifiers. Recall measures the percentage of a classification problem we must use a binary classification
phishing websites that the model manages to detect (mod- model, we consider “-1“ as a phishing sample and “1“ as a
els effectiveness). Precision measures the degree to which legitimate one.
the phishing detected websites are indeed phishing (models In our study, we used various machine learning models
safety). F1 score is the weighted harmonic mean of precision for detection phishing websites which are Logistic regression,
and recall. Let NL→L be the number of legitimate websites Ada booster, random forest, KNN, neural networks, SVM,
classified as legitimate, NL→P be the number of legitimate Gradient boosting, XGBoost. We evaluate the accuracy, preci-
websites misclassified as phishing, NP →L be the number of sion, recall, F1 score, training time, and testing time of these
phishing misclassified as legitimate and NP →P be the number models and we used different methods of feature selection
of phishing websites classified as phishing. Thus the following and hyperparameters tuning for getting the best results. Table
equations hold II shows the comparison between accuracy, precision, recall,
and F1 score of these models.
NL→L + NP →P For finding the best performance from support vector ma-
acc = (1)
NL→L + NL→P + NP →L + NP →P chine we have tested four kinds of kernel:
NP →P • Linear kernel
r= (2)
NP →L + NP →P • Polynomial kernel
NP →P • Sigmoid kernel
p= (3) • RBF kernel
NL→P + NP →P
TABLE III
C LASSIFICATION R ESULTS FOR D IFFERENT M ETHODS
classifier train time (s) test time(s) accuracy recall precision F1 score
logistic regression 0.080971 0.006414 0.926550 0.943968 0.925700 0.934704
decision tree 0.021452 0.003737 0.965988 0.971414 0.967681 0.969531
random forest 0.436126 0.021941 0.972682 0.981484 0.969852 0.975622
ada booster 0.336519 0.016766 0.936953 0.954362 0.933943 0.944032
KNN 0.112972 0.353562 0.952780 0.962968 0.952783 0.957827
neural network 9.088517 0.006925 0.969879 0.978723 0.967605 0.973112
SVM linear 1.647538 0.053979 0.927726 0.945592 0.926268 0.935779
SVM poly 1.048257 0.074207 0.949254 0.968816 0.941779 0.955083
SVM rbf 1.341540 0.103329 0.952149 0.968815 0.946580 0.957543
SVM sigmoid 1.344607 0.109696 0.827498 0.846515 0.844311 0.845305
gradient boosting 0.891888 0.005298 0.948621 0.962481 0.946234 0.954260
XGBoost 0.506072 0.006237 0.983235 0.981047 0.987235 0.976802
In our experience Linear, Polynomial, and RBF kernels In KNN classification we found out the best performance is
would work equally well on this dataset but we get the best acquired when we set k to 5. In KNN classification there is no
performance from the RBF kernel. The choice of the kernel optimal number to set k that is suitable for all kinds of datasets.
and regularization parameters can be optimized with a cross- According to the KNN result which is shown in Figure 6 the
validation model selection. With more than a few hyper- noise will have a higher impact on the result when the number
parameters to tune, automated model selection is likely to of neighbors is small, moreover, a large number of neighbors
result in severe over-fitting, due to the variance of the model make it computationally expensive to acquire the result. Our
selection criterion. In the absence of expert knowledge, the result has also shown that a small number of neighbors is the
RBF kernel makes a good default kernel when our problem most flexible fit which will have low bias but the high variance
requiring a non-linear classifier. In Figure 5 performance of plus a large number of neighbors will have a smoother decision
SVM with the different kernel are presented. boundary which means lower variance but higher bias.
ACKNOWLEDGMENT
This research was supported by Smart Land co. We would
like to express our special thanks Abed Farvardin for providing
us a resource for doing this project as well as Saeed Shahrivari
who gave us the golden opportunity to do this wonderful
project on the Phishing detection, which also helped us in
doing a lot of research and we came to know about so many
new things we are really thankful to them.