0% found this document useful (0 votes)
25 views10 pages

Detection of Phishing Websites by Investigating Their Urls Using LSTM Algorithm

The paper discusses the detection of phishing websites using the Long-Short-Term Memory (LSTM) algorithm, achieving an accuracy of 94-96% while maintaining a low false positive rate. It emphasizes the importance of understanding phishing techniques and the need for effective machine learning methods to classify URLs as malicious or legitimate. The research highlights the significance of feature extraction from URLs and proposes a model to improve user authentication against phishing attacks.

Uploaded by

hdj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Detection of Phishing Websites by Investigating Their Urls Using LSTM Algorithm

The paper discusses the detection of phishing websites using the Long-Short-Term Memory (LSTM) algorithm, achieving an accuracy of 94-96% while maintaining a low false positive rate. It emphasizes the importance of understanding phishing techniques and the need for effective machine learning methods to classify URLs as malicious or legitimate. The research highlights the significance of feature extraction from URLs and proposes a model to improve user authentication against phishing attacks.

Uploaded by

hdj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.

5, May 2022 419

Detection of Phishing Websites by Investigating Their URLs using


LSTM Algorithm
Barah Mohammed Alanzi1 and Diaa Mohammed Uliyan1
[email protected], [email protected]
1
College of Computer Science and Engineering,
Department of information and computer science
University of Hai’l ,Ha’il, Saudi Arabia

Abstract Online URL reputation services are used to classify


Phishing is a criminal mechanism that uses both social engineering URLs and the classes returned are used as a supplementary
and technical tricks to steal consumers' personal identity data and source of information that will enable the system to classify
financial account credentials. As the number of web user's URLs. The classifier achieves an accuracy of 94-96% by
increases, phishing frauds are gradually increasing. In order to detecting a large number of phishing hosts, while
respond effectively to various phishing mechanism, a proper maintaining a modest false positive rate. URL groups, URL
understanding of phishing attacks is necessary, and some categorization, and URL grading mechanisms work in
appropriate response methods should be utilized. In this paper, the tandem to give URLs a rank [8].
main aim is to detect a phishing website attack by a suggested
machine learning algorithm. First, we need to update a blacklisted Using Long-Short-Term Memory (LSTM) provides an
URLs and IP for antivirus into the database of our method. The effective solution for detecting phishing sites. The LSTM
database is known as the "blacklist". Second, to avoid blacklist algorithm can solve more complex problems compared to
attackers, we need to understand how they use creative techniques shallow learning algorithms (that is, traditional machine
to deceive users by modifying the URL to look as legitimate user learning algorithms). Moreover, LSTM can store past
via obfuscation and many other simple techniques including fast information for a long time, however, Recurrent Neural
blur, where proxies are automatically generated to host a web page; Networks (RNN) are not able to do this task for long periods.
Algorithm generation of new URLs; etc. A blacklist is a list of LSTMs have an internal state, they are familiar with the
many unsafe websites that are accused of fraud, spreading temporal structure of the inputs, and they can model the
malware, or launching any other form of malicious activity. parallel chain of inputs separately. As such, we aimed to
Having this list is one of the biggest nightmares for website owners combine the power of the LSTM algorithm into a single
because the websites that became part of this list are no longer model and presented how to implement this integration
scanned by web crawlers, and there are no backlinks to create these effectively [10].
sites. The first step is to collect benign and phishing URLs. Then,
1.1 Research problem
Host-based, popularity-based, and lexical feature extractions are
applied to create a database of feature values. Finally, the database The researcher formulated the research questions
is knowledge extracted using various methods of machine learning. according to the purpose of the study, which are as follows:
An experimental study was conducted using a deep learning
algorithm, including long-term memory (LSTM). To analyse the 1) How does machine learning by detecting URLs
behaviours of these deep learning architectures, extensive identify phishing sites?
experiments were conducted to examine the effect of parameter All machine learning models use features - properties or
tuning on the performance accuracy of deep learning models. The attributes of data extracted from input data sets to create their
experimental results from this paper also show several issues and models. In the context of URL classification, there are three
suggest future research directions related to deep learning in the types of features: host-based, lexical, and content-based.
field of phishing detection. Host-based features are those that identify identity, location,
Keywords: and other network information about the host. Lexical
URL; Phishing website; Machine learning; LSTM; RNN. features are text properties that are obtained from the URL
itself. Finally, content-based features come from web pages
1. Introduction linked to the same URLs. Content-based features require
With the increase in the number of web users, phishing more in-depth analysis of the content and are
attacks are gradually increasing. In order to respond computationally more expensive. It is also an inherent risk
effectively to various phishing attacks, a proper that our systems could be compromised while exploring web
understanding of phishing attacks is necessary, and pages related to the URLs we are trying to rank for. The set
appropriate response methods should be used. of content-based features is outside the scope of our research,
due to the associated risks and greater time requirements.

Manuscript received May 5, 2022


Manuscript revised May 20, 2022
https://doi.org/10.22937/IJCSNS.2022.22.5.60
420 IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022

2) How to apply ML methods to classify malicious The main objective of this paper is to analyse the
and legitimate websites? performance of the LSTM algorithm in detecting phishing
activities. This analysis will help organizations or
The researchers examined a variety of techniques to individuals choose and adopt the appropriate solution
prevent phishing attacks. Anti-phishing technologies can be according to their technology needs and specific application
categorized into three categories: email-based, content- requirements to combat phishing attacks [24].
based, and URL-based detection methods. The researchers
used machine learning techniques on feature sets derived 1.3 Importance of this research
from phishing emails. Other studies in classifiers using
features of URLs and search engine results as a means have The harmful effects of phishing can be to gain access to
investigated high detection rates while maintaining low false user's confidential details, which can lead to financial losses
positive rates. Chandrasekaran et al. Attempting to identify to users and even prevent them from accessing their own
phishing emails using structural properties. These anti- accounts. Therefore, in this study, we will identify and
phishing techniques are not always applicable because the qualify phishing website features to prevent and mitigate the
methods require phishing emails [13]. risks of phishing website.

The authors developed URL-Net, a CNN-based deep In addition, this study will make a comparative
neural URL discovery network. They argued that current evaluation between the techniques of machine learning
methods often use Bag of Words (BoW) properties but suffer algorithms.
from some fundamental limitations, such as a lack of Online phishing costs internet users billions of dollars
detection of concatenated concepts in the URL string, a lack annually. Scammers steal personal information and financial
of automatic feature extraction and a failure of properties account details such as usernames and passwords, leaving
that are not actually visible in ephemeral URLs. users vulnerable in the online space.
The authors developed CNNs and Word CNNs to form 1.4 Limitations and delimitations of the study
the visual image of the network. In addition, they suggested
modern techniques that were effective in dealing with rare Since the problem of phishing takes advantage of human
terms, a prevalent issue in malicious URL detection tasks. ignorance or naivety regarding their interaction with
This approach can allow URL-Net to identify embedding's electronic communication channels (such as email, HTTP,
and use sub-word data from invisible words during the etc.), it is not always easy to solve. All suggested solutions
testing phase. try to reduce the impact of phishing attacks.

The main problem addressed in this study is to enhance From a high-level perspective, there are generally two
user authentication on a website. The research investigates popular proposed solutions to mitigate phishing attacks:
the potential uses of the input models in detecting phishing • User education. Humans are taught to try to improve
URLs. In particular, the goal here is to develop the model classification accuracy to correctly identify phishing
that will be used to predict whether a website is fraudulent messages, and then apply appropriate actions to properly
or legitimate and, if so, to what degree, to improve the categorized phishing messages, such as reporting attacks to
accuracy of phishing URL detection [11]. system administrators.
1.2 Research objectives • Software improvement. The program was developed to
The goal of a phishing site is to obtain personal better categorize phishing messages on behalf of the user, or
information without permission, either through extortion or to present the information in a more simplified manner.
by visiting a fake web page that looks like the real one, The main disadvantages of both approaches are:
which asks the user to enter personal information. This
results in information security breaches through • Resistance to training by non-technologists, so training
compromises in confidential data where the victim may must be permanent.
suffer financial loss or loss of assets. The attacker may • Still, some software solutions depend on the user. If
additionally commit identity theft using the personal details people ignore the security warnings, the solution may
of the victims. Also, a phishing attack can damage the become useless [15].
reputation of the spoofed financial institution, as customers
lose confidence that their account is secure. Thus, they may Attackers can use technical vulnerabilities to create
take their habits to another company. Phishing, if not socially customized packets (such as using legitimate but
investigated, can negatively affect an organization's assets, deceptive domain names). Effective mitigation requires
revenue, customer relationships, or marketing efforts, as addressing matters on a personal and technical level. Since
well as a company's image. A phishing attack could cost the phishing attacks aim to exploit vulnerabilities of the user (i.e.
company hundreds of thousands of dollars per attack in end users), it is difficult to minimize them. For example,
terms of employee time and fraud-related loss [9]. according to the assessment, end users fail to detect 27% of
phishing attacks, even when taught with the best outreach
software. On the other hand, software phishing detection
IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022 421

techniques are evaluated against phishing attacks, which discover how it works in data set estimation. This paper
makes their performance abnormal by phishing tactics [16]. suggested that the pre-phishing algorithm for an effective
phishing URL detection system is based on the analysis of
2. Literature review the URL sentence. The Pre-Phish approach was Demo
The researchers examined a variety of techniques for Phishing, an empirical case study investigated to collect and
preventing phishing attacks. Anti-phishing technologies can evaluate a variety of phishing URL features and patterns,
be classified into three categories: email-based, content- with all relevant attributes. This was a computerized
based, and URL-based detection methods. The researchers machine learning method that relied on the characteristics of
used machine learning techniques on feature sets derived a phishing URL to catch and block phishing URLs and to
from phishing emails. provide a high level of security. The same limitations have
been used to create a tool based on a web browser plug-in
Several researchers have analysed the statistics of that can capture and block phishing URLs in real time and
suspicious URLs in some way. Our approach borrows resolve data mining methods to detect new patterns of
important ideas from previous studies. We are reviewing phishing URLs [2].
previous work on phishing Site discovery using URL
features that motivated our own approach. Ashit Kumar Dutta (2021) [3] proposed study
emphasized the phishing technique, whereby a phishing
Bahnsen et al. (2018) [1] suggested a more effective URL is seen to include the automatic classification of URLs
method for real-time phishing URL detection. It was in a predefined set of category values based on several
mentioned that there are a lot of anti-phishing methods features and a category variable. Machine learning-based
appearing, but scammers use diverse and dynamic methods phishing techniques rely on URL functions to gather
for scam victims, so a smart and flexible model was needed information that can help categorize URLs to detect phishing
to catch a phishing URL. Data mining methods can be used sites. To combat the ongoing complexity of phishing attacks
to promote an active model that contains basic and non- and tactics, anti-phishing techniques are essential. The
trivial data that can be backed up from huge data sets using authors used LSTM technology to identify malicious and
classification algorithms to name a legitimate URL. legitimate URLs. The crawler was developed that crawled
Four different classification algorithms were used to 8000 URLs from the Alexa Rank portal, and also used the
Phish-tank dataset to measure the efficiency of the proposed
classify and approximate the data set for its achievement,
URL detector. The result of this study shows that the
accuracy, and several criteria. The experiments were
proposed method provides superior results instead of the
handled using four different rule-based algorithms to detect
current deep learning methods. A total of 8000 malicious
cryptic awareness, from the huge data set to predict a
URLs were detected using our suggested URL detector. We
phishing URL. The rated results paralleled their
achieved better accuracy and F1 record with limited time.
performance on the accuracy chart, error rate, time duration,
and total number of component criteria. However, the results In the study by researchers Ciza Thomas, Sandhya L. and
showed that all the selected algorithms complete a higher Joby James (2013) several features are compared using
expected rate. The rules that were developed demonstrated different data mining algorithms. The results indicate the
the interaction and relationship between URL features which efficiency that can be achieved using lexical features. To
can help us build frameworks for detecting phishing URLs. protect end users from visiting these sites, we can attempt to
There was a phishing detection form which is good for identify phishing URLs by analyzing lexical and host-based
preventing users from being deceived by achieving features. The particular challenge in this area is that
verification by sending private information. criminals are constantly developing new strategies to
Preethi, and Velmayil (2019) [2] suggested a method for counter our defensive measures. To succeed in this
competition, we need algorithms that constantly adapt to
analysing phishing URLs using lexical analysis. Suggest a
new examples and features of phishing URLs. Online
pre-phishing algorithm which is computerized machine
learning algorithms provide better learning methods as
learning to resolve phishing and non-phishing URLs to
compared to batch based learning mechanisms [5].
extract safe results.
Purbay M. and Kumar D. (2021) [3] proposed a process
Phishing URLs often contain two connections between
of classifying phishing attacks according to the scammer's
the part of the registered domain level and the method or
mechanism to trap the alleged users.
reservation level URL. Therefore, applying a URL to
connections describes threading and categorizes using Many of these attack methods are master logging tools
feature extractor from attributes. Also, these features are and DNS disablement. Social engineering startups include
then used in a machine learning method to catch phishing online blogs, SMS services, social media platforms that use
URLs from an actual data set. Phishing and non-phishing web services such as Facebook and Instagram, peer-to-peer
URLs were categorized by detecting the domain value and file-sharing services, and Voice over Internet Protocol (VoIP)
the threshold value for each attribute using decision rating. systems that attackers use to use caller spoofing identifiers.
This technique was further classified in Mat Lab using three Each form of phishing differs slightly in how the process is
major classifiers SVM, Random Forest and Naive Bayes to carried out in order to defraud an unsuspecting consumer.
422 IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022

Phishing attacks occur when an attacker sends a message


containing a link to potential users to direct them to phishing
URLs [4].

3. Methodology
In this study, we use LSTM (Long-Short-Term Memory)
which is an algorithm that is part of the structure of our
scheme that takes the input from a URL as a character
sequence and predicts whether the link is a phishing or a
legitimate website [22].
Long and short-term memory is an adaptive and
recurrent neural network (RNN), in which each neuron is
swapped with a memory cell that is additional to the
conservative neuron on behalf of an internal state.
Multiplexes are also used as gates to control the flow of
information. LSTM layers consist of a set of frequently
Figure 1. Architecture of the LSTM method – A
linked blocks called memory blocks as shown in Fig. 1.
Each of these blocks contains one or more memory cells
that are connected repeatedly. Hence, a normal LSTM cell Time complexity: In order to calculate the time
has an input gate that controls the input of data from outside complexity of the proposed models, the time complexity of
the cell, which determines whether the cell retains or omits the DNN and LSTM based model sections must be
data in the internal state, and an output gate that prevents or calculated separately. For the DNN section, the time
allows the internal state to be seen from the outside. complexity is equal to the sum of the number of parameters
for each layer because the time is dominated by the matrix
The most common is the confusion matrix consisting of
multiples of the Multilayer Perception (MLP) layers. As
four basic scales: true positive (TP), true negative (TN), false
such, the time complexity of the DNN section is O(4p1),
positive (FP), and false negative (FN).
where 4 is the number of layers and p1 is the average number
Standard metrices, such as accuracy, recall, and F1 score of parameters per layer, which depends on the input and
are used in this study to measure the performance of the output Rate each layer.
proposed solutions. The time complexity of each layer in the LSTM is O(1)
Furthermore, LSTM modules are known to have the ability per weight because the LSTM is local in space and time [26].
to learn a large-scale dependency from the input sequence. Therefore, the time complexity of the LSTM partition is O(w
The LSTM training algorithm uses error gradients to + p2), where the total number of all weights in the LSTM
calculate, and combines iterative real-time learning with layers and p2 is the number of parameters in the last layer of
backpropagation. the LSTM partition. For the model where BiLSTM is used
However, backpropagation is dropped after the first instead of LSTM, the time complexity is O(2w + p2) instead
timestamp because long-term dependencies are handled by of O(w+p2) because the calculations are done in two
memory blocks, not by backpropagation error gradient flow. different directions in BiLSTM. Due to the structure of the
This step also helped make the LSTM directly comparable hybrid model, when two separate partitions are combined,
to other RNNs in terms of performance because training can two MLP layers are used to obtain the final output value. The
be performed using standard backpropagation over time. time complexity of this final part is O(2p3), where p3 is the
average number of parameters in these layers. Finally, the
LSTM algorithm architecture: The central components time complexity of the proposed LSTM-based hybrid model
of the LSTM architecture are the memory cell, which can is O(w + 4p1 + p2 + 2p3), which is the sum of the time
maintain its state over time, and the nonlinear gate units, complexity of all parts. Although this combination of models
which regulate the information input and output flow of the brings an additional cost in terms of time required, the
network as shown in Fig. 2. Based on the insights from benefit is outside of these additional cost data sets. We did
secure networks, it is observed that since LSTM neurons not want to include old datasets and our goal was to conduct
consist of internal cells and gate units, one should look not experiments in new datasets.
only at the output of the neuron but also at the internal
structure to design the original features of the LSTM so that
it can address classification problems [25].
IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022 423

determining the output of a specific input, but the


unidirectional neural networks cannot account for these
events in their prediction [30]. To better understand the
concept of recurrent neural networks, the below figure
illustrates a simple RNN and how the network can be
unfolded to generate the 3rd output as shown in Fig. 3.

Figure 3. The structure and unfolding of RNN.


As shown in the first part of Fig. 1, the RNN has a
feedback loop which is used to unroll in 3 timesteps to
produce the second part of the figure. Note that the RNN can
Figure 2. Architecture of the LSTM method - B be modified to unroll N timesteps as well. While the figure
shows a simple illustration of a very small RNN, the topic of
RNN is vast and discussing it is beyond the scope of our
work. However, we need to discuss the gradient to be able
3.1 Principles to understand the downsides of the RNN and how the LSTM
The proposed framework employs a Recurrent Neural is used to address these issues.
Network (RNN) variant called the Long Short-Term A gradient is a partial derivative with respect to its inputs,
Memory (LSTM) to classify malicious and legitimate which measures how much the output of a function changes
website URLs. The RNN implies to broad groups of if you change the inputs a little bit. The gradient can be
networks of a similar structure, where on is finite and the thought of as the slope of a function, where the higher the
other is an infinite input. Both network types contain time gradient, the steeper the slope and the faster a model can
dynamic behavior. A recurrent network of finite input is a learn. However, if the slope is zero, the model stops learning.
directed acyclic graph that can be replaced by a purely Simply put, the gradient measures the change in all weights
feedforward neural network, whereas a recurrent network of with regard to the change in error.
infinite input is a directed cyclical graph that cannot be
modified. The LSTM is a deep learning method, which After describing the gradient and what is it used for, we
prevents the gradient problem of RNN. The LSTM can now discuss the downsides of the RNN. There are two
comprises multiple gates that are employed to improve the main downsides for the RNN, exploding gradients and
performance. Each input of the LSTM generates an output vanishing gradients. The exploding gradients happen when
that becomes an input for the following layer. the algorithm, unreasonably, assigns high importance to the
weights. Fortunately, this problem can be mitigated by
a. Recurrent Neural Networks truncating or squashing the gradients. Unfortunately, the
The recurrent neural network is a type of artificial neural vanishing gradients problem is not as easy as the exploding
network which uses sequential data or time series data. gradients. The vanishing gradients problem occurs when the
These algorithms are typically used for temporal or ordinal values of the gradients are too small and the model stops
problems such as speech recognition, image captioning, learning or takes too long to learn. The problem lasted for a
natural language processing. Similar to the convolutional while, but it was solved through the concept of LSTM [31].
neural networks and feedforward neural networks, recurrent b. Long Short-Term Memory
neural networks utilize training data to learn and optimize
the model parameters [30]. The networks’ ability to Long Short-Term Memory networks (LSTMs) are an
memorize distinguishes them from other machine learning extension for recurrent neural networks, which basically
techniques. Such ability is characterized by the way the extends the memory. Therefore, it is well suited to learn from
previous input is used in calculating the weights, which important experiences that have very long time lags in
influences the current input and output. Therefore, the between [31]. The units of an LSTM are connected together
recurrent neural network output depends on the sequence of to be used as the building blocks of the RNN, which is often
previous inputs. Similarly, the future input can be helpful in called the LSTM network. However, building the RNN
424 IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022

using LSTM network enables the RNN to remember inputs and tuned to improve the performance before deploying it
over a long period of time. The LSTM’s ability to remember into the production environment. The classifier can then act
is due to the LSTM’s contains an information memory, as an intermediary stage between the end user and the
which is very similar to the computer’s memory. The LSTM internet. Whenever a request is sent to any URL, the
can read, write and delete information from its memory. The requested URL is verified using the model and the access is
memory of the LSTM can be described as a gated cell, with granted if the requested URL is not a phishing URL, or
the gate controlling whether or not to store or delete the blocked if the requested URL is a phishing URL.
information. The gate opens to store the information or
discard the forwarded the information based on the
importance of the information, which happens through
weights. The weights are learned by the algorithm over time,
which simply means the LSTM learns over time what
information are important and what are not.
The LSTM have three gates: input gate, forget gate, and
output gate. These gates control the memory of the LSTM,
where the input gate is responsible of determining whether
or not to let new input in, the delete gate is responsible of
deleting the information if it is not important, and the output
gate allows the information to impact the output at the
current timestep. Fig. 4 illustrates a RNN with its gates.
Figure 5. Design of the proposed methodology.

Before diving into the details of the LSTM


implementation, we start by discussing the components of
the LSTM network layers. The LSTM is an effective
prediction and classification mode as it generates an output
based on the arbitrary number of implemented steps. The
LSTM model contains five essential components that enable
the model [28].
Cell State (CS) – a cell that accommodate the long- and
short-term memories.
Figure 4. RNN with its three input gates. Hidden State (HS) – The output status information that is
used to determine the classification based on the current data,
input data, and a hidden condition. The HS is used to recover
The gates in the LSTM are analog and using sigmoid both short-term and long-term memory in order to make the
functions, meaning they generate a value that ranges from prediction.
zero to one. The fact that they are analog enables them to do
Input Gate (IT) – The total number of the information
backpropagation. The problematic issue of vanishing
that is fed into the cell state. The input gate identifies an input
gradients is solved through LSTM as it keeps the gradients
value for memory alteration. The sigmoid defines the values
steep enough, which keeps the training relatively short and
that ranges from 0 to 1. Then a tanh function is used to
the accuracy high.
weight the passed by values to evaluate their significance
After briefly describing the RNN, the problem of from -1 to 1. The below equations represent the input gate
vanishing gradients, and how the problem is solved using the and the cell state, wherein Wn is the weight, HTt-1 is the
LSTM. We now start discussing our proposed design. previous state of the hidden state, xi is the input, and bn is the
bias vector which need to be learnt during the training phase
3.2 Design [29].
As shown in Fig. 5, we started the process by collecting IT = 𝜕(Wn(HTt-1, xi) + bn) (1)
and preparing the dataset. The dataset was collected from
Phish-tank [27]. The used dataset contains 194,798 URLs, CT = tanh(Wd(HTt-1, xi) + bc) (2)
of which 97,399 are phishing URLs and the rest is legitimate
ones. The data were then split into training and testing
datasets. The LSTM network was trained using the training Forget Gate (FT) – The total number of data that flows
dataset and the performance was evaluated based on the from the current input and past cell state into the present cell
accuracy. The parameters of the LSTM were then modified state. This gate is used to filter out the information that needs
IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022 425

to be discarded from the memory. The sigmoid function is Table 1. LSTM network parameters.
used to describe it contains (HTt-1). The input values (xi) are
examined and the number of outputs are verified by each cell Model Sequential
state CTt-1. Embedding Input dimension = 100, output
dimension = 32, input length = 75
FT = 𝜕(Wf(HTt-1, xi) + bf) (3) LSTM Output = 32, dropout = 0.2, recurrent
Output Gate (OT) – The total number of information that dropout = 0.2
flows into the hidden state. The sigmoid function of this gate Dense Activation = sigmoid, Kernel
determines which values to let through 0 and 1. The tanh
regularizer = regularizers.l2(le-4)
function presents weightage of the values which are
transferred to determine their degree of importance ranging Adam optimizer Learning rate = 0.0015, loss =
from -1 to 1 and multiplied with output of sigmoid [29]. binary_crossentropy, metrics =
accuracy
OT = 𝜕(Wo(HTt-1, xi) + bo) (4)
4. Discussion and Results
HT = OT * TANH(CT) (5)
The data set that we used in our research has been well
3.3 Implementation researched and measured by some researchers. The wiki
The LSTM network is implemented using python with accompanying the dataset comes with a data description
the help of keras and tensorflow libraries. The LSTM is built document that discusses the data generation strategies taken
using a sequential model and structured as shown in the by the dataset authors [7].
below figure. To update our dataset of new phishing sites, we also
As shown in Fig. 6, the first layer in the LSTM is the implemented code that extracts the features of new phishing
input layer, which determines the size and the type of the sites provided by the Phish-Tank website. The dataset
input into the LSTM network. The second layer is the contains about 11,000 samples from websites, and we used
embedding layer, which sets between the input layer and the 10% of the samples in the testing phase. Every website is
LSTM layer. As the LSTM operations are basically floating flagged as legitimate or phishing. The features of our dataset
additions and multiplications, the embedding layer is used to are as follows:
generate a vector float point representation of the input URL. I. Abnormal URL: extracted from WHOIS database.
The third layer is the LSTM layer, which utilizes the sigmoid For a legitimate website, the identity is usually part
and the tanh functions to adjust the weights, dropout some of its URL.
data which are considered irrelevant, and forward the results
into the dropout layer. The fourth layer is the dropout layer, II. Website Redirect Count: If the redirect is more than
which controls what data flows into the output layer and four times.
what data are to be removed from the LSTM memory. III. Web Traffic: This feature measures the popularity of
a site by determining the number of visitors.
IV. Page Rank: Page Rank is a value ranging from 0 to 1.
PageRank aims to measure the importance of a web
page on the Internet.
V. HTTPS Token: Spoof the https token in the URL. For
example, http://https-www-mellat-phish.ir
VI. DNS record: DNS record exists.
VII. Request URL: The request URL checks whether
external objects in a web page such as images, videos,
and sounds have been downloaded from another
Figure 6. The structure of the LSTM network. domain.
VIII. Anchor URL: A link is an element specified by the
<a> tag. This feature is treated exactly as the request
The final layer is the Dense layer or the output layer, URL.
which takes the output of the LSTM as an input and
produces the classification of the LSTM network. The IX. Get an IP address: If an IP address is used instead of
parameters of the designed LSTM network are shown in the a domain name in the URL, such as
Table 1. http://217.102.24.235/sample.html.
X. URL length: Scammers can use a long URL to hide
the suspicious part in the address bar.
426 IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022

The proposed framework was developed in Python 3.0 having a different process interrupting the process, cache
using Jupyter notebook software with the support of Numpy, misses, processes scheduling… etc.
Sci-Kit Learn, Tensorflow, and Keras. To evaluate our
proposed framework, we extracted a dataset from the Phish- To test the trained model of our proposed framework, we
tank database. The parameters of the training phase of the split the collected dataset into 75% training and 25% for
proposed framework are shown in Table 1. We used a testing. Using the testing part of the dataset, our model
learning rate of 0.0015, which was obtained by sweeping the reported a loss of 0.1887 and an accuracy of 0.9257. As the
design space and selecting the best value. results of the testing shows, the performance of our trained
model is better than the results reported by the last training
In order to analyze the results of our proposed framework, epoch. Therefore, we started a sensitivity analysis in which
we used the following metrics: we varied the number of epochs used to train our model and
we report the results as shown in Table 3.
1) Accuracy: is the percentage of correctly classified
URLs. Table 3. The performance of our proposed framework with
different epochs.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (6)
# of Time of Loss of last Accuracy of
2) Recall: is the total number of phishing URLs that are epochs last epoch epoch last epoch
correctly classified. 1 180s 0.3581 0.8440
2 176s 0.2711 0.8886
𝑅𝑒𝑐𝑎𝑙𝑙 (7) 3 178s 0.2403 0.9026
4 181s 0.2224 0.9100
3) Precision: is the number of correctly predicted 5 181s 0.2107 0.9153
phishing URLs. 6 183s 0.2018 0.9193
7 184s 0.1944 0.9221
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (8) 8 174s 0.1897 0.9246
9 176s 0.187 0.9250
4) F-measure: is the weighted harmonic mean of the 10 176s 0.1829 0.9276
precision and recall of the test. The best value will be
at 1 and worst at 0 value.
As the results in Table 3 shows, increasing the number
𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 (9) of epochs can increase the framework’s accuracy and
decrease the loss. However, the returns of increasing the
We trained our model using 5 epochs, and we report the number of epochs are diminishing when the number of
results of each epoch as shown in Table 2. epochs increases, as we can see from the table, increasing
the number of epochs from 9 to 10 increased the accuracy by
Table 2. The results of the training phase by epoch. 0.26% only, and dropped the loss by 0.39% only. Thus. We
Epoch Time Loss Accuracy stopped our sensitivity analysis on the number of epochs at
10 epochs. Testing the trained model at 10 epochs, reported
1 195s 0.3562 0.8442
an event better results, which achieved an accuracy of 93.45%
2 192s 0.2709 0.8889 and a loss of 16.71%.
3 194s 0.2415 0.9012
To further analyze the results and tune our trained model,
4 195s 0.2224 0.9101 we varied the learning rate parameter as shown in the below
5 191s 0.2104 0.9151 Table 4.
Table 4. Accuracy with respect to different learning
As shown in Table 2, the accuracy of our proposed rates.
framework improved by 4.4% from epoch 1 into epoch 2.
However, the accuracy improvement decreases to reach 0.5% # of LR = LR = LR =
between epochs 4 and 5. Similarly, the loss decreases by 8.5% epochs 0.0015 0.0001 0.002
between epoch 1 and epoch 2, but drops to a small decrease 1 0.8440 0.7712 0.8522
of 1.2% between epochs 4 and 5. On the other hand, the time 2 0.8886 0.8189 0.8964
to finish running each epoch is slightly affected and hovers 3 0.9026 0.8289 0.901
around 195 seconds. While the time required by each epoch 4 0.9100 0.8376 0.9160
is expected to be the same, the framework was run on a 5 0.9153 0.8439 0.9203
laptop with the support of Intel® Core™ i7-7700HQ CPU 6 0.9193 0.8497 0.9234
@ 2.8Ghz and 16 GBs of RAM. The Laptop was running 7 0.9221 0.8547 0.9258
multiple applications and have a shared environment as
8 0.9246 0.8589 0.9279
expected, therefore; the time variation can be explained by
IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022 427

9 0.9250 0.8627 0.9293 Using lexical features, we were able to achieve a


10 0.9276 0.8672 0.9307 detection accuracy/success rate of 91.5% for splitting the test
at 60%. When using 90% of the data set, we obtained 92.55%
detection accuracy. In MATLAB, using a regression tree, we
As we can notice from Table 4, a learning rate of 0.002
obtained 90.25% detection accuracy when 60% of the data
achieved the highest accuracy which reached to 93.07%.
set was used for the test and 87.26% detection accuracy
The below table shows a comparison between our when 90% of the data was used for the test.
proposed framework and other frameworks, which used
different machine learning techniques to detect phishing 5. Conclusion
websites. The proposed framework is an effective technique that
Table 5. Performance comparison between different addresses the detection of phishing websites by relying on
schemes. the website’s URL. The framework is built using the Long
Short-Term Memory algorithm, which improves the
Method Accuracy TP rate FP rate F- Recurrent Neural Networks by solving the diminishing
measure gradients problem. While the problem of phishing cannot be
Our 93.45% 95.07% 4.9% 93.31% completely removed, however; it can be significantly
framework mitigated by two main ways. First, improving and
DNN [32] 88.77% 85.83% 14.17% - implementing smart anti-phishing techniques. Second,
DNN With 91.13% 90.79% 9.21% - educating the end users on how fraudulent phishing websites
GA [32] can be detected and identified. To counter the novel and
Decision 92.24% 93.2% 6.8% - complex phishing attacks and tactics, ML anti-phishing
Table [33] techniques are of extreme importance. In this work, we
Naïve 92.98% 93% 7% - employed LSTM technique to distinguish malicious and
Bayes [33] legitimate websites. We used Phish-tank dataset to measure
ANN-MLP 87.61% - - - the efficiency of the proposed framework. The results of our
evaluation shows that the proposed method presents superior
[34]
results. A dataset of 194,798 URLs, of which 97,399 are
As shown in Table 5, our proposed framework has a phishing URLs and the rest is legitimate ones. Our
better performance than other machine learning techniques. framework achieved a very high accuracy in detecting the
We observe that the Naïve Bayes classifier proposed in [33] phishing websites.
has the closest accuracy to our proposed framework, which
has an accuracy that is 0.5% below the accuracy of our 6. REFERENCES
proposed framework. To have a better understanding of our
results we show the confusion matrix of our evaluation in [1] Bahnsen et al. (2018), How to Detect Phishing Website
Using Three Model Ensemble Classification.
Table 6.
[2] Preethi, V., Velmayil, G. (2019). Automated phishing
Table 6. The confusion matrix of our proposed framework. website detection using URL features and machine
learning technique, International Journal of
Predictive Engineering and Techniques ,2(5), 107–15. Retrieved
1 Dec 2019, from http://www.ijetjournal.org.
Positive Negative [3] Ashit Kumar Dutta (2021), Detecting phishing
websites using machine learning technique, PLOS
Actual Positive 91.60% 4.74% ONE | https://doi.org/10.1371/journal.pone.0258361
October 11, 2021.
Negative 8.39% 95.25% [4] Gunter Ollmann, “The Phishing Guide Understanding
& Preventing Phishing Attacks”, IBM-Internet
Table 6 shows the confusion matrix of our proposed Security Systems, 2012.
framework, we can observe that our scheme has a higher [5] Sandhya L., Ciza Thomas, Joby James (2013).
Detection of phishing URLs using machine learning
accuracy in correctly detecting legitimate websites. On the techniques, International Conference on Control
other hand, our framework has an accuracy of 91.6% in Communication and Computing (ICCC). Publication at:
detecting a phishing URL when it is actually a phishing URL. https://www.researchgate.net/publication/269032183.
December 2013.
Host-based features explain “where” phishing sites are [6] Purbay M., Kumar D (2021), “Split Behavior of Super
hosted, “who” they are managed by, and “how” they are vised Machine Learning Algorithms for Phishing URL
Detection”, Lecture Notes in Electrical Engineering,
managed. We use these features because phishing websites vol.683, https://doi.org/10.1007/978-981-15-6840-
may be hosted at less reputable hosting centers, on machines 4_40.
that are not usual web hosts, or through non-reputable [7] Mohammad R., Thabtah F. (2016) McCluskey L ,
registrars. (2015) Phishing websites dataset. Available:
https://archive.ics.uci.edu/ml/datasets/Phishing+Webs
ites Accessed January.
428 IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.5, May 2022

[8] Adebowale MA, Lwin KT, Sanchez E, and Hossain [20] Akinyelu AA (2019) Machine learning and nature
MA (2019) . Intelligent web-phishing detection and inspired based phishing detection: a literature survey.
protection scheme using integrated features of Images, Int J Artif Intell Tools 28(05):1930002.
frames and text. Expert Systems with Applications, 115: [21] Wei, W.; Ke, Q.; Nowak, J.; Korytkowski, M.; Scherer,
300-313. https://doi.org/10.1016/j.eswa.2018.07.067 R.; Wo´zniak, M. (2020) Accurate and fast URL
[9] Al-diabat,M. (2016).Detection and Prediction of phishing detector: A convolutional neural network
Phishing Websites using Classification Mining approach. Comput. Netw, 178, 107275. [CrossRef]
Techniques. International Journal of Computer [22] Chen, D.; Wawrzynski, P.; Lv, Z. (2021) Cyber
Applications,147(5) . security in smart cities: A review of deep learning-
[10] APWG (2019). 2nd quarter 2019: Phishing activity based applications and case studies.Sustain. Cities Soc.
trends report. 66, 102655. [CrossRef]
Anti-Phishing Working Group. [23] Al-Ahmadi, S. PDMLP: Phishing Detection Using
https://doi.org/10.1016/S1361-3723(19)30025-9. Multilayer Perceptron. Int. J. Netw. Secur. Its Appl.
[11] Abdelhamid, N., Thabtah, F. and Abdel-Jaber, H. 2020, 12. SSRN:3624621. Available online (accessed
(2017). Phishing detection: A recent intelligent on 12 May 2021) :
machine learning comparison based on models https://papers.ssrn.com/abstract=3624621.
content and features. In IEEE International Conference [24] Ahmad, R.; Alsmadi, I. (2021) Machine learning
on Intelligence and Security Informatics (ISI), 22–77, approaches to IoT security: A systematic literature
China:IEEE. review. Internet Things , 14, 100365.
[12] Aburrous, M., Hossain, M., Dahal, K., Thabtah, F. [25] Dargan S, Kumar M, Ayyagari MR, Kumar G (2020)
(2010). Experimental case studies for investigating e- A survey of deep learning and its applications: a new
banking phishing techniques and attack strategies. paradigm to machine learning. Arch Comput Methods
cognitive computation, 2(3),242-253. Eng 27(4):1071–1092
[13] Dou Z, Khalil I, Khreishah A, Al-Fuqaha A, and [26] Hao S, Ge FX, Li Y, Jiang J (2020) Multisensor bearing
Guizani M (2017). Systematization of knowledge (sok): fault diagnosis based on one-dimensional
A systematic review of software-based web phishing convolutional long short-term memory networks.
detection. IEEE Communications Surveys and Measurement 159:107802
Tutorials, 19(4): 2797-2819. [27] Phishtank website,” https://phishtank.org/”, accessed:
https://doi.org/10.1109/COMST.2017.2752087 2022-02-06
[14] Dunlop M, Groat S, and Shelly D (2014). Goldphish: [28] Dutta, Ashit Kumar. (2021) "Detecting phishing
Using images for content-based phishing analysis. In websites using machine learning technique." PloS one
the 5th International Conference on Internet 16, no. 10: e0258361.
Monitoring and Protection, IEEE, Barcelona, Spain: [29] LSTM structure and gates,”
123-128. https://doi.org/10.1109/ICIMP.2010.24. https://www.pluralsight.com/guides/introduction-to-
[15] Nirmal K, Janet B, and Kumar R (2015). Phishing-the lstm-units-in-rnn”, accessed: 2022-02-06.
threat that still exists. In the International Conference
on Computing and Communications Technologies, [30] Recurrent Neural Networks,”
IEEE, Chennai, India: 139-143. https://www.ibm.com/cloud/learn/recurrent-neural-
https://doi.org/10.1109/ICCCT2.2015.7292734. networks”, accessed: 2022-02-06.
[16] Opara C, Wei B, and Chen Y (2019). HTMLPhish: [31] A guide to RNN understanding,”
Enabling accurate phishing web page detection by https://builtin.com/data-science/recurrent-neural-
applying deep learning techniques on HTML analysis. networks-and-lstm”, accessed: 2022-02-06.
Available online at: https://bit.ly/2zV0ymk. [32] 32. Ali, Waleed, and Adel A. Ahmed. (2019) "Hybrid
[17] JainA.K., Gupta B.B. (2018) “PHISH-SAFE:URL intelligent phishing website prediction using deep
Features Based Phishing Detection System Using neural networks with genetic algorithm-based feature
Machine Learning”, CyberSecurity. Advances selection and weighting." IET Information Security 13,
inIntelligent Systems and Computing, no. 6: 659-669.
vol.729,https://doi.org/10.1007/978-981-10-8536- [33] Abdulrahman, Musbau Dogo, John K. Alhassan,
9_44. Olawale Surajudeen Adebayo, Joseph Adebayo
[18] Luke ,I. (2020).The 5 most common types of phishing Ojeniyi, and Morufu Olalere. (2019) "Phishing attack
attack.[online] Retrieved 1 March 2020, from detection based on random forest with wrapper feature
https://www.itgovernance.eu/blog/en/the-5-most- selection method.".
common-types-of-phishing-attack. [34] Ferreira, Ricardo Pinto, Andréa Martiniano, Domingos
[19] Adebowale MA, Lwin KT, Hossain MA (2020) Napolitano, Marcio Romero, Dacyr Dante De Oliveira
Intelligent phishing detection scheme using deep Gatto, Edquel Bueno Prado Farias, and Renato José
learning algorithms. J En-terprise Inf Manag. Sassi. (2018) "Artificial neural network for websites
classification with phishing characteristics."

You might also like