0% found this document useful (0 votes)
63 views

Phishing Url Detection Using CNNLSTM and Random Forest Classifier

Phising url detection

Uploaded by

Moturu Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Phishing Url Detection Using CNNLSTM and Random Forest Classifier

Phising url detection

Uploaded by

Moturu Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

ISSN: 2995-3286

Research Article International Journal of Media and Networks


Phishing URL Detection Using CNN-LSTM and Random Forest Classifier
Hemant Gurung, Roshan Nepal and Sopnil Nepal*
National College of Engineering Kathmandu, Nepal *
Corresponding Author
Sopnil Nepal, National College of Engineering Kathmandu, Nepal.

Submitted: 2023, Oct 27; Accepted: 2023, Nov 06; Published: 2024, May 27

Citation: Gurung, H., Nepal, R., Nepal, S. (2023). Phishing URL Detection Using CNN-LSTM and Random Forest Classifier.
Int J Med Net, 2(5), 01-06.

Abstract
This paper presents the classification of phishing URL's apart from legitimate URL's with the use of machine learning and
deep learning techniques. Phishing is defined as an act to steal the private information by pretending to be a legitimate entity
which they are not. Machine learning model, Random Forest classifier is trained on the extracted features based on Address
Bar, Domain and HTML and JavaScript of the URL. On the other hand, CNN-LSTM hybrid model was trained to learn the
character sequence features of the given URL and make the classification. The dataset used was public data from Kaggle
which was downloaded from their website. The dataset contained 11,430 URLs: 5,715 legitimate URLs and 5,715 phishing
URL. Hereafter, we classified the URL of the current address bar as legitimate or phishing with the use of previously trained
model. Thus, proposed paper focuses on the study and development of models for detection of phishing sites so that properties
of various URLs can be learnt by feature extraction and can be classified as accurately as possible.

Keywords: Phishing Website Detection, Convolutional Neural Network, Long Short-Term Memory Network, Random Forest,
Machine Learning

1. Introduction as the layout, Domain information or HTML& JavaScript and


In recent years, as internet technology begins to evolve every then classify these features but machine learning algorithms do
day, it has brought great convenience to human society. There not analyze the sequence or the positions of words in a URL and
is no denying the fact that internet has become a primary source also 63% of phishing websites have a lifespan of only 2 hours
for information and data sharing. The internet era is booming after which they change either expire or change their domain
and its usage only goes uphill from here on now. Many people name [3]. In order to use the machine learning techniques that
share their information like their email, location, credit card focuses on the statistical features of URL and also to exploit the
information, bank details, etc. for various purpose whether it orientation and sequence learning capability of deep learning,
may be online shopping, charity, online banking or different we propose a CNN-LSTM model along with Random Forest,
purpose. They share this information with the legitimate they belong to the field of deep learning whereas Random Forest
companies. With a lot of users sharing such information, internet classifier belongs to the field of machine learning.
is expected to be infested with people who intend to steal this
sensitive information. According to the FBI, phishing was the 2. Related Works
most common type of cybercrime in 2020 and phishing incidents Qiao Zhang et al. proposed a phishing website detection
nearly doubled in frequency, from 114,702 incidents in 2019, to technology based on CNN-BiLSTM algorithm. Their model
241,324 incidents in 2020 [1]. With these statistics as evidence, it attempted to solve the problems of existing phishing web page
is pretty obvious that phishing has been causing a lot of problems detection methods with manual feature extraction. Their method
for innocent users of internet. In current scenario, to mitigate first performed word segmentation processing on URL based
the effect of phishing there are roughly about three techniques on sensitive word segmentation, then converted it into a feature
widely used. The first way is through the user awareness. The vector matrix that automatically extracts its local features
second and the most common way is by blacklisting the phishing through CNN and acquired its bidirectional long-distance
websites. However, the disadvantage of this approach is that, to dependent features through BiLSTM. Their model classified the
blacklist a website it should be proven as a phishing website. phishing and legitimate URLs with accuracy of 98.84%[3].
The third way that has proven to be the most effective is to
use machine learning and deep learning techniques that learns T. Sujithra et al implemented various machine learning algorithms
about the characteristic features of previous malicious links and to reduce the false positives in detecting new phishing sites.
can make accurate distinctions in the future based on previous They attempted to identify the best machine learning algorithm
predictions made [2]. Current mainstream machine learning to detect phishing sites with high accuracy than the existing
methods of phishing website detection extract statistical features techniques. After implementing various classifying algorithms,
from the URL or extract relevant features of the webpage, such they found that XGBOOST classifier outperformed the rest.
Int J Med Net, 2023 Volume 2 | Issue 5 | 1
According to their research, XGBOOST algorithm had accuracy 3.1 Random Forest Algorithm
of 94.7% [4]. Random Forest is a machine learning algorithm. For the
implementation of this algorithm, several features are extracted
Peng Yang et al. proposed a multidimensional feature phishing from the collected dataset. The URL embedding matrix used
detection approach based on a fast detection method by using in deep learning cannot fully represent the phishing website
deep learning. In the first step, character sequence features of the information. The different features are:
given URL were extracted and used for quick classification by
deep learning, and this step did not require third-party assistance • IP Address in URL: It is unusual to see IP address in the
or any prior knowledge about phishing. In the second step, they address bar of the browser while surfing the internet. Example:
combined URL statistical features, webpage code features, “http://120.30.3.3/abc.php”. If such URLs appear then it might
webpage text features and the quick classification result of deep be a phishing one.
learning into multidimensional features. The approach could
reduce the detection time for setting a threshold. With this • "@" Symbol in URL: If “@” symbol is present in the URL
approach they were able to achieve accuracy of 98.99% [5]. then everything to the left of “@” is ignored and only the right
part of URL is taken into consideration by the browser thus
Rakotoasimbahoaka et al. proposed a reliable, generic and providing an easy gateway to phishing sites.
flexible system. They proposed a hybrid approach based
on Machine Learning and Deep Learning methods (CNN- • Length of URL: Phishers sometimes exploit the features
LSTM-RF). They performed manual feature extraction for RF of browsers by creating a very long URL in order to hide the
(Random Forest) algorithm and automatic feature extraction true identity of the URL. It is unusual to see long URLs while
with CNN-LSTM model. CNN_LSTM_RF 10 produced an browsing internet. Most legit URLs are at max 200 characters
interesting result, with convergence after three epochs and an long.
accuracy rate of malicious URL detection of 96%. They also
tried experimenting with RF_CNN_LSTM hybrid but with • Redirection "//" in URL: If “//” is present in URL then it
this model the performance was poor and the malicious URL means that there is redirection to another URL. Phishers can put
detection accuracy was just 50% [6]. their phishing links after ‘//’ so that users can be redirected to
their site.
Vysakh S Mohan et al. proposed S.P.O.O.F Net: Syntactic
Patterns for identification of Ominous Online Factors. It was a • "http/https" in Domain name: HTTP does not have a security
combination of Convolutional Neural Network and Long Short- mechanism for data encryption and has no SSL certificate.
Term Memory Neural Network. The proposed architecture was Having no HTTPS makes it more likely to be a phishing URL.
found to outperform existing threat detection strategies like
blacklisting, sink holing and machine learning based classifiers • Using URL Shortening Services “Tiny URL”: URL
for malicious URL detection. S.P.O.O.F Net overcomes shortening is a method to reduce the length of URL that creates
drawbacks of methodology of traditional methods, like the another URL of smaller length. The smaller length URL will
requirement of a domain level expert for constant maintenance redirect to the original website. But with shortened URL, the
of the database the classifier is trained on, because the threats originality of the URL is now masked and new users will have
are ever changing. With this model they were able to achieve no idea what is the website that the link will lead to.
accuracy of 95.2 % [7].
• Prefix or Suffix "-" in Domain: The “-” symbol is mostly
Hitesha Gupta et al attempted to perform early phishing detection used to mimic the legitimate website. Example: “https://www.
using XGBOOST classifier. Thy tried to solve the problem pay-pal.com”. To the naïve users it may seem like the legit one.
of stealing confidential information from the victims using It is unlikely to see URLs with dash symbol frequently.
legitimate websites or email. They used three datasets have used
for this simulation. The accuracy achieved by this method was • Favicon: Favicons may be described as the logo that appears
highest in comparison to the other machine learning classifying on the tab of the web pages. They are used to provide the visual
algorithms i.e., 98.45%. Using XGBoost classification, the total identity of websites. Phishers can use the favicons of legitimate
F1-measure obtained by the FRS function choice was 98.45% websites in order to mimic them that is generated from another
[8]. website.

3. Methodologies • Request URL: Many phishing websites have request URLs


For our proposed methodology we have basically used two to load the components of webpage like image, icons, etc. from
algorithmic models namely random forest classifier and CNN another site. Clicking on these foreign components may lead to
LSTM hybrid model (LSTM is used in order to make future redirecting to another site.
classification based on previous classifications made) and we
have also used 1D convolution neural network for CNN is used • URL of anchor: The anchor tag in phishing URLs are used
to learn about the sequence of characters present in URL. to redirect to external websites. Legit sites have anchor tags that
mostly redirect to the same domain name.
Int J Med Net, 2023 Volume 2 | Issue 5 | 2
• Links in Script: The script tag of html of website contains 3.1.1 Random Forest Pseudocode
the JavaScript for that site. It is expected that the JavaScript 1. Randomly select “k” features from total “m” features.
is loaded from the same domain name for that website and no 2. n Where k << m Among the “k” features, calculate the node
external and suspicious scripts are loaded. “d” using the best split point.
3. Split the node into daughter nodes using the best split on the
It will result in total of 14 manually extracted features. To extract basis of highest information gain where highest information gain
these features, different python libraries like re, urllib, I p address, is calculated using :
BeautifulSoup, who is and requests were used. The number of
estimators used was set to 100. For each feature, if the feature IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))
satisfies the condition specified to be declared as a phishing Entropy(S) = - ∑ pᵢ * log₂(pᵢ); i = 1 to
URL, the value 1 is assigned else the value 0 is assigned to it.
The features with labels of 0 and 1 are then shaped into array and 4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
are passed to the model for classification. 5. Build forest by repeating steps 1 to 4 for “n” number times to
create “n” number of trees. [9]

Figure 1: Random Forest Working Mechanism


3.2 The CNN-LSTM Algorithm present in our dataset. Let it be L. If the length of a sample
In first step using CNN-LSTM model, firstly, the URL is to be URL is less than L then padding would be added in order to
segmented on character level. Then the normalization is to be extend it up to length L. Then character map dictionary will be
performed on the segmented URL. For normalization we would implemented in order to convert the characters into one hot code
fix the maximum length of URL as maximum length of URL sequence.

Figure 2: Character Map Dictionary

Figure 3: CNN LSTM Working Mechanism

Int J Med Net, 2023 Volume 2 | Issue 5 | 3


In first step using CNN-LSTM model, firstly, the URL is to be i. Accuracy: It is the percentage of all normal and anomaly
segmented on character level. Then the normalization is to be instances that are correctly classified.
performed on the segmented URL. For normalization we would
fix the maximum length of URL as maximum length of URL Accuracy (A) = (TP+TN) / (TP+TN+FP+FN).
present in our dataset. Let it be L. If the length of a sample URL is
less than L then padding would be added in order to extend it up ii. Sensitivity: Sensitivity is a measure of the proportion of actual
to length L. Then character map dictionary will be implemented positive cases that got predicted as positive (or true positive).
in order to convert the characters into one hot code sequence. The Sensitivity is also termed as Recall.
one hot code sequence consists many zeros and are inefficient
for computation and storage. Embedding layers convert this Sensitivity = (TP) / (TP+FN)
one hot sequence into fixed length vector representation with
reduced dimension that makes them computation efficient. Then iii. Precision: Precision refers to the quality of positive
convolution will be performed on embedding matrix. URL is prediction made by the model. Precision can be derived by the
a 1D character sequence so Convolution1D is suitable. With number of true positives divided by the total number of positive
convolution the deep correlation features among the characters predictions (i.e. the number of true positives plus the number of
in a URL will be extracted. CNN will learn about the positions of false positives).
different characters in a URL and how deeply are the characters
related to one another. After convolution, pooling is performed. Precision= (TP) / (TP+FP)
The result of pooling is the input to LSTM.
Where
The output of pooling contains sequence of embedded vector True Positive (TP) = URL that are actually detected as
representation. This sequence can be treated as a time series data legitimate.
as for different timestamp different value from pool is obtained. True Negative (TN) = URLs that are actually detected as
LSTMs are great for time series input. LSTM learns the phishing.
sequential information among the features obtained from CNN. False Positive (FP) = URLs that are predictively detected as
LSTM implement memory cells that can remember both the legitimate.
long term and short-term sequence. LSTM captures the context False Negative (FN) = URLs that are predictively detected as
of URL sequence and dependency. phishing.
The output of LSTM is then subjected to sigmoid activation that
performs a binary classification on the output of LSTM. 4.1 Dataset for Testing
From the total dataset, the data was split as 80% for training and
4. Experiments and Results 20% for testing purposes. After the model was trained,
The research is based on two sets of algorithm: the model was again tested for the entire URL of the dataset.
a. Random forest classifier
b. CNN LSTM algorithm 4.2 Results
The random forest model used 100 estimators and was trained
The efficiency of the proposed algorithms are tested using the on the training data. When implemented on the testing data it
performance parameters namely: Accuracy, Sensitivity and had an accuracy of 70.034%. The confusion matrix for random
precision. forest model evaluated against test data (n=20% of 11430) is
These parameters are calculated as: demonstrated in Table 1:

n=2286 predicted: predicted:


legitimate phishing
Actual: legitimate 899 258
Actual: phishing 427 702

Table 1: Confusion Matrix for Random Forest

The same test data was also used for evaluation of CNN-LSTM model. For the testing data, the model provided an accuracy of
94.7%.The confusion matrix for CNN_LSTM model evaluated against test data is shown in Table 2:

n=2286 predicted: predicted:


legitimate phishing
Actual: legitimate 1101 56
Actual: phishing 65 1064

Table 2: Confusion matrix for CNN LSTM


Int J Med Net, 2023 Volume 2 | Issue 5 | 4
4.3 Data Sample Result
4.3.1 Random Forest
The result of Performance parameter calculation for Random Forest and Dataset are classified in following table and graph:

URL count Actual Predicted Actual Predicted Accuracy Sensitivity Precision


Legitimate Legitimate Phishing Phishing
1-2000 997 772 1003 693 73.25% 77.4% 71.3%
2001-4000 999 777 1001 645 71.1% 77.77% 68.5%
4001-6000 1011 800 989 553 67.13% 79.12% 64%
6000-8000 980 762 1020 650 69.94% 77.75% 67.3%
8000-10000 1003 769 997 623 69.6% 76.66% 67.27%
10000-11430 725 582 705 433 70.5% 80.27% 68.14%
Average 70.25% 78.16% 67.75%

Table 3: Performance Parameter of Random Forest

Figure 4: Performance Parameter of Random Forest Graph

The Table 3 and Figure 4 above shows the calculated value 76.66%. Finally, Average precision is obtained for this dataset is
of performance parameters i.e., Actual legitimate, Predicted 67.75% with highest value of 71.3% and lowest of 64%.
Legitimate, Actual Phishing, Predicted Phishing, Accuracy,
Sensitivity and Precision. Here, the average accuracy of 4.3.2 Deep Learning Model
algorithm is 70.25% with highest accuracy value as 71.1% and The result of Performance parameter calculation for Deep
lowest of 67.13%. Similarly average sensitivity obtained here is Learning Model and Dataset are classified in following table
78.16% with highest sensitivity value of 80.27% and lowest is and graph:

URL count Actual Predicted Actual Predicted Accuracy Sensitivity Precision


Legitimate Legitimate Phishing Phishing
1-2000 997 927 1003 956 94.15% 95.55% 95.49%
2000-4000 999 945 1001 956 95.05% 95.05% 95.04%
4000-6000 1011 937 989 958 94.75% 94.81% 94.07%
6000-8000 980 916 1020 992 95.4% 95.48% 95.3%
8000-10000 1003 910 997 956 93.3% 93.41% 93.03%
10000-11430 725 679 705 671 94.4 93.26% 93.13%
Average 94.3% 94.59% 94.51%

Table 4: Performance parameter calculation for Deep Learning Model

Int J Med Net, 2023 Volume 2 | Issue 5 | 5


The Table 4 and Figure 5 above shows the calculated value • All authors certify that they have no affiliations with or involvement
of performance parameters i.e. Actual legitimate, Predicted in any organization or entity with any financial interest or non-
Legitimate, Actual Phishing, Predicted Phishing, Accuracy, financial interest in the subject matter or materials discussed in this
Sensitivity and Precision. Here, we can see that the average manuscript.
accuracy of algorithm is 94.3% with highest accuracy value as • The authors have no financial or proprietary interests in any
95.5% and lowest of 93.3%. Similarly, average sensitivity obtained material discussed in this article.
here is 95.59% with highest sensitivity value of 95.55% and lowest
is 93.26%. Finally, Average precision is obtained for this dataset is Data Availability Statement
94.51% with highest value of 95.49% and lowest of 93.03%. The initial relevant data corresponding to this paper was accessed
from https://www.kaggle.com/datasets/shashwatwork/web-page-
5. Conclusion and Future Works phishing-detection-dataset. The data was modified around our
From the above analysis we can infer that for the above dataset, research and feature extraction was performed resulting new
random forest Algorithm followed by CNN-LSTM Random dataset which can be found uploaded to the following links:
Forest model provided accuracy of around 70% and CNN-LSTM Dataset1(URL with labels):
model provided around 94%. 14 different features were extracted https://figshare.com/articles/dataset/URL_csv/21070183
manually for random forest model. The lower accuracy on random Dataset2(URL with extracted features)
forest model maybe due to the fact that many phishing websites https://figshare.com/articles/dataset/phishingfeatures_
have been shut down so the feature extraction by making a request csv/21070198
to such website and web scraping for such websites was not very All data generated or analyzed during this study are included in the
optimal. In this work, we have performed the work by taking the published article.
total 11430 sample URLs where we further divided those sample
in six batches with sample size 2000 URLs per sample. We have References
used 14 different parameters (Ip address in URL, '@' symbol in 1. V. Patel, "eBook on How to Protect Yourself and Your
URL, length of URL, '//' redirection in URL, "http/https" in URL, Company from Phishing for Free," 9 June 2021. Available:
Tiny URL service, Prefix or Suffix containing "-", External favicon https://druvstar.com/how-to-protect-yourself-and-your-
source, External Request URL, Link in Script and link tags, Server company-from-phishing-for-free/.
Handler Form, Submitting to email, I frame redirection). From 2. Sravanth, M., T. Reddy, K. Nagendra, and G. Aashirvad.
result of the sample test above we achieved the average accuracy "Phishing Website Detection Based On Multidimensional
of random forest 70.25% and for CNN LSTM of about 94.3%. Features Driven By Deep Learning." International Journal of
Techno-Engineering (2021): 105-112.
In future, this work can be extended and enhanced as follows: 3. Zhang, Q., Bu, Y., Chen, B., Zhang, S., & Lu, X. (2021).
1. Distribution of data can be changed i.e. both large size and small Research on phishing webpage detection technology based on
data size samples can be taken instead of equal size for testing the CNN-BiLSTM algorithm. In Journal of Physics: Conference
result. Series (Vol. 1738, No. 1, p. 012131). IOP Publishing.
2. The trained models can be implemented either in form of browser 4. Sujithra, T., Dwivedi, N., & Utakarsha, A. (2020). Detection of
extension or web application for the real time detection of phishing phishing websites using deep learning and machine learning.
URLs. Journal of Critical Reviews, 7(8).
5. Yang, P., Zhao, G., & Zeng, P. (2019). Phishing website
Acknowledgements detection based on multidimensional features driven by deep
We would like to thank the Department of Computer and Electronics learning. IEEE access, 7, 15196-15209.
Engineering, National College of Engineering for providing us the 6. Rakotoasimbahoaka, A., Randria, I., & Razafindrakoto, N.
working opportunity and motivation to enhance our knowledge and R. (2019). Malicious URL detection by combining machine
provide us a new experience of teamwork. learning and deep learning models. Artificial Intelligence for
Internet of Things, 1.
Compliance with Ethical Standards 7. Mohan, V. S., Vinayakumar, R., Soman, K. P., &
• The authors have no conflict of interest. Poornachandran, P. (2018, May). Spoof net: syntactic patterns
• The authors did not receive support from any organization for the for identification of ominous online factors. In 2018 IEEE
submitted work. Security and Privacy Workshops (SPW) (pp. 258-263). IEEE.
• No funding was received to assist with the preparation of this 8. H. Gupta and R. Shrivastava, "Early Phishing Attack Detection
manuscript. using XGBoost Classifier," JASC: Journal of Applied Science
• No funding was received for conducting this study. and Computations, pp. 138-145, 2019.
• No funds, grants, or other support was received. 9. Polamuri, S. (2017). How the random forest algorithm works
• The study was conducted without involvement of experimentation in machine learning. Retrieved December, 21.
on humans or animals.

Competing Interests
• The authors have no relevant financial or non-financial interests
Copyright: ©2023 Sopnil Nepal, et al. This is an open-access article
to disclose.
distributed under the terms of the Creative Commons Attribution License,
• The authors have no competing interests to declare that are
which permits unrestricted use, distribution, and reproduction in any
relevant to the content of this article.
medium, provided the original author and source are credited.
Int J Med Net, 2023 https://opastpublishers.com Volume 2 | Issue 5 | 6

You might also like