Phishing URL Detection Research Paper
Phishing URL Detection Research Paper
learning methods
ABSTRACT
In a world that is changing all the time, phishing is one of the most worrying problems.
Cybercrime is a new form of data theft brought about by the growing usage of the Internet.
Cybercrime is the term used to describe the theft of personal data and privacy violations
using computers. Phishing is the principal method employed. Phishing via URLs (Uniform
Resource Locators) is one of the most prevalent forms, with the main objective being data
theft from the user upon accessing the malicious website. It can be difficult to identify a
rogue URL. The goal of this work is to develop a method for identifying these websites by
using machine learning algorithms that concentrate on the characteristics and behaviors of the
recommended URL. To identify harmful websites, the online security community has
developed blacklisting services. These blacklists are produced using a range of techniques,
including heuristics for site inspection and manual reporting. Many harmful websites
unintentionally avoid blacklisting because of their recentness, lack of evaluation, or
inaccurate evaluation. Algorithms like Support Vector Machine (SVM), Random Forests,
Decision Trees, Light GBM, Logistic Regression, and Logistic Regression are used to build a
machine learning model that determines whether a URL is malicious or not. The first stage is
to extract features; the second is to apply the model.
1.Introduction
[1]An increasing number of people are using the Internet as a platform for online
transactions, information sharing, and e-commerce as a result of the surge in internet usage
over the past several years. Cybercrime is a new type of crime that emerged as the use of the
Internet developed. Cybercriminals can steal information in a variety of ways, and phishing is
the primary tool they use to do so. Phishing comes in a variety of forms, such as email
phishing, spear phishing, whaling, and vishing. Phishing was first documented in 1990 and
was used to obtain passwords. Phishing assaults have increased in the last few years. Phishing
using URLs is one such assault. A website address, or URL, is a representation of a website's
location on a network and how to access it. Through the URL, we establish a connection to
the server's database, which houses all of the website's information and has a webpage that
shows it. [1].[2] There are two types of URLs: harmful and benign. URL phishing uses
malicious URLs, whereas benign URLs are safe and secure. A cybercriminal will design a
website that is identical to the absolute URL in every way, and it will appear to be the actual
thing. On other websites, the URL will show up as an advertising. When the user inputs their
credentials, fraud will occur. Another method involves sending the user a malicious URL via
email. When the user attempts to open the URL, a dangerous virus is downloaded, giving
hackers access to the data they need to carry out their crimes. We must extract certain
properties from malicious and benign URLs in order to differentiate between them. In order
to identify if a URL is malicious or benign, it is necessary to extract certain properties from
them and compare them [2].
2. Literature review
Numerous theories and methods have been offered by different authors and studied in order
to identify phishing URLs. One theory is to use features based on the message content
weighting to determine whether or not the URL is malicious.
[3]Carolin and Rajsingh devised a technique that uses associate rule mining, a data mining
procedure, to identify dangerous URLs. The process of organizing and extracting information
from a dataset is known as data mining [3].
He carried out a study using both malicious and valid URLs to ascertain how the properties
of the URL differ between the two. He conducted a study that included both dangerous and
normal URLs, and by doing so, he gave a quick summary of the attributes of URLs. A
machine learning model that could identify fraudulent URLs was created using this data.
Mohammed et al. [4] presented a model in which additional URL-based data and results from
Microsoft Reputation Services were used to build a machine learning model. We can
ascertain whether a URL has malicious purpose by applying this model. The model produced
precise outcomes. Microsoft has developed a product called Microsoft Reputation Services
that offers URL classification as virus protection.[4]
All of these characteristics were used to create a machine learning model. Various models
have been developed to identify fraudulent or genuine URLs. Using NLP algorithms is a
helpful technique that creates a word dictionary with all the language-based properties of
both benign and malicious URLs. This dictionary is then used to build a machine learning
model that can identify harmful URLs. Parekh [5] suggested utilizing document object model
attributes to identify the rogue website. The document object model serves as an API for
programming languages such as XML and HTML. It is a tree structure that represents the
HTML or XML code and has features like color and gray histograms and spatial relationships
that can be used to identify phishing URLs[5]. Furthermore, Pradeepthi and Kannan [6]
offered a visual approach to spotting rogue websites. In this effort, phishing detection entails
examining text segments and styles in addition to webpage visuals. PhoneyC is a virtual
honey pot that is used to investigate the types of harmful URLs that hackers employ to steal
information, as revealed by a study by Fu [7].
We utilize the EMD to determine the signature distances of the webpage photos using
Sahoo's suggested method [8]. After converting the websites to photos, they identified the
visual indication using characteristics like color. Malicious URLs have also been shown to be
detectable in some investigations by examining their link to previously used domains. In this
study, they suggested a method to check if there is any harmful content in the URL using the
beautiful soup Python package used to parse HTML and XML files. Based on that, we can
detect the malicious URL. Another aspect of malicious [9,10]URL detection is based on the
HTML features. Another option is to use string-based algorithms, where the URLs are pre
processed so that both malicious and legitimate URLs have a word cloud. In this case, the
word cloud only contains the most common words in malicious and legitimate URLs, and the
analysis of the word clouds between the malicious and legitimate URLs is based on the word
clouds. We can tell if a URL is dangerous or not using machine learning methods.[11].
Both reputable and fraudulent websites are used in data acquisition. There are two processes
in extracting valuable features: URL-based refers to IP addresses, URLs with the "@"
symbol, dashes, lengthy URLs, unusually high or low numbers, URL subdomains, etc.
Domain-based factors include the website's Page Rank, its age, and its validity.
3. Methodology
3.1. Dataset
10,000 URLs total—five thousand malicious URLs and fifteen thousand benign URLs—are
included in this sample. Phishing URLs were gathered from an open-source platform named
Phish Tank. Through a database of phishing information, Phish Tank offers collaborative data
on phishing on the Internet. The site offers several types of data, including csv, json, and
many more, and the data is updated hourly. I discovered a data set including benign, spam,
phishing, malware, and defacement URLs after doing some research. The University of New
Brunswick is my source, and there are 35,300 valid URLs in this collection. In this data set,
benign and malicious URLs are mixed. After gathering the dataset, the following action is:
A Data preprocessing: Null values present a significant obstacle when trying to add a dataset
to the machine learning model and require data preprocessing, such as merging the data. As a
result, prior to include the dataset in the machine learning model, all null values are
eliminated.
B Feature extraction: In this step, Python modules like url parse and who is are used to
extract lexical domain-based features from the final dataset.
C Lastly, we use machine learning techniques such the Random Forest Classifier, Decision
Tree, and Light GBM methods to apply the machine learning model to every feature
produced by the feature extraction module.
• Domain name: For now, all we are doing is extracting the domain from the URL. It is not
really helpful to have this function during training. It might even be completely abandoned
while in training.
• Possess an IP: Typical URLs just contain a domain name rather than an IP address. We may
determine that a URL is malicious if it has an IP address. IP addresses found in URLs are
used by cybercriminals to steal private data. If the IP address is present in the URL, it is
malicious; else, it is benign. The URL will be given a score of 0.
• Have @ symbol: The @ sign indicates that a URL is either legal (zero) or phishing (one).
• Length and Depth of URL: Cybercriminals frequently utilize lengthy URLs to conceal their
anonymous content, therefore URLs longer than 54 characters are rated as 1 (phishing) or 0
(benign). The depth of a URL is only determined by the number of subpages it includes.
• Location of "//" in the URL: If the URL begins with HTTP, it should be at position six; if it
begins with HTTPS, it should be at position seven. The value for this feature should be either
1 (phishing) or 0 (benign) if "//" is detected anyplace else.
• HTTP/HTTPS in Domain name: Depending on whether the URL has "http/https" in the
domain portion, this feature is assigned a value of 1 (phishing) or 0 (benign).
• Prefix/ Suffix ‘- ‘in the Domain: A number of 1 would suggest phishing, while a value of 0
would indicate a benign URL. Cybercriminals may add the prefix or suffix "-" to URLs, even
though they do not already have it.
• Tiny URL: This online technique allows a URL to be significantly shortened while still
pointing to the necessary webpage. To do this, an HTTP redirect is used on a short domain
name to link to the webpage with the long URL. A value of 1 (phishing) or 0 (legal) is
assigned if the URL makes use of a shortening service.
• DNS Record: WHOIS is a registrar that holds data on domain names, including contact and
registration information. The value assigned to this feature is either 0 (benign) or 1 (phishing)
if there isn't a DNS record.
• Domain Based Features: Domain-Based Features: Features like web traffic, domain age,
domain expiration date, and subdomains are examples of domain-based features. The number
of people who visited a URL or webpage is known as web traffic, and it is obtained from the
Alexa database. The feature's value is either 1 (phishing) or 0 (benign) if a URL's rank is less
than 100,000. Because a malicious URL can only be older than 12, the age of the domain is
crucial. If the domain is younger than 12, this feature will be rated as either 1 (phishing) or 0
(legitimate). Depending on whether there is a difference of less than six months between the
expiration date and the present time of a domain, we assign a value of either 1 (phishing) or 0
(benign) for the domain end term in this feature.
• Sub-domain: A website is considered malicious if the number of "." in the URL is more
than three, and it is given a value of either 1 (phishing) or 0 (benign).
The feature importance of each decision tree will be determined, and the average of all the
feature importance calculations will be utilized.
•XGBoost
A machine learning technique called XG Boost is a member of the gradient boosting
framework, which is a subset of ensemble learning. It makes use of regularization techniques
to improve model generalization using decision trees as foundation learners. XGBoost is a
popular choice for computationally demanding tasks including regression, classification, and
ranking due to its proficiency in feature importance analysis, management of missing
information, and computational economy.
Key features of XGBoost Algorithm include its ability to handle complex relationships in data,
regularization techniques to prevent overfitting and incorporation of parallel processing for efficient
computation. XGBoost is widely used in various domains due to its high predictive performance and
versatility across different datasets.
• Logistic regression
A linear model used for binary classification problems is called logistic regression. It
forecasts the likelihood that an instance will fall into a specific class. In many fields of study,
logistic regression is the most used statistical model for forecasting binary data. Its
widespread application can be attributed to its great interpretability and ease of use. The logit
function is frequently used in conjunction with generalized linear models.
• SVM
Using supervised learning as its foundation, SVM is a machine learning technique that may
be applied to regression as well as classification. The Support Vector Machine (SVM) is a
novel approach that is rapidly gaining traction because to its solid foundation in statistical
learning theory and its success in a number of data mining tasks. SVM is a statistical
learning-based classification technique that has shown useful in a number of large-scale,
nonlinear classification applications with large datasets and issues. The direction (a) of each
hyper-plane determines it; (b) is the exact location in space or threshold; (lxi) represents the
input array of component N and signifies the category. The training cases are displayed in
Eqs. (6) and (7).
(LXp, Yp); lxi ∈RDS; (LX1, Y1), (LX2, Y2),…..Let's see.(6) Where DS is the number of
input dataset dimensions and p is the number of training datasets. The role of decision-
making is described as follows:
One advantage of using the SVM for system training is its ability to handle multi-dimensional
data. A classifier called SVM uses input labelled training data to produce an ideal hyperplane
that is used to identify future samples. Through margin maximization, SVM generates a
hyperplane that connects the data sets.
•Multilayer perceptrons
A basic kind of artificial neural network that is frequently used in machine learning,
particularly the identification of phishing websites, is the multilayer perceptron (MLP). These
neural networks are made up of several interconnected layers of nodes, each of which uses
nonlinear activation functions and weighted connections to change the input data. MLPs can
be trained on characteristics taken from website content, such as text content, HTML code,
and URL structure, to categorize websites as dangerous or legitimate in the context of
phishing detection. MLPs can successfully discern between legitimate and counterfeit
websites by learning intricate patterns and correlations within the data, hence aiding in the
protection of people from online risks.
4. Results
As a result, all of the previously covered techniques may be used to develop a machine
learning model. For testing and training, the model and 80% of the dataset were used for
training, and the remaining 20% for testing. Machine learning techniques such as Random
Forest, Decision Tree, Logistic Regression, XGBoost, and SVM are employed to analyze and
ascertain the legitimacy of a given URL. XGBoost yielded good results after fitting the
dataset to all algorithms; the performance analysis is presented in Table 1.
While Random Forest gets 0.820 in training accuracy and holds 0.821 in test accuracy,
XGBoost has 0.868 in training accuracy and 0.858 in test accuracy. Furthermore, the decision
tree's test accuracy remains at 0.850 while its training accuracy reaches 0.880.
Figure 2 shows the accuracy of each algorithms used for training the model.
Figure 3 presents a graph illustrating the relative significance of the various features
considered. Only a few of the fifteen criteria are crucial for improving accuracy.
The validation curves for each of the employed algorithms are shown in Figs. 4–6. The
model's accuracy, or score, for various algorithmic hyperparameter values is shown on the
validation curve.
Figure 4 shows that the training and cross-validation scores are identical and steadily rising,
indicating that the model is operating effectively. Additionally, Fig. 5 demonstrates that this
model is operating well because the training and cross-validation scores are comparable and
are rising with time. In a similar vein, the XGboost model is the most precise and ideal.
Table:1
6.REFERENCES
[1]Safi, A., & Singh, S. (2023, February 1). A systematic literature review on phishing website
detection techniques. Journal of King Saud University. Computer and Information
Sciences/Maǧalaẗ Ǧamʼaẗ Al-malīk Saud : Ùlm Al-ḥasib Wa Al-maʼlumat.
https://doi.org/10.1016/j.jksuci.2023.01.004
[2]Machine Learning and Artificial Intelligence to Advance Earth System Science. (2022, June
13). National Academies Press eBooks. https://doi.org/10.17226/26566
[3] Carolin Jeeva S, Rajsingh EB. Intelligent phishing URL detection using association rule
mining. Hum Centr Comput Inf Sci 2022. https://doi.org/10.1186/s13673-016- 0064-3.
[4] Mohammed Nazim Feroz SM. Phishing URL detection using URL ranking. In: Proceedings
of the IEEE international congress on big data (BigData congress); 2015.
https://doi.org/10.1109/BigDataCongress.2015.97.
[5] Parekh Shraddha, Parikh Dhwanil, Kotak Srushti, Sankhe Smita. A new method for
detection of phishing websites: URL detection. IEEE; 2018. p. 949–52.
[6] K.V. Pradeepthi, A. Kannan "Performance study of classification techniques for phishing
URL detection”, https://ieeexplore.ieee.org/document/7229761, 2022.
[7] A.Y. Fu, “Detecting phishing web pages with visual similarity assessment based on earth
mover’s distance (EMD)”, 2022.
[9] Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. “Machine Learning-Based Phishing
Detection from URLs,” Expert Systems with Applications, vol. 117, pp. 345-357, January
2019.
[10] J. James, Sandhya L. and C. Thomas, “Detection of phishing URLs using machine learning
techniques,” International Conference on Control Communication and Computing (ICCC),
December 2013.
[11] Dipayan Sinha, Dr. Minal Moharir, Prof. Anitha Sandeep, “Phishing Website URL
Detection using Machine Learning,” International Journal of Advanced Science and
Technology, vol. 29, no. 3, pp. 2495-2504, 2020.