0% found this document useful (0 votes)
16 views8 pages

Review Paper

Uploaded by

Dhehus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Review Paper

Uploaded by

Dhehus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Phishing Detection using Machine Learning

based URL Analysis: A Review


Aditya Deshmukh1, Akash Yadav2, Pratham Maske3, Shreyash Kathane4, Dr. D.S Adane5
Student, Information & Technology, RCOEM, Nagpur, India1-4
Professor of Information & Technology, RCOEM, Nagpur, India5
Investment scams were the most damaging-they alone robbed
Abstract— As the technology develops this increases victims of $4.57 billion, which is an increase of 38% from the
the chance of cybercrimes happening. Phishing attacks previous year. Crypto-investment fraud accounted for alone $3.94
billion-a whopping 53% rise. Phishing type schemes are among the
based on URLs are among the most common threats
most reported crimes, with over 298,000 complaints with make
toward Internet users. Such attacks are not built upon about 34% of the complaints
technical vulnerabilities; instead, they exploit a weakness
in humans and are often launched against organizations
and individuals. Attackers deceive users by clicking on
URLs that appear trustworthy, leading them to reveal
sensitive information or install malware. Various
techniques of machine learning used for phishing URL
detection classify URLs into phishing and legitimate ones.
Models remain under development and refinement
because of researchers' determination to develop them as
accurate and efficient as possible. Different machine
learning techniques for detecting phishing URLs
accompanied by URL features and datasets that train the
models are reviewed. The paper further discusses the
many different methods put forth by the researchers to
enhance the detection accuracy of these models.

Fig 1: Complaints and losses of last 5 years [1]


1. INTRODUCTION
The report places importance on public reporting to IC3 so as to
In the year 2024, we only deepened our reliance on assist the FBI in combating cyber threats. FBI encourages
technology that further exposed us to more non-native cyber consumers to look out for and read consumer and industry alerts
threats. The ongoing digital transformation, with major about cybercrime, notify financial institutions if victimized, and
impetus from the global pandemic, had created fertile fields file a report to IC3 or local law enforcement..
for the operation of cybercriminals. Recent analysis and
reports are pointing at the surge of security breaches, which
caused both financial losses and personal information
exposures of astronomical proportions. Phishing has been
continued to be prevalent among these instances of
cyberspace crime, using both social engineering and further
technical deception to steal an individual's personal identity
data and financial account credentials. Attackers build fake
versions of trusted websites with the aim of tricking people
into voluntarily divulging their usernames, passwords,
banking details, and other sensitive information. These
phishing URLs would typically be distributed through e-mail,
instant messages, or text messages, thus it is worthwhile that
users should remain awake to the matter and embrace solid
respect for cybersecurity practices.

The FBI's Internet Crime Complaint Center (IC3) 2024 report


highlights a significant rise in online fraud, with 880,418
complaints and potential losses exceeding $12.5 billion,
marking a 10% increase in complaints and a 22% rise in
losses from 2023. California reported the highest number of
complaints and losses, with nearly 80,000 complaints and
over $2 billion in losses.
2. BACKGROUND

A. Phishing Detection 3. LITERATURE REVIEW


A URL based phishing attack is carried out by sending In this section, few of the research works that deploy the
malicious links, that seems legitimate to the users, and tricking above-mentioned algorithms are reviewed and their results are
them into clicking on it. In phishing detection, an incoming summarized.
URL is identified as phishing or not by analysing the different
features of the URL and is classified accordingly. Different
machine learning algorithms are trained on various datasets of Phishing attack detection was investigated in Alam et al.
URL features to classify a given URL as phishing or (2020) [5], which used decision tree and random forest
legitimate. algorithms for the classification of attacks. The dataset, which
came from Kaggle, had 30 very significant features for
B. Phishing Detection Approaches identifying phishing URLs. The detailed preprocessing step
List-Based Phishing Detection Systems was reasonably done to render clean and noise-free data,
These systems rely on two lists to classify website as followed by feature selection using algorithms like PCA. The
either phishing or non-phishing. The whitelist contains safe performance of each algorithm was analyzed in terms of
and legitimate websites, while the blacklist includes those confusion matrices and the following performance measures:
identified as phishing. Researchers have used whitelists to accuracy, precision, recall, and F1-score. The performance of
ensure that only URLs on the list are accessible. Another random forests was superior to DTs, offering a 97% accuracy
approach is the blacklist method, where URLs are checked compared to 91.94% accuracy for DTs, with random forests
against a list of known phishing sites. However, these systems dealing with overfitting and variability issues effectively. The
have a significant drawback: even a small change in the URL study asserted that random forests, ensemble approaches,
techniques for web-based search filter out the spam and help
can prevent it from being matched in the list. Additionally,
assistant for phishing detection substantially in view of the
they struggle to catch new, zero-day attacks.[3] large data.
The Rule-based Phishing Detection Systems Rashid et al. (2020) in [6] have presented a machine
The feature sets for rule-based systems stem from relational learning approach for phishing detection that harnesses
rule mining. The rules provide a weighting of characteristics Support Vector Machines (SVM) for classification. This
most prevalent in phishing URLs. These rules, when used with dataset obtained from repositories such as Phish Tank and
the system, provide better accuracy than what can be achieved Alexa consists of valid and phishing URLs, together with
with just features working alone in classification. For internal features, such as the length of the URL and external
example, researchers in the CANTINA study resorted to TF- features that are derived from third-party services. Principal
IDF and some specific rules to identify phishing attacks. Component Analysis (PCA) was performed for dimension
reduction to facilitate more efficient processing. The model
Researchers have implemented a combination of features and achieved 95.66% accuracy using SVM with only five
rules to uncover higher detection accuracy in similar works.[3] features, much higher than that achieved using any other
techniques, for example, Random Forest, which showed an
Visual Similarity-Based Phishing Detection Systems accuracy of 94.27% with 30 features. This reduction in
The systems compare web pages with phishing sites visually feature set improved computational efficiency while
to detect attempts of phishing. They take a server-perspective maintaining good detection rates. The authors indicated how
comparison of both phishing and non-phishing sites and use robust their solution is at identifying new and transient
image processing techniques to identify minor visual phishing sites that constitute a practical attack against cyber
differences which users would not notice. Fake sites are threats.
designed with the intention of making them similar to the
The detection of phishing websites with machine learning
original ones; however, slight differences are visible due to
techniques by Kulkarni and Brown (2019) [7]. A dataset
these techniques. Studies have shown that visual similarity- was reported as obtained from the University of California,
based systems can prove to be effective detection models Irvine Machine Learning Repository containing 1353 URLs
against phishing attacks upon comparing generic visual labeled as phishing, suspicious, and legitimate. Nine features
elements.[3] were extracted from URLs, including URL length, age of
domain, presence of an IP address, and others. Four
Machine Learning-Based Phishing Detection Systems classifiers were set to run: Decision Tree, Support Vector
Machine learning-based systems detect phishing websites by Machine (SVM), Naïve Bayes, and Neural Network. The
classifying specified features using artificial intelligence accuracy achieved by the Decision Tree classifier was 91.5%,
techniques. These features can include URL structure, domain with a True Positive Rate (TPR) of 90.97% and a False
name, website content, and more. Due to their dynamic nature, Positive Rate (FPR) of 7.81%. The SVM was slightly behind,
these systems are particularly popular for detecting anomalies achieving an accuracy of 86.69%, and both Naïve Bayes and
on websites. Machine learning models can adapt to new Neural Network slightly trailed at rates of 86.14% and
phishing tactics, making them highly effective in protecting 84.87%, respectively. The study stated that Decision Trees
users from evolving threats.[3] are quite good with discrete feature values, but they need
pruning to deal with problems of overfitting. The authors
concluded more features with larger datasets would help the
performance of the classifier and recommended going for
ensemble methods and rule-based approaches for future work.
Rishikesh Mahajan and Irfan Siddavatam[8] emphasized
three class orientation algorithms: Decision Tree, Random Forest,
and Support Vector Machine. The dataset of benign URLs accuracy."
was constructed by taking 17,058 from Alexa and 19,653
from PhishTank, all with16 features. The data were The study was conducted by Dr. Nitin N. Sakhare et al. [4]
respectively partitioned into training and testing sets with Integrated conventional machine learning models like XGBoost,
proportions of 50:50, 70:30, and 90:10. The performance was LightGBM, and a referenced but inactive Random Forest classifier
judged according to accuracy, false negative rate, and false alongside a Graph Nerual Network (GNN). XGboost classifier
positive rate. Random Forest stood out as the algorithm gives accuracy of 92.09%, LightGBM gives highest accuracy of
where 97.14% accuracy was achieved with the least false 93.29%. Apart from this, they implement another tree-based
negative rate. Their conclusion was that the more data used machine learning algorithm, CatBoost, which gives accuracy of
for training, the better the accuracy. 92.98%. GNN's performance left a huge scope for improvement.
LightGBM emerged as a standout performer, giving a precision
Jitendra Kumar et al. described in their research [9] score of 0.93 alongside a recall score of 0.93.
the training of Logistic Regression, Naive Bayes, Random
Forest, Decision Tree and K-Nearest Neighbor classifiers
using features derived from the lexical structure of URLs. A. Orunsolu et al.[4] Proposed an scalable architecture
They had carefully created a dataset to solve common combined with incremental learning in a modular approach
problems like data imbalance, biased training, variance and was effective. Utilizing an extensive dataset from
overfitting. The preprocessed dataset was evenly split into Phishtank(comprising 2,541 phishing URLs) and Alexa
phishing and trusted URLs and was further divided into a (containing 2,500 legitimate URLs), the model attained
70:30 ratio for training and testing. Interestingly, all 99.96% accuracy with a low false positive rate of 0.04%. In
classifiers had similar AUC (Area Under Curve) values, but conducting comparative performance studies, use was made
the Naive Bayes Classifier claimed to be the best performer of Support Vector Machine (SVM) and Naïve Bayes (NB)
with the highest AUC value. It achieved an accuracy of 98% algorithms. The study provides a criterion for assessing
with precision of 1, recall of 0.95 and F1-score of 0.97, thus feature importance based on how often phishing and
the study makes a point regarding the importance of a legitimate datasets favor certain features. The selection
balanced dataset and further emphasizes Naive Bayes being a therefore introduces features as per maximum relevance
strong candidate choice in the detection of phishing. with minimum redundancy. The URL features consist of,
The proposed research by Vahid Shahrivari and but are not limited to, length, presence of '@', and
Mohammad Mahdi Darabi [10] deals with the application hexadecimal codes. The webpage features investigated
of various machine-learning algorithms for the detection of include validity of SSL certificates and congruency with
phishing websites. This research uses a dataset constituted of domain names; while patterns of behavior, like cookie
30 features, such as IP address presence, URL length, handling, and the age of the domain, also qualify to be
whether shortening services are used, and SSL state among important features. The incremental methodology processes
others. Characteristics common to such URL layouts are these features in stages, starting with URL analysis,
employed to distinguish phishing websites from those which followed by webpage properties, and finally webpage
do not engage in this practice. Logistic regression, decision behaviors if needed. This modular approach ensures
tree, random forest, AdaBoost, KNN, SVM, gradient scalability and adaptability to new phishing tactics. The
boosting, XGBoost, and neural networks constituted the study’s results demonstrate the effectiveness of the
machine learning algorithms that tried out. Besides the proposed system, though limitations such as dataset
accuracy, precision, recall, and F1 score are also used to diversity, lack of real-time testing, and absence of
assess the performance of different models. While XGBoost benchmarking.
proved most accurate at 98.32%, Random Forest came Korkmaz et al [3] This research work addressed a persistent
second best at an accuracy close to 97.27%; moreover, Neural concern regarding phishing through URL analysis, which
Network exhibited good performance, achieving 96.98% employs machine-learning techniques to track these attacks
accuracy. The authors concluded that the ensemble methods proliferated by exploiting vulnerabilities inherent within human
such as Random Forest and XGBoost are good at detecting nature by imitating legitimate sites in a bid to obtain sensitive
phishing websites due to their high accuracy and robustness. data. Also, such an attempt to assess performance can improve
They stressed the usefulness of employing multiple features by addressing primarily the attributes of URLs for further
and suggested that one method for enhancing detection improvement in efficiency. The authors employed eight machine
performance might be coupling machine learning models learning algorithms via Random Forest (RF), Artificial Neural
with other phishing detection methods. This work exemplifies Networks (ANN), and Support Vector Machines (SVM), which
the potential for machine learning to help discern phishing were tested on three datasets with over 126,000 URLs. The
websites, and its further promise of improvement with hybrid datasets combined the phishing URLs from PhishTank and the
models and novel features. legitimate URLs from Alexa and Common Crawl databases. The
Machikuri Santoshi Kumari et al. in [1] detects phishing system extracted and used 48 key features from the URLs that
based on models enhanced by blacklisting and machine- include domain structure, special character presence, and length
learning methods. Several machine-learning algorithms, such metrics, without recourse to third-party services for efficiency
as XGBoost, Random Forest, Decision Tree, and Multilayer concerns. The experimental results indicate that the Random
Perceptrons, were used for the detection. Other datasets were Forest algorithm had the highest accuracy across the dataset (up
used in addition to the Phishtank dataset, namely: one to 94.59%) and had better accuracy than previous studies. Such
containing phishing websites and the second containing an experiment proves to be running with a high degree of
phonemy features. A total of 30 features were used out of 30 efficiency in that it can be effectively used for real-time
most important features were HTTPS, followed by Anchor detection and speed. However, limited area coverages mentioned
URL, Website Traffic, etc. XGBoost gave the maximum in the paper provided directions for further work. Expanding
training accuracy of 100% and the best test accuracy at 96.7% upon the initial dataset.
out of all other algorithms. They concluded that "using the B. Sucharitha et al [11] investigated the application of machine
XGBoost algorithm to detect phishing improves prediction learning algorithms to classify phishing websites. The dataset for
this research comprises of 32 features including IP address, another website
URL length, URL shortening service employed, and state of
SSL, among others. The study gives these salient features of 6) Prefix Suffix: Phishers tend to add prefixes or suffixes
malicious URLs, and these features identify phishing separated by (-) to the domain name so that users feel that
websites. The authors considered different machine learning they are dealing with a legitimate webpage. For example
models, namely, Decision Trees, Random Forest, and http://www.Confirme-paypal.com.
Gradient Boosting. These models were evaluated using 7) Having Sub Domain: Having subdomain in URL.
metrics such as accuracy, precision, recall, and F1-score.
Among all other models, Gradient Boosting achieved the 8) SSL State: Shows that website use SSL
highest score with accuracy 98.9%, precision 99.0%, recall
99.4%, and F-value just slightly lower at 98.6%. Thus, the 9) Domain Registration Length: Based on the fact that
authors concluded that ensemble methods such as Gradient phishing website lives for a short period
Boosting and Random Forest can provide accurate and 10) Favicon: A favicon is a graphic image (icon) associated with
strong generalization capabilities when detecting phishing a specific webpage. If the favicon is loaded from a other
websites. The authors stress the importance of using domain then the webpage is likely to be considered Phishing
features from the varied sources and suggest that combining
attempt.
machine learning models and other phishing detection
techniques can enhance the detection capabilities further. 11) Using Non-Standard Port: To control intrusions, it is much
This research clearly epitomizes machine learning in the better to merely open ports that you need. Several firewalls,
detection of phishing websites, being a step further to its Proxy and Network Address Translation (NAT) servers will,
improvement by hybrid models and other features. by default, block all or most of the ports.
12) HTTPS token: Having deceiving https token in URL. For
4. DATASETS example, http://https-www-mellat-phish.ir
The datasets have been collected from various sites such
as PhishTank , Alexa, etc.Which has the data about the
phishing websites and keeps updating them . Abnormal Based Features
13) Request URL: Request URL examines whether the external
objects contained within a webpage such as images, videos,
5. FEATURE EXTRACTION and sounds are loaded from another domain.
URLs have certain characteristics and patterns that can be 14) URL of Anchor: An anchor is an element defined by the < a
considered as its features. The Fig. 3 shows the relevant parts > tag. This feature is treated exactly as Request URL.
of a typical URL.
15) Links In Tags: It is common for legitimate websites to use
In case of URL based analysis for designing machine
¡Meta¿ tags to offer metadata about the HTML document;
learning models, we need to extract these features in order to
form a dataset that can be used for training and testing. There ¡Script¿ tags to create a client side script; and ¡Link¿ tags to
are four categories of features that are most commonly retrieve other web resources.
considered for feature extraction as in [18]. They are as 16) Server Form Handler: If the domain name in SFHs is
follows: different from the domain name of the webpage.
1) Address Bar based features
17) Submitting Information To E-mail: A phisher might
2) Abnormal based features redirect the users information to his email.
3) HTML and JavaScript based features 18) Abnormal URL: It is extracted from the WHOIS database.
4) Domain based features For a legitimate website, identity is typically part of its URL.

Address Bar Based Features HTML & JavaScript Based Features


1) Having IP Address: If an IP address is used instead of
19) Website Redirect Count: If the redirection is more than
the domain name in the URL, such as four-time
http://217.102.24.235/sample.html
20) Status Bar Customization: Use JavaScript to show a fake
2) URL Length: Phishers can use a long URL to hide the URL in the status bar to users
doubtful part in the address bar.
21) Disabling Right Click: It is treated exactly as Using
3) Shortening Service: Links to the webpage that has a onMouseOver to hide the Link
long URL. For example, the URL
http://sharif.hud.ac.uk/ can be shortened to 22) Using Pop-up Window: Showing having popo-up windows
bit.ly/1sSEGTB. on the webpage.
4) Having @ Symbol: Using the @ symbol in the URL 23) IFrame: IFrame is an HTML tag used to display an
leads the browser to ignore everything preceding the @ additional webpage into one that is currently shown.
symbol and the real address often follows the @ symbol
5) Double Slash Redirection: The existence of // within Domain Based Features
the URL which means that the user will be redirected to
24) Age of Domain: If the age of the domain is less than a
month.
25) DNS Record: Having the DNS record
26) Web Traffic: This feature measures the popularity of
the website by determining the number of visitors.
27) Page Rank: Page rank is a value ranging from 0 to 1.
PageRank aims to measure how important a webpage is
on the Internet.
28) Google Index: This feature examines whether a website
is in Googles index or not.
29) Links Pointing To Page: The number of links pointing
to the web page.
30) Statistical Report: If the IP belongs to top phishing
IP’s or not.
TABLE I. R E S U L T A N A L Y S I S
Paper Models Used Suitable Accuracy score Paper Models Used Suitable Models Accuracy Score
Models

[1] Decision Tree (DT), Gradient Boost DT: 96.0% [9] The dataset was The Random 50:50 split ratio: 96.72%
Random Forest(RF), achived highest RF:96.9% divided into split Fore 70:30 split ratio: 96.84%
Gradient Boost(GB) accuracy with GB:98.9% ratios of 50:50, st classifier 90:10 split ratio: 97.14%
precesion 70:30, and 90:10. demonstrated
99.0%, recall of Decision Tree superior accuracy
99.4% and F1 (DT), Random and the lowest
score 98.6% Forest (RF), and false negative
[2] Combined blacklisting Among these, XGBoost: (SVM) classifiers rate.
applied ML Algorithms: XGBoost was 96.7%, RF: were applied.
XGBoost, RF, DT, and found to be the 92.5%, DT: [10] A balanced Random Forest Random Forest :98.03%
Multilayer Perceptrons to most accurate 90.5%, dataset was and Naive Bayes Gaussian Naive Bayes :
dataset with features, model. Multilayer utilized to train demonstrated 97.18%
Phishing URLs collected Perceptrons: 88% classifiers such as superior accuracy
from Phishtank and Logistic
OpenPhish. Regression (LR),
[3] Random Forest (RF), Random Forest RF: 94.59%, Naive Bayes
Artificial Neural Networks (RF) was the ANN: 94.35%, (NB), Random
(ANN), Support Vector best-suited XGBoost: Forest (RF),
Machines (SVM), Logistic model based on 92.95%, DT: Decision Tree
Regression (LR), K-Nearest its highest 92.59%, KNN: (DT), and k-
Neighbor (KNN), Decision accuracy and 91.49%, LR: Nearest
Tree (DT), Naive Bayes overall 91.31%, NB: Neighbors (k-
(NB), XGBoost performance in 88.35%, SVM: NN), using
detecting 87.03% features derived
phishing URLs. from the lexical
[4] Support Vector Machine Both Support SVM: 99.96%, structure of
(SVM) and Naïve Bayes Vector Machine NB: 99.96% URLs.
(NB) with features based on (SVM) and [11] The examined Very good Logistic regression
maximum relevance with Naïve Bayes classifiers are performance in :92.6%
minimum redundancy. (NB) classifiers Logistic ensembling Decisiontree :96.5%
Phishtank (2,541 phishing have TPR of Regression, classifiers Randomforest :97.2%
URLs) and Alexa (2,500 99.96, FNR of Decision Tree, namely, Random Adabooster:93.6%
legitimate URLs) datasets. 0.04, TNR of Support Vector Forest, XGBoost KNN:95%
99.96, and FPR Machine, Ada both on SVM:94.9%
of 0.04. Boost, Random computation Gradientboosting:94.8%
[5] XGBoost,LightGBM,Graph LightGBM give XGBoost: Forest, Neural duration and XGBoost:98.3%
Neural Network(GNN) and highest accuracy 92.09% Networks, KNN, accuracy
CatBoost with precesion LightGBM: Gradient
applied.Performance 0.93 and recall 93.29% Boosting, and
evaluated using accuracy, score 0.93 GNN:70% XGBoost.
precesion, recall and F1- CatBoost:92.98%
score.
[7] DT and RF applied to a RF RF: 97%, DT:
Kaggle dataset with 30 outperformed 91.94%
features. PCA used for DT, addressing
feature selection. overfitting and
Performance evaluated variability
using accuracy, precision, effectively.
recall, and F1-score. 6) PERFORMANCE EVALUATION METRICS
[8] Applied SVM on data from RF and NB SVM: 95.66%,
PhishTank and Alexa, with classifiers had RF: 94.27% A selected parameter will be used to evaluate the
internal and external better measure of performance for the system. The associated
features, and PCA for accuracies. In models are Accuracy, Precision, Recall, F1 Score, and
dimensionality reduction. terms of AUC, ROC curve, all derived from the values of True Positive
Gaussian Naive
Bayes had a (TP), True Negative (TN), False Positive (FP), and False
slightly higher Negative (FN).
value of 0.991.
[11] Four classifiers (DT, SVM, DT have 91.5% DT: 91.5%, In the context of URL classification.
Naïve Bayes, Neural accuracy but SVM: 86.69%,
Network) applied to a UCI required pruning Naïve Bayes: True Positive (TP): The number of phishing URLs
dataset with 1,353 labeled to address 86.14%, Neural correctly detected as phishing.
URLs and 9 extracted overfitting. Network:
features. Ensemble 84.87% True Negative (TN): The number of legitimate URLs
methods were correctly detected as legitimate.
recommended
for False Positive (FP): The number of legitimate URLs
improvement. incorrectly classified as phishing.
False Negative (FN): The number of phishing URLs
incorrectly classified as legitimate.
A Confusion Matrix represents these values in terms of
how it indicates the performance of the classification [2] Dr. Nitin N. Sakhare, Jyoti L. Bangare, Dr. Radhika G.
model. Purandare, Disha S. Wankhede, Pooja Dehankar, “Phishing
Website Detection Using Advanced Machine Learning
Techniques”, International Journal of Intelligent Systems and
Applications in Engineering 2024.
[3] Sucharitha, B., Chandini, B., Kumar, D. S., Surendra, M., &
[10] Kumar, G. K. (2024). Detecting phishing websites using
machine learning. IJARCCE, 13(4).
https://doi.org/10.17148/ijarcce.2024.134145
[4] Machikuri Santoshi Kumari, Chiguru Keerthi Priya, Gondhi
[10] Bhavya Haridas Neha, Monisha Awasthi, Surendra Tripathi, ”
Viable Detection of URL Phishing using Machine Learning
Approach”, 15th International Conference on Materials
Processing and Characterization (ICMPC 2023).
[5].A.A. Orunsolu, A. S. Sodiya, and A. T. Akinwale, “A
predictive model for phishing detection,” Journal of King Saud
[10] University – Computer and Information Sciences, vol. 34, no.
2, pp. 232–247, 2022.
[6] Korkma, M., Sahingoz, O. K., & Diri, B. (2020). Detection
of Phishing Websites by Using Machine Learning-Based URL
Analysis. Presented at the 11th International Conference on
Computing, Communication and Networking Technologies
(ICCCNT), July 1-3, 2020, IIT Kharagpur, India. IEEE.
[10] [7] Mohammad Nazmul Alam, Dhiman Sarma et al., “Phishing
OBSERVATIONS attacks detection using machine learning approach,” 3rd
International Conference on Smart Systems and Inventive
Phishing attacks are constantly evolving and the cyber world Technology (ICSSIT), 2020.
is hit by new types of attacks often. Hence a particular detection
approach or algorithm cannot be tagged as the best one giving [8] Junaid Rashid, “Phishing Detection Using Machine
exact results. Through the literature survey, it is evidently Learning Technique”, First International Conference of Smart
visible that Random Forest gives better results in most Systems and Emerging Technologies (SMARTTECH), 2020.
scenarios. But then the performance of each algorithm varies
depending on the dataset used, train-test split ratio, feature [9] Vahid Shahrivari, Mohammad Mahdi Darabi, Mohammad
selection techniques applied etc. Researchers prefer to create Izadi “Phishing Detection Using Machine Learning
machine learning models that perform phishing detection with Techniques” arXiv preprint arXiv:2009.11116, 2020. Retrieved
best value for evaluation parameters and least training time. from arXiv.
Therefore, the future works should focus on these aspects of
phishing detection. [10] Jitendra Kumar, A. Santhanavijayan, B. Janet, Balaji
Rajendran, and Bindhumadhava BS, “Phishing website
classification and detection using machine learning,”
6. CONCLUSION International Conference on Computer Communication and
Due to the greater demand for the security of personal, Informatics (ICCCI), 2020.
financial, and professional data in this digital era, phishing
detection has risen to be a highly critical area of research. [11] Arun Kulkarni, Leonard L. Brown, “Phishing Websites
URL-based analysis is one of the ways that enhance both Detection using Machine Learning”, IJACSA International
detection speed and detection accuracy. By extracting Journal of Advanced Computer Science and Applications, Vol.
those features from the given URL and applying feature 10, No. 7, 2019.
selection and dimensionality reduction techniques, models [12] Rishikesh Mahajan, and Irfan Siddavatam, “Phishing
are refined by eliminating unnecessary data and focusing website detection using machine learning algorithms,”
on the most informative features. Numerous machine International Journal of Computer Applications (0975-8887),
learning algorithms have shown strong performance on vol. 181, no. 23, 2018.
phishing URL classification including Random Forest,
XGBoost, and Support Vector Machines. In this paper, we
retrospectively examined phishing detection, focusing on
different methodologies and their performance. The
review builds a good basis for future researchers taking
their next step at improving phishing detection systems.

REFERENCES
[1] 2023 Internet Crime Report FBI. Retrieved from:
https://www.ic3.gov/Media/PDF/AnnualReport/2023_IC3Re
port.pdf
[1] 2023 Internet Crime Report FBI. Retrieved from: https://www.ic3.gov/Media/PDF/AnnualReport/2023_IC3Report.pdf
[2] Dr. Nitin N. Sakhare, Jyoti L. Bangare, Dr. Radhika G. Purandare, Disha S. Wankhede, Pooja Dehankar, “Phishing Website
Detection Using Advanced Machine Learning Techniques”, International Journal of Intelligent Systems and Applications in
Engineering 2024.
[3] Sucharitha, B., Chandini, B., Kumar, D. S., Surendra, M., & Kumar, G. K. (2024). Detecting phishing websites using machine
learning. IJARCCE, 13(4). https://doi.org/10.17148/ijarcce.2024.134145
[4] Machikuri Santoshi Kumari, Chiguru Keerthi Priya, Gondhi Bhavya Haridas Neha, Monisha Awasthi, Surendra Tripathi, ”
Viable Detection of URL Phishing using Machine Learning Approach”, 15th International Conference on Materials Processing and
Characterization (ICMPC 2023).
[5] A.A. Orunsolu, A. S. Sodiya, and A. T. Akinwale, “A predictive model for phishing detection,” Journal of King Saud University
– Computer and Information Sciences, vol. 34, no. 2, pp. 232–247, 2022.
[6] Korkma, M., Sahingoz, O. K., & Diri, B. (2020). Detection of Phishing Websites by Using Machine Learning-Based URL
Analysis. Presented at the 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT),
July 1-3, 2020, IIT Kharagpur, India. IEEE.
[7] Mohammad Nazmul Alam, Dhiman Sarma et al., “Phishing attacks detection using machine learning approach,” 3rd
International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020.
[8] Junaid Rashid, “Phishing Detection Using Machine Learning Technique”, First International Conference of Smart Systems and
Emerging Technologies (SMARTTECH), 2020.
[9] Vahid Shahrivari, Mohammad Mahdi Darabi, Mohammad Izadi “Phishing Detection Using Machine Learning Techniques”
arXiv preprint arXiv:2009.11116, 2020. Retrieved from arXiv.
[10] Jitendra Kumar, A. Santhanavijayan, B. Janet, Balaji Rajendran, and Bindhumadhava BS, “Phishing website classification and
detection using machine learning,” International Conference on Computer Communication and Informatics (ICCCI), 2020.
[11] Arun Kulkarni, Leonard L. Brown, “Phishing Websites Detection using Machine Learning”, IJACSA International Journal of
Advanced Computer Science and Applications, Vol. 10, No. 7, 2019.
[12] Rishikesh Mahajan, and Irfan Siddavatam, “Phishing website detection using machine learning algorithms,” International
Journal of Computer Applications (0975-8887), vol. 181, no. 23, 2018.

You might also like