Phishing Final
Phishing Final
Phishing attacks involve cybercriminals tricking users into revealing sensitive information like passwords
and bank details. This paper explores the use of Machine Learning (ML) to detect phishing URLs by
analyzing various features of URLs through lexical analysis. It evaluates the performance of eleven ML
algorithms, including Decision Tree (DT), Gradient Boost Classifier (GB), Random Forest (RF), Support
Vector Machines (SVM), and Cat Boost Classifier (CB), based on their detection accuracy.
The study finds that ML models can effectively classify URLs as phishing or legitimate, with different
algorithms showing varying levels of performance. The paper highlights how feature extraction from
URL structures plays a critical role in improving phishing detection accuracy.
In conclusion, ML offers a proactive approach to identifying phishing websites, with certain algorithms
performing better in terms of accuracy.
Introduction
Attacks are a major security threat, with attackers creating fake websites that resemble legitimate
ones to steal sensitive information, such as bank account credentials. Traditional methods like
blacklisting URLs and IP addresses are limited, as attackers can bypass them using techniques like
URL obfuscation and fast-flux. Heuristic-based methods can detect zero-hour phishing attacks but
have a high false positive rate. To improve detection, machine learning techniques are being used, as
they can analyze features of both legitimate and phishing URLs to more accurately identify phishing
websites, including those not yet recognized.
Problem statement
The project on "Phishing URL Detection Using Machine Learning" aims to address the escalating
threat of phishing attacks by developing an intelligent system. The challenge lies in the dynamic
nature of phishing techniques, requiring a machine learning model capable of accurately detecting
malicious URLs in real-time. Key objectives include effective feature engineering, selection of
appropriate algorithms for model training, achieving real-time processing capabilities, ensuring
generalization across diverse phishing attacks, and implementing robust data security measures. The
project ultimately seeks to contribute to online security by providing a proactive and adaptive
solution to identify and mitigate phishing threats.
Motivation
The motivation for using machine learning (ML) in phishing URL detection arises from the limitations of
traditional detection methods and the increasing sophistication of phishing attacks. Traditional techniques
like blacklisting URLs or IP addresses are easily bypassed by attackers using techniques such as URL
obfuscation and fast-flux, rendering them ineffective in detecting new, unknown phishing websites
(zero - hour attacks). Heuristic-based methods, while capable of detecting some phishing attempts, often suffer
from high false positive rates, making them unreliable. Machine learning, however, can analyze large datasets of
URLs and identify complex patterns that distinguish legitimate sites from phishing ones,enabling real-time
detection with greater accuracy. Additionally, ML models can continuously learn from new data, adapting to
evolving phishing tactics and automating the detection process, making it a scalable solution to combat phishing
threats.
Technical specifications
Hardware requirements
• RAM : 4GB
• ROM : 128GB
• Processor : Intel Core-i3 and above
Software requirements
• Data collection
• Data preprocessing
• Feature Extractions
• Model Training
• Prediction using various algorithms(Gradient boosting tree classifier ,
Decision Tree ,Random forest)
• Evaluating the model
• Result
Result
Conclusion and Future scope
• In this project, we implemented seven Machine Learning algorithms including Decision Tree,
Gradient Boosting, Logistic Regression, Random Forest, Support Vector Machine and CatBoost.
These algorithms are the most used in phishing URLs classification. We adopted lexical analysis
approach to extract URL features and we calculated accuracy performance metric for each
algorithm. Thus, we presented the results obtained in a table that allows us to compare the various
algorithms in terms of performance based on accuracy score. It is seen that Gradient boost
classifier algorithm achieved the best accuracy score 97.4%. In the future work, we aim to exploit
the results of the work presented in this document, thus developing a model capable of detecting
simple URLs and those based on Machine Learning. We will also consider the following:
• Introduce URL HTML Encoding and URL Hit approach to extract URL features
• Use other performance metrics: Specificity, Confusion matrix.
References
•Kuraku & Kalla (2023): Examines machine learning and NLP for phishing detection,
focusing on models like Random Forests and SVM
•IEEE Survey (2024): Reviews machine learning techniques, challenges, and datasets
in phishing URL detection
•Hybrid Features Study: Explores combining URL features with hyperlink structures for
improved detection