Paper 2
Paper 2
1 Introduction
Services are rapidly digitized, and the dependence on internet-based platforms
has increased significantly, thereby bringing forward a rise in cyber threats. One
⋆
Supported by Chaithanya Bharathi Institute of Technology
2 P Abhitej et al.
of the most ominous forms of cybercrime, among these threats, phishing attacks
have become one of the most successful and damaging. Since phishing websites
are malicious websites that resemble an authentic website, they trick the user
into sharing an important piece like login credentials, card number, and personal
identification details. Typically, these attacks are hard to detect manually as
phishing sites are easy to be designed to look legit, and evolve fast to bypass
traditional security mechanisms.
The key problem with phishing is the constantly changing, thuggish charac-
ter of these attacks. Static and reactive approaches of conventional rule-based
systems or blacklists suffer from a late detection and a high vulnerability. With
ever-changing phishing techniques, real-time solutions to automatically identify
phishing attempts are required.
Machine Learning (ML) is of course essential here. Websites of the same
category (URL structure, metadata) and even with different categories (traffic
patterns) can have distinct ML algorithms that can recognize them as legiti-
mate and phishing, respectively. ML models, unlike traditional approaches, are
trained to be able to learn to detect signals indicative of phishing, and keep
getting better as more data becomes available. Integrating machine learning in
phishing detection brings in an automatic, proactive and scalable approach to
threat identification which is a much better alternative to the traditional ways
of detection. Further, in the case of such models, they can be plugged into web-
based applications with frameworks like Flask and the final security of phishing
threats deployed to use in realtime by the end users. The complete ML-based
pipeline proposed for phishing website detection from data pre-processing to
model training and deployment consists of this research which aims at present-
ing a robust practical cybersecurity solution.
2 Literature Review
Deep learning and machine learning based techniques have become the main
approaches to phishing detection that has been seen over the last few years in
countering evolving cyber threats. In work of [1], Sahingoz et al. propose a sys-
tem, DEPHIDES, utilizing deep learning models like ANN, CNN, RNN, BiRNN,
and Att, to classify URLs with the accuracy of 98.74 with CNN, showing the
feasibility of neural networks to classify the malicious links. Like Karim et al.,
we also developed a hybrid model using machine learning algorithms such as De-
cision Tree, Logistic Regression, Random Forest and SVM. Further, in [2], they
proposed an ensemble model (LSD: LR + SVC + DT) with soft/hard voting,
canopy feature selection and hyperparameter tuning and showed it outperforms
the individual classifiers in the phishing detection. Moreover, Prabakaran et al.
also highlighted the shortcomings of the blacklist based methods and proposed a
deep learning framework that involved convolutional neural network to produce
an image of the user and a Variational Auto-encoders (VAE) to convert the im-
age to a vector for facial reconstruction and recognition. Features extracted from
raw URLs were automatically extracted by the model and it achieved 0.9745
Title Suppressed Due to Excessive Length 3
with 1.9 second response time, indicating a good potential of VAE in further
strengthening the model generalization [3]. CNN has a 99.2 accuracy for phish-
ing URL detection using the three deep learning models considering the LSTM,
CNN, and the hybrid LSTM-CNN, as a similar approach to mine deep features
from textual domain is proven to be better through these models [4]. In the last,
Mughaid et al. themselves tackle phishing emails and developed a deep learning
system using boosted decision trees. Using feature selection and boosting, they
test across three datasets, attaining accuracy levels up to 100, proving how im-
portant feature selection and boosting are for text-based phishing detection [5].
Together, these studies demonstrate the use of hybrid and deep learning models
to improve phishing detection accuracy, speed, and flexibility.RetryClaude can
make mistakes. Please double-check responses.
3 Methodology
3.1 Approach
Based on this, this project implements in full a phishing detection system us-
ing machine learning techniques, URL based feature extraction and adaptive
learning strategies to classify phishing websites with high precision and recall.
The ensemble learning approach is used around models so as to be capable of
learning intricate structures of structured URL data. As such, this approach is
based on the fact that phishing websites always leave behind detectable, yet
subtle traces in their URLs which can be systematically collected and processed
to train robust classification models.
Data acquisition and preprocessing is first used as the overall strategy of
building the phishing detection system. The dataset used for this purpose has a
large collection of URLs, some of which were labeled phishing, others of which
were labeled legitimate. This dataset has been either collected from publicly
available repositories such as PhishTank, OpenPhish, or crawled and analyzed
over the domain. Then, these URLs are preprocessed by removing the unneces-
sary parts, formatting normalization and encoding them in a form suitable for
feature extraction.
This system is heavily dependent on feature engineering phase. As the model
is mainly based on the URL, features are extracted from the structural com-
ponents of the URLs. For instance, these likewise incorporate landmarks like
the URL’s length, regardless of whether there are IP addresses as opposed to
domain names, number of uncommon characters including hyphens or slashes,
application of deceptive words like ’login’, ’secure’, ’bank’ or presentation of hex-
adecimal. In advanced settings we also consider additional lexical and domain
specific features like whether the domain is listed in the whitelist or blacklist,
Alexa rank, domain age etc. These properties are likely to be good to use because
previous research and empirical research have also shown correlation between
these properties, and their likelihood of being a phishing attempt.
In the machine learning phase, several classifiers are trained to identify
phishing URLs. We considered and evaluated some algorithms such as decision
4 P Abhitej et al.
trees, random forests, XGBoost, logistic regression, and support vector machines
(SVM), as well as k-nearest neighbors (KNN). One of the ways in which it was
learned great performance was among these ensemble methods, namely Random
Forest and XGBoost. The concept of ensemble learning is the combination of
the prediction power of several base estimators to enhance accuracy. XGBoost
focuses on saving at each stage on classification error through sequential train-
ing and optimization and random forest aggregates many such decision trees to
reduce overfitting and variance. Besides their accuracy, these models are chosen
because they are both interpretable and computationally efficient.
Lastly, the models are evaluated on various performance metrics including
accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC)
to check whether the system is well performing in real-world case when cost of
false positives and false negatives can be huge. The trained model is interfaced
with a Flask based web application for making the phishing detection system
accessible through an easy and user-friendly interface. As a lightweight Python
web framework, Flask offers fast development as well as deploying. The users
can add a URL which is processed by the backend in the web app. As the URL
comes in, the Flask server pre-processes, extracts the required features in real
time, and sends to the model. The entered URL is then classified as phishing or
legitimate by the model and the result is returned back to the user interface.
Where real data was sparse, synthetic phishing URLs were generated based
on common evasion patterns to augment the dataset and improve generalizabil-
ity.
3.4 Analysis
Quantitative Analysis To evaluate performance, standard metrics are used:
Accuracy, Precision, Recall, F1 Score These metrics evaluate the model’s ability
to correctly classify phishing and legitimate URLs.
P recision × Recall
F 1Score = 2 × (1)
P recision + Recall
ROC-AUC Score: Evaluates the trade-off between true positive and false positive
rates across different threshold values.
Confusion Matrix: Used to visualize the number of correct and incorrect predic-
tions across both classes.
Error Metrics
Mean Absolute Error (MAE): Though mainly used in regression, MAE is used
here for model interpretability in probabilistic phishing scoring.
n
1X
M AE = |yi − ŷi | (3)
n i=1
6 P Abhitej et al.
R-squared Score: Pn
(yi − ŷi )2
R2 = 1 − Pi=1
n 2
(5)
i=1 (yi − ȳ)
– The model correctly identified 933 out of 976 negative cases (True Nega-
tives).
– It correctly identified 1221 out of 1235 positive cases (True Positives).
– There were 43 false positives and only 14 false negatives, reflecting
strong predictive capability.
5 Conclusion
outcomes from this research show that machine learning models, specifically en-
semble strategies such as GBC are excellent methods of fighting phishing threats.
Further enhancements could include URL analysis integration in real time, as
well as use of deep learning for pattern recognition, and as a browser extension
or API service more generally. This work provides a solid base for the intelligent
solutions of cybersecurity based on data-driven methodology.
References
20. Kumar, "A Novel Approach to Detect Phishing Attacks Using Hybrid Models,"
Int. J. Inf. Manag., vol. 63, pp. 102–115, Apr. 2023.
21. Zaimi, "An Intelligent Mechanism to Detect Phishing URLs," Future Gener. Com-
put. Syst., vol. 134, pp. 789–800, Jan. 2024.