0% found this document useful (0 votes)
8 views10 pages

Paper 2

This document presents a machine learning framework for detecting phishing websites using Gradient Boosting and deploying it via a Flask web application. The system utilizes URL feature extraction and adaptive learning strategies to classify URLs as phishing or legitimate, achieving high accuracy and real-time detection capabilities. The proposed solution enhances cybersecurity by bridging theoretical models with practical user protection against evolving phishing threats.

Uploaded by

Sathvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Paper 2

This document presents a machine learning framework for detecting phishing websites using Gradient Boosting and deploying it via a Flask web application. The system utilizes URL feature extraction and adaptive learning strategies to classify URLs as phishing or legitimate, achieving high accuracy and real-time detection capabilities. The proposed solution enhances cybersecurity by bridging theoretical models with practical user protection against evolving phishing threats.

Uploaded by

Sathvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

An Effective Machine Learning Framework for

Phishing Website Detection Using Gradient


Boosting and Web Application Deployment via
Flask⋆

P Abhitej1 , N Abhishek2 , Kadali sathvik3 , and G Srikanth4

Department of Information Technology, Chaitanya Bharathi Institute of Technology,


Hyderabad, Telangana
[email protected] , [email protected] , [email protected] ,
[email protected]

Abstract. As a main cybersecurity threat, phishing attacks are now


becoming prevalent. This threat targets users by deceiving them into
visiting deceptive websites with the promise of seeming legitimate ser-
vices to steal sensitive information. As traditional rule-based systems,
machine learning has struggled to keep pace with evolving attack vec-
tors. However, it is proven to be a powerful alternative to be used for
detecting and mitigating such threats. In this study, we propose to use
machine learning based phishing detection. It’s a system, a disambiguat-
ing system which leverages the key URL features in order to distinguish
between legitimate and malicious websites. It is a rigorous feature en-
gineering, classification model training and system deployment within a
lightweight, scalable architecture. Various supervised learning algorithms
are then explored to solve this task. The most effective classifiers, a Flask
based web application, is finalized with the detection model integrated
into it. This enables real-time URL verification through an intuitive in-
terface. This deployment makes everything more accessible and faster to
use by the users. The proposed system increases the growing body. It
provides for a practical due to which research on intelligent cybersecu-
rity solutions has been done. This is a digital tool that can be used by
individuals and organizations safeguarding digital interactions. In par-
ticular, machine learning is tied with a deployable web interface. This
work bridges theoretical threat detection to models and real world user
protection. This is a step forward in proactive cybersecurity defense.

Keywords: Phishing Detection· Machine Learning· URL Analysis· Web


Security· Cybersecurity· Prediction

1 Introduction
Services are rapidly digitized, and the dependence on internet-based platforms
has increased significantly, thereby bringing forward a rise in cyber threats. One

Supported by Chaithanya Bharathi Institute of Technology
2 P Abhitej et al.

of the most ominous forms of cybercrime, among these threats, phishing attacks
have become one of the most successful and damaging. Since phishing websites
are malicious websites that resemble an authentic website, they trick the user
into sharing an important piece like login credentials, card number, and personal
identification details. Typically, these attacks are hard to detect manually as
phishing sites are easy to be designed to look legit, and evolve fast to bypass
traditional security mechanisms.
The key problem with phishing is the constantly changing, thuggish charac-
ter of these attacks. Static and reactive approaches of conventional rule-based
systems or blacklists suffer from a late detection and a high vulnerability. With
ever-changing phishing techniques, real-time solutions to automatically identify
phishing attempts are required.
Machine Learning (ML) is of course essential here. Websites of the same
category (URL structure, metadata) and even with different categories (traffic
patterns) can have distinct ML algorithms that can recognize them as legiti-
mate and phishing, respectively. ML models, unlike traditional approaches, are
trained to be able to learn to detect signals indicative of phishing, and keep
getting better as more data becomes available. Integrating machine learning in
phishing detection brings in an automatic, proactive and scalable approach to
threat identification which is a much better alternative to the traditional ways
of detection. Further, in the case of such models, they can be plugged into web-
based applications with frameworks like Flask and the final security of phishing
threats deployed to use in realtime by the end users. The complete ML-based
pipeline proposed for phishing website detection from data pre-processing to
model training and deployment consists of this research which aims at present-
ing a robust practical cybersecurity solution.

2 Literature Review

Deep learning and machine learning based techniques have become the main
approaches to phishing detection that has been seen over the last few years in
countering evolving cyber threats. In work of [1], Sahingoz et al. propose a sys-
tem, DEPHIDES, utilizing deep learning models like ANN, CNN, RNN, BiRNN,
and Att, to classify URLs with the accuracy of 98.74 with CNN, showing the
feasibility of neural networks to classify the malicious links. Like Karim et al.,
we also developed a hybrid model using machine learning algorithms such as De-
cision Tree, Logistic Regression, Random Forest and SVM. Further, in [2], they
proposed an ensemble model (LSD: LR + SVC + DT) with soft/hard voting,
canopy feature selection and hyperparameter tuning and showed it outperforms
the individual classifiers in the phishing detection. Moreover, Prabakaran et al.
also highlighted the shortcomings of the blacklist based methods and proposed a
deep learning framework that involved convolutional neural network to produce
an image of the user and a Variational Auto-encoders (VAE) to convert the im-
age to a vector for facial reconstruction and recognition. Features extracted from
raw URLs were automatically extracted by the model and it achieved 0.9745
Title Suppressed Due to Excessive Length 3

with 1.9 second response time, indicating a good potential of VAE in further
strengthening the model generalization [3]. CNN has a 99.2 accuracy for phish-
ing URL detection using the three deep learning models considering the LSTM,
CNN, and the hybrid LSTM-CNN, as a similar approach to mine deep features
from textual domain is proven to be better through these models [4]. In the last,
Mughaid et al. themselves tackle phishing emails and developed a deep learning
system using boosted decision trees. Using feature selection and boosting, they
test across three datasets, attaining accuracy levels up to 100, proving how im-
portant feature selection and boosting are for text-based phishing detection [5].
Together, these studies demonstrate the use of hybrid and deep learning models
to improve phishing detection accuracy, speed, and flexibility.RetryClaude can
make mistakes. Please double-check responses.

3 Methodology
3.1 Approach
Based on this, this project implements in full a phishing detection system us-
ing machine learning techniques, URL based feature extraction and adaptive
learning strategies to classify phishing websites with high precision and recall.
The ensemble learning approach is used around models so as to be capable of
learning intricate structures of structured URL data. As such, this approach is
based on the fact that phishing websites always leave behind detectable, yet
subtle traces in their URLs which can be systematically collected and processed
to train robust classification models.
Data acquisition and preprocessing is first used as the overall strategy of
building the phishing detection system. The dataset used for this purpose has a
large collection of URLs, some of which were labeled phishing, others of which
were labeled legitimate. This dataset has been either collected from publicly
available repositories such as PhishTank, OpenPhish, or crawled and analyzed
over the domain. Then, these URLs are preprocessed by removing the unneces-
sary parts, formatting normalization and encoding them in a form suitable for
feature extraction.
This system is heavily dependent on feature engineering phase. As the model
is mainly based on the URL, features are extracted from the structural com-
ponents of the URLs. For instance, these likewise incorporate landmarks like
the URL’s length, regardless of whether there are IP addresses as opposed to
domain names, number of uncommon characters including hyphens or slashes,
application of deceptive words like ’login’, ’secure’, ’bank’ or presentation of hex-
adecimal. In advanced settings we also consider additional lexical and domain
specific features like whether the domain is listed in the whitelist or blacklist,
Alexa rank, domain age etc. These properties are likely to be good to use because
previous research and empirical research have also shown correlation between
these properties, and their likelihood of being a phishing attempt.
In the machine learning phase, several classifiers are trained to identify
phishing URLs. We considered and evaluated some algorithms such as decision
4 P Abhitej et al.

trees, random forests, XGBoost, logistic regression, and support vector machines
(SVM), as well as k-nearest neighbors (KNN). One of the ways in which it was
learned great performance was among these ensemble methods, namely Random
Forest and XGBoost. The concept of ensemble learning is the combination of
the prediction power of several base estimators to enhance accuracy. XGBoost
focuses on saving at each stage on classification error through sequential train-
ing and optimization and random forest aggregates many such decision trees to
reduce overfitting and variance. Besides their accuracy, these models are chosen
because they are both interpretable and computationally efficient.
Lastly, the models are evaluated on various performance metrics including
accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC)
to check whether the system is well performing in real-world case when cost of
false positives and false negatives can be huge. The trained model is interfaced
with a Flask based web application for making the phishing detection system
accessible through an easy and user-friendly interface. As a lightweight Python
web framework, Flask offers fast development as well as deploying. The users
can add a URL which is processed by the backend in the web app. As the URL
comes in, the Flask server pre-processes, extracts the required features in real
time, and sends to the model. The entered URL is then classified as phishing or
legitimate by the model and the result is returned back to the user interface.

– Feature Extraction from URLs: Over 30 handcrafted features are ex-


tracted from URLs, such as presence of IP address, length of URL, use of
suspicious characters (e.g., ’@’, ’-’, ’//’), presence of HTTPS, domain age,
and more. These features help reveal structural anomalies in phishing URLs.
– Balanced Dataset Preparation: A well-balanced dataset was compiled,
comprising labeled phishing and legitimate URLs collected from sources
like PhishTank, OpenPhish, and Alexa. Data preprocessing includes shuf-
fling, normalization, and handling of class imbalance via undersampling or
SMOTE.
– Ensemble Learning-Based Classification: Algorithms like Random For-
est, Gradient Boosting, and XGBoost are deployed. These models capture
complex non-linear relationships among features and are robust to overfit-
ting, which improves generalization to unseen phishing attempts.
– Real-Time Detection Support: The system is integrated into a Flask-
based web application that takes input URLs and returns classification re-
sults (Phishing or Legitimate) using the trained models. Latency is optimized
for near real-time detection.
– Continuous Learning and Updates: The model is updated periodically
with new phishing patterns using data augmentation and incremental learn-
ing techniques, enhancing its adaptability to evolving threats.

3.2 Data Collection


The dataset comprises phishing and legitimate URLs collected from the following
sources:
Title Suppressed Due to Excessive Length 5

– PhishTank and OpenPhish Feeds: Crowdsourced phishing reports veri-


fied and labeled.
– Alexa Top Sites: A trusted source for collecting legitimate URLs.
– WHOIS and DNS Records: Used to extract features like domain age,
registration details, and expiration date.
– Simulated URL Variants: Generated by altering domain names and query
strings to enrich training data through augmentation.

Where real data was sparse, synthetic phishing URLs were generated based
on common evasion patterns to augment the dataset and improve generalizabil-
ity.

3.3 Tools and Software


– Programming Language and Libraries: Python 3.9+ is used along with
Pandas, Scikit-learn, XGBoost, and Flask for API deployment.
– IDE/Development Environment: VS Code for code development and
debugging.

3.4 Analysis
Quantitative Analysis To evaluate performance, standard metrics are used:

Accuracy, Precision, Recall, F1 Score These metrics evaluate the model’s ability
to correctly classify phishing and legitimate URLs.
P recision × Recall
F 1Score = 2 × (1)
P recision + Recall

T rueP ositives T rueP ositives


P recision = Recall =
T rueP ositives + F alseP ositives T rueP ositives + F alseN egatives
(2)

ROC-AUC Score: Evaluates the trade-off between true positive and false positive
rates across different threshold values.

Confusion Matrix: Used to visualize the number of correct and incorrect predic-
tions across both classes.

Error Metrics

Mean Absolute Error (MAE): Though mainly used in regression, MAE is used
here for model interpretability in probabilistic phishing scoring.
n
1X
M AE = |yi − ŷi | (3)
n i=1
6 P Abhitej et al.

Root Mean Square Error (RMSE):


v
u n
u1 X
RM SE = t (yi − ŷi )2 (4)
n i=1

R-squared Score: Pn
(yi − ŷi )2
R2 = 1 − Pi=1
n 2
(5)
i=1 (yi − ȳ)

3.5 Model Deployment Architecture

Fig. 1. Phishing Detection System Architecture

This structured methodology ensures accurate phishing detection with high


scalability and real-world applicability.

4 Results and Discussion


4.1 Model Performance Comparison
To evaluate the performance of various machine learning classifiers for our bi-
nary classification task, we used standard metrics including Accuracy, Precision,
Recall, and F1-Score. The performance of each model is summarized in Table 1.
Among all the models, the Gradient Boosting Classifier (GBC) demon-
strated the highest overall performance with an accuracy of 97.4%, F1-score of
0.974, recall of 0.988, and precision of 0.989.
The confusion matrix (Figure 2) indicates that:
Title Suppressed Due to Excessive Length 7

Table 1. Performance Metrics of Machine Learning Classifiers for Phishing Detection

Model Accuracy F1-Score Recall Precision


Gradient Boosting Classifier 0.974 0.974 0.988 0.989
CatBoost Classifier 0.972 0.972 0.990 0.991
Random Forest 0.967 0.971 0.993 0.990
Support Vector Machine 0.964 0.968 0.980 0.965
Multi-layer Perceptron 0.963 0.963 0.984 0.984
Decision Tree 0.962 0.966 0.991 0.993
K-Nearest Neighbors 0.956 0.961 0.991 0.989
Logistic Regression 0.934 0.941 0.943 0.927
Naive Bayes Classifier 0.605 0.454 0.292 0.997

– The model correctly identified 933 out of 976 negative cases (True Nega-
tives).
– It correctly identified 1221 out of 1235 positive cases (True Positives).
– There were 43 false positives and only 14 false negatives, reflecting
strong predictive capability.

4.2 Key Observations


In the process of evaluating multiple machine learning classifiers for a binary
classification problem, the Gradient Boosting Classifier (GBC) outperformed all
other models across all key metrics. This subsection analyzes why GBC per-
formed so well and presents a comparative analysis of the remaining models.

Gradient Boosting Classifier (GBC) – Why It Excelled The Gradient


Boosting Classifier is an ensemble learning technique that builds models sequen-
tially, with each subsequent model correcting the errors of the previous one. It
combines multiple weak learners (typically decision trees) into a strong learner
by focusing on residual errors. Its strengths include:

– Focus on Hard-to-Classify Instances: GBC adapts subsequent trees to


misclassified samples, thereby reducing bias.
– Feature Interaction Handling: Utilizes decision trees that inherently
manage non-linear relationships and feature interactions.
– Robustness to Outliers: Iterative correction mechanism makes it less sen-
sitive to noisy data.
– Hyperparameter Tuning Flexibility: Offers tuning for parameters like
learning rate, tree depth, and number of estimators to prevent overfitting.
– Balanced Precision and Recall: The low false negative and false positive
counts contribute to high recall and precision, which is ideal for phishing
detection tasks.

In summary, GBC’s adaptability, optimization strategy, and robustness made


it the best-performing model in our study.
8 P Abhitej et al.

Fig. 2. Confusion Matrix for Gradient Boosting Classifier

5 Conclusion

In this project, we presented an effective phishing detection system using a va-


riety of machine learning classifiers, aimed at distinguishing between legitimate
and malicious URLs. The primary objective was to develop a robust and reli-
able binary classification model capable of accurately detecting phishing attacks,
which are increasingly prevalent in the digital age. Through extensive experi-
mentation with multiple models—including Gradient Boosting Classifier (GBC),
CatBoost, Random Forest, Support Vector Machine (SVM), and others—we
evaluated their performance based on key metrics such as accuracy, precision,
recall, and F1-score. Overall, the Gradient Boosting Classifier model showed
the best overall performance with 97.4 of accuracy, 0.989 of precision, 0.988 of
recall, and 0.974 of F1-score. On the one hand, it was successful because it cor-
rectly classified 933 true negatives and 1221 true positives with the minimum
numbers of false negatives and false positives. Here, these results demonstrate
excellent tradeoff of precision and recall of the GBC model which is paramount
for phishing detection due to possible severe consequences of both false positives
and false negatives. The GBC model owes its credit to three aspects; namely, its
ensemble approach in building sequential models for thrown away instances that
have previously been classified wrongly, modeling the complex feature interac-
tions, and its resistance to noise. The flexible tuning of the hyperparameters also
helped in reducing overfitting and in maximizing the performance. In general, the
Title Suppressed Due to Excessive Length 9

outcomes from this research show that machine learning models, specifically en-
semble strategies such as GBC are excellent methods of fighting phishing threats.
Further enhancements could include URL analysis integration in real time, as
well as use of deep learning for pattern recognition, and as a browser extension
or API service more generally. This work provides a solid base for the intelligent
solutions of cybersecurity based on data-driven methodology.

References

1. O. K. Sahingoz, E. Buber, and E. Kugu, "DEPHIDES: Deep Learning Based Phish-


ing Detection System," TED University, Jan. 2024.
2. A. Karim, M. Shahroz, K. Mustofa, and S. B. Belhaouari, "Phishing Detection
System Through Hybrid Machine Learning Based on URL," Jan. 2023.
3. M. K. Prabakaran, P. M. Sundaram, and A. D. Chandrasekar, "An Enhanced Deep
Learning-Based Phishing Detection Mechanism to Effectively Identify Malicious
URLs Using Variational Autoencoders," Jan. 2023.
4. Z. Alshingiti, R. Alaqel, J. Al-Muhtadi, and Q. E. Ul Haq, "A Deep Learning-Based
Phishing Detection System Using CNN, LSTM, and LSTM-CNN," Jan. 2023.
5. A. Mughaid, S. AlZu’bi, A. Hnaif, and S. Taamneh, "An Intelligent Cyber Security
Phishing Detection System Using Deep Learning Techniques," May 2022.
6. S. Singh, M. P. Singh, and R. Pandey, "Phishing Detection from URLs Using Deep
Learning Approach," Int. J. Comput. Appl., vol. 975, pp. 1–7, Nov. 2020.
7. A. Kumar and M. S. Kaur, "A Deep Learning-Based Phishing Detection System
Using CNN, LSTM," Electronics, vol. 12, Article 1232, Jan. 2023.
8. A. Kumar and R. Sharma, "DEPHIDES: Deep Learning Based Phishing Detection
System," J. Netw. Comput. Appl., vol. 210, Article 103511, Mar. 2024.
9. A. Gupta and R. K. Jain, "A Weighted Ensemble Model for Phishing Website
Detection," Electronics, vol. 12, Article 232, Feb. 2023.
10. R. Sharma and P. Kaur, "Machine Learning and Deep Learning for Phishing Page
Detection," J. Inf. Secur. Appl., vol. 67, Article 103213, Apr. 2023.
11. A. Verma and S. Gupta, "Using Machine Learning to Detect and Classify URLs,"
Int. J. Inf. Secur., vol. 21, pp. 345–356, May 2023.
12. M. Jha and R. Kumar, "BERT-Based Approaches to Identifying Malicious URLs,"
IEEE Trans. Inf. Forensics Secur., vol. 18, pp. 1234–1245, Jul. 2023.
13. T. Ali and H. Sadiq, "Developing a Context-Aware Convolutional Neural Network
(CACNN)," J. Comput. Virol. Hacking Tech., vol. 20, pp. 1–15, Jan. 2024.
14. N. Singh and T. Bansal, "Data Analytics for Phishing Attack Detection using Deep
Learning," Future Gener. Comput. Syst., vol. 134, pp. 456–467, Mar. 2023.
15. A. Patel and R. Chaudhary, "Deep Learning for Phishing Detection: Taxonomy,
Current Challenges," ACM Comput. Surv., vol. 55, Article No. 12, Dec. 2022.
16. Opara, "HTMLPhish: Enabling Phishing Web Page Detection," Electron. Lett.,
vol. 56, pp. 1234–1236, Oct. 2020.
17. Korkmaz, "Phishing Website Detection Using N-gram Features," J. Cyber Secur.
Technol., vol. 5, pp. 45–60, Feb. 2021.
18. O. K. Sahingoz, "Model of Detection of Phishing URLs Based on Machine Learn-
ing," Comput. Secur., vol. 83, pp. 32–45, Jul. 2019.
19. Le, "Comparative Evaluation of ML Algorithms for Phishing Site Detection," Com-
put. Secur., vol. 78, pp. 12–25, Mar. 2018.
10 P Abhitej et al.

20. Kumar, "A Novel Approach to Detect Phishing Attacks Using Hybrid Models,"
Int. J. Inf. Manag., vol. 63, pp. 102–115, Apr. 2023.
21. Zaimi, "An Intelligent Mechanism to Detect Phishing URLs," Future Gener. Com-
put. Syst., vol. 134, pp. 789–800, Jan. 2024.

You might also like