0% found this document useful (0 votes)
34 views8 pages

Phishing Detection in Email Using Deep Learning

This study explores phishing detection in emails using deep learning techniques, focusing on the classification of phishing URLs through machine learning algorithms like Support Vector Machines, Random Forests, and Decision Trees. The research aims to enhance detection accuracy by evaluating false positive and negative rates, while also addressing the limitations of traditional heuristic and blacklist-based methods. Experimental results demonstrate that machine learning significantly improves phishing detection and provides robust defenses against cyber threats.

Uploaded by

IJMSRT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

Phishing Detection in Email Using Deep Learning

This study explores phishing detection in emails using deep learning techniques, focusing on the classification of phishing URLs through machine learning algorithms like Support Vector Machines, Random Forests, and Decision Trees. The research aims to enhance detection accuracy by evaluating false positive and negative rates, while also addressing the limitations of traditional heuristic and blacklist-based methods. Experimental results demonstrate that machine learning significantly improves phishing detection and provides robust defenses against cyber threats.

Uploaded by

IJMSRT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology

ISSN No- 2584-2706

Phishing Detection in Email using Deep


Learning
Vanshika Sharma, Aman Singh, Srishti Verma, Dr. Tanu Gupta

Abstract
One of the easiest ways to obtain personal Phishing, Machine Learning, URL
information from careless individuals is Detection, Cyber security.
through phishing attacks. The phisher's
main goal is to acquire important
information, such as bank account details, Check Gr ammar
usernames, passwords, and more. Cyber One of the easiest ways to get personal
security experts are currently focusing on information from careless people is
creating reliable and powerful through phishing attacks. Phisher's main
identification methods for detecting goal is to get important information such
phishing websites. By extracting and as bank account details, username,
analyzing several attributes from both password, and more. Cyber security
legitimate and phishing URLs, this study experts are currently focusing on creating
examines the use of a machine learning reliable and powerful identification
approach for phishing URL identification. methods for phishing website detection.
Phishing websites are classified using By extracting and analyzing several
methods such as Support Vector Machines attributes from the actual and phishing
(SVMs), Random Forests, and Decision URLs, this study examines the use of a
Tree Algorithms. machine learning approach for phishing-
This study focuses on the use of machine URL identification. Phishing websites are
learning approaches for phishing URL classified specifically into support vector
detection by extracting and analyzing machines (SVMs), random forests, and
various attributes from both real and algorithms for classifying trees that
phishing URLs. Phishing websites are determine decisions.
categorized using Support Vector By extracting and analyzing several
Machines (SVMs), Random Forests, and attributes from both real and phishing
Decision Tree Algorithms. In addition to URLs, this study examines the use of
successfully identifying phishing URLs, machine learning approaches to identify
the purpose of this study is to compare the machine learning URL identification.
accuracy of various models by evaluating Phishing websites are categorized into
false positive and false negative rates, Support Vector Machines (SVMs),
aiming to identify the most effective Random Forests, and Decision Structure
algorithms for machine learning. Algorithms. In addition to the successful
Experimental results show that machine identification of phishing URLs, the
learning-based techniques significantly purpose of this study is to compare the
enhance the detection of phishing websites accuracy of comparing false positives and
and provide reliable defenses against false negative rates of several models to
online threats. identify the best effective algorithms for
machine learning. Experimental results
Keywords: Support Vector Machine show that machine learning-based
(SVM), Random Forest, Decision Tree, techniques significantly improve
awareness of phishing and provide reliable
defense against online dangers.

IJMSRT25MAR038 www.ijmsrt.com 242


DOI: https://doi.org/10.5281/zenodo.15110574
Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology
ISSN No- 2584-2706

Keywords: Support Vector Machine machine learning technologies to


(SVM), Randall Swald, Decision Tree, overcome the limitations of heuristic and
Phishing, Machine Learning, URL blacklist-based approaches.
Recognition, Cyber security.
The rest of this paper is structured as
follows: Section 2 reviews related research
Check Grammar on phishing detection approaches. Section
Introduction 3 discusses the methodology used in
Phishing has become a major problem for machine learning and deep learning
security researchers in recent years, as it is approaches. Section 4 presents
very easy for an attacker to develop fake experimental results and analysis, and
websites that mimic real ones. Even if Section 5 outlines future research
experts can recognize the fraudulent opportunities.
website, phishing attempts still affect
many people, leading to the loss of
personal and financial information. The
theft of bank account details is the primary
goal of the attacker. Phishing attacks are
estimated to cause U.S. companies to lose
USD 2 billion annually. According to the
third Microsoft Computing Safer Index
report published in February 2014, the
global annual impact of phishing is up to
USD 5 billion.

Due to a lack of consumer awareness,


phishing attempts remain effective.
Reducing phishing attacks is challenging
because they exploit human weaknesses,
but improving defense methods against
phishing is still crucial. A leading blacklist
of known phishing URLs and associated
Fig. 1
Internet Protocol (IP) addresses is the basis
of traditional phishing detection
This photo illustrates how deep learning
techniques. To bypass these blacklists,
can be applied in various cybersecurity
attackers often employ strategies such as
applications, including malware detection
domain fluxing (where proxies are
(for both PC and Android), phishing
dynamically built to host phishing detection (including SMS, website, and
websites) and URL generation algorithms.
email phishing), spam detection (including
The inability of blacklist-based detection
to identify phishing attacks in real-time is social, email, and SMS spam), and
a significant disadvantage. intrusion detection (including anomaly and
abuse recognition). Deep learning models
Heuristic detection techniques can identify are used to address specific threats in all
zero-day phishing attacks by analyzing the these sectors.
distinctive features of phishing websites.
However, these techniques may produce Associated Research
false positives more quickly, as phishing Phishing attacks, which involve social
signs are not always present. As a result, engineering and technical manipulation to
many security experts have turned to steal personal information such as login

IJMSRT25MAR038 www.ijmsrt.com 243


DOI: https://doi.org/10.5281/zenodo.15110574
Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology
ISSN No- 2584-2706

credentials, bank account details, and Furthermore, feature selection techniques


personal data, have become one of the such as information gain, chi-square, and
most common and dangerous correlation-based analysis are often used to
cybersecurity threats [1], [2]. To exploit identify the most relevant attributes and
individuals and organizations, attackers improve model performance [18], [19]. To
use a variety of tactics, including spear- further enhance the performance of deep
phishing, phishing emails, fake websites, learning models, optimization techniques
and SMS messages. These attacks can like Genetic Algorithms (GA), Particle
cause significant financial losses and Swarm Optimization (PSO), and Gray
damage to brand reputation. Orunsolu et Wolf Optimizer (GWO) have been
al. [3] argue that phishing detection integrated into phishing detection
remains a critical research topic, as frameworks [21], [22]. For instance, Ali
hackers constantly refine their methods to and Ahmed [20] proposed a hybrid
evade detection. intelligent phishing detection technique
that combines feature selection with deep
Traditional anti-phishing techniques neural networks. Zhou et al. [21] presented
primarily rely on browser security an extended deep model to improve
technologies, heuristics, and blacklisting. phishing awareness in semantic web
Major browsers use blocklists of known systems.
harmful websites, such as those provided
by Google Safe Browsing and PhishTank, Moreover, GWO and its improved
to warn users about suspicious websites versions have been shown to be effective
[4], [5]. However, blacklist-based methods in optimizing the deep learning models to
face difficulties in identifying new increase accuracy and generalization [22],
phishing domains and zero-day phishing [23], [24]. Despite these advancements,
attacks [6]. Researchers are increasingly phishing remains a dynamic threat that
turning to deep learning (DL) and machine requires continuous improvements in
learning (ML) to overcome these detection systems.
limitations.
Problem Statement
Phishing attempts can be categorized Phishing attacks have become a significant
based on traditional machine learning cybersecurity threat, where criminals use
algorithms such as content analysis, email fake emails to trick victims into revealing
headers, Naive Bayes (NB), Support personal information. Traditional rule-
Vector Machines (SVM), Decision Trees based and machine learning algorithms
(DT), and Random Forests (RF) [8], [9]. have limited capabilities in handling the
Recent advances in deep learning have evolving nature of phishing attempts. To
significantly improved phishing detection address this, the proposed model leverages
systems. Models such as Convolutional neural networks (such as CNN, RNN,
Neural Networks (CNNs), Long Short- LSTM, and transformer-based models) and
Term Memory networks (LSTMs), and natural language processing (NLP) to
hybrid deep learning frameworks are now identify malicious patterns and improve
capable of learning complex patterns from the accuracy of phishing detection.
updated email content and URLs [9], [13].
Al-Dabat [16] compares various Proposed Methodology
classification methods for predicting The step-by-step process adopted for
phishing websites, while Sahingoz et al. implementing the proposed methodology
[13] discuss machine-based phishing is demonstrated in this section with the
detection using URL properties. help of a flow chart (provided in Figure *).

IJMSRT25MAR038 www.ijmsrt.com 244


DOI: https://doi.org/10.5281/zenodo.15110574
Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology
ISSN No- 2584-2706

Each module of the flow chart is further explained with its specific purpose.

(a) Training Dataset Collection emails using the pre-processed email data.
The data records for this survey were Google Colab was used to run the training
obtained from Kaggle.com, a popular process, with GPU support to accelerate
website that provides openly accessible computations.
datasets. The collection of emails in the
dataset is classified as either safe (HAM) (d) Optimizing the Deep Learning
or phishing. The data was downloaded and Framework
uploaded to Google Colab, a cloud-based An optimization approach was applied to
development environment that offers tune the hyper parameters and improve
sufficient processing power for deep model performance. Key variables such as
learning operations. The dataset was then learning rate, batch size, number of
split into training and testing subsets to epochs, and optimization algorithms were
facilitate model evaluation and training. adjusted. The goal of this fine-tuning
process was to enhance both the accuracy
(b) Email Pre-processing and capacity of the model.
Raw email texts undergo a comprehensive
pre-processing phase to prepare them for (e) Feature Extraction from Testing
training deep learning models. The Dataset
following steps were performed: The test subset of email data records was
fed into the trained deep learning model.
1. Text cleaning: The entire text was This model extracted patterns and features
processed to maintain consistency. from these emails, which were used to
2. HTML tag removal: HTML tags and evaluate the model's ability to generalize
special characters were eliminated to the knowledge gained during training.
remove unnecessary noise.
3. Tokenization: The email content was (f) Classification
divided into individual words or tokens.
The deep learning classifier categorized
4. Lemmatization: Words were reduced
each email in the test dataset as either safe
to their root forms, minimizing
vocabulary size. or phishing, based on the extracted
5. Text normalization: Additional features. To assess the model's
normalization techniques were applied performance, the classification results
to ensure the data was in the optimal were compared with the ground truth
format for learning. labels.

(c) Deep Learning Model Training


A deep learning model was trained to learn Dataset
discriminatory patterns and features that The SMS Spam Collection is a dataset of
distinguish safe emails from phishing 5,574 SMS messages in English, classified

IJMSRT25MAR038 www.ijmsrt.com 245


DOI: https://doi.org/10.5281/zenodo.15110574
Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology
ISSN No- 2584-2706

as either "spam" or "ham" (legitimate). Random Forest's effectiveness in


Each row of the dataset contains two identifying spam emails were 1.00 for
columns: V1, which labels the message as training and 0.9578 for testing,
either spam or ham, and V2, which demonstrating excellent performance.
contains the raw content of the message.
Extracting relevant spam messages from
usage claims was a difficult and time-
consuming task, requiring the
implementation of multiple websites to
find pertinent spam information.

Experiment and Results


The results of the study's proposed
framework are presented in this section.
The tests utilized optimization approaches
such as GWO (Gray Wolf Optimizer), DE
(Differential Evolution), and GWO + DE
to assess the performance of the Bi-GRU
(Bidirectional Gated Recurrent Unit).
Metrics such as accuracy, precision, recall,
and the AUC-ROC curve were used to
evaluate and compare the results.
Experiment 2: Fast Text-Based Spam
Email Classification
Experiment 1: GloVe-Based Spam
This experiment compares the spam
Email Classification
classification performance of three models:
In this experiment, email text features for Linear SVC, Adaboost, and Random
spam classification were processed using Forest, using a variety of features. The
GloVe embedding. GloVe was used to Random Forest model achieved an AUC-
convert data records into numerical vectors ROC score of 0.9836 and a maximum test
before training three models: Random accuracy of 0.9776, demonstrating its
Forests, Adaboost, and SVM. Adaboost ability to effectively capture feature
achieved second-best performance with a associations and distinguish etween spam
training accuracy of 0.9589 and a test and non-spam emails.
accuracy of 0.9507. SVM showed slightly
lower performance with a training In comparison, Adaboost achieved an
accuracy of 0.9468 and a test accuracy of AUC-ROC score of 0.9059 and a test
0.9399. The AUC-ROC values for accuracy of 0.9157.

IJMSRT25MAR038 www.ijmsrt.com 246


DOI: https://doi.org/10.5281/zenodo.15110574
Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology
ISSN No- 2584-2706

Experiment 3: Word2Vec-Based Spam AdaBoost also performs well, with a


Email Classification testing accuracy of 0.9381, showing
In this experiment, we describe spam strong generalization capability. However,
email text using GloVe embeddings and SVM exhibits the lowest accuracy
assess how well three machine learning (0.8655), suggesting that it struggles with
models—SVM, AdaBoost, and Random the feature representation provided by
GloVe. Despite this, SVM achieves the
Forest—perform.
According to the findings, Random Forest highest AUC-ROC score (0.9651),
indicating its potential effectiveness in
attains the greatest testing accuracy ranking spam and non-spam emails
(0.9641) and correctly.
It is the top-performing model for this
challenge, with an AUC-ROC of 0.9646.
.

IJMSRT25MAR038 www.ijmsrt.com 247


DOI: https://doi.org/10.5281/zenodo.15110574
Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology
ISSN No- 2584-2706

References
Arachchilage, N. A. G., & Harrison, M. language processing and machine learning
(2014). A systematic approach to phishing techniques. Journal of Network and
detection. International Journal of Computer Applications, 108, 1-12.
Information Management, 34(4), 503-509. https://doi.org/10.1016/j.jnca.2018.02.005
https://doi.org/10.1016/j.ijinfomgt.2014.02
.001 Chollet, F. (2015). Keras. GitHub
repository. Retrieved from
Amit, S., & Prakash, A. (2022). A hybrid https://github.com/fchollet/keras
approach for phishing detection using deep
learning and machine learning techniques. Abadi, M., Barham, P., Chen, J., & Chen,
Journal of Information Security and Z. (2016). TensorFlow: A system for large-
Applications, 67, 103067. scale machine learning. In 12th USENIX
https://doi.org/10.1016/j.jisa.2022.103067 Symposium on Operating Systems Design
and Implementation (OSDI 16), 265-283.
Ghafoor, K. Z., Khan, M. A., & Qadir, J. Retrieved from
(2020). Phishing detection using long https://www.tensorflow.org/
short-term memory networks. Computers
& Security, 97, 101866. Ribeiro, M. T., Singh, S., & Guestrin, C.
https://doi.org/10.1016/j.cose.2020.101866 (2016). "Why should I trust you?"
Explaining the predictions of any
Jakobsson, M., & Johnson, A. (2006). classifier. In Proceedings of the 22nd ACM
Phishing and online identity theft. In SIGKDD International Conference on
Advances in Information Security (Vol. 27, Knowledge Discovery and Data Mining
pp. 11-31). Springer. (pp. 1135-1144).
https://doi.org/10.1007/0-387-33058-1_2 https://doi.org/10.1145/2939672.2939778

Li, Y., Wu, Z., & Zhang, Y. (2021). He, S., & Wang, D. (2020). Phishing email
Phishing email detection using BERT- detection using deep learning. Journal of
based models. Journal of Computer and Information Science, 46(5), 641-654.
System Sciences, 109, 90-98. https://doi.org/10.1177/016555151879660
https://doi.org/10.1016/j.jcss.2020.10.015 0

Zhang, D., Wang, Y., & Zhao, J. (2018). Kumar, N., Sonowal, S., & Nishant.
Phishing detection based on natural (2020). Email spam detection using

IJMSRT25MAR038 www.ijmsrt.com 248


DOI: https://doi.org/10.5281/zenodo.15110574
Volume-3, Issue3, March 2025 International Journal of Modern Science and Research Technology
ISSN No- 2584-2706

machine learning algorithms. Proceedings Novel techniques for detecting phishing


of the 2020 Second International sites and their targets. 2016 IEEE 36th
Conference on Inventive Research in International Conference on Distributed
Computing Applications (ICIRCA), Computing Systems (ICDCS), 323-333.
9183098.
https://doi.org/10.1109/ICIRCA48905.202 Khonji, M., Iraqi, Y., & Jones, A. (2013).
0.9183098 Phishing detection: A literature survey.
IEEE Communications Surveys &
Basnet, R., Sung, A. H., & Liu, Q. (2014). Tutorials, 15(4), 2091-2121.
Learning to detect phishing URLs.
International Journal of Research in Bojanova, I., & Hurlburt, G. (2016).
Engineering and Technology, 3(6), 11-21. Phishing made easy. IT Professional,
18(5), 60-63.
Fette, I., Sadeh, N., & Tomasic, A. (2007).
Learning to detect phishing emails. Abdelhamid, N., Ayesh, A., & Thabtah, F.
Proceedings of the 16th International (2014). Phishing detection: a recent
Conference on World Wide Web, 649-656. intelligent machine learning comparison
based on models content and features.
Verma, R., & Hossain, N. (2014). 2014 IEEE International Conference on
Semantic feature selection for text with Cybercrime and Computer Forensic, 1-6.
application to phishing email detection.
Proceedings of the 2014 ACM Symposium Banu, S. S., & Gomathi, S. (2014). An
on Document Engineering, 123-126. intelligent phishing website detection and
prevention system using SVM classifier.
Whittaker, C., Ryner, B., & Nazif, M. International Conference on Intelligent
(2010). Large-scale automatic Computing Applications, 31-37.
classification of phishing pages.
Proceedings of the 17th Network and Almomani, A., Gupta, B., Atawneh, S.,
Distributed System Security Symposium Meulenberg, A., & Almomani, E. (2013).
(NDSS). A survey of phishing email filtering
techniques. IEEE Communications
James, L. (2005). Phishing exposed. Surveys & Tutorials, 15(4), 2070-2090.
Syngress Publishing.

Hong, J. (2012). The state of phishing


attacks. Communications of the ACM,
55(1), 74-81.

Bakhshi, T., & Ghita, B. (2014). The


impact of phishing attacks on social
network services. 2014 International
Conference on Cyberworlds, 283-287.

Jagatic, T. N., Johnson, N. A., Jakobsson,


M., & Menczer, F. (2007). Social phishing.
Communications of the ACM, 50(10), 94-
100.

Marchal, S., Saari, K., Singh, N., &


Asokan, N. (2016). Know your phish:

IJMSRT25MAR038 www.ijmsrt.com 249


DOI: https://doi.org/10.5281/zenodo.15110574

You might also like