Phishing Detection in Email Using Deep Learning
Phishing Detection in Email Using Deep Learning
Abstract
One of the easiest ways to obtain personal Phishing, Machine Learning, URL
information from careless individuals is Detection, Cyber security.
through phishing attacks. The phisher's
main goal is to acquire important
information, such as bank account details, Check Gr ammar
usernames, passwords, and more. Cyber One of the easiest ways to get personal
security experts are currently focusing on information from careless people is
creating reliable and powerful through phishing attacks. Phisher's main
identification methods for detecting goal is to get important information such
phishing websites. By extracting and as bank account details, username,
analyzing several attributes from both password, and more. Cyber security
legitimate and phishing URLs, this study experts are currently focusing on creating
examines the use of a machine learning reliable and powerful identification
approach for phishing URL identification. methods for phishing website detection.
Phishing websites are classified using By extracting and analyzing several
methods such as Support Vector Machines attributes from the actual and phishing
(SVMs), Random Forests, and Decision URLs, this study examines the use of a
Tree Algorithms. machine learning approach for phishing-
This study focuses on the use of machine URL identification. Phishing websites are
learning approaches for phishing URL classified specifically into support vector
detection by extracting and analyzing machines (SVMs), random forests, and
various attributes from both real and algorithms for classifying trees that
phishing URLs. Phishing websites are determine decisions.
categorized using Support Vector By extracting and analyzing several
Machines (SVMs), Random Forests, and attributes from both real and phishing
Decision Tree Algorithms. In addition to URLs, this study examines the use of
successfully identifying phishing URLs, machine learning approaches to identify
the purpose of this study is to compare the machine learning URL identification.
accuracy of various models by evaluating Phishing websites are categorized into
false positive and false negative rates, Support Vector Machines (SVMs),
aiming to identify the most effective Random Forests, and Decision Structure
algorithms for machine learning. Algorithms. In addition to the successful
Experimental results show that machine identification of phishing URLs, the
learning-based techniques significantly purpose of this study is to compare the
enhance the detection of phishing websites accuracy of comparing false positives and
and provide reliable defenses against false negative rates of several models to
online threats. identify the best effective algorithms for
machine learning. Experimental results
Keywords: Support Vector Machine show that machine learning-based
(SVM), Random Forest, Decision Tree, techniques significantly improve
awareness of phishing and provide reliable
defense against online dangers.
Each module of the flow chart is further explained with its specific purpose.
(a) Training Dataset Collection emails using the pre-processed email data.
The data records for this survey were Google Colab was used to run the training
obtained from Kaggle.com, a popular process, with GPU support to accelerate
website that provides openly accessible computations.
datasets. The collection of emails in the
dataset is classified as either safe (HAM) (d) Optimizing the Deep Learning
or phishing. The data was downloaded and Framework
uploaded to Google Colab, a cloud-based An optimization approach was applied to
development environment that offers tune the hyper parameters and improve
sufficient processing power for deep model performance. Key variables such as
learning operations. The dataset was then learning rate, batch size, number of
split into training and testing subsets to epochs, and optimization algorithms were
facilitate model evaluation and training. adjusted. The goal of this fine-tuning
process was to enhance both the accuracy
(b) Email Pre-processing and capacity of the model.
Raw email texts undergo a comprehensive
pre-processing phase to prepare them for (e) Feature Extraction from Testing
training deep learning models. The Dataset
following steps were performed: The test subset of email data records was
fed into the trained deep learning model.
1. Text cleaning: The entire text was This model extracted patterns and features
processed to maintain consistency. from these emails, which were used to
2. HTML tag removal: HTML tags and evaluate the model's ability to generalize
special characters were eliminated to the knowledge gained during training.
remove unnecessary noise.
3. Tokenization: The email content was (f) Classification
divided into individual words or tokens.
The deep learning classifier categorized
4. Lemmatization: Words were reduced
each email in the test dataset as either safe
to their root forms, minimizing
vocabulary size. or phishing, based on the extracted
5. Text normalization: Additional features. To assess the model's
normalization techniques were applied performance, the classification results
to ensure the data was in the optimal were compared with the ground truth
format for learning. labels.
References
Arachchilage, N. A. G., & Harrison, M. language processing and machine learning
(2014). A systematic approach to phishing techniques. Journal of Network and
detection. International Journal of Computer Applications, 108, 1-12.
Information Management, 34(4), 503-509. https://doi.org/10.1016/j.jnca.2018.02.005
https://doi.org/10.1016/j.ijinfomgt.2014.02
.001 Chollet, F. (2015). Keras. GitHub
repository. Retrieved from
Amit, S., & Prakash, A. (2022). A hybrid https://github.com/fchollet/keras
approach for phishing detection using deep
learning and machine learning techniques. Abadi, M., Barham, P., Chen, J., & Chen,
Journal of Information Security and Z. (2016). TensorFlow: A system for large-
Applications, 67, 103067. scale machine learning. In 12th USENIX
https://doi.org/10.1016/j.jisa.2022.103067 Symposium on Operating Systems Design
and Implementation (OSDI 16), 265-283.
Ghafoor, K. Z., Khan, M. A., & Qadir, J. Retrieved from
(2020). Phishing detection using long https://www.tensorflow.org/
short-term memory networks. Computers
& Security, 97, 101866. Ribeiro, M. T., Singh, S., & Guestrin, C.
https://doi.org/10.1016/j.cose.2020.101866 (2016). "Why should I trust you?"
Explaining the predictions of any
Jakobsson, M., & Johnson, A. (2006). classifier. In Proceedings of the 22nd ACM
Phishing and online identity theft. In SIGKDD International Conference on
Advances in Information Security (Vol. 27, Knowledge Discovery and Data Mining
pp. 11-31). Springer. (pp. 1135-1144).
https://doi.org/10.1007/0-387-33058-1_2 https://doi.org/10.1145/2939672.2939778
Li, Y., Wu, Z., & Zhang, Y. (2021). He, S., & Wang, D. (2020). Phishing email
Phishing email detection using BERT- detection using deep learning. Journal of
based models. Journal of Computer and Information Science, 46(5), 641-654.
System Sciences, 109, 90-98. https://doi.org/10.1177/016555151879660
https://doi.org/10.1016/j.jcss.2020.10.015 0
Zhang, D., Wang, Y., & Zhao, J. (2018). Kumar, N., Sonowal, S., & Nishant.
Phishing detection based on natural (2020). Email spam detection using