URAI Phishing Email Detection Paper
URAI Phishing Email Detection Paper
1 Introduction
2 Related Work
Phishing email detection has been an area of active research for decades, evolving from
the application of rule-based systems, lexical analysis and machine learning algorithms,
including SVM and tree-based classifiers, up to leveraging deep learning methods like
recurrent and convolutional neural networks as well as transformers [4][5]. The use of
Large Language Models for identifying email phishing is still an emerging field with a
sparse number of research publications.
A majority of recent studies based their work on the GPT models of OpenAI [6][7][8].
The model family achieved a high level of popularity with the release of its derivative
ChatGPT. Rosa et al. [7] achieved an overall accuracy of 75.75 % for binary phishing
email classification by feeding emails to GPT 3.5. With their high number of active
parameters the GPT models proved a strong performance across many application areas,
however, the GPT models are proprietary and closed-source [9]. This paper focuses on
the use of open models, that are free to use and are meeting higher demands regarding
data privacy. While some studies on phishing detection use open LLMs solely as upstream
feature extractors for other machine learning methods [3], Koide et al. [10] employs the
model Llama 2 to classify emails and achieves an overall accuracy of 88.61 % through
prompt engineering. Their study contrast this with the use of the much-larger GPT-4
model showing 99.70% accuracy.
Baumann et al. [11] proposes a combination of RAG and FSL to generate models for
domain-specific languages (DSLs) finding application in the field of software engineering.
Their approach uses RAG to retrieve relevant examples from a knowledge base, enabling
FSL to generate synthetic models for underrepresented DSLs lacking sufficient training
data and thereby adapting a LLMs output syntax. Literature review showed, the method
of using a fusion of RAG and FSL to improve a LLM’s capability to solve unknown
machine learning tasks has not been addressed to date.
3 Methodology
3.1 Dataset
The experiments conducted in this study aim to evaluate the performance of the proposed
approaches for the classification of phishing emails. For this purpose, a dataset containing
both phishing and legitimate emails was created by concatenating two publicly available
datasets. The CSDMC Spam Corpus[12] includes 2,949 so-called “ham emails”, legitimate
messages that do not fall into the categories of phishing or “spam”. It has already been
used in similar studies as [10]. The phishing emails were sampled from the Phishing Pot
[13] dataset and are real emails collected from August 2022 to July 2024. In contrast to
[14] this approach do not include synthetic phishing samples or emails collected well into
the past, as in [15]. By choosing an up-to-date source dataset newer phishing techniques
are also represented in our final dataset. From each source dataset 2,900 emails were
randomly sampled to build a new set with a total of 5,800 emails, balanced between
the two classes phishing and no phishing. Samples with an email body of less than 50
characters or more than 420,000 characters were not considered as valid samples and
were discarded in the selection step. In a subsequent data-cleaning process, all non-
ASCII characters in the messages were removed. Each message sample consists of the
concatenation of the email’s subject and its body. If the message body was available
in text format and the HTML format, this approach prioritized the HTML part and
converted to plain text by removing all HTML-related fragments. This study does not
address the role of email attachments as an attack vector, all attachments included in
the samples were removed.
3.2 Model Selection
The experiments were evaluated for a variety of Large Language Models that represent
the current state of the art and are published under an open license. The approach de-
liberately refrained from the use of commercial models such as GPT4 (OpenAI). The
selected AI models are OpenChat 7B [16], Mixtral 8x7B [17], Mistral 7B [18], Gemma2
9B and 27B (Google Deep Mind) [19], Llama3.2 1B and 3B [20], Mistral-Small 22B [18],
Command-R 35B [21], as well as Llama3.1 8B and Llama3.1 70B (Meta AI) [20]. All
models were pre-trained by their respective authors on different datasets and differ in
their architecture and the number of parameters. While models with a larger number
of parameters generally have a greater ability to understand complex patterns and rela-
tionships, they may tend to show over-fitting behavior and be less applicable to new and
unseen data.
( 0.2, 0.8, … )
( 0.4, 0.9, … )
Phishing S
( 0.6, 0.2, … )
E-Mail P
( 0.3, 0.7, … )
Samples L Transformer
( 0.9, 0.4, … )
I Model
( 0.1, 0.3, … )
T
( 0.5, 0.7, … )
( 0.8, 0.2, … )
Vector
Embeddings
Phishing?
Context Retriever
Transformer
( 0.3, 0.7, … ) Cosine Similarity
Model
Top k=5 Samples
Inference Vector Store
E-Mail Vector
Embedding
Instruction
Large
JSON
Sample E-Mail Language True / False
Parser
Model Model
Inference E-Mail Output
Prompt
The result is extracted from the model output in a structured form using a JSON parser
in the same way as the first approach (see section 3.3).
This work evaluates how well LLMs are able to distinguish legitimate emails from phish-
ing emails. The paper presents an approach that improves the effectiveness of detection
by combining the methods of Few-Shot Learning and RAG for contextual reinforcement.
The knowledge of the language model is dynamically enhanced at the time of inference
by in-context and problem-specific learning without the need of computationally inten-
sive adjustments to the actual AI model and its parameters. Experiments on a generated
test dataset have shown that our approach significantly increases the recognition rate of
models with fewer parameters and lower resource requirements, and outperforms previ-
ous approaches using open LLMs. This approach achieves an accuracy of 96.18% for the
classification of phishing emails.
The results of this work raise further questions for future research on the detection
of phishing emails with LLMs. In a next step, it should be investigated how a fusion
of the RAG information source with additional data sets affects the detection accuracy.
A promising approach could be the generation of phishing examples by an LLM itself,
as already used by attackers. In addition, the use of other embedding models and dif-
ferent semantic search methods should be evaluated. It would also be useful to consider
email metadata and file attachments. Furthermore, agent approaches that extend the
capabilities of LLMs with functional tools, e.g. for retrieving API interfaces, could be
investigated.
References
1. Cloudflare: Bericht zu Phishing-Bedrohungen 2023. Technical report, München (2024)
2. OpenAI, Achiam, J., et al., S.A.: Gpt-4 technical report (2024)
3. Nahmias, D., Engelberg, G., et al., D.K.: Prompted contextual vectors for spear-phishing
detection. (2024)
4. Crawford, M., Khoshgoftaar, T., Prusa, J.e.a.: Survey of review spam detection using ma-
chine learning techniques. Journal of Big Data 2 (10 2015) 23
5. Thakur, K., Ali, M.L., Obaidat, M.A.e.a.: A systematic review on deep-learning-based
phishing email detection. Electronics 12(21) (2023)
6. Roumeliotis, K.I., Tselikas, N.D., Nasiopoulos, D.K.: Next-generation spam filtering: Com-
parative fine-tuning of llms, nlps, and cnn models for email spam classification. Electronics
13(11) (2024)
7. Rosa, S., Gringoli, F., Bellicini, G.: Hey chatgpt, is this message phishing? (06 2024) 1–10
8. Heiding, F., Schneier, B., et al., A.V.: Devising and detecting phishing: Large language
models vs. smaller human models (2023)
9. Hou, X., Zhao, Y., et al, Y.L.: Large language models for software engineering: A systematic
literature review (2024)
10. Koide, T., Fukushi, N., et al., H.N.: Chatspamdetector: Leveraging large language models
for effective phishing email detection. (2024)
11. Baumann, N., Diaz, J.S., Michael, J.e.a.: Combining retrieval-augmented generation and
few-shot learning for model synthesis of uncommon dsls. Modellierung 2024 Satellite Events
(2024)
12. Zhang, R.: Csdmc2010 spam corpus. International Conference on Neural Information Pro-
cessing. https://github.com/zrz1996/Spam-Email-Classifier-DataSet (2010)
13. Anonymous: Phishing pot dataset. https://github.com/rf-peixoto/phishing pot. (2024)
14. Jamal, S., Wimmer, H., Sarker, I.H.: An improved transformer-based model for detecting
phishing, spam and ham emails: A large language model approach. SECURITY AND
PRIVACY n/a(n/a) (2024) e402
15. Patel, H., Rehman, U., Iqbal, F.: Large language models spot phishing emails with surprising
accuracy: A comparative analysis of performance. (2024)
16. Wang, G., Cheng, S., Zhan, X.e.a.: Openchat: Advancing open-source language models with
mixed-quality data. arXiv preprint arXiv:2309.11235 (2023)
17. Jiang, A.Q., Sablayrolles, A., et al., A.R.: Mixtral of experts. (2024)
18. Jiang, A., Sablayrolles, A., et al., A.M.: Mistral 7b. (2023)
19. Gemma Team, G.D.: Gemma 2: Improving open language models at a practical size. (2024)
20. Touvron, H., Lavril, T., et al., G.I.: Llama: Open and efficient foundation language models.
(2023)
21. Gomez, A.: Command r: Retrieval-augmented generation at production scale. (2024)
22. White, J., Fu, Q., Hays, S.e.a.: A prompt pattern catalog to enhance prompt engineering
with chatgpt. arXiv e-prints (February 2023) arXiv:2302.11382
23. Pezoa, F., Reutter, J.L., Suarez, F.e.a.: Foundations of json schema. In: Proceedings of
the 25th International Conference on World Wide Web, International World Wide Web
Conferences Steering Committee (2016) 263–273
24. Brown, T.B., Mann, B., Ryder, N.e.a.: Language models are few-shot learners. In: Proceed-
ings of the 34th International Conference on Neural Information Processing Systems. NIPS
’20 (2020)
25. Lewis, P., Perez, E., Piktus, A.e.a.: Retrieval-augmented generation for knowledge-intensive
nlp tasks. In: Proceedings of the 34th International Conference on Neural Information
Processing Systems. NIPS ’20 (2020)
26. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks.
(2019)
27. Singhal, A.: Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24
(2001) 35–43
28. Pedregosa, F., Varoquaux, G., Gramfort, A.e.a.: Scikit-learn: Machine Learning in Python.
Journal of Machine Learning Research 12 (2011) 2825–2830