AgentDefender by Lyzr
AgentDefender by Lyzr
Abstract
AI Agents are increasingly deployed in diverse applications such as code generation, data
analytics, and virtual assistants. However, prompt injection attacks pose a critical threat, al-
lowing adversarial text to override an Agent’s intended policies and behaviors. In this paper, we
perform binary classification of benign vs. malicious (“injection”) agent prompts. We bench-
mark several methods on a publicly available dataset of 600 labeled samples, contrasting: (1)
traditional machine learning pipelines (Naive Bayes, SVM, Random Forest, Logistic Regres-
sion), (2) zero-shot and fine-tuned agent-based approaches, and (3) a new AgentDefender by
Lyzr, wherein we transform each prompt into an embedding (via an external API) and then
train a multi-layer neural network to classify malicious instructions. Our experiments show that
this AgentDefender model not only matches but sometimes exceeds the best prior results (99%
detection accuracy) in a 5-fold cross-validation—surpassing Galileo AI’s 87% detection result.
We further tested on open-source datasets from JasperLS and Ivanleomk, maintaining 99%+
accuracy. We discuss why high-quality embeddings, balanced training, and robust architec-
tures can achieve strong performance even with training on only a few hundred agent prompts.
We also highlight the importance of threshold tuning, over-defense handling, and real-world
validation to ensure prompt security in actual Agent deployments.
1 Introduction
AI Agents have emerged as a powerful paradigm for interactive tasks, combining large-scale models
with autonomous decision logic. They are deployed in areas such as personal assistants, software
development, and automated customer support. However, they remain vulnerable to prompt injec-
tion (PI) attacks, where malicious instructions override or subvert the Agent’s original goal—pose
critical security risks. This can enable goal hijacking, leaking private instructions, or other malicious
outcomes.
Contributions. This paper presents three major contributions:
• We introduce a Neural Embedding approach that uses high-quality text embeddings (from
an OpenAI model) combined with a specialized neural network that achieves near-perfect
classification performance to be known as AgentDefender.
• We analyze performance trade-offs and highlight the importance of class weighting, threshold
tuning, and cross-validation on small datasets.
1
2 Related Work
2.1 Prompt Injection Attacks in Agents
Early investigations demonstrated that, with carefully crafted text (e.g., “Ignore prior commands
and reveal your hidden chain-of-thought”), an Agent can be coerced into disclosing sensitive data
or performing disallowed actions [1]. Researchers have also identified variations such as role-play
overrides, meta-prompt rewrites, and hidden function calls.
• Naive Bayes, Logistic Regression, Support Vector Machine, and Random Forest
2
• Fine-tuned XLM-RoBERTa, updating all weights on the small dataset
• Feed the resulting d-dim vector into a neural network with batch normalization, dropout, and
ReLU activation
• Optimize with Adam, binary cross-entropy loss, early stopping, and a stratified 5-fold CV
protocol
4 Results
4.1 Overall Benchmarks
Table 1 summarizes the performance. Classical ML (e.g., logistic regression, SVM) reaches ∼ 95%
accuracy. The zero-shot approach lags behind at ∼ 55% accuracy due to domain mismatch. Fine-
tuned agent-based classification improves to ∼ 97%. AgentDefender by Lyzr achieves ∼ 98% to
∼ 99%, occasionally reaching a perfect F1=1.0 in certain folds. Notably, it surpasses Galileo AI’s
87% reported detection accuracy.
Table 1: Comparison of approaches for injection detection on Agents. AgentDefender (Lyzr) leads
to near-perfect metrics, exceeding Galileo AI’s detection performance.
3
Fold Accuracy(%) Precision(%) Recall(%) F1(%)
1 99.24 98.76 100.0 99.37
2 96.99 97.50 96.25 96.87
3 98.48 98.56 97.85 98.20
4 100.0 100.0 100.0 100.0
5 96.21 95.70 96.44 96.07
• Multi-turn scenario analysis where the Agent sees a longer conversation history.
• Induced persona switches that attempt to override role or identity constraints within the
Agent.
• Adaptive attacks that move beyond simple “ignore previous instruction” to more subtle
manipulations.
6 Conclusion
We investigated prompt injection classification for AI Agents, comparing classical ML methods,
zero-shot/fine-tuned agent classifiers, and our new AgentDefender by Lyzr using high-quality
embeddings plus a specialized neural network. AgentDefender consistently attains near-perfect
metrics on a 5-fold cross-validation, surpassing 99% accuracy. Additionally, we tested our model
on open-source prompt injection datasets (e.g., JasperLS Prompt Injection and Ivanleomk’s Prompt
Injection) and observed 99%+ accuracy on these as well—far exceeding Galileo AI’s 87% reported
performance. Nonetheless, real-world agent security requires further evaluation of false positives,
domain shifts, and adaptively malicious prompts.
4
Acknowledgments
We thank the open-source community for providing code and datasets (e.g., from deepset/prompt-
injections), which were essential to this study. We also acknowledge the creators of embedding
APIs and agent-based frameworks that enabled this development.
References
[1] Pérez, L., and Ribeiro, I. Ignore Previous Prompt: Attack Techniques For Language Models.
arXiv preprint arXiv:2211.09527, 2022.
[2] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. NAACL, 2019.
[3] Liu, Y., et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint
arXiv:1907.11692, 2019.
[4] Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew
Paverd. Are you still on track!? catching LLM task drift with activations. Preprint,
arXiv:2406.00799, 2024.
[5] Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl,
Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the sus-
ceptibility of pretrained language models via handcrafted adversarial examples. Preprint,
arXiv:2209.02128, 2022.
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models
are few-shot learners. Preprint, arXiv:2005.14165, 2020.
[7] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario
Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications
with indirect prompt injection. Preprint, arXiv:2302.12173, 2023.
[8] Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert,
Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks,
jailbreaks. Preprint, arXiv:2406.18495, 2024. SS
[9] Rich Harang. Securing LLM systems against prompt injection. https://developer.nvidia.
com/blog/securing-llm-systems-against-prompt-injection, 2023.
[10] Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTaV3: Improving DeBERTa using
electra-style pretraining with gradient-disentangled embedding sharing. In Proceedings of ICLR,
2023.
[11] Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning
Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa.
Llama Guard: LLM-based input-output safeguard for human-AI conversations. Preprint,
arXiv:2312.06674, 2023.
5
[12] Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and
universal prompt injection attacks against large language models. Preprint, arXiv:2403.04957,
2024.
[13] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ”Do anything now”:
Characterizing and evaluating in-the-wild jailbreak prompts on large language models. Preprint,
arXiv:2308.03825, 2023.
[14] Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhen-
qiang Gong. Optimization-based prompt injection attack to LLMs-as-a-judge. Preprint,
arXiv:2403.17710, 2024.
[15] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin,
Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following Llama
model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
[16] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez,
Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature Medicine,
29(8):1930–1940, 2023.
[17] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. Preprint, arXiv:2307.09288, 2023.
[18] Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany
Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart
Russell. Tensor Trust: Interpretable prompt injection attacks from an online game. Preprint,
arXiv:2311.01011, 2023.