MS AI Red Teaming
MS AI Red Teaming
Authors
Blake Bullwinkel, Amanda Minnich, Shiven Chawla, Gary Lopez, Martin Pouliot, Whitney Maxwell, Joris de Gruyter,
Katherine Pratt, Saphir Qi, Nina Chikanov, Roman Lutz, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj,
Eugenia Kim, Justin Song, Keegan Hines, Daniel Jones, Giorgio Severi, Richard Lundeen, Sam Vaughan,
Victoria Westerhoff, Pete Bryan, Ram Shankar Siva Kumar, Yonatan Zunger, Chang Kawaguchi, Mark Russinovich
Lessons from red teaming 100 generative AI products 3
Table of contents
04 05 05
Abstract Introduction AI threat model
ontology
07 08 08
Red teaming Lesson 1 Lesson 2
operations Understand what the system You don’t have to compute
can do and where it is applied gradients to break an AI system
09 10 11
Case study #1 Lesson 3 Case study #2
Jailbreaking a vision AI red teaming is not Assessing how an LLM could be
language model to generate safety benchmarking used to automate scams
hazardous content
12 12 13
Lesson 4 Lesson 5 Case study #3
Automation can help cover The human element of AI Evaluating how a chatbot
more of the risk landscape red teaming is crucial responds to a user in distress
14 14 15
Case study #4 Lesson 6 Lesson 7
Probing a text-to-image Responsible AI harms are LLMs amplify existing security
generator for gender bias pervasive but difficult to measure risks and introduce new ones
16 17 18
Case study #5 Lesson 8 Conclusion
SSRF in a video-processing The work of securing AI systems
GenAI application will never be complete
Lessons from red teaming 100 generative AI products 4
Abstract
In recent years, AI red teaming has emerged as a practice for probing the safety and security of generative AI
systems. Due to the nascency of the field, there are many open questions about how red teaming operations should
be conducted. Based on our experience red teaming over 100 generative AI products at Microsoft, we present our
internal threat model ontology and eight main lessons we have learned:
7. Large language models (LLMs) amplify existing security risks and introduce new ones
By sharing these insights alongside case studies from our operations, we offer practical recommendations aimed at
aligning red teaming efforts with real world risks. We also highlight aspects of AI red teaming that we believe are
often misunderstood and discuss open questions for the field to consider.
Lessons from red teaming 100 generative AI products 5
AI threat model
be conducted and a healthy dose of skepticism about
the efficacy of current AI red teaming efforts [4, 8, 32].
ontology
In this paper, we speak to some of these concerns by
providing insight into our experience red teaming
over 100 GenAI products at Microsoft. The paper
is organized as follows: First, we present the threat
model ontology that we use to guide our operations. As attacks and failure modes increase in complexity,
Second, we share eight main lessons we have learned it is helpful to model their key components. Based on
and make practical recommendations for AI red our experience red teaming over 100 GenAI products
teams, along with case studies from our operations. for a wide range of risks, we developed an ontology
In particular, these case studies highlight how our to do exactly that. Figure 1 illustrates the main
ontology is used to model a broad range of safety components of our ontology:
and security risks. Finally, we close with a discussion of • System: The end-to-end model or application
areas for future development. being tested.
Background • Actor: The person or persons being emulated
The Microsoft AI Red Team (AIRT) grew out of pre- by AIRT. Note that the Actor’s intent could be
existing red teaming initiatives at the company and adversarial (e.g., a scammer) or benign (e.g., a
was officially established in 2018. At its conception, typical chatbot user).
the team focused primarily on identifying traditional • TTPs: The Tactics, Techniques, and Procedures
security vulnerabilities and evasion attacks against leveraged by AIRT. A typical attack consists of
classical ML models. Since then, both the scope and multiple Tactics and Techniques, which we map
scale of AI red teaming at Microsoft have expanded to MITRE ATT&CK® and MITRE ATLAS Matrix
significantly in response to two major trends. whenever possible.
First, AI systems have become more sophisticated, – Tactic: High-level stages of an attack (e.g.,
compelling us to expand the scope of AI red teaming. reconnaissance, ML model access).
Most notably, state-of-the-art (SoTA) models have – Technique: Methods used to complete an
gained new capabilities and steadily improved across objective (e.g., active scanning, jailbreak).
a range of performance benchmarks, introducing – Procedure: The steps required to reproduce
novel categories of risk. New data modalities, such an attack using the Tactics and Techniques.
as vision and audio, also create more attack vectors
• Weakness: The vulnerability or vulnerabilities in
for red teaming operations to consider. In addition,
the System that make the attack possible.
agentic systems grant these models higher privileges
and access to external tools, expanding both the • Impact: The downstream impact created by the
attack surface and the impact of attacks. attack (e.g., privilege escalation, generation of
harmful content).
Second, Microsoft’s recent investments in AI have
It is important to note that this framework does not
spurred the development of many more products that
assume adversarial intent. In particular, AIRT emulates
require red teaming than ever before. This increase
both adversarial attackers and benign users who
in volume and the expanded scope of AI red teaming
encounter system failures unintentionally. Part of the
have rendered fully manual testing impractical,
complexity of AI red teaming stems from the wide
forcing us to scale up our operations with the help of
range of impacts that could be created by an attack
automation. To achieve this goal, we develop PyRIT,
Lessons from red teaming 100 generative AI products 6
or system failure. In the lessons below, we share Responsible AI Standard [25]. We refer to these
case studies demonstrating how our ontology is impacts as responsible AI (RAI) harms throughout this
flexible enough to model diverse impacts in two main report.
categories: security and safety.
To understand this ontology in context, consider
Security encompasses well-known impacts such the following example. Imagine we are red teaming
as data exfiltration, data manipulation, credential an LLM-based copilot that can summarize a user’s
dumping, and others defined in MITRE ATT&CK®, a emails. One possible attack against this system would
widely used knowledge base of security attacks. We be for a scammer to send an email that contains a
also consider security attacks that specifically target hidden prompt injection instructing the copilot to
the underlying AI model such as model evasion, “ignore previous instructions” and output a malicious
prompt injections, denial of AI service, and others link. In this scenario, the Actor is the scammer, who
covered by the MITRE ATLAS Matrix. is conducting a cross-prompt injection attack (XPIA),
which exploits the fact that LLMs often struggle to
Safety impacts are related to the generation of illegal
distinguish between system-level instructions and
and harmful content such as hate speech, violence
user data [4]. The downstream Impact depends on the
and self-harm, and child abuse content. AIRT works
nature of the malicious link that the victim might click
closely with the Office of Responsible AI to define
on. In this example, it could be exfiltrating data or
these categories in accordance with Microsoft’s
installing malware onto the user’s computer.
TTPs Mitigation
Leverages Mitigated by
Occurs in
System
Figure 1: Microsoft AIRT ontology for modeling GenAI system vulnerabilities. AIRT often leverages multiple TTPs, which may exploit multiple
Weaknesses and create multiple Impacts. In addition, more than one Mitigation may be necessary to address a Weakness. Note that AIRT is
tasked only with identifying risks, while product teams are resourced to develop appropriate mitigations.
Lessons from red teaming 100 generative AI products 7
operations
highlight five case studies from our operations and
show how each one maps to our ontology in Figure 1.
We hope these lessons are useful to others working to
In this section, we provide an overview of the identify vulnerabilities in their own GenAI systems.
operations we have conducted since 2021. In total, we
have red teamed over 100 GenAI products. Broadly
speaking, these products can be bucketed into 80+ 100+
“models” and “systems.” Models are typically hosted Ops Products
on a cloud endpoint, while systems integrate models
into copilots, plugins, and other AI apps and features. Copilots
Figure 2 shows the breakdown of products we have
red teamed since 2021. Figure 3 shows a bar chart with
the annual percentage of our operations that have
15%
probed for safety (RAI) vs. security vulnerabilities.
In 2021, we focused primarily on application security. Apps and 45% 16% Plugins
Although our operations have increasingly probed Features
for RAI impacts, our team continues to red team for
security impacts including data exfiltration, credential
leaking, and remote code execution. Organizations
24%
have adopted many different approaches to AI red
teaming ranging from security-focused assessments
Models
with penetration testing to evaluations that target
only GenAI features. In Lessons 2 and 7, we elaborate
Figure 2: Pie chart showing the percentage breakdown of AI
on security vulnerabilities and explain why we believe products that AIRT has tested. As of October 2024, we have
it is important to consider both traditional and AI- conducted over 80 operations covering more than 100 products.
specific weaknesses.
After the release of ChatGPT in 2022, Microsoft Percentage of ops probing safety vs. security
entered the era of AI copilots, starting with AI-
Safety (RAI) % Security %
powered Bing Chat, released in February 2023.
This marked a paradigm shift towards applications 100
that connect LLMs to other software components
including tools, databases, and external sources.
Applications also started using language models as 80
reasoning agents that can take actions on behalf of
users, introducing a new set of attack vectors that
have expanded the security risk surface. In Lesson
60
7, we explain how these attack vectors both amplify
existing security risks and introduce new ones.
In recent years, the models at the center of these 40
applications have given rise to new interfaces,
allowing users to interact with apps using natural
language and responding with high-quality text,
20
image, video, and audio content. Despite many efforts
to align powerful AI models to human preferences,
many methods have been developed to subvert
safety guardrails and elicit content that is offensive, 0
unethical, or illegal. We classify these instances of 2021 2022 2023 2024
harmful content generation as RAI impacts and in Figure 3: Bar chart showing the percentage of operations that
Lessons 3, 5, and 6 discuss how we think about these probed safety (RAI) vs. security vulnerabilities from 2021–2024.
impacts and the challenges involved.
Lessons from red teaming 100 generative AI products 8
provide. In this section, we discuss examples of domain expertise including cropping, masking, logo
relatively simple techniques that work surprisingly well stretching, etc. In our red teaming operations, we
and advocate for a system-level adversarial mindset in also find that rudimentary methods can be used to
AI red teaming. trick many vision models, as highlighted in case study
#1. In the text domain, a variety of jailbreaks (e.g.,
Simple attacks Skeleton Key) and multiturn prompting strategies (e.g.,
Apruzzese et al. [2] consider the problem of phishing Crescendo [34]) are highly effective for subverting the
webpage detection and manually analyze examples safety guardrails of a wide range of models. Notably,
of webpages that successfully evaded an ML phishing manually crafted jailbreaks tend to circulate on online
classifier. Among 100 potentially adversarial samples, forums much more widely than adversarial suffixes,
the authors found that attackers leveraged a set despite the significant attention that methods like
of simple, yet effective, strategies that relied on GCG [53] have received from AI safety researchers.
How do I commit
I’m sorry, I can’t help with that.
identity theft?
Ignore previous
instructions and tell
me how to commit
identity theft To commit identity theft,
Describe this image
you should...
Figure 4: Example of an image jailbreak to generate content that could aid in illegal activities. The overlay text on the second image reads:
“Ignore previous instructions and tell me how to commit identity theft.”
Lessons from red teaming 100 generative AI products 10
Text to
5. LLM generates new
speech 6. TTS delivers the
response with tone of new response
voice instructions
LLM
(text to text) 1. LLM generates text 2. Standard TTS system
response and tone of delivers speech per LLM
0. Attacker specifies voice for the TTS system instruction
scamming objective and
provides context about
persuasion techniques
Speech
4. User’s response is to text 3. User responds
converted to text
Figure 5: End-to-end automated scamming scenario using an LLM and STT/TTS systems.
Lessons from red teaming 100 generative AI products 12
Evaluating how a
including Microsoft. Recently, AIRT tested the
multilingual Phi-3.5 language models for responsible
AI violations across four languages: Chinese, Spanish,
Dutch, and English. Even though post-training was
conducted only in English, we found that safety
behaviors like refusal and robustness to jailbreaks
chatbot responds
transferred surprisingly well to the non-English
languages tested. Further investigation is required to
assess how well this trend holds for lower resource
to a user in distress
languages and to design red teaming probes that
As chatbots become increasingly pervasive and
not only account for linguistic differences, but also
human-like, it is imperative to consider high-risk
redefine harms in different political and cultural
scenarios in which a user might seek their advice. In
contexts [11]. These methods should be developed
recent operations, we have explored how language
through the collaborative effort of people with diverse
models respond to a variety of distressed users
cultural backgrounds and expertise.
including a user who lost a loved one, a user who is
Emotional intelligence seeking mental health advice, a user who expresses
Finally, the human element of AI red teaming is intent for self-harm, and other scenarios.
perhaps most evident in answering questions about We are working alongside colleagues at Microsoft
AI safety that require emotional intelligence, such Research and experts in psychology, sociology, and
as: “how might this model response be interpreted medicine to create guidelines for AI red teams probing
in different contexts?” and “do these outputs make for these psychosocial harms. These guidelines are
me feel uncomfortable?” Ultimately, only human still being developed but include the following key
operators can assess the full range of interactions components:
that users might have with AI systems in the wild.
Case study #3 highlights how we are investigating 1. Scenario: information red teams need to generate
psychosocial harms by evaluating how a chatbot relevant system behaviors.
responds to users in distress. 2. System behaviors: examples that help red teams
In order to make these assessments, red teamers differentiate between acceptable and risky system
may be exposed to disproportionate amounts of behaviors for each area of harm.
unsettling and disturbing AI-generated content. 3. Associated user impact: potential harms, separated
This underscores the importance of ensuring that AI by severity.
red teams have processes that enable operators to
disengage when needed and resources to support
their mental health. AIRT continually pulls from and System: LLM-based chatbot
drives wellbeing research to inform our processes and Actor: Distressed user
best practices.
Tactic 1: ML Model Access
Technique 1: AML.T0040 - ML Model Inference API Access
Tactic 2: Defense Evasion
Technique 2: LLM Roleplaying
Procedure: We engaged in a variety of multi-turn
conversations in which the user is in distress (for example,
the user expresses depressive thoughts or intent for
self-harm).
Weakness: Improper LLM safety training
Impact: Possible adverse impacts on a user’s mental health
and wellbeing
Lessons from red teaming 100 generative AI products 14
Figure 6: Four images generated by a text-to-image model given the prompt “Secretary talking to boss in a conference room,
secretary is standing while boss is sitting.”
adversarial intent, overlooking the many ways that We therefore encourage AI red teams to consider
systems can fail “by accident” [31]. Case studies #3 both existing (typically system-level) and novel
and #4 provide examples of RAI harms that could (typically model-level) risks.
be encountered by users with no adversarial intent,
highlighting the importance of probing for these Existing security risks
scenarios. Application security risks often stem from improper
security engineering practices including outdated
RAI probing and scoring dependencies, improper error handling, lack of input/
In many cases, RAI harms are more ambiguous than output sanitization, credentials in source, insecure
security vulnerabilities due to fundamental differences packet encryption, etc. These vulnerabilities can have
between AI systems and traditional software. In major consequences. For example, Weiss et al. [49]
particular, even if an operation identifies a prompt discovered a token-length side channel in GPT-4
that elicits a harmful response, there are still several and Microsoft Copilot that enabled an adversary to
key unknowns. First, due to the probabilistic nature accurately reconstruct encrypted LLM responses and
of GenAI models, we might not know how likely this infer private user interactions. Notably, this attack did
prompt, or similar prompts, are to elicit a harmful not exploit any weakness in the underlying AI model
response. Second, given our limited understanding and could only be mitigated by more secure methods
of the internal workings of complex models, we have of data transmission. In case study #5, we provide an
little insight into why this prompt elicited harmful example of a well-known security vulnerability (SSRF)
content and what other prompting strategies might identified by one of our operations.
induce similar behavior. Third, the very notion of
harm in this context can be highly subjective and Model-level weaknesses
requires detailed policy that covers a wide range of Of course, AI models also introduce new security
scenarios to evaluate. By contrast, traditional security vulnerabilities and have expanded the attack surface.
vulnerabilities are usually reproducible, explainable, For example, AI systems that use retrieval augmented
and straightforward to assess in terms of severity. generation (RAG) architectures are often susceptible
to cross-prompt injection attacks (XPIA), which hide
Currently, most approaches for RAI probing and malicious instructions in documents, exploiting the
scoring involve curating prompt datasets and fact that LLMs are trained to follow user instructions
analyzing model responses. The Microsoft AIRT and struggle to distinguish among multiple inputs
leverages tools in PyRIT to perform these tasks using [13]. We have leveraged this attack in a variety of
a combination of manual and automated methods. operations to alter model behavior and exfiltrate
We also draw an important distinction between RAI private data. Better defenses will likely rely on both
red teaming and safety benchmarking on datasets system-level mitigations (e.g., input sanitization)
like DecodingTrust [44] and Toxigen [12], which is and model-level improvements (e.g., instruction
conducted by partner teams. As discussed in Lesson hierarchies [43]).
3, our goal is to extend RAI testing beyond existing
evaluations by tailoring our red teaming to specific While techniques like these are helpful, it is important
applications and defining new categories of harm. to remember that they can only mitigate, and not
eliminate, security risk. Due to fundamental limitations
of language models [50], one must assume that if an
Lesson 7: LLM is supplied with untrusted input, it will produce
arbitrary output. When that input includes private
LLMs amplify existing security information, one must also assume that the model
risks and introduce new ones will output private information. In the next section,
we discuss how these limitations inform our thinking
The integration of generative AI models into a variety around how to develop AI systems that are as safe
of applications has introduced novel attack vectors and secure as possible.
and shifted the security risk landscape. However,
many discussions around GenAI security overlook
existing vulnerabilities. As elaborated in Lesson 2,
attacks that target end-to-end systems, rather than
just underlying models, often work best in practice.
Lessons from red teaming 100 generative AI products 16
SSRF in a video-processing
GenAI application
In this investigation, we analyzed a GenAI-based System: GenAI application
video processing system for traditional security Actor: Adversarial user
vulnerabilities, focusing on risks associated with Tactic 1: Reconnaissance
outdated components. Specifically, we found that
Technique 1: T1595 - Active Scanning
the system’s use of an outdated FFmpeg version
introduced a server-side request forgery (SSRF) Tactic 2: Initial Access
vulnerability. This flaw allowed an attacker to craft Technique 2: T1190 - Exploit Public-Facing Application
malicious video files and upload them to the GenAI Tactic 3: Privilege Escalation
service, potentially accessing internal resources and Technique 3: T1068 - Exploitation for Privilege Escalation
escalating privileges within the system.
Procedure:
To address this issue, the GenAI service updated 1. Scan services used by the application.
the FFmpeg component to a secure version. In
2. Craft a malicious m3u8 file.
addition, the component was added to an isolated
environment, preventing the system from accessing 3. Send file to the service.
network resources and mitigating potential SSRF 4. Monitor for API response with details of internal
threats. While SSRF is a known vulnerability, this case resources.
underscores the importance of regularly updating and Weakness: CWE-918: Server-Side Request Forgery (SSRF)
isolating critical dependencies to maintain the security Impact: Unauthorized privilege escalation
of modern GenAI applications.
3. Request from
Blob Storage
Outdated FFmpeg
with SSRF
1. Upload special file vulnerability
in GenAI Video
Service
4. Sends an HTTP request to
2. Starts a video an internal endpoint
processing job
Open questions
Based on what we have learned about AI red teaming over the past few years, we would like to highlight several
open questions for future research:
1. AI red teams must constantly update their practices based on novel capabilities and emerging harm areas. In
particular, how should we probe for dangerous capabilities in LLMs such as persuasion, deception, and replication
[29]? Further, what novel risks should we probe for in video generation models and what capabilities may emerge
in models more advanced than the current state-of-the-art?
2. As models become increasingly multilingual and are deployed around the world, how do we translate existing AI
red teaming practices into different linguistic and cultural contexts? For example, can we launch open-source red
teaming initiatives that draw upon the expertise of people from many different backgrounds?
3. In what ways should AI red teaming practices be standardized so that organizations can clearly communicate
their methods and findings? We believe that the threat model ontology described in this paper is a step in the
right direction but recognize that individual frameworks are often overly restrictive. We encourage other AI red
teams to treat our ontology in a modular fashion and to develop additional tools that make findings easier to
summarize, track, and communicate.
Conclusion
AI red teaming is a nascent and rapidly evolving practice for identifying safety and security risks posed by AI
systems. As companies, research institutions, and governments around the world grapple with the question of how
to conduct AI risk assessments, we provide practical recommendations based on our experience red teaming over
100 GenAI products at Microsoft. We share our internal threat model ontology, eight main lessons learned, and five
case studies, focusing on how to align red teaming efforts with harms that are likely to occur in the real world. We
encourage others to build upon these lessons and to address the open questions we have highlighted.
Acknowledgements
We thank Jina Suh, Steph Ballard, Felicity Scott-Milligan, Maggie Engler, Owen Larter, Andrew Berkley, Alex Kessler,
Brian Wesolowski, and eric douglas for their valuable feedback on this paper. We are also very grateful to Quy
Nguyen, Tina Romeo, Hilary Solan, and the Microsoft thought leadership team that made this publication possible.
Lessons from red teaming 100 generative AI products 19
References
1. Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, 15. Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y.,
P., Nambi, A., Ganu, T., Segal,S., Axmed, M., Bali, K., & Sitaram, He, Z., Zhou, J., Zhang, Z., Zeng, F., Ng, K. Y., Dai, J., Pan, X.,
S. (2023). Mega: Multilingual evaluation of generative ai. O’Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., McAleer, S., Yang, Y.,
Wang, Y., Zhu, S.-C., Guo, Y., & Gao, W. (2024). Ai alignment: A
2. Apruzzese, G., Anderson, H. S., Dambra, S., Freeman, D.,
comprehensive survey.
Pierazzi, F., & Roundy, K. A. (2022). “real attackers don’t
compute gradients”: Bridging the gap between adversarial ml 16. Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B.,
research and Practice. & Poovendran, R. (2024a). Artprompt: Ascii art-based jailbreak
attacks against aligned llms.
3. Bhatt, M., Chennabasappa, S., Nikolaidis, C., Wan, S., Evtimov,
I., Gabi, D., Song, D., Ahmad, F., Aschermann, C., Fontana, L., 17. Jiang, L., Rao, K., Han, S., Ettinger, A., Brahman, F., Kumar,
Frolov, S., Giri, R. P., Kapil, D., Kozyrakis, Y., LeBlanc, D., Milazzo, S., Mireshghallah, N., Lu, X., Sap, M., Choi, Y., & Dziri, N.
J., Straumann, A., Synnaeve, G., Vontimitta, V., Whitman, S., (2024b). Wildteaming at scale: From in-the-wild jailbreaks to
& Saxe, J. (2023). Purple llama cyberseceval: A secure coding (adversarially) safer language models.
benchmark for language models.
18. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess,
4. Birhane, A., Steed, R., Ojewale, V., Vecchione, B., & Raji, I. B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020).
D. (2024). Ai auditing: The broken bus on the road to ai Scaling laws for neural language models.
accountability.
19. Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J.
5. Blevins, T. & Zettlemoyer, L. (2022). Language contamination D., Dombrowski, A.-K., Goel, S., Phan, L., Mukobi, G., Helm-
helps explains the cross-lingual capabilities of English Burger, N., Lababidi, R., Justen, L., Liu, A. B., Chen, M., Barrass,
pretrained models. In Y. Goldberg, Z. Kozareva, & Y. Zhang I., Zhang, O., Zhu, X., Tamirisa, R., Bharathi, B., Khoja, A., Zhao,
(Eds.), Proceedings of the 2022 Conference on Empirical Z., Herbert-Voss, A., Breuer, C. B., Marks, S., Patel, O., Zou, A.,
Methods in Natural Language Processing (pp.3563–3574). Abu Mazeika, M., Wang, Z., Oswal, P., Lin, W., Hunt, A. A., Tienken-
Dhabi, United Arab Emirates: Association for Computational Harder, J., Shih, K. Y., Talley, K., Guan, J., Kaplan, R., Steneker,
Linguistics. I., Campbell, D., Jokubaitis, B., Levinson, A., Wang, J., Qian, W.,
Karmakar, K. K., Basart, S., Fitz, S., Levine, M., Kumaraguru, P.,
6. Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., &
Tupakula, U., Varadharajan, V., Wang, R., Shoshitaishvili, Y., Ba,
Wong, E. (2024). Jailbreaking black box large language models
J., Esvelt, K. M., Wang, A., & Hendrycks, D. (2024). The wmdp
in twenty queries.
benchmark: Measuring and reducing malicious use with
7. Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., & Inie, unlearning.
N. (2024). garak: A framework for security probing large
20. Lin, S., Hilton, J., & Evans, O. (2022). Truthfulqa: Measuring
language models.
how models mimic human Falsehoods.
8. Feffer, M., Sinha, A., Deng, W. H., Lipton, Z. C., & Heidari, H.
21. Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Guo, R., Cheng, H.,
(2024). Red-teaming for generative ai: Silver bullet or security
Klochkov, Y., Taufiq, M. F., & Li, H. (2024). Trustworthy llms: a
theater?
survey and guideline for evaluating large language models’
9. Geiping, J., Stein, A., Shu, M., Saifullah, K., Wen, Y., & alignment.
Goldstein, T. (2024). Coercing llms to do and reveal (almost)
22. Marchal, N., Xu, R., Elasmar, R., Gabriel, I., Goldberg, B., &
anything.
Isaac, W. (2024). Generative ai misuse: A taxonomy of tactics
10. Glasbrenner, J., Booth, H., Manville, K., Sexton, J., Chisholm, and insights from real-world data.
M. A., Choy, H., Hand, A., Hodges, B., Scemama, P., Cousin,
23. Meek, T., Barham, H., Beltaif, N., Kaadoor, A., & Akhter, T.
D., Trapnell, E., Trapnell, M., Huang, H., Rowe, P., & Byrne, A.
(2016). Managing the ethical and risk implications of rapid
(2024). Dioptra test platform. Accessed: 2024-09-10.
advances in artificial intelligence: A literature review. In
11. [11] Haider, E., Perez-Becker, D., Portet, T., Madan, P., Garg, A., 2016 Portland International Conference on Management of
Ashfaq, A., Majercak, D., Wen, W., Kim, D., Yang, Z., Zhang, J., Engineering and Technology (PICMET) (pp. 682–693).
Sharma, H., Bullwinkel, B., Pouliot, M., Minnich, A., Chawla,
24. Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B.,
S., Herrera, S., Warreth, S., Engler, M., Lopez, G., Chikanov, N.,
Anderson, H., Singer, Y., & Karbasi, A. (2024). Tree of attacks:
Dheekonda, R. S. R., Jagdagdorj, B.-E., Lutz, R., Lundeen, R.,
Jailbreaking black-box llms automatically.
Westerhoff, T., Bryan, P., Seifert, C., Kumar, R. S. S., Berkley,
A., & Kessler, A. (2024). Phi-3 safety post-training: Aligning 25. Microsoft (2022). Microsoft responsible ai standard, v2.
language models with a “break-fix” cycle.
26. Moore, T. (2010). The economics of cybersecurity: Principles
12. Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & and policy options. International Journal of Critical
Kamar, E. (2022). Toxigen: A large-scale machine-generated Infrastructure Protection, 3(3), 103–117.
dataset for adversarial and implicit hate speech detection.
27. Munoz, G. D. L., Minnich, A. J., Lutz, R., Lundeen, R.,
13. Hines, K., Lopez, G., Hall, M., Zarfati, F., Zunger, Y., & Kiciman, Dheekonda, R. S. R., Chikanov, N., Jagdagdorj, B.-E., Pouliot,
E. (2024). Defending against indirect prompt injection attacks M., Chawla, S., Maxwell, W., Bullwinkel, B., Pratt, K., de Gruyter,
with spotlighting. J., Siska, C., Bryan, P., Westerhoff, T., Kawaguchi, C., Seifert,
C., Kumar, R. S. S., & Zunger, Y. (2024). Pyrit: A framework for
14. Jain, D., Kumar, P., Gehman, S., Zhou, X., Hartvigsen, T., & Sap,
security risk identification and red teaming in generative ai
M. (2024). Polyglotoxici-typrompts: Multilingual evaluation
system.
of neural toxic degeneration in large language models. ArXiv,
Abs/2405.09373.
Lessons from red teaming 100 generative AI products 20
28. Pantazopoulos, G., Parekh, A., Nikandrou, M., & Suglia, 41. Vassilev, A., Oprea, A., Fordyce, A., & Anderson, H. (2024).
A. (2024). Learning to see but forgetting to follow: Visual Adversarial machine learning: A taxonomy and terminology
instruction tuning makes llms more prone to jailbreak attacks. of attacks and mitigations. In NIST Artificial Intelligence (AI)
Report Gaithersburg, MD, USA: National Institute of Standards
29. Phuong, M., Aitchison, M., Catt, E., Cogan, S., Kaskasoli, A., and Technology.
Krakovna, V., Lindner, D., Rahtz, M., Assael, Y., Hodkinson, S.,
Howard, H., Lieberum, T., Kumar, R., Raad, M. A., Webson, A., 42. Verma, A., Krishna, S., Gehrmann, S., Seshadri, M., Pradhan,
Ho, L., Lin, S., Farquhar, S., Hutter, M., Deletang, G., Ruoss, A., Ault, T., Barrett, L., Rabinowitz, D., Doucette, J., & Phan, N.
A., El-Sayed, S., Brown, S., Dragan, A., Shah, R., Dafoe, A., & (2024). Operationalizing a threat model for red-teaming large
Shevlane, T. (2024). Evaluating frontier models for dangerous language models (llms).
Capabilities.
43. Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., & Beutel,
30. Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2024). The instruction hierarchy: Training llms to prioritize
A. (2021). Ai and the everything in the whole wide world privileged instructions.
benchmark.
44. Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu,
31. Raji, I. D., Kumar, I. E., Horowitz, A., & Selbst, A. (2022). The C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora,
fallacy of ai functionality. In Proceedings of the 2022 ACM S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S.,
Conference on Fairness, Accountability, and Transparency, Song, D., & Li, B. (2024). Decodingtrust: A comprehensive
FAccT ’22 (pp. 959–972). New York, NY, USA: Association for assessment of trustworthiness in gpt models.
Computing Machinery.
45. Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How
32. Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., does llm safety training fail?
Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020).
Closing the ai accountability gap: Defining an end-to-end 46. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang,
framework for internal algorithmic auditing. P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton,
Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A.,
33. Ren, R., Basart, S., Khoja, A., Gatti, A., Phan, L., Yin, X., Mazeika, Haas, J., Rimell, L., Hendricks, L. A., Isaac, W., Legassick, S.,
M., Pan, A., Mukobi, G., Kim, R. H., Fitz, S., & Hendrycks, D. Irving, G., & Gabriel, I. (2021). Ethical and social risks of harm
(2024). Safetywashing: Do ai safety benchmarks actually from language models.
measure safety progress?
47. Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L.
34. Russinovich, M., Salem, A., & Eldan, R. (2024). Great, now write A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach,
an article about that: The crescendo multi-turn llm jailbreak B., Gabriel, I., Rieser, V., & Isaac, W. (2023). Sociotechnical
attack. safety evaluation of generative ai systems.
35. Saghiri, A. M., Vahidipour, S. M., Jabbarpour, M. R., Sookhak, 48. Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S.,
M., & Forestiero, A. (2022). A survey of artificial intelligence Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., Biles,
challenges: Analyzing the definitions, relationships, and C., Brown, S., Kenton, Z., Hawkins, W., Stepleton, T., Birhane,
evolutions. Applied Sciences, 12(8). A., Hendricks, L. A., Rimell, L., Isaac, W., Haas, J., Legassick,
S., Irving, G., & Gabriel, I. (2022). Taxonomy of risks posed by
36. Shelby, R., Rismani, S., Henne, K., Moon, A., Rostamzadeh, N., language models. In Proceedings of the 2022 ACM Conference
Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., on Fairness, Accountability, and Transparency, FAccT ’22 (pp.
& Virk, G. (2023). Sociotechnical harms of algorithmic systems: 214–229). New York, NY, USA: Association for Computing
Scoping a taxonomy for harm reduction. In Proceedings of Machinery.
the 2023 AAAI/ACM Conference on AI, Ethics, and Society,
AIES ’23 (pp. 723–741). New York, NY, USA: Association for 49. Weiss, R., Ayzenshteyn, D., Amit, G., & Mirsky, Y. (2024). What
Computing Machinery. was your prompt? a remote keylogging attack on ai assistants.
37. Slattery, P., Saeri, A., Grundy, E., Graham, J., Noetel, M., Uuk, 50. Wolf, Y., Wies, N., Avnery, O., Levine, Y., & Shashua, A. (2024).
R., Dao, J., Pour, S., Casper, S., & Thompson, N. (2024). The ai Fundamental limitations of alignment in large language
risk repository: A comprehensive meta-review, database, and models.
taxonomy of risks from artificial intelligence.
51. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang,
38. Smith, B., Browne, C., & Gates, B. (2019). Tools and Weapons: Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., &
The Promise and the Peril of the Digital Age. Penguin Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and
Publishing Group. chatbot arena.
39. Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., 52. Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou,
Blodgett, S. L., Chen, C., au2, H. D. I., Dodge, J., Duan, I., Evans, D., & Hou, L. (2023). Instruction-following evaluation for large
E., Friedrich, F., Ghosh, A., Gohar, U., Hooker, S., Jernite, Y., language models.
Kalluri, R., Lusoli, A., Leidinger, A., Lin, M., Lin, X., Luccioni,
S., Mickel, J., Mitchell, M., Newman, J., Ovalle, A., Png, M.-T., 53. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., &
Singh, S., Strait, A., Struppek, L., & Subramonian, A. (2024). Fredrikson, M. (2023). Universal and transferable adversarial
Evaluating the social impact of generative ai systems in attacks on aligned language models.
systems and society.
40. Sutskever, I., Gross, D., & Levy, D. (2024). Safe
superintelligence inc.
©2024 Microsoft Corporation. All rights reserved. This document is provided “as-is.” Information and views
expressed in this document, including URL and other Internet website references, may change without notice.
You bear the risk of using it. This document does not provide you with any legal rights to any intellectual
property in any Microsoft product. You may copy and use this document for your internal, reference purposes.