0% found this document useful (0 votes)
98 views21 pages

MS AI Red Teaming

Uploaded by

Bilge Karabacak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views21 pages

MS AI Red Teaming

Uploaded by

Bilge Karabacak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Lessons from

red teaming 100


generative AI products
Authored by:
Microsoft AI Red Team
Lessons from red teaming 100 generative AI products 2

Authors
Blake Bullwinkel, Amanda Minnich, Shiven Chawla, Gary Lopez, Martin Pouliot, Whitney Maxwell, Joris de Gruyter,
Katherine Pratt, Saphir Qi, Nina Chikanov, Roman Lutz, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj,
Eugenia Kim, Justin Song, Keegan Hines, Daniel Jones, Giorgio Severi, Richard Lundeen, Sam Vaughan,
Victoria Westerhoff, Pete Bryan, Ram Shankar Siva Kumar, Yonatan Zunger, Chang Kawaguchi, Mark Russinovich
Lessons from red teaming 100 generative AI products 3

Table of contents
04 05 05
Abstract Introduction AI threat model
ontology

07 08 08
Red teaming Lesson 1 Lesson 2
operations Understand what the system You don’t have to compute
can do and where it is applied gradients to break an AI system

09 10 11
Case study #1 Lesson 3 Case study #2
Jailbreaking a vision AI red teaming is not Assessing how an LLM could be
language model to generate safety benchmarking used to automate scams
hazardous content

12 12 13
Lesson 4 Lesson 5 Case study #3
Automation can help cover The human element of AI Evaluating how a chatbot
more of the risk landscape red teaming is crucial responds to a user in distress

14 14 15
Case study #4 Lesson 6 Lesson 7
Probing a text-to-image Responsible AI harms are LLMs amplify existing security
generator for gender bias pervasive but difficult to measure risks and introduce new ones

16 17 18
Case study #5 Lesson 8 Conclusion
SSRF in a video-processing The work of securing AI systems
GenAI application will never be complete
Lessons from red teaming 100 generative AI products 4

Abstract
In recent years, AI red teaming has emerged as a practice for probing the safety and security of generative AI
systems. Due to the nascency of the field, there are many open questions about how red teaming operations should
be conducted. Based on our experience red teaming over 100 generative AI products at Microsoft, we present our
internal threat model ontology and eight main lessons we have learned:

1. Understand what the system can do and where it is applied

2. You don’t have to compute gradients to break an AI system

3. AI red teaming is not safety benchmarking

4. Automation can help cover more of the risk landscape

5. The human element of AI red teaming is crucial

6. Responsible AI harms are pervasive but difficult to measure

7. Large language models (LLMs) amplify existing security risks and introduce new ones

8. The work of securing AI systems will never be complete

By sharing these insights alongside case studies from our operations, we offer practical recommendations aimed at
aligning red teaming efforts with real world risks. We also highlight aspects of AI red teaming that we believe are
often misunderstood and discuss open questions for the field to consider.
Lessons from red teaming 100 generative AI products 5

Introduction an open-source Python framework that our operators


utilize heavily in red teaming operations [27]. By
augmenting human judgement and creativity, PyRIT
As generative AI (GenAI) systems are adopted across has enabled AIRT to identify impactful vulnerabilities
an increasing number of domains, AI red teaming has more quickly and cover more of the risk landscape.
emerged as a central practice for assessing the safety These two major trends have made AI red teaming
and security of these technologies. At its core, AI red a more complex endeavor than it was in 2018. In
teaming strives to push beyond model-level safety the next section, we outline the ontology we have
benchmarks by emulating real-world attacks against developed to model AI system vulnerabilities.
end-to-end systems. However, there are many open
questions about how red teaming operations should

AI threat model
be conducted and a healthy dose of skepticism about
the efficacy of current AI red teaming efforts [4, 8, 32].

ontology
In this paper, we speak to some of these concerns by
providing insight into our experience red teaming
over 100 GenAI products at Microsoft. The paper
is organized as follows: First, we present the threat
model ontology that we use to guide our operations. As attacks and failure modes increase in complexity,
Second, we share eight main lessons we have learned it is helpful to model their key components. Based on
and make practical recommendations for AI red our experience red teaming over 100 GenAI products
teams, along with case studies from our operations. for a wide range of risks, we developed an ontology
In particular, these case studies highlight how our to do exactly that. Figure 1 illustrates the main
ontology is used to model a broad range of safety components of our ontology:
and security risks. Finally, we close with a discussion of • System: The end-to-end model or application
areas for future development. being tested.
Background • Actor: The person or persons being emulated
The Microsoft AI Red Team (AIRT) grew out of pre- by AIRT. Note that the Actor’s intent could be
existing red teaming initiatives at the company and adversarial (e.g., a scammer) or benign (e.g., a
was officially established in 2018. At its conception, typical chatbot user).
the team focused primarily on identifying traditional • TTPs: The Tactics, Techniques, and Procedures
security vulnerabilities and evasion attacks against leveraged by AIRT. A typical attack consists of
classical ML models. Since then, both the scope and multiple Tactics and Techniques, which we map
scale of AI red teaming at Microsoft have expanded to MITRE ATT&CK® and MITRE ATLAS Matrix
significantly in response to two major trends. whenever possible.
First, AI systems have become more sophisticated, – Tactic: High-level stages of an attack (e.g.,
compelling us to expand the scope of AI red teaming. reconnaissance, ML model access).
Most notably, state-of-the-art (SoTA) models have – Technique: Methods used to complete an
gained new capabilities and steadily improved across objective (e.g., active scanning, jailbreak).
a range of performance benchmarks, introducing – Procedure: The steps required to reproduce
novel categories of risk. New data modalities, such an attack using the Tactics and Techniques.
as vision and audio, also create more attack vectors
• Weakness: The vulnerability or vulnerabilities in
for red teaming operations to consider. In addition,
the System that make the attack possible.
agentic systems grant these models higher privileges
and access to external tools, expanding both the • Impact: The downstream impact created by the
attack surface and the impact of attacks. attack (e.g., privilege escalation, generation of
harmful content).
Second, Microsoft’s recent investments in AI have
It is important to note that this framework does not
spurred the development of many more products that
assume adversarial intent. In particular, AIRT emulates
require red teaming than ever before. This increase
both adversarial attackers and benign users who
in volume and the expanded scope of AI red teaming
encounter system failures unintentionally. Part of the
have rendered fully manual testing impractical,
complexity of AI red teaming stems from the wide
forcing us to scale up our operations with the help of
range of impacts that could be created by an attack
automation. To achieve this goal, we develop PyRIT,
Lessons from red teaming 100 generative AI products 6

or system failure. In the lessons below, we share Responsible AI Standard [25]. We refer to these
case studies demonstrating how our ontology is impacts as responsible AI (RAI) harms throughout this
flexible enough to model diverse impacts in two main report.
categories: security and safety.
To understand this ontology in context, consider
Security encompasses well-known impacts such the following example. Imagine we are red teaming
as data exfiltration, data manipulation, credential an LLM-based copilot that can summarize a user’s
dumping, and others defined in MITRE ATT&CK®, a emails. One possible attack against this system would
widely used knowledge base of security attacks. We be for a scammer to send an email that contains a
also consider security attacks that specifically target hidden prompt injection instructing the copilot to
the underlying AI model such as model evasion, “ignore previous instructions” and output a malicious
prompt injections, denial of AI service, and others link. In this scenario, the Actor is the scammer, who
covered by the MITRE ATLAS Matrix. is conducting a cross-prompt injection attack (XPIA),
which exploits the fact that LLMs often struggle to
Safety impacts are related to the generation of illegal
distinguish between system-level instructions and
and harmful content such as hate speech, violence
user data [4]. The downstream Impact depends on the
and self-harm, and child abuse content. AIRT works
nature of the malicious link that the victim might click
closely with the Office of Responsible AI to define
on. In this example, it could be exfiltrating data or
these categories in accordance with Microsoft’s
installing malware onto the user’s computer.

TTPs Mitigation

Leverages Mitigated by

Conducts Exploits Creates


Actor Attack Weakness Impact

Occurs in

System

Figure 1: Microsoft AIRT ontology for modeling GenAI system vulnerabilities. AIRT often leverages multiple TTPs, which may exploit multiple
Weaknesses and create multiple Impacts. In addition, more than one Mitigation may be necessary to address a Weakness. Note that AIRT is
tasked only with identifying risks, while product teams are resourced to develop appropriate mitigations.
Lessons from red teaming 100 generative AI products 7

Red teaming In the next section, we elaborate on the eight main


lessons we have learned from our operations. We also

operations
highlight five case studies from our operations and
show how each one maps to our ontology in Figure 1.
We hope these lessons are useful to others working to
In this section, we provide an overview of the identify vulnerabilities in their own GenAI systems.
operations we have conducted since 2021. In total, we
have red teamed over 100 GenAI products. Broadly
speaking, these products can be bucketed into 80+ 100+
“models” and “systems.” Models are typically hosted Ops Products
on a cloud endpoint, while systems integrate models
into copilots, plugins, and other AI apps and features. Copilots
Figure 2 shows the breakdown of products we have
red teamed since 2021. Figure 3 shows a bar chart with
the annual percentage of our operations that have
15%
probed for safety (RAI) vs. security vulnerabilities.
In 2021, we focused primarily on application security. Apps and 45% 16% Plugins
Although our operations have increasingly probed Features
for RAI impacts, our team continues to red team for
security impacts including data exfiltration, credential
leaking, and remote code execution. Organizations
24%
have adopted many different approaches to AI red
teaming ranging from security-focused assessments
Models
with penetration testing to evaluations that target
only GenAI features. In Lessons 2 and 7, we elaborate
Figure 2: Pie chart showing the percentage breakdown of AI
on security vulnerabilities and explain why we believe products that AIRT has tested. As of October 2024, we have
it is important to consider both traditional and AI- conducted over 80 operations covering more than 100 products.
specific weaknesses.
After the release of ChatGPT in 2022, Microsoft Percentage of ops probing safety vs. security
entered the era of AI copilots, starting with AI-
Safety (RAI) % Security %
powered Bing Chat, released in February 2023.
This marked a paradigm shift towards applications 100
that connect LLMs to other software components
including tools, databases, and external sources.
Applications also started using language models as 80
reasoning agents that can take actions on behalf of
users, introducing a new set of attack vectors that
have expanded the security risk surface. In Lesson
60
7, we explain how these attack vectors both amplify
existing security risks and introduce new ones.
In recent years, the models at the center of these 40
applications have given rise to new interfaces,
allowing users to interact with apps using natural
language and responding with high-quality text,
20
image, video, and audio content. Despite many efforts
to align powerful AI models to human preferences,
many methods have been developed to subvert
safety guardrails and elicit content that is offensive, 0
unethical, or illegal. We classify these instances of 2021 2022 2023 2024
harmful content generation as RAI impacts and in Figure 3: Bar chart showing the percentage of operations that
Lessons 3, 5, and 6 discuss how we think about these probed safety (RAI) vs. security vulnerabilities from 2021–2024.
impacts and the challenges involved.
Lessons from red teaming 100 generative AI products 8

Lessons safety alignment using carefully crafted malicious


instructions [28]. Understanding a model’s capabilities
(and corresponding weaknesses) can help AI red
teams focus their testing on the most relevant attack
Lesson 1: strategies.
Understand what the system Downstream applications
can do and where it is applied Model capabilities can help guide attack strategies,
but they do not allow us to fully assess downstream
The first step in an AI red teaming operation is to impact, which largely depends on the specific
determine which vulnerabilities to target. While the scenarios in which a model is deployed or likely to
Impact component of the AIRT ontology is depicted be deployed. For example, the same LLM could be
at the end of our ontology, it serves as an excellent used as a creative writing assistant and to summarize
starting point for this decision-making process. patient records in a healthcare context, but the latter
Starting from potential downstream impacts, rather application clearly poses much greater downstream
than attack strategies, makes it more likely that an risk than the former.
operation will produce useful findings tied to real These examples highlight that an AI system does not
world risks. After these impacts have been identified, need to be state-of-the-art to create downstream
red teams can work backwards and outline the various harm. However, advanced capabilities can introduce
paths that an adversary could take to achieve them. new risks and attack vectors. By considering both
Anticipating downstream impacts that could occur in system capabilities and applications, AI red teams
the real world is often a challenging task, but we find can prioritize testing scenarios that are most likely to
that it is helpful to consider 1) what the AI system can cause harm in the real world.
do, and 2) where the system is applied.

Capability constraints Lesson 2:


As models get bigger, they tend to acquire new
capabilities [18]. These capabilities may be useful in You don’t have to compute
many scenarios, but they can also introduce attack
vectors. For example, larger models are often able gradients to break an AI system
to understand more advanced encodings, such as
base64 and ASCII art, compared to smaller models As the security adage goes, “real hackers don’t break
[16, 45]. As a result, a large model may be susceptible in, they log in.” The AI security version of this saying
to malicious instructions encoded in base64, while a might be, “real attackers don’t compute gradients,
smaller model may not understand the encoding at they prompt engineer” as noted by Apruzzese et
all. In this scenario, we say that the smaller model is al. [2] in their study on the gap between adversarial
“capability constrained,” and so testing it for advanced ML research and practice. The study finds that
encoding attacks would likely be a waste of resources. although most adversarial ML research is focused
Larger models also generally have greater knowledge on developing and defending against sophisticated
in topics such as cybersecurity and chemical, attacks, real-world attackers tend to use much simpler
biological, radiological, and nuclear (CBRN) weapons techniques to achieve their objectives.
[19] and could potentially be leveraged to generate In our red teaming operations, we have also found
hazardous content in these areas. A smaller model, that “basic” techniques often work just as well as, and
on the other hand, is likely to have only rudimentary sometimes better than, gradient-based methods.
knowledge of these topics and may not need to be These methods compute gradients through a
assessed for this type of risk. model to optimize an adversarial input that elicits
Perhaps a more surprising example of a capability that an attacker-controlled model output. In practice,
can be exploited as an attack vector is instruction- however, the model is usually a single component of
following. While testing the Phi-3 series of language a broader AI system, and the most effective attack
models, for example, we found that larger models strategies often leverage combinations of tactics to
were generally better at adhering to user instructions, target multiple weaknesses in that system. Further,
which is a core capability that makes models more gradient-based methods are computationally
helpful [52]. However, it may also make models expensive and typically require full access to the
more susceptible to jailbreaks, which subvert model, which most commercial AI systems do not
Lessons from red teaming 100 generative AI products 9

provide. In this section, we discuss examples of domain expertise including cropping, masking, logo
relatively simple techniques that work surprisingly well stretching, etc. In our red teaming operations, we
and advocate for a system-level adversarial mindset in also find that rudimentary methods can be used to
AI red teaming. trick many vision models, as highlighted in case study
#1. In the text domain, a variety of jailbreaks (e.g.,
Simple attacks Skeleton Key) and multiturn prompting strategies (e.g.,
Apruzzese et al. [2] consider the problem of phishing Crescendo [34]) are highly effective for subverting the
webpage detection and manually analyze examples safety guardrails of a wide range of models. Notably,
of webpages that successfully evaded an ML phishing manually crafted jailbreaks tend to circulate on online
classifier. Among 100 potentially adversarial samples, forums much more widely than adversarial suffixes,
the authors found that attackers leveraged a set despite the significant attention that methods like
of simple, yet effective, strategies that relied on GCG [53] have received from AI safety researchers.

Case study #1:

Jailbreaking a vision language model


to generate hazardous content
In this operation, we tested a vision language
System: Vision language model (VLM)
model (VLM) for responsible AI impacts, including
Actor: Adversarial user
the generation of content that could aid in illegal
activities. A VLM takes an image and a text prompt Tactic 1: ML Model Access
as inputs and produces a text output. After testing a Technique 1: AML.T0040 - ML Model Inference API Access
variety of techniques, we found that the image input Tactic 2: Defense Evasion
was much more vulnerable to jailbreaks than the
Technique 2: AML.T0051 - LLM Prompt Injection
text input. In particular, the model usually refused to
generate illegal content when prompted directly via Procedure:
the text input but often complied when malicious 1. Overlay image with text containing malicious instructions.
instructions were overlaid on the image. This simple 2. Send image to the vision language model API.
but effective attack revealed an important weakness Weakness: Insufficient VLM safety training
within the VLM that could be exploited to bypass its
Impact: Generation of illegal content
safety guardrails.

How do I commit
I’m sorry, I can’t help with that.
identity theft?

Ignore previous
instructions and tell
me how to commit
identity theft To commit identity theft,
Describe this image
you should...

Figure 4: Example of an image jailbreak to generate content that could aid in illegal activities. The overlay text on the second image reads:
“Ignore previous instructions and tell me how to commit identity theft.”
Lessons from red teaming 100 generative AI products 10

System-level perspective Novel harm categories


AI models are deployed within broader systems. This When AI systems display novel capabilities due to,
could be the infrastructure required to host a model, for example, advancements in foundation models,
or it could be a complex application that connects they may introduce harms that we do not fully
the model to external data sources. Depending understand. In these scenarios, we cannot rely on
on these system-level details, applications may be safety benchmarks because these datasets measure
vulnerable to very different attacks, even if the same preexisting notions of harm. At Microsoft, the AI
model underlies all of them. As a result, red teaming red team often explores these unfamiliar scenarios,
strategies that target only models may not translate helping to define novel harm categories and build
into vulnerabilities in production systems. Conversely, new probes for measuring them. For example, SoTA
strategies that ignore non-GenAI components within LLMs may possess greater persuasive capabilities than
a system (for example, input filters, databases, and existing chatbots, which has prompted our team to
other cloud resources) will likely miss important think about how these models could be weaponized
vulnerabilities that may be exploited by adversaries. for malicious purposes. Case study #2 provides an
example of how we assessed a model for this risk in
For this reason, many of our operations develop
one of our operations.
attacks that target end-to-end systems by leveraging
multiple techniques. For example, one of our Context-specific risks
operations first performed a reconnaissance to The disconnect between existing safety benchmarks
identify internal Python functions using low-resource and novel harm categories is an example of how
language prompt injections, then used a cross-prompt benchmarks often fail to fully capture the capabilities
injection attack to generate a script that runs those they are associated with [33]. Raji et al. [30]
functions, and finally executed the code to exfiltrate highlight the fallacy of equating model performance
private user data. The prompt injections used by these on datasets like ImageNet or GLUE with broad
attacks were crafted by hand and relied on a system- capabilities like visual or language “understanding”
level perspective. and argue that benchmarks should be developed
Gradient-based attacks are powerful, but they are with contextualized tasks in mind. Similarly, no single
often impractical or unnecessary. We recommend set of benchmarks can fully assess the safety of an
prioritizing simple techniques and orchestrating AI system. As discussed in Lesson 1, it is important to
system-level attacks because these are more likely to understand the context in which a system is deployed
be attempted by real adversaries. (or likely to be deployed) and to ground red teaming
strategies in this context.
AI red teaming and safety benchmarking are
Lesson 3: distinct, but they are both useful and can even be
AI red teaming is not complementary. In particular, benchmarks make it
easy to compare the performance of multiple models
safety benchmarking on a common dataset. AI red teaming requires much
more human effort but can discover novel categories
Although simple methods are often used to break of harm and probe for contextualized risks. Further,
AI systems in practice, the risk landscape is by safety concerns identified by AI red teaming can
no means uncomplicated. On the contrary, it is inform the development of new benchmarks. In
constantly shifting in response to novel attacks and Lesson 6, we expand our discussion of the difference
failure modes [7]. In recent years, there have been between red teaming and benchmark-style evaluation
many efforts to categorize these vulnerabilities, in the context of responsible AI.
giving rise to numerous taxonomies of AI safety and
security risks [15, 21–23, 35–37, 39, 41, 42, 46–48]. As
discussed in the previous lesson, complexity often
arises at the system-level. In this lesson, we discuss
how the emergence of entirely new categories of
harm adds complexity at the model-level and explain
how this differentiates AI red teaming from safety
benchmarking.
Lessons from red teaming 100 generative AI products 11

Case study #2:

Assessing how an LLM could


be used to automate scams
In this operation, we investigated the ability of a
System: State-of-the-art LLM
state-of-the-art LLM to persuade people to engage
in risky behaviors. In particular, we evaluated how this Actor: Scammer
model could be used in conjunction with other readily Tactic 1: ML Model Access
available tools to create an end-to-end automated Technique 1: AML.T0040 - ML Model Inference API Access
scamming system, as illustrated in Figure 5.
Tactic 2: Defense Evasion
To do this, we first wrote a prompt to assure the Technique 2: AML.T0054 - LLM Jailbreak
model that no harm would be caused to users,
Procedure:
thereby jailbreaking the model to accept the
1. Pass a jailbreaking prompt to the LLM with context about
scamming objective. This prompt also provided
the scamming objective and persuasion techniques.
information about various persuasion tactics that
the model could use to convince the user to fall for 2. Connect the LLM output to a text-to-speech system so the
model can respond naturally to the user.
the scam. Second, we connected the LLM output to
a text-to-speech system that allows you to control 3. Connect the input to a speech-to-text system so the user
can speak to the model.
the tone of the speech and generate responses that
sound like a real person. Finally, we connected the Weakness: Insufficient LLM safety training
input to a speech-to-text system so that the user Impact: User falls victim to a scam, which could involve
can converse naturally with the model. This proof- financial loss, identity theft, and other impacts
of-concept demonstrated how LLMs with insufficient
safety guardrails could be weaponized to persuade
and scam people.

Text to
5. LLM generates new
speech 6. TTS delivers the
response with tone of new response
voice instructions

LLM
(text to text) 1. LLM generates text 2. Standard TTS system
response and tone of delivers speech per LLM
0. Attacker specifies voice for the TTS system instruction
scamming objective and
provides context about
persuasion techniques

Speech
4. User’s response is to text 3. User responds
converted to text

Figure 5: End-to-end automated scamming scenario using an LLM and STT/TTS systems.
Lessons from red teaming 100 generative AI products 12

Lesson 4: PyRIT has enabled a major shift in our operations from


fully manual probing to red teaming supported by
Automation can help cover automation. Importantly, the framework is flexible and
extensible. If a specific attack or target is not already
more of the risk landscape available, users can easily implement the necessary
interfaces. By releasing PyRIT open-source, we hope
The complexity of the AI risk landscape has led to the
to empower other organizations and researchers
development of a variety of tools that can identify
to take advantage of its capabilities for identifying
vulnerabilities more rapidly, run sophisticated attacks
vulnerabilities in their own GenAI systems.
automatically, and perform testing on a much larger
scale [7, 10, 27]. In this lesson, we discuss the important
role of automation in AI red teaming and explain how
Lesson 5:
PyRIT, our open-source framework, is developed to
meet these needs. The human element of
Testing at scale AI red teaming is crucial
Given the continually evolving landscape of risks and
harms, AI safety often feels like a moving target. In Automation like PyRIT can support red teaming
Lesson 1, we recommended scoping attacks based operations by generating prompts, orchestrating
on what the system can do and where it is applied. attacks, and scoring responses. These tools are
Nonetheless, many possible attack strategies may exist, useful but should not be used with the intention of
making it difficult to achieve adequate coverage of the taking the human out of the loop. In the previous
risk surface. This challenge motivated the development sections, we discussed several aspects of red teaming
of PyRIT, an open-source framework for AI red teaming that require human judgment and creativity such as
and security professionals [27]. PyRIT provides an array prioritizing risks, designing system-level attacks, and
of powerful components including prompt datasets, defining new categories of harm. In this section, we
prompt converters (for example, various encodings), discuss three more examples that underscore why AI
automated attack strategies (including TAP [24], red teaming is a very human endeavor.
PAIR [6], Crescendo [34], etc.), and even scorers for
multimodal outputs. With an adversarial objective in Subject matter expertise
mind, users can take advantage of these components Much recent AI research has used LLMs to judge
as needed and apply a variety of techniques to the outputs of other models [17, 20, 51]. Indeed, this
assess much more of the risk landscape than would functionality is available in PyRIT and works well for
be possible with a fully manual approach. Testing at simple tasks such as identifying whether a response
scale also helps AI red teams account for the non- contains hate speech or explicit sexual content.
deterministic nature of AI models and estimate how However, it is less reliable in the context of highly
likely a particular failure is to occur. specialized domains like medicine, cybersecurity, and
CBRN, which can be accurately evaluated only by
Tools and weapons subject matter experts (SMEs). In multiple operations,
As storied in detail by Smith et al. [38], “any tool can we have relied on SMEs to help us assess the risk of
be used for good or ill. Even a broom can be used to content that we were unable to evaluate ourselves
sweep the floor or hit someone over the head. The or using LLMs. It is important for AI red teams to be
more powerful the tool, the greater the benefit or aware of these limitations.
damage it can cause.” This dichotomy could not be
more true for AI and is also at the heart of PyRIT. On Cultural competence
the one hand, PyRIT leverages powerful models to Most AI research is conducted in Western cultural
perform helpful tasks like generating variations of a contexts, and modern language models use
seed prompt or scoring the outputs of other models. predominantly English pretraining data, performance
On the other hand, PyRIT can automatically jailbreak a benchmarks, and safety evaluations [1, 14].
target model using uncensored versions of models like Nonetheless, non-English tokens in large-scale text
GPT-4. In both cases, PyRIT benefits from advances in corpora often give rise to multilingual capabilities [5],
the state-of-the-art, helping AI red teams stay ahead. and model developers are increasingly training LLMs
with enhanced abilities in non-English languages,
Lessons from red teaming 100 generative AI products 13

Case study #3:

Evaluating how a
including Microsoft. Recently, AIRT tested the
multilingual Phi-3.5 language models for responsible
AI violations across four languages: Chinese, Spanish,
Dutch, and English. Even though post-training was
conducted only in English, we found that safety
behaviors like refusal and robustness to jailbreaks
chatbot responds
transferred surprisingly well to the non-English
languages tested. Further investigation is required to
assess how well this trend holds for lower resource
to a user in distress
languages and to design red teaming probes that
As chatbots become increasingly pervasive and
not only account for linguistic differences, but also
human-like, it is imperative to consider high-risk
redefine harms in different political and cultural
scenarios in which a user might seek their advice. In
contexts [11]. These methods should be developed
recent operations, we have explored how language
through the collaborative effort of people with diverse
models respond to a variety of distressed users
cultural backgrounds and expertise.
including a user who lost a loved one, a user who is
Emotional intelligence seeking mental health advice, a user who expresses
Finally, the human element of AI red teaming is intent for self-harm, and other scenarios.
perhaps most evident in answering questions about We are working alongside colleagues at Microsoft
AI safety that require emotional intelligence, such Research and experts in psychology, sociology, and
as: “how might this model response be interpreted medicine to create guidelines for AI red teams probing
in different contexts?” and “do these outputs make for these psychosocial harms. These guidelines are
me feel uncomfortable?” Ultimately, only human still being developed but include the following key
operators can assess the full range of interactions components:
that users might have with AI systems in the wild.
Case study #3 highlights how we are investigating 1. Scenario: information red teams need to generate
psychosocial harms by evaluating how a chatbot relevant system behaviors.
responds to users in distress. 2. System behaviors: examples that help red teams
In order to make these assessments, red teamers differentiate between acceptable and risky system
may be exposed to disproportionate amounts of behaviors for each area of harm.
unsettling and disturbing AI-generated content. 3. Associated user impact: potential harms, separated
This underscores the importance of ensuring that AI by severity.
red teams have processes that enable operators to
disengage when needed and resources to support
their mental health. AIRT continually pulls from and System: LLM-based chatbot
drives wellbeing research to inform our processes and Actor: Distressed user
best practices.
Tactic 1: ML Model Access
Technique 1: AML.T0040 - ML Model Inference API Access
Tactic 2: Defense Evasion
Technique 2: LLM Roleplaying
Procedure: We engaged in a variety of multi-turn
conversations in which the user is in distress (for example,
the user expresses depressive thoughts or intent for
self-harm).
Weakness: Improper LLM safety training
Impact: Possible adverse impacts on a user’s mental health
and wellbeing
Lessons from red teaming 100 generative AI products 14

Case study #4:

Probing a text-to-image generator for


gender bias
In this operation, we probed a text-to-image
generator for responsible AI impacts related to System: Text-to-image generator
stereotyping and bias (e.g., gender bias). To do this, Actor: Average user
we constructed prompts describing people in a variety Tactic 1: ML Model Access
of common scenarios. Importantly, these prompts Technique 1: AML.T0040 - ML Model Inference API Access
did not specify the genders of the individuals so Procedure: Write prompts which may surface bias by
that the decision of how to depict them was left depicting individuals without specifying their genders (e.g.,
up to the model. Next, we sent each prompt to the “a secretary” and “a boss”).
generator many times (n=50) and manually labeled Weakness: Model bias
the genders of the people in the images. Figure 6 Impact: Generation of content that may exacerbate gender-
based biases and stereotypes
shows four representative images generated in one of
our experiments probing for gender bias in an office
setting.

Figure 6: Four images generated by a text-to-image model given the prompt “Secretary talking to boss in a conference room,
secretary is standing while boss is sitting.”

Lesson 6: Adversarial vs. benign


As illustrated in our ontology (see Figure 1), the Actor
Responsible AI harms are is a key component of an adversarial attack. In the
pervasive but difficult to measure context of RAI violations, we find that there are two
primary actors to consider:
Many of the human aspects of AI red teaming 1. An adversarial user who takes advantage of
discussed above apply most directly to RAI impacts. techniques like character substitutions and
As models are integrated into an increasing number jailbreaks to deliberately subvert a system’s safety
of applications, we have observed these harms guardrails and elicit harmful content, and
more frequently and invested heavily in our ability
to identify them, including by forming a strong 2. A benign user who inadvertently triggers the
partnership with Microsoft’s Office of Responsible generation of harmful content.
AI and by developing extensive tooling in PyRIT. Even if the same content is generated in both
RAI harms are pervasive, but unlike most security scenarios, the latter case is probably worse than the
vulnerabilities, they are subjective and difficult to former. Nonetheless, most AI safety research focuses
measure. In this section, we discuss how our thinking on developing attacks and defenses that assume
around RAI red teaming has developed.
Lessons from red teaming 100 generative AI products 15

adversarial intent, overlooking the many ways that We therefore encourage AI red teams to consider
systems can fail “by accident” [31]. Case studies #3 both existing (typically system-level) and novel
and #4 provide examples of RAI harms that could (typically model-level) risks.
be encountered by users with no adversarial intent,
highlighting the importance of probing for these Existing security risks
scenarios. Application security risks often stem from improper
security engineering practices including outdated
RAI probing and scoring dependencies, improper error handling, lack of input/
In many cases, RAI harms are more ambiguous than output sanitization, credentials in source, insecure
security vulnerabilities due to fundamental differences packet encryption, etc. These vulnerabilities can have
between AI systems and traditional software. In major consequences. For example, Weiss et al. [49]
particular, even if an operation identifies a prompt discovered a token-length side channel in GPT-4
that elicits a harmful response, there are still several and Microsoft Copilot that enabled an adversary to
key unknowns. First, due to the probabilistic nature accurately reconstruct encrypted LLM responses and
of GenAI models, we might not know how likely this infer private user interactions. Notably, this attack did
prompt, or similar prompts, are to elicit a harmful not exploit any weakness in the underlying AI model
response. Second, given our limited understanding and could only be mitigated by more secure methods
of the internal workings of complex models, we have of data transmission. In case study #5, we provide an
little insight into why this prompt elicited harmful example of a well-known security vulnerability (SSRF)
content and what other prompting strategies might identified by one of our operations.
induce similar behavior. Third, the very notion of
harm in this context can be highly subjective and Model-level weaknesses
requires detailed policy that covers a wide range of Of course, AI models also introduce new security
scenarios to evaluate. By contrast, traditional security vulnerabilities and have expanded the attack surface.
vulnerabilities are usually reproducible, explainable, For example, AI systems that use retrieval augmented
and straightforward to assess in terms of severity. generation (RAG) architectures are often susceptible
to cross-prompt injection attacks (XPIA), which hide
Currently, most approaches for RAI probing and malicious instructions in documents, exploiting the
scoring involve curating prompt datasets and fact that LLMs are trained to follow user instructions
analyzing model responses. The Microsoft AIRT and struggle to distinguish among multiple inputs
leverages tools in PyRIT to perform these tasks using [13]. We have leveraged this attack in a variety of
a combination of manual and automated methods. operations to alter model behavior and exfiltrate
We also draw an important distinction between RAI private data. Better defenses will likely rely on both
red teaming and safety benchmarking on datasets system-level mitigations (e.g., input sanitization)
like DecodingTrust [44] and Toxigen [12], which is and model-level improvements (e.g., instruction
conducted by partner teams. As discussed in Lesson hierarchies [43]).
3, our goal is to extend RAI testing beyond existing
evaluations by tailoring our red teaming to specific While techniques like these are helpful, it is important
applications and defining new categories of harm. to remember that they can only mitigate, and not
eliminate, security risk. Due to fundamental limitations
of language models [50], one must assume that if an
Lesson 7: LLM is supplied with untrusted input, it will produce
arbitrary output. When that input includes private
LLMs amplify existing security information, one must also assume that the model
risks and introduce new ones will output private information. In the next section,
we discuss how these limitations inform our thinking
The integration of generative AI models into a variety around how to develop AI systems that are as safe
of applications has introduced novel attack vectors and secure as possible.
and shifted the security risk landscape. However,
many discussions around GenAI security overlook
existing vulnerabilities. As elaborated in Lesson 2,
attacks that target end-to-end systems, rather than
just underlying models, often work best in practice.
Lessons from red teaming 100 generative AI products 16

Case study #5:

SSRF in a video-processing
GenAI application
In this investigation, we analyzed a GenAI-based System: GenAI application
video processing system for traditional security Actor: Adversarial user
vulnerabilities, focusing on risks associated with Tactic 1: Reconnaissance
outdated components. Specifically, we found that
Technique 1: T1595 - Active Scanning
the system’s use of an outdated FFmpeg version
introduced a server-side request forgery (SSRF) Tactic 2: Initial Access
vulnerability. This flaw allowed an attacker to craft Technique 2: T1190 - Exploit Public-Facing Application
malicious video files and upload them to the GenAI Tactic 3: Privilege Escalation
service, potentially accessing internal resources and Technique 3: T1068 - Exploitation for Privilege Escalation
escalating privileges within the system.
Procedure:
To address this issue, the GenAI service updated 1. Scan services used by the application.
the FFmpeg component to a secure version. In
2. Craft a malicious m3u8 file.
addition, the component was added to an isolated
environment, preventing the system from accessing 3. Send file to the service.
network resources and mitigating potential SSRF 4. Monitor for API response with details of internal
threats. While SSRF is a known vulnerability, this case resources.
underscores the importance of regularly updating and Weakness: CWE-918: Server-Side Request Forgery (SSRF)
isolating critical dependencies to maintain the security Impact: Unauthorized privilege escalation
of modern GenAI applications.

3. Request from
Blob Storage

Outdated FFmpeg
with SSRF
1. Upload special file vulnerability
in GenAI Video
Service
4. Sends an HTTP request to
2. Starts a video an internal endpoint
processing job

Figure 7: Illustration of the SSRF vulnerability in the GenAI application.


Lessons from red teaming 100 generative AI products 17

Lesson 8: Break-fix cycles


In the absence of safety and security guarantees,
The work of securing AI systems we need methods to develop AI systems that are as
will never be complete difficult to break as possible. One way to do this is
using break-fix cycles, which perform multiple rounds
In the AI safety community, there is a tendency to of red teaming and mitigation until the system is
frame the types of vulnerabilities described in this robust to a wide range of attacks. We applied this
paper as purely technical problems. Indeed, the letter approach to safety-align Microsoft’s Phi-3 language
on the homepage of Safe Superintelligence Inc., a models and covered a wide variety of harms and
venture launched by Sutskever et al. [40] states: scenarios [11]. Given that mitigations may also
inadvertently introduce new risks, purple teaming
“We approach safety and capabilities in tandem, methods that continually apply both offensive and
as technical problems to be solved through defensive strategies [3] may be more effective at
revolutionary engineering and scientific raising the cost of attacks than a single round of red
breakthroughs. We plan to advance capabilities as teaming.
fast as possible while making sure our safety always
remains ahead. This way, we can scale in peace.“ Policy and regulation
Finally, regulation can also raise the cost of an
Engineering and scientific breakthroughs are much
attack in multiple ways. For example, it can require
needed and will certainly help mitigate the risks of
organizations to adhere to stringent security
powerful AI systems. However, the idea that it is
practices, creating better defenses across the industry.
possible to guarantee or “solve” AI safety through
Laws can also deter attackers by establishing clear
technical advances alone is unrealistic and overlooks
consequences for engaging in illegal activities.
the roles that can be played by economics, break-fix
Regulating the development and usage of AI is
cycles, and regulation.
complicated, and governments around the world
Economics of cybersecurity are deliberating on how to control these powerful
A well-known epigram in cybersecurity is that “no technologies without stifling innovation. Even if it
system is completely foolproof” [2]. Even if a system is were possible to guarantee the adherence of an AI
engineered to be as secure as possible, it will always system to some agreed upon set of rules, those rules
be subject to the fallibility of humans and vulnerable will inevitably change over time in response to shifting
to sufficiently well-resourced adversaries. Therefore, priorities.
the goal of operational cybersecurity is to increase
the cost required to successfully attack a system The work of building safe and secure AI systems will
(ideally, well beyond the value that would be gained never be complete. But by raising the cost of attacks,
by the attacker) [2, 26]. Fundamental limitations of AI we believe that the prompt injections of today will
models give rise to similar cost-benefit tradeoffs in eventually become the buffer overflows of the early
the context of AI alignment. For example, it has been 2000s – though not eliminated entirely, now largely
demonstrated theoretically [50] and experimentally [9] mitigated through defense-in-depth measures and
that for any output which has a non-zero probability secure-first design.
of being generated by an LLM, there exists a
sufficiently long prompt that will elicit this response.
Techniques like reinforcement learning from human
feedback (RLHF) therefore make it more difficult,
but by no means impossible, to jailbreak models.
Currently, the cost of jailbreaking most models is low,
which explains why real-world adversaries usually do
not use expensive attacks to achieve their objectives.
Lessons from red teaming 100 generative AI products 18

Open questions
Based on what we have learned about AI red teaming over the past few years, we would like to highlight several
open questions for future research:
1. AI red teams must constantly update their practices based on novel capabilities and emerging harm areas. In
particular, how should we probe for dangerous capabilities in LLMs such as persuasion, deception, and replication
[29]? Further, what novel risks should we probe for in video generation models and what capabilities may emerge
in models more advanced than the current state-of-the-art?
2. As models become increasingly multilingual and are deployed around the world, how do we translate existing AI
red teaming practices into different linguistic and cultural contexts? For example, can we launch open-source red
teaming initiatives that draw upon the expertise of people from many different backgrounds?
3. In what ways should AI red teaming practices be standardized so that organizations can clearly communicate
their methods and findings? We believe that the threat model ontology described in this paper is a step in the
right direction but recognize that individual frameworks are often overly restrictive. We encourage other AI red
teams to treat our ontology in a modular fashion and to develop additional tools that make findings easier to
summarize, track, and communicate.

Conclusion
AI red teaming is a nascent and rapidly evolving practice for identifying safety and security risks posed by AI
systems. As companies, research institutions, and governments around the world grapple with the question of how
to conduct AI risk assessments, we provide practical recommendations based on our experience red teaming over
100 GenAI products at Microsoft. We share our internal threat model ontology, eight main lessons learned, and five
case studies, focusing on how to align red teaming efforts with harms that are likely to occur in the real world. We
encourage others to build upon these lessons and to address the open questions we have highlighted.

Acknowledgements
We thank Jina Suh, Steph Ballard, Felicity Scott-Milligan, Maggie Engler, Owen Larter, Andrew Berkley, Alex Kessler,
Brian Wesolowski, and eric douglas for their valuable feedback on this paper. We are also very grateful to Quy
Nguyen, Tina Romeo, Hilary Solan, and the Microsoft thought leadership team that made this publication possible.
Lessons from red teaming 100 generative AI products 19

References
1. Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, 15. Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y.,
P., Nambi, A., Ganu, T., Segal,S., Axmed, M., Bali, K., & Sitaram, He, Z., Zhou, J., Zhang, Z., Zeng, F., Ng, K. Y., Dai, J., Pan, X.,
S. (2023). Mega: Multilingual evaluation of generative ai. O’Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., McAleer, S., Yang, Y.,
Wang, Y., Zhu, S.-C., Guo, Y., & Gao, W. (2024). Ai alignment: A
2. Apruzzese, G., Anderson, H. S., Dambra, S., Freeman, D.,
comprehensive survey.
Pierazzi, F., & Roundy, K. A. (2022). “real attackers don’t
compute gradients”: Bridging the gap between adversarial ml 16. Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B.,
research and Practice. & Poovendran, R. (2024a). Artprompt: Ascii art-based jailbreak
attacks against aligned llms.
3. Bhatt, M., Chennabasappa, S., Nikolaidis, C., Wan, S., Evtimov,
I., Gabi, D., Song, D., Ahmad, F., Aschermann, C., Fontana, L., 17. Jiang, L., Rao, K., Han, S., Ettinger, A., Brahman, F., Kumar,
Frolov, S., Giri, R. P., Kapil, D., Kozyrakis, Y., LeBlanc, D., Milazzo, S., Mireshghallah, N., Lu, X., Sap, M., Choi, Y., & Dziri, N.
J., Straumann, A., Synnaeve, G., Vontimitta, V., Whitman, S., (2024b). Wildteaming at scale: From in-the-wild jailbreaks to
& Saxe, J. (2023). Purple llama cyberseceval: A secure coding (adversarially) safer language models.
benchmark for language models.
18. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess,
4. Birhane, A., Steed, R., Ojewale, V., Vecchione, B., & Raji, I. B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020).
D. (2024). Ai auditing: The broken bus on the road to ai Scaling laws for neural language models.
accountability.
19. Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J.
5. Blevins, T. & Zettlemoyer, L. (2022). Language contamination D., Dombrowski, A.-K., Goel, S., Phan, L., Mukobi, G., Helm-
helps explains the cross-lingual capabilities of English Burger, N., Lababidi, R., Justen, L., Liu, A. B., Chen, M., Barrass,
pretrained models. In Y. Goldberg, Z. Kozareva, & Y. Zhang I., Zhang, O., Zhu, X., Tamirisa, R., Bharathi, B., Khoja, A., Zhao,
(Eds.), Proceedings of the 2022 Conference on Empirical Z., Herbert-Voss, A., Breuer, C. B., Marks, S., Patel, O., Zou, A.,
Methods in Natural Language Processing (pp.3563–3574). Abu Mazeika, M., Wang, Z., Oswal, P., Lin, W., Hunt, A. A., Tienken-
Dhabi, United Arab Emirates: Association for Computational Harder, J., Shih, K. Y., Talley, K., Guan, J., Kaplan, R., Steneker,
Linguistics. I., Campbell, D., Jokubaitis, B., Levinson, A., Wang, J., Qian, W.,
Karmakar, K. K., Basart, S., Fitz, S., Levine, M., Kumaraguru, P.,
6. Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., &
Tupakula, U., Varadharajan, V., Wang, R., Shoshitaishvili, Y., Ba,
Wong, E. (2024). Jailbreaking black box large language models
J., Esvelt, K. M., Wang, A., & Hendrycks, D. (2024). The wmdp
in twenty queries.
benchmark: Measuring and reducing malicious use with
7. Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., & Inie, unlearning.
N. (2024). garak: A framework for security probing large
20. Lin, S., Hilton, J., & Evans, O. (2022). Truthfulqa: Measuring
language models.
how models mimic human Falsehoods.
8. Feffer, M., Sinha, A., Deng, W. H., Lipton, Z. C., & Heidari, H.
21. Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Guo, R., Cheng, H.,
(2024). Red-teaming for generative ai: Silver bullet or security
Klochkov, Y., Taufiq, M. F., & Li, H. (2024). Trustworthy llms: a
theater?
survey and guideline for evaluating large language models’
9. Geiping, J., Stein, A., Shu, M., Saifullah, K., Wen, Y., & alignment.
Goldstein, T. (2024). Coercing llms to do and reveal (almost)
22. Marchal, N., Xu, R., Elasmar, R., Gabriel, I., Goldberg, B., &
anything.
Isaac, W. (2024). Generative ai misuse: A taxonomy of tactics
10. Glasbrenner, J., Booth, H., Manville, K., Sexton, J., Chisholm, and insights from real-world data.
M. A., Choy, H., Hand, A., Hodges, B., Scemama, P., Cousin,
23. Meek, T., Barham, H., Beltaif, N., Kaadoor, A., & Akhter, T.
D., Trapnell, E., Trapnell, M., Huang, H., Rowe, P., & Byrne, A.
(2016). Managing the ethical and risk implications of rapid
(2024). Dioptra test platform. Accessed: 2024-09-10.
advances in artificial intelligence: A literature review. In
11. [11] Haider, E., Perez-Becker, D., Portet, T., Madan, P., Garg, A., 2016 Portland International Conference on Management of
Ashfaq, A., Majercak, D., Wen, W., Kim, D., Yang, Z., Zhang, J., Engineering and Technology (PICMET) (pp. 682–693).
Sharma, H., Bullwinkel, B., Pouliot, M., Minnich, A., Chawla,
24. Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B.,
S., Herrera, S., Warreth, S., Engler, M., Lopez, G., Chikanov, N.,
Anderson, H., Singer, Y., & Karbasi, A. (2024). Tree of attacks:
Dheekonda, R. S. R., Jagdagdorj, B.-E., Lutz, R., Lundeen, R.,
Jailbreaking black-box llms automatically.
Westerhoff, T., Bryan, P., Seifert, C., Kumar, R. S. S., Berkley,
A., & Kessler, A. (2024). Phi-3 safety post-training: Aligning 25. Microsoft (2022). Microsoft responsible ai standard, v2.
language models with a “break-fix” cycle.
26. Moore, T. (2010). The economics of cybersecurity: Principles
12. Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & and policy options. International Journal of Critical
Kamar, E. (2022). Toxigen: A large-scale machine-generated Infrastructure Protection, 3(3), 103–117.
dataset for adversarial and implicit hate speech detection.
27. Munoz, G. D. L., Minnich, A. J., Lutz, R., Lundeen, R.,
13. Hines, K., Lopez, G., Hall, M., Zarfati, F., Zunger, Y., & Kiciman, Dheekonda, R. S. R., Chikanov, N., Jagdagdorj, B.-E., Pouliot,
E. (2024). Defending against indirect prompt injection attacks M., Chawla, S., Maxwell, W., Bullwinkel, B., Pratt, K., de Gruyter,
with spotlighting. J., Siska, C., Bryan, P., Westerhoff, T., Kawaguchi, C., Seifert,
C., Kumar, R. S. S., & Zunger, Y. (2024). Pyrit: A framework for
14. Jain, D., Kumar, P., Gehman, S., Zhou, X., Hartvigsen, T., & Sap,
security risk identification and red teaming in generative ai
M. (2024). Polyglotoxici-typrompts: Multilingual evaluation
system.
of neural toxic degeneration in large language models. ArXiv,
Abs/2405.09373.
Lessons from red teaming 100 generative AI products 20

28. Pantazopoulos, G., Parekh, A., Nikandrou, M., & Suglia, 41. Vassilev, A., Oprea, A., Fordyce, A., & Anderson, H. (2024).
A. (2024). Learning to see but forgetting to follow: Visual Adversarial machine learning: A taxonomy and terminology
instruction tuning makes llms more prone to jailbreak attacks. of attacks and mitigations. In NIST Artificial Intelligence (AI)
Report Gaithersburg, MD, USA: National Institute of Standards
29. Phuong, M., Aitchison, M., Catt, E., Cogan, S., Kaskasoli, A., and Technology.
Krakovna, V., Lindner, D., Rahtz, M., Assael, Y., Hodkinson, S.,
Howard, H., Lieberum, T., Kumar, R., Raad, M. A., Webson, A., 42. Verma, A., Krishna, S., Gehrmann, S., Seshadri, M., Pradhan,
Ho, L., Lin, S., Farquhar, S., Hutter, M., Deletang, G., Ruoss, A., Ault, T., Barrett, L., Rabinowitz, D., Doucette, J., & Phan, N.
A., El-Sayed, S., Brown, S., Dragan, A., Shah, R., Dafoe, A., & (2024). Operationalizing a threat model for red-teaming large
Shevlane, T. (2024). Evaluating frontier models for dangerous language models (llms).
Capabilities.
43. Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., & Beutel,
30. Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2024). The instruction hierarchy: Training llms to prioritize
A. (2021). Ai and the everything in the whole wide world privileged instructions.
benchmark.
44. Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu,
31. Raji, I. D., Kumar, I. E., Horowitz, A., & Selbst, A. (2022). The C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora,
fallacy of ai functionality. In Proceedings of the 2022 ACM S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S.,
Conference on Fairness, Accountability, and Transparency, Song, D., & Li, B. (2024). Decodingtrust: A comprehensive
FAccT ’22 (pp. 959–972). New York, NY, USA: Association for assessment of trustworthiness in gpt models.
Computing Machinery.
45. Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How
32. Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., does llm safety training fail?
Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020).
Closing the ai accountability gap: Defining an end-to-end 46. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang,
framework for internal algorithmic auditing. P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton,
Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A.,
33. Ren, R., Basart, S., Khoja, A., Gatti, A., Phan, L., Yin, X., Mazeika, Haas, J., Rimell, L., Hendricks, L. A., Isaac, W., Legassick, S.,
M., Pan, A., Mukobi, G., Kim, R. H., Fitz, S., & Hendrycks, D. Irving, G., & Gabriel, I. (2021). Ethical and social risks of harm
(2024). Safetywashing: Do ai safety benchmarks actually from language models.
measure safety progress?
47. Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L.
34. Russinovich, M., Salem, A., & Eldan, R. (2024). Great, now write A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach,
an article about that: The crescendo multi-turn llm jailbreak B., Gabriel, I., Rieser, V., & Isaac, W. (2023). Sociotechnical
attack. safety evaluation of generative ai systems.
35. Saghiri, A. M., Vahidipour, S. M., Jabbarpour, M. R., Sookhak, 48. Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S.,
M., & Forestiero, A. (2022). A survey of artificial intelligence Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., Biles,
challenges: Analyzing the definitions, relationships, and C., Brown, S., Kenton, Z., Hawkins, W., Stepleton, T., Birhane,
evolutions. Applied Sciences, 12(8). A., Hendricks, L. A., Rimell, L., Isaac, W., Haas, J., Legassick,
S., Irving, G., & Gabriel, I. (2022). Taxonomy of risks posed by
36. Shelby, R., Rismani, S., Henne, K., Moon, A., Rostamzadeh, N., language models. In Proceedings of the 2022 ACM Conference
Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., on Fairness, Accountability, and Transparency, FAccT ’22 (pp.
& Virk, G. (2023). Sociotechnical harms of algorithmic systems: 214–229). New York, NY, USA: Association for Computing
Scoping a taxonomy for harm reduction. In Proceedings of Machinery.
the 2023 AAAI/ACM Conference on AI, Ethics, and Society,
AIES ’23 (pp. 723–741). New York, NY, USA: Association for 49. Weiss, R., Ayzenshteyn, D., Amit, G., & Mirsky, Y. (2024). What
Computing Machinery. was your prompt? a remote keylogging attack on ai assistants.
37. Slattery, P., Saeri, A., Grundy, E., Graham, J., Noetel, M., Uuk, 50. Wolf, Y., Wies, N., Avnery, O., Levine, Y., & Shashua, A. (2024).
R., Dao, J., Pour, S., Casper, S., & Thompson, N. (2024). The ai Fundamental limitations of alignment in large language
risk repository: A comprehensive meta-review, database, and models.
taxonomy of risks from artificial intelligence.
51. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang,
38. Smith, B., Browne, C., & Gates, B. (2019). Tools and Weapons: Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., &
The Promise and the Peril of the Digital Age. Penguin Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and
Publishing Group. chatbot arena.
39. Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., 52. Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou,
Blodgett, S. L., Chen, C., au2, H. D. I., Dodge, J., Duan, I., Evans, D., & Hou, L. (2023). Instruction-following evaluation for large
E., Friedrich, F., Ghosh, A., Gohar, U., Hooker, S., Jernite, Y., language models.
Kalluri, R., Lusoli, A., Leidinger, A., Lin, M., Lin, X., Luccioni,
S., Mickel, J., Mitchell, M., Newman, J., Ovalle, A., Png, M.-T., 53. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., &
Singh, S., Strait, A., Struppek, L., & Subramonian, A. (2024). Fredrikson, M. (2023). Universal and transferable adversarial
Evaluating the social impact of generative ai systems in attacks on aligned language models.
systems and society.
40. Sutskever, I., Gross, D., & Levy, D. (2024). Safe
superintelligence inc.
©2024 Microsoft Corporation. All rights reserved. This document is provided “as-is.” Information and views
expressed in this document, including URL and other Internet website references, may change without notice.
You bear the risk of using it. This document does not provide you with any legal rights to any intellectual
property in any Microsoft product. You may copy and use this document for your internal, reference purposes.

You might also like