Crescendo
Crescendo
1
(a) chatGPT. (b) Gemini Ultra.
Figure 1: A real-world example of Crescendo for the Molotov task with ChatGPT (Figure 1a) and Gemini Ultra (Figure 1b),
compared to the baseline approach of directly requesting the task. See full examples at [1, 2].
and identify 15 representative tasks that violate different cate- tion but refuses to declare success, explaining that it would
gories to thoroughly evaluate Crescendo’s performance. We violate its safety regulations. While this new layer of evalu-
present these tasks and their corresponding categories in Ta- ation introduces its own false positives and negatives, they
ble 1. occur at a much lower rate. We also manually inspect the
In this paper, we start by manually executing Crescendo highest-performing responses to minimize false positives and
on a subset of the tasks listed in Table 1 against all models. provide more reliable outcomes. Additionally, we utilize two
Our findings confirm that Crescendo can indeed overcome the external APIs, namely the Google Perspective API and Mi-
safety alignment of all models for nearly all tasks (Table 2). crosoft Azure Content Filters, to score responses with respect
We also notice that not all tasks are equally challenging to to supported categories such as “Hate Speech”, “Self-harm”,
jailbreak. For instance, the Intimacy task proved most difficult, “Violence”, and “Sexual Content”.
with the Claude-2 model being totally resistant to it, whereas To account for the inherent randomness in LLMs, we run
misinformation tasks, e.g., Climate, Election and UnsafeVax, Crescendomation through ten independent trials. Our results
were relatively easy to execute successfully. indicate that Crescendomation successfully produces work-
We further demonstrate that Crescendo can be automated. ing Crescendo jailbreaks in most cases. For instance, for the
Specifically, we introduce Crescendomation, a tool that au- Election, Climate, Rant, Denial tasks, our tool achieves al-
tomates the Crescendo jailbreak technique. Crescendoma- most perfect attack success rate (ASR) of 100% for all four
tion takes a target task and API access to a model as inputs models (GPT-4, GPT-3.5, Gemini-Pro and LLaMA-2 70b),
and initiates conversations aimed at jailbreaking the model where ASR is defined as the proportion of trials in which
into performing the task. It leverages an LLM to generate Crescendomation successfully completes the target task.
Crescendo jailbreaks and incorporates multiple input sources, By presenting Crescendo and demonstrating its potential
such as a feedback loop that assesses the quality of the output for automation with tools like Crescendomation, we aim to
and whether the model is refusing to respond, to refine its contribute to the better alignment of LLMs. Our hope is that
questions. these insights will help in developing more robust models
To quantitatively evaluate Crescendomation, we employ that are resistant to such jailbreaking techniques, thereby en-
three different techniques. First, the Judge, a self-evaluation hancing the overall security and ethical integrity of LLM
where GPT-4 assesses the output with respect to the intended deployments.
task. Recognizing that such evaluations can yield false pos- In summary, our contributions are:
itives and negatives, we introduce a second round of GPT- • We introduce Crescendo, a novel simple multi-turn jail-
4 evaluation, namely a Secondary Judge that evaluates the breaking technique that uses completely benign inputs.
Judge’s output and its corresponding reasoning. This addi-
tional layer of review significantly reduces false negatives, • We demonstrate the robust capability of Crescendo in
particularly in cases where GPT-4 acknowledges task comple- jailbreaking multiple public AI chat services, includ-
2
ing ChatGPT, Gemini and Gemini Advanced, Anthropic always aims to refer to the model’s output instead of explic-
Chat, and LLaMA-2 Chat across various tasks. itly writing something itself. Moreover, Crescendo does not
involve optimization and can be executed in fewer than ten
• We develop and extensively evaluate Crescendomation, queries (most tasks can be achieved in under five). It is also
a tool that automates the Crescendo strategy. highly effective, as it can complete almost all tasks, as demon-
strated later (Table 2); and as presented in Table 1, Crescendo
2 Related Works can be performed against very specific tasks with specified
outputs that can be clearly judged as successful or unsuc-
Jailbreaking large language models (LLMs) is a relatively cessful. Finally, as Crescendo employs benign questions and
new topic; however, due to the widespread deployment of prompts to execute the jailbreak, we believe it presents a
LLMs in various applications, several works have explored significantly greater challenge for detection and mitigation
jailbreaks. We present some of the latest related works here compared to other jailbreaks (which is demonstrated by its
and compare them to Crescendo. ability to bypass the safety regulations implemented in the
Numerous users have published jailbreaks on different plat- publicly available AI chat services as presented in Table 2).
forms like X or other websites [3,4]. Some works [17,21] have Another line of research focused on automating jailbreaks
investigated these jailbreaks in the wild. These works demon- using LLMs. For example, [24] proposes an LLM-operated
strate that while most of the jailbreaks are already patched, fuzzer that mutates a list of given seeds to produce more effec-
some were still active at the time of their evaluation. Other tive jailbreaks. They show that with a budget of around 50,000
works [13, 25] show the effect of manipulating the LLM’s in- prompts, they can achieve an attack success rate (ASR) of
ference hyperparameters, e.g., temperature, topK, and decod- 54% for LLaMA-2 7b. [20] also introduces Maatphor, which
ing techniques on the safety alignment. Their results indicate automates variant analysis. They start from some seed and
that manipulating these parameters can significantly boost mutate it to create better and different variants. [10] takes a
jailbreak success rates and enable the model to overcome its different approach by fine-tuning an LLM to generate jail-
safety alignment. breaks. They show that by fine-tuning an attacker LLM, they
Furthermore, [22] demonstrates that there are two failure can achieve an ASR of 21.58%.
modes of safety alignment. First, competing objectives where Contrasting with previous works, our tool, Crescendoma-
safety alignment conflicts with the model’s performance. They tion, presents an automated version of Crescendo, which can
capitalize on this by instructing the model to start its re- achieve a high ASR, as shown in Figure 11. It also uses sig-
sponse with “Absolutely! Here’s” when performing the ma- nificantly fewer interactions, i.e., we set the number of inter-
licious task, which successfully bypasses the safety align- actions to 10, and it can be further reduced. Moreover, it does
ment. Second, mismatched generalization, where the model not make any assumptions about the target model except for
does not perform uniformly across different languages or for- API access. Finally, it is important to note that Crescendoma-
mats. They demonstrate this by encoding jailbreak targets tion is just a proof-of-concept variant of how the Crescendo
with Base64, which successfully bypasses safety regulations. technique can be automated. However, better variants can
Similarly, [11] shows the effect of using different languages most likely be constructed.
to bypass safety regulations. [26] builds on the competing
objectives observation of [13] and proposes an optimization- 3 Crescendo
based technique for jailbreaking. Intuitively, they optimize
an adversarial suffix that, when concatenated with a disal- Crescendo is a multi-turn jailbreaking technique that uses be-
lowed/malicious task, will bypass the model’s safety regula- nign human readable prompts. Crescendo distinguishes itself
tion and be performed. [16] improves the optimization of [26] from other approaches by utilizing the target model’s outputs
to make the adversarial suffixes stealthier, i.e., incorporating to direct the model towards bypassing its safety alignment.
normal text instead of random symbols. It is essential to note This approach begins with an innocuous topic linked to the
that both of these works ( [16] and [26]) require white-box target task and progressively intensifies, directing the model’s
access to optimize the adversarial suffixes. Another work that responses towards the intended outcome. Hence, it circum-
explores single-turn jailbreaks [9] demonstrates that, on aver- vents defenses and safety measures, especially ones designed
age, 20 queries are needed to find a jailbreak. They report a to react mainly to the user’s prompts. The incremental nature
success rate of around 6% for the Claude-2 model. of Crescendo’s strategy mirrors the “foot-in-the-door” psy-
In comparison to these jailbreak attacks, Crescendo chological tactic, where agreeing to a small, initial request
presents the first multi-turn jailbreak that does not use any ad- increases the likelihood of complying with subsequent, larger
versarial text in its prompt. It continuously queries the model demands.
to gradually perform the intended task. The Crescendo inputs Crescendo intuitively starts with a question that can serve
do not have any added noise; they are entirely human-readable as a foundation for achieving the objective. To illustrate, Fig-
and benign. This is mainly due to Crescendo’s design, as it ure 2 present an example general pattern that successfully
3
generates articles promoting misinformation. In this example Crescendo attacks, we limit ourselves to a maximum of four
X represents the type of misinformation, i.e., the abstract topic attempts per task. In the majority of these attempts, we em-
of the jailbreak task. As Figure 2 show, the user makes no or ploy a backtracking strategy; that is, when a model rejects
minimal direct references to the target task. a question, we edit and resubmit it, taking advantage of the
editing features provided by chat interfaces such as ChatGPT
and Gemini. Where the chat interface does not facilitate back-
A : Tell me about X. tracking, we reinitialize the chat/session and start over. It is
Model: Concerns include 1, 2, 3, etc important to note that the tasks and outcomes presented here
are intended to solely to show the Crescendo’s capabilities. In
A : What are some things people concerned about the practice, Crescendo can be adapted to a wider array of tasks
second concern might say? and categories.
To further illustrate Crescendo’s capabilities, we refine the
Model: “I believe. . . ”, “I’m concerned about. . . ”, etc
target task to be even more precise. For instance, rather than
A : combine those quotes into article form and add a the more general Manifesto task (Table 1), we customize it
title to directly address the residents of Jackson, Mississippi. As
shown in Figure 3a, Crescendo successfully generates highly
Model: Title: The truth about X. . . specific output, showcasing its effectiveness in targeted tasks.
Additionally, we explore the possibility of executing multi-
A : make it angry ple Crescendo attacks aimed at different sub-tasks within the
Model: Title: Why people need to wake up and realize same chat. To this end, we start with the Manifesto task and
the truth about X. . . subsequently apply another Crescendo attack to incorporate
quotes from the Harry Potter series, which is copyrighted ma-
terial. The results, presented in Figure 3b, demonstrate the
Figure 2: An illustration of a Crescendo pattern/execution for efficacy of a dual Crescendo sequence, yielding a Manifesto
generating an article that propagate misinformation, with X that includes copyrighted elements, specifically quotations
denoting the subject of disinformation and A the adversary. and references from Harry Potter. We replicate the experi-
ment by incorporating profanity into the Manifesto rather
As illustrated in Figure 2, one key advantage of Crescendo than Harry Potter content, and achieve similarly successful
is its ability to allows the adversary to tailor the response outcomes. An example of the output is provided in the ap-
more precisely. For example, the adversary may choose to pendix Figure 16.
iterate the prompt “make it angry” multiple times until the Finally, since that some systems, such as Gemini and Chat-
output meets their expectations. GPT, possess image generation capabilities, we evaluate us-
ing Crescendo to jailbreak this feature. To this end, we exe-
cute Crescendo as previously mentioned, and once the model
3.1 Manual Examples achieves the task, we prompt the model to generate an im-
To evaluate the efficacy of the Crescendo attack, we define age to go with the text. Sometimes, we adapt the generation
a range of tasks spanning various categories that contravene prompt to improve the generated images. Our results show
safety guidelines. Table 1 presents the different tasks and that Crescendo can cause these systems to generate images
their associated categories. We manually execute and evaluate that they would otherwise refuse to produce if asked to gen-
Crescendo on a subset of these tasks, targeting five different erate them directly. Due to space limitations, we present a
state-of-the-art aligned public chat systems and LLMs, in- selection of these images in the appendix, as indicated in
cluding ChatGPT (GPT-4), Gemini (Gemini Pro and Gemini Figure 14.
Ultra), Anthropic Chat (Claude-2 and Claude-3) and LLaMA-
2 70b. Finally, it is important to note that, to the best of our 3.2 Understanding Crescendo
knowledge, these models have all been subject to some form
of alignment process, and the chat services also incorporate Intuitively, Crescendo is able to jailbreak a target model by
safety instructions within their meta prompts. progressively asking it to generate related content until the
The findings from our evaluations are summarized in Ta- model has generated sufficient content to essentially override
ble 2. The data illustrates that Crescendo can effectively jail- its safety alignment. To test this hypothesis, we first start with
break all the evaluated models across the vast majority of a syntactic illustration in which we systematically amplify the
tasks, demonstrating its strong performance across a spec- context and observe the probability of generating a designated
trum of categories and models. Moreover, to visualize the word. Specifically, we concentrate on LLaMA-2 70b, as it is
output, we provide an example of the Manifesto task for each the only open sourced aligned model, and select profanity as
tested systems in the appendix (Figure 15). To carry out these the target domain.
4
Table 1: List of Explored Tasks for Testing Crescendo and Their Corresponding Categories
Table 2: Crescendo evaluations (manual). A checkmark ✓indicates Crescendo was successful, and a starred checkmark (✓*)
denotes that Crescendo was successful, but a post-output filter was activated (validated by running without content filters, i.e.,
using Gemini-Pro API instead of the Gemini service).
UnsafeVax
Manifesto
Stabbing
Intimacy
Molotov
Election
Climate
Denial
Meth
Rant
5
Figure 4: Probability of generating the tokens ‘f” and “uck”,
(a) Manifesto targeted to the residents of Jackson, Mississippi. originating from the benign phrase ’Frank, I love you, man,’
and incrementally introducing aggressive and profanity-laden
context.
6
of the various sentences is the final response before the tar- 4.2 Measuring Success
get model is jailbroken, i.e., before querying Sentence C. We
continue to measure the probabilities of “Sure” and “I” as We now outline our methodology for quantitatively assess-
the metric for success and failure, respectively. The results, ing the success of Crescendomation, employing two distinct
illustrated in Figure 5a, further demonstrate the Crescendo evaluation techniques. Firstly, we utilize an LLM-based eval-
effect, where model is incrementally led to a jailbreak as it uator (Self-Evaluation), and secondly, we incorporate external
encounters more related content (with the fourth sentence hav- moderation APIs for comprehensive analysis (External APIs).
ing a higher influence). This progression is consistent unless
the sentence produced by the model contains a reminder of 4.2.1 Self-Evaluation
the task’s nature, such as acknowledging that the “f-word”
is controversial in this example. To further validate that no Given the variability of the target tasks and the high false pos-
specific sentence is responsible for the jailbreak, but rather itives of string matching, we adopt the methodology of prior
the generated context as a whole, we replicate the experiment studies [9, 16] by employing a Judge Large Language Model
while omitting the most influential sentence—in this case, (LLM). The Judge LLM evaluates a response in relation to the
the fourth sentence. Figure 5b illustrates that even with the target task, determining whether the response successfully ac-
removal of the fourth sentence, the probability still increased complishes the task’s. This evaluation is represented through
to 100%. We repeat this experiment by systematically remov- three distinct outputs: a boolean flag indicating success or
ing the most influential sentences and consistently observe failure, a percentage reflecting the extent to which the task
similar outcomes. is accomplished, and an explanatory statement justifying the
decision.
We apply the Judge to quantitatively assess the effective-
4 Crescendomation ness of Crescendo’s jailbreaks. Nonetheless, it is important to
acknowledge that, as with other methods, the use of an LLM
Next, we demonstrate the feasibility of automating the for evaluating jailbreak success is susceptible to both false
Crescendo attack. To this end, we introduce Crescendomation, positives and negatives. A common cause of false negatives
a tool designed to automate Crescendo. It takes the input task (FN), in our setting, is the inherent nature of the tasks, which
and executes a Crescendo jailbreak against the target model, are designed to challenge the model’s safety alignment. As a
with the sole prerequisite of having API access to the model. result, the Judge might agree that a response is fulfilling the
task, however it constitutes a breach of safety alignment, thus
rendering the response unsuccessful.
4.1 Overview To mitigate this issue, we introduce a Secondary Judge, an
additional layer of evaluation aimed at validating the decisions
Intuitively, Crescendomation leverages an LLM to automate of the original Judge. The Secondary Judge scrutinizes the
the Crescendo attack sequence. For this work we use GPT-4 Judge’s explanation, identifying any illogical reasoning or
for Crescendomation. The process begins with the generation inconsistencies triggered by safety constraints, and reverses
of an initial prompt or question, which is then sent to the target the Judge’s decision when applicable. This process helps to
LLM. The tool then processes the received response and, in an reduce the rate of false negatives attributed to the primary
adaptive manner, Crescendomation formulates the subsequent Judge, although it does not completely eliminate them and
prompt or question. This cycle of interaction continues over can occasionally lead to its own false positives and negatives,
multiple turns until the tool successfully jailbreaks the model albeit at a significantly reduced rate.
and accomplishes the intended task. Lastly, we manually review the highest-performing
To implement Crescendomation effectively, we utilize vari- prompts and questions to confirm their validity and coun-
ous sources of input to optimize its efficacy. Firstly, Crescen- teract any false positives. We believe this final manual step
domation receives a comprehensive meta prompt that outlines ensures a more reliable confirmation of our results.
the nature of the Crescendo attack, supplemented with exam-
ples of successful Crescendo attacks. Secondly, given that 4.2.2 External APIs
Crescendo is a multi-turn jailbreak strategy, Crescendomation
considers the most recent responses from the target model to To complement the self-evaluation, we additionally utilize
shape the subsequent prompt. It also maintains a log of all pre- two commercial content moderation APIs: Microsoft Azure
viously posed questions and a summary of their corresponding Content Filter [5] and Google Perspective API [6]. These
responses. Crescendomation itself generates this summary services offer assessment scores across various categories,
by following instructions to condense the latest response for including "Hate Speech," "Violence," "Self-harm," etc. We
future iterations. We present the Crescendomation’s concrete leverage these scores to further evaluate our results. Never-
algorithm in Algorithm 1. theless, it is important to note that these APIs cannot be used
7
(a) All Sentences (b) Top Sentence (Fourth) Removed
Figure 5: Probabilities of jailbreak success (’Sure’) and failure (’I’) analyzed sentence by sentence in the final response prior to
querying Sentence C.
exclusively for evaluation, as there are some categories, such GPT-3.5 & GPT-4. GPT-3.5 and GPT-4 are a large language
as misinformation, that they do not cover. model (LLM) developed by OpenAI, accessible exclusively
through a black-box API. They serves as the base model for
the free and premium versions of ChatGPT, respectively.
4.3 Feedback and Backtracking Loops
Gemini-Pro. The Gemini series are another family of LLMs
Crescendomation incorporates a feedback mechanism to mon- that is provided by Google. At the time of this work, Gemini-
itor the success of the jailbreak attempt. We employ both the Pro is their most advanced model available to the public with
Judge and the Secondary Judge (see Section 4.2) for this task. an API. It is also only accessible through a black-box API.
Each response from the target model is first assessed by the
Judge and then verified by the Secondary Judge. The resulting Claude-3. Claude-3 are another black-box LLM series de-
success flag and percentage are fed back into Crescendoma- veloped by Anthropic. We focus on the largest model (in the
tion. This information, corresponding to each question and time of writing) of that series, namely Opus.
summarized response, enables Crescendomation to keep track LLaMA-2. The LLaMA-2 series is also a family of LLMs cre-
of the Crescendo jailbreak’s progression. ated by Facebook. Distinct from GPT and Gemini, LLaMA-2
In addition, we implement another evaluator, the Refusal is open-source, providing public access to its weights and out-
Judge, which determines whether the target model refuses put probabilities. We focus on the largest version of LLaMA-
to respond or whether any content filtering mechanisms are 2, specifically LLaMA-2 70b, which has approximately 70
triggered. Should the Refusal Judge indicate a refusal or filter billion parameters.
activation, Crescendomation retracts the last posed question. All references and comparisons made to these models per-
This action is akin to regenerating or editing a question in tain to the period of this research. However, it is likely that
ChatGPT. The retraction process involves erasing the question more advanced models will become available in the future.
from the dialogue history with the target model, although it Moreover, we would like to to point out that Claude 2 is not
is still retained within Crescendomation’s history, marked as included in the evaluation of Crescendomation; despite hav-
either filtered out or refused. To prevent Crescendomation ing applied for API access, we had not received authorization
from getting stuck in repetitive backtracking, we set a limit of at the time, thus preventing us from testing Crescendomation
ten rephrasing attempts, allowing Crescendomation to revise against it.
its questions a maximum of ten times.
8
Algorithm 1: Crescendomation 5.1.4 Hyperparameters
Input: Task t, Target Model T , Attack Model A , Unless explicitly stated, we run Crescendomation ten times in-
Evaluation Model J , Rounds R, Iterations N dependently to account for the inherent randomness in LLMs.
Result: Jailbroken Target Model Additionally, we cap the multi turn to ten interaction rounds
// Different independent iterations of with Crescendomation. These hyperparameters are adjustable
Crescendomation and can be tuned to suit various scenarios as required. For
1 for i ← 1 to N do all LLMs, we set the temperature for all models to 0.5 for
// Initialize history for target model capturing a broader spectrum of the models’ output behavior.
2 HT ← {};
// Initialize history for attack model 5.1.5 Evaluations Metrics
3 HA ← {};
4 Crefu ← 0; We use a range of different evaluations metrics to measure
5 r ← ""; the success of the jailbreak.
6 for j ← 1 to R do Self-Evaluation. Our first metric is the binary Success Flag,
// Generate new prompt and the last generated by the evaluation LLM as mentioned in Sec-
response (r) summary tion 4.2.1. We report the Success Flag through the attack
7 q, s ← genCrescendoStep(A ,t, HA , r); success rate (ASR), which is defined as the proportion of
// Add prompt to T ’s history (HT ) trials in which the Judge recognizes at least one response as
8 add(HT , q); successful. Additionally, we provide the Judge score, which
// Generate a response from T spans from 0 to 100. A top score of 100 denotes full achieve-
9 r ← T (HT ); ment of the task, while a minimum score of 0 signifies total
// Add prompt to A ’s history (HA ) failure.
10 add(HA , q); Perspective API. Secondly, we utilize the Perspective API [6]
// Checking if T refused r to evaluate six available metrics: Toxicity, Severe Toxicity, In-
11 if responseRefused(J , r) and Crefu < 10 then sult, Profanity, Sexually Explicit, and Threat. We have opted
// Backtrack not to incorporate the Identity Attack metric, as it is not rele-
12 pop(HT ); vant to any of the tasks we are testing. For each response, we
13 Crefu + +; record the highest score (from 0 to 1) across these categories
14 continue; and assign the corresponding label to that response.
15 end
Azure Content Filter. Thirdly, we employ the Azure Con-
// Add response to T ’s history (HT )
tent Filter [5] to assess the four metrics it provides: Hate
16 add(HT , r);
Speech, Self-Harm, Sexually Explicit Content, and Violence.
// Evaluate the response
These metrics are scored on a scale ranging from 0, indicating
17 e ← evaluate(J , r,t);
normal text, to 7, representing the highest severity level.
// Add evaluation results to A ’s
history
18 add(HA , e); 5.1.6 Baseline
// Add the generated summary to A ’s To evaluate the effectiveness of Crescendomation, we com-
history pare its performance against the straightforward approach
19 add(HA , s); (Baseline) where we query the target model with the tasks’
20 end description –the “Task” column– listed in Table 1. We repeat
21 end this process for ten times for each task, and use the same
evaluation metrics to assess the outcomes.
5.2 GPT-3.5
5.1.3 Evaluation Models We first evaluate the GPT-3.5 model. To this end, we fol-
low the previously mentioned evaluation settings and execute
Crescendomation on the 15 target tasks presented in Table 1.
Our tool, Crescendomation, employs GPT-4 as the base model For each task, we run Crescendomation ten times indepen-
for all its various Judges. Like the attack model, the evaluation dently. The results togther with the baseline are shown in Fig-
model is flexible and can be switched with other off-the-shelf ure 6. As the figure shows, with the exception of the Explicit
models. task, there is at least one successful Crescendo attack for all
9
Baseline Crescendo Baseline Crescendo
100 100
80 80
Attack Success Rate
40 40
20 20
0 0
Molotov
Rant
Stabbing
Soothing
Molotov
Rant
Stabbing
Toxin
Intimacy
Manifesto
Election
Climate
Toxin
Meth
Intimacy
Explicit
PayGap
Denial
Choice
Manifesto
Election
Climate
Meth
Explicit
Soothing
UnsafeVax
PayGap
Denial
Choice
UnsafeVax
Figure 6: Crescendomation success rate for GPT-3.5. Figure 7: Crescendomation success rate for GPT-4.
tasks. As expected, certain tasks are more susceptible to jail- Additionally, we report the highest scores obtained in Ta-
breaking than others; for example, most misinformation tasks ble 4. The Rant task registers the highest score across all tasks
consistently succeed in all attempts, while others, such as the and models, reaching a Toxicity Perspective API score of
Toxin task, achieve success about 30% of the time. Finally, 0.85. Interestingly, GPT-4 is able to execute the Climate task
while the baseline achieve some success, it completely fails without setting off any alarms for Azure CF, i.e., attaining
for the majority of the cases and Crescendomation achieves a score of zero. While this outcome is somewhat expected,
significantly stronger performance. given that the Perspective API and Azure CF are typically
Additionally, we present the highest scores achieved with not crafted to flag misinformation in such specialized areas,
respect to the Judge, Perspective API and Azure Content they do occasionally trigger some filters. For example, con-
Filters (Azure CF) in Table 4. The table demonstrates that tent generated for comparable tasks—such as the Election
Crescendomation can achieve a score nearly reaching the task—tends to trigger Hate Speech filters.
maximum for the Azure CF (which is 7) in the Molotov task,
specifically a score of 6. It is also interesting to see, while the
5.4 Gemini-Pro
Judge does not find any successful response, the Perspective
API rates at least one response for the Explicit case with a Next, we evaluate Gemini-Pro, Google’s best-performing
score of 0.68. model with API access at the time of writing. We follow
the same evaluation settings as previous models with the only
change of the target model used. The results, as shown in
5.3 GPT-4
Figure 8, reveal that Crescendomation manages to complete
Second, we evaluate GPT-4, the largest and latest model pro- all tasks with a minimum success rate of 60%. Remarkably, in
vided by OpenAI. We follow the same evaluation settings 9 out of the 15 tasks, Crescendomation attains a 100% ASR.
mentioned in Section 5.2, with the exception that we set the In comparison, the Baseline is only successful in four tasks.
target model to GPT-4 instead of GPT-3.5. The results for Further, as demonstrated in Table 5, Gemini-Pro produced
GPT-4 against the 15 tasks are presented in Figure 7. As the at least one response in the Manifesto task that secured the
figure show, Crescendomation can successfully perform all top score of 7 on the Azure CF scale for both the Violence and
tasks with at least a 20% ASR. In other words, for each task, Hate categories. The table further indicates that, according
the model is jailbroken and successfully performs the action to the Self-evaluation (Judge) metric, each task generated
required in at least two separate instances. Similar to GPT-3.5, at least one response scoring 80 or higher. In addition, the
certain tasks, such as Explicit, Toxin, and Soothing, proved to Explicit task achieved one of the Perspective API’s highest
be more challenging, while others—particularly misinforma- scores (0.83) in the Sexually Explicit category, similarly it
tion ones (UnsafeVax, Election, and Climate) and violence- achieve the score of 6 for the same category from Azure CF.
related (Stabbing)—achieved a perfect ASR of 100%. Finally,
as Figure 7 illustrates, GPT-4 is more inclined to perform
5.5 Claude-3
some tasks when only prompted by the tasks themselves
(Baseline). However, it completely blocks others, such as We now evaluate Claude-3, the most recent Anthropic model
Meth, Toxin and most misinformation tasks. at the time of writing. We follow the same evaluation set-
10
Baseline Crescendo Baseline LLaMA-2 7b
100 LLaMA-2 70b
100
80
Attack Success Rate
80
Stabbing
Toxin
Intimacy
Soothing
Election
Climate
Meth
Explicit
PayGap
Denial
Choice
Manifesto
UnsafeVax
Molotov
Rant
Stabbing
Intimacy
Election
Climate
Toxin
Meth
Explicit
Soothing
Manifesto
PayGap
Denial
Choice
UnsafeVax
Figure 8: Crescendomation success rate for Gemini-Pro.
Figure 10: Crescendomation success rate for LLaMA-2 70b.
Baseline Crescendo
100 5.6 LLaMA-2 70b
80 Lastly, we evaluate Meta’s latest and largest open-source
Attack Success Rate
Stabbing
Toxin
Soothing
Election
Climate
Meth
Intimacy
Explicit
PayGap
Denial
Choice
Manifesto
UnsafeVax
fends against all tasks for the baseline. The figure additionally
indicates that both versions of LLaMA-2 are fully resistant to
Crescendomation for certain tasks, including Manifesto and
Figure 9: Crescendomation success rate for Claude-3. Explicit. For other tasks, however, Crescendomation is able
to achieve a minimum ASR of 20%. Figure 10 also shows the
very similar performance of both versions of LLaMA-2 with
tings as previously mentioned. Figure 9 shows that indeed, respect to their safety alignment and how they perform against
when considering the tasks without jailbreak, Claude-3 is rel- Crescendomation. This finding was unexpected and suggests
atively better aligned, with only one task being successful. that for the Llama models, alignment and susceptibility to
However, with Crescendomation, Claude-3 is almost equally Crescendo are independent of the model’s size.
vulnerable. Crescendomation achieves at least one successful As with other models, Table 5 presents the highest scores
attempt for each task. More interestingly, a manual inspection attained by the LLaMA-2 70b model. The table reveals that,
of the outputs of Claude-3 shows that Claude-3’s successful despite recording a Judge score of 0—indicating a failure to
outputs are more detailed for most tasks, and for some tasks complete the task according to the Judge metric—at least one
like Explicit, the Judge is underestimating the success rate. response achieved a score of 5 in the sexual category for the
Manual inspection resulted in at least three more responses Explicit task and the maximum score of 7 for Violence in the
that, when reviewed by humans, were deemed successful, but Manifesto task, as per the Azure CF.
were marked as unsuccessful by the Judge. However, for con-
sistency, we report the Judge’s numbers similarly to other 5.6.1 Comparison
models. Table 6 further demonstrates this, with tasks such
as Rant achieving a Perspective API score of 87 (the high- Finally, we compare Crescendomation’s performance across
est across all models) for Profanity. Similarly, the Judge’s all models. Figure 11 presents the success rates of each model
self-evaluation reports a score of at least 70 for all different on all tasks, while Figure 17 –in the appendix– provides
tasks. the counts of refusals—instances where a model refuses to
11
GPT-4 GPT-3.5 GeminiPro Llama-2 70b Claude-3
100
Attack Success Rate80
60
40
20
0 n
te
in
th
ov
cy
cit
ing
ap
ax
ing
o
nia
n
est
oic
o
Tox
ma
Me
Ra
eV
a
lot
pli
cti
yG
bb
oth
im
De
Ch
nif
Ex
saf
Ele
Mo
Cli
Pa
Sta
Int
So
Ma
Un
Figure 11: Comparing the results of Crescendomation on all models with respect to Attack Success Rate
respond to a question or prompt. As both figures indicate, both figures demonstrate, applying Crescendo is transferable
LLaMA-2 70b and Claude-3 generally have a slightly lower for most tasks. However, the performance of a customized
average success rate and a higher refusal rate across most or target-specific Crescendo surpasses that of transferring a
tasks. Figure 11 further highlights that some tasks, such as successful Crescendo from a different model. For example,
Explicit, are challenging for almost all models. Surprisingly, for the Election task, both models achieve an attack success
the task with the highest success rate, where all models con- rate (ASR) of at least 90%. In contrast, some tasks, such as
sistently reach 100% attack success rate, is the Denial task Explicit and Manifesto, exhibit significantly worse results
related to self-harm. Moreover, all models show strong per- with nearly 0% ASR when transferring Crescendo across
formance on tasks like the Election and Climate ones, which models.
indicates a vulnerability in these models to be prompted into
creating misinformation content, particularly of a political na-
ture. The effectiveness of Crescendomation is also validated 5.8 Generalizability
by external moderation APIs, such as Perspective API and
Azure CF, as indicated in Table 4 and Table 5. To demonstrate the generalizability of Crescendo, we per-
While the figures point to the overall strong performance of form one more experiment using Crescendomation on one
Crescendomation against all models tested, it is important to of the state-of-the-art benchmark datasets for harmful behav-
note that Crescendomation is just one method of automating iors, namely AdvBench [26]. AdvBench contains 520 tasks,
the Crescendo technique and might be further refined for even many of which are semantically similarly. To limit cost, we
better outcomes. For example, as previously demonstrated in utilize GPT-4 to categorize these behaviors into distinct cate-
Table 2, manual Crescendo was successful against the LLaMa- gories and then ask it to select the top three most severe tasks
2 70b model for the Manifesto and Explicit tasks, whereas from each category. GPT-4 categorized the tasks into four
Crescendomation was not. categories, leading to a total of 12 tasks for our evaluation.
Due to space restrictions, we present the selected tasks in the
5.7 Transferability Appendix (Table 7).
Next, we execute Crescendomation using the same settings
We now proceed to evaluate the transferability of the for all models across the 12 tasks and present the outcomes
Crescendo attack using Crescendomation. We focus on the in Figure 13. The results indicate that Crescendomation suc-
GPT-4 and Gemini-Pro models, and we assess the successful cessfully jailbreaks all 12 tasks at least 20% of the time. Fur-
Crescendo of other models on them. More concretely, we first thermore, many of the tasks achieve a perfect Attack Success
extract questions from the non-target models and apply them Rate (ASR) as shown in the figure.
to the target models (GPT-4 and Gemini-Pro). It is important These results further demonstrate the generalizablity of the
to note that some of these questions are not self-contained Crescendo across different categories and tasks, while also
because the refer to the original model’s output. Nevertheless, highlighting the strong performance of the Crescendomation
we opt to run them as-is without making any modifications, tool.
as this approach provides a lower bound on transferability
performance.
Figure 12a and Figure 12b present the results for GPT-4
and Gemini-Pro, respectively. For comparative context, we 6 Discussion
include the performance metrics of the original Crescendo
attack on each respective model—represented by the “GPT- We now discuss some of the limitations and possible mitiga-
4” and “Gemini-Pro” bars in the corresponding figures. As tions of Crescendo.
12
GPT-4 GPT-3.5 Claude GPT-4 GPT-3.5 Claude
GeminiPro Llama-2 70b GeminiPro Llama-2 70b
100 100
80 80
Attack Success Rate
40 40
20 20
0 0
Molotov
Rant
Stabbing
Soothing
Molotov
Rant
Stabbing
Toxin
Election
Climate
Toxin
Meth
Intimacy
Explicit
Manifesto
Election
Climate
Meth
Intimacy
Explicit
Soothing
Manifesto
PayGap
Denial
Choice
UnsafeVax
PayGap
Denial
Choice
UnsafeVax
(a) GPT-4. (b) Gemini-Pro
Figure 12: Transferability results for target models GPT-4 (Figure 12a) and Gemini-Pro (Figure 12b). Different bars represent
using the corresponding model’s questions/prompts to perform a Crescendo attack on the target model.
40
6.2 Mitigation
20
Mitigating the Crescendo attack presents a complex challenge
0 due to its reliance on multi-turn interactions with seemingly
RacialNews
Racism
ChaosNews
Manipulation
Weapon
GovHack
Virus
Bomb
IdTheft
CorpHack
ChildGuns
IdTheft2
13
Lastly, for existing models, applying content filters to [6] https://perspectiveapi.com/.
both input and output could help in detecting and blocking
Crescendo jailbreak prompts and responses. Yet, it can be dif- [7] Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
ficult to comprehensively filter out every potential issue. For Amanda Askell, Jackson Kernion, Andy Jones, Anna
instance, identifying misinformation poses a significant chal- Chen, Anna Goldie, Azalia Mirhoseini, Cameron McK-
lenge, and the use of character substitutions—such as using innon, Carol Chen, Catherine Olsson, Christopher Olah,
“$” and “@” to replace “s” and “a”—can further complicate Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin
detection efforts. Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared
Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse,
Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nel-
7 Conclusion son Elhage, Nicholas Schiefer, Noemi Mercado, Nova
DasSarma, Robert Lasenby, Robin Larson, Sam Ringer,
In this paper, we have presented a new form of jailbreak Scott Johnston, Shauna Kravec, Sheer El Showk,
attack, namely Crescendo. Unlike traditional methods, this Stanislav Fort, Tamera Lanham, Timothy Telleen-
technique does not require the adversary to explicitly mention Lawton, Tom Conerly, Tom Henighan, Tristan Hume,
the task. Instead, the adversary interacts with mostly benign Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann,
prompts to the model, while incrementally steering the model Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom
to execute the task by leveraging the model’s own output. Our Brown, and Jared Kaplan. Constitutional ai: Harmless-
experiments have shown that the Crescendo jailbreak is highly ness from ai feedback, 2022.
effective against various state-of-the-art models and systems.
Additionally, it proves to be flexible, capable of being adapted [8] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew
to different tasks and scenarios. Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam
Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson,
Ethical Statement Alina Oprea, and Colin Raffel. Extracting training data
from large language models, 2021.
First and foremost, we have reported Crescendo to all im-
[9] Patrick Chao, Alexander Robey, Edgar Dobriban,
pacted organizations and provided them with Crescendo-
Hamed Hassani, George J. Pappas, and Eric Wong. Jail-
mation, along with comprehensive manual examples of
breaking Black Box Large Language Models in Twenty
Crescendo, adhering to the coordinated vulnerability disclo-
Queries. CoRR abs/2310.08419, 2023.
sure protocol. Secondly, by bringing attention to the potential
vulnerabilities in public systems and LLMs, our aim is to aid [10] Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying
in the development of more secure models. Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and
Yang Liu. Jailbreaker: Automated Jailbreak Across
Acknowledgment Multiple Large Language Model Chatbots. CoRR
abs/2307.08715, 2023.
We would like to thank Yonatan Zunger and Ram Shankar for
their valuable feedback on the early drafts of this paper. [11] Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li-
dong Bing. Multilingual Jailbreak Challenges in Large
Language Models. CoRR abs/2310.06474, 2023.
References
[12] Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John
[1] https://chat.openai.com/share/ Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
31708d66-c735-46e4-94fd-41f436d4d3e9. Laura Weidinger, Martin Chadwick, Phoebe Thacker,
Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen
[2] https://gemini.google.com/share/ Huang, Ramona Comanescu, Fan Yang, Abigail See,
35f0817c3a03. Sumanth Dathathri, Rory Greig, Charlie Chen, Doug
[3] https://www.lesswrong.com/ Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá,
posts/RYcoJdvmoBbi5Nax7/ Nicholas Fernando, Boxi Wu, Rachel Foley, Susan-
jailbreaking-chatgpt-on-release-day. nah Young, Iason Gabriel, William Isaac, John Mellor,
Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hen-
[4] https://www.jailbreakchat.com/. dricks, and Geoffrey Irving. Improving alignment of
dialogue agents via targeted human judgements, 2022.
[5] https://learn.microsoft.com/
en-us/python/api/overview/azure/ [13] Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai
ai-contentsafety-readme?view=azure-python. Li, and Danqi Chen. Catastrophic Jailbreak of Open-
14
source LLMs via Exploiting Generation. CoRR [25] Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan
abs/2310.06987, 2023. Cheng, and Xiangyu Zhang. Make them spill the beans!
coercive knowledge extraction from (production) llms,
[14] Nikhil Kandpal, Krishna Pillutla, Alina Oprea, Peter
2023.
Kairouz, Christopher A. Choquette-Choo, and Zheng
Xu. User inference attacks on large language models, [26] Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrik-
2023. son. Universal and Transferable Adversarial Attacks
on Aligned Language Models. CoRR abs/2307.15043,
[15] Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika
2023.
Bhalerao, Christopher L. Buckley, Jason Phang,
Samuel R. Bowman, and Ethan Perez. Pretraining lan-
guage models with human preferences, 2023. A Evaluation
[16] Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei We present here (Table 4 and Table 5) the maximum scores
Xiao. AutoDAN: Generating Stealthy Jailbreak achieved for all models. Figure 16 presents an example for
Prompts on Aligned Large Language Models. CoRR Manifesto with profanity. And Figure 17 compares the model
abs/2310.04451, 2023. refusals across all tasks.
[17] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen
Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and B AdvBench Tasks
Yang Liu. Jailbreaking ChatGPT via Prompt Engineer-
ing: An Empirical Study. CoRR abs/2305.13860, 2023. We present here (Table 7) the selected tasks from the Ad-
vBench dataset.
[18] Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople,
Lukas Wutschitz, and Santiago Zanella-Béguelin. Ana-
lyzing leakage of personally identifiable information in
language models, 2023.
[19] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder, Paul
Christiano, Jan Leike, and Ryan Lowe. Training lan-
guage models to follow instructions with human feed-
back, 2022.
[20] Ahmed Salem, Andrew Paverd, and Boris Köpf. Maat-
phor: Automated variant analysis for prompt injection
attacks, 2023.
[21] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen,
and Yang Zhang. Do Anything Now: Characterizing
and Evaluating In-The-Wild Jailbreak Prompts on Large
Language Models. CoRR abs/2308.03825, 2023.
[22] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt.
Jailbroken: How Does LLM Safety Training Fail? CoRR
abs/2307.02483, 2023.
[23] Rui Wen, Tianhao Wang, Michael Backes, Yang Zhang,
and Ahmed Salem. Last one standing: A comparative
analysis of security and privacy of soft prompt tuning,
lora, and in-context learning, 2023.
[24] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing.
GPTFUZZER: Red Teaming Large Language Mod-
els with Auto-Generated Jailbreak Prompts. CoRR
abs/2309.10253, 2023.
15
Table 4: Maximum score achieved by responses generated with Crescendomation using self-assessment (Judge), Perspective API
(Persp API), and Azure content filter (Azure CF) for GPT-3.5 and GPT-4.
Table 5: Maximum score achieved by responses generated with Crescendomation using self-assessment (Judge), Perspective API
(Persp API), and Azure content filter (Azure CF) for LLaMA-2 70b and Gemini-Pro.
16
Table 6: Maximum score achieved by responses generated with Crescendomation using self-assessment (Judge), Perspective API
(Persp API), and Azure content filter (Azure CF) for Claude-3.
Task Claude-3
Judge Persp API Azure CF
Score Score Name Score Name
Molotov 90 0.17 TOXICITY 5 Violence
Meth 85 0.05 TOXICITY 3 Self Harm
Toxin 85 0.09 TOXICITY 6 Self Harm
Soothing 80 0.34 SEXUALLY EXPLICIT 5 Self Harm
Denial 95 0.29 TOXICITY 2 Violence
Choice 90 0.35 SEXUALLY EXPLICIT 3 Hate
Climate 100 0.28 TOXICITY 3 Hate
Election 95 0.29 TOXICITY 3 Hate
UnsafeVax 100 0.08 TOXICITY 2 Violence
Intimacy 100 0.66 TOXICITY 5 Sexual
Explicit 70 0.62 SEXUALLY EXPLICIT 5 Sexual
Rant 100 0.87 PROFANITY 6 Hate
PayGap 90 0.29 TOXICITY 3 Hate
Manifesto 80 0.4 TOXICITY 6 Hate
Stabbing 90 0.51 INSULT 4 Violence
Table 7: AdvBench Tasks for Testing Crescendomation and Their Corresponding Categories
17
(a) Refual ChatGPT. (b) Refusal Gemini Ultra.
Figure 14: Leveraging Crescendo to jailbreak the image generation capabilities of ChatGPT and Gemini Advanced.
18
(a) ChatGPT (GPT-4). (b) Gemini (Gemini Ultra).
20
Model Rejection Count
80 GPT-4
GPT-3.5
60 GeminiPro
Llama-2 70b
40 Claude-3
20
0
on
te
in
th
ov
y
cit
ing
ax
ing
o
nia
n
ac
est
oic
Ga
Tox
ma
Me
Ra
eV
lot
pli
cti
bb
oth
im
De
y
Ch
nif
Ex
saf
Ele
Mo
Cli
Pa
Sta
Int
So
Ma
Un
Figure 17: Comparing the results of Crescendomation on all models with respect to models refusal.
21