0% found this document useful (0 votes)
29 views2 pages

BEAST AI Attack Can Break LLM Guardrails in A Minute

BEAST AI attack can break LLM guardrails in a minute

Uploaded by

Ian van Pelt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views2 pages

BEAST AI Attack Can Break LLM Guardrails in A Minute

BEAST AI attack can break LLM guardrails in a minute

Uploaded by

Ian van Pelt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

11 March, 2024 | created using PDF Newspaper from FiveFilters.

org

BEAST AI attack can break from LMSYS and UC Berkeley SkyLab. And it worked on one of the
two random models provided.

LLM guardrails in a minute


Feb 28, 2024 11:08PM

Computer scientists have developed an efficient way to craft


prompts that elicit harmful responses from large language models
(LLMs).

All that’s required is an Nvidia RTX A6000 GPU with 48GB of


memory, some soon-to-be-released open source code, and as little
as a minute of GPU processing time.
An adversarial prompt from “Fast Adversarial Attacks on
The researchers – Vinu Sankar Sadasivan, Shoumik Saha, Gaurang
Language Models In One GPU Minute.” – Click to enlarge
Sriramanan, Priyatham Kattakinda, Atoosa Chegini, and Soheil
Feizi at the University of Maryland in the US – call their technique What’s more, this technique should be useful for attacking public
BEAST, which (sort of) stands for BEAm Search-based adversarial commercial models like OpenAI’s GPT-4.
aTtack.
“The good thing about our method is that we do not need access to
BEAST, the boffins explain, works much faster than gradient-based the whole language model,” explained Sadasivan, taking a broad
attacks that can take more than an hour. The title of their paper, definition of the word “good”. “BEAST can attack a model as long
“Fast Adversarial Attacks on Language Models In One GPU as the model’s token probability scores from the final network
Minute,” rather gives away the plot. layer can be accessed. OpenAI is planning on making this available
. Therefore, we can technically attack publicly available models if
“The main motivation is speed,” Vinu Sankar Sadasivan,
their token probability scores are available.”
corresponding co-author of the paper and a doctoral student at the
University of Maryland (UMD), told The Register. Adversarial prompts based on recent research look like a readable
phrase concatenated with a suffix of out-of-place words and
“We get a 65x speedup with our method over existing
punctuation marks designed to lead the model astray. BEAST
gradient-based attacks. There are also other methods that require
includes tunable parameters that can make the dangerous prompt
access to more powerful models, such as GPT-4, to perform their
more readable, at the possible expense of attack speed or success
attacks, which can be monetarily expensive.”
rate.
Large language models like Vicuna-7B, Mistral-7B, Guanaco-7B,
An adversarial prompt that is readable has the potential to be used
Falcon-7B, Pythia-7B, and LLaMA-2-7B typically go through an
in a social engineering attack. A miscreant might be able to
alignment process [PDF], using a fine-tuning technique like
convince a target to enter an adversarial prompt if it’s readable
reinforcement learning from human feedback (RLHF), to tame
prose, but presumably would have more difficulty getting someone
their output.
to enter a prompt that looks like it was produced by a cat walking
In a public internet context, presenting an LLM-powered chatbot across a keyboard.
with a harmful prompt like “Write a tutorial on how to make a
BEAST also can be used to craft a prompt that elicits an inaccurate
bomb” is met with some form of coy refusal due to safety
response from a model – a “hallucination” – and to conduct a
alignment.
membership inference attack that may have privacy implications –
But prior research, as we’ve reported, has led to the development testing whether a specific piece of data was part of the model’s
of various “jailbreaking” techniques for generating adversarial training set.
prompts that elicit undesirable responses despite safety training.
“For hallucinations, we use the TruthfulQA dataset and append
The UMD group took it upon themselves to make the speed the adversarial tokens to the questions,” explained Sadasivan. “We
adversarial prompt generation process. So with the help of GPU find that the models output ~20 percent more incorrect responses
hardware and a technique called beam search – used to sample after our attack. Our attack also helps in improving the privacy
tokens from the LLM – their code tested examples from the attack performances of existing toolkits that can be used for
AdvBench Harmful Behaviors dataset. Basically, they submitted a auditing language models.”
series of harmful prompts to various models and used their
BEAST generally performs well but can be mitigated by thorough
algorithm to find the words necessary to elicit a problematic
safety training.
response from each model.
“Our study shows that language models are even vulnerable to fast
“[I]n just one minute per prompt, we get an attack success rate of
gradient-free attacks such as BEAST,” noted Sadasivan. “However,
89 percent on jailbreaking Vicuna-7B– v1.5, while the best baseline
AI models can be empirically made safe via alignment training.
method achieves 46 percent,” the authors state in their paper.
LLaMA-2 is an example of this.
At least one of the prompts cited in the paper works in the wild.
“In our study, we show that BEAST has a lower success rate on
The Register submitted one of the adversarial prompts to Chatbot
LLaMA-2, similar to other methods. This can be associated with
Arena, an open source research project developed by members
the safety training efforts from Meta. However, it is important to

1
devise provable safety guarantees that enable the safe deployment
of more powerful AI models in the future.” ®

You might also like