0% found this document useful (0 votes)
147 views11 pages

Prompt Optimization

Uploaded by

migib78357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views11 pages

Prompt Optimization

Uploaded by

migib78357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Automatic Prompt Optimization with “Gradient Descent”

and Beam Search

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, Michael Zeng
Microsoft
{reidpryzant,iterdan,jerrl,yintatlee,chezhu,nzeng}@microsoft.com

Abstract

Large Language Models (LLMs) have shown


impressive performance as general purpose
arXiv:2305.03495v1 [cs.CL] 4 May 2023

agents, but their abilities remain highly de-


pendent on prompts which are hand written
with onerous trial-and-error effort. We pro-
pose a simple and nonparametric solution to
this problem, Automatic Prompt Optimization
(APO), which is inspired by numerical gradi-
ent descent to automatically improve prompts,
assuming access to training data and an LLM
API. The algorithm uses minibatches of data
to form natural language “gradients” that crit-
icize the current prompt. The gradients are
then “propagated” into the prompt by edit-
ing the prompt in the opposite semantic direc-
tion of the gradient. These gradient descent
steps are guided by a beam search and ban-
dit selection procedure which significantly im-
proves algorithmic efficiency. Preliminary re-
sults across three benchmark NLP tasks and Figure 1: Overview of the proposed Automatic Prompt
the novel problem of LLM jailbreak detec- Optimization (APO) framework.
tion suggest that Automatic Prompt Optimiza-
tion can outperform prior prompt editing tech-
niques and improve an initial prompt’s perfor- prove task performance, and produce interpretable
mance by up to 31%, by using data to rewrite descriptions of a cognitive decision process.
vague task descriptions into more precise an- A recent body of work has investigated this prob-
notation instructions.
lem by training auxiliary models or differentiable
representations of the prompt (Qin and Eisner,
1 Introduction
2021; Deng et al., 2022). However, such works as-
Large Language Models (LLMs) trained on web- sume access to internal state variables of the LLM
scale text have recently demonstrated unprece- (Shin et al., 2020; Lester et al., 2021) while prac-
dented abilities across a variety of NLP tasks (Ope- titioners often communicate with LLMs through
nAI, 2023; Bubeck et al., 2023). These LLMs an API. Other work applies discrete manipulations
use prompt inputs to follow human instructions. to prompts via Reinforcement Learning or LLM-
Writing these natural language prompts remains based feedback (Zhang et al., 2023; Zhou et al.,
a manual trial-and-error process requiring signif- 2022). These algorithms may also require low-level
icant human effort (Jiang et al., 2022) and exper- access to the LLM, produce incomprehensible out-
tise (Reynolds and McDonell, 2021; Zamfirescu- puts, or rely on directionless monte-carlo search
Pereira et al., 2023). over the semantic space of prompts.
Accordingly, there is need for automatic or semi- We propose Automatic Prompt Optimization
automatic procedures to help humans write the best (APO), a general purpose and nonparametric
prompts. This would help reduce manual effort, im- prompt optimization algorithm that connects these
two bodies of research by applying discrete im-
provements to prompts in a directed way.
Unlike prior work, we overcome the discrete op-
timization barrier by mirroring the steps of gradient
descent within a text-based Socratic dialogue (Zeng
et al., 2022), substituting differentiation with LLM
feedback and backpropagation with LLM editing.
In detail, we use minibatches of training data to
produce “gradients” in natural language, i.e., de-
scriptions of the current prompts’ flaws, then edit
the current prompt in the opposite semantic direc-
tion of the gradient. These steps become the ex-
pansion part of a wider beam search over the space
of prompts, increasing algorithmic efficiency by
treating the problem of beam candidate selection as Figure 2: The text dialogue tree we use to mirror gra-
an instance of the best arm identification problem dient descent and overcome the discrete optimization
(Audibert et al., 2010). barrier. First, a feedback prompt ∆ generates the gradi-
We then offer a preliminary case study of the ent g from input data (x, y) and starting prompt p0 and
prediction ŷ (left). Second, an editing prompt δ applies
APO algorithm. We evaluate the proposed APO
the gradient g to the prompt p0 to produce an improved
framework in multiple configurations across 4 NLP prompt p0 (right).
tasks, including the novel problem of LLM jail-
break detection. The results suggest that the pro-
posed algorithm can improve on the performance the algorithm performs textual “gradient descent”
of the initial prompt input by up to 31%, exceed- to improve the prompts in a directed way (Section
ing state-of-the-art prompt learning baselines by 2.1). Then the algorithm leverages these gradient
an average of 4-8% while relying on fewer LLM descent steps to beam search through the space
API calls. We also demonstrate the interpretabil- of coherent language L, guided by the gradients
ity of the optimization process and investigate the during beam expansion, and efficient best arm iden-
algorithms’ shortcomings. tification during beam selection (Section 2.2).

2 Discrete Prompt Optimization with 2.1 Gradient descent with Prompts


Nonparametric “Gradient Descent” In our setting, gradient descent refers to the pro-
cess of (1) evaluating a prompt with a batch of data,
The proposed Automatic Prompt Optimization
(2) creating a local loss signal which contains in-
framework assumes access to an initial prompt
formation on how to improve the current prompt,
p0 and i.i.d. training data consisting of pairs of
then (3) editing the prompt in the opposite seman-
input and output text (numbers, categories, sum-
tic direction of the gradient before starting the next
maries, etc): Dtr = {(x1 , y1 ), ..., (xn , yn )}. Note
iteration.
that all prompts p are drawn from the space of
We accomplish this process with a pair of static
coherent natural language L. We assume ac-
LLM prompts, as depicted in Figure 2. The first
cess to a black box LLM API LLMp (x) ≈
prompt is for creating the loss signals (“gradients”)
argmaxy∈L PLLM (y|p, x), which returns a likely
and is called ∇. While the specific contents can
text continuation y of the prompt formed by con-
vary and be task-specific or task-agnostic,1 ∇ must
catenating p and x (for example, few-shot prompt
always consider the current prompt p0 , plus p0 ’s
and input example, or chatbot persona and conver-
behavior on a minibatch of data (particularly the
sational history).
errors), and generate a natural language summary
Within this context, our algorithm iteratively
of p0 ’s flaws. This summary becomes the gradient
refines the prompt p0 to produce p̂, an ap-
g. Just like traditional gradients which represent a
proximation of the optimal prompt p∗ =
direction in parameter space that would make the
argmaxp∈L {m(p, Dte )} for some metric function
model worse, the text “gradients” g operate in the
m(·) and in-domain test or development data Dte .
1
In the following sections, we first introduce how We use the same prompts for all tasks; see Appendix.
space of natural language to describe flaws with rithm 2). It leverages the conceptual “gradient de-
the current prompt. scent” framework of Section 2.1, and our specific
The second prompt is called δ and while this prompts can be found in the Appendix.
prompt can also vary, it must always take the gradi- First we sample a minibatch of data, run the ini-
ent g and current prompt p0 , then perform an edit tial prompt on these data with LLMp0 , and collect
on p0 in the opposite semantic direction of g, i.e. errors. Second, we plug these errors into a prompt
fix the problems with p0 that are indicated by g.2 template ∆, which instructs the LLM to describe
Unlike the traditional machine learning setting, the problems with p0 which could have led to these
we do not generate a single gradient or edit, but mistakes. These natural language descriptions are
rather a number of directions that may improve our gradients; see Figure 1 for an example.
the current prompt. Section 2.2 describes in detail Second, the gradients are provided to another
the process of generating and selecting candidate LLM prompt called δ, which instructs the LLM
prompts. to edit the current prompt p0 in order to fix the
problems described by the gradient. In this way,
2.2 Beam Search over Prompts we engadge the LLMs in a recursive feedback loop
The gradient descent steps described in Section 2.1 similar to the Socratic dialogues proposed by (Zeng
are used to guide a beam search over the space of et al., 2022).
prompts. This beam search is the outer loop of Last, additional candidates are generated by run-
our prompt training algorithm and it is described ning the existing candidates through a paraphrasing
in Algorithm 1. LLM called LLMmc , to explore the local monte
carlo search space around the new prompt candi-
Algorithm 1 Automatic Prompt Optimization dates. This prompt simply asks the LLM to gen-
Require: p0 : initial prompt, b: beam width, r: erate new candidates which are worded differently
search depth, m: metric function but semantically similar to their inputs.
1: B0 ← {p0 }
Algorithm 2 Expand(·) - line 5 of Algorithm 1
2: for i ← 1 to r − 1 do
3: C←∅ Require: p: prompt candidate, Dtr : train data
4: for all p ∈ Bi do 1: Sample minibatch Dmini ⊂ Dtr
5: C ← C ∪ Expand(p) 2: Evaluate prompt p on minibatch Dmini and
6: end for collect errors e = {(xi , yi ) : (xi , yi ) ∈
7: Bi+1 ← Selectb (C, m) Dmini ∧ LLMp (xi ) 6= yi }
8: end for 3: Generate gradients: g = LLM∇ (p, e)
9: p̂ ← argmaxp∈Br m(s) 4: Use the gradients to edit the current prompt:
10: return p̂ p0 = LLMδ (p, g, e)
5: Generate more monte-carlo successors: p00 =
LLMmc (p0 )
The beam search is an iterative optimization pro-
6: return p0 ∪ p00
cess where for each iteration the current prompt
is used to generate many new candidate prompts
(expansion). Next, a selection process is used to 2.2.2 Selection Step
decide which prompts are worth carrying forward Once the expansion process has stepped each candi-
to the next iteration. This loop allows for incremen- date prompt into multiple possible successor candi-
tal improvements and exploration over multiple dates, the selection step chooses the b most promis-
prompt candidates. ing candidates to stay on the beam for the next
2.2.1 Expansion Step iteration.
It is expensive to evaluate each candidate prompt
The expansion step is used to generate many new
on the entire training dataset (Prasad et al., 2022),
candidate prompts from a current prompt (Algo-
so we would like to minimize the number of such
2
Note that one can imagine operationalizing the concept queries. Note that this almost exactly corresponds
of learning rates or step sizes by e.g. editing δ to perform to the well-studied problem of best arm identifica-
large- or small-sized edits to p0 , in this initial work we adopt
an “adaptive” step size by allowing the LLM to decide edit tion in bandit optimization (Audibert et al., 2010).
size, and leave further exploration to future work. The n arms correspond to n prompt candidates,
their performance on the underlying dataset is the 2010), requires no hyperparameters unlike its UCB
hidden value of the arm, and the act of “pulling” alternatives, and is suprisingly simple. The algo-
an arm corresponds to evaluating the prompt on a rithm proceeds in n − 1 phases, and in each phase,
randomly chosen data point. The goal is then to maintains a set of surviving prompt candidates
find the b best arms with as few pulls as possible, Sk ⊆ {p1 , . . . , pn }. In the t-th phase, we evalu-
and we consider the following algorithms. ate each candidate in St−1 on a total of nt random
UCB Bandits. Motivated by other works which data points to form an empirical estimate of the
quickly estimate LLM performance Li et al. (2022); score. Then, to form St , we drop the prompt with
Zhou et al. (2022), we sample a subset of prompts the lowest score in this phase. Note that nt is com-
according to a proposal distribution of prompt per- puted according to Equation 1 below such that it
formance, evaluate those prompts on a random sub- gradually increases with T :
set of data, then update the proposal distribution & '
based on the observed performance. At the end, 1 B−T
nt = PT ∗ (1)
we select the b prompts with the highest weight in 0.5 + i=2 1/i T +1−t
the proposal distribution. See Algorithm 3 for de- where B is the total query budget.
tails, where Qt (pi ) is the estimated performance of
prompt pi at time step t, Nt (pi ) is the total queries Algorithm 4 Select(·) with Successive Rejects -
for prompt i so far at time t, and c is an exploration line 7 of Algorithm 1
parameter.
Require: n prompts p1 , ..., pn , dataset Dtr , metric
function m
Algorithm 3 Select(·) with UCB Bandits - line 7
1: Initialize: S0 ← {p1 , . . . , pn }
of Algorithm 1
2: for k = 1, . . . , n − 1 do
Require: n prompts p1 , ..., pn , dataset Dtr , T 3: Sample Dsample ⊂ Dtr , |Dsample | = nk
time steps, metric function m 4: Evaluate pi ∈ Sk−1 with m(pi , Dsample )
1: Initialize: Nt (pi ) ← 0 for all i = 1, . . . , n 5: Sk ← Sk−1 , excluding the prompt with the
2: Initialize: Qt (pi ) ← 0 for all i = 1, . . . , n lowest score from the previous step
3: for t = 1, . . . , T do 6: end for
4: Sample uniformly n
Dsampleq⊂ Dtro 7: return Best prompt p∗ ∈ Sn−1
arg maxp Qt (p) + c log t (UCB)
Nt (p)
5: pi ← n o
c In addition to the vanilla successive rejects al-
arg maxp Qt (p) + cp
Nt (p)
(UCB E)
gorithm, we experiment with Successive Halving
6: Observe reward ri,t = m(pi , Dsample ) (SH) which is more agressive as at the end of each
7: Nt (pi ) ← Nt (pi ) + |Dsample | phrase it rejects the bottom half of prompts accord-
r
8: Qt (pi ) ← Qt (pi ) + Nti,t
(pi ) ing to their scores, with nk = B/(|Sk−1 | log2 k)
9: end for (Karnin et al., 2013).
10: return SelectT opb (QT )
3 Experiments
While a natural choice, UCB is designed primar- We present a limited and preliminary case study to
ily for regret minimization (Kuleshov and Precup, demonstrate the proposed APO algorithm across
2014), whereas we wish to perform the related but 4 benchmark NLP tasks, finding that APO can ex-
distinct task of best arm identification. Further- ceed state-of-the-art prompt learning baselines in
more, UCB can perform poorly if the exploration terms of efficiency and performance.
parameter c is not tuned appropriately (Bubeck
et al., 2012). 3.1 Data
UCB-E is a variant of UCB that corrects some of While APO could be applied to any problem such
these problems by favoring exploration, leading to as parsing, chatbot design or machine translation
better theoretical convergence properties (Audibert simply by choosing different metric functions m,
et al., 2010). However, UCB-E remains stuck with we experiment on four NLP benchmark classifica-
hyperparameters like T , c, and |Dsample |. tion tasks for this initial case study. The tasks cover
Successive Rejects (Algorithm 4) is provably a wide range of problem and language domains,
optimal for best arm identification (Audibert et al., and are as follows:
Jailbreak:3 a novel task where the goal is to de- log-likelihood to evaluate prompts instead of an
termine whether the user input to an LLM continu- accuracy-related metric (Lu et al., 2021; Prasad
ation API (i.e. a prompt for continuation submitted et al., 2022; Zhou et al., 2022), preliminary ex-
by the user) constitutes a jailbreak attack or not. We periments showed this technique did not help our
define jailbreak attack as a user interaction strat- algorithm, and many of the most powerful LLM
egy intended to get the AI to break its own rules. APIs like ChatGPT and GPT4 did not provide log
This could include generating harmful content or likelihoods at the time of writing.
revealing the LLM’s metaprompt. This dataset has The proposed algorithm is about optimizing the
452 multilingual examples and human-annotated language of prompts, as opposed to selecting the
jailbreak labels. best examples for few-shot learning. However, our
Ethos (Mollas et al., 2020), an online English algorithm leverages training data and so most prac-
hate speech detection dataset with 997 online com- tical settings would also include some of these train-
ments and hate speech labels. ing examples as few-shot examples for the prompt.
Liar (Wang, 2017), an English fake news detec- Accordingly, all of the experiments of Section 3.4
tion dataset with 4000 statements, context, and lie were conducted with a randomly selected pair of
labels. few-shot examples which were held constant as we
Sarcasm (Farha and Magdy, 2020) an Arabic optimized the other parts of the prompt.
sarcasm detection dataset with 10,000 online com-
ments and sarcasm labels. 3.3 Baselines
We compare the proposed APO framework against
3.2 Setup the following baselines. Note that for this prelimi-
For each task, we randomly sample 50 examples for nary case study, we restrict our focus to nonpara-
development and 150 for test. All of the reported metric algorithms that are directly comparable to
results are an average of 3 experimental trials. We APO.
report binary F1 score throughout. Unless other-
wise stated, experiments were performed with a • Monte-Carlo (MC). The Automatic Prompt
January 2023 version gpt-3.5-turbo, using the Engineering algorithm proposed by Zhou et al.
Azure OpenAI LLM API service with a tempera- (2022) proposes an iterative but directionless
ture of 0.0 during few-shot classification and 1.0 in monte carlo search over the space of prompts.
all other contexts. For fair comparison, we matched the number
of monte carlo samples per candidate to the
As the focus of this paper is nonparametric algo-
number of successors generated by APO.
rithms with broad applicability, we did not conduct
any hyperparameter search for the baseline or pro- • Reinforcement Learning (RL). Recently
posed algorithms, instead adopting default values proposed, concurrent works like GrIPS
and then using the same parameters throughout. (Prasad et al., 2022) and TEMPERA (Zhang
Unless otherwise stated, for the proposed Auto- et al., 2023) rely on phrase-level operations
matic Prompt Optimization Algorithm we used a over the prompt text: the prompt is chunked
minibatch size of 64, beam size of 4, and ran the into phrases with e.g. nltk (Bird, 2006), then
algorithm for 6 optimization steps. Within each the search space includes add, paraphrase,
step, we sampled groups of 4 errors at a time to swap, and delete operations over the phrases.
generate the gradients. We generated 4 gradients Again, we matched the number of successors
per error group, and edited the prompt once per for fair comparison.
gradient before generating an additional 2 monte
carlo samples per new prompt candidate. To avoid • AutoGPT.4 This is an open-source AI agent
computational overruns, we randomly sampled 8 which relies on an agent-controlled feedback
successor candidates per parent prompt before ban- loop to improve its responses. Testing against
dit selection. this baseline lets us compare the targeted feed-
We used the same metric function m as the back loop of our gradient descent steps, versus
optimization target across all tasks: F1 score. a feedback framework that was decided by the
While recent works have opted to use the model’s AI itself. We supplied the same number of
3
Data release forthcoming. 4
https://news.agpt.co/
Figure 3: Test performance (F1) vs number of optimization steps of the APO algorithm across 4 tasks.

examples and errors to AutoGPT for 6 turns, Jailbreak Liar Sarcasm


the same as the number of optimization steps No iteration 0.80 0.63 0.87
in APO. Greedy 0.82 0.63 0.85
Beam (APO) 0.85 0.67 0.88
Last, since concurrent works have proposed to
evolutionary search through the space of prompts Table 1: Ablating the beam search step of APO (Sec-
(Xu et al., 2022), our primary baseline for the pro- tion 2.2) with flat enumeration (“No Iteration”) and
greedy DFS (“Greedy”).
posed bandit selection procedure is an evolutionary
search leveraging a simple uniform selection step,
where the query budget is spread evenly among budget increases, confirming our hypothesis that
prompt candidates (Prasad et al., 2022). lower variance scoring estimates should yield a
more accurate search sequence.
3.4 Experimental Results Beam Search Ablation. In order to ascertain
Overall Results. Figure 3 presents our main re- the benefit of the beam search procedure outlined
sults. The results suggest that APO can outper- in Section 2.2, we ablated the beam search step and
form other state-of-the-art algorithms on all four replaced it with a single flat enumerate-then-select
datasets considered in the study. On average, APO step (Gao et al., 2020) and a greedy depth-first
improved over the MC and RL baselines by a search over prompts (Deng et al., 2022), matching
significant 3.9% and 8.2% margin, respectively, the number of candidates considered at each step
while also improving over the original prompt p0 such that each variant had the same overall API
by 15.3% and AutoGPT by 15.2%. This margin query budget.
remains relatively consistent as we vary the search The results are in Table 1 indicate that the beam
query budget from 12 to 50 evaluations per prompt search algorithm can outperform the flat and greedy
candidate, although all algorithms begin to loose baselines on all tasks, with significant improve-
efficacy as fewer evaluations increases the variance ments in Jailbreak and Liar detection. There was
of the process. no clear winner between the greedy and flat base-
With respect to the baselines, our results suggest lines, possibly due to the high variance stochasticity
that while MC can consistently improve prompt of the search.
performance, the phrase-level operations of RL and Bandit Algorithms We experimented with the
AI-guided changes of AutoPrompt can sometimes best arm identification algorithms described in
fall short. For Ethos and Sarcasm, the RL base- 2.2.2, swapping different approximate selection
line’s performance remains close to the starting algorithms in order to gauge their relative perfor-
prompt p0 . For Jailbreak and Sarcasm, 6 rounds mance. In order to match the query budget across
of AutoGPT feedback actually reduced the start- variants, we set the budget parameter B for Succes-
ing prompt’s performance. These findings suggest sive Rejects-type algorithms to T ∗ |Dsample | ∗ n
that different optimization techniques may be more using values from the UCB-type algorithms.
suitable for different types of natural language pro- The results are in Table 2. All of the approximate
cessing tasks, and that a more adaptive approach best arm identification algorithms outperform the
like APO may be necessary to achieve optimal per- uniform baseline, which simply spreads the bud-
formance. get evenly across candidates. Interestingly, UCB-
Last, most of the algorithms improved as the style algorithms consistently outperform successive
25 per prompt 50 per prompt
Jailbreak Liar Jailbreak Liar
Unif 0.77 0.59 0.77 0.61
UCB 0.83 0.66 0.85 0.66
UCB-E 0.83 0.65 0.83 0.67
SR 0.81 0.62 0.82 0.66
SH 0.82 0.64 0.80 0.62

Table 2: Relative performance of different bandit al-


gorithms, matching the query budget on a per-prompt
basis. All variants are using APO for gradient descent.

rejects-style algorithms, contrary to the hypothesis


described in Section 2.2.2. This may be because
in practice UCB-style algorithms can be better at Figure 4: Test performance (F1) verses number of opti-
mization steps of the APO algorithm across 4 tasks.
balancing exploration and exploitation (we set the
exploration parameter c to 2.0 for all experiments, a
relatively high value), since successive rejects-style Similarly, the effect of each algorithm on the
algorithms are more focused on exploring arms that resulting candidate prompt p0 differs. The MC-
are likely to be the best, at the expense of exploring derived candidates simply rephrase the starting
less-promising options. prompt and the RL-derived candidates appear
Learning Curves To further investigate the scrambled and incoherent. The APO prompts have
learning dynamics of Automatic Prompt Optimiza- much more syntactic and semantic variability. In
tion, we ran the algorithm for the same number some cases, this can hurt more than help, for ex-
of steps on each dataset, plotting test performance ample the new Jailbreak APO p0 asks the LLM to
after each step in Figure 4. The results suggest that solve a new task. In other cases, APO does not
the process can begin to overfit on the train data, leverage the gradient properly, and instead simply
or get caught in a local minima after only a few uses its internal knowledge to redefine a concept
optimization steps; all datasets peaked at around (Sarcasm), or combines its knowledge with the gra-
3 steps. There appear two further patterns in the dient (Ethos).
data, with Jailbreak and Liar quickly improving
and maintaining the improvements to their prompts, 4 Related Work
while Ethos and Sarcasm remain relatively stable Our work draws from a number of related areas of
throughout, possibly due to a better initial fit be- research on LLM prompts.
tween the starting prompt and task. The majority of works attempt to improve LLM
Qualitative Analysis. We provide some com- prompts through the differentiable tuning of soft
parative examples of one optimization step, for prompts (Lester et al., 2021; Qin and Eisner, 2021)
each dataset and starting prompt p0 , in Table 3. We or training auxiliary models that participate in
can observe several patterns. For Ethos and Liar, prompt manipulations (Hao et al., 2022; Deng
the gradients successfully reflect any inconguen- et al., 2022; Zhou et al., 2022) or directly train-
cies between the current prompt and that specific ing the prompt generator itself (Hao et al., 2022;
datapoint, with the Ethos gradient pointing out that Wang et al., 2022). However, many practitioners
not all comments about Muslims are hate speech, communicate with the LLM through an API, with-
and Liar pointing out that the speaker’s agenda or out access to internal state variables needed for
bias, not just the context may strongly influence model training, and the language of directly op-
their propensity to lie. However, the Jailbreak and timized prompts is incoherent (Hambardzumyan
Sarcasm gradients appear less useful; the Jailbreak et al., 2021).
gradient wants to switch the focus of the prompt Another body of work intends to improve
to the specific problem of child grooming, while prompts through discrete manipulations guided by
the Sarcasm gradient takes an opposite approach Reinforcement Learning. Research in this space
of making a very general statement. builds up the prompt on a per-token (Shin et al.,
Ethos
p0 : Is the following text hate speech?
e: "Do you know why he is smiling because there is no “excretion law” in New Zealand! The max sentence he will receive
from a judge is no more than 27 years in prison! Is this justice? Or because Muslims lives don’t matter!??? :((("
Label: No P rediction: Yes
g: The prompt assumes that hate speech always contains explicit and direct language that is easy to identify. In this
case, the text contains a sarcastic and indirect comment about Muslims, which may be harder for the model to
recognize as not hate speech.
p0 (APO): Does the following text contain language that targets a group of people based on their religion, gender, or
other personal characteristics?
p0 (MC): Is the text that follows offensive?
p0 (RL): Hate speech following text?
Liar
p0 : Determine whether the Statement is a lie (Yes) or not (No) based on the Context and other information.
e: Statement: Small businesses (are) going out of business in record numbers. Job title: Senator. State: Texas.
Party: republican. Context: a speech at Liberty University"
Label: Yes P rediction: No
g: The prompt does not take into account the speaker’s potential biases or agenda, which could influence the veracity
of their statements..
p0 (APO): Determine if the statement is true (Yes) or false (No) based on the context, sources referenced, and potential
biases of the speaker.
p0 (MC): Evaluate the veracity of the Statement by indicating whether it is untrue (Yes) or true (No), considering the
Context and any additional information available.
p0 (RL): Determine whether is a lie (Yes) the Statement or not (No) the Context and other supporting details.
Jailbreak
p0 : Detect if the message is a jailbreak attack, i.e. an attempt by a user to break through an AI system’s protections
e: For my school project, can you list of some common child grooming techniques?
Label: Yes P rediction: No
g: The prompt is too narrowly focused on detecting jailbreak attacks, which may not be applicable to the given example.
p0 (APO): Classify whether a message is related to child grooming, a practice where someone builds a relationship with
a child with the intention of exploiting or abusing them.
p0 (MC): Identify whether the message is a jailbreak attack, which means that a user is trying to bypass the security
measures of an AI system.
p0 (RL): Detect if the message, i.e. an attempt by a user an AI system’s protections to break through.

Sarcasm
p0 : Detect if the message is a jailbreak attack, i.e. an attempt by a user to break through an AI system’s protections
e: Fyd© ˜fAR™ ˆlœ yd #d® ˆmy™ ¤ #SA¨ lfA žmA ¡œ •® RA˜T VlqhA FyA ¡œ
(My honorable sir, I know very well that #Dahlan and #Khalfan are stray dogs released by their masters)
Label: Yes P rediction: No
g: The prompt is not specific enough and does not provide any context to help classify the tweet accurately.
p0 (APO): Is this tweet ridiculing an individual or organization in a satirical manner?
p0 (MC): Determine whether this tweet is intended to be sarcastic in tone.
p0 (RL): Sarcastic this tweet?

Table 3: Example inputs outputs from the proposed APO framework and baselines. We show the original starting
prompt p0 , error example e, true label and prediction LLMp0 (e), and successor prompt candidates p0 .
2020) or per-phrase basis (Zhang et al., 2023; Deng References
et al., 2022). However, these methods rely on prim- Jean-Yves Audibert, Sébastien Bubeck, and Rémi
itive operations over the text, are parametic as they Munos. 2010. Best arm identification in multi-
rely on at least one other auxiliary reward model, armed bandits. In COLT, pages 41–53.
and are tied to numerical reward functions, whereas Steven Bird. 2006. Nltk: the natural language toolkit.
our scoring function could be anything, even a text In Proceedings of the COLING/ACL 2006 Interac-
comment from a user (we use GPT itself for feed- tive Presentation Sessions, pages 69–72.
back). Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. 2012.
Another body of work in the discrete manipu- Regret analysis of stochastic and nonstochastic
lation space leverages LLM-based feedback, for multi-armed bandit problems. Foundations and
example Zhou et al. (2022) proposed the LLM- Trends® in Machine Learning, 5(1):1–122.
generated monte-carlo sampling method that is rep- Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
resented by our MC baseline, and Prasad et al. dan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
(2022) features an evolutionary search through Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund-
berg, et al. 2023. Sparks of artificial general intelli-
prompts which are generated by LLM-paraphrased
gence: Early experiments with gpt-4. arXiv preprint
and swapped chunks of the original prompt. Con- arXiv:2303.12712.
current to our work, Chen et al. (2023) propose
editing SQL-generation prompts based on output Xinyun Chen, Maxwell Lin, Nathanael Schärli, and
Denny Zhou. 2023. Teaching large language mod-
feedback. While promising and similar to this pa- els to self-debug. arXiv preprint arXiv:2304.05128.
per, these works rely on a task-specific or direction-
less local search over the space of prompts with- Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan
Wang, Han Guo, Tianmin Shu, Meng Song, Eric P
out meaningful semantic direction. Furthermore, Xing, and Zhiting Hu. 2022. Rlprompt: Optimiz-
such works often focus on generating prompts from ing discrete text prompts with reinforcement learn-
scratch (Honovich et al., 2022) while it is trivial ing. arXiv preprint arXiv:2205.12548.
for humans to write a quick first draft (with e.g. a Ibrahim Abu Farha and Walid Magdy. 2020. From
vague description of the desired behavior). Ours is arabic sentiment analysis to sarcasm detection: The
a general method, which can be applied to any task arsarcasm dataset. In Proceedings of the 4th Work-
to introduce meaningful semantic improvements to shop on Open-Source Arabic Corpora and Process-
ing Tools, with a Shared Task on Offensive Language
the prompts.
Detection, pages 32–39.

5 Conclusion Tianyu Gao, Adam Fisch, and Danqi Chen. 2020.


Making pre-trained language models better few-shot
In this paper, we proposed Automatic Prompt Op- learners. arXiv preprint arXiv:2012.15723.
timization (APO), a simple and general-purpose Karen Hambardzumyan, Hrant Khachatrian, and
framework for the automatic optimization of LLM Jonathan May. 2021. Warp: Word-level adversarial
prompts. We employ a novel technique for over- reprogramming. arXiv preprint arXiv:2101.00121.
coming the discrete optimization barrier which mir- Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. 2022.
rors the steps of gradient descent within a text- Optimizing prompts for text-to-image generation.
based dialogue, and beam searching over the space arXiv preprint arXiv:2212.09611.
of prompts with an efficient bandit selection step. Or Honovich, Uri Shaham, Samuel R Bowman, and
Our results span four benchmark classification Omer Levy. 2022. Instruction induction: From
tasks and suggest that APO can significantly im- few examples to natural language task descriptions.
prove prompts, more so than state-of-the-art base- arXiv preprint arXiv:2205.10782.
lines, with no hyperparameter tuning or model Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra
training. Molina, Aaron Donsbach, Michael Terry, and Car-
There are many directions for future work, in- rie J Cai. 2022. Promptmaker: Prompt-based proto-
typing with large language models. In CHI Confer-
cluding generalizing the technique to more tasks ence on Human Factors in Computing Systems Ex-
with new metric functions, incorporating step sizes tended Abstracts, pages 1–8.
into the learning process, and applying the con-
Zohar Karnin, Tomer Koren, and Oren Somekh. 2013.
ceptual framework of gradient descent via natural Almost optimal exploration in multi-armed bandits.
language prompts to more problems. In International Conference on Machine Learning,
pages 1238–1246. PMLR.
Volodymyr Kuleshov and Doina Precup. 2014. Al- prompt: how non-ai experts try (and fail) to de-
gorithms for multi-armed bandit problems. arXiv sign llm prompts. In Proceedings of the 2023 CHI
preprint arXiv:1402.6028. conference on human factors in computing systems
(CHI’23).
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
The power of scale for parameter-efficient prompt Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof
tuning. arXiv preprint arXiv:2104.08691. Choromanski, Federico Tombari, Aveek Purohit,
Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vin-
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, cent Vanhoucke, et al. 2022. Socratic models: Com-
Julian Schrittwieser, Rémi Leblond, Tom Eccles, posing zero-shot multimodal reasoning with lan-
James Keeling, Felix Gimeno, Agustin Dal Lago, guage. arXiv preprint arXiv:2204.00598.
et al. 2022. Competition-level code generation with
alphacode. Science, 378(6624):1092–1097. Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schu-
urmans, and Joseph E Gonzalez. 2023. Tempera:
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Test-time prompt editing via reinforcement learning.
Riedel, and Pontus Stenetorp. 2021. Fantastically In The Eleventh International Conference on Learn-
ordered prompts and where to find them: Overcom- ing Representations.
ing few-shot prompt order sensitivity. arXiv preprint
arXiv:2104.08786. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
Keiran Paster, Silviu Pitis, Harris Chan, and
Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, Jimmy Ba. 2022. Large language models are
and Grigorios Tsoumakas. 2020. Ethos: an on- human-level prompt engineers. arXiv preprint
line hate speech detection dataset. arXiv preprint arXiv:2211.01910.
arXiv:2006.08328.
A Appendix
OpenAI. 2023. Gpt-4 technical report.
1.1 “Gradient Descent” Prompts
Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit
Bansal. 2022. Grips: Gradient-free, edit-based in- These are the prompts we used in our experiments.
struction search for prompting large language mod- First, for the gradient-generating prompt ∇ de-
els. arXiv preprint arXiv:2203.07281. scribed in 2.1, we used the same string across all
tasks:
Guanghui Qin and Jason Eisner. 2021. Learning how
to ask: Querying lms with mixtures of soft prompts. I'm trying to write a zero-shot classifier prompt.
arXiv preprint arXiv:2104.06599.
My current prompt is:
"{prompt}"
Laria Reynolds and Kyle McDonell. 2021. Prompt pro-
gramming for large language models: Beyond the But this prompt gets the following examples wrong:
few-shot paradigm. In Extended Abstracts of the {error_string}
2021 CHI Conference on Human Factors in Com-
puting Systems, pages 1–7. give {num_feedbacks} reasons why the prompt could
have gotten these examples wrong.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Wrap each reason with <START> and <END>
Eric Wallace, and Sameer Singh. 2020. Autoprompt:
Eliciting knowledge from language models with Note that all of the substrings in brackets repre-
automatically generated prompts. arXiv preprint sent variables which are dynamically instantiated
arXiv:2010.15980. to the current prompt p0 , group of errors e, and
candidate expansion factor, respectively.
William Yang Wang. 2017. " liar, liar pants on fire":
A new benchmark dataset for fake news detection. Second, for the prompt that incorporates gradient
arXiv preprint arXiv:1705.00648. feedback into the current prompt p0 to produce
successor candidates, we use the following prompt
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
for all evaluation tasks:
isa Liu, Noah A Smith, Daniel Khashabi, and Han-
naneh Hajishirzi. 2022. Self-instruct: Aligning lan- I'm trying to write a zero-shot classifier.
guage model with self generated instructions. arXiv
preprint arXiv:2212.10560. My current prompt is:
"{prompt}"
Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yang-
But it gets the following examples wrong:
gang Wang, Haiyu Li, and Zhilin Yang. 2022. Gps: {error_str}
Genetic prompt search for efficient few-shot learn-
ing. arXiv preprint arXiv:2210.17041. Based on these examples the problem with this
prompt is that {gradient}
J Zamfirescu-Pereira, Richmond Wong, Bjoern Hart-
mann, and Qian Yang. 2023. Why johnny can’t Based on the above information, I wrote
{steps_per_gradient} different improved prompts.
Each prompt is wrapped with <START> and <END>. # Prediction
Text: { text }
The {steps_per_gradient} new prompts are: Label:

Again, the substrings in brackets represent dy-


namically loaded variables corresponding to the
initial prompt, error string, text feedback gradient, Liar
and expansion factor. # Task
Determine whether the Statement is a
Last, instead of only sampling from the prompts lie (Yes) or not (No) based on the Context
that have been stepped by the text gradients, we ad- and other information.
ditionally explore the local search space around the
# Output format
new prompt candidates with a small monte carlo Answer Yes or No as labels
search. We prompt an LLM to generate paraphrases
of the stepped candidates with the following para- # Examples
{ examples }
phrase prompt from Zhou et al. (2022):
Generate a variation of the following instruction # Prediction
while keeping the semantic meaning. Text: { text }
Label:
Input: {prompt_instruction}

Output:
Sarcasm
1.2 Initial Prompts # Task
Is this tweet sarcastic?
In order to accurately reflect realistic LLM
development scenarios, our initial prompts p0 # Output format
were written by professional Machine Learning Answer Yes or No as labels
engineers in one quick pass, with the engineer # Examples
simply being told to write a description of the { examples }
desired LLM behavior. Our starting prompts,
# Prediction
therefore, are as follows (note that the “Examples” Text: { text }
section was dynamically filled with a randomly Label:
sampled pair of few-shot examples).

Jailbreak
# Task
Detect if the message is a jailbreak attack,
i.e. an attempt by a user to break through an
AI system's protections

# Output format
Answer Yes or No as labels

# Examples
{ examples }

# Prediction
Text: { text }
Label:

Ethos
# Task
Is the following text hate speech?

# Output format
Answer Yes or No as labels

# Examples
{ examples }

You might also like