0% found this document useful (0 votes)
16 views12 pages

Fluentprompt

Uploaded by

sameen2080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

Fluentprompt

Uploaded by

sameen2080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Toward Human Readable Prompt Tuning:

Kubrick’s The Shining is a good movie, and a good prompt too?


Weijia Shi∗ Xiaochuang Han∗
Hila Gonen Ari Holtzman Yulia Tsvetkov Luke Zettlemoyer
Paul G. Allen School of Computer Science & Engineering,
University of Washington, Seattle, WA
{swj0419, xhan77, hilagnn, ahai, yuliats, lsz}@cs.washington.edu

Abstract
Large language models can perform new tasks
in a zero-shot fashion, given natural language
prompts that specify the desired behavior.
arXiv:2212.10539v1 [cs.CL] 20 Dec 2022

Such prompts are typically hand engineered,


but can also be learned with gradient-based
methods from labeled data. However, it is un-
derexplored what factors make the prompts ef-
fective, especially when the prompts are natu-
ral language. In this paper, we investigate com-
mon attributes shared by effective prompts.
We first propose a human readable prompt
tuning method (F LUENT P ROMPT) based on
Langevin dynamics that incorporates a fluency
constraint to find a diverse distribution of effec-
tive and fluent prompts. Our analysis reveals
that effective prompts are topically related
to the task domain and calibrate the prior
probability of label words. Based on these
Figure 1: Compared with previous discrete prompt tun-
findings, we also propose a method for gener-
ing method AutoPrompt (Shin et al., 2020) which gen-
ating prompts using only unlabeled data, out-
erates gibberish prompts, F LUENT P ROMPT can iden-
performing strong baselines by an average of
tify effective and more readable prompts that are topi-
7.0% accuracy across three tasks.
cally relevant to the task domain and calibrate the prior
1 Introduction probability of label words.

Large language models can perform new tasks by


simply conditioning on a prompt–a short sequence for each task, making it difficult to infer com-
of text specific to the task. Such natural langauge mon features of good prompts where contrast with
prompts are either carefully hand engineered (e.g., less effective prompts is needed. Additionally, the
manual prompt engineering, Kojima et al. 2022) prompts found by gradient-based tuning methods
or automatically learned from labeled data (e.g., are often disfluent and unnatural, making them dif-
gradient-based prompt tuning, Shin et al. 2020). ficult to interpret (e.g., AutoPrompt in Figure 1).
Despite their effectiveness, it remains unclear what To overcome these challenges, we first propose
makes these prompts work and what attributes ef- a human readable prompt tuning method called
fective prompts share in common. In this paper, F LUENT P ROMPT based on a constrained decoding
we aim to identify key characteristics of effective method. F LUENT P ROMPT uses Langevin dynam-
prompting, and use this knowledge to generate ef- ics to generate a set of human readable prompts for
fective and human readable prompts without any any task. Our method adds a progressive noise to
labeled data. the tuning procedure to obtain a distribution of ef-
There are two main challenges for performing fective prompts, while also maintaining the fluency
this type of analysis. First, manual prompt tuning of the prompts through a perplexity constraint. As
produces a limited number of effective prompts shown in Figure 1, compared to the baseline gib-

Equal contribution. Order randomly determined. berish prompts produced by AutoPrompt, F LUENT-
P ROMPT generates prompts that are more fluent these prompts usually improve the model per-
(i.e., lower perplexity) and perform competitively. formance, their continuous nature makes them
The resulting fluent prompts not only facilitate our difficult for humans to understand or interpret
further analysis, but can also lead to better trust and (Khashabi et al., 2021; Hambardzumyan et al.,
engagement from both researchers and end users. 2021).
After obtaining a diverse set of effective and
human-readable prompts, we analyze the factors Discrete Prompt Discrete prompts are com-
that contribute to the effectiveness of prompts. posed of discrete tokens from natural language vo-
Specifically, we show that effective prompts are cabulary. Such prompts can be either written by
both (1) topically related to the task domain and human or searched automatically. Human-written
(2) more calibrated to the task verbalizers. Cal- prompts (Kojima et al., 2022; Wang et al., 2022;
ibration measures how balanced the label word Sanh et al., 2021) typically consist of meaningful
distribution of the prompted model is given an texts such as task descriptions (Schick and Schütze,
example-independent domain string (Holtzman 2021) or instructions (e.g., “let’s think step by step”,
et al., 2019). Kojima et al. 2022), which are not only human read-
Based on our findings, we propose a novel able but also aligns with their understanding of the
method U NSUPERVISED F LUENT P ROMPT, for au- task. In-context demonstration examples can also
tomatically searching for effective prompts using be considered as human-written prompts (Brown
only unlabeled data. U NSUPERVISED F LUENT- et al., 2020; Liu et al., 2022) but is not a focus of
P ROMPT optimizes the prompts for both better cal- this work.
ibration and better domain relevance. Our exper- Prior work has also focused on searching the dis-
imental results show that U NSUPERVISED F LU - crete prompts automatically. One prominent way
ENT P ROMPT outperforms strong zero-shot base-
for this search can be gradient-based similar to the
line (Holtzman et al., 2021) by 7.0% in accuracy. continuous prompt setup but with projections to a
We summarize our contributions as follows: discrete vocabulary (Shin et al., 2020). The draw-
back of this method is that the resulting prompts are
• We introduce F LUENT P ROMPT, a human- usually disfluent and difficult to read. Other work
readable prompt tuning method that can gener- searching for discrete prompts include edit-based
ate a diverse set of effective and fluent prompts enumeration (Prasad et al., 2022), reinforcement
(§3). This method not only serves as the foun- learning (Deng et al., 2022), and large language
dation for our analysis, but also helps bridge model continuation and filtering (Zhou et al., 2022).
the gap between manual prompt engineering The goal for these prompt tuning methods is mainly
and gradient-based prompt tuning. to achieve competitive task performance without
modifying language model parameters.
• We analyze the factors that contribute to the The purpose of our work is to analyze what as-
effectiveness of prompts and show that topic pects of the tuned natural language prompts make
relatedness and calibration of the prompts are them effective for zero-shot language models. To
key to their success (§4). facilitate such analysis, we need prompt readabil-
ity as in human-written prompts and also a large
• Inspired by our findings, we introduce a
search space as in gradient-based discrete prompt
new method for discovering effective prompts
tuning. F LUENT P ROMPT bridges the gap and pro-
without the need for labeled data (§5).
vides a distribution of effective, diverse, and human-
2 Related Work readable prompts.

2.1 Prompt Tuning 2.2 Analyses of Prompts


Continuous Prompt Continuous prompts are A growing body of literature tries to understand
continuous vectors inserted to the task input for the mechanisms behind prompts via various per-
a prompted language model (Qin and Eisner, 2021; spectives. For example, prompts in the form
Ding et al., 2021; Lester et al., 2021; Liu et al., of in-context examples are analyzed under per-
2021). Such continuous prompts are typically turbations w.r.t. order, label, editing, etc. (Lu
tuned by gradient-based methods, which are guided et al., 2022; Min et al., 2022; Chen et al., 2022).
by the tasks training examples with labels. While Human-written instructions (Mishra et al., 2021)
have also been studied and show weak sensitivity soft prompt ẽ is updated as
to semantic-changing perturbations (Webson and
Pavlick, 2021). Gonen et al. (2022) use paraphras- ẽi = ẽi−1 − η∇ẽ (− log pθ (v(y) | ẽi−1 , x, t))
ing and back-translation on a set of human-written
where i is the timestep superscript, referring to i-th
prompts and analyze the correlation between their
optimization step.
perplexity and performance.
Our work focuses on natural language prompts 3.2 Discrete prompt tuning with Langevin
derived from gradient-based prompt tuning. dynamics
Khashabi et al. (2021) tune continuous prompts
There are two challenges for the soft prompt tuning
and shows effective continuous prompts may trans-
reviewed above. First, the resulting embeddings
fer poorly to their nearest discrete prompts. In
cannot be mapped to the natural language vocabu-
contrast, we perform prompt tuning in the discrete
lary. Khashabi et al. (2021) show that naively map-
space directly with F LUENT P ROMPT, demonstrat-
ping an effecive soft prompt to their nearest tokens
ing the feasibility of searching for readable prompts
significantly drops the performance. Second, we
using gradient-based method. This approach gives
only obtain a single embedding instead of a range
us a more faithful understanding of the factors that
of embeddings with varying levels of performance.
contribute to the effectiveness of natural language
This makes it difficult to analyze the characteristics
prompts.
of the prompts and compare their effectiveness in
specific tasks for the language model.
3 F LUENT P ROMPT
Following Kumar et al. (2022), we use Langevin
F LUENT P ROMPT generates a diverse set of human- dynamics to sample discrete prompts that lead to
readable prompts. Our goal is not only to identify a a better performing model in the task. Overall, the
single best-performing prompt, but also to explore method is similar to SGD but adds a progressive
the relationship between the features of the prompts Gaussian noise to the embeddings, with the scale
and their performance. decreasing over time. Additionally, at each opti-
mization step, the updated embedding is projected
3.1 Background: continuous prompt tuning to the nearest embedding in the LM vocabulary.
Given an input example x with an output label
p
ẽi = ProjE [ẽi−1 − η∇ẽ E(ẽi−1 ) + 2ηβi z]
y ∈ Y , we can prompt an autoregressive language
same as SGD new noise
model with parameters θ as follows. We reformu- where:
late the task as a language modeling problem by
inserting a task-specific template t to x and defin- • E is an energy function (lower is better),
ing a verbalizer v mapping from a label y to a label E(ẽi−1 ) = − log pθ (v(y) | ẽi−1 , x, t).
word (i.e, token in the LM’s vocabulary). The prob- • z is a Gaussian noise, z ∼ N (0, I|ẽ| ).
ability of the label is estimated by:
• β is the variance of the noise following a geo-
exp logitθ (v(y) | x, t) metric progression, βstart > βi > βend →− 0.
pθ (v(y) | x, t) = P 0
y 0 exp logitθ (v(y ) | x, t)
• E is the embedding table (layer) of the LM θ,
Lester et al. (2021) add a sequence of M soft one embedding for each token in the vocabu-
embeddings ẽ0:M (simplified as ẽ; 0:M refers to lary.
the positional subscript for the sequence from po- • ProjE is a projection operation finding
sition 0 to M − 1) in front of the input. There- a nearest neighbor for each soft embed-
fore, the probability of the label is computed by ding in the LM’s vocabulary, ProjE (ẽ) =
pθ (v(y) | ẽ, x, t), where ẽ is embeddings that by- argminev ∈E (kev − ẽk2 ).
pass the word embedding layer of the LM θ and
is learned based on a set of training data. These Without the progressive noise in Langevin dy-
learned embeddings are sometimes referred to as namics, our prompt search procedure is gradient-
soft prompts, and the learning of such prompts as based and shares a similar intuition with Auto-
soft prompt tuning. For example, if stochastic gra- Prompt (Shin et al., 2020). Both methods use the
dient descent (SGD) is used as an optimizer, the gradient of the loss w.r.t. the embeddings, though
AutoPrompt applies greedy substitution whereas Prompt Acc. PPL
we use projected gradient descent, aligning with SST-2
soft prompt tuning and enabling the subsequent Empty Prompt 66.5 -
prompt sampling. AutoPrompt also incorporates AutoPromptSGD
Compl disgustingÃÂÃÂ Rated jer 87.6 > 106
verbalizer word selection, which is not a focus of F LUENT P ROMPT
the analysis in this work. We use our gradient- Kubrick, "The Shining 87.5 13.1
based, discrete prompt tuning method without Paramount, "The Shining 86.8 12.2
Kubrick\’s "The Man 86.3 9.3
Langevin dynamics as a baseline, referred to as disappointing.\n\n" 84.4 4.1
AutoPromptSGD .
AMAZON
3.3 Fluency constraint Empty Prompt 75.8 -
AutoPromptSGD
Sampling from projected Langevin dynamics en- Reviewed experien audition lashesrible 82.2 > 106
sures that the tuned prompt contains natural lan- F LUENT P ROMPT
scathing.\n\n" 83.1 5.1
gauge tokens. However, with no extra constraints, upset.\n\n" 82.6 3.67
they can form a disfluent sentence. cigars: \n\n 82.4 20.9
mascara\n\n\n 82.2 47.1
We explicitly incorporate a fluency objective to
the Langevin energy function. This objective re- AGNEWS
sembles the regular perplexity loss, but the labels Empty Prompt 49.7 -
AutoPromptSGD
(next token in the prompt) are not ground-truth. In- EStreamFramenetflixnetflixobookgenre 69.3 > 105
stead, we measure an embedding-based sequence F LUENT P ROMPT
probability according to Kumar et al. (2022). For netflix/genre/netflix 71.1 281.0
netflix AnimeMoviegenre\n 70.1 1925.0
simplicity, below we drop the timestep superscript Synopsis\n\nThe story is 69.2 9.6
on the prompt embeddings and only keep the posi- pmwiki.php/main/Superhero 65.0 2.4
tional subscript.
The first step is to obtain the probability of gen- Table 1: Accuracy (Acc.) and Perplexity (PPL) of
prompts. Both F LUENT P ROMPT and AutoPromptSGD
erating the embedding at position m (i.e., ẽm )
use M =5 tunable tokens. F LUENT P ROMPT shows
based on the previous m − 1 embeddings (i.e., comparable performance to the AutoPromptSGD but
ẽ0:m ). We extract the last hidden state from the with significantly lower perplexity. Prompts discovered
LM (i.e., output embedding) at position m − 1: by F LUENT P ROMPT show domain relevance and poten-
hθ,m−1 = hθ (ẽ0:m ). Then the probability is: tial caliberation for model outputs.

exp(hθ,m−1 · ẽm )
pθ (ẽm | ẽ0:m ) = P 3.4 Experimental Setup
ev ∈E exp(hθ,m−1 · ev )
Target tasks We evaluate performance on
where we equivalently compute the logits for each two sentiment analysis tasks: Amazon Polar-
embedding’s corresponded vocabulary and take the ity (McAuley and Leskovec, 2013) and SST-
softmax.1 Subsequently,
Q −1 the sequence probability 2 (Socher et al., 2013), and one topic classi-
is pθ (ẽ0:M ) = M m=1 pθ (ẽm | ẽ0:m ). fication task: AGNEWS (Zhang et al., 2015).
We define a prompt fluency loss as the neg- These tasks were selected since vanilla soft prompt
ative log-likelihood of the prompt embeddings, tuning (Lester et al., 2021) substantially im-
− log pθ (ẽ0:M ). Along with the task labeling loss proves model performance. In contrast, tasks like
(§3.2), we modify our energy function as: RTE (Dagan et al., 2005) are more difficult; soft
E(ẽ0:M ) = − λtask log pθ (v(y) | ẽ0:M , x, t) prompt tuning did not yield a significant improve-
ment (57.4% accuracy from prompt tuning com-
− λfluency log pθ (ẽ0:M )
pared with 52.1% from random guess) in our pilot
where λtask + λfluency = 1. Through the whole study, and we therefore did not pursue further anal-
F LUENT P ROMPT tuning procedure, the language ysis using F LUENT P ROMPT. The verbalizer words
model parameters θ is fixed while the embeddings and templates used for each task are listed in Ta-
ẽ0:M are tuned. ble 8.
1
This is equivalently computing the logits since ev and the
projected ẽm from the last optimization step are both in the Model We optimize prompts for GPT-2 large
embedding table. (774M parameters, Radford et al. 2019) using F LU -
SST-2 Amazon AGNews
log(perplexity) Accuracy log(perplexity) Accuracy log(perplexity) Accuracy
λfluency = 0 13.75 ± 1.81 87.55 ± 0.95 14.32 ± 1.31 75.31 ± 1.76 15.04 ± 3.30 74.56 ± 1.65
λfluency = 0.003 9.86 ± 2.41 88.86 ± 0.67 10.44 ± 2.09 86.37 ± 0.68 10.13 ± 1.13 76.43 ± 1.05

Table 2: Accuracy and perplexity of the prompts tuned with and without the readability constraint λfluency . For
λfluency > 0, we report the best value (λfluency =0.003) across 4 learning rates and 5 random seeds with M =10.
All t-tests of perplexity and accuracy show p ≤ 0.0001.

ENT P ROMPT . We use a batch size of 16 and Effect of z The progressive noise z helps find
train for 5,000 steps with an AdamW optimizer a diverse set of prompts while not compromising
(Loshchilov and Hutter, 2018). We select the best overall performance. In Table 3 we show the best
prompt based on the validation performance. For and average prompt performance with and without
our method F LUENT P ROMPT, we use a learning the noise z (i.e., β > 0 and β = 0). We measure
rate η ∈ {0.3, 1.0, 3.0, 10.0}, βstart = 1.0, βend = the diversity of prompts by Dist-1, a unigram dis-
0.0001, λfluency ∈ {0.003, 0.01, 0.03, 0.1, 0.3}. tinctiveness metric (Li et al., 2016). We find that
We search for both 5-token prompts (M = 5) and the prompts obtained with z (β > 0) are more di-
10 token (M = 10) prompts and use five random verse and overall have an on-par performance with
seeds for each hyperparameter setup. Additionally, the setup without z (β = 0).
we perform experiments with βstart = βend = 0 (i.e,
no progressive noise) and λfluency = 0 (i.e, no flu- 4 What makes good prompts?
ency constraint) as ablations to F LUENT P ROMPT In this section, we analyze common attributes of
purposed for analysis. the effective tuned prompts. Specifically, we study
the 10-token prompts found by F LUENT P ROMPT
3.5 Results on SST-2, Amazon and AGNEWS.
Table 1 shows the accuracy and perplexity of empty 4.1 Effective prompts calibrate the output
prompt (i.e., no ẽ), AutoPromptSGD and F LUENT- distribution over label words
P ROMPT, along with example prompts for each
Language models are known to be biased towards
method. We see that F LUENT P ROMPT performs
label words that are common in its pretraining dis-
comparably to AutoPromptSGD and significantly
tribution (Holtzman et al., 2021; Zhao et al., 2021).
better than the empty prompt. In terms of read-
In this section, we aim to investigate whether ef-
ability, F LUENT P ROMPT generates more fluent
fective prompts found by prompt tuning implicitly
prompts than AutoPromptSGD .
adjust for the bias (calibration). To measure this
To further understand the contributions of F LU - bias, we follow Holtzman et al. (2021) to use task-
ENT P ROMPT , we now investigate the effects of its specific domain string d as the test input and com-
two key modifications on top of AutoPromptSGD : pute the entropy of the labels. As listed in Table 4,
the noise z in Langevin dynamics and the weight the task-specific domain strings do not imply any
λfluency for prompt fluency. label information. Therefore, we expect the output
of the language model to be more uniform over the
Effect of λfluency In Table 2 we show the per- label words when only conditioned on the domain
formance with and without the fluency constraint string. The entropy of the label words is computed
(λfluency = 0.003 and λfluency = 0) and the log- as follows:
perplexity of the discovered prompts. The fluency
constraint effectively leads to significantly lower H(y) = Ey∈Y [− log p(y)] =
perplexity and also better accuracy (p ≤ 0.0001
X
− pθ (v(y) | ẽ, d, t) log pθ (v(y) | ẽ, d, t)
in all t-tests).2 Prompts with lower perplexity are y∈Y
desired for their potentially better readability for
downstream analyses. The higher the entropy is, the more balanced (cali-
brated) the label words distribution is.
2
On human-written prompts, Gonen et al. (2022) report a As listed in Table 1, some effective prompts
similar finding. found by F LUENT P ROMPT for sentiment analy-
SST-2 Amazon AGNews
Max Mean Dist-1 Max Mean Dist-1 Max Mean Dist-1
β=0 90.2 86.5 72.6 87.7 85.1 57.9 82.6 71.6 81.7
β>0 89.6 85.5 77.6 88.7 85.4 61.2 80.7 74.1 82.2

Table 3: Prompt performance and diversity with and without the progressive noise z (β > 0 and β = 0).

(a) (b) (c)

Figure 2: Frequency of prompts (y-axis) at different entropy level (x-axis). We compare effective prompts with the
empty and human-written prompt.

Task Domain String for SST-2, Amazon and AGNEWS, respectively.


SST-2 This is a movie review
Amazon This is an Amazon product review 4.2 Effective prompts are topically related to
AGNEWS This is a news the task domain

Table 4: Tasks and their task-specific domain strings. Qualitative Analysis As shown in Table 1, most
of the effective prompts obtained by F LUENT-
P ROMPT contain domain-related words. For ex-
sis contain negative sentiment words (e.g., “disap- ample, the prompt Kubrick, "The Shining in SST-
pointing” and “complained” in prompts for SST-2 ), 2 contains movie director names and movie ti-
which may implicitly reduce the probabilty of posi- tles, relevant to the domain of movie reviews.
tive label and calibrate the label word distribution. Similarly, the prompt mascara\n\n and cigars\n\n
To validate this hypothesis, we filter a set of ef- found for Amazon contain product names rel-
fective prompts by F LUENT P ROMPT and compute evant to the domain of product reviews. Ad-
the entropy of the label predictions conditioned ditionally, AGNEWS is a news topic classifica-
on the concatenation of the prompt and the task- tion task. Some of the effective prompts in AG-
specific domain string. Figure 2 shows the density NEWS contain topic classification-related words
plot comparing the label word entropy of effec- such as “genre”, while others contain URLs that
tive prompts, along with empty and human-written link to websites such as netflix3 and pmwiki.4
prompts taken from Bach et al. (2022). We observe The target pages of these URLs also contain
that the entropy of effective prompts has a higher topic classification-related information, such as
mode than the entropy of empty and human-written the prompt pmwiki/pmwiki.php/Main/Superhero
prompts with lower accuracy. which links to a wiki page containing the following
To further explore the relation between the task information: “Genre: Action Adventure Comedy
performance and calibration, we compute correla- Commercials”.
tion between the task accuracy and the label word
entropy of all prompts obtained by our algorithm Quantitative Analysis Based on our qualitative
F LUENT P ROMPT and report Spearman’s rank cor- analysis, we hypothesize that effective prompts are
relation. From Figure 3, we observe that the label topically related to the task domain. To validate this
word entropy exhibits significant positive correla- hypothesis, we compare domain word frequency
tions with the task accuracy (all p < 0.0001). The 3
www.netflix.com
4
Spearman’s coefficients are +0.61, +0.75 and +0.43 www.pmwiki.org
(a) (b) (c)

Figure 3: Correlation between task performance and label word entropy. Spearman rank correlation coefficients
for SST-2, Amazon and AGNEWS are +0.61, +0.75 and +0.43. All p-values are less than 0.0001.

in effective prompts and random sentences. First, Task Domain Words


we select a set of domain words for each task (see SST-2 movie, film, cinima, director, positive, negative
Table 5), which consist of the task label words Amazon book, amazon, product, furniture, positive, neg-
ative
(e.g., “positive” and “negative” for SST-2) and com- AGNEWS topic, category, politics, sports, business, tech-
mon words in the task domain (e.g., “movie” and nology
“film” for the movie domain of SST-2). Since our
prompts are very short (10 tokens), we augment Table 5: Domain words for each task.
each prompt with its continuation generated by
GPT-3 (Brown et al., 2020), based on the assump- SST-2 Amazon AGNEWS
tion that the continuation by the large LM follows Acc. Freq. Acc. Freq. Acc. Freq.
the same domain as the prompt. For each prompt, Effective 89.4 23.4 86.5 5.8 77.6 3.7
we sample 5 distinct continuations from GPT-3 us- Random 67.2 1.3 74.2 2.2 49.3 0.8
ing nucleus sampling p = 0.95 at a length of 100
tokens. We compare the top 10 effective prompts Table 6: Average domain words frequency (Freq.)
and average accuracy (Acc.) for effective and ran-
with 10 random sentences from PILE (Gao et al.,
dom prompts. Effective prompts and their continua-
2020) augmented by the same continuations. We tion contain subsantially more domain words than ran-
then count the domain words in the concatenation dom prompts. The p-values from the paired t-test for
of the prompt and its continuation. SST-2, Amazon, and AGNEWS were 0.004, 0.003, and
Table 6 lists the average accuracy of and num- 0.0002, respectively.
ber of domain words in the effective and random
prompts with the continuations. The accuracy of truth labels to compute. Therefore, in this section,
effective prompts is higher than that of random sen- we extend F LUENT P ROMPT to explicitly tune the
tences on all three datasets. Moreover, the domain prompts towards better calibration and domain rel-
words frequency of effective prompts is signifi- evance, without using the task labels.
cantly higher than that of random sentences with
p-values of 0.004, 0.003, and 0.0002 for SST-2, 5.1 Method
Amazon, and AGNEWS, respectively. Both our Calibration loss In Section 4.1 we find a strong
qualitative and quantitative analysis provide strong positive correlation between the degree of calibra-
evidence that effective prompts obtained by our tion (i.e., entropy) and performance of the prompts.
prompt tuning are topically related to the task’s We therefore explicitly optimize the prompt to-
domain. wards greater calibration, with an (negative) en-
tropy loss defined below.
5 U NSUPERVISED F LUENT P ROMPT
Lentropy (ẽ) = Ey∈Y [log Ex∈X pθ (v(y) | ẽ, x, t)]
Our findings in Section 4 show the effective tuned
prompts do calibration and have a high domain Intuitively the entropy loss encourages the prompt
relevance to the task. These two attributes are to help model generate more balanced predictions
both highly predictive and do not require ground- at a group level.
Domain relevance loss In Section 4.2 we find SST-2 Amazon AGNEWS
effective prompts overall are more related to the Unsupervised
task domain, as defined by augmented data and Emtpy 66.5 75.8 49.7
PMIDC 85.6 76.2 64.1
keyword matches. To explicitly incorporate the U NSUP. F.P. 88.2 85.3 68.0
domain relevance to the prompts, we extend the
existing fluency (perplexity) loss in Section 3.3, Table 7: Accuracy of different unsupervised prompting
modeling the perplexity of both the prepending methods on the three datasets. U NSUP. F.P. refers to
prompt and the input example: our U NSUPERVISED F LUENT P ROMPT.

Ldomain (ẽ) = − log pθ (ẽ0:M )


X prompts are topically related to the task domain
− log pθ (xi | ẽ, x<i ) and calibrate the prior probability of label words.
i
X Although the prompts generated by F LUENT-
− log pθ (tj | ẽ, x, t<j ) P ROMPT are effective and readable, they still carry
j limited semantic meanings. For instance, we did
Intuitively, log pθ (x | ẽ) − log pθ (x) would mea- not find any prompts directly indicating the task
sure the pointwise mutual information between the definition or instructions. One potential reason
task data x and the tuned prompt x, with the part is that the GPT-2 large model is not instruction-
log pθ (x) not involved in the prompt optimization. tuned. Future work can apply F LUENT P ROMPT to
Overall, our unsupervised energy function E is an instruction-tuned model to see if instruction-like
updated to: prompts can be discovered.

E(ẽ0:M ) = − λcalibration Lentropy (ẽ)


References
− λdomain Ldomain (ẽ)
Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert
where λcalibration + λdomain = 1. Webson, Colin Raffel, Nihal V Nayak, Abheesht
Sharma, Taewoon Kim, M Saiful Bari, Thibault
Hyperparameters Inheriting the notations of Févry, et al. 2022. Promptsource: An integrated
F LUENT P ROMPT, we consider the following hy- development environment and repository for natural
perparameters: η ∈ {1.0, 3.0}, βstart = 1.0, βend = language prompts. In Proceedings of the 60th An-
nual Meeting of the Association for Computational
0.0001, λdomain ∈ {0, 0.0003, 0.001, 0.003, 0.01, Linguistics: System Demonstrations, pages 93–104.
0.05, 0.2, 0.5}, M = 10. We use five random seeds
for each setup and report the average performance. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
5.2 Results Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
In Table 7, we compare the performance of our pro- learners. Advances in neural information processing
posed method, U NSUPERVISED F LUENT P ROMPT, systems, 33:1877–1901.
with two other unsupervised methods, the empty Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKe-
prompt and PMI calibration PMIDC (Holtzman own, and He He. 2022. On the relation between sen-
et al., 2021) on three datasets. Our results show that sitivity and accuracy in in-context learning. arXiv
U NSUPERVISED F LUENT P ROMPT consistently preprint arXiv:2209.07661.
outperforms PMIDC with an average improvement Ido Dagan, Oren Glickman, and Bernardo Magnini.
of 7.0% across the datasets. This demonstrates the 2005. The pascal recognising textual entailment
incorporated calibration and domain information challenge. In Machine learning challenges work-
are helpful to finding effective prompts. shop, pages 177–190. Springer.
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan
6 Conclusion Wang, Han Guo, Tianmin Shu, Meng Song, Eric P
Xing, and Zhiting Hu. 2022. Rlprompt: Optimiz-
In this paper, we investigate the factors that con- ing discrete text prompts with reinforcement learn-
tribute to the effectiveness of prompts. To facilitate ing. arXiv preprint arXiv:2205.12548.
this study, we develop a human-readable prompt
Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen,
tuning method F LUENT P ROMPT and apply it to the Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun.
GPT-2 large model to generate effective and read- 2021. Openprompt: An open-source framework for
able prompts. Our analysis reveals that effective prompt-learning. CoRR, abs/2111.01998.
Leo Gao, Stella Biderman, Sid Black, Laurence Gold- Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B
ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- Dolan, Lawrence Carin, and Weizhu Chen. 2022.
race He, Anish Thite, Noa Nabeshima, et al. 2020. What makes good in-context examples for gpt-3?
The pile: An 800gb dataset of diverse text for lan- In Proceedings of Deep Learning Inside Out (Dee-
guage modeling. arXiv preprint arXiv:2101.00027. LIO 2022): The 3rd Workshop on Knowledge Ex-
traction and Integration for Deep Learning Architec-
Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, tures, pages 100–114.
and Luke Zettlemoyer. 2022. Demystifying prompts
in language models via perplexity estimation. arXiv Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,
preprint arXiv:2212.04037. Yujie Qian, Zhilin Yang, and Jie Tang. 2021. Gpt
understands, too. arXiv preprint arXiv:2103.10385.
Karen Hambardzumyan, Hrant Khachatrian, and
Jonathan May. 2021. WARP: Word-level Adversar- Ilya Loshchilov and Frank Hutter. 2018. Decoupled
ial ReProgramming. In Proceedings of the 59th An- weight decay regularization. In International Con-
nual Meeting of the Association for Computational ference on Learning Representations.
Linguistics and the 11th International Joint Confer- Yao Lu, Max Bartolo, Alastair Moore, Sebastian
ence on Natural Language Processing (Volume 1: Riedel, and Pontus Stenetorp. 2022. Fantastically
Long Papers), pages 4921–4933, Online. Associa- ordered prompts and where to find them: Overcom-
tion for Computational Linguistics. ing few-shot prompt order sensitivity. In Proceed-
ings of the 60th Annual Meeting of the Association
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and for Computational Linguistics (Volume 1: Long Pa-
Yejin Choi. 2019. The curious case of neural text de- pers), pages 8086–8098.
generation. In International Conference on Learn-
ing Representations. Julian McAuley and Jure Leskovec. 2013. Hidden fac-
tors and hidden topics: understanding rating dimen-
Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, sions with review text. In Proceedings of the 7th
and Luke Zettlemoyer. 2021. Surface form compe- ACM conference on Recommender systems, pages
tition: Why the highest probability answer isn’t al- 165–172.
ways right. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Process- Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
ing, pages 7038–7051, Online and Punta Cana, Do- Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
minican Republic. Association for Computational moyer. 2022. Rethinking the role of demonstra-
Linguistics. tions: What makes in-context learning work? arXiv
preprint arXiv:2202.12837.
Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui
Qin, Kyle Richardson, Sean Welleck, Hannaneh Ha- Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
jishirzi, Tushar Khot, Ashish Sabharwal, Sameer Hannaneh Hajishirzi. 2021. Cross-task generaliza-
Singh, and Yejin Choi. 2021. Prompt waywardness: tion via natural language crowdsourcing instructions.
The curious case of discretized interpretation of con- arXiv preprint arXiv:2104.08773.
tinuous prompts.
Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Bansal. 2022. Grips: Gradient-free, edit-based in-
taka Matsuo, and Yusuke Iwasawa. 2022. Large struction search for prompting large language mod-
language models are zero-shot reasoners. arXiv els. arXiv preprint arXiv:2203.07281.
preprint arXiv:2205.11916. Guanghui Qin and Jason Eisner. 2021. Learning how
to ask: Querying LMs with mixtures of soft prompts.
Sachin Kumar, Biswajit Paria, and Yulia Tsvetkov. In Proceedings of the 2021 Conference of the North
2022. Gradient-based constrained sampling from American Chapter of the Association for Computa-
language models. arXiv preprint arXiv:2205.12558. tional Linguistics: Human Language Technologies,
pages 5203–5212, Online. Association for Compu-
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. tational Linguistics.
The power of scale for parameter-efficient prompt
tuning. In Proceedings of the 2021 Conference on Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Empirical Methods in Natural Language Processing, Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
pages 3045–3059. guage models are unsupervised multitask learners.
OpenAI blog, 1(8):9.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and Bill Dolan. 2016. A diversity-promoting ob- Victor Sanh, Albert Webson, Colin Raffel, Stephen H.
jective function for neural conversation models. In Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Proceedings of the 2016 Conference of the North Chaffin, Arnaud Stiegler, Teven Le Scao, Arun
American Chapter of the Association for Computa- Raja, Manan Dey, M. Saiful Bari, Canwen Xu,
tional Linguistics: Human Language Technologies, Urmish Thakker, Shanya Sharma, Eliza Szczechla,
pages 110–119, San Diego, California. Association Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak,
for Computational Linguistics. Debajyoti Datta, Jonathan Chang, Mike Tian-Jian
Jiang, Han Wang, Matteo Manica, Sheng Shen, Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
Zheng Xin Yong, Harshit Pandey, Rachel Baw- Keiran Paster, Silviu Pitis, Harris Chan, and
den, Thomas Wang, Trishala Neeraj, Jos Rozen, Jimmy Ba. 2022. Large language models are
Abheesht Sharma, Andrea Santilli, Thibault Févry, human-level prompt engineers. arXiv preprint
Jason Alan Fries, Ryan Teehan, Stella Biderman, arXiv:2211.01910.
Leo Gao, Tali Bers, Thomas Wolf, and Alexan-
der M. Rush. 2021. Multitask prompted train-
ing enables zero-shot task generalization. CoRR,
abs/2110.08207.

Timo Schick and Hinrich Schütze. 2021. Exploiting


cloze-questions for few-shot text classification and
natural language inference. In Proceedings of the
16th Conference of the European Chapter of the As-
sociation for Computational Linguistics: Main Vol-
ume, pages 255–269, Online. Association for Com-
putational Linguistics.

Taylor Shin, Yasaman Razeghi, Robert L Logan IV,


Eric Wallace, and Sameer Singh. 2020. Autoprompt:
Eliciting knowledge from language models with au-
tomatically generated prompts. In Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 4222–
4235.

Richard Socher, Alex Perelygin, Jean Wu, Jason


Chuang, Christopher D. Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree-
bank. In Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing,
pages 1631–1642, Seattle, Washington, USA. Asso-
ciation for Computational Linguistics.

Yizhong Wang, Swaroop Mishra, Pegah Alipoor-


molabashi, Yeganeh Kordi, Amirreza Mirzaei,
Anjana Arunkumar, Arjun Ashok, Arut Selvan
Dhanasekaran, Atharva Naik, David Stap, Eshaan
Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Is-
han Purohit, Ishani Mondal, Jacob Anderson, Kirby
Kuznia, Krima Doshi, Maitreya Patel, Kuntal Ku-
mar Pal, Mehrad Moradshahi, Mihir Parmar, Mi-
rali Purohit, Neeraj Varshney, Phani Rohitha Kaza,
Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia,
Shailaja Keyur Sampat, Savan Doshi, Siddhartha
Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit,
Xudong Shen, Chitta Baral, Yejin Choi, Noah A.
Smith, Hannaneh Hajishirzi, and Daniel Khashabi.
2022. Super-naturalinstructions: Generalization via
declarative instructions on 1600+ nlp tasks.

Albert Webson and Ellie Pavlick. 2021. Do prompt-


based models really understand the meaning of their
prompts? arXiv preprint arXiv:2109.01247.

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.


Character-level convolutional networks for text clas-
sification. In NIPS.

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and


Sameer Singh. 2021. Calibrate before use: Improv-
ing few-shot performance of language models. In In-
ternational Conference on Machine Learning, pages
12697–12706. PMLR.
A Verbalizer and templates
Table 8 shows an example input, template and the
verbalizer used for each task.
Task Templates Verbalizers
SST-2 Illuminating if overly talky documentary. It was positive, negative
Amazon Terrible service. It was positive, negative
AGNEWS Economic growth in Japan slows down as the country experiences. It politics, sports, business, technology
is about

Table 8: The template, example (colored black) and verbalizer used for each dataset.

You might also like