Fluentprompt
Fluentprompt
Abstract
Large language models can perform new tasks
in a zero-shot fashion, given natural language
prompts that specify the desired behavior.
arXiv:2212.10539v1 [cs.CL] 20 Dec 2022
exp(hθ,m−1 · ẽm )
pθ (ẽm | ẽ0:m ) = P 3.4 Experimental Setup
ev ∈E exp(hθ,m−1 · ev )
Target tasks We evaluate performance on
where we equivalently compute the logits for each two sentiment analysis tasks: Amazon Polar-
embedding’s corresponded vocabulary and take the ity (McAuley and Leskovec, 2013) and SST-
softmax.1 Subsequently,
Q −1 the sequence probability 2 (Socher et al., 2013), and one topic classi-
is pθ (ẽ0:M ) = M m=1 pθ (ẽm | ẽ0:m ). fication task: AGNEWS (Zhang et al., 2015).
We define a prompt fluency loss as the neg- These tasks were selected since vanilla soft prompt
ative log-likelihood of the prompt embeddings, tuning (Lester et al., 2021) substantially im-
− log pθ (ẽ0:M ). Along with the task labeling loss proves model performance. In contrast, tasks like
(§3.2), we modify our energy function as: RTE (Dagan et al., 2005) are more difficult; soft
E(ẽ0:M ) = − λtask log pθ (v(y) | ẽ0:M , x, t) prompt tuning did not yield a significant improve-
ment (57.4% accuracy from prompt tuning com-
− λfluency log pθ (ẽ0:M )
pared with 52.1% from random guess) in our pilot
where λtask + λfluency = 1. Through the whole study, and we therefore did not pursue further anal-
F LUENT P ROMPT tuning procedure, the language ysis using F LUENT P ROMPT. The verbalizer words
model parameters θ is fixed while the embeddings and templates used for each task are listed in Ta-
ẽ0:M are tuned. ble 8.
1
This is equivalently computing the logits since ev and the
projected ẽm from the last optimization step are both in the Model We optimize prompts for GPT-2 large
embedding table. (774M parameters, Radford et al. 2019) using F LU -
SST-2 Amazon AGNews
log(perplexity) Accuracy log(perplexity) Accuracy log(perplexity) Accuracy
λfluency = 0 13.75 ± 1.81 87.55 ± 0.95 14.32 ± 1.31 75.31 ± 1.76 15.04 ± 3.30 74.56 ± 1.65
λfluency = 0.003 9.86 ± 2.41 88.86 ± 0.67 10.44 ± 2.09 86.37 ± 0.68 10.13 ± 1.13 76.43 ± 1.05
Table 2: Accuracy and perplexity of the prompts tuned with and without the readability constraint λfluency . For
λfluency > 0, we report the best value (λfluency =0.003) across 4 learning rates and 5 random seeds with M =10.
All t-tests of perplexity and accuracy show p ≤ 0.0001.
ENT P ROMPT . We use a batch size of 16 and Effect of z The progressive noise z helps find
train for 5,000 steps with an AdamW optimizer a diverse set of prompts while not compromising
(Loshchilov and Hutter, 2018). We select the best overall performance. In Table 3 we show the best
prompt based on the validation performance. For and average prompt performance with and without
our method F LUENT P ROMPT, we use a learning the noise z (i.e., β > 0 and β = 0). We measure
rate η ∈ {0.3, 1.0, 3.0, 10.0}, βstart = 1.0, βend = the diversity of prompts by Dist-1, a unigram dis-
0.0001, λfluency ∈ {0.003, 0.01, 0.03, 0.1, 0.3}. tinctiveness metric (Li et al., 2016). We find that
We search for both 5-token prompts (M = 5) and the prompts obtained with z (β > 0) are more di-
10 token (M = 10) prompts and use five random verse and overall have an on-par performance with
seeds for each hyperparameter setup. Additionally, the setup without z (β = 0).
we perform experiments with βstart = βend = 0 (i.e,
no progressive noise) and λfluency = 0 (i.e, no flu- 4 What makes good prompts?
ency constraint) as ablations to F LUENT P ROMPT In this section, we analyze common attributes of
purposed for analysis. the effective tuned prompts. Specifically, we study
the 10-token prompts found by F LUENT P ROMPT
3.5 Results on SST-2, Amazon and AGNEWS.
Table 1 shows the accuracy and perplexity of empty 4.1 Effective prompts calibrate the output
prompt (i.e., no ẽ), AutoPromptSGD and F LUENT- distribution over label words
P ROMPT, along with example prompts for each
Language models are known to be biased towards
method. We see that F LUENT P ROMPT performs
label words that are common in its pretraining dis-
comparably to AutoPromptSGD and significantly
tribution (Holtzman et al., 2021; Zhao et al., 2021).
better than the empty prompt. In terms of read-
In this section, we aim to investigate whether ef-
ability, F LUENT P ROMPT generates more fluent
fective prompts found by prompt tuning implicitly
prompts than AutoPromptSGD .
adjust for the bias (calibration). To measure this
To further understand the contributions of F LU - bias, we follow Holtzman et al. (2021) to use task-
ENT P ROMPT , we now investigate the effects of its specific domain string d as the test input and com-
two key modifications on top of AutoPromptSGD : pute the entropy of the labels. As listed in Table 4,
the noise z in Langevin dynamics and the weight the task-specific domain strings do not imply any
λfluency for prompt fluency. label information. Therefore, we expect the output
of the language model to be more uniform over the
Effect of λfluency In Table 2 we show the per- label words when only conditioned on the domain
formance with and without the fluency constraint string. The entropy of the label words is computed
(λfluency = 0.003 and λfluency = 0) and the log- as follows:
perplexity of the discovered prompts. The fluency
constraint effectively leads to significantly lower H(y) = Ey∈Y [− log p(y)] =
perplexity and also better accuracy (p ≤ 0.0001
X
− pθ (v(y) | ẽ, d, t) log pθ (v(y) | ẽ, d, t)
in all t-tests).2 Prompts with lower perplexity are y∈Y
desired for their potentially better readability for
downstream analyses. The higher the entropy is, the more balanced (cali-
brated) the label words distribution is.
2
On human-written prompts, Gonen et al. (2022) report a As listed in Table 1, some effective prompts
similar finding. found by F LUENT P ROMPT for sentiment analy-
SST-2 Amazon AGNews
Max Mean Dist-1 Max Mean Dist-1 Max Mean Dist-1
β=0 90.2 86.5 72.6 87.7 85.1 57.9 82.6 71.6 81.7
β>0 89.6 85.5 77.6 88.7 85.4 61.2 80.7 74.1 82.2
Table 3: Prompt performance and diversity with and without the progressive noise z (β > 0 and β = 0).
Figure 2: Frequency of prompts (y-axis) at different entropy level (x-axis). We compare effective prompts with the
empty and human-written prompt.
Table 4: Tasks and their task-specific domain strings. Qualitative Analysis As shown in Table 1, most
of the effective prompts obtained by F LUENT-
P ROMPT contain domain-related words. For ex-
sis contain negative sentiment words (e.g., “disap- ample, the prompt Kubrick, "The Shining in SST-
pointing” and “complained” in prompts for SST-2 ), 2 contains movie director names and movie ti-
which may implicitly reduce the probabilty of posi- tles, relevant to the domain of movie reviews.
tive label and calibrate the label word distribution. Similarly, the prompt mascara\n\n and cigars\n\n
To validate this hypothesis, we filter a set of ef- found for Amazon contain product names rel-
fective prompts by F LUENT P ROMPT and compute evant to the domain of product reviews. Ad-
the entropy of the label predictions conditioned ditionally, AGNEWS is a news topic classifica-
on the concatenation of the prompt and the task- tion task. Some of the effective prompts in AG-
specific domain string. Figure 2 shows the density NEWS contain topic classification-related words
plot comparing the label word entropy of effec- such as “genre”, while others contain URLs that
tive prompts, along with empty and human-written link to websites such as netflix3 and pmwiki.4
prompts taken from Bach et al. (2022). We observe The target pages of these URLs also contain
that the entropy of effective prompts has a higher topic classification-related information, such as
mode than the entropy of empty and human-written the prompt pmwiki/pmwiki.php/Main/Superhero
prompts with lower accuracy. which links to a wiki page containing the following
To further explore the relation between the task information: “Genre: Action Adventure Comedy
performance and calibration, we compute correla- Commercials”.
tion between the task accuracy and the label word
entropy of all prompts obtained by our algorithm Quantitative Analysis Based on our qualitative
F LUENT P ROMPT and report Spearman’s rank cor- analysis, we hypothesize that effective prompts are
relation. From Figure 3, we observe that the label topically related to the task domain. To validate this
word entropy exhibits significant positive correla- hypothesis, we compare domain word frequency
tions with the task accuracy (all p < 0.0001). The 3
www.netflix.com
4
Spearman’s coefficients are +0.61, +0.75 and +0.43 www.pmwiki.org
(a) (b) (c)
Figure 3: Correlation between task performance and label word entropy. Spearman rank correlation coefficients
for SST-2, Amazon and AGNEWS are +0.61, +0.75 and +0.43. All p-values are less than 0.0001.
Table 8: The template, example (colored black) and verbalizer used for each dataset.