An Information-Theoretic Approach To Prompt Engineering Without
An Information-Theoretic Approach To Prompt Engineering Without
Accuracy
prompt engineering seeks to align these mod-
els to specific tasks. Unfortunately, existing
prompt engineering methods require signifi- 0.4
Min
cant amounts of labeled data, access to model Mean
parameters, or both. We introduce a new 0.2 Median
MI (Ours)
method for selecting prompt templates without Max
labeled examples and without direct access to 0.0
the model. Specifically, over a set of candi- uA
D A ies QA DB olQ PA Wi
C
AD Sto
r
Co IM Bo CO
date templates, we choose the template that SQ B
OC
LAM R
maximizes the mutual information between
the input and the corresponding model output. Figure 1: Performance of template selected by our max-
Across 8 datasets representing 7 distinct NLP imum mutual information method (MI) compared to
tasks, we show that when a template has high the the worst, mean, median, and best prompt on GPT-3
mutual information, it also has high accuracy Davinci (175B). Our method performs at almost oracle
on the task. On the largest model, selecting levels, without labels or access to model weights.
prompts with our method gets 90% of the way
from the average prompt accuracy to the best
prompt accuracy and requires no ground truth be formulated as natural language generation, if
labels. only they can be primed in the right way.
Such priming is not a trivial task. The few-shot
1 Introduction learning breakthrough can give the impression that
if the LM is given a sensible prompt, it will “under-
It is well-known that large pre-trained language
stand” what is meant and perform well on the task
models (LMs) learn substantial linguistic (Liu et al.,
if it has the capacity. However, LMs can generate
2019; Amrami and Goldberg, 2018) and factual
substantially different output distributions–and thus
world knowledge (Petroni et al., 2020; Bosselut
text–given two distinct prompts that appear seman-
et al.; Bouraoui et al.; Zuo et al., 2018), achiev-
tically invariant (e.g., alternative orderings, lexical
ing state-of-the-art performance on classic NLP
changes like capitalization, and general rephrasing
tasks like closed-book question-answering, senti-
(Zhao et al., 2021; Lu et al., 2021)). This can lead
ment analysis, and many other tasks (Radford et al.,
to surprisingly high variance in performance from
2019; Devlin et al., 2019; Raffel et al., 2019). The
prompt to prompt. Clearly, some prompts are better
largest models can do this in a few-shot way–that is,
than others for aligning a model to a task.
being trained only with generic, semi-supervised
Prompt engineering is a nascent field that aims
objectives and “taught” tasks with just instructions
to find aligning prompts (Reynolds and McDonell,
and a few examples of the task provided via a
2021). While “prompt” refers to any language
natural language “prompt” in the context window
passed to the model via the context window, a
(Brown et al., 2020). This suggests that pre-training
template refers to a natural language scaffolding
equips them to potentially do many tasks that can
filled in with raw data, resulting in a prompt. Thus,
*Equal Contribution prompt engineering includes finding high-quality
templates (i.e., those with high test accuracy). Gen- model sizes) (Lester et al., 2021), not all models
erally, this is done by optimizing for accuracy over are publicly available. Thus, these methods are
a validation set: a template is chosen from a can- only feasible for those who have direct access to
didate set based on its performance on labeled ex- the model and can perform backprop on it. Prompts
amples. Such labeled examples can be challenging optimized in continuous space are also not inter-
to procure for some tasks and impossible for oth- pretable in natural language, making it harder to
ers. Some recent methods optimize prompts using transfer insights from prompts that work well for
backpropagation, which requires access to model one task to another task. Additionally, these meth-
weights. In this paper, we propose a new method ods require labeled examples, while ours does not.
for selecting prompts by using mutual information, Other selection protocols not based on gradient
which allows prediction of a prompt’s performance descent can include cross-validation or minimum
without labels or access to model parameters. description length, as in (Perez et al., 2021). These
Mutual information (MI) is a metric that quan- methods yield prompts that perform marginally
tifies the shared information between two random better than average in terms of test accuracy.
variables (see Section 3.2). We demonstrate that Mutual information has been used in n-gram
the mutual information between a prompt and a clustering, part-of-speech tagging, probing classi-
language model’s output can serve as a useful sur- fiers, and LM training objective reframing (Brown
rogate for the test accuracy of a template. Specifi- et al., 1992; Stratos, 2019; Voita and Titov, 2020;
cally, for eight popular datasets representing seven Kong et al., 2019). Ours is the first work of which
classic NLP tasks, we generate a diverse set of 20 we are aware to apply MI to prompt engineering.
templates for each and show that template mutual (Lu et al., 2021) make use of entropy statistics to
information and template accuracy are highly cor- determine performant orderings for few-shot exam-
related. These results are strongest on the largest ples in prompts. Our work is focused on selecting
models we study, for which our method chooses high quality templates with no special focus on ex-
prompts that, on average, get 90% of the way from ample ordering or need for multiple examples to
mean accuracy to maximum accuracy and even order (the few-shot case). Our method uses no arti-
selects the best prompt on three of eight datasets. ficial “probing set,” making our prompt selection
This suggests that, across a variety of NLP tasks, much cheaper, and we also explore open-ended
mutual information can be used to select one of the tasks. While the GlobalE and LocalE statistics they
best prompts from a set of candidate prompts, even use are similar (and in the case of LocalE identical)
without making use of model weights or ground to the two parts of our MI calculation (see 3.2), we
truth labels. In the following pages, we outline use the two statistics jointly and choose prompts
each step of our general method for generating and by minimizing, rather than maximizing, LocalE.
evaluating templates so that it can easily be ported
to any other task. Code is available online.1 3 Methods
At the most abstract, our method is as follows (see
2 Related Work
Appendix A for a more thorough description):
The promise of language models and the chal-
lenge of aligning them has given rise to the field of 1. Generate a set of K prompt templatizing
prompt engineering, which seeks to construct the functions.
best prompt given a task and a language model (Liu
2. Playground a couple of examples to
et al., 2021a). The best performance on prompt
ensure that templates give roughly ex-
engineering is often achieved using backpropaga-
pected output.
tion in continuous prompt embedding space (Lester
et al., 2021; Li and Liang, 2021; Gu et al., 2021; 3. Estimate mutual information for
Liu et al., 2021b; Zhang et al., 2021) in contrast each template given a set of inputs
to generating a discrete set of prompts by hand x1 , x2 , ..., xN where xi ∼ X, ∀i.
and testing them. While optimizing in continu- 4. Choose template(s) based on mutual in-
ous prompt space via backprop allows for similar formation and perform inference.
performance to model-tuning (at least at higher
1
github.com/BYU-PCCL/information-theoretic-prompts We find it useful to unify all the tasks we study
.....
If asked the question ‘<Question>’, Common Sense Quiz Answer Key Given the following questions and questions,
and given the choices ‘<A>’, Question 1: Where would people choices, pick the choice that choices,
‘<B>’, ‘<C>’, ‘<D>’, and ‘<E>’, not typically go for fun? corresponds best to the question. answers
I would say A: theme park “I’m crossing the river, my feet are
B: movie theatre wet but my body is dry, where am “What is France?”,
C: carnival I?“, “bridge, waterfall, valley, “[state,city,country,continent,
D: waste management facility pebble, mountain”, -> “valley” mountain range]”,
E: beach “In what Spanish speaking North country
Correct Answer: D American country can you get a
Question 2: <Question> great cup of coffee?“, “mexico, “<Question>”,
A: <A> mildred’s coffee shop, diner, “[<A>, <B>, <C>, <D>, <E>]”,
B: <B> kitchen, canteen”, -> “mexico”
C: <C> “<Question>“,
D: <D> “<A>, <B>, <C>, <D>, <E>” -> ”
“In a predicament, E: <E>
Question an animal might Correct Answer:
choose flight or
what?” {' fight': 0.2791,
“If asked the question ' Fight': 0.0648, 'leave': 0.08,
A. “leave home” 'hunt': 0.11,
' Feel': 0.0584,
Normalize
‘In a predicament, an
B. “hunt for food” ' hunt': 0.0556, 'smell': 0.02, Collapsed
animal might choose
Choices C. “smell prey” flight or what?’, and ' feel': 0.0488, 'feel': 0.19, token
D. “feel pain” given the choices ..... 'fight': 0.60 distribution
E. “fight for life” ‘leave home’, ‘hunt for ' flee': 0.0088,
food’, ‘smell prey’, ' leave: 0.0086,
Ground ‘feel pain’, and ‘fight ' Hunt': 0.0082,
E. “fight for life”
Truth for life’, I would say” ' smell': 0.0063} Mutual Accuracy
Information
Tφ Ground Truth:
Data Prompt Token distribution E. “fight for life”
Figure 2: We choose θ ∈ {θi }K i=1 and templatize a sampled instance from the dataset X. We pass this prompt
through the language model via gφ , yielding a probability distribution over the model’s tokens Tφ . The collapsing
function cθ sums the weight given to each token corresponding to each possible answer y ∈ Y and normalizes,
giving a probability distribution P (Y |xi ), which we can use to estimate mutual information or obtain a guess for
yi .
within a single framework, which we describe in While, for open-ended tasks, this method might
Section 3.1. We also justify our use of mutual artificially inflate accuracy if the model starts to
information as a surrogate for prompt quality and give a wrong answer that happens to start with the
specify how we estimate it in Section 3.2. same token as the correct one, we find that this
difference is small and does not affect our results.3
3.1 Task Definition Irrelevant tokens (with which none of the desired
In order to demonstrate our method’s widespread answers begin) are ignored, and the resulting col-
applicability and general effectiveness, we validate lapsed probabilities are normalized. We term this
it across many datasets and tasks. This requires approach One-token Response (OTR). Although
us to estimate MI and accuracy, and this is most our method isn’t limited to OTR tasks, we choose
straightforward in the case where, given a context, tasks that can be cast as OTR tasks for simplicity
a language model produces just one probability dis- and to reduce computational expense. Many NLP
tribution P (tn |context = t1 , t2 , ..., tn−1 ). This is tasks fit within this framework, although a few do
in contrast to other experimental setups that use not (e.g., machine translation and summarization).
multi-token sampling methods (e.g., beam search), This basic approach is in common use (Brown et al.,
although our method is easily tractable in such se- 2020), but we formalize it for clarity below.
tups.2 Any NLP task is tractable in this framework Generally, the OTR framework casts a natural
so long as the output space consists of a set of op- language task as a classification problem with raw
tions that each start with a unique token. In this data input xi ∈ X and output P (Y |xi ), a prob-
case, the language model can “give” an answer by
3
assigning probability to tokens that begin giving Our open-ended datasets are SQuAD, LAMBADA, and
ROCStories, and none of these seemed more likely than ROC-
each of these answers (invariant to lexical varia- Stories to exhibit this issue. We reran our experiment on
tion like capitalization and leading/trailing spaces). ROCStories by sampling with temperature 0 until reaching a
space, and only counted responses as accurate if they exactly
2
The only difference: For each considered answer, simply matched the corresponding ground truth labels. Results were
calculate its unnormalized probability by multiplying the prob- virtually unchanged: accuracy decreased by only 0.03 on aver-
abilities of the decisions taken at each branch in the sequence age, and the correlation between mutual information and test
of tokens, then normalize the resulting probability scores. accuracy increased by 0.04, from 0.68 to 0.72.
ability distribution over targets. In order to use functions fθ1 , fθ2 , ..., fθK along with their corre-
a language model φ for this task, a templatizing sponding collapsing functions cθk (see Appendix
function fθ : X → L is needed to map raw data A). Treating fθ (X) := {fθ (x), x ∈ X} as a ran-
into natural language prompts. gφ : L → Tφ maps dom variable, we can calculate I(fθ (X); Y ) and
prompts to a probability distribution over Tφ , the to- use it as a criterion for selecting prompt templatiz-
ken set represented by the model tokenizer. Finally, ing functions with which to do inference.
a collapsing function cθ : Tφ → P (Y |x, θ, φ) (see We hypothesize that a θi with higher mutual
Appendix A) yields an estimate of P (Y |X): information will align a language model to a task
better than a θj with lower mutual information.
P (Y |x, θ, φ) = cθ (gφ (fθ (x))), x ∈ X (1) Formally, we select θ̂ = argmaxθ {I(fθ (X); Y )}.
Mutual information is estimated as:
We also refer to P (Y |x, θ, φ) as P (Y |fθ (x)).
The above pipeline can be specified in many I (fθ (X); Y ) = H(Y ) − H(Y |fθ (X)) (2)
ways using different θ and φ (see Figure 2), which
will result in different accuracies. Our ultimate aim where each term is estimated in expectation using
is to select the best θ given φ. Whereas past prompt draws xi ∼ X and Equation 1 as follows:
engineering methods rely on scores calculated by N
!
comparing model answers and ground truth, our 1 X
H(Y ) ≈ H P (Y |fθ (xi )) (3)
method selects θ by maximizing mutual informa- N
i=1
tion, which requires no ground truth labels.
N
1 X
3.2 Mutual Information H(Y |fθ (X)) ≈ H(P (Y |fθ (xi )))) (4)
N
i=1
Mutual information is a measure of the amount of
The marginal entropy H(Y ) is the entropy of the
shared information between two random variables
mean of the conditional distributions, and the con-
(Cover and Thomas, 2006); in other words, it is the
ditional entropy H(Y |fθ (X)) is the mean of en-
reduction in entropy that is observed in one random
tropies of the individual conditional distributions.
variable when the other random variable is known.
This definition gives us another reason to expect
We expect MI to serve as a good criterion for
that mutual information will work well. Since mu-
comparing prompts. Previous work has shown that
tual information is the marginal entropy minus the
large networks trained with cross-entropy loss are
conditional entropy, maximizing mutual informa-
calibrated (e.g., a 60% confidence corresponds to a
tion is equivalent to maximizing marginal entropy
60% chance of the model being correct) when in the
and minimizing conditional entropy. Thus, MI is
early-stopped (∼ 1 epoch) regime (Ji et al., 2021),
high for templates that are, on average, less biased
but become miscalibrated in the overfit regime
towards any given answer (high marginal entropy)
(Nakkiran and Bansal, 2020). According to (Brown
and templates with outputs the model is confident
et al., 2020), GPT-3 was trained for a different num-
about (low conditional entropy). These attributes
ber of epochs on each corpus in its training data.
are desirable in constructing prompts, and we postu-
We calculate it was trained for an average of 1.57
late that maximizing mutual information will yield
epochs, so we have reason to believe that GPT-3 is
a well-aligned template.
generally well-calibrated. Thus, we postulate that a
Looking at it another way, by the data pro-
prompt that elicits a very confident response (high
cessing inequality (Cover and Thomas, 2006),
MI) from the language model is more likely than a
I(fθ (X); Y ) ≤ I(X; Y ). Thus, I(fθ (X); Y )
less confident prompt to score well.
gives a lower bound for I(X; Y ), and the high-
We denote the mutual information between ran-
est mutual information is the tightest lower bound.
dom variables X and Y asR I(X; Y ) and the entropy
The prompt corresponding to this lower bound pre-
of X as H(X) = − x∈X P (x) log(P (x))dx.
serves the most information between X and Y .
The mutual information between X and Y is de-
fined as DKL (P(X,Y ) ||PX ⊗PY ), and can be rewrit- 4 Experimental Setup
ten as H(Y ) − H(Y |X) (the reduction in entropy
in Y given knowledge of X). 4.1 Datasets
Using the OTR framework, we fix a model φ and We validate the efficacy of our prompt engineer-
generate a diverse set of K prompt templatizing ing method with experiments on eight well-known
Distributions over Template Accuracies
SQuAD LAMBADA ROCStories CoQA
0.8 0.8 0.6
0.4 0.4
0.4
0.2 0.3
0.2 0.2
0.2
Accuracy
0.7 0.525
0.8 0.7
0.500
0.6
0.6 0.475
0.6 0.5
0.5 0.450
0.4
B B B B B B B M B B B B B B B M B B B B B B B M B B B B B B B M
175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124 175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124 175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124 175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124
T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2: T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2: T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2: T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2:
GP G G GPT G G GP GP G G GPT G G GP GP G G GPT G G GP GP G G GPT G G GP
Figure 3: Distributions of accuracies over K = 20 templates for each model/dataset pair, compared to the prompts
selected with MI (translucent red dots).
Base Size a word from each story in order to use the data for
Dataset Task |Y |
Acc. Nall
masked word prediction (cloze).
SQuAD Open Book QA |Tφ | ∼0 16K
LAMBADA Cloze |Tφ | ∼0 5K We made minor changes to two of the datasets in
ROCStories Cloze |Tφ | ∼0 52K order to cast the associated tasks into OTR. For the
CoQA Closed Book QA 5 0.2 9K SQuAD dataset, we dropped all questions that did
Sentiment
IMDB 2 0.5 50K not have a one word answer. For the CoQA dataset
Analysis
BoolQ
Reading
2 0.5 16K
we dropped all questions with answer choices that
Comprehension started with a shared first word (e.g, the dog, the cat,
Choice of Positive
COPA 2 0.5 1K the monkey). Both changes were to decrease ambi-
Alternatives
WiC Word in Context 2 0.5 5K guity about which option the model was choosing
given its output distribution for a single token.
Table 1: All datasets used in our experiments. |Y | is
the size of the label space and Nall is the size of the 4.2 Models
dataset we sample from (after any modifications).
We assess our method on eight models ranging
from 124 million to 175 billion parameters : These
NLP datasets4 –SQuAD2.0 (Rajpurkar et al., 2018), include GPT-2 124M & 1.5B (Radford et al., 2019),
LAMBADA (Paperno et al., 2016), ROCStories GPT-Neo 2.7B (Black et al., 2021), GPT-J (6B)
(Mostafazadeh et al., 2016), CommonsenseQA (Wang and Komatsuzaki, 2021), and (Ada, Bab-
(CoQA) (Talmor et al., 2018), IMDB (Maas et al., bage, Curie, & Davinci) GPT-3 (Brown et al.,
2011), BoolQ (Clark et al., 2019), COPA (Gor- 2020). We assume (per (Perez et al., 2021)) these
don et al., 2012), and WiC (Pilehvar and Camacho- models to correspond, respectively, to the 2.7B,
Collados, 2018))–that span seven unique NLP tasks 6.7B, 13B, and 175B models in (Brown et al.,
(see Table 1). We used a random sample of 2020). Each is a causal language model, and al-
N = 500 samples from each dataset for our ex- though we do not include masked language models,
periments.5 For ROCStories, which consists of a this is a promising area for future work.
set of five sentence stories, we randomly masked
5 Results
4
Datasets are listed in descending order here and through-
out the paper, first by |Y |, and then by method performance. In this section, we analyze our experiments. First,
5
We sampled from the train sets of CoQA and SQuAD; we look at our method’s ability to select high-
the train and validation sets of WIC, COPA, and BoolQ; the
full datasets of ROCStories and IMDB; and the test set for accuracy prompts across models and datasets (Sec-
LAMBADA. tion 5.1). Next, we correlate template mutual infor-
Correlation between MI and Accuracy Mutual Information vs. Accuracy with GPT-3 175B
1.00
SQuAD 0.92 0.95 0.91 0.96 0.96 0.78 0.91 0.74 SQuAD LAMBADA
0.75 1.0 0.8
0.8
LAMBADA 0.86 0.93 0.96 0.96 0.97 0.95 0.88 0.94
Accuracy
0.50 0.6 0.6
ROCStories 0.68 0.63 0.59 0.7 0.77 0.77 0.67 0.87
0.25 0.4 0.4
CoQA 0.56 0.75 0.69 0.71 0.64 0.62 0.63 0.14
0.00 0.8 3 4 5 2 4
IMDB 0.71 0.67 0.8 0.78 0.58 0.12 0.3 0 ROCStories CoQA
0.25 0.6
BoolQ 0.7 0.66 0.02 0.42 -0.5 -0.18 0.13 -0.05
0.4 0.5
Accuracy
0.50
COPA 0.62 0.18 -0.29 0.27 -0.46 0.3 -0.18 0.3 0.4
0.75 0.2
0.6 0.3
WiC 0.33 -0.06 0.07 0.04 -0.15 -0.19 -0.37 -0.1
1.00 2 4 0.0 0.5
75B 13B .7B 6B .7B .7B .5B 4M
- 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 IMDB BoolQ
T GP
GP
T G T-N GPT GPT PT- 1.0 0.8
GP GP G
0.4 0.7
Accuracy
0.8
Figure 4: Correlations are more consistently high 0.6
0.6
across all tasks for the largest models, suggesting that 0.5
our method is most useful at those model sizes. 0.1 0.2 0.3 0.025 0.050 0.075
0.2 COPA WiC
Accuracy
compare our method and template selection using 0.6 0.475
labeled examples in Section 5.3. In Section 5.4, we 0.5
0.450
0.0
explore the robustness of MI and use ensembling to 0.000
0.0 0.025
0.2 0.4 0.6 0.01 0.02
0.8 0.03 1.0
Mutual Information (nats)
improve it. Finally, we compare the tranferability
of prompt templates selected with MI from model Figure 5: Each dot represents a template and its average
to model in Section 5.5. mutual information and accuracy over N = 500 task
instances. Linear best fit (by mean standard error) lines
5.1 Template Selection Performance are included to show overall trends.
2
4
8
16
32
64
8
6
2
4
8
16
32
64
8
6
2
4
8
16
32
64
8
6
12
25
12
25
12
25
12
25
IMDB BoolQ COPA WiC
0.8 0.54
0.8
0.9 0.52
0.7
0.7 0.50
0.8 0.6 0.48
2
4
8
16
32
64
8
6
2
4
8
16
32
64
8
6
2
4
8
16
32
64
8
6
2
4
8
16
32
64
8
6
12
25
12
25
12
25
12
25
N (training set size)
Figure 6: For P = 100 random train/test set partitions for each training size N = 2, 4, 8, ..., 256, we select a
template based on accuracy (N-shot Acc) and based on mutual information based on just those N examples (N-
shot MI). Then, we report accuracy of that template on the test set (size: 500 − N ). Error bars (±σ) are reported
across the P = 100 partitions. For reference, the highest, average, and full-dataset MI template accuracy is also
reported.
5.2 Correlation between Template Mutual few-labeled examples? Also, how many unlabeled
Information and Accuracy examples does MI need to be able to perform well?
In Section 5.1, we see how the mutual informa- Results with the largest model are reported in
tion selected template does in terms of accuracy Figure 6. Note that with as few as N = 2 instances,
compared to all other templates. We have not dis- MI selects a far better than average template, al-
cussed, however, how generally MI and accuracy lowing performance gains even in the low-data,
are correlated, except that the highest MI template unlabeled regime. Additionally, for low N and
tends to have anomalously high accuracy. Here, across all eight datasets, MI even selects a better
we establish that their correlation is high across all template on average than selecting based on labeled
templates for the largest LMs. Each of the K = 20 train set accuracy. This suggests that, even with
templates has two corresponding measures: aver- labeled examples, selecting based off of MI may
age accuracy and average MI. We can use these be preferable to test accuracy with few examples.
pairs to correlate MI and accuracy via Pearson’s R. Selecting by labeled train set accuracy often begins
We see in Figure 4 that the correlations are to perform better at higher N , but at the cost of
surprisingly high for the majority of models and requiring labeled data, while our method needs no
datasets. For SQuAD, LAMBADA, ROCStories, labels.
and CoQA, this pattern holds across all model sizes;
for the remainder, results are good on larger mod- 5.4 Method Robustness and Ensembling
els and are much less reliable on smaller models. To explore our method’s robustness we consider
Overall, this is evidence that as mutual information the question: what if we had included a different
increases, so does accuracy. In other words, mutual subset of templates, especially not including the top
information can be used to make an educated guess MI template? Figure 5 shows average MI/accuracy
about accuracy without having to use any ground data for all K = 20 prompt templates on GPT-
truth labels, especially on larger models. 3 175B (similar plots for other models are found
in Appendix B.1). For six of eight datasets, the
5.3 Compared to Few Labeled Examples results are robust; the top few prompt templates (by
Next, we ask: How does our method compare to MI) are all high performers. The performance for
selecting a template based on the accuracy of a COPA and WiC is more brittle; excluding the top-
Accuracy by Ensembling Protocol on GPT-3 175B
SQuAD LAMBADA ROCStories CommonsenseQA
1.0
5 5
10 5
0.8
0.60 0 0 0
Density
0.7 0.8 0.5 0.6 0.7 0.8 0.2 0.4 0.4 0.5 0.6
0.4 IMDB BoolQ COPA WiC
20
10
0.2 5 50
0.00 0 0 0
0.0 0.8 0.9 0.2 0.65 0.70 0.40.75 0.5 0.6
0.6 0.7 0.8 0.450 0.475 0.500 0.525
0.8 1.0
Accuracy
Figure 7: For each dataset the KDE plot represents accuracy over each of the 20
5 ensembles of 5 templates from
the 20 templates associated with the dataset. Each plot also includes lines representing the average accuracy of all
single templates for the dataset, the accuracy of the ensemble of all 20 templates, and the accuracy of the ensemble
of the top 5 templates chosen by MI. In only one case does all-20 beat top-5-MI, and it does so at 4× the cost.
0.25 0.25
GPT-J: 6B 0.6 0.61 0.58 0.59 0.36 -0 0.32 0.13 0.65 0.76 0.71 1 0.46 0.45 0.64 0.33
0.00 0.00
GPT-Neo: 2.7B 0.68 0.61 0.47 0.62 0.08 0.34 0.37 0.37 0.57 0.62 0.74 0.77 1 0.56 0.73 0.48
0.25 0.25
GPT-3: 2.7B 0.74 0.41 0.5 0.52 0.43 0.08 0.43 0.37 0.41 0.41 0.53 0.44 0.24 1 0.38 0.62
0.50 0.50
GPT-2: 1.5B 0.69 0.71 -0.01 0.62 0.38 0.21 0.37 0.37 0.55 0.71 0.71 0.65 0.59 0.51 1 0.56
0.75 0.75
GPT-2: 124M 0.23 0.2 0.2 0.35 0.42 0.19 0.08 0.21 0.15 -0.07 0.73 0.18 0.62 0.69 0.26 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model
Figure 8: For each model/dataset pair, accuracies are normalized linearly so that 0 is the average prompt accuracy
and 1 is the highest test accuracy. Using the prompt chosen by either MI or test accuracy on each selection model,
average performance across datasets is reported for each inference model.
MI template would have resulted in a large drop in ensembled together”. We then compare the top-
accuracy. This attests to the utility of generating 5 MI ensemble to other possible ensembles. The
a diverse slate of templates as recommended in results are shown in Figure 7.
Appendix A and also to the risk that outliers could We found that the top-5 MI ensemble does at
compromise our method’s effectiveness. least as well as the top-20 ensemble in all but one
A comprehensive discussion of remedies for out- case. Two reasons to use MI are, then, that 1)
liers is beyond the scope of this paper, but it is the MI ensemble gets as good or better a result as
an important concern. Considering the strength of ensembling all prompt templates and 2) at a fourth
MI/accuracy correlations, one simple approach is of the experimental cost. In short, ensembling by
to ensemble the top 5 MI templates. MI is a cheap and effective way to guard against
To compare this principled top-5 ensemble to anomalous high MI/low accuracy templates.
other possible ensembles of templates, we take
20
all 5 subsets of 5 templates from all 20 tem- 5.5 Transferability across Models
plates and calculate the accuracy of each ensemble. Finally, we explore how well-chosen templates gen-
For each dataset, we plot this distribution’s kernel eralize between models. Concretely, we choose
density estimate, which models the p.d.e. of the templates by maximizing either test accuracy (or-
random variable “accuracy of 5 random templates acle) or mutual information (our method) using a
selection model φs , and then calculate test accuracy prompt quality. This method cannot align a LM
using a different inference model φi . We calculate to a task if the entire set of prompts is poor or,
absolute test accuracy and then normalize it such obviously, if the model cannot be aligned. High
that 0 and 100 correspond to the average and maxi- mutual information does not necessarily imply high
mum scores across templates for a model/dataset accuracy despite the strong correlation we found.
pair. We average our results across datasets and Thus, our method should only be employed on a
present the results in Figure 8. Prompt transfer for task if there is some understanding of how high
each dataset can be found in Appendix B.2. MI needs to be on a domain or set of templates to
MI performance is best when the largest model imply a sufficiently high accuracy for safe use.
(GPT-3 175B) is used as both the selection and Otherwise, we introduce no model, dataset, or
inference model: on average, MI scores 90% on other contribution that might warrant ethical con-
this normalized scale. Additionally, performance is cern.
most consistently high when the largest models are
used either for selection or inference. But almost Acknowledgements
all transfer scores are well above 0 (only one nega- We thank the anonymous reviewers for their helpful
tive average gain out of 64 transfer permutations), feedback. This material is based upon work sup-
suggesting that transfer is often effective. ported by the National Science Foundation under
Overall, we have observed that prompt selec- Grant No. RI 2141680.
tion by mutual information is surprisingly effective
across a variety of datasets and model sizes. This
method works best on larger models and for tasks
that the LM is capable of performing. Given the
high diversity of tasks that we have explored, we
expect this method to transfer well to many other
NLP tasks, including regimes with little labeled
data.
6 Conclusion
In this paper, we introduce a method for selecting
prompts that effectively align language models to
NLP tasks. Over a set of candidate prompts, our
method selects the template that maximizes the mu-
tual information between the input and the model
output. We demonstrate that 1) mutual information
is highly correlated with test accuracy and 2) select-
ing a prompt based on mutual information leads
to significant accuracy gains over random choice,
approaching oracle performance on GPT-3 175B,
and it does so across model sizes and tasks.
Whereas other methods rely on ground truth
labels and/or direct model access, ours requires
neither. Many applications characterized by lack
of computational resources, limited model access
(e.g., inference only), and lack of ground truth data
prohibiting testing of candidate prompts become
feasible with our method.
7 Ethics
There are many ways to prompt a language model
poorly, and there still seem to be NLP tasks which
are beyond alignment regardless of model size or
References Montréal, Canada. Association for Computational
Linguistics.
Asaf Amrami and Yoav Goldberg. 2018. Word Sense
Induction with Neural biLM and Symmetric Pat- Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang.
terns. pages 4860–4867. 2021. PPT: Pre-trained Prompt Tuning for Few-shot
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Learning.
Stella Biderman. 2021. GPT-Neo: Large Scale Ziwei Ji, Justin D. Li, and Matus Telgarsky. 2021.
Autoregressive Language Modeling with Mesh- Early-stopped neural networks are consistent.
Tensorflow.
Lingpeng Kong, Cyprien de Masson d’Autume, Wang
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-
Ling, Lei Yu, Zihang Dai, and Dani Yogatama. 2019.
tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.
A mutual information maximization perspective of
COMET : Commonsense Transformers for Auto-
language representation learning.
matic Knowledge Graph Construction.
Zied Bouraoui, Jose Camacho-collados, and Steven Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
Schockaert. Inducing Relational Knowledge from The Power of Scale for Parameter-Efficient Prompt
BERT. Tuning.
Peter F. Brown, Vincent J. Della Pietra, Peter V. deS- Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning:
ouza, Jenifer C. Lai, and Robert L. Mercer. 1992. Optimizing Continuous Prompts for Generation.
Class-based n-gram models of natural language. pages 4582–4597.
Computational Linguistics, 18(4):467–480.
Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Matthew E. Peters, and Noah A. Smith. 2019. Lin-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind guistic knowledge and transferability of contextual
Neelakantan, Pranav Shyam, Girish Sastry, Amanda representations. NAACL HLT 2019 - 2019 Confer-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, ence of the North American Chapter of the Associ-
Gretchen Krueger, Tom Henighan, Rewon Child, ation for Computational Linguistics: Human Lan-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, guage Technologies - Proceedings of the Conference,
Clemens Winter, Christopher Hesse, Mark Chen, 1:1073–1094.
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Candlish, Alec Radford, Ilya Sutskever, and Dario Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-
Amodei. 2020. Language models are few-shot learn- train, Prompt, and Predict: A Systematic Survey of
ers. arXiv. Prompting Methods in Natural Language Processing.
pages 1–46.
Christopher Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, and Kristina Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,
Toutanova. 2019. Boolq: Exploring the surpris- Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT
ing difficulty of natural yes/no questions. CoRR, Understands, Too.
abs/1905.10044.
Yao Lu, Max Bartolo, Alastair Moore, Sebastian
Thomas M. Cover and Joy A. Thomas. 2006. Elements Riedel, and Pontus Stenetorp. 2021. Fantastically
of Information Theory 2nd Edition (Wiley Series in Ordered Prompts and Where to Find Them: Over-
Telecommunications and Signal Processing). Wiley- coming Few-Shot Prompt Order Sensitivity.
Interscience.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Dan Huang, Andrew Y. Ng, and Christopher Potts.
Kristina Toutanova. 2019. BERT: Pre-training of 2011. Learning word vectors for sentiment analy-
deep bidirectional transformers for language under- sis. In Proceedings of the 49th Annual Meeting of
standing. NAACL HLT 2019 - 2019 Conference the Association for Computational Linguistics: Hu-
of the North American Chapter of the Associa- man Language Technologies, pages 142–150, Port-
tion for Computational Linguistics: Human Lan- land, Oregon, USA. Association for Computational
guage Technologies - Proceedings of the Conference, Linguistics.
1(Mlm):4171–4186.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
Andrew Gordon, Zornitsa Kozareva, and Melissa He, Devi Parikh, Dhruv Batra, Lucy Vander-
Roemmele. 2012. SemEval-2012 task 7: Choice wende, Pushmeet Kohli, and James F. Allen. 2016.
of plausible alternatives: An evaluation of common- A corpus and evaluation framework for deeper
sense causal reasoning. In *SEM 2012: The First understanding of commonsense stories. CoRR,
Joint Conference on Lexical and Computational Se- abs/1604.01696.
mantics – Volume 1: Proceedings of the main con-
ference and the shared task, and Volume 2: Pro- Preetum Nakkiran and Yamini Bansal. 2020. Distribu-
ceedings of the Sixth International Workshop on Se- tional generalization: A new kind of generalization.
mantic Evaluation (SemEval 2012), pages 394–398, CoRR, abs/2009.08092.
Denis Paperno, Germán Kruszewski, Angeliki Lazari- Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and
dou, Quan Ngoc Pham, Raffaella Bernardi, San- Sameer Singh. 2021. Calibrate Before Use: Improv-
dro Pezzelle, Marco Baroni, Gemma Boleda, and ing Few-Shot Performance of Language Models.
Raquel Fernández. 2016. The LAMBADA dataset:
Word prediction requiring a broad discourse context. Yukun Zuo, Quan Fang, Shengsheng Qian, Xiaorui
CoRR, abs/1606.06031. Zhang, and Changsheng Xu. 2018. Representa-
tion Learning of Knowledge Graphs with Entity At-
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. tributes and Multimedia Descriptions. 2018 IEEE
True Few-Shot Learning with Language Models. 4th International Conference on Multimedia Big
(Cv):1–21. Data, BigMM 2018, pages 2659–2665.
0.4
0.2
0.0
3 4 5 2 3 4 2 3 4 2 3 4 2 3 4 2 4 2 4 1 2 3
0.8
0.6
LAMBADA
0.4
0.2
0.0
2 4 2 4 2 4 2 4 2 4 1 2 3 2 4 1 2 3
0.4
ROCStories
0.2
0.0
2 3 4 2 4 2 4 2 4 2 4 2 4 2 4 1 2 3
0.6
0.5
CoQA
0.4
0.3
0.2
0.00 0.25 0.50 0.00 0.25 0.50 0.00 0.25 0.50 0.0 0.2 0.4 0.00 0.25 0.50 0.0 0.5 0.0 0.5 0.0 0.5
1.0
0.8
IMDB
0.6
0.1 0.2 0.3 0.0 0.2 0.0 0.1 0.2 0.0 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2
0.8
0.7
BoolQ
0.6
0.5
0.4
0.025 0.050 0.075 0.025 0.050 0.075 0.02 0.04 0.02 0.04 0.02 0.03 0.04 0.02 0.04 0.02 0.04 0.01 0.02 0.03
0.7
COPA
0.6
0.5
0.00 0.02 0.04 0.005 0.010 0.002 0.004 0.01 0.02 0.0025 0.0050 0.0075 0.0025 0.0050 0.0075 0.002 0.004 0.006 0.00 0.01
0.525
0.500
WiC
0.475
0.450
0.01 0.02 0.03 0.025 0.050 0.0 0.1 0.01 0.02 0.02 0.04 0.02 0.04 0.05 0.10 0.025 0.050
Figure 9: Mutual information plotted against accuracy per prompt for each dataset using GPT-3 175B with linear
best fit (by MSE) lines to show overall trends
Transferability for SQuAD
Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
0.75 0.75
GPT-3: 13B 0.88 0.78 0.87 0.71 0.89 0.85 1 0.67 1 1 1 1 1 1 0.99 1
0.50 0.50
GPT-3: 6.7B 1 1 1 1 1 1 0.99 1 1 1 1 1 1 1 0.99 1
Selection Model
0.25 0.25
GPT-J: 6B 1 1 1 1 1 1 0.99 1 1 1 1 1 1 1 0.99 1
0.00 0.00
GPT-Neo: 2.7B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
0.25 0.25
GPT-3: 2.7B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
0.50 0.50
GPT-2: 1.5B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 0.88 0.78 0.87 0.71 0.89 0.85 1 0.67
0.75 0.75
GPT-2: 124M 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model
0.25 0.25
GPT-J: 6B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.00 0.00
GPT-Neo: 2.7B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.25 0.25
GPT-3: 2.7B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.50 0.50
GPT-2: 1.5B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.75 0.75
GPT-2: 124M 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model
0.25 0.25
GPT-J: 6B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27
0.00 0.00
GPT-Neo: 2.7B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.61 0.88 1 0.99 1 1 1 1
0.25 0.25
GPT-3: 2.7B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.61 0.88 1 0.99 1 1 1 1
0.50 0.50
GPT-2: 1.5B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.61 0.88 1 0.99 1 1 1 1
0.75 0.75
GPT-2: 124M 0.61 0.88 1 0.99 1 1 1 1 0.61 0.88 1 0.99 1 1 1 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model
0.25 0.25
GPT-J: 6B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.00 0.00
GPT-Neo: 2.7B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.25 0.25
GPT-3: 2.7B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.50 0.50
GPT-2: 1.5B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.75 0.75
GPT-2: 124M -0.55-0.15-0.04-0.04-0.04-0.15-0.12-0.51 1 1 1 1 1 1 1 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model
0.25 0.25
GPT-J: 6B 0.5 0.65 -0.15 0.57 0.45 -0.21-0.14-0.23 0.15 0.84 1 1 0.94 0.13 0.81 0.39
0.00 0.00
GPT-Neo: 2.7B 0.5 0.65 -0.15 0.57 0.45 -0.21-0.14-0.23 0.86 1 0.93 0.77 1 0.08 1 0.05
0.25 0.25
GPT-3: 2.7B 0.29 -0.37-0.32-0.19 0.21 0.2 -0.13 0.06 1 0.91 0.51 0.77 0.56 1 -0.2 -0.03
0.50 0.50
GPT-2: 1.5B 0.5 0.65 -0.15 0.57 0.45 -0.21-0.14-0.23 0.86 1 0.93 0.77 1 0.08 1 0.05
0.75 0.75
GPT-2: 124M 0.29 -0.37-0.32-0.19 0.21 0.2 -0.13 0.06 0.08 -1.5 0.77 -0.31-0.07 0.07 -0.37 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model
0.25 0.25
GPT-J: 6B 0.21 0.45 0.2 0.01 -1.5 -2.3 -0.72 -2 0.69 0.39 -0.1 1 -0.86 -1.1 0.59 -1
0.00 0.00
GPT-Neo: 2.7B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 -0.19 0.22 0.15 0.34 1 0.55 0.68 -0.06
0.25 0.25
GPT-3: 2.7B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 -1.1 -2.6 -0.55 -2.2 0.42 1 -0.83 1
0.50 0.50
GPT-2: 1.5B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 0.26 0.68 1 0.86 0.63 -0.14 1 0.89
0.75 0.75
GPT-2: 124M 0.14 -0.63 0.15 0.12 0.63 -0.59 -1.3 0.31 -1.1 -2.6 -0.55 -2.2 0.42 1 -0.83 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model
0.25 0.25
GPT-J: 6B -0.21-0.21 0.33 -0.05 0.15 -0.06 0 -0.02 0.33 -0.1 1 1 1 -0.06 0 -0.02
0.00 0.00
GPT-Neo: 2.7B -0.22 0.12 0.33 -0.08 -2.1 1 0 -0.02 0.33 -0.1 1 1 1 -0.06 0 -0.02
0.25 0.25
GPT-3: 2.7B 0.91 -0.32 0.33 -0.08 0.15 -0.06 0.22 -0.02 -0.22 0.12 0.33 -0.08 -2.1 1 0 -0.02
0.50 0.50
GPT-2: 1.5B -0.13 1 -3.4 -0.08 0.29 -0.06 0 -0.02 -0.22-0.21 -0.11 -0.08-0.99 0.09 1 -0.02
0.75 0.75
GPT-2: 124M -0.22 -0.1 0.33 0.02 0.15 -0.06 0 -0.02 -0.21 -0.1 1 -0.08 0.15 -0.21 -1.6 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model
0.25 0.25
GPT-J: 6B 0.54 0.03 0.39 0.2 0.27 -1 -0.2 -0.04 0.22 1 -0.09 1 -0.97 1 0.11 0.02
0.00 0.00
GPT-Neo: 2.7B 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31 0.15 -0.04-0.13 0.08 1 -0.1 0.2 -0.15
0.25 0.25
GPT-3: 2.7B 0.54 0.03 0.39 0.2 0.27 -1 -0.2 -0.04 0.22 1 -0.09 1 -0.97 1 0.11 0.02
0.50 0.50
GPT-2: 1.5B 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31 0.22 0.55 -0.04-0.05 0.2 0.2 1 -0.15
0.75 0.75
GPT-2: 124M 0.15 0.42 -0.51 0.45 -0.02 0.45 0.11 -0.04 -0.95-0.17 0.62 0.02 0.49 0.69 0.87 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model
QUESTION1: Alice was friends with Bob. Alice went to visit her friend
"As of the census of 2000, there were 197,790 people, 84,549 ____. -> Bob
households, and 43,627 families residing in the city. The popu- George bought some baseball equipment, a ball, a glove, and a
lation density was 3,292.6 people per square mile (1,271.3/km). ____. -> bat
"I would speak to you privately," Bowen said, casting a glance
There were 92,282 housing units at an average density of 1,536.2 around at the others milling about.
per square mile (593.1/km). The racial makeup of the city was
38.3% White, 57.2% African American, 0.2% Native American, The worry in her eyes deepened, but she nodded hesitantly and
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and awaited Bowen’s directive.
1.5% from two or more races. Hispanic or Latino of any race
were 2.6% of the population." What percentage of the Richmond He led her through the great hall, annoyance biting at him when
population of 2000 was Pacific Islander? he saw no place where people weren’t congregated. He stepped
ANSWER1: outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____. ->
Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are
considered
considered
He led her through the great hall, annoyance biting at him when Prompt 5 (MI: 4.194, Acc: 0.608):
he saw no place where people weren’t congregated. He stepped P1: I’m going to tell you a story, but leave a word out. Once I’m
outside the back of the keep, where, finally, he spied an area near done telling the story, pick the word that best fits in the blank.
the bathhouses, where it was quiet and ____. -> I like to eat peanut butter and jelly ____.
P2: sandwiches
Collapsing token sets: None, all tokens are P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
considered "I would speak to you privately," Bowen said, casting a glance
around at the others milling about.
Prompt 2 (MI: 4.793, Acc: 0.770):
The worry in her eyes deepened, but she nodded hesitantly and
Fill in blank: awaited Bowen’s directive.
She held the torch in front of her. He led her through the great hall, annoyance biting at him when
She caught her breath. he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near
"Chris? There’s a step." the bathhouses, where it was quiet and ____.
P2:
"What?"
Collapsing token sets: None, all tokens are
"A step. Cut in the rock. About fifty feet ahead." She
moved faster. They both moved faster. "In fact," she said, raising considered
the torch higher, "there’s more than a ____. -> step
He led her through the great hall, annoyance biting at him when The missing word in the story should be: "
he saw no place where people weren’t congregated. He stepped Collapsing token sets: None, all tokens are
outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____." -> ""I would speak considered
to you privately," Bowen said, casting a glance around at the
others milling about. Prompt 10 (MI: 2.632, Acc: 0.474):
"I would speak to you privately," Bowen said, casting a glance
The worry in her eyes deepened, but she nodded hesitantly and
around at the others milling about.
awaited Bowen’s directive.
The worry in her eyes deepened, but she nodded hesitantly and
He led her through the great hall, annoyance biting at him when
awaited Bowen’s directive.
he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near He led her through the great hall, annoyance biting at him when
the bathhouses, where it was quiet and he saw no place where people weren’t congregated. He stepped
Collapsing token sets: None, all tokens are outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____.
considered Fill in the blank with the missing word or phrase.
What is the missing word? The missing word is "
Prompt 7 (MI: 4.328, Acc: 0.596): Collapsing token sets: None, all tokens are
P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank. considered
It was a cold night. The wind was ____ around the courtyard as I
stepped out of the car and into the darkness. Prompt 11 (MI: 4.549, Acc: 0.470):
P2: whistling
P1: I’m going to tell you a story, but leave a word out. Once I’m It was a cold night. The wind was ____ around the courtyard as I
done telling the story, pick the word that best fits in the blank. stepped out of the car and into the darkness.
"I would speak to you privately," Bowen said, casting a glance Word: whistling
around at the others milling about.
"I would speak to you privately," Bowen said, casting a
The worry in her eyes deepened, but she nodded hesitantly and glance around at the others milling about.
awaited Bowen’s directive.
The worry in her eyes deepened, but she nodded hesitantly and
He led her through the great hall, annoyance biting at him when awaited Bowen’s directive.
he saw no place where people weren’t congregated. He stepped
He led her through the great hall, annoyance biting at him when
outside the back of the keep, where, finally, he spied an area near
he saw no place where people weren’t congregated. He stepped
the bathhouses, where it was quiet and ____.
outside the back of the keep, where, finally, he spied an area near
P2:
the bathhouses, where it was quiet and ____.
Collapsing token sets: None, all tokens are
Word:
considered
Collapsing token sets: None, all tokens are
Prompt 8 (MI: 3.338, Acc: 0.586): considered
Fill in the blank with the missing word to complete the sentence.
Prompt 12 (MI: 2.637, Acc: 0.454):
Passage: I like to eat peanut butter and jelly ____.
Missing Word: sandwiches P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
Passage: "I would speak to you privately," Bowen said, "I would speak to you privately," Bowen said, casting a glance
casting a glance around at the others milling about. around at the others milling about.
The worry in her eyes deepened, but she nodded hesitantly and The worry in her eyes deepened, but she nodded hesitantly and
awaited Bowen’s directive. awaited Bowen’s directive.
He led her through the great hall, annoyance biting at him when He led her through the great hall, annoyance biting at him when
he saw no place where people weren’t congregated. He stepped he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____. the bathhouses, where it was quiet and ____.
Missing Word: " P2: The word which fits best is "
Collapsing token sets: None, all tokens are Collapsing token sets: None, all tokens are
considered considered
Prompt 13 (MI: 2.476, Acc: 0.434): Prompt 17 (MI: 1.931, Acc: 0.376):
"I would speak to you privately," Bowen said, casting a glance "I would speak to you privately," Bowen said, casting a glance
around at the others milling about. around at the others milling about.
The worry in her eyes deepened, but she nodded hesitantly and The worry in her eyes deepened, but she nodded hesitantly and
awaited Bowen’s directive. awaited Bowen’s directive.
He led her through the great hall, annoyance biting at him when He led her through the great hall, annoyance biting at him when
he saw no place where people weren’t congregated. He stepped he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____. the bathhouses, where it was quiet and ____.
Fill in the blank with the missing word or phrase to complete the Which word should we put in the blank to complete the story?
sentence. Let’s use the word "
What is the missing word? The missing word is "
Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are
considered
considered
Prompt 18 (MI: 2.530, Acc: 0.374):
Prompt 14 (MI: 3.043, Acc: 0.432): P1: What word do you think fits best in the following story?
Read the following sentences, and try to guess which word goes "I would speak to you privately," Bowen said, casting a glance
in the blank.
"I would speak to you privately," Bowen said, casting a glance around at the others milling about.
around at the others milling about. The worry in her eyes deepened, but she nodded hesitantly and
The worry in her eyes deepened, but she nodded hesitantly and awaited Bowen’s directive.
awaited Bowen’s directive. He led her through the great hall, annoyance biting at him when
He led her through the great hall, annoyance biting at him when he saw no place where people weren’t congregated. He stepped
he saw no place where people weren’t congregated. He stepped outside the back of the keep, where, finally, he spied an area near
outside the back of the keep, where, finally, he spied an area near the bathhouses, where it was quiet and ____.
the bathhouses, where it was quiet and ____. P2: The word which fits best is "
Answer: " Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are considered
considered
Prompt 19 (MI: 2.372, Acc: 0.364):
Prompt 15 (MI: 2.450, Acc: 0.428): "I would speak to you privately," Bowen said, casting a glance
Fill in blank: around at the others milling about.
"I would speak to you privately," Bowen said, casting a The worry in her eyes deepened, but she nodded hesitantly and
glance around at the others milling about. awaited Bowen’s directive.
The worry in her eyes deepened, but she nodded hesitantly and He led her through the great hall, annoyance biting at him when
awaited Bowen’s directive. he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near
He led her through the great hall, annoyance biting at him when the bathhouses, where it was quiet and ____.
he saw no place where people weren’t congregated. He stepped Which word fills in the blank best?
outside the back of the keep, where, finally, he spied an area near The word that fills in the blank best is "
the bathhouses, where it was quiet and ____. -> Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are considered
considered
Prompt 20 (MI: 2.860, Acc: 0.296):
Prompt 16 (MI: 2.820, Acc: 0.398): Pick the best word to replace the blank.
Fill in the blank with the missing word. Story: "I would speak to you privately," Bowen said, casting a
"I would speak to you privately," Bowen said, casting a glance glance around at the others milling about.
around at the others milling about.
The worry in her eyes deepened, but she nodded hesitantly and
The worry in her eyes deepened, but she nodded hesitantly and awaited Bowen’s directive.
awaited Bowen’s directive.
He led her through the great hall, annoyance biting at him when
He led her through the great hall, annoyance biting at him when he saw no place where people weren’t congregated. He stepped
he saw no place where people weren’t congregated. He stepped outside the back of the keep, where, finally, he spied an area near
outside the back of the keep, where, finally, he spied an area near the bathhouses, where it was quiet and ____.
the bathhouses, where it was quiet and ____. Answer: "
Answer: " Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are considered
considered
C.3 ROCStories Prompt 6 (MI: 4.167, Acc: 0.298):
P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
Prompt 1 (MI: 3.859, Acc: 0.538): It was a cold night. The wind was _____ around the courtyard as
Fill in the blank for the following sentences. I stepped out of the car and into the darkness.
P2: whistling
"Marissa loved _____ pokemon go game. It is the biggest P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
thing right now. She had done so much more walking since she
Marissa loved _____ pokemon go game. It is the biggest thing
started playing it. She walked all day and evening sometimes.
right now. She had done so much more walking since she started
She walked almost 10 miles in two days." -> "Marissa loved
playing it. She walked all day and evening sometimes. She walked
Collapsing token sets: None, all tokens are almost 10 miles in two days.
considered P2:
Collapsing token sets: None, all tokens are
Prompt 2 (MI: 4.427, Acc: 0.524): considered
Fill in the blank for the following sentences.
"It was a cold night. The wind was _____ around the Prompt 7 (MI: 4.066, Acc: 0.290):
courtyard as I stepped out of the car and into the darkness." -> "It P1: I’m going to tell you a story, but leave a word out. Once I’m
was a cold night. The wind was whistling around the courtyard done telling the story, pick the word that best fits in the blank.
as I stepped out of the car and into the darkness." I like to eat _____ and jelly sandwiches.
"Marissa loved _____ pokemon go game. It is the biggest thing P2: peanut butter
right now. She had done so much more walking since she started P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
playing it. She walked all day and evening sometimes. She
Marissa loved _____ pokemon go game. It is the biggest thing
walked almost 10 miles in two days." -> "Marissa loved
right now. She had done so much more walking since she started
Collapsing token sets: None, all tokens are playing it. She walked all day and evening sometimes. She walked
considered almost 10 miles in two days.
P2:
Prompt 3 (MI: 3.728, Acc: 0.420): Collapsing token sets: None, all tokens are
Poke GO! considered
Marissa loved
Prompt 8 (MI: 3.707, Acc: 0.258):
Collapsing token sets: None, all tokens are Guess the word in the blank to complete the story.
considered Story: Marissa loved _____ pokemon go game. It is the biggest
thing right now. She had done so much more walking since she
Prompt 4 (MI: 3.670, Acc: 0.356): started playing it. She walked all day and evening sometimes. She
walked almost 10 miles in two days.
Fill in the blank with the missing word or phrase to complete the
sentence. Answer:
Collapsing token sets: None, all tokens are Collapsing token sets: None, all tokens are
considered considered
Prompt 11 (MI: 3.199, Acc: 0.220): Prompt 17 (MI: 2.634, Acc: 0.088):
Fill in the blank with the missing word or phrase. Marissa loved _____ pokemon go game. It is the biggest thing
Marissa loved _____ pokemon go game. It is the biggest thing right now. She had done so much more walking since she started
right now. She had done so much more walking since she started playing it. She walked all day and evening sometimes. She walked
playing it. She walked all day and evening sometimes. She walked almost 10 miles in two days.
almost 10 miles in two days. Which word fills in the blank best?
Answer: The word that fills in the blank best is "
Collapsing token sets: None, all tokens are Collapsing token sets: None, all tokens are
considered considered
Prompt 12 (MI: 2.013, Acc: 0.214): Prompt 18 (MI: 2.637, Acc: 0.086):
Marissa loved _____ pokemon go game. It is the biggest thing P1: What word do you think fits best in the following story?
right now. She had done so much more walking since she started Marissa loved _____ pokemon go game. It is the biggest thing
playing it. She walked all day and evening sometimes. She walked right now. She had done so much more walking since she started
almost 10 miles in two days. playing it. She walked all day and evening sometimes. She walked
Fill in the blank with the missing word or phrase to complete the almost 10 miles in two days.
sentence. P2: The word which fits best is "
What is the missing word? The missing word is "
Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are
considered
considered
Prompt 19 (MI: 3.648, Acc: 0.050):
Prompt 13 (MI: 3.116, Acc: 0.182): It was a cold night. The wind was _____ around the courtyard as
Read the following sentences, and try to guess which word goes I stepped out of the car and into the darkness.
in the blank. Word: whistling
Marissa loved _____ pokemon go game. It is the biggest thing
right now. She had done so much more walking since she started Marissa loved _____ pokemon go game. It is the biggest
playing it. She walked all day and evening sometimes. She walked thing right now. She had done so much more walking since she
almost 10 miles in two days. started playing it. She walked all day and evening sometimes.
Answer: She walked almost 10 miles in two days.
Put the best word in the blank to complete the story.
Collapsing token sets: None, all tokens are Word:
considered Collapsing token sets: None, all tokens are
considered
Prompt 14 (MI: 1.843, Acc: 0.158):
Marissa loved _____ pokemon go game. It is the biggest thing Prompt 20 (MI: 1.891, Acc: 0.036):
right now. She had done so much more walking since she started
playing it. She walked all day and evening sometimes. She Marissa loved _____ pokemon go game. It is the biggest thing
walked almost 10 miles in two days. right now. She had done so much more walking since she started
playing it. She walked all day and evening sometimes. She walked
The missing word in the story should be: " almost 10 miles in two days.
Choose a word to replace the blank.
Collapsing token sets: None, all tokens are Word: "
considered Collapsing token sets: None, all tokens are
Prompt 15 (MI: 2.681, Acc: 0.140): considered
P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
Marissa loved _____ pokemon go game. It is the biggest thing
right now. She had done so much more walking since she started
playing it. She walked all day and evening sometimes. She walked
almost 10 miles in two days.
P2: The word which fits best is "
Collapsing token sets: None, all tokens are
considered
Question: In what Spanish speaking North American Question: If you’re still in love and end up stopping be-
country can you get a great cup of coffee? ing married to your partner, what emotion are you likely to
Choices: mildred’s coffee shop, mexico, diner, kitchen, canteen experience?
Answer: "mexico" is the best answer. It’s true that you can get
a cup of coffee in a coffee shop or a diner, but the question Choices: wrong, pleasure, encouragement, depression, relief
specifically asks for a Spanish speaking North American country. Answer: "
Mexico is the only country listed, so that must be the correct
answer. Collapsing token sets: {’A’: [’wrong’],
Question: If you’re still in love and end up stopping be- ’B’: [’pleasure’], ’C’: [’encouragement’],
ing married to your partner, what emotion are you likely to ’D’: [’depression’], ’E’: [’relief’]}
experience?
Choices: wrong, pleasure, encouragement, depression, relief Prompt 11 (MI: 0.059, Acc: 0.380):
Answer: "
Common Sense Quiz Answer Key
Collapsing token sets: {’A’: [’wrong’], Question 1: If you’re still in love and end up stopping be-
’B’: [’pleasure’], ’C’: [’encouragement’], ing married to your partner, what emotion are you likely to
’D’: [’depression’], ’E’: [’relief’]} experience?
A: wrong
Prompt 8 (MI: 0.364, Acc: 0.408): B: pleasure
Q: What might a vegan eat for breakfast? C: encouragement
D: depression
Choices: oats, bacon, sausage, omelet, ham E: relief
A: oats Correct Answer:
Q: If you’re still in love and end up stopping being mar- Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’]
ried to your partner, what emotion are you likely to experience?
Choices: wrong, pleasure, encouragement, depression, re- Prompt 12 (MI: 0.233, Acc: 0.360):
lief Given the question, order the options from best answer to the
question to worst answer to the question.
A:
Question: I’m crossing the river, my feet are wet but my
Collapsing token sets: {’A’: [’wrong’], body is dry, where am I?
’B’: [’pleasure’], ’C’: [’encouragement’], Choices: bridge, waterfall, valley, pebble, mountain
Answers (in order of best to worst): valley, bridge, waterfall,
’D’: [’depression’], ’E’: [’relief’]} mountain, pebble
Question: If you’re still in love and end up stopping be-
Prompt 9 (MI: 0.410, Acc: 0.408): ing married to your partner, what emotion are you likely to
What would you use to put out a fire? experience?
A: gasoline Choices: wrong, pleasure, encouragement, depression, relief
B: poison
C: laundry detergent Answers (in order of best to worst):
D: water Collapsing token sets: {’A’: [’wrong’],
E: pencil
Answer: water ’B’: [’pleasure’], ’C’: [’encouragement’],
If you’re still in love and end up stopping being married ’D’: [’depression’], ’E’: [’relief’]}
to your partner, what emotion are you likely to experience?
A: wrong
B: pleasure
C: encouragement
D: depression
E: relief
Answer:
Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’]
Prompt 13 (MI: 0.255, Acc: 0.360): Prompt 17 (MI: 0.265, Acc: 0.276):
Given the question, order the options from best answer to the Me: I watched the most recent episode of the "Is It Really Com-
question to worst answer to the question. mon Sense" game show yesterday night.
Friend: Oh, how was it?
Question: I’m crossing the river, my feet are wet but my Me: It was good. I remember one of the questions.
body is dry, where am I? Friend: What was the question?
Choices: bridge, waterfall, valley, pebble, mountain Me: If you’re still in love and end up stopping being married to
Answers (in order of best to worst): valley, bridge, waterfall, your partner, what emotion are you likely to experience?
mountain, pebble
Friend: What were the options?
Question: In what Spanish speaking North American Me: wrong, pleasure, encouragement, depression, or relief
country can you get a great cup of coffee? Friend: Did the contestant get the answer right?
Choices: mildred’s coffee shop, mexico, diner, kitchen, canteen Me: Yep!
Answers (in order of best to worst): mexico, mildred’s coffee Friend: Which of the options was correct?
shop, diner, kitchen, canteen Me: The correct answer was
’D’: [’depression’], ’E’: [’relief’]} Question: If you’re still in love and end up stopping be-
ing married to your partner, what emotion are you likely to
Prompt 14 (MI: 0.222, Acc: 0.354): experience?
Choices: wrong, pleasure, encouragement, depression, relief
Q: If you’re still in love and end up stopping being married to
Answers (in order of best to worst):
your partner, what emotion are you likely to experience?
Collapsing token sets: {’A’: [’wrong’],
Choices: wrong, pleasure, encouragement, depression, re-
lief ’B’: [’pleasure’], ’C’: [’encouragement’],
’D’: [’depression’], ’E’: [’relief’]}
A:
Collapsing token sets: {’A’: [’wrong’], Prompt 19 (MI: 0.013, Acc: 0.234):
’B’: [’pleasure’], ’C’: [’encouragement’], If you’re still in love and end up stopping being married to your
’D’: [’depression’], ’E’: [’relief’]} partner, what emotion are you likely to experience?
A: wrong
Prompt 15 (MI: 0.246, Acc: 0.342): B: pleasure
Teacher: I’m going to ask you a common sense question. C: encouragement
D: depression
Student: Alright. E: relief
Teacher: If you’re still in love and end up stopping being Answer:
married to your partner, what emotion are you likely to
experience? Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’]
Student: What are the possible answers?
Teacher: The answer is either "wrong," "pleasure," "en-
couragement," "depression," or "relief."
Student: What are the possible answers? As someone who has worked in a warehouse myself when I was
younger, I can tell you that the warehouse fights, complete with
Teacher: The answer is either "george washington," "dec- tumbling packing cases and flailing grappling hooks are as realis-
laration of independence," "boston tea party," "star spangled
banner," or "vampire assassins." tic as it gets. I’ve been in fights like these myself, although no one
got killed.
Student: I know the right answer - it’s "vampire assas-
sins." The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
Teacher: That’s right! Here’s another common sense film, which, at the time, was much needed.
question for you. If you’re still in love and end up stopping
being married to your partner, what emotion are you likely to All the three principle characters - Warden, Cassavetes and Poitier
experience? - are superb, with Warden the most outstanding of the three.
P1: So, overall, would you give it a positive or negative review?
Student: What are the possible answers? P2: I would give it a
Teacher: The answer is either "wrong," "pleasure," "en- Collapsing token sets: {’positive’: [’positive’],
couragement," "depression," or "relief."
’negative’: [’negative’]}
Student: I know the right answer - it’s "
Collapsing token sets: {’A’: [’wrong’], Prompt 3 (MI: 0.154, Acc: 0.904):
’B’: [’pleasure’], ’C’: [’encouragement’], Considering this movie review, determine its sentiment.
’D’: [’depression’], ’E’: [’relief’]} Review: John Cassavetes is on the run from the law. He
is at the bottom of the heap. He sees Negro Sidney Poitier as his
equal and they quickly become friends, forming a sort of alliance
against a bully of a foreman played by Jack Warden.
As someone who has worked in a warehouse myself when I
was younger, I can tell you that the warehouse fights, complete
with tumbling packing cases and flailing grappling hooks are as
C.5 IMDB realistic as it gets. I’ve been in fights like these myself, although
no one got killed.
Prompt 1 (MI: 0.175, Acc: 0.944): The introduction of Sidney Poitier’s widow is a variation on
P1: How was the movie? Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
P2: John Cassavetes is on the run from the law. He is at the bot- film, which, at the time, was much needed.
tom of the heap. He sees Negro Sidney Poitier as his equal and
they quickly become friends, forming a sort of alliance against a All the three principle characters - Warden, Cassavetes and
bully of a foreman played by Jack Warden. Poitier - are superb, with Warden the most outstanding of the
three.
As someone who has worked in a warehouse myself when I was
younger, I can tell you that the warehouse fights, complete with In general, was the sentiment positive or negative The
tumbling packing cases and flailing grappling hooks are as realis- sentiment was
tic as it gets. I’ve been in fights like these myself, although no one Collapsing token sets: {’positive’: [’positive’],
got killed.
’negative’: [’negative’]}
The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed.
All the three principle characters - Warden, Cassavetes and Poitier
- are superb, with Warden the most outstanding of the three.
P1: Would you say your review of the movie is negative or posi-
tive?
P2: I would say my review review of the movie is
Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]}
Prompt 4 (MI: 0.260, Acc: 0.898): Prompt 6 (MI: 0.151, Acc: 0.886):
P1: How was the movie? Read the following movie review to determine the review’s
P2: John Cassavetes is on the run from the law. He is at the bot- sentiment.
tom of the heap. He sees Negro Sidney Poitier as his equal and
they quickly become friends, forming a sort of alliance against a John Cassavetes is on the run from the law. He is at the
bully of a foreman played by Jack Warden. bottom of the heap. He sees Negro Sidney Poitier as his equal
and they quickly become friends, forming a sort of alliance
As someone who has worked in a warehouse myself when I was against a bully of a foreman played by Jack Warden.
younger, I can tell you that the warehouse fights, complete with
tumbling packing cases and flailing grappling hooks are as realis- As someone who has worked in a warehouse myself when I
tic as it gets. I’ve been in fights like these myself, although no one was younger, I can tell you that the warehouse fights, complete
got killed. with tumbling packing cases and flailing grappling hooks are as
realistic as it gets. I’ve been in fights like these myself, although
The introduction of Sidney Poitier’s widow is a variation on no one got killed.
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed. The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
All the three principle characters - Warden, Cassavetes and Poitier film, which, at the time, was much needed.
- are superb, with Warden the most outstanding of the three.
P1: Would you say your review of the movie is positive or nega- All the three principle characters - Warden, Cassavetes and
tive? Poitier - are superb, with Warden the most outstanding of the
P2: I would say my review of the movie is three.
Collapsing token sets: {’positive’: [’positive’], In general, was the sentiment positive or negative? The
’negative’: [’negative’]} sentiment was
Collapsing token sets: {’positive’: [’positive’],
Prompt 5 (MI: 0.237, Acc: 0.888): ’negative’: [’negative’]}
After reading the following review, classify it as negative or
positive.
Prompt 7 (MI: 0.086, Acc: 0.886):
Review: John Cassavetes is on the run from the law. He Considering this movie review, determine its sentiment.
is at the bottom of the heap. He sees Negro Sidney Poitier as his
equal and they quickly become friends, forming a sort of alliance Review:
"""
against a bully of a foreman played by Jack Warden. John Cassavetes is on the run from the law. He is at the bottom
As someone who has worked in a warehouse myself when I of the heap. He sees Negro Sidney Poitier as his equal and they
was younger, I can tell you that the warehouse fights, complete quickly become friends, forming a sort of alliance against a bully
with tumbling packing cases and flailing grappling hooks are as of a foreman played by Jack Warden.
realistic as it gets. I’ve been in fights like these myself, although As someone who has worked in a warehouse myself when I
no one got killed. was younger, I can tell you that the warehouse fights, complete
The introduction of Sidney Poitier’s widow is a variation on with tumbling packing cases and flailing grappling hooks are as
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist realistic as it gets. I’ve been in fights like these myself, although
film, which, at the time, was much needed. no one got killed.
All the three principle characters - Warden, Cassavetes and The introduction of Sidney Poitier’s widow is a variation on
Poitier - are superb, with Warden the most outstanding of the Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
three. film, which, at the time, was much needed.
Prompt 17 (MI: 0.021, Acc: 0.486): Prompt 19 (MI: 0.019, Acc: 0.462):
John Cassavetes is on the run from the law. He is at the bottom
John Cassavetes is on the run from the law. He is at the bottom
of the heap. He sees Negro Sidney Poitier as his equal and they
of the heap. He sees Negro Sidney Poitier as his equal and they
quickly become friends, forming a sort of alliance against a bully
quickly become friends, forming a sort of alliance against a bully
of a foreman played by Jack Warden.
of a foreman played by Jack Warden.
As someone who has worked in a warehouse myself when I
As someone who has worked in a warehouse myself when I
was younger, I can tell you that the warehouse fights, complete
was younger, I can tell you that the warehouse fights, complete
with tumbling packing cases and flailing grappling hooks are as
with tumbling packing cases and flailing grappling hooks are as
realistic as it gets. I’ve been in fights like these myself, although
realistic as it gets. I’ve been in fights like these myself, although
no one got killed.
no one got killed.
The introduction of Sidney Poitier’s widow is a variation on
The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed.
film, which, at the time, was much needed.
All the three principle characters - Warden, Cassavetes and
All the three principle characters - Warden, Cassavetes and
Poitier - are superb, with Warden the most outstanding of the
Poitier - are superb, with Warden the most outstanding of the
three.
three.
Was the sentiment of previous review positive or negative? The
Was the previous review negative or positive? The previ- previous review was
ous review was
Collapsing token sets: {’positive’: [’positive’], Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]} ’negative’: [’negative’]}
Pyruvic acid – Pyruvic acid (CHCOCOOH) is the sim- Given this question: "Is pyruvic acid and pyruvate the
plest of the alpha-keto acids, with a carboxylic acid and a same thing?"
ketone functional group. Pyruvate (/paruvet/), the conjugate base, If asked to choose "true" or "false", My answer would be:
CHCOCOO, is a key intermediate in several metabolic pathways. "
Is pyruvic acid and pyruvate the same thing? Collapsing token sets: {’True’: [’true’],
Answer: " ’False’: [’false’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]} Prompt 18 (MI: 0.020, Acc: 0.518):
Read the following passage: "Pyruvic acid – Pyruvic acid
Prompt 14 (MI: 0.050, Acc: 0.668): (CHCOCOOH) is the simplest of the alpha-keto acids, with
a carboxylic acid and a ketone functional group. Pyruvate
Read the following passage: "Pyruvic acid – Pyruvic acid
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate
(CHCOCOOH) is the simplest of the alpha-keto acids, with
in several metabolic pathways."
a carboxylic acid and a ketone functional group. Pyruvate
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate Given this question: "Is pyruvic acid and pyruvate the
in several metabolic pathways." same thing?"
If asked to choose yes or no, My answer would be: "
Given this question: "Is pyruvic acid and pyruvate the
same thing?" Collapsing token sets: {’True’: [’yes’],
I would answer: " ’False’: [’no’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]} Prompt 19 (MI: 0.013, Acc: 0.452):
Read the following passage: "Pyruvic acid – Pyruvic acid
Prompt 15 (MI: 0.058, Acc: 0.646): (CHCOCOOH) is the simplest of the alpha-keto acids, with
a carboxylic acid and a ketone functional group. Pyruvate
Read the following passage: "Pyruvic acid – Pyruvic acid
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate
(CHCOCOOH) is the simplest of the alpha-keto acids, with
in several metabolic pathways."
a carboxylic acid and a ketone functional group. Pyruvate
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate Given this question: "Is pyruvic acid and pyruvate the
in several metabolic pathways." same thing?"
If asked to choose "true" or "false", I would answer: "
Given this question: "Is pyruvic acid and pyruvate the
same thing?" Collapsing token sets: {’True’: [’true’],
I would respond: " ’False’: [’false’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]} Prompt 20 (MI: 0.022, Acc: 0.438):
Read the following passage: "Pyruvic acid – Pyruvic acid
Prompt 16 (MI: 0.027, Acc: 0.634): (CHCOCOOH) is the simplest of the alpha-keto acids, with
a carboxylic acid and a ketone functional group. Pyruvate
Based on the passage: "Pyruvic acid – Pyruvic acid (CHCO-
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate
COOH) is the simplest of the alpha-keto acids, with a carboxylic
in several metabolic pathways."
acid and a ketone functional group. Pyruvate (/paruvet/), the
conjugate base, CHCOCOO, is a key intermediate in several Given this question: "Is pyruvic acid and pyruvate the
metabolic pathways." same thing?"
If asked to choose yes or no, I would answer: "
And answering the question: "Is pyruvic acid and pyru-
vate the same thing?" Collapsing token sets: {’True’: [’yes’],
By choosing yes or no
My answer would be: "
’False’: [’no’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]}
C.7 COPA Prompt 3 (MI: 0.003, Acc: 0.628):
What is the effect of the following premise: "My foot went
numb."
Premise: The man broke his toe. What was the CAUSE Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
of this?
Alternative 1: He got a hole in his sock. Prompt 4 (MI: 0.002, Acc: 0.612):
Alternative 2: He dropped a hammer on his foot.
Answer: Alternative 2. Getting a hole in your sock would not Solve the following COPA task by choosing the sentence which
break your toe, unless there is additional information. Dropping makes the most sense after the premise.
a hammer (which is a heavy object), on the other hand, would
almost certaintly break your toe. Thus, the best answer is Premise: My foot went numb.
Alternative 2. Choice 1. I put my shoes on.
Choice 2. I shook my foot.
Premise: I tipped the bottle. What happened as a RE-
SULT? Answer: Choice
Alternative 1: The liquid in the bottle froze. Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Alternative 2: The liquid in the bottle poured out.
Answer: Alternative 2. Tipping a bottle causes liquid to fall out,
not to freeze. Freezing is caused by being placed in a cold place. Prompt 5 (MI: 0.003, Acc: 0.550):
Pouring out (Alternative 2) is correct because it makes the most
sense. If asked to pick between choice 1 ("I put my shoes on.") or choice
2 ("I shook my foot.") to see what the effect of this premise ("My
Premise: I knocked on my neighbor’s door. What hap- foot went numb.") was, I would say: "choice
pened as a RESULT?
Alternative 1: My neighbor invited me in. Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Alternative 2: My neighbor left his house.
Answer: Alternative 1. When you knock on a neighbor’s door,
it is likely that if they are home they will answer and invite you Prompt 6 (MI: 0.010, Acc: 0.540):
in. It does not make much sense, however, that a neighbor would Solve the following COPA tasks by choosing the sentence which
leave their house without explanation. Therefore, Alternative 1 is makes the most sense after the premise.
the best result of the premise.
Premise: The man broke his toe.
Premise: My foot went numb. What happened as a RE- Choice 1. He got a hole in his sock.
SULT? Choice 2. He dropped a hammer on his foot.
Alternative 1: I put my shoes on. Answer: Choice 2.
Alternative 2: I shook my foot.
Answer: Alternative Premise: My foot went numb.
Choice 1. I put my shoes on.
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]} Choice 2. I shook my foot.
Answer: Choice
Prompt 2 (MI: 0.034, Acc: 0.762):
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
The Choice Of Plausible Alternatives (COPA) evaluation
provides researchers with a tool for assessing progress in
open-domain commonsense causal reasoning. COPA consists of Prompt 7 (MI: 0.002, Acc: 0.532):
1000 questions, split equally into development and test sets of
500 questions each. Each question is composed of a premise and What is the effect of the following premise: "My foot went
two alternatives, where the task is to select the alternative that numb."
more plausibly has a causal relation with the premise. The correct
alternative is randomized so that the expected performance of If asked to choose between Choice 1: "I put my shoes
randomly guessing is 50%. on." or Choice 2: "I shook my foot."
My answer would be: Choice
Examples
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Premise: The man broke his toe. What was the CAUSE
of this?
Alternative 1: He got a hole in his sock. Prompt 8 (MI: 0.006, Acc: 0.530):
Alternative 2: He dropped a hammer on his foot. I will give you a premise and you will choose either sentence 1)
Answer: Alternative 2 or 2) which is the better plausible alternative.
Premise: I tipped the bottle. What happened as a RE- Premise: My foot went numb.
SULT? 1) I put my shoes on.
Alternative 1: The liquid in the bottle froze. 2) I shook my foot.
Alternative 2: The liquid in the bottle poured out. The most plausible alternative is: Sentence
Answer: Alternative 2
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Premise: I knocked on my neighbor’s door. What hap-
pened as a RESULT?
Alternative 1: My neighbor invited me in. Prompt 9 (MI: 0.018, Acc: 0.524):
Alternative 2: My neighbor left his house. Read the following premise and answer by choosing "effect1" or
Answer: Alternative 1 "effect2"
Premise: "My foot went numb."
Premise: My foot went numb. What happened as a RE-
SULT? effect1: "I put my shoes on."
Alternative 1: I put my shoes on. effect2: "I shook my foot."
Alternative 2: I shook my foot. Answer: "effect
Answer: Alternative Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Prompt 10 (MI: 0.008, Acc: 0.520): Prompt 17 (MI: 0.006, Acc: 0.500):
Read the following premise and pick "effect2" or "effect1" I will give you a premise and you will choose either sentence 1)
Premise: "My foot went numb." or 2) which is the better plausible alternative.
effect1: "I put my shoes on."
Premise: The man broke his toe.
effect2: "I shook my foot." 1) He got a hole in his sock.
Answer: "effect 2) He dropped a hammer on his foot.
The most plausible alternative is: Sentence 2).
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
I will give you a premise and you will choose either sen-
Prompt 11 (MI: 0.003, Acc: 0.516): tence 1) or 2) which is the better plausible alternative.
Premise: My foot went numb.
Based on this premise: "My foot went numb." 1) I put my shoes on.
If asked to choose between 2) I shook my foot.
Choice 1: "I put my shoes on." The most plausible alternative is: Sentence
or
Choice 2: "I shook my foot." Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
My answer would be: Choice
Prompt 18 (MI: 0.003, Acc: 0.500):
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
P1: Here’s a premise: My foot went numb..Which sentence pro-
vides the better alternative? 1. "I put my shoes on", or 2. "I shook
Prompt 12 (MI: 0.008, Acc: 0.510): my foot."P2: The better alternative is sentence
Which one of these stories makes the most sense?
Story 1: My foot went numb. I put my shoes on. Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Story 2: My foot went numb. I shook my foot.
Answer: Story Prompt 19 (MI: 0.019, Acc: 0.500):
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]} "The man broke his toe."
Which of the following alternatives is most plausible for the
previous sentence?
Prompt 13 (MI: 0.003, Acc: 0.506):
P1: Here’s a premise: "The man broke his toe." Sentence 1) He got a hole in his sock.
Which sentence provides the better alternative? Sentence 2) He dropped a hammer on his foot.
1. "He got a hole in his sock", or The most plausible alternative is sentence 2).
2. "He dropped a hammer on his foot." "My foot went numb."
P2: The better alternative is sentence Which of the following alternatives is most plausible for the
previous sentence?
P1: Here’s a premise: "My foot went numb.".Which sen- Sentence 1) I put my shoes on.
tence provides the better alternative? 1. "I put my shoes on", or 2. Sentence 2) I shook my foot.
"I shook my foot."P2: The better alternative is sentence The most plausible alternative is sentence
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]} Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Prompt 14 (MI: 0.003, Acc: 0.504): Prompt 20 (MI: 0.001, Acc: 0.496):
Based on this premise: "My foot went numb." I want to figure out which effect of this sentence is more probably:
"My foot went numb."
If asked to pick between
Choice 1: "I put my shoes on." or Choice 2: "I shook my foot."
Choice 1: "I put my shoes on." or Choice 2: "I shook my foot."
to get the effect I would say: "Choice
of the predeciding sentence, I would say: "Choice Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}