0% found this document useful (0 votes)
23 views44 pages

An Information-Theoretic Approach To Prompt Engineering Without

This paper introduces a new method for selecting prompt templates to align language models to tasks, without requiring labeled data or access to the model. The method chooses the template that maximizes the mutual information between the input and the model's output. Across 8 datasets covering 7 NLP tasks, templates selected with this method achieve high accuracy, performing near the best possible template and outperforming the average template, demonstrating its effectiveness.

Uploaded by

mayurb342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views44 pages

An Information-Theoretic Approach To Prompt Engineering Without

This paper introduces a new method for selecting prompt templates to align language models to tasks, without requiring labeled data or access to the model. The method chooses the template that maximizes the mutual information between the input and the model's output. Across 8 datasets covering 7 NLP tasks, templates selected with this method achieve high accuracy, performing near the best possible template and outperforming the average template, demonstrating its effectiveness.

Uploaded by

mayurb342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

An Information-theoretic Approach to Prompt Engineering Without

Ground Truth Labels


Taylor Sorensen*, Joshua Robinson*, Christopher Michael Rytting*,
Alexander Shaw, Kyle Rogers, Alexia Delorey, Mahmoud Khalil,
Nancy Fulda, David Wingate
Computer Science Department, Brigham Young University
{tsor13,joshua_robinson,chrisrytting}@byu.edu
{nfulda,wingated}@cs.byu.edu

Mutual Information Prompt vs. Others


Abstract 1.0

Pre-trained language models derive substantial 0.8


linguistic and factual knowledge from the mas-
arXiv:2203.11364v1 [cs.CL] 21 Mar 2022

sive corpora on which they are trained, and


0.6

Accuracy
prompt engineering seeks to align these mod-
els to specific tasks. Unfortunately, existing
prompt engineering methods require signifi- 0.4
Min
cant amounts of labeled data, access to model Mean
parameters, or both. We introduce a new 0.2 Median
MI (Ours)
method for selecting prompt templates without Max
labeled examples and without direct access to 0.0
the model. Specifically, over a set of candi- uA
D A ies QA DB olQ PA Wi
C
AD Sto
r
Co IM Bo CO
date templates, we choose the template that SQ B
OC
LAM R
maximizes the mutual information between
the input and the corresponding model output. Figure 1: Performance of template selected by our max-
Across 8 datasets representing 7 distinct NLP imum mutual information method (MI) compared to
tasks, we show that when a template has high the the worst, mean, median, and best prompt on GPT-3
mutual information, it also has high accuracy Davinci (175B). Our method performs at almost oracle
on the task. On the largest model, selecting levels, without labels or access to model weights.
prompts with our method gets 90% of the way
from the average prompt accuracy to the best
prompt accuracy and requires no ground truth be formulated as natural language generation, if
labels. only they can be primed in the right way.
Such priming is not a trivial task. The few-shot
1 Introduction learning breakthrough can give the impression that
if the LM is given a sensible prompt, it will “under-
It is well-known that large pre-trained language
stand” what is meant and perform well on the task
models (LMs) learn substantial linguistic (Liu et al.,
if it has the capacity. However, LMs can generate
2019; Amrami and Goldberg, 2018) and factual
substantially different output distributions–and thus
world knowledge (Petroni et al., 2020; Bosselut
text–given two distinct prompts that appear seman-
et al.; Bouraoui et al.; Zuo et al., 2018), achiev-
tically invariant (e.g., alternative orderings, lexical
ing state-of-the-art performance on classic NLP
changes like capitalization, and general rephrasing
tasks like closed-book question-answering, senti-
(Zhao et al., 2021; Lu et al., 2021)). This can lead
ment analysis, and many other tasks (Radford et al.,
to surprisingly high variance in performance from
2019; Devlin et al., 2019; Raffel et al., 2019). The
prompt to prompt. Clearly, some prompts are better
largest models can do this in a few-shot way–that is,
than others for aligning a model to a task.
being trained only with generic, semi-supervised
Prompt engineering is a nascent field that aims
objectives and “taught” tasks with just instructions
to find aligning prompts (Reynolds and McDonell,
and a few examples of the task provided via a
2021). While “prompt” refers to any language
natural language “prompt” in the context window
passed to the model via the context window, a
(Brown et al., 2020). This suggests that pre-training
template refers to a natural language scaffolding
equips them to potentially do many tasks that can
filled in with raw data, resulting in a prompt. Thus,
*Equal Contribution prompt engineering includes finding high-quality
templates (i.e., those with high test accuracy). Gen- model sizes) (Lester et al., 2021), not all models
erally, this is done by optimizing for accuracy over are publicly available. Thus, these methods are
a validation set: a template is chosen from a can- only feasible for those who have direct access to
didate set based on its performance on labeled ex- the model and can perform backprop on it. Prompts
amples. Such labeled examples can be challenging optimized in continuous space are also not inter-
to procure for some tasks and impossible for oth- pretable in natural language, making it harder to
ers. Some recent methods optimize prompts using transfer insights from prompts that work well for
backpropagation, which requires access to model one task to another task. Additionally, these meth-
weights. In this paper, we propose a new method ods require labeled examples, while ours does not.
for selecting prompts by using mutual information, Other selection protocols not based on gradient
which allows prediction of a prompt’s performance descent can include cross-validation or minimum
without labels or access to model parameters. description length, as in (Perez et al., 2021). These
Mutual information (MI) is a metric that quan- methods yield prompts that perform marginally
tifies the shared information between two random better than average in terms of test accuracy.
variables (see Section 3.2). We demonstrate that Mutual information has been used in n-gram
the mutual information between a prompt and a clustering, part-of-speech tagging, probing classi-
language model’s output can serve as a useful sur- fiers, and LM training objective reframing (Brown
rogate for the test accuracy of a template. Specifi- et al., 1992; Stratos, 2019; Voita and Titov, 2020;
cally, for eight popular datasets representing seven Kong et al., 2019). Ours is the first work of which
classic NLP tasks, we generate a diverse set of 20 we are aware to apply MI to prompt engineering.
templates for each and show that template mutual (Lu et al., 2021) make use of entropy statistics to
information and template accuracy are highly cor- determine performant orderings for few-shot exam-
related. These results are strongest on the largest ples in prompts. Our work is focused on selecting
models we study, for which our method chooses high quality templates with no special focus on ex-
prompts that, on average, get 90% of the way from ample ordering or need for multiple examples to
mean accuracy to maximum accuracy and even order (the few-shot case). Our method uses no arti-
selects the best prompt on three of eight datasets. ficial “probing set,” making our prompt selection
This suggests that, across a variety of NLP tasks, much cheaper, and we also explore open-ended
mutual information can be used to select one of the tasks. While the GlobalE and LocalE statistics they
best prompts from a set of candidate prompts, even use are similar (and in the case of LocalE identical)
without making use of model weights or ground to the two parts of our MI calculation (see 3.2), we
truth labels. In the following pages, we outline use the two statistics jointly and choose prompts
each step of our general method for generating and by minimizing, rather than maximizing, LocalE.
evaluating templates so that it can easily be ported
to any other task. Code is available online.1 3 Methods
At the most abstract, our method is as follows (see
2 Related Work
Appendix A for a more thorough description):
The promise of language models and the chal-
lenge of aligning them has given rise to the field of 1. Generate a set of K prompt templatizing
prompt engineering, which seeks to construct the functions.
best prompt given a task and a language model (Liu
2. Playground a couple of examples to
et al., 2021a). The best performance on prompt
ensure that templates give roughly ex-
engineering is often achieved using backpropaga-
pected output.
tion in continuous prompt embedding space (Lester
et al., 2021; Li and Liang, 2021; Gu et al., 2021; 3. Estimate mutual information for
Liu et al., 2021b; Zhang et al., 2021) in contrast each template given a set of inputs
to generating a discrete set of prompts by hand x1 , x2 , ..., xN where xi ∼ X, ∀i.
and testing them. While optimizing in continu- 4. Choose template(s) based on mutual in-
ous prompt space via backprop allows for similar formation and perform inference.
performance to model-tuning (at least at higher
1
github.com/BYU-PCCL/information-theoretic-prompts We find it useful to unify all the tasks we study
.....
If asked the question ‘<Question>’, Common Sense Quiz Answer Key Given the following questions and questions,
and given the choices ‘<A>’, Question 1: Where would people choices, pick the choice that choices,
‘<B>’, ‘<C>’, ‘<D>’, and ‘<E>’, not typically go for fun? corresponds best to the question. answers
I would say A: theme park “I’m crossing the river, my feet are
B: movie theatre wet but my body is dry, where am “What is France?”,
C: carnival I?“, “bridge, waterfall, valley, “[state,city,country,continent,
D: waste management facility pebble, mountain”, -> “valley” mountain range]”,
E: beach “In what Spanish speaking North country
Correct Answer: D American country can you get a
Question 2: <Question> great cup of coffee?“, “mexico, “<Question>”,
A: <A> mildred’s coffee shop, diner, “[<A>, <B>, <C>, <D>, <E>]”,
B: <B> kitchen, canteen”, -> “mexico”
C: <C> “<Question>“,
D: <D> “<A>, <B>, <C>, <D>, <E>” -> ”
“In a predicament, E: <E>
Question an animal might Correct Answer:
choose flight or
what?” {' fight': 0.2791,
“If asked the question ' Fight': 0.0648, 'leave': 0.08,
A. “leave home” 'hunt': 0.11,
' Feel': 0.0584,

Normalize
‘In a predicament, an
B. “hunt for food” ' hunt': 0.0556, 'smell': 0.02, Collapsed
animal might choose
Choices C. “smell prey” flight or what?’, and ' feel': 0.0488, 'feel': 0.19, token
D. “feel pain” given the choices ..... 'fight': 0.60 distribution
E. “fight for life” ‘leave home’, ‘hunt for ' flee': 0.0088,
food’, ‘smell prey’, ' leave: 0.0086,
Ground ‘feel pain’, and ‘fight ' Hunt': 0.0082,
E. “fight for life”
Truth for life’, I would say” ' smell': 0.0063} Mutual Accuracy
Information
Tφ Ground Truth:
Data Prompt Token distribution E. “fight for life”

Figure 2: We choose θ ∈ {θi }K i=1 and templatize a sampled instance from the dataset X. We pass this prompt
through the language model via gφ , yielding a probability distribution over the model’s tokens Tφ . The collapsing
function cθ sums the weight given to each token corresponding to each possible answer y ∈ Y and normalizes,
giving a probability distribution P (Y |xi ), which we can use to estimate mutual information or obtain a guess for
yi .

within a single framework, which we describe in While, for open-ended tasks, this method might
Section 3.1. We also justify our use of mutual artificially inflate accuracy if the model starts to
information as a surrogate for prompt quality and give a wrong answer that happens to start with the
specify how we estimate it in Section 3.2. same token as the correct one, we find that this
difference is small and does not affect our results.3
3.1 Task Definition Irrelevant tokens (with which none of the desired
In order to demonstrate our method’s widespread answers begin) are ignored, and the resulting col-
applicability and general effectiveness, we validate lapsed probabilities are normalized. We term this
it across many datasets and tasks. This requires approach One-token Response (OTR). Although
us to estimate MI and accuracy, and this is most our method isn’t limited to OTR tasks, we choose
straightforward in the case where, given a context, tasks that can be cast as OTR tasks for simplicity
a language model produces just one probability dis- and to reduce computational expense. Many NLP
tribution P (tn |context = t1 , t2 , ..., tn−1 ). This is tasks fit within this framework, although a few do
in contrast to other experimental setups that use not (e.g., machine translation and summarization).
multi-token sampling methods (e.g., beam search), This basic approach is in common use (Brown et al.,
although our method is easily tractable in such se- 2020), but we formalize it for clarity below.
tups.2 Any NLP task is tractable in this framework Generally, the OTR framework casts a natural
so long as the output space consists of a set of op- language task as a classification problem with raw
tions that each start with a unique token. In this data input xi ∈ X and output P (Y |xi ), a prob-
case, the language model can “give” an answer by
3
assigning probability to tokens that begin giving Our open-ended datasets are SQuAD, LAMBADA, and
ROCStories, and none of these seemed more likely than ROC-
each of these answers (invariant to lexical varia- Stories to exhibit this issue. We reran our experiment on
tion like capitalization and leading/trailing spaces). ROCStories by sampling with temperature 0 until reaching a
space, and only counted responses as accurate if they exactly
2
The only difference: For each considered answer, simply matched the corresponding ground truth labels. Results were
calculate its unnormalized probability by multiplying the prob- virtually unchanged: accuracy decreased by only 0.03 on aver-
abilities of the decisions taken at each branch in the sequence age, and the correlation between mutual information and test
of tokens, then normalize the resulting probability scores. accuracy increased by 0.04, from 0.68 to 0.72.
ability distribution over targets. In order to use functions fθ1 , fθ2 , ..., fθK along with their corre-
a language model φ for this task, a templatizing sponding collapsing functions cθk (see Appendix
function fθ : X → L is needed to map raw data A). Treating fθ (X) := {fθ (x), x ∈ X} as a ran-
into natural language prompts. gφ : L → Tφ maps dom variable, we can calculate I(fθ (X); Y ) and
prompts to a probability distribution over Tφ , the to- use it as a criterion for selecting prompt templatiz-
ken set represented by the model tokenizer. Finally, ing functions with which to do inference.
a collapsing function cθ : Tφ → P (Y |x, θ, φ) (see We hypothesize that a θi with higher mutual
Appendix A) yields an estimate of P (Y |X): information will align a language model to a task
better than a θj with lower mutual information.
P (Y |x, θ, φ) = cθ (gφ (fθ (x))), x ∈ X (1) Formally, we select θ̂ = argmaxθ {I(fθ (X); Y )}.
Mutual information is estimated as:
We also refer to P (Y |x, θ, φ) as P (Y |fθ (x)).
The above pipeline can be specified in many I (fθ (X); Y ) = H(Y ) − H(Y |fθ (X)) (2)
ways using different θ and φ (see Figure 2), which
will result in different accuracies. Our ultimate aim where each term is estimated in expectation using
is to select the best θ given φ. Whereas past prompt draws xi ∼ X and Equation 1 as follows:
engineering methods rely on scores calculated by N
!
comparing model answers and ground truth, our 1 X
H(Y ) ≈ H P (Y |fθ (xi )) (3)
method selects θ by maximizing mutual informa- N
i=1
tion, which requires no ground truth labels.
N
1 X
3.2 Mutual Information H(Y |fθ (X)) ≈ H(P (Y |fθ (xi )))) (4)
N
i=1
Mutual information is a measure of the amount of
The marginal entropy H(Y ) is the entropy of the
shared information between two random variables
mean of the conditional distributions, and the con-
(Cover and Thomas, 2006); in other words, it is the
ditional entropy H(Y |fθ (X)) is the mean of en-
reduction in entropy that is observed in one random
tropies of the individual conditional distributions.
variable when the other random variable is known.
This definition gives us another reason to expect
We expect MI to serve as a good criterion for
that mutual information will work well. Since mu-
comparing prompts. Previous work has shown that
tual information is the marginal entropy minus the
large networks trained with cross-entropy loss are
conditional entropy, maximizing mutual informa-
calibrated (e.g., a 60% confidence corresponds to a
tion is equivalent to maximizing marginal entropy
60% chance of the model being correct) when in the
and minimizing conditional entropy. Thus, MI is
early-stopped (∼ 1 epoch) regime (Ji et al., 2021),
high for templates that are, on average, less biased
but become miscalibrated in the overfit regime
towards any given answer (high marginal entropy)
(Nakkiran and Bansal, 2020). According to (Brown
and templates with outputs the model is confident
et al., 2020), GPT-3 was trained for a different num-
about (low conditional entropy). These attributes
ber of epochs on each corpus in its training data.
are desirable in constructing prompts, and we postu-
We calculate it was trained for an average of 1.57
late that maximizing mutual information will yield
epochs, so we have reason to believe that GPT-3 is
a well-aligned template.
generally well-calibrated. Thus, we postulate that a
Looking at it another way, by the data pro-
prompt that elicits a very confident response (high
cessing inequality (Cover and Thomas, 2006),
MI) from the language model is more likely than a
I(fθ (X); Y ) ≤ I(X; Y ). Thus, I(fθ (X); Y )
less confident prompt to score well.
gives a lower bound for I(X; Y ), and the high-
We denote the mutual information between ran-
est mutual information is the tightest lower bound.
dom variables X and Y asR I(X; Y ) and the entropy
The prompt corresponding to this lower bound pre-
of X as H(X) = − x∈X P (x) log(P (x))dx.
serves the most information between X and Y .
The mutual information between X and Y is de-
fined as DKL (P(X,Y ) ||PX ⊗PY ), and can be rewrit- 4 Experimental Setup
ten as H(Y ) − H(Y |X) (the reduction in entropy
in Y given knowledge of X). 4.1 Datasets
Using the OTR framework, we fix a model φ and We validate the efficacy of our prompt engineer-
generate a diverse set of K prompt templatizing ing method with experiments on eight well-known
Distributions over Template Accuracies
SQuAD LAMBADA ROCStories CoQA
0.8 0.8 0.6

0.6 0.6 0.4 0.5

0.4 0.4
0.4
0.2 0.3
0.2 0.2
0.2
Accuracy

0.0 0.0 0.0


IMDB BoolQ COPA WiC

0.7 0.525
0.8 0.7
0.500
0.6
0.6 0.475
0.6 0.5
0.5 0.450
0.4
B B B B B B B M B B B B B B B M B B B B B B B M B B B B B B B M
175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124 175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124 175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124 175 : 13 : 6.7 -J: 6 : 2.7 : 2.7 : 1.5 124
T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2: T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2: T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2: T-3: PT-3PT-3 GPT-Neo PT-3 PT-2 T-2:
GP G G GPT G G GP GP G G GPT G G GP GP G G GPT G G GP GP G G GPT G G GP

Figure 3: Distributions of accuracies over K = 20 templates for each model/dataset pair, compared to the prompts
selected with MI (translucent red dots).

Base Size a word from each story in order to use the data for
Dataset Task |Y |
Acc. Nall
masked word prediction (cloze).
SQuAD Open Book QA |Tφ | ∼0 16K
LAMBADA Cloze |Tφ | ∼0 5K We made minor changes to two of the datasets in
ROCStories Cloze |Tφ | ∼0 52K order to cast the associated tasks into OTR. For the
CoQA Closed Book QA 5 0.2 9K SQuAD dataset, we dropped all questions that did
Sentiment
IMDB 2 0.5 50K not have a one word answer. For the CoQA dataset
Analysis
BoolQ
Reading
2 0.5 16K
we dropped all questions with answer choices that
Comprehension started with a shared first word (e.g, the dog, the cat,
Choice of Positive
COPA 2 0.5 1K the monkey). Both changes were to decrease ambi-
Alternatives
WiC Word in Context 2 0.5 5K guity about which option the model was choosing
given its output distribution for a single token.
Table 1: All datasets used in our experiments. |Y | is
the size of the label space and Nall is the size of the 4.2 Models
dataset we sample from (after any modifications).
We assess our method on eight models ranging
from 124 million to 175 billion parameters : These
NLP datasets4 –SQuAD2.0 (Rajpurkar et al., 2018), include GPT-2 124M & 1.5B (Radford et al., 2019),
LAMBADA (Paperno et al., 2016), ROCStories GPT-Neo 2.7B (Black et al., 2021), GPT-J (6B)
(Mostafazadeh et al., 2016), CommonsenseQA (Wang and Komatsuzaki, 2021), and (Ada, Bab-
(CoQA) (Talmor et al., 2018), IMDB (Maas et al., bage, Curie, & Davinci) GPT-3 (Brown et al.,
2011), BoolQ (Clark et al., 2019), COPA (Gor- 2020). We assume (per (Perez et al., 2021)) these
don et al., 2012), and WiC (Pilehvar and Camacho- models to correspond, respectively, to the 2.7B,
Collados, 2018))–that span seven unique NLP tasks 6.7B, 13B, and 175B models in (Brown et al.,
(see Table 1). We used a random sample of 2020). Each is a causal language model, and al-
N = 500 samples from each dataset for our ex- though we do not include masked language models,
periments.5 For ROCStories, which consists of a this is a promising area for future work.
set of five sentence stories, we randomly masked
5 Results
4
Datasets are listed in descending order here and through-
out the paper, first by |Y |, and then by method performance. In this section, we analyze our experiments. First,
5
We sampled from the train sets of CoQA and SQuAD; we look at our method’s ability to select high-
the train and validation sets of WIC, COPA, and BoolQ; the
full datasets of ROCStories and IMDB; and the test set for accuracy prompts across models and datasets (Sec-
LAMBADA. tion 5.1). Next, we correlate template mutual infor-
Correlation between MI and Accuracy Mutual Information vs. Accuracy with GPT-3 175B
1.00
SQuAD 0.92 0.95 0.91 0.96 0.96 0.78 0.91 0.74 SQuAD LAMBADA
0.75 1.0 0.8
0.8
LAMBADA 0.86 0.93 0.96 0.96 0.97 0.95 0.88 0.94

Accuracy
0.50 0.6 0.6
ROCStories 0.68 0.63 0.59 0.7 0.77 0.77 0.67 0.87
0.25 0.4 0.4
CoQA 0.56 0.75 0.69 0.71 0.64 0.62 0.63 0.14
0.00 0.8 3 4 5 2 4
IMDB 0.71 0.67 0.8 0.78 0.58 0.12 0.3 0 ROCStories CoQA
0.25 0.6
BoolQ 0.7 0.66 0.02 0.42 -0.5 -0.18 0.13 -0.05
0.4 0.5

Accuracy
0.50
COPA 0.62 0.18 -0.29 0.27 -0.46 0.3 -0.18 0.3 0.4
0.75 0.2
0.6 0.3
WiC 0.33 -0.06 0.07 0.04 -0.15 -0.19 -0.37 -0.1
1.00 2 4 0.0 0.5
75B 13B .7B 6B .7B .7B .5B 4M
- 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 IMDB BoolQ
T GP
GP
T G T-N GPT GPT PT- 1.0 0.8
GP GP G
0.4 0.7

Accuracy
0.8
Figure 4: Correlations are more consistently high 0.6
0.6
across all tasks for the largest models, suggesting that 0.5
our method is most useful at those model sizes. 0.1 0.2 0.3 0.025 0.050 0.075
0.2 COPA WiC

mation and accuracy in Section 5.2. After that, we 0.7 0.500

Accuracy
compare our method and template selection using 0.6 0.475
labeled examples in Section 5.3. In Section 5.4, we 0.5
0.450
0.0
explore the robustness of MI and use ensembling to 0.000
0.0 0.025
0.2 0.4 0.6 0.01 0.02
0.8 0.03 1.0
Mutual Information (nats)
improve it. Finally, we compare the tranferability
of prompt templates selected with MI from model Figure 5: Each dot represents a template and its average
to model in Section 5.5. mutual information and accuracy over N = 500 task
instances. Linear best fit (by mean standard error) lines
5.1 Template Selection Performance are included to show overall trends.

We first define baselines against which we compare


our approach. Other prompt engineering methods
generally require either access to model weights, la- dataset/model pair. With every model, MI gives
beled data (validation set selection), or both (back- above-average performance on several datasets.
prop/continuous prompt embedding methods). Our Although MI is more likely to select a high ac-
method does not require these, so we instead com- curacy template for larger models, it is a good
pare to random and oracle baselines. A random criterion even for smaller models on all but two
template selection method would give us the aver- datasets, COPA and WiC. Note that, for these two
age accuracy of our template set (in expectation), datasets, none of the templates do significantly bet-
while an oracle selection method would give us ter than chance (∼50%) besides the largest model
the best accuracy every time. To understand how on COPA, which is in line with previous work.6
our MI method compares to these two baselines Thus, we observe that mutual information performs
for each dataset, refer to Figure 1, where we ana- best when there is a high-signal prompt to select,
lyze performance on GPT-3 175B. On each of the and worse when all prompts are low-signal.
eight datasets, mutual information selects a prompt
When considering all other datasets, MI selects
template that outperforms both the mean and me-
an above average prompt 83% of the time for all
dian accuracies (random baseline performance). In
models; for the largest two models, MI selects an
three of the eight datasets, mutual information se-
above average template 100% of the time.
lects the best (highest accuracy) template from the
20 proposed (equivalent to oracle performance).
Given our method’s promising performance with
6
GPT-3 175B, it is natural to ask how it performs Our template’s best accuracy is 54% for WiC, and 78.2%
for COPA, which is similar to previous work (WiC: (Brown
with smaller models. Figure 3 shows the accu- et al., 2020) - 49.4%, (Perez et al., 2021) - 54.1%; COPA:
racy distributions over prompt templates for each (Brown et al., 2020) - 92.0%, (Perez et al., 2021) - 84.8%).
Few-Shot Accuracy Selection vs. Mutual Information Selection on GPT-3 175B
SQuAD LAMBADA ROCStories CoQA
0.8 0.6
0.8 0.6
0.7 N-shot Acc
N-shot MI 0.4 0.5
0.7 0.6 All-shot MI
Top Acc 0.4
0.5 Avg Acc
Accuracy
2
4
8
16
32
64
8
6

2
4
8
16
32
64
8
6

2
4
8
16
32
64
8
6

2
4
8
16
32
64
8
6
12
25

12
25

12
25

12
25
IMDB BoolQ COPA WiC
0.8 0.54
0.8
0.9 0.52
0.7
0.7 0.50
0.8 0.6 0.48

0.7 0.6 0.5 0.46

2
4
8
16
32
64
8
6
2
4
8
16
32
64
8
6

2
4
8
16
32
64
8
6

2
4
8
16
32
64
8
6

12
25
12
25

12
25

12
25
N (training set size)
Figure 6: For P = 100 random train/test set partitions for each training size N = 2, 4, 8, ..., 256, we select a
template based on accuracy (N-shot Acc) and based on mutual information based on just those N examples (N-
shot MI). Then, we report accuracy of that template on the test set (size: 500 − N ). Error bars (±σ) are reported
across the P = 100 partitions. For reference, the highest, average, and full-dataset MI template accuracy is also
reported.

5.2 Correlation between Template Mutual few-labeled examples? Also, how many unlabeled
Information and Accuracy examples does MI need to be able to perform well?
In Section 5.1, we see how the mutual informa- Results with the largest model are reported in
tion selected template does in terms of accuracy Figure 6. Note that with as few as N = 2 instances,
compared to all other templates. We have not dis- MI selects a far better than average template, al-
cussed, however, how generally MI and accuracy lowing performance gains even in the low-data,
are correlated, except that the highest MI template unlabeled regime. Additionally, for low N and
tends to have anomalously high accuracy. Here, across all eight datasets, MI even selects a better
we establish that their correlation is high across all template on average than selecting based on labeled
templates for the largest LMs. Each of the K = 20 train set accuracy. This suggests that, even with
templates has two corresponding measures: aver- labeled examples, selecting based off of MI may
age accuracy and average MI. We can use these be preferable to test accuracy with few examples.
pairs to correlate MI and accuracy via Pearson’s R. Selecting by labeled train set accuracy often begins
We see in Figure 4 that the correlations are to perform better at higher N , but at the cost of
surprisingly high for the majority of models and requiring labeled data, while our method needs no
datasets. For SQuAD, LAMBADA, ROCStories, labels.
and CoQA, this pattern holds across all model sizes;
for the remainder, results are good on larger mod- 5.4 Method Robustness and Ensembling
els and are much less reliable on smaller models. To explore our method’s robustness we consider
Overall, this is evidence that as mutual information the question: what if we had included a different
increases, so does accuracy. In other words, mutual subset of templates, especially not including the top
information can be used to make an educated guess MI template? Figure 5 shows average MI/accuracy
about accuracy without having to use any ground data for all K = 20 prompt templates on GPT-
truth labels, especially on larger models. 3 175B (similar plots for other models are found
in Appendix B.1). For six of eight datasets, the
5.3 Compared to Few Labeled Examples results are robust; the top few prompt templates (by
Next, we ask: How does our method compare to MI) are all high performers. The performance for
selecting a template based on the accuracy of a COPA and WiC is more brittle; excluding the top-
Accuracy by Ensembling Protocol on GPT-3 175B
SQuAD LAMBADA ROCStories CommonsenseQA
1.0
5 5
10 5
0.8

0.60 0 0 0
Density

0.7 0.8 0.5 0.6 0.7 0.8 0.2 0.4 0.4 0.5 0.6
0.4 IMDB BoolQ COPA WiC
20
10
0.2 5 50

0.00 0 0 0
0.0 0.8 0.9 0.2 0.65 0.70 0.40.75 0.5 0.6
0.6 0.7 0.8 0.450 0.475 0.500 0.525
0.8 1.0
Accuracy

Figure 7: For each dataset the KDE plot represents accuracy over each of the 20

5 ensembles of 5 templates from
the 20 templates associated with the dataset. Each plot also includes lines representing the average accuracy of all
single templates for the dataset, the accuracy of the ensemble of all 20 templates, and the accuracy of the ensemble
of the top 5 templates chosen by MI. In only one case does all-20 beat top-5-MI, and it does so at 4× the cost.

Transferability (Averaged over Datasets)


Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 0.9 0.59 0.56 0.6 0.35 0.2 0.44 0.29 1 0.68 0.56 0.63 0.41 0.61 0.41 0.28
0.75 0.75
GPT-3: 13B 0.76 0.64 0.71 0.68 0.49 0.4 0.63 0.51 0.72 1 0.23 0.8 0.56 0.67 0.71 0.33
0.50 0.50
GPT-3: 6.7B 0.7 0.69 0.11 0.56 0.43 0.31 0.52 0.5 0.52 0.59 1 0.87 0.82 0.42 0.7 0.61
Selection Model

0.25 0.25
GPT-J: 6B 0.6 0.61 0.58 0.59 0.36 -0 0.32 0.13 0.65 0.76 0.71 1 0.46 0.45 0.64 0.33
0.00 0.00
GPT-Neo: 2.7B 0.68 0.61 0.47 0.62 0.08 0.34 0.37 0.37 0.57 0.62 0.74 0.77 1 0.56 0.73 0.48
0.25 0.25
GPT-3: 2.7B 0.74 0.41 0.5 0.52 0.43 0.08 0.43 0.37 0.41 0.41 0.53 0.44 0.24 1 0.38 0.62
0.50 0.50
GPT-2: 1.5B 0.69 0.71 -0.01 0.62 0.38 0.21 0.37 0.37 0.55 0.71 0.71 0.65 0.59 0.51 1 0.56
0.75 0.75
GPT-2: 124M 0.23 0.2 0.2 0.35 0.42 0.19 0.08 0.21 0.15 -0.07 0.73 0.18 0.62 0.69 0.26 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model

Figure 8: For each model/dataset pair, accuracies are normalized linearly so that 0 is the average prompt accuracy
and 1 is the highest test accuracy. Using the prompt chosen by either MI or test accuracy on each selection model,
average performance across datasets is reported for each inference model.

MI template would have resulted in a large drop in ensembled together”. We then compare the top-
accuracy. This attests to the utility of generating 5 MI ensemble to other possible ensembles. The
a diverse slate of templates as recommended in results are shown in Figure 7.
Appendix A and also to the risk that outliers could We found that the top-5 MI ensemble does at
compromise our method’s effectiveness. least as well as the top-20 ensemble in all but one
A comprehensive discussion of remedies for out- case. Two reasons to use MI are, then, that 1)
liers is beyond the scope of this paper, but it is the MI ensemble gets as good or better a result as
an important concern. Considering the strength of ensembling all prompt templates and 2) at a fourth
MI/accuracy correlations, one simple approach is of the experimental cost. In short, ensembling by
to ensemble the top 5 MI templates. MI is a cheap and effective way to guard against
To compare this principled top-5 ensemble to anomalous high MI/low accuracy templates.
other possible ensembles of templates, we take
20

all 5 subsets of 5 templates from all 20 tem- 5.5 Transferability across Models
plates and calculate the accuracy of each ensemble. Finally, we explore how well-chosen templates gen-
For each dataset, we plot this distribution’s kernel eralize between models. Concretely, we choose
density estimate, which models the p.d.e. of the templates by maximizing either test accuracy (or-
random variable “accuracy of 5 random templates acle) or mutual information (our method) using a
selection model φs , and then calculate test accuracy prompt quality. This method cannot align a LM
using a different inference model φi . We calculate to a task if the entire set of prompts is poor or,
absolute test accuracy and then normalize it such obviously, if the model cannot be aligned. High
that 0 and 100 correspond to the average and maxi- mutual information does not necessarily imply high
mum scores across templates for a model/dataset accuracy despite the strong correlation we found.
pair. We average our results across datasets and Thus, our method should only be employed on a
present the results in Figure 8. Prompt transfer for task if there is some understanding of how high
each dataset can be found in Appendix B.2. MI needs to be on a domain or set of templates to
MI performance is best when the largest model imply a sufficiently high accuracy for safe use.
(GPT-3 175B) is used as both the selection and Otherwise, we introduce no model, dataset, or
inference model: on average, MI scores 90% on other contribution that might warrant ethical con-
this normalized scale. Additionally, performance is cern.
most consistently high when the largest models are
used either for selection or inference. But almost Acknowledgements
all transfer scores are well above 0 (only one nega- We thank the anonymous reviewers for their helpful
tive average gain out of 64 transfer permutations), feedback. This material is based upon work sup-
suggesting that transfer is often effective. ported by the National Science Foundation under
Overall, we have observed that prompt selec- Grant No. RI 2141680.
tion by mutual information is surprisingly effective
across a variety of datasets and model sizes. This
method works best on larger models and for tasks
that the LM is capable of performing. Given the
high diversity of tasks that we have explored, we
expect this method to transfer well to many other
NLP tasks, including regimes with little labeled
data.

6 Conclusion
In this paper, we introduce a method for selecting
prompts that effectively align language models to
NLP tasks. Over a set of candidate prompts, our
method selects the template that maximizes the mu-
tual information between the input and the model
output. We demonstrate that 1) mutual information
is highly correlated with test accuracy and 2) select-
ing a prompt based on mutual information leads
to significant accuracy gains over random choice,
approaching oracle performance on GPT-3 175B,
and it does so across model sizes and tasks.
Whereas other methods rely on ground truth
labels and/or direct model access, ours requires
neither. Many applications characterized by lack
of computational resources, limited model access
(e.g., inference only), and lack of ground truth data
prohibiting testing of candidate prompts become
feasible with our method.

7 Ethics
There are many ways to prompt a language model
poorly, and there still seem to be NLP tasks which
are beyond alignment regardless of model size or
References Montréal, Canada. Association for Computational
Linguistics.
Asaf Amrami and Yoav Goldberg. 2018. Word Sense
Induction with Neural biLM and Symmetric Pat- Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang.
terns. pages 4860–4867. 2021. PPT: Pre-trained Prompt Tuning for Few-shot
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Learning.
Stella Biderman. 2021. GPT-Neo: Large Scale Ziwei Ji, Justin D. Li, and Matus Telgarsky. 2021.
Autoregressive Language Modeling with Mesh- Early-stopped neural networks are consistent.
Tensorflow.
Lingpeng Kong, Cyprien de Masson d’Autume, Wang
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-
Ling, Lei Yu, Zihang Dai, and Dani Yogatama. 2019.
tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.
A mutual information maximization perspective of
COMET : Commonsense Transformers for Auto-
language representation learning.
matic Knowledge Graph Construction.
Zied Bouraoui, Jose Camacho-collados, and Steven Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
Schockaert. Inducing Relational Knowledge from The Power of Scale for Parameter-Efficient Prompt
BERT. Tuning.

Peter F. Brown, Vincent J. Della Pietra, Peter V. deS- Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning:
ouza, Jenifer C. Lai, and Robert L. Mercer. 1992. Optimizing Continuous Prompts for Generation.
Class-based n-gram models of natural language. pages 4582–4597.
Computational Linguistics, 18(4):467–480.
Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Matthew E. Peters, and Noah A. Smith. 2019. Lin-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind guistic knowledge and transferability of contextual
Neelakantan, Pranav Shyam, Girish Sastry, Amanda representations. NAACL HLT 2019 - 2019 Confer-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, ence of the North American Chapter of the Associ-
Gretchen Krueger, Tom Henighan, Rewon Child, ation for Computational Linguistics: Human Lan-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, guage Technologies - Proceedings of the Conference,
Clemens Winter, Christopher Hesse, Mark Chen, 1:1073–1094.
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Candlish, Alec Radford, Ilya Sutskever, and Dario Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-
Amodei. 2020. Language models are few-shot learn- train, Prompt, and Predict: A Systematic Survey of
ers. arXiv. Prompting Methods in Natural Language Processing.
pages 1–46.
Christopher Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, and Kristina Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,
Toutanova. 2019. Boolq: Exploring the surpris- Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT
ing difficulty of natural yes/no questions. CoRR, Understands, Too.
abs/1905.10044.
Yao Lu, Max Bartolo, Alastair Moore, Sebastian
Thomas M. Cover and Joy A. Thomas. 2006. Elements Riedel, and Pontus Stenetorp. 2021. Fantastically
of Information Theory 2nd Edition (Wiley Series in Ordered Prompts and Where to Find Them: Over-
Telecommunications and Signal Processing). Wiley- coming Few-Shot Prompt Order Sensitivity.
Interscience.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Dan Huang, Andrew Y. Ng, and Christopher Potts.
Kristina Toutanova. 2019. BERT: Pre-training of 2011. Learning word vectors for sentiment analy-
deep bidirectional transformers for language under- sis. In Proceedings of the 49th Annual Meeting of
standing. NAACL HLT 2019 - 2019 Conference the Association for Computational Linguistics: Hu-
of the North American Chapter of the Associa- man Language Technologies, pages 142–150, Port-
tion for Computational Linguistics: Human Lan- land, Oregon, USA. Association for Computational
guage Technologies - Proceedings of the Conference, Linguistics.
1(Mlm):4171–4186.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
Andrew Gordon, Zornitsa Kozareva, and Melissa He, Devi Parikh, Dhruv Batra, Lucy Vander-
Roemmele. 2012. SemEval-2012 task 7: Choice wende, Pushmeet Kohli, and James F. Allen. 2016.
of plausible alternatives: An evaluation of common- A corpus and evaluation framework for deeper
sense causal reasoning. In *SEM 2012: The First understanding of commonsense stories. CoRR,
Joint Conference on Lexical and Computational Se- abs/1604.01696.
mantics – Volume 1: Proceedings of the main con-
ference and the shared task, and Volume 2: Pro- Preetum Nakkiran and Yamini Bansal. 2020. Distribu-
ceedings of the Sixth International Workshop on Se- tional generalization: A new kind of generalization.
mantic Evaluation (SemEval 2012), pages 394–398, CoRR, abs/2009.08092.
Denis Paperno, Germán Kruszewski, Angeliki Lazari- Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and
dou, Quan Ngoc Pham, Raffaella Bernardi, San- Sameer Singh. 2021. Calibrate Before Use: Improv-
dro Pezzelle, Marco Baroni, Gemma Boleda, and ing Few-Shot Performance of Language Models.
Raquel Fernández. 2016. The LAMBADA dataset:
Word prediction requiring a broad discourse context. Yukun Zuo, Quan Fang, Shengsheng Qian, Xiaorui
CoRR, abs/1606.06031. Zhang, and Changsheng Xu. 2018. Representa-
tion Learning of Knowledge Graphs with Entity At-
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. tributes and Multimedia Descriptions. 2018 IEEE
True Few-Shot Learning with Language Models. 4th International Conference on Multimedia Big
(Cv):1–21. Data, BigMM 2018, pages 2659–2665.

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton


Bakhtin, Yuxiang Wu, Alexander H. Miller, and Se-
bastian Riedel. 2020. Language models as knowl-
edge bases? EMNLP-IJCNLP 2019 - 2019 Con-
ference on Empirical Methods in Natural Language
Processing and 9th International Joint Conference
on Natural Language Processing, Proceedings of
the Conference, pages 2463–2473.

Mohammad Taher Pilehvar and José Camacho-


Collados. 2018. Wic: 10, 000 example pairs for
evaluating context-sensitive representations. CoRR,
abs/1808.09121.

Alec Radford, Jeff Wu, Rewon Child, David Luan,


Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine


Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2019. Exploring the Lim-
its of Transfer Learning with a Unified Text-to-Text
Transformer. pages 1–53.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.


Know what you don’t know: Unanswerable ques-
tions for squad. CoRR, abs/1806.03822.

Laria Reynolds and Kyle McDonell. 2021. Prompt pro-


gramming for large language models: Beyond the
few-shot paradigm.

Karl Stratos. 2019. Mutual information maximization


for simple and accurate part-of-speech induction.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and


Jonathan Berant. 2018. Commonsenseqa: A ques-
tion answering challenge targeting commonsense
knowledge. CoRR, abs/1811.00937.

Elena Voita and Ivan Titov. 2020. Information-


Theoretic Probing with Minimum Description
Length. pages 183–196.

Ben Wang and Aran Komatsuzaki. 2021. GPT-


J-6B: A 6 Billion Parameter Autoregressive
Language Model. https://github.com/
kingoflolz/mesh-transformer-jax.

Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng,


Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun Chen.
2021. Differentiable Prompt Makes Pre-trained Lan-
guage Models Better Few-shot Learners. pages 1–
18.
A Prompt Engineering Process While we aimed for as diverse a set of prompts
as possible in this work, additional dimensions of
In this section, we step through our method in detail.
variation in prompt templates could be explored in
Again, note that this method uses no ground truth
future work (e.g., ordering of few-shot examples).
labels and does not require gradient updates or
model parameter access. Given a task that can 2. Playground. For each chosen fθk , calculate
be represented in natural language with the OTR gφ (fθk (x)) for a few dataset samples. Do not look
framework, the only requirements for our approach at associated ground truth labels for these samples.
are a) several candidate prompt templates and b) Simply check to ensure that gφ puts high probabil-
some instances (X) on which to do inference. ity on the tokens one would expect given fθk that
1. Generate a set of K prompt templatizing could be reasonably collapsed by cθk into P (Y ).
functions with corresponding collapsing func- For example, on the BoolQ reading comprehension
tions. Each prompt template function fθk should task, the language model predicts the answer to
take in an input from the dataset and output a a yes/no question with a corresponding passage.
prompt ready for processing by the language model. Given this task, we would expect the highest prob-
We chose to generate our template functions by ability to be on tokens like “Yes” or “No”. A poor
hand, i.e., a human writes a sensible, custom nat- prompt template, though, might put the highest
ural language scaffolding that can be filled with probability on unrelated tokens like “I”, “think”, or
input data (see examples in C). “\n”. Revise or replace any template that fails to put
high probability mass on the tokens expected.
Each template must also have a collapsing func-
tion cθk that takes the language model output log- 3. Estimate mutual information for each tem-
probs, exponentiates and sums “equivalent” log- plate fθk . Choose how many data points N to use
probs, and normalizes the resulting probabilities for estimating mutual information for each tem-
to produce a distribution over targets. Equivalent plate function. A higher N will allow for estima-
logprobs are those that indicate the same answer. tion of mutual information based on a more repre-
For example, a template might be designed for sentative sample of the dataset at the cost of more
a question-answering task with possible answers LM computation. Sample N samples from your
“Yes” and “No”. We consider all logits correspond- dataset. Since we do not require any Y labels, one
ing to possible lexical variants of each of these could even choose the X’s on which you desire to
answers to be equivalent. For example, what logits do inference (as we do). Then, for each sample x
should count toward the answer “Yes”? Not just the and each template fθk , calculate P (Y |fθ (x)) using
exact token “Yes”, since “ye”, “ yes”, and “YES” Equation 1. Use the output to estimate MI for each
are all lexical variants of the same answer or the be- prompt template with Equation 2.
ginning of it, just with surrounding white space and
For all of our experiments, cθ takes in a distribu-
alternative capitalization. The collapsing function
tion of tokens gφ (fθk (x)) and a mapping between
lower-cases and strips white space from all logits,
the set of possible ground truth labels for fθk (x)
and if the lower-cased answer begins with a token,
and model vocabulary Tφ . For a sentiment analysis
that token’s probability (the exponentiated logprob)
task, that mapping would be from the ground truth
is added to the sum of probability for that answer.
labels “positive” and “negative” to the expected
Finally, the sums of probabilities for all individual
tokens “positive” and “negative” respectively. If
answers are normalized. Prompt template func-
a prompt template for the task was phrased as a
tions should be chosen to be as diverse as possible
yes/no question, the mapping for it would be from
to increase the probability of finding high-quality
“positive” and “negative” to “yes” and “no” respec-
prompts. For example, we use templates that frame
tively. Our c function returns a probability over
input from datasets as test questions, back and forth
Y (target label space), and the highest probability
dialogue between friends, Python code, test answer
label is treated as the prediction. To keep things
banks, etc. A sample of the prompt templates used
simple, the values in our map are always single
in this work is provided in Appendix C. A good
tokens. See examples in Appendix C.
resource for coming up with prompt template func-
tion ideas is the OpenAI API examples collection7 . 4. Choose prompt template(s) to use for infer-
7
beta.openai.com/examples ence based on mutual information. For choosing
a single prompt template to use for inference, select
the template with highest estimated mutual infor-
mation. With an increased computational budget,
one could also ensemble the top p prompt tem-
plates, as we describe in Section 5.4.

5. Use chosen prompt template(s) to perform


inference Use chosen prompt template(s) fθ̂ to cal-
culate cθ̂ (gφ (fθ̂ (x)) for each dataset sample. Infer-
ence can be done with the language model used for
estimating mutual information or a smaller model if
cost is prohibitive (for information on performance
statistics with this approach, see Figure 8).
B Additional Figures
B.1 Mutual Information vs. Accuracy
See Figure 9.

B.2 Per Dataset Transfer Heatmaps


See Figures 10-17.
Mutual Information vs. Accuracy for each Dataset and Model
GPT-3: 175B GPT-3: 13B GPT-3: 6.7B GPT-J: 6B GPT-Neo: 2.7B GPT-3: 2.7B GPT-2: 1.5B GPT-2: 124M
0.8
0.6
SQuAD

0.4
0.2
0.0
3 4 5 2 3 4 2 3 4 2 3 4 2 3 4 2 4 2 4 1 2 3
0.8
0.6
LAMBADA

0.4
0.2
0.0
2 4 2 4 2 4 2 4 2 4 1 2 3 2 4 1 2 3

0.4
ROCStories

0.2

0.0
2 3 4 2 4 2 4 2 4 2 4 2 4 2 4 1 2 3
0.6
0.5
CoQA

0.4
0.3
0.2
0.00 0.25 0.50 0.00 0.25 0.50 0.00 0.25 0.50 0.0 0.2 0.4 0.00 0.25 0.50 0.0 0.5 0.0 0.5 0.0 0.5
1.0

0.8
IMDB

0.6

0.1 0.2 0.3 0.0 0.2 0.0 0.1 0.2 0.0 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2
0.8
0.7
BoolQ

0.6
0.5
0.4
0.025 0.050 0.075 0.025 0.050 0.075 0.02 0.04 0.02 0.04 0.02 0.03 0.04 0.02 0.04 0.02 0.04 0.01 0.02 0.03

0.7
COPA

0.6

0.5
0.00 0.02 0.04 0.005 0.010 0.002 0.004 0.01 0.02 0.0025 0.0050 0.0075 0.0025 0.0050 0.0075 0.002 0.004 0.006 0.00 0.01

0.525
0.500
WiC

0.475
0.450

0.01 0.02 0.03 0.025 0.050 0.0 0.1 0.01 0.02 0.02 0.04 0.02 0.04 0.05 0.10 0.025 0.050

Figure 9: Mutual information plotted against accuracy per prompt for each dataset using GPT-3 175B with linear
best fit (by MSE) lines to show overall trends
Transferability for SQuAD
Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
0.75 0.75
GPT-3: 13B 0.88 0.78 0.87 0.71 0.89 0.85 1 0.67 1 1 1 1 1 1 0.99 1
0.50 0.50
GPT-3: 6.7B 1 1 1 1 1 1 0.99 1 1 1 1 1 1 1 0.99 1
Selection Model

0.25 0.25
GPT-J: 6B 1 1 1 1 1 1 0.99 1 1 1 1 1 1 1 0.99 1
0.00 0.00
GPT-Neo: 2.7B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
0.25 0.25
GPT-3: 2.7B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
0.50 0.50
GPT-2: 1.5B 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 0.88 0.78 0.87 0.71 0.89 0.85 1 0.67
0.75 0.75
GPT-2: 124M 0.62 0.55 -0.01 0.46 0.41 -0.35 0.11 -0.13 1 1 1 1 1 1 0.99 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model

Figure 10: Prompt transfer performance for SQuAD

Transferability for LAMBADA


Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 0.95 0.77 0.65 0.62 0.33 0.6 0.17 0.02 1 0.43 0.28 0.33 -0.1 0.13 0.08 0.42
0.75 0.75
GPT-3: 13B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.50 0.50
GPT-3: 6.7B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
Selection Model

0.25 0.25
GPT-J: 6B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.00 0.00
GPT-Neo: 2.7B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.25 0.25
GPT-3: 2.7B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.50 0.50
GPT-2: 1.5B 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
0.75 0.75
GPT-2: 124M 0.82 1 1 1 1 1 1 1 0.82 1 1 1 1 1 1 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model

Figure 11: Prompt transfer performance for LAMBADA


Transferability for ROCStories
Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 1 1 0.98 0.99 0.67 0.7 0.71 0.24
0.75 0.75
GPT-3: 13B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 1 1 0.98 0.99 0.67 0.7 0.71 0.24
0.50 0.50
GPT-3: 6.7B 0.24 -0.03-0.22-0.27-0.22-0.26-0.23-0.16 0.61 0.88 1 0.99 1 1 1 1
Selection Model

0.25 0.25
GPT-J: 6B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27
0.00 0.00
GPT-Neo: 2.7B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.61 0.88 1 0.99 1 1 1 1
0.25 0.25
GPT-3: 2.7B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.61 0.88 1 0.99 1 1 1 1
0.50 0.50
GPT-2: 1.5B 0.95 0.97 0.9 1 0.54 0.58 0.63 0.27 0.61 0.88 1 0.99 1 1 1 1
0.75 0.75
GPT-2: 124M 0.61 0.88 1 0.99 1 1 1 1 0.61 0.88 1 0.99 1 1 1 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model

Figure 12: Prompt transfer performance for ROCStories

Transferability for CoQA


Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.75 0.75
GPT-3: 13B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.50 0.50
GPT-3: 6.7B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Selection Model

0.25 0.25
GPT-J: 6B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.00 0.00
GPT-Neo: 2.7B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.25 0.25
GPT-3: 2.7B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.50 0.50
GPT-2: 1.5B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.75 0.75
GPT-2: 124M -0.55-0.15-0.04-0.04-0.04-0.15-0.12-0.51 1 1 1 1 1 1 1 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model

Figure 13: Prompt transfer performance for CoQA


Transferability for IMDB
Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 0.86 1 0.93 0.77 1 0.08 1 0.05 1 0.91 0.51 0.77 0.56 1 -0.2 -0.03
0.75 0.75
GPT-3: 13B 0.86 1 0.93 0.77 1 0.08 1 0.05 0.86 1 0.93 0.77 1 0.08 1 0.05
0.50 0.50
GPT-3: 6.7B 0.86 1 0.93 0.77 1 0.08 1 0.05 0.15 0.84 1 1 0.94 0.13 0.81 0.39
Selection Model

0.25 0.25
GPT-J: 6B 0.5 0.65 -0.15 0.57 0.45 -0.21-0.14-0.23 0.15 0.84 1 1 0.94 0.13 0.81 0.39
0.00 0.00
GPT-Neo: 2.7B 0.5 0.65 -0.15 0.57 0.45 -0.21-0.14-0.23 0.86 1 0.93 0.77 1 0.08 1 0.05
0.25 0.25
GPT-3: 2.7B 0.29 -0.37-0.32-0.19 0.21 0.2 -0.13 0.06 1 0.91 0.51 0.77 0.56 1 -0.2 -0.03
0.50 0.50
GPT-2: 1.5B 0.5 0.65 -0.15 0.57 0.45 -0.21-0.14-0.23 0.86 1 0.93 0.77 1 0.08 1 0.05
0.75 0.75
GPT-2: 124M 0.29 -0.37-0.32-0.19 0.21 0.2 -0.13 0.06 0.08 -1.5 0.77 -0.31-0.07 0.07 -0.37 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model

Figure 14: Prompt transfer performance for IMDB

Transferability for BoolQ


Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 1 1 0.45 0.75 0.47 0.66 0.88 -0.65
0.75 0.75
GPT-3: 13B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 1 1 0.45 0.75 0.47 0.66 0.88 -0.65
0.50 0.50
GPT-3: 6.7B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 0.26 0.68 1 0.86 0.63 -0.14 1 0.89
Selection Model

0.25 0.25
GPT-J: 6B 0.21 0.45 0.2 0.01 -1.5 -2.3 -0.72 -2 0.69 0.39 -0.1 1 -0.86 -1.1 0.59 -1
0.00 0.00
GPT-Neo: 2.7B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 -0.19 0.22 0.15 0.34 1 0.55 0.68 -0.06
0.25 0.25
GPT-3: 2.7B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 -1.1 -2.6 -0.55 -2.2 0.42 1 -0.83 1
0.50 0.50
GPT-2: 1.5B 0.78 0.39 0.7 0.78 -0.17 -0.7 0.8 0.79 0.26 0.68 1 0.86 0.63 -0.14 1 0.89
0.75 0.75
GPT-2: 124M 0.14 -0.63 0.15 0.12 0.63 -0.59 -1.3 0.31 -1.1 -2.6 -0.55 -2.2 0.42 1 -0.83 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model

Figure 15: Prompt transfer performance for BoolQ


Transferability for COPA
Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 1 -0.1 0.33 -0.08 0.15 -0.06 0.22 -0.02 1 -0.1 0.33 -0.08 0.15 -0.06 0.22 -0.02
0.75 0.75
GPT-3: 13B -0.21-0.21 0.33 -0.05 0.15 -0.06 0 -0.02 -0.13 1 -3.4 -0.08 0.29 -0.06 0 -0.02
0.50 0.50
GPT-3: 6.7B -0.13 1 -3.4 -0.08 0.29 -0.06 0 -0.02 0.33 -0.1 1 1 1 -0.06 0 -0.02
Selection Model

0.25 0.25
GPT-J: 6B -0.21-0.21 0.33 -0.05 0.15 -0.06 0 -0.02 0.33 -0.1 1 1 1 -0.06 0 -0.02
0.00 0.00
GPT-Neo: 2.7B -0.22 0.12 0.33 -0.08 -2.1 1 0 -0.02 0.33 -0.1 1 1 1 -0.06 0 -0.02
0.25 0.25
GPT-3: 2.7B 0.91 -0.32 0.33 -0.08 0.15 -0.06 0.22 -0.02 -0.22 0.12 0.33 -0.08 -2.1 1 0 -0.02
0.50 0.50
GPT-2: 1.5B -0.13 1 -3.4 -0.08 0.29 -0.06 0 -0.02 -0.22-0.21 -0.11 -0.08-0.99 0.09 1 -0.02
0.75 0.75
GPT-2: 124M -0.22 -0.1 0.33 0.02 0.15 -0.06 0 -0.02 -0.21 -0.1 1 -0.08 0.15 -0.21 -1.6 1
1.00 1.00
B B B B B B B M B B B B B B B M
: 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124 : 175 -3: 13 3: 6.7 T-J: 6 o: 2.7 3: 2.7 2: 1.5 : 124
3
T- PT T- GP T-Ne PT- PT- PT-2 3
T- PT T- GP T-Ne PT- PT- PT-2
GP G GP GP
G G G GP G GP GP
G G G
Inference Model Inference Model

Figure 16: Prompt transfer performance for COPA

Transferability for WiC


Mutual Information Test Accuracy
1.00 1.00
GPT-3: 175B 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31
0.75 0.75
GPT-3: 13B 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31 0.22 1 -0.09 1 -0.97 1 0.11 0.02
0.50 0.50
GPT-3: 6.7B 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31 -0.04-0.62 1 0.14 -0.02-0.53-0.24-0.38
Selection Model

0.25 0.25
GPT-J: 6B 0.54 0.03 0.39 0.2 0.27 -1 -0.2 -0.04 0.22 1 -0.09 1 -0.97 1 0.11 0.02
0.00 0.00
GPT-Neo: 2.7B 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31 0.15 -0.04-0.13 0.08 1 -0.1 0.2 -0.15
0.25 0.25
GPT-3: 2.7B 0.54 0.03 0.39 0.2 0.27 -1 -0.2 -0.04 0.22 1 -0.09 1 -0.97 1 0.11 0.02
0.50 0.50
GPT-2: 1.5B 1 0.16 -0.04 0.26 -0.46 0.45 -0.42 0.31 0.22 0.55 -0.04-0.05 0.2 0.2 1 -0.15
0.75 0.75
GPT-2: 124M 0.15 0.42 -0.51 0.45 -0.02 0.45 0.11 -0.04 -0.95-0.17 0.62 0.02 0.49 0.69 0.87 1
1.00 1.00
75B 13B .7B 6B .7B .7B .5B 4M 75B 13B .7B 6B .7B .7B .5B 4M
- 3: 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12 - 3 : 1 T-3: -3: 6 PT-J: eo: 2 -3: 2 -2: 1 2: 12
T P T G -N PT PT T- T P T G -N PT PT T-
GP G GP GP
T G G GP GP G GP GP
T G G GP
Inference Model Inference Model

Figure 17: Prompt transfer performance for WiC


C Template Examples Prompt 2 (MI: 4.965, Acc: 0.800):
Given the following passages and questions, provide a brief,
correct answer from the text.
The following are example template fθ s provided
for each dataset. We include all used templates, "BYU students arrive with superb preparation. The enter-
ing class has an average high school GPA of 3.71 (on a 4.0
ordered by accuracy. In blue, we highlight the data scale) and an average ACT score that ranks in the 89th percentile
nationally. The University consistently places in the top 20 for
that is filled in from X; in red, we highlight the enrollment of National Merit Scholars.", "What high school GPA
area where we ask the model to predict the next for BYU freshmen have on average?" -> "3.71""BYU students
arrive with superb preparation. The entering class has an average
token; everything that is not highlighted is static high school GPA of 3.71 (on a 4.0 scale) and an average ACT
score that ranks in the 89th percentile nationally. The University
from instance to instance. We also include the consistently places in the top 20 for enrollment of National Merit
token sets used in the collapsing functions. Scholars.", "What high school GPA for BYU freshmen have on
average?" -> "3.71"
"In meteorology, precipitation is any product of the condensation
C.1 SQuAD of atmospheric water vapor that falls under gravity. The main
forms of precipitation include drizzle, rain, sleed, snow, graupel,
Prompt 1 (MI: 4.950, Acc: 0.820): and hail... Precipitation forms as smaller droplets coalesce
via collision with other rain drops or ice crystals within a
TASK: Answer the questions below using the phrasing from the cloud. Short, intense periods of rain in scattered locations
context. are called"showers".", "What causes precipitation to fall?" ->
"gravity"
CONTEXT:
As of the census of 2000, there were 197,790 people, 84,549 "As of the census of 2000, there were 197,790 people, 84,549
households, and 43,627 families residing in the city. The popu- households, and 43,627 families residing in the city. The popu-
lation density was 3,292.6 people per square mile (1,271.3/km). lation density was 3,292.6 people per square mile (1,271.3/km).
There were 92,282 housing units at an average density of 1,536.2 There were 92,282 housing units at an average density of 1,536.2
per square mile (593.1/km). The racial makeup of the city was per square mile (593.1/km). The racial makeup of the city was
38.3% White, 57.2% African American, 0.2% Native American, 38.3% White, 57.2% African American, 0.2% Native American,
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and 1.3% Asian, 0.1% Pacific Islander, 1.5% from other races,
1.5% from two or more races. Hispanic or Latino of any race and 1.5% from two or more races. Hispanic or Latino of any
were 2.6% of the population. race were 2.6% of the population.", "What percentage of the
Richmond population of 2000 was Pacific Islander?" -> "
QUESTIONS:
1) In 2000, how many families lived in Richmond? Collapsing token sets: None, all tokens are
Answer: "43,627" considered
2) What percentage of the Richmond population of 2000
was Pacific Islander? Prompt 3 (MI: 4.965, Acc: 0.800):
Answer: " Given the following passages and questions, provide a brief,
correct answer from the text.
Collapsing token sets: None, all tokens are
"BYU students arrive with superb preparation. The enter-
considered ing class has an average high school GPA of 3.71 (on a 4.0
scale) and an average ACT score that ranks in the 89th percentile
nationally. The University consistently places in the top 20 for
enrollment of National Merit Scholars.", "What high school GPA
for BYU freshmen have on average?" -> "3.71""BYU students
arrive with superb preparation. The entering class has an average
high school GPA of 3.71 (on a 4.0 scale) and an average ACT
score that ranks in the 89th percentile nationally. The University
consistently places in the top 20 for enrollment of National Merit
Scholars.", "What high school GPA for BYU freshmen have on
average?" -> "3.71"
"In meteorology, precipitation is any product of the condensation
of atmospheric water vapor that falls under gravity. The main
forms of precipitation include drizzle, rain, sleed, snow, graupel,
and hail... Precipitation forms as smaller droplets coalesce
via collision with other rain drops or ice crystals within a
cloud. Short, intense periods of rain in scattered locations
are called"showers".", "What causes precipitation to fall?" ->
"gravity"
"As of the census of 2000, there were 197,790 people, 84,549
households, and 43,627 families residing in the city. The popu-
lation density was 3,292.6 people per square mile (1,271.3/km).
There were 92,282 housing units at an average density of 1,536.2
per square mile (593.1/km). The racial makeup of the city was
38.3% White, 57.2% African American, 0.2% Native American,
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races,
and 1.5% from two or more races. Hispanic or Latino of any
race were 2.6% of the population.", "What percentage of the
Richmond population of 2000 was Pacific Islander?" -> "
Collapsing token sets: None, all tokens are
considered
Prompt 4 (MI: 4.901, Acc: 0.790): Prompt 6 (MI: 5.224, Acc: 0.754):
TASK: Answer the questions below using the phrasing from the CHAPTER QUIZ
context.
PASSAGE:
CONTEXT: As of the census of 2000, there were 197,790 people, 84,549
BYU students arrive with superb preparation. The entering class households, and 43,627 families residing in the city. The popu-
has an average high school GPA of 3.71 (on a 4.0 scale) and an lation density was 3,292.6 people per square mile (1,271.3/km).
average ACT score that ranks in the 89th percentile nationally.
The University consistently places in the top 20 for enrollment of There were 92,282 housing units at an average density of 1,536.2
National Merit Scholars. per square mile (593.1/km). The racial makeup of the city was
QUESTIONS: 38.3% White, 57.2% African American, 0.2% Native American,
1) What high school GPA for BYU freshmen have on average? 1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
Answer: "3.71"
1.5% from two or more races. Hispanic or Latino of any race
were 2.6% of the population.
CONTEXT:
As of the census of 2000, there were 197,790 people, 84,549 QUESTIONS:
households, and 43,627 families residing in the city. The popu- 1) In 2000, how many families lived in Richmond?
lation density was 3,292.6 people per square mile (1,271.3/km). 2) What percentage of the Richmond population of 2000 was
There were 92,282 housing units at an average density of 1,536.2 Pacific Islander?
per square mile (593.1/km). The racial makeup of the city was ANSWER KEY:
38.3% White, 57.2% African American, 0.2% Native American, 1) 43,627
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and 2)
1.5% from two or more races. Hispanic or Latino of any race
were 2.6% of the population. Collapsing token sets: None, all tokens are
considered
QUESTIONS:
1) What percentage of the Richmond population of 2000 was
Pacific Islander? Prompt 7 (MI: 5.126, Acc: 0.750):
Answer: " CHAPTER QUIZ
PASSAGE: BYU students arrive with superb preparation. The
Collapsing token sets: None, all tokens are entering class has an average high school GPA of 3.71 (on a 4.0
considered scale) and an average ACT score that ranks in the 89th percentile
nationally. The University consistently places in the top 20 for
enrollment of National Merit Scholars.
Prompt 5 (MI: 4.711, Acc: 0.758): QUESTIONS:
1) What high school GPA for BYU freshmen have on average?
P1: As of the census of 2000, there were 197,790 people, 84,549
households, and 43,627 families residing in the city. The popu- ANSWER KEY:
lation density was 3,292.6 people per square mile (1,271.3/km). 1) 3.71
There were 92,282 housing units at an average density of 1,536.2 CHAPTER QUIZ
per square mile (593.1/km). The racial makeup of the city was
38.3% White, 57.2% African American, 0.2% Native American, PASSAGE:
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and As of the census of 2000, there were 197,790 people, 84,549
1.5% from two or more races. Hispanic or Latino of any race households, and 43,627 families residing in the city. The popu-
were 2.6% of the population. lation density was 3,292.6 people per square mile (1,271.3/km).
P2: In 2000, how many families lived in Richmond? There were 92,282 housing units at an average density of 1,536.2
P1: 43,627 per square mile (593.1/km). The racial makeup of the city was
P2: What percentage of the Richmond population of 2000 was 38.3% White, 57.2% African American, 0.2% Native American,
Pacific Islander? 1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
P1: 1.5% from two or more races. Hispanic or Latino of any race
were 2.6% of the population.
Collapsing token sets: None, all tokens are
QUESTIONS:
considered 1) What percentage of the Richmond population of 2000 was
Pacific Islander?
ANSWER KEY:
1)
Collapsing token sets: None, all tokens are
considered
Prompt 8 (MI: 4.745, Acc: 0.700): Prompt 11 (MI: 4.231, Acc: 0.684):
P1: BYU students arrive with superb preparation. The entering P1 tells P2 some information, P2 asks comprehension questions,
class has an average high school GPA of 3.71 (on a 4.0 scale) and and P1 answers.
an average ACT score that ranks in the 89th percentile nationally.
The University consistently places in the top 20 for enrollment of P1: As of the census of 2000, there were 197,790 people,
National Merit Scholars. 84,549 households, and 43,627 families residing in the city.
P2: What high school GPA for BYU freshmen have on average? The population density was 3,292.6 people per square mile
P1: 3.71
(1,271.3/km). There were 92,282 housing units at an average
density of 1,536.2 per square mile (593.1/km). The racial makeup
P1: As of the census of 2000, there were 197,790 people, of the city was 38.3% White, 57.2% African American, 0.2%
84,549 households, and 43,627 families residing in the city. Native American, 1.3% Asian, 0.1% Pacific Islander, 1.5% from
The population density was 3,292.6 people per square mile other races, and 1.5% from two or more races. Hispanic or Latino
(1,271.3/km). There were 92,282 housing units at an average of any race were 2.6% of the population.
density of 1,536.2 per square mile (593.1/km). The racial makeup P2: What percentage of the Richmond population of 2000 was
of the city was 38.3% White, 57.2% African American, 0.2% Pacific Islander?
Native American, 1.3% Asian, 0.1% Pacific Islander, 1.5% from P1: The answer is "
other races, and 1.5% from two or more races. Hispanic or Latino
of any race were 2.6% of the population. Collapsing token sets: None, all tokens are
P2: What percentage of the Richmond population of 2000 was considered
Pacific Islander?
P1: Prompt 12 (MI: 3.568, Acc: 0.620):
Collapsing token sets: None, all tokens are P1: As of the census of 2000, there were 197,790 people, 84,549
considered households, and 43,627 families residing in the city. The popu-
lation density was 3,292.6 people per square mile (1,271.3/km).
There were 92,282 housing units at an average density of 1,536.2
Prompt 9 (MI: 3.998, Acc: 0.692): per square mile (593.1/km). The racial makeup of the city was
CHAPTER QUIZ 38.3% White, 57.2% African American, 0.2% Native American,
PASSAGE: 1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
As of the census of 2000, there were 197,790 people, 84,549 1.5% from two or more races. Hispanic or Latino of any race
households, and 43,627 families residing in the city. The popu- were 2.6% of the population.
lation density was 3,292.6 people per square mile (1,271.3/km). P2: What percentage of the Richmond population of 2000 was
There were 92,282 housing units at an average density of 1,536.2 Pacific Islander?
per square mile (593.1/km). The racial makeup of the city was P1: The answer is "
38.3% White, 57.2% African American, 0.2% Native American, Collapsing token sets: None, all tokens are
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
1.5% from two or more races. Hispanic or Latino of any race considered
were 2.6% of the population.
Prompt 13 (MI: 3.261, Acc: 0.614):
QUESTIONS:
1) What percentage of the Richmond population of 2000 was Given the following passages and questions, provide a brief, cor-
rect answer from the text.
Pacific Islander? "As of the census of 2000, there were 197,790 people, 84,549
ANSWER KEY: households, and 43,627 families residing in the city. The popu-
1) lation density was 3,292.6 people per square mile (1,271.3/km).
There were 92,282 housing units at an average density of 1,536.2
Collapsing token sets: None, all tokens are per square mile (593.1/km). The racial makeup of the city was
considered 38.3% White, 57.2% African American, 0.2% Native American,
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
Prompt 10 (MI: 4.037, Acc: 0.686): 1.5% from two or more races. Hispanic or Latino of any race were
TASK: Using words from the CONTEXT, answer the below 2.6% of the population.", "What percentage of the Richmond pop-
QUESTIONS. ulation of 2000 was Pacific Islander?" -> "

CONTEXT: Collapsing token sets: None, all tokens are


As of the census of 2000, there were 197,790 people, 84,549 considered
households, and 43,627 families residing in the city. The popu-
lation density was 3,292.6 people per square mile (1,271.3/km).
There were 92,282 housing units at an average density of 1,536.2
per square mile (593.1/km). The racial makeup of the city was
38.3% White, 57.2% African American, 0.2% Native American,
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
1.5% from two or more races. Hispanic or Latino of any race
were 2.6% of the population.
QUESTIONS:
1) What percentage of the Richmond population of 2000 was
Pacific Islander?
Answer: "
Collapsing token sets: None, all tokens are
considered
Prompt 14 (MI: 3.760, Acc: 0.608): Prompt 17 (MI: 3.508, Acc: 0.544):
As of the census of 2000, there were 197,790 people, 84,549 I read this in a book today:
households, and 43,627 families residing in the city. The popu- As of the census of 2000, there were 197,790 people, 84,549
lation density was 3,292.6 people per square mile (1,271.3/km). households, and 43,627 families residing in the city. The popu-
There were 92,282 housing units at an average density of 1,536.2 lation density was 3,292.6 people per square mile (1,271.3/km).
per square mile (593.1/km). The racial makeup of the city was There were 92,282 housing units at an average density of 1,536.2
38.3% White, 57.2% African American, 0.2% Native American, per square mile (593.1/km). The racial makeup of the city was
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and 38.3% White, 57.2% African American, 0.2% Native American,
1.5% from two or more races. Hispanic or Latino of any race 1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
were 2.6% of the population. 1.5% from two or more races. Hispanic or Latino of any race
were 2.6% of the population.
What percentage of the Richmond population of 2000 What percentage of the Richmond population of 2000 was Pacific
was Pacific Islander? Islander?
The correct answer is: Answer:
Collapsing token sets: None, all tokens are Collapsing token sets: None, all tokens are
considered considered
Prompt 15 (MI: 3.006, Acc: 0.606): Prompt 18 (MI: 3.227, Acc: 0.406):
I read this in a book today: Context: As of the census of 2000, there were 197,790
As of the census of 2000, there were 197,790 people, 84,549 people, 84,549 households, and 43,627 families residing in the
households, and 43,627 families residing in the city. The popu- city. The population density was 3,292.6 people per square
lation density was 3,292.6 people per square mile (1,271.3/km). mile (1,271.3/km). There were 92,282 housing units at an
There were 92,282 housing units at an average density of 1,536.2 average density of 1,536.2 per square mile (593.1/km). The
per square mile (593.1/km). The racial makeup of the city was racial makeup of the city was 38.3% White, 57.2% African
38.3% White, 57.2% African American, 0.2% Native American, American, 0.2% Native American, 1.3% Asian, 0.1% Pacific
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and Islander, 1.5% from other races, and 1.5% from two or more
1.5% from two or more races. Hispanic or Latino of any race races. Hispanic or Latino of any race were 2.6% of the population.
were 2.6% of the population.
From that context, did you catch What percentage of the Rich- Q: What percentage of the Richmond population of 2000
mond population of 2000 was Pacific Islander? was Pacific Islander?
Yes, the answer is
A:
Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are
considered
considered
Prompt 16 (MI: 3.843, Acc: 0.592):
Prompt 19 (MI: 2.497, Acc: 0.402):
TASK: Answer the questions below using the phrasing from the
context. A friend of mine told me this:
As of the census of 2000, there were 197,790 people, 84,549
CONTEXT: households, and 43,627 families residing in the city. The popu-
As of the census of 2000, there were 197,790 people, 84,549 lation density was 3,292.6 people per square mile (1,271.3/km).
households, and 43,627 families residing in the city. The popu- There were 92,282 housing units at an average density of 1,536.2
lation density was 3,292.6 people per square mile (1,271.3/km). per square mile (593.1/km). The racial makeup of the city was
There were 92,282 housing units at an average density of 1,536.2 38.3% White, 57.2% African American, 0.2% Native American,
per square mile (593.1/km). The racial makeup of the city was 1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and
38.3% White, 57.2% African American, 0.2% Native American, 1.5% from two or more races. Hispanic or Latino of any race
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and were 2.6% of the population.
1.5% from two or more races. Hispanic or Latino of any race My friend then asked: What percentage of the Richmond popula-
were 2.6% of the population. tion of 2000 was Pacific Islander?
QUESTIONS: I answered:
1) What percentage of the Richmond population of 2000 was Collapsing token sets: None, all tokens are
Pacific Islander?
Answer: " considered
Collapsing token sets: None, all tokens are
considered
Prompt 20 (MI: 2.312, Acc: 0.302): Prompt 3 (MI: 5.062, Acc: 0.770):
ANSWER KEY: Fill in blank:

QUESTION1: Alice was friends with Bob. Alice went to visit her friend
"As of the census of 2000, there were 197,790 people, 84,549 ____. -> Bob
households, and 43,627 families residing in the city. The popu- George bought some baseball equipment, a ball, a glove, and a
lation density was 3,292.6 people per square mile (1,271.3/km). ____. -> bat
"I would speak to you privately," Bowen said, casting a glance
There were 92,282 housing units at an average density of 1,536.2 around at the others milling about.
per square mile (593.1/km). The racial makeup of the city was
38.3% White, 57.2% African American, 0.2% Native American, The worry in her eyes deepened, but she nodded hesitantly and
1.3% Asian, 0.1% Pacific Islander, 1.5% from other races, and awaited Bowen’s directive.
1.5% from two or more races. Hispanic or Latino of any race
were 2.6% of the population." What percentage of the Richmond He led her through the great hall, annoyance biting at him when
population of 2000 was Pacific Islander? he saw no place where people weren’t congregated. He stepped
ANSWER1: outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____. ->
Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are
considered
considered

Prompt 4 (MI: 5.058, Acc: 0.736):


"I would speak to you privately," Bowen said, casting a glance
around at the others milling about.
C.2 LAMBADA
The worry in her eyes deepened, but she nodded hesitantly and
Prompt 1 (MI: 4.984, Acc: 0.782): awaited Bowen’s directive.
Fill in blank: He led her through the great hall, annoyance biting at him when
Alice was friends with Bob. Alice went to visit her friend he saw no place where people weren’t congregated. He stepped
____. -> Bob outside the back of the keep, where, finally, he spied an area near
"I would speak to you privately," Bowen said, casting a glance the bathhouses, where it was quiet and
around at the others milling about.
Collapsing token sets: None, all tokens are
The worry in her eyes deepened, but she nodded hesitantly and considered
awaited Bowen’s directive.

He led her through the great hall, annoyance biting at him when Prompt 5 (MI: 4.194, Acc: 0.608):
he saw no place where people weren’t congregated. He stepped P1: I’m going to tell you a story, but leave a word out. Once I’m
outside the back of the keep, where, finally, he spied an area near done telling the story, pick the word that best fits in the blank.
the bathhouses, where it was quiet and ____. -> I like to eat peanut butter and jelly ____.
P2: sandwiches
Collapsing token sets: None, all tokens are P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
considered "I would speak to you privately," Bowen said, casting a glance
around at the others milling about.
Prompt 2 (MI: 4.793, Acc: 0.770):
The worry in her eyes deepened, but she nodded hesitantly and
Fill in blank: awaited Bowen’s directive.
She held the torch in front of her. He led her through the great hall, annoyance biting at him when
She caught her breath. he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near
"Chris? There’s a step." the bathhouses, where it was quiet and ____.
P2:
"What?"
Collapsing token sets: None, all tokens are
"A step. Cut in the rock. About fifty feet ahead." She
moved faster. They both moved faster. "In fact," she said, raising considered
the torch higher, "there’s more than a ____. -> step

"I would speak to you privately," Bowen said, casting a


glance around at the others milling about.
The worry in her eyes deepened, but she nodded hesitantly and
awaited Bowen’s directive.
He led her through the great hall, annoyance biting at him when
he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____. ->
Collapsing token sets: None, all tokens are
considered
Prompt 6 (MI: 4.623, Acc: 0.608): Prompt 9 (MI: 2.230, Acc: 0.498):
Fill in the blank for the following sentences. "I would speak to you privately," Bowen said, casting a glance
around at the others milling about.
"It was a cold night. The wind was whistling around the
courtyard as I stepped out of the car and into the ____." -> "It The worry in her eyes deepened, but she nodded hesitantly and
was a cold night. The wind was whistling around the courtyard
as I stepped out of the car and into the darkness." awaited Bowen’s directive.
""I would speak to you privately," Bowen said, casting a glance He led her through the great hall, annoyance biting at him when
around at the others milling about. he saw no place where people weren’t congregated. He stepped
The worry in her eyes deepened, but she nodded hesitantly and outside the back of the keep, where, finally, he spied an area near
awaited Bowen’s directive. the bathhouses, where it was quiet and ____.

He led her through the great hall, annoyance biting at him when The missing word in the story should be: "
he saw no place where people weren’t congregated. He stepped Collapsing token sets: None, all tokens are
outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____." -> ""I would speak considered
to you privately," Bowen said, casting a glance around at the
others milling about. Prompt 10 (MI: 2.632, Acc: 0.474):
"I would speak to you privately," Bowen said, casting a glance
The worry in her eyes deepened, but she nodded hesitantly and
around at the others milling about.
awaited Bowen’s directive.
The worry in her eyes deepened, but she nodded hesitantly and
He led her through the great hall, annoyance biting at him when
awaited Bowen’s directive.
he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near He led her through the great hall, annoyance biting at him when
the bathhouses, where it was quiet and he saw no place where people weren’t congregated. He stepped
Collapsing token sets: None, all tokens are outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____.
considered Fill in the blank with the missing word or phrase.
What is the missing word? The missing word is "
Prompt 7 (MI: 4.328, Acc: 0.596): Collapsing token sets: None, all tokens are
P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank. considered
It was a cold night. The wind was ____ around the courtyard as I
stepped out of the car and into the darkness. Prompt 11 (MI: 4.549, Acc: 0.470):
P2: whistling
P1: I’m going to tell you a story, but leave a word out. Once I’m It was a cold night. The wind was ____ around the courtyard as I
done telling the story, pick the word that best fits in the blank. stepped out of the car and into the darkness.
"I would speak to you privately," Bowen said, casting a glance Word: whistling
around at the others milling about.
"I would speak to you privately," Bowen said, casting a
The worry in her eyes deepened, but she nodded hesitantly and glance around at the others milling about.
awaited Bowen’s directive.
The worry in her eyes deepened, but she nodded hesitantly and
He led her through the great hall, annoyance biting at him when awaited Bowen’s directive.
he saw no place where people weren’t congregated. He stepped
He led her through the great hall, annoyance biting at him when
outside the back of the keep, where, finally, he spied an area near
he saw no place where people weren’t congregated. He stepped
the bathhouses, where it was quiet and ____.
outside the back of the keep, where, finally, he spied an area near
P2:
the bathhouses, where it was quiet and ____.
Collapsing token sets: None, all tokens are
Word:
considered
Collapsing token sets: None, all tokens are
Prompt 8 (MI: 3.338, Acc: 0.586): considered
Fill in the blank with the missing word to complete the sentence.
Prompt 12 (MI: 2.637, Acc: 0.454):
Passage: I like to eat peanut butter and jelly ____.
Missing Word: sandwiches P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
Passage: "I would speak to you privately," Bowen said, "I would speak to you privately," Bowen said, casting a glance
casting a glance around at the others milling about. around at the others milling about.

The worry in her eyes deepened, but she nodded hesitantly and The worry in her eyes deepened, but she nodded hesitantly and
awaited Bowen’s directive. awaited Bowen’s directive.

He led her through the great hall, annoyance biting at him when He led her through the great hall, annoyance biting at him when
he saw no place where people weren’t congregated. He stepped he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____. the bathhouses, where it was quiet and ____.
Missing Word: " P2: The word which fits best is "

Collapsing token sets: None, all tokens are Collapsing token sets: None, all tokens are
considered considered
Prompt 13 (MI: 2.476, Acc: 0.434): Prompt 17 (MI: 1.931, Acc: 0.376):
"I would speak to you privately," Bowen said, casting a glance "I would speak to you privately," Bowen said, casting a glance
around at the others milling about. around at the others milling about.
The worry in her eyes deepened, but she nodded hesitantly and The worry in her eyes deepened, but she nodded hesitantly and
awaited Bowen’s directive. awaited Bowen’s directive.
He led her through the great hall, annoyance biting at him when He led her through the great hall, annoyance biting at him when
he saw no place where people weren’t congregated. He stepped he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near outside the back of the keep, where, finally, he spied an area near
the bathhouses, where it was quiet and ____. the bathhouses, where it was quiet and ____.
Fill in the blank with the missing word or phrase to complete the Which word should we put in the blank to complete the story?
sentence. Let’s use the word "
What is the missing word? The missing word is "
Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are
considered
considered
Prompt 18 (MI: 2.530, Acc: 0.374):
Prompt 14 (MI: 3.043, Acc: 0.432): P1: What word do you think fits best in the following story?
Read the following sentences, and try to guess which word goes "I would speak to you privately," Bowen said, casting a glance
in the blank.
"I would speak to you privately," Bowen said, casting a glance around at the others milling about.
around at the others milling about. The worry in her eyes deepened, but she nodded hesitantly and
The worry in her eyes deepened, but she nodded hesitantly and awaited Bowen’s directive.
awaited Bowen’s directive. He led her through the great hall, annoyance biting at him when
He led her through the great hall, annoyance biting at him when he saw no place where people weren’t congregated. He stepped
he saw no place where people weren’t congregated. He stepped outside the back of the keep, where, finally, he spied an area near
outside the back of the keep, where, finally, he spied an area near the bathhouses, where it was quiet and ____.
the bathhouses, where it was quiet and ____. P2: The word which fits best is "
Answer: " Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are considered
considered
Prompt 19 (MI: 2.372, Acc: 0.364):
Prompt 15 (MI: 2.450, Acc: 0.428): "I would speak to you privately," Bowen said, casting a glance
Fill in blank: around at the others milling about.

"I would speak to you privately," Bowen said, casting a The worry in her eyes deepened, but she nodded hesitantly and
glance around at the others milling about. awaited Bowen’s directive.

The worry in her eyes deepened, but she nodded hesitantly and He led her through the great hall, annoyance biting at him when
awaited Bowen’s directive. he saw no place where people weren’t congregated. He stepped
outside the back of the keep, where, finally, he spied an area near
He led her through the great hall, annoyance biting at him when the bathhouses, where it was quiet and ____.
he saw no place where people weren’t congregated. He stepped Which word fills in the blank best?
outside the back of the keep, where, finally, he spied an area near The word that fills in the blank best is "
the bathhouses, where it was quiet and ____. -> Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are considered
considered
Prompt 20 (MI: 2.860, Acc: 0.296):
Prompt 16 (MI: 2.820, Acc: 0.398): Pick the best word to replace the blank.
Fill in the blank with the missing word. Story: "I would speak to you privately," Bowen said, casting a
"I would speak to you privately," Bowen said, casting a glance glance around at the others milling about.
around at the others milling about.
The worry in her eyes deepened, but she nodded hesitantly and
The worry in her eyes deepened, but she nodded hesitantly and awaited Bowen’s directive.
awaited Bowen’s directive.
He led her through the great hall, annoyance biting at him when
He led her through the great hall, annoyance biting at him when he saw no place where people weren’t congregated. He stepped
he saw no place where people weren’t congregated. He stepped outside the back of the keep, where, finally, he spied an area near
outside the back of the keep, where, finally, he spied an area near the bathhouses, where it was quiet and ____.
the bathhouses, where it was quiet and ____. Answer: "
Answer: " Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are considered
considered
C.3 ROCStories Prompt 6 (MI: 4.167, Acc: 0.298):
P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
Prompt 1 (MI: 3.859, Acc: 0.538): It was a cold night. The wind was _____ around the courtyard as
Fill in the blank for the following sentences. I stepped out of the car and into the darkness.
P2: whistling
"Marissa loved _____ pokemon go game. It is the biggest P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
thing right now. She had done so much more walking since she
Marissa loved _____ pokemon go game. It is the biggest thing
started playing it. She walked all day and evening sometimes.
right now. She had done so much more walking since she started
She walked almost 10 miles in two days." -> "Marissa loved
playing it. She walked all day and evening sometimes. She walked
Collapsing token sets: None, all tokens are almost 10 miles in two days.
considered P2:
Collapsing token sets: None, all tokens are
Prompt 2 (MI: 4.427, Acc: 0.524): considered
Fill in the blank for the following sentences.
"It was a cold night. The wind was _____ around the Prompt 7 (MI: 4.066, Acc: 0.290):
courtyard as I stepped out of the car and into the darkness." -> "It P1: I’m going to tell you a story, but leave a word out. Once I’m
was a cold night. The wind was whistling around the courtyard done telling the story, pick the word that best fits in the blank.
as I stepped out of the car and into the darkness." I like to eat _____ and jelly sandwiches.
"Marissa loved _____ pokemon go game. It is the biggest thing P2: peanut butter
right now. She had done so much more walking since she started P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
playing it. She walked all day and evening sometimes. She
Marissa loved _____ pokemon go game. It is the biggest thing
walked almost 10 miles in two days." -> "Marissa loved
right now. She had done so much more walking since she started
Collapsing token sets: None, all tokens are playing it. She walked all day and evening sometimes. She walked
considered almost 10 miles in two days.
P2:
Prompt 3 (MI: 3.728, Acc: 0.420): Collapsing token sets: None, all tokens are
Poke GO! considered
Marissa loved
Prompt 8 (MI: 3.707, Acc: 0.258):
Collapsing token sets: None, all tokens are Guess the word in the blank to complete the story.
considered Story: Marissa loved _____ pokemon go game. It is the biggest
thing right now. She had done so much more walking since she
Prompt 4 (MI: 3.670, Acc: 0.356): started playing it. She walked all day and evening sometimes. She
walked almost 10 miles in two days.
Fill in the blank with the missing word or phrase to complete the
sentence. Answer:

I like to eat _____ and jelly sandwiches.


Collapsing token sets: None, all tokens are
Answer: peanut butter considered
Marissa loved _____ pokemon go game. It is the biggest
thing right now. She had done so much more walking since she Prompt 9 (MI: 3.644, Acc: 0.256):
started playing it. She walked all day and evening sometimes. Pick the best word to replace the blank.
She walked almost 10 miles in two days. Story: Marissa loved _____ pokemon go game. It is the biggest
Answer: thing right now. She had done so much more walking since she
started playing it. She walked all day and evening sometimes. She
Collapsing token sets: None, all tokens are walked almost 10 miles in two days.
considered Answer:
Collapsing token sets: None, all tokens are
Prompt 5 (MI: 3.904, Acc: 0.310):
considered
Fill in the blank with the missing word or phrase.
Sentence: I like to eat _______ and jelly sandwiches. Prompt 10 (MI: 1.979, Acc: 0.222):
Missing Word/Phrase: peanut butter
Marissa loved _____ pokemon go game. It is the biggest thing
Sentence: Marissa loved _____ pokemon go game. It is right now. She had done so much more walking since she started
the biggest thing right now. She had done so much more walking playing it. She walked all day and evening sometimes. She walked
since she started playing it. She walked all day and evening almost 10 miles in two days.
sometimes. She walked almost 10 miles in two days. Fill in the blank with the missing word or phrase.
Missing Word/Phrase: What is the missing word? The missing word is "

Collapsing token sets: None, all tokens are Collapsing token sets: None, all tokens are
considered considered
Prompt 11 (MI: 3.199, Acc: 0.220): Prompt 17 (MI: 2.634, Acc: 0.088):
Fill in the blank with the missing word or phrase. Marissa loved _____ pokemon go game. It is the biggest thing
Marissa loved _____ pokemon go game. It is the biggest thing right now. She had done so much more walking since she started
right now. She had done so much more walking since she started playing it. She walked all day and evening sometimes. She walked
playing it. She walked all day and evening sometimes. She walked almost 10 miles in two days.
almost 10 miles in two days. Which word fills in the blank best?
Answer: The word that fills in the blank best is "
Collapsing token sets: None, all tokens are Collapsing token sets: None, all tokens are
considered considered

Prompt 12 (MI: 2.013, Acc: 0.214): Prompt 18 (MI: 2.637, Acc: 0.086):
Marissa loved _____ pokemon go game. It is the biggest thing P1: What word do you think fits best in the following story?
right now. She had done so much more walking since she started Marissa loved _____ pokemon go game. It is the biggest thing
playing it. She walked all day and evening sometimes. She walked right now. She had done so much more walking since she started
almost 10 miles in two days. playing it. She walked all day and evening sometimes. She walked
Fill in the blank with the missing word or phrase to complete the almost 10 miles in two days.
sentence. P2: The word which fits best is "
What is the missing word? The missing word is "
Collapsing token sets: None, all tokens are
Collapsing token sets: None, all tokens are
considered
considered
Prompt 19 (MI: 3.648, Acc: 0.050):
Prompt 13 (MI: 3.116, Acc: 0.182): It was a cold night. The wind was _____ around the courtyard as
Read the following sentences, and try to guess which word goes I stepped out of the car and into the darkness.
in the blank. Word: whistling
Marissa loved _____ pokemon go game. It is the biggest thing
right now. She had done so much more walking since she started Marissa loved _____ pokemon go game. It is the biggest
playing it. She walked all day and evening sometimes. She walked thing right now. She had done so much more walking since she
almost 10 miles in two days. started playing it. She walked all day and evening sometimes.
Answer: She walked almost 10 miles in two days.
Put the best word in the blank to complete the story.
Collapsing token sets: None, all tokens are Word:
considered Collapsing token sets: None, all tokens are
considered
Prompt 14 (MI: 1.843, Acc: 0.158):
Marissa loved _____ pokemon go game. It is the biggest thing Prompt 20 (MI: 1.891, Acc: 0.036):
right now. She had done so much more walking since she started
playing it. She walked all day and evening sometimes. She Marissa loved _____ pokemon go game. It is the biggest thing
walked almost 10 miles in two days. right now. She had done so much more walking since she started
playing it. She walked all day and evening sometimes. She walked
The missing word in the story should be: " almost 10 miles in two days.
Choose a word to replace the blank.
Collapsing token sets: None, all tokens are Word: "
considered Collapsing token sets: None, all tokens are
Prompt 15 (MI: 2.681, Acc: 0.140): considered
P1: I’m going to tell you a story, but leave a word out. Once I’m
done telling the story, pick the word that best fits in the blank.
Marissa loved _____ pokemon go game. It is the biggest thing
right now. She had done so much more walking since she started
playing it. She walked all day and evening sometimes. She walked
almost 10 miles in two days.
P2: The word which fits best is "
Collapsing token sets: None, all tokens are
considered

Prompt 16 (MI: 2.150, Acc: 0.120):


Marissa loved _____ pokemon go game. It is the biggest thing
right now. She had done so much more walking since she started
playing it. She walked all day and evening sometimes. She walked
almost 10 miles in two days.
Which word should we put in the blank to complete the story?
Let’s use the word "
Collapsing token sets: None, all tokens are
considered
C.4 CoQA Prompt 4 (MI: 0.083, Acc: 0.466):
What would you use to put out a fire?
A: gasoline
Prompt 1 (MI: 0.600, Acc: 0.590): B: poison
C: laundry detergent
Instructions: For each question below, choose the answer from D: water
the answer bank corresponding to the question that best answers E: pencil
the question. Answer: D. water
Question 1 Answer Bank: ladybug, bunny, goldfish, leop- If you’re still in love and end up stopping being married
ard, caterpillarQuestion: What animal would be most dangerous to your partner, what emotion are you likely to experience?
for a human to encounter in the wild?
A: wrong
Answer: leopard B: pleasure
C: encouragement
Question 2 Answer Bank: wrong, pleasure, encourage- D: depression
ment, depression, relief E: relief
Question: If you’re still in love and end up stopping be- Answer:
ing married to your partner, what emotion are you likely to Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’]
experience?
Answer: Prompt 5 (MI: 0.504, Acc: 0.462):
# multiple choice quiz questions and answers
Collapsing token sets: {’A’: [’wrong’],
’B’: [’pleasure’], ’C’: [’encouragement’], qa = [’q’: ’What is France?’, ’choices’: [’state’, ’city’,
’country’, ’continent’, ’mountain range’], ’answer’: ’country’, ],
’D’: [’depression’], ’E’: [’relief’]} ’[q’: ’If you’re still in love and end up stopping being married
to your partner, what emotion are you likely to experience?’,
Prompt 2 (MI: 0.233, Acc: 0.546): ’choices’: [wrong, pleasure, encouragement, depression, relief],
’answer’: ’
Common Sense Quiz Answer Key
Question 1: Where would people not typically go for
Collapsing token sets: {’A’: [’wrong’],
fun? ’B’: [’pleasure’], ’C’: [’encouragement’],
A: theme park
B: movie theatre ’D’: [’depression’], ’E’: [’relief’]}
C: carnival
D: waste management facility Prompt 6 (MI: 0.431, Acc: 0.448):
E: beach
Correct Answer: D Given the following questions and choices, pick the choice that
corresponds best to the question.
Question 2: If you’re still in love and end up stopping be-
ing married to your partner, what emotion are you likely to "I’m crossing the river, my feet are wet but my body is
dry, where am I?", "bridge, waterfall, valley, pebble, mountain",
experience? -> "valley"
A: wrong "If you’re still in love and end up stopping being married to your
B: pleasure partner, what emotion are you likely to experience?", "wrong,
C: encouragement pleasure, encouragement, depression, relief" -> "
D: depression
E: relief Collapsing token sets: {’A’: [’wrong’],
Correct Answer: ’B’: [’pleasure’], ’C’: [’encouragement’],
Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’] ’D’: [’depression’], ’E’: [’relief’]}

Prompt 3 (MI: 0.474, Acc: 0.470):


Given the following questions and choices, pick the choice that
corresponds best to the question.
"I’m crossing the river, my feet are wet but my body is
dry, where am I?", "bridge, waterfall, valley, pebble, mountain",
-> "valley"
"In what Spanish speaking North American country can you get
a great cup of coffee?", "mexico, mildred’s coffee shop, diner,
kitchen, canteen", -> "mexico"
"If you’re still in love and end up stopping being married to your
partner, what emotion are you likely to experience?", "wrong,
pleasure, encouragement, depression, relief" -> "
Collapsing token sets: {’A’: [’wrong’],
’B’: [’pleasure’], ’C’: [’encouragement’],
’D’: [’depression’], ’E’: [’relief’]}
Prompt 7 (MI: 0.417, Acc: 0.428): Prompt 10 (MI: 0.363, Acc: 0.396):
Choose the best single answer to the question, and explain your Choose the best single answer to the question, and explain your
answer. answer.
Question: I’m crossing the river, my feet are wet but my Question: I’m crossing the river, my feet are wet but my
body is dry, where am I? body is dry, where am I?
Choices: bridge, waterfall, valley, pebble, mountain Choices: bridge, waterfall, valley, pebble, mountain
Answer: "valley" is the best answer. While "bridge" also seems Answer: "valley" is the best answer. While "bridge" also seems
to make sense at first, your feet would not be wet if you crossed to make sense at first, your feet would not be wet if you crossed
over a river on a bridge. Meanwhile, if you crossed the river at a over a river on a bridge. Meanwhile, if you crossed the river at a
valley, the river would be shallow, only getting your feet wet. valley, the river would be shallow, only getting your feet wet.

Question: In what Spanish speaking North American Question: If you’re still in love and end up stopping be-
country can you get a great cup of coffee? ing married to your partner, what emotion are you likely to
Choices: mildred’s coffee shop, mexico, diner, kitchen, canteen experience?
Answer: "mexico" is the best answer. It’s true that you can get
a cup of coffee in a coffee shop or a diner, but the question Choices: wrong, pleasure, encouragement, depression, relief
specifically asks for a Spanish speaking North American country. Answer: "
Mexico is the only country listed, so that must be the correct
answer. Collapsing token sets: {’A’: [’wrong’],
Question: If you’re still in love and end up stopping be- ’B’: [’pleasure’], ’C’: [’encouragement’],
ing married to your partner, what emotion are you likely to ’D’: [’depression’], ’E’: [’relief’]}
experience?
Choices: wrong, pleasure, encouragement, depression, relief Prompt 11 (MI: 0.059, Acc: 0.380):
Answer: "
Common Sense Quiz Answer Key
Collapsing token sets: {’A’: [’wrong’], Question 1: If you’re still in love and end up stopping be-
’B’: [’pleasure’], ’C’: [’encouragement’], ing married to your partner, what emotion are you likely to
’D’: [’depression’], ’E’: [’relief’]} experience?
A: wrong
Prompt 8 (MI: 0.364, Acc: 0.408): B: pleasure
Q: What might a vegan eat for breakfast? C: encouragement
D: depression
Choices: oats, bacon, sausage, omelet, ham E: relief
A: oats Correct Answer:
Q: If you’re still in love and end up stopping being mar- Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’]
ried to your partner, what emotion are you likely to experience?
Choices: wrong, pleasure, encouragement, depression, re- Prompt 12 (MI: 0.233, Acc: 0.360):
lief Given the question, order the options from best answer to the
question to worst answer to the question.
A:
Question: I’m crossing the river, my feet are wet but my
Collapsing token sets: {’A’: [’wrong’], body is dry, where am I?
’B’: [’pleasure’], ’C’: [’encouragement’], Choices: bridge, waterfall, valley, pebble, mountain
Answers (in order of best to worst): valley, bridge, waterfall,
’D’: [’depression’], ’E’: [’relief’]} mountain, pebble
Question: If you’re still in love and end up stopping be-
Prompt 9 (MI: 0.410, Acc: 0.408): ing married to your partner, what emotion are you likely to
What would you use to put out a fire? experience?
A: gasoline Choices: wrong, pleasure, encouragement, depression, relief
B: poison
C: laundry detergent Answers (in order of best to worst):
D: water Collapsing token sets: {’A’: [’wrong’],
E: pencil
Answer: water ’B’: [’pleasure’], ’C’: [’encouragement’],
If you’re still in love and end up stopping being married ’D’: [’depression’], ’E’: [’relief’]}
to your partner, what emotion are you likely to experience?
A: wrong
B: pleasure
C: encouragement
D: depression
E: relief
Answer:
Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’]
Prompt 13 (MI: 0.255, Acc: 0.360): Prompt 17 (MI: 0.265, Acc: 0.276):
Given the question, order the options from best answer to the Me: I watched the most recent episode of the "Is It Really Com-
question to worst answer to the question. mon Sense" game show yesterday night.
Friend: Oh, how was it?
Question: I’m crossing the river, my feet are wet but my Me: It was good. I remember one of the questions.
body is dry, where am I? Friend: What was the question?
Choices: bridge, waterfall, valley, pebble, mountain Me: If you’re still in love and end up stopping being married to
Answers (in order of best to worst): valley, bridge, waterfall, your partner, what emotion are you likely to experience?
mountain, pebble
Friend: What were the options?
Question: In what Spanish speaking North American Me: wrong, pleasure, encouragement, depression, or relief
country can you get a great cup of coffee? Friend: Did the contestant get the answer right?
Choices: mildred’s coffee shop, mexico, diner, kitchen, canteen Me: Yep!
Answers (in order of best to worst): mexico, mildred’s coffee Friend: Which of the options was correct?
shop, diner, kitchen, canteen Me: The correct answer was

Question: If you’re still in love and end up stopping be-


Collapsing token sets: {’A’: [’wrong’],
ing married to your partner, what emotion are you likely to ’B’: [’pleasure’], ’C’: [’encouragement’],
experience? ’D’: [’depression’], ’E’: [’relief’]}
Choices: wrong, pleasure, encouragement, depression, relief
Answers (in order of best to worst):
Prompt 18 (MI: 0.197, Acc: 0.248):
Collapsing token sets: {’A’: [’wrong’], Given the question, order the options from best answer to the
’B’: [’pleasure’], ’C’: [’encouragement’], question to worst answer to the question.

’D’: [’depression’], ’E’: [’relief’]} Question: If you’re still in love and end up stopping be-
ing married to your partner, what emotion are you likely to
Prompt 14 (MI: 0.222, Acc: 0.354): experience?
Choices: wrong, pleasure, encouragement, depression, relief
Q: If you’re still in love and end up stopping being married to
Answers (in order of best to worst):
your partner, what emotion are you likely to experience?
Collapsing token sets: {’A’: [’wrong’],
Choices: wrong, pleasure, encouragement, depression, re-
lief ’B’: [’pleasure’], ’C’: [’encouragement’],
’D’: [’depression’], ’E’: [’relief’]}
A:
Collapsing token sets: {’A’: [’wrong’], Prompt 19 (MI: 0.013, Acc: 0.234):
’B’: [’pleasure’], ’C’: [’encouragement’], If you’re still in love and end up stopping being married to your
’D’: [’depression’], ’E’: [’relief’]} partner, what emotion are you likely to experience?
A: wrong
Prompt 15 (MI: 0.246, Acc: 0.342): B: pleasure
Teacher: I’m going to ask you a common sense question. C: encouragement
D: depression
Student: Alright. E: relief
Teacher: If you’re still in love and end up stopping being Answer:
married to your partner, what emotion are you likely to
experience? Collapsing token sets: [’A’, ’B’, ’C’, ’D’, ’E’]
Student: What are the possible answers?
Teacher: The answer is either "wrong," "pleasure," "en-
couragement," "depression," or "relief."

Student: I know the right answer - it’s "


Collapsing token sets: {’A’: [’wrong’],
’B’: [’pleasure’], ’C’: [’encouragement’],
’D’: [’depression’], ’E’: [’relief’]}

Prompt 16 (MI: 0.376, Acc: 0.336):


questions,choices,answers
"What is France?","[state,city,country,continent,mountain
range]",country
"If you’re still in love and end up stopping being married
to your partner, what emotion are you likely to experi-
ence?","[wrong,pleasure,encouragement,depression,relief]",

Collapsing token sets: {’A’: [’wrong’],


’B’: [’pleasure’], ’C’: [’encouragement’],
’D’: [’depression’], ’E’: [’relief’]}
Prompt 20 (MI: 0.241, Acc: 0.228): Prompt 2 (MI: 0.306, Acc: 0.920):
Teacher: I’m going to ask you a common sense question. P1: Could you give me a review of the movie you just saw?
P2: Sure, John Cassavetes is on the run from the law. He is at the
Student: Alright. bottom of the heap. He sees Negro Sidney Poitier as his equal and
Teacher: What would you not expect to read about in a they quickly become friends, forming a sort of alliance against a
book on the founding of the United States? bully of a foreman played by Jack Warden.

Student: What are the possible answers? As someone who has worked in a warehouse myself when I was
younger, I can tell you that the warehouse fights, complete with
Teacher: The answer is either "george washington," "dec- tumbling packing cases and flailing grappling hooks are as realis-
laration of independence," "boston tea party," "star spangled
banner," or "vampire assassins." tic as it gets. I’ve been in fights like these myself, although no one
got killed.
Student: I know the right answer - it’s "vampire assas-
sins." The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
Teacher: That’s right! Here’s another common sense film, which, at the time, was much needed.
question for you. If you’re still in love and end up stopping
being married to your partner, what emotion are you likely to All the three principle characters - Warden, Cassavetes and Poitier
experience? - are superb, with Warden the most outstanding of the three.
P1: So, overall, would you give it a positive or negative review?
Student: What are the possible answers? P2: I would give it a
Teacher: The answer is either "wrong," "pleasure," "en- Collapsing token sets: {’positive’: [’positive’],
couragement," "depression," or "relief."
’negative’: [’negative’]}
Student: I know the right answer - it’s "
Collapsing token sets: {’A’: [’wrong’], Prompt 3 (MI: 0.154, Acc: 0.904):
’B’: [’pleasure’], ’C’: [’encouragement’], Considering this movie review, determine its sentiment.

’D’: [’depression’], ’E’: [’relief’]} Review: John Cassavetes is on the run from the law. He
is at the bottom of the heap. He sees Negro Sidney Poitier as his
equal and they quickly become friends, forming a sort of alliance
against a bully of a foreman played by Jack Warden.
As someone who has worked in a warehouse myself when I
was younger, I can tell you that the warehouse fights, complete
with tumbling packing cases and flailing grappling hooks are as
C.5 IMDB realistic as it gets. I’ve been in fights like these myself, although
no one got killed.
Prompt 1 (MI: 0.175, Acc: 0.944): The introduction of Sidney Poitier’s widow is a variation on
P1: How was the movie? Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
P2: John Cassavetes is on the run from the law. He is at the bot- film, which, at the time, was much needed.
tom of the heap. He sees Negro Sidney Poitier as his equal and
they quickly become friends, forming a sort of alliance against a All the three principle characters - Warden, Cassavetes and
bully of a foreman played by Jack Warden. Poitier - are superb, with Warden the most outstanding of the
three.
As someone who has worked in a warehouse myself when I was
younger, I can tell you that the warehouse fights, complete with In general, was the sentiment positive or negative The
tumbling packing cases and flailing grappling hooks are as realis- sentiment was
tic as it gets. I’ve been in fights like these myself, although no one Collapsing token sets: {’positive’: [’positive’],
got killed.
’negative’: [’negative’]}
The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed.
All the three principle characters - Warden, Cassavetes and Poitier
- are superb, with Warden the most outstanding of the three.
P1: Would you say your review of the movie is negative or posi-
tive?
P2: I would say my review review of the movie is
Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]}
Prompt 4 (MI: 0.260, Acc: 0.898): Prompt 6 (MI: 0.151, Acc: 0.886):
P1: How was the movie? Read the following movie review to determine the review’s
P2: John Cassavetes is on the run from the law. He is at the bot- sentiment.
tom of the heap. He sees Negro Sidney Poitier as his equal and
they quickly become friends, forming a sort of alliance against a John Cassavetes is on the run from the law. He is at the
bully of a foreman played by Jack Warden. bottom of the heap. He sees Negro Sidney Poitier as his equal
and they quickly become friends, forming a sort of alliance
As someone who has worked in a warehouse myself when I was against a bully of a foreman played by Jack Warden.
younger, I can tell you that the warehouse fights, complete with
tumbling packing cases and flailing grappling hooks are as realis- As someone who has worked in a warehouse myself when I
tic as it gets. I’ve been in fights like these myself, although no one was younger, I can tell you that the warehouse fights, complete
got killed. with tumbling packing cases and flailing grappling hooks are as
realistic as it gets. I’ve been in fights like these myself, although
The introduction of Sidney Poitier’s widow is a variation on no one got killed.
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed. The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
All the three principle characters - Warden, Cassavetes and Poitier film, which, at the time, was much needed.
- are superb, with Warden the most outstanding of the three.
P1: Would you say your review of the movie is positive or nega- All the three principle characters - Warden, Cassavetes and
tive? Poitier - are superb, with Warden the most outstanding of the
P2: I would say my review of the movie is three.
Collapsing token sets: {’positive’: [’positive’], In general, was the sentiment positive or negative? The
’negative’: [’negative’]} sentiment was
Collapsing token sets: {’positive’: [’positive’],
Prompt 5 (MI: 0.237, Acc: 0.888): ’negative’: [’negative’]}
After reading the following review, classify it as negative or
positive.
Prompt 7 (MI: 0.086, Acc: 0.886):
Review: John Cassavetes is on the run from the law. He Considering this movie review, determine its sentiment.
is at the bottom of the heap. He sees Negro Sidney Poitier as his
equal and they quickly become friends, forming a sort of alliance Review:
"""
against a bully of a foreman played by Jack Warden. John Cassavetes is on the run from the law. He is at the bottom
As someone who has worked in a warehouse myself when I of the heap. He sees Negro Sidney Poitier as his equal and they
was younger, I can tell you that the warehouse fights, complete quickly become friends, forming a sort of alliance against a bully
with tumbling packing cases and flailing grappling hooks are as of a foreman played by Jack Warden.
realistic as it gets. I’ve been in fights like these myself, although As someone who has worked in a warehouse myself when I
no one got killed. was younger, I can tell you that the warehouse fights, complete
The introduction of Sidney Poitier’s widow is a variation on with tumbling packing cases and flailing grappling hooks are as
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist realistic as it gets. I’ve been in fights like these myself, although
film, which, at the time, was much needed. no one got killed.

All the three principle characters - Warden, Cassavetes and The introduction of Sidney Poitier’s widow is a variation on
Poitier - are superb, with Warden the most outstanding of the Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
three. film, which, at the time, was much needed.

Classification: All the three principle characters - Warden, Cassavetes and


Poitier - are superb, with Warden the most outstanding of the
Collapsing token sets: {’positive’: [’positive’], three.
’negative’: [’negative’]} """
In general, what was the sentiment of the review? The sentiment
was
Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]}
Prompt 8 (MI: 0.274, Acc: 0.858): Prompt 10 (MI: 0.119, Acc: 0.842):
Yesterday I went to see a movie. John Cassavetes is on the run Read the following movie review to determine the review’s
from the law. He is at the bottom of the heap. He sees Negro Sid- sentiment.
ney Poitier as his equal and they quickly become friends, forming John Cassavetes is on the run from the law. He is at the
a sort of alliance against a bully of a foreman played by Jack War- bottom of the heap. He sees Negro Sidney Poitier as his equal
den. and they quickly become friends, forming a sort of alliance
As someone who has worked in a warehouse myself when I was against a bully of a foreman played by Jack Warden.
younger, I can tell you that the warehouse fights, complete with As someone who has worked in a warehouse myself when I
tumbling packing cases and flailing grappling hooks are as realis- was younger, I can tell you that the warehouse fights, complete
tic as it gets. I’ve been in fights like these myself, although no one with tumbling packing cases and flailing grappling hooks are as
got killed. realistic as it gets. I’ve been in fights like these myself, although
The introduction of Sidney Poitier’s widow is a variation on no one got killed.
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist The introduction of Sidney Poitier’s widow is a variation on
film, which, at the time, was much needed. Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
All the three principle characters - Warden, Cassavetes and Poitier film, which, at the time, was much needed.
- are superb, with Warden the most outstanding of the three. Be- All the three principle characters - Warden, Cassavetes and
tween positive and negative, I would say the movie was Poitier - are superb, with Warden the most outstanding of the
Collapsing token sets: {’positive’: [’positive’], three.
’negative’: [’negative’]} In general, was the sentiment negative or positive? The
sentiment was
Prompt 9 (MI: 0.026, Acc: 0.852): Collapsing token sets: {’positive’: [’positive’],
Q: Is the sentiment of the following movie review negative or pos- ’negative’: [’negative’]}
itive?
"""
John Cassavetes is on the run from the law. He is at the bottom Prompt 11 (MI: 0.162, Acc: 0.824):
of the heap. He sees Negro Sidney Poitier as his equal and they
Q: Is the sentiment of the following movie review positive or neg-
quickly become friends, forming a sort of alliance against a bully ative?
of a foreman played by Jack Warden. John Cassavetes is on the run from the law. He is at the bottom
of the heap. He sees Negro Sidney Poitier as his equal and they
As someone who has worked in a warehouse myself when I was
quickly become friends, forming a sort of alliance against a bully
younger, I can tell you that the warehouse fights, complete with
of a foreman played by Jack Warden.
tumbling packing cases and flailing grappling hooks are as realis-
tic as it gets. I’ve been in fights like these myself, although no one As someone who has worked in a warehouse myself when I was
got killed. younger, I can tell you that the warehouse fights, complete with
tumbling packing cases and flailing grappling hooks are as realis-
The introduction of Sidney Poitier’s widow is a variation on
tic as it gets. I’ve been in fights like these myself, although no one
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
got killed.
film, which, at the time, was much needed.
The introduction of Sidney Poitier’s widow is a variation on
All the three principle characters - Warden, Cassavetes and Poitier
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
- are superb, with Warden the most outstanding of the three.
film, which, at the time, was much needed.
"""
A: The sentiment of the movie review was All the three principle characters - Warden, Cassavetes and Poitier
Collapsing token sets: {’positive’: [’positive’], - are superb, with Warden the most outstanding of the three.
A (positive or negative):
’negative’: [’negative’]}
Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]}
Prompt 12 (MI: 0.101, Acc: 0.822): Prompt 14 (MI: 0.201, Acc: 0.798):
Q: Is the sentiment of the following movie review negative or pos- P1: Could you give me a review of the movie you just saw?
itive? P2: Sure, John Cassavetes is on the run from the law. He is at the
John Cassavetes is on the run from the law. He is at the bottom bottom of the heap. He sees Negro Sidney Poitier as his equal and
of the heap. He sees Negro Sidney Poitier as his equal and they they quickly become friends, forming a sort of alliance against a
quickly become friends, forming a sort of alliance against a bully bully of a foreman played by Jack Warden.
of a foreman played by Jack Warden.
As someone who has worked in a warehouse myself when I was
As someone who has worked in a warehouse myself when I was younger, I can tell you that the warehouse fights, complete with
younger, I can tell you that the warehouse fights, complete with tumbling packing cases and flailing grappling hooks are as realis-
tumbling packing cases and flailing grappling hooks are as realis- tic as it gets. I’ve been in fights like these myself, although no one
tic as it gets. I’ve been in fights like these myself, although no one got killed.
got killed.
The introduction of Sidney Poitier’s widow is a variation on
The introduction of Sidney Poitier’s widow is a variation on Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist film, which, at the time, was much needed.
film, which, at the time, was much needed.
All the three principle characters - Warden, Cassavetes and Poitier
All the three principle characters - Warden, Cassavetes and Poitier - are superb, with Warden the most outstanding of the three.
- are superb, with Warden the most outstanding of the three. P1: So overall was the sentiment of the movie negative or posi-
A (negative or positive): tive?
P2: I would give it a
Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]} Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]}
Prompt 13 (MI: 0.084, Acc: 0.810):
Considering this movie review, determine its sentiment. Prompt 15 (MI: 0.234, Acc: 0.786):
After reading the following review, classify it as positive or
Review: negative.
"""
John Cassavetes is on the run from the law. He is at the bottom Review: John Cassavetes is on the run from the law. He
of the heap. He sees Negro Sidney Poitier as his equal and they is at the bottom of the heap. He sees Negro Sidney Poitier as his
quickly become friends, forming a sort of alliance against a bully equal and they quickly become friends, forming a sort of alliance
of a foreman played by Jack Warden. against a bully of a foreman played by Jack Warden.
As someone who has worked in a warehouse myself when I As someone who has worked in a warehouse myself when I
was younger, I can tell you that the warehouse fights, complete was younger, I can tell you that the warehouse fights, complete
with tumbling packing cases and flailing grappling hooks are as with tumbling packing cases and flailing grappling hooks are as
realistic as it gets. I’ve been in fights like these myself, although realistic as it gets. I’ve been in fights like these myself, although
no one got killed. no one got killed.
The introduction of Sidney Poitier’s widow is a variation on The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed. film, which, at the time, was much needed.
All the three principle characters - Warden, Cassavetes and All the three principle characters - Warden, Cassavetes and
Poitier - are superb, with Warden the most outstanding of the Poitier - are superb, with Warden the most outstanding of the
three. three.
"""
In general, was the sentiment positive or negative? The sentiment Classification:
was
Collapsing token sets: {’positive’: [’positive’],
Collapsing token sets: {’positive’: [’positive’], ’negative’: [’negative’]}
’negative’: [’negative’]}
Prompt 16 (MI: 0.042, Acc: 0.628): Prompt 18 (MI: 0.016, Acc: 0.484):
Q: Is the sentiment of the following movie review positive or neg- John Cassavetes is on the run from the law. He is at the bottom
ative? of the heap. He sees Negro Sidney Poitier as his equal and they
"""
John Cassavetes is on the run from the law. He is at the bottom quickly become friends, forming a sort of alliance against a bully
of the heap. He sees Negro Sidney Poitier as his equal and they of a foreman played by Jack Warden.
quickly become friends, forming a sort of alliance against a bully As someone who has worked in a warehouse myself when I
of a foreman played by Jack Warden. was younger, I can tell you that the warehouse fights, complete
As someone who has worked in a warehouse myself when I was with tumbling packing cases and flailing grappling hooks are as
younger, I can tell you that the warehouse fights, complete with realistic as it gets. I’ve been in fights like these myself, although
tumbling packing cases and flailing grappling hooks are as realis- no one got killed.
tic as it gets. I’ve been in fights like these myself, although no one The introduction of Sidney Poitier’s widow is a variation on
got killed. Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
The introduction of Sidney Poitier’s widow is a variation on film, which, at the time, was much needed.
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist All the three principle characters - Warden, Cassavetes and
film, which, at the time, was much needed. Poitier - are superb, with Warden the most outstanding of the
All the three principle characters - Warden, Cassavetes and Poitier three.
- are superb, with Warden the most outstanding of the three. Was the previous review positive or negative? The previ-
""" ous review was
A: The sentiment of the movie review was
Collapsing token sets: {’positive’: [’positive’], Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]} ’negative’: [’negative’]}

Prompt 17 (MI: 0.021, Acc: 0.486): Prompt 19 (MI: 0.019, Acc: 0.462):
John Cassavetes is on the run from the law. He is at the bottom
John Cassavetes is on the run from the law. He is at the bottom
of the heap. He sees Negro Sidney Poitier as his equal and they
of the heap. He sees Negro Sidney Poitier as his equal and they
quickly become friends, forming a sort of alliance against a bully
quickly become friends, forming a sort of alliance against a bully
of a foreman played by Jack Warden.
of a foreman played by Jack Warden.
As someone who has worked in a warehouse myself when I
As someone who has worked in a warehouse myself when I
was younger, I can tell you that the warehouse fights, complete
was younger, I can tell you that the warehouse fights, complete
with tumbling packing cases and flailing grappling hooks are as
with tumbling packing cases and flailing grappling hooks are as
realistic as it gets. I’ve been in fights like these myself, although
realistic as it gets. I’ve been in fights like these myself, although
no one got killed.
no one got killed.
The introduction of Sidney Poitier’s widow is a variation on
The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed.
film, which, at the time, was much needed.
All the three principle characters - Warden, Cassavetes and
All the three principle characters - Warden, Cassavetes and
Poitier - are superb, with Warden the most outstanding of the
Poitier - are superb, with Warden the most outstanding of the
three.
three.
Was the sentiment of previous review positive or negative? The
Was the previous review negative or positive? The previ- previous review was
ous review was
Collapsing token sets: {’positive’: [’positive’], Collapsing token sets: {’positive’: [’positive’],
’negative’: [’negative’]} ’negative’: [’negative’]}

Prompt 20 (MI: 0.017, Acc: 0.450):


John Cassavetes is on the run from the law. He is at the bottom
of the heap. He sees Negro Sidney Poitier as his equal and they
quickly become friends, forming a sort of alliance against a bully
of a foreman played by Jack Warden.
As someone who has worked in a warehouse myself when I
was younger, I can tell you that the warehouse fights, complete
with tumbling packing cases and flailing grappling hooks are as
realistic as it gets. I’ve been in fights like these myself, although
no one got killed.
The introduction of Sidney Poitier’s widow is a variation on
Shakespeare’s Shylock "Do I not bleed?" This is an anti racist
film, which, at the time, was much needed.
All the three principle characters - Warden, Cassavetes and
Poitier - are superb, with Warden the most outstanding of the
three.
Was the sentiment of previous review negative or positive? The
previous review was
Collapsing token sets: {’positive’: [’positive’], Prompt 3 (MI: 0.055, Acc: 0.750):
’negative’: [’negative’]} Given the passage and question, please answer the question with
yes or no.
”’Turn on red – In Canada, left turn on red light from a
one-way road into a one-way road is permitted except in some
areas of Quebec, New Brunswick, and Prince Edward Island.
Left turn on red light from a two-way road into a one-way road is
permitted in British Columbia but only if the driver turns onto the
closest lane and yields to pedestrians and cross traffic.”’, ”’Can
you turn left on red in canada?”’ -> ”’Yes”’

”’Lord Voldemort – Lord Voldemort ( known as Tom Mar-


volo Riddle) is a fictional character and the main antagonist in
J.K. Rowling’s series of Harry Potter novels. Voldemort first
appeared in Harry Potter and the Philosopher’s Stone, which
was released in 1997. Voldemort appears either in person or
in flashbacks in each book and its film adaptation in the series,
except the third, Harry Potter and the Prisoner of Azkaban, where
he is only mentioned.”’, ”’Are tom riddle and lord voldemort the
same person?”’ -> ”’Yes”’
C.6 BoolQ ”’Clerks – Clerks is a 1994 American independent black-
and-white comedy film written, directed and co-produced by
Kevin Smith. Starring Brian O’Halloran as Dante Hicks and Jeff
Anderson as Randal Graves, it presents a day in the lives of two
store clerks and their acquaintances.”’, ”’Is the movie clerks in
colors?”’ -> ”’No”’

”’Pyruvic acid – Pyruvic acid (CHCOCOOH) is the sim-


plest of the alpha-keto acids, with a carboxylic acid and a
Prompt 1 (MI: 0.077, Acc: 0.778): ketone functional group. Pyruvate (/paruvet/), the conjugate
Given the passage and question, please answer the question with base, CHCOCOO, is a key intermediate in several metabolic
yes or no. pathways.”’, ”’Is pyruvic acid and pyruvate the same thing?”’ ->
”’Turn on red – In Canada, left turn on red light from a ”’
one-way road into a one-way road is permitted except in some Collapsing token sets: {’True’: [’yes’],
areas of Quebec, New Brunswick, and Prince Edward Island.
Left turn on red light from a two-way road into a one-way road is ’False’: [’no’]}
permitted in British Columbia but only if the driver turns onto the
closest lane and yields to pedestrians and cross traffic.”’, ”’Can
you turn left on red in canada?”’ -> ”’Yes”’ Prompt 4 (MI: 0.076, Acc: 0.740):
Passage: "Turn on red – In Canada, left turn on red light from a
”’Pyruvic acid – Pyruvic acid (CHCOCOOH) is the sim- one-way road into a one-way road is permitted except in some
plest of the alpha-keto acids, with a carboxylic acid and a areas of Quebec, New Brunswick, and Prince Edward Island.
ketone functional group. Pyruvate (/paruvet/), the conjugate Left turn on red light from a two-way road into a one-way road
base, CHCOCOO, is a key intermediate in several metabolic is permitted in British Columbia but only if the driver turns onto
the closest lane and yields to pedestrians and cross traffic."
pathways.”’, ”’Is pyruvic acid and pyruvate the same thing?”’ -> Question: "Can you turn left on red in canada?"
”’ Answer: "Yes"
Collapsing token sets: {’True’: [’yes’], Passage: "Lord Voldemort – Lord Voldemort ( known as
’False’: [’no’]} Tom Marvolo Riddle) is a fictional character and the main
antagonist in J.K. Rowling’s series of Harry Potter novels.
Voldemort first appeared in Harry Potter and the Philosopher’s
Prompt 2 (MI: 0.090, Acc: 0.752): Stone, which was released in 1997. Voldemort appears either
in person or in flashbacks in each book and its film adaptation
Passage: "Turn on red – In Canada, left turn on red light from a in the series, except the third, Harry Potter and the Prisoner of
one-way road into a one-way road is permitted except in some Azkaban, where he is only mentioned."
areas of Quebec, New Brunswick, and Prince Edward Island. Question: "Are tom riddle and lord voldemort the same person?"
Left turn on red light from a two-way road into a one-way road Answer: "Yes"
is permitted in British Columbia but only if the driver turns onto
the closest lane and yields to pedestrians and cross traffic." Passage: "Clerks – Clerks is a 1994 American indepen-
Question: "Can you turn left on red in canada?" dent black-and-white comedy film written, directed and
Answer: "Yes" co-produced by Kevin Smith. Starring Brian O’Halloran as
Dante Hicks and Jeff Anderson as Randal Graves, it presents a
Passage: "Pyruvic acid – Pyruvic acid (CHCOCOOH) is day in the lives of two store clerks and their acquaintances."
the simplest of the alpha-keto acids, with a carboxylic acid and Question: "Is the movie clerks in colors?"
a ketone functional group. Pyruvate (/paruvet/), the conjugate Answer: "No"
base, CHCOCOO, is a key intermediate in several metabolic
Passage: "Pyruvic acid – Pyruvic acid (CHCOCOOH) is
pathways."
the simplest of the alpha-keto acids, with a carboxylic acid and
Question: "Is pyruvic acid and pyruvate the same thing?"
a ketone functional group. Pyruvate (/paruvet/), the conjugate
Answer: "
base, CHCOCOO, is a key intermediate in several metabolic
Collapsing token sets: {’True’: [’yes’], pathways."
’False’: [’no’]} Question: "Is pyruvic acid and pyruvate the same thing?"
Answer: "
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]}
Prompt 5 (MI: 0.037, Acc: 0.740): Collapsing token sets: {’True’: [’yes’],
Given the passage and question, please answer the question with ’False’: [’no’]}
yes or no.
”’Turn on red – In Canada, left turn on red light from a Prompt 9 (MI: 0.055, Acc: 0.688):
one-way road into a one-way road is permitted except in some Passage: "Turn on red – In Canada, left turn on red light from a
areas of Quebec, New Brunswick, and Prince Edward Island. one-way road into a one-way road is permitted except in some
Left turn on red light from a two-way road into a one-way road is areas of Quebec, New Brunswick, and Prince Edward Island.
permitted in British Columbia but only if the driver turns onto the Left turn on red light from a two-way road into a one-way road
closest lane and yields to pedestrians and cross traffic.”’, ”’Can is permitted in British Columbia but only if the driver turns onto
you turn left on red in canada?”’ -> ”’Yes”’ the closest lane and yields to pedestrians and cross traffic."
Question: "Can you turn left on red in canada?"
”’Lord Voldemort – Lord Voldemort ( known as Tom Mar- Answer: "Yes"
volo Riddle) is a fictional character and the main antagonist in
J.K. Rowling’s series of Harry Potter novels. Voldemort first Passage: "Lord Voldemort – Lord Voldemort ( known as
appeared in Harry Potter and the Philosopher’s Stone, which Tom Marvolo Riddle) is a fictional character and the main
was released in 1997. Voldemort appears either in person or antagonist in J.K. Rowling’s series of Harry Potter novels.
in flashbacks in each book and its film adaptation in the series, Voldemort first appeared in Harry Potter and the Philosopher’s
except the third, Harry Potter and the Prisoner of Azkaban, where Stone, which was released in 1997. Voldemort appears either
he is only mentioned.”’, ”’Are tom riddle and lord voldemort the in person or in flashbacks in each book and its film adaptation
same person?”’ -> ”’Yes”’ in the series, except the third, Harry Potter and the Prisoner of
Azkaban, where he is only mentioned."
”’Pyruvic acid – Pyruvic acid (CHCOCOOH) is the sim- Question: "Are tom riddle and lord voldemort the same person?"
plest of the alpha-keto acids, with a carboxylic acid and a Answer: "Yes"
ketone functional group. Pyruvate (/paruvet/), the conjugate
base, CHCOCOO, is a key intermediate in several metabolic Passage: "Pyruvic acid – Pyruvic acid (CHCOCOOH) is
pathways.”’, ”’Is pyruvic acid and pyruvate the same thing?”’ -> the simplest of the alpha-keto acids, with a carboxylic acid and
”’ a ketone functional group. Pyruvate (/paruvet/), the conjugate
base, CHCOCOO, is a key intermediate in several metabolic
Collapsing token sets: {’True’: [’yes’], pathways."
’False’: [’no’]} Question: "Is pyruvic acid and pyruvate the same thing?"
Answer: "
Prompt 6 (MI: 0.068, Acc: 0.702): Collapsing token sets: {’True’: [’yes’],
"Pyruvic acid – Pyruvic acid (CHCOCOOH) is the simplest ’False’: [’no’]}
of the alpha-keto acids, with a carboxylic acid and a ketone
functional group. Pyruvate (/paruvet/), the conjugate base,
CHCOCOO, is a key intermediate in several metabolic path-
Prompt 10 (MI: 0.052, Acc: 0.682):
ways." "Pyruvic acid – Pyruvic acid (CHCOCOOH) is the simplest
of the alpha-keto acids, with a carboxylic acid and a ketone
For the question: "Is pyruvic acid and pyruvate the same functional group. Pyruvate (/paruvet/), the conjugate base,
thing?" CHCOCOO, is a key intermediate in several metabolic path-
I would answer: " ways."
Collapsing token sets: {’True’: [’yes’], For the question: "Is pyruvic acid and pyruvate the same
’False’: [’no’]} thing?"
My answer would be: "
Prompt 7 (MI: 0.039, Acc: 0.698): Collapsing token sets: {’True’: [’yes’],
"Pyruvic acid – Pyruvic acid (CHCOCOOH) is the simplest ’False’: [’no’]}
of the alpha-keto acids, with a carboxylic acid and a ketone
functional group. Pyruvate (/paruvet/), the conjugate base,
CHCOCOO, is a key intermediate in several metabolic path-
Prompt 11 (MI: 0.026, Acc: 0.682):
ways." Given the passage and question, please answer the question with
yes or no.
When picking between yes or no For the question: "Is
”’Pyruvic acid – Pyruvic acid (CHCOCOOH) is the sim-
pyruvic acid and pyruvate the same thing?"
plest of the alpha-keto acids, with a carboxylic acid and a
I would answer: "
ketone functional group. Pyruvate (/paruvet/), the conjugate
Collapsing token sets: {’True’: [’yes’], base, CHCOCOO, is a key intermediate in several metabolic
’False’: [’no’]} pathways.”’, ”’Is pyruvic acid and pyruvate the same thing?”’ ->
”’
Prompt 8 (MI: 0.034, Acc: 0.698): Collapsing token sets: {’True’: [’yes’],
ANSWER KEY ’False’: [’no’]}
Please read the following passage with the following ques-
tion in mind: "Is pyruvic acid and pyruvate the same thing?" Prompt 12 (MI: 0.016, Acc: 0.680):
"Pyruvic acid – Pyruvic acid (CHCOCOOH) is the simplest
Pyruvic acid – Pyruvic acid (CHCOCOOH) is the sim-
of the alpha-keto acids, with a carboxylic acid and a ketone
plest of the alpha-keto acids, with a carboxylic acid and a
functional group. Pyruvate (/paruvet/), the conjugate base,
ketone functional group. Pyruvate (/paruvet/), the conjugate base,
CHCOCOO, is a key intermediate in several metabolic path-
CHCOCOO, is a key intermediate in several metabolic pathways.
ways."
Is pyruvic acid and pyruvate the same thing?
When picking between "true" or "false", For the question:
Answer key: " "Is pyruvic acid and pyruvate the same thing?"
My answer would be: "
Collapsing token sets: {’True’: [’true’], Prompt 17 (MI: 0.013, Acc: 0.522):
’False’: [’false’]} Read the following passage: "Pyruvic acid – Pyruvic acid
(CHCOCOOH) is the simplest of the alpha-keto acids, with
Prompt 13 (MI: 0.074, Acc: 0.674): a carboxylic acid and a ketone functional group. Pyruvate
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate
Please read the following passage with the following question in
mind: "Is pyruvic acid and pyruvate the same thing?" in several metabolic pathways.

Pyruvic acid – Pyruvic acid (CHCOCOOH) is the sim- Given this question: "Is pyruvic acid and pyruvate the
plest of the alpha-keto acids, with a carboxylic acid and a same thing?"
ketone functional group. Pyruvate (/paruvet/), the conjugate base, If asked to choose "true" or "false", My answer would be:
CHCOCOO, is a key intermediate in several metabolic pathways. "
Is pyruvic acid and pyruvate the same thing? Collapsing token sets: {’True’: [’true’],
Answer: " ’False’: [’false’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]} Prompt 18 (MI: 0.020, Acc: 0.518):
Read the following passage: "Pyruvic acid – Pyruvic acid
Prompt 14 (MI: 0.050, Acc: 0.668): (CHCOCOOH) is the simplest of the alpha-keto acids, with
a carboxylic acid and a ketone functional group. Pyruvate
Read the following passage: "Pyruvic acid – Pyruvic acid
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate
(CHCOCOOH) is the simplest of the alpha-keto acids, with
in several metabolic pathways."
a carboxylic acid and a ketone functional group. Pyruvate
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate Given this question: "Is pyruvic acid and pyruvate the
in several metabolic pathways." same thing?"
If asked to choose yes or no, My answer would be: "
Given this question: "Is pyruvic acid and pyruvate the
same thing?" Collapsing token sets: {’True’: [’yes’],
I would answer: " ’False’: [’no’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]} Prompt 19 (MI: 0.013, Acc: 0.452):
Read the following passage: "Pyruvic acid – Pyruvic acid
Prompt 15 (MI: 0.058, Acc: 0.646): (CHCOCOOH) is the simplest of the alpha-keto acids, with
a carboxylic acid and a ketone functional group. Pyruvate
Read the following passage: "Pyruvic acid – Pyruvic acid
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate
(CHCOCOOH) is the simplest of the alpha-keto acids, with
in several metabolic pathways."
a carboxylic acid and a ketone functional group. Pyruvate
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate Given this question: "Is pyruvic acid and pyruvate the
in several metabolic pathways." same thing?"
If asked to choose "true" or "false", I would answer: "
Given this question: "Is pyruvic acid and pyruvate the
same thing?" Collapsing token sets: {’True’: [’true’],
I would respond: " ’False’: [’false’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]} Prompt 20 (MI: 0.022, Acc: 0.438):
Read the following passage: "Pyruvic acid – Pyruvic acid
Prompt 16 (MI: 0.027, Acc: 0.634): (CHCOCOOH) is the simplest of the alpha-keto acids, with
a carboxylic acid and a ketone functional group. Pyruvate
Based on the passage: "Pyruvic acid – Pyruvic acid (CHCO-
(/paruvet/), the conjugate base, CHCOCOO, is a key intermediate
COOH) is the simplest of the alpha-keto acids, with a carboxylic
in several metabolic pathways."
acid and a ketone functional group. Pyruvate (/paruvet/), the
conjugate base, CHCOCOO, is a key intermediate in several Given this question: "Is pyruvic acid and pyruvate the
metabolic pathways." same thing?"
If asked to choose yes or no, I would answer: "
And answering the question: "Is pyruvic acid and pyru-
vate the same thing?" Collapsing token sets: {’True’: [’yes’],
By choosing yes or no
My answer would be: "
’False’: [’no’]}
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]}
C.7 COPA Prompt 3 (MI: 0.003, Acc: 0.628):
What is the effect of the following premise: "My foot went
numb."

Prompt 1 (MI: 0.044, Acc: 0.782): Choice 1. I put my shoes on.


Choice 2. I shook my foot.
For the following premises, choose the alternative that is either a
cause or result of the premise, and justify your answer. Answer: Choice

Premise: The man broke his toe. What was the CAUSE Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
of this?
Alternative 1: He got a hole in his sock. Prompt 4 (MI: 0.002, Acc: 0.612):
Alternative 2: He dropped a hammer on his foot.
Answer: Alternative 2. Getting a hole in your sock would not Solve the following COPA task by choosing the sentence which
break your toe, unless there is additional information. Dropping makes the most sense after the premise.
a hammer (which is a heavy object), on the other hand, would
almost certaintly break your toe. Thus, the best answer is Premise: My foot went numb.
Alternative 2. Choice 1. I put my shoes on.
Choice 2. I shook my foot.
Premise: I tipped the bottle. What happened as a RE-
SULT? Answer: Choice
Alternative 1: The liquid in the bottle froze. Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Alternative 2: The liquid in the bottle poured out.
Answer: Alternative 2. Tipping a bottle causes liquid to fall out,
not to freeze. Freezing is caused by being placed in a cold place. Prompt 5 (MI: 0.003, Acc: 0.550):
Pouring out (Alternative 2) is correct because it makes the most
sense. If asked to pick between choice 1 ("I put my shoes on.") or choice
2 ("I shook my foot.") to see what the effect of this premise ("My
Premise: I knocked on my neighbor’s door. What hap- foot went numb.") was, I would say: "choice
pened as a RESULT?
Alternative 1: My neighbor invited me in. Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Alternative 2: My neighbor left his house.
Answer: Alternative 1. When you knock on a neighbor’s door,
it is likely that if they are home they will answer and invite you Prompt 6 (MI: 0.010, Acc: 0.540):
in. It does not make much sense, however, that a neighbor would Solve the following COPA tasks by choosing the sentence which
leave their house without explanation. Therefore, Alternative 1 is makes the most sense after the premise.
the best result of the premise.
Premise: The man broke his toe.
Premise: My foot went numb. What happened as a RE- Choice 1. He got a hole in his sock.
SULT? Choice 2. He dropped a hammer on his foot.
Alternative 1: I put my shoes on. Answer: Choice 2.
Alternative 2: I shook my foot.
Answer: Alternative Premise: My foot went numb.
Choice 1. I put my shoes on.
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]} Choice 2. I shook my foot.
Answer: Choice
Prompt 2 (MI: 0.034, Acc: 0.762):
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
The Choice Of Plausible Alternatives (COPA) evaluation
provides researchers with a tool for assessing progress in
open-domain commonsense causal reasoning. COPA consists of Prompt 7 (MI: 0.002, Acc: 0.532):
1000 questions, split equally into development and test sets of
500 questions each. Each question is composed of a premise and What is the effect of the following premise: "My foot went
two alternatives, where the task is to select the alternative that numb."
more plausibly has a causal relation with the premise. The correct
alternative is randomized so that the expected performance of If asked to choose between Choice 1: "I put my shoes
randomly guessing is 50%. on." or Choice 2: "I shook my foot."
My answer would be: Choice
Examples
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Premise: The man broke his toe. What was the CAUSE
of this?
Alternative 1: He got a hole in his sock. Prompt 8 (MI: 0.006, Acc: 0.530):
Alternative 2: He dropped a hammer on his foot. I will give you a premise and you will choose either sentence 1)
Answer: Alternative 2 or 2) which is the better plausible alternative.
Premise: I tipped the bottle. What happened as a RE- Premise: My foot went numb.
SULT? 1) I put my shoes on.
Alternative 1: The liquid in the bottle froze. 2) I shook my foot.
Alternative 2: The liquid in the bottle poured out. The most plausible alternative is: Sentence
Answer: Alternative 2
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Premise: I knocked on my neighbor’s door. What hap-
pened as a RESULT?
Alternative 1: My neighbor invited me in. Prompt 9 (MI: 0.018, Acc: 0.524):
Alternative 2: My neighbor left his house. Read the following premise and answer by choosing "effect1" or
Answer: Alternative 1 "effect2"
Premise: "My foot went numb."
Premise: My foot went numb. What happened as a RE-
SULT? effect1: "I put my shoes on."
Alternative 1: I put my shoes on. effect2: "I shook my foot."
Alternative 2: I shook my foot. Answer: "effect
Answer: Alternative Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Prompt 10 (MI: 0.008, Acc: 0.520): Prompt 17 (MI: 0.006, Acc: 0.500):
Read the following premise and pick "effect2" or "effect1" I will give you a premise and you will choose either sentence 1)
Premise: "My foot went numb." or 2) which is the better plausible alternative.
effect1: "I put my shoes on."
Premise: The man broke his toe.
effect2: "I shook my foot." 1) He got a hole in his sock.
Answer: "effect 2) He dropped a hammer on his foot.
The most plausible alternative is: Sentence 2).
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
I will give you a premise and you will choose either sen-
Prompt 11 (MI: 0.003, Acc: 0.516): tence 1) or 2) which is the better plausible alternative.
Premise: My foot went numb.
Based on this premise: "My foot went numb." 1) I put my shoes on.
If asked to choose between 2) I shook my foot.
Choice 1: "I put my shoes on." The most plausible alternative is: Sentence
or
Choice 2: "I shook my foot." Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
My answer would be: Choice
Prompt 18 (MI: 0.003, Acc: 0.500):
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
P1: Here’s a premise: My foot went numb..Which sentence pro-
vides the better alternative? 1. "I put my shoes on", or 2. "I shook
Prompt 12 (MI: 0.008, Acc: 0.510): my foot."P2: The better alternative is sentence
Which one of these stories makes the most sense?
Story 1: My foot went numb. I put my shoes on. Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Story 2: My foot went numb. I shook my foot.
Answer: Story Prompt 19 (MI: 0.019, Acc: 0.500):
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]} "The man broke his toe."
Which of the following alternatives is most plausible for the
previous sentence?
Prompt 13 (MI: 0.003, Acc: 0.506):
P1: Here’s a premise: "The man broke his toe." Sentence 1) He got a hole in his sock.
Which sentence provides the better alternative? Sentence 2) He dropped a hammer on his foot.
1. "He got a hole in his sock", or The most plausible alternative is sentence 2).
2. "He dropped a hammer on his foot." "My foot went numb."
P2: The better alternative is sentence Which of the following alternatives is most plausible for the
previous sentence?
P1: Here’s a premise: "My foot went numb.".Which sen- Sentence 1) I put my shoes on.
tence provides the better alternative? 1. "I put my shoes on", or 2. Sentence 2) I shook my foot.
"I shook my foot."P2: The better alternative is sentence The most plausible alternative is sentence
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]} Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Prompt 14 (MI: 0.003, Acc: 0.504): Prompt 20 (MI: 0.001, Acc: 0.496):
Based on this premise: "My foot went numb." I want to figure out which effect of this sentence is more probably:
"My foot went numb."
If asked to pick between
Choice 1: "I put my shoes on." or Choice 2: "I shook my foot."
Choice 1: "I put my shoes on." or Choice 2: "I shook my foot."
to get the effect I would say: "Choice
of the predeciding sentence, I would say: "Choice Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}

Prompt 15 (MI: 0.036, Acc: 0.502):


I am going to tell you two stories, one of them will make sense
and the other will not.
Story 1: My foot went numb. I put my shoes on.
Story 2: My foot went numb. I shook my foot.
The story that makes sense is Story C.8 WiC
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]}
Prompt 1 (MI: 0.036, Acc: 0.520):
Prompt 16 (MI: 0.009, Acc: 0.502): Classify whether the following two sentences’ use of the word
has the same meaning or not.
My foot went numb.
Which of the following alternatives is most plausible for the pre- Word: bright
vious sentence? Usage 1: He is a bright child
Sentence 1) I put my shoes on. Usage 2: The sun is very bright today
Sentence 2) I shook my foot. Meaning: different
The most plausible alternative is sentence Word: didacticism
Collapsing token sets: {’1’: [’1’], ’2’: [’2’]} Usage 1: The didacticism of the 19th century gave birth to many
great museums.
Usage 2: The didacticism expected in books for the young.
Meaning:
Collapsing token sets: {’True’: [’same’],
’False’: [’different’]}
Prompt 2 (MI: 0.006, Acc: 0.512): Prompt 6 (MI: 0.007, Acc: 0.496):
"The didacticism of the 19th century gave birth to many great Q: What year did America first land on the moon?
museums." A: 1969
"The didacticism expected in books for the young." Q: Does the word "didacticism" have the same meaning
True or false, the word didacticism has the same mean- in the following sentences? "The didacticism of the 19th century
ing. gave birth to many great museums."; "The didacticism expected
Answer: in books for the young."
A:
Collapsing token sets: {’True’: [’true’],
’False’: [’false’]} Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]}
Prompt 3 (MI: 0.025, Acc: 0.506):
Depending on its context, an ambiguous word can refer to Prompt 7 (MI: 0.004, Acc: 0.496):
multiple, potentially unrelated, meanings. Mainstream static I am going to answer true or false questions about whether
word embeddings, such as Word2vec and GloVe, are unable a word that appears in two sentences has the same meaning or not.
to reflect this dynamic semantic nature. Contextualised word
embeddings are an attempt at addressing this limitation by True or False, the word "didacticism" has the same mean-
computing dynamic representations for words which can adapt ing in the following sentences.
based on context. A system’s task on the WiC dataset is to
identify the intended meaning of words. WiC is framed as a Sentence 1: The didacticism of the 19th century gave
binary classification task. Each instance in WiC has a target word
w, either a verb or a noun, for which two contexts are provided. birth to many great museums.
Each of these contexts triggers a specific meaning of w. The Sentence 2: The didacticism expected in books for the young.
task is to identify if the occurrences of w in the two contexts Answer:
correspond to the same meaning or not. In fact, the dataset can
also be viewed as an application of Word Sense Disambiguation Collapsing token sets: {’True’: [’true’],
in practise.
WiC features multiple interesting characteristics: ’False’: [’false’]}
It is suitable for evaluating a wide range of applications, Prompt 8 (MI: 0.006, Acc: 0.494):
including contextualized word and sense representation and
Word Sense Disambiguation; Classify whether the following two sentences’ use of the word
It is framed asa binary classification dataset, in which, unlike has the same meaning or not.
Stanford Contextual Word Similarity (SCWS), identical words
are paired with each other (in different contexts); hence, a Word: bright
context-insensitive word embedding model would perform Usage 1: He is a bright child
similarly to a random baseline; Usage 2: The sun is very bright today
It is constructed using high quality annotations curated by experts. Meaning: different

Examples from the dataset: Word: air


Context-1 // Context-2 // Target // Label Usage 1: Utah has too much air pollution.
There’s a lot of trash on the bed of the river // I keep a glass of Usage 2: Open a window and let in some air.
water on my bed when I sleep // bed // Different Meaning: same
Air pollution // Open a window and let in some air // air // Same
The didacticism of the 19th century gave birth to many great Word: cool
Usage 1: Her pants are cool.
museums. // The didacticism expected in books for the young. // Usage 2: Let your food cool.
didacticism // Meaning: different
Collapsing token sets: {’True’: [’same’], Word: didacticism
’False’: [’different’]} Usage 1: The didacticism of the 19th century gave birth to many
great museums.
Prompt 4 (MI: 0.007, Acc: 0.504): Usage 2: The didacticism expected in books for the young.
Meaning:
"The didacticism of the 19th century gave birth to many great
museums." Collapsing token sets: {’True’: [’same’],
"The didacticism expected in books for the young." ’False’: [’different’]}
True or False, the word "didacticism" has the same mean-
ing. Prompt 9 (MI: 0.007, Acc: 0.494):
Answer: Q: What does 2 + 2 equal?
A: 4
Collapsing token sets: {’True’: [’true’],
’False’: [’false’]} Q: If you are 60 inches tall how tall are you in feet?
A: 5 feet
Prompt 5 (MI: 0.006, Acc: 0.504): Q: Does the word "didacticism" have the same meaning
Q: What does 2 + 2 equal? in the following sentences? "The didacticism of the 19th century
A: 4 gave birth to many great museums."; "The didacticism expected
in books for the young."
Q: Does the word "didacticism" have the same meaning
A:
in the following sentences? "The didacticism of the 19th century
gave birth to many great museums."; "The didacticism expected Collapsing token sets: {’True’: [’yes’],
in books for the young." ’False’: [’no’]}
A:
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]}
Prompt 10 (MI: 0.004, Acc: 0.494): Prompt 16 (MI: 0.031, Acc: 0.486):
True or False, the word "didacticism" has the same meaning in In linguistics, a word sense is one of the meanings of a word.
the following sentences. Words are in two sets: a large set with multiple meanings (word
senses) and a small set with only one meaning (word sense).
Sentence 1: "The didacticism of the 19th century gave For example, a dictionary may have over 50 different senses
birth to many great museums." of the word "play", each of these having a different meaning
based on the context of the word’s usage in a sentence, as follows:
Sentence 2: "The didacticism expected in books for the young."
"We went to see the play Romeo and Juliet at the the-
Answer: ater."
"The coach devised a great play that put the visiting team on the
Collapsing token sets: {’True’: [’true’], defensive."
’False’: [’false’]} "The children went out to play in the park."
In each sentence we associate a different meaning of the word
"play" based on hints the rest of the sentence gives us.
Prompt 11 (MI: 0.009, Acc: 0.494):
People and computers, as they read words, must use a
In the sentences "The didacticism of the 19th century gave birth process called word-sense disambiguation[1][2] to find the
to many great museums." and "The didacticism expected in books correct meaning of a word. This process uses context to narrow
for the young.", true or false, the statement "the word didacticism the possible senses down to the probable ones. The context
has the same meaning" is includes such things as the ideas conveyed by adjacent words
and nearby phrases, the known or probable purpose and register
Collapsing token sets: {’True’: [’true’], of the conversation or document, and the orientation (time
and place) implied or expressed. The disambiguation is thus
’False’: [’false’]} context-sensitive.
Advanced semantic analysis has resulted in a sub-distinction. A
Prompt 12 (MI: 0.008, Acc: 0.492): word sense corresponds either neatly to a seme (the smallest
Q: What year did America first land on the moon? possible unit of meaning) or a sememe (larger unit of meaning),
A: 1969 and polysemy of a word of phrase is the property of having
multiple semes or sememes and thus multiple senses.
Q: What is the average height in America?
A: 5 feet 9 inches The following are examples of two sentences where the
meaning of the word is either the same or different.
Q: Does the word "didacticism" have the same meaning
in the following sentences? "The didacticism of the 19th century Examples:
There’s a lot of trash on the bed of the river // I keep a glass of
gave birth to many great museums."; "The didacticism expected water on my bed when I sleep // bed // Different
in books for the young." Air pollution // Open a window and let in some air // air // Same
A: The didacticism of the 19th century gave birth to many great
museums. // The didacticism expected in books for the young. //
Collapsing token sets: {’True’: [’yes’], didacticism //
’False’: [’no’]}
Collapsing token sets: {’True’: [’same’],
Prompt 13 (MI: 0.017, Acc: 0.492): ’False’: [’different’]}
"The didacticism of the 19th century gave birth to many great
museums." Prompt 17 (MI: 0.007, Acc: 0.466):
"The didacticism expected in books for the young." Classify whether the following two sentences’ use of the word
has the same meaning or not.
"True" or "False", the word didacticism has the same
meaning. Word: bright
Answer: " Usage 1: He is a bright child
Usage 2: The sun is very bright today
Collapsing token sets: {’True’: [’true’], Meaning: different
’False’: [’false’]} Word: air
Usage 1: Utah has too much air pollution.
Usage 2: Open a window and let in some air.
Prompt 14 (MI: 0.017, Acc: 0.488): Meaning: same
The didacticism of the 19th century gave birth to many great mu-
Word: cool
seums. // The didacticism expected in books for the young. Usage 1: Her pants are cool.
Choose "yes" or "no". Does the word didacticism have the same Usage 2: Let your food cool.
meaning in the previous sentences? " Meaning: different
Collapsing token sets: {’True’: [’yes’], Word: fight
Usage 1: My wife and I had a fight.
’False’: [’no’]} Usage 2: I fight for my freedom.
Meaning: same
Prompt 15 (MI: 0.008, Acc: 0.488): Word: didacticism
In the sentences "The didacticism of the 19th century gave birth Usage 1: The didacticism of the 19th century gave birth to many
to many great museums." and "The didacticism expected in books great museums.
for the young." and choosing "true" or "false", the statement "the Usage 2: The didacticism expected in books for the young.
word didacticism has the same meaning" is " Meaning:
Collapsing token sets: {’True’: [’true’], Collapsing token sets: {’True’: [’same’],
’False’: [’false’]} ’False’: [’different’]}
Prompt 18 (MI: 0.010, Acc: 0.460):
Classify whether the following two sentences’ use of the word
has the same meaning or not.
Word: bright
Usage 1: He is a bright child
Usage 2: The sun is very bright today
Meaning: different
Word: air
Usage 1: Utah has too much air pollution.
Usage 2: Open a window and let in some air.
Meaning: same
Word: didacticism
Usage 1: The didacticism of the 19th century gave birth to many
great museums.
Usage 2: The didacticism expected in books for the young.
Meaning:
Collapsing token sets: {’True’: [’same’],
’False’: [’different’]}

Prompt 19 (MI: 0.007, Acc: 0.460):


Q: Is the United States in South America?
A: No
Q: Does the word "didacticism" have the same meaning
in the following sentences? "The didacticism of the 19th century
gave birth to many great museums."; "The didacticism expected
in books for the young."
A:
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]}

Prompt 20 (MI: 0.004, Acc: 0.440):


Q: Is the United States in South America?
A: No
Q: Is the following sentence missing a comma? Before
leaving I ate breakfast.
A: Yes

Q: Does the word "didacticism" have the same meaning


in the following sentences? "The didacticism of the 19th century
gave birth to many great museums."; "The didacticism expected
in books for the young."
A:
Collapsing token sets: {’True’: [’yes’],
’False’: [’no’]}

You might also like