2022.emnlp-Main.759 1
2022.emnlp-Main.759 1
Classification
Abstract 65
60 No Demos Demos w/ gold labels Demos w/ random labels
55
Macro-F1 (%)
Large language models (LMs) are able to in- 50
45
context learn—perform a new task via infer- 40
ence alone by conditioning on a few input-label 35
30
pairs (demonstrations) and making predictions 25
MetaICL (774M) GPT-J (6B) GPT-3 (175B)
for new inputs. However, there has been lit- Multi-choice
75
tle understanding of how the model learns and 70 No Demos Demos w/ gold labels Demos w/ random labels
which aspects of the demonstrations contribute 65
Accuracy (%)
60
to end task performance. In this paper, we 55
show that ground truth demonstrations are in 50
45
fact not required—randomly replacing labels in 40
the demonstrations barely hurts performance on 35
MetaICL (774M) GPT-J (6B) GPT-3 (175B)
a range of classification and multi-choce tasks,
consistently over 12 different models including Figure 1: Results in classification (top) and multi-choice
GPT-3. Instead, we find that other aspects of tasks (bottom), using three LMs with varying size. Re-
the demonstrations are the key drivers of end ported on six datasets on which GPT-3 is evaluated; the
task performance, including the fact that they channel method is used. See Section 4 for the full results.
provide a few examples of (1) the label space, In-context learning performance drops only marginally
(2) the distribution of the input text, and (3) the when labels in the demonstrations are replaced by ran-
overall format of the sequence. Together, our dom labels.
analysis provides a new way of understanding
how and why in-context learning works, while
opening up new questions about how much can is consistent over 12 different models including the
be learned from large language models through GPT-3 family (Radford et al., 2019; Min et al.,
inference alone. 2021b; Wang and Komatsuzaki, 2021; Artetxe
et al., 2021; Brown et al., 2020). This strongly
1 Introduction
suggests, counter-intuitively, that the model does
Large language models (LMs) have shown impres- not rely on the input-label mapping in the demon-
sive performance on downstream tasks by simply strations to perform the task.
conditioning on a few input-label pairs (demonstra- Further analysis investigates which parts of
tions); this type of inference has been referred to as demonstrations actually do contribute to the perfor-
in-context learning (Brown et al., 2020). Despite in- mance. We identify possible aspects of demonstra-
context learning consistently outperforming zero- tions (e.g., the label space and the distribution of
shot inference on a wide range of tasks (Zhao et al., the input text) and evaluate a series of variants of
2021; Liu et al., 2021), there is little understanding the demonstrations to quantify the impact of each
of how it works and which aspects of the demon- (Section 5). We find that: (1) the label space and
strations contribute to end task performance. the distribution of the input text specified by the
In this paper, we show that ground truth demon- demonstrations are both key to in-context learn-
strations are in fact not required for effective in- ing (regardless of whether the labels are correct
context learning (Section 4). Specifically, replac- for individual inputs); (2) specifying the overall
ing the labels in demonstrations with random labels format is also crucial, e.g., when the label space
barely hurts performance in a range of classifica- is unknown, using random English words as la-
tion and multi-choice tasks (Figure 1). The result bels is significantly better than using no labels; and
11048
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048 - 11064
December 7-11, 2022 ©2022 Association for Computational Linguistics
Demonstrations
(3) meta-training with an in-context learning objec- Circulation revenue has increased by 5% in Finland. \n Positive
tive (Min et al., 2021b) magnifies these effects—the Panostaja did not disclose the purchase price. \n Neutral
models almost exclusively exploit simpler aspects Paying off the national debt will be extremely painful. \n Negative
The acquisition will have an immediate positive impact. \n ________
of the demonstrations like the format rather than Test input
the input-label mapping. LM
In summary, our analysis provides a new way
Prediction Positive
of understanding the role of the demonstrations in
in-context learning. We empirically show that the Figure 2: An overview of in-context learning. The
model (1) counter-intuitively does not rely on the demonstrations consist of k input-label pairs from the
ground truth input-label mapping provided in the training data (k = 3 in the figure).
demonstrations as much as we thought (Section 4),
and (2) nonetheless still benefits from knowing the Model # Params Public Meta-trained
label space and the distribution of inputs specified GPT-2 Large 774M ✓ ✗
= MetaICL 774M ✓ ✓
by the demonstrations (Section 5). We also include GPT-J 6B ✓ ✗
a discussion of broader implications, e.g., what we fairseq 6.7B† 6.7B ✓ ✗
can say about the model learning at test time, and fairseq 13B† 13B ✓ ✗
avenues for future work (Section 6). GPT-3 175B‡ ✗ ✗
45
40
35
30
25
Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel
GPT-2 GPT-2 MetaICL MetaICL GPT-J GPT-J fairseq 6.7B fairseq 6.7B fairseq 13B fairseq 13B GPT-3 GPT-3
70 Multi-choice
No Demos Demos w/ gold labels Demos w/ random labels
65
60
Accuracy (%)
55
50
45
40
35
Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel
GPT-2 GPT-2 MetaICL MetaICL GPT-J GPT-J fairseq 6.7B fairseq 6.7B fairseq 13B fairseq 13B GPT-3 GPT-3
Figure 3: Results when using no-demonstrations, demonstrations with gold labels, and demonstrations with random
labels in classification (top) and multi-choice tasks (bottom). The first eight models are evaluated on 16 classification
and 10 multi-choice datasets, and the last four models are evaluated on 3 classification and 3 multi-choice datasets.
See Figure 11 for numbers comparable across all models. Model performance with random labels is very close
to performance with gold labels (more discussion in Section 4.1).
largest dense LM (GPT-3) and the largest publicly datasets2 and 3 random seeds. We report Macro-
released dense LM (fairseq 13B) at the time of con- F13 for classification tasks and Accuracy for multi-
ducting experiments. We also include MetaICL, choice tasks. We compute per-dataset average over
which is initialized from GPT-2 Large and then seeds, and then report macro-average over datasets.
meta-trained on a collection of supervised datasets We use the minimal templates in forming an in-
with an in-context learning objective, and ensure put sequence from an example. We refer to Ap-
that our evaluation datasets do not overlap with pendix B for more details. All experiments are
those used at meta-training time. reproducible from github.com/Alrope123/
rethinking-demonstrations.
Evaluation Data. We evaluate on 26 datasets,
including sentiment analysis, paraphrase detection,
4 Ground Truth Matters Little
natural language inference, hate speech detection,
question answering, and sentence completion (full 4.1 Gold labels vs. random labels
list and references provided in Appendix A).1 All
To see the impact of correctly-paired inputs and
datasets are classification and multi-choice tasks.
labels in the demonstrations—which we call the
We use these datasets because they (1) are true
ground truth input-label mapping—we compare the
low-resource datasets with less than 10K train-
following three methods.4
ing examples, (2) include well-studied bench-
marks from GLUE (Wang et al., 2018) and Super- No demonstrations is a typical zero-shot method
GLUE (Wang et al., 2019a), and (3) cover diverse that does not use any labeled data. A prediction
domains including science, social media, finance, is made via argmaxy∈C P (y|x), where x is the test
and more. input and C is a small discrete set of possible labels.
Other Details. We use k = 16 examples as Demonstrations w/ gold labels are used in a typi-
demonstrations by default for all experiments in cal in-context learning method with k labeled ex-
the paper, unless otherwise specified. Examples amples (x1 , y1 )...(xk , yk ). A concatenation of k
are sampled at uniform from the training data. input-label pairs is used to make a prediction via
We choose a set of k training examples using argmaxy∈C P (y|x1 , y1 ...xk , yk , x).
5 different random seeds and run experiments 5 2
Three classification and three multi-choice: MRPC, RTE,
times. For fairseq 13B and GPT-3, due to lim- Tweet_eval-hate, OpenbookQA, CommonsenseQA, COPA.
ited resources, we experiment with a subset of 6 3
Known to be better for imbalanced classes.
4
Without loss of generality, all methods in Section 4 and 5
1
For convenience, we use ‘labels’ to refer to the output for are described based on the direct method, but can be trivially
the task, though our datasets include non-classification tasks. converted to the channel method by flipping x and y.
11050
65
60
Accuracy (%) 100% correct 75% correct 50% correct 25% correct 0% correct No Demos
55
50
45
40
35
30
25
MetaICL (Classification) GPT-J (Classification) MetaICL (Multi-choice) GPT-J (Multi-choice)
Figure 4: Results with varying number of correct labels in the demonstrations. Channel and Direct used for
classification and multi-choice, respectively. Performance with no demonstrations (blue) is reported as a reference.
Classification Multi-choice
Demonstrations w/ random labels are formed 60 60
Macro-F1 (%)
Accuracy (%)
the labeled data. Each xi (1 ≤ i ≤ 50 50
Results are reported in Figure 3. First, using the Figure 5: Ablations on varying numbers of examples in
demonstrations with gold labels significantly im- the demonstrations (k). Models that are the best under
proves the performance over no demonstrations,5 13B in each task category (Channel MetaICL and Direct
GPT-J, respectively) are used.
as it has been consistently found in much of prior
work (Brown et al., 2020; Zhao et al., 2021; Liu
et al., 2021). We then find that replacing gold la- label mapping and exploit other components of the
bels with random labels only marginally hurts demonstrations (more discussion in Section 5.4).
performance. The trend is consistent over nearly In Appendix C.2, we provide additional results
all models: models see performance drop in the showing that (1) selecting random labels from a
range of 0–5% absolute. There is less impact in true distribution of labels (instead of a uniform
replacing labels in multi-choice tasks (1.7% on av- distribution) reduces the gap even further, and (2)
erage) than in classification tasks (2.6% absolute). the trends may depend on the dataset, although the
This result indicates that the ground truth input- overall trend is consistent over most datasets.
label pairs are not necessary to achieve perfor-
mance gains. This is counter-intuitive, given that 4.2 Ablations
correctly paired training data is critical in typical For additional ablations, we experiment with 5 clas-
supervised training—it informs the model of the ex- sification and 4 multi-choice datasets.6
pected input-label correspondence required to per-
form the downstream task. Nonetheless, the mod- Does the number of correct labels matter? To
els do achieve non-trivial performance on the down- further examine the impact of correctness of la-
stream tasks. This strongly suggests that the mod- bels in the demonstrations, we conduct an ablation
els are capable of recovering the expected input- study by varying the number of correct labels in the
label correspondence for the task; however, it is not demonstrations. We evaluate “Demonstrations w/
directly from the pairings in the demonstrations. a% correct labels” (0 ≤ a ≤ 100) which consist
It is also worth noting that there is particularly of k × a/100 correct pairs and k × (1 − a/100)
little performance drop in MetaICL: 0.1–0.9% ab- incorrect pairs (see Algorithm 1 in Appendix B).
solute. This suggests that meta-training with an Here, a = 100 is the same as typical in-context
explicit in-context learning objective actually en- learning, i.e., demonstrations w/ gold labels.
courages the model to essentially ignore the input- Results are reported in Figure 4. Model perfor-
5
mance is fairly insensitive to the number of correct
There are some exceptions, e.g., in the classification tasks,
Direct GPT-2, Direct GPT-J and Direct fairseq 6.7B models
labels in the demonstrations. In fact, always us-
are not significantly better than random guessing on many ing incorrect labels significantly outperforms no-
datasets; Channel fairseq 13B has significantly better no-
6
demonstrations performance compared to demonstrations with Classification includes: MRPC, RTE, Tweet_eval-hate,
gold labels. We thus discuss the results from these models less SICK, poem-sentiment; Multi-choice includes OpenbookQA,
significantly for the rest of analysis. CommonsenseQA, COPA and ARC.
11051
65
60
Accuracy (%) No demos Gold labels Random labels No demos + T Gold labels + T Random labels + T
55
50
45
40
35
30
25
MetaICL (Classification) GPT-J (Classification) MetaICL (Multi-choice) GPT-J (Multi-choice)
Figure 6: Results with minimal templates and manual templates. ‘+T’ indicates that manual templates are used.
Channel and Direct used for classification and multi-choice, respectively.
demonstrations, e.g., preserving 92%, 100% and Demonstrations Distribution of inputs Label space
97% of improvements from using the demonstra- Circulation revenue has increased by 5% in Finland. \n Positive
Format
tions with MetaICL in classification, MetaICL in Panostaja did not disclose the purchase price. \n Neutral
(The use
multi-choice, and GPT-J in multi-choice, respec- Paying off the national debt will be extremely painful. \n Negative of pairs)
L: Label space
50
I: Input distribution
45
M: Input-Label Mapping
40
35
30
25
Direct MetaICL Channel MetaICL Direct GPT-J Channel GPT-J
Figure 8: Impact of the distribution of the inputs. Evaluated in classification (top) and multi-choice (bottom). The
impact of the distribution of the input text can be measured by comparing ■ and ■. The gap is substantial, with an
exception in Direct MetaICL (discussion in Section 5.1).
and four multi-choice datasets as in Section 4.2. where |Crand | = |C|, and randomly pair ỹi ∈ Crand
See Appendix B and Table 4 for implementation with xi . This variant assesses the impact of the
details and example demonstrations, respectively. label space, while keeping the distribution of the
input text and the format of the demonstrations.
5.1 Impact of the distribution of the input text
We experiment with OOD demonstrations which Results. Based on Figure 9, direct models and
include out-of-distribution (OOD) text instead of channel models exhibit different patterns. With di-
the inputs from unlabeled training data. Specif- rect models, the performance gap between using
ically, a set of k sentences {xi,rand }ki=1 are ran- random labels within the label space and using ran-
domly sampled from an external corpus, and re- dom English words is significant, ranging between
place x1 ...xk in the demonstrations. This variant 5–16% absolute. This indicates that conditioning
assesses the impact of the distribution of the input on the label space significantly contributes to per-
text, while keeping the label space and the format formance gains. This is true even for multi-choice
of the demonstrations. tasks where there is no fixed set of labels—we
hypothesize that multi-choice tasks still do have
Results. Figure 8 shows that using out-of-
a particular distribution of the choices (e.g., ob-
distribution inputs instead of the inputs from the
jects like “Bolts” or “Screws” in the OpenBookQA
training data significantly drops the performance
dataset) that the model uses.
when Channel MetaICL, Direct GPT-J or Channel
On the other hand, removing the output space
GPT-J are used, both in classification and multi-
does not lead to significant drop in the channel
choice, by 3–16% in absolute. In the case of Di-
models: there is 0–2% drop in absolute, or some-
rect GPT-J in multi-choice, it is even significantly
times even an increase. We hypothesize that this is
worse than no demonstrations. Direct MetaICL
because the channel models only condition on the
is an exception, which we think is the effect of
labels, and thus are not benefiting from knowing
meta-training (discussion in Section 5.4).
the label space. This is in contrast to direct models
This suggests that in-distribution inputs in the
which must generate the correct labels.
demonstrations substantially contribute to perfor-
mance gains. This is likely because conditioning on
5.3 Impact of input-label pairing
the in-distribution text makes the task closer to lan-
guage modeling, since the LM always conditioned Section 5.1 and 5.2 focus on variants which keep
on the in-distribution text during training. the format of the demonstrations as much as possi-
ble. This section explores variants that change the
5.2 Impact of the label space format. While there are many aspects of the format,
We also experiment with demonstrations w/ ran- we make minimal modifications to remove the pair-
dom English words that use random English ings of inputs to labels. Specifically, we evaluate
words as labels for all k pairs. Specifically, we demonstrations with no labels where the LM is
sample a random subset of English words Crand conditioned on the concatenation of x1 ...xk , and
11053
60 Classification
55
Macro-F1 (%)
50
45
40 F L I M
35 ■ Gold labels ✓ ✓ ✓ ✓
30 ■ Random labels ✓ ✓ ✓ ✗
25 ■ Random English words ✓ ✗ ✓ ✗
Direct MetaICL Channel MetaICL Direct GPT-J Channel GPT-J ■ No demonstrations ✗ ✗ ✗ ✗
60 Multi-choice
F: Format
55
Accuracy (%)
L: Label space
50
I: Input distribution
45
M: Input-Label Mapping
40
35
30
25
Direct MetaICL Channel MetaICL Direct GPT-J Channel GPT-J
Figure 9: Impact of the label space. Evaluated in classification (top) and multi-choice (bottom). The impact of
the label space can be measured by comparing ■ and ■. The gap is significant in the direct models but not in the
channel models (discussion in Section 5.2).
60 Classification
55
Macro-F1 (%)
50 F L I M
45 ■ Gold labels ✓ ✓ ✓ ✓
40 ■ Random labels ✓ ✓ ✓ ✗
35 ■ OOD + Random labels ✓ ✓ ✗ ✗
30 ■ Random labels only ✗ ✓ ✗ ✗
25 ■ Random English words
Direct MetaICL Channel MetaICL Direct GPT-J Channel GPT-J ✓ ✗ ✓ ✗
■ No labels ✗ ✗ ✓ ✗
60 Multi-choice ■ No demonstrations ✗ ✗ ✗ ✗
55
Accuracy (%)
50 F: Format
45 L: Label space
40 I: Input distribution
35 M: Input-Label Mapping
30
25
Direct MetaICL Channel MetaICL Direct GPT-J Channel GPT-J
Figure 10: Impact of the format, i.e., the use of the input-label pairs. Evaluated in classification (top) and multi-
choice (bottom). Variants of demonstrations without keeping the format (■ and ■) are overall not better than no
demonstrations (■). Keeping the format is especially significant when it is possible to achieve substantial gains
with the label space but without the inputs (■ vs. ■ in Direct MetaICL), or with the input distribution but without
the labels (■ vs. ■ in Channel MetaICL and Channel GPT-J). More discussion in Section 5.3.
demonstrations with labels only where the LM is tences from a corpus and randomly pairing them
conditioned on the concatenation of y1 ...yk . These with the label set (■ in Figure 10) in classification
ablations provide the no-format counterparts of the and multi-choice, respectively. Similarly, with the
‘demonstrations with random English words’ and channel models, it is possible to retain 82%, 87%,
‘demonstrations with OOD inputs’, respectively. 86% and 75% of improvements from in-context
learning by simply pairing each input from the un-
Results. Based on Figure 10, removing the for- labeled training data with a random English word
mat is close to or worse than no demonstrations, (■ in Figure 10) in MetaICL classification, GPT-
indicating the importance of the format. This is J classification, MetaICL multi-choice and GPT-J
likely because conditioning on a sequence of input- multi-choice, respectively. For all of these cases,
label pairs triggers the model to mimic the overall removing inputs instead of using OOD inputs, or
format and complete the new example as expected removing labels instead of using random English
when the test input is given. words is significantly worse, indicating that keep-
More interestingly, keeping the format plays a ing the format of the input-label pairs is key.
significant role in retaining a large portion of per-
formance gains by only using the inputs or only
5.4 Impact of meta-training
using the labels. For instance, with Direct MetaICL,
it is possible to retain 95% and 82% of improve- Different from other models, MetaICL is trained
ments from in-context learning (demonstrations with an in-context learning objective, in line with
with gold labels) by simply sampling random sen- recent work that uses multi-task training on a
11054
large collection of supervised datasets (called meta- then our findings suggest that LMs do not learn
training) for generalization to new tasks (Agha- new tasks at test time. Our analysis shows that the
janyan et al., 2021; Khashabi et al., 2020; Wei model may ignore the task defined by the demon-
et al., 2022a; Sanh et al., 2022). We aim to better strations and instead use prior from pretraining.
understand the role of this meta-training in relation However, learning a new task can be interpreted
with our findings by closely examining the result of more broadly: it may include adapting to specific
MetaICL. In particular, we observe that the patterns input and label distributions and the format sug-
we see so far are significantly more evident with gested by the demonstrations, and ultimately get-
MetaICL than with other models. For instance, the ting to make a prediction more accurately. With
ground truth input-label mapping matters even less, this definition of learning, the model does learn
and keeping the format of the demonstrations mat- the task from the demonstrations. Our experiments
ters even more. There is nearly zero influence of indicate that the model does make use of aspects of
the input-label mapping and the input distribution the demonstrations and achieve performance gains.
in Direct MetaICL, and the input-label mapping
Capacity of LMs. The model performs a down-
and the output space in Channel MetaICL.
stream task without relying on the input-label corre-
Based on this observation, we hypothesize that
spondence from the demonstrations. This suggests
meta-training encourages the model to exclu-
that the model has learned the (implicit notion of)
sively exploit simpler aspects of the demonstra-
input-label correspondence from the language mod-
tions and to ignore others. This is based on our
eling objective alone, e.g., associating a positive
intuition that (1) the input-label mapping is likely
review with the word ‘positive’. This is in line
harder to exploit, (2) the format is likely easier to
with Reynolds and McDonell (2021) who claim
exploit, and (3) the space of the text that the model
that the demonstrations are for task location and
is trained to generate is likely easier to exploit than
the intrinsic ability to perform the task is obtained
the space of the text that the model conditions on.8
at pretraining time.9
6 Discussion & Conclusion On one hand, this suggests that the language
modeling objective has led to great zero-shot ca-
In this paper, we study the role of the demonstra- pacity, even if it is not always evident from the
tions with respect to the success of in-context learn- naive zero-shot accuracy. On the other hand, this
ing. We find that the ground truth input-label map- suggests that in-context learning may not work on
ping in the demonstrations matters significantly a task whose input-label correspondence is not al-
less than one might think—replacing gold labels ready captured in the LM. This leads to the research
with random labels in the demonstrations only question of how to make progress in NLP problems
marginally lowers the performance. We then iden- that in-context learning does not solve: whether
tify a series of aspects in the demonstrations and we need a better way of extracting the input-label
examine which aspect actually contributes to per- mappings that are already stored in the LM, a bet-
formance gains. Results reveal that (1) gains are ter variant of the LM objective that learns a wider
mainly coming from independent specification of range of task semantics, or explicit supervision
the input space and the label space, (2) the models through fine-tuning on the labeled data.
can still retain up to 95% of performance gains by
using either the inputs only or the label set only if Connection to instruction-following models.
the right format is used, and (3) meta-training with Prior work has found it promising to train the model
an in-context learning objective magnifies these that reads the natural language description of the
trends. Together, our findings lead to a set of task (called instructions) and performs a new task
broader indications about in-context learning, as at inference (Mishra et al., 2021b; Efrat and Levy,
well as avenues for future work. 2020; Wei et al., 2022a; Sanh et al., 2022). We
think the demonstrations and instructions largely
Does the model learn at test time? If we take have the same role to LMs, and hypothesize that our
a strict definition of learning: capturing the input- findings hold for instruction-following models: the
label correspondence given in the training data,
9
However, while Reynolds and McDonell (2021) claims
8
That is, the direct model exploits the label space better that the demonstrations are thus unnecessary, we think using
than the input distribution, and the channel model exploits the the demonstrations is actually the most unambiguous and the
input distribution better than the label space. easiest way to prompt the model to perform a task.
11055
instructions prompt the model to recover the capac- tasks such as generation, but leave this to future
ity it already has, but do not supervise the model to work. Extending of our experiments to such tasks
learn novel task semantics. This has been partially is not trivial, because it requires a variation of the
verified by Webson and Pavlick (2022) who showed output which has incorrect input-output correspon-
that the model performance does not degrade much dence while keeping the correct output distribution
with irrelevant or misleading instructions. We leave (which is important based on our analysis in Sec-
more analysis on instruction-following models for tion 5).
future work. Since the first version of our paper, Madaan and
Yazdanbakhsh (2022) conducted a similar analy-
Significantly improved zero-shot performance.
sis with the chain of thought prompting (Wei et al.,
One of our key findings is that it is possible to
2022b) which generates a rationale to perform com-
achieve nearly k-shot performance without using
plex tasks such as math problems. Madaan and
any labeled data, by simply pairing each unlabeled
Yazdanbakhsh (2022) show that, while simply us-
input with a random label and using it as the demon-
ing a random rationale in the demonstrations (e.g.,
strations. This means our zero-shot baseline level
pairing with a rationale from a different example)
is significantly higher than previously thought.10
significantly degrades the performance, other types
Future work can further improve the zero-shot per-
of counterfactual rationales (e.g., wrong equations)
formance with relaxed assumptions in access to the
do not degrade the performance as much as we
unlabeled training data.
thought. We refer to Madaan and Yazdanbakhsh
Limitation (2022) for more discussions on what aspects of the
rationale matter or do not matter.
Effect of types of tasks and datasets. This paper
focuses on the tasks from established NLP bench- Acknowledgements
marks that have real natural language inputs. Syn-
thetic tasks with more limited inputs may actually We thank Gabriel Ilharco, Julian Michael, Ofir
use the ground truth labels more, as observed by Press, UW NLP members and anonymous review-
Rong (2021). ers for their comments in the paper. This research
We report macro-level analysis by examining the was supported by NSF IIS-2044660, ONR N00014-
average performance over multiple NLP datasets, 18-1-2826, a Sloan fellowship and gifts from AI2.
but different datasets may behave differently. Ap-
pendix C.2 discusses this aspect, including find-
References
ings that there are larger gaps between using the
ground truth labels and using the random labels Armen Aghajanyan, Anchit Gupta, Akshat Shrivas-
in some dataset-model pairs (e.g., in the most tava, Xilun Chen, Luke Zettlemoyer, and Sonal
Gupta. 2021. Muppet: Massive multi-task rep-
extreme case, nearly 14% absolute on the finan- resentations with pre-finetuning. arXiv preprint
cial_phrasebank dataset with GPT-J). Since the first arXiv:2101.11038.
version of our paper, Kim et al. (2022) showed
that using negated labels substantially lowers the Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor
performance in classification.11 We believe it is Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin,
Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru,
important to understand to what extend the model et al. 2021. Efficient large scale language mod-
needs the ground truth labels to successfully per- eling with mixtures of experts. arXiv preprint
form in-context learning. arXiv:2112.10684.
Extensions to generation. Our experiments are Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,
limited to classification and multi-choice tasks. We Danilo Giampiccolo, Bernardo Magnini, and Idan
hypothesize that ground truth output may not be Szpektor. 2006. The second pascal recognising tex-
tual entailment challenge. In Proceedings of the sec-
necessary for in-context learning in the open-set ond PASCAL challenges workshop on recognising
10 textual entailment.
We take the perspective that using the unlabeled training
data is permitted (Kodirov et al., 2015; Wang et al., 2019b;
Schick and Schütze, 2021). Francesco Barbieri, Jose Camacho-Collados, Luis Es-
11
Note that Kim et al. (2022) estimate the random label per- pinosa Anke, and Leonardo Neves. 2020. TweetEval:
formance by interpolating with the performance using negated Unified benchmark and comparative evaluation for
labels, while our paper samples the random labels at uniform. tweet classification. In Findings of EMNLP.
11056
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo L Gao, S Biderman, S Black, L Golding, T Hoppe,
Giampiccolo. 2009. The fifth pascal recognizing C Foster, J Phang, H He, A Thite, N Nabeshima,
textual entailment challenge. In TAC. et al. 2021. The pile: an 800gb dataset of diverse
text for language modeling 2020. arXiv preprint
Tom Brown, Benjamin Mann, Nick Ryder, Melanie arXiv:2101.00027.
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, and Bill Dolan. 2007. The third pascal recognizing
Gretchen Krueger, Tom Henighan, Rewon Child, textual entailment challenge. In Proceedings of the
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens ACL-PASCAL workshop on textual entailment and
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- paraphrasing.
teusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec Andrew Gordon, Zornitsa Kozareva, and Melissa Roem-
Radford, Ilya Sutskever, and Dario Amodei. 2020. mele. 2012. SemEval-2012 task 7: Choice of plau-
Language models are few-shot learners. In NeurIPS. sible alternatives: An evaluation of commonsense
causal reasoning. In The First Joint Conference on
Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fer- Lexical and Computational Semantics (SemEval).
nandez, and Doug Downey. 2019. CODAH: An
adversarially-authored question answering dataset Ari Holtzman, Peter West, Vered Schwartz, Yejin Choi,
for common sense. In Proceedings of the 3rd Work- and Luke Zettlemoyer. 2021. Surface form competi-
shop on Evaluating Vector Space Representations for tion: Why the highest probability answer isn’t always
NLP. right. In EMNLP.
Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, Daniel Khashabi, Sewon Min, Tushar Khot, Ashish
and He He. 2021. Meta-learning via language model Sabharwal, Oyvind Tafjord, Peter Clark, and Han-
in-context tuning. arXiv preprint arXiv:2110.07814. naneh Hajishirzi. 2020. UnifiedQA: Crossing format
boundaries with a single qa system. In Findings of
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, EMNLP.
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tushar Khot, Peter Clark, Michal Guerquin, Peter
Tafjord. 2018. Think you have solved question an-
Jansen, and Ashish Sabharwal. 2020. Qasc: A
swering? try arc, the ai2 reasoning challenge. ArXiv.
dataset for question answering via sentence composi-
tion. In AAAI.
Ido Dagan, Oren Glickman, and Bernardo Magnini.
2005. The pascal recognising textual entailment chal- Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho,
lenge. In Machine Learning Challenges Workshop. Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee,
Kang Min Yoo, and Taeuk Kim. 2022. Ground-truth
Ona de Gibert, Naiara Perez, Aitor García-Pablos, and labels matter: A deeper look into input-label demon-
Montse Cuadros. 2018. Hate Speech Dataset from a strations. arXiv preprint arXiv:2205.12685.
White Supremacy Forum. In Proceedings of the 2nd
Workshop on Abusive Language Online (ALW2). Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shao-
gang Gong. 2015. Unsupervised domain adaptation
Marie-Catherine de Marneffe, Mandy Simons, and Ju- for zero-shot learning. In Proceedings of the IEEE
dith Tonhauser. 2019. The commitmentbank: Inves- international conference on computer vision.
tigating projection in naturally occurring discourse.
Proceedings of Sinn und Bedeutung. Hector J. Levesque, Ernest Davis, and Leora Morgen-
stern. 2012. The winograd schema challenge. In
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Proceedings of the Thirteenth International Confer-
Kristina Toutanova. 2019. BERT: Pre-training of ence on Principles of Knowledge Representation and
deep bidirectional transformers for language under- Reasoning.
standing. In NAACL.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
T. Diggelmann, Jordan L. Boyd-Graber, Jannis Bu- Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
lian, Massimiliano Ciaramita, and Markus Leippold. Ves Stoyanov, and Luke Zettlemoyer. 2020. BART:
2020. Climate-fever: A dataset for verification of Denoising sequence-to-sequence pre-training for nat-
real-world climate claims. ArXiv. ural language generation, translation, and compre-
hension. In ACL.
William B. Dolan and Chris Brockett. 2005. Automati-
cally constructing a corpus of sentential paraphrases. Quentin Lhoest, Albert Villanova del Moral, Yacine
In Proceedings of the Third International Workshop Jernite, Abhishek Thakur, Patrick von Platen, Suraj
on Paraphrasing (IWP2005). Patil, Julien Chaumond, Mariama Drame, Julien Plu,
Lewis Tunstall, Joe Davison, Mario Šaško, Gun-
Avia Efrat and Omer Levy. 2020. The turking test: Can jan Chhablani, Bhavitvya Malik, Simon Brandeis,
language models understand instructions? arXiv Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas
preprint arXiv:2010.11982. Patry, Angelina McMillan-Major, Philipp Schmid,
11057
Sylvain Gugger, Clément Delangue, Théo Matus- Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin
sière, Lysandre Debut, Stas Bekman, Pierric Cis- Choi, and Hannaneh Hajishirzi. 2021a. Refram-
tac, Thibault Goehringer, Victor Mustar, François ing instructional prompts to gptk’s language. arXiv
Lagunas, Alexander Rush, and Thomas Wolf. 2021. preprint arXiv:2109.07830.
Datasets: A community library for natural language
processing. In EMNLP: System Demonstrations. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
Hannaneh Hajishirzi. 2021b. Cross-task generaliza-
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, tion via natural language crowdsourcing instructions.
Lawrence Carin, and Weizhu Chen. 2021. What arXiv preprint arXiv:2104.08773.
makes good in-context examples for gpt-3? arXiv
preprint arXiv:2101.06804. Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos,
and Grigorios Tsoumakas. 2020. Ethos: an online
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- hate speech detection dataset. ArXiv.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Sebastian Nagel. 2016. CC-News. http:
Roberta: A robustly optimized bert pretraining ap- //web.archive.org/save/http:
proach. arXiv preprint arXiv:1907.11692. //commoncrawl.org/2016/10/
news-dataset-available.
Robert L Logan IV, Ivana Balaževic, Eric Wallace,
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Fabio Petroni, Sameer Singh, and Sebastian Riedel.
Dario Amodei, and Ilya Sutskever. 2019. Language
2021. Cutting down on prompts and parameters:
models are unsupervised multitask learners. OpenAI
Simple few-shot learning with language models.
blog.
arXiv preprint arXiv:2106.13353.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
and Pontus Stenetorp. 2021. Fantastically ordered Wei Li, and Peter J Liu. 2020. Exploring the limits
prompts and where to find them: Overcoming of transfer learning with a unified text-to-text trans-
few-shot prompt order sensitivity. arXiv preprint former. Journal of Machine Learning Research.
arXiv:2104.08786.
Yasaman Razeghi, Robert L Logan IV, Matt Gardner,
Aman Madaan and Amir Yazdanbakhsh. 2022. Text and Sameer Singh. 2022. Impact of pretraining term
and patterns: For effective chain of thought, it takes frequencies on few-shot reasoning. arXiv preprint
two to tango. arXiv preprint arXiv:2209.07686. arXiv:2202.07206.
Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wal- Laria Reynolds and Kyle McDonell. 2021. Prompt pro-
lenius, and Pyry Takala. 2014. Good debt or bad gramming for large language models: Beyond the
debt: Detecting semantic orientations in economic few-shot paradigm. In Extended Abstracts of the
texts. J. Assoc. Inf. Sci. Technol. 2021 CHI Conference on Human Factors in Comput-
ing Systems.
Marco Marelli, Stefano Menini, Marco Baroni, Luisa
Bentivogli, Raffaella Bernardi, and Roberto Zam- Frieda Rong. 2021. Extrapolating to unnatu-
parelli. 2014. A SICK cure for the evaluation of ral language processing with gpt-3’s in-context
compositional distributional semantic models. In learning: The good, the bad, and the myste-
LREC. rious. https://ai.stanford.edu/blog/
in-context-learning.
Clara H. McCreery, Namit Katariya, Anitha Kannan,
Manish Chablani, and Xavier Amatriain. 2020. Ef- Ohad Rubin, Jonathan Herzig, and Jonathan Berant.
fective transfer learning for identifying similar ques- 2021. Learning to retrieve prompts for in-context
tions: Matching user questions to covid-19 faqs. In learning. arXiv preprint arXiv:2112.08633.
Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. Victor Sanh, Albert Webson, Colin Raffel, Stephen H.
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja,
Sabharwal. 2018. Can a suit of armor conduct elec- Manan Dey, M Saiful Bari, Canwen Xu, Urmish
tricity? a new dataset for open book question answer- Thakker, Shanya Sharma, Eliza Szczechla, Taewoon
ing. In EMNLP. Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti
Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han
Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong,
Luke Zettlemoyer. 2021a. Noisy channel language Harshit Pandey, Rachel Bawden, Thomas Wang, Tr-
model prompting for few-shot text classification. ishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea
arXiv preprint arXiv:2108.04106. Santilli, Thibault Fevry, Jason Alan Fries, Ryan Tee-
han, Stella Biderman, Leo Gao, Tali Bers, Thomas
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- Wolf, and Alexander M. Rush. 2022. Multitask
naneh Hajishirzi. 2021b. MetaICL: Learning to learn prompted training enables zero-shot task generaliza-
in context. arXiv preprint. tion. In ICLR.
11058
Timo Schick and Hinrich Schütze. 2021. It’s not just Sang Michael Xie, Aditi Raghunathan, Percy Liang,
size that matters: Small language models are also and Tengyu Ma. 2022. An explanation of in-context
few-shot learners. In NAACL-HLT. learning as implicit bayesian inference. In ICLR.
Emily Sheng and David Uthus. 2020. Investigating Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. 2021.
societal biases in a poetry composition system. In Crossfit: A few-shot learning challenge for cross-
Proceedings of the Second Workshop on Gender Bias task generalization in nlp. In EMNLP.
in Natural Language Processing.
Tony Z Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, Sameer Singh. 2021. Calibrate before use: Improv-
and Claire Cardie. 2019. DREAM: A challenge data ing few-shot performance of language models. In
set and models for dialogue-based reading compre- ICML.
hension. TACL.
11059
A Full Datasets Dataset # Train # Eval
Task category: Sentiment analysis
We include 26 datasets as follows: fi- financial_phrasebank 1,811 453
nancial_phrasebank (Malo et al., 2014), poem_sentiment 892 105
poem_sentiment (Sheng and Uthus, 2020), Task category: Paraphrase detection
medical_questions_pairs (McCreery et al., 2020), medical_questions_pairs 2,438 610
glue-mrpc 3,668 408
glue-mrpc (Dolan and Brockett, 2005), glue-
wnli (Levesque et al., 2012), climate_fever (Diggel- Task category: Natural language inference
glue-wnli 635 71
mann et al., 2020), glue-rte (Dagan et al., 2005; climate_fever 1,228 307
Bar-Haim et al., 2006; Giampiccolo et al., glue-rte 2,490 277
superglue-cb 250 56
2007; Bentivogli et al., 2009), superglue- sick 4,439 495
cb (de Marneffe et al., 2019), sick (Marelli et al.,
Task category: Hate speech detection
2014) , hate_speech18 (de Gibert et al., 2018), hate_speech18 8,562 2,141
ethos-national_origin (Mollas et al., 2020), ethos- ethos-national_origin 346 87
race (Mollas et al., 2020), ethos-religion (Mollas ethos-race 346 87
ethos-religion 346 87
et al., 2020), tweet_eval-hate (Barbieri et al., 2020), tweet_eval-hate 8,993 999
tweet_eval-stance_atheism (Barbieri et al., 2020), tweet_eval-stance_atheism 461 52
tweet_eval-stance_feminist (Barbieri et al., 2020), tweet_eval-stance_feminist 597 67
quarel (Tafjord et al., 2019a), openbookqa (Mi- Task category: Question answering
quarel 1,941 278
haylov et al., 2018), qasc (Khot et al., 2020), com- openbookqa 4,957 500
monsense_qa (Talmor et al., 2019), ai2_arc (Clark qasc 8,134 926
et al., 2018), codah (Chen et al., 2019), superglue- commonsense_qa 9,741 1,221
ai2_arc 1,119 299
copa (Gordon et al., 2012), dream (Sun et al.,
Task category: Sentence completion
2019), quartz-with_knowledge (Tafjord et al., codah 1665 556
2019b), quartz-no_knowledge (Tafjord et al., superglue-copa 400 100
2019b). The choice of datasets is made following dream 6116 2040
quartz-with_knowledge 2696 384
low-resource datasets in Min et al. (2021b), with quartz-no_knowledge 2696 384
the exact same set of k-shot train data using 5
random seeds. We use the HuggingFace version Table 2: 26 datasets used for experiments, classified into
of the data (Lhoest et al., 2021) and use the 6 task categories. # Train and # Test indicate the number
development data for evaluation, following Ye of training and test examples of the dataset. Note that #
et al. (2021). See Table 2 for statistics. train is based on the original training dataset but we use
k random samples for k-shot evaluation.
B Experimental Details
Example template We follow Ye et al. (2021); and each demonstration example with a space. For
Min et al. (2021b); Logan IV et al. (2021) in us- MetaICL, GPT-J and GPT-3, we separate the input
ing the minimal format to transform the input to a and the label with a newline (\n), and each demon-
sequence (e.g. a concatenation of multiple inputs) stration example with three newlines. For fairseq
and using the label words from each dataset as it is. models, we use a newline to separate the input and
We also explore manual templates taken from prior the label as well as each demonstration example.
work (Holtzman et al., 2021; Zhao et al., 2021) as
reported in Section 4.2, although we find that using Details in variants of the demonstrations For
these templates is not consistently better than using “demonstrations w/ a% accurate labels” (0 ≤
minimal templates. We thus run main experiments a ≤ 100), we use k × a/100 correct pairs and
with minimal templates. Example templates are k × (1 − a/100) incorrect pairs in a random order,
provided in Table 3. as described in Algorithm 1. For “OOD demon-
strations”, we use CC-News (Nagel, 2016) as an
Format of the demonstrations We follow the external corpus. We consider the length of the text
standard of each model for formatting the demon- during sampling, so that sampled sentences have
strations, either from exploration in prior work or similar length to the test input. For “demonstrations
the example code provided in the official tutorial. with random English words”, we use pypi.org/
For GPT-2, we separate the input and the label, project/english-words for the set of En-
11060
60 Classification
No Demos Demos w/ gold labels Demos w/ random labels
55
50
Macro-F1 (%)
45
40
35
30
25
Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel
GPT-2 GPT-2 MetaICL MetaICL GPT-J GPT-J fairseq 6.7B fairseq 6.7B fairseq 13B fairseq 13B GPT-3 GPT-3
70 Multi-choice
No Demos Demos w/ gold labels Demos w/ random labels
65
60
Accuracy (%)
55
50
45
40
35
Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel Direct Channel
GPT-2 GPT-2 MetaICL MetaICL GPT-J GPT-J fairseq 6.7B fairseq 6.7B fairseq 13B fairseq 13B GPT-3 GPT-3
Figure 11: Results of No-demonstration, Gold demonstration and Random demonstration on 3 classification datasets
(top) and 3 multi-choice datasets (bottom). Details in Section 4.1. This figure is for providing numbers that are
comparable across models—full results with more datasets are reported in Figure 3.
Algorithm 1 Forming the demonstrations with an labels is further reduced compared to using uni-
accuracy of a%. formly random labels: with Channel MetaICL, the
1: procedure F ORM D EMONS({(xi , yi )}ki=1 , a) gap is reduced from 1.9% to 1.3% absolute, and
2: D ← [] // demonstration to be formed with Channel GPT-J, the gap is reduced from 5.0%
3: n ← k × a/100 // number of correct pairs
to 3.5% absolute.
4: G ← Sample(Range(1, k), n)
5: for i ∈ Range(1, k) do Figure 12 shows performance gap between using
6: if i ∈ G then // add correct pair gold labels and using random labels per dataset. We
7: D.append((xi , yi ))
8: else // add incorrect pair find that the trend that the gap is smaller than pre-
9: D.append((xi , Sample(C − yi ))) viously thought is consistant across most datasets.
10: return D Nonetheless, there are a few outlier datasets where
performance gap is non-negligible, such as finan-
cial_phrasebank and a few hate speech detection
glish words, which consists of 61,569 words. datasets. Future work may investigate on which
Table 4 provides a list of example demonstra- tasks the model makes more use of the correctly
tions for each method used in Section 5. paired training data.
C.1 Gold labels vs. random labels We explored demonstrations with a con-
stant label where all labels in the demon-
Figure 11 shares the same interface as Figure 3, but strations are replaced with a constant text,
all models are evaluated on 3 classification and 3 “answer”. Specifically, a prediction is made via
multi-choice datasets and are thus comparable to argmaxy∈C P (y|x1 , answer...xk , answer, x).
each other. This can be viewed as another way to remove the
impact of the label space while keeping the impact
C.2 Random labels from true distribution of of the distribution of the input text. However,
labels & Task breakdown results are consistently worse than the results
In Section 4, random labels are sampled from the of demonstrations with random English labels.
label space from a uniform distribution. We ex- We think this is because constant labels actually
periment with another variant of demonstrations in change the format of the demonstrations, since
the classification tasks, where labels are randomly they can be viewed as part of a separator between
sampled from the true distribution of labels on the different demonstration examples.
training data. This may have large impact if labels We also explored demonstrations with the test
are far from uniform on the training data. Results input where all inputs in the demonstrations are
indicate that performance drop from using gold replaced with the test input, each paired with a ran-
11061
dom label. Specifically, a prediction is made via
argmaxy∈C P (y|x, ỹ1 ...x, ỹk , x), where ỹi (1 ≤
i ≤ k) is randomly sampled at uniform from C.
This variant is seemingly a reasonable choice given
that it satisfies the condition that the inputs in the
demonstrations come from the same distribution
as the test input (since they are identical), and us-
ing random labels is as good as using gold labels.
Nonetheless, we find that this variant is signifi-
cantly worse than most other methods with demon-
strations. We think this is because using the con-
stant input for all demonstration example signifi-
cantly changes the format of the sequence, since the
input can be viewed as part of a separator between
different demonstration examples.
11062
tw
3
2
1
0
1
2
3
4
5
6
7
8
9
10
2
1
0
1
2
3
4
5
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2
1
0
1
2
3
4
5
6
7
ee
t_e fin fin
va an an
l-s
t an
cia
l_p
cia
l_p
ce hr tw hr
_f ee
po po em ase t_e ase
em sic em ini b an va ba
_se k _se st k l-s
tan
nk
nti nti su ce
me me pe _fe
nt- fin nt- po rg po mi
lue
respectively.
ne an ne em em nis
w c ial w _se -cb _se t
tw _ ph nti nti
eth ee ras me me
os- t_e eb nt- nt-
rac va an ne ne
tw e l-s
t k w w
tw ee an su
ee t_e ce eth pe
t_e va _at os- rg
va l-h eth he r eth lue
ate ism ac -cb
l-s
tan
os-
n ati e os-
na
ce on tw tw tio
_fe al_ ee ee na
fin t_e t_e l_o
an mi
nis
ori
g v al- sic va rig
c ial t in sta k l-s
tan in
_p su nc ce
hr pe e_f
ase rg em _at
ba lue he
nk -cb ini
st is m
ha tw
tw te_ eth ee
ee sp os- t_e
t_e ee r eli v al-
va ch gio ha sic
l-s 18 n te k
tan ha ha
ce te_ te_
_at glu sp sp
he e-w ee ee
ch ch
ism nli 1 8 18
cli eth
ma os-
eth te_ r eli
glu
e-w
fev sic gio
os-
na er k n eth
nli
tio
na glu
os-
l_o co rel
rig da e-r igi
i n h te on
cli cli
ma ma eth
glu te_ te_
e-r f ev eth f ev
os-
rac
te er os-
n er e
eth ati
os- on
rel al_
qu ori co
igi
on are g in da
h
tw
ee l
t_e
qu
art
glu
e-w v al- qu
art
glu
e-w qu
z-w nli ha
te z -w nli co are
ith ith mm l
_kn eth _ kn me on
ow ow sen
l ed
os-
r ac led dic
al_ se_
co ge co e co ge qa
11063
mm mm mm qu
on on on est
sen sen sen ion
se_ se_ se_ s_p
qa qa qa air
s
op
glu en
e-m qu bo
qa rp art qa ok
sc ha c z -no sc tw qa
Performance drop (Channel GPT-J, uniform)
te_ ee
Performance drop (Channel MetaICL, uniform)
_kn t_e
glu me sp ow qu va
qu e-m ee
at uniform, with Channel MetaICL and Channel GPT-J, respectively. The bottom two figures use random labels
random labels. Datasets are sorted in descending order. The top two figures use random labels that are sampled
Figure 12: Performance gap from using the demonstrations with gold labels to using the demonstrations with
that are sampled from a true distribution of labels on the training data, with Channel MetaICL and Channel GPT-J,
Dataset Type Example
sentence 1: Cisco pared spending to compensate for sluggish sales . [SEP] sentence 2: In
Minimal
MRPC response to sluggish sales , Cisco pared spending . \n {equivalent|not_equivalent}
Cisco pared spending to compensate for sluggish sales . \n The question is: In response to
Manual
sluggish sales , Cisco pared spending . True or False? \n The answer is:{True|False}
sentence 1: The girl was found in Drummondville. [SEP] sentence 2: Drummondville
Minimal
RTE contains the girl. \n {entailment|not_entailment}
The girl was found in Drummondville. \n The question is: Drummondville contains the
Manual
girl. True or False? \n The answer is:{True|False}
Minimal The Truth about #Immigration \n {hate|non-hate}
Tweet_eval-hate
Manual Tweet: The Truth about #Immigration \n Sentiment: {against|favor}
sentence 1: A man is screaming. [SEP] sentence 2: A man is scared. \n
Minimal
SICK {contradiction|entailment|neutral}
A man is screaming. \n The question is: A man is scared. True or False? \n The answer is:
Manual
{False|True|Not sure}
Minimal willis sneered: \n {negative|no_impact|positive}
poem-sentiment
Manual willis sneered: \n The sentiment is: {negative|no_impact|positive}
Minimal What creates a valley? \n {feet|rock|water|sand}
OpenbookQA
Manual The question is: What creates a valley? \n The answer is: {feet|rock|water|sand}
Minimal What blocks sunshine? \n {summer|park|desktop|sea|moon}
CommonsenseQA
Manual The question is: What blocks sunshine? \n The answer is: {summer|park|desktop|sea|moon}
Minimal Effect: I coughed. \n {Cause: I inhaled smoke.|Cause: I lowered my voice.}
COPA
Manual I coughed because {I inhaled smoke.|I lowered my voice.}
Minimal Which biome has the most vegetation? \n {desert|forest|grassland|tundra}
ARC
The question is: Which biome has the most vegetation? \n The answer is: {desert|forest|
Manual
grassland|tundra}
Table 3: A list of minimal templates taken from Ye et al. (2021); Min et al. (2021b) and manual templates taken
from Holtzman et al. (2021); Zhao et al. (2021). Details provided in Appendix B. See Figure 6 for discussion in
empirical results. The input and the label are in the red text and in the blue text, respectively. Note that | is used to
separate different options for the labels.
Table 4: Example demonstrations when using methods in Section 5. The financial_phrasebank dataset with
C = {“positive”, “neutral”, “negative”} is used. Red text indicates the text is sampled from an external corpus; blue
text indicates the labels are randomly sampled from the label set; purple text indicates a random English word.
11064