0% found this document useful (0 votes)

20 views21 pages

Transfer Prompt

Uploaded by

Trần Khiêm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views21 pages

Transfer Prompt

Uploaded by

Trần Khiêm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer

Tu Vu1,2F Brian Lester1

Noah Constant1 Rami Al-Rfou1 Daniel Cer1
1
Google Research
University of Massachusetts Amherst2
{ttvu,brianlester,nconstant,rmyeid,cer}@google.com
[email protected]

Abstract 100

There has been growing interest in parameter- 90

efficient methods to apply pre-trained lan-

SuperGLUE Score
guage models to downstream tasks. Build- 80
ing on the P ROMPT T UNING approach of Lester
et al. (2021), which learns task-specific soft
prompts to condition a frozen pre-trained 70
model to perform different tasks, we propose a P D s (GPT-3)
novel prompt-based transfer learning approach 60 M T
P T
called SP OT: Soft Prompt Transfer. SP OT M - sM T
first learns a prompt on one or more source 50 SP T (Ours)
tasks and then uses it to initialize the prompt 108 109 1010 1011
for a target task. We show that SP OT sig- Model Parameters
nificantly boosts the performance of P ROMPT-
T UNING across many tasks. More remarkably, Figure 1: Our SP OT approach—which transfers a
across all model sizes, SP OT matches or out- prompt learned from a mixture of source tasks (here,
performs standard M ODELT UNING (which fine- GLUE) onto target tasks—outperforms vanilla P ROMT-
tunes all model parameters) on the S UPER - T UNING (Lester et al., 2021) and GPT-3 (Brown et al.,
GLUE benchmark, while using up to 27,000× 2020) on S UPER GLUE by a large margin, matching or
fewer task-specific parameters. To understand outperforming M ODELT UNING across all model sizes.
where SP OT is most effective, we conduct a At the XXL model size, SP OT even outperforms M ULTI -
large-scale study on task transferability with TASK M ODELT UNING , which fine-tunes the entire model

26 NLP tasks in 160 combinations, and demon- on the GLUE mixture before fine-tuning it on individ-
strate that many tasks can benefit each other ual S UPER GLUE tasks. See Appendix A for full results.
via prompt transfer. Finally, we propose an
efficient retrieval approach that interprets task
of the model for each downstream task would be
prompts as task embeddings to identify similar
tasks and predict the most transferable source
prohibitively expensive. To get around the infeasi-
tasks for a novel target task. bility of fine-tuning, Brown et al. (2020) propose
P ROMPT D ESIGN, where every downstream task is
1 Introduction cast as a language modeling task and the frozen pre-
trained model performs different tasks by condition-
The past few years have seen the rapid developing on manual text prompts provided at inference
ment of ever larger pre-trained language models, time. They demonstrate impressive few-shot perfor-
where it has repeatedly been shown that scaling mance with a single frozen GPT-3 model, although
up the model size is a key ingredient for achiev- its performance depends highly on the choice of the
ing the best performance (Devlin et al., 2019; Raf- prompt (Zhao et al., 2021) and still lags far behind
fel et al., 2020; Brown et al., 2020). While this state-of-the-art fine-tuning results.
trend has continued to push the boundaries of pos-
More recent work explores methods for learn-
sibility across various NLP benchmarks, the sheer
ing soft prompts (Liu et al., 2021b; Qin and Eis-
size of these models presents a challenge for their
ner, 2021; Li and Liang, 2021; Lester et al., 2021),
practical application. For 100B+ parameter mod-
which can be seen as additional learnable parame-
els, fine-tuning and deploying a separate instance
ters injected into the language model. Lester et al.
F Work done during an internship at Google Research. (2021) propose P ROMPT T UNING, a simple method
5039
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 5039 - 5059
May 22-27, 2022 c 2022 Association for Computational Linguistics
Source Prompt Tuning Target Prompt Tuning Target Source Task
Task Embeddings
Source Prompt
Initialization Library
❄ ❄ Target Task Query
🔥 🔥 Embedding Keys Values
Source Pre-trained Target Pre-trained
Prompt Model Prompt Model Task A
Source
Value
Prompt Source Task B
Target Prompts
Task A Task Target Task C
Initialization
Unsupervised Task
Task B 🔥 tuned
Task
Task C ❄ frozen Target
Prompt

Figure 2: An illustration of our generic (left) and targeted (right) SP OT approaches. Left: We learn a single
generic source prompt on one or more source tasks, which is then used to initialize the prompt for each target task.
Right: We learn separate prompts for various source tasks, saving early checkpoints as task embeddings and best
checkpoints as source prompts. These form the keys and values of our prompt library. Given a novel target task,
a user: (i) computes a task embedding, (ii) retrieves an optimal source prompt, and (iii) trains a target prompt,
initialized from the source prompt (see §3 for details).

that learns a small task-specific prompt (a sequence other via prompt transfer. To address (b), we inter-
of tunable tokens prepended to each example) for pret the learned task prompts as task embeddings to
each downstream task during adaptation to condi- construct a semantic space of tasks and formalize
tion the frozen language model to perform the task. the similarity between tasks. We design an efficient
Strikingly, as model capacity increases, P ROMPT- retrieval algorithm that measures task embedding
T UNING becomes competitive with M ODELT UNING, similarity, allowing practitioners to identify source
which fine-tunes the entire model on each down- tasks that will likely yield positive transfer.
stream task. Nevertheless, at smaller model sizes To summarize, our main contributions are:
(below 11B parameters), there are still large gaps (1) We propose SP OT, a novel prompt-based trans-
between P ROMPT T UNING and M ODELT UNING. fer learning approach, and show that scale is not
In this paper, we propose SP OT: Soft Prompt necessary for P ROMPT T UNING to match the perfor-
Transfer, a novel transfer learning approach in the mance of M ODELT UNING; on S UPER GLUE, SP OT
context of prompt tuning. SP OT first trains a prompt matches or beats M ODELT UNING across all model
on one or more source tasks, and then uses the re- sizes. (2) We conduct a large-scale and systematic
sulting prompt to initialize the prompt for a target study on task transferability, demonstrating con-
(downstream) task. Our experiments show that ditions under which tasks can benefit each other
SP OT offers significant improvements over P ROMPT-
via prompt transfer. (3) We propose an efficient re-
T UNING across tasks and model sizes. For instance,
trieval method that interprets task prompts as task
on the S UPER GLUE benchmark (Wang et al., 2019b), embeddings to construct a semantic space of tasks,
we obtain +10.1 and +2.4 point average accuracy and measures task embedding similarity to identify
improvements using the T5 BASE (220M parame- which tasks could benefit each other. (4) To fa-
ter) and T5 XXL (11B parameter) models (Raffel cilitate future work on prompt-based learning, we
et al., 2020), respectively. More importantly, SP OT will release our library of task prompts and pre-
is competitive with or outperforms M ODELT UNING trained models, and provide practical recommenda-
across all model sizes (see Figure 1). tions for adapting our library to NLP practitioners
at https://github.com/google-research/
Motivated by these results, we investigate trans- prompt-tuning/tree/main/prompt_tuning/
ferability between tasks, through the lens of soft spot.
task prompts. Our goal is to answer two questions:
(a) For a given target task, when does initializing 2 Improving P ROMPT T UNING with SP OT
the prompt from a source task boost performance?
(b) Can we use task prompts to efficiently predict To improve performance of P ROMPT T UNING on a
which source tasks will transfer well onto a novel target task, SP OT introduces source prompt tuning,
target task? To answer (a), we conduct a system- an intermediate training stage between language
atic study of the T5 model using 26 NLP tasks in model pre-training and target prompt tuning (Fig-
160 combinations of source and target tasks. Our ure 2, left), to learn a prompt on one or more source
results indicate that many tasks can benefit each tasks (while still keeping the base model frozen),
5040
which is then used to initialize the prompt for the S UPER GLUE (Wang et al., 2019b) benchmarks.4 We
target task.1 Our approach retains all the compu- train for a fixed number of steps and report results
tational benefits of P ROMPT T UNING: for each target on the validation set associated with each dataset.5
task, it only requires storing a small task-specific
prompt, enabling the reuse of a single frozen pre- 2.1.3 Data for source prompt tuning
trained model across all tasks. In this section, we As with language model pre-training, the choice of
present a generic SP OT approach where a single training data is crucial for successful prompt trans-
transferred prompt is reused for all target tasks. fer. To investigate the impact of source training
In §3, we explore a targeted approach that retrieves data on downstream performance, we compare a
different source prompts for different target tasks. diverse set of source tasks.

2.1 Experimental setup A single unsupervised learning task: We first

consider training the prompt on a fraction of the
Our frozen models are built on top of the pre- C4 (Colossal Clean Crawled Corpus) dataset (Raf-
trained T5 checkpoints of all sizes: S MALL, BASE, fel et al., 2020) using the “prefix LM” objective
L ARGE, XL, XXL with 60M, 220M, 770M, 3B, and discussed in Raffel et al. (2020). Although this
11B parameters, respectively. In our experiments task was used to pre-train our frozen T5 models al-
with SP OT, we leverage the LM adapted version of ready, it could still be helpful for learning a general-
T52 , which was found to be easier to optimize for purpose prompt.
P ROMPT T UNING (Lester et al., 2021).
A single supervised learning task: Alterna-
2.1.1 Baselines tively, we can train the prompt using a supervised
We compare SP OT to the following baselines: task. We use either MNLI (Williams et al., 2018) or
SQ UAD (Rajpurkar et al., 2016) as a single source
P ROMPT T UNING: The vanilla prompt tuning ap- task. MNLI was shown to be helpful for many
proach of Lester et al. (2021), where an indepen- sentence-level classification tasks (Phang et al.,
dent prompt is directly trained on each target task. 2019), while SQ UAD was found to generalize well
to QA tasks (Talmor and Berant, 2019).
M ODELT UNING & M ULTI - TASK M ODELT UNING: We
compare prompt tuning approaches to M ODELT UN - A multi-task mixture: So far, we have consid-
ING , the standard fine-tuning approach (Devlin ered using a single source task. An alternative
et al., 2019; Raffel et al., 2020), where all model approach is multi-task training. Within T5’s unified
parameters are fine-tuned on each target task sep- text-to-text framework, this simply corresponds to
arately. For an apples-to-apples comparison, we mixing different datasets together. We explore mix-
include M ULTI - TASK M ODELT UNING, a more competi- ing datasets from different NLP benchmarks or fam-
tive baseline that first fine-tunes the entire model ilies of tasks, including GLUE, S UPER GLUE, natural
on the same mixture of source tasks used for SP OT language inference (NLI), paraphrasing/semantic
before fine-tuning it on individual target tasks.3 similarity, sentiment analysis, question answering
(QA) on MRQA (Fisch et al., 2019), commonsense
2.1.2 Evaluation datasets reasoning on RAINBOW (Lourie et al., 2021), ma-
We study downstream performance on a diverse set chine translation, summarization, and natural lan-
of tasks from the GLUE (Wang et al., 2019c) and 4
These datasets include grammatical acceptability judg-
ments (C O LA (Warstadt et al., 2019)), sentiment analysis
1
The target task can be treated as one of the source tasks (SST-2 (Socher et al., 2013)), paraphrasing/semantic similar-
being mixed together. ity (MRPC (Dolan and Brockett, 2005), STS-B (Cer et al.,
2
T5 1.1 checkpoints trained for an additional 100K steps 2017), QQP (Iyer et al., 2017)), natural language inference
using the “prefix LM” objective (Raffel et al., 2020), avail- (MNLI (Williams et al., 2018), QNLI (Wang et al., 2019c),
able at https://github.com/google-research/ RTE (Dagan et al., 2005, et seq.), CB (De Marneffe et al.,
text-to-text-transfer-transformer/blob/ 2019)), coreference resolution (WSC (Levesque et al., 2012)),
main/released_checkpoints.md sentence completion (COPA (Roemmele et al., 2011)), word
3 sense disambiguation (W I C (Pilehvar and Camacho-Collados,
In preliminary experiments, we found that using the orig-
inal version of T5 1.1 (which was pre-trained exclusively on 2019)), and question answering (M ULTI RC (Khashabi et al.,
span corruption) for model tuning approaches results in better 2018), R E C O RD (Zhang et al., 2018), B OOL Q (Clark et al.,
performance than using the LM adapted version. We therefore 2019)). We exclude the problematic WNLI (Levesque et al.,
report results corresponding to the original T5 1.1 for M ODEL - 2012) dataset from GLUE, following Devlin et al. (2019).
5
T UNING and M ULTI - TASK M ODELT UNING. For tasks with multiple metrics, we average the metrics.

5041
guage generation on GEM (Gehrmann et al., 2021).6 Method GLUE S UPER GLUE
We create a mixture of source tasks from each of BASELINE
the NLP benchmarks/families of tasks above, and a P ROMPT T UNING 81.20.4 66.60.2
mixture comprising all datasets (C4 + 55 labeled − longer tuning 78.41.7 63.11.1
datasets), using the examples-proportional mixing SP OTwith different source mixtures
strategy in Raffel et al. (2020) with an artificial GLUE (8 tasks) 82.80.2 73.20.3
dataset size limit K = 219 examples. − longer tuning 82.00.2 70.70.4
C4 82.00.2 67.70.3
2.1.4 Training details MNLI 82.50.0 72.60.8
SQ UAD 82.20.1 72.00.4
We closely follow the training procedure in Lester
S UPER GLUE (8 tasks) 82.00.1 66.60.2
et al. (2021). Specifically, the only new parameters NLI (7 tasks) 82.60.1 71.40.2
introduced during both source and target prompt Paraphrasing/similarity (4 tasks) 82.20.1 69.70.5
tuning are a shared prompt ρ ∈ RL×E prepended Sentiment (5 tasks) 81.10.2 68.60.1
MRQA (6 tasks) 81.80.2 68.40.2
to each (embedded) input sequence, where L, E
RAINBOW (6 tasks) 80.30.6 64.00.4
are the prompt length and the embedding size, re- Translation (3 tasks) 82.40.2 65.30.1
spectively. In all cases, we set L = 100 tokens Summarization (9 tasks) 80.90.3 67.11.0
and tune the prompt for a fixed number of steps GEM (8 tasks) 81.90.2 70.50.5
S.7 While S is set to 30K in Lester et al. (2021), All (C4 + 55 supervised tasks) 81.80.2 67.90.9
we find that additional tuning is helpful on large
datasets. As such, we set S to 218 = 262,144, fol- Table 1: GLUE and S UPER GLUE results achieved by
applying T5 BASE with different prompt tuning ap-
lowing Raffel et al. (2020), with the exception of
proaches. We report the mean and standard deviation
ablation experiments (rows “− longer tuning”) in (in the subscript) across three random seeds. SP OT
Table 1 which use S = 30K. For source prompt significantly improves performance and stability of
tuning, the prompt token embeddings are initial- P ROMPT T UNING across the two benchmarks.
ized from sampled vocabulary (i.e., the 5,000 most
common tokens). During target prompt tuning, we
ablation study indicates that longer tuning is also an
save a checkpoint every 500 steps and report re-
important ingredient for achieving the best perfor-
sults on the checkpoint with the highest validation
mance, and is complementary to prompt transfer.
performance. Appendix C contains training details
Additionally, when longer tuning is omitted, we
for P ROMPT T UNING and model tuning approaches.
observe that SP OT improves stability across runs.
2.2 Effect of SP OT Within SP OT, we can compare the effectiveness
of different source mixtures (see Table 1). Source
We compare the results of SP OT and other ap- prompt tuning on GLUE performs best on both
proaches in Table 1 and Figure 1. Below, we sum- GLUE and S UPER GLUE, obtaining average scores of
marize and analyze each of our findings in detail. 82.8 and 73.2, respectively.8 Interestingly, unsuper-
SP OT significantly improves performance and vised source prompt tuning on C4 (the same task
stability of P ROMPT T UNING: Our results on the used to pre-train our frozen models) still yields con-
GLUE and S UPER GLUE benchmarks with T5 BASE
siderable improvements, even outperforming using
S UPER GLUE for S UPER GLUE tasks. Using MNLI or
(Table 1) suggest that prompt transfer provides
SQ UAD as a single source dataset is also particularly
an effective means of improving performance for
P ROMPT T UNING. For example, the best-performing
helpful across target tasks. Other source mixtures
variant of SP OT outperforms the vanilla P ROMPT T UN - can lead to significant gains, with some families of
ING approach on both GLUE and S UPER GLUE by a
tasks (e.g., NLI and paraphrasing/semantic similar-
substantial margin, obtaining +4.4 and +10.1 point ity) showing more benefit than others. Mixing all
average accuracy improvements, respectively. Our the datasets together does not yield the best results,
possibly due to task interference/negative transfer
6
See Appendix B for details about datasets. issues, where achieving good performance on one
7
We use the Adafactor optimizer (Shazeer and Stern, 2018) or more source tasks can hurt performance on a
with default parameters except with a constant learning rate of
0.3, weight decay of 1e−5, and parameter scaling turned off. target task.
We train with a batch size of 32. The dropout probability is
8
always kept at 0.1. All of our models are implemented using S UPER GLUE tasks benefit less from source prompt tuning
JAX (Bradbury et al., 2018) and F LAX (Heek et al., 2020). on S UPER GLUE likely due to the small size of these datasets.

5042
SP OT helps close the gap with M ODELT UNING Name Task type |Train|
across all model sizes: Figure 1 shows our 16 source tasks
S UPER GLUE results across model sizes (see Ap- C4 language modeling 365M
D OC NLI NLI 942K
pendix A for full results). As shown in Lester Y ELP -2 sentiment analysis 560K
et al. (2021), P ROMPT T UNING becomes more com- MNLI NLI 393K
petitive with scale, and at the XXL size, it nearly QQP paraphrase detection 364K
QNLI NLI 105K
matches the performance of M ODELT UNING. How- R E C O RD QA 101K
ever, at smaller model sizes, there are still large CXC semantic similarity 88K
SQ UAD QA 88K
gaps between the two approaches. We show that DROP QA 77K
SP OT helps close these gaps and even exceeds M OD - SST-2 sentiment analysis 67K
ELT UNING ’s performance by a large margin at sev- W INO G RANDE commonsense reasoning 40K
H ELLA SWAG commonsense reasoning 40K
eral model sizes, while retaining all the computa- M ULTI RC QA 27K
tional benefits conferred by P ROMPT T UNING. Finally, C OSMOS QA commonsense reasoning 25K
at the XXL size, SP OT achieves the best average RACE QA 25K
score of 91.2, +1.1 points better than the strong 10 target tasks
M ULTI - TASK M ODELT UNING baseline, despite having B OOL Q QA 9K
C O LA grammatical acceptability 9K
27,000× fewer task-specific parameters. STS-B semantic similarity 6K
As a final test of SP OT’s effectiveness, we submit- WIC word sense disambiguation 5K
ted our XXL model’s predictions to the S UPER GLUE CR sentiment analysis 4K
MRPC paraphrase detection 4K
leaderboard, achieving a score of 89.2. This far RTE NLI 2K
exceeds all previous submissions using parameter- WSC coreference resolution 554
efficient adaptation, such as GPT-3 (71.8), and al- COPA QA 400
CB NLI 250
most matches fully fine-tuned T5 XXL (89.3),9 de-
spite tuning 27,000× fewer parameters. To the Table 2: Tasks used in our task transferability experi-
best of our knowledge, SP OT is the first parameter- ments, sorted by training dataset size.
efficient adaptation approach that is competitive
with methods that tune billions of parameters. See pose a retrieval algorithm (§3.3) that leverages task
Appendix D for details. embedding similarity to choose which source tasks
3 Predicting task transferability to use for a given novel target task (Figure 2, right).
Our proposed approach can eliminate 69% of the
So far, we have seen that soft prompt transfer can source task search space while keeping 90% of the
significantly boost the performance of prompt tun- best-case quality gain.
ing, but it is critical to pick the right source tasks for
transfer. For instance, through an extensive search, 3.1 Measuring transferability
we found that GLUE and MNLI provide excellent
We study a diverse set of 16 source datasets and
source tasks for transferring to individual GLUE
10 target datasets (see Table 2).10 We consider
and S UPER GLUE tasks. But what about a resource-
all 160 possible source-target pairs, and perform
constrained scenario where a user is not able to
transfer from each source task to each target task.
exhaustively search over a set of source tasks? Can
All source tasks are data-rich or have been shown
we predict which tasks will best transfer onto a
to yield positive transfer in prior work. To simulate
novel target task without testing them one by one?
a realistic scenario, we use low-resource tasks (less
To investigate this, we conduct a large-scale em-
than 10K training examples) as target tasks.11
pirical study with 26 NLP tasks. We first measure
transferability across all task combinations (§3.1). 10
Beyond the datasets from §2, we use D OC NLI (Yin et al.,
Next, we show that by interpreting task prompts 2021), Y ELP -2 (Zhang et al., 2015), C X C (Parekh et al., 2021),
DROP (Dua et al., 2019), W INO G RANDE (Sakaguchi et al., 2020),
as task embeddings, we can construct a seman- H ELLA SWAG (Zellers et al., 2019), C OSMOS QA (Huang et al.,
tic space of tasks, wherein similar tasks cluster 2019), RACE (Lai et al., 2017), and CR (Hu and Liu, 2004).
11
together (§3.2). Based on this observation, we pro- The source tasks comprise one unsupervised task (C4)
and 15 supervised tasks covering natural language inference
9
Note that the T5 submission uses the original version of (NLI), paraphrasing/semantic similarity, sentiment analysis,
T5 (which was pre-trained on a multi-task mixture of unsuper- question answering (QA), and commonsense reasoning. The
vised and supervised tasks) while we use T5 1.1 (which was target tasks additionally include grammatical acceptability,
pre-trained on C4 only without mixing in supervised tasks). word sense disambiguation, and coreference resolution.

5043
C4 +25 1.00
DocNLI 0.75
+20
Yelp-2 0.50
MNLI +15 0.25
QQP +10
QNLI 0.00 C4
WSC
ReCoRD +5 SQuAD
CxC ReCoRD
SQuAD 0 DROP
CoLA
DROP 5 COPA
Yelp-2
SST-2 SST-2
WinoGrande 10 CR
HellaSWAG 15 MNLI
CB
MultiRC DocNLI
CosmosQA 20 RTE
RACE CxC
25 STS-B
CoLA STS-B CR MRPC RTE BoolQ WiC WSC COPA CB MRPC
QQP
QNLI
WiC
MultiRC
Figure 3: A heatmap of our task transferability results. BoolQ
RACE
Each cell shows the relative error reduction on the tar- WinoGrande
HellaSWAG
CosmosQA
get task of the transferred prompt from the associated

Yelp-2
SST-2

RTE

HellaSWAG
COPA

RACE
MultiRC
ReCoRD

STS-B

WiC

WinoGrande
C4
WSC
SQuAD
DROP

CR
MNLI
CB
DocNLI
CxC
MRPC
QQP
QNLI

BoolQ

CosmosQA
CoLA
source task (row) to the associated target task (column).

To limit computational costs, we use T5 BASE in Figure 4: A clustered heatmap of cosine similarities
all of our task transferability experiments. We per- between the task embeddings of the 26 NLP tasks we
form 262,144 prompt tuning steps on each source study. Our prompt-based task embeddings capture task
relationships: similar tasks cluster together.
task. The prompt checkpoint with the highest
source task validation performance is selected to
initialize prompts for target tasks. Since the target test this idea, we interpret task prompts as task em-
datasets are small, we only perform 100K prompt beddings and construct a semantic space of tasks.
tuning steps on each target task. We repeat each More concretely, we define a task’s embedding as
experiment three times with different random seeds. the prompt checkpoint after training for 10K steps
Other training details match §2.1.4. on that task.13 Note that using early checkpoints
allows for quick computation of task embeddings
Tasks benefiting each other via prompt trans- for novel target tasks. We estimate the similarity
fer: Figure 3 shows a heatmap of our results (see between two tasks t1 , t2 by measuring the similar-
Appendix E for full results). In many cases, prompt ity between their corresponding task embeddings
transfer provides a significant gain on the target e1 , e2 , using the following metrics:
task. The transfer MNLI → CB yields the largest
C OSINE S IMILARITY OF AVERAGE T OKENS: We
relative error reduction of 58.9% (from an average
score of 92.7 to 97.0), followed by MNLI → COPA compute the cosine similarity between the average
(29.1%) and R E C O RD → WSC (20.0%). Using the pooled representations of the prompt tokens:
best source prompt (out of 48) for each target task 1 X 1 1 X 2
dramatically improves the average score across our sim(t1 , t2 ) = cos( ei , ej ),
L L
i j
10 target tasks from 74.7 to 80.7. Overall, our re-
sults show effective transfer from large source tasks
where e1i , e2j denote the respective prompt tokens
that involve high-level reasoning about semantic re-
of e1 , e2 , and cos denotes the cosine similarity.
lationships among sentences (e.g., MNLI), or when
the source and target tasks are similar (e.g., C X C → P ER - TOKEN AVERAGE C OSINE S IMILARITY: We
STS-B). Interestingly, positive transfer can occur compute the average cosine similarity between ev-
between relatively dissimilar tasks (e.g., R E C O RD ery prompt token pair (e1i , e2j ):
→ WSC, SQ UAD → MRPC, C X C → W I C).12
1 XX
3.2 Defining task similarity through prompts sim(t1 , t2 ) = cos(e1i , e2j ).
L2 i j
Since only prompt parameters are updated dur- 13
Our preliminary experiments with other checkpoint al-
ing prompt tuning on specific tasks, the learned ternatives (in the range 1K to 100K) yielded worse perfor-
prompts likely encode task-specific knowledge. mance. We also found that measuring task similarity using
task embeddings derived from a fixed prompt checkpoint (10K
This suggests that they could be used to reason steps) gave better results than those derived from the best-
about the nature of tasks and their relationships. To performing prompt checkpoint per task. This suggests that
prompts trained for a differing number of steps may be less
12
Table 7 in Appendix E contains more cases. directly comparable than those trained for the same duration.

5044
STS-B RTE
Task embeddings capture task relationships: (r = 0.708, p = 1.853e-08) (r = 0.290, p = 0.046)
Figure 4 shows a hierarchically-clustered heatmap
of cosine similarities between the task embed-
dings using the C OSINE S IMILARITY OF AVERAGE TO -
KENS metric.
14 We observe that our learned task

relative error reduction

embeddings capture many intuitive task relation-
ships. Specifically, similar tasks group together
into clusters, including QA (SQ UAD, R E C O RD, and
WiC WSC
DROP; M ULTI RC and B OOL Q), sentiment analysis (r = 0.163, p = 0.270) (r = 0.428, p = 0.002)

(Y ELP -2, SST-2, and CR), NLI (MNLI and CB; D OC -

NLI and RTE), semantic similarity (STS-B and C X C),
paraphrasing (MRPC and QQP), and commonsense
reasoning (W INO G RANDE, H ELLA SWAG, and C OS -
MOS QA ). We note that QNLI , which is an NLI task
built from the SQ UAD dataset, is not closely linked
to SQ UAD; this suggests that our task embeddings
cosine similarity
are more sensitive to the type of task than domain
similarity. Interestingly, they also capture the un- Figure 5: Correlation between task similarity and task
intuitive case of R E C O RD’s high transferability to transferability. Each point represents a source prompt.
WSC. Additionally, task embeddings that are de- The x-axis shows the cosine similarity between the as-
rived from different prompts of the same task have sociated source and target task embeddings, averaged
over three runs for the target task (orange title). The y-
high similarity scores (see Appendix F).
axis measures the relative error reduction on the target
3.3 Predicting transferability via similarity task achieved by each source prompt. We include the
Pearson correlation coefficient (r) and p-value.
We leverage our task embeddings to predict and
exploit task transferability. Specifically, we explore source prompts kr=1 αr ρsr so that we only per-
P
methods to predict the most beneficial source tasks form prompt tuning on the target task t once. The
for a given target task and then make use of the weights αr are computed as:
source task prompts to improve performance on the
target task. To enlarge our set of source prompts, sim(esr , et )
αr = Pk ,
we use the prompts from each of the three different sl t
l=1 sim(e , e )
prompt tuning runs on each source task, resulting in
48 source prompts. Given a target task t with task where esr denotes the corresponding task embed-
embedding et , we rank all the source prompts ρs ding of ρsr .
with associated embeddings es in descending order
T OP -k M ULTI - TASK M IXTURE: We first identify
by similarity, sim(es , et ). We denote the ranked
the source tasks whose prompts are in the top-k
list of source prompts as ρsr , where r denotes the
prompts and mix their datasets and the target
rank (r = 1, 2, . . . , 48). We experiment with three
dataset together, using the examples-proportional
methods for using the ranked source prompts:
mixing strategy of Raffel et al. (2020). Then, we
B EST OF T OP -k: We select the top-k source perform source prompt tuning on this multi-task
prompts and use each of them individually to ini- mixture and use the final prompt checkpoint to ini-
tialize the target prompt. This procedure requires tialize the target prompt.
prompt tuning k times on the target task t. The best We report the average score across all target
individual result is used for evaluating the effec- tasks achieved by each method. For comparison,
tiveness of this method. we measure the absolute and relative improvements
over BASELINE—prompt tuning on each target task
T OP -k W EIGHTED AVERAGE: We initialize the tar-
from scratch (i.e., without any prompt transfer).15
get prompt with a weighted average of the top-k
Additionally, we include O RACLE—the oracle re-
14
To obtain the highest resolution of similarity between two sults achieved by a brute-force search to identify
tasks, we use the average of cosine similarities between their
15
task embeddings obtained with all the three different prompt For each target task t, we report the average and standard
tuning runs (9 combinations). deviation of performance across three prompt tuning runs.

5045
the best possible out of 48 source prompts for each Method
Change
Avg. score
target task. Abs. Rel.

BASELINE - - 74.70.7
Correlation between task similarity and task
transferability: Figure 5 shows how the relative B RUTE - FORCE S EARCH (k = 48)
O RACLE 6.00.5 26.51.1 80.70.0
error reduction on a target task changes as a func-
C OSINE S IMILARITY OF AVERAGE T OKENS
tion of the similarity between the source and target B EST OF T OP -k
task embeddings. Overall, we observe a signifi- k=1 1.50.5 11.71.1 76.20.1
k=3 2.70.6 16.61.1 77.40.3
cant positive correlation between task embedding k=6 3.80.1 20.01.1 78.50.5
similarity and task transferability on four (out of k=9 4.50.4 22.21.1 79.2 0.1
k = 12 5.00.9 23.62.2 79.7 0.4
10) target tasks, including STS-B (p < 0.001), CB k = 15 5.40.8 24.91.8 80.10.3
(p < 0.001), WSC (p < 0.01), and RTE (p < 0.05), P ER - TOKEN AVERAGE C OSINE S IMILARITY
while it is less significant on the other tasks.16 In B EST OF T OP -k
k=1 2.00.4 12.11.1 76.70.7
some cases (e.g., on B OOL Q), we observe a large rel- k=3 2.90.6 17.00.6 77.50.4
ative error reduction (19.0%, achieved by a source k=6 4.50.5 22.11.2 79.20.1
k=9 4.60.5 22.60.9 79.50.2
prompt of MNLI) despite a low cosine similarity k = 12 5.00.6 23.51.4 79.60.1
(0.4). This suggests that factors other than task k = 15 5.30.9 24.52.2 80.00.4
similarity (data size, task difficulty, domain sim- T OP -k W EIGHTED AVERAGE
best k = 3 1.90.5 11.52.7 76.60.1
ilarity, etc.) may also play a role in determining
T OP -k M ULTI - TASK M IXTURE
transferability. best k = 12 3.10.5 15.32.8 77.80.1

Retrieving targeted source tasks via task em- Table 3: Task embeddings provide an effective means
beddings is helpful: Table 3 compares differ- of predicting and exploiting task transferability. Us-
ent methods for identifying which source prompts ing B EST OF T OP -k with k = 3 improves over BASE -
could be beneficial for a given target task. Over- LINE ( P ROMPT T UNING on each task from scratch) by
all, our results show the effectiveness of B EST OF +2.8 points. With larger values of k (≤ 15), we can
T OP -k. Simply choosing the source prompt with retain most of the benefits conferred by oracle selec-
the highest task embedding similarity to the target tion. For T OP -k W EIGHTED AVERAGE and T OP -k M ULTI -
TASK M IXTURE, we experiment with different values of
task using P ER - TOKEN AVERAGE C OSINE S IMILARITY
k ∈ {3, 6, 9, 12} and report the best results.
improves over the baseline by a large margin (from
an average score of 74.7 to 76.7, a 12.1% average
relative error reduction). Trying all the top-3 (out to exhibit remarkable performance on many NLP
of 48) source prompts for each target task yields an tasks (Devlin et al., 2019; Liu et al., 2019b; Yang
average score of 77.5. With larger values of k, we et al., 2019; Lan et al., 2020; Raffel et al., 2020;
can retain most of the benefits of oracle selection Brown et al., 2020; He et al., 2021). To improve
(80% of the gain in terms of average score with practical applicability of these models, early work
k = 9 and 90% with k = 15), while still elimi- introduces compression techniques (Sanh et al.,
nating over 2/3 of the candidate source prompts. 2019; Jiao et al., 2020; Fan et al., 2020; Sanh et al.,
T OP -k W EIGHTED AVERAGE has similar average per-
2020) to obtain lightweight models. Other work ex-
formance to B EST OF TOP -k with k = 1, but achieves plores updating only small parts of the model (Za-
lower variance. Thus, this may be an appealing al- ken et al., 2021) or task-specific modules, such as
ternative to B EST OF TOP -k in scenarios where trying adapters (Houlsby et al., 2019; Karimi Mahabadi
multiple prompt tuning runs on the target task is et al., 2021) or low-rank structures (Mahabadi et al.,
computationally prohibitive. Finally, TOP -k M ULTI - 2021; Hu et al., 2021), while keeping the rest of
TASK M IXTURE also provides a means of obtaining
the model fixed.
strong performance with an average score of 77.8, Recently, Brown et al. (2020) demonstrate im-
even outperforming B EST OF TOP -k with k ≤ 3. pressive few-shot performance with P ROMPT D ESIGN,
where their model is conditioned on a manual
4 Related Work text prompt at inference time to perform differ-
ent tasks. Several efforts have since focused on
Parameter-efficient transfer learning: Large-
developing prompt-based learning approaches with
scale pre-trained language models have been shown
carefully handcrafted prompts (Schick and Schütze,
16
See Appendix G for full results. 2021), prompt mining and paraphrasing (Jiang
5046
et al., 2020b), gradient-based search for improved parameters, P ROMPT T UNING is the most parameter
prompts (Shin et al., 2020), and automatic prompt efficient, requiring less than 0.01% task-specific pa-
generation (Gao et al., 2021). The use of hard rameters for most model sizes. (2) P ROMPT T UNING
prompts, however, was found to be sub-optimal and is simpler than other methods, as it does not mod-
sensitive to the choice of the prompt (Zhao et al., ify the internal model architecture (cf. the P REFIX -
2021; Liu et al., 2021b). As such, more recent work T UNING method of Li and Liang (2021), which
has shifted toward learning soft prompts (Liu et al., adds a prefix to each layer of both the Transformer
2021b; Qin and Eisner, 2021; Li and Liang, 2021; encoder and decoder); as such, P ROMPT T UNING al-
Lester et al., 2021), which can be seen as learn- lows mixed-task inference and facilitates transfer
able parameters injected into the model. We refer learning between tasks. (3) As model capacity in-
readers to Liu et al. (2021a) for a recent survey on creases, P ROMPT T UNING becomes more competitive
prompt-based learning research. with M ODELT UNING; to the best of our knowledge,
In concurrent work, Gu et al. (2021) also explore this has not been shown for other methods. (4) Soft
the effectiveness of prompt transfer. Their method prompts could possibly be interpreted as natural
uses hand-crafted pre-training tasks tailored to spe- language instructions.
cific types of downstream tasks, being less extensi- Additionally, since our prompt-based task em-
ble to novel downstream tasks. In contrast, we use bedding approach does not capture all of the factors
existing tasks as source tasks and show that prompt that influence task transferability, we leave further
transfer can confer benefits even when there are exploration of other task embedding methods to
mismatches (e.g., in task type or input/output for- future work.
mat) between the source and target.
6 Conclusion
Task transferability We also build on existing
work on task transferability (Wang et al., 2019a; In this paper, we study transfer learning in the con-
Liu et al., 2019a; Talmor and Berant, 2019; Pruk- text of prompt tuning. We show that scale is not
sachatkun et al., 2020; Vu et al., 2020, 2021). Prior necessary for P ROMPT T UNING to match the perfor-
work shows effective transfer from data-rich source mance of M ODELT UNING. On S UPER GLUE, our SP OT
tasks (Phang et al., 2019), those that require com- approach matches or even exceeds the performance
plex reasoning and inference (Pruksachatkun et al., of M ODELT UNING by a large margin across model
2020), or those that are similar to the target task (Vu sizes while being more parameter-efficient. Our
et al., 2020). There have also been efforts to predict large-scale study on task transferability indicates
task transferability (Bingel and Søgaard, 2017; Vu that tasks can benefit each other via prompt transfer
et al., 2020; Poth et al., 2021). Vu et al. (2020) in various scenarios. Finally, we demonstrate that
use task embeddings derived from either the input task prompts can be interpreted as task embeddings
text or the diagonal Fisher information matrix of to formalize the similarity between tasks. We pro-
the model, while Poth et al. (2021) explore adapter- pose a simple yet efficient retrieval approach that
based alternatives. Here, our use of the same model measures task similarity to identify which source
(without task-specific components) with a unifying tasks could confer benefits to a novel target task.
text-to-text format allows us to more easily model Taken as a whole, we hope that our work will spur
the space of tasks. Additionally, prompt-based task more research into prompt-based transfer learning.
embeddings are comparatively cheaper to obtain.
Acknowledgements
5 Limitations & Future work
We thank Mohit Iyyer, Sebastian Ruder, Kalpesh
As other parameter-efficient adaptation methods Krishna, Thang Luong, Quoc Le, and the members
(see §4) may outperform P ROMPT T UNING in specific of the Descartes team and the UMass NLP group
situations, it would be interesting to test whether an for helpful discussion and feedback. We would
approach similar to SP OT could extend successfully also like to thank Grady Simon, Lucas Dixon, Slav
to these methods. At the same time, we believe that Petrov, Nader Akoury, Haw-Shiuan Chang, Kather-
P ROMPT T UNING has its own merits. As pre-trained ine Thai, Marzena Karpinska, and Shufan Wang for
language models become larger and larger, some their comments on this manuscript. Finally, we are
advantages of P ROMPT T UNING over other methods grateful to Vamsi Aribandi for his work on prepro-
are: (1) Among current methods with learnable cessing several datasets used in our experiments.
5047
References Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Chandra Bhagavatula, Ronan Le Bras, Chaitanya Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- Amanda Askell, Sandhini Agarwal, Ariel Herbert-
nah Rashkin, Doug Downey, Scott Wen-tau Yih, and Voss, Gretchen Krueger, Tom Henighan, Rewon
Yejin Choi. 2020. Abductive commonsense reason- Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
ing. In Proceedings of the 8th International Confer- Clemens Winter, Chris Hesse, Mark Chen, Eric
ence on Learning Representations (ICLR 2020). Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Joachim Bingel and Anders Søgaard. 2017. Identify- Alec Radford, Ilya Sutskever, and Dario Amodei.
ing beneficial task relations for multi-task learning 2020. Language models are few-shot learners. In
in deep neural networks. In Proceedings of the Con- Proceedings of the 34th Conference on Neural In-
ference of the European Chapter of the Association formation Processing Systems (NeurIPS 2020), vol-
for Computational Linguistics (EACL 2017), pages ume 33, pages 1877–1901.
164–169.

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
Gao, and Yejin Choi. 2020. Piqa: Reasoning about Gazpio, and Lucia Specia. 2017. SemEval-2017
physical commonsense in natural language. Pro- task 1: Semantic textual similarity multilingual and
ceedings of the AAAI Conference on Artificial Intel- crosslingual focused evaluation. In Proceedings of
ligence (AAAI 2020), 34(05):7432–7439. the 11th International Workshop on Semantic Evalu-
ation (SemEval 2017), pages 1–14.
Ondřej Bojar, Christian Buck, Christian Federmann,
Barry Haddow, Philipp Koehn, Johannes Leveling, Christopher Clark, Kenton Lee, Ming-Wei Chang,
Christof Monz, Pavel Pecina, Matt Post, Herve Tom Kwiatkowski, Michael Collins, and Kristina
Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Toutanova. 2019. BoolQ: Exploring the surprising
Tamchyna. 2014. Findings of the 2014 workshop on difficulty of natural yes/no questions. In Proceed-
statistical machine translation. In Proceedings of the ings of the 2019 Conference of the North Ameri-
Ninth Workshop on Statistical Machine Translation can Chapter of the Association for Computational
(WMT 2014), pages 12–58. Linguistics: Human Language Technologies (ACL
2019), pages 2924–2936.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
Yvette Graham, Barry Haddow, Matthias Huck, An- Ido Dagan, Oren Glickman, and Bernardo Magnini.
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo- 2005. The pascal recognising textual entailment
gacheva, Christof Monz, Matteo Negri, Aurélie challenge. In Proceedings of the 1st International
Névéol, Mariana Neves, Martin Popel, Matt Post, Conference on Machine Learning Challenges: Eval-
Raphael Rubino, Carolina Scarton, Lucia Spe- uating Predictive Uncertainty Visual Object Classifi-
cia, Marco Turchi, Karin Verspoor, and Marcos cation, and Recognizing Textual Entailment (MLCW
Zampieri. 2016. Findings of the 2016 conference 2005), page 177–190.
on machine translation. In Proceedings of the First
Conference on Machine Translation (WMT 2016), Marie-Catherine De Marneffe, Mandy Simons, and
pages 131–198. Judith Tonhauser. 2019. The CommitmentBank:
Investigating projection in naturally occurring dis-
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, course. In Proceedings of Sinn und Bedeutung 23
Barry Haddow, Matthias Huck, Chris Hokamp, (SuB 2018), volume 23, pages 107–124.
Philipp Koehn, Varvara Logacheva, Christof Monz,
Matteo Negri, Matt Post, Carolina Scarton, Lucia Dorottya Demszky, Dana Movshovitz-Attias, Jeong-
Specia, and Marco Turchi. 2015. Findings of the woo Ko, Alan Cowen, Gaurav Nemade, and Sujith
2015 workshop on statistical machine translation. In Ravi. 2020. GoEmotions: A dataset of fine-grained
Proceedings of the Tenth Workshop on Statistical emotions. In Proceedings of the 58th Annual Meet-
Machine Translation (WMT 2015), pages 1–46. ing of the Association for Computational Linguistics
(ACL 2020), pages 4040–4054.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large anno- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
tated corpus for learning natural language inference. Kristina Toutanova. 2019. BERT: Pre-training of
In Proceedings of the 2015 Conference on Empirical deep bidirectional transformers for language under-
Methods in Natural Language Processing (EMNLP standing. In Proceedings of the 2019 Conference of
2015), pages 632–642. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
James Bradbury, Roy Frostig, Peter Hawkins, nologies (NAACL 2019), pages 4171–4186.
Matthew James Johnson, Chris Leary, Dougal
Maclaurin, George Necula, Adam Paszke, Jake William B. Dolan and Chris Brockett. 2005. Automati-
VanderPlas, Skye Wanderman-Milne, and Qiao cally constructing a corpus of sentential paraphrases.
Zhang. 2018. JAX: composable transformations of In Proceedings of the Third International Workshop
Python+NumPy programs. on Paraphrasing (IWP 2005).

5048
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Madaan, Mounica Maddela, Khyati Mahajan,
Stanovsky, Sameer Singh, and Matt Gardner. 2019. Saad Mahamood, Bodhisattwa Prasad Majumder,
DROP: A reading comprehension benchmark requir- Pedro Henrique Martins, Angelina McMillan-
ing discrete reasoning over paragraphs. In Proceed- Major, Simon Mille, Emiel van Miltenburg, Moin
ings of the Conference of the North American Chap- Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre
ter of the Association for Computational Linguis- Niyongabo Rubungo, Salomey Osei, Ankur Parikh,
tics: Human Language Technologies (NAACL 2019), Laura Perez-Beltrachini, Niranjan Ramesh Rao,
pages 2368–2378. Vikas Raunak, Juan Diego Rodriguez, Sashank
Santhanam, João Sedoc, Thibault Sellam, Samira
Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Shaikh, Anastasia Shimorina, Marco Antonio
Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Sobrevilla Cabezudo, Hendrik Strobelt, Nishant
Searchqa: A new q&a dataset augmented with Subramani, Wei Xu, Diyi Yang, Akhila Yerukola,
context from a search engine. arXiv preprint and Jiawei Zhou. 2021. The GEM benchmark:
arXiv:1704.05179. Natural language generation, its evaluation and
metrics. In Proceedings of the 1st Workshop on
Ondřej Dušek, David M. Howcroft, and Verena Rieser.
Natural Language Generation, Evaluation, and
2019. Semantic noise matters for neural natural lan-
Metrics (GEM 2021), pages 96–120.
guage generation. In Proceedings of the 12th Inter-
national Conference on Natural Language Genera-
tion (INLG 2019), pages 421–426. Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and
Aleksander Wawer. 2019. SAMSum corpus: A
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and human-annotated dialogue dataset for abstractive
Dragomir Radev. 2019. Multi-news: A large-scale summarization. In Proceedings of the 2nd Workshop
multi-document summarization dataset and abstrac- on New Frontiers in Summarization (NewSum 2019),
tive hierarchical model. In Proceedings of the 57th pages 70–79.
Annual Meeting of the Association for Computa-
tional Linguistics (ACL 2019), pages 1074–1084. Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-
ter sentiment classification using distant supervision.
Angela Fan, Edouard Grave, and Armand Joulin. 2020. CS224N Project Report, Stanford.
Reducing transformer depth on demand with struc-
tured dropout. In Proceedings of the 8th Inter- David Graff, Junbo Kong, Ke Chen, and Kazuaki
national Conference on Learning Representations Maeda. 2003. English gigaword. Linguistic Data
(ICLR 2020). Consortium, Philadelphia, 4(1):34.
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-
nsol Choi, and Danqi Chen. 2019. MRQA 2019 Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
shared task: Evaluating generalization in reading Newsroom: A dataset of 1.3 million summaries with
comprehension. In Proceedings of the 2nd Work- diverse extractive strategies. In Proceedings of the
shop on Machine Reading for Question Answering 2018 Conference of the North American Chapter of
(MRQA 2019), pages 1–13. the Association for Computational Linguistics: Hu-
man Language Technologies (NAACL 2018), pages
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. 708–719.
Making pre-trained language models better few-shot
learners. In Proceedings of the 59th Annual Meet- Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang.
ing of the Association for Computational Linguistics 2021. PPT: Pre-trained prompt tuning for few-shot
and the 11th International Joint Conference on Nat- learning. arXiv preprint arXiv:2109.04332.
ural Language Processing (ACL 2021), pages 3816–
3830. Karen Hambardzumyan, Hrant Khachatrian, and
Jonathan May. 2021. WARP: Word-level Adver-
Claire Gardent, Anastasia Shimorina, Shashi Narayan,
sarial ReProgramming. In Proceedings of the 59th
and Laura Perez-Beltrachini. 2017. Creating train-
Annual Meeting of the Association for Computa-
ing corpora for NLG micro-planners. In Proceed-
tional Linguistics and the 11th International Joint
ings of the 55th Annual Meeting of the Association
Conference on Natural Language Processing (ACL-
for Computational Linguistics (ACL 2017), pages
IJCNLP 2021), pages 4921–4933.
179–188.
Sebastian Gehrmann, Tosin Adewumi, Karmanya Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Aggarwal, Pawan Sasanka Ammanamanchi, Weizhu Chen. 2021. Deberta: Decoding-enhanced
Anuoluwapo Aremu, Antoine Bosselut, Khy- bert with disentangled attention. In Proceedings of
athi Raghavi Chandu, Miruna-Adriana Clinciu, the 9th International Conference on Learning Repre-
Dipanjan Das, Kaustubh Dhole, Wanyu Du, sentations (ICLR 2021).
Esin Durmus, Ondřej Dušek, Chris Chinenye
Emezue, Varun Gangal, Cristina Garbacea, Tat- Jonathan Heek, Anselm Levskaya, Avital Oliver, Mar-
sunori Hashimoto, Yufang Hou, Yacine Jernite, vin Ritter, Bertrand Rondepierre, Andreas Steiner,
Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mi- and Marc van Zee. 2020. Flax: A neural network
hir Kale, Dhruv Kumar, Faisal Ladhak, Aman library and ecosystem for JAX.

5049
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Dehghani, and James Henderson. 2021. Parameter-
and Phil Blunsom. 2015. Teaching machines to read efficient multi-task fine-tuning for transformers via
and comprehend. In Proceedings of the 29th Con- shared hypernetworks. In Proceedings of the 59th
ference on Neural Information Processing Systems Annual Meeting of the Association for Computa-
(NeurIPS 2020), volume 28. tional Linguistics and the 11th International Joint
Conference on Natural Language Processing (ACL-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, IJCNLP 2021), pages 565–576.
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,
2019. Parameter-efficient transfer learning for NLP. Shyam Upadhyay, and Dan Roth. 2018. Looking be-
In Proceedings of the 36th International Conference yond the surface: A challenge set for reading com-
on Machine Learning (PMLR 2019), volume 97, prehension over multiple sentences. In Proceedings
pages 2790–2799. of the 2018 Conference of the North American Chap-
ter of the Association for Computational Linguis-
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan tics: Human Language Technologies (NAACL 2018),
Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu pages 252–262.
Chen. 2021. Lora: Low-rank adaptation of large lan-
guage models. arXiv preprint arXiv:2106.09685. Anastassia Kornilova and Vladimir Eidelman. 2019.
BillSum: A corpus for automatic summarization of
Minqing Hu and Bing Liu. 2004. Mining and summa- US legislation. In Proceedings of the 2nd Workshop
rizing customer reviews. In Proceedings of the 10th on New Frontiers in Summarization (NewSum 2019),
ACM SIGKDD International Conference on Knowl- pages 48–56.
edge Discovery and Data Mining (KDD 2004), page
168–177. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
field, Michael Collins, Ankur Parikh, Chris Al-
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and berti, Danielle Epstein, Illia Polosukhin, Jacob De-
Yejin Choi. 2019. Cosmos QA: Machine reading vlin, Kenton Lee, Kristina Toutanova, Llion Jones,
comprehension with contextual commonsense rea- Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,
soning. In Proceedings of the 2019 Conference on Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.
Empirical Methods in Natural Language Processing Natural questions: A benchmark for question an-
and the 9th International Joint Conference on Nat- swering research. Transactions of the Association
ural Language Processing (EMNLP-IJCNLP 2019), for Computational Linguistics (TACL 2019), 7:452–
pages 2391–2401. 466.

Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. Faisal Ladhak, Esin Durmus, Claire Cardie, and Kath-
2017. First Quora Dataset Release: Question pairs. leen McKeown. 2020. WikiLingua: A new bench-
mark dataset for cross-lingual abstractive summa-
Chao Jiang, Mounica Maddela, Wuwei Lan, Yang rization. In Findings of the Association for Com-
Zhong, and Wei Xu. 2020a. Neural CRF model for putational Linguistics (Findings of EMNLP 2020),
sentence alignment in text simplification. In Pro- pages 4034–4048.
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2020), Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
pages 7943–7960. and Eduard Hovy. 2017. RACE: Large-scale ReAd-
ing comprehension dataset from examinations. In
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Proceedings of the 2017 Conference on Empirical
Neubig. 2020b. How can we know what language Methods in Natural Language Processing (EMNLP
models know? Transactions of the Association 2017), pages 785–794.
for Computational Linguistics (TACL 2020), 8:423–
438. Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, 2020. ALBERT: A lite BERT for self-supervised
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. learning of language representations. In Proceed-
2020. TinyBERT: Distilling BERT for natural lan- ings of the 8th International Conference on Learn-
guage understanding. In Findings of the Association ing Representations (ICLR 2020).
for Computational Linguistics (Findings of EMNLP
2020), pages 4163–4174. Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
The power of scale for parameter-efficient prompt
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke tuning. In Proceedings of the 2021 Conference on
Zettlemoyer. 2017. TriviaQA: A large scale dis- Empirical Methods in Natural Language Processing
tantly supervised challenge dataset for reading com- (EMNLP 2021), pages 3045–3059.
prehension. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics Hector J. Levesque, Ernest Davis, and Leora Morgen-
(ACL 2017), pages 1601–1611. stern. 2012. The winograd schema challenge. In

5050
Proceedings of the Thirteenth International Confer- North American Chapter of the Association for Com-
ence on Principles of Knowledge Representation putational Linguistics: Human Language Technolo-
and Reasoning (KR 2012), page 552–561. gies (NAACL 2021), pages 432–447.

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
Optimizing continuous prompts for generation. In 2018. Don’t give me the details, just the sum-
Proceedings of the 59th Annual Meeting of the mary! topic-aware convolutional neural networks
Association for Computational Linguistics and the for extreme summarization. In Proceedings of the
11th International Joint Conference on Natural Lan- 2018 Conference on Empirical Methods in Natural
guage Processing (ACL 2021), pages 4582–4597. Language Processing (EMNLP 2018), pages 1797–
1807.
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei
Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Yixin Nie, Adina Williams, Emily Dinan, Mohit
Ren. 2020. CommonGen: A constrained text gen- Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-
eration challenge for generative commonsense rea- versarial NLI: A new benchmark for natural lan-
soning. In Findings of the Association for Computa- guage understanding. In Proceedings of the 58th An-
tional Linguistics (Findings of EMNLP 2020), pages nual Meeting of the Association for Computational
1823–1840. Linguistics (ACL 2020), pages 4885–4901.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Zarana Parekh, Jason Baldridge, Daniel Cer, Austin
Matthew E. Peters, and Noah A. Smith. 2019a. Lin- Waters, and Yinfei Yang. 2021. Crisscrossed cap-
guistic knowledge and transferability of contextual tions: Extended intramodal and intermodal seman-
representations. In Proceedings of the Conference of tic similarity judgments for MS-COCO. In Proceed-
the North American Chapter of the Association for ings of the 16th Conference of the European Chap-
Computational Linguistics: Human Language Tech- ter of the Association for Computational Linguistics
nologies (NAACL 2019), pages 1073–1094. (EACL 2021), pages 2855–2870.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Jason Phang, Thibault Févry, and Samuel R Bowman.
Hiroaki Hayashi, and Graham Neubig. 2021a. Pre- 2019. Sentence encoders on stilts: Supplementary
train, prompt, and predict: A systematic survey of training on intermediate labeled-data tasks. arXiv
prompting methods in natural language processing. preprint arXiv:1811.01088.
arXiv preprint arXiv:2107.13586.
Mohammad Taher Pilehvar and Jose Camacho-
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Collados. 2019. WiC: the word-in-context dataset
Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. Gpt for evaluating context-sensitive meaning representa-
understands, too. arXiv preprint arXiv:2103.10385. tions. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Com-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- putational Linguistics: Human Language Technolo-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, gies (NAACL 2019), pages 1267–1273.
Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
Roberta: A robustly optimized bert pretraining ap- Clifton Poth, Jonas Pfeiffer, Andreas Rücklé, and Iryna
proach. arXiv preprint arXiv:1907.11692. Gurevych. 2021. What to pre-train on? Efficient
intermediate task selection. In Proceedings of the
Nicholas Lourie, Ronan Le Bras, Chandra Bhagavat- 2021 Conference on Empirical Methods in Natural
ula, and Yejin Choi. 2021. Unicorn on rainbow: Language Processing (EMNLP 2021), pages 10585–
A universal commonsense reasoning model on a 10605.
new multitask benchmark. Proceedings of the AAAI
Conference on Artificial Intelligence (AAAI 2021), Yada Pruksachatkun, Jason Phang, Haokun Liu,
35(15):13480–13488. Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe
Pang, Clara Vania, Katharina Kann, and Samuel R.
Rabeeh Karimi Mahabadi, James Henderson, and Se- Bowman. 2020. Intermediate-task transfer learning
bastian Ruder. 2021. Compacter: Efficient low- with pretrained language models: When and why
rank hypercomplex adapter layers. arXiv preprint does it work? In Proceedings of the 58th Annual
arXiv:2106.04647. Meeting of the Association for Computational Lin-
guistics (ACL 2020), pages 5231–5247.
Linyong Nan, Dragomir Radev, Rui Zhang, Amrit
Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xian- Guanghui Qin and Jason Eisner. 2021. Learning how
gru Tang, Aadit Vyas, Neha Verma, Pranav Kr- to ask: Querying LMs with mixtures of soft prompts.
ishna, Yangxiaokang Liu, Nadia Irwanto, Jessica In Proceedings of the 2021 Conference of the North
Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mu- American Chapter of the Association for Computa-
tuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern tional Linguistics: Human Language Technologies
Tan, Xi Victoria Lin, Caiming Xiong, Richard (NAACL 2021), pages 5203–5212.
Socher, and Nazneen Fatema Rajani. 2021. DART:
Open-domain structured data record to text genera- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
tion. In Proceedings of the 2021 Conference of the Lee, Sharan Narang, Michael Matena, Yanqi Zhou,

5051
Wei Li, and Peter J. Liu. 2020. Exploring the lim- Language Technologies (NAACL 2021), pages 2339–
its of transfer learning with a unified text-to-text 2352.
transformer. Journal of Machine Learning Research
(JMLR 2020), 21(140):1–67. Abigail See, Peter J. Liu, and Christopher D. Manning.
2017. Get to the point: Summarization with pointer-
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and generator networks. In Proceedings of the 55th An-
Percy Liang. 2016. SQuAD: 100,000+ questions nual Meeting of the Association for Computational
for machine comprehension of text. In Proceedings Linguistics (ACL 2017), pages 1073–1083.
of the Conference on Empirical Methods in Natural
Language Processing (EMNLP 2016), pages 2383– Noam Shazeer and Mitchell Stern. 2018. Adafactor:
2392. Adaptive learning rates with sublinear memory cost.
arXiv preprint arXiv:1804.04235.
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara,
Raghav Gupta, and Pranav Khaitan. 2020. Towards Taylor Shin, Yasaman Razeghi, Robert L. Logan IV,
scalable multi-domain conversational agents: The Eric Wallace, and Sameer Singh. 2020. AutoPrompt:
schema-guided dialogue dataset. Proceedings of the Eliciting Knowledge from Language Models with
AAAI Conference on Artificial Intelligence (AAAI Automatically Generated Prompts. In Proceed-
2020), 34(05):8689–8696. ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP 2020),
Melissa Roemmele, Cosmin Adrian Bejan, and An- pages 4222–4235.
drew S Gordon. 2011. Choice of plausible alterna-
tives: An evaluation of commonsense causal reason- Richard Socher, Alex Perelygin, Jean Wu, Jason
ing. In Proceedings of the 25th AAAI Spring Sympo- Chuang, Christopher D. Manning, Andrew Ng, and
sium: Logical Formalizations of Commonsense Rea- Christopher Potts. 2013. Recursive deep models
soning (AAAI Spring Symposium 2011). for semantic compositionality over a sentiment tree-
bank. In Proceedings of the 2013 Conference on
Alexander M. Rush, Sumit Chopra, and Jason Weston. Empirical Methods in Natural Language Processing
2015. A neural attention model for abstractive sen- (EMNLP 2013), pages 1631–1642.
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Alon Talmor and Jonathan Berant. 2019. MultiQA: An
guage Processing (EMNLP 2015), pages 379–389. empirical investigation of generalization and transfer
in reading comprehension. In Proceedings of the An-
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhaga- nual Meeting of the Association for Computational
vatula, and Yejin Choi. 2020. Winogrande: An ad- Linguistics (ACL 2019), pages 4911–4921.
versarial winograd schema challenge at scale. Pro-
ceedings of the AAAI Conference on Artificial Intel- Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
ligence (AAAI 2020), 34(05):8732–8740. ris, Alessandro Sordoni, Philip Bachman, and Ka-
heer Suleman. 2017. NewsQA: A machine compre-
Victor Sanh, Lysandre Debut, Julien Chaumond, and hension dataset. In Proceedings of the Workshop
Thomas Wolf. 2019. Distilbert, a distilled version of on Representation Learning for NLP (RepL4NLP
bert: smaller, faster, cheaper and lighter. In Proceed- 2017), pages 191–200.
ings of the 5th Workshop on Energy Efficient Ma-
chine Learning and Cognitive Computing (EMC2 Tu Vu, Minh-Thang Luong, Quoc Le, Grady Simon,
2019). and Mohit Iyyer. 2021. STraTA: Self-training with
task augmentation for better few-shot learning. In
Victor Sanh, Thomas Wolf, and Alexander Rush. Proceedings of the 2021 Conference on Empirical
2020. Movement pruning: Adaptive sparsity by fine- Methods in Natural Language Processing (EMNLP
tuning. In Proceedings of the 34th Conference on 2021), pages 5715–5731.
Neural Information Processing Systems (NeurIPS
2020), volume 33, pages 20378–20389. Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessan-
dro Sordoni, Adam Trischler, Andrew Mattarella-
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Micke, Subhransu Maji, and Mohit Iyyer. 2020. Ex-
Le Bras, and Yejin Choi. 2019. Social IQa: Com- ploring and predicting transferability across NLP
monsense reasoning about social interactions. In tasks. In Proceedings of the 2020 Conference on
Proceedings of the 2019 Conference on Empirical Empirical Methods in Natural Language Processing
Methods in Natural Language Processing and the (EMNLP 2020), pages 7882–7926.
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP 2019), pages Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pap-
4463–4473. pagari, R. Thomas McCoy, Roma Patel, Najoung
Kim, Ian Tenney, Yinghui Huang, Katherin Yu,
Timo Schick and Hinrich Schütze. 2021. It’s not just Shuning Jin, Berlin Chen, Benjamin Van Durme,
size that matters: Small language models are also Edouard Grave, Ellie Pavlick, and Samuel R. Bow-
few-shot learners. In Proceedings of the 2021 Con- man. 2019a. Can you tell me how to get past sesame
ference of the North American Chapter of the As- street? sentence-level pretraining beyond language
sociation for Computational Linguistics: Human modeling. In Proceedings of the Annual Meeting of

5052
the Association for Computational Linguistics (ACL Rui Zhang and Joel Tetreault. 2019. This email could
2019), pages 4465–4476. save your life: Introducing the task of email subject
line generation. In Proceedings of the 57th Annual
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Meeting of the Association for Computational Lin-
Amanpreet Singh, Julian Michael, Felix Hill, Omer guistics (ACL 2019), pages 446–456.
Levy, and Samuel Bowman. 2019b. Superglue: A
stickier benchmark for general-purpose language un- Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng
derstanding systems. In Proceedings of the 33rd In- Gao, Kevin Duh, and Benjamin Van Durme. 2018.
ternational Conference on Neural Information Pro- Record: Bridging the gap between human and ma-
cessing Systems (NeurIPS 2019), volume 32, pages chine commonsense reading comprehension. arXiv
3266–3280. preprint arXiv:1810.12885.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Hill, Omer Levy, and Samuel R Bowman. 2019c. Character-level convolutional networks for text clas-
Glue: A multi-task benchmark and analysis platform sification. In Proceedings of the 29th Conference on
for natural language understanding. Proceedings of Neural Information Processing Systems (NeurIPS
the 7th International Conference on Learning Repre- 2015), volume 28, pages 649–657.
sentations (ICLR 2019).
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Sameer Singh. 2021. Calibrate before use: Improv-
Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
ing few-shot performance of language models. In
man. 2019. Neural network acceptability judgments.
Proceedings of the 38th International Conference
Transactions of the Association for Computational
on Machine Learning (ICML 2021), volume 139 of
Linguistics (TACL 2019), 7:625–641.
PMLR, pages 12697–12706.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the Conference of the North American Chap-
ter of the Association for Computational Linguis-
tics: Human Language Technologies (NAACL 2018),
pages 1112–1122.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-

bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Xlnet: Generalized autoregressive pretraining for
language understanding. In Proceedings of the 33th
Conference on Neural Information Processing Sys-
tems (NeurIPS 2019), volume 32, pages 5753–5763.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,

William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018. HotpotQA: A dataset
for diverse, explainable multi-hop question answer-
ing. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP
2018), pages 2369–2380.

Wenpeng Yin, Dragomir Radev, and Caiming Xiong.

2021. DocNLI: A large-scale dataset for document-
level natural language inference. In Findings of the
Association for Computational Linguistics (Findings
of ACL-IJCNLP 2021), pages 4913–4922.

Elad Ben Zaken, Shauli Ravfogel, and Yoav Gold-

berg. 2021. Bitfit: Simple parameter-efficient
fine-tuning for transformer-based masked language-
models. arXiv preprint arXiv:2106.10199.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali

Farhadi, and Yejin Choi. 2019. HellaSwag: Can a
machine really finish your sentence? In Proceed-
ings of the 57th Annual Meeting of the Association
for Computational Linguistics (ACL 2019), pages
4791–4800.

5053
Appendices Method
Model size
S MALL BASE L ARGE XL XXL

A Full results for Figure 1 P ROMPT D ESIGN (GPT-3)

M ODELT UNING
40.6
62.80.8
43.4
73.70.6
45.1
81.30.6
47.8
83.10.2
52.8
89.90.2
P ROMPT T UNING 59.80.8 63.11.1 74.52.2 79.20.9 88.80.2
Table 4 shows the performance of different model M ULTI - TASK M ODELT UNING 64.60.2 79.20.3 84.50.1 88.00.5 90.10.2
SP OT (O URS ) 64.50.3 73.20.3 82.70.2 88.70.3 91.20.1
tuning and prompt tuning methods (described
in §2.1.1) on the S UPER GLUE benchmark. Table 4: S UPER GLUE performance of different model
tuning and prompt tuning methods across model sizes.
B Source datasets used in our SP OT We report the mean and standard deviation (in the sub-
experiments in §2 script) across three random seeds. SP OT outperforms
vanilla P ROMT T UNING and GPT-3 by a large margin,
Figure 6 displays the datasets used in our SP OT matching or outperforming M ODELT UNING across all
experiments in §2. In addition to the C4 unlabeled model sizes. At the XXL model size, SP OT even outper-
dataset (Raffel et al., 2020), we use 55 labeled forms M ULTI - TASK M ODELT UNING, which fine-tunes the
datasets. These datasets come from common NLP entire model on the GLUE mixture before fine-tuning
benchmarks/families of tasks, namely: it on individual S UPER GLUE tasks.

• GLUE (Wang et al., 2019c), including

C O LA (Warstadt et al., 2019), SST-2 (Socher
• Commonsense reasoning on RAIN-
BOW (Lourie et al., 2021) includ-
et al., 2013), MRPC (Dolan and Brockett,
2005), QQP (Iyer et al., 2017), STS-B (Cer ing αNLI (Bhagavatula et al., 2020),
C OSMOS QA (Huang et al., 2019), H EL -
et al., 2017), MNLI (Williams et al., 2018),
LA SWAG (Zellers et al., 2019), PIQA (Bisk
QNLI (Wang et al., 2019c), and RTE (Dagan
et al., 2005, et seq.). et al., 2020), S OCIAL IQ A (Sap et al., 2019),
and W INO G RANDE (Sakaguchi et al., 2020).
• S UPER GLUE (Wang et al., 2019b), including
B OOL Q (Clark et al., 2019), CB (De Marn- • Machine translation, including WMT
effe et al., 2019), COPA (Roemmele et al., E N D E (Bojar et al., 2014), WMT E N F R (Bojar
2011), M ULTI RC (Khashabi et al., 2018), et al., 2015), and WMT E N RO (Bojar et al.,
R E C O RD (Zhang et al., 2018), RTE, W I C (Pile- 2016).
hvar and Camacho-Collados, 2019), and
• Summarization, including A ESLC (Zhang and
WSC (Levesque et al., 2012).
Tetreault, 2019), B ILL S UM (Kornilova and Ei-
• Natural language inference (NLI), including delman, 2019), CNN/DAILYMAIL (Hermann
ANLI (Nie et al., 2020), CB, D OC NLI (Yin et al., et al., 2015; See et al., 2017), W IKILIN -
2021), MNLI, QNLI, RTE, and SNLI (Bowman GUA (Ladhak et al., 2020), G IGAWORD (Graff

et al., 2015). et al., 2003; Rush et al., 2015), M ULTI -

N EWS (Fabbri et al., 2019), N EWSROOM (Grusky
• Paraphrasing/semantic similarity, including et al., 2018), SAMS UM (Gliwa et al., 2019),
C X C (Parekh et al., 2021), MRPC, QQP, and and XS UM (Narayan et al., 2018).
STS-B.
• Natural language generation on
• Sentiment analysis, including CR (Hu and Liu, GEM (Gehrmann et al., 2021), including
2004), G OEMOTIONS (Demszky et al., 2020), C OMMON G EN (Lin et al., 2020), DART (Nan
S ENTIMENT 140 (Go et al., 2009), SST-2, and et al., 2021), E2E (Dušek et al., 2019),
Y ELP -2 (Zhang et al., 2015). SGD (Rastogi et al., 2020), W EB NLG (Gardent
et al., 2017), W IKI AUTO (Jiang et al., 2020a),
• Question answering (QA) on MRQA (Fisch XS UM, and W IKILINGUA.
et al., 2019), including SQ UAD (Ra-
jpurkar et al., 2016), N EWS QA (Trischler
et al., 2017), T RIVIAQA (Joshi et al., C Additional training details
2017), S EARCH QA (Dunn et al., 2017),
H OTPOT QA (Yang et al., 2018), and NAT- For P ROMPT T UNING, following Lester et al. (2021),
URAL Q UESTIONS ( NQ (Kwiatkowski et al., we initialize the prompt tokens with embeddings
2019)). that represent an enumeration of the output classes
5054
Translation
GLUE NLI Paraphrasing/ WMT EnDe WMT EnFr
CoLA SST-2 ANLI CB Similarity RAINBOW
WMT EnRo
MRPC QQP DocNLI MNLI CxC MRPC αNLI CosmosQA

STS-B MNLI QNLI RTE QQP STS-B HellaSWAG PIQA

SocialIQa WinoGrande
Summarization
QNLI RTE SNLI
Aeslc BillSum
C4
CNN/Dailymail Wikilingua
SuperGLUE Sentiment GEM
Gigaword
BoolQ
BoolQ CB
MRQA DART
MultiNews
CR Goemotions CommonGen

SQuAD NewsQA Newsroom SAMSum

COPA
COPA MultiRC Sentiment140 SST-2 E2E SGD
TriviaQA SearchQA XSum
ReCoRD
ReCoRD RTE Yelp-2 WebNLG WikiAuto
HotpotQA NQ
WiC
WiC WSC XSum Wikilingua

Figure 6: Datasets used in our SP OT experiments in §2. C4, MNLI, and SQ UAD were all used by themselves as
single source tasks in addition to being mixed in with other tasks.
Total Tuned
Model S CORE B OOL Q CB COPA M ULTI RC R E C O RD RTE WIC WSC
parameters parameters
ST-M O E-32B 269B 269B 91.2 92.4 96.9/98.0 99.2 89.6/65.8 95.1/94.4 93.5 77.7 96.6
T URING NLR V 5 5.4B 5.4B 90.9 92.0 95.9/97.6 98.2 88.4/63.0 96.4/95.9 94.1 77.1 97.3
Top-7
submissions ERNIE 3.0 12B 12B 90.6 91.0 98.6/99.2 97.4 88.6/63.2 94.7/94.2 92.6 77.4 97.3
T5 + UDG 11B 11B 90.4 91.4 95.8/97.6 98.0 88.3/63.0 94.2/93.5 93.0 77.9 96.6
D E BERTA / T URING NLRV 4 3.1B 3.1B 90.3 90.4 95.7/97.6 98.4 88.2/63.7 94.5/94.1 93.2 77.5 95.9
H UMAN BASELINES - - 89.8 89.0 95.8/98.9 100.0 81.8/51.9 91.7/91.3 93.6 80.0 100.0
T5 11B 11B 89.3 91.2 93.9/96.8 94.8 88.1/63.3 94.1/93.4 92.5 76.9 93.8
F ROZEN T5 1.1 + SP OT 11B 410K 89.2 91.1 95.8/97.6 95.6 87.9/61.9 93.3/92.4 92.9 75.8 93.8
Parameter- GPT-3 FEW- SHOT 175B 0 71.8 76.4 52.0/75.6 92.0 75.4/30.5 91.1/90.2 69.0 49.4 80.1
efficient
adaptation WARP FEW- SHOT 223M 25K 48.7 62.2 70.2/82.4 51.6 0.0/0.5 14.0/13.6 69.1 53.1 63.7
CB OW 15M 33K 44.5 62.2 49.0/71.2 51.6 0.0/0.5 14.0/13.6 49.7 53.1 65.1

Table 5: S UPER GLUE results of our SP OT XXL submission (in green) and competitors from the leaderboard as of
2022/02/09.

with a back off to sampled vocabulary to fill any Our SP OT submission achieves a score of 89.2,
remaining prompt positions. which far exceeds all other parameter-efficient
For model tuning approaches, we use the de- adaptation methods, including GPT-3, which ben-
fault hyperparameters for T5 (Raffel et al., 2020), efits from over 10× more frozen parameters (al-
i.e., learning rate 0.001, Adafactor optimizer with though it uses no tuned parameters). Compared to
pre-training parameter states restored, and dropout WARP (Hambardzumyan et al., 2021), our SP OT ap-
probability 0.1. To improve the model tuning base- proach tunes 16× more parameters (410K vs. 25K),
lines, we perform a sweep over the batch size hy- and benefits from 50× more frozen parameters.
perparameter and select 216 tokens per batch, fol- To the best of our knowledge, SP OT is the first
lowing Lester et al. (2021). parameter-efficient adaptation approach that is com-
petitive with methods that tune billions of param-
D Details of our S UPER GLUE submission eters. Most notably, SP OT’s performance almost
Table 5 shows the performance of our SP OT XXL matches that of fully fine-tuned T5 XXL (89.3), de-
S UPER GLUE submission, along with several strong
spite building on the same underlying model, and
competitors from the public S UPER GLUE leader- tuning 27,000× fewer parameters. We note that
SP OT outperforms T5 on three of eight S UPER GLUE
board. Apart from the human baseline, the top-7
submissions all tune >3B parameters directly on the tasks (namely, CB, COPA, RTE).
final tasks. Only three previous S UPER GLUE sub- E Task transferability results
missions use parameter efficient adaptation, in the
sense of tuning <1M parameters on the final tasks; The full results of our task transferability exper-
all other submissions tune >50M parameters.17 iments can be found in Table 6. We show that
17
in many cases, initializing the prompt to that of a
The “AILabs Team, Transformers” submission is listed
as tuning 3M parameters, but we suspect this is in error, as the submission mentions using the T5-3B and T5-L ARGE models.

5055
source task can provide significant gain on a target
task. Table 7 displays positive transfers with more
than 10% relative error reduction on the target task.

F Task embedding similarity

In Figure 7, we show a clustered heatmap of cosine
similarities between the task embeddings of the
26 NLP tasks we study in our task transferability
experiments. For each task, we include the result-
ing task embeddings from all the three different
prompt tuning runs on the task. As can be seen, our
task embeddings capture task relationships: similar
tasks cluster together. Additionally, task embed-
dings that are derived from different prompts of the
same task are linked together.

G Correlation between task similarity

and task transferability
Figure 8 shows how the relative error reduction on
a target task changes as a function of the similarity
between the source and target task embeddings.

5056
B OOL Q C O LA STS-B WIC CR MRPC RTE WSC COPA CB
BASELINE 73.01.2 52.91.2 88.10.6 63.61.6 93.50.2 86.10.7 68.71.2 71.51.7 56.71.7 92.71.9
C4 75.80.5 54.81.1 87.80.6 66.30.8 93.90.1 88.00.6 69.11.9 68.00.5 54.30.9 83.15.7
D OC NLI 72.71.4 52.70.9 87.30.9 64.70.3 93.60.4 86.20.8 67.42.6 71.13.6 56.05.9 87.21.7
Y ELP -2 74.80.7 53.90.2 88.10.3 64.70.5 93.80.3 86.60.8 69.21.1 70.81.2 55.00.0 87.81.6
MNLI 77.60.4 54.20.7 89.50.3 69.50.5 93.90.4 88.40.6 74.71.3 71.83.3 69.32.1 97.01.1
QQP 75.90.5 55.61.3 89.40.2 67.90.2 93.70.5 88.10.7 72.00.5 71.50.9 62.02.2 88.74.2
QNLI 75.60.5 55.52.0 89.20.2 69.61.3 93.80.2 87.80.1 71.10.8 71.52.5 59.73.9 92.51.1
R E C O RD 73.10.9 54.71.3 87.70.7 65.50.9 93.70.1 88.70.3 67.51.3 77.22.3 59.31.2 74.15.2
CXC 75.90.4 55.00.2 90.00.0 70.20.1 93.90.2 88.00.4 70.30.5 68.62.5 60.33.9 89.32.4
SQ UAD 76.00.7 54.91.2 87.60.1 66.80.3 93.90.5 88.70.7 71.20.4 72.40.5 63.01.6 91.31.3
DROP 73.61.3 53.01.0 86.90.9 67.51.2 93.70.2 88.20.3 65.73.1 73.42.0 60.03.6 78.58.6
SST-2 73.30.5 52.30.3 87.90.3 63.81.7 93.80.5 85.60.9 66.91.1 68.60.4 57.02.2 92.91.3
W INO G RANDE 74.10.8 52.81.6 87.80.3 62.42.5 93.70.1 86.10.5 67.91.3 71.52.5 56.71.2 83.90.8
H ELLA SWAG 70.02.6 32.723.6 87.50.2 60.13.9 93.60.0 86.61.4 63.95.4 70.22.1 58.02.2 85.52.6
M ULTI RC 74.00.5 50.04.6 88.20.2 66.40.5 93.40.1 86.41.3 67.61.0 69.24.1 56.04.1 80.08.6
C OSMOS QA 73.41.3 52.12.3 87.70.5 65.91.0 93.60.3 87.90.8 68.71.6 69.63.2 62.35.0 83.98.8
RACE 73.60.5 52.52.8 87.50.5 63.15.3 93.40.2 86.50.8 66.52.0 68.91.2 57.31.2 84.83.4

Table 6: Many tasks can benefit each other via prompt transfer. The orange-colored row shows the results of
prompt tuning T5 BASE on the target tasks from scratch (i.e., without any prompt transfer). Each cell in the other
rows represents the target task performance when transferring the prompt from the associated source task (row) to
the associated target task (column). Positive transfers are shown in green and the best results are highlighted in
bold (green). Numbers in the subscript indicate the standard deviation across 3 random seeds.

Transfer Increase (relative)

MNLI → CB 58.9
MNLI → COPA 29.1
R E C O RD → WSC 20.0
MNLI → RTE 19.2
R E C O RD → MRPC 18.7
SQ UAD → MRPC 18.7
CXC → WIC 18.1
MNLI → B OOL Q 17.0
MNLI → MRPC 16.5
QNLI → W I C 16.5
MNLI → W I C 16.2
C X C → STS-B 16.0
DROP → MRPC 15.1
SQ UAD → COPA 14.5
QQP → MRPC 14.4
C X C → MRPC 13.7
C4 → MRPC 13.7
C OSMOS QA → MRPC 12.9
C OSMOS QA → COPA 12.9
QQP → COPA 12.2
QNLI → MRPC 12.2
QQP → W I C 11.8
MNLI → STS-B 11.8
SQ UAD → B OOL Q 11.1
QQP → STS-B 10.9
QQP → B OOL Q 10.7
C X C → B OOL Q 10.7
DROP → W I C 10.7
QQP → RTE 10.5
C4 → B OOL Q 10.4

Table 7: Positive transfers with more than 10% relative error reduction on the target task. s → t denotes the
transfer from source task s to target task t.

5057
1.0

0.8

0.6

0.4

0.2

0.0 C4_1
C4_2
C4_3
WSC_1
WSC_2
WSC_3
SQuAD_3
SQuAD_1
SQuAD_2
DROP_1
DROP_2
DROP_3
ReCoRD_2
ReCoRD_1
ReCoRD_3
CoLA_2
CoLA_1
CoLA_3
RTE_3
DocNLI_3
DocNLI_1
DocNLI_2
CB_2
CB_1
CB_3
MNLI_2
MNLI_1
MNLI_3
CxC_2
CxC_1
CxC_3
STS-B_1
STS-B_2
STS-B_3
RTE_1
RTE_2
MRPC_2
MRPC_1
MRPC_3
QQP_3
QQP_1
QQP_2
QNLI_1
QNLI_2
QNLI_3
COPA_2
COPA_1
COPA_3
Yelp-2_2
Yelp-2_1
Yelp-2_3
SST-2_2
SST-2_1
SST-2_3
CR_3
CR_1
CR_2
WiC_1
WiC_2
WiC_3
MultiRC_1
MultiRC_2
MultiRC_3
BoolQ_1
BoolQ_2
BoolQ_3
RACE_3
RACE_1
RACE_2
CosmosQA_1
HellaSWAG_2
HellaSWAG_3
CosmosQA_2
HellaSWAG_1
CosmosQA_3
WinoGrande_3
WinoGrande_1
WinoGrande_2
Yelp-2_2
Yelp-2_1
Yelp-2_3
SST-2_2
SST-2_1
SST-2_3
RTE_3

RTE_1
RTE_2

COPA_2
COPA_1
COPA_3

HellaSWAG_2
HellaSWAG_3
HellaSWAG_1
ReCoRD_2
ReCoRD_1
ReCoRD_3

MultiRC_1
MultiRC_2
MultiRC_3

RACE_3
RACE_1
RACE_2
STS-B_1
STS-B_2
STS-B_3

WiC_1
WiC_2
WiC_3

WinoGrande_3
WinoGrande_1
WinoGrande_2
C4_1
C4_2
C4_3
WSC_1
WSC_2
WSC_3
SQuAD_3
SQuAD_1
SQuAD_2
DROP_1
DROP_2
DROP_3

DocNLI_3
DocNLI_1
DocNLI_2
CB_2
CB_1
CB_3
MNLI_2
MNLI_1
MNLI_3
CxC_2
CxC_1
CxC_3

MRPC_2
MRPC_1
MRPC_3
QQP_3
QQP_1
QQP_2
QNLI_1
QNLI_2
QNLI_3

CR_3
CR_1
CR_2

BoolQ_1
BoolQ_2
BoolQ_3

CosmosQA_1
CosmosQA_2
CosmosQA_3
CoLA_2
CoLA_1
CoLA_3

Figure 7: Our prompt-based task embeddings capture task relationships: similar tasks group together into clusters.
Additionally, task embeddings that are derived from different prompts of the same task are linked together. t_1, t_2,
t_3 correspond to three different prompt tuning runs on task t.

5058
BoolQ CoLA STS-B WiC CR
(r = -0.070, p = 0.635) (r = 0.028, p = 0.852) (r = 0.708, p = 1.853e-08) (r = 0.163, p = 0.270) (r = 0.234, p =0.110)
relative error reduction

MRPC RTE WSC COPA CB

(r = 0.243, p = 0.096) (r = 0.290, p = 0.046) (r = 0.428, p = 0.002) (r = 0.140, p = 0.343) (r = 0.490, p = 0.000)

cosine similarity

Figure 8: Correlation between task similarity and task transferability. Each point represents a source prompt. The
x-axis shows the cosine similarity between the associated source and target task embeddings, averaged over three
runs for the target task (orange title). The y-axis measures the relative error reduction on the target task achieved
by each source prompt. We include the Pearson correlation coefficient (r) and p-value.

5059