Setfit

The document proposes a new framework called SETFIT for efficient few-shot learning without prompts. SETFIT first fine-tunes a pretrained sentence transformer on text pairs, then uses the embeddings to train a classification head, achieving high accuracy with fewer parameters than existing techniques. The experiments show SETFIT obtains comparable results to state-of-the-art prompt-based methods while being faster to train and not requiring prompts.

Uploaded by

Achraf Louiza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views13 pages

Setfit

Uploaded by

Achraf Louiza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Efficient Few-Shot Learning Without Prompts

Lewis Tunstall1 , Nils Reimers2 , Unso Eun Seo Jo1 , Luke Bates3 ,
Daniel Korat4 , Moshe Wasserblat4 , Oren Pereg4
1
Hugging Face 2 cohere.ai
3
Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt
4
Emergent AI Lab, Intel Labs
1
[email protected] 2 [email protected]
3
[email protected]
4
[email protected]

Abstract
Recent few-shot methods, such as parameter-
efficient fine-tuning (PEFT) and pattern ex-
ploiting training (PET), have achieved impres-
arXiv:2209.11055v1 [cs.CL] 22 Sep 2022

sive results in label-scarce settings. How-

ever, they are difficult to employ since they
are subject to high variability from manually
crafted prompts, and typically require billion-
parameter language models to achieve high ac-
curacy. To address these shortcomings, we
propose S ET F IT (Sentence Transformer Fine-
tuning), an efficient and prompt-free frame- Figure 1: Compared to standard fine-tuning, S ET F IT is
work for few-shot fine-tuning of Sentence more sample efficient and exhibits less variability when
Transformers (ST). S ET F IT works by first fine- trained on a small number of labeled examples.
tuning a pretrained ST on a small number of
text pairs, in a contrastive Siamese manner.
The resulting model is then used to generate
rich text embeddings, which are used to train a (PET). Unfortunately, these approaches can be im-
classification head. This simple framework re- practical for many researchers and practitioners.
quires no prompts or verbalizers, and achieves One disadvantage is that these approaches typically
high accuracy with orders of magnitude less rely on the use of large-scale language models to
parameters than existing techniques. Our ex- achieve high performance. For example, T-F EW
periments show that S ET F IT obtains compa- (Liu et al., 2022) is based on the 11 billion param-
rable results with PEFT and PET techniques, eter model T0 (Sanh et al., 2021), while GPT-3
while being an order of magnitude faster to
(Brown et al., 2020a) is an order of magnitude
train. We also show that S ET F IT can be ap-
plied in multilingual settings by simply switch- larger. Secondly, training and deploying these few-
ing the ST body. Our code1 and datasets2 are shot methods typically requires specialized infras-
made publicly available. tructure with limited accessibility. Moreover, PET
and the prominent PEFT methods require, as part
1 Introduction of their training, the input of manually generated
Few-shot learning methods have emerged as an at- prompts, yielding varying outcomes depending on
tractive solution to label-scarce scenarios, where the level of manual prompt-engineering.
data annotation can be time-consuming and costly. In this paper, we propose S ET F IT, an approach
These methods are designed to work with a small based on Sentence Transformers (ST) (Reimers
number of labeled training examples, and typi- and Gurevych, 2019) that dispenses with prompts
cally involve adapting pretrained language models altogether and does not require large-scale PLMs
(PLMs) for specific downstream tasks. to achieve high accuracy. For example, with only
Today, there exist several approaches to few- 8 labeled examples in the Customer Reviews (CR)
shot learning with PLMs. These include in- sentiment dataset, S ET F IT is competitive with fine-
context learning (ICL), parameter-efficient fine- tuning on the full training set, despite the fine-tuned
tuning (PEFT), and pattern exploiting training model being three times larger (see Figure 1).
1
https://github.com/huggingface/setfit We demonstrate S ET F IT’s efficacy in few-shot
2
https://huggingface.co/setfit text classification over a range of NLP datasets
and in multiple scenarios including distillation and parameters and requires massive computational re-
non-English data. We compare our method to stan- sources, prompt engineering, and can only utilize
dard PLM fine-tuning, state-of-the-art PET- and pretrained knowledge.
PEFT-based methods such as A DAPET (Tam et al., PEFT methods, such as adapters (Rebuffi et al.,
2021) and T-F EW (Liu et al., 2022), as well as 2017), hold the majority of parameters fixed dur-
recent prompt-free techniques such as P ERFECT ing training and only update small feed-forward
(Karimi Mahabadi et al., 2022a). networks that are inserted within the larger model
We summarize our contributions as follows: architecture. A recent example is T-F EW (Liu et al.,
2022), which outperforms GPT-3 at much lower
1. We propose S ET F IT– a simple and prompt- computational cost. It accomplishes this by adding
free method – and provide a comprehensive learned vectors that rescale the network’s internal
guide for applying it in practical few-shot set- activations. T-F EW is 16 times smaller than GPT-3,
tings. but is still too large to be utilized as a practical tool
in industry. It also requires a set of handcrafted
2. We evaluate S ET F IT’s performance on a num- prompts for each dataset.
ber of few-shot text classifications tasks and Another alternative to ICL is prompt-based fine-
show that it outperforms the state-of-the- tuning. This approach converts the downstream
art prompt-free method and ranks alongside classification task into a masked-language model-
much larger prompt-based, few-shot models. ing (MLM) objective. The model outputs tokens in
a cloze-style format that maps to the corresponding
3. We make the code and data used in our work
labels via a predefined template. A well known ex-
publicly available.
ample of this method is Pattern Exploiting Training
(PET) (Schick and Schütze, 2021b,a) . Like GPT-
2 Related Work 3, PET relies on manually-crafted prompts, but
S ET F IT engages with two related lines of literature. since the model can be fine-tuned to specific tasks,
We first extend the small but significant body of PET-based approaches typically outperform GPT-3
work on Sentence Transformers (ST) for text classi- in few-shot scenarios, even with far smaller PLM
fication. Perone et al. (2018) introduced the idea of backbones. PET has since been extended in two
using sentence embeddings for text classification. main directions: A DAPET (Tam et al., 2021), which
Piao (2021) used ’out-of-the-box’ STs for text clas- improves PET with a decoupled label objective
sification without fine-tuning them. S ET F IT differs and label-conditioned MLM objective, and P ER -
FECT (Karimi Mahabadi et al., 2022b) which uses
from these works in two aspects: First, we fine-tune
the ST in a Siamese manner for a text classification task-specific adapters (Houlsby et al., 2019; Pfeif-
objective showing that it significantly enhances per- fer et al., 2021) and multi-token label-embeddings
formance; second, we demonstrate this approach eliminate task prompts and verbalizers.
in few-shot setups.
S ET F IT is also related to the recently emerging 3 SetFit: Sentence Transformer
few-shot and zero-shot training line of literature as Fine-Tuning
few-shot and zero-shot approaches have recently
received a great deal of interest in the research S ET F IT is based on Sentence Transformers
community due to the availability of pretrained lan- (Reimers and Gurevych, 2019) which are modi-
guage models and the untapped capacity to use fications of pretrained transformer models that use
them in resource-constrained domains. Specifi- Siamese and triplet network structures to derive se-
cally, we discuss ICL, PEFT, and prompt-based mantically meaningful sentence embeddings. The
fine-tuning. goal of these models is to minimize the distance be-
ICL models directly generate predictions based tween pairs of semantically similar sentences and
on input-to-output training examples provided as maximize the distance between sentence pairs that
prompts, without any parameter updates. Perhaps are semantically distant. Standard STs output a
the best known example is GPT-3 (Brown et al., fixed, dense vector that is meant to represent tex-
2020b), which achieves remarkable few-shot per- tual data and can then be used by machine learning
formance. However, GPT-3 contains 175 billion algorithms.
Figure 2: S ET F IT ’s fine-tuning and training block diagram.

3.1 The S ET F IT approach for few-shot text nificantly larger than just K.
classification
Classification head training In this second step,
S ET F IT uses a two-step training approach in which the fine-tuned ST encodes the original labeled train-
we first fine-tune an ST and then train a classifier ing data {xi }, yielding a single sentence embed-
head. In the first step, an ST is fine-tuned on the ding per training sample; Embxi = ST (xi ) where
input data in a contrastive, Siamese manner on ST () is the function representing the fine-tuned
sentence pairs. In the second step, a text classifica- ST. The embeddings, along with their class labels,
tion head is trained using the encoded training data constitute the training set for the classification head
generated by the fine-tuned ST from the first step. T CH = {(Embxi , yi )} where |T CH | = |D|. A
Figure 2 illustrates this process, and we discuss logistic regression model is used as the text classi-
these two steps in the following sections. fication head throughout this work.

ST fine-tuning To better handle the limited Inference At inference time, the fine-tuned ST
amount of labeled training data in few-shot scenar- encodes an unseen input sentence (xi ) and pro-
ios, we adopt a contrastive training approach that is duces a sentence embedding. Next, the classifi-
often used for image similarity (Koch et al., 2015). cation head that was trained in the training step,
Formally, given a small set of K labeled examples produces the class prediction of the input sentence
D = {(xi , yi )}, where xi and yi are sentences and based on its sentence embedding. Formally this is
their class labels, respectively. For each class la- xpred
i = CH(ST (xi )), where CH represents the
bel c ∈ C, we generate a set of R positive triplets classification head prediction function.
Tpc = {(xi , xj , 1)}, where xi and xj are pairs of
randomly chosen sentences from the same class c 4 Experiments
such that (yi = yj = c). Similarly, we also gener- 4.1 Data
ate a set of R negative triplets Tnc = {(xi , xj , 0)},
where xi are sentences from class c and xj are ran- We conduct experiments on available text classifi-
domly chosen sentences from different classes such cation datasets. We split the datasets into develop-
that (yi = c, yj 6= c). Finally, the contrastive fine ment and test datasets (See Table 6 in Appendix
tuning data set T is produced by concatenating the A). The development datasets are utilized for set-
positive and negative triplets across all class labels; ting S ET F IT’s hyperparameters such as the number
|C| |C| of training pairs (|T |), the loss function and the
T = {(Tp0 , Tn0 ), (Tp1 , Tn1 ), ..., (Tp , Tn )}, where
optimal number of training epochs. In order to
|C| is the number of class labels, |T | = 2R|C| is
test the robustness of S ET F IT to various types of
the number of pairs in T and R is a hyperparameter.
text, we choose test datasets that represent different
Unless stated otherwise, we used R = 20 in all the
text classification tasks with a varying number of
evaluations.
classes. All datasets used are available on the Hug-
This contrastive fine-tuning approach enlarges
ging Face Hub under the S ET F IT organisation.3 In
the size of training data in few-shot scenarios. As-
addition we evaluate S ET F IT on the RAFT bench-
suming that a small number (K) of labeled exam-
mark (Alex et al., 2021), a real-world few-shot
ples are given for a binary classification task, the
text-classification benchmark composed of 11 prac-
potential size of the ST fine-tuning set T is derived
tical tasks, where each task has only 50 training
from the number of unique sentence pairs that can
be generated, namely K(K − 1)/2, which is sig- 3
huggingface.co/SetFit
examples. (Schick and Schütze, 2021), the authors show that
PET-based classification methods excel on the
4.2 S ET F IT models RAFT benchmark, placing second only to much
We evaluate three variations of S ET F IT each one larger models such as T-F EW. In our experiments,
uses different underlying model of different size we used A DAPET with default hyperparameters
(Shown in Table 1) and examined its performance with different PLM
backbones, reporting the PLM which resulted in
Variation Underlying ST Model Size∗ the best performance, albert-xxlarge-v2 7 (see
S ET F IT RO BERTA all-roberta-large-v14 355M Appendix A.2 in the Appendix for further details).
S ET F IT MPN ET paraphrase-mpnet-base-v25 110M
S ET F IT M INI LM paraphrase-MiniLM-L3-v26 15M P ERFECT P ERFECT (Karimi Mahabadi et al.,
2022b) is another cloze-based fine-tuning method,
Table 1: S ET F IT model variations using three different but unlike PET or A DAPET, it does not require
underlying ST models. ∗ Number of parameters.
handcrafted task prompts and verbalizers. Instead,
P ERFECT uses task-specific adapters (Houlsby
4.3 Baselines et al., 2019; Pfeiffer et al., 2021) and multi-token
label-embeddings which are independent from the
We compare S ET F IT’s performance against stan- language model vocabulary during fine-tuning. To
dard transformer fine-tuning and recent best- run P ERFECT on our test datasets, we adapted the
performing few-shot approaches: A DAPET (Tam configurations provided in the P ERFECT codebase.
et al., 2021), P ERFECT (Karimi Mahabadi et al.,
2022b), and T-F EW (Liu et al., 2022). T-F EW T-F EW (Liu et al., 2022) is a PEFT-based
few-shot learning method based on T0 (Sanh et al.,
Standard fine-tuning Our first baseline is
2021). The authors provide two versions of T-F EW:
RO BERTA L ARGE (Liu et al., 2019), a standard,
11 and 3 billion parameters. Due to compute con-
encoder-only transformer that is fine-tuned for se-
straints, we were unable to run the 11 billion ver-
quence classification. Since we assume no val-
sion, which requires an 80GB A100 GPU. Running
idation sets, we constructed validation splits by
tests on T-F EW as opposed to S ET F IT posed sev-
randomly selecting equally sized portions from the
eral hurdles. First, because T-F EW’s performance
train split. We perform a hyperparameter search on
varies significantly depending on the input prompts,
the number of epochs in the range [25,75] and pick
we run each experiment using 5 random seeds, and
the best performing model on a validation split.
report the median result, as in the original paper.
We use a learning rate of 2e−5 and batch size of
Second, T-F EW relies on dataset-specific prompts,
4 in all our experiments.
made available on P3 (Public Pool of Prompts)
A DAPET Pattern exploiting training (PET) (Bach et al., 2022). Only one of our test datasets
(Schick and Schütze, 2021b,a) is a method for had prompts in P3. For the rest of the datasets, we
improving PLM performance in few-shot setups on adapt standardized P3 prompts of similar tasks or
downstream tasks by converting textual input into implement prompts ourselves (See Appendix A.3).
a cloze-style question intended to be reminiscent of
the masked language modelling (MLM) objective 4.4 Experimental Setup
under which large PLMs such as BERT (Devlin Systematically evaluating few-shot performance
et al., 2019) are trained. To determine S ET F IT’s can be challenging, because fine-tuning on small
performance relative to PET-based approaches, datasets may incur instability (Dodge et al., 2020;
we compare our method to A DAPET (Tam et al., Zhang et al., 2021). To address this issue, in our
2021), an extension of PET. In recent work experiments we use 10 random training splits for
4
https://huggingface. each dataset and sample size. These splits are used
co/sentence-transformers/ as training data across all tested methods. For each
all-roberta-large-v1 method, we report the average measure (depending
5
https://huggingface.
co/sentence-transformers/ on the dataset) and the standard deviation across
paraphrase-mpnet-base-v2 these splits. We fine-tune S ET F IT’s ST model using
6
https://huggingface.
7
co/sentence-transformers/ https://huggingface.co/
paraphrase-MiniLM-L3-v2 albert-xxlarge-v2
cosine-similarity loss with a learning rate of 1e−3 , amine A DAPET on non-English data (see Appendix
a batch size of 16 and a maximum sequence length A for details).
of 256 tokens, for 1 epoch.
Experimental Setup For the multilingual exper-
5 Results iments, we use the Multilingual Amazon Reviews
Corpus (MARC) (Keung et al., 2020). This dataset
Table 2 shows a comparison between consists of Amazon reviews in six languages (En-
S ET F IT MPN ET and the baselines for N = 8 glish, Japanese, German, French, Spanish, and Chi-
and N = 64 labeled training samples per class. nese), where each review is labeled according to
For reference purposes, standard fine-tuning results a 5-star rating scale. We chose this corpus for its
using the full training data are also shown (in all typological diversity in order to examine the gener-
cases higher scores indicates stronger performance; alizability of S ET F IT and other methods across a
see Table 6 in Appendix A for dataset metric variety of languages.
details). We find that S ET F IT MPN ET significantly For the S ET F IT underlying model, we use
outperforms the F INE T UNE baseline for N = 8 by paraphrase-multilingual-mpnet-base-v2,8 which is
an average of 19.3 points. However, as the number a multilingual version of paraphrase-mpnet-base-
of training samples increases to N = 64, the gap v2 that is trained on parallel data in over 50 lan-
decreases to 5.6 points. guages.
Similarly, we find that S ET F IT MPN ET out- For the F INE T UNE and A DAPET baselines, we
performs P ERFECT by 13.6 and 2.6 points. use XLM-RO BERTA BASE (Conneau et al., 2019),9
S ET F IT MPN ET also outperforms A DAPET by 4.0 which has a similar size to the S ET F IT model. We
and 1.5 points for N = 8 and N = 64 respec- compare the performance of each method using the
tively. For N = 8, S ET F IT MPN ET is on par with same settings as (Conneau et al., 2019):
T-F EW 3B whereas for N = 64 S ET F IT MPN ET out-
performs T-F EW 3B by 5 points on average, despite • each: Train and evaluate on monolingual data
being prompt-free and more than 27 times smaller. to measure per-language performance.

RAFT results The test datasets listed in Table 2 • en: Train on the English training data and
were not specifically designed for few-shot bench- then evaluate on each language’s test set.
marking. In order to better benchmark S ET F IT,
we used the RAFT benchmark (Alex et al., 2021) • all: Train on all the training data and evaluate
which is specifically designed for benchmarking on each language’s test set.
few-shot methods. Table 3 shows the average ac-
curacy of S ET F IT MPN ET and S ET F IT RO BERTA and Method For S ET F IT standard fine-tuning, and
four prominent methods. S ET F IT RO BERTA outper- A DAPET, we adopt the same methodology and hy-
forms GPT3 and PET by 8.6 and 1.7 points respec- perparameters used for the monolingual English
tively while alleviating the need for hand crafting experiments in 4. We evaluate each method in the
prompts. It also surpasses the human baseline in few-shot regime (N = 8 samples per class) and
7 out of 11 tasks. S ET F IT RO BERTA falls short of T- compare against performance of fine-tuning on the
F EW 11B by 4.5 points. however, S ET F IT RO BERTA full training set of 20,000 examples.
is more than 30 times smaller than T-F EW 11B,
Results Table 4 shows the results of S ET F IT
does not require manual prompt crafting and is
standard fine-tuning, and A DAPET on each lan-
much more efficient in training and inference (see
guage in MARC, where a higher MAE indicates
Table 5).
weaker performance. In the few-shot regime of
N = 8 samples per class, we find that S ET F IT
6 Multilingual Experiments
significantly outperforms F INE T UNE and A DAPET
To determine S ET F IT’s performance in a multi- in all settings (each, en, all), with the best average
lingual, few-shot text classification scenario, we performance obtained when training on English
conducted development and test experiments on data only.
multilingual datasets and compared S ET F IT to stan- 8
huggingface.co/sentence-transformers/
dard transformer fine-tuning and A DAPET. To the paraphrase-multilingual-mpnet-base-v2
9
best of our knowledge, this is the first work to ex- huggingface.co/xlm-roberta-base
Method SST-5 AmazonCF CR Emotion EnronSpam AGNews Average†
|N | = 8∗
F INE T UNE 33.52.1 9.24.9 58.86.3 28.76.8 85.06.0 81.73.8 43.05.2
P ERFECT 34.93.1 18.15.3 81.58.6 29.85.7 79.37.4 80.85.0 48.76.0
A DAPET 50.01.9 19.47.3 91.01.3 46.23.7 85.13.7 85.12.7 58.33.6
T-F EW 3B 55.0?1.4 19.03.9 92.11.0 57.41.8 93.11.6 – 63.41.9
S ET F IT MPN ET 43.63.0 40.311.8 88.51.9 48.84.5 90.13.4 82.92.8 62.34.9
|N | = 64∗
F INE T UNE 45.96.9 52.812.1 88.91.9 65.017.2 95.90.8 88.40.9 69.77.8
P ERFECT 49.10.7 65.15.2 92.20.5 61.72.7 95.41.1 89.00.3 72.71.9
A DAPET 54.10.8 54.16.4 92.60.7 72.02.2 96.00.9 88.00.6 73.82.2
T-F EW 3B 56.00.6 34.74.5 93.11.0 70.91.1 97.00.3 – 70.31.5
S ET F IT MPN ET 51.90.6 61.92.9 90.40.6 76.21.3 96.10.8 88.00.7 75.31.3
|N | = Full∗∗
F INE T UNE 59.8 80.1 92.4 92.6 99.0 93.8 84.8

Table 2: S ET F IT performance score and standard deviation compared to the baselines across 6 test datasets for
three training set sizes |N |. ∗ Number of training samples per class. ∗∗ Entire available training data used. † The
AGNews dataset is excluded from the average score to enable fair comparison with T-F EW (which has AGNews
in its training set). ? The inputs of SST-5 (but not its labels) appeared in T-F EW’s training set, as part of Rotten
Tomatoes dataset.

Rank Method Score Size∗ of S ET F IT as a student model compared to a stan-

1 Y IWISE 76.8 - dard transformer student model in few-shot distilla-
2 T-F EW 11B 75.8 11B tion setups when the amount of unlabeled training
4 Human baseline 73.5 - data is limited.
6 S ET F IT RO BERTA 71.3 355M Experimental Setup For the distillation tests we
9 PET 69.6 235M use the datasets AGNews, Emotion and SST-5 de-
11 S ET F IT MPN ET 66.9 110M scribed in Appendix A.1. For the S ET F IT teacher
12 GPT-3 62.7 175B we chose S ET F IT MPN ET , which contains 110M pa-
rameters, whereas for the S ET F IT student we chose
Table 3: S ET F IT compared to prominent methods on
the RAFT leaderboard (as of Sept. 5, 2022). ∗ Number
S ET F IT M INI LM , which is a much smaller model
of parameters. (15M parameters). For fair comparison, we use as
the baseline student MiniLM-L3-H384-uncased10 ,
a standard transformer encoder of the same size
7 S ET F IT Model Efficiency as our S ET F IT student model. For each of the
three datasets we train the S ET F IT teacher model
7.1 Few-shot distillation
using only 16 labeled samples per class, and the stu-
We have shown that S ET F IT achieves state-of-the- dent models are trained using the same 16 labeled
art results in few-shot setups using underlying samples per class together with various amounts
base models such as paraphrase-mpnet-base-v2 and of additional unlabeled data. We follow the same
RO BERTA L ARGE , containing 110M parameters and data-split policy and S ET F IT training parameters’
355M parameters respectively; but in real-world settings described in Section 4.4.
deployments, where cost and sustainability are pri-
oritized, the use on even more efficient models is Method The S ET F IT student is trained using sen-
desirable. Previous works have shown model dis- tence pairs and the level of similarity between each
tillation to be effective in reducing computational pair as input. The similarity is generated by using
load while preserving much of the original model’s the underlying ST of the teacher to produce sen-
performance (Ba and Caruana, 2014; Hinton et al., 10
huggingface.co/nreimers/
2015). In this section we evaluate the performance MiniLM-L3-H384-uncased
Method Train En De Ja Zh Fr Es Average
|N | = 8∗
each 122.914.0 119.913.6 120.58.0 128.610.7 123.213.0 116.38.3 121.911.3
F INE T UNE en 115.911.3 115.212.0 121.612.3 123.08.8 117.313.0 113.112.4 117.711.6
all 117.84.9 116.39.7 121.512.4 120.56.7 117.39.9 110.19.5 117.28.8
each 129.913.6 136.410.6 130.413.4 135.010.9 141.810.1 136.010.4 134.911.5
A DAPET en 138.917.8 151.517.8 160.816.7 158.816.3 152.015.7 149.817.1 152.016.9
all 150.812.0 136.27.0 150.810.0 152.810.2 140.014.0 145.14.5 146.011.3
each 82.94.3 80.02.4 95.52.8 95.32.8 85.36.0 80.85.4 86.64.9
S ET F IT en 82.64.8 83.45.9 93.26.6 93.93.6 82.24.8 83.45.9 86.45.2
all 83.05.3 84.07.6 97.19.2 97.46.5 83.56.5 84.96.1 88.36.9
|N | =Full∗∗
each 46.2 43.7 46.8 56.6 47.8 45.3 47.7
F INE T UNE en 46.1 46.6 61.0 69.4 55.6 52.9 55.3
all 46.6 49.4 61.0 69.4 55.6 55.0 56.2

Table 4: Average performance (MAE × 100) on the Multilingual Amazon Reviews Corpus for two training set
sizes |N |. ∗ No. of training samples per class. ∗∗ Entire available training data used (20,000 samples).

AG News Emotion SST5

90 70 40
Average Accuracy

60
80
50
30
70 40

30
60
20 20

50 S ET F IT student 10 S ET F IT student S ET F IT student

Baseline student Baseline student Baseline student
0
40 10
8 16 32 64 100 200 1K 8 16 32 64 100 200 1K 8 16 32 64 100 200 1K

Unlabeled Training Set Size (N ) Unlabeled Training Set Size (N ) Unlabeled Training Set Size (N )

Figure 3: Average accuracy as a function of the unlabeled training set size N of the S ET F IT student and the
baseline student on AG News, Emotion and SST5 datasets.

tence embeddings for each pair and to calculate the outperforms the baseline student when only small
cosine-similarity between them. The underlying ST amounts of unlabeled data are available. For exam-
of the S ET F IT student is trained to mimic the ST of ple, for N = 8, the S ET F IT student outperforms
the teacher output by minimizing the error between the baseline student by 24.8, 25.1, and 8.9 aver-
the S ET F IT teacher-produced cosine-similarity and age accuracy on the AGNews, Emotion and SST5
its output. The classification head of the student datasets, respectively. As N increases, the perfor-
is then trained using the embeddings produced by mance gains decrease and are on par for N = 1K.
the student’s ST and the logits produced by the
S ET F IT teacher classification head. The baseline 7.2 Computational costs
student is trained to mimic the teacher output by
minimizing the error between the logits produced Comparing the relative computational costs of S ET-
by the S ET F IT teacher classification head and its F IT versus PET and PEFT methods isn’t straight-
output. forward since each method typically has different
hardware and memory requirements.
Results Figure 3 shows a comparison between To simplify the comparison, we follow the ap-
the S ET F IT student model and the baseline stu- proach adopted by Liu et al. (2022) and use FLOPs-
dent model for various amounts of unlabeled train- per-token estimates to compare S ET F IT to T-F EW.
ing data (N ). The S ET F IT student significantly These estimates can be obtained from Kaplan et al.
(2020), who show that encoder-only models with Inf. Train Speed-
N parameters have approximately 2N FLOPs-per- Method FLOPs FLOPs up Score
token for inference and 6N FLOPs-per-token for T-F EW 3B 1.6e11 3.9e15 1x 63.41.9
S ET F IT MPN ET 8.3e9 2.0e14 19x 62.34.9
training. The resulting cost for inference and train-
S ET F IT M INI LM † 1.3e9 3.2e13 123x 60.31.6
ing is then given by:
Table 5: Relative computational cost and average
scores of S ET F IT and T-F EW using |N | = 8 on the
Cinf = 2N · `seq , test datasets listed in Table 2. † Trained in the distilla-
Ctrain = 6N · `seq · nsteps · nbatch , tion setup as described in Section 7.1, using |N | = 8
for teacher training and the rest of the available training
where `seq is the input sequence length, nsteps data as unlabeled student training data. For fixed nsteps
is the number of training steps, and nbatch is the and nbatch , the relative speed-up (N 0 · `0seq )/(2N · `seq )
batch size. For encoder-decoder models like T- is the same for inference and training.
F EW, these estimates are halved, since the model
only processes each token with either the encoder S ET F IT is additionally not subject to the instabil-
or decoder. ity and inconvenience of prompting. We have also
For the inference and training estimates shown demonstrated that S ET F IT is a robust few-shot text
in Table 5, we use `seq = 38 and `seq = 54 as the classifier in languages other than English across
input sequence length for S ET F IT MPN ET (T-F EW); varying typologies. Finally, S ET F IT has proven
this is the median number of tokens across all the useful in few-shot distillation setups.
test datasets in Table 2. We also use nsteps = 1000
and nbatch = 8 for all training estimates. As shown Acknowledgements
in the table, the S ET F IT MPN ET model is approxi-
mately an order of magnitude faster at inference The authors thank Hugging Face Inc and Intel Inc.
and training than T-F EW, despite having compa- for providing computing resources and the Ger-
rable performance on the test datasets of Table 2. man Federal Ministry of Education and Research
S ET F IT M INI LM is two orders of magnitude faster and the Hessian Ministry of Science and the Arts
than T-F EW, with an average score reduction of 3.1 (HMWK) within the projects "The Third Wave
accuracy points. Moreover, the storage cost of the of Artificial Intelligence - 3AI", hessian.AI, and
S ET F IT models (70MB and 420MB respectively) is within their joint support of the National Research
163 to 26 times smaller than the T0-3B checkpoint Center for Applied Cybersecurity ATHENE.
used by T-F EW 3B (11.4GB), making these models
much better suited for real-world deployment.
These estimates are borne out by comparing References
the time needed to train each method to conver- Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek
gence on N = 8 examples. For our datasets, Thakur, Pegah Maham, C. Jess Riedel, Emmie
S ET F IT MPN ET takes approximately 30 seconds to Hine, Carolyn Ashurst, Paul Sedille, Alexis Car-
train on a p3.2xlarge AWS instance (16GB GPU lier, Michael Noetel, and Andreas Stuhlmüller. 2021.
RAFT: A real-world few-shot text classification
memory), at a cost of $0.025 per split. On the other benchmark. In Thirty-fifth Conference on Neural In-
hand, T-F EW 3B requires at least 40GB GPU mem- formation Processing Systems Datasets and Bench-
ory, and training on a p4d.24xlarge AWS instance marks Track (Round 2).
takes approximately 700 seconds, at a cost of $0.7
Jimmy Ba and Rich Caruana. 2014. Do deep nets really
per split. need to be deep? In Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger,
8 Conclusion editors, Advances in Neural Information Processing
Systems 27, pages 2654–2662. Curran Associates,
This paper introduces S ET F IT, a new few-shot text Inc.
classification approach. We show that S ET F IT has
several advantages over comparable approaches Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Al-
such as T-F EW, A DAPET and P ERFECT. In particu- bert Webson, Colin Raffel, Nihal V. Nayak, Ab-
heesht Sharma, Taewoon Kim, M Saiful Bari,
lar, S ET F IT is much faster at inference and training; Thibault Fevry, Zaid Alyafeai, Manan Dey, An-
S ET F IT requires much smaller base models to be drea Santilli, Zhiqing Sun, Srulik Ben-David, Can-
performant, not requiring external compute; and wen Xu, Gunjan Chhablani, Han Wang, Jason Alan
Fries, Maged S. Al-shaibani, Shanya Sharma, Ur- Derek Greene and Pádraig Cunningham. 2006. Prac-
mish Thakker, Khalid Almubarak, Xiangru Tang, tical solutions to the problem of diagonal dom-
Dragomir Radev, Mike Tian-Jian Jiang, and Alexan- inance in kernel document clustering. In Proc.
der M. Rush. 2022. Promptsource: An integrated 23rd International Conference on Machine learning
development environment and repository for natural (ICML’06), pages 377–384. ACM Press.
language prompts.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Distilling the knowledge in a neural network. Cite
Subbiah, Jared D Kaplan, Prafulla Dhariwal, arxiv:1503.02531 Comment: NIPS 2014 Deep
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Learning Workshop.
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Bruna Morrone, Quentin de Laroussilhe, Andrea
Clemens Winter, Chris Hesse, Mark Chen, Eric Gesmundo, Mona Attariyan, and Sylvain Gelly.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, 2019. Parameter-efficient transfer learning for NLP.
Jack Clark, Christopher Berner, Sam McCandlish, CoRR, abs/1902.00751.
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020a. Language models are few-shot learners. In Minqing Hu and Bing Liu. 2004. Mining and sum-
Advances in Neural Information Processing Systems, marizing customer reviews. In Proceedings of the
volume 33, pages 1877–1901. Curran Associates, Tenth ACM SIGKDD International Conference on
Inc. Knowledge Discovery and Data Mining, KDD ’04,
page 168–177, New York, NY, USA. Association for
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Computing Machinery.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Jared Kaplan, Sam McCandlish, Tom Henighan,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Tom B. Brown, Benjamin Chess, Rewon Child,
Gretchen Krueger, Tom Henighan, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Amodei. 2020. Scaling laws for neural language
Clemens Winter, Christopher Hesse, Mark Chen, models. CoRR, abs/2001.08361.
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James
Candlish, Alec Radford, Ilya Sutskever, and Dario
Henderson, Lambert Mathias, Marzieh Saeidi,
Amodei. 2020b. Language models are few-shot
Veselin Stoyanov, and Majid Yazdani. 2022a.
learners. CoRR, abs/2005.14165.
Prompt-free and efficient few-shot learning with lan-
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, guage models. In Proceedings of the 60th Annual
Vishrav Chaudhary, Guillaume Wenzek, Francisco Meeting of the Association for Computational Lin-
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- guistics (Volume 1: Long Papers), pages 3638–3652,
moyer, and Veselin Stoyanov. 2019. Unsupervised Dublin, Ireland. Association for Computational Lin-
cross-lingual representation learning at scale. CoRR, guistics.
abs/1911.02116.
Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James
Alexis Conneau and Douwe Kiela. 2018. SentEval: An Henderson, Lambert Mathias, Marzieh Saeidi,
evaluation toolkit for universal sentence representa- Veselin Stoyanov, and Majid Yazdani. 2022b.
tions. In Proceedings of the Eleventh International Prompt-free and efficient few-shot learning with lan-
Conference on Language Resources and Evaluation guage models. In Proceedings of the 60th Annual
(LREC 2018), Miyazaki, Japan. European Language Meeting of the Association for Computational Lin-
Resources Association (ELRA). guistics (Volume 1: Long Papers), pages 3638–3652,
Dublin, Ireland. Association for Computational Lin-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and guistics.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Phillip Keung, Yichao Lu, György Szarvas, and
standing. In Proceedings of the 2019 Conference Noah A. Smith. 2020. The multilingual amazon re-
of the North American Chapter of the Association views corpus.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Gregory Koch, Richard Zemel, Ruslan Salakhutdinov,
pages 4171–4186, Minneapolis, Minnesota. Associ- et al. 2015. Siamese neural networks for one-shot
ation for Computational Linguistics. image recognition. In ICML deep learning work-
shop, volume 2, page 0. Lille.
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali
Farhadi, Hannaneh Hajishirzi, and Noah Smith. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay
2020. Fine-tuning pretrained language models: Mohta, Tenghao Huang, Mohit Bansal, and Colin
Weight initializations, data orders, and early stop- Raffel. 2022. Few-shot parameter-efficient fine-
ping. arXiv preprint arXiv:2002.06305. tuning is better and cheaper than in-context learning.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Jason Alan Fries, Ryan Teehan, Stella Biderman,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Leo Gao, Tali Bers, Thomas Wolf, and Alexan-
Luke Zettlemoyer, and Veselin Stoyanov. 2019. der M. Rush. 2021. Multitask prompted train-
Roberta: A robustly optimized bert pretraining ap- ing enables zero-shot task generalization. CoRR,
proach. arXiv preprint arXiv:1907.11692. abs/2110.08207.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang,
Dan Huang, Andrew Y. Ng, and Christopher Potts. Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con-
2011. Learning word vectors for sentiment analy- textualized affect representations for emotion recog-
sis. In Proceedings of the 49th Annual Meeting of nition. In Proceedings of the 2018 Conference on
the Association for Computational Linguistics: Hu- Empirical Methods in Natural Language Processing,
man Language Technologies, pages 142–150, Port- pages 3687–3697, Brussels, Belgium. Association
land, Oregon, USA. Association for Computational for Computational Linguistics.
Linguistics.
Timo Schick and Hinrich Schütze. 2021a. Exploiting
James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, cloze-questions for few-shot text classification and
Motoko Kubota, and Danushka Bollegala. 2021. I natural language inference. In Proceedings of the
wish I would have loved this one, but I didn’t - A 16th Conference of the European Chapter of the As-
multilingual dataset for counterfactual detection in sociation for Computational Linguistics: Main Vol-
product reviews. CoRR, abs/2104.06893. ume, pages 255–269, Online. Association for Com-
putational Linguistics.
Christian S. Perone, Roberto Pereira Silveira, and
Thomas S. Paula. 2018. Evaluation of sentence em- Timo Schick and Hinrich Schütze. 2021b. It’s not just
beddings in downstream and linguistic probing tasks. size that matters: Small language models are also
CoRR, abs/1806.06259. few-shot learners. In Proceedings of the 2021 Con-
ference of the North American Chapter of the Asso-
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, ciation for Computational Linguistics: Human Lan-
Kyunghyun Cho, and Iryna Gurevych. 2021. guage Technologies, pages 2339–2352, Online. As-
AdapterFusion: Non-destructive task composition sociation for Computational Linguistics.
for transfer learning. In Proceedings of the 16th
Conference of the European Chapter of the Associ- Timo Schick and Hinrich Schütze. 2021. True few-
ation for Computational Linguistics: Main Volume, shot learning with prompts - A real-world perspec-
pages 487–503, Online. Association for Computa- tive. CoRR, abs/2111.13440.
tional Linguistics.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Guangyuan Piao. 2021. Scholarly text classification Chuang, Christopher D. Manning, Andrew Ng, and
with sentence bert and entity embeddings. In Trends Christopher Potts. 2013. Recursive deep models
and Applications in Knowledge Discovery and Data for semantic compositionality over a sentiment tree-
Mining, pages 79–87, Cham. Springer International bank. In Proceedings of the 2013 Conference on
Publishing. Empirical Methods in Natural Language Processing,
pages 1631–1642, Seattle, Washington, USA. Asso-
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea ciation for Computational Linguistics.
Vedaldi. 2017. Learning multiple visual domains
with residual adapters. Derek Tam, Rakesh R. Menon, Mohit Bansal,
Shashank Srivastava, and Colin Raffel. 2021. Im-
Nils Reimers and Iryna Gurevych. 2019. Sentence- proving and simplifying pattern exploiting training.
BERT: Sentence embeddings using Siamese BERT- In Proceedings of the 2021 Conference on Empiri-
networks. In Proceedings of the 2019 Conference on cal Methods in Natural Language Processing, pages
Empirical Methods in Natural Language Processing 4980–4991, Online and Punta Cana, Dominican Re-
and the 9th International Joint Conference on Natu- public. Association for Computational Linguistics.
ral Language Processing (EMNLP-IJCNLP), pages
3982–3992, Hong Kong, China. Association for I. Androutsopoulos V. Metsis and G. Paliouras. 2006.
Computational Linguistics. Spam filtering with naive bayes - which naive bayes?
In Proceedings of the 3rd Conference on Email and
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Anti-Spam (CEAS 2006).
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q
Raja, Manan Dey, M. Saiful Bari, Canwen Xu, Weinberger, and Yoav Artzi. 2021. Revisiting few-
Urmish Thakker, Shanya Sharma, Eliza Szczechla, sample {bert} fine-tuning. In International Confer-
Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, ence on Learning Representations.
Debajyoti Datta, Jonathan Chang, Mike Tian-Jian
Jiang, Han Wang, Matteo Manica, Sheng Shen, Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015a.
Zheng Xin Yong, Harshit Pandey, Rachel Baw- Character-level convolutional networks for text clas-
den, Thomas Wang, Trishala Neeraj, Jos Rozen, sification. In Advances in Neural Information Pro-
Abheesht Sharma, Andrea Santilli, Thibault Févry, cessing Systems, volume 28. Curran Associates, Inc.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Following is a description of the test datasets:
2015b. Character-level convolutional networks for
text classification. In NIPS. Stanford Sentiment Treebank-5 (SST5)
The SST-5 dataset is the fine-grained version of the
A Appendix Stanford Sentiment Treebank, where each example
A.1 Datasets is given one of five labels: very positive, positive,
neutral, negative, very negative.
Table 6 shows the development and test datasets
that are used for setting S ET F IT’s hyperparameters. Amazon Counterfactual The Amazon Coun-
Following is a description of the datasets used: terfactual dataset is set of Amazon customer re-
SST2 The Stanford Sentiment Treebank 2 is a views with professionally labeled binary labels
collection of single sentence movie reviews with of counterfactual detection. Counterfactual state-
positive-negative sentiment class labels. (Socher ments are statements that denote something that
et al., 2013). did not happen or cannot (e.g. "They are much
bigger than I thought they would be."). We used
IMDB The Internet Movie Database dataset is the English subset for our experiments. (O’Neill
a collection of single sentence movie reviews with et al., 2021).
positive-negative sentiment class labels. (Maas
et al., 2011). Customer Reviews The Customer Reviews
(Hu and Liu, 2004) dataset is part of the of Sen-
BBC News The BBC News dataset is a collec- tEval (Conneau and Kiela, 2018) benchmark. It is
tion of articles from the news outlet BBC with one composed of positive and negative opinions mined
of 5 topic classifications: Politics, Sports, Enter- from the web and written by customers about a
tainment, Tech, and Business. (Greene and Cun- variety of products.
ningham, 2006).
Emotion14 The Emotion dataset consists of
Enron Spam The Enron spam email dataset tweets from Twitter that display clear emotions
consists of emails from the internal Enron corre- (e.g. "i am now nearly finished [with] the week
spondence channel where emails are classified as detox and i feel amazing"). Labels are one of six
spam or not spam. (V. Metsis and Paliouras, 2006). categories: anger, fear, joy, love, sadness, and sur-
Student Question Categories11 This is a set prise. (Saravia et al., 2018).
of questions from university entrance exams in In-
AG News AG News is a dataset of news titles
dia that are classified into 4 subjects: Math, Biol-
from AG news with one of 4 classifications (World,
ogy, Chemistry, Physics.
Entertainment, Sports, and Business). (Zhang et al.,
TREC-QC The Text Retrieval Conference 2015b).
Question Answering dataset.
A.2 A DAPET Training Procedure
Toxic Conversations12 The Toxic Conversa-
By default, A DAPET assumes access to a training,
tions dataset is set of comments from Civil Com-
development, and test dataset. It trains for 1, 000
ments, a platform for reader comments for indepen-
batches, runs predictions on the development data
dent news outlets. Human raters have given them
every 250 batches and checkpoints, keeping the
toxicity attributes.
model state which performed best on the devel-
Amazon Polarity13 The Amazon Polarity opment dataset. In our case, where we assume
dataset consists of customer reviews from Ama- few-shot training and no development data, we
zon taken over 18 years with binary sentiment la- ran A DAPET for 1, 000 batches and disabled the
bels. Examples are either positive ("Great Read") checkpointing, using the model state that resulted
or negative ("The Worst!") labelled. (Zhang et al., after training for 1, 000 batches. For the English
2015a). data in Table 2, we used the pattern "[TEXT1]
11 this is [LBL]", where "[TEXT1]" and "[LBL]" are
www.kaggle.com/datasets/mrutyunjaybiswal/iitjee-neet-
aims-students-questions-data placeholders for a given piece of text and the cor-
12
https://www.kaggle.com/competitions/jigsaw- responding label, respectively. We constructed the
unintended-bias-in-toxicity-classification/data
13 14
hf.co/datasets/amazon_polarity hf.co/datasets/emotion
Dataset Name Type of Task Cls.∗ Label Dist.** Metric Split
SST5 Sentiment 5 Approx. equal Accuracy Test
Amazon Counterfactual Counterfactual 2 10% counterfactual MCC Test
CR Sentiment 2 Equal Accuracy Test
Emotion Emotion 6 Equal Accuracy Test
Enron Spam Unwanted Language 2 Equal Accuracy Test
AG News Topic 4 Equal Accuracy Test
SST2 Sentiment 2 Equal Accuracy Dev
IMDB Sentiment 2 Equal Accuracy Dev
BBC News Topic 5 Equal Accuracy Dev
Student Question Categories Topic 4 Approx.Equal Accuracy Dev
TREC-QC Topic 50 N/A Accuracy Dev
Toxic Conversations Unwanted Language 2 8% Toxic Avg. Precision Dev
Amazon Polarity Sentiment 2 Equal Accuracy Dev

Table 6: English datasets used for development and test experiments. ∗ No. of classes per dataset. ∗∗
Distribution
of the examples across classes.

verbalizer from the "label" and "label text" columns We also added two new prompts for SST5, to make
that are available in all of our datasets. For the mul- it compatible with the label names of SST5. Fol-
tilingual datasets in Table 4, we used the same lowing is a list of prompts we created for each
pattern, but asked native speakers of each language dataset:
to translate this pattern into their language. We
additionally constructed the verbalizer by mapping Amazon Counterfactual Prompts
labels to a star rating, for example, 0 = 1 star
and 4 = 5 stars, again asking native speakers of Input template:
{{ text }} Is the statement factual?
each language to translate the verbalizer into their
language.
Target template:
{{ answer_choices[label] }}
A.3 Prompts used in T-F EW
The Emotion dataset is the only one that had ex- Answer choices template:
isting prompts in P3 (Public Pool of Prompts) Yes ||| No
(Bach et al., 2022). For three other datasets, we
Input template:
had to adapt existing prompts designed for simi-
{{ text }} Does the statement describe
lar datasets on P3, by making minimal required a fact?
changes to address the differences in data domains
or label names: Target template:
{{ answer_choices[label] }}
• Prompts for Enron Spam, a spam e-
mail detection dataset, were adapted from Answer choices template:
sms_spam dataset prompts. Yes ||| No

• CR prompts were adapted from Input template:

{{ text }} Is the statement
amazon_polarity. non-counterfactual or counterfactual?

• SST5 prompts were adapted from

Target template:
yelp_review_full. {{ answer_choices[label] }}

The Amazon Counterfactual dataset does not

Answer choices template:
have any relevant prompts on P3. Hence, we man- non-counterfactual ||| counterfactual
ually generated prompts ourselves, based on stan-
dard practices for prompt creation published in P3. Input template:
{{ text }} Is the statement Input and target templates:
counterfactual? {{ text }} This movie is a very
{{answer_choices[label]}} one
Target template:
{{ answer_choices[label] }} Answer choices template:
terrible ||| bad ||| okay ||| good |||
great
Answer choices template:
No ||| Yes

Input template:
{{ text }} Does the sentence express
an event that did not happen?

Target template:
{{ answer_choices[label] }}