Setfit
Setfit
Lewis Tunstall1 , Nils Reimers2 , Unso Eun Seo Jo1 , Luke Bates3 ,
Daniel Korat4 , Moshe Wasserblat4 , Oren Pereg4
1
Hugging Face 2 cohere.ai
3
Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt
4
Emergent AI Lab, Intel Labs
1
[email protected] 2 [email protected]
3
[email protected]
4
[email protected]
Abstract
Recent few-shot methods, such as parameter-
efficient fine-tuning (PEFT) and pattern ex-
ploiting training (PET), have achieved impres-
arXiv:2209.11055v1 [cs.CL] 22 Sep 2022
3.1 The S ET F IT approach for few-shot text nificantly larger than just K.
classification
Classification head training In this second step,
S ET F IT uses a two-step training approach in which the fine-tuned ST encodes the original labeled train-
we first fine-tune an ST and then train a classifier ing data {xi }, yielding a single sentence embed-
head. In the first step, an ST is fine-tuned on the ding per training sample; Embxi = ST (xi ) where
input data in a contrastive, Siamese manner on ST () is the function representing the fine-tuned
sentence pairs. In the second step, a text classifica- ST. The embeddings, along with their class labels,
tion head is trained using the encoded training data constitute the training set for the classification head
generated by the fine-tuned ST from the first step. T CH = {(Embxi , yi )} where |T CH | = |D|. A
Figure 2 illustrates this process, and we discuss logistic regression model is used as the text classi-
these two steps in the following sections. fication head throughout this work.
ST fine-tuning To better handle the limited Inference At inference time, the fine-tuned ST
amount of labeled training data in few-shot scenar- encodes an unseen input sentence (xi ) and pro-
ios, we adopt a contrastive training approach that is duces a sentence embedding. Next, the classifi-
often used for image similarity (Koch et al., 2015). cation head that was trained in the training step,
Formally, given a small set of K labeled examples produces the class prediction of the input sentence
D = {(xi , yi )}, where xi and yi are sentences and based on its sentence embedding. Formally this is
their class labels, respectively. For each class la- xpred
i = CH(ST (xi )), where CH represents the
bel c ∈ C, we generate a set of R positive triplets classification head prediction function.
Tpc = {(xi , xj , 1)}, where xi and xj are pairs of
randomly chosen sentences from the same class c 4 Experiments
such that (yi = yj = c). Similarly, we also gener- 4.1 Data
ate a set of R negative triplets Tnc = {(xi , xj , 0)},
where xi are sentences from class c and xj are ran- We conduct experiments on available text classifi-
domly chosen sentences from different classes such cation datasets. We split the datasets into develop-
that (yi = c, yj 6= c). Finally, the contrastive fine ment and test datasets (See Table 6 in Appendix
tuning data set T is produced by concatenating the A). The development datasets are utilized for set-
positive and negative triplets across all class labels; ting S ET F IT’s hyperparameters such as the number
|C| |C| of training pairs (|T |), the loss function and the
T = {(Tp0 , Tn0 ), (Tp1 , Tn1 ), ..., (Tp , Tn )}, where
optimal number of training epochs. In order to
|C| is the number of class labels, |T | = 2R|C| is
test the robustness of S ET F IT to various types of
the number of pairs in T and R is a hyperparameter.
text, we choose test datasets that represent different
Unless stated otherwise, we used R = 20 in all the
text classification tasks with a varying number of
evaluations.
classes. All datasets used are available on the Hug-
This contrastive fine-tuning approach enlarges
ging Face Hub under the S ET F IT organisation.3 In
the size of training data in few-shot scenarios. As-
addition we evaluate S ET F IT on the RAFT bench-
suming that a small number (K) of labeled exam-
mark (Alex et al., 2021), a real-world few-shot
ples are given for a binary classification task, the
text-classification benchmark composed of 11 prac-
potential size of the ST fine-tuning set T is derived
tical tasks, where each task has only 50 training
from the number of unique sentence pairs that can
be generated, namely K(K − 1)/2, which is sig- 3
huggingface.co/SetFit
examples. (Schick and Schütze, 2021), the authors show that
PET-based classification methods excel on the
4.2 S ET F IT models RAFT benchmark, placing second only to much
We evaluate three variations of S ET F IT each one larger models such as T-F EW. In our experiments,
uses different underlying model of different size we used A DAPET with default hyperparameters
(Shown in Table 1) and examined its performance with different PLM
backbones, reporting the PLM which resulted in
Variation Underlying ST Model Size∗ the best performance, albert-xxlarge-v2 7 (see
S ET F IT RO BERTA all-roberta-large-v14 355M Appendix A.2 in the Appendix for further details).
S ET F IT MPN ET paraphrase-mpnet-base-v25 110M
S ET F IT M INI LM paraphrase-MiniLM-L3-v26 15M P ERFECT P ERFECT (Karimi Mahabadi et al.,
2022b) is another cloze-based fine-tuning method,
Table 1: S ET F IT model variations using three different but unlike PET or A DAPET, it does not require
underlying ST models. ∗ Number of parameters.
handcrafted task prompts and verbalizers. Instead,
P ERFECT uses task-specific adapters (Houlsby
4.3 Baselines et al., 2019; Pfeiffer et al., 2021) and multi-token
label-embeddings which are independent from the
We compare S ET F IT’s performance against stan- language model vocabulary during fine-tuning. To
dard transformer fine-tuning and recent best- run P ERFECT on our test datasets, we adapted the
performing few-shot approaches: A DAPET (Tam configurations provided in the P ERFECT codebase.
et al., 2021), P ERFECT (Karimi Mahabadi et al.,
2022b), and T-F EW (Liu et al., 2022). T-F EW T-F EW (Liu et al., 2022) is a PEFT-based
few-shot learning method based on T0 (Sanh et al.,
Standard fine-tuning Our first baseline is
2021). The authors provide two versions of T-F EW:
RO BERTA L ARGE (Liu et al., 2019), a standard,
11 and 3 billion parameters. Due to compute con-
encoder-only transformer that is fine-tuned for se-
straints, we were unable to run the 11 billion ver-
quence classification. Since we assume no val-
sion, which requires an 80GB A100 GPU. Running
idation sets, we constructed validation splits by
tests on T-F EW as opposed to S ET F IT posed sev-
randomly selecting equally sized portions from the
eral hurdles. First, because T-F EW’s performance
train split. We perform a hyperparameter search on
varies significantly depending on the input prompts,
the number of epochs in the range [25,75] and pick
we run each experiment using 5 random seeds, and
the best performing model on a validation split.
report the median result, as in the original paper.
We use a learning rate of 2e−5 and batch size of
Second, T-F EW relies on dataset-specific prompts,
4 in all our experiments.
made available on P3 (Public Pool of Prompts)
A DAPET Pattern exploiting training (PET) (Bach et al., 2022). Only one of our test datasets
(Schick and Schütze, 2021b,a) is a method for had prompts in P3. For the rest of the datasets, we
improving PLM performance in few-shot setups on adapt standardized P3 prompts of similar tasks or
downstream tasks by converting textual input into implement prompts ourselves (See Appendix A.3).
a cloze-style question intended to be reminiscent of
the masked language modelling (MLM) objective 4.4 Experimental Setup
under which large PLMs such as BERT (Devlin Systematically evaluating few-shot performance
et al., 2019) are trained. To determine S ET F IT’s can be challenging, because fine-tuning on small
performance relative to PET-based approaches, datasets may incur instability (Dodge et al., 2020;
we compare our method to A DAPET (Tam et al., Zhang et al., 2021). To address this issue, in our
2021), an extension of PET. In recent work experiments we use 10 random training splits for
4
https://huggingface. each dataset and sample size. These splits are used
co/sentence-transformers/ as training data across all tested methods. For each
all-roberta-large-v1 method, we report the average measure (depending
5
https://huggingface.
co/sentence-transformers/ on the dataset) and the standard deviation across
paraphrase-mpnet-base-v2 these splits. We fine-tune S ET F IT’s ST model using
6
https://huggingface.
7
co/sentence-transformers/ https://huggingface.co/
paraphrase-MiniLM-L3-v2 albert-xxlarge-v2
cosine-similarity loss with a learning rate of 1e−3 , amine A DAPET on non-English data (see Appendix
a batch size of 16 and a maximum sequence length A for details).
of 256 tokens, for 1 epoch.
Experimental Setup For the multilingual exper-
5 Results iments, we use the Multilingual Amazon Reviews
Corpus (MARC) (Keung et al., 2020). This dataset
Table 2 shows a comparison between consists of Amazon reviews in six languages (En-
S ET F IT MPN ET and the baselines for N = 8 glish, Japanese, German, French, Spanish, and Chi-
and N = 64 labeled training samples per class. nese), where each review is labeled according to
For reference purposes, standard fine-tuning results a 5-star rating scale. We chose this corpus for its
using the full training data are also shown (in all typological diversity in order to examine the gener-
cases higher scores indicates stronger performance; alizability of S ET F IT and other methods across a
see Table 6 in Appendix A for dataset metric variety of languages.
details). We find that S ET F IT MPN ET significantly For the S ET F IT underlying model, we use
outperforms the F INE T UNE baseline for N = 8 by paraphrase-multilingual-mpnet-base-v2,8 which is
an average of 19.3 points. However, as the number a multilingual version of paraphrase-mpnet-base-
of training samples increases to N = 64, the gap v2 that is trained on parallel data in over 50 lan-
decreases to 5.6 points. guages.
Similarly, we find that S ET F IT MPN ET out- For the F INE T UNE and A DAPET baselines, we
performs P ERFECT by 13.6 and 2.6 points. use XLM-RO BERTA BASE (Conneau et al., 2019),9
S ET F IT MPN ET also outperforms A DAPET by 4.0 which has a similar size to the S ET F IT model. We
and 1.5 points for N = 8 and N = 64 respec- compare the performance of each method using the
tively. For N = 8, S ET F IT MPN ET is on par with same settings as (Conneau et al., 2019):
T-F EW 3B whereas for N = 64 S ET F IT MPN ET out-
performs T-F EW 3B by 5 points on average, despite • each: Train and evaluate on monolingual data
being prompt-free and more than 27 times smaller. to measure per-language performance.
RAFT results The test datasets listed in Table 2 • en: Train on the English training data and
were not specifically designed for few-shot bench- then evaluate on each language’s test set.
marking. In order to better benchmark S ET F IT,
we used the RAFT benchmark (Alex et al., 2021) • all: Train on all the training data and evaluate
which is specifically designed for benchmarking on each language’s test set.
few-shot methods. Table 3 shows the average ac-
curacy of S ET F IT MPN ET and S ET F IT RO BERTA and Method For S ET F IT standard fine-tuning, and
four prominent methods. S ET F IT RO BERTA outper- A DAPET, we adopt the same methodology and hy-
forms GPT3 and PET by 8.6 and 1.7 points respec- perparameters used for the monolingual English
tively while alleviating the need for hand crafting experiments in 4. We evaluate each method in the
prompts. It also surpasses the human baseline in few-shot regime (N = 8 samples per class) and
7 out of 11 tasks. S ET F IT RO BERTA falls short of T- compare against performance of fine-tuning on the
F EW 11B by 4.5 points. however, S ET F IT RO BERTA full training set of 20,000 examples.
is more than 30 times smaller than T-F EW 11B,
Results Table 4 shows the results of S ET F IT
does not require manual prompt crafting and is
standard fine-tuning, and A DAPET on each lan-
much more efficient in training and inference (see
guage in MARC, where a higher MAE indicates
Table 5).
weaker performance. In the few-shot regime of
N = 8 samples per class, we find that S ET F IT
6 Multilingual Experiments
significantly outperforms F INE T UNE and A DAPET
To determine S ET F IT’s performance in a multi- in all settings (each, en, all), with the best average
lingual, few-shot text classification scenario, we performance obtained when training on English
conducted development and test experiments on data only.
multilingual datasets and compared S ET F IT to stan- 8
huggingface.co/sentence-transformers/
dard transformer fine-tuning and A DAPET. To the paraphrase-multilingual-mpnet-base-v2
9
best of our knowledge, this is the first work to ex- huggingface.co/xlm-roberta-base
Method SST-5 AmazonCF CR Emotion EnronSpam AGNews Average†
|N | = 8∗
F INE T UNE 33.52.1 9.24.9 58.86.3 28.76.8 85.06.0 81.73.8 43.05.2
P ERFECT 34.93.1 18.15.3 81.58.6 29.85.7 79.37.4 80.85.0 48.76.0
A DAPET 50.01.9 19.47.3 91.01.3 46.23.7 85.13.7 85.12.7 58.33.6
T-F EW 3B 55.0?1.4 19.03.9 92.11.0 57.41.8 93.11.6 – 63.41.9
S ET F IT MPN ET 43.63.0 40.311.8 88.51.9 48.84.5 90.13.4 82.92.8 62.34.9
|N | = 64∗
F INE T UNE 45.96.9 52.812.1 88.91.9 65.017.2 95.90.8 88.40.9 69.77.8
P ERFECT 49.10.7 65.15.2 92.20.5 61.72.7 95.41.1 89.00.3 72.71.9
A DAPET 54.10.8 54.16.4 92.60.7 72.02.2 96.00.9 88.00.6 73.82.2
T-F EW 3B 56.00.6 34.74.5 93.11.0 70.91.1 97.00.3 – 70.31.5
S ET F IT MPN ET 51.90.6 61.92.9 90.40.6 76.21.3 96.10.8 88.00.7 75.31.3
|N | = Full∗∗
F INE T UNE 59.8 80.1 92.4 92.6 99.0 93.8 84.8
Table 2: S ET F IT performance score and standard deviation compared to the baselines across 6 test datasets for
three training set sizes |N |. ∗ Number of training samples per class. ∗∗ Entire available training data used. † The
AGNews dataset is excluded from the average score to enable fair comparison with T-F EW (which has AGNews
in its training set). ? The inputs of SST-5 (but not its labels) appeared in T-F EW’s training set, as part of Rotten
Tomatoes dataset.
Table 4: Average performance (MAE × 100) on the Multilingual Amazon Reviews Corpus for two training set
sizes |N |. ∗ No. of training samples per class. ∗∗ Entire available training data used (20,000 samples).
60
80
50
30
70 40
30
60
20 20
Unlabeled Training Set Size (N ) Unlabeled Training Set Size (N ) Unlabeled Training Set Size (N )
Figure 3: Average accuracy as a function of the unlabeled training set size N of the S ET F IT student and the
baseline student on AG News, Emotion and SST5 datasets.
tence embeddings for each pair and to calculate the outperforms the baseline student when only small
cosine-similarity between them. The underlying ST amounts of unlabeled data are available. For exam-
of the S ET F IT student is trained to mimic the ST of ple, for N = 8, the S ET F IT student outperforms
the teacher output by minimizing the error between the baseline student by 24.8, 25.1, and 8.9 aver-
the S ET F IT teacher-produced cosine-similarity and age accuracy on the AGNews, Emotion and SST5
its output. The classification head of the student datasets, respectively. As N increases, the perfor-
is then trained using the embeddings produced by mance gains decrease and are on par for N = 1K.
the student’s ST and the logits produced by the
S ET F IT teacher classification head. The baseline 7.2 Computational costs
student is trained to mimic the teacher output by
minimizing the error between the logits produced Comparing the relative computational costs of S ET-
by the S ET F IT teacher classification head and its F IT versus PET and PEFT methods isn’t straight-
output. forward since each method typically has different
hardware and memory requirements.
Results Figure 3 shows a comparison between To simplify the comparison, we follow the ap-
the S ET F IT student model and the baseline stu- proach adopted by Liu et al. (2022) and use FLOPs-
dent model for various amounts of unlabeled train- per-token estimates to compare S ET F IT to T-F EW.
ing data (N ). The S ET F IT student significantly These estimates can be obtained from Kaplan et al.
(2020), who show that encoder-only models with Inf. Train Speed-
N parameters have approximately 2N FLOPs-per- Method FLOPs FLOPs up Score
token for inference and 6N FLOPs-per-token for T-F EW 3B 1.6e11 3.9e15 1x 63.41.9
S ET F IT MPN ET 8.3e9 2.0e14 19x 62.34.9
training. The resulting cost for inference and train-
S ET F IT M INI LM † 1.3e9 3.2e13 123x 60.31.6
ing is then given by:
Table 5: Relative computational cost and average
scores of S ET F IT and T-F EW using |N | = 8 on the
Cinf = 2N · `seq , test datasets listed in Table 2. † Trained in the distilla-
Ctrain = 6N · `seq · nsteps · nbatch , tion setup as described in Section 7.1, using |N | = 8
for teacher training and the rest of the available training
where `seq is the input sequence length, nsteps data as unlabeled student training data. For fixed nsteps
is the number of training steps, and nbatch is the and nbatch , the relative speed-up (N 0 · `0seq )/(2N · `seq )
batch size. For encoder-decoder models like T- is the same for inference and training.
F EW, these estimates are halved, since the model
only processes each token with either the encoder S ET F IT is additionally not subject to the instabil-
or decoder. ity and inconvenience of prompting. We have also
For the inference and training estimates shown demonstrated that S ET F IT is a robust few-shot text
in Table 5, we use `seq = 38 and `seq = 54 as the classifier in languages other than English across
input sequence length for S ET F IT MPN ET (T-F EW); varying typologies. Finally, S ET F IT has proven
this is the median number of tokens across all the useful in few-shot distillation setups.
test datasets in Table 2. We also use nsteps = 1000
and nbatch = 8 for all training estimates. As shown Acknowledgements
in the table, the S ET F IT MPN ET model is approxi-
mately an order of magnitude faster at inference The authors thank Hugging Face Inc and Intel Inc.
and training than T-F EW, despite having compa- for providing computing resources and the Ger-
rable performance on the test datasets of Table 2. man Federal Ministry of Education and Research
S ET F IT M INI LM is two orders of magnitude faster and the Hessian Ministry of Science and the Arts
than T-F EW, with an average score reduction of 3.1 (HMWK) within the projects "The Third Wave
accuracy points. Moreover, the storage cost of the of Artificial Intelligence - 3AI", hessian.AI, and
S ET F IT models (70MB and 420MB respectively) is within their joint support of the National Research
163 to 26 times smaller than the T0-3B checkpoint Center for Applied Cybersecurity ATHENE.
used by T-F EW 3B (11.4GB), making these models
much better suited for real-world deployment.
These estimates are borne out by comparing References
the time needed to train each method to conver- Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek
gence on N = 8 examples. For our datasets, Thakur, Pegah Maham, C. Jess Riedel, Emmie
S ET F IT MPN ET takes approximately 30 seconds to Hine, Carolyn Ashurst, Paul Sedille, Alexis Car-
train on a p3.2xlarge AWS instance (16GB GPU lier, Michael Noetel, and Andreas Stuhlmüller. 2021.
RAFT: A real-world few-shot text classification
memory), at a cost of $0.025 per split. On the other benchmark. In Thirty-fifth Conference on Neural In-
hand, T-F EW 3B requires at least 40GB GPU mem- formation Processing Systems Datasets and Bench-
ory, and training on a p4d.24xlarge AWS instance marks Track (Round 2).
takes approximately 700 seconds, at a cost of $0.7
Jimmy Ba and Rich Caruana. 2014. Do deep nets really
per split. need to be deep? In Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger,
8 Conclusion editors, Advances in Neural Information Processing
Systems 27, pages 2654–2662. Curran Associates,
This paper introduces S ET F IT, a new few-shot text Inc.
classification approach. We show that S ET F IT has
several advantages over comparable approaches Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Al-
such as T-F EW, A DAPET and P ERFECT. In particu- bert Webson, Colin Raffel, Nihal V. Nayak, Ab-
heesht Sharma, Taewoon Kim, M Saiful Bari,
lar, S ET F IT is much faster at inference and training; Thibault Fevry, Zaid Alyafeai, Manan Dey, An-
S ET F IT requires much smaller base models to be drea Santilli, Zhiqing Sun, Srulik Ben-David, Can-
performant, not requiring external compute; and wen Xu, Gunjan Chhablani, Han Wang, Jason Alan
Fries, Maged S. Al-shaibani, Shanya Sharma, Ur- Derek Greene and Pádraig Cunningham. 2006. Prac-
mish Thakker, Khalid Almubarak, Xiangru Tang, tical solutions to the problem of diagonal dom-
Dragomir Radev, Mike Tian-Jian Jiang, and Alexan- inance in kernel document clustering. In Proc.
der M. Rush. 2022. Promptsource: An integrated 23rd International Conference on Machine learning
development environment and repository for natural (ICML’06), pages 377–384. ACM Press.
language prompts.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Distilling the knowledge in a neural network. Cite
Subbiah, Jared D Kaplan, Prafulla Dhariwal, arxiv:1503.02531 Comment: NIPS 2014 Deep
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Learning Workshop.
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Bruna Morrone, Quentin de Laroussilhe, Andrea
Clemens Winter, Chris Hesse, Mark Chen, Eric Gesmundo, Mona Attariyan, and Sylvain Gelly.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, 2019. Parameter-efficient transfer learning for NLP.
Jack Clark, Christopher Berner, Sam McCandlish, CoRR, abs/1902.00751.
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020a. Language models are few-shot learners. In Minqing Hu and Bing Liu. 2004. Mining and sum-
Advances in Neural Information Processing Systems, marizing customer reviews. In Proceedings of the
volume 33, pages 1877–1901. Curran Associates, Tenth ACM SIGKDD International Conference on
Inc. Knowledge Discovery and Data Mining, KDD ’04,
page 168–177, New York, NY, USA. Association for
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Computing Machinery.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Jared Kaplan, Sam McCandlish, Tom Henighan,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Tom B. Brown, Benjamin Chess, Rewon Child,
Gretchen Krueger, Tom Henighan, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Amodei. 2020. Scaling laws for neural language
Clemens Winter, Christopher Hesse, Mark Chen, models. CoRR, abs/2001.08361.
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James
Candlish, Alec Radford, Ilya Sutskever, and Dario
Henderson, Lambert Mathias, Marzieh Saeidi,
Amodei. 2020b. Language models are few-shot
Veselin Stoyanov, and Majid Yazdani. 2022a.
learners. CoRR, abs/2005.14165.
Prompt-free and efficient few-shot learning with lan-
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, guage models. In Proceedings of the 60th Annual
Vishrav Chaudhary, Guillaume Wenzek, Francisco Meeting of the Association for Computational Lin-
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- guistics (Volume 1: Long Papers), pages 3638–3652,
moyer, and Veselin Stoyanov. 2019. Unsupervised Dublin, Ireland. Association for Computational Lin-
cross-lingual representation learning at scale. CoRR, guistics.
abs/1911.02116.
Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James
Alexis Conneau and Douwe Kiela. 2018. SentEval: An Henderson, Lambert Mathias, Marzieh Saeidi,
evaluation toolkit for universal sentence representa- Veselin Stoyanov, and Majid Yazdani. 2022b.
tions. In Proceedings of the Eleventh International Prompt-free and efficient few-shot learning with lan-
Conference on Language Resources and Evaluation guage models. In Proceedings of the 60th Annual
(LREC 2018), Miyazaki, Japan. European Language Meeting of the Association for Computational Lin-
Resources Association (ELRA). guistics (Volume 1: Long Papers), pages 3638–3652,
Dublin, Ireland. Association for Computational Lin-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and guistics.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Phillip Keung, Yichao Lu, György Szarvas, and
standing. In Proceedings of the 2019 Conference Noah A. Smith. 2020. The multilingual amazon re-
of the North American Chapter of the Association views corpus.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Gregory Koch, Richard Zemel, Ruslan Salakhutdinov,
pages 4171–4186, Minneapolis, Minnesota. Associ- et al. 2015. Siamese neural networks for one-shot
ation for Computational Linguistics. image recognition. In ICML deep learning work-
shop, volume 2, page 0. Lille.
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali
Farhadi, Hannaneh Hajishirzi, and Noah Smith. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay
2020. Fine-tuning pretrained language models: Mohta, Tenghao Huang, Mohit Bansal, and Colin
Weight initializations, data orders, and early stop- Raffel. 2022. Few-shot parameter-efficient fine-
ping. arXiv preprint arXiv:2002.06305. tuning is better and cheaper than in-context learning.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Jason Alan Fries, Ryan Teehan, Stella Biderman,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Leo Gao, Tali Bers, Thomas Wolf, and Alexan-
Luke Zettlemoyer, and Veselin Stoyanov. 2019. der M. Rush. 2021. Multitask prompted train-
Roberta: A robustly optimized bert pretraining ap- ing enables zero-shot task generalization. CoRR,
proach. arXiv preprint arXiv:1907.11692. abs/2110.08207.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang,
Dan Huang, Andrew Y. Ng, and Christopher Potts. Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con-
2011. Learning word vectors for sentiment analy- textualized affect representations for emotion recog-
sis. In Proceedings of the 49th Annual Meeting of nition. In Proceedings of the 2018 Conference on
the Association for Computational Linguistics: Hu- Empirical Methods in Natural Language Processing,
man Language Technologies, pages 142–150, Port- pages 3687–3697, Brussels, Belgium. Association
land, Oregon, USA. Association for Computational for Computational Linguistics.
Linguistics.
Timo Schick and Hinrich Schütze. 2021a. Exploiting
James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, cloze-questions for few-shot text classification and
Motoko Kubota, and Danushka Bollegala. 2021. I natural language inference. In Proceedings of the
wish I would have loved this one, but I didn’t - A 16th Conference of the European Chapter of the As-
multilingual dataset for counterfactual detection in sociation for Computational Linguistics: Main Vol-
product reviews. CoRR, abs/2104.06893. ume, pages 255–269, Online. Association for Com-
putational Linguistics.
Christian S. Perone, Roberto Pereira Silveira, and
Thomas S. Paula. 2018. Evaluation of sentence em- Timo Schick and Hinrich Schütze. 2021b. It’s not just
beddings in downstream and linguistic probing tasks. size that matters: Small language models are also
CoRR, abs/1806.06259. few-shot learners. In Proceedings of the 2021 Con-
ference of the North American Chapter of the Asso-
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, ciation for Computational Linguistics: Human Lan-
Kyunghyun Cho, and Iryna Gurevych. 2021. guage Technologies, pages 2339–2352, Online. As-
AdapterFusion: Non-destructive task composition sociation for Computational Linguistics.
for transfer learning. In Proceedings of the 16th
Conference of the European Chapter of the Associ- Timo Schick and Hinrich Schütze. 2021. True few-
ation for Computational Linguistics: Main Volume, shot learning with prompts - A real-world perspec-
pages 487–503, Online. Association for Computa- tive. CoRR, abs/2111.13440.
tional Linguistics.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Guangyuan Piao. 2021. Scholarly text classification Chuang, Christopher D. Manning, Andrew Ng, and
with sentence bert and entity embeddings. In Trends Christopher Potts. 2013. Recursive deep models
and Applications in Knowledge Discovery and Data for semantic compositionality over a sentiment tree-
Mining, pages 79–87, Cham. Springer International bank. In Proceedings of the 2013 Conference on
Publishing. Empirical Methods in Natural Language Processing,
pages 1631–1642, Seattle, Washington, USA. Asso-
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea ciation for Computational Linguistics.
Vedaldi. 2017. Learning multiple visual domains
with residual adapters. Derek Tam, Rakesh R. Menon, Mohit Bansal,
Shashank Srivastava, and Colin Raffel. 2021. Im-
Nils Reimers and Iryna Gurevych. 2019. Sentence- proving and simplifying pattern exploiting training.
BERT: Sentence embeddings using Siamese BERT- In Proceedings of the 2021 Conference on Empiri-
networks. In Proceedings of the 2019 Conference on cal Methods in Natural Language Processing, pages
Empirical Methods in Natural Language Processing 4980–4991, Online and Punta Cana, Dominican Re-
and the 9th International Joint Conference on Natu- public. Association for Computational Linguistics.
ral Language Processing (EMNLP-IJCNLP), pages
3982–3992, Hong Kong, China. Association for I. Androutsopoulos V. Metsis and G. Paliouras. 2006.
Computational Linguistics. Spam filtering with naive bayes - which naive bayes?
In Proceedings of the 3rd Conference on Email and
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Anti-Spam (CEAS 2006).
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q
Raja, Manan Dey, M. Saiful Bari, Canwen Xu, Weinberger, and Yoav Artzi. 2021. Revisiting few-
Urmish Thakker, Shanya Sharma, Eliza Szczechla, sample {bert} fine-tuning. In International Confer-
Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, ence on Learning Representations.
Debajyoti Datta, Jonathan Chang, Mike Tian-Jian
Jiang, Han Wang, Matteo Manica, Sheng Shen, Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015a.
Zheng Xin Yong, Harshit Pandey, Rachel Baw- Character-level convolutional networks for text clas-
den, Thomas Wang, Trishala Neeraj, Jos Rozen, sification. In Advances in Neural Information Pro-
Abheesht Sharma, Andrea Santilli, Thibault Févry, cessing Systems, volume 28. Curran Associates, Inc.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Following is a description of the test datasets:
2015b. Character-level convolutional networks for
text classification. In NIPS. Stanford Sentiment Treebank-5 (SST5)
The SST-5 dataset is the fine-grained version of the
A Appendix Stanford Sentiment Treebank, where each example
A.1 Datasets is given one of five labels: very positive, positive,
neutral, negative, very negative.
Table 6 shows the development and test datasets
that are used for setting S ET F IT’s hyperparameters. Amazon Counterfactual The Amazon Coun-
Following is a description of the datasets used: terfactual dataset is set of Amazon customer re-
SST2 The Stanford Sentiment Treebank 2 is a views with professionally labeled binary labels
collection of single sentence movie reviews with of counterfactual detection. Counterfactual state-
positive-negative sentiment class labels. (Socher ments are statements that denote something that
et al., 2013). did not happen or cannot (e.g. "They are much
bigger than I thought they would be."). We used
IMDB The Internet Movie Database dataset is the English subset for our experiments. (O’Neill
a collection of single sentence movie reviews with et al., 2021).
positive-negative sentiment class labels. (Maas
et al., 2011). Customer Reviews The Customer Reviews
(Hu and Liu, 2004) dataset is part of the of Sen-
BBC News The BBC News dataset is a collec- tEval (Conneau and Kiela, 2018) benchmark. It is
tion of articles from the news outlet BBC with one composed of positive and negative opinions mined
of 5 topic classifications: Politics, Sports, Enter- from the web and written by customers about a
tainment, Tech, and Business. (Greene and Cun- variety of products.
ningham, 2006).
Emotion14 The Emotion dataset consists of
Enron Spam The Enron spam email dataset tweets from Twitter that display clear emotions
consists of emails from the internal Enron corre- (e.g. "i am now nearly finished [with] the week
spondence channel where emails are classified as detox and i feel amazing"). Labels are one of six
spam or not spam. (V. Metsis and Paliouras, 2006). categories: anger, fear, joy, love, sadness, and sur-
Student Question Categories11 This is a set prise. (Saravia et al., 2018).
of questions from university entrance exams in In-
AG News AG News is a dataset of news titles
dia that are classified into 4 subjects: Math, Biol-
from AG news with one of 4 classifications (World,
ogy, Chemistry, Physics.
Entertainment, Sports, and Business). (Zhang et al.,
TREC-QC The Text Retrieval Conference 2015b).
Question Answering dataset.
A.2 A DAPET Training Procedure
Toxic Conversations12 The Toxic Conversa-
By default, A DAPET assumes access to a training,
tions dataset is set of comments from Civil Com-
development, and test dataset. It trains for 1, 000
ments, a platform for reader comments for indepen-
batches, runs predictions on the development data
dent news outlets. Human raters have given them
every 250 batches and checkpoints, keeping the
toxicity attributes.
model state which performed best on the devel-
Amazon Polarity13 The Amazon Polarity opment dataset. In our case, where we assume
dataset consists of customer reviews from Ama- few-shot training and no development data, we
zon taken over 18 years with binary sentiment la- ran A DAPET for 1, 000 batches and disabled the
bels. Examples are either positive ("Great Read") checkpointing, using the model state that resulted
or negative ("The Worst!") labelled. (Zhang et al., after training for 1, 000 batches. For the English
2015a). data in Table 2, we used the pattern "[TEXT1]
11 this is [LBL]", where "[TEXT1]" and "[LBL]" are
www.kaggle.com/datasets/mrutyunjaybiswal/iitjee-neet-
aims-students-questions-data placeholders for a given piece of text and the cor-
12
https://www.kaggle.com/competitions/jigsaw- responding label, respectively. We constructed the
unintended-bias-in-toxicity-classification/data
13 14
hf.co/datasets/amazon_polarity hf.co/datasets/emotion
Dataset Name Type of Task Cls.∗ Label Dist.** Metric Split
SST5 Sentiment 5 Approx. equal Accuracy Test
Amazon Counterfactual Counterfactual 2 10% counterfactual MCC Test
CR Sentiment 2 Equal Accuracy Test
Emotion Emotion 6 Equal Accuracy Test
Enron Spam Unwanted Language 2 Equal Accuracy Test
AG News Topic 4 Equal Accuracy Test
SST2 Sentiment 2 Equal Accuracy Dev
IMDB Sentiment 2 Equal Accuracy Dev
BBC News Topic 5 Equal Accuracy Dev
Student Question Categories Topic 4 Approx.Equal Accuracy Dev
TREC-QC Topic 50 N/A Accuracy Dev
Toxic Conversations Unwanted Language 2 8% Toxic Avg. Precision Dev
Amazon Polarity Sentiment 2 Equal Accuracy Dev
Table 6: English datasets used for development and test experiments. ∗ No. of classes per dataset. ∗∗
Distribution
of the examples across classes.
verbalizer from the "label" and "label text" columns We also added two new prompts for SST5, to make
that are available in all of our datasets. For the mul- it compatible with the label names of SST5. Fol-
tilingual datasets in Table 4, we used the same lowing is a list of prompts we created for each
pattern, but asked native speakers of each language dataset:
to translate this pattern into their language. We
additionally constructed the verbalizer by mapping Amazon Counterfactual Prompts
labels to a star rating, for example, 0 = 1 star
and 4 = 5 stars, again asking native speakers of Input template:
{{ text }} Is the statement factual?
each language to translate the verbalizer into their
language.
Target template:
{{ answer_choices[label] }}
A.3 Prompts used in T-F EW
The Emotion dataset is the only one that had ex- Answer choices template:
isting prompts in P3 (Public Pool of Prompts) Yes ||| No
(Bach et al., 2022). For three other datasets, we
Input template:
had to adapt existing prompts designed for simi-
{{ text }} Does the statement describe
lar datasets on P3, by making minimal required a fact?
changes to address the differences in data domains
or label names: Target template:
{{ answer_choices[label] }}
• Prompts for Enron Spam, a spam e-
mail detection dataset, were adapted from Answer choices template:
sms_spam dataset prompts. Yes ||| No
Input template:
{{ text }} Does the sentence express
an event that did not happen?
Target template:
{{ answer_choices[label] }}
Input template:
{{ text }} Does this describe an
actual event?
Target template:
{{ answer_choices[label] }}
Input template:
{{ text }} Does the sentence contain
events that did not or cannot take
place?
Target template:
{{ answer_choices[label] }}
Input template:
Is the label for the following
sentence non-counterfactual or
counterfactual? {{ text }}
Target template:
{{ answer_choices[label] }}
Target template:
{{ answer_choices[label] }}