AdapterFusion: Non-Destructive Task Composition For Transfer Learning
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
Abstract
Add & Norm
edge from multiple tasks. First, in the knowl- Add & Norm
edge extraction stage we learn task specific pa- Multi-Head
rameters called adapters, that encapsulate the Attention
XN
! For common adapter architectures, Φ contains
Θ0→{1,...,N } ← argmin Ln (Dn ; Θ0 ) considerably fewer parameters than Θ, e.g., only
Θ n=1 3.6% of the parameters of the pretrained model in
Where Θ0→{1,...,N } indicates that we start with Θ0 Houlsby et al. (2019).
and fine-tune on a set of tasks {1, ..., N }. 2.2.2 Multi-Task Adapters (MT-A)
However, MTL requires simultaneous access to
Stickland and Murray (2019) propose to train
all tasks, making it difficult to add more tasks on
adapters for N tasks in parallel with a multi-task
the fly. As the different tasks have varying sizes as
objective. The underlying parameters Θ0 are fine-
well as loss functions, effectively combining them
tuned along with the task-specific parameters in
during training is very challenging and requires
Φn . The training objective can be defined as:
heuristic approaches as proposed in Stickland and
Murray (2019).
N
!
X
2.2 Adapters Θ ← argmin Ln (Dn ; Θ0 , Φn )
Θ,Φ n=1
While the predominant methodology for transfer
learning is to fine-tune all weights of the pre- where
trained model, adapters (Houlsby et al., 2019)
have recently been introduced as an alternative Θ = Θ0→{1,...,N } , Φ1 , . . . , ΦN .
approach with applications in domain transfer
(Rücklé et al., 2020b), machine translation (Bapna 2.2.3 Adapters in Practice
and Firat, 2019; Philip et al., 2020) transfer learn- Introducing new adapter parameters in different
ing (Stickland and Murray, 2019; Wang et al., 2020; layers of an otherwise fixed pretrained model has
Lauscher et al., 2020), and cross-lingual transfer been shown to perform on-par with, or only slightly
(Pfeiffer et al., 2020c,d; Üstün et al., 2020; Vi- below, full model fine-tuning (Houlsby et al., 2019;
doni et al., 2020). Adapters share a large set of Stickland and Murray, 2019; Pfeiffer et al., 2020a).
For NLP tasks, adapters have been introduced for Add & Norm
the transformer architecture (Vaswani et al., 2017).
AdapterFusion
At each transformer layer l, a set of adapter param-
SoftMax
eters Φl is introduced. The placement and archi-
tecture of adapter parameters Φ within a pretrained
Value Key Query
model is non-trivial. Houlsby et al. (2019) experi-
ment with different architectures, finding that a two-
layer feed-foward neural network with a bottleneck
works well. They place two of these components FF Up
Table 1: Mean and standard deviation results (development sets) for each of the 16 datasets and the different
architectural setups. The datasets are ordered by their respective training dataset size. Dashed horizontal lines
separate datasizes {> 40k, > 10k, > 5k}, respectively. Each model is initialized with BERT-base (Devlin et al.,
2019) weights. Head indicates training only a classification head on top of fixed BERT weights. For Full training
we fine-tune all weights of BERT. Single-Task Adapters (ST-A) is the training of independently trained adapters
for each task, using the architecture illustrated in Figure 5. Multi-Task Adapters (MT-A) shows results of jointly
trained adapters using the default settings of Stickland and Murray (2019). Fusion w/ ST-A and Fusion w/ MT-A
show the results of AdapterFusion using the respective pre-trained Adapters. ST-AHoulsby shows the results of
ST-Adapters with the architecture proposed by Houlsby et al. (2019). Reported results are accuracy scores.
Figure 3: Relative performance difference of the two adapter architectures and the AdapterFusion models over
fully fine-tuned BERT. Fusion improves over its corresponding adapters (ST-A and MT-A) for most tasks.
Fus. w/ ST-A Fus. w/ MT-A Fusion to ST-A and MT-A. The arrows indicate
compared to ST-A MT-A ST-A MT-A whether there is an improvement %, decrease &,
MNLI → % & % or if the the results remain the same →. We com-
QQP → % → %
SST % → % %
pare the performance of both, Fusion with ST-A
Winogrande & % & % and Fusion with MT-A, to ST-A and MT-A. We
IMDB % % & → summarize our four most important findings below.
HellaSwag → % & %
SocialIQA % % → % (1) In the case of Fusion with ST-A, for 15/16
CosmosQA % & % % tasks, the performance remains the same or im-
SciTail → % % → proves as compared to the task’s pretrained adapter.
Argument → % & %
CSQA % % & % For 10/16 tasks we see performance gains. This
BoolQ % & % % shows that having access to adapters from other
MRPC % % & % tasks is beneficial and in the majority of cases leads
SICK % & % %
RTE % & % % to better results on the target task. (2) We find
CB % % % % that for 11/16 tasks, Fusion with ST-A improves
Improved 10/16 11/16 7/16 14/16 the performance compared to MT-A. This demon-
strates the ability of Fusion with ST-A to share
Table 2: Performance changes of AdapterFusion com- information between tasks while avoiding the in-
pared to ST-A and MT-A. Arrows indicate whether terference that multi-task training suffers from. (3)
there has been an improvement % (> 0.3), decrease For only 7/16 tasks, we see an improvement of Fu-
& (< −0.3), or whether the results have stayed the
sion with MT-A over the ST-A. Training of MT-A
same → [−0.3, 0.3].
in the first stage of our algorithm suffers from all
the problems of multi-task learning and results in
observe particularly large performance gains for less effective adapters than our ST-A on average.
datasets with less than 5k training instances. For Fusion helps bridge some of this gap but is not able
example, Fusion with ST-A achieves substantial to mitigate the entire performance drop. (4) In the
improvements of 6.5 % for RTE and 5.64 % for case of AdapterFusion with MT-A, we see that the
MRPC. In addition, we also see performance gains performances on all 16 tasks improves or stays the
for moderately sized datasets such as the common- same. This demonstrates that AdapterFusion can
sense tasks CosmosQA and CSQA. Fusion with MT- successfully combine the specific adapter weights,
A achieves smaller improvements, as the model al- even if the adapters were trained in a multi-task
ready includes a shared set of parameters. However, setting, confirming that our method is versatile.
we do see performance gains for SICK, SocialIQA,
Winogrande and MRPC. On average, we observe Summary. Our findings demonstrate that Fusion
improvements of 1.27% and 1.25% when using with ST-A is the most promising approach to shar-
Fusion with ST-A and MT-A, respectively. ing information across tasks. Our approach allows
us to train adapters in parallel and it requires no
Mitigating catastrophic interference. In order heuristic sampling strategies to deal with imbal-
to identify whether our approach is able to mit- anced datasets. It also allows researchers to easily
igate problems faced by multi-task learning, we add more tasks as they become available, without
present the performance differences of adapters and requiring complete model retraining.
AdapterFusion compared to the fully fine-tuned While Fusion with MT-A does provide gains
model in Figure 3. In Table 2, we compare Adapter- over simply using MT-A, the effort required to train
Layer 1 Layer 7 Layer 9 Layer 12
argument
boolq
cosmosqa
csqa 0.60
hellaswag
imdb
multinli
qqp 0.45
scitail
sick 0.30
socialiqa
sst_glue
winogranderte 0.15
mrpccb
qqp
qqp
qqp
argument
boolq
csqa
qqp
boolq
csqa
imdb
socialiqa
mrpc
cosmosqa
hellaswag
imdb
multinli
scitail
sick
socialiqa
sst_glue
winogrande
rte
cb
mrpc
cb
argument
boolq
csqa
mrpc
cosmosqa
hellaswag
imdb
multinli
scitail
sick
socialiqa
sst_glue
winogrande
rte
argument
cosmosqa
hellaswag
imdb
multinli
scitail
sick
socialiqa
sst_glue
winogrande
rte
cb
mrpc
argument
boolq
csqa
cosmosqa
hellaswag
multinli
scitail
sick
sst_glue
winogrande
rte
cb
Figure 4: AdapterFusion activations of pretrained ST-Adapters. Rows indicate the target task m, columns indicate
adapters n. We assume that the softmax activation for Φn,l is high if the information of adapter n is useful for
task m. For our analysis, we calculate the softmax activation for each adapter Φn,l , where n ∈ {1, . . . , N }, and
average over all activations within the same layer l calculated over all instances in the development set.
these in a multi-task setting followed by the Fusion diction head. The representations of the adapters
step are not warranted by the limited gains in per- in the 12th layer might thus not be as comparable,
formance. On the other hand, we find that Fusion resulting in more distributed activations. This is
with ST-A is an efficient and versatile approach to in line with Pfeiffer et al. (2020d) who are able to
transfer learning. improve zero-shot cross-lingual performance con-
siderably by dropping the adapters in the last layer.
6 Analysis of Fusion Activation
7 Contemporary Work
We analyze the weighting patterns that are learned
In contemporaneous work, other approaches for
by AdapterFusion to better understand which tasks
parameter efficient fine-tuning have been proposed.
impact the model predictions, and whether there
Guo et al. (2020) train sparse “diff” vectors which
exist differences across BERT layers.
are applied on top of pretrained frozen parameter
We plot the results for layers 1, 7, 9, and 12 and
vectors. Ravfogel and Goldberg (2021) only fine-
ST-A in Figure 4 (see Appendix Figure 6 for the
tune bias terms of the pretrained language mod-
remaining layers). We find that tasks which do not
els, achieving similar results as full model fine-
benefit from AdapterFusion tend to more strongly
tuning. Li and Liang (2021) propose prefix-tuning
activate their own adapter at every layer (e.g. Argu-
for natural language generation tasks. Here, con-
ment, HellaSwag, MNLI, QQP, SciTail). This con-
tinuous task-specific vectors are trained while the
firms that AdapterFusion only extracts information
remaining model is kept frozen. These alternative,
from adapters if they are beneficial for the target
parameter-efficient fine-tuning strategies all encap-
task m. We further find that MNLI is a useful inter-
sulate the idiosyncratic task-specific information
mediate task that benefits a large number of target
in designated parameters, creating the potential for
tasks, e.g. BoolQ, SICK, CSQA, SST-2, CB, MRPC,
new composition approaches of multiple tasks.
RTE, which is in line with previous work (Phang
Rücklé et al. (2020a) analyse the training and
et al., 2018; Conneau and Kiela, 2018; Reimers
inference efficiency of adapters and AdapterFu-
and Gurevych, 2019). Similarly, QQP is utilized
sion. For AdapterFusion, they find that adding
by a large number of tasks, e.g. SICK, IMDB, RTE,
more tasks to the set of adapters results in a linear
CB, MRPC, SST-2. Most importantly, tasks with
increase of computational cost, both for training
small datasets such as CB, RTE, and MRPC often
and inference. They further propose approaches to
strongly rely on adapters trained on large datasets
mitigate this overhead.
such as MNLI and QQP.
Interestingly, we find that the activations in layer 8 Conclusion and Outlook
12 are considerably more distributed across multi-
ple tasks than adapters in earlier layers. The poten- 8.1 Conclusion
tial reason for this is that the last adapters are not We propose a novel approach to transfer learning
encapsulated between frozen pretrained layers, and called AdapterFusion which provides a simple and
can thus be considered as an extension of the pre- effective way to combine information from several
tasks. By separating the extraction of knowledge Acknowledgments
from its composition, we are able to effectively
avoid the common pitfalls of multi-task learning, Jonas is supported by the LOEWE initiative (Hesse,
such as catastrophic forgetting and interference be- Germany) within the emergenCITY center. Aish-
tween tasks. Further, AdapterFusion mitigates the warya was supported in part by a DeepMind PhD
problem of traditional multi-task learning in which Fellowship during the time which this project was
complete re-training is required, when new tasks carried out. Andreas is supported by the German
are added to the pool of datasets. Research Foundation within the project “Open Ar-
We have shown that AdapterFusion is compati- gument Mining” (GU 798/25-1), associated with
ble with adapters trained in both single-task as well the Priority Program “Robust Argumentation Ma-
as multi-task setups. AdapterFusion consistently chines (RATIO)” (SPP-1999). This work was
outperforms fully fine-tuned models on the target partly supported by Samsung Advanced Institute of
task, demonstrating the value in having access to Technology (Next Generation Deep Learning: from
information from other tasks. While we observe pattern recognition to AI) and Samsung Research
gains using both ST-A as well as MT-A, we find (Improving Deep Learning using Latent Structure).
that composing ST-A using AdapterFusion is the Kyunghyun was a research scientist at Facebook
more efficient strategy, as adapters can be trained AI Research part-time during which this project
in parallel and re-used. was carried out.
We thank Sebastian Ruder, Max Glockner, Jason
Finally, we analyze the weighting patterns of in-
Phang, Alex Wang, Katrina Evtimova and Sam
dividual adapters in AdapterFusion which reveal
Bowman for insightful feedback and suggestions
that tasks with small datasets more often rely on
on drafts of this paper.
information from tasks with large datasets, thereby
achieving the largest performance gains in our ex-
periments. We show that AdapterFusion is able
References
to identify and select adapters that contain knowl-
edge relevant to task of interest, while ignoring the Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-
remaining ones. This provides an implicit no-op ton. 2016. Layer normalization. arXiv preprint.
option and makes AdapterFusion a suitable and
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
versatile transfer learning approach for any NLU gio. 2015. Neural machine translation by jointly
setting. learning to align and translate. In 3rd Inter-
national Conference on Learning Representations,
8.2 Outlook ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings.
Rücklé et al. (2020a) have studied pruning a large
portion of adapters after Fusion training. Their re- Ankur Bapna and Orhan Firat. 2019. Simple, scal-
sults show that removing the less activated adapters able adaptation for neural machine translation. In
Proceedings of the 2019 Conference on Empirical
results in almost no performance drop at inference Methods in Natural Language Processing and the
time while considerably improving the inference 9th International Joint Conference on Natural Lan-
speed. They also provide some initial evidence that guage Processing, EMNLP-IJCNLP 2019, Hong
it is possible to train Fusion with a subset of the Kong, China, November 3-7, 2019, pages 1538–
1548.
available adapters in each minibatch, potentially
enabling us to scale our approach to large adapter Rich Caruana. 1997. Multitask learning. Machine
sets — which would otherwise be computationally Learning, 28(1):41–75.
infeasible. We believe that such extensions are a
promising direction for future work. Christopher Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, and Kristina
Pfeiffer et al. (2020d) have achieved consider- Toutanova. 2019. Boolq: Exploring the surprising
able improvements in the zero-shot cross-lingual difficulty of natural yes/no questions. In Proceed-
transfer performance by dropping the adapters in ings of the 2019 Conference of the North American
the last layer. In preliminary results, we have ob- Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, NAACL-
served similar trends with AdapterFusion when the HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
adapters in the last layer are not used. We will Volume 1 (Long and Short Papers), pages 2924–
investigate this further in future work. 2936.
Ronan Collobert and Jason Weston. 2008. A unified ar- Hong Kong, China, November 3-7, 2019, pages
chitecture for natural language processing: deep neu- 2391–2401.
ral networks with multitask learning. In Machine
Learning, Proceedings of the Twenty-Fifth Interna- Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
tional Conference (ICML 2008), Helsinki, Finland, Scitail: A textual entailment dataset from science
June 5-9, 2008, pages 160–167. question answering. In Proceedings of the Thirty-
Second AAAI Conference on Artificial Intelligence,
Alexis Conneau and Douwe Kiela. 2018. Senteval: An (AAAI-18), the 30th innovative Applications of Arti-
evaluation toolkit for universal sentence representa- ficial Intelligence (IAAI-18), and the 8th AAAI Sym-
tions. In Proceedings of the Eleventh International posium on Educational Advances in Artificial Intel-
Conference on Language Resources and Evaluation, ligence (EAAI-18), New Orleans, Louisiana, USA,
LREC 2018, Miyazaki, Japan, May 7-12, 2018. February 2-7, 2018, pages 5189–5197.
Marie-Catherine De Marneffe, Mandy Simons, and Ju- Anne Lauscher, Olga Majewska, Leonardo F. R.
dith Tonhauser. 2019. The commitmentbank: Inves- Ribeiro, Iryna Gurevych, Nikolai Rozanov, and
tigating projection in naturally occurring discourse. Goran Glavaš. 2020. Common Sense or World
In proceedings of Sinn und Bedeutung, volume 23, Knowledge? Investigating Adapter-Based Knowl-
pages 107–124. edge Injection into Pretrained Transformers. arXiv
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and preprint.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Jason Lee, Kyunghyun Cho, and Thomas Hofmann.
standing. In Proceedings of the 2019 Conference 2017. Fully character-level neural machine transla-
of the North American Chapter of the Association tion without explicit segmentation. Transactions of
for Computational Linguistics: Human Language the Association for Computational Linguistics 2017,
Technologies, Volume 1 (Long and Short Papers), 5:365–378.
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics. Hector J. Levesque. 2011. The winograd schema chal-
lenge. In Logical Formalizations of Commonsense
William B. Dolan and Chris Brockett. 2005. Automati- Reasoning, Papers from the 2011 AAAI Spring Sym-
cally constructing a corpus of sentential paraphrases. posium, Technical Report SS-11-06, Stanford, Cali-
In Proceedings of the Third International Workshop fornia, USA, March 21-23, 2011.
on Paraphrasing, IWP@IJCNLP 2005, Jeju Island,
Korea, October 2005, 2005. Xiang Lisa Li and Percy Liang. 2021. Prefix-
tuning: Optimizing continuous prompts for genera-
Robert M French. 1999. Catastrophic forgetting in con- tion. arXiv preprint.
nectionist networks. Trends in cognitive sciences,
3(4):128–135. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016.
Recurrent neural network for text classification with
Demi Guo, Alexander M. Rush, and Yoon Kim. 2020. multi-task learning. In Proceedings of the Twenty-
Parameter-efficient transfer learning with diff prun- Fifth International Joint Conference on Artificial In-
ing. arXiv preprint. telligence, IJCAI 2016, New York, NY, USA, 9-15
July 2016, pages 2873–2879.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzkeb-
ski, Bruna Morrone, Quentin de Laroussilhe, An-
drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017.
2019. Parameter-efficient transfer learning for NLP. Adversarial multi-task learning for text classifica-
In Proceedings of the 36th International Conference tion. In Proceedings of the 55th Annual Meeting
on Machine Learning, ICML 2019, 9-15 June 2019, of the Association for Computational Linguistics
Long Beach, California, USA, pages 2790–2799. (Volume 1: Long Papers), pages 1–10, Vancouver,
Canada. Association for Computational Linguistics.
Jeremy Howard and Sebastian Ruder. 2018. Universal
language model fine-tuning for text classification. In Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
Proceedings of the 56th Annual Meeting of the As- feng Gao. 2019a. Multi-task deep neural networks
sociation for Computational Linguistics, ACL 2018, for natural language understanding. In Proceedings
Melbourne, Australia, July 15-20, 2018, Volume 1: of the 57th Conference of the Association for Compu-
Long Papers, pages 328–339. tational Linguistics, ACL 2019, Florence, Italy, July
28- August 2, 2019, Volume 1: Long Papers, pages
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and 4487–4496.
Yejin Choi. 2019. Cosmos QA: machine reading
comprehension with contextual commonsense rea- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
soning. In Proceedings of the 2019 Conference on dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Empirical Methods in Natural Language Processing Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
and the 9th International Joint Conference on Nat- Roberta: A robustly optimized bert pretraining ap-
ural Language Processing, EMNLP-IJCNLP 2019, proach. arXiv preprint.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
Dan Huang, Andrew Y. Ng, and Christopher Potts. bastian Ruder. 2020d. UNKs Everywhere: Adapt-
2011. Learning word vectors for sentiment analysis. ing Multilingual Language Models to New Scripts.
In The 49th Annual Meeting of the Association for arXiv preprint.
Computational Linguistics: Human Language Tech-
nologies, Proceedings of the Conference, 19-24 June, Jason Phang, Thibault Févry, and Samuel R. Bowman.
2011, Portland, Oregon, USA, pages 142–150. 2018. Sentence encoders on stilts: Supplementary
training on intermediate labeled-data tasks. arXiv
Marco Marelli, Stefano Menini, Marco Baroni, Luisa preprint.
Bentivogli, Raffaella Bernardi, and Roberto Zampar-
elli. 2014. A SICK cure for the evaluation of com- Jerin Philip, Alexandre Berard, Matthias Gallé, and
positional distributional semantic models. In Pro- Laurent Besacier. 2020. Monolingual adapters for
ceedings of the Ninth International Conference on zero-shot neural machine translation. In Proceed-
Language Resources and Evaluation (LREC-2014), ings of the 2020 Conference on Empirical Methods
pages 216–223, Reykjavik, Iceland. European Lan- in Natural Language Processing, EMNLP 2020, On-
guages Resources Association (ELRA). line, November 16-20, 2020, pages 4465–4470.
Michael McCloskey and Neal J Cohen. 1989. Catas- Yada Pruksachatkun, Jason Phang, Haokun Liu,
trophic interference in connectionist networks: The Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe
sequential learning problem. In Psychology of learn- Pang, Clara Vania, Katharina Kann, and Samuel
ing and motivation, volume 24, pages 109–165. El- Bowman. 2020. Intermediate-task transfer learning
sevier. with pretrained language models: When and why
does it work? In Proceedings of the 58th Annual
Jinseok Nam, Jungi Kim, Eneldo Loza Menc’ia, Iryna Meeting of the Association for Computational Lin-
Gurevych, and Johannes Fürnkranz. 2014. Large- guistics, pages 5231–5247.
scale multi-label text classification - revisiting neu-
ral networks. In Machine Learning and Knowl- Alec Radford, Karthik Narasimhan, Tim Salimans, and
edge Discovery in Databases - European Confer- Ilya Sutskever. 2018. Improving language under-
ence, ECML PKDD 2014, Nancy, France, Septem- standing by generative pre-training.
ber 15-19, 2014. Proceedings, Part II, pages 437–
452. Elad Ben-Zaken1 Shauli Ravfogel and Yoav Gold-
berg. 2021. Bitfit: Simple parameter-efficient
Sinno Jialin Pan and Qiang Yang. 2010. A survey on fine-tuning for transformer-based masked language-
transfer learning. IEEE Trans. Knowl. Data Eng., models. arXiv preprint.
22(10):1345–1359.
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea
Matthew E. Peters, Sebastian Ruder, and Noah A. Vedaldi. 2017. Learning multiple visual domains
Smith. 2019. To tune or not to tune? adapting pre- with residual adapters. In Advances in Neural Infor-
trained representations to diverse tasks. In Proceed- mation Processing Systems 30: Annual Conference
ings of the 4th Workshop on Representation Learn- on Neural Information Processing Systems 2017, 4-
ing for NLP, RepL4NLP@ACL 2019, Florence, Italy, 9 December 2017, Long Beach, CA, USA, pages
August 2, 2019, pages 7–14. 506–516.
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- Nils Reimers and Iryna Gurevych. 2019. Sentence-
warya Kamath, Ivan Vulić, Sebastian Ruder, BERT: Sentence embeddings using Siamese BERT-
Kyunghyun Cho, and Iryna Gurevych. 2020a. networks. In Proceedings of the 2019 Conference on
AdapterHub: A framework for adapting transform- Empirical Methods in Natural Language Processing
ers. In Proceedings of the 2020 Conference on Em- and the 9th International Joint Conference on Natu-
pirical Methods in Natural Language Processing: ral Language Processing (EMNLP-IJCNLP), pages
System Demonstrations, pages 46–54, Online. Asso- 3980–3990, Hong Kong, China. Association for
ciation for Computational Linguistics. Computational Linguistics.
Jonas Pfeiffer, Edwin Simpson, and Iryna Gurevych. Andreas Rücklé, Gregor Geigle, Max Glockner,
2020b. Low resource multi-task sequence tagging - Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna
revisiting dynamic conditional random fields. arXiv Gurevych. 2020a. AdapterDrop: On the Efficiency
preprint. of Adapters in Transformers. arXiv preprint.
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se- Andreas Rücklé, Jonas Pfeiffer, and Iryna Gurevych.
bastian Ruder. 2020c. MAD-X: An Adapter-Based 2020b. MultiCQA: Zero-shot transfer of self-
Framework for Multi-Task Cross-Lingual Transfer. supervised text matching models on a massive scale.
In Proceedings of the 2020 Conference on Empirical In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Methods in Natural Language Processing (EMNLP),
pages 7654–7673, Online. Association for Computa- pages 2471–2486, Online. Association for Computa-
tional Linguistics. tional Linguistics.
Sebastian Ruder. 2017. An overview of multi-task Christian Stab, Tristan Miller, Benjamin Schiller,
learning in deep neural networks. arXiv preprint. Pranav Rai, and Iryna Gurevych. 2018. Cross-topic
argument mining from heterogeneous sources. In
Sebastian Ruder. 2019. Neural Transfer Learning for Proceedings of the 2018 Conference on Empirical
Natural Language Processing. Ph.D. thesis, Na- Methods in Natural Language Processing, Brussels,
tional University of Ireland, Galway. Belgium, October 31 - November 4, 2018, pages
3664–3674.
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein,
and Anders Søgaard. 2019. Latent multi-task archi- Asa Cooper Stickland and Iain Murray. 2019. BERT
tecture learning. In The Thirty-Third AAAI Con- and pals: Projected attention layers for efficient
ference on Artificial Intelligence, AAAI 2019, The adaptation in multi-task learning. In Proceedings
Thirty-First Innovative Applications of Artificial In- of the 36th International Conference on Machine
telligence Conference, IAAI 2019, The Ninth AAAI Learning, ICML 2019, 9-15 June 2019, Long Beach,
Symposium on Educational Advances in Artificial In- California, USA, pages 5986–5995.
telligence, EAAI 2019, Honolulu, Hawaii, USA, Jan-
uary 27 - February 1, 2019, pages 4822–4829. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2019. Commonsenseqa: A ques-
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- tion answering challenge targeting commonsense
ula, and Yejin Choi. 2020. Winogrande: An adver- knowledge. In Proceedings of the 2019 Conference
sarial winograd schema challenge at scale. In The of the North American Chapter of the Association
Thirty-Fourth AAAI Conference on Artificial Intelli- for Computational Linguistics: Human Language
gence, AAAI 2020, The Thirty-Second Innovative Ap- Technologies, NAACL-HLT 2019, Minneapolis, MN,
plications of Artificial Intelligence Conference, IAAI USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
2020, The Tenth AAAI Symposium on Educational pers), pages 4149–4158.
Advances in Artificial Intelligence, EAAI 2020, New
York, NY, USA, February 7-12, 2020, pages 8732– Lisa Torrey and Jude Shavlik. 2010. Transfer learn-
8740. ing. In Handbook of research on machine learning
applications and trends: algorithms, methods, and
Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. techniques, pages 242–264. IGI Global.
A hierarchical multi-task approach for learning em-
beddings from semantic tasks. In The Thirty-Third Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and
AAAI Conference on Artificial Intelligence, AAAI Gertjan van Noord. 2020. UDapter: Language adap-
2019, The Thirty-First Innovative Applications of tation for truly Universal Dependency parsing. In
Artificial Intelligence Conference, IAAI 2019, The Proceedings of the 2020 Conference on Empirical
Ninth AAAI Symposium on Educational Advances Methods in Natural Language Processing (EMNLP),
in Artificial Intelligence, EAAI 2019, Honolulu, pages 2302–2315, Online. Association for Computa-
Hawaii, USA, January 27 - February 1, 2019, pages tional Linguistics.
6949–6956.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Bras, and Yejin Choi. 2019. Social iqa: Com- Kaiser, and Illia Polosukhin. 2017. Attention is all
monsense reasoning about social interactions. In you need. In Advances in Neural Information Pro-
Proceedings of the 2019 Conference on Empirical cessing Systems 30: Annual Conference on Neural
Methods in Natural Language Processing and the Information Processing Systems 2017, 4-9 Decem-
9th International Joint Conference on Natural Lan- ber 2017, Long Beach, CA, USA, pages 5998–6008.
guage Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, pages 4462– M. Vidoni, Ivan Vulić, and Goran Glavaš. 2020. Or-
4472. thogonal language and task adapters in zero-shot
cross-lingual transfer. In arXiv preprint.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. Manning, Andrew Ng, and Alex Wang, Amanpreet Singh, Julian Michael, Fe-
Christopher Potts. 2013. Recursive deep models lix Hill, Omer Levy, and Samuel R. Bowman.
for semantic compositionality over a sentiment tree- 2018. GLUE: A multi-task benchmark and anal-
bank. In Proceedings of the 2013 Conference on ysis platform for natural language understand-
Empirical Methods in Natural Language Processing, ing. In Proceedings of the Workshop: Analyzing
pages 1631–1642, Seattle, Washington, USA. Asso- and Interpreting Neural Networks for NLP, Black-
ciation for Computational Linguistics. boxNLP@EMNLP 2018, Brussels, Belgium, Novem-
ber 1, 2018, pages 353–355.
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
Conceptnet 5.5: An open multilingual graph of gen- Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xu-
eral knowledge. In Proceedings of the Thirty-First anjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang,
AAAI Conference on Artificial Intelligence, Febru- and Ming Zhou. 2020. K-adapter: Infusing knowl-
ary 4-9, 2017, San Francisco, California, USA, edge into pre-trained models with adapters. arXiv
pages 4444–4451. preprint.
Adina Williams, Nikita Nangia, and Samuel R. Bow- et al., 2013) consists of short movie reviews from
man. 2018. A broad-coverage challenge corpus Rotten Tomatoes6 .
for sentence understanding through inference. In
Proceedings of the 2018 Conference of the North Natural Language Inference (NLI) The goal is
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
to classify whether two sentences entail, contradict,
NAACL-HLT 2018, New Orleans, Louisiana, USA, or are neutral to each other. For this we conduct
June 1-6, 2018, Volume 1 (Long Papers), pages experiments on MultiNLI (Williams et al., 2018),
1112–1122. a multi-genre dataset, SciTail (Khot et al., 2018)
a NLI dataset on scientific text, SICK (Marelli
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and
Yejin Choi. 2018. SWAG: A large-scale adversar- et al., 2014) a NLI dataset with relatedness scores,
ial dataset for grounded commonsense inference. In the composition of Recognizing Textual Entailment
Proceedings of the 2018 Conference on Empirical (RTE) datasets provided by Wang, Singh, Michael,
Methods in Natural Language Processing, Brussels, Hill, Levy, and Bowman (2018), as well as the
Belgium, October 31 - November 4, 2018, pages 93–
104. Commitment Bank (CB) (De Marneffe et al., 2019)
three-class textual entailment dataset.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. Hellaswag: Can Sentence Relatedness We include two semantic
a machine really finish your sentence? In Pro- relatedness datasets which capture whether or not
ceedings of the 57th Conference of the Association two text samples include similar content. Microsoft
for Computational Linguistics, ACL 2019, Florence,
Research Paraphrase Corpus (MRPC) (Dolan and
Italy, July 28- August 2, 2019, Volume 1: Long Pa-
pers, pages 4791–4800. Brockett, 2005) consists of sentence pairs which
capture a paraphrase/semantic equivalence relation-
Yu Zhang and Qiang Yang. 2017. A survey on multi- ship. Quora Question Pairs (QQP) targets dupli-
task learning. arXiv preprint. cate question detection.7
mrpc
cb
rte
winogrande
socialiqa
sick
scitail
multinli
imdb
hellaswag
cosmosqa
argument
csqa
boolq
mrpc
winogrande
sst_glue
multinli
sst_glue
multinli
cosmosqa
socialiqa
sick
scitail
imdb
hellaswag
csqa
boolq
argument
mrpc
socialiqa
sick
scitail
imdb
hellaswag
csqa
boolq
argument
mrpc
socialiqa
sick
scitail
imdb
hellaswag
csqa
boolq
argument
qqp
cb
rte
qqp
cb
rte
winogrande
qqp
cb
rte
winogrande
sst_glue
qqp
argument argument argument argument
boolq boolq boolq boolq
cosmosqa cosmosqa cosmosqa cosmosqa
csqa csqa csqa csqa
hellaswag hellaswag hellaswag hellaswag
imdb imdb imdb imdb
multinli multinli multinli multinli
qqp qqp qqp qqp
scitail scitail scitail scitail
Layer 7
Layer 4
Layer 1
Layer 10
sick sick sick sick
socialiqa socialiqa socialiqa socialiqa
Layer 8
Layer 5
Layer 2
Layer 11
sick sick sick sick
socialiqa socialiqa socialiqa socialiqa
sst_glue sst_glue sst_glue sst_glue
winogrande winogrande winogrande winogrande
rte rte rte rte
cb cb cb cb
mrpc mrpc mrpc mrpc
argument argument argument argument
boolq boolq boolq boolq
cosmosqa cosmosqa cosmosqa cosmosqa
csqa csqa csqa csqa
hellaswag hellaswag hellaswag hellaswag
imdb imdb imdb imdb
multinli multinli multinli multinli
qqp qqp qqp qqp
scitail scitail scitail scitail
Layer 9
Layer 6
Layer 3
Layer 12
sick sick sick sick
socialiqa socialiqa socialiqa socialiqa
sst_glue sst_glue sst_glue sst_glue
winogrande winogrande winogrande winogrande
rte rte rte rte
cb cb cb cb
mrpc mrpc mrpc mrpc
set of adapters are displayed in columns. Black squares indicate that an adapter has not been activated, whereas
Figure 6: AdapterFusion activations in the 12 BERT-base layers. Target tasks are presented in rows, whereas the
100 SST-2: Adapter Positions 100 SST-2: Pre-Trained LayerNorm 100 SST-2: New LayerNorm
Accuracy
80 80 80
60 60 60
0 20 40 60 0 20 40 60 0 20 40 60
Reduction Factor Reduction Factor Reduction Factor
BERT Fully Trained Bottom Adapter Only BERT Fully Trained Pre-Trained LN Before BERT Fully Trained New LN After
Top Adapter Only Both Adapters Pre-Trained LN Before & After No Pre-Trained LN No New LN New LN Before & After
Pre-Trained LN After New LN Before
(a) Adapter Positions in Layer (b) Position of pretrained LayerNorm (c) Position of newly trained Layer-
Norm
Figure 7: Results of the grid search on the SST-2 dataset over the architecture settings illustrated on the left of
Figure 5. As we go from (a) to (c), the best performing setting is used for further search over other hyperparameters.
We find that the best performing architecture is Top Adapter Only with Pretrained LayerNorm Before & After
including No New LayerNorm. This Architecture is illustrated on the right of Figure 5.
70 70 70
60 60 60
0 20 40 60 0 20 40 60 0 20 40 60
Reduction Factor Reduction Factor Reduction Factor
BERT Fully Trained Bottom Adapter Only BERT Fully Trained Pre-Trained LN Before BERT Fully Trained New LN After
Top Adapter Only Both Adapters Pre-Trained LN Before & After No Pre-Trained LN No New LN New LN Before & After
Pre-Trained LN After New LN Before
(a) Adapter Positions in Layer (b) Position of Pretrained LayerNorm (c) Position of newly trained Layer-
Norm
Figure 8: Results of the grid search on the Argument dataset over the architecture settings illustrated on the left of
Figure 5. As we go from (a) to (c), the best performing setting is used for further search over other hyperparameters.
We find that the best performing architecture is Top Adapter Only with Pretrained LayerNorm Before & After
including No New LayerNorm. This Architecture is illustrated on the right of Figure 5.
40 40 40
30 30 30
20 0 20 40 60 20 0 20 40 60 20 0 20 40 60
Reduction Factor Reduction Factor Reduction Factor
BERT Fully Trained Bottom Adapter Only BERT Fully Trained Pre-Trained LN Before BERT Fully Trained New LN After
Top Adapter Only Both Adapters Pre-Trained LN Before & After No Pre-Trained LN No New LN New LN Before & After
Pre-Trained LN After New LN Before
(a) Adapter Positions in Layer (b) Position of Pretrained LayerNorm (c) Position of newly trained Layer-
Norm
Figure 9: Results of the grid search on the CSQA dataset over the architecture settings illustrated on the left of
Figure 5. As we go from (a) to (c), the best performing setting is used for further search over other hyperparameters.
We find that the best performing architecture is Top Adapter Only with Pretrained LayerNorm Before & After
including No New LayerNorm. This Architecture is illustrated on the right of Figure 5.
Dataset ST-A2 ST-A16 ST-A64
MultiNLI 84.60 84.32 84.08
QQP 90.57 90.59 89.73
SST 92.66 ±0.32 91.85 ±0.41 92.01 ±0.33
Winogrande 62.11 ±0.09 61.09 ±0.11 59.70 ±0.06
IMDB 94.20 ±0.28 93.85 ±0.07 93.90 ±0.14
HellaSwag 39.45 ±0.20 38.11 ±0.14 38.28 ±0.37
SocialIQA 60.95 ±0.15 62.41 ±0.11 62.23 ±0.73
CosmosQA 59.32 ±0.24 60.01 ±0.02 60.65 ±0.34
SciTail 94.44 ±0.81 93.90 ±0.16 93.82 ±0.49
Argument 76.83 ±0.21 77.65 ±0.34 77.64 ±0.56
CSQA 57.83 ±0.23 58.91 ±0.57 58.88 ±0.40
BoolQ 77.14 ±1.10 75.66 ±1.25 76.07 ±0.54
MRPC 86.13 ±1.59 85.16 ±0.52 85.58 ±0.32
SICK 87.50 ±0.14 86.20 ±0.00 85.70 ±0.42
RTE 70.68 ±4.57 71.04 ±1.62 69.16 ±1.59
CB 87.85 ±2.94 86.07 ±3.87 84.28 ±4.79
Table 3: Mean and standard deviation results (development sets) for each of the 16 datasets and reduction factors
{2, 16, 64} for ST-A. Each model is initialized with BERT-base (Devlin et al., 2019) weights. The datasets are
ordered by their respective training dataset size. Dashed horizontal lines separates datasizes {> 40k, > 10k, > 5k}
respectively.
Table 4: Mean and standard deviation results of models initialized with RoBERTa-base (Liu et al., 2019b) weights.
Performances are measured on the development sets of the 16 datasets for the different architectural setups.
The datasets are ordered by their respective training dataset size. Dashed horizontal lines separate datasizes
{> 40k, > 10k, > 5k} respectively. Head indicates training only a classification head on top of fixed RoBERTa
weights. For Full training we fine-tune all weights of RoBERTa. Single-Task adapters (ST-A) is the training of
independently trained adapters for each task, using the architecture illustrated in Figure 5, indices {2, 16, 64}
indicate the reduction factor. Fusion w/ ST-A show the results of AdapterFusion using the respective pretrained
adapters. ST-AHoulsby
16 shows the results of ST-A with with architecture proposed by Houlsby et al. (2019).