0% found this document useful (0 votes)
62 views17 pages

AdapterFusion: Non-Destructive Task Composition For Transfer Learning

The document proposes a new method called AdapterFusion for non-destructively combining knowledge from multiple tasks for transfer learning. AdapterFusion uses a two-stage process where it first learns task-specific parameters called adapters to encapsulate task information, then combines the adapters in a separate composition step. This allows it to effectively exploit representations from multiple tasks without issues like catastrophic forgetting.

Uploaded by

ChrisHalden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views17 pages

AdapterFusion: Non-Destructive Task Composition For Transfer Learning

The document proposes a new method called AdapterFusion for non-destructively combining knowledge from multiple tasks for transfer learning. AdapterFusion uses a two-stage process where it first learns task-specific parameters called adapters to encapsulate task information, then combines the adapters in a separate composition step. This allows it to effectively exploit representations from multiple tasks without issues like catastrophic forgetting.

Uploaded by

ChrisHalden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

AdapterFusion:

Non-Destructive Task Composition for Transfer Learning

Jonas Pfeiffer1 , Aishwarya Kamath2 , Andreas Rücklé1 ,


Kyunghyun Cho2,3 , Iryna Gurevych1
1
Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt
2
New York University 3 CIFAR Associate Fellow
[email protected]

Abstract
Add & Norm

Sequential fine-tuning and multi-task learn- AdapterFusion


ing are methods aiming to incorporate knowl-
arXiv:2005.00247v3 [cs.CL] 26 Jan 2021

edge from multiple tasks; however, they suffer Adapter

from catastrophic forgetting and difficulties in


dataset balancing. To address these shortcom- Add & Norm

ings, we propose AdapterFusion, a new two Feed


stage learning algorithm that leverages knowl- Forward

edge from multiple tasks. First, in the knowl- Add & Norm
edge extraction stage we learn task specific pa- Multi-Head
rameters called adapters, that encapsulate the Attention

task-specific information. We then combine


the adapters in a separate knowledge composi-
tion step. We show that by separating the two Figure 1: AdapterFusion architecture inside a trans-
stages, i.e., knowledge extraction and knowl- former (Vaswani et al., 2017). The AdapterFusion com-
edge composition, the classifier can effectively ponent takes as input the representations of multiple
exploit the representations learned from mul- adapters trained on different tasks and learns a parame-
tiple tasks in a non-destructive manner. We terized mixer of the encoded information.
empirically evaluate AdapterFusion on 16 di-
verse NLU tasks, and find that it effectively
combines various types of knowledge at differ-
ent layers of the model. We show that our ap- There are two approaches for sharing informa-
proach outperforms traditional strategies such tion across multiple tasks. The first consists of
as full fine-tuning as well as multi-task learn- starting from the pretrained language model and
ing. Our code and adapters are available at sequentially fine-tuning on each of the tasks one
AdapterHub.ml. by one (Phang et al., 2018). However, as we subse-
quently fine-tune the model weights on new tasks,
1 Introduction
the problem of catastrophic forgetting (McCloskey
The most commonly used method for solving and Cohen, 1989; French, 1999) can arise, which
NLU tasks is to leverage pretrained models, with results in loss of knowledge already learned from
the dominant architecture being a transformer all previous tasks. This, together with the non-
(Vaswani et al., 2017), typically trained with a trivial decision of the order of tasks in which to
language modelling objective (Devlin et al., 2019; fine-tune the model, hinders the effective transfer
Radford et al., 2018; Liu et al., 2019b). Transfer of knowledge. Multi-task learning (Caruana, 1997;
to a task of interest is achieved by fine-tuning all Zhang and Yang, 2017; Liu et al., 2019a) is another
the weights of the pretrained model on that single approach for sharing information across multiple
task, often yielding state-of-the-art results (Zhang tasks. This involves fine-tuning the weights of a
and Yang, 2017; Ruder, 2017; Howard and Ruder, pretrained language model using a weighted sum
2018; Peters et al., 2019). However, each task of in- of the objective function of each target task simul-
terest requires all the parameters of the network to taneously. Using this approach, the network cap-
be fine-tuned, which results in a specialized model tures the common structure underlying all the target
for each task. tasks. However, multi-task learning requires simul-
taneous access to all tasks during training. Adding ognizing textual entailment. (3) We compare our
new tasks thus requires complete joint retraining. approach with Stickland and Murray (2019) where
Further, it is difficult to balance multiple tasks and adapters are trained for all tasks in a multi-task man-
train a model that solves each task equally well. As ner, finding that AdapterFusion is able to improve
has been shown in Lee et al. (2017), these models this method, even though the model has simultane-
often overfit on low resource tasks and underfit on ous access to all tasks during pretraining. (4) We
high resource tasks. This makes it difficult to ef- show that our proposed approach outperforms fully
fectively transfer knowledge across tasks with all fine-tuning the transformer model on a single tar-
the tasks being solved equally well (Pfeiffer et al., get task. Our approach additionally outperforms
2020b), thus considerably limiting the applicability adapter based models trained both in a Single-Task,
of multi-task learning in many scenarios. as well as Multi-Task setup.
Recently, adapters (Rebuffi et al., 2017; Houlsby The code of this work is integrated into the
et al., 2019) have emerged as an alternative training AdapterHub.ml (Pfeiffer et al., 2020a).
strategy. Adapters do not require fine-tuning of all
parameters of the pretrained model, and instead 2 Background
introduce a small number of task specific param- In this section, we formalize our goal of transfer
eters — while keeping the underlying pretrained learning (Pan and Yang, 2010; Torrey and Shavlik,
language model fixed. Thus, we can separately and 2010; Ruder, 2019), highlight its key challenges,
simultaneously train adapters for multiple tasks, and provide a brief overview of common methods
which all share the same underlying pretrained pa- that can be used to address them. This is followed
rameters. However, to date, there exists no method by an introduction to adapters (Rebuffi et al., 2017)
for using multiple adapters to maximize the trans- and a brief formalism of the two approaches to
fer of knowledge across tasks without suffering training adapters.
from the same problems as sequential fine-tuning
Task Definition. We are given a model that is pre-
and multi-task learning. For instance, Stickland
trained on a task with training data D0 and a loss
and Murray (2019) propose a multi-task approach
function L0 . The weights Θ0 of this model are
for training adapters, which still suffers from the
learned as follows:
difficulty of balancing the various target tasks and
requiring simultaneous access to all target tasks. D0 := Large corpus of unlabelled text
In this paper we address these limitations and
L0 := Masked language modelling loss
propose a new variant of adapters called Adapter-
Fusion. We further propose a novel two stage learn- Θ0 ← argmin L0 (D0 ; Θ)
Θ
ing algorithm that allows us to effectively share
knowledge across multiple tasks while avoiding In the remainder of this paper, we refer to this
the issues of catastrophic forgetting and balancing pretrained model by the tuple (D0 , L0 ).
of different tasks. Our AdapterFusion architec- We define C as the set of N classification tasks
ture, illustrated in Figure 1, has two components. having labelled data of varying sizes and different
The first component is an adapter trained on a task loss functions:
without changing the weights of the underlying lan-
guage model. The second component — our novel C = {(D1 , L1 ), . . . , (DN , LN )}
Fusion layer — combines the representations from
The aim is to be able to leverage a set of N
several such task adapters in order to improve the
tasks to improve on a target task m with Cm =
performance on the target task.
(Dm , Lm ). In this work we focus on the setting
Contributions Our main contributions are: (1) where m ∈ {1, . . . , N }.
We introduce a novel two-stage transfer learning Desiderata. We wish to learn a parameterization
strategy, termed AdapterFusion, which combines Θm that is defined as follows:
the knowledge from multiple source tasks to per-
form better on a target task. (2) We empirically Θm ← argmin Lm (Dm ; Θ0 )
Θ0
evaluate our proposed approach on a set of 16 di-
verse NLU tasks such as sentiment analysis, com- where Θ0 is expected to have encapsulated relevant
monsense reasoning, paraphrase detection, and rec- information from all the N tasks. The target model
for task m is initialized with Θ0 for which we learn parameters Θ across all tasks and introduce a small
the optimal parameters Θm through minimizing number of task-specific parameters Φn . While
the task’s loss on its training data. Θ represents the weights of a pretrained model
(e.g., a transformer), the parameters Φn , where
2.1 Current Approaches to Transfer n ∈ {1, . . . , N }, are used to encode task-specific
Learning representations in intermediate layers of the shared
There are two predominant approaches to achieve model. Current work on adapters focuses either on
sharing of information from one task to another. training adapters for each task separately (Houlsby
et al., 2019; Bapna and Firat, 2019; Pfeiffer et al.,
2.1.1 Sequential Fine-Tuning
2020a) or training them in a multi-task setting to
This involves sequentially updating all the weights leverage shared representations (Stickland and Mur-
of the model on each task. For a set of N tasks, ray, 2019). We discuss both variants below.
the order of fine-tuning is defined and at each step
the model is initialized with the parameters learned 2.2.1 Single-Task Adapters (ST-A)
through the previous step. However, this approach For each of the N tasks, the model is initialized
does not perform well beyond two sequential tasks with parameters Θ0 . In addition, a set of new and
(Phang et al., 2018; Pruksachatkun et al., 2020) due randomly initialized adapter parameters Φn are in-
to catastrophic forgetting. troduced.
The parameters Θ0 are fixed and only the pa-
2.1.2 Multi-Task Learning (MTL)
rameters Φn are trained. This makes it possible to
All tasks are trained simultaneously with the aim
efficiently parallelize the training of adapters for all
of learning a shared representation that will en-
N tasks, and store the corresponding knowledge
able the model to generalize better on each task
in designated parts of the model. The objective for
(Caruana, 1997; Collobert and Weston, 2008; Nam
each task n ∈ {1, . . . , N } is of the form:
et al., 2014; Liu et al., 2016, 2017; Zhang and Yang,
2017; Ruder, 2017; Ruder et al., 2019; Sanh et al., Φn ← argmin Ln (Dn ; Θ0 , Φ)
2019; Pfeiffer et al., 2020b, inter alia). Φ

XN
! For common adapter architectures, Φ contains
Θ0→{1,...,N } ← argmin Ln (Dn ; Θ0 ) considerably fewer parameters than Θ, e.g., only
Θ n=1 3.6% of the parameters of the pretrained model in
Where Θ0→{1,...,N } indicates that we start with Θ0 Houlsby et al. (2019).
and fine-tune on a set of tasks {1, ..., N }. 2.2.2 Multi-Task Adapters (MT-A)
However, MTL requires simultaneous access to
Stickland and Murray (2019) propose to train
all tasks, making it difficult to add more tasks on
adapters for N tasks in parallel with a multi-task
the fly. As the different tasks have varying sizes as
objective. The underlying parameters Θ0 are fine-
well as loss functions, effectively combining them
tuned along with the task-specific parameters in
during training is very challenging and requires
Φn . The training objective can be defined as:
heuristic approaches as proposed in Stickland and
Murray (2019).
N
!
X
2.2 Adapters Θ ← argmin Ln (Dn ; Θ0 , Φn )
Θ,Φ n=1
While the predominant methodology for transfer
learning is to fine-tune all weights of the pre- where
trained model, adapters (Houlsby et al., 2019)
have recently been introduced as an alternative Θ = Θ0→{1,...,N } , Φ1 , . . . , ΦN .
approach with applications in domain transfer
(Rücklé et al., 2020b), machine translation (Bapna 2.2.3 Adapters in Practice
and Firat, 2019; Philip et al., 2020) transfer learn- Introducing new adapter parameters in different
ing (Stickland and Murray, 2019; Wang et al., 2020; layers of an otherwise fixed pretrained model has
Lauscher et al., 2020), and cross-lingual transfer been shown to perform on-par with, or only slightly
(Pfeiffer et al., 2020c,d; Üstün et al., 2020; Vi- below, full model fine-tuning (Houlsby et al., 2019;
doni et al., 2020). Adapters share a large set of Stickland and Murray, 2019; Pfeiffer et al., 2020a).
For NLP tasks, adapters have been introduced for Add & Norm
the transformer architecture (Vaswani et al., 2017).
AdapterFusion
At each transformer layer l, a set of adapter param-
SoftMax
eters Φl is introduced. The placement and archi-
tecture of adapter parameters Φ within a pretrained
Value Key Query
model is non-trivial. Houlsby et al. (2019) experi-
ment with different architectures, finding that a two-
layer feed-foward neural network with a bottleneck
works well. They place two of these components FF Up

within one layer, one after the multi-head atten- FF Down


Adapter
tion (further referred to as bottom) and one after
the feed-forward layers of the transformer (further
Add & Norm
referred to as top).1 Bapna and Firat (2019) and
Stickland and Murray (2019) only introduce one
Figure 2: Our AdapterFusion architecture. This in-
of these components at the top position, however,
cludes learnable weights Query, Key, and Value. Query
Bapna and Firat (2019) include an additional layer takes as input the output of the pretrained transformer
norm (Ba et al., 2016). weights. Both Key and Value take as input the out-
Adapters trained in both single-task (ST-A) or put of the respective adapters. The dot product of the
multi-task (MT-A) setups have learned the idiosyn- query with all the keys is passed into a softmax func-
cratic knowledge of the respective tasks’ training tion, which learns to weight the adapters with respect
data, encapsulated in their designated parameters. to the context.
This results in a compression of information, which
requires less space to store task-specific knowledge. setting or Θ0→{1,...,N,m} in the MT-A setup. In
However, the distinct weights of adapters prevent our experiments we focus on the setting where
a downstream task from being able to use multi- m ∈ {1, ..., N }, which means that the training
ple sources of extracted information. In the next dataset of m is used twice: once for training the
section we describe our two stage algorithm which adapters Φm and again for training Fusion parame-
tackles the sharing of information stored in adapters ters Ψm , which learn to compose the information
trained on different tasks. stored in the N task adapters.
3 AdapterFusion By separating the two stages — knowledge ex-
traction in the adapters, and knowledge composi-
Adapters avoid catastrophic forgetting by intro- tion with AdapterFusion — we address the issues
ducing task-specific parameters; however, current of catastrophic forgetting, interference between
adapter approaches do not allow sharing of infor- tasks and training instabilities.
mation between tasks. To mitigate this we propose
AdapterFusion. 3.2 Components

3.1 Learning algorithm AdapterFusion learns to compose the N task


adapters Φn and the shared pretrained model Θ, by
In the first stage of our learning algorithm, we train
introducing a new set of weights Ψ. These param-
either ST-A or MT-A for each of the N tasks.
eters learn to combine the adapters as a dynamic
In the second stage, we then combine the set of
function of the target task data.
N adapters by using AdapterFusion. While fixing
As illustrated in Figure 2, we define the Adapter-
both the parameters Θ as well as all adapters Φ, we
Fusion parameters Ψ to consist of Key, Value and
introduce parameters Ψ that learn to combine the
Query matrices at each layer l, denoted by Kl , Vl
N task adapters to solve the target task.
and Ql respectively. At each layer l of the trans-
Ψm ← argmin Lm (Dm ; Θ, Φ1 , . . . , ΦN , Ψ) former and each time-step t, the output of the feed-
Ψ forward sub-layer of layer l is taken as the query
vector. The output of each adapter zl,t is used as in-
Ψm are the newly learned AdapterFusion param-
put to both the value and key transformations. Sim-
eters for task m. Θ refers to Θ0 in the ST-A
ilar to attention (Bahdanau et al., 2015; Vaswani
1
We illustrate these placements in Appendix Figure 5 (left). et al., 2017), we learn a contextual activation of
each adapter n using in all experiments.4 We train for a maximum of
10 epochs with early stopping. While we initialize
sl,t = softmax(h> >
l,t Ql ⊗ zl,t,n Kl ), n ∈ {1, ..., N } Q and K randomly, we initialize V with a diago-
z0l,t,n = z>
l,t,n Vl , n ∈ {1, ..., N } nal of ones and the rest of the matrix with random
Z0l,t = [z0l,t,0 , ..., z0l,t,N ] weights having a small norm (1e − 6). Multiplying
the adapter output with this value matrix V initially
ol,t = s> 0
l,t Zl,t adds small amounts of noise, but retains the over-
Where ⊗ represents the dot product and [·, ·] indi- all representation. We continue to regularize the
cates the concatenation of vectors. Value matrix using l2 -norm to avoid introducing
Given the context, AdapterFusion learns a pa- additional capacity.
rameterized mixer of the available trained adapters.
4.2 Tasks and Datasets
It learns to identify and activate the most useful
adapter for a given input. We briefly summarize the different types of tasks
that we include in our experiments, and reference
4 Experiments the related datasets accordingly. A detailed descrip-
tions can be found in Appendix A.1.
In this section we evaluate how effective Adapter-
Commonsense reasoning is used to gauge
Fusion is in overcoming the issues faced by other
whether the model can perform basic reason-
transfer learning methods. We provide a brief de-
ing skills: Hellaswag (Zellers et al., 2018,
scription of the 16 diverse datasets that we use for
2019), Winogrande (Sakaguchi et al., 2020), Cos-
our study, each of which uses accuracy as the scor-
mosQA (Huang et al., 2019), CSQA (Talmor
ing metric.
et al., 2019), SocialIQA (Sap et al., 2019). Sen-
4.1 Experimental Setup timent analysis predicts whether a given text has
In order to investigate our model’s ability to over- a positive or negative sentiment: IMDb (Maas
come catastrophic forgetting, we compare Fusion et al., 2011), SST (Socher et al., 2013). Nat-
using ST-A to only the ST-A for the task. We also ural language inference predicts whether one
compare Fusion using ST-A to MT-A for the task sentence entails, contradicts, or is neutral to an-
to test whether our two-stage procedure alleviates other: MNLI (Williams et al., 2018), SciTail (Khot
the problems of interference between tasks. Fi- et al., 2018), SICK (Marelli et al., 2014), RTE (as
nally, our experiments to compare MT-A with and combined by Wang et al. (2018)), CB (De Marn-
without Fusion let us investigate the versatility of effe et al., 2019). Sentence relatedness captures
our approach. Gains in this setting would show whether two sentences include similar content:
that AdapterFusion is useful even when the base MRPC (Dolan and Brockett, 2005), QQP5 . We
adapters have already been trained jointly. also use an argument mining Argument (Stab et al.,
In all experiments, we use BERT-base-uncased 2018) and reading comprehension BoolQ (Clark
(Devlin et al., 2019) as the pretrained language et al., 2019) dataset.
model. We train ST-A, described in Appendix A.2
5 Results
and illustrated in Figure 5, for all datasets described
in §4.2. We train them with reduction factors2 We present results for all 16 datasets in Table 1. For
{2, 16, 64} and learning rate 0.0001 with AdamW reference, we also include the adapter architecture
and a linear learning rate decay. We train for a max- of Houlsby et al. (2019), ST-AHoulsby , which has
imum of 30 epochs with early stopping. We follow twice as many parameters compared to ST-A. To
the setup used in Stickland and Murray (2019) for provide a fair comparison to Stickland and Murray
training the MT-A. We use the default hyperpa- (2019) we primarily experiment with BERT-base-
rameters3 , and train a MT-A model on all datasets uncased. We additionally validate our best model
simultaneously. configurations — ST-A and Fusion with ST-A —
For AdapterFusion, we empirically find that a with RoBERTa-base, for which we present our re-
learning rate of 5e − 5 works well, and use this sults in Appendix Table 4.
2 4
A reduction factor indicates the factor by which the hid- We have experimented with learning rates {6e−6, 5e−5,
den size is reduced such that the bottle-neck size for BERT 1e − 4, 2e − 4}
Base with factor 64 is reduced to 12 (768/64 = 12). 5
data.quora.com/First-Quora-DatasetReleaseQuestion-
3
We additionally test out batch sizes 16 and 32. Pairs
Dataset Head Full ST-A MT-A F. w/ ST-A F. w/ MT-A ST-AHoulsby
MNLI 54.59 84.10 84.32 82.49 ±0.49 84.28 83.05 84.13
QQP 76.79 90.87 90.59 89.47 ±0.60 90.71 90.58 90.63
SST 85.17 ±0.45 92.39 ±0.22 91.85 ±0.41 92.27 ±0.71 92.20 ±0.18 93.00 ±0.20 92.75 ±0.37
WGrande 51.92 ±0.35 60.01 ±0.08 61.09 ±0.11 57.70 ±1.40 60.23 ±0.31 59.32 ±0.30 59.32 ±1.33
IMDB 85.05 ±0.22 94.05 ±0.21 93.85 ±0.07 92.56 ±0.54 93.82 ±0.39 92.66 ±0.32 93.96 ±0.22
HSwag 34.17 ±0.27 39.25 ±0.76 38.11 ±0.14 36.47 ±0.98 37.98 ±0.01 37.36 ±0.10 38.65 ±0.25
SocIQA 50.33 ±2.50 62.05 ±0.04 62.41 ±0.11 61.21 ±0.89 63.16 ±0.24 62.56 ±0.10 62.73 ±0.53
CosQA 50.06 ±0.51 60.28 ±0.40 60.01 ±0.02 61.25 ±0.90 60.65 ±0.55 62.78 ±0.07 61.37 ±0.35
SciTail 85.30 ±2.44 94.32 ±0.11 93.90 ±0.16 94.53 ±0.43 94.04 ±0.23 94.79 ±0.17 94.07 ±0.39
Argument 70.61 ±0.59 76.87 ±0.32 77.65 ±0.34 75.70 ±0.60 77.65 ±0.21 76.08 ±0.27 77.44 ±0.62
CSQA 41.09 ±0.27 58.88 ±0.40 58.91 ±0.57 53.30 ±2.19 59.73 ±0.54 56.73 ±0.14 60.05 ±0.36
BoolQ 63.07 ±1.27 74.84 ±0.24 75.66 ±1.25 78.76 ±0.76 76.25 ±0.19 79.18 ±0.45 76.02 ±1.13
MRPC 71.91 ±0.13 85.14 ±0.45 85.16 ±0.52 81.86 ±0.99 90.29 ±0.84 84.68 ±0.32 86.66 ±0.81
SICK 76.30 ±0.71 87.30 ±0.42 86.20 ±0.00 88.61 ±1.06 87.28 ±0.99 90.43 ±0.30 86.12 ±0.54
RTE 61.37 ±1.17 65.41 ±0.90 71.04 ±1.62 77.61 ±3.21 76.82 ±1.68 79.96 ±0.76 69.67 ±1.96
CB 68.93 ±4.82 82.49 ±2.33 86.07 ±3.87 89.09 ±1.15 92.14 ±0.97 89.81 ±0.99 87.50 ±4.72

Mean 64.17 75.51 76.05 75.80 77.33 77.06 76.32

Table 1: Mean and standard deviation results (development sets) for each of the 16 datasets and the different
architectural setups. The datasets are ordered by their respective training dataset size. Dashed horizontal lines
separate datasizes {> 40k, > 10k, > 5k}, respectively. Each model is initialized with BERT-base (Devlin et al.,
2019) weights. Head indicates training only a classification head on top of fixed BERT weights. For Full training
we fine-tune all weights of BERT. Single-Task Adapters (ST-A) is the training of independently trained adapters
for each task, using the architecture illustrated in Figure 5. Multi-Task Adapters (MT-A) shows results of jointly
trained adapters using the default settings of Stickland and Murray (2019). Fusion w/ ST-A and Fusion w/ MT-A
show the results of AdapterFusion using the respective pre-trained Adapters. ST-AHoulsby shows the results of
ST-Adapters with the architecture proposed by Houlsby et al. (2019). Reported results are accuracy scores.

5.1 Adapters only partially address common problems of multi-


task learning such as catastrophic interference. It
Training only a prediction-head on the output of a also shows that learning a shared representation
pretrained model can also be considered an adapter. jointly does not guarantee the best results for all
This procedure, commonly referred to as training tasks. On average, however, we do see a perfor-
only the Head, performs considerably worse than mance increase of 0.4% using MT-A over Full fine-
fine-tuning all weights (Howard and Ruder, 2018; tuning on each task separately, which demonstrates
Peters et al., 2019). We show that the performance that there are advantages in leveraging information
of only fine-tuning the Head compared to Full fine- from other tasks with multi-task learning.
tuning causes on average a drop of 10 points in
accuracy. This demonstrates the need for more 5.2 AdapterFusion
complex adaptation approaches.
AdapterFusion aims to improve performance on a
In Table 1 we show the results for MT-A and
given target task m by transferring task specific
ST-A with a reduction factor 16 (see the appendix
knowledge from the set of all N task adapters,
Table 3 for more results) which we find has a good
where m ∈ {1, . . . , N}. We hypothesize that if
trade-off between the number of newly introduced
there exists at least one task that supports the target
parameters and the task performance. Interest-
task, AdapterFusion should lead to performance
ingly, the ST-A have a regularization effect on some
gains. If no such task exists, then the performance
datasets, resulting in better performance on average
should remain the same.
for certain tasks, even though a much small propor-
tion of weights is trained. On average, we improve Dependence on the size of training data. In Ta-
0.66% by training ST-A instead of the Full model. ble 1 we notice that having access to relevant tasks
For MT-A we find that there are considerable considerably improves the performance for the tar-
performance drops of more than 2% for CSQA get task when using AdapterFusion. While datasets
and MRPC, despite the heuristic strategies for sam- with more than 40k training instances perform well
pling from the different datasets (Stickland and without Fusion, smaller datasets with fewer train-
Murray, 2019). This indicates that these heuristics ing instances benefit more from our approach. We
Score Delta 10
Type
ST-A
0 Fus. w\ ST-A
MT-A
I A ail Fus. w\ MT-A
MNL QQP SST de b ag A nt A lQ RPC SIC
K RTE CB
ogran IMD ellaSw ocialIQ smosQ SciT rgume CSQ Boo M
Win H S Co A

Figure 3: Relative performance difference of the two adapter architectures and the AdapterFusion models over
fully fine-tuned BERT. Fusion improves over its corresponding adapters (ST-A and MT-A) for most tasks.

Fus. w/ ST-A Fus. w/ MT-A Fusion to ST-A and MT-A. The arrows indicate
compared to ST-A MT-A ST-A MT-A whether there is an improvement %, decrease &,
MNLI → % & % or if the the results remain the same →. We com-
QQP → % → %
SST % → % %
pare the performance of both, Fusion with ST-A
Winogrande & % & % and Fusion with MT-A, to ST-A and MT-A. We
IMDB % % & → summarize our four most important findings below.
HellaSwag → % & %
SocialIQA % % → % (1) In the case of Fusion with ST-A, for 15/16
CosmosQA % & % % tasks, the performance remains the same or im-
SciTail → % % → proves as compared to the task’s pretrained adapter.
Argument → % & %
CSQA % % & % For 10/16 tasks we see performance gains. This
BoolQ % & % % shows that having access to adapters from other
MRPC % % & % tasks is beneficial and in the majority of cases leads
SICK % & % %
RTE % & % % to better results on the target task. (2) We find
CB % % % % that for 11/16 tasks, Fusion with ST-A improves
Improved 10/16 11/16 7/16 14/16 the performance compared to MT-A. This demon-
strates the ability of Fusion with ST-A to share
Table 2: Performance changes of AdapterFusion com- information between tasks while avoiding the in-
pared to ST-A and MT-A. Arrows indicate whether terference that multi-task training suffers from. (3)
there has been an improvement % (> 0.3), decrease For only 7/16 tasks, we see an improvement of Fu-
& (< −0.3), or whether the results have stayed the
sion with MT-A over the ST-A. Training of MT-A
same → [−0.3, 0.3].
in the first stage of our algorithm suffers from all
the problems of multi-task learning and results in
observe particularly large performance gains for less effective adapters than our ST-A on average.
datasets with less than 5k training instances. For Fusion helps bridge some of this gap but is not able
example, Fusion with ST-A achieves substantial to mitigate the entire performance drop. (4) In the
improvements of 6.5 % for RTE and 5.64 % for case of AdapterFusion with MT-A, we see that the
MRPC. In addition, we also see performance gains performances on all 16 tasks improves or stays the
for moderately sized datasets such as the common- same. This demonstrates that AdapterFusion can
sense tasks CosmosQA and CSQA. Fusion with MT- successfully combine the specific adapter weights,
A achieves smaller improvements, as the model al- even if the adapters were trained in a multi-task
ready includes a shared set of parameters. However, setting, confirming that our method is versatile.
we do see performance gains for SICK, SocialIQA,
Winogrande and MRPC. On average, we observe Summary. Our findings demonstrate that Fusion
improvements of 1.27% and 1.25% when using with ST-A is the most promising approach to shar-
Fusion with ST-A and MT-A, respectively. ing information across tasks. Our approach allows
us to train adapters in parallel and it requires no
Mitigating catastrophic interference. In order heuristic sampling strategies to deal with imbal-
to identify whether our approach is able to mit- anced datasets. It also allows researchers to easily
igate problems faced by multi-task learning, we add more tasks as they become available, without
present the performance differences of adapters and requiring complete model retraining.
AdapterFusion compared to the fully fine-tuned While Fusion with MT-A does provide gains
model in Figure 3. In Table 2, we compare Adapter- over simply using MT-A, the effort required to train
Layer 1 Layer 7 Layer 9 Layer 12
argument
boolq
cosmosqa
csqa 0.60
hellaswag
imdb
multinli
qqp 0.45
scitail
sick 0.30
socialiqa
sst_glue
winogranderte 0.15
mrpccb

qqp

qqp

qqp
argument
boolq
csqa

qqp

boolq
csqa

imdb

socialiqa

mrpc
cosmosqa
hellaswag
imdb
multinli
scitail
sick
socialiqa
sst_glue
winogrande
rte
cb
mrpc

cb
argument
boolq
csqa

mrpc
cosmosqa
hellaswag
imdb
multinli
scitail
sick
socialiqa
sst_glue
winogrande
rte

argument
cosmosqa
hellaswag
imdb
multinli
scitail
sick
socialiqa
sst_glue
winogrande
rte
cb
mrpc
argument
boolq
csqa
cosmosqa
hellaswag
multinli
scitail
sick
sst_glue
winogrande
rte
cb
Figure 4: AdapterFusion activations of pretrained ST-Adapters. Rows indicate the target task m, columns indicate
adapters n. We assume that the softmax activation for Φn,l is high if the information of adapter n is useful for
task m. For our analysis, we calculate the softmax activation for each adapter Φn,l , where n ∈ {1, . . . , N }, and
average over all activations within the same layer l calculated over all instances in the development set.

these in a multi-task setting followed by the Fusion diction head. The representations of the adapters
step are not warranted by the limited gains in per- in the 12th layer might thus not be as comparable,
formance. On the other hand, we find that Fusion resulting in more distributed activations. This is
with ST-A is an efficient and versatile approach to in line with Pfeiffer et al. (2020d) who are able to
transfer learning. improve zero-shot cross-lingual performance con-
siderably by dropping the adapters in the last layer.
6 Analysis of Fusion Activation
7 Contemporary Work
We analyze the weighting patterns that are learned
In contemporaneous work, other approaches for
by AdapterFusion to better understand which tasks
parameter efficient fine-tuning have been proposed.
impact the model predictions, and whether there
Guo et al. (2020) train sparse “diff” vectors which
exist differences across BERT layers.
are applied on top of pretrained frozen parameter
We plot the results for layers 1, 7, 9, and 12 and
vectors. Ravfogel and Goldberg (2021) only fine-
ST-A in Figure 4 (see Appendix Figure 6 for the
tune bias terms of the pretrained language mod-
remaining layers). We find that tasks which do not
els, achieving similar results as full model fine-
benefit from AdapterFusion tend to more strongly
tuning. Li and Liang (2021) propose prefix-tuning
activate their own adapter at every layer (e.g. Argu-
for natural language generation tasks. Here, con-
ment, HellaSwag, MNLI, QQP, SciTail). This con-
tinuous task-specific vectors are trained while the
firms that AdapterFusion only extracts information
remaining model is kept frozen. These alternative,
from adapters if they are beneficial for the target
parameter-efficient fine-tuning strategies all encap-
task m. We further find that MNLI is a useful inter-
sulate the idiosyncratic task-specific information
mediate task that benefits a large number of target
in designated parameters, creating the potential for
tasks, e.g. BoolQ, SICK, CSQA, SST-2, CB, MRPC,
new composition approaches of multiple tasks.
RTE, which is in line with previous work (Phang
Rücklé et al. (2020a) analyse the training and
et al., 2018; Conneau and Kiela, 2018; Reimers
inference efficiency of adapters and AdapterFu-
and Gurevych, 2019). Similarly, QQP is utilized
sion. For AdapterFusion, they find that adding
by a large number of tasks, e.g. SICK, IMDB, RTE,
more tasks to the set of adapters results in a linear
CB, MRPC, SST-2. Most importantly, tasks with
increase of computational cost, both for training
small datasets such as CB, RTE, and MRPC often
and inference. They further propose approaches to
strongly rely on adapters trained on large datasets
mitigate this overhead.
such as MNLI and QQP.
Interestingly, we find that the activations in layer 8 Conclusion and Outlook
12 are considerably more distributed across multi-
ple tasks than adapters in earlier layers. The poten- 8.1 Conclusion
tial reason for this is that the last adapters are not We propose a novel approach to transfer learning
encapsulated between frozen pretrained layers, and called AdapterFusion which provides a simple and
can thus be considered as an extension of the pre- effective way to combine information from several
tasks. By separating the extraction of knowledge Acknowledgments
from its composition, we are able to effectively
avoid the common pitfalls of multi-task learning, Jonas is supported by the LOEWE initiative (Hesse,
such as catastrophic forgetting and interference be- Germany) within the emergenCITY center. Aish-
tween tasks. Further, AdapterFusion mitigates the warya was supported in part by a DeepMind PhD
problem of traditional multi-task learning in which Fellowship during the time which this project was
complete re-training is required, when new tasks carried out. Andreas is supported by the German
are added to the pool of datasets. Research Foundation within the project “Open Ar-
We have shown that AdapterFusion is compati- gument Mining” (GU 798/25-1), associated with
ble with adapters trained in both single-task as well the Priority Program “Robust Argumentation Ma-
as multi-task setups. AdapterFusion consistently chines (RATIO)” (SPP-1999). This work was
outperforms fully fine-tuned models on the target partly supported by Samsung Advanced Institute of
task, demonstrating the value in having access to Technology (Next Generation Deep Learning: from
information from other tasks. While we observe pattern recognition to AI) and Samsung Research
gains using both ST-A as well as MT-A, we find (Improving Deep Learning using Latent Structure).
that composing ST-A using AdapterFusion is the Kyunghyun was a research scientist at Facebook
more efficient strategy, as adapters can be trained AI Research part-time during which this project
in parallel and re-used. was carried out.
We thank Sebastian Ruder, Max Glockner, Jason
Finally, we analyze the weighting patterns of in-
Phang, Alex Wang, Katrina Evtimova and Sam
dividual adapters in AdapterFusion which reveal
Bowman for insightful feedback and suggestions
that tasks with small datasets more often rely on
on drafts of this paper.
information from tasks with large datasets, thereby
achieving the largest performance gains in our ex-
periments. We show that AdapterFusion is able
References
to identify and select adapters that contain knowl-
edge relevant to task of interest, while ignoring the Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-
remaining ones. This provides an implicit no-op ton. 2016. Layer normalization. arXiv preprint.
option and makes AdapterFusion a suitable and
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
versatile transfer learning approach for any NLU gio. 2015. Neural machine translation by jointly
setting. learning to align and translate. In 3rd Inter-
national Conference on Learning Representations,
8.2 Outlook ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings.
Rücklé et al. (2020a) have studied pruning a large
portion of adapters after Fusion training. Their re- Ankur Bapna and Orhan Firat. 2019. Simple, scal-
sults show that removing the less activated adapters able adaptation for neural machine translation. In
Proceedings of the 2019 Conference on Empirical
results in almost no performance drop at inference Methods in Natural Language Processing and the
time while considerably improving the inference 9th International Joint Conference on Natural Lan-
speed. They also provide some initial evidence that guage Processing, EMNLP-IJCNLP 2019, Hong
it is possible to train Fusion with a subset of the Kong, China, November 3-7, 2019, pages 1538–
1548.
available adapters in each minibatch, potentially
enabling us to scale our approach to large adapter Rich Caruana. 1997. Multitask learning. Machine
sets — which would otherwise be computationally Learning, 28(1):41–75.
infeasible. We believe that such extensions are a
promising direction for future work. Christopher Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, and Kristina
Pfeiffer et al. (2020d) have achieved consider- Toutanova. 2019. Boolq: Exploring the surprising
able improvements in the zero-shot cross-lingual difficulty of natural yes/no questions. In Proceed-
transfer performance by dropping the adapters in ings of the 2019 Conference of the North American
the last layer. In preliminary results, we have ob- Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, NAACL-
served similar trends with AdapterFusion when the HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
adapters in the last layer are not used. We will Volume 1 (Long and Short Papers), pages 2924–
investigate this further in future work. 2936.
Ronan Collobert and Jason Weston. 2008. A unified ar- Hong Kong, China, November 3-7, 2019, pages
chitecture for natural language processing: deep neu- 2391–2401.
ral networks with multitask learning. In Machine
Learning, Proceedings of the Twenty-Fifth Interna- Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
tional Conference (ICML 2008), Helsinki, Finland, Scitail: A textual entailment dataset from science
June 5-9, 2008, pages 160–167. question answering. In Proceedings of the Thirty-
Second AAAI Conference on Artificial Intelligence,
Alexis Conneau and Douwe Kiela. 2018. Senteval: An (AAAI-18), the 30th innovative Applications of Arti-
evaluation toolkit for universal sentence representa- ficial Intelligence (IAAI-18), and the 8th AAAI Sym-
tions. In Proceedings of the Eleventh International posium on Educational Advances in Artificial Intel-
Conference on Language Resources and Evaluation, ligence (EAAI-18), New Orleans, Louisiana, USA,
LREC 2018, Miyazaki, Japan, May 7-12, 2018. February 2-7, 2018, pages 5189–5197.
Marie-Catherine De Marneffe, Mandy Simons, and Ju- Anne Lauscher, Olga Majewska, Leonardo F. R.
dith Tonhauser. 2019. The commitmentbank: Inves- Ribeiro, Iryna Gurevych, Nikolai Rozanov, and
tigating projection in naturally occurring discourse. Goran Glavaš. 2020. Common Sense or World
In proceedings of Sinn und Bedeutung, volume 23, Knowledge? Investigating Adapter-Based Knowl-
pages 107–124. edge Injection into Pretrained Transformers. arXiv
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and preprint.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Jason Lee, Kyunghyun Cho, and Thomas Hofmann.
standing. In Proceedings of the 2019 Conference 2017. Fully character-level neural machine transla-
of the North American Chapter of the Association tion without explicit segmentation. Transactions of
for Computational Linguistics: Human Language the Association for Computational Linguistics 2017,
Technologies, Volume 1 (Long and Short Papers), 5:365–378.
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics. Hector J. Levesque. 2011. The winograd schema chal-
lenge. In Logical Formalizations of Commonsense
William B. Dolan and Chris Brockett. 2005. Automati- Reasoning, Papers from the 2011 AAAI Spring Sym-
cally constructing a corpus of sentential paraphrases. posium, Technical Report SS-11-06, Stanford, Cali-
In Proceedings of the Third International Workshop fornia, USA, March 21-23, 2011.
on Paraphrasing, IWP@IJCNLP 2005, Jeju Island,
Korea, October 2005, 2005. Xiang Lisa Li and Percy Liang. 2021. Prefix-
tuning: Optimizing continuous prompts for genera-
Robert M French. 1999. Catastrophic forgetting in con- tion. arXiv preprint.
nectionist networks. Trends in cognitive sciences,
3(4):128–135. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016.
Recurrent neural network for text classification with
Demi Guo, Alexander M. Rush, and Yoon Kim. 2020. multi-task learning. In Proceedings of the Twenty-
Parameter-efficient transfer learning with diff prun- Fifth International Joint Conference on Artificial In-
ing. arXiv preprint. telligence, IJCAI 2016, New York, NY, USA, 9-15
July 2016, pages 2873–2879.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzkeb-
ski, Bruna Morrone, Quentin de Laroussilhe, An-
drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017.
2019. Parameter-efficient transfer learning for NLP. Adversarial multi-task learning for text classifica-
In Proceedings of the 36th International Conference tion. In Proceedings of the 55th Annual Meeting
on Machine Learning, ICML 2019, 9-15 June 2019, of the Association for Computational Linguistics
Long Beach, California, USA, pages 2790–2799. (Volume 1: Long Papers), pages 1–10, Vancouver,
Canada. Association for Computational Linguistics.
Jeremy Howard and Sebastian Ruder. 2018. Universal
language model fine-tuning for text classification. In Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
Proceedings of the 56th Annual Meeting of the As- feng Gao. 2019a. Multi-task deep neural networks
sociation for Computational Linguistics, ACL 2018, for natural language understanding. In Proceedings
Melbourne, Australia, July 15-20, 2018, Volume 1: of the 57th Conference of the Association for Compu-
Long Papers, pages 328–339. tational Linguistics, ACL 2019, Florence, Italy, July
28- August 2, 2019, Volume 1: Long Papers, pages
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and 4487–4496.
Yejin Choi. 2019. Cosmos QA: machine reading
comprehension with contextual commonsense rea- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
soning. In Proceedings of the 2019 Conference on dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Empirical Methods in Natural Language Processing Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
and the 9th International Joint Conference on Nat- Roberta: A robustly optimized bert pretraining ap-
ural Language Processing, EMNLP-IJCNLP 2019, proach. arXiv preprint.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
Dan Huang, Andrew Y. Ng, and Christopher Potts. bastian Ruder. 2020d. UNKs Everywhere: Adapt-
2011. Learning word vectors for sentiment analysis. ing Multilingual Language Models to New Scripts.
In The 49th Annual Meeting of the Association for arXiv preprint.
Computational Linguistics: Human Language Tech-
nologies, Proceedings of the Conference, 19-24 June, Jason Phang, Thibault Févry, and Samuel R. Bowman.
2011, Portland, Oregon, USA, pages 142–150. 2018. Sentence encoders on stilts: Supplementary
training on intermediate labeled-data tasks. arXiv
Marco Marelli, Stefano Menini, Marco Baroni, Luisa preprint.
Bentivogli, Raffaella Bernardi, and Roberto Zampar-
elli. 2014. A SICK cure for the evaluation of com- Jerin Philip, Alexandre Berard, Matthias Gallé, and
positional distributional semantic models. In Pro- Laurent Besacier. 2020. Monolingual adapters for
ceedings of the Ninth International Conference on zero-shot neural machine translation. In Proceed-
Language Resources and Evaluation (LREC-2014), ings of the 2020 Conference on Empirical Methods
pages 216–223, Reykjavik, Iceland. European Lan- in Natural Language Processing, EMNLP 2020, On-
guages Resources Association (ELRA). line, November 16-20, 2020, pages 4465–4470.

Michael McCloskey and Neal J Cohen. 1989. Catas- Yada Pruksachatkun, Jason Phang, Haokun Liu,
trophic interference in connectionist networks: The Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe
sequential learning problem. In Psychology of learn- Pang, Clara Vania, Katharina Kann, and Samuel
ing and motivation, volume 24, pages 109–165. El- Bowman. 2020. Intermediate-task transfer learning
sevier. with pretrained language models: When and why
does it work? In Proceedings of the 58th Annual
Jinseok Nam, Jungi Kim, Eneldo Loza Menc’ia, Iryna Meeting of the Association for Computational Lin-
Gurevych, and Johannes Fürnkranz. 2014. Large- guistics, pages 5231–5247.
scale multi-label text classification - revisiting neu-
ral networks. In Machine Learning and Knowl- Alec Radford, Karthik Narasimhan, Tim Salimans, and
edge Discovery in Databases - European Confer- Ilya Sutskever. 2018. Improving language under-
ence, ECML PKDD 2014, Nancy, France, Septem- standing by generative pre-training.
ber 15-19, 2014. Proceedings, Part II, pages 437–
452. Elad Ben-Zaken1 Shauli Ravfogel and Yoav Gold-
berg. 2021. Bitfit: Simple parameter-efficient
Sinno Jialin Pan and Qiang Yang. 2010. A survey on fine-tuning for transformer-based masked language-
transfer learning. IEEE Trans. Knowl. Data Eng., models. arXiv preprint.
22(10):1345–1359.
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea
Matthew E. Peters, Sebastian Ruder, and Noah A. Vedaldi. 2017. Learning multiple visual domains
Smith. 2019. To tune or not to tune? adapting pre- with residual adapters. In Advances in Neural Infor-
trained representations to diverse tasks. In Proceed- mation Processing Systems 30: Annual Conference
ings of the 4th Workshop on Representation Learn- on Neural Information Processing Systems 2017, 4-
ing for NLP, RepL4NLP@ACL 2019, Florence, Italy, 9 December 2017, Long Beach, CA, USA, pages
August 2, 2019, pages 7–14. 506–516.

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- Nils Reimers and Iryna Gurevych. 2019. Sentence-
warya Kamath, Ivan Vulić, Sebastian Ruder, BERT: Sentence embeddings using Siamese BERT-
Kyunghyun Cho, and Iryna Gurevych. 2020a. networks. In Proceedings of the 2019 Conference on
AdapterHub: A framework for adapting transform- Empirical Methods in Natural Language Processing
ers. In Proceedings of the 2020 Conference on Em- and the 9th International Joint Conference on Natu-
pirical Methods in Natural Language Processing: ral Language Processing (EMNLP-IJCNLP), pages
System Demonstrations, pages 46–54, Online. Asso- 3980–3990, Hong Kong, China. Association for
ciation for Computational Linguistics. Computational Linguistics.

Jonas Pfeiffer, Edwin Simpson, and Iryna Gurevych. Andreas Rücklé, Gregor Geigle, Max Glockner,
2020b. Low resource multi-task sequence tagging - Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna
revisiting dynamic conditional random fields. arXiv Gurevych. 2020a. AdapterDrop: On the Efficiency
preprint. of Adapters in Transformers. arXiv preprint.

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se- Andreas Rücklé, Jonas Pfeiffer, and Iryna Gurevych.
bastian Ruder. 2020c. MAD-X: An Adapter-Based 2020b. MultiCQA: Zero-shot transfer of self-
Framework for Multi-Task Cross-Lingual Transfer. supervised text matching models on a massive scale.
In Proceedings of the 2020 Conference on Empirical In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Methods in Natural Language Processing (EMNLP),
pages 7654–7673, Online. Association for Computa- pages 2471–2486, Online. Association for Computa-
tional Linguistics. tional Linguistics.
Sebastian Ruder. 2017. An overview of multi-task Christian Stab, Tristan Miller, Benjamin Schiller,
learning in deep neural networks. arXiv preprint. Pranav Rai, and Iryna Gurevych. 2018. Cross-topic
argument mining from heterogeneous sources. In
Sebastian Ruder. 2019. Neural Transfer Learning for Proceedings of the 2018 Conference on Empirical
Natural Language Processing. Ph.D. thesis, Na- Methods in Natural Language Processing, Brussels,
tional University of Ireland, Galway. Belgium, October 31 - November 4, 2018, pages
3664–3674.
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein,
and Anders Søgaard. 2019. Latent multi-task archi- Asa Cooper Stickland and Iain Murray. 2019. BERT
tecture learning. In The Thirty-Third AAAI Con- and pals: Projected attention layers for efficient
ference on Artificial Intelligence, AAAI 2019, The adaptation in multi-task learning. In Proceedings
Thirty-First Innovative Applications of Artificial In- of the 36th International Conference on Machine
telligence Conference, IAAI 2019, The Ninth AAAI Learning, ICML 2019, 9-15 June 2019, Long Beach,
Symposium on Educational Advances in Artificial In- California, USA, pages 5986–5995.
telligence, EAAI 2019, Honolulu, Hawaii, USA, Jan-
uary 27 - February 1, 2019, pages 4822–4829. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2019. Commonsenseqa: A ques-
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- tion answering challenge targeting commonsense
ula, and Yejin Choi. 2020. Winogrande: An adver- knowledge. In Proceedings of the 2019 Conference
sarial winograd schema challenge at scale. In The of the North American Chapter of the Association
Thirty-Fourth AAAI Conference on Artificial Intelli- for Computational Linguistics: Human Language
gence, AAAI 2020, The Thirty-Second Innovative Ap- Technologies, NAACL-HLT 2019, Minneapolis, MN,
plications of Artificial Intelligence Conference, IAAI USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
2020, The Tenth AAAI Symposium on Educational pers), pages 4149–4158.
Advances in Artificial Intelligence, EAAI 2020, New
York, NY, USA, February 7-12, 2020, pages 8732– Lisa Torrey and Jude Shavlik. 2010. Transfer learn-
8740. ing. In Handbook of research on machine learning
applications and trends: algorithms, methods, and
Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. techniques, pages 242–264. IGI Global.
A hierarchical multi-task approach for learning em-
beddings from semantic tasks. In The Thirty-Third Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and
AAAI Conference on Artificial Intelligence, AAAI Gertjan van Noord. 2020. UDapter: Language adap-
2019, The Thirty-First Innovative Applications of tation for truly Universal Dependency parsing. In
Artificial Intelligence Conference, IAAI 2019, The Proceedings of the 2020 Conference on Empirical
Ninth AAAI Symposium on Educational Advances Methods in Natural Language Processing (EMNLP),
in Artificial Intelligence, EAAI 2019, Honolulu, pages 2302–2315, Online. Association for Computa-
Hawaii, USA, January 27 - February 1, 2019, pages tional Linguistics.
6949–6956.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Bras, and Yejin Choi. 2019. Social iqa: Com- Kaiser, and Illia Polosukhin. 2017. Attention is all
monsense reasoning about social interactions. In you need. In Advances in Neural Information Pro-
Proceedings of the 2019 Conference on Empirical cessing Systems 30: Annual Conference on Neural
Methods in Natural Language Processing and the Information Processing Systems 2017, 4-9 Decem-
9th International Joint Conference on Natural Lan- ber 2017, Long Beach, CA, USA, pages 5998–6008.
guage Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, pages 4462– M. Vidoni, Ivan Vulić, and Goran Glavaš. 2020. Or-
4472. thogonal language and task adapters in zero-shot
cross-lingual transfer. In arXiv preprint.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. Manning, Andrew Ng, and Alex Wang, Amanpreet Singh, Julian Michael, Fe-
Christopher Potts. 2013. Recursive deep models lix Hill, Omer Levy, and Samuel R. Bowman.
for semantic compositionality over a sentiment tree- 2018. GLUE: A multi-task benchmark and anal-
bank. In Proceedings of the 2013 Conference on ysis platform for natural language understand-
Empirical Methods in Natural Language Processing, ing. In Proceedings of the Workshop: Analyzing
pages 1631–1642, Seattle, Washington, USA. Asso- and Interpreting Neural Networks for NLP, Black-
ciation for Computational Linguistics. boxNLP@EMNLP 2018, Brussels, Belgium, Novem-
ber 1, 2018, pages 353–355.
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
Conceptnet 5.5: An open multilingual graph of gen- Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xu-
eral knowledge. In Proceedings of the Thirty-First anjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang,
AAAI Conference on Artificial Intelligence, Febru- and Ming Zhou. 2020. K-adapter: Infusing knowl-
ary 4-9, 2017, San Francisco, California, USA, edge into pre-trained models with adapters. arXiv
pages 4444–4451. preprint.
Adina Williams, Nikita Nangia, and Samuel R. Bow- et al., 2013) consists of short movie reviews from
man. 2018. A broad-coverage challenge corpus Rotten Tomatoes6 .
for sentence understanding through inference. In
Proceedings of the 2018 Conference of the North Natural Language Inference (NLI) The goal is
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
to classify whether two sentences entail, contradict,
NAACL-HLT 2018, New Orleans, Louisiana, USA, or are neutral to each other. For this we conduct
June 1-6, 2018, Volume 1 (Long Papers), pages experiments on MultiNLI (Williams et al., 2018),
1112–1122. a multi-genre dataset, SciTail (Khot et al., 2018)
a NLI dataset on scientific text, SICK (Marelli
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and
Yejin Choi. 2018. SWAG: A large-scale adversar- et al., 2014) a NLI dataset with relatedness scores,
ial dataset for grounded commonsense inference. In the composition of Recognizing Textual Entailment
Proceedings of the 2018 Conference on Empirical (RTE) datasets provided by Wang, Singh, Michael,
Methods in Natural Language Processing, Brussels, Hill, Levy, and Bowman (2018), as well as the
Belgium, October 31 - November 4, 2018, pages 93–
104. Commitment Bank (CB) (De Marneffe et al., 2019)
three-class textual entailment dataset.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. Hellaswag: Can Sentence Relatedness We include two semantic
a machine really finish your sentence? In Pro- relatedness datasets which capture whether or not
ceedings of the 57th Conference of the Association two text samples include similar content. Microsoft
for Computational Linguistics, ACL 2019, Florence,
Research Paraphrase Corpus (MRPC) (Dolan and
Italy, July 28- August 2, 2019, Volume 1: Long Pa-
pers, pages 4791–4800. Brockett, 2005) consists of sentence pairs which
capture a paraphrase/semantic equivalence relation-
Yu Zhang and Qiang Yang. 2017. A survey on multi- ship. Quora Question Pairs (QQP) targets dupli-
task learning. arXiv preprint. cate question detection.7

A Appendices Misc The Argument Aspect corpus (Stab et al.,


2018) is a three-way classification task to pre-
A.1 Datasets dict whether a document provides arguments for,
Commonsense Reasoning We work with a large against or none for a given topic (Nuclear Energy,
number of datasets, all of which have emerged re- Abortion, Gun-Control, etc). BoolQ (Clark et al.,
cently in this domain, ranging from sentence level 2019) is a binary reading comprehension classifica-
and document level classification to multiple choice tion task for simple yes, no questions.
questions. The next sentence prediction task Hel-
laSWAG (Zellers et al., 2019) is a more difficult A.2 What Is The Best Adapter Setup?
version of the previously released SWAG dataset As described in §2.2.3, the placement of adapter pa-
(Zellers et al., 2018). Winogrande (Sakaguchi et al., rameters Φ within a pretrained model is non-trivial,
2020) is a large scale and adversarially filtered and thus requires extensive experiments. In order
(Zellers et al., 2018) adaptation of the Winograd to identify the best ST-A setting, we run an exhaus-
Schema Challenge (Levesque, 2011). Cosmos QA tive architecture search on the hyperparameters —
(Huang et al., 2019) is a commonsense reading including the position and number of adapters in
comprehension dataset which requires reasoning each transformer layer, the position and number
over larger text passages. Social IQA (Sap et al., of pretrained or task dependent layer norms, the
2019) is a multiple choice dataset which requires position of residual connections, the bottleneck re-
reasoning over social interactions between humans. duction factors {2, 8, 16, 64}, and the non linear-
Commonsense QA (Talmor et al., 2019) is a mul- ity {ReLU, LeakyReLU, Swish} used within the
tiple choice dataset based on ConceptNet (Speer adapter. We illustrate this in Figure 5. This grid
et al., 2017), which requires reasoning over general search includes the settings introduced by Houlsby
knowledge. et al. (2019) and Bapna and Firat (2019). We per-
form this search on three diverse tasks8 and find
Sentiment Analysis We conduct experiments on 6
www.rottentomatoes.com
two binary sentiment classification tasks on long 7
data.quora.com/First-Quora-DatasetReleaseQuestion-
and short text passages. IMDb (Maas et al., 2011) Pairs
8
consists of long movie reviews and SST-2 (Socher SST-2, Commonsense QA, and Argument.
A.4 BERT-base ST-A with Reduction Factors
Add & Norm Add & Norm
{2, 16, 64}
LayerNorm We present the ST-A results with different capacity
leveraging BERT-base weights in Table 3. Reduc-
FF Up FF Up
tion factors 2, 16, and 64 amount to dense adapter
FF Down FF Down dimensions 384, 48, and 12 respectively.
LayerNorm
Adapter Adapter
A.5 ST-A and Fusion with ST-A Results with
Add & Norm Add & Norm RoBERTa-base
Feed Feed
In order to validate our findings of our best
Forward Forward
setup—ST-A—we re-evaluate our results leverag-
Add & Norm ing RoBERTa-base weights. We present our re-
Adapter sults in Table 4. Similar to our findigs with BERT-
Add & Norm Add & Norm base, especially datasets with less data profit from
Multi-Head Multi-Head
Attention Attention AdapterFusion. We find that, in contrast to BERT-
base, RoBERTa-base does not perform well with
high capacity adapters with reduction factor 2.
Figure 5: Different architectural components of the
adapter. On the left, we show all components for which
we conduct an exhaustive search (dashed lines). On the
right, we show the adapter architecture that performs
the best across all our tasks.

that across all three tasks, the same setup obtains


best results. We present our results on the SST-
2, Argument, and CSQA datasets in Figures 7, 8,
and 9 respectively, at different granularity levels.
We find that in contrast to Houlsby et al. (2019),
but in line with Bapna and Firat (2019), a single
adapter after the feed-forward layer outperforms
other settings. While we find that this setting per-
forms on-par with that of Houlsby et al. (2019), it
requires only half the number of newly introduced
adapters as compared to them, resulting in a more
efficient setting in terms of number of operations.
For the single-task adapter setting, we thus per-
form all subsequent experiments with the best ar-
chitecture illustrated in Figure 5 on the right and a
learning rate of 1e − 4. In order to reproduce the
multi-task results in Stickland and Murray (2019)
and build upon them, for experiments involving
multi-task training, we adopt their architecture as
described in §2.2.3.

A.3 AdapterFusion Activations of all Layers

We present the cross-product of activations of


AdapterFusion of all layers for BERT-Base and
ST-A16 in Figure 6, as an extension to Figure 4.
sst_glue
cosmosqa
multinli
cosmosqa

mrpc
cb
rte
winogrande
socialiqa
sick
scitail
multinli
imdb
hellaswag
cosmosqa
argument
csqa
boolq
mrpc
winogrande
sst_glue
multinli
sst_glue
multinli
cosmosqa

socialiqa
sick
scitail
imdb
hellaswag
csqa
boolq
argument
mrpc
socialiqa
sick
scitail
imdb
hellaswag
csqa
boolq
argument
mrpc
socialiqa
sick
scitail
imdb
hellaswag
csqa
boolq
argument

qqp
cb
rte
qqp
cb
rte
winogrande
qqp
cb
rte
winogrande
sst_glue
qqp
argument argument argument argument
boolq boolq boolq boolq
cosmosqa cosmosqa cosmosqa cosmosqa
csqa csqa csqa csqa
hellaswag hellaswag hellaswag hellaswag
imdb imdb imdb imdb
multinli multinli multinli multinli
qqp qqp qqp qqp
scitail scitail scitail scitail

Layer 7
Layer 4
Layer 1

Layer 10
sick sick sick sick
socialiqa socialiqa socialiqa socialiqa

white cells indicate full activation.


sst_glue sst_glue sst_glue sst_glue
winogrande winogrande winogrande winogrande
rte rte rte rte
cb cb cb cb
mrpc mrpc mrpc mrpc
argument argument argument argument
boolq boolq boolq boolq
cosmosqa cosmosqa cosmosqa cosmosqa
csqa csqa csqa csqa
hellaswag hellaswag hellaswag hellaswag
imdb imdb imdb imdb
multinli multinli multinli multinli
qqp qqp qqp qqp
scitail scitail scitail scitail

Layer 8
Layer 5
Layer 2

Layer 11
sick sick sick sick
socialiqa socialiqa socialiqa socialiqa
sst_glue sst_glue sst_glue sst_glue
winogrande winogrande winogrande winogrande
rte rte rte rte
cb cb cb cb
mrpc mrpc mrpc mrpc
argument argument argument argument
boolq boolq boolq boolq
cosmosqa cosmosqa cosmosqa cosmosqa
csqa csqa csqa csqa
hellaswag hellaswag hellaswag hellaswag
imdb imdb imdb imdb
multinli multinli multinli multinli
qqp qqp qqp qqp
scitail scitail scitail scitail
Layer 9
Layer 6
Layer 3

Layer 12
sick sick sick sick
socialiqa socialiqa socialiqa socialiqa
sst_glue sst_glue sst_glue sst_glue
winogrande winogrande winogrande winogrande
rte rte rte rte
cb cb cb cb
mrpc mrpc mrpc mrpc

set of adapters are displayed in columns. Black squares indicate that an adapter has not been activated, whereas
Figure 6: AdapterFusion activations in the 12 BERT-base layers. Target tasks are presented in rows, whereas the
100 SST-2: Adapter Positions 100 SST-2: Pre-Trained LayerNorm 100 SST-2: New LayerNorm
Accuracy

80 80 80

60 60 60
0 20 40 60 0 20 40 60 0 20 40 60
Reduction Factor Reduction Factor Reduction Factor
BERT Fully Trained Bottom Adapter Only BERT Fully Trained Pre-Trained LN Before BERT Fully Trained New LN After
Top Adapter Only Both Adapters Pre-Trained LN Before & After No Pre-Trained LN No New LN New LN Before & After
Pre-Trained LN After New LN Before

(a) Adapter Positions in Layer (b) Position of pretrained LayerNorm (c) Position of newly trained Layer-
Norm

Figure 7: Results of the grid search on the SST-2 dataset over the architecture settings illustrated on the left of
Figure 5. As we go from (a) to (c), the best performing setting is used for further search over other hyperparameters.
We find that the best performing architecture is Top Adapter Only with Pretrained LayerNorm Before & After
including No New LayerNorm. This Architecture is illustrated on the right of Figure 5.

80 Argument: Adapter Positions 80Argument: Pre-Trained LayerNorm 80 Argument: New LayerNorm


Accuracy

70 70 70

60 60 60
0 20 40 60 0 20 40 60 0 20 40 60
Reduction Factor Reduction Factor Reduction Factor
BERT Fully Trained Bottom Adapter Only BERT Fully Trained Pre-Trained LN Before BERT Fully Trained New LN After
Top Adapter Only Both Adapters Pre-Trained LN Before & After No Pre-Trained LN No New LN New LN Before & After
Pre-Trained LN After New LN Before

(a) Adapter Positions in Layer (b) Position of Pretrained LayerNorm (c) Position of newly trained Layer-
Norm

Figure 8: Results of the grid search on the Argument dataset over the architecture settings illustrated on the left of
Figure 5. As we go from (a) to (c), the best performing setting is used for further search over other hyperparameters.
We find that the best performing architecture is Top Adapter Only with Pretrained LayerNorm Before & After
including No New LayerNorm. This Architecture is illustrated on the right of Figure 5.

60 CSQA: Adapter Positions 60 CSQA: Pre-Trained LayerNorm 60 CSQA: New LayerNorm


50 50 50
Accuracy

40 40 40
30 30 30
20 0 20 40 60 20 0 20 40 60 20 0 20 40 60
Reduction Factor Reduction Factor Reduction Factor
BERT Fully Trained Bottom Adapter Only BERT Fully Trained Pre-Trained LN Before BERT Fully Trained New LN After
Top Adapter Only Both Adapters Pre-Trained LN Before & After No Pre-Trained LN No New LN New LN Before & After
Pre-Trained LN After New LN Before

(a) Adapter Positions in Layer (b) Position of Pretrained LayerNorm (c) Position of newly trained Layer-
Norm

Figure 9: Results of the grid search on the CSQA dataset over the architecture settings illustrated on the left of
Figure 5. As we go from (a) to (c), the best performing setting is used for further search over other hyperparameters.
We find that the best performing architecture is Top Adapter Only with Pretrained LayerNorm Before & After
including No New LayerNorm. This Architecture is illustrated on the right of Figure 5.
Dataset ST-A2 ST-A16 ST-A64
MultiNLI 84.60 84.32 84.08
QQP 90.57 90.59 89.73
SST 92.66 ±0.32 91.85 ±0.41 92.01 ±0.33
Winogrande 62.11 ±0.09 61.09 ±0.11 59.70 ±0.06
IMDB 94.20 ±0.28 93.85 ±0.07 93.90 ±0.14
HellaSwag 39.45 ±0.20 38.11 ±0.14 38.28 ±0.37
SocialIQA 60.95 ±0.15 62.41 ±0.11 62.23 ±0.73
CosmosQA 59.32 ±0.24 60.01 ±0.02 60.65 ±0.34
SciTail 94.44 ±0.81 93.90 ±0.16 93.82 ±0.49
Argument 76.83 ±0.21 77.65 ±0.34 77.64 ±0.56
CSQA 57.83 ±0.23 58.91 ±0.57 58.88 ±0.40
BoolQ 77.14 ±1.10 75.66 ±1.25 76.07 ±0.54
MRPC 86.13 ±1.59 85.16 ±0.52 85.58 ±0.32
SICK 87.50 ±0.14 86.20 ±0.00 85.70 ±0.42
RTE 70.68 ±4.57 71.04 ±1.62 69.16 ±1.59
CB 87.85 ±2.94 86.07 ±3.87 84.28 ±4.79

Mean 76.39 76.05 75.73

Table 3: Mean and standard deviation results (development sets) for each of the 16 datasets and reduction factors
{2, 16, 64} for ST-A. Each model is initialized with BERT-base (Devlin et al., 2019) weights. The datasets are
ordered by their respective training dataset size. Dashed horizontal lines separates datasizes {> 40k, > 10k, > 5k}
respectively.

Dataset Head Full ST-A2 ST-A16 ST-A64 F. w/ ST-A16 ST-AHoulsby


16

MultiNLI 56.84 86.42 85.56 86.06 85.86 86.20 86.57


QQP 71.40 91.07 90.88 ±0.07 90.27 89.39 ±0.63 90.28 90.66
SST 81.86 ±0.21 94.29 ±0.22 93.71 ±0.29 93.80 ±0.23 93.35 ±0.43 93.67 ±0.13 94.17 ±0.15
Winogrande 51.93 66.77 51.27 ±0.78 65.58 ±0.53 62.43 66.01 ±0.47 63.46 ±6.38
IMDB 85.40 96.00 95.70 95.78 ±0.13 95.80 95.78 ±0.19 95.68 ±0.26
HellaSwag 41.16 63.53 61.09 ±0.08 61.57 ±0.14 61.18 ±0.21 61.52 ±0.07 61.21 ±0.37
SocialIQA 46.87 69.44 69.24 70.14 ±0.40 70.21 70.13 ±0.11 70.78 ±0.17
CosmosQA 41.88 ±0.29 68.52 ±0.49 68.01 ±0.94 68.76 ±0.53 68.62 ±0.55 68.64 ±0.04 69.18 ±0.34
SciTail 49.57 94.47 94.24 94.59 ±0.64 94.32 94.44 ±0.09 94.09 ±0.39
Argument 66.22 ±0.62 78.04 ±0.42 78.60 ±0.34 78.50 ±0.45 78.53 ±0.59 77.98 ±0.24 78.42 ±0.44
CSQA 41.37 ±0.34 65.81 ±0.59 66.11 ±0.60 66.30 ±0.38 64.03 ±0.27 66.52 ±0.18 67.53 ±0.70
BoolQ 62.17 81.89 80.86 ±0.86 80.83 ±0.27 80.17 ±0.25 80.86 ±0.15 81.11 ±0.54
MRPC 68.38 ±0.00 89.11 ±0.93 89.11 ±0.51 88.72 ±0.71 87.10 ±1.67 89.65 ±0.50 89.17 ±1.06
SICK 56.40 86.60 84.80 85.40 ±0.32 85.40 85.76 ±0.26 85.88 ±0.46
RTE 55.81 ±2.92 72.34 ±11.02 61.80 ±12.47 75.30 ±0.61 73.86 ±1.55 78.79 ±1.12 78.56 ±1.54
CB 59.64 ±11.05 90.00 ±1.60 87.14 ±6.85 89.28 ±2.82 81.07 ±4.82 92.86 ±3.79 89.64 ±3.87

Mean 58.05 81.08 78.63 80.83 79.52 81.41 81.18

Table 4: Mean and standard deviation results of models initialized with RoBERTa-base (Liu et al., 2019b) weights.
Performances are measured on the development sets of the 16 datasets for the different architectural setups.
The datasets are ordered by their respective training dataset size. Dashed horizontal lines separate datasizes
{> 40k, > 10k, > 5k} respectively. Head indicates training only a classification head on top of fixed RoBERTa
weights. For Full training we fine-tune all weights of RoBERTa. Single-Task adapters (ST-A) is the training of
independently trained adapters for each task, using the architecture illustrated in Figure 5, indices {2, 16, 64}
indicate the reduction factor. Fusion w/ ST-A show the results of AdapterFusion using the respective pretrained
adapters. ST-AHoulsby
16 shows the results of ST-A with with architecture proposed by Houlsby et al. (2019).

You might also like