Mera: Merging Pretrained Adapters For Few-Shot Learning
Mera: Merging Pretrained Adapters For Few-Shot Learning
Shwai He1 Run-Ze Fan3 Liang Ding2∗ Li Shen4 Tianyi Zhou1 Dacheng Tao2
1
University of Maryland, College Park 2 The University of Sydney
3
University of Chinese Academy of Sciences 4 JD Explore Academy
[email protected], [email protected]
Performance (%)
prompts us to directly leverage pretrained adapters 58.0
to extend the potential of single adapters.
To achieve this, we propose an approach that 54.0
merges pretrained adapters into a single one
(MerA). On the one hand, the merge pretrained 50.0
adapter does not introduce additional trainable pa- QNLI QQP RTE STSB
rameters. On the other hand, the knowledge from
Figure 2: Comparison between single adapters and
pretrained adapters enhances downstream perfor- AdapterFusion under 10-shot scenarios.
mance, as illustrated in Figure 1. We first imple-
ment two straightforward methods for merging pa-
rameters, including summation (“Sum.”) and av- The trainable parameters of a composition layer
eraging (“Avg.”), whereas the lack of one-to-one are 3d2 (ignoring bias terms for the time being),
2
correspondences between the parameters of differ- while a single adapter only contains 2dr . On the
ent models leads to suboptimal performance. In- other hand, AdapterFusion has to assign parallel
spired by Solomon et al. (2015); Singh and Jaggi pretrained adapters for each adapter layer, which
(2020), we further propose to align adapters’ param- multiplies the additional deployment cost.
eters through optimal transport based on weights Despite the parameters budgets, to check
(“Wts.”) and activations (“Acts.”). whether AdapterFusion is an optimal choice to im-
Extensive few-shot experiments demonstrate prove the task-specific performance under data con-
that MerA achieves significant improvements com- straint scenarios, we conduct few-shot experiments
pared with Adapters, e.g., 2.7% in averaged ac- to compare single adapter and AdapterFusion. As
curacy. In addition, we also find that merging is shown in Figure 2, single adapters consistently
adapters from the same track of tasks further en- outperform AdapterFusion with fewer parameters,
hances the capacity of MerA. Therefore, we intro- which indicates the superiority of tuning single
duce a simple yet effective technique called the adapters. Inspired by the preliminary study, we in-
"same-track" setting. With the implementation of tend to merge multiple adapters into a single model
the "same-track" setting, we observe even more to explore the potential of single adapters.
impressive gains, surpassing the performance of Adapter Merging. Due to the limitations of
full fine-tuning and Adapter tuning by a substantial AdapterFusion, we intend to make more efficient
margin, e.g., 3.5% in MRPC and 5.0% in MNLI. use of the knowledge in the pretrained adapters.
Integrating the parallel adapter into a module can
2 Methodology reduce the excess trainable parameters and improve
downstream performance. We first consider two
AdapterFusion. Adapters (Houlsby et al., 2019) simple methods for merging the weights in adapters
are bottleneck modules plugged in PLMs, with trained on different tasks. The first one is the sum-
model dimensions d and reduction factor r. In mation (“Sum.”):
standard Adapter Tuning, only adapter layers are Xn
trainable, while other layers are frozen. After tun- W
f= Wτj , (1)
j=1
Motivation. However, the resource constraint However, the problem with vanilla summation and
scenario challenges the parameter efficiency of averaging is the lack of one-to-one correspondence
AdapterFusion. For one thing, AdapterFusion re- between parameters from different adapters. In
quires composition layers that include additional particular, the p-th neuron of the i-th adapter might
trainable parameters, e.g., Query, Key, and Value. behave very differently from the p-th neuron of the
Table 1: Experimental results of different merging Methods on GLUE benchmark, where we perform merging
with the same pretrained adapters for a fair comparison. All tasks are evaluated using accuracy. Average scores on
all shots are underlined. The best results are bold.
j-th adapter and instead behave similarly to another were conducted on the widely-used GLUE bench-
neuron. Therefore, aligning the neurons first and mark (Wang et al., 2019).
then assembling adapters makes sense. Inspired by We use Adam (Kingma and Ba, 2015) as the op-
Singh and Jaggi (2020); Zan et al. (2022), we align timizer with β1 , β2 = 0.9, 0.98. We set the weight
adapters’ parameters via optimal transport based decay as 0.1 and grid-search the learning rate and
on weights (“Wts.”) and activations (“Acts.”). training epochs from {1e-5, 5e-5, 1e-4}, and {5,
Given adapters trained from task τi and τj , we 10}. The maximum length is 128. We follow pre-
plug them into language models and denote the vious works (Phang et al., 2018; Lee et al., 2020;
(ℓ, ℓ−1)
l-th adapter layer’s incoming edges as Wτi Dodge et al., 2020; He et al., 2023; Zhong et al.,
(ℓ, ℓ−1) (ℓ, ℓ−1) 2023) to fine-tune the pretrained language models,
and Wτj , respectively. We align Wτj to
W τi
(ℓ, ℓ−1)
by constructing convex combinations of e.g., BERT (Devlin et al., 2019), on the downstream
previous layer transport matrix T (ℓ−1) based on training set and report results on the dev set using
weights (“Wts”) or activations (“Acts”), normal- the last checkpoint.
ized appropriately via the inverse of corresponding
3.2 Results
column marginals β:
Main Results. In Table 1, we carefully com-
cτ(ℓ, ℓ−1) ← Wτ(ℓ, ℓ−1) T (ℓ−1) diag(1/β (ℓ−1) ),
W (3)
j j pare MerA (with merging methods above:
fτ(ℓ, ℓ−1) ← diag(1/β (ℓ) )T (ℓ) ⊤ W
cτ(ℓ, ℓ−1) .
“Sum.”, “Avg.”, “Wts”, “Acts”) to the standard
W (4)
j j
adapter (Houlsby et al., 2019) (“Adapter”) and
We refer W fτ(ℓ,
j
ℓ−1)
to the aligned weights of the AdapterFusion (Pfeiffer et al., 2021) (“AF”) on
adapter from task τj , which can be directly added GLUE benchmark for pretrained language models
with Wτi
(ℓ, ℓ−1)
. We carry out this procedure on all BERT (Devlin et al., 2019), where we set training
pretrained adapters: shots to 10, 20, 30, respectively.
MerA achieves significantly better performance
f (ℓ, ℓ−1) ← 1
Xn
W Wr(ℓ,
j
ℓ−1)
. (5) than vanilla adapter (Houlsby et al., 2019; Pfeif-
n j=1
fer et al., 2021) with the same parameters budget.
3 Empirical Evaluation Compared to AdapterFusion (Pfeiffer et al., 2021),
our methods significantly reduce required trainable
3.1 Setup
parameters and improve performance.
We collect pretrained adapters from AdapterHub
(Pfeiffer et al., 2020), which are trained on imdb MerA with Different Merging Methods MerA
(Maas et al., 2011), boolq (Clark et al., 2019), from all merging methods outperforms the standard
scitail (Khot et al., 2018) and winogrande (Sak- adapter tuning, where optimal transport methods
aguchi et al., 2021). The pretraining tasks cover outperform two naive methods in all tasks, reflect-
different NLP tracks, including sentiment analysis, ing the necessity of weight alignment. Notably,
question-answering, natural language inference, “Acts” based Adapters achieve up to 2.7% average
and common sense reasoning. Our experiments improvement compared to the standard adapter and
Table 2: Comparison between MerA and adapter architectures under various fine-tuning strategies. We
display the performance of 30 shots tuning and denote head-based tuning, prompt-based tuning, and prompt-based
tuning with demonstrations as “head”, “prompt”, and “prompt-demo”, respectively.
Table 3: Effect of different tracks. We denote a single ous fine-tuning strategies. This highlights the ef-
pretrained adapter as “+Single”, and denote MerA with fectiveness and versatility of MerA in enhancing
pretrained tracks (e.g., “+QA”). For each track, the same
performance in various fine-tuning scenarios.
number of pretrained adapters are chosen for MerA.
Effect on Initialization To analyze the bene-
MRPC MNLI fits of merging adapters, we conducted a zero-
Method
10 20 10 20 shot experiment to examine the effect of MerA
Adapter 65.4 67.0 39.4 40.8 on task initialization. In Figure 3, we plugged
+Single 65.0 67.4 40.6 42.3 MerA into BERT and compared it with BERT mod-
+QA 66.9 68.9 39.2 40.8 els equipped with randomly initialized adapters
+NLI 66.4 68.6 44.2⇑+4.8 45.8⇑+5.0 (“Adapters”) and without any adapters (“Baseline”).
+Sentiment 64.2 66.9 39.8 41.9 Compared to “Adapter” and “Baseline”, MerA en-
+Comsense 65.4 68.2 39.6 41.8 sures a superior initial state for the target tasks and
+STS 68.9⇑+3.5 70.1⇑+3.1 40.2 42.2 achieves significant accuracy improvement, e.g.,
2.4% in MNLI and 0.98% in SST-2. These find-
ings reveal the efficacy of MerA in enhancing the
+2.0 +2.4 initialization of target tasks.
+3.1 +2.8
We take ethical considerations very seriously. This Shwai He, Liang Ding, Daize Dong, Boan Liu, Fuqiang
paper focuses on higher model and data efficiency Yu, and Dacheng Tao. 2023. Pad-net: An efficient
framework for dynamic networks. In ACL.
for Adapters. Both the datasets and models used
in this paper are publicly available and have been Shwai He, Liang Ding, Daize Dong, Jeremy Zhang,
widely adopted by researchers. We ensure that the and Dacheng Tao. 2022b. SparseAdapter: An easy
findings and conclusions of this paper are reported approach for improving the parameter-efficiency of
accurately and objectively. adapters. In EMNLP.