0% found this document useful (0 votes)

69 views6 pages

Mera: Merging Pretrained Adapters For Few-Shot Learning

1) The document proposes MerA, an approach to efficiently incorporate pretrained language model adapters for few-shot learning by merging the adapters into a single adapter. 2) Experiments show that in few-shot scenarios, a single adapter can outperform AdapterFusion, which assembles pretrained adapters using composition layers, despite using fewer parameters. 3) MerA merges pretrained adapters into a single adapter through parameter alignment techniques, achieving better performance than both single adapters and AdapterFusion, while avoiding the increased parameters of AdapterFusion.

Uploaded by

billeton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views6 pages

Mera: Merging Pretrained Adapters For Few-Shot Learning

Uploaded by

billeton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MerA: Merging Pretrained Adapters For Few-Shot Learning

Shwai He1 Run-Ze Fan3 Liang Ding2∗ Li Shen4 Tianyi Zhou1 Dacheng Tao2
1
University of Maryland, College Park 2 The University of Sydney
3
University of Chinese Academy of Sciences 4 JD Explore Academy
[email protected], [email protected]

Abstract I was happy today, I was happy today, Pretrained Adapters

but I fall down the stairs. but I fall down the stairs.

Adapter tuning, which updates only a few pa-

rameters, has become a mainstream method
arXiv:2308.15982v1 [cs.CL] 30 Aug 2023

for fine-tuning pretrained language models to Single Adapter MerA

downstream tasks. However, it often yields

subpar results in few-shot learning. AdapterFu-
sion, which assembles pretrained adapters us-
ing composition layers tailored to specific tasks,
is a possible solution but significantly increases ❌ ✔
trainable parameters and deployment costs. De-
spite this, our preliminary study reveals that
! #
even single adapters can outperform Adapterfu-
Figure 1: Comparison between a single adapter and
sion in few-shot learning, urging us to propose
MerA in sentiment analysis tasks. Left: The dilemma
Merging Pretrained Adapters (MerA) that
of a single adapter in few-shot learning. Right: the
efficiently incorporates pretrained adapters to a
improvement of merging pretrained adapters.
single model through model fusion. Extensive
experiments on two PLMs demonstrate that
MerA achieves substantial improvements com-
tuning often starts with randomly initialized blocks
pared to both single adapters and AdapterFu-
sion. To further enhance the capacity of MerA, and may not perform well in scenarios with limited
we also introduce a simple yet effective tech- training data, such as few-shot learning (Moosavi
nique, referred to as the "same-track" setting, et al., 2022; Bansal et al., 2022).
that merges adapters from the same track of One possible solution is to transfer knowledge
pretraining tasks. With the implementation of from pretrained adapters to target tasks (Chawla
the "same-track" setting, we observe even more et al., 2021; Zhong et al., 2022; Wang et al.,
impressive gains, surpassing the performance
2022). AdapterFusion (Pfeiffer et al., 2021) has
of both full fine-tuning and adapter tuning by
a substantial margin, e.g., 3.5% in MRPC and been proposed to assemble pretrained adapters with
5.0% in MNLI. composition layers to integrate knowledge. How-
ever, AdapterFusion compromises parameter ef-
1 Introduction ficiency (He et al., 2022b), primarily due to the
excessive number of trainable parameters in the
Pretrained language models (PLMs) (Devlin et al., composition layers. On the other hand, deploying
2019; Liu et al., 2019) have revolutionized the field parallel pretrained adapters also increases compu-
of natural language processing, with fine-tuning be- tational costs.
ing a mainstream approach to leverage the power of
In this work, we turn to explore an efficient ap-
PLMs. However, with the ever-increasing number
proach to leverage pretrained adapters, raising the
of parameters in PLMs (Brown et al., 2020), there
following question: Can the current AdapterFusion
is a need for parameter-efficient fine-tuning tech-
framework fully exploit the pretrained adapters un-
niques (He et al., 2022a) to reduce training costs.
der few-shot scenarios? If not, how to incorporate
One representative technique is adapters tuning
the pretrained adapters more efficiently? To this
(Houlsby et al., 2019), which updates only a subset
end, we first conduct a series of experiments to
of parameters. Despite their advantages, adapter
compare the performance of single adapters and
∗
Corresponding author that of AdapterFusion under few-shot scenarios,
66.0 Adapter AdapterFusion
with results shown in Figure 2. Surprisingly, a sin-
gle adapter outperforms AdapterFusion with much 62.0
few trainable parameters. Such preliminary study

Performance (%)
prompts us to directly leverage pretrained adapters 58.0
to extend the potential of single adapters.
To achieve this, we propose an approach that 54.0
merges pretrained adapters into a single one
(MerA). On the one hand, the merge pretrained 50.0

adapter does not introduce additional trainable pa- QNLI QQP RTE STSB
rameters. On the other hand, the knowledge from
Figure 2: Comparison between single adapters and
pretrained adapters enhances downstream perfor- AdapterFusion under 10-shot scenarios.
mance, as illustrated in Figure 1. We first imple-
ment two straightforward methods for merging pa-
rameters, including summation (“Sum.”) and av- The trainable parameters of a composition layer
eraging (“Avg.”), whereas the lack of one-to-one are 3d2 (ignoring bias terms for the time being),
2
correspondences between the parameters of differ- while a single adapter only contains 2dr . On the
ent models leads to suboptimal performance. In- other hand, AdapterFusion has to assign parallel
spired by Solomon et al. (2015); Singh and Jaggi pretrained adapters for each adapter layer, which
(2020), we further propose to align adapters’ param- multiplies the additional deployment cost.
eters through optimal transport based on weights Despite the parameters budgets, to check
(“Wts.”) and activations (“Acts.”). whether AdapterFusion is an optimal choice to im-
Extensive few-shot experiments demonstrate prove the task-specific performance under data con-
that MerA achieves significant improvements com- straint scenarios, we conduct few-shot experiments
pared with Adapters, e.g., 2.7% in averaged ac- to compare single adapter and AdapterFusion. As
curacy. In addition, we also find that merging is shown in Figure 2, single adapters consistently
adapters from the same track of tasks further en- outperform AdapterFusion with fewer parameters,
hances the capacity of MerA. Therefore, we intro- which indicates the superiority of tuning single
duce a simple yet effective technique called the adapters. Inspired by the preliminary study, we in-
"same-track" setting. With the implementation of tend to merge multiple adapters into a single model
the "same-track" setting, we observe even more to explore the potential of single adapters.
impressive gains, surpassing the performance of Adapter Merging. Due to the limitations of
full fine-tuning and Adapter tuning by a substantial AdapterFusion, we intend to make more efficient
margin, e.g., 3.5% in MRPC and 5.0% in MNLI. use of the knowledge in the pretrained adapters.
Integrating the parallel adapter into a module can
2 Methodology reduce the excess trainable parameters and improve
downstream performance. We first consider two
AdapterFusion. Adapters (Houlsby et al., 2019) simple methods for merging the weights in adapters
are bottleneck modules plugged in PLMs, with trained on different tasks. The first one is the sum-
model dimensions d and reduction factor r. In mation (“Sum.”):
standard Adapter Tuning, only adapter layers are Xn
trainable, while other layers are frozen. After tun- W
f= Wτj , (1)
j=1

ing, adapters contain specific knowledge for a sin-

where we denote τ as the indices of the tasks and
gle task. AdapterFusion (Pfeiffer et al., 2021) is
weights of the adapter trained on task τj as Wτj .
proposed to leverage knowledge from pretrained
The second one is averaging (“Avg.”):
adapters, which improves the performance in down-
f= 1
Xn
stream tasks and prevents catastrophic forgetting W Wτj . (2)
for each adapter. n j=1

Motivation. However, the resource constraint However, the problem with vanilla summation and
scenario challenges the parameter efficiency of averaging is the lack of one-to-one correspondence
AdapterFusion. For one thing, AdapterFusion re- between parameters from different adapters. In
quires composition layers that include additional particular, the p-th neuron of the i-th adapter might
trainable parameters, e.g., Query, Key, and Value. behave very differently from the p-th neuron of the
Table 1: Experimental results of different merging Methods on GLUE benchmark, where we perform merging
with the same pretrained adapters for a fair comparison. All tasks are evaluated using accuracy. Average scores on
all shots are underlined. The best results are bold.

#Param. MRPC SST-2 MNLI

Method
(Trained) 10 20 30 Avg. 10 20 30 Avg. 10 20 30 Avg.
Fine-Tune 100% 66.2 67.6 69.1 67.6 60.7 62.3 63.9 62.3 39.4 41.1 42.2 40.9
AdapterFusion 18% 65.2 66.4 67.5 66.4 60.1 61.5 63.6 61.7 39.3 40.8 41.7 40.6
Adapter 0.8% 65.4 66.8 67.7 66.6 60.2 61.1 64.1 61.8 39.1 41.0 42.0 40.7
w/ Sum 0.8% 67.4 67.9 68.6 68.0 61.2 62.1 64.2 62.5 39.2 41.4 42.0 40.9
w/ Avg 0.8% 67.2 67.7 68.4 67.8 60.6 62.6 63.7 62.3 39.4 41.7 42.1 41.1
w/ Wts 0.8% 67.4 68.4 69.6 68.5 60.7 63.3 65.4 63.1 39.9 41.8 42.4 41.4
w/ Acts 0.8% 67.6 68.8 70.3 68.9 61.6 63.2 65.6 63.5 39.7 42.1 42.6 41.5

j-th adapter and instead behave similarly to another were conducted on the widely-used GLUE bench-
neuron. Therefore, aligning the neurons first and mark (Wang et al., 2019).
then assembling adapters makes sense. Inspired by We use Adam (Kingma and Ba, 2015) as the op-
Singh and Jaggi (2020); Zan et al. (2022), we align timizer with β1 , β2 = 0.9, 0.98. We set the weight
adapters’ parameters via optimal transport based decay as 0.1 and grid-search the learning rate and
on weights (“Wts.”) and activations (“Acts.”). training epochs from {1e-5, 5e-5, 1e-4}, and {5,
Given adapters trained from task τi and τj , we 10}. The maximum length is 128. We follow pre-
plug them into language models and denote the vious works (Phang et al., 2018; Lee et al., 2020;
(ℓ, ℓ−1)
l-th adapter layer’s incoming edges as Wτi Dodge et al., 2020; He et al., 2023; Zhong et al.,
(ℓ, ℓ−1) (ℓ, ℓ−1) 2023) to fine-tune the pretrained language models,
and Wτj , respectively. We align Wτj to
W τi
(ℓ, ℓ−1)
by constructing convex combinations of e.g., BERT (Devlin et al., 2019), on the downstream
previous layer transport matrix T (ℓ−1) based on training set and report results on the dev set using
weights (“Wts”) or activations (“Acts”), normal- the last checkpoint.
ized appropriately via the inverse of corresponding
3.2 Results
column marginals β:
Main Results. In Table 1, we carefully com-
cτ(ℓ, ℓ−1) ← Wτ(ℓ, ℓ−1) T (ℓ−1) diag(1/β (ℓ−1) ),
W (3)
j j pare MerA (with merging methods above:
fτ(ℓ, ℓ−1) ← diag(1/β (ℓ) )T (ℓ) ⊤ W
cτ(ℓ, ℓ−1) .
“Sum.”, “Avg.”, “Wts”, “Acts”) to the standard
W (4)
j j
adapter (Houlsby et al., 2019) (“Adapter”) and
We refer W fτ(ℓ,
j
ℓ−1)
to the aligned weights of the AdapterFusion (Pfeiffer et al., 2021) (“AF”) on
adapter from task τj , which can be directly added GLUE benchmark for pretrained language models
with Wτi
(ℓ, ℓ−1)
. We carry out this procedure on all BERT (Devlin et al., 2019), where we set training
pretrained adapters: shots to 10, 20, 30, respectively.
MerA achieves significantly better performance
f (ℓ, ℓ−1) ← 1
Xn
W Wr(ℓ,
j
ℓ−1)
. (5) than vanilla adapter (Houlsby et al., 2019; Pfeif-
n j=1
fer et al., 2021) with the same parameters budget.
3 Empirical Evaluation Compared to AdapterFusion (Pfeiffer et al., 2021),
our methods significantly reduce required trainable
3.1 Setup
parameters and improve performance.
We collect pretrained adapters from AdapterHub
(Pfeiffer et al., 2020), which are trained on imdb MerA with Different Merging Methods MerA
(Maas et al., 2011), boolq (Clark et al., 2019), from all merging methods outperforms the standard
scitail (Khot et al., 2018) and winogrande (Sak- adapter tuning, where optimal transport methods
aguchi et al., 2021). The pretraining tasks cover outperform two naive methods in all tasks, reflect-
different NLP tracks, including sentiment analysis, ing the necessity of weight alignment. Notably,
question-answering, natural language inference, “Acts” based Adapters achieve up to 2.7% average
and common sense reasoning. Our experiments improvement compared to the standard adapter and
Table 2: Comparison between MerA and adapter architectures under various fine-tuning strategies. We
display the performance of 30 shots tuning and denote head-based tuning, prompt-based tuning, and prompt-based
tuning with demonstrations as “head”, “prompt”, and “prompt-demo”, respectively.

#Param. BERT-Base RoBERTa-Base

Method
(Trained) head prompt +demos Avg. head prompt +demos Avg.
Fine-Tune 100% 63.9 71.8 72.6 69.4 71.1 79.4 79.9 76.8
AdapterFusion 18% 63.6 71.2 71.7 68.8 70.9 78.9 79.4 76.1
Adapter 0.8% 64.1 71.5 72.1 69.2 70.5 78.5 79.3 76.4
MerA 0.8% 65.7 73.6 73.9 71.1 73.3 81.3 82.1 78.9

Table 3: Effect of different tracks. We denote a single ous fine-tuning strategies. This highlights the ef-
pretrained adapter as “+Single”, and denote MerA with fectiveness and versatility of MerA in enhancing
pretrained tracks (e.g., “+QA”). For each track, the same
performance in various fine-tuning scenarios.
number of pretrained adapters are chosen for MerA.
Effect on Initialization To analyze the bene-
MRPC MNLI fits of merging adapters, we conducted a zero-
Method
10 20 10 20 shot experiment to examine the effect of MerA
Adapter 65.4 67.0 39.4 40.8 on task initialization. In Figure 3, we plugged
+Single 65.0 67.4 40.6 42.3 MerA into BERT and compared it with BERT mod-
+QA 66.9 68.9 39.2 40.8 els equipped with randomly initialized adapters
+NLI 66.4 68.6 44.2⇑+4.8 45.8⇑+5.0 (“Adapters”) and without any adapters (“Baseline”).
+Sentiment 64.2 66.9 39.8 41.9 Compared to “Adapter” and “Baseline”, MerA en-
+Comsense 65.4 68.2 39.6 41.8 sures a superior initial state for the target tasks and
+STS 68.9⇑+3.5 70.1⇑+3.1 40.2 42.2 achieves significant accuracy improvement, e.g.,
2.4% in MNLI and 0.98% in SST-2. These find-
ings reveal the efficacy of MerA in enhancing the
+2.0 +2.4 initialization of target tasks.
+3.1 +2.8

Augmenting MerA with Same-Track Setting

One potential approach to further increase the gain
of MerA at initialization is to merge the adapters
trained in the same NLP track because of the shar-
(a) MNLI (b) SST-2
ing of knowledge in a specific track. So we further
Figure 3: Effect of MerA on Initialization. “Baseline” investigate the roles of different pretrained tracks.
represent vanilla BERT. We denote BERT equipped with We consider pretrained adapters from five different
adapters as “Adapters” and denote BERT equipped with tracks, including question-answering (QA), com-
“MerA” as “MerA”, respectively. mon sense reasoning (Comsense), natural language
inference (NLI), semantic textual similarity (STS),
even beat full fine-tuning, so we set “Acts” as the and sentiment analysis (Sentiment). MRPC and
default setting in the following experiments. MNLI belong to semantic textual similarity and
natural language inference, respectively. We merge
Prompt-Based Finetuning In addition to head- adapters trained in the same track and validate
based fine-tuning, we also investigate the applica- MerA on MNLI and MRPC. We also consider di-
bility of MerA in prompt-based fine-tuning, which rectly fine-tuning a single pretrained adapter “+Sin-
leverages informative and indicative prompts to gle”, experimenting with multiple adapters within
enhance performance (Gao et al., 2021). In Ta- the same tracks, and reporting the best results.
ble 2, we consider two strategies including prompt- Table 3 compares MerA merged from different
based finetuning and prompt-based finetuning with tracks, where we can see significant improvements
demonstrations. Experimental results demonstrate when the pretraining tasks are on the same track
the consistent improvement of MerA across vari- as the downstream task. In this case, MerA outper-
forms the standard adapter by 3.5% in MRPC and Clark, Christopher Berner, Sam McCandlish, Alec
5.0% in MNLI. However, when pretrained tasks Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In NeurIPS.
are on a different track, MerA may encounter a
performance drop. The above findings inform the Akshay Chawla, Hongxu Yin, Pavlo Molchanov, and
importance of the knowledge shared in one track. Jose Alvarez. 2021. Data-free knowledge distillation
for object detection. In WACV.
4 Conclusion
Christopher Clark, Kenton Lee, Ming-Wei Chang,
In this work, we systematically investigate adapters Tom Kwiatkowski, Michael Collins, and Kristina
and AdapterFusion on few-shot scenarios. Based Toutanova. 2019. Boolq: Exploring the surprising
difficulty of natural yes/no questions. In NAACL.
on our findings, we propose a plug-in strategy, i.e.,
MerA, for existing adapters. Our empirical results Kevin Clark, Minh-Thang Luong, Quoc V Le, and
indicate the potential to make MerA a golden stan- Christopher D Manning. 2020. Electra: Pre-training
dard efficient few-shot learning strategy for the text encoders as discriminators rather than generators.
NLP community. In ICLR.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

5 Limitations Kristina Toutanova. 2019. Bert: Pre-training of deep
bidirectional transformers for language understand-
Despite the progress we made, there still exist lim-
ing. In NAACL.
itations in our work. On the one hand, we only
investigated some classic merging methods and Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali
found that “Acts” performs the best in selected cri- Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020.
teria. However, other advanced pruning methods Fine-tuning pretrained language models: Weight ini-
tializations, data orders, and early stopping. arXiv
may exist that can further improve the performance, preprint.
which deserves exploration in future work. On
the other hand, since we only consider BERT and Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.
RoBERTa in limited tasks, it would be valuable to Making pre-trained language models better few-shot
learners. In ACL.
consider other architecture families (e.g., XLNET
(Yang et al., 2019), ELECTRA (Clark et al., 2020)) Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-
and tasks (e.g., machine translation). Kirkpatrick, and Graham Neubig. 2022a. Towards a
unified view of parameter-efficient transfer learning.
Ethics Statement In ICLR.

We take ethical considerations very seriously. This Shwai He, Liang Ding, Daize Dong, Boan Liu, Fuqiang
paper focuses on higher model and data efficiency Yu, and Dacheng Tao. 2023. Pad-net: An efficient
framework for dynamic networks. In ACL.
for Adapters. Both the datasets and models used
in this paper are publicly available and have been Shwai He, Liang Ding, Daize Dong, Jeremy Zhang,
widely adopted by researchers. We ensure that the and Dacheng Tao. 2022b. SparseAdapter: An easy
findings and conclusions of this paper are reported approach for improving the parameter-efficiency of
accurately and objectively. adapters. In EMNLP.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,

Bruna Morrone, Quentin De Laroussilhe, Andrea
References Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.
Trapit Bansal, Salaheddin Alzubi, Tong Wang, Jay- Parameter-efficient transfer learning for nlp. In
Yoon Lee, and Andrew McCallum. 2022. Meta- ICML.
adapters: Parameter efficient few-shot fine-tuning
through meta-learning. In AutoML. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
Scitail: A textual entailment dataset from science
Tom Brown, Benjamin Mann, Nick Ryder, Melanie question answering. In AAAI.
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Diederik P Kingma and Jimmy Ba. 2015. Adam: A
Askell, Sandhini Agarwal, Ariel Herbert-Voss, method for stochastic optimization. In ICLR.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang.
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- 2020. Mixout: Effective regularization to finetune
teusz Litwin, Scott Gray, Benjamin Chess, Jack large-scale pretrained language models. In ICLR.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Dacheng Tao. 2022. Panda: Prompt transfer meets
Luke Zettlemoyer, and Others. 2019. Roberta: A knowledge distillation for efficient model adaptation.
robustly optimized bert pretraining approach. arXiv arXiv preprint.
preprint.
Qihuang Zhong, Liang Ding, Juhua Liu, Xuebo Liu,
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Min Zhang, Bo Du, and Dacheng Tao. 2023. Re-
Dan Huang, Andrew Y. Ng, and Christopher Potts. visiting token dropping strategy in efficient BERT
2011. Learning word vectors for sentiment analysis. pretraining. In ACL.
In ACL.

Nafise Sadat Moosavi, Quentin Delfosse, Kristian Ker-

sting, and Iryna Gurevych. 2022. Adaptable adapters.
In NAACL.

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé,

Kyunghyun Cho, and Iryna Gurevych. 2021.
Adapterfusion: Non-destructive task composition for
transfer learning. In EACL.

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya

Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun
Cho, and Iryna Gurevych. 2020. Adapterhub: A
framework for adapting transformers. In EMNLP
Demonstrations.

Jason Phang, Thibault Févry, and Samuel R Bowman.

2018. Sentence encoders on stilts: Supplementary
training on intermediate labeled-data tasks. arXiv
preprint.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-

ula, and Yejin Choi. 2021. Winogrande: An adver-
sarial winograd schema challenge at scale. Commun.
ACM.

Sidak Pal Singh and Martin Jaggi. 2020. Model fusion

via optimal transport. In NeurIPS.

Justin Solomon, Fernando de Goes, Gabriel Peyré,

Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao
Du, and Leonidas J. Guibas. 2015. Convolutional
wasserstein distances: efficient optimal transporta-
tion on geometric domains. ACM Trans. Graph.

Alex Wang, Amanpreet Singh, Julian Michael, Felix

Hill, Omer Levy, and Samuel R Bowman. 2019.
GLUE: A multi-task benchmark and analysis plat-
form for natural language understanding. In ICLR.

Zhenyi Wang, Xiaoyang Wang, Li Shen, Qiuling Suo,

Kaiqiang Song, Dong Yu, Yan Shen, and Mingchen
Gao. 2022. Meta-learning without data via wasser-
stein distributionally-robust model fusion. In UAI.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-

bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Xlnet: Generalized autoregressive pretraining for lan-
guage understanding. In NeurIPS.

Changtong Zan, Liang Ding, Li Shen, Yu Cao, Weifeng

Liu, and Dacheng Tao. 2022. On the complemen-
tarity between pre-training and random-initialization
for resource-rich machine translation. In COLING.

Copie de Restaurants Email List
No ratings yet
Copie de Restaurants Email List
249 pages
Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
No ratings yet
Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
29 pages
Quantum Versus Classical Mechanics and Integrability Problems
No ratings yet
Quantum Versus Classical Mechanics and Integrability Problems
464 pages
Metode Numerik - Full
No ratings yet
Metode Numerik - Full
18 pages
Tronik Data Bank
No ratings yet
Tronik Data Bank
35 pages
Machine Learning For Asset Management 1714827480
No ratings yet
Machine Learning For Asset Management 1714827480
233 pages
Supervised and Unsupervised Learning: Ciro Donalek Ay/Bi 199 - April 2011
No ratings yet
Supervised and Unsupervised Learning: Ciro Donalek Ay/Bi 199 - April 2011
69 pages
Brochure - Oct - Pepsi
No ratings yet
Brochure - Oct - Pepsi
15 pages
B C: A Benchmark For Bioinformatics Code Generation With Contextual Pragmatic Knowledge
No ratings yet
B C: A Benchmark For Bioinformatics Code Generation With Contextual Pragmatic Knowledge
72 pages
DSP Integrated Circuits 3
No ratings yet
DSP Integrated Circuits 3
3 pages
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
No ratings yet
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
14 pages
M269-Final-By ISA-5th Edition
No ratings yet
M269-Final-By ISA-5th Edition
110 pages
State Variable Project
No ratings yet
State Variable Project
17 pages
Decision Modelling: Project Report Topic: Seven Iims Dilemma Prepared By: Group 2
No ratings yet
Decision Modelling: Project Report Topic: Seven Iims Dilemma Prepared By: Group 2
12 pages
LAB # 01 Digital Sequences (Unit Step, Unit Impulse) : Background Review
No ratings yet
LAB # 01 Digital Sequences (Unit Step, Unit Impulse) : Background Review
6 pages
A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
No ratings yet
A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
33 pages
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
No ratings yet
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
17 pages
Knowledge-Grounded Natural Language Recommendation Explanation
No ratings yet
Knowledge-Grounded Natural Language Recommendation Explanation
15 pages
Task-Based Moe For Multitask Multilingual Machine Translation
No ratings yet
Task-Based Moe For Multitask Multilingual Machine Translation
11 pages
CD
No ratings yet
CD
5 pages
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
No ratings yet
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
11 pages
Toddlerberta: Exploiting Babyberta For Grammar Learning and Language Understanding
No ratings yet
Toddlerberta: Exploiting Babyberta For Grammar Learning and Language Understanding
10 pages
Business Process Text Sketch Automation Generation Using Large Language Model
No ratings yet
Business Process Text Sketch Automation Generation Using Large Language Model
10 pages
The Nordic Pile: A 1.2TB Nordic Dataset For Language Modeling
No ratings yet
The Nordic Pile: A 1.2TB Nordic Dataset For Language Modeling
14 pages
SPDX1
No ratings yet
SPDX1
21 pages
Milmo:Minority Multilingual Pre-Trained Language Model
No ratings yet
Milmo:Minority Multilingual Pre-Trained Language Model
7 pages
2732 Shraddha
No ratings yet
2732 Shraddha
4 pages
2308 16474
No ratings yet
2308 16474
6 pages
Wheat Leaf Disease Detection Using Machine Learning Method-A Review
No ratings yet
Wheat Leaf Disease Detection Using Machine Learning Method-A Review
6 pages
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
No ratings yet
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
24 pages
Essays in Bayesian Econometrics Modeling, Estimation, and Inference
No ratings yet
Essays in Bayesian Econometrics Modeling, Estimation, and Inference
106 pages
PNN
No ratings yet
PNN
19 pages
ASME VIII 2 Permissible Cycle Life
No ratings yet
ASME VIII 2 Permissible Cycle Life
5 pages
Generalised Winograd Schema and Its Contextuality: Kin Ian Lo Mehrnoosh Sadrzadeh Shane Mansfield
No ratings yet
Generalised Winograd Schema and Its Contextuality: Kin Ian Lo Mehrnoosh Sadrzadeh Shane Mansfield
16 pages
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
No ratings yet
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
16 pages
Solution
No ratings yet
Solution
2 pages
Zheng SimMatch Semi-Supervised Learning With Similarity Matching CVPR 2022 Paper
No ratings yet
Zheng SimMatch Semi-Supervised Learning With Similarity Matching CVPR 2022 Paper
11 pages
Laskar 21
No ratings yet
Laskar 21
32 pages
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
No ratings yet
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
11 pages
Ban701-Computational Fluid Dynamics Answer Key Part-A: 1.applications of CFD
No ratings yet
Ban701-Computational Fluid Dynamics Answer Key Part-A: 1.applications of CFD
4 pages
2020 Golden Rose Price List-1
No ratings yet
2020 Golden Rose Price List-1
6 pages
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
No ratings yet
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
10 pages
Quantifying and Analyzing Entity-Level Memorization in Large Language Models
No ratings yet
Quantifying and Analyzing Entity-Level Memorization in Large Language Models
9 pages
LL SM: L L S M: A Arge Anguage and Peech Odel
No ratings yet
LL SM: L L S M: A Arge Anguage and Peech Odel
8 pages
DFT in MATLAB Using FFT
No ratings yet
DFT in MATLAB Using FFT
3 pages
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
No ratings yet
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
5 pages
Parameter-Efficient Transfer Learning For NLP: Dai & Le 2015 Howard & Ruder 2018 Radford Et Al. 2018
No ratings yet
Parameter-Efficient Transfer Learning For NLP: Dai & Le 2015 Howard & Ruder 2018 Radford Et Al. 2018
13 pages
01 Function Marking Scheme
No ratings yet
01 Function Marking Scheme
2 pages
Link Prediction For Wikipedia Articles As A Natural Language Inference Task
No ratings yet
Link Prediction For Wikipedia Articles As A Natural Language Inference Task
4 pages
22BCE2200
No ratings yet
22BCE2200
17 pages
Parameter-Efficient Transfer Learning For NLP
No ratings yet
Parameter-Efficient Transfer Learning For NLP
10 pages
Twosisters OO
No ratings yet
Twosisters OO
1 page
Hugva Usd Price List
No ratings yet
Hugva Usd Price List
1 page
Assignment Problem - PPTX - Compressed
No ratings yet
Assignment Problem - PPTX - Compressed
30 pages
46 Hyperfast Instant Classificati
No ratings yet
46 Hyperfast Instant Classificati
22 pages
Nandani Product List
No ratings yet
Nandani Product List
8 pages
3305 Syl
No ratings yet
3305 Syl
2 pages
The Effectiveness of MAE Pre-Pretraining For Billion-Scale Pretraining
No ratings yet
The Effectiveness of MAE Pre-Pretraining For Billion-Scale Pretraining
16 pages
The Effectiveness of MAE Pre-Pretraining For Billion-Scale Pretraining
No ratings yet
The Effectiveness of MAE Pre-Pretraining For Billion-Scale Pretraining
17 pages
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
No ratings yet
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
17 pages
Sensors 23 02381 v2
No ratings yet
Sensors 23 02381 v2
16 pages
Ties
No ratings yet
Ties
23 pages
Data-Efficient Multimodal Fusion On A Single GPU
No ratings yet
Data-Efficient Multimodal Fusion On A Single GPU
15 pages
Slides CNN
No ratings yet
Slides CNN
17 pages
Ch03 - Embarrassingly Parallel Computations 2023-2024
No ratings yet
Ch03 - Embarrassingly Parallel Computations 2023-2024
34 pages
Transferability in Deep Learning: A Survey: Junguang Jiang
No ratings yet
Transferability in Deep Learning: A Survey: Junguang Jiang
64 pages
Data - and AI-driven Methods in Engineering
No ratings yet
Data - and AI-driven Methods in Engineering
40 pages
Deep Model Fusion: A Survey
No ratings yet
Deep Model Fusion: A Survey
46 pages
(2203.06915) SimMatch - Semi-Supervised Learning With Similarity Matching
No ratings yet
(2203.06915) SimMatch - Semi-Supervised Learning With Similarity Matching
17 pages
A Practitioner's Guide To Continual Multimodal Pretraining
No ratings yet
A Practitioner's Guide To Continual Multimodal Pretraining
52 pages
Smile: Z - S S M L - R E C F P - T F M: ERO HOT Parse Ixture of OW ANK Xperts Onstruction ROM RE Rained Oundation Odels
No ratings yet
Smile: Z - S S M L - R E C F P - T F M: ERO HOT Parse Ixture of OW ANK Xperts Onstruction ROM RE Rained Oundation Odels
21 pages
Unit-V Tranfer Learning Notes
No ratings yet
Unit-V Tranfer Learning Notes
27 pages
Combined Paper
No ratings yet
Combined Paper
26 pages
Semi-Supervised Learning Literature Survey
No ratings yet
Semi-Supervised Learning Literature Survey
59 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
43 pages
Properties of ROC
No ratings yet
Properties of ROC
6 pages
Mergekit
No ratings yet
Mergekit
11 pages
AAI Module 4
No ratings yet
AAI Module 4
13 pages
(ArXiv2406.15479v2) Twin-Merging Dynamic Integration of Modular Expertise in Model Merging
No ratings yet
(ArXiv2406.15479v2) Twin-Merging Dynamic Integration of Modular Expertise in Model Merging
24 pages
1288 Dataless Knowledge Fusion by M
No ratings yet
1288 Dataless Knowledge Fusion by M
19 pages
Unit 4
No ratings yet
Unit 4
50 pages
A Survey of Deep Learning - From Activations To Transformers
No ratings yet
A Survey of Deep Learning - From Activations To Transformers
12 pages
PSRM II Assingment 6
No ratings yet
PSRM II Assingment 6
2 pages
(ArXiv2111.09832) Merging Models With Fisher-Weighted Averaging - Michael Matena, Colin Raffel
No ratings yet
(ArXiv2111.09832) Merging Models With Fisher-Weighted Averaging - Michael Matena, Colin Raffel
16 pages
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
No ratings yet
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
12 pages
Make 04 00002 v2
No ratings yet
Make 04 00002 v2
20 pages
15 Improving Performance - Hacks & Tricks
No ratings yet
15 Improving Performance - Hacks & Tricks
57 pages
Sample Q - A For Module 3 - 4
No ratings yet
Sample Q - A For Module 3 - 4
18 pages
MLSys 2022 Taglets A System For Automatic Semi Supervised Learning With Auxiliary Data Paper
No ratings yet
MLSys 2022 Taglets A System For Automatic Semi Supervised Learning With Auxiliary Data Paper
21 pages
Enhancing Domain Adaptation Through Prompt Gradient Alignment
No ratings yet
Enhancing Domain Adaptation Through Prompt Gradient Alignment
26 pages
2502 04dqwd
No ratings yet
2502 04dqwd
18 pages
Li Et Al. - 2023 - Building Manufacturing Deep Learning Models With M
No ratings yet
Li Et Al. - 2023 - Building Manufacturing Deep Learning Models With M
8 pages
Ait304 Robotics and Intelligent System May 2024
No ratings yet
Ait304 Robotics and Intelligent System May 2024
2 pages
Unit - V
No ratings yet
Unit - V
44 pages
Maths 1
No ratings yet
Maths 1
3 pages
2020 - AdapterDrop - On The Efficiency of Adapters in Transformers - Rücklé Et Al
No ratings yet
2020 - AdapterDrop - On The Efficiency of Adapters in Transformers - Rücklé Et Al
17 pages
Expert Systems - 2023 - Basabain - Enhancing Arabic Text Feature Extraction Utilizing Label Semantic Augmentation in Few
No ratings yet
Expert Systems - 2023 - Basabain - Enhancing Arabic Text Feature Extraction Utilizing Label Semantic Augmentation in Few
16 pages
Sensor Fusion Presentation
No ratings yet
Sensor Fusion Presentation
10 pages
Assignment Ada 1
No ratings yet
Assignment Ada 1
5 pages
ML - Meta-Pretraining Then Meta-Learning For Few-Shot Text Classification
No ratings yet
ML - Meta-Pretraining Then Meta-Learning For Few-Shot Text Classification
3 pages
Advances in AI: Module-1
No ratings yet
Advances in AI: Module-1
23 pages
Mix Lora
No ratings yet
Mix Lora
18 pages
Elaborate On The Significance of Hyperparameter Optimization
No ratings yet
Elaborate On The Significance of Hyperparameter Optimization
5 pages
P3-5 29练习
No ratings yet
P3-5 29练习
10 pages
Adapter Soup
No ratings yet
Adapter Soup
10 pages
Reinforced Model Merging: Jiaqi Han, Jingwen Ye, Shunyu Liu, Haofei Zhang, Jie Song, Zunlei Feng, Mingli Song
No ratings yet
Reinforced Model Merging: Jiaqi Han, Jingwen Ye, Shunyu Liu, Haofei Zhang, Jie Song, Zunlei Feng, Mingli Song
6 pages
Unit Iii
No ratings yet
Unit Iii
26 pages
Evaluating Parameter Efficient Learning For Generation
No ratings yet
Evaluating Parameter Efficient Learning For Generation
10 pages
Model Merging in Pre-Training of Large Language Models: Bytedance Seed
No ratings yet
Model Merging in Pre-Training of Large Language Models: Bytedance Seed
18 pages
Self-Supervised Contrastive Representation Learning For Semi-Supervised Time-Series Classification
No ratings yet
Self-Supervised Contrastive Representation Learning For Semi-Supervised Time-Series Classification
15 pages
CS601 Machine Learning Unit 3
No ratings yet
CS601 Machine Learning Unit 3
47 pages
FDP Ai, ML, DL Q5
No ratings yet
FDP Ai, ML, DL Q5
2 pages
Chapter 6 - Notes PDF
No ratings yet
Chapter 6 - Notes PDF
22 pages
Aai TT1
No ratings yet
Aai TT1
50 pages
Towards Smarter E-Learning: Real-Time Analytics and Machine Learning For Personalized Education
No ratings yet
Towards Smarter E-Learning: Real-Time Analytics and Machine Learning For Personalized Education
12 pages
Exercise #1 7 - 4 - 2025
No ratings yet
Exercise #1 7 - 4 - 2025
3 pages
Data Augmentation For Meta-Learning
No ratings yet
Data Augmentation For Meta-Learning
10 pages
Fuse To Forget
No ratings yet
Fuse To Forget
21 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

Mera: Merging Pretrained Adapters For Few-Shot Learning

Uploaded by

Mera: Merging Pretrained Adapters For Few-Shot Learning

Uploaded by

MerA: Merging Pretrained Adapters For Few-Shot Learning

Abstract I was happy today, I was happy today, Pretrained Adapters

Adapter tuning, which updates only a few pa-

for fine-tuning pretrained language models to Single Adapter MerA

downstream tasks. However, it often yields

ing, adapters contain specific knowledge for a sin-

#Param. MRPC SST-2 MNLI

#Param. BERT-Base RoBERTa-Base

Augmenting MerA with Same-Track Setting

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,

Nafise Sadat Moosavi, Quentin Delfosse, Kristian Ker-

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé,

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya

Jason Phang, Thibault Févry, and Samuel R Bowman.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-

Sidak Pal Singh and Martin Jaggi. 2020. Model fusion

Justin Solomon, Fernando de Goes, Gabriel Peyré,

Alex Wang, Amanpreet Singh, Julian Michael, Felix

Zhenyi Wang, Xiaoyang Wang, Li Shen, Qiuling Suo,

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-

Changtong Zan, Liang Ding, Li Shen, Yu Cao, Weifeng

You might also like