0% found this document useful (0 votes)
69 views6 pages

Mera: Merging Pretrained Adapters For Few-Shot Learning

1) The document proposes MerA, an approach to efficiently incorporate pretrained language model adapters for few-shot learning by merging the adapters into a single adapter. 2) Experiments show that in few-shot scenarios, a single adapter can outperform AdapterFusion, which assembles pretrained adapters using composition layers, despite using fewer parameters. 3) MerA merges pretrained adapters into a single adapter through parameter alignment techniques, achieving better performance than both single adapters and AdapterFusion, while avoiding the increased parameters of AdapterFusion.

Uploaded by

billeton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views6 pages

Mera: Merging Pretrained Adapters For Few-Shot Learning

1) The document proposes MerA, an approach to efficiently incorporate pretrained language model adapters for few-shot learning by merging the adapters into a single adapter. 2) Experiments show that in few-shot scenarios, a single adapter can outperform AdapterFusion, which assembles pretrained adapters using composition layers, despite using fewer parameters. 3) MerA merges pretrained adapters into a single adapter through parameter alignment techniques, achieving better performance than both single adapters and AdapterFusion, while avoiding the increased parameters of AdapterFusion.

Uploaded by

billeton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

MerA: Merging Pretrained Adapters For Few-Shot Learning

Shwai He1 Run-Ze Fan3 Liang Ding2∗ Li Shen4 Tianyi Zhou1 Dacheng Tao2
1
University of Maryland, College Park 2 The University of Sydney
3
University of Chinese Academy of Sciences 4 JD Explore Academy
[email protected], [email protected]

Abstract I was happy today, I was happy today, Pretrained Adapters


but I fall down the stairs. but I fall down the stairs.

Adapter tuning, which updates only a few pa-


rameters, has become a mainstream method
arXiv:2308.15982v1 [cs.CL] 30 Aug 2023

for fine-tuning pretrained language models to Single Adapter MerA

downstream tasks. However, it often yields


subpar results in few-shot learning. AdapterFu-
sion, which assembles pretrained adapters us-
ing composition layers tailored to specific tasks,
is a possible solution but significantly increases ❌ ✔
trainable parameters and deployment costs. De-
spite this, our preliminary study reveals that
! #
even single adapters can outperform Adapterfu-
Figure 1: Comparison between a single adapter and
sion in few-shot learning, urging us to propose
MerA in sentiment analysis tasks. Left: The dilemma
Merging Pretrained Adapters (MerA) that
of a single adapter in few-shot learning. Right: the
efficiently incorporates pretrained adapters to a
improvement of merging pretrained adapters.
single model through model fusion. Extensive
experiments on two PLMs demonstrate that
MerA achieves substantial improvements com-
tuning often starts with randomly initialized blocks
pared to both single adapters and AdapterFu-
sion. To further enhance the capacity of MerA, and may not perform well in scenarios with limited
we also introduce a simple yet effective tech- training data, such as few-shot learning (Moosavi
nique, referred to as the "same-track" setting, et al., 2022; Bansal et al., 2022).
that merges adapters from the same track of One possible solution is to transfer knowledge
pretraining tasks. With the implementation of from pretrained adapters to target tasks (Chawla
the "same-track" setting, we observe even more et al., 2021; Zhong et al., 2022; Wang et al.,
impressive gains, surpassing the performance
2022). AdapterFusion (Pfeiffer et al., 2021) has
of both full fine-tuning and adapter tuning by
a substantial margin, e.g., 3.5% in MRPC and been proposed to assemble pretrained adapters with
5.0% in MNLI. composition layers to integrate knowledge. How-
ever, AdapterFusion compromises parameter ef-
1 Introduction ficiency (He et al., 2022b), primarily due to the
excessive number of trainable parameters in the
Pretrained language models (PLMs) (Devlin et al., composition layers. On the other hand, deploying
2019; Liu et al., 2019) have revolutionized the field parallel pretrained adapters also increases compu-
of natural language processing, with fine-tuning be- tational costs.
ing a mainstream approach to leverage the power of
In this work, we turn to explore an efficient ap-
PLMs. However, with the ever-increasing number
proach to leverage pretrained adapters, raising the
of parameters in PLMs (Brown et al., 2020), there
following question: Can the current AdapterFusion
is a need for parameter-efficient fine-tuning tech-
framework fully exploit the pretrained adapters un-
niques (He et al., 2022a) to reduce training costs.
der few-shot scenarios? If not, how to incorporate
One representative technique is adapters tuning
the pretrained adapters more efficiently? To this
(Houlsby et al., 2019), which updates only a subset
end, we first conduct a series of experiments to
of parameters. Despite their advantages, adapter
compare the performance of single adapters and

Corresponding author that of AdapterFusion under few-shot scenarios,
66.0 Adapter AdapterFusion
with results shown in Figure 2. Surprisingly, a sin-
gle adapter outperforms AdapterFusion with much 62.0
few trainable parameters. Such preliminary study

Performance (%)
prompts us to directly leverage pretrained adapters 58.0
to extend the potential of single adapters.
To achieve this, we propose an approach that 54.0
merges pretrained adapters into a single one
(MerA). On the one hand, the merge pretrained 50.0

adapter does not introduce additional trainable pa- QNLI QQP RTE STSB
rameters. On the other hand, the knowledge from
Figure 2: Comparison between single adapters and
pretrained adapters enhances downstream perfor- AdapterFusion under 10-shot scenarios.
mance, as illustrated in Figure 1. We first imple-
ment two straightforward methods for merging pa-
rameters, including summation (“Sum.”) and av- The trainable parameters of a composition layer
eraging (“Avg.”), whereas the lack of one-to-one are 3d2 (ignoring bias terms for the time being),
2
correspondences between the parameters of differ- while a single adapter only contains 2dr . On the
ent models leads to suboptimal performance. In- other hand, AdapterFusion has to assign parallel
spired by Solomon et al. (2015); Singh and Jaggi pretrained adapters for each adapter layer, which
(2020), we further propose to align adapters’ param- multiplies the additional deployment cost.
eters through optimal transport based on weights Despite the parameters budgets, to check
(“Wts.”) and activations (“Acts.”). whether AdapterFusion is an optimal choice to im-
Extensive few-shot experiments demonstrate prove the task-specific performance under data con-
that MerA achieves significant improvements com- straint scenarios, we conduct few-shot experiments
pared with Adapters, e.g., 2.7% in averaged ac- to compare single adapter and AdapterFusion. As
curacy. In addition, we also find that merging is shown in Figure 2, single adapters consistently
adapters from the same track of tasks further en- outperform AdapterFusion with fewer parameters,
hances the capacity of MerA. Therefore, we intro- which indicates the superiority of tuning single
duce a simple yet effective technique called the adapters. Inspired by the preliminary study, we in-
"same-track" setting. With the implementation of tend to merge multiple adapters into a single model
the "same-track" setting, we observe even more to explore the potential of single adapters.
impressive gains, surpassing the performance of Adapter Merging. Due to the limitations of
full fine-tuning and Adapter tuning by a substantial AdapterFusion, we intend to make more efficient
margin, e.g., 3.5% in MRPC and 5.0% in MNLI. use of the knowledge in the pretrained adapters.
Integrating the parallel adapter into a module can
2 Methodology reduce the excess trainable parameters and improve
downstream performance. We first consider two
AdapterFusion. Adapters (Houlsby et al., 2019) simple methods for merging the weights in adapters
are bottleneck modules plugged in PLMs, with trained on different tasks. The first one is the sum-
model dimensions d and reduction factor r. In mation (“Sum.”):
standard Adapter Tuning, only adapter layers are Xn
trainable, while other layers are frozen. After tun- W
f= Wτj , (1)
j=1

ing, adapters contain specific knowledge for a sin-


where we denote τ as the indices of the tasks and
gle task. AdapterFusion (Pfeiffer et al., 2021) is
weights of the adapter trained on task τj as Wτj .
proposed to leverage knowledge from pretrained
The second one is averaging (“Avg.”):
adapters, which improves the performance in down-
f= 1
Xn
stream tasks and prevents catastrophic forgetting W Wτj . (2)
for each adapter. n j=1

Motivation. However, the resource constraint However, the problem with vanilla summation and
scenario challenges the parameter efficiency of averaging is the lack of one-to-one correspondence
AdapterFusion. For one thing, AdapterFusion re- between parameters from different adapters. In
quires composition layers that include additional particular, the p-th neuron of the i-th adapter might
trainable parameters, e.g., Query, Key, and Value. behave very differently from the p-th neuron of the
Table 1: Experimental results of different merging Methods on GLUE benchmark, where we perform merging
with the same pretrained adapters for a fair comparison. All tasks are evaluated using accuracy. Average scores on
all shots are underlined. The best results are bold.

#Param. MRPC SST-2 MNLI


Method
(Trained) 10 20 30 Avg. 10 20 30 Avg. 10 20 30 Avg.
Fine-Tune 100% 66.2 67.6 69.1 67.6 60.7 62.3 63.9 62.3 39.4 41.1 42.2 40.9
AdapterFusion 18% 65.2 66.4 67.5 66.4 60.1 61.5 63.6 61.7 39.3 40.8 41.7 40.6
Adapter 0.8% 65.4 66.8 67.7 66.6 60.2 61.1 64.1 61.8 39.1 41.0 42.0 40.7
w/ Sum 0.8% 67.4 67.9 68.6 68.0 61.2 62.1 64.2 62.5 39.2 41.4 42.0 40.9
w/ Avg 0.8% 67.2 67.7 68.4 67.8 60.6 62.6 63.7 62.3 39.4 41.7 42.1 41.1
w/ Wts 0.8% 67.4 68.4 69.6 68.5 60.7 63.3 65.4 63.1 39.9 41.8 42.4 41.4
w/ Acts 0.8% 67.6 68.8 70.3 68.9 61.6 63.2 65.6 63.5 39.7 42.1 42.6 41.5

j-th adapter and instead behave similarly to another were conducted on the widely-used GLUE bench-
neuron. Therefore, aligning the neurons first and mark (Wang et al., 2019).
then assembling adapters makes sense. Inspired by We use Adam (Kingma and Ba, 2015) as the op-
Singh and Jaggi (2020); Zan et al. (2022), we align timizer with β1 , β2 = 0.9, 0.98. We set the weight
adapters’ parameters via optimal transport based decay as 0.1 and grid-search the learning rate and
on weights (“Wts.”) and activations (“Acts.”). training epochs from {1e-5, 5e-5, 1e-4}, and {5,
Given adapters trained from task τi and τj , we 10}. The maximum length is 128. We follow pre-
plug them into language models and denote the vious works (Phang et al., 2018; Lee et al., 2020;
(ℓ, ℓ−1)
l-th adapter layer’s incoming edges as Wτi Dodge et al., 2020; He et al., 2023; Zhong et al.,
(ℓ, ℓ−1) (ℓ, ℓ−1) 2023) to fine-tune the pretrained language models,
and Wτj , respectively. We align Wτj to
W τi
(ℓ, ℓ−1)
by constructing convex combinations of e.g., BERT (Devlin et al., 2019), on the downstream
previous layer transport matrix T (ℓ−1) based on training set and report results on the dev set using
weights (“Wts”) or activations (“Acts”), normal- the last checkpoint.
ized appropriately via the inverse of corresponding
3.2 Results
column marginals β:
Main Results. In Table 1, we carefully com-
cτ(ℓ, ℓ−1) ← Wτ(ℓ, ℓ−1) T (ℓ−1) diag(1/β (ℓ−1) ),
W (3)
j j pare MerA (with merging methods above:
fτ(ℓ, ℓ−1) ← diag(1/β (ℓ) )T (ℓ) ⊤ W
cτ(ℓ, ℓ−1) .
“Sum.”, “Avg.”, “Wts”, “Acts”) to the standard
W (4)
j j
adapter (Houlsby et al., 2019) (“Adapter”) and
We refer W fτ(ℓ,
j
ℓ−1)
to the aligned weights of the AdapterFusion (Pfeiffer et al., 2021) (“AF”) on
adapter from task τj , which can be directly added GLUE benchmark for pretrained language models
with Wτi
(ℓ, ℓ−1)
. We carry out this procedure on all BERT (Devlin et al., 2019), where we set training
pretrained adapters: shots to 10, 20, 30, respectively.
MerA achieves significantly better performance
f (ℓ, ℓ−1) ← 1
Xn
W Wr(ℓ,
j
ℓ−1)
. (5) than vanilla adapter (Houlsby et al., 2019; Pfeif-
n j=1
fer et al., 2021) with the same parameters budget.
3 Empirical Evaluation Compared to AdapterFusion (Pfeiffer et al., 2021),
our methods significantly reduce required trainable
3.1 Setup
parameters and improve performance.
We collect pretrained adapters from AdapterHub
(Pfeiffer et al., 2020), which are trained on imdb MerA with Different Merging Methods MerA
(Maas et al., 2011), boolq (Clark et al., 2019), from all merging methods outperforms the standard
scitail (Khot et al., 2018) and winogrande (Sak- adapter tuning, where optimal transport methods
aguchi et al., 2021). The pretraining tasks cover outperform two naive methods in all tasks, reflect-
different NLP tracks, including sentiment analysis, ing the necessity of weight alignment. Notably,
question-answering, natural language inference, “Acts” based Adapters achieve up to 2.7% average
and common sense reasoning. Our experiments improvement compared to the standard adapter and
Table 2: Comparison between MerA and adapter architectures under various fine-tuning strategies. We
display the performance of 30 shots tuning and denote head-based tuning, prompt-based tuning, and prompt-based
tuning with demonstrations as “head”, “prompt”, and “prompt-demo”, respectively.

#Param. BERT-Base RoBERTa-Base


Method
(Trained) head prompt +demos Avg. head prompt +demos Avg.
Fine-Tune 100% 63.9 71.8 72.6 69.4 71.1 79.4 79.9 76.8
AdapterFusion 18% 63.6 71.2 71.7 68.8 70.9 78.9 79.4 76.1
Adapter 0.8% 64.1 71.5 72.1 69.2 70.5 78.5 79.3 76.4
MerA 0.8% 65.7 73.6 73.9 71.1 73.3 81.3 82.1 78.9

Table 3: Effect of different tracks. We denote a single ous fine-tuning strategies. This highlights the ef-
pretrained adapter as “+Single”, and denote MerA with fectiveness and versatility of MerA in enhancing
pretrained tracks (e.g., “+QA”). For each track, the same
performance in various fine-tuning scenarios.
number of pretrained adapters are chosen for MerA.
Effect on Initialization To analyze the bene-
MRPC MNLI fits of merging adapters, we conducted a zero-
Method
10 20 10 20 shot experiment to examine the effect of MerA
Adapter 65.4 67.0 39.4 40.8 on task initialization. In Figure 3, we plugged
+Single 65.0 67.4 40.6 42.3 MerA into BERT and compared it with BERT mod-
+QA 66.9 68.9 39.2 40.8 els equipped with randomly initialized adapters
+NLI 66.4 68.6 44.2⇑+4.8 45.8⇑+5.0 (“Adapters”) and without any adapters (“Baseline”).
+Sentiment 64.2 66.9 39.8 41.9 Compared to “Adapter” and “Baseline”, MerA en-
+Comsense 65.4 68.2 39.6 41.8 sures a superior initial state for the target tasks and
+STS 68.9⇑+3.5 70.1⇑+3.1 40.2 42.2 achieves significant accuracy improvement, e.g.,
2.4% in MNLI and 0.98% in SST-2. These find-
ings reveal the efficacy of MerA in enhancing the
+2.0 +2.4 initialization of target tasks.
+3.1 +2.8

Augmenting MerA with Same-Track Setting


One potential approach to further increase the gain
of MerA at initialization is to merge the adapters
trained in the same NLP track because of the shar-
(a) MNLI (b) SST-2
ing of knowledge in a specific track. So we further
Figure 3: Effect of MerA on Initialization. “Baseline” investigate the roles of different pretrained tracks.
represent vanilla BERT. We denote BERT equipped with We consider pretrained adapters from five different
adapters as “Adapters” and denote BERT equipped with tracks, including question-answering (QA), com-
“MerA” as “MerA”, respectively. mon sense reasoning (Comsense), natural language
inference (NLI), semantic textual similarity (STS),
even beat full fine-tuning, so we set “Acts” as the and sentiment analysis (Sentiment). MRPC and
default setting in the following experiments. MNLI belong to semantic textual similarity and
natural language inference, respectively. We merge
Prompt-Based Finetuning In addition to head- adapters trained in the same track and validate
based fine-tuning, we also investigate the applica- MerA on MNLI and MRPC. We also consider di-
bility of MerA in prompt-based fine-tuning, which rectly fine-tuning a single pretrained adapter “+Sin-
leverages informative and indicative prompts to gle”, experimenting with multiple adapters within
enhance performance (Gao et al., 2021). In Ta- the same tracks, and reporting the best results.
ble 2, we consider two strategies including prompt- Table 3 compares MerA merged from different
based finetuning and prompt-based finetuning with tracks, where we can see significant improvements
demonstrations. Experimental results demonstrate when the pretraining tasks are on the same track
the consistent improvement of MerA across vari- as the downstream task. In this case, MerA outper-
forms the standard adapter by 3.5% in MRPC and Clark, Christopher Berner, Sam McCandlish, Alec
5.0% in MNLI. However, when pretrained tasks Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In NeurIPS.
are on a different track, MerA may encounter a
performance drop. The above findings inform the Akshay Chawla, Hongxu Yin, Pavlo Molchanov, and
importance of the knowledge shared in one track. Jose Alvarez. 2021. Data-free knowledge distillation
for object detection. In WACV.
4 Conclusion
Christopher Clark, Kenton Lee, Ming-Wei Chang,
In this work, we systematically investigate adapters Tom Kwiatkowski, Michael Collins, and Kristina
and AdapterFusion on few-shot scenarios. Based Toutanova. 2019. Boolq: Exploring the surprising
difficulty of natural yes/no questions. In NAACL.
on our findings, we propose a plug-in strategy, i.e.,
MerA, for existing adapters. Our empirical results Kevin Clark, Minh-Thang Luong, Quoc V Le, and
indicate the potential to make MerA a golden stan- Christopher D Manning. 2020. Electra: Pre-training
dard efficient few-shot learning strategy for the text encoders as discriminators rather than generators.
NLP community. In ICLR.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and


5 Limitations Kristina Toutanova. 2019. Bert: Pre-training of deep
bidirectional transformers for language understand-
Despite the progress we made, there still exist lim-
ing. In NAACL.
itations in our work. On the one hand, we only
investigated some classic merging methods and Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali
found that “Acts” performs the best in selected cri- Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020.
teria. However, other advanced pruning methods Fine-tuning pretrained language models: Weight ini-
tializations, data orders, and early stopping. arXiv
may exist that can further improve the performance, preprint.
which deserves exploration in future work. On
the other hand, since we only consider BERT and Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.
RoBERTa in limited tasks, it would be valuable to Making pre-trained language models better few-shot
learners. In ACL.
consider other architecture families (e.g., XLNET
(Yang et al., 2019), ELECTRA (Clark et al., 2020)) Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-
and tasks (e.g., machine translation). Kirkpatrick, and Graham Neubig. 2022a. Towards a
unified view of parameter-efficient transfer learning.
Ethics Statement In ICLR.

We take ethical considerations very seriously. This Shwai He, Liang Ding, Daize Dong, Boan Liu, Fuqiang
paper focuses on higher model and data efficiency Yu, and Dacheng Tao. 2023. Pad-net: An efficient
framework for dynamic networks. In ACL.
for Adapters. Both the datasets and models used
in this paper are publicly available and have been Shwai He, Liang Ding, Daize Dong, Jeremy Zhang,
widely adopted by researchers. We ensure that the and Dacheng Tao. 2022b. SparseAdapter: An easy
findings and conclusions of this paper are reported approach for improving the parameter-efficiency of
accurately and objectively. adapters. In EMNLP.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,


Bruna Morrone, Quentin De Laroussilhe, Andrea
References Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.
Trapit Bansal, Salaheddin Alzubi, Tong Wang, Jay- Parameter-efficient transfer learning for nlp. In
Yoon Lee, and Andrew McCallum. 2022. Meta- ICML.
adapters: Parameter efficient few-shot fine-tuning
through meta-learning. In AutoML. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
Scitail: A textual entailment dataset from science
Tom Brown, Benjamin Mann, Nick Ryder, Melanie question answering. In AAAI.
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Diederik P Kingma and Jimmy Ba. 2015. Adam: A
Askell, Sandhini Agarwal, Ariel Herbert-Voss, method for stochastic optimization. In ICLR.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang.
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- 2020. Mixout: Effective regularization to finetune
teusz Litwin, Scott Gray, Benjamin Chess, Jack large-scale pretrained language models. In ICLR.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Dacheng Tao. 2022. Panda: Prompt transfer meets
Luke Zettlemoyer, and Others. 2019. Roberta: A knowledge distillation for efficient model adaptation.
robustly optimized bert pretraining approach. arXiv arXiv preprint.
preprint.
Qihuang Zhong, Liang Ding, Juhua Liu, Xuebo Liu,
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Min Zhang, Bo Du, and Dacheng Tao. 2023. Re-
Dan Huang, Andrew Y. Ng, and Christopher Potts. visiting token dropping strategy in efficient BERT
2011. Learning word vectors for sentiment analysis. pretraining. In ACL.
In ACL.

Nafise Sadat Moosavi, Quentin Delfosse, Kristian Ker-


sting, and Iryna Gurevych. 2022. Adaptable adapters.
In NAACL.

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé,


Kyunghyun Cho, and Iryna Gurevych. 2021.
Adapterfusion: Non-destructive task composition for
transfer learning. In EACL.

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya


Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun
Cho, and Iryna Gurevych. 2020. Adapterhub: A
framework for adapting transformers. In EMNLP
Demonstrations.

Jason Phang, Thibault Févry, and Samuel R Bowman.


2018. Sentence encoders on stilts: Supplementary
training on intermediate labeled-data tasks. arXiv
preprint.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-


ula, and Yejin Choi. 2021. Winogrande: An adver-
sarial winograd schema challenge at scale. Commun.
ACM.

Sidak Pal Singh and Martin Jaggi. 2020. Model fusion


via optimal transport. In NeurIPS.

Justin Solomon, Fernando de Goes, Gabriel Peyré,


Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao
Du, and Leonidas J. Guibas. 2015. Convolutional
wasserstein distances: efficient optimal transporta-
tion on geometric domains. ACM Trans. Graph.

Alex Wang, Amanpreet Singh, Julian Michael, Felix


Hill, Omer Levy, and Samuel R Bowman. 2019.
GLUE: A multi-task benchmark and analysis plat-
form for natural language understanding. In ICLR.

Zhenyi Wang, Xiaoyang Wang, Li Shen, Qiuling Suo,


Kaiqiang Song, Dong Yu, Yan Shen, and Mingchen
Gao. 2022. Meta-learning without data via wasser-
stein distributionally-robust model fusion. In UAI.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-


bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Xlnet: Generalized autoregressive pretraining for lan-
guage understanding. In NeurIPS.

Changtong Zan, Liang Ding, Li Shen, Yu Cao, Weifeng


Liu, and Dacheng Tao. 2022. On the complemen-
tarity between pre-training and random-initialization
for resource-rich machine translation. In COLING.

You might also like