MAD-X: An Adapter-Based Framework For Multi-Task Cross-Lingual Transfer
MAD-X: An Adapter-Based Framework For Multi-Task Cross-Lingual Transfer
Table 2: NER F1 scores averaged over all 16 target languages when transferring from each source language (i.e.
the columns are source languages). The vertical dashed line distinguishes between languages seen in multilingual
pretraining and the unseen ones (see also Table 1).
Source Language
for all methods across each single language pair, sw -0.5 0.7
is -1.2 2.8 6.3 -3.4 4.8 2.3 1.8 -2.3 10.0 16.4 6.7 14.9 19.4 18.5 16.0 4.9
as well as a comparison of methods on the most my -7.5 -3.2 -5.3 -9.2 3.9 -5.4 -3.2 -0.6 -3.8 11.5 -12.2 4.8 3.2 3.9 3.4 -2.5
common setting with English as source language. qu -2.9 3.7 7.5 -1.4 -0.9 1.6 4.5 10.9 5.0 8.8 -14.1 20.3 15.9 8.2 8.8 7.6
cdo 6.9 2.4 3.6 4.8 9.6 0.9 13.3 19.5 3.1 12.1 -5.8 25.9 -11.8 6.5 6.3 0.2
In general, we observe that XLM-R perfor- ilo 1.6 -2.3 -5.3 12.5 9.7 3.3 10.8 7.6 0.8 6.3 6.5 10.5 7.7 5.8 -0.1 5.1
xmf -4.5 -1.7 -4.0 -12.3 -0.4 -7.7 1.8 1.9 3.2 18.9 -11.3 4.8 -3.4 3.0 2.4 -1.5
mance is indeed lowest for unseen languages (the mi -8.3 0.5 0.2 -0.3 3.5 -4.1 -4.7 16.1 -6.1 4.7 -3.9 15.5 3.3 1.6 -5.8 -10.1
right half of the table after the vertical dashed line). mhr -11.3 -3.9 -4.2 -6.1 2.5 -8.9 0.4 4.5 -0.8 13.0 -20.2 13.6 8.9 14.5 5.2 -7.4
tk 5.2 1.6 1.1 12.8 14.2 4.8 17.2 17.5 7.6 19.1 -1.7 24.5 14.4 21.6 13.7 7.8
XLM-RBase MLM- SRC performs worse than gn -0.1 -1.3 -3.9 -5.0 -0.3 -9.5 6.1 -8.0 -11.2 14.4 -15.1 5.6 -3.0 5.8 2.6 9.6
XLM-R, which indicates that source-language fine- en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
tuning is not useful for cross-lingual transfer in
general.5 On the other hand, XLM-RBase MLM- Figure 3: Relative F1 improvement of MAD-XBase
TRG is a stronger transfer method than XLM-R over XLM-RBase in cross-lingual NER transfer.
on average, yielding gains in 9/16 target languages.
However, its gains seem to vanish for low-resource
languages. Further, there is another disadvantage, To demonstrate that our framework is model-
outlined in §3: XLM-RBase MLM- TRG requires agnostic, we also employ two other strong multi-
fine-tuning the full large pretrained model sepa- lingual models, XLM-RLarge and mBERT as foun-
rately for each target language in consideration, dation for MAD-X and show the results in Table
which can be prohibitively expensive. 2. MAD-X shows consistent improvements even
over stronger base pretrained models.
MAD-X without language and invertible
adapters performs on par with XLM-R for almost For a more fine-grained impression of the per-
all languages present in the pretraining data (left formance of MAD-X in different languages, we
half of the table). This mirrors findings in the mono- show its relative performance against XLM-R in
lingual setting where task adapters have been ob- the standard setting in Figure 3. We observe the
served to achieve performance similar to regular largest differences in performance when transfer-
fine-tuning while being more parameter-efficient ring from high-resource to low-resource and un-
(Houlsby et al., 2019). However, looking at unseen seen languages (top-right quadrant of Figure 3),
languages, the performance of MAD-X that only which is arguably the most natural setup for cross-
uses task adapters deteriorates significantly com- lingual transfer. In particular, we observe strong
pared to XLM-R. This shows that task adapters gains when transferring from Arabic, whose script
alone are not expressive enough to bridge the dis- might not be well represented in XLM-R’s vo-
crepancy when adapting to an unseen language. cabulary. We also detect strong performance in
the in-language monolingual setting (diagonal) for
Adding language adapters to MAD-X improves
the subset of low-resource languages. This indi-
its performance across the board, and their use-
cates that MAD-X may help bridge the perceived
fulness is especially pronounced for low-resource
weakness of multilingual versus monolingual mod-
languages. Language adapters help capture the
els. Finally, MAD-X performs competitively even
characteristics of the target language and con-
when the target language is high-resource.6
sequently provide boosts for unseen languages.
Even for high-resource languages, the addition of Causal Commonsense Reasoning We show re-
language-specific parameters yields substantial im- sults on transferring from English to each target
provements. Finally, invertible adapters provide language on XCOPA in Table 3. For brevity, we
further gains and generally outperform only using only show the results of the best fine-tuning set-
task and language adapters: for instance, we ob- 6
serve gains with MAD-X over MAD-X – INV on In the appendix, we also plot relative performance of
the full MAD-X method (with all three adapter types) ver-
13/16 target languages. Overall, the full MAD-X sus XLM-RBase MLM- TRG across all language pairs. The
framework improves upon XLM-R by more than scores lead to similar conclusions as before: the largest bene-
fits of MAD-X are observed for the set of low-resource target
5 F1 points on average. languages (i.e., the right half of the heatmap). The scores also
again confirm that the proposed XLM-RBase MLM- TRG
5
However, there are some examples (e.g., JA, TK) where it transfer baseline is more competitive than the standard XLM-
does yield slight gains over the standard XLM-R transfer. R transfer across a substantial number of language pairs.
Model en et ht id it qu sw ta th tr vi zh avg
Base
XLM-R 66.8 58.0 51.4 65.0 60.2 51.2 52.0 58.4 62.0 56.6 65.6 68.8 59.7
XLM-RBase MLM- TRG 66.8 59.4 50.0 71.0 61.6 46.0 58.8 60.0 63.2 62.2 67.6 67.4 61.2
MAD-XBase 68.3 61.3 53.7 65.8 63.0 52.5 56.3 61.9 61.8 60.3 66.1 67.6 61.5
Table 3: Accuracy scores of all models on the XCOPA test sets when transferring from English. Models are first
fine-tuned on SIQA and then on the COPA training set.
en ar de el es hi ru th tr vi zh avg
XLM-RBase 83.6 / 72.1 66.8 / 49.1 74.4 / 60.1 73.0 / 55.7 76.4 / 58.3 68.2 / 51.7 74.3 / 58.1 66.5 / 56.7 68.3 / 52.8 73.7 / 53.8 51.3 / 42.0 70.6 / 55.5
XLM-RBase MLM- TRG 84.7 / 72.6 67.0 / 49.2 73.7 / 58.8 73.2 / 55.7 76.6 / 58.3 69.8 / 53.6 74.3 / 57.9 67.0 / 55.8 68.6 / 53.0 75.5 / 54.9 52.2 / 43.1 71.1 / 55.7
MAD-XBase – INV 83.3 / 72.1 64.0 / 47.1 72.0 / 55.8 71.0 / 52.9 74.6 / 55.5 67.3 / 51.0 72.1 / 55.1 64.1 / 51.8 66.2 / 49.6 73.0 / 53.6 50.9 / 40.6 67.0 / 53.2
MAD-XBase 83.5 / 72.6 65.5 / 48.2 72.9 / 56.0 72.9 / 54.6 75.9 / 56.9 68.2 / 51.3 73.1 / 56.7 67.8 / 55.9 67.0 / 49.8 73.7 / 53.3 52.7 / 42.8 70.3 / 54.4
Table 4: F1 / EM scores on XQuAD with English as the source language for each target language.
80
ting from Ponti et al. (2020a)—fine-tuning first on
SIQA (Sap et al., 2019) and on the English COPA 60
F1
Language Epochs
the appendix. Target language adaptation outper- 20 qu xmf tk 25
cdo mi gn 50
forms XLM-RBase while MAD-XBase achieves ilo mhr 100
0
the best scores. It shows gains in particular for 0 20k 40k 60k 80k 100k
Number of iterations
the two unseen languages, Haitian Creole (ht) and
Quechua (qu). Performance on the other languages Figure 4: Cross-lingual NER performance of MAD-X
is also generally competitive or better. transferring from English to the target languages with
invertible and language adapters trained on target lan-
Question Answering The results on XQuAD guage data for different numbers of iterations. Shaded
when transferring from English to each target lan- regions denote variance in F1 scores across 5 runs.
guage are provided in Table 4. The main finding is
that MAD-X achieves similar performance to the Model + Params % Model
XLM-R baseline. As before, invertible adapters Base
MAD-X 8.25M 3.05
generally improve performance and target language MAD-XBase – INV 7.96M 2.94
adaptation improves upon the baseline setting. We MAD-XBase – LAD – INV 0.88M 0.32
note that all languages included in XQuAD can
be considered high-resource, with more than 100k Table 5: Number of parameters added to XLM-R Base,
Wikipedia articles each (cf. Wikipedia sizes of and as a fraction of its parameter budget (270M).
NER languages in Table 1). The corresponding
setting can be found in the top-left quadrant in Fig-
ure 3 where relative differences are comparable. with and without invertible adapters for each source
These and XCOPA results demonstrate that, language–target language pair on the NER data
while MAD-X excels at transfer to unseen and low- set (see Section D in the appendix). Invertible
resource languages, it achieves competitive perfor- adapters improve performance for many transfer
mance even for high-resource languages and on pairs, and particularly when transferring to low-
more challenging tasks. These evaluations also hint resource languages. Performance is only consis-
at the modularity of the adapter-based MAD-X ap- tently lower with a single low-resource language as
proach, which holds promise of quick adaptation source (Maori), likely due to variation in the data.
to more tasks: we use exactly the same language- Sample Efficiency The main adaptation bottle-
specific adapters in NER, CCR, and QA for lan- neck of MAD-X is training language adapters and
guages such as English and Mandarin Chinese that invertible adapters. However, due to the modularity
appear in all three evaluation language samples. of MAD-X, once trained, these adapters have an
advantage of being directly reusable (i.e., “plug-
7 Further Analysis
and-playable”) across different tasks (see the dis-
Impact of Invertible Adapters We also analyse cussion in §6). To estimate the sample efficiency of
the relative performance difference of MAD-X adapter training, we measure NER performance on
several low-resource target languages (when trans- Linguistics, ACL 2020, Virtual Conference, July 6-8,
ferring from English as the source) conditioned on 2020, pages 4623–4637.
the number of training iterations. The results are Ankur Bapna and Orhan Firat. 2019. Simple, scal-
given in Figure 4. They reveal that we can achieve able adaptation for neural machine translation. In
strong performance for the low-resource languages Proceedings of the 2019 Conference on Empirical
already at 20k training iterations, and longer train- Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
ing offers modest increase in performance. guage Processing, EMNLP-IJCNLP 2019, Hong
Moreover, in Table 5 we present the number Kong, China, November 3-7, 2019, pages 1538–
of parameters added to the original XLM-R Base 1548.
model per language for each MAD-X variant. The Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Mul-
full MAD-X model for NER receives an additional tilingual Alignment of Contextual Word Represen-
set of 8.25M adapter parameters for every language, tations. In 8th International Conference on Learn-
which makes up only 3.05% of the original model. ing Representations, ICLR 2020, Virtual Conference,
April 26 - May 1, 2020.
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Melissa Roemmele, Cosmin Adrian Bejan, and An-
Kyunghyun Cho, and Iryna Gurevych. 2020a. drew S. Gordon. 2011. Choice of plausible alter-
AdapterFusion: Non-destructive task composition natives: An evaluation of commonsense causal rea-
for transfer learning. arXiv preprint. soning. In Logical Formalizations of Commonsense
Reasoning, Papers from the 2011 AAAI Spring Sym-
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- posium, Technical Report SS-11-06, Stanford, Cali-
warya Kamath, Ivan Vulić, Sebastian Ruder, fornia, USA, March 21-23, 2011.
Kyunghyun Cho, and Iryna Gurevych. 2020b.
Adapterhub: A framework for adapting transform- Andreas Rücklé, Gregor Geigle, Max Glockner,
ers. In Proceedings of the 2020 Conference on Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna
Gurevych. 2020. AdapterDrop: On the Efficiency of the 21st Conference on Computational Natural Lan-
Adapters in Transformers. arXiv preprint. guage Learning (CoNLL 2017), Vancouver, Canada,
August 3-4, 2017, pages 281–289.
Sebastian Ruder, Ivan Vulić, and Anders Søgaard.
2019. A survey of cross-lingual word embedding Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
models. Journal of Artificial Intelligence Research, Chaumond, Clement Delangue, Anthony Moi an-
65:569–631. dArt Pierric Cistac, Tim Rault, Rémi Louf, Mor-
gan Funtowicz, and Jamie Brew. 2020. Hugging-
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Face’s Transformers: State-of-the-art Natural Lan-
Daan Wierstra, and Timothy P. Lillicrap. 2016. One- guage Processing. In Proceedings of the 2020 Con-
shot learning with memory-augmented neural net- ference on Empirical Methods in Natural Language
works. arXiv preprint. Processing (System Demonstrations), EMNLP 2020,
Virtual Conference, 2020.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan
LeBras, and Yejin Choi. 2019. Socialiqa: Com- Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettle-
monsense reasoning about social interactions. arXiv moyer, and Veselin Stoyanov. 2020. Emerging cross-
preprint. lingual structure in pretrained language models. In
Proceedings of the 58th Conference of the Associa-
Asa Cooper Stickland and Iain Murray. 2019. BERT tion for Computational Linguistics, ACL 2020, Vir-
and PALs: Projected Attention Layers for Efficient tual Conference, July 6-8, 2020, pages 6022–6034.
Adaptation in Multi-Task Learning. In Proceedings
of the 36th International Conference on Machine Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas:
Learning, ICML 2019, 9-15 June 2019, Long Beach, The surprising cross-lingual effectiveness of BERT.
California, USA, pages 5986–5995. In Proceedings of the 2019 Conference on Empiri-
cal Methods in Natural Language Processing and
Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and the 9th International Joint Conference on Natural
Gertjan van Noord. 2020. UDapter: Language Language Processing, EMNLP-IJCNLP 2019, Hong
Adaptation for Truly Universal Dependency Parsing. Kong, China, November 3-7, 2019, pages 833–844.
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing, EMNLP
2020, Virtual Conference. A Evaluation data
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob • Named Entity Recognition (NER). Data:
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention Is All WikiANN (Rahimi et al., 2019). Available
You Need. In Advances in Neural Information Pro- online at:
cessing Systems 30: Annual Conference on Neural www.amazon.com/clouddrive/share/
Information Processing Systems 2017, 4-9 Decem- d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN.
ber 2017, Long Beach, CA, USA, pages 5998–6008.
Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, • Causal Commonsense Reasoning (CCR).
Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Data: XCOPA (Ponti et al., 2020a). Avail-
Sampo Pyysalo. 2019. Multilingual is not enough: able online at:
BERT for Finnish. arXiv preprint. github.com/cambridgeltl/xcopa
Zirui Wang, Jiateng Xie, Ruochen Xu, Yiming Yang,
Graham Neubig, and Jaime Carbonell. 2020. Cross-
• Question Answering (QA). Data: XQuAD
lingual Alignment vs Joint Training: A Compara- (Artetxe et al., 2020). Available online at:
tive Study and A Simple Unified Framework. In github.com/deepmind/xquad
8th International Conference on Learning Represen-
tations, ICLR 2020, Virtual Conference, April 26 - B NER zero-shot results from English
May 1, 2020.
We show the F1 scores when transferring from
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
English to the other languages averaged over five
mand Joulin, and Edouard Grave. 2020. Ccnet: Ex- runs in Table 6.
tracting high quality monolingual datasets from web
crawl data. In Proceedings of The 12th Language C NER results per language pair
Resources and Evaluation Conference, LREC 2020,
Marseille, France, May 11-16, 2020, pages 4003– We show the F1 scores on the NER dataset across
4012. all combinations of source and target language
Georg Wiese, Dirk Weissenborn, and Mariana L.
for all of our comparison methods in Figures
Neves. 2017. Neural Domain Adaptation for 5 (XLM-RBase ), 6 (XLM-RBase MLM- SRC),
Biomedical Question Answering. In Proceedings of 7 (XLM-RBase MLM- TRG), 8 (MAD-XBase –
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn avg
mBERT 84.8 26.7 38.5 38.7 57.8 66.0 65.7 42.9 54.9 14.20 63.5 31.1 21.8 46.0 47.2 45.4 44.0
XLM-R 83.0 15.2 19.6 41.3 56.1 63.5 67.2 46.9 58.3 20.47 61.3 32.2 15.9 41.8 43.4 41.0 41.6
XLM-RBase MLM- SRC 84.2 8.45 11.0 27.3 44.8 57.9 59.0 35.6 52.5 21.4 60.3 22.7 22.7 38.1 44.0 41.7 36.5
XLM-RBase MLM- TRG 84.2 9.30 15.5 44.5 50.2 77.7 71.7 55.5 68.7 47.6 84.7 60.3 43.6 56.3 56.4 50.6 52.8
MAD-X – LAD – inv 82.0 15.6 20.3 41.0 54.4 66.4 67.8 48.8 57.8 16.9 59.9 36.9 14.3 44.3 41.9 42.9 41.9
MAD-X – INV 82.2 16.8 20.7 36.9 54.1 68.7 71.5 50.0 59.6 39.2 69.9 54.9 48.3 58.1 53.1 52.8 50.3
MAD-X 82.3 19.0 20.5 41.8 55.7 73.8 74.5 51.9 66.1 36.5 73.1 57.6 51.0 62.1 59.7 55.1 53.2
sw
is 51.7 9.5 14.6 26.0 47.5 53.9 81.8 40.6 50.1 24.1 40.8 34.4 37.8 32.6 45.2 46.5
my 13.3 4.2 7.5 10.3 12.1 12.6 23.9 60.8 10.6 5.6 15.0 15.2 14.6 18.5 17.9 8.1
qu 24.2 0.3 0.9 20.5 26.3 24.3 21.6 16.3 53.6 12.5 35.9 11.3 17.8 19.4 23.2 18.8
cdo 9.7 0.5 1.4 4.3 13.7 15.0 17.9 4.4 9.5 36.2 5.4 4.2 25.0 13.6 15.5 17.3
ilo 17.2 4.5 5.6 4.2 14.6 21.4 12.0 10.3 16.2 10.5 62.9 9.6 22.1 14.8 20.8 8.5
xmf 16.1 1.2 2.8 11.8 19.8 13.7 25.5 18.3 17.2 12.2 7.3 50.8 25.4 19.0 16.0 16.8
mi 10.3 0.9 1.9 4.2 8.1 13.9 11.6 1.8 14.2 15.6 6.5 2.3 83.7 10.8 17.2 12.2
mhr 16.0 5.8 8.7 13.7 15.9 16.5 31.4 23.1 14.9 18.2 11.6 24.6 8.7 57.1 23.5 25.1
tk 26.5 1.3 3.0 12.0 26.6 29.6 30.4 15.3 26.1 14.6 24.2 14.3 20.3 18.2 56.5 29.6
gn 27.2 0.9 2.5 13.7 26.6 26.2 33.2 18.6 29.2 18.5 19.6 15.1 25.3 20.8 35.2 50.6
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 5: Mean F1 scores of XLM-RBase in the standard setting (XLM-RBase ) for cross-lingual transfer on NER.
en 84.0 9.4 11.2 24.6 42.8 51.9 49.3 21.5 54.6 14.8 62.5 19.3 15.8 27.6 39.9 37.6
ja 44.6 72.3 51.5 16.8 32.0 30.8 31.2 43.8 40.2 9.8 34.0 23.9 13.9 28.6 39.3 26.7
zh 41.6 46.8 81.9 12.8 22.8 32.2 32.8 31.9 41.0 21.3 40.3 21.9 9.0 26.2 41.4 31.4
ar 29.2 3.6 6.1 90.4 11.3 17.1 17.7 6.4 21.2 1.3 12.7 16.5 3.3 8.4 20.7 8.6
jv 48.3 0.2 0.5 33.0 71.5 46.6 52.2 22.7 33.9 20.1 42.2 18.1 34.8 29.7 39.2 41.9
55.2 5.7 5.1 30.5 41.0 88.4 51.9 19.6 44.0 16.7 42.8 23.3 30.2 25.5 37.5 47.0
Source Language
sw
is 55.4 9.6 12.2 21.4 50.1 53.6 86.7 21.9 56.2 23.5 43.9 25.3 30.4 30.3 49.1 50.3
my 20.4 0.7 1.8 16.4 21.8 18.4 32.2 71.3 16.9 6.8 11.6 10.0 27.4 25.4 15.6 16.3
qu 35.5 0.4 1.3 27.4 26.7 34.4 34.9 19.3 70.7 16.1 28.7 20.7 15.2 22.3 33.8 37.8
cdo 22.0 0.7 2.5 6.2 14.1 15.6 27.7 3.7 10.9 66.9 3.9 8.2 25.4 11.9 20.9 26.8
ilo 36.2 1.7 1.9 17.0 23.1 40.5 27.5 16.7 31.4 14.4 78.2 11.0 15.9 22.0 29.8 31.5
xmf 23.9 0.1 0.5 15.9 25.5 21.8 37.1 19.6 24.2 10.1 8.5 74.9 24.8 24.1 18.1 28.6
mi 17.6 0.4 1.1 9.0 8.9 18.8 18.3 2.4 15.8 9.4 13.2 5.4 85.6 7.9 19.6 22.1
mhr 25.0 1.6 1.7 12.1 13.5 18.5 29.2 13.6 23.9 13.8 15.5 17.5 4.3 71.4 24.0 25.2
tk 39.5 2.3 3.0 23.6 28.2 36.0 34.8 21.0 29.7 16.9 31.1 19.0 19.9 23.2 70.8 43.0
gn 33.7 0.1 0.2 14.5 27.5 27.6 31.0 6.7 36.0 17.7 21.8 10.5 14.8 15.8 40.4 62.6
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 6: Mean F1 scores of XLM-RBase with MLM fine-tuning on source language data (XLM-RBase MLM-
SRC ) for cross-lingual transfer on NER.
en 84.2 9.3 15.5 44.5 50.2 77.8 71.8 55.6 68.7 47.6 84.8 60.3 43.7 56.3 56.5 50.7
ja 47.5 67.6 61.5 26.0 46.6 44.3 62.7 54.9 47.9 38.5 44.7 47.9 15.4 42.4 60.9 49.8
zh 48.5 55.8 81.9 23.1 40.9 53.8 61.2 58.5 47.2 48.9 54.5 49.1 77.5 46.4 70.8 56.6
ar 46.6 10.6 14.9 90.4 60.5 67.1 68.6 57.9 56.0 35.6 62.3 55.8 40.0 40.3 54.0 56.5
jv 47.7 0.1 0.1 47.5 70.6 58.6 46.5 34.6 56.9 27.4 47.1 49.8 22.6 35.2 31.1 43.3
54.9 9.7 18.2 48.8 46.9 88.4 66.6 50.0 61.3 36.7 75.2 52.4 25.6 38.8 38.7 57.7
Source Language
sw
is 59.7 14.4 17.9 53.5 53.8 56.6 87.4 54.7 72.0 47.5 50.0 59.0 58.5 49.0 56.9 56.7
my 25.6 5.2 8.3 8.1 19.5 22.5 41.3 70.3 7.3 21.1 5.9 25.8 0.0 19.5 30.5 9.3
qu 39.7 0.3 0.2 35.2 38.0 35.2 45.2 26.8 70.7 18.8 23.3 25.9 14.9 16.9 32.9 44.6
cdo 15.4 0.0 0.1 4.2 12.1 25.3 34.8 21.1 10.3 67.0 2.8 5.2 13.7 11.1 24.8 17.9
ilo 36.9 1.4 7.0 20.8 26.4 46.6 32.3 29.3 24.5 12.8 85.3 36.0 6.6 14.8 25.7 31.3
xmf 30.0 0.8 4.4 28.1 19.2 45.4 45.8 30.4 7.7 24.0 8.4 74.9 31.4 11.1 15.1 18.7
mi 17.9 0.2 0.1 6.9 11.6 12.1 23.7 10.2 8.1 24.8 2.8 5.0 88.1 5.8 23.9 13.5
mhr 22.3 0.6 2.0 20.6 25.3 21.9 52.8 21.1 28.7 23.6 14.2 30.0 28.6 70.7 40.6 19.6
tk 30.2 2.0 4.6 17.2 33.0 31.8 33.3 7.1 31.6 32.0 19.0 34.6 15.2 23.7 70.8 37.2
gn 35.9 0.1 0.6 27.5 38.6 29.7 52.0 17.8 31.1 36.6 10.8 25.4 24.7 23.6 36.1 66.2
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 7: Mean F1 scores of XLM-RBase with MLM fine-tuning on target language data (XLM-RBase MLM-
TRG ) for cross-lingual transfer on NER.
en 82.0 15.6 20.4 41.0 54.5 66.4 67.8 48.8 57.8 16.9 59.9 36.9 14.3 44.3 41.9 42.9
ja 41.7 64.8 55.3 25.7 34.4 33.9 50.6 52.8 41.6 15.5 39.8 32.1 15.7 27.1 43.7 42.7
zh 43.5 47.2 75.1 24.7 37.8 37.1 53.0 48.5 41.5 19.5 44.5 32.5 18.3 29.4 51.5 46.2
ar 46.3 10.8 20.6 87.9 52.4 44.0 65.3 55.3 54.8 12.8 43.7 52.9 16.0 29.5 46.3 46.3
jv 40.7 0.7 1.8 31.7 59.0 39.9 51.6 29.7 37.7 19.3 31.0 32.0 32.2 31.7 35.3 43.8
56.5 11.6 18.6 38.2 49.3 87.6 62.8 37.9 45.1 21.2 55.5 38.7 32.6 39.4 47.6 46.3
Source Language
sw
is 56.9 18.1 25.3 48.4 56.6 60.6 83.6 52.0 59.5 27.7 47.3 57.8 41.3 40.9 51.0 50.8
my 20.4 3.2 9.4 21.3 20.7 20.8 37.6 62.4 21.0 16.4 24.2 31.1 13.3 25.1 31.6 23.2
qu 31.1 0.3 1.3 23.0 28.6 19.9 26.2 19.2 56.6 17.3 26.6 15.2 10.9 20.4 25.6 29.7
cdo 10.8 0.6 0.8 1.7 6.8 12.5 12.3 4.0 10.2 26.9 9.5 3.4 21.2 14.1 17.2 18.0
ilo 27.5 6.0 8.5 14.7 19.1 34.2 22.0 16.4 32.4 13.5 67.5 19.6 21.9 31.2 26.9 21.5
xmf 30.6 2.9 7.9 22.8 26.7 27.4 38.4 31.6 34.4 14.3 21.7 58.3 27.1 31.5 31.4 38.6
mi 10.1 0.2 2.0 3.4 8.2 9.7 12.0 10.0 14.0 9.8 5.4 6.6 76.9 10.3 17.0 15.2
mhr 22.2 5.8 8.8 15.4 24.2 24.8 31.6 28.2 28.1 17.7 24.9 28.4 13.6 56.0 29.7 34.3
tk 23.1 0.3 1.4 10.9 23.4 21.1 23.5 13.9 25.3 14.3 21.6 13.7 11.8 18.3 45.5 32.5
gn 27.3 0.7 3.2 12.9 23.8 21.5 35.0 17.8 32.8 17.5 22.6 21.5 10.8 23.3 31.3 48.1
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 8: Mean F1 scores of our framework without language adapters and invertible adapters (MAD-XBase –
LAD – INV ) for cross-lingual transfer on NER.
en 82.2 16.9 20.7 36.9 54.1 68.7 71.5 50.0 59.6 39.2 69.9 54.9 48.3 58.1 53.1 52.9
ja 41.1 65.4 57.2 24.9 39.8 46.1 54.3 56.1 45.0 36.7 39.8 48.0 24.1 49.4 59.9 48.9
zh 47.8 49.0 77.4 20.4 41.4 48.5 55.2 53.6 38.7 43.2 45.8 47.0 16.9 47.6 55.5 50.8
ar 56.3 16.9 23.3 89.1 65.3 62.2 75.5 55.6 65.9 40.7 63.3 66.9 57.3 49.4 59.0 53.9
jv 40.3 4.2 13.0 37.8 71.6 54.2 57.6 39.2 46.7 35.3 48.7 46.2 33.0 45.5 49.4 43.1
55.1 7.7 13.2 38.7 54.7 89.6 66.4 46.1 54.1 31.5 74.2 51.4 45.7 49.4 53.0 47.0
Source Language
sw
is 56.2 14.0 21.7 42.6 59.4 58.8 85.9 48.1 61.4 43.3 56.3 67.3 51.3 52.8 61.5 58.1
my 14.8 2.3 7.2 11.5 19.4 19.0 37.0 66.5 10.9 19.4 8.4 32.3 37.4 33.8 30.1 21.6
qu 33.8 3.5 4.6 29.2 32.9 32.5 37.9 31.4 73.0 28.8 34.4 39.5 31.6 31.0 33.4 40.5
cdo 25.3 0.6 2.3 12.3 23.5 24.6 39.4 33.8 27.3 57.4 14.4 41.1 33.0 27.7 34.2 39.3
ilo 33.9 5.8 10.0 19.5 26.4 44.7 38.0 24.5 36.3 21.8 81.8 24.0 25.1 34.1 32.2 35.0
xmf 32.7 4.2 10.2 23.7 32.3 28.0 45.8 37.1 38.1 37.6 24.7 71.2 31.9 35.6 38.0 37.9
mi 18.0 3.0 3.7 9.5 16.9 18.7 25.6 24.1 20.0 27.8 11.7 29.7 87.3 20.6 29.2 30.6
mhr 24.1 2.4 4.7 17.2 28.5 19.9 42.5 29.2 35.5 28.6 25.2 40.4 29.4 71.0 38.8 31.4
tk 35.1 0.4 3.0 17.8 36.8 26.5 48.7 22.4 29.5 32.0 24.2 31.0 33.8 33.4 72.2 39.4
gn 34.0 0.4 3.8 13.1 32.8 24.9 45.2 25.3 35.9 28.4 14.1 26.5 24.8 35.5 43.8 66.2
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 9: Mean F1 scores of our framework without invertible adapters (MAD-XBase – INV) for cross-lingual
transfer on NER.
en 82.2 19.0 20.5 41.8 55.7 73.8 74.5 51.9 66.1 36.5 73.1 57.6 51.0 62.1 59.6 55.1
ja 43.8 65.9 58.3 29.1 34.0 53.8 56.5 54.6 45.3 43.5 38.5 53.5 17.2 47.3 57.9 47.2
zh 45.4 47.6 75.4 26.9 39.1 49.2 55.6 49.5 46.6 50.1 44.1 53.9 27.5 40.0 57.8 47.5
ar 56.5 17.5 24.0 89.4 66.2 62.5 75.8 58.9 74.9 40.4 64.4 62.8 73.0 47.4 60.6 56.4
jv 36.3 9.5 13.6 34.7 70.0 51.1 46.9 30.4 53.4 31.0 45.3 46.1 42.9 34.3 43.3 38.6
56.2 11.6 15.3 43.4 59.7 88.6 65.8 47.4 56.2 35.9 75.5 53.2 52.3 47.4 53.7 45.0
Source Language
sw
is 56.8 15.9 24.7 42.4 62.0 61.4 86.3 48.8 63.9 46.4 52.5 68.8 63.5 54.8 63.3 60.4
my 16.0 1.8 5.3 15.5 21.5 18.8 39.1 66.2 14.1 24.0 13.7 35.5 32.8 38.1 34.3 21.6
qu 33.2 5.0 10.0 31.0 33.6 38.0 34.5 30.7 72.4 23.0 32.8 41.0 27.5 35.0 35.0 39.5
cdo 22.3 2.5 4.0 11.0 24.5 21.9 36.7 27.6 17.6 58.0 10.5 33.6 26.3 24.9 31.8 33.9
ilo 35.4 6.5 7.4 26.9 34.2 45.9 42.6 28.6 38.9 22.0 85.7 30.5 32.5 34.1 34.2 35.3
xmf 32.0 6.9 11.2 21.9 36.8 28.6 48.7 37.6 41.2 41.1 20.6 72.0 36.2 36.9 37.9 39.6
mi 8.6 0.7 1.7 5.3 11.0 11.2 16.1 18.3 9.6 20.0 5.6 21.1 89.5 15.3 18.3 16.6
mhr 22.5 4.1 8.7 17.8 31.6 28.1 44.9 35.8 37.8 34.4 17.7 48.2 25.7 74.3 42.2 33.5
tk 31.7 2.0 1.9 17.8 35.7 30.1 47.1 26.3 32.6 34.6 32.4 33.2 31.7 38.8 71.0 44.0
gn 33.7 0.1 0.7 15.9 34.6 26.1 49.0 25.7 31.7 33.1 16.3 34.5 37.2 36.0 43.1 68.2
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 10: Mean F1 scores of our complete adapter-based framework (MAD-XBase ) for cross-lingual transfer on
NER.
en 84.7 26.9 40.5 41.1 63.0 68.5 69.6 44.9 60.7 13.6 65.0 32.4 22.5 46.1 52.5 46.3
ja 58.6 73.2 68.8 38.3 54.8 51.6 68.7 49.0 51.3 19.3 41.4 44.6 47.5 34.9 54.9 51.4
zh 59.0 48.3 82.2 40.4 53.6 51.3 68.8 50.6 48.5 22.0 41.7 43.4 33.5 42.8 65.2 59.3
ar 62.9 29.2 48.0 89.8 74.1 59.2 74.2 49.2 59.9 18.2 43.4 44.6 25.0 26.3 51.3 60.4
jv 55.9 26.7 41.7 40.0 74.4 62.8 65.7 40.9 47.0 14.1 47.8 34.0 44.2 32.2 45.4 53.5
62.4 23.0 38.9 36.0 61.8 89.6 65.1 39.2 50.3 18.1 65.3 41.7 45.4 40.7 50.2 52.3
Source Language
sw
is 62.4 26.1 43.1 46.1 64.8 63.6 85.5 48.2 63.8 17.7 50.8 45.6 46.5 39.1 58.9 58.6
my 13.7 1.8 4.2 17.7 15.1 10.7 35.1 69.7 5.9 5.5 6.4 20.4 9.6 31.2 23.1 13.0
qu 42.6 15.8 26.2 25.9 32.4 45.9 41.2 23.6 71.8 9.2 41.9 21.8 19.9 27.0 30.1 34.5
cdo 18.4 5.5 10.9 12.2 18.4 18.6 27.7 27.1 19.4 48.3 14.2 13.8 19.1 19.4 31.4 28.3
ilo 39.2 12.8 22.2 19.6 30.5 53.7 44.4 34.9 44.4 10.0 80.2 22.1 18.7 35.2 34.9 30.8
xmf 22.4 2.2 4.9 22.4 21.9 18.4 43.3 35.1 23.6 12.5 11.8 63.2 37.5 31.8 35.6 32.4
mi 18.8 3.1 7.8 12.6 13.6 17.1 27.7 18.4 15.8 11.6 10.2 18.2 87.1 15.3 26.3 31.1
mhr 31.1 8.2 15.0 25.0 29.1 28.3 48.8 35.1 35.6 15.5 22.2 33.0 33.4 61.7 42.2 36.6
tk 35.7 7.7 14.5 23.0 36.3 36.5 51.4 32.0 35.6 20.5 27.6 35.7 45.9 37.4 69.2 48.1
gn 39.9 8.0 15.9 23.0 30.0 31.3 49.5 31.7 42.5 13.4 23.4 30.3 22.5 26.2 44.2 62.9
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 11: Mean F1 scores of mBERT for cross-lingual transfer on NER.
en 83.7 21.8 37.6 35.6 64.0 67.8 72.9 44.0 73.5 20.2 67.0 46.6 41.8 62.6 54.6 51.8
ja 55.9 69.3 64.0 37.6 53.2 49.8 67.4 46.5 56.9 33.9 41.1 53.5 55.4 47.8 64.7 53.2
zh 57.6 46.0 78.9 34.1 52.5 54.8 69.1 49.2 58.5 36.1 59.1 55.4 30.4 48.2 64.2 57.9
ar 63.8 22.9 43.0 89.0 61.2 62.6 73.3 48.0 63.6 28.1 63.3 55.0 51.6 48.0 63.3 50.6
jv 50.5 13.7 27.1 36.4 73.4 54.9 64.0 41.9 57.2 25.1 58.5 42.3 56.0 43.4 47.1 49.7
57.2 19.3 31.6 31.8 59.8 90.0 67.8 42.6 61.3 31.4 75.0 48.1 46.9 50.5 52.1 49.5
Source Language
sw
is 59.6 19.1 31.2 34.2 62.5 50.8 85.3 44.3 69.0 30.2 49.8 51.7 60.2 50.6 62.1 61.8
my 12.5 2.7 6.2 14.2 20.6 12.7 32.5 61.8 12.1 17.0 13.9 32.5 14.0 32.3 30.9 20.0
qu 42.3 12.9 23.8 23.2 39.5 43.8 47.1 34.2 72.9 17.2 50.4 37.3 35.8 39.5 39.2 43.6
cdo 23.7 0.9 4.7 11.0 16.0 16.7 36.9 34.0 17.5 51.8 8.9 27.1 20.0 23.8 33.5 26.3
ilo 40.2 10.1 17.9 19.7 37.9 53.2 45.2 27.6 38.0 18.1 79.1 30.2 35.0 32.1 41.8 34.3
xmf 27.9 2.9 4.4 23.2 28.2 23.4 47.2 35.6 31.6 31.1 15.4 67.5 33.1 35.8 38.7 33.5
mi 12.3 0.2 0.9 4.5 13.0 11.4 23.2 18.1 14.7 20.8 6.7 15.0 88.0 15.1 25.2 28.0
mhr 28.6 4.3 10.1 18.4 34.0 25.9 48.3 38.4 34.9 26.7 13.3 40.4 31.4 70.4 44.0 40.4
tk 38.2 7.3 10.7 17.3 40.5 29.7 53.2 32.0 36.8 31.3 16.5 35.3 27.5 34.6 70.3 47.1
gn 28.2 2.1 4.4 11.4 30.8 21.7 42.2 14.8 32.6 20.0 11.4 26.4 29.3 31.7 37.3 56.9
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 12: Mean F1 scores of our complete adapter-based framework (MAD-XmBERT ) for cross-lingual transfer
on NER.
en 84.1 16.9 25.3 50.4 58.8 69.0 74.2 49.6 54.1 15.6 63.9 39.0 31.7 47.8 47.5 45.1
ja 52.4 72.9 61.8 34.8 49.5 52.8 62.3 49.3 42.5 14.6 57.0 38.9 20.3 36.4 53.8 45.3
zh 52.0 52.7 80.4 26.6 48.4 50.1 61.5 54.8 40.0 17.4 58.2 43.3 17.0 32.3 60.1 48.1
ar 55.7 19.1 30.0 90.7 62.6 53.5 69.1 57.0 53.5 6.9 40.0 48.8 31.7 26.7 44.5 46.1
jv 51.5 3.8 5.8 38.2 70.2 54.5 62.8 31.4 44.5 19.1 43.0 41.2 36.3 33.1 43.9 48.0
58.7 11.7 18.7 37.0 54.6 88.4 65.2 38.1 46.6 19.6 57.2 37.3 38.7 34.2 45.0 52.4
Source Language
sw
is 60.6 13.3 22.8 53.6 58.5 57.8 86.0 45.6 58.5 23.5 48.5 57.0 43.0 41.7 54.7 53.8
my 26.0 3.2 7.7 23.3 22.7 25.5 43.6 69.4 23.6 11.2 16.9 27.2 28.2 28.3 34.4 27.6
qu 36.4 1.2 3.0 26.5 30.9 33.8 33.6 19.2 65.5 13.2 40.0 20.0 17.7 26.2 25.3 29.8
cdo 14.7 0.3 1.7 5.5 9.9 18.9 21.2 6.1 16.6 47.3 14.8 6.9 17.1 17.5 18.5 24.9
ilo 28.6 4.5 6.8 13.5 22.4 34.0 25.6 16.9 33.7 11.2 69.2 16.2 9.6 18.3 30.9 23.4
xmf 34.2 4.4 8.8 28.0 37.7 33.3 49.5 33.4 35.5 16.6 29.8 67.5 43.7 36.0 37.2 43.0
mi 18.7 0.1 0.4 9.9 14.3 18.4 23.9 16.5 18.6 18.0 14.5 10.0 86.8 15.9 26.1 25.5
mhr 26.3 4.2 8.1 18.2 28.1 25.9 41.0 26.3 32.8 18.2 32.2 34.7 18.1 59.6 33.1 36.3
tk 34.5 1.8 3.9 21.2 34.5 35.2 45.0 22.4 31.5 25.2 30.1 28.4 25.1 28.4 63.4 42.3
gn 39.7 1.6 3.6 23.4 43.7 33.9 51.8 25.4 42.5 19.6 28.0 38.6 41.0 35.0 47.9 64.9
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 13: Mean F1 scores of XLM-RLarge for cross-lingual transfer on NER.
en 84.2 16.2 25.6 38.6 64.5 76.8 78.9 55.5 73.8 41.7 72.0 54.9 42.0 60.6 63.0 52.4
ja 52.1 73.7 65.9 30.4 51.5 56.9 62.8 55.6 57.0 52.0 53.4 55.2 44.1 39.3 56.0 48.1
zh 56.6 56.1 80.3 26.5 51.0 54.8 68.2 56.1 61.1 54.0 61.4 56.3 43.7 48.1 59.4 56.8
ar 63.1 17.4 28.3 91.4 72.4 65.5 78.9 52.7 78.1 51.7 66.9 68.1 52.5 48.4 64.6 50.5
jv 43.1 0.3 0.8 32.5 73.1 52.3 63.3 30.4 48.2 37.4 47.1 38.1 44.5 39.6 51.4 44.2
57.3 7.2 10.4 34.1 59.7 90.6 69.0 39.5 63.7 48.2 73.9 48.5 48.0 48.0 59.4 51.5
Source Language
sw
is 57.4 8.9 15.2 42.8 68.2 50.4 87.7 46.2 65.8 51.3 51.8 64.7 55.6 56.0 65.1 64.3
my 18.4 1.6 4.6 13.9 28.1 21.8 40.8 58.8 20.3 27.1 13.0 26.8 23.7 30.6 37.0 26.3
qu 31.9 1.0 3.2 20.7 35.6 33.7 45.6 19.6 74.5 38.5 39.8 38.5 39.8 35.3 41.7 42.9
cdo 28.4 0.3 0.4 12.8 25.3 24.8 43.5 21.7 22.5 63.7 14.4 29.6 35.2 26.9 35.1 40.1
ilo 34.6 1.6 1.8 18.8 37.2 46.5 40.2 16.6 45.3 27.8 81.3 28.5 17.5 33.0 36.9 28.1
xmf 29.7 4.2 10.5 18.9 37.4 27.4 46.5 27.8 38.5 40.9 24.0 67.5 37.8 41.9 42.1 38.8
mi 17.6 0.0 0.0 8.6 18.3 16.7 32.4 18.8 23.6 29.0 11.0 29.8 92.0 24.9 34.6 31.6
mhr 20.7 1.7 3.3 11.2 26.1 26.6 43.6 26.0 36.5 28.2 21.9 34.3 27.1 67.2 38.2 35.9
tk 31.8 0.1 0.1 15.7 40.8 25.7 47.9 18.9 34.3 39.7 18.9 35.1 28.2 39.7 75.9 40.4
gn 23.5 0.0 0.0 8.9 29.9 22.3 42.2 19.1 33.1 28.3 13.4 28.7 35.6 30.9 39.4 66.7
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 14: Mean F1 scores of our complete adapter-based framework (MAD-XLarge ) for cross-lingual transfer
on NER.
Model en et ht id it qu sw ta th tr vi zh avg
XLM-RBase
→COP A 57.6 59.8 49.4 58.0 56.0 50.7 57.2 56.6 52.8 56.2 58.5 56.6 55.8
XLM-RBase MLM- TRG→COP A 57.6 57.8 48.6 60.8 54.4 49.5 55.4 55.8 54.2 54.8 57.6 57.2 55.3
XLM-RBase
→SIQA 68.0 59.4 49.2 67.2 63.6 51.0 57.6 58.8 61.6 60.4 65.8 66.0 60.7
XLM-RBase
→SIQA→COP A 66.8 58.0 51.4 65.0 60.2 51.2 52.0 58.4 62.0 56.6 65.6 68.8 59.7
XLM-RBase MLM- TRG→SIQA→COP A 66.8 59.4 50.0 71.0 61.6 46.0 58.8 60.0 63.2 62.2 67.6 67.4 61.2
MAD-XBase
→COP A 48.1 49.0 51.5 50.7 50.7 49.1 52.7 52.5 48.7 53.3 52.1 50.4 50.7
MAD-XBase
→SIQA 67.6 59.7 51.7 66.2 64.4 54.0 53.9 61.3 61.1 60.1 65.4 66.7 61.0
MAD-XBase
→SIQA→COP A 68.3 61.3 53.7 65.8 63.0 52.5 56.3 61.9 61.8 60.3 66.1 67.6 61.5
Table 7: Accuracy scores of all models on the XCOPA test sets when transferring from English. Models are
either only fine-tuned on the COPA training set (→COP A ), only fine-tuned on the SIQA training set (→SIQA ) or
fine-tuned first on SIQA and then on COPA (→SIQA→COP A ).
sw -0.8
-2.6 -5.2 -0.8 0.3 0.8 26.7 8.6 15.0 -12.7
is -2.9 1.5 6.8 -11.1 8.2 4.7 -1.1 -5.9 -8.0 -1.1 2.5 9.8 4.9 5.8 6.3 3.7
my -9.6 -3.4 -3.1 7.4 2.0 -3.7 -2.2 -4.1 6.7 3.0 7.7 9.7 32.8 18.6 3.8 12.3
qu -6.5 4.7 9.8 -4.1 -4.4 2.8 -10.7 3.9 1.7 4.3 9.5 15.1 12.6 18.1 2.1 -5.1
cdo 6.9 2.5 3.9 6.8 12.5 -3.4 1.8 6.5 7.3 -9.0 7.8 28.4 12.6 13.8 7.0 16.0
ilo -1.5 5.1 0.4 6.1 7.8 -0.7 10.3 -0.7 14.4 9.2 0.4 -5.5 25.9 19.3 8.5 4.0
xmf 2.1 6.1 6.8 -6.1 17.6 -16.8 2.9 7.2 33.5 17.1 12.2 -2.9 4.8 25.8 22.8 20.9
mi -9.3 0.5 1.6 -1.6 -0.6 -0.9 -7.6 8.1 1.6 -4.8 2.8 16.1 1.3 9.5 -5.6 3.2
mhr 0.3 3.5 6.6 -2.8 6.2 6.1 -7.8 14.7 9.1 10.8 3.5 18.3 -2.9 3.6 1.6 13.9
tk 1.5 0.0 -2.7 0.7 2.7 -1.6 13.8 19.2 1.0 2.6 13.4 -1.4 16.6 15.1 0.1 6.8
gn -2.2 0.1 0.1 -11.6 -3.9 -3.6 -3.0 7.9 0.6 -3.5 5.5 9.1 12.5 12.4 7.0 2.0
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 15: Relative F1 improvement of MAD-XBase over XLM-RBase MLM- TRG in cross-lingual NER transfer.
en 0.0 2.1 -0.2 4.9 1.6 5.0 3.0 2.0 6.5 -2.7 3.2 2.7 2.7 4.0 6.5 2.2
ja 2.7 0.6 1.1 4.2 -5.8 7.6 2.1 -1.4 0.3 6.8 -1.3 5.5 -6.8 -2.1 -2.0 -1.7
zh -2.4 -1.4 -1.9 6.5 -2.2 0.7 0.4 -4.1 7.9 7.0 -1.7 7.0 10.6 -7.5 2.3 -3.2
ar 0.2 0.6 0.7 0.3 0.9 0.3 0.3 3.3 9.0 -0.3 1.1 -4.1 15.7 -2.0 1.6 2.5
jv -4.0 5.3 0.6 -3.1 -1.6 -3.1 -10.7 -8.8 6.7 -4.3 -3.3 -0.1 9.8 -11.2 -6.0 -4.5
5.0 -1.0 -0.6 1.3 2.0 4.3
Source Language
sw 1.0 3.9 2.1 4.6 1.3 1.9 6.6 -1.9 0.7 -1.9
is 0.6 1.9 3.0 -0.2 2.6 2.6 0.4 0.7 2.5 3.1 -3.8 1.6 12.2 2.0 1.7 2.3
my 1.1 -0.5 -2.0 4.0 2.1 -0.2 2.1 -0.3 3.1 4.6 5.2 3.2 -4.7 4.3 4.3 0.0
qu -0.6 1.5 5.3 1.8 0.7 5.5 -3.3 -0.7 -0.6 -5.8 -1.6 1.5 -4.2 4.0 1.6 -1.0
cdo -3.1 2.0 1.7 -1.3 1.0 -2.6 -2.7 -6.1 -9.7 0.7 -3.9 -7.5 -6.7 -2.8 -2.4 -5.4
ilo 1.5 0.7 -2.6 7.4 7.8 1.2 4.5 4.1 2.6 0.2 3.8 6.5 7.4 0.0 2.0 0.3
xmf -0.7 2.7 1.0 -1.7 4.5 0.6 3.0 0.6 3.2 3.5 -4.1 0.7 4.4 1.3 -0.0 1.7
mi -9.3 -2.3 -1.9 -4.3 -6.0 -7.5 -9.5 -5.8 -10.4 -7.8 -6.1 -8.6 2.2 -5.3 -11.0-13.9
mhr -1.5 1.8 4.0 0.7 3.0 8.1 2.4 6.6 2.3 5.9 -7.5 7.9 -3.7 3.3 3.4 2.1
tk -3.4 1.5 -1.1 -0.0 -1.1 3.6 -1.6 3.9 3.1 2.6 8.2 2.2 -2.1 5.4 -1.3 4.6
gn -0.3 -0.2 -3.1 2.8 1.9 1.1 3.7 0.4 -4.2 4.7 2.2 8.1 12.4 0.6 -0.7 2.0
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 16: Relative F1 improvement of MAD-XBase over MAD-XBase –INV in cross-lingual NER transfer.
sw 2.7
3.5 11.1 13.3 9.8 6.5 1.5 9.8 1.9 -2.9
is -2.8 -7.0 -11.9-11.9 -2.3 -12.8 -0.2 -3.9 5.2 12.5 -1.0 6.1 13.7 11.5 3.3 3.2
my -1.2 0.9 2.0 -3.5 5.5 2.0 -2.6 -7.8 6.2 11.5 7.4 12.0 4.3 1.1 7.8 7.0
qu -0.3 -2.9 -2.5 -2.7 7.1 -2.1 5.9 10.5 1.1 8.0 8.5 15.5 15.8 12.5 9.1 9.1
cdo 5.3 -4.5 -6.2 -1.1 -2.4 -1.9 9.1 7.0 -1.9 3.6 -5.3 13.3 0.9 4.3 2.2 -2.1
ilo 1.0 -2.7 -4.3 0.1 7.4 -0.5 0.8 -7.2 -6.4 8.1 -1.1 8.1 16.3 -3.1 6.9 3.5
xmf 5.5 0.7 -0.5 0.9 6.3 5.1 4.0 0.5 8.0 18.6 3.6 4.4 -4.4 4.0 3.1 1.0
mi -6.5 -2.9 -6.8 -8.2 -0.7 -5.7 -4.5 -0.3 -1.1 9.2 -3.5 -3.2 0.9 -0.2 -1.1 -3.1
mhr -2.5 -3.9 -5.0 -6.6 4.9 -2.5 -0.5 3.3 -0.7 11.2 -9.0 7.5 -2.0 8.7 1.8 3.8
tk 2.6 -0.3 -3.7 -5.7 4.2 -6.8 1.8 -0.0 1.2 10.8 -11.0 -0.4 -18.4 -2.9 1.1 -1.0
gn -11.6 -5.9 -11.5-11.7 0.8 -9.6 -7.3 -16.9 -9.9 6.6 -12.0 -3.9 6.8 5.5 -7.0 -6.0
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 17: Relative F1 improvement of MAD-XmBERT over mBERT in cross-lingual NER transfer.
en 0.2 -0.7 0.3 -11.8 5.8 7.7 4.7 5.9 19.7 26.0 8.1 15.9 10.2 12.8 15.5 7.3
ja -0.3 0.8 4.0 -4.4 2.0 4.2 0.5 6.3 14.6 37.4 -3.6 16.3 23.7 2.9 2.2 2.8
zh 4.7 3.4 -0.1 -0.1 2.6 4.7 6.7 1.3 21.2 36.6 3.2 13.0 26.8 15.8 -0.7 8.7
ar 7.4 -1.7 -1.6 0.7 9.8 12.1 9.8 -4.3 24.5 44.8 26.9 19.2 20.7 21.7 20.1 4.4
jv -8.4 -3.5 -5.0 -5.7 2.9 -2.2 0.5 -1.0 3.7 18.3 4.1 -3.1 8.2 6.5 7.5 -3.9
-8.3 -2.9 5.1 2.2 3.8 1.4 17.1 28.6 16.7 11.2 9.3 13.8 14.4 -1.0
Source Language
sw -1.4 -4.4
is -3.2 -4.4 -7.7 -10.8 9.8 -7.4 1.6 0.6 7.3 27.8 3.4 7.7 12.6 14.2 10.4 10.5
my -7.5 -1.7 -3.1 -9.4 5.3 -3.7 -2.8 -10.6 -3.3 15.9 -3.9 -0.5 -4.5 2.2 2.6 -1.3
qu -4.5 -0.2 0.2 -5.8 4.7 -0.1 12.1 0.4 9.0 25.3 -0.3 18.4 22.1 9.1 16.4 13.1
cdo 13.7 -0.0 -1.3 7.2 15.4 5.9 22.3 15.7 5.9 16.3 -0.5 22.7 18.0 9.4 16.6 15.3
ilo 6.0 -2.9 -5.0 5.2 14.9 12.5 14.6 -0.3 11.6 16.6 12.0 12.3 7.9 14.7 6.1 4.6
xmf -4.5 -0.2 1.7 -9.2 -0.3 -5.9 -3.0 -5.7 3.0 24.3 -5.8 0.1 -6.0 5.8 4.9 -4.2
mi -1.0 -0.1 -0.4 -1.4 4.0 -1.8 8.5 2.3 5.0 10.9 -3.5 19.8 5.2 9.0 8.5 6.1
mhr -5.6 -2.5 -4.8 -7.0 -2.0 0.7 2.6 -0.3 3.8 10.0 -10.3 -0.4 9.0 7.6 5.1 -0.3
tk -2.7 -1.7 -3.8 -5.5 6.3 -9.5 3.0 -3.5 2.7 14.4 -11.2 6.7 3.2 11.2 12.5 -1.9
gn -16.1 -1.6 -3.6 -14.5-13.8-11.6 -9.6 -6.2 -9.4 8.7 -14.6 -9.9 -5.4 -4.2 -8.5 1.9
en ja zh ar jv sw is my qu cdo ilo xmf mi mhr tk gn
Target Language
Figure 18: Relative F1 improvement of MAD-XLarge over XLM-RLarge in cross-lingual NER transfer.