0% found this document useful (0 votes)
50 views14 pages

Adapting Pre-Trained Language Models To African Languages Via

Uploaded by

thegreatref
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views14 pages

Adapting Pre-Trained Language Models To African Languages Via

Uploaded by

thegreatref
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Adapting Pre-trained Language Models to African Languages via

Multilingual Adaptive Fine-Tuning

Jesujoba O. Alabi∗, David Ifeoluwa Adelani∗, Marius Mosbach, and Dietrich Klakow
Spoken Language Systems (LSV), Saarland University, Saarland Informatics Campus, Germany
{jalabi,didelani,mmosbach,dklakow}@lsv.uni-saarland.de

Abstract transfer tasks. Due to the curse of multilinguality (Con-


neau et al., 2020) — a trade-off between language cov-
Multilingual pre-trained language models erage and model capacity — and non-availability of
(PLMs) have demonstrated impressive per- pre-training corpora for many low-resource languages,
formance on several downstream tasks for multilingual PLMs are often trained on about 100 lan-
both high-resourced and low-resourced lan- guages. Despite the limitations of language cover-
guages. However, there is still a large perfor- age, multilingual PLMs have been shown to transfer
mance drop for languages unseen during pre- to several low-resource languages unseen during pre-
training, especially African languages. One training. Although, there is still a large performance
of the most effective approaches to adapt to a gap compared to languages seen during pre-training.
new language is language adaptive fine-tuning
One of the most effective approaches to adapt to a
(LAFT) — fine-tuning a multilingual PLM
new language is language adaptive fine-tuning (LAFT)
on monolingual texts of a language using the
— fine-tuning a multilingual PLM on monolingual
pre-training objective. However, adapting to
texts in the target language using the same pre-training
a target language individually takes a large
objective. This has been shown to lead to big gains on
disk space and limits the cross-lingual transfer
many cross-lingual transfer tasks (Pfeiffer et al., 2020),
abilities of the resulting models because they
and low-resource languages (Muller et al., 2021; Chau
have been specialized for a single language.
& Smith, 2021), including African languages (Alabi
In this paper, we perform multilingual adap-
et al., 2020; Adelani et al., 2021). Nevertheless, adapt-
tive fine-tuning on 17 most-resourced African
ing a model to each target language individually takes
languages and three other high-resource lan-
large disk space, and limits the cross-lingual transfer
guages widely spoken on the African conti-
abilities of the resulting models because they have been
nent to encourage cross-lingual transfer learn-
specialized to individual languages (Beukman, 2021).
ing. To further specialize the multilingual
PLM, we removed vocabulary tokens from An orthogonal approach to improve the coverage of
the embedding layer that corresponds to non- low-resource languages is to include them in the pre-
African writing scripts before MAFT, thus re- training data. An example for this approach is AfriB-
ducing the model size by around 50%. Our ERTa (Ogueji et al., 2021), which was trained from
evaluation on two multilingual PLMs (AfriB- scratch on 11 African languages. A downside of this
ERTa and XLM-R) and three NLP tasks (NER, approach is that it is resource intensive in terms of data
news topic classification, and sentiment clas- and compute.
sification) shows that our approach is com- Another alternative approach is parameter efficient
petitive to applying LAFT on individual lan- fine-tuning like Adapters (Pfeiffer et al., 2020) and
guages while requiring significantly less disk sparse fine-tuning (Ansell et al., 2021), where the
space. Additionally, we show that our adapted model is adapted to new languages by using a sparse
PLM also improves the zero-shot cross-lingual network trained on a small monolingual corpus. Simi-
transfer abilities of parameter efficient fine- lar to LAFT, it requires adaptation for every new target
tuning methods. language. Although it takes little disk space, all target
language-specific parameters need to be stored.
1 Introduction In this paper, we propose multilingual adaptive fine-
tuning (MAFT), a language adaptation to multiple lan-
Recent advances in the development of multilingual
guages at once. We perform language adaptation on
pre-trained language models (PLMs) like mBERT (De-
the 17 most-resourced African languages (Afrikaans,
vlin et al., 2019), XLM-R (Conneau et al., 2020), and
Amharic, Hausa, Igbo, Malagasy, Chichewa, Oromo,
RemBERT (Chung et al., 2021) have led to significant
Naija, Kinyarwanda, Kirundi, Shona, Somali, Sesotho,
performance gains on a wide range of cross-lingual
Swahili, isiXhosa, Yorùbá, isiZulu) and three other

* Equal contribution. high-resource language widely spoken on the continent
4336
Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349
October 12–17, 2022.
(English, French, and Arabic) simultaneously to pro- but apply MAFT which allows multilingual adaptation
vide a single model for cross-lingual transfer learning and preserves downstream performance on both high-
for African languages. To further specialize the multi- resource and low-resource languages.
lingual PLM, we follow the approach of Abdaoui et al.
(2020) and remove vocabulary tokens from the embed- Adaptation of multilingual PLMs It is not unusual
ding layer that correspond to non-Latin and non-Ge’ez for a new multilingual PLM to be initialized from an
(used by Amharic) scripts before MAFT, thus effec- existing model. For example, Chi et al. (2021) trained
tively reducing the model size by 50%. InfoXLM by initializing the weights from XLM-R be-
Our evaluation on two multilingual PLMs (AfriB- fore training the model on a joint monolingual and
ERTa and XLM-R) and three NLP tasks (NER, news translation corpus. Although they make use of a
topic classification and sentiment classification) shows new training objective during adaptation. Similarly,
that our approach is competitive to performing LAFT Tang et al. (2020) extended the languages covered by
on the individual languages, with the benefit of having mBART (Liu et al., 2020b) from 25 to 50 by first
a single model instead of a separate model for each of modifying the vocabulary and initializing the model
the target languages. Also, we show that our adapted weights of the original mBART before fine-tuning it
PLM improves the zero-shot cross-lingual transfer on a combination of monolingual texts from the origi-
abilities of parameter efficient fine-tuning methods nal 25 languages in addition to 25 new languages. De-
like Adapters (Pfeiffer et al., 2020) and sparse fine- spite increasing the number of languages covered by
tuning (Ansell et al., 2021). their model, they did not observe a significant perfor-
mance drop on downstream tasks. We take inspiration
As an additional contribution, and in order to cover
from these works for applying MAFT on African lan-
more diverse African languages in our evaluation, we
guages, but we do not modify the training objective
create a new evaluation corpus, ANTC – African News
during adaptation nor increase the vocabulary.
Topic Classification – for Lingala, Somali, Naija,
Malagasy, and isiZulu from pre-defined news cate- Compressing PLMs One of the most effective meth-
gories of VOA, BBC, Global Voices, and Isolezwe ods for creating smaller PLMs is distillation where a
newspapers. To further the research on NLP for small student model is trained to reproduce the be-
African languages, we make our code and data pub- haviour of a larger teacher model. This has been ap-
licly available.1 Additionally, our models are available plied to many English PLMs (Sanh et al., 2019; Jiao
via HuggingFace.2 et al., 2020; Sun et al., 2020; Liu et al., 2020a) and
a few multilingual PLMs (Wang et al., 2020, 2021).
2 Related Work However, it often leads to a drop in performance com-
pared to the teacher PLM. An alternative approach that
Multilingual PLMs for African languages The suc- does not lead to a drop in performance has been pro-
cess of multilingual PLMs such as mBERT (Devlin posed by Abdaoui et al. (2020) for multilingual PLM.
et al., 2019) and XLM-R (Conneau et al., 2020) for They removed unused vocabulary tokens from the em-
cross-lingual transfer in many natural language under- bedding layer. This simple method significantly re-
standing tasks has encouraged the continuous devel- duces the number of embedding parameters thus re-
opment of multilingual models (Luo et al., 2021; Chi ducing the overall model size since the embedding
et al., 2021; Ouyang et al., 2021; Chung et al., 2021; layer contributes the most to the total number of model
He et al., 2021). Most of these models cover 50 to parameters. In our paper, we combine MAFT with
110 languages and only few African languages are rep- the method proposed by Abdaoui et al. (2020) to re-
resented due to lack of large monolingual corpora on duce the overall size of the resulting multilingual PLM
the web. To address this under-representation, regional for African languages. This is crucial especially be-
multilingual PLMs have been trained from scratch such cause people from under-represented communities in
as AfriBERTa (Ogueji et al., 2021) or adapted from ex- Africa may not have access to powerful GPUs in order
isting multilingual PLM through LAFT (Alabi et al., to fine-tune large PLMs. Also, Google Colab3 (free-
2020; Pfeiffer et al., 2020; Muller et al., 2021; Ade- version), which is widely used by individuals from
lani et al., 2021). AfriBERTa is a relatively small under-represented communities without access to other
multilingual PLM (126M parameters) trained using the compute resources, cannot run large models like e.g.
RoBERTa architecture and pre-training objective on 11 XLM-R. Hence, it is important to provide smaller mod-
African languages. However, it lacks coverage of lan- els that still achieve competitive downstream perfor-
guages from the southern region of the African conti- mance to these communities.
nent, specifically the southern-Bantu languages. In our
work, we extend to those languages since only a few Evaluation datasets for African languages One of
of them have large (>100MB size) monolingual cor- the challenges of developing (multilingual) PLMs for
pus. We also do not specialize to a single language African languages is the lack of evaluation corpora.
There have been many efforts by communities like
1
https://github.com/uds-lsv/afro-maft
2 3
https://huggingface.co/Davlan https://colab.research.google.com/
4337
Number of sentences
Domain Classes Number of classes
Train Dev Test
Newly created datasets
Lingala (lin) 1,536 220 440 Rdc, Politiki/Politique, Bokengi/Securite, Jus- 5
tice, Bokolongono/Santé/Medecine
Naija (pcm) 1,165 167 333 Entertainment, Africa, Sport, Nigeria, World 5
Malagasy (mlg) 4544 650 1299 Politika (Politics), Kolontsaina (Cul- 5
ture), Zon’olombelona (Human Rights),
Siansa sy Teknolojia (Science and Technol-
ogy) ,Tontolo iainana (Environment)
Somali (som) 10,072 1,440 2879 Soomaaliya (Somalia), Wararka (News), 6
Caalamka (World), Maraykanka (United
States), Afrika (Africa)
isiZulu (zul) 2,961 424 847 Ezemidlalo (Sports), Ezokungcebeleka (Recre- 5
ation), Imibono (Ideas), Ezezimoto (Automo-
tive), Intandokazi (Favorites)
Existing datasets
Amharic (amh) 36,029 5,147 10,294 Local News, Sport, Politics, International 6
News, Business, Entertainment
English (eng) 114,000 6,000 7,600 World, Sports, Business, Sci/Tech 4
Hausa (hau) 2,045 290 582 Africa, World, Health, Nigeria, Politics 5
Kinyarwanda (kin) 16,163 851 4,254 Politics, Sport, Economy, Health, Entertain- 14
ment, History, Technology, Tourism, Culture,
Fashion, Religion, Environment, Education,
Relationship
Kiswahili (swa) 21,096 1,111 7,338 Uchumi (Economic), Kitaifa (National), 6
Michezo (Sports), Kimataifa (International),
Burudani (Recreation), Afya (Health)
Yorùbá (yor) 1,340 189 379 Nigeria, Africa, World, Entertainment, Health, 7
Sport, Politics

Table 1: Number of sentences in training, development and test splits. We provide automatic translation of some
of the African language words to English (in Parenthesis) using Google Translate.

Masakhane to address this issue (∀ et al., 2020; Adelani News, Voice of America News5 (Palen-Michel et al.,
et al., 2021). We only find two major evaluation bench- 2022), and some other news websites based in Africa.
mark datasets that cover a wide range of African lan- Table 9 in the Appendix provides a summary of the
guages: one for named entity recognition (NER) (Ade- monolingual data, including their sizes and sources.
lani et al., 2021) and one for sentiment classifica- We pre-processed the data by removing lines that con-
tion (Muhammad et al., 2022). In addition, there are sist of numbers or punctuation only, and lines with less
also several news topic classification datasets (Hed- than six tokens.
derich et al., 2020; Niyongabo et al., 2020; Azime &
Mohammed, 2021) but they are only available for a 3.2 Evaluation tasks
few African languages. Our work contributes novel We run our experiments on two sentence level classi-
news topic classification datasets (i.e. ANTC) for addi- fication tasks: news topic classification and sentiment
tional five African languages: Lingala, Naija, Somali, classification, and one token level classification task:
isiZulu, and Malagasy. NER. We evaluate our models on English as well as di-
verse African languages with different linguistic char-
3 Data acteristics.
3.1 Adaptation corpora 3.2.1 Existing datasets
We perform MAFT on 17 African languages Afrikaans, NER For the NER task we evaluate on the
Amharic, Hausa, Igbo, Malagasy, Chichewa, Oromo, MasakhaNER dataset (Adelani et al., 2021), a manu-
Naija, Kinyarwanda, Kirundi, Shona, Somali, Sesotho, ally annotated dataset covering 10 African languages
Swahili, isiXhosa, Yorùbá, isiZulu) covering the ma- (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo,
jor African language families and 3 high resource lan- Naija, Kiswahili, Wolof, and Yorùbá) with texts from
guages (Arabic, French, and English) widely spoken the news domain. For English, we use data from the
in Africa. We selected the African languages based CoNLL 2003 NER task (Tjong Kim Sang & De Meul-
on the availability of a (relatively) large amount of der, 2003) also containing texts from the news domain.
monolingual texts. We obtain the monolingual texts For isiXhosa, we use the data from Eiselen (2016).
from three major sources: the mT5 pre-training corpus Lastly, to evaluate on Arabic we make use of the AN-
which is based on Common Crawl Corpus4 (Xue et al., ERCorp dataset (Benajiba et al., 2007; Obeid et al.,
2021), the British Broadcasting Corporation (BBC) 2020).
4 5
https://commoncrawl.org/ https://www.voanews.com
4338
News topic classification We use existing news topic PLM # Lang. African languages covered
datasets for Amharic (Azime & Mohammed, 2021), XLM-R-base 100 afr, amh, hau, mlg, orm, som,
(270M) swa, xho
English – AG News corpus – (Zhang et al., 2015), AfriBERTa-large 11 amh, hau, ibo, kin, run, orm,
Kinyarwanda – KINNEWS – (Niyongabo et al., 2020), (126M) pcm, som, swa, tir, yor
Kiswahili – new classification dataset– (David, 2020), XLM-R-miniLM 100 afr, amh, hau, mlg, orm, som,
(117M) swa, xho
and both Yorùbá and Hausa (Hedderich et al., 2020). XLM-R-large 100 afr, amh, hau, mlg, orm, som,
For dataset without a development set, we randomly (550M) swa, xho
AfroXLMR* 20 afr, amh, hau, ibo, kin, run
sample 5% of their training instances and use them as (117M-270M) mlg, nya, orm, pcm, sna, som,
a development set. sot, swa, xho, yor, zul

Table 2: Language coverage and size for pre-trained


Sentiment classification We use the NaijaSenti mul-
language models. Languages in bold have evaluation
tilingual Twitter sentiment analysis corpus (Muham-
datasets for either NER, news topic classification or
mad et al., 2022). This is a large code-mixed and
sentiment analysis.
monolingual sentiment analysis dataset, manually an-
notated for 4 Nigerian languages: Hausa, Igbo, Yorùbá
and Pidgin. Additionally, we evaluate on the Amharic,
and English Twitter sentiment datasets by Yimam et al. parameter sizes respectively. Although, for our
(2020) and Rosenthal et al. (2017), respectively. For all main experiments, we make use of XLM-R-base.
datasets above, we only make use of tweets with posi-
2. AfriBERTa (Ogueji et al., 2021) has been pre-
tive, negative and neutral sentiments.
trained only on African languages. Despite its
3.2.2 Newly created dataset: ANTC corpus smaller parameter size (126M), it has been shown
We created a novel dataset, ANTC — African News to reach competitive performance to XLM-R-base
Topic Classification for five African languages. We on African language datasets (Adelani et al., 2021;
obtained data from three different news sources: VOA, Hedderich et al., 2020).
BBC6 , Global Voices7 , and isolezwe8 . From the VOA
3. XLM-R-miniLM (Wang et al., 2020) is a distilled
data we created datasets for Lingala and Somali. We
version of XLM-R-large with only 117M parame-
obtained the topics from data released by Palen-Michel
ters.
et al. (2022) and used the provided URLs to get the
news category from the websites. For Naija, Mala-
gasy and isiZulu, we scrapped news topic from the Hyper-parameters for baseline models We fine-
respective news website (BBC Pidgin, Global Voices, tune the baseline models for NER, news topic classi-
and isolezwe respectively) directly base on their cate- fication and sentiment classification for 50, 25, and 20
gory. We noticed that some news topics are not mutu- epochs respectively. We use a learning rate of 5e-5 for
ally exclusive to their categories, therefore, we filtered all the task, except for sentiment classification where
such topics with multiple labels. Also, we ensured that we use 2e-5 for XLM-R-base and XLM-R-large. The
each category has at least 200 samples. The categories maximum sequence length is 164 for NER, 500 for
include but are not limited to: Africa, Entertainment, news topic classification, and 128 for sentiment classi-
Health, and Politics. The pre-processed datasets were fication. The adapted models also make use of similar
divided into training, development, and test sets using hyper-parameters.
stratified sampling with a ratio of 70:10:20. Table 1
provides details about the dataset size and news topic 5 Multilingual Adaptive Fine-tuning
information.
We introduce MAFT as an approach to adapt a multi-
lingual PLM to a new set of languages. Adapting PLMs
4 Pre-trained Language Models
has been shown to be effective when adapting to a new
For our experiments, we make use of different multi- domain (Gururangan et al., 2020) or language (Pfeif-
lingual PLMs that have been trained using a masked fer et al., 2020; Alabi et al., 2020; Muller et al., 2021;
language model objective on large collections of mono- Adelani et al., 2021). While previous work on multilin-
lingual texts from several languages. Table 2 shows the gual adaptation has mostly focused on autoregressive
number of parameters as well as the African languages sequence-to-sequence models such as mBART (Tang
covered by each of the models we consider. et al., 2020), in this work, we adapt non-autoregressive
masked PLMs on monolingual corpora covering 20
1. XLM-R (Conneau et al., 2020) has been pre- languages. Crucially, during adaptation we use the
trained on 100 languages including eight African same objective that was also used during pre-training.
languages. We make use of both XLM-R-base and The models resulting from MAFT can then be fine-
XLM-R-large for MAFT with 270M and 550M tuned on supervised NLP downstream tasks. We name
6
https://www.bbc.com/pidgin the model resulting after applying MAFT to XLM-
7 R-base and XLM-R-miniLM as AfroXLMR-base and
https://mg.globalvoices.org/
8
https://www.isolezwe.co.za AfroXLMR-mini, respectively. For adaptation, we train
4339
Model Size amh ara eng hau ibo kin lug luo pcm swa wol xho yor avg
Finetune
XLM-R-miniLM 117M 69.5 76.1 91.5 74.5 81.9 68.6 64.7 11.7 83.2 86.3 51.7 69.3 72.0 69.3
AfriBERTa 126M 73.8 51.3 89.0 90.2 87.4 73.8 78.9 70.2 85.7 88.0 61.8 67.2 81.3 76.8
XLM-R-base 270M 70.6 77.9 92.3 89.5 84.8 73.3 79.7 74.9 87.3 87.4 63.9 69.9 78.3 79.2
XLM-R-large 550M 76.2 79.7 93.1 90.5 84.1 73.8 81.6 73.6 89.0 89.4 67.9 72.4 78.9 80.8
MAFT + Finetune
XLM-R-miniLM 117M 69.7 76.5 91.7 87.7 83.5 74.1 77.4 17.5 85.5 86.0 59.0 72.3 75.1 73.5
AfriBERTa 126M 72.5 40.9 90.1 89.7 87.6 75.2 80.1 69.6 86.5 87.6 62.3 71.8 77.0 76.2
XLM-R-base 270M 76.1 79.7 92.8 91.2 87.4 78.0 82.9 75.1 89.6 88.6 67.4 71.9 82.1 81.8
XLM-R-base-v70k 140M 70.1 76.4 91.0 91.4 86.6 77.5 83.2 75.4 89.0 88.7 65.9 72.4 81.3 80.7
XLM-R-base+LAFT 270M x 13 78.0 79.1 91.3 91.5 87.7 77.8 84.7 75.3 90.0 89.5 68.3 73.2 83.7 82.3

Table 3: NER model comparison, showing F1-score on the test sets after 50 epochs averaged over 5 runs. Results
are for all 4 tags in the dataset: PER, ORG, LOC, DATE/MISC. For LAFT, we multiplied the size of XLM-R-base
by the number of languages as LAFT results in a single model per language.

on a combination of the monolingual corpora used for in the new vocabulary, we justify the choice of this ap-
AfriMT5 adaptation by Adelani et al. (2022). Details proach in Section 5.3. We assume that the top-k most
for each of the monolingual corpora and languages are frequent tokens should be representative of the vocabu-
provided in Appendix A.1. lary of the whole 20 languages. We chose k = 52.000
from the Amharic sub-tokens which covers 99.8% of
Hyper-parameters for MAFT The PLMs were the Amharic monolingual texts, and k = 60.000 which
trained for 3 epochs with a learning rate of 5e-5 using covers 99.6% of the other 19 languages, and merged
huggingface transformers (Wolf et al., 2020). We use them. In addition, we include the top 1000 tokens from
of a batch size of 32 for AfriBERTa and a batch size 10 the original XLM-R-base tokenizer in the new vocab-
for the other PLMs. ulary to include frequent tokens that were not present
5.1 Vocabulary reduction in the new top-k tokens.9 We note that our assumption
above may not hold in the case of some very distant
Multilingual PLMs come with various parameter sizes, and low-resourced languages as well as when there are
the larger ones having more than hundred million pa- domain differences between the corpora used during
rameters, which makes fine-tuning and deploying such adaptation and fine-tuning. We leave the investigation
models a challenge due to resource constraints. One of alternative approaches for vocabulary compression
of the major factors that contributes to the parameter for future work.
size of these models is the embedding matrix whose
size is a function of the vocabulary size of the model. 5.2 Results and discussion
While a large vocabulary size is essential for a multi-
5.2.1 Baseline results
lingual PLM trained on hundreds of languages, some of
the tokens in the vocabulary can be removed when they For the baseline models (top rows in Tables 3, 4, and 5),
are irrelevant to the domain or language considered in we directly fine-tune on each of the downstream tasks
the downstream task, thus reducing the vocabulary size in the target language: NER, news topic classification
of the model. Inspired by Abdaoui et al. (2020), we and sentiment analysis.
experiment with reducing the vocabulary size of the
Performance on languages seen during pre-training
XLM-R-base model before adapting via MAFT. There
For NER and sentiment analysis we find XLM-R-large
are two possible vocabulary reductions in our setting:
to give the best overall performance. We attribute this
(1) removal of tokens before MAFT or (2) removal of
to the fact that it has a larger model capacity compared
tokens after MAFT. From our preliminary experiments,
to the other PLMs. Similarly, we find AfriBERTa and
we find approach (1) to work better. We call the result-
XLM-R-base to give better results on languages they
ing model, AfroXLMR-small.
have been pre-trained on (see Table 2), and in most
To remove non-African vocabulary sub-tokens from
cases AfriBERTa tends to perform better than XLM-
the pretrained XLM-base model, we concatenated the
R-base on languages they are both pre-trained on, for
monolingual texts from 19 out of the 20 African lan-
example amh, hau, and swa. However, when the lan-
guages together. Then, we apply sentencepiece to
guages are unseen by AfriBERTa (e.g. ara, eng, wol,
the Amharic monolingual texts, and concatenated texts
lin, lug, luo, xho, zul), it performs much worse
separately using the original XLM-R-base tokenizer.
than XLM-R-base and in some cases even worse than
The frequency of all the sub-tokens in the two separate
the XLM-R-miniLM. This shows that it may be better
monolingual corpora is computed, and we select the
top-k most frequent tokens from the separate corpora. 9
This introduced just a few new tokens which are mostly
We used this separate sampling to ensure that a con- English tokens to the new vocabulary. We end up with 70.609
siderable number of Amharic sub-tokens are captured distinct sub-tokens after combining all of them.
4340
Model Size amh eng hau kin lin mlg pcm som swa yor zul avg
Finetune
XLM-R-miniLM 117M 70.4 94.1 77.6 64.2 41.2 42.9 67.6 74.2 86.7 68.8 56.9 67.7
AfriBERTa 126M 70.7 93.6 90.1 75.8 55.4 56.4 81.5 79.9 87.7 82.6 71.4 76.8
XLM-R-base 270M 71.1 94.1 85.9 73.3 56.8 54.2 77.3 78.8 87.1 71.1 70.0 74.6
XLM-R-large 550M 72.7 94.5 86.2 75.1 52.2 63.6 79.4 79.2 87.5 74.8 78.7 76.7
MAFT + Finetune
XLM-R-miniLM 117M 69.5 94.1 86.7 72.0 51.7 55.3 78.1 77.7 87.2 74.0 60.3 73.3
AfriBERTa 126M 68.8 93.7 89.5 76.5 54.9 59.7 82.2 79.9 87.7 80.8 76.4 77.3
XLM-R-base 270M 71.9 94.6 88.3 76.8 58.6 64.7 78.9 79.1 87.8 80.2 79.6 78.2
XLM-R-base-v70k 140M 70.4 94.2 87.7 76.1 56.8 64.4 76.1 79.4 87.4 76.9 77.4 76.9
XLM-R-base+LAFT 270M x 11 73.0 94.3 91.2 76.0 56.9 67.3 77.4 79.4 88.0 79.2 79.5 78.4

Table 4: News topic classification model comparison, showing F1-score on the test sets after 25 epochs averaged
over 5 runs. For LAFT, we multiplied the size of XLM-R-base by the number of languages.

Model Size amh eng hau ibo pcm yor avg


Finetune
XLM-R-miniLM 117M 51.0 62.8 75.0 78.0 72.9 73.4 68.9
AfriBERTa-large 126M 51.7 61.8 81.0 81.2 75.0 80.2 71.8
XLM-R-base 270M 51.4 66.2 78.4 79.9 76.3 76.9 71.5
XLM-R-large 550M 52.4 67.5 79.3 80.8 77.6 78.1 72.6
MAFT+Finetune
XLM-R-miniLM 117M 51.3 63.3 77.7 78.0 73.6 74.3 69.7
AfriBERTa 126M 53.6 63.2 81.0 80.6 74.7 80.4 72.3
XLM-R-base 270M 53.0 65.6 80.7 80.5 77.5 79.4 72.8
XLM-R-base-v70k 140M 52.2 65.3 80.6 81.0 77.4 78.6 72.5
XLM-R-base+LAFT 270M x 6 55.0 65.6 81.5 80.8 74.7 80.9 73.1

Table 5: Sentiment classification model comparison, showing F1 evaluation on test sets after 20 epochs, averaged
over 5 runs. We obtained the results for the baseline model results of “hau”, “ibo”, “pcm”, and “yor” from
Muhammad et al. (2022). For LAFT, we multiplied the size of XLM-R-base by the number of languages as LAFT
results in a single model per language.

to adapt to a new African language from a PLM that has compared to LAFT, MAFT results in a single adapted
seen numerous languages than one trained on a subset model which can be applied to many languages while
of African languages from scratch. LAFT results in a new model for each language. Be-
low, we discuss our results in more detail.
LAFT is a strong baseline The results of applying
LAFT to the XLM-R-base model are shown in the last
row of Tables 3, 4, and 5. We find that applying LAFT PLMs pre-trained on many languages benefit the
on each language individually provides a significant most from MAFT We found all the PLMs to im-
improvement in performance across all languages and prove after we applied MAFT. The improvement is
tasks we evaluated on. Sometimes, the improvement the largest for the XLM-R-miniLM, where the perfor-
is very large, for example, +7.4 F1 on Amharic NER mance improved by +4.2 F1 for NER, and +5.6 F1 for
and +9.5 F1 for Zulu news-topic classification. The news topic classification. Although, the improvement
only exception is for English since XLM-R has already was lower for sentiment classification (+0.8). Apply-
seen large amounts of English text during pre-training. ing MAFT on XLM-R-base gave the overall best result.
Additionally, LAFT models tend to give slightly worse On average, there is an improvement of +2.6, +3.6,
result when adaptation is performed on a smaller cor- and +1.5 F1 on NER, news topic and sentiment classi-
pus.10 fication, respectively. The main advantage of MAFT
5.2.2 Multilingual adaptive fine-tuning results is that it allows us to use the same model for many
African languagesinstead of many models specialized
While LAFT provides an upper bound on downstream to individual languages. This significantly reduces the
performance for most languages, our new approach required disk space to store the models, without sac-
is often competitive to LAFT. On average, the differ- rificing performance. Interestingly, there is no strong
ence on NER, news topic and sentiment classification benefit of applying MAFT to AfriBERTa. In most cases
is −0.5, −0.2, and −0.3 F1, respectively. Crucially, the improvement is < 0.6 F1. We speculate that this is
10
We performed LAFT on eng using VOA news corpus probably due to AfriBERTa’s tokenizer having a lim-
with about 906.6MB, much smaller than the CC-100 eng ited coverage. We leave a more detailed investigation
corpus (300GB) of this for future work.
4341
Model amh ara eng yor
#UNK F1 #UNK F1 #UNK F1 #UNK F1
AfroXLMR-base 0 76.1 0 79.7 0 92.8 24 82.1
Afro-XLM-R70k (i) 3704 67.8 1403 76.3 44 90.6 5547 81.2
Afro-XLM-R70k (ii) 3395 70.1 669 76.4 54 91.0 6438 81.3

Table 6: Numbers of UNKs when the model tokenizers are applied on the NER test sets.

More efficient models using vocabulary reduction score. We noticed a drop in the vocabulary coverage
Applying vocabulary reduction helps to reduce the for the other languages as we increased the Amharic
model size by more than 50% before applying MAFT. sub-tokens. Therefore, we concluded that there is no
We find a slight reduction in performance as we re- sweet spot in terms of the way to pick the vocabulary
move more vocabulary tokens. Average performance that covers all languages and we believe that this is an
of XLM-R-base-v70k reduces by −1.6, −1.5 and exciting area for future work.
−0.6 F1 for NER, news topic, and sentiment clas-
sification compared to the XLM-R-base+LAFT base- 5.4 Scaling MAFT to larger models
line. Despite the reduction in performance compared To demonstrate the applicability of MAFT to larger
to XLM-R-base+LAFT, they are still better than XLM- models, we applied MAFT to XLM-R-large using the
R-miniLM, which has a similar model size, with or same training setup as XLM-R-base. We refer to the
without MAFT. We also find that their performance is new PLM as AfroXLMR-large. For comparison, we
better than that of the PLMs that have not undergone also trained individual LAFT models using the mono-
any adaptation. We find the largest reduction in perfor- lingual data11 from Adelani et al. (2021). Table 7 shows
mance on languages that make use of non-Latin scripts the evaluation result on NER. Averaging over all 13
i.e. amh and ara — they make use of the Ge’ez script languages, AfroXLMR-large improved over XLM-R-
and Arabic script respectively. We attribute this to the large by +2.8 F1, which is very comparable to the
vocabulary reduction impacting the number of amh and improvement we obtained between AfroXLMR-base
ara subwords covered by our tokenizer. (81.8 F1) and XLM-R-base (79.2 F1). Surprisingly,
In summary, we recommend XLM-R-base+MAFT the improvement is quite large (+3.5 to +6.3 F1)
(i.e. AfroXLMR-base) for all languages on which for seven out of ten African languages: yor, luo,
we evaluated, including high-resource languages like lug, kin, ibo, and amh. The most interesting ob-
English, French and Arabic. If there are GPU re- servation is that AfroXLMR-large, on average, is ei-
source constraints, we recommend using XLM-R-base- ther competitive or better than the individual language
v70k+MAFT (i.e. AfroXLMR-small). LAFT models, including languages not seen during the
MAFT training stage like lug, luo and wol. This
5.3 Ablation experiments on vocabulary implies that AfroXLMR-large (a single model) pro-
reduction vides a better alternative to XLM-R-large+LAFT (for
Our results showed that applying vocabulary reduction each language) in terms of performance on downstream
reduced the model size, but we also observed a drop in tasks and disk space. AfroXLMR-large is currently the
performance for different languages across the down- largest masked language model for African languages,
stream tasks, especially for Amharic, because it uses a and achieves the state-of-the-art compared to all other
non-Latin script. Hence, we compared different sam- multilingual PLM on the NER task. This shows that
pling strategies for selecting the top-k vocabulary sub- our MAFT approach is very effective and scales to
tokens. These include: (i) concatenating the monolin- larger PLMs.
gual texts, and selecting the top-70k sub-tokens (ii) the
exact approach described in Section 5.1. The result- 6 Cross-lingual Transfer Learning
ing tokenizers from the two approaches are used to to-
The previous section demonstrates the applicability of
kenize the sentences in the NER test sets for Amharic,
MAFT in the fully-supervised transfer learning setting.
Arabic, English, and Yorùbá. Table 6 shows the num-
Here, we demonstrate that our MAFT approach is also
ber of UNKs in the respective test set after tokeniza-
very effective in the zero-shot cross-lingual transfer
tion and the F1 scores obtained on the NER task for the
setting using parameter-efficient fine-tuning methods.
languages. The table shows that the original AfroX-
Parameter-efficient fine-tuning methods like
LMR tokenizer obtained the least number of UNKs for
adapters (Houlsby et al., 2019) are appealing because
all languages, with the highest F1 scores. Note that
of their modularity, portability, and composability
Yorùbá has 24 UNKs, which is explained by the fact
across languages and tasks. Often times, language
that Yorùbá was not seen during pre-training. Fur-
adapters are trained on a general domain corpus
thermore, using approach (i), gave 3704 UNKs for
Amharic, but with approach (ii) there was a significant 11
For languages not in MasakhaNER, we use the same
drop in the number of UNKs and an improvement in F1 monolingual data in Table 9.
4342
Model Size amh ara eng hau ibo kin lug luo pcm swa wol xho yor avg
XLM-R-large 550M 76.2 79.7 93.1 90.5 84.1 73.8 81.6 73.6 89.0 89.4 67.9 72.4 78.9 80.8
XLM-R-large+LAFT 550M x 13 79.9 81.3 92.2 91.7 87.7 78.4 86.2 78.2 91.1 90.3 68.8 72.7 82.9 83.2
AfroXLMR-large 550M 79.7 80.9 92.2 91.2 87.7 79.1 86.7 78.1 91.0 90.4 69.6 72.9 85.2 83.4

Table 7: NER model comparison on XLM-R-large, XLM-R-large+LAFT and XLM-R-large+MAFT (i.e


AfroXLMR-large), showing F1-score on the test sets after 50 epochs averaged over 5 runs. Results are for all
4 tags in the dataset: PER, ORG, LOC, DATE/MISC.

Model amh hau ibo kin lug luo pcm swa wol yor avg
XLM-R-base (fully-supervised) 69.7 91.0 86.2 73.8 80.5 75.8 86.9 88.7 69.6 78.1 81.2
mBERT (MAD-X) (Ansell et al., 2021) - 83.4 71.7 65.3 67.0 52.2 72.1 77.6 65.6 74.0 69.9
mBERT (MAD-X on news domain) - 86.0 77.6 69.9 73.3 56.9 78.5 80.2 68.8 75.6 74.1
XLM-R-base (MAD-X on news domain) 47.5 85.5 83.2 72.0 75.7 57.8 76.8 84.0 68.2 72.2 75.0
AfroXLMR-base (MAD-X on news domain) 47.7 88.1 80.9 73.0 80.1 59.2 79.9 86.9 69.1 75.6 77.0
mBERT (LT-SFT) (Ansell et al., 2021) - 83.5 76.7 67.4 67.9 54.7 74.6 79.4 66.3 74.8 71.7
mBERT (LT-SFT on news domain) - 86.4 80.6 69.2 76.8 55.1 80.4 82.3 71.6 76.7 75.4
XLM-R-base (LT-SFT on news domain) 54.1 87.6 81.4 72.7 79.5 60.7 81.2 85.5 73.6 73.7 77.3
AfroXLMR-base (LT-SFT on news domain) 54.0 88.6 83.5 73.8 81.0 60.7 81.7 86.4 74.5 78.7 78.8

Table 8: Cross-lingual transfer using LT-SFT (Ansell et al., 2021) and evaluation on MasakhaNER. The full-
supervised baselines are obtained from Adelani et al. (2021) to measure performance gap when annotated datasets
are available. Experiments are performed on 3 tags: PER, ORG, LOC. Average (avg) excludes amh. The best
zero-shot transfer F1-scores are underlined.

like Wikipedia. However, when there is a mismatch Fine-tuning (LT-SFT) approach (Ansell et al., 2021), a
between the target domain of the task and the domain parameter-efficient fine-tuning approach that has been
of the language adapter, it could also impact the shown to give competitive or better performance than
cross-lingual performance. the MAD-X 2.0 approach. The LT-SFT approach is
Here, we investigate how we can improve the based on the Lottery Ticket Hypothesis (LTH) that
cross-lingual transfer abilities of our adapted PLM – states that each neural model contains a sub-network
AfroXLMR-base by training language adapters on the (a “winning ticket”) that, if trained again in isolation,
same domain as the target task. For our experiments, can reach or even surpass the performance of the orig-
we use the MasakhaNER dataset, which is based on inal model. The LTH is originally a compression ap-
the news domain. We compare the performance of proach, the authors of LT-SFT re-purposed the ap-
language adapters trained on Wikipedia and news do- proach for cross-lingual adaptation by finding sparse
mains. In addition to adapters, we experiment with sub-networks for tasks and languages, that will later be
another parameter-efficient method based on Lottery- composed together for zero-shot adaptation, similar to
Ticket Hypothesis (Frankle & Carbin, 2019) i.e. LT- Adapters. For additonal details we refer to Ansell et al.
SFT (Ansell et al., 2021). (2021).
For the adapter approach, we make use of the MAD-
6.1 Experimental setup
X approach (Pfeiffer et al., 2020) – an adapter-based
framework that enables cross-lingual transfer to arbi- For our experiments, we followed the same setting as
trary languages by learning modular language and task Ansell et al. (2021) that adapted mBERT from English
representations. However, the evaluation data in the CoNLL03 (Tjong Kim Sang & De Meulder, 2003) to
target languages should have the same task and label African languages (using MasakhaNER dataset) for the
configuration as the source language. Specifically, we NER task.12 Furthermore, we extend the experiments
make use of MAD-X 2.0 (Pfeiffer et al., 2021) where to XLMR-base and AfroXLMR-base. For the train-
the last adapter layers are dropped, which has been ing of MAD-X 2.0 and sparse fine-tunings (SFT) for
shown to improve performance. The setup is as fol- African languages, we make use of the monolingual
lows: (1) We train language adapters via masked lan- texts from the news domain since it matches the domain
guage modelling (MLM) individually on source and of the evaluation data. Unlike, Ansell et al. (2021) that
target languages, the corpora used are described in trained adapters and SFT on monolingual data from
Appendix A.2; (2) We train a task adapter by fine- Wikipedia domain except for luo and pcm where the
tuning on the target task using labelled data in a source dataset is absent, we show that the domain used for
language. (3) During inference, task and language training language SFT is also very important. For a
adapters are stacked together by substituting the source 12
We excluded the MISC and DATE from CoNLL03 and
language adapter with a target language adapter. MasakhaNER respectively to ensure same label configura-
We also make use of the Lottery Ticket Sparse tion.
4343
fair comparison, we reproduced the result of Ansell dataset for 4 African languages. Our results show that
et al. (2021) by training MAD-X 2.0 and LT-SFT on MAFT is competitive to LAFT while providing a sin-
mBERT, XLM-R-base and AfroXLMR-base on target gle model compared to many models specialized for
languages with the news domain corpus. But, we still individual languages. We went further to show that
make use of the pre-trained English language adapter13 combining vocabulary reduction and MAFT leads to
and SFT14 for mBERT and XLM-R-base trained on the a 50% reduction in the parameter size of a XLM-R
Wikipedia domain. For the AfroXLMR-base, we make while still being competitive to applying LAFT on indi-
use of the same English adapter and SFT as XLM-R- vidual languages. We hope that future work improves
base because the PLM is already good for English lan- vocabulary reduction to provide even smaller models
guage. We make use of the same hyper-parameters re- with strong performance on distant and low-resource
ported in the LT-SFT paper. languages. To further research on NLP for African
languages and reproducibility, we have uploaded our
Hyper-parameters for adapters We train the task language adapters, language SFTs, AfroXLMR-base,
adapter using the following hyper-parameters: batch AfroXLMR-small, and AfroXLMR-mini models to the
size of 8, 10 epochs, “pfeiffer” adapter config, adapter HuggingFace Model Hub15 .
reduction factor of 8, and learning rate of 5e-5. For the
language adapters, we make use of 100 epochs or max- Acknowledgments
imum steps of 100K, minimum number of steps is 30K,
batch size of 8, “pfeiffer+inv” adapter config, adapter Jesujoba Alabi was partially funded by the BMBF
reduction factor of 2, learning rate of 5e-5, and max- project SLIK under the Federal Ministry of Educa-
imum sequence length of 256. For a fair comparison tion and Research grant 01IS22015C. David Adelani
with adapter models trained on Wikipedia domain, we acknowledges the EU-funded Horizon 2020 projects:
used the same hyper-parameter settings (Ansell et al., ROXANNE under grant number 833635 and COM-
2021) for the news domain. PRISE (http://www.compriseh2020.eu/) un-
der grant agreement No. 3081705. Marius Mosbach
6.2 Results and discussion acknowledges funding from the Deutsche Forschungs-
Table 8 shows the results of MAD-X 2.0 and LT-SFT, gemeinschaft (DFG, German Research Foundation) –
we compare their performance to fully supervised set- Project-ID 232722074 – SFB 1102. We also thank
ting, where we fine-tune XLM-R-base on the training DFKI GmbH for providing the infrastructure to run
dataset of each of the languages, and evaluate on the some of the experiments. We are also grateful to
test-set. We find that both MAD-X 2.0 and LT-SFT Stella Biderman for providing the resources to train
using news domain for African languages produce bet- AfroXLMR-large. We thank Alan Ansell for provid-
ter performance (+4.2 on MAD-X and +3.7 on LT- ing his MAD-X 2.0 code. Lastly, we thank Benjamin
SFT) than the ones trained largely on the wikipedia Muller, the anonymous reviewers of AfricaNLP 2022
domain. This shows that the domain of the data mat- workshop and COLING 2022 for their helpful feed-
ters. Also, we find that training LT-SFT on XLM- back.
R-base gives better performance than mBERT on all
languages. For MAD-X, there are a few exceptions
like hau, pcm, and yor. Overall, the best perfor- References
mance is obtained by training LT-SFT on AfroXLMR- Amine Abdaoui, Camille Pradel, and Grégoire
base, and sometimes it give better performance than Sigel. Load what you need: Smaller versions
the fully-supervised setting (e.g. as observed in kin of mutililingual BERT. In Proceedings of Sus-
and lug, wol yor languages). On both MAD-X and taiNLP: Workshop on Simple and Efficient Nat-
LT-SFT, AfroXLMR-base gives the best result since it ural Language Processing, pp. 119–123, Online,
has been firstly adapted on several African languages November 2020. Association for Computational
and secondly on the target domain of the target task. Linguistics. doi: 10.18653/v1/2020.sustainlp-1.
This shows that the MAFT approach is effective since 16. URL https://aclanthology.org/
the technique provides a better PLM that parameter- 2020.sustainlp-1.16.
efficient methods can benefit from. David Adelani, Jesujoba Alabi, Angela Fan, Julia
Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
7 Conclusion Dietrich Klakow, Peter Nabende, Ernie Chang,
Tajuddeen Gwadabe, Freshia Sackey, Bonaventure
In this work, we proposed and studied MAFT as an
F. P. Dossou, Chris Emezue, Colin Leong, Michael
approach to adapt multilingual PLMs to many African Beukman, Shamsuddeen Muhammad, Guyo Jarso,
languages with a single model. We evaluated our Oreen Yousuf, Andre Niyongabo Rubungo, Gilles
approach on 3 different NLP downstream tasks and Hacheme, Eric Peter Wairagala, Muhammad Umair
additionally contribute novel news topic classification Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne
13 15
https://adapterhub.ml/ https://huggingface.co/models?sort=
14
https://huggingface.co/cambridgeltl downloads&search=Davlan%2Fafro-xlmr
4344
Gitau, Jade Abbott, Mohamed Ahmed, Milli- amharic news text classification dataset. ArXiv,
cent Ochieng, Anuoluwapo Aremu, Perez Ogayo, abs/2103.05639, 2021.
Jonathan Mukiibi, Fatoumata Ouoba Kabore, God-
son Kalipe, Derguene Mbaye, Allahsera Au- Yassine Benajiba, Paolo Rosso, and José Miguel
guste Tapo, Victoire Memdjokam Koagne, Edwin Benedı́Ruiz. Anersys: An arabic named entity
Munkoh-Buabeng, Valencia Wagner, Idris Abdul- recognition system based on maximum entropy. In
mumin, Ayodele Awokoya, Happy Buzaaba, Bless- Alexander Gelbukh (ed.), Computational Linguis-
ing Sibanda, Andiswa Bukula, and Sam Manthalu. tics and Intelligent Text Processing, pp. 143–153,
A few thousand translations go a long way! lever- Berlin, Heidelberg, 2007. Springer Berlin Heidel-
aging pre-trained models for African news trans- berg. ISBN 978-3-540-70939-8.
lation. In Proceedings of the 2022 Conference
of the North American Chapter of the Associa- Michael Beukman. Analysing the effects of
tion for Computational Linguistics: Human Lan- transfer learning on low-resourced named en-
guage Technologies, pp. 3053–3070, Seattle, United tity recognition performance. In 3rd Work-
States, July 2022. Association for Computational shop on African Natural Language Processing,
Linguistics. doi: 10.18653/v1/2022.naacl-main. 2021. URL https://openreview.net/
223. URL https://aclanthology.org/ forum?id=HKWMFqfN8b5.
2022.naacl-main.223.
Ethan C. Chau and Noah A. Smith. Specializing mul-
David Ifeoluwa Adelani, Jade Abbott, Graham Neu- tilingual language models: An empirical study. In
big, Daniel D’souza, Julia Kreutzer, Constantine Proceedings of the 1st Workshop on Multilingual
Lignos, Chester Palen-Michel, Happy Buzaaba, Representation Learning, pp. 51–61, Punta Cana,
Shruti Rijhwani, Sebastian Ruder, Stephen May- Dominican Republic, November 2021. Association
hew, Israel Abebe Azime, Shamsuddeen H. Muham- for Computational Linguistics. doi: 10.18653/v1/
mad, Chris Chinenye Emezue, Joyce Nakatumba- 2021.mrl-1.5. URL https://aclanthology.
Nabende, Perez Ogayo, Aremu Anuoluwapo, org/2021.mrl-1.5.
Catherine Gitau, Derguene Mbaye, Jesujoba Al-
abi, Seid Muhie Yimam, Tajuddeen Rabiu Gwad- Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham
abe, Ignatius Ezeani, Rubungo Andre Niyongabo, Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao,
Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Heyan Huang, and Ming Zhou. InfoXLM: An
Davis David, Samba Ngom, Tosin Adewumi, Paul information-theoretic framework for cross-lingual
Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, language model pre-training. In Proceedings of the
Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka 2021 Conference of the North American Chapter
Odu, Eric Peter Wairagala, Samuel Oyerinde, of the Association for Computational Linguistics:
Clemencia Siro, Tobius Saul Bateesa, Temilola Human Language Technologies, pp. 3576–3588,
Oloyede, Yvonne Wambui, Victor Akinode, Deb- Online, June 2021. Association for Computational
orah Nabagereka, Maurice Katusiime, Ayodele Linguistics. doi: 10.18653/v1/2021.naacl-main.
Awokoya, Mouhamadane MBOUP, Dibora Gebrey- 280. URL https://aclanthology.org/
ohannes, Henok Tilaye, Kelechi Nwaike, Degaga 2021.naacl-main.280.
Wolde, Abdoulaye Faye, Blessing Sibanda, Ore-
vaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Hyung Won Chung, Thibault Fevry, Henry Tsai,
Ogueji, Thierno Ibrahima DIOP, Abdoulaye Di- Melvin Johnson, and Sebastian Ruder. Rethinking
allo, Adewale Akinfaderin, Tendai Marengereke, embedding coupling in pre-trained language mod-
and Salomey Osei. MasakhaNER: Named en- els. In International Conference on Learning Repre-
tity recognition for African languages. Transac- sentations, 2021. URL https://openreview.
tions of the Association for Computational Lin- net/forum?id=xpFFI_NtgpW.
guistics, 9:1116–1131, 2021. doi: 10.1162/tacl a
00416. URL https://aclanthology.org/ Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
2021.tacl-1.66. Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David
moyer, and Veselin Stoyanov. Unsupervised cross-
Adelani, and Cristina España-Bonet. Massive vs.
lingual representation learning at scale. In Pro-
curated embeddings for low-resourced languages:
ceedings of the 58th Annual Meeting of the Asso-
the case of Yorùbá and Twi. In Proceedings
ciation for Computational Linguistics, pp. 8440–
of the 12th Language Resources and Evaluation
8451, Online, July 2020. Association for Computa-
Conference, pp. 2754–2762, Marseille, France,
tional Linguistics. doi: 10.18653/v1/2020.acl-main.
May 2020. European Language Resources Associ-
747. URL https://aclanthology.org/
ation. ISBN 979-10-95546-34-4. URL https:
2020.acl-main.747.
//aclanthology.org/2020.lrec-1.335.
Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Davis David. Swahili : News classification dataset.
and Ivan Vulić. Composable sparse fine-tuning for Zenodo, December 2020. doi: 10.5281/zenodo.
cross-lingual transfer, 2021. 5514203. URL https://doi.org/10.5281/
zenodo.5514203. The news version contains
Israel Abebe Azime and Nebil Mohammed. An both train and test sets.
4345
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and training with gradient-disentangled embedding shar-
Kristina Toutanova. BERT: Pre-training of deep ing. ArXiv, abs/2111.09543, 2021.
bidirectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the Michael A. Hedderich, David Adelani, Dawei Zhu, Je-
North American Chapter of the Association for Com- sujoba Alabi, Udia Markus, and Dietrich Klakow.
putational Linguistics: Human Language Technolo- Transfer learning and distant supervision for mul-
gies, Volume 1 (Long and Short Papers), pp. 4171– tilingual transformer models: A study on African
4186, Minneapolis, Minnesota, June 2019. Associa- languages. In Proceedings of the 2020 Confer-
tion for Computational Linguistics. doi: 10.18653/ ence on Empirical Methods in Natural Language
v1/N19-1423. URL https://aclanthology. Processing (EMNLP), pp. 2580–2591, Online,
org/N19-1423. November 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.emnlp-main.
Roald Eiselen. Government domain named en- 204. URL https://aclanthology.org/
tity recognition for South African languages. In 2020.emnlp-main.204.
Proceedings of the Tenth International Confer-
ence on Language Resources and Evaluation Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzeb-
(LREC’16), pp. 3344–3348, Portorož, Slovenia, ski, Bruna Morrone, Quentin De Laroussilhe,
May 2016. European Language Resources Associa- Andrea Gesmundo, Mona Attariyan, and Syl-
tion (ELRA). URL https://aclanthology. vain Gelly. Parameter-efficient transfer learn-
org/L16-1533. ing for NLP. In Kamalika Chaudhuri and Rus-
∀, Wilhelmina Nekoto, Vukosi Marivate, Tshi- lan Salakhutdinov (eds.), Proceedings of the 36th
nondiwa Matsila, Timi Fasubaa, Taiwo Fagbo- International Conference on Machine Learning,
hungbe, Solomon Oluwole Akinola, Shamsud- volume 97 of Proceedings of Machine Learn-
deen Muhammad, Salomon Kabongo Kabenamualu, ing Research, pp. 2790–2799. PMLR, 09–15 Jun
Salomey Osei, Freshia Sackey, Rubungo Andre 2019. URL https://proceedings.mlr.
Niyongabo, Ricky Macharm, Perez Ogayo, Ore- press/v97/houlsby19a.html.
vaoghene Ahia, Musie Meressa Berhe, Mofetoluwa
Adeyemi, Masabata Mokgesi-Selinga, Lawrence Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Okegbemi, Laura Martinus, Kolawole Tajudeen, Xiao Chen, Linlin Li, Fang Wang, and Qun
Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Liu. TinyBERT: Distilling BERT for natu-
Julia Kreutzer, Jason Webster, Jamiil Toure Ali, ral language understanding. In Findings of
Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Ab- the Association for Computational Linguistics:
dulkadir Dangana, Herman Kamper, Hady Elsahar, EMNLP 2020, pp. 4163–4174, Online, Novem-
Goodness Duru, Ghollah Kioko, Murhabazi Espoir, ber 2020. Association for Computational Lin-
Elan van Biljon, Daniel Whitenack, Christopher guistics. doi: 10.18653/v1/2020.findings-emnlp.
Onyefuluchi, Chris Chinenye Emezue, Bonaven- 372. URL https://aclanthology.org/
ture F. P. Dossou, Blessing Sibanda, Blessing 2020.findings-emnlp.372.
Bassey, Ayodele Olabiyi, Arshath Ramkilowan,
Alp Öktem, Adewale Akinfaderin, and Abdallah Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao,
Bashir. Participatory research for low-resourced Haotang Deng, and Qi Ju. FastBERT: a self-
machine translation: A case study in African distilling BERT with adaptive inference time. In
languages. In Findings of the Association for Proceedings of the 58th Annual Meeting of the As-
Computational Linguistics: EMNLP 2020, On- sociation for Computational Linguistics, pp. 6035–
line, 2020. URL https://www.aclweb.org/ 6044, Online, July 2020a. Association for Com-
anthology/2020.findings-emnlp.195. putational Linguistics. doi: 10.18653/v1/2020.
acl-main.537. URL https://aclanthology.
Jonathan Frankle and Michael Carbin. The lot- org/2020.acl-main.537.
tery ticket hypothesis: Finding sparse, trainable
neural networks. In International Conference on Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Learning Representations, 2019. URL https:// Edunov, Marjan Ghazvininejad, Mike Lewis, and
openreview.net/forum?id=rJl-b3RcF7. Luke Zettlemoyer. Multilingual denoising pre-
Suchin Gururangan, Ana Marasović, Swabha training for neural machine translation. Trans-
Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, actions of the Association for Computational Lin-
and Noah A. Smith. Don’t stop pretraining: Adapt guistics, 8:726–742, 2020b. doi: 10.1162/tacl a
language models to domains and tasks. In Proceed- 00343. URL https://aclanthology.org/
ings of the 58th Annual Meeting of the Association 2020.tacl-1.47.
for Computational Linguistics, pp. 8342–8360,
Online, July 2020. Association for Computational Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi,
Linguistics. doi: 10.18653/v1/2020.acl-main.740. Songfang Huang, Fei Huang, and Luo Si. VECO:
URL https://aclanthology.org/2020. Variable and flexible cross-lingual pre-training for
acl-main.740. language understanding and generation. In Proceed-
ings of the 59th Annual Meeting of the Associa-
Pengcheng He, Jianfeng Gao, and Weizhu Chen. De- tion for Computational Linguistics and the 11th In-
bertav3: Improving deberta using electra-style pre- ternational Joint Conference on Natural Language
4346
Processing (Volume 1: Long Papers), pp. 3980– ing cross-lingual semantics with monolingual cor-
3994, Online, August 2021. Association for Com- pora. In Proceedings of the 2021 Confer-
putational Linguistics. doi: 10.18653/v1/2021. ence on Empirical Methods in Natural Language
acl-long.308. URL https://aclanthology. Processing, pp. 27–38, Online and Punta Cana,
org/2021.acl-long.308. Dominican Republic, November 2021. Associa-
tion for Computational Linguistics. doi: 10.
Shamsuddeen Hassan Muhammad, David Ifeoluwa 18653/v1/2021.emnlp-main.3. URL https://
Adelani, Sebastian Ruder, Ibrahim Said Ahmad, aclanthology.org/2021.emnlp-main.3.
Idris Abdulmumin, Bello Shehu Bello, Monojit
Choudhury, Chris Chinenye Emezue, Saheed Ab- Chester Palen-Michel, June Kim, and Constan-
dullahi Salahudeen, Aremu Anuoluwapo, Alı́pio Je- tine Lignos. Multilingual open text 1.0: Pub-
orge, and Pavel Brazdil. Naijasenti: A nigerian twit- lic domain news in 44 languages. CoRR,
ter sentiment corpus for multilingual sentiment anal- abs/2201.05609, 2022. URL https://arxiv.
ysis. ArXiv, abs/2201.08277, 2022. org/abs/2201.05609.

Benjamin Muller, Antonios Anastasopoulos, Benoı̂t Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and
Sagot, and Djamé Seddah. When being un- Sebastian Ruder. MAD-X: An Adapter-Based
seen from mBERT is just the beginning: Han- Framework for Multi-Task Cross-Lingual Trans-
dling new languages with multilingual language fer. In Proceedings of the 2020 Conference on
models. In Proceedings of the 2021 Con- Empirical Methods in Natural Language Process-
ference of the North American Chapter of the ing (EMNLP), pp. 7654–7673, Online, Novem-
Association for Computational Linguistics: Hu- ber 2020. Association for Computational Lin-
man Language Technologies, pp. 448–462, On- guistics. doi: 10.18653/v1/2020.emnlp-main.
line, June 2021. Association for Computational 617. URL https://aclanthology.org/
Linguistics. doi: 10.18653/v1/2021.naacl-main. 2020.emnlp-main.617.
38. URL https://aclanthology.org/
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
2021.naacl-main.38.
bastian Ruder. UNKs everywhere: Adapting mul-
Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, tilingual language models to new scripts. In Pro-
and Li Huang. KINNEWS and KIRNEWS: ceedings of the 2021 Conference on Empirical Meth-
Benchmarking cross-lingual text classification ods in Natural Language Processing, pp. 10186–
for Kinyarwanda and Kirundi. In Proceed- 10203, Online and Punta Cana, Dominican Repub-
ings of the 28th International Conference on lic, November 2021. Association for Computational
Computational Linguistics, pp. 5507–5521, Linguistics. doi: 10.18653/v1/2021.emnlp-main.
Barcelona, Spain (Online), December 2020. 800. URL https://aclanthology.org/
International Committee on Computational Lin- 2021.emnlp-main.800.
guistics. doi: 10.18653/v1/2020.coling-main.480. Sara Rosenthal, Noura Farra, and Preslav Nakov.
URL https://aclanthology.org/2020. Semeval-2017 task 4: Sentiment analysis in twitter.
coling-main.480. In Proceedings of the 11th international workshop
on semantic evaluation (SemEval-2017), pp. 502–
Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima
518, 2017.
Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl
Eryani, Alexander Erdmann, and Nizar Habash. Victor Sanh, Lysandre Debut, Julien Chaumond, and
CAMeL tools: An open source python toolkit for Thomas Wolf. Distilbert, a distilled version of
Arabic natural language processing. In Proceed- bert: smaller, faster, cheaper and lighter. ArXiv,
ings of the 12th Language Resources and Evalua- abs/1910.01108, 2019.
tion Conference, pp. 7022–7032, Marseille, France,
May 2020. European Language Resources Associ- Kathleen Siminyu, Godson Kalipe, Davor Orlic,
ation. ISBN 979-10-95546-34-4. URL https: Jade Z. Abbott, Vukosi Marivate, Sackey Freshia,
//aclanthology.org/2020.lrec-1.868. Prateek Sibal, Bhanu Neupane, David Ifeoluwa
Adelani, Amelia Taylor, Jamiil Toure Ali, Kevin
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small Degila, Momboladji Balogoun, Thierno Ibrahima
data? no problem! exploring the viability of Diop, Davis David, Chayma Fourati, Hatem Haddad,
pretrained multilingual language models for low- and Malek Naski. Ai4d - african language program.
resourced languages. In Proceedings of the 1st ArXiv, abs/2104.02516, 2021.
Workshop on Multilingual Representation Learn-
ing, pp. 116–126, Punta Cana, Dominican Re- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu,
public, November 2021. Association for Computa- Yiming Yang, and Denny Zhou. MobileBERT: a
tional Linguistics. doi: 10.18653/v1/2021.mrl-1. compact task-agnostic BERT for resource-limited
11. URL https://aclanthology.org/ devices. In Proceedings of the 58th Annual Meet-
2021.mrl-1.11. ing of the Association for Computational Linguistics,
pp. 2158–2170, Online, July 2020. Association for
Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Computational Linguistics. doi: 10.18653/v1/2020.
Hao Tian, Hua Wu, and Haifeng Wang. ERNIE- acl-main.195. URL https://aclanthology.
M: Enhanced multilingual representation by align- org/2020.acl-main.195.
4347
Y. Tang, C. Tran, Xian Li, Peng-Jen Chen, Naman ing Amharic sentiment analysis from social me-
Goyal, Vishrav Chaudhary, Jiatao Gu, and An- dia texts: Building annotation tools and classifica-
gela Fan. Multilingual translation with extensi- tion models. In Proceedings of the 28th Interna-
ble multilingual pretraining and finetuning. ArXiv, tional Conference on Computational Linguistics, pp.
abs/2008.00401, 2020. 1048–1060, Barcelona, Spain (Online), December
2020. International Committee on Computational
Erik F. Tjong Kim Sang and Fien De Meulder. Intro- Linguistics. doi: 10.18653/v1/2020.coling-main.
duction to the CoNLL-2003 shared task: Language- 91. URL https://aclanthology.org/
independent named entity recognition. In Proceed- 2020.coling-main.91.
ings of the Seventh Conference on Natural Lan-
guage Learning at HLT-NAACL 2003, pp. 142–147, Xiang Zhang, Junbo Jake Zhao, and Yann LeCun.
2003. URL https://aclanthology.org/ Character-level convolutional networks for text clas-
W03-0419. sification. In NIPS, 2015.

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan A Appendix


Yang, and Ming Zhou. Minilm: Deep self-attention
distillation for task-agnostic compression of pre- A.1 Monolingual corpora for LAFT and MAFT
trained transformers. In H. Larochelle, M. Ranzato, For training the MAFT models, we make use of the
R. Hadsell, M. F. Balcan, and H. Lin (eds.), Ad- aggregation of monolingual data from Table 9.
vances in Neural Information Processing Systems,
For the LAFT models, we make use of existing
volume 33, pp. 5776–5788. Curran Associates, Inc.,
2020. XLMR-base+LAFT models from the MasakhaNER
paper (Adelani et al., 2021). However, for other lan-
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, guages not present in MasakhaNER (ara, mlg,orm,
and Furu Wei. MiniLMv2: Multi-head self- sna, som, xho), we make use of the mC4 corpus ex-
attention relation distillation for compressing pre- cept for eng — we use the VOA corpus. For a fair
trained transformers. In Findings of the As- comparison across models, when training the XLM-
sociation for Computational Linguistics: ACL- R-large+LAFT models, we use the same monolingual
IJCNLP 2021, pp. 2140–2151, Online, Au- corpus used to train XLM-R-base+LAFT models.
gust 2021. Association for Computational Lin-
guistics. doi: 10.18653/v1/2021.findings-acl. A.2 News corpora for language adapters and
188. URL https://aclanthology.org/ SFTs
2021.findings-acl.188.
Table 10 provides the news corpus we used to train lan-
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien guage adapters and SFTs for the cross-lingual settings.
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. Trans-
formers: State-of-the-art natural language process-
ing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Process-
ing: System Demonstrations, pp. 38–45, Online,
October 2020. Association for Computational Lin-
guistics. doi: 10.18653/v1/2020.emnlp-demos.
6. URL https://aclanthology.org/
2020.emnlp-demos.6.

Linting Xue, Noah Constant, Adam Roberts, Mi-


hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
Barua, and Colin Raffel. mt5: A mas-
sively multilingual pre-trained text-to-text trans-
former. In Proceedings of the 2021 Confer-
ence of the North American Chapter of the As-
sociation for Computational Linguistics: Hu-
man Language Technologies, pp. 483–498, On-
line, June 2021. Association for Computational
Linguistics. doi: 10.18653/v1/2021.naacl-main.
41. URL https://aclanthology.org/
2021.naacl-main.41.

Seid Muhie Yimam, Hizkiel Mitiku Alemayehu,


Abinew Ayele, and Chris Biemann. Explor-
4348
Language Source Size (MB) No. of sentences
Afrikaans (afr) mC4 (subset) (Xue et al., 2021) 752.2MB 3,697,430
Amharic (amh) mC4 (subset), and VOA 1,300MB 2,913,801
Arabic (ara) mC4 (subset) 1,300MB 3,939,375
English (eng) mC4 (subset), and VOA 2,200MB 8,626,571
French (fra) mC4 (subset), and VOA 960MB 4,731,196
Hausa (hau) mC4 (all), and VOA 594.1MB 3,290,382
Igbo (ibo) mC4 (all), and AfriBERTa Corpus (Ogueji et al., 2021) 287.5MB 1,534,825
Malagasy (mlg) mC4 (all) 639.6MB 3,304,459
Chichewa (nya) mC4 (all), Chichewa News Corpus (Siminyu et al., 2021) 373.8MB 2,203,040
Oromo (orm) AfriBERTa Corpus, and VOA 67.3MB 490,399
Naija (pcm) AfriBERTa Corpus 54.8MB 166,842
Rwanda-Rundi (kin/run) AfriBERTa Corpus, KINNEWS & KIRNEWS (Niyongabo et al., 2020), and VOA 84MB 303,838
chiShona (sna) mC4 (all), and VOA 545.2MB 2,693,028
Somali (som) mC4 (all), and VOA 1,000MB 3,480,960
Sesotho (sot) mC4 (all) 234MB 1,107,565
Kiswahili (swa) mC4 (all) 823.5MB 4,220,346
isiXhosa (xho) mC4 (all), and Isolezwe Newspaper 178.4MB 832,954
Yorùbá (yor) mC4 (all), Alaroye News, Asejere News, Awikonko News, BBC, and VON 179.3MB 897,299
isiZulu (zul) mC4 (all), and Isolezwe Newspaper 700.7MB 3,252,035

Table 9: Monolingual Corpora (after pre-processing – we followed AfriBERTa (Ogueji et al., 2021) approach) ,
their sources and size (MB), and number of sentences.

Language Source Size (MB) No. of sentences


Amharic (amh) VOA (Palen-Michel et al., 2022) 19.9MB 72,125
Hausa (hau) VOA (Palen-Michel et al., 2022) 46.1MB 235,614
Igbo (ibo) BBC Igbo (Ogueji et al., 2021) 16.6MB 62,654
Kinyarwanda (kin) KINNEWS (Niyongabo et al., 2020) 35.8MB 61,910
Luganda (lug) Bukedde 7.9MB 67,716
Luo (luo) Ramogi FM news and MAFAND-MT (Adelani et al., 2022) 1.4MB 8,684
Naija (pcm) BBC 50.2MB 161,843
Kiswahili (swa) VOA (Palen-Michel et al., 2022) 17.1MB 88,314
Wolof (wol) Lu Defu Waxu, Saabal, Wolof Online, and MAFAND-MT (Adelani et al., 2022) 2.3MB 13,868
Yorùbá (yor) BBC Yorùbá 15.0MB 117,124

Table 10: Monolingual News Corpora used for language adapter and SFT training, their sources and size (MB),
and number of sentences.

4349

You might also like