Adapting Pre-Trained Language Models To African Languages Via
Adapting Pre-Trained Language Models To African Languages Via
Jesujoba O. Alabi∗, David Ifeoluwa Adelani∗, Marius Mosbach, and Dietrich Klakow
Spoken Language Systems (LSV), Saarland University, Saarland Informatics Campus, Germany
{jalabi,didelani,mmosbach,dklakow}@lsv.uni-saarland.de
Table 1: Number of sentences in training, development and test splits. We provide automatic translation of some
of the African language words to English (in Parenthesis) using Google Translate.
Masakhane to address this issue (∀ et al., 2020; Adelani News, Voice of America News5 (Palen-Michel et al.,
et al., 2021). We only find two major evaluation bench- 2022), and some other news websites based in Africa.
mark datasets that cover a wide range of African lan- Table 9 in the Appendix provides a summary of the
guages: one for named entity recognition (NER) (Ade- monolingual data, including their sizes and sources.
lani et al., 2021) and one for sentiment classifica- We pre-processed the data by removing lines that con-
tion (Muhammad et al., 2022). In addition, there are sist of numbers or punctuation only, and lines with less
also several news topic classification datasets (Hed- than six tokens.
derich et al., 2020; Niyongabo et al., 2020; Azime &
Mohammed, 2021) but they are only available for a 3.2 Evaluation tasks
few African languages. Our work contributes novel We run our experiments on two sentence level classi-
news topic classification datasets (i.e. ANTC) for addi- fication tasks: news topic classification and sentiment
tional five African languages: Lingala, Naija, Somali, classification, and one token level classification task:
isiZulu, and Malagasy. NER. We evaluate our models on English as well as di-
verse African languages with different linguistic char-
3 Data acteristics.
3.1 Adaptation corpora 3.2.1 Existing datasets
We perform MAFT on 17 African languages Afrikaans, NER For the NER task we evaluate on the
Amharic, Hausa, Igbo, Malagasy, Chichewa, Oromo, MasakhaNER dataset (Adelani et al., 2021), a manu-
Naija, Kinyarwanda, Kirundi, Shona, Somali, Sesotho, ally annotated dataset covering 10 African languages
Swahili, isiXhosa, Yorùbá, isiZulu) covering the ma- (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo,
jor African language families and 3 high resource lan- Naija, Kiswahili, Wolof, and Yorùbá) with texts from
guages (Arabic, French, and English) widely spoken the news domain. For English, we use data from the
in Africa. We selected the African languages based CoNLL 2003 NER task (Tjong Kim Sang & De Meul-
on the availability of a (relatively) large amount of der, 2003) also containing texts from the news domain.
monolingual texts. We obtain the monolingual texts For isiXhosa, we use the data from Eiselen (2016).
from three major sources: the mT5 pre-training corpus Lastly, to evaluate on Arabic we make use of the AN-
which is based on Common Crawl Corpus4 (Xue et al., ERCorp dataset (Benajiba et al., 2007; Obeid et al.,
2021), the British Broadcasting Corporation (BBC) 2020).
4 5
https://commoncrawl.org/ https://www.voanews.com
4338
News topic classification We use existing news topic PLM # Lang. African languages covered
datasets for Amharic (Azime & Mohammed, 2021), XLM-R-base 100 afr, amh, hau, mlg, orm, som,
(270M) swa, xho
English – AG News corpus – (Zhang et al., 2015), AfriBERTa-large 11 amh, hau, ibo, kin, run, orm,
Kinyarwanda – KINNEWS – (Niyongabo et al., 2020), (126M) pcm, som, swa, tir, yor
Kiswahili – new classification dataset– (David, 2020), XLM-R-miniLM 100 afr, amh, hau, mlg, orm, som,
(117M) swa, xho
and both Yorùbá and Hausa (Hedderich et al., 2020). XLM-R-large 100 afr, amh, hau, mlg, orm, som,
For dataset without a development set, we randomly (550M) swa, xho
AfroXLMR* 20 afr, amh, hau, ibo, kin, run
sample 5% of their training instances and use them as (117M-270M) mlg, nya, orm, pcm, sna, som,
a development set. sot, swa, xho, yor, zul
Table 3: NER model comparison, showing F1-score on the test sets after 50 epochs averaged over 5 runs. Results
are for all 4 tags in the dataset: PER, ORG, LOC, DATE/MISC. For LAFT, we multiplied the size of XLM-R-base
by the number of languages as LAFT results in a single model per language.
on a combination of the monolingual corpora used for in the new vocabulary, we justify the choice of this ap-
AfriMT5 adaptation by Adelani et al. (2022). Details proach in Section 5.3. We assume that the top-k most
for each of the monolingual corpora and languages are frequent tokens should be representative of the vocabu-
provided in Appendix A.1. lary of the whole 20 languages. We chose k = 52.000
from the Amharic sub-tokens which covers 99.8% of
Hyper-parameters for MAFT The PLMs were the Amharic monolingual texts, and k = 60.000 which
trained for 3 epochs with a learning rate of 5e-5 using covers 99.6% of the other 19 languages, and merged
huggingface transformers (Wolf et al., 2020). We use them. In addition, we include the top 1000 tokens from
of a batch size of 32 for AfriBERTa and a batch size 10 the original XLM-R-base tokenizer in the new vocab-
for the other PLMs. ulary to include frequent tokens that were not present
5.1 Vocabulary reduction in the new top-k tokens.9 We note that our assumption
above may not hold in the case of some very distant
Multilingual PLMs come with various parameter sizes, and low-resourced languages as well as when there are
the larger ones having more than hundred million pa- domain differences between the corpora used during
rameters, which makes fine-tuning and deploying such adaptation and fine-tuning. We leave the investigation
models a challenge due to resource constraints. One of alternative approaches for vocabulary compression
of the major factors that contributes to the parameter for future work.
size of these models is the embedding matrix whose
size is a function of the vocabulary size of the model. 5.2 Results and discussion
While a large vocabulary size is essential for a multi-
5.2.1 Baseline results
lingual PLM trained on hundreds of languages, some of
the tokens in the vocabulary can be removed when they For the baseline models (top rows in Tables 3, 4, and 5),
are irrelevant to the domain or language considered in we directly fine-tune on each of the downstream tasks
the downstream task, thus reducing the vocabulary size in the target language: NER, news topic classification
of the model. Inspired by Abdaoui et al. (2020), we and sentiment analysis.
experiment with reducing the vocabulary size of the
Performance on languages seen during pre-training
XLM-R-base model before adapting via MAFT. There
For NER and sentiment analysis we find XLM-R-large
are two possible vocabulary reductions in our setting:
to give the best overall performance. We attribute this
(1) removal of tokens before MAFT or (2) removal of
to the fact that it has a larger model capacity compared
tokens after MAFT. From our preliminary experiments,
to the other PLMs. Similarly, we find AfriBERTa and
we find approach (1) to work better. We call the result-
XLM-R-base to give better results on languages they
ing model, AfroXLMR-small.
have been pre-trained on (see Table 2), and in most
To remove non-African vocabulary sub-tokens from
cases AfriBERTa tends to perform better than XLM-
the pretrained XLM-base model, we concatenated the
R-base on languages they are both pre-trained on, for
monolingual texts from 19 out of the 20 African lan-
example amh, hau, and swa. However, when the lan-
guages together. Then, we apply sentencepiece to
guages are unseen by AfriBERTa (e.g. ara, eng, wol,
the Amharic monolingual texts, and concatenated texts
lin, lug, luo, xho, zul), it performs much worse
separately using the original XLM-R-base tokenizer.
than XLM-R-base and in some cases even worse than
The frequency of all the sub-tokens in the two separate
the XLM-R-miniLM. This shows that it may be better
monolingual corpora is computed, and we select the
top-k most frequent tokens from the separate corpora. 9
This introduced just a few new tokens which are mostly
We used this separate sampling to ensure that a con- English tokens to the new vocabulary. We end up with 70.609
siderable number of Amharic sub-tokens are captured distinct sub-tokens after combining all of them.
4340
Model Size amh eng hau kin lin mlg pcm som swa yor zul avg
Finetune
XLM-R-miniLM 117M 70.4 94.1 77.6 64.2 41.2 42.9 67.6 74.2 86.7 68.8 56.9 67.7
AfriBERTa 126M 70.7 93.6 90.1 75.8 55.4 56.4 81.5 79.9 87.7 82.6 71.4 76.8
XLM-R-base 270M 71.1 94.1 85.9 73.3 56.8 54.2 77.3 78.8 87.1 71.1 70.0 74.6
XLM-R-large 550M 72.7 94.5 86.2 75.1 52.2 63.6 79.4 79.2 87.5 74.8 78.7 76.7
MAFT + Finetune
XLM-R-miniLM 117M 69.5 94.1 86.7 72.0 51.7 55.3 78.1 77.7 87.2 74.0 60.3 73.3
AfriBERTa 126M 68.8 93.7 89.5 76.5 54.9 59.7 82.2 79.9 87.7 80.8 76.4 77.3
XLM-R-base 270M 71.9 94.6 88.3 76.8 58.6 64.7 78.9 79.1 87.8 80.2 79.6 78.2
XLM-R-base-v70k 140M 70.4 94.2 87.7 76.1 56.8 64.4 76.1 79.4 87.4 76.9 77.4 76.9
XLM-R-base+LAFT 270M x 11 73.0 94.3 91.2 76.0 56.9 67.3 77.4 79.4 88.0 79.2 79.5 78.4
Table 4: News topic classification model comparison, showing F1-score on the test sets after 25 epochs averaged
over 5 runs. For LAFT, we multiplied the size of XLM-R-base by the number of languages.
Table 5: Sentiment classification model comparison, showing F1 evaluation on test sets after 20 epochs, averaged
over 5 runs. We obtained the results for the baseline model results of “hau”, “ibo”, “pcm”, and “yor” from
Muhammad et al. (2022). For LAFT, we multiplied the size of XLM-R-base by the number of languages as LAFT
results in a single model per language.
to adapt to a new African language from a PLM that has compared to LAFT, MAFT results in a single adapted
seen numerous languages than one trained on a subset model which can be applied to many languages while
of African languages from scratch. LAFT results in a new model for each language. Be-
low, we discuss our results in more detail.
LAFT is a strong baseline The results of applying
LAFT to the XLM-R-base model are shown in the last
row of Tables 3, 4, and 5. We find that applying LAFT PLMs pre-trained on many languages benefit the
on each language individually provides a significant most from MAFT We found all the PLMs to im-
improvement in performance across all languages and prove after we applied MAFT. The improvement is
tasks we evaluated on. Sometimes, the improvement the largest for the XLM-R-miniLM, where the perfor-
is very large, for example, +7.4 F1 on Amharic NER mance improved by +4.2 F1 for NER, and +5.6 F1 for
and +9.5 F1 for Zulu news-topic classification. The news topic classification. Although, the improvement
only exception is for English since XLM-R has already was lower for sentiment classification (+0.8). Apply-
seen large amounts of English text during pre-training. ing MAFT on XLM-R-base gave the overall best result.
Additionally, LAFT models tend to give slightly worse On average, there is an improvement of +2.6, +3.6,
result when adaptation is performed on a smaller cor- and +1.5 F1 on NER, news topic and sentiment classi-
pus.10 fication, respectively. The main advantage of MAFT
5.2.2 Multilingual adaptive fine-tuning results is that it allows us to use the same model for many
African languagesinstead of many models specialized
While LAFT provides an upper bound on downstream to individual languages. This significantly reduces the
performance for most languages, our new approach required disk space to store the models, without sac-
is often competitive to LAFT. On average, the differ- rificing performance. Interestingly, there is no strong
ence on NER, news topic and sentiment classification benefit of applying MAFT to AfriBERTa. In most cases
is −0.5, −0.2, and −0.3 F1, respectively. Crucially, the improvement is < 0.6 F1. We speculate that this is
10
We performed LAFT on eng using VOA news corpus probably due to AfriBERTa’s tokenizer having a lim-
with about 906.6MB, much smaller than the CC-100 eng ited coverage. We leave a more detailed investigation
corpus (300GB) of this for future work.
4341
Model amh ara eng yor
#UNK F1 #UNK F1 #UNK F1 #UNK F1
AfroXLMR-base 0 76.1 0 79.7 0 92.8 24 82.1
Afro-XLM-R70k (i) 3704 67.8 1403 76.3 44 90.6 5547 81.2
Afro-XLM-R70k (ii) 3395 70.1 669 76.4 54 91.0 6438 81.3
Table 6: Numbers of UNKs when the model tokenizers are applied on the NER test sets.
More efficient models using vocabulary reduction score. We noticed a drop in the vocabulary coverage
Applying vocabulary reduction helps to reduce the for the other languages as we increased the Amharic
model size by more than 50% before applying MAFT. sub-tokens. Therefore, we concluded that there is no
We find a slight reduction in performance as we re- sweet spot in terms of the way to pick the vocabulary
move more vocabulary tokens. Average performance that covers all languages and we believe that this is an
of XLM-R-base-v70k reduces by −1.6, −1.5 and exciting area for future work.
−0.6 F1 for NER, news topic, and sentiment clas-
sification compared to the XLM-R-base+LAFT base- 5.4 Scaling MAFT to larger models
line. Despite the reduction in performance compared To demonstrate the applicability of MAFT to larger
to XLM-R-base+LAFT, they are still better than XLM- models, we applied MAFT to XLM-R-large using the
R-miniLM, which has a similar model size, with or same training setup as XLM-R-base. We refer to the
without MAFT. We also find that their performance is new PLM as AfroXLMR-large. For comparison, we
better than that of the PLMs that have not undergone also trained individual LAFT models using the mono-
any adaptation. We find the largest reduction in perfor- lingual data11 from Adelani et al. (2021). Table 7 shows
mance on languages that make use of non-Latin scripts the evaluation result on NER. Averaging over all 13
i.e. amh and ara — they make use of the Ge’ez script languages, AfroXLMR-large improved over XLM-R-
and Arabic script respectively. We attribute this to the large by +2.8 F1, which is very comparable to the
vocabulary reduction impacting the number of amh and improvement we obtained between AfroXLMR-base
ara subwords covered by our tokenizer. (81.8 F1) and XLM-R-base (79.2 F1). Surprisingly,
In summary, we recommend XLM-R-base+MAFT the improvement is quite large (+3.5 to +6.3 F1)
(i.e. AfroXLMR-base) for all languages on which for seven out of ten African languages: yor, luo,
we evaluated, including high-resource languages like lug, kin, ibo, and amh. The most interesting ob-
English, French and Arabic. If there are GPU re- servation is that AfroXLMR-large, on average, is ei-
source constraints, we recommend using XLM-R-base- ther competitive or better than the individual language
v70k+MAFT (i.e. AfroXLMR-small). LAFT models, including languages not seen during the
MAFT training stage like lug, luo and wol. This
5.3 Ablation experiments on vocabulary implies that AfroXLMR-large (a single model) pro-
reduction vides a better alternative to XLM-R-large+LAFT (for
Our results showed that applying vocabulary reduction each language) in terms of performance on downstream
reduced the model size, but we also observed a drop in tasks and disk space. AfroXLMR-large is currently the
performance for different languages across the down- largest masked language model for African languages,
stream tasks, especially for Amharic, because it uses a and achieves the state-of-the-art compared to all other
non-Latin script. Hence, we compared different sam- multilingual PLM on the NER task. This shows that
pling strategies for selecting the top-k vocabulary sub- our MAFT approach is very effective and scales to
tokens. These include: (i) concatenating the monolin- larger PLMs.
gual texts, and selecting the top-70k sub-tokens (ii) the
exact approach described in Section 5.1. The result- 6 Cross-lingual Transfer Learning
ing tokenizers from the two approaches are used to to-
The previous section demonstrates the applicability of
kenize the sentences in the NER test sets for Amharic,
MAFT in the fully-supervised transfer learning setting.
Arabic, English, and Yorùbá. Table 6 shows the num-
Here, we demonstrate that our MAFT approach is also
ber of UNKs in the respective test set after tokeniza-
very effective in the zero-shot cross-lingual transfer
tion and the F1 scores obtained on the NER task for the
setting using parameter-efficient fine-tuning methods.
languages. The table shows that the original AfroX-
Parameter-efficient fine-tuning methods like
LMR tokenizer obtained the least number of UNKs for
adapters (Houlsby et al., 2019) are appealing because
all languages, with the highest F1 scores. Note that
of their modularity, portability, and composability
Yorùbá has 24 UNKs, which is explained by the fact
across languages and tasks. Often times, language
that Yorùbá was not seen during pre-training. Fur-
adapters are trained on a general domain corpus
thermore, using approach (i), gave 3704 UNKs for
Amharic, but with approach (ii) there was a significant 11
For languages not in MasakhaNER, we use the same
drop in the number of UNKs and an improvement in F1 monolingual data in Table 9.
4342
Model Size amh ara eng hau ibo kin lug luo pcm swa wol xho yor avg
XLM-R-large 550M 76.2 79.7 93.1 90.5 84.1 73.8 81.6 73.6 89.0 89.4 67.9 72.4 78.9 80.8
XLM-R-large+LAFT 550M x 13 79.9 81.3 92.2 91.7 87.7 78.4 86.2 78.2 91.1 90.3 68.8 72.7 82.9 83.2
AfroXLMR-large 550M 79.7 80.9 92.2 91.2 87.7 79.1 86.7 78.1 91.0 90.4 69.6 72.9 85.2 83.4
Model amh hau ibo kin lug luo pcm swa wol yor avg
XLM-R-base (fully-supervised) 69.7 91.0 86.2 73.8 80.5 75.8 86.9 88.7 69.6 78.1 81.2
mBERT (MAD-X) (Ansell et al., 2021) - 83.4 71.7 65.3 67.0 52.2 72.1 77.6 65.6 74.0 69.9
mBERT (MAD-X on news domain) - 86.0 77.6 69.9 73.3 56.9 78.5 80.2 68.8 75.6 74.1
XLM-R-base (MAD-X on news domain) 47.5 85.5 83.2 72.0 75.7 57.8 76.8 84.0 68.2 72.2 75.0
AfroXLMR-base (MAD-X on news domain) 47.7 88.1 80.9 73.0 80.1 59.2 79.9 86.9 69.1 75.6 77.0
mBERT (LT-SFT) (Ansell et al., 2021) - 83.5 76.7 67.4 67.9 54.7 74.6 79.4 66.3 74.8 71.7
mBERT (LT-SFT on news domain) - 86.4 80.6 69.2 76.8 55.1 80.4 82.3 71.6 76.7 75.4
XLM-R-base (LT-SFT on news domain) 54.1 87.6 81.4 72.7 79.5 60.7 81.2 85.5 73.6 73.7 77.3
AfroXLMR-base (LT-SFT on news domain) 54.0 88.6 83.5 73.8 81.0 60.7 81.7 86.4 74.5 78.7 78.8
Table 8: Cross-lingual transfer using LT-SFT (Ansell et al., 2021) and evaluation on MasakhaNER. The full-
supervised baselines are obtained from Adelani et al. (2021) to measure performance gap when annotated datasets
are available. Experiments are performed on 3 tags: PER, ORG, LOC. Average (avg) excludes amh. The best
zero-shot transfer F1-scores are underlined.
like Wikipedia. However, when there is a mismatch Fine-tuning (LT-SFT) approach (Ansell et al., 2021), a
between the target domain of the task and the domain parameter-efficient fine-tuning approach that has been
of the language adapter, it could also impact the shown to give competitive or better performance than
cross-lingual performance. the MAD-X 2.0 approach. The LT-SFT approach is
Here, we investigate how we can improve the based on the Lottery Ticket Hypothesis (LTH) that
cross-lingual transfer abilities of our adapted PLM – states that each neural model contains a sub-network
AfroXLMR-base by training language adapters on the (a “winning ticket”) that, if trained again in isolation,
same domain as the target task. For our experiments, can reach or even surpass the performance of the orig-
we use the MasakhaNER dataset, which is based on inal model. The LTH is originally a compression ap-
the news domain. We compare the performance of proach, the authors of LT-SFT re-purposed the ap-
language adapters trained on Wikipedia and news do- proach for cross-lingual adaptation by finding sparse
mains. In addition to adapters, we experiment with sub-networks for tasks and languages, that will later be
another parameter-efficient method based on Lottery- composed together for zero-shot adaptation, similar to
Ticket Hypothesis (Frankle & Carbin, 2019) i.e. LT- Adapters. For additonal details we refer to Ansell et al.
SFT (Ansell et al., 2021). (2021).
For the adapter approach, we make use of the MAD-
6.1 Experimental setup
X approach (Pfeiffer et al., 2020) – an adapter-based
framework that enables cross-lingual transfer to arbi- For our experiments, we followed the same setting as
trary languages by learning modular language and task Ansell et al. (2021) that adapted mBERT from English
representations. However, the evaluation data in the CoNLL03 (Tjong Kim Sang & De Meulder, 2003) to
target languages should have the same task and label African languages (using MasakhaNER dataset) for the
configuration as the source language. Specifically, we NER task.12 Furthermore, we extend the experiments
make use of MAD-X 2.0 (Pfeiffer et al., 2021) where to XLMR-base and AfroXLMR-base. For the train-
the last adapter layers are dropped, which has been ing of MAD-X 2.0 and sparse fine-tunings (SFT) for
shown to improve performance. The setup is as fol- African languages, we make use of the monolingual
lows: (1) We train language adapters via masked lan- texts from the news domain since it matches the domain
guage modelling (MLM) individually on source and of the evaluation data. Unlike, Ansell et al. (2021) that
target languages, the corpora used are described in trained adapters and SFT on monolingual data from
Appendix A.2; (2) We train a task adapter by fine- Wikipedia domain except for luo and pcm where the
tuning on the target task using labelled data in a source dataset is absent, we show that the domain used for
language. (3) During inference, task and language training language SFT is also very important. For a
adapters are stacked together by substituting the source 12
We excluded the MISC and DATE from CoNLL03 and
language adapter with a target language adapter. MasakhaNER respectively to ensure same label configura-
We also make use of the Lottery Ticket Sparse tion.
4343
fair comparison, we reproduced the result of Ansell dataset for 4 African languages. Our results show that
et al. (2021) by training MAD-X 2.0 and LT-SFT on MAFT is competitive to LAFT while providing a sin-
mBERT, XLM-R-base and AfroXLMR-base on target gle model compared to many models specialized for
languages with the news domain corpus. But, we still individual languages. We went further to show that
make use of the pre-trained English language adapter13 combining vocabulary reduction and MAFT leads to
and SFT14 for mBERT and XLM-R-base trained on the a 50% reduction in the parameter size of a XLM-R
Wikipedia domain. For the AfroXLMR-base, we make while still being competitive to applying LAFT on indi-
use of the same English adapter and SFT as XLM-R- vidual languages. We hope that future work improves
base because the PLM is already good for English lan- vocabulary reduction to provide even smaller models
guage. We make use of the same hyper-parameters re- with strong performance on distant and low-resource
ported in the LT-SFT paper. languages. To further research on NLP for African
languages and reproducibility, we have uploaded our
Hyper-parameters for adapters We train the task language adapters, language SFTs, AfroXLMR-base,
adapter using the following hyper-parameters: batch AfroXLMR-small, and AfroXLMR-mini models to the
size of 8, 10 epochs, “pfeiffer” adapter config, adapter HuggingFace Model Hub15 .
reduction factor of 8, and learning rate of 5e-5. For the
language adapters, we make use of 100 epochs or max- Acknowledgments
imum steps of 100K, minimum number of steps is 30K,
batch size of 8, “pfeiffer+inv” adapter config, adapter Jesujoba Alabi was partially funded by the BMBF
reduction factor of 2, learning rate of 5e-5, and max- project SLIK under the Federal Ministry of Educa-
imum sequence length of 256. For a fair comparison tion and Research grant 01IS22015C. David Adelani
with adapter models trained on Wikipedia domain, we acknowledges the EU-funded Horizon 2020 projects:
used the same hyper-parameter settings (Ansell et al., ROXANNE under grant number 833635 and COM-
2021) for the news domain. PRISE (http://www.compriseh2020.eu/) un-
der grant agreement No. 3081705. Marius Mosbach
6.2 Results and discussion acknowledges funding from the Deutsche Forschungs-
Table 8 shows the results of MAD-X 2.0 and LT-SFT, gemeinschaft (DFG, German Research Foundation) –
we compare their performance to fully supervised set- Project-ID 232722074 – SFB 1102. We also thank
ting, where we fine-tune XLM-R-base on the training DFKI GmbH for providing the infrastructure to run
dataset of each of the languages, and evaluate on the some of the experiments. We are also grateful to
test-set. We find that both MAD-X 2.0 and LT-SFT Stella Biderman for providing the resources to train
using news domain for African languages produce bet- AfroXLMR-large. We thank Alan Ansell for provid-
ter performance (+4.2 on MAD-X and +3.7 on LT- ing his MAD-X 2.0 code. Lastly, we thank Benjamin
SFT) than the ones trained largely on the wikipedia Muller, the anonymous reviewers of AfricaNLP 2022
domain. This shows that the domain of the data mat- workshop and COLING 2022 for their helpful feed-
ters. Also, we find that training LT-SFT on XLM- back.
R-base gives better performance than mBERT on all
languages. For MAD-X, there are a few exceptions
like hau, pcm, and yor. Overall, the best perfor- References
mance is obtained by training LT-SFT on AfroXLMR- Amine Abdaoui, Camille Pradel, and Grégoire
base, and sometimes it give better performance than Sigel. Load what you need: Smaller versions
the fully-supervised setting (e.g. as observed in kin of mutililingual BERT. In Proceedings of Sus-
and lug, wol yor languages). On both MAD-X and taiNLP: Workshop on Simple and Efficient Nat-
LT-SFT, AfroXLMR-base gives the best result since it ural Language Processing, pp. 119–123, Online,
has been firstly adapted on several African languages November 2020. Association for Computational
and secondly on the target domain of the target task. Linguistics. doi: 10.18653/v1/2020.sustainlp-1.
This shows that the MAFT approach is effective since 16. URL https://aclanthology.org/
the technique provides a better PLM that parameter- 2020.sustainlp-1.16.
efficient methods can benefit from. David Adelani, Jesujoba Alabi, Angela Fan, Julia
Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
7 Conclusion Dietrich Klakow, Peter Nabende, Ernie Chang,
Tajuddeen Gwadabe, Freshia Sackey, Bonaventure
In this work, we proposed and studied MAFT as an
F. P. Dossou, Chris Emezue, Colin Leong, Michael
approach to adapt multilingual PLMs to many African Beukman, Shamsuddeen Muhammad, Guyo Jarso,
languages with a single model. We evaluated our Oreen Yousuf, Andre Niyongabo Rubungo, Gilles
approach on 3 different NLP downstream tasks and Hacheme, Eric Peter Wairagala, Muhammad Umair
additionally contribute novel news topic classification Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne
13 15
https://adapterhub.ml/ https://huggingface.co/models?sort=
14
https://huggingface.co/cambridgeltl downloads&search=Davlan%2Fafro-xlmr
4344
Gitau, Jade Abbott, Mohamed Ahmed, Milli- amharic news text classification dataset. ArXiv,
cent Ochieng, Anuoluwapo Aremu, Perez Ogayo, abs/2103.05639, 2021.
Jonathan Mukiibi, Fatoumata Ouoba Kabore, God-
son Kalipe, Derguene Mbaye, Allahsera Au- Yassine Benajiba, Paolo Rosso, and José Miguel
guste Tapo, Victoire Memdjokam Koagne, Edwin Benedı́Ruiz. Anersys: An arabic named entity
Munkoh-Buabeng, Valencia Wagner, Idris Abdul- recognition system based on maximum entropy. In
mumin, Ayodele Awokoya, Happy Buzaaba, Bless- Alexander Gelbukh (ed.), Computational Linguis-
ing Sibanda, Andiswa Bukula, and Sam Manthalu. tics and Intelligent Text Processing, pp. 143–153,
A few thousand translations go a long way! lever- Berlin, Heidelberg, 2007. Springer Berlin Heidel-
aging pre-trained models for African news trans- berg. ISBN 978-3-540-70939-8.
lation. In Proceedings of the 2022 Conference
of the North American Chapter of the Associa- Michael Beukman. Analysing the effects of
tion for Computational Linguistics: Human Lan- transfer learning on low-resourced named en-
guage Technologies, pp. 3053–3070, Seattle, United tity recognition performance. In 3rd Work-
States, July 2022. Association for Computational shop on African Natural Language Processing,
Linguistics. doi: 10.18653/v1/2022.naacl-main. 2021. URL https://openreview.net/
223. URL https://aclanthology.org/ forum?id=HKWMFqfN8b5.
2022.naacl-main.223.
Ethan C. Chau and Noah A. Smith. Specializing mul-
David Ifeoluwa Adelani, Jade Abbott, Graham Neu- tilingual language models: An empirical study. In
big, Daniel D’souza, Julia Kreutzer, Constantine Proceedings of the 1st Workshop on Multilingual
Lignos, Chester Palen-Michel, Happy Buzaaba, Representation Learning, pp. 51–61, Punta Cana,
Shruti Rijhwani, Sebastian Ruder, Stephen May- Dominican Republic, November 2021. Association
hew, Israel Abebe Azime, Shamsuddeen H. Muham- for Computational Linguistics. doi: 10.18653/v1/
mad, Chris Chinenye Emezue, Joyce Nakatumba- 2021.mrl-1.5. URL https://aclanthology.
Nabende, Perez Ogayo, Aremu Anuoluwapo, org/2021.mrl-1.5.
Catherine Gitau, Derguene Mbaye, Jesujoba Al-
abi, Seid Muhie Yimam, Tajuddeen Rabiu Gwad- Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham
abe, Ignatius Ezeani, Rubungo Andre Niyongabo, Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao,
Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Heyan Huang, and Ming Zhou. InfoXLM: An
Davis David, Samba Ngom, Tosin Adewumi, Paul information-theoretic framework for cross-lingual
Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, language model pre-training. In Proceedings of the
Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka 2021 Conference of the North American Chapter
Odu, Eric Peter Wairagala, Samuel Oyerinde, of the Association for Computational Linguistics:
Clemencia Siro, Tobius Saul Bateesa, Temilola Human Language Technologies, pp. 3576–3588,
Oloyede, Yvonne Wambui, Victor Akinode, Deb- Online, June 2021. Association for Computational
orah Nabagereka, Maurice Katusiime, Ayodele Linguistics. doi: 10.18653/v1/2021.naacl-main.
Awokoya, Mouhamadane MBOUP, Dibora Gebrey- 280. URL https://aclanthology.org/
ohannes, Henok Tilaye, Kelechi Nwaike, Degaga 2021.naacl-main.280.
Wolde, Abdoulaye Faye, Blessing Sibanda, Ore-
vaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Hyung Won Chung, Thibault Fevry, Henry Tsai,
Ogueji, Thierno Ibrahima DIOP, Abdoulaye Di- Melvin Johnson, and Sebastian Ruder. Rethinking
allo, Adewale Akinfaderin, Tendai Marengereke, embedding coupling in pre-trained language mod-
and Salomey Osei. MasakhaNER: Named en- els. In International Conference on Learning Repre-
tity recognition for African languages. Transac- sentations, 2021. URL https://openreview.
tions of the Association for Computational Lin- net/forum?id=xpFFI_NtgpW.
guistics, 9:1116–1131, 2021. doi: 10.1162/tacl a
00416. URL https://aclanthology.org/ Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
2021.tacl-1.66. Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David
moyer, and Veselin Stoyanov. Unsupervised cross-
Adelani, and Cristina España-Bonet. Massive vs.
lingual representation learning at scale. In Pro-
curated embeddings for low-resourced languages:
ceedings of the 58th Annual Meeting of the Asso-
the case of Yorùbá and Twi. In Proceedings
ciation for Computational Linguistics, pp. 8440–
of the 12th Language Resources and Evaluation
8451, Online, July 2020. Association for Computa-
Conference, pp. 2754–2762, Marseille, France,
tional Linguistics. doi: 10.18653/v1/2020.acl-main.
May 2020. European Language Resources Associ-
747. URL https://aclanthology.org/
ation. ISBN 979-10-95546-34-4. URL https:
2020.acl-main.747.
//aclanthology.org/2020.lrec-1.335.
Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Davis David. Swahili : News classification dataset.
and Ivan Vulić. Composable sparse fine-tuning for Zenodo, December 2020. doi: 10.5281/zenodo.
cross-lingual transfer, 2021. 5514203. URL https://doi.org/10.5281/
zenodo.5514203. The news version contains
Israel Abebe Azime and Nebil Mohammed. An both train and test sets.
4345
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and training with gradient-disentangled embedding shar-
Kristina Toutanova. BERT: Pre-training of deep ing. ArXiv, abs/2111.09543, 2021.
bidirectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the Michael A. Hedderich, David Adelani, Dawei Zhu, Je-
North American Chapter of the Association for Com- sujoba Alabi, Udia Markus, and Dietrich Klakow.
putational Linguistics: Human Language Technolo- Transfer learning and distant supervision for mul-
gies, Volume 1 (Long and Short Papers), pp. 4171– tilingual transformer models: A study on African
4186, Minneapolis, Minnesota, June 2019. Associa- languages. In Proceedings of the 2020 Confer-
tion for Computational Linguistics. doi: 10.18653/ ence on Empirical Methods in Natural Language
v1/N19-1423. URL https://aclanthology. Processing (EMNLP), pp. 2580–2591, Online,
org/N19-1423. November 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.emnlp-main.
Roald Eiselen. Government domain named en- 204. URL https://aclanthology.org/
tity recognition for South African languages. In 2020.emnlp-main.204.
Proceedings of the Tenth International Confer-
ence on Language Resources and Evaluation Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzeb-
(LREC’16), pp. 3344–3348, Portorož, Slovenia, ski, Bruna Morrone, Quentin De Laroussilhe,
May 2016. European Language Resources Associa- Andrea Gesmundo, Mona Attariyan, and Syl-
tion (ELRA). URL https://aclanthology. vain Gelly. Parameter-efficient transfer learn-
org/L16-1533. ing for NLP. In Kamalika Chaudhuri and Rus-
∀, Wilhelmina Nekoto, Vukosi Marivate, Tshi- lan Salakhutdinov (eds.), Proceedings of the 36th
nondiwa Matsila, Timi Fasubaa, Taiwo Fagbo- International Conference on Machine Learning,
hungbe, Solomon Oluwole Akinola, Shamsud- volume 97 of Proceedings of Machine Learn-
deen Muhammad, Salomon Kabongo Kabenamualu, ing Research, pp. 2790–2799. PMLR, 09–15 Jun
Salomey Osei, Freshia Sackey, Rubungo Andre 2019. URL https://proceedings.mlr.
Niyongabo, Ricky Macharm, Perez Ogayo, Ore- press/v97/houlsby19a.html.
vaoghene Ahia, Musie Meressa Berhe, Mofetoluwa
Adeyemi, Masabata Mokgesi-Selinga, Lawrence Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Okegbemi, Laura Martinus, Kolawole Tajudeen, Xiao Chen, Linlin Li, Fang Wang, and Qun
Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Liu. TinyBERT: Distilling BERT for natu-
Julia Kreutzer, Jason Webster, Jamiil Toure Ali, ral language understanding. In Findings of
Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Ab- the Association for Computational Linguistics:
dulkadir Dangana, Herman Kamper, Hady Elsahar, EMNLP 2020, pp. 4163–4174, Online, Novem-
Goodness Duru, Ghollah Kioko, Murhabazi Espoir, ber 2020. Association for Computational Lin-
Elan van Biljon, Daniel Whitenack, Christopher guistics. doi: 10.18653/v1/2020.findings-emnlp.
Onyefuluchi, Chris Chinenye Emezue, Bonaven- 372. URL https://aclanthology.org/
ture F. P. Dossou, Blessing Sibanda, Blessing 2020.findings-emnlp.372.
Bassey, Ayodele Olabiyi, Arshath Ramkilowan,
Alp Öktem, Adewale Akinfaderin, and Abdallah Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao,
Bashir. Participatory research for low-resourced Haotang Deng, and Qi Ju. FastBERT: a self-
machine translation: A case study in African distilling BERT with adaptive inference time. In
languages. In Findings of the Association for Proceedings of the 58th Annual Meeting of the As-
Computational Linguistics: EMNLP 2020, On- sociation for Computational Linguistics, pp. 6035–
line, 2020. URL https://www.aclweb.org/ 6044, Online, July 2020a. Association for Com-
anthology/2020.findings-emnlp.195. putational Linguistics. doi: 10.18653/v1/2020.
acl-main.537. URL https://aclanthology.
Jonathan Frankle and Michael Carbin. The lot- org/2020.acl-main.537.
tery ticket hypothesis: Finding sparse, trainable
neural networks. In International Conference on Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Learning Representations, 2019. URL https:// Edunov, Marjan Ghazvininejad, Mike Lewis, and
openreview.net/forum?id=rJl-b3RcF7. Luke Zettlemoyer. Multilingual denoising pre-
Suchin Gururangan, Ana Marasović, Swabha training for neural machine translation. Trans-
Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, actions of the Association for Computational Lin-
and Noah A. Smith. Don’t stop pretraining: Adapt guistics, 8:726–742, 2020b. doi: 10.1162/tacl a
language models to domains and tasks. In Proceed- 00343. URL https://aclanthology.org/
ings of the 58th Annual Meeting of the Association 2020.tacl-1.47.
for Computational Linguistics, pp. 8342–8360,
Online, July 2020. Association for Computational Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi,
Linguistics. doi: 10.18653/v1/2020.acl-main.740. Songfang Huang, Fei Huang, and Luo Si. VECO:
URL https://aclanthology.org/2020. Variable and flexible cross-lingual pre-training for
acl-main.740. language understanding and generation. In Proceed-
ings of the 59th Annual Meeting of the Associa-
Pengcheng He, Jianfeng Gao, and Weizhu Chen. De- tion for Computational Linguistics and the 11th In-
bertav3: Improving deberta using electra-style pre- ternational Joint Conference on Natural Language
4346
Processing (Volume 1: Long Papers), pp. 3980– ing cross-lingual semantics with monolingual cor-
3994, Online, August 2021. Association for Com- pora. In Proceedings of the 2021 Confer-
putational Linguistics. doi: 10.18653/v1/2021. ence on Empirical Methods in Natural Language
acl-long.308. URL https://aclanthology. Processing, pp. 27–38, Online and Punta Cana,
org/2021.acl-long.308. Dominican Republic, November 2021. Associa-
tion for Computational Linguistics. doi: 10.
Shamsuddeen Hassan Muhammad, David Ifeoluwa 18653/v1/2021.emnlp-main.3. URL https://
Adelani, Sebastian Ruder, Ibrahim Said Ahmad, aclanthology.org/2021.emnlp-main.3.
Idris Abdulmumin, Bello Shehu Bello, Monojit
Choudhury, Chris Chinenye Emezue, Saheed Ab- Chester Palen-Michel, June Kim, and Constan-
dullahi Salahudeen, Aremu Anuoluwapo, Alı́pio Je- tine Lignos. Multilingual open text 1.0: Pub-
orge, and Pavel Brazdil. Naijasenti: A nigerian twit- lic domain news in 44 languages. CoRR,
ter sentiment corpus for multilingual sentiment anal- abs/2201.05609, 2022. URL https://arxiv.
ysis. ArXiv, abs/2201.08277, 2022. org/abs/2201.05609.
Benjamin Muller, Antonios Anastasopoulos, Benoı̂t Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and
Sagot, and Djamé Seddah. When being un- Sebastian Ruder. MAD-X: An Adapter-Based
seen from mBERT is just the beginning: Han- Framework for Multi-Task Cross-Lingual Trans-
dling new languages with multilingual language fer. In Proceedings of the 2020 Conference on
models. In Proceedings of the 2021 Con- Empirical Methods in Natural Language Process-
ference of the North American Chapter of the ing (EMNLP), pp. 7654–7673, Online, Novem-
Association for Computational Linguistics: Hu- ber 2020. Association for Computational Lin-
man Language Technologies, pp. 448–462, On- guistics. doi: 10.18653/v1/2020.emnlp-main.
line, June 2021. Association for Computational 617. URL https://aclanthology.org/
Linguistics. doi: 10.18653/v1/2021.naacl-main. 2020.emnlp-main.617.
38. URL https://aclanthology.org/
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
2021.naacl-main.38.
bastian Ruder. UNKs everywhere: Adapting mul-
Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, tilingual language models to new scripts. In Pro-
and Li Huang. KINNEWS and KIRNEWS: ceedings of the 2021 Conference on Empirical Meth-
Benchmarking cross-lingual text classification ods in Natural Language Processing, pp. 10186–
for Kinyarwanda and Kirundi. In Proceed- 10203, Online and Punta Cana, Dominican Repub-
ings of the 28th International Conference on lic, November 2021. Association for Computational
Computational Linguistics, pp. 5507–5521, Linguistics. doi: 10.18653/v1/2021.emnlp-main.
Barcelona, Spain (Online), December 2020. 800. URL https://aclanthology.org/
International Committee on Computational Lin- 2021.emnlp-main.800.
guistics. doi: 10.18653/v1/2020.coling-main.480. Sara Rosenthal, Noura Farra, and Preslav Nakov.
URL https://aclanthology.org/2020. Semeval-2017 task 4: Sentiment analysis in twitter.
coling-main.480. In Proceedings of the 11th international workshop
on semantic evaluation (SemEval-2017), pp. 502–
Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima
518, 2017.
Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl
Eryani, Alexander Erdmann, and Nizar Habash. Victor Sanh, Lysandre Debut, Julien Chaumond, and
CAMeL tools: An open source python toolkit for Thomas Wolf. Distilbert, a distilled version of
Arabic natural language processing. In Proceed- bert: smaller, faster, cheaper and lighter. ArXiv,
ings of the 12th Language Resources and Evalua- abs/1910.01108, 2019.
tion Conference, pp. 7022–7032, Marseille, France,
May 2020. European Language Resources Associ- Kathleen Siminyu, Godson Kalipe, Davor Orlic,
ation. ISBN 979-10-95546-34-4. URL https: Jade Z. Abbott, Vukosi Marivate, Sackey Freshia,
//aclanthology.org/2020.lrec-1.868. Prateek Sibal, Bhanu Neupane, David Ifeoluwa
Adelani, Amelia Taylor, Jamiil Toure Ali, Kevin
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small Degila, Momboladji Balogoun, Thierno Ibrahima
data? no problem! exploring the viability of Diop, Davis David, Chayma Fourati, Hatem Haddad,
pretrained multilingual language models for low- and Malek Naski. Ai4d - african language program.
resourced languages. In Proceedings of the 1st ArXiv, abs/2104.02516, 2021.
Workshop on Multilingual Representation Learn-
ing, pp. 116–126, Punta Cana, Dominican Re- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu,
public, November 2021. Association for Computa- Yiming Yang, and Denny Zhou. MobileBERT: a
tional Linguistics. doi: 10.18653/v1/2021.mrl-1. compact task-agnostic BERT for resource-limited
11. URL https://aclanthology.org/ devices. In Proceedings of the 58th Annual Meet-
2021.mrl-1.11. ing of the Association for Computational Linguistics,
pp. 2158–2170, Online, July 2020. Association for
Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Computational Linguistics. doi: 10.18653/v1/2020.
Hao Tian, Hua Wu, and Haifeng Wang. ERNIE- acl-main.195. URL https://aclanthology.
M: Enhanced multilingual representation by align- org/2020.acl-main.195.
4347
Y. Tang, C. Tran, Xian Li, Peng-Jen Chen, Naman ing Amharic sentiment analysis from social me-
Goyal, Vishrav Chaudhary, Jiatao Gu, and An- dia texts: Building annotation tools and classifica-
gela Fan. Multilingual translation with extensi- tion models. In Proceedings of the 28th Interna-
ble multilingual pretraining and finetuning. ArXiv, tional Conference on Computational Linguistics, pp.
abs/2008.00401, 2020. 1048–1060, Barcelona, Spain (Online), December
2020. International Committee on Computational
Erik F. Tjong Kim Sang and Fien De Meulder. Intro- Linguistics. doi: 10.18653/v1/2020.coling-main.
duction to the CoNLL-2003 shared task: Language- 91. URL https://aclanthology.org/
independent named entity recognition. In Proceed- 2020.coling-main.91.
ings of the Seventh Conference on Natural Lan-
guage Learning at HLT-NAACL 2003, pp. 142–147, Xiang Zhang, Junbo Jake Zhao, and Yann LeCun.
2003. URL https://aclanthology.org/ Character-level convolutional networks for text clas-
W03-0419. sification. In NIPS, 2015.
Table 9: Monolingual Corpora (after pre-processing – we followed AfriBERTa (Ogueji et al., 2021) approach) ,
their sources and size (MB), and number of sentences.
Table 10: Monolingual News Corpora used for language adapter and SFT training, their sources and size (MB),
and number of sentences.
4349