2022.naacl-main.23

SwahBERT: Language Model of Swahili
Gati L Martin1 , Medard E Mswahili2 , Young-Seob Jeong2 *, Jiyoung Woo1

1
Soonchunhyang University, Asan-si, Korea
2
Chungbuk National University, Cheongju-si, Korea
{gatimartin, jywoo}@sch.ac.kr
{medardedmund25, ysjay}@chungbuk.ac.kr
Abstract drawing much attention as they are known to be ef-

fective in many NLP tasks (e.g., text classification,
The rapid development of social networks, elec- entailment, sequence labeling), but they commonly
tronic commerce, mobile Internet, and other require a huge amount of data for pre-training and
technologies has influenced the growth of Web
fine-tuning; some models are designed for few-shot
data. Social media and Internet forums are valu-
able sources of citizens’ opinions, which can be learning that does not require much labeled data
analyzed for community development and user for fine-tuning, though they still require plenty of
behavior analysis. Unfortunately, the scarcity pre-training data. As it is expensive and difficult
of resources (i.e., datasets or language mod- to get the labeled and unlabeled data, the majority
els) has become a barrier to the development of the data are in high-resource languages (HRLs)
of natural language processing applications in (e.g., English, Spanish). Unfortunately, other than
low-resource languages. Thanks to the recent
about 20 HRLs languages, approximately 7,000
growth of online forums and news platforms of
Swahili, we introduce two datasets of Swahili
low-resource languages (LRLs) in the world are
in this paper: a pre-training dataset of approxi- left behind, where most of LRLs are spoken and
mately 105MB with 16M words and an anno- little written (Magueresse et al., 2020). Africa and
tated dataset of 13K instances for the emotion India are the main hosts of LRLs, where some lan-
classification task. The emotion classification guages are spoken by more than 20 million people
dataset is manually annotated by two native (e.g., Hausa, Oromo, Zulu, and Swahili). As more
Swahili speakers. We pre-trained a new mono- data on social media in LRLs, qualified datasets,
lingual language model for Swahili, namely
and publicly available language models will bring
SwahBERT, using our collected pre-training
data, and tested it with four downstream tasks many advantages in various fields, such as edu-
including emotion classification. We found that cation (Obiria, 2019), healthcare (de Las Heras-
SwahBERT outperforms multilingual BERT, a Pedrosa et al., 2020), entertainment (Ahn et al.,
well-known existing language model, in almost 2013), and business.
all downstream tasks. Swahili, a Bantu language, is one of the two offi-
cial languages (the other being English) of the East
1 Introduction African countries such as Tanzania (Petzell, 2012),
Nowadays, online social networking has revolu- Kenya, and Uganda. It has been widely spread
tionized interpersonal communication. The influ- in African countries not only as a lingua franca
ence of social media in our everyday lives, at both but also as a second or third language across the
a personal and professional level, has led recent African continent and broadly in education, admin-
studies to language analysis in social media (Zeng istration, and media. With the rapid development
et al., 2010). Especially, natural language process- of social networks, electronic commerce, mobile
ing (NLP) tools are often used to analyze textual Internet, and other technologies, Swahili is also
data for various real-world applications; mining spreading in online places that result in the growth
social media for information about health (De Gen- of Web data. For example, JamiiForum is a popular
naro et al., 2020), diseases analysis (e.g., COVID- online platform in Tanzania, and it provides a place
19 (Gao et al., 2020), Ebola (Tran and Lee, 2016)), to discuss different issues, including political, busi-
identifying sentiment and emotion toward prod- ness, educational, and lifestyle; this means more
ucts and services, and developing dialog systems collected textual data of Swahili is available.
(Zhou et al., 2020). Language models have recently By making use of the online textual data of
314
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 314 - 324
July 10-15, 2022 ©2022 Association for Computational Linguistics
Swahili, there are several studies for different tasks cation, news classification, and named entity
(e.g., sentiment classification (Obiria, 2019; Noor recognition (NER)).
and Turan, 2019; Seif, 2016), news classification
(David, 2020)). Recently, language models have 2 Background
drawn much attention from the industry and aca- Most African countries have minority languages
demic world, as the language models brought much that are used by specific ethnic groups (approx.
better performance (e.g., accuracy) than other ex- 158 in Tanzania 1 ). However they speak different
isting models. There are few studies that employed national and official languages of their countries,
the language models for Swahili: named entity including native and colonial that can be used in
recognition (NER) (Adelani et al., 2021) and senti- public services such as education, politics, and the
ment classification (Martin et al., 2021). Although media. Swahili is a Bantu language widely spo-
these studies have shown successful results, they ken in sub-Saharan Africa and acts as the common
are limited in that they just borrow the language tongue for most East African (Lodhi, 1993; Amidu,
models (e.g., multilingual Bidirectional Encoder 1995). Many Swahili vocabularies are derived from
Representations from Transformer (mBERT) (De- loanwords, the vast majority from Arabic, but also
vlin et al., 2019), Cross-lingual Model-RoBERTA English, Hindi, Portuguese, and other Bantu lan-
(XLM-R) (Conneau et al., 2020)) pre-trained with guages 2 . As the language grows, new formal and
other resources (i.e., other languages); in other informal vocabularies emerge. The formal vocabu-
words, their language models are pre-trained for laries are used in official documents, whereas the
multiple languages but not dedicated for Swahili. informal vocabularies are mostly used by young
Although such multilingual language models have adults and on social media platforms (Momanyi,
shown great generalization power across multiple 2009).
languages, several studies (Bhattacharjee et al., Structurally, it is considered an agglutinative lan-
2021; Tanvir et al., 2021; Vilares et al., 2021) re- guage with polysemous features. Its morphology
ported that the monolingual models often outper- depends on prefixes and suffixes which are sylla-
form these multilingual models. There was no bles (Shikali et al., 2019). A single word is gen-
study that proposed a monolingual language model erated with morphemes (i.e., stem, prefixes, and
for Swahili (i.e., Swahili-specific language model), affixes) that will have corresponding inflectional
and the main reason is that Swahili is one of the forms. Nouns are divided into classes on the basis
LRLs, so the existing studies commonly suffered of their singular and plural prefixes. Despite its
from lack of available data. popularity, a limited amount of textual data is avail-
In this paper, we focus on the Swahili language. able and it is one of the low-resource languages
To the best of our knowledge, this is the first study (LRLs). Although there have been a few studies
that collects a pre-training dataset and uses it for that illustrate the value of NLP (Martin et al., 2021;
pre-training of the Swahili-specific language model. Obiria, 2019; Gelas et al., 2012), they commonly
We also provide a manually annotated dataset for suffered from the lack of available data.
the emotion classification task. The contributions
are summarized as follows. 2.1 Existing dataset of Swahili
• Pre-training dataset: we collected Web data To overcome the problem of limited language re-
from different sources (news sites and so- source, they have been few datasets for different
cial discussion forums) for pre-training the tasks: new classification dataset, NER dataset, sen-
Swahili language model. timent classification dataset, and emotion classifi-
cation dataset.
• Emotion dataset: we introduce a new Swahili
2.1.1 News classification dataset
dataset for multi-label emotion classification
with six Ekman’s emotions: happy, surprise, This dataset 3 is created and shared by a data sci-
sadness, fear, anger, and disgust. ence competition platform Zindi (David, 2020) . It
contains a total of 23,266 instances collected from
• Swahili language model: we pre-trained the different news websites in Tanzania. There are 6
Swahili language model and compared its per- 1
https://www.tanzania.go.tz/home/pages/228
formance with other language models on sev- 2
https://en.wikipedia.org/wiki/Swahili_language
eral downstream tasks (e.g., emotion classifi- 3
https://zenodo.org/record/4300294?ref=hackernoon.com
315
categories of news: kitaifa (national), kimataifa (In- 2.2 Language models
ternational), biashara (finance), michezo (sports), Over the years, models for word representation
afya (health), and burudani (entertainment). The have been developed and have shown that they are
amount of each category is 10,242 (national), 1,905 capable of capturing the semantics and syntactic
(International), 2,028 (finance), 859 (health), 6,003 dependencies between words: Word2Vec (Mikolov
(sports), and 2,229 (entertainment); so this dataset et al., 2013), Glove (Pennington et al., 2014), and
is highly imbalanced. Kastanos and Martin (2021) FastText (Bojanowski et al., 2017). As these mod-
applied a deep learning model called Text Graph els do not incorporate context of words, many
Convolutional Network (Text GCN) on this dataset context-aware language models based on Trans-
and achieved an F1 score of 75.67% for the news former (Vaswani et al., 2017) were introduced (e.g.,
classification task. Bidirectional Encoder Representations from Trans-
former (BERT) (Devlin et al., 2019) and XLM-R
2.1.2 Sentiment classification dataset (Conneau et al., 2020)). These language models
are trainable with a monolingual or multilingual
Obiria (2019) collected 886 posts from Twitter and dataset. For example, multilingual BERT (mBERT)
Facebook to analyze the student opinion in Kenyan is trained with a dataset of 104 languages and a
universities and achieved an accuracy of 83% on shared vocabulary.
binary classification task using support vector ma- Although mBERT has shown its potential in
chine (SVM) (Hearst et al., 1998). Noor and Turan some previous work, several studies reported some
(2019) extracted 1,087 Twitter texts about demone- limitations of mBERT especially to LRLs: (1) the
tization in Kenya and performed ternary sentiment limited scale of pre-training data (only Wikipedia
classification with Naive Bayes. They applied var- was used) (Conneau et al., 2020); (2) the small
ious feature extraction methods and obtained an vocabulary size for specific language (Wang et al.,
accuracy of 70.8%. Recently, Martin et al. (2021) 2019). To overcome that, XLM-RoBERTA mod-
used a cross-lingual model, mBERT, to perform ifies mBERT by increasing the amount of pre-
binary sentiment classification on a social media training data, which increases the shared vocab-
dataset that they manually annotated, and achieved ulary between different languages. It provides
an accuracy of 87.59%. None of the above datasets a strong improvement over mBERT, however, it
are publicly available. is outperformed by monolingual models (Tanvir
et al., 2021; Bhattacharjee et al., 2021) due to bet-
2.1.3 Named entity recognition dataset ter representation of morphological language such
as Swahili. This is a good improvement since the
This dataset, namely MasakhaNER 4 (Adelani
model can learn morphological information. An-
et al., 2021), is created for ten African languages,
other limitation is the nature of pre-training cor-
including Swahili. The news texts are collected
pora. Most available corpora are extracted from
from local news sources and annotated using
Wikipedia, Bookcorpus, or news blogs which may
ELISA tool (Lin et al., 2018) by native speakers
not be compatible with the task that covers multi-
of each language. The dataset contains a total
domain such as social media data. In this work,
of 3,006 instances and covers four entities: per-
we collect our data from different sources across
sonal name (PER), location (LOC), organization
several domains for our new language model.
(ORG), and date & time (DATE) as inspired by
the English CoNLL-2003 corpus (Tjong Kim Sang 3 Dataset
and De Meulder, 2003). The number of entities
of each type is 1,702 (PER), 2,842 (LOC), 960 In this section we describe our collected dataset for
(ORG), and 940 (DATE). They compared three pre-training and the downstream task of emotion
models (e.g., CNN-BiLSTM-CRF, mBERT, and classification. We open our datasets for future use
XLM-R) on the NER task, and mBERT and XLM- in various studies 5 .
R achieved 89.36% and 89.46% of an F1 score, 3.1 Pre-training dataset
respectively.
The existing available corpora for Swahili are
4
very small, for example, the Open Super-large
https://github.com/masakhane-io/masakhane-
5
ner/tree/main/data https://sites.google.com/view/swahbert/home
316
Crawled Aggregated coRpus (OSCAR) database Item Value
contains about 25 megabytes of the corpus. Using # of examples 12,976
crawler tools, we scraped our own data from dif- # of labels 7 (including ‘neutral’)
ferent sources such as news Web sites, forums, and joy: 2,439
Wikipedia. The news Web sites include UN news 6 , disgust: 3,227
Voice of America (VoA) 7 , Deutsche Welle (DW) anger: 1,772
8 and taifaleo 9 . We collected data from JamiiFo- # of text per labels sadness: 2,339
fear: 1,116
rums, which is one of the most popular social media
surprise: 2,305
websites in Tanzania founded in 2006. The forum Neutral: 863
has provided a discussion platform for the public 1: 92.09%
to discuss different issues, including political, busi- # of labels per examples 2: 7.45%
ness, educational, and lifestyle. Since JamiiForums 3: 0.46%
is a discussion platform, most of its contents are Dailydialog: 13.26%
either passages of information or short comments. Emotion cause: 13.08%
Ratio of source
We collected the passages with more than four logi- ISEAR: 2.26%
cally connected sentences. We removed URL links, Social media: 71.40%
usernames, non-textual content (e.g., HTML tags) Politics: 15%
and filtered out non-Swahili characters (e.g., Latin, Appx. ratio of taxonomy social issues: 80%
Chinese). The size of dataset is about 105MB with pandemic: 5%
16M words, where a sentence has an average of 27 Table 1: Statistics of Emotion classification dataset.
subword tokens. Most of these platforms contain
data that range from 5 to 10 years. The contribution
(in percentage) of each source was taifaleo (39.4), such as politics, disease outbreaks, and aspects
UN news (28.6), JamiiForum (10.2), Wikipedia of daily life. We reviewed the dataset and re-
(9.5), VoA (7.2), and DW (5.1). moved profanity towards a specific person or eth-
nic group. For example, in the sentence ‘[name]
3.2 Emotion classification dataset is very stupid, she doesn’t act like a leader at all,’
Existing non-Swahili datasets typically use anno- we replaced the target name with pronoun. We
tation schemes based on Ekman (Ekman, 1992), also selected three existing English datasets with
Plutchik (Plutchik, 1980) or with multiple cate- relevant topic coverage and converted them into
gories (Demszky et al., 2020). For example, there Swahili using Google translator. The three datasets
are English datasets with multiple emotion cate- include: (1) Dailydialog (Li et al., 2017) that re-
gories: Affective Text (Strapparava and Mihalcea, flects daily communication and covers various top-
2007) with 11 categories, CrowdFlower with 14 cat- ics about our daily life, (2) Emotion Cause (Ghazi
egories, GoEmotions (Demszky et al., 2020) with et al., 2015), and (3) ISEAR (Scherer and Wallbott,
27 categories and others (Oberländer and Klinger, 1994) collected from participants from varying cul-
2018). In this paper, to construct a new Swahili tural backgrounds who complete questionnaires
dataset for emotion classification, we chose to use about their experiences and reactions. The Swahili
6 emotion categories from Ekman’s (Ekman, 1992) emotion texts obtained from the Google translator
scheme: anger (hasira), surprise (mshangao), dis- were checked and corrected thoroughly by a native
gust (machukizo), joy (furaha), fear (woga), and Swahili speaker.
sadness (huzuni). Our dataset is collected from two For the dataset collected from forums of Swahili,
source types: social media platforms of Swahili two native Swahili speakers were assigned to anno-
and existing emotion datasets of English. The so- tate the emotion labels. These speakers agreed to
cial media platforms include YouTube, JamiiFo- the consent of serving as annotators and were given
rum 10 and Twitter The conversations and com- an instruction of annotation. They were asked to
ments on these platforms cover different topics, select one or multiple suitable emotion labels that
were expressed in the text, and labeled ‘neutral’ for
6
https://news.un.org/ unsure texts. The statistics summary is presented
7
https://www.voaswahili.com/
8
https://www.dw.com/sw/idhaa-ya-kiswahili/s-11588
in Table 1. We also calculated annotator agree-
9
https://taifaleo.nation.co.ke/ ment using Cohen’s kappa metric, which computes
10
https://www.jamiiforums.com/ a score of agreement level between two annota-
317
tors who each classify N items into C mutually for) has a lexical morph {-pika}, four grammatical
exclusive categories. The scores for each label are morphs {a-,-li-, -m-, -i-} and two in the verb skele-
joy (0.835), disgust (0.845), anger (0.763), sadness tal morphological frame which has the root {-pik-},
(0.733), fear (0.694), and surprise (0.806). Figure and bantu end vowel {-a} (Choge, 2018). In this
1 is a heatmap that shows the degree of relation- paper, to incorporate such linguistic complexity,
ship between emotions. The emotion pair with we try monolingual tokenizers for Swahili with dif-
high intensity (e.g. hasira (anger) and machukizo ferent vocabulary sizes (e.g., 32K, 50K, and 70K)
(disgust)) has a positive correlation in multi-label using the WordPiece algorithm 12 .
emotion.
4.2 Training
SwahBERT has 12 encoder blocks and 768 hidden
units. We employ two unsupervised pre-training
tasks: Masked Language Modeling (MLM) and
Next Sentence Prediction (NSP) as described in
(Devlin et al., 2019). We conduct experiments by
varying the vocabulary size and number of training
and warmp steps. Following the pre-training pro-
cess of (Devlin et al., 2019), we pre-trained Swah-
BERT in two-phases: uncased was firstly trained
for 600K steps using an input length of 128, and
then further trained for an additional 200K steps
using an input length of 512. Cased models were
Figure 1: Pearson correlation matrix for the multi- trained for 600K and 900K steps initially, and an ad-
emotion. ditional 200K and 100K steps in the second phase.
The batch size is 32 and 6 for the two-phases, re-
spectively, and the parameters were optimized us-
4 SwahBERT ing the Adam optimizer (Kingma and Ba, 2014)
With the collected dataset, we pre-trained the mono- with a warmup over the first 1% of the steps to a
lingual BERT for Swahili, namely SwahBERT 11 . peak learning rate of 1e-4.
The SwahBERT basically has the same architecture Table 3 gives the results of pre-training, where
as the original BERT. This section describes the it took around 105 hours to complete all phases
process of pre-training and fine-tuning of Swah- using two GeForce GTX 1080 Ti GPUs. The best
BERT. result was obtained with a vocabulary size of 50K
for uncased models, while a vocabulary size of 32K
4.1 Tokenizer was the best for cased models. Compared to the
In mBERT, not all languages have equal content mBERT that has a vocabulary size of 119K, the
size (Wu and Dredze, 2020), and some languages best vocabulary size of SwahBERT seems small.
are dominated; for example, Swahili is only less This is consistent with the vocabulary sizes of other
than 1% of the approximately 120K vocabulary monolingual BERT models; for example, 32K for
of mBERT. Although it might benefit from high English, 50K for Estonian, and 30K for Dutch.
resource languages as Swahili has the same typol- With the pre-trained models, we put an additional
ogy (word order) and many loanwords, it would layer on top of the models and fine-tuned them
definitely be better to generate a Swahili-specific in a supervised way with the labeled datasets for
tokenizer. That is, the multilingual tokenizer often downstream tasks.
splits the words without considering morphological
5 Experiments
boundaries (e.g., stem, prefixes, and suffixes), like
the sentence in Table 2, so the individual subword We tested our model on downstream tasks and com-
units do not have a clear semantic meaning. Swahili pared with other models. We put an additional
is morphologically rich language and polysynthetic linear layer and an output layer on top of the pre-
language; for example, a word alimpikia (cooked trained language models, where all models are im-
11 12
https://sites.google.com/view/swahbert/home https://github.com/kwonmha/bert-vocab-builder
318
Vocabulary Tokenization
mBERT wa ##nan ##chi wa ##nata ##raj ##ia fur ##sa ke ##dek ##ede
SwahBERT(32K) wananchi wanatarajia fursa ke ##de ##ke ##de
SwahBERT(50K) wananchi wanatarajia fursa kede ##ke ##de
SwahBERT(70K) wananchi wanatarajia fursa kedekede
Table 2: Tokenization of the sentence ‘Wananchi wanatarajia fursa kedekede ’ (Citizens expects more opportunity)
by using mBERT and SwahBERT tokenziers.
Steps vocab size MLM acc NSP acc loss tasks SwahBERT SwahBERTcased mBERT
800K 32K (uncased) 73.37 99.50 1.1822 Emotion 64.46 64.77 60.52
800K 50K (uncased) 76.54 99.67 1.0667 News 90.90 89.90 89.73
800K 70K (uncased) 73.38 100.0 1.2131 Sentiment 70.94 71.12 67.20
800K 32K (cased) 76.94 99.33 1.0562 NER 88.50 88.60 89.36
1M 32K (cased) 73.81 98.17 1.2732
Table 5: F1 scores (%) of language models on down-
Table 3: Accuracy and loss of pre-training. stream tasks, where NER indicates named entity recog-
nition.
tasks Total Train Development Test
Emotion 12,976 9,732 1,297 1,947
News 23,266 18,612 2,327 2,327 and test (15%) sets. As shown in Table 5, there
Sentiment 7,107 5,330 710 1,067 is an improvement of 3.94% F1 score from Swah-
NER 3,006 2,104 300 602
BERT (64.46) compared to mBERT (60.52). The
Table 4: The number of instances of the datasets of model exhibits the best performance on emotions
downstream tasks. like joy (0.80), sadness (0.71), and surprise (0.68),
as exhibited in Table 6; this is consistent with the
fact that these emotions have a lower correlation
plemented with HuggingFace PyTorch library. Dur- with other emotions, allowing the models to more
ing the fine-tuning, the parameters are optimized easily classify them. For the neutral (0.25) case,
using the Adam optimizer (Kingma and Ba, 2014) we found that there were many instances of in-
with an initial learning rate of 5e-5 and ϵ parameter complete or uncertain expressions, and this caused
of 1e-8. The batch size was set to 32. Table 5 sum- confusion with other emotions. This is reasonable
marizes the averaged F1 scores of language models as ‘neutral’ might not even exist because people
for different downstream tasks, where all language are always feeling something (Gasper et al., 2019).
models are with uncased vocabularies except for For example, ‘Hivi aliyekudanganya hivyo nani?’
SwahBERTcased . Except for the NER task, the (who lied to you that anyway?) was predicted as
SwahBERT outperformed the mBERT for all tasks. disgust, while ‘Nani aliyekudanganya?’ (who lied
The statistic of the datasets is summarized in Table to you?) was classified as neutral. As mentioned
4, where the emotion classification dataset is intro- in (Öhman et al., 2020), such uncertain texts are
duced in this paper. Among the tasks, all models usually not self-contained since they are reactions
achieved much better performance in the news clas- to other posts; that is, the emotion will be different
sification task, and this might be explained by the whether we consider its context information or not.
fact that the data source of this task is online news
documents that may have similar characteristics to Post: Vifo vya Corona kila kukicha (Corona
the pre-training dataset that is collected from on- cases increases everyday) [sadness], <sad-
line forums. In the following subsections, detailed ness>
results of each task will be described, where the Com1: Acha tuu (Yeah...) [sadness], <neu-
best scores were obtained from three independent tral>
experiments. Com2: Kila mtu anaongea lake.. (Everyone is
quick to talk what they wish) [disgust], <neu-
5.1 Emotion Classification tral>
Com3: Popote, mimi na barakoa yangu (any-
We use our new dataset for this task and split the where, with my mask) [fear], <neutral>
dataset into training (75%), development (10%),
319
SwahBERT mBERT SwahBERT mBERT
labels P R F1 P R F1 labels P R F1 P R F1
joy 0.88 0.73 0.80 0.74 0.71 0.72 national 0.91 0.92 0.92 0.91 0.91 0.91
anger 0.61 0.43 0.51 0.70 0.32 0.44 sports 0.96 0.97 0.97 0.94 0.98 0.96
sadness 0.68 0.74 0.71 0.69 0.65 0.67 entert. 0.89 0.94 0.91 0.85 0.93 0.89
disgust 0.61 0.61 0.61 0.54 0.56 0.55 business 0.94 0.85 0.89 0.91 0.82 0.86
surprise 0.73 0.64 0.68 0.74 0.66 0.70 Internat. 0.90 0.89 0.90 0.91 0.84 0.88
fear 0.65 0.61 0.63 0.65 0.56 0.60 health 0.50 0.41 0.45 0.47 0.44 0.45
neutral 0.30 0.22 0.25 0.33 0.11 0.16
Table 7: News classification results, where P, R, F1
Table 6: Results of emotion classification, where P, R, indicate precision, recall and F1-score.
and F1 indicate precision, recall, and F1 score, respec-
tively. SwahBERT mBERT
labels P R F1 P R F1
negative 0.70 0.70 0.70 0.65 0.69 0.67
The annotators made the labels based on the positive 0.82 0.83 0.82 0.75 0.82 0.79
context, whereas the language models predicted neutral 0.59 0.60 0.59 0.58 0.48 0.53
labels without the context, and this caused the per-
formance degradation. The example in the box Table 8: Sentiment classification results, where P, R, F1
demonstrates how emotions can be affected by con- indicate precision, recall, and F1 score.
textual information, where emotion with context
is represented in [emotion] and emotion of non- tion because ‘surprise’ can be mapped mid-way
context is <emotion>. of negative and positive (Marmolejo-Ramos et al.,
2017). We split the dataset into three sets with ra-
5.2 News Classification tio 75%:10%:15% equivalent to 5,330:710:1,067.
We used the existing news classification dataset, Results are presented in Table 8. We found that
and it is split into three sets with a ra- SwahBERT outperformed the mBERT with a gap
tio of 80%:10%:10% which is equivalent to of 3-6% of F1 scores. As we observed in the re-
18,612:2,327:2,327 instances. As this dataset has sults of the emotion classification task, the overall
six news categories, this task is a classification on performance of sentiment classification task for
six classes: kitaifa (national), kimataifa (Interna- the ‘neutral’ class was much lower than the other
tional), biashara (finance), michezo (sports), afya classes.
(health), and burudani (entertainment). Table 7
shows the results of SwahBERT and mBERT mod- 5.4 Named Entity Recognition
els. Compared to the existing study of (Kastanos We used MasakhaNER (Adelani et al., 2021)
and Martin, 2021) that achieved 75.56% of F1 score dataset for this task, and it has 70%:10%:20% ratio
using graph convolutional networks (GCN), we ob- for training, development, and test set. As shown
served the improvement of 14.06% and 15.23% F1 in Table 5, we did not observe performance im-
scores with mBERT and SwahBERT, respectively. provement of SwahBERT against the mBERT of
The performance for the ‘health’ class is relatively (Adelani et al., 2021). The biggest reason of this
lower than others, and the reason might be the data will be the small size of the dataset compared to
imbalance; the ‘health’ class has a much smaller other downstream tasks, as shown in Table 4. That
amount of instances than other classes, as described is, the small NER dataset was not enough for Swah-
in subsection 2.1.1. BERT to learn the underlying patterns for NER
task, so there was no performance improvement
5.3 Sentiment Classification compared to the multilingual language model. We
As there is no publicly available dataset for this believe that the NER performance of SwahBERT
task, we used our emotion dataset by convert- will increase as we keep gathering more NER data.
ing some emotion categories into three sentiment
6 Discussion
classes: positive, negative, and neutral, where we
mapped ‘joy’ to positive, ‘disgust’ to negative, and We constructed two datasets for the low-resource
‘neutral’ was unchanged. For the neutral class, we Swahili language: for the downstream task of
extracted additional instances from ‘surprise’ emo- emotion classification and pre-training of the new
320
Figure 2: Histogram of downstream task datasets, where the x-axis represents vocabulary words of SwahBERT.
Swahili-specific BERT model. For pre-training Tasks Cosine similarity

purposes, we managed to collect a corpus of about News 98.616
105 MB. Although the size of our corpus is quite NER 52.465
smaller than that of rich-resource languages (e.g., Emotion 84.445
English), the pre-trained SwahBERT has shown Sentiment 81.543
great improvement on the downstream tasks. This
result is consistent with other previous studies. For Table 9: Similarity scores between pre-training dataset
and datasets of downstream tasks.
example, Micheli et al. (2020) found that well-
performing language models can be obtained with
a little size of corpora of 100MB. Similarly, in
(Martin et al., 2020), experimental results with the classification dataset and the sentiment classifica-
language models pre-trained with the 4GB dataset tion dataset have a similar curve of histogram, and
were comparable to those pre-trained with 138GB the NER dataset and the news classification dataset
dataset. However, we believe that plenty of qual- seem similar to each other. This explains that
ified datasets will help to increase the power of the language models (e.g., SwahBERT, mBERT)
language models. achieved similar performance (e.g., 88.5% to 90.9%
F1 scores) for the news classification and NER
As demonstrated in the experimental results, tasks, and similar performance (e.g., 60.52% to
SwahBERT is generally superior to mBERT in al- 71.12% F1 scores) for the emotion classification
most all downstream tasks. We believe that our tok- and sentiment classification tasks. Another interest-
enizer with Swahili vocabulary has the biggest con- ing point is that SwahBERT showed no improve-
tribution to the results. The tokenizer of mBERT ment in the NER task compared to mBERT. The
works by sharing vocabulary over multiple lan- main reason for this, of course, is the small amount
guages, and this tokenizer tends to split the words of NER dataset, but we further examined more de-
without taking into account morphological bound- tails by similarity scores between the pre-training
aries (e.g., stem, prefix, and postfix), as shown dataset and downstream task datasets, as shown
in Table 2, even though Swahili is a morphologi- in Table 9. The similarity score is computed us-
cally rich language. The tokenizer of SwahBERT ing a cosine similarity function on word frequen-
accommodates most single words (e.g., wanatara- cies in datasets. Note that the NER dataset has a
jia (expects), fursa (opportunity)) as one and thus much lower score than others, which implies that
helps the model to get better representation. language models have a smaller chance to learn
We examined the characteristics of the datasets linguistic patterns for the NER task. This can be
by frequency histograms of vocabulary words in the resolved if we collect more data for the NER task.
same order as depicted in Figure 2. The emotion We also believe that collecting more qualified data
321
for different tasks in low-resource languages (LRL) Acknowledgements
will significantly contribute to various future NLP
This research was supported by Basic Science Re-
applications (e.g., social network services, news
search Program through the National Research
recommendations, etc.).
Foundation of Korea(NRF) funded by the Ministry
of Education(NRF-2020R1I1A3053015). This
7 Conclusion work was supported by Chungbuk National Univer-
sity BK21 program(2022).
In this study, we introduced our pretraining corpus
and annotated dataset for the emotion classification
task. The emotion classification dataset contains References
7 emotion classes, including neutral, and has ap- David Ifeoluwa Adelani, Jade Abbott, Graham Neu-
proximately 13K instances. We also performed pre- big, Daniel D’souza, Julia Kreutzer, Constantine Lig-
training of monolingual BERT for Swahili, namely nos, Chester Palen-Michel, Happy Buzaaba, Shruti
Rijhwani, Sebastian Ruder, Stephen Mayhew, Is-
SwahBERT, and experimentally compared it with
rael Abebe Azime, Shamsuddeen H. Muhammad,
the multilingual BERT (mBERT). The SwahBERT Chris Chinenye Emezue, Joyce Nakatumba-Nabende,
outperformed the mBERT in almost all downstream Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau,
tasks, where the downstream tasks include emotion Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yi-
classification, news classification, sentiment clas- mam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani,
Rubungo Andre Niyongabo, Jonathan Mukiibi, Ver-
sification, and NER. Although SwahBERT exhib- rah Otiende, Iroro Orife, Davis David, Samba Ngom,
ited superior performance with a relatively smaller Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi,
pre-training corpus, a more qualified pre-training Gerald Muriuki, Emmanuel Anebi, Chiamaka Chuk-
corpus will definitely contribute to the develop- wuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel
Oyerinde, Clemencia Siro, Tobius Saul Bateesa,
ment of better language models. Therefore, with Temilola Oloyede, Yvonne Wambui, Victor Akin-
the growth of the digital platforms for Swahili, we ode, Deborah Nabagereka, Maurice Katusiime, Ayo-
will continue to use the available sources, including dele Awokoya, Mouhamadane MBOUP, Dibora Ge-
native Swahili speakers as annotators, and collect breyohannes, Henok Tilaye, Kelechi Nwaike, De-
gaga Wolde, Abdoulaye Faye, Blessing Sibanda, Ore-
more data from different domains. We hope that vaoghene Ahia, Bonaventure F. P. Dossou, Kelechi
this study will facilitate the development of other Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo,
methodologies and pre-trained language models Adewale Akinfaderin, Tendai Marengereke, and Sa-
(e.g., XLM-R) and also aid in social services (e.g., lomey Osei. 2021. MasakhaNER: Named Entity
Recognition for African Languages. Transactions
user emotion analysis on forum texts).
of the Association for Computational Linguistics,
9:1116–1131.
Ethical Considerations JoongHo Ahn, Sehwan Oh, and Hyunjung Kim. 2013.
Korean pop takes off! social media strategy of korean
The two native Swahili annotators are the authors entertainment industry. In 2013 10th International
of this paper, and they receive fair compensation Conference on Service Systems and Service Manage-
ment, pages 774–777.
from the project fund. We collected data from a few
sources, and all annotated datasets are in Swahili Assibi Apatewon Amidu. 1995. Kiswahili: People, lan-
language. We anonymized all texts and confirm guage, literature and lingua franca. Nordic Journal
that our datasets are allowed to be disclosed for of African Studies, 4(1):104–123.
academic purpose according to policies and laws Abhik Bhattacharjee, Tahmid Hasan, Kazi Samin,
(e.g., UN news13 , Voice of America14 , Deutsche Md Saiful Islam, M. Sohel Rahman, Anindya Iqbal,
Welle15 , and taifaleo16 ). We gratefully acknowl- and Rifat Shahriyar. 2021. Banglabert: Com-
bating embedding barrier in multilingual models
edge a favor of every institute or company that for low-resource language understanding. CoRR,
approved of a generous data policy for academic abs/2101.00204.
purpose.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with
13
https://www.un.org/en/about-us/terms-of-use subword information. Transactions of the Associa-
14
https://www.voanews.com/p/5338.html tion for Computational Linguistics, 5:135–146.
15
https://www.gesetze-im-
internet.de/englisch_urhg/englisch_urhg.html Susan C Choge. 2018. A morphological classification
16
https://www.nationmedia.com/terms-of-use/ of kiswahili. Kiswahili, 80(1).
322
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Marti A. Hearst, Susan T Dumais, Edgar Osuna, John
Vishrav Chaudhary, Guillaume Wenzek, Francisco Platt, and Bernhard Scholkopf. 1998. Support vec-
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- tor machines. IEEE Intelligent Systems and their
moyer, and Veselin Stoyanov. 2020. Unsupervised applications, 13(4):18–28.
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso- Alexandros Kastanos and Tyler Martin. 2021. Graph
ciation for Computational Linguistics, pages 8440– convolutional network for swahili news classification.
8451. arXiv preprint arXiv:2103.09325.
Davis David. 2020. Swahili : News classification Diederik P Kingma and Jimmy Ba. 2014. Adam: A
dataset. method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Mauro De Gennaro, Eva G Krumhuber, and Gale Lucas.
2020. Effectiveness of an empathic chatbot in com-
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang
bating adverse effects of social exclusion on mood.
Cao, and Shuzi Niu. 2017. DailyDialog: A manually
Frontiers in psychology, 10:3061.
labelled multi-turn dialogue dataset. In Proceedings
Carlos de Las Heras-Pedrosa, Pablo Sánchez-Núñez, of the Eighth International Joint Conference on Nat-
and José Ignacio Peláez. 2020. Sentiment analysis ural Language Processing (Volume 1: Long Papers),
and emotion understanding during the covid-19 pan- pages 986–995, Taipei, Taiwan. Asian Federation of
demic in spain and its impact on digital ecosystems. Natural Language Processing.
International Journal of Environmental Research and
Public Health, 17(15):5542. Ying Lin, Cash Costello, Boliang Zhang, Di Lu, Heng
Ji, James Mayfield, and Paul McNamee. 2018. Plat-
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo forms for non-speakers annotating names in any lan-
Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. guage. In Proceedings of ACL 2018, System Demon-
2020. "GoEmotions: A dataset of fine-grained emo- strations, pages 1–6. Association for Computational
tions. In Proceedings of the 58th Annual Meeting of Linguistics.
the Association for Computational Linguistics, pages
4040–4054. Association for Computational Linguis- Abdulaziz Y Lodhi. 1993. The language situation in
tics. africa today. Nordic Journal of African Studies,
2(1):11–11.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Alexandre Magueresse, Vincent Carles, and Evan
deep bidirectional transformers for language under- Heetderks. 2020. Low-resource languages: A re-
standing. In Proceedings of the 2019 Conference of view of past work and future challenges. ArXiv,
the North American Chapter of the Association for abs/2006.07264.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Fernando Marmolejo-Ramos, Juan C Correa, Gopal
4171–4186, Minneapolis, Minnesota. Association for Sakarkar, Giang Ngo, Susana Ruiz-Fernández, Na-
Computational Linguistics. talie Butcher, and Yuki Yamada. 2017. Placing joy,
surprise and sadness in space: a cross-linguistic study.
Paul Ekman. 1992. An argument for basic emotions. Psychological Research, 81(4):750–763.
Cognition & emotion, 6(3-4):169–200.
Gati L Martin, Medard E Mswahili, and Young-Seob
Junling Gao, Pinpin Zheng, Yingnan Jia, Hao Chen, Jeong. 2021. Sentiment classification in swahili
Yimeng Mao, Suhong Chen, Yi Wang, Hua Fu, and language using multilingual bert. arXiv preprint
Junming Dai. 2020. Mental health problems and so- arXiv:2104.09006.
cial media exposure during covid-19 outbreak. Plos
one, 15(4):e0231924.
Louis Martin, Benjamin Muller, Pedro Javier Or-
Karen Gasper, Lauren A Spencer, and Danfei Hu. 2019. tiz Suárez, Yoann Dupont, Laurent Romary, Éric
Does neutral affect exist? how challenging three de la Clergerie, Djamé Seddah, and Benoît Sagot.
beliefs about neutral affect can advance affective re- 2020. CamemBERT: a tasty French language model.
search. Frontiers in Psychology, 10:2476. In Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, pages 7203–
Hadrien Gelas, Laurent Besacier, and François Pelle- 7219. Association for Computational Linguistics.
grino. 2012. Developments of swahili resources for
an automatic speech recognition system. In Spoken Vincent Micheli, Martin d’Hoffschmidt, and François
Language Technologies for Under-Resourced Lan- Fleuret. 2020. On the importance of pre-training data
guages. volume for compact language models. In Proceed-
ings of the 2020 Conference on Empirical Methods
Diman Ghazi, Diana Inkpen, and Stan Szpakowicz. in Natural Language Processing (EMNLP), pages
2015. Detecting emotion stimuli in emotion-bearing 7853–7858, Online. Association for Computational
sentences. In CICLing (2), pages 152–165. Linguistics.
323
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Hasan Tanvir, Claudia Kittask, Sandra Eiche, and
frey Dean. 2013. Efficient estimation of word Kairit Sirts. 2021. EstBERT: A pretrained language-
representations in vector space. arXiv preprint specific BERT for Estonian. In Proceedings of the
arXiv:1301.3781. 23rd Nordic Conference on Computational Linguis-
tics (NoDaLiDa), pages 11–19. Linköping University
Clara Momanyi. 2009. The effects of’sheng’in the Electronic Press, Sweden.
teaching of kiswahili in kenyan schools. Journal
of Pan African Studies. Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared task:
Ibrahim Moge Noor and Metin Turan. 2019. Sentiment Language-independent named entity recognition. In
analysis using twitter dataset. IJID (International Proceedings of the Seventh Conference on Natural
Journal on Informatics for Development), pages 84– Language Learning at HLT-NAACL 2003, pages 142–
94. 147.
Laura Ana Maria Oberländer and Roman Klinger. 2018. Thanh Tran and Kyumin Lee. 2016. Understanding
An analysis of annotated corpora for emotion clas- citizen reactions and ebola-related information prop-
sification in text. In Proceedings of the 27th Inter- agation on social media. In 2016 IEEE/ACM Inter-
national Conference on Computational Linguistics, national Conference on Advances in Social Networks
pages 2104–2119. Analysis and Mining (ASONAM), pages 106–111.
IEEE.
Peter Obiria. 2019. Swahili text classification using
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
support vector machine and feature selection to en-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
hance opinion analysis in kenyan universities. PAC
Kaiser, and Illia Polosukhin. 2017. Attention is all
University Journal of Arts and Social Sciences, 2(2).
you need. In Advances in neural information pro-
cessing systems, volume 30, pages 5998–6008.
Emily Öhman, Marc Pàmies, Kaisla Kajava, and Jörg
Tiedemann. 2020. Xed: A multilingual dataset for David Vilares, Marcos Garcia, and Carlos Gómez-
sentiment analysis and emotion detection. In The Rodríguez. 2021. Bertinho: Galician bert representa-
28th International Conference on Computational Lin- tions. CoRR, abs/2103.13799.
guistics (COLING 2020).
Hai Wang, Dian Yu, Kai Sun, Janshu Chen, and Dong
Jeffrey Pennington, Richard Socher, and Christopher D Yu. 2019. Improving pre-trained multilingual model
Manning. 2014. Glove: Global vectors for word rep- with vocabulary expansion. In Proceedings of the
resentation. In Proceedings of the 2014 conference 23rd Conference on Computational Natural Lan-
on empirical methods in natural language processing guage Learning (CoNLL), pages 316–327, Hong
(EMNLP), pages 1532–1543. Kong, China. Association for Computational Lin-
guistics.
Malin Petzell. 2012. The linguistic situation in tanzania.
Moderna språk, 106(1):136–144. Shijie Wu and Mark Dredze. 2020. Are all languages
created equal in multilingual BERT? In Proceedings
Robert Plutchik. 1980. A general psychoevolutionary of the 5th Workshop on Representation Learning for
theory of emotion. In Theories of emotion, pages NLP, pages 120–130. Association for Computational
3–33. Elsevier. Linguistics.
Klaus R Scherer and Harald G Wallbott. 1994. Evidence Daniel Zeng, Hsinchun Chen, Robert Lusch, and Shu-
for universality and cultural variation of differential Hsing Li. 2010. Social media analytics and intelli-
emotion response patterning. Journal of personality gence. IEEE Intelligent Systems, 25(6):13–16.
and social psychology, 66(2):310.
Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.
Hassan Seif. 2016. Naïve bayes and j48 classification 2020. The design and implementation of XiaoIce, an
algorithms on swahili tweets: Perfomance evalua- empathetic social chatbot. Computational Linguis-
tion. International Journal of Computer Science and tics, 46(1):53–93.
Information Security, 14(1):1.
Casper S Shikali, Zhou Sijie, Liu Qihe, and Refuoe

Mokhosi. 2019. Better word representation vectors
using syllabic alphabet: a case study of swahili. Ap-
plied Sciences, 9(18):3648.
Carlo Strapparava and Rada Mihalcea. 2007. Semeval-

2007 task 14: Affective text. In Proceedings of the
4th International Workshop on Semantic Evaluations,
pages 70–74. Association for Computational Linguis-
tics.
324

2022.naacl-main.23

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

2022.naacl-main.23

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2022.naacl-main.23

Uploaded by

Copyright:

Available Formats

SwahBERT: Language Model of Swahili

Gati L Martin1 , Medard E Mswahili2 , Young-Seob Jeong2 *, Jiyoung Woo1

Abstract drawing much attention as they are known to be ef-

Swahili-specific BERT model. For pre-training Tasks Cosine similarity

Casper S Shikali, Zhou Sijie, Liu Qihe, and Refuoe

Carlo Strapparava and Rada Mihalcea. 2007. Semeval-

You might also like

2022.naacl-main.23

Uploaded by

Document Informationclick to expand document informationMAWASILIANO

Document Informationclick to expand document information

Copyright:

Available Formats

2022.naacl-main.23

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2022.naacl-main.23

Uploaded by

Copyright:

Available Formats

SwahBERT: Language Model of Swahili

Gati L Martin1 , Medard E Mswahili2 , Young-Seob Jeong2 *, Jiyoung Woo1

Abstract drawing much attention as they are known to be ef-

Swahili-specific BERT model. For pre-training Tasks Cosine similarity

Casper S Shikali, Zhou Sijie, Liu Qihe, and Refuoe

Carlo Strapparava and Rada Mihalcea. 2007. Semeval-

You might also like