Doctoral Thesis
Doctoral Thesis
By:
Yemane Keleta Tedla
Supervisor:
Assoc. Prof. Kazuhide Yamamoto
Natural Language Processing Lab
Thesis committee:
Assoc. Prof. Kazuhide Yamamoto (Principal supervisor)
Prof. Takeshi Yukawa
Prof. Koichi Yamada
Prof. Hiromi Ban
Prof. Hideko Shibasaki
iii
Abstract
Natural language has a central role in the communication process of human beings.
Natural Language Processing (NLP) is the branch of artificial intelligence that enables
machines to understand and process human language. NLP products have extensive
applications in our day-to-day activities including grammar correction, spam filter-
ing, personal digital assistance, language translation, recommendation systems, and
so on. Significant NLP solutions have been reported for resourceful languages such as
English. However, the same is not true for most of the world languages. For instance,
Google Translate currently supports about 100 languages out of over 6000 languages
in the world. While there are several reasons that contribute to this digital divide,
the absence or scarcity of resources is a major bottleneck impeding NLP advances in
low-resource languages. Tigrinya is one of the languages with very limited language
resources. It is a morphologically rich Semitic language spoken by over seven mil-
lion people in Eritrea and northern Ethiopia. We aim to initiate Tigrinya language
processing from the foundation by constructing essential annotated and unannotated
text corpora and building fundamental NLP components. In the process of resource
building, we constructed a news text corpus of over 15 million tokens, containing a
lexicon of over 593,000 tokens. We processed this corpus to generate important word
lists including Tigrinya stop words and affix lists. The corpus is preprocessed using
several tools developed for Ge’ez-to-Latin script transliteration, clitic normalization,
and text cleaning. In an earlier research, a part of the corpus containing 72,080 tokens
was manually annotated for parts-of-speech (POS). In this research, we constructed
another new resource that consists of over 45,000 morphologically segmented tokens
extracted from the POS tagged corpus. Moreover, we compiled and properly aligned
an English-to-Tigrinya parallel corpus for machine translation research.
These resources are employed in the following research. First, we approached
POS tagging as a classification as well as a sequence labeling problem, employing
iv
support vector machines (SVM) and conditional random fields (CRF) respectively.
We utilized the unique morphological patterns of Tigrinya to boost the performance of
particularly unknown words. Our method doubled the accuracy of unknown words
from around 39% to 80%. As a result, the overall accuracy obtained was 90.89%.
Furthermore, we obtained 91.6% accuracy (state-of-the-art) by approaching POS tag-
ging as a sequence-to-sequence labeling using bidirectional long short-term memory
(BiLSTM) networks with word embedding forgoing feature engineering. Second,
we presented the first research on morphological segmentation for Tigrinya. We ex-
plored language-independent character and substring features based on CRF. In addi-
tion, we obtained a state-of-the-art F1 score of 95.07% with BiLSTM networks using
concatenated character and word embedding. This approach does not require feature
engineering to extract linguistic information, which is useful for languages lacking
sufficient resources. In this research, we explored several Begin-Inside-Outside (B-I-
O) tagging schemes to discover the recommended strategy in Tigrinya morphological
segmentation. Finally, we explored English-to-Tigrinya statistical machine transla-
tion. Translation from English to the morphologically rich Tigrinya entails several
challenges, including out-of-vocabulary (OOV) problem, language model perplex-
ity, and poor word alignment. We introduced shallow and fine-grained morphologi-
cal segmentation to mitigate these problems and improve the convergence of the two
languages. Generally, we observed that translation quality can improve by using the
morphologically segmented models.
v
Acknowledgements
First, I thank God for his wonderful gifts in my life that no words of gratitude can
adequately express. I would not be where I am today without the mercy and grace of
the almighty God.
I am sincerely thankful to my advisor, Assoc. Prof. Kazuhide Yamamoto, for
accepting me into his lab and guiding me throughout the doctoral study. I will always
be grateful for his outstanding mentoring, motivation, and kindness. I also extend my
heartfelt gratitude to Assoc. Prof. Ashu Marasinghe, my former advisor during my
master’s and the first semester of my doctoral study.
Special thanks go to my thesis committee members-Prof. Takashi Yukawa, Prof.
Koichi Yamada, Prof. Hiromi Ban, and Prof. Hideko Shibasaki-for their precious
time, insightful comments, and guidance.
I would also like to thank Dr. Chenchen Ding from NICT Japan for his continued
interest and useful comments.
It was a great joy to work with my labmates at the Natural Language Processing
Lab. I have learned a lot from everyone during our seminars, literature introduction,
and intense discussions. My labmates were there for me every time I needed any
kind of help. I will always treasure our memories together. I also thank my friends
Bereket Samuel and his wife Hanan Hussein for their company and love.
I feel honored to express my gratitude to the people and government of Japan
for the financial support provided through Super Global Monbukagakusho Schol-
arship. Furthermore, I was supported by Rotary Yoneyama Memorial Foundation
Scholarship. I extend my heartfelt thanks to the peace-loving Rotarians at Nagaoka
Nishi Club for their gracious generosity and hospitality. My rotary counselor, Mr.
Masanori Sanjo, deserves special appreciation for his fatherly care. I am extremely
honored to be a part of the Rotary family.
vi
Moreover, I am grateful to Mr. Tsurusaki Tsuneo from the JICA Eritrea Liaison
office for his continued support and kindness throughout my study. I would also like
to thank the Koide family, the Osaki family, and Ishiyama san for their warmth and
love that made me feel right at home.
I am very grateful to all the Tigrinya scholars who have taught me through their
writings, books, and discussions. I cannot thank Memhir Amare Weldeslassie enough
for offering me two great books of Tigrinya Grammar without hesitation. I am also
grateful to the author Engineer Tekie Tesfay for his encouragement, clarifications,
and for sharing his Tigrinya grammar book and other useful documents. I thank the
Ministry of Information, Eritrea, for providing documents and allowing the use of
Haddas Ertra articles for the research.
I am thankful to the administration and my colleagues at the Eritrea Institute of
Technology, who supported me in the pursuit of my studies.
Last but not the least, I thank all of my family and friends for investing in my
life in every way. Their prayers, love, inspiration, and support have helped me stay
committed to my studies over the years. I am truly blessed to have them all in my
life.
vii
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Task of Morphological Segmentation . . . . . . . . . . . . . . 4
1.5 Morphological Segmentation for Machine Translation . . . . . . . . 5
1.6 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Tigrinya Language 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Words and Morphemes . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Writing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Tigrinya Morphology . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Morphological Ambiguity . . . . . . . . . . . . . . . . . . . . . . 15
2.6 POS Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Resource Construction 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Creating Large-scale Text Corpus . . . . . . . . . . . . . . . . . . 21
3.3 Nagaoka POS Tagged Tigrinya Corpus . . . . . . . . . . . . . . . 22
3.4 Morphologically Segmented Corpus . . . . . . . . . . . . . . . . . 23
3.5 English-Tigrinya Parallel Corpus . . . . . . . . . . . . . . . . . . . 23
viii
5.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.6 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.7.1 The Effect of the Tagset Design . . . . . . . . . . 53
5.3.7.2 The Effect of the Prefix and Suffix Features . . . 54
5.3.7.3 The Effect of the Pattern Features . . . . . . . . . 55
5.3.7.4 The Effect of the Data Size . . . . . . . . . . . . 56
5.3.7.5 Error Analysis . . . . . . . . . . . . . . . . . . . 58
5.4 Experiment: POS Tagging Based on seq2seq LSTM and BiLSTM . 60
5.4.1 Datasets and Tagsets . . . . . . . . . . . . . . . . . . . . . 60
5.4.2 Training Parameters . . . . . . . . . . . . . . . . . . . . . 60
5.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . 61
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8 Final Remarks 97
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . 99
8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3.1 Improving the Quality of the NTC and the Performance of
the POS Tagger . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3.2 Improving Segmented Corpus and Integrating Minimally Su-
pervised Approaches . . . . . . . . . . . . . . . . . . . . . 101
8.3.3 Creating a Large Parallel Corpus and Exploring Neural Ma-
chine Translation . . . . . . . . . . . . . . . . . . . . . . . 101
Bibliography 103
xi
List of Figures
1.1 The difficulty of aligning the Tigrinya word (unsegmented) with the
English words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Segmentation improves proper word alignment. . . . . . . . . . . . 7
5.1 The performance of the tagger improves with the availability of more
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 The performance of the tagger improves with the availability of more
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xiv
List of Tables
5.5 Morphological features and the error rate (%) of unknown words for
selected POS tags in CRF-based experiment. . . . . . . . . . . . . . 54
5.6 The accuracy of CRF-based tagger (Tagset 1) as data size increases 56
5.7 Tag-wise precision, recall, and F1-score for the SVM-based tagger
(tagset-2), Support is the number of words for the respective tag. . . 58
5.8 Results of LSTM and BiLSTM models for POS tagging in sequence-
to-sequence setting. The vocabulary sizes marked as “top10k” and
“all” denote the most frequent 10,000 tokens and all the vocabularies
(18,740 tokens) respectively. . . . . . . . . . . . . . . . . . . . . . 62
List of Abbreviations
Chapter 1
Introduction
1.1 Introduction
Tigrinya belongs to the Semitic language branch of the Afroasiatic family, along with
Hebrew, Amharic, Maltese, Tigre, and Arabic. Tigrinya has over 7 million speakers
in Eritrea and northern Ethiopia. As noted in [1], although resource-rich languages
such as English have well-developed language tools, low-resource languages suffer
from either low grade or the absence of electronic data support altogether to pursue
NLP research. The Tigrinya language is one such low-resource language. Unlike ma-
jor Semitic languages that enjoy relatively widespread NLP research and resources,
the Tigrinya language has been largely ignored in NLP-related research, mainly due
to the absence of a Tigrinya text corpus. A text corpus is a large collection of text
intended for linguistic processing. Although a few electronic dictionaries and a lexi-
con of automatically extracted Tigrinya words exist, the linguistically annotated text
corpus has not been made available to the public. Given this circumstance, new re-
search is needed to initiate the NLP research from the foundation by constructing
a corpus, and thereby language tools, for the advancement of information access in
Tigrinya. Therefore, in an earlier research, we constructed and released the first POS
tagged Tigrinya corpus containing 72,080 words [2] and developed a POS tagger
based on this corpus [3]. In this study, we continue this research to improve the per-
formance of the tagger. However, our primary objective in this research is to create
a new morphologically segmented corpus and investigate the first Tigrinya morpho-
logical segmentation with language-independent and automatic feature engineering
approaches that leverage the rich morphology of Tigrinya. We are also interested in
2 Chapter 1. Introduction
exploring English-to-Tigrinya machine translation using the Bible, the only publicly
available English-Tigrinya parallel corpus. We introduce morphological segmenta-
tion on the target language, Tigrinya, to enhance translation quality by improving the
convergence of these two languages. The detailed research objectives are presented
in section 1.2. Furthermore, in Chapter 1, we describe the tasks of POS tagging,
morphological segmentation, and machine translation.
2. In our earlier research, we constructed the first POS-tagged corpus for Tigrinya
(the Nagaoka Tigrinya Corpus, NTC). We also developed a POS tagger based
on transformation learning and hidden Markov models. However, given the
small size of the corpus and the morphological complexity of Tigrinya, the
performance of the tagger, particularly with regard to unknown words was quite
low. In this research, we are interested in improving the performance of this
POS tagger. Therefore, we intend to introduce methods that exploit the unique
morphological pattern of Tigrinya and employ automatic feature engineering
such as word embedding to boost performance.
1.3. Part-of-Speech Tagging 3
In the sections that follow, we present a brief introduction to POS tagging, mor-
phological segmentation, and the use of morphological segmentation in machine
translation.
The interpretation of Example 1.1 may differ significantly according to the acti-
vated POS role of each word from the several possibilities. For instance, assuming
4 Chapter 1. Introduction
the word class of “Flies” is noun, the sentence conveys that “flies (the insects) love a
flower.” However, if the word “Flies” takes on the role of a verb, then the meaning
changes to “something performs the action of flying just like a flower does.” Sim-
ilarly, the word “like” can function as a preposition, adverb, conjunction, noun, or
verb, and the word flower can appear as a noun or a verb. Therefore, the correct
translation of Example 1.1 requires resolving the most likely POS tag of each word
within the given context.
The number of POS categories varies from language to language, depending on
the inherent properties of languages and the target application. For example, the
tagsets for the morphologically rich Arabic language defined by Penn Treebank con-
sisted up to 2000 fine-grained POS tags [4], whereas the Penn Treebank POS tagsets
for English contained only 45 tags [5]. Moreover, a universal tagset that consists
of twelve tags was proposed in an effort to standardize the most frequent and useful
POS tags across multiple languages [6]. For the current Tigrinya research, a set of
73 tags were defined to capture three levels of grammatical information. The com-
plete list of the tagset and the experiments thereof are detailed in Chapter 5. There are,
broadly, two computational approaches to automatic POS tagging [7]. The rule-based
approach is the classical tagging method that relies on linguistically motivated dis-
ambiguation rules. The main downside of this method is the expensive development
and maintenance cost incurred. The second and the most widely adopted approach
follows statistical models that compute the most probable tag sequence given the
word sequence from a POS tagged corpus. We adopted the latter approach based on
support vector machines (SVM) classification [8], conditional random fields (CRF),
[9] and long short-term memory neural network (LSTM) sequence labeling architec-
tures [10]. POS tagging is a necessary prior step in many NLP applications such as
syntactic parsing and word sense disambiguation (WSD).
language [11]. In this thesis, we focus on the task of detecting morphological bound-
aries, also referred to as morphological segmentation. This task involves the break-
ing down of words into their component morphemes. For example, the English word
“reads” can be segmented into “read” and “s”, where “read” is the stem and “s” is an
inflectional morpheme, marking third-person singular verb.
Morphological segmentation is useful for several downstream NLP tasks, such as
morphological analysis, POS tagging, stemming, and lemmatization. Segmentation
is also applied as an important preprocessing phase in a number of systems, including
machine translation, information retrieval, and speech recognition.
Segmentation is mainly performed using rule-based or machine learning approaches.
Rule-based approaches can be quite expensive and language dependent because the
morphemes and all the affixation rules need to be identified to disambiguate seg-
mentation boundaries. Machine learning approach, on the other hand, is data-driven
wherein the underlying structure is automatically extracted from the data. We present
supervised morphological segmentation based on CRFs [9] and LSTM neural net-
works [10]. Since morphemes are sequences of characters, we address the problem
as a sequence tagging task and propose a fixed-size window approach for modeling
contextual information of characters. CRFs are well-suited for this kind of sequence
aware classification tasks. We also exploit the long-distance memory capabilities of
LSTMs for modeling boundaries of morphemes.
earlier, Tigrinya is a highly inflected language, while the English language is weakly
inflected [12]. A single word in Tigrinya may be expressed by one or more words
when translated into English. Consider the translation examples in Table 1.1 that
show a considerable unevenness in token count as the Tigrinya verb is inflected fur-
ther.
Table 1.1: An inflected Tigrinya word is expressed with multiple
words in English.
In the compiled Bible corpus, the English side has 1,002,214 tokens while the
Tigrinya side contains 623,984 tokens. Accordingly, the English side is larger by
around 38% in the token count. This mismatch causes data sparseness on the Tigrinya
side, aggravating the OOV problem caused by insufficient data. Moreover, source-
target word alignment cannot be performed properly as illustrated in Fig. 1.1.
Decomposing the inflected Tigrinya words into their constituent morphemes as
shown in Fig. 1.2 improves the alignment of words. Moreover, the granularity of
segmentation has been shown to affect the performance of translation systems [13].
Therefore, we explore the effect of coarse-grained (affix-based) segmentation and
fine-grained (morpheme-based) segmentation. In the affix-based segmentation, we
apply shallow pruning to find the longest prefix/suffix resulting in prefix-stem-suffix
segments (Example 1.2)1 ; while in morphological segmentation, we attempt detailed
morpheme-level segmentation (Example 1.3).
Chapter 2
Tigrinya Language
2.1 Introduction
This chapter is an introduction to the relevant aspects of the Tigrinya language. In this
chapter, we first describe words and morphemes in Tigrinya and present the Ge’ez
script, which is a special writing system of Eritrea and Ethiopia. Next, we discuss
the unique non-concatenative morphology of Tigrinya, followed by the challenges in
Tigrinya morphological analysis and part-of-speech disambiguation.
ሀ ለ ሐ መ ሠ ረ ሰ ሸ ቀ ቐ በ ተ ቸ ኀ ነ ኘ አ ከ
he le He me se re se she qe Qe be te che he ne Ne ’e ke
ኸ ወ ዐ ዘ ዠ የ ደ ጀ ገ ጠ ጨ ጰ ጸ ፀ ፈ ፐ ቨ
Ke we Oe ze Ze ye de je ge Te Ce Pe Se Se fe pe ve
Table 2.3: Ge’ez script: Examples of seven ordered and five ordered
alphabets.
ሀ ሁ ሂ ሃ ሄ ህ ሆ
he hu hi ha hE hI ho
ኰ ኲ ኳ ኴ ኵ
kWe kWi kWa kWE kWI
The Ge’ez script is an abugida 2 system where each letter (alphabet) represents a
joint consonant-vowel (CV) syllable. Accordingly, Tigrinya identifies seven vowels
usually called “orders” [14]. There are also a few alphabets that are variants of some
of the 35 base alphabets with only five orders (Table 2.3). In sum, about 275 symbols
make up the Tigrinya alphabet chart known as ፊደል Fidel. The base alphabets from
which the orders stem are shown in Table 2.2
The Ge’ez script does not mark the lengthening of consonants or gemination in
pronunciation; however, this limitation does not seem to create difficulties for native
1
http:mgafrica.comarticle2014-06-11-in-the-race-between-african-scripts-and-the-latin-alphabet-
only-ethiopia-and-eritrea-are-in-the-game
2
The term “abugida” (አቡጊዳ) itself is derived from letters of the Ge’ez alphabet chart. አ (a), ቡ
(bu), ጊ (gi), and ዳ (da) are the first, second, third, and fourth alphabets of the a, b, g, and d orders
respectively.
12 Chapter 2. Tigrinya Language
speakers. In this thesis, Tigrinya words are transliterated to Latin characters accord-
ing to the SERA transliteration scheme [15], with the addition of “I” for explicit
marking of the epenthetic vowel traditionally known as ሳድስ SadIsI.
Fig. 2.3 shows the order and position of Tigrinya affixes. The circumfix is formed
from two constituents that are found at the prefix (CIRC-1) and the suffix side (CIRC-
2) illustrated by the arc in the figure.
The affixes represent various morphological features including gender, person,
number, tense, aspect, mood, voice, and so on. Furthermore, there are clitics of
mostly prepositions and conjunctions that can be affixed to other words. Some of
2.4. Tigrinya Morphology 13
the typical linguistic features of a Tigrinya verb are illustrated in Fig. 2.1. The single
token in Tigrinya is expressed by multiple words in English. If the token is segmented
into morphological units, we see that a lot of linguistic information can be extracted,
which may be leveraged for disambiguation tasks.
Specifically, the order of morpheme slots is defined by [17] as follows.
(prep|conj)(rel)(neg)sbjSTEMsbj(obj)(neg)(conj)
The “prep” slot refers to preposition prefixes while the “conj” slot represents con-
junctions that can appear as both prefixes and suffixes. The “rel” indicates a rela-
tivizer (the prefix ዝ/ዚ zI/zi) corresponding in function to the English demonstratives
such as that, which, and who. The “sbj” on either side of the STEM are a prefix
and/or suffix of the four verb types, namely perfective, imperfective, imperative, and
gerundive. As shown in examples 2.1 and 2.4, the perfective and gerundive verbs con-
jugate only on suffixes while imperfective verbs undergo both the prefix and suffix
inflections (example 2.2). The imperatives either show the suffix-only conjugations
or change prefix as well (example 2.3). In addition to the verb type, these fusional
morphemes convey gender, person, and number information.
The word order typology is normally subject-object-verb (SOV), though there are
cases in which this sequence may not apply strictly[18], [19]. Changes in “sbj” verb
affixes, along with pronoun inflections, enforce subject-verb agreements. One as-
pect of the non-concatenative morphology in Tigrinya is the circumfixing of negation
morpheme in the structure “ayI-STEM-nI” [17]. Some conjunction enclitics such as
“do; ke; Ke” can also be found in Tigrinya orthography as mostly bound suffix mor-
phemes. For example,
KeyIdudo? keyId + u + do (did he go?)
nIsuKe? nIs + u + Ke (what about him?)
The pronominal object marker “obj” is always suffixed to the verb as shown in ex-
amples 2.7 to 2.10. According to [18], Tigrinya suffixes of object pronoun can be
categorized into two constructs. The first is described in relation to verbs (examples
2.7, 2.8 and 2.9) and the other indicates the semantic role of applicative cases by
inflecting for prepositional enclitic “lI” + a pronominal suffix as in example 2.10.
Example 2.10. beliOI + lu/obj (he ate for him/he ate using [it])
from nouns by prefixing “bI-” which has similar functionality as the English “-ly”
suffix (example 2.13).
In this work, we deal with boundaries of prefix, stem, and suffix morphemes.
Inflections related to infixes (internal stem changes) are not feasible for this type of
segmentation.
InItezeyIHatetIkayomI
InIte-zeyI-HatetI-ka-yomI
InIte-zeyIHatetIkayomI
InIte-zeyI-HatetIkayomI
InIte-zeyI-HatetI-kayomI
InItezeyIHatetIka-yomI
InItezeyIHatetI-kayomI
InItezeyI-HatetIkayomI
Example 2.18. እዚ ቤት እዚ ናተይ እዩ Izi bEtI Izi nateyI eyu, (This house is mine)
In example 2.17, Izi (this) functions as a pronoun, whereas in example 2.18, “this”
is modifying the noun (house) and is hence an adjective. In Tigrinya, demonstrative
adjectives tend to repeat themselves in the pattern Izi NOUN Izi. Furthermore, an
adjective may take the role of a noun, a relative verb, a pronoun, a proper noun, or an
interjection in a sentence [19]. As described in relation to morphological ambiguity
(section 2.5), inflection and affixation of Tigrinya words may also render lexically
ambiguous words. Looking again at the prefix ብ bI (by), in some contexts, it in-
troduces ambiguity as the role of a noun is turned into an adverb [21] when “bI” is
prefixed. For instance, in the word ብመምህር bImemIhIrI (by a teacher), the prefix
(bI) functions as a preposition. However, for the word ብትብዓት bItIbIOatI (bravely),
the word assumes the role of an adverb, specifically, in the event it modifies a verb.
However, “bI” is not always a prefix or an adverb but sometimes part of the word
too. For example, the word ብስራት bIsIratI (good news), which can appear as a noun
and proper noun, starts with “bI”). This ambiguity can be partly resolved by a com-
bination of stemming and lexicon look-up to check if the stem exists in the lexicon
or alternatively looking at the context for disambiguation information such as the
part-of- speech of the word being modified. In this research, we applied the latter
alternative to disambiguate the POS based on the contextual information.
18 Chapter 2. Tigrinya Language
One of the features of Tigrinya as a Semitic language is that it has two tenses (per-
fective and imperfective) but several aspects (causative, reflexive, reciprocal, and so
on) and the imperative/jussive mood [14]. Other tenses, such as the future tense, are
expressed by combining one of these tenses with the auxiliary verbs. This distinction
becomes useful when using syntactic information to understand the arrangement of
words for POS disambiguation. For example, in the phrase ዘሊሉ ሃደመ zelilu hademe
(he escaped by jumping), the gerundive zelilu describes the way the action of escaping
was performed and thus functions as an adverb instead of a verb [21].
The noun declension in Tigrinya happens for gender, number, case, and definite-
ness. However, noun declension does not follow a regular pattern almost 75% of the
time [14]. Similarly, declension of adjectives takes place for gender and number. In
general, the complexity arising from Tigrinya grammar leads to several ambiguities
related to all parts of speech.
There is also POS ambiguity arising from the writing system of Tigrinya. The
phenomenon of gemination creates ambiguity in the POS of words. This ambiguity
results from the absence of notation symbols in Ge’ez alphabets, such as orthography
of the European languages to represent gemination of consonants. For example, the
word ሰበረ sebere can mean (he broke) or it can be a type of a legume if “be” in sebere
is geminated. Furthermore, the widespread use of cliticized words may pose serious
problems in POS tagging because of the orthographic variation it creates. During
Tigrinya cliticization, certain unpronounced characters of a word are omitted and
replaced by apostrophes. For example, the compound word መምህር’ዩ memIhIrI’yu,
(he is a teacher), is a combination of the noun memIhIrI and the auxiliary verb Iyu.
However, during the fusion of these words, the first character of the auxiliary verb
is omitted by cliticization. The proposed system includes some character recovery
rules and has normalized the corpus in order to reduce these types of orthographic
variations. Orthographic variations may aggravate the out-of-vocabulary problem
that often occurs with low-resource languages. However, the character recovery pro-
cess is not always straightforward because there are some combinations that require
more contextual or semantic knowledge to determine the proper missing character.
For example, the word ከመይ’ላ kemeyI’la could be resolved to either kemeyI ala, (how
is she) or kemeyI ila (how did she). Issues with the recovery of such ambiguous cases
2.7. Chapter Summary 19
Chapter 3
Resource Construction
3.1 Introduction
One of the main purposes of this research is constructing various corpora to support
the development of Tigrinya NLP. In an earlier research, we constructed a POS tagged
corpus that was used to explore the first POS tagging research using methods such as
hidden Markov models (HMM). In the following sections, a brief description of ad-
ditional resources has been provided. These resources include a Tigrinya text corpus,
a morphologically segmented corpus, and a parallel corpus. These resources are em-
ployed to investigate Tigrinya word embeddings, morphological segmentation, and
machine translation.
formatted using tools developed for preprocessing the corpus. First, the raw text was
formatted into a sentence-based format. Then, we normalized the corpus to unify
over 60 different styles of Tigrinya word cliticizations into a common format. At
this stage, the total raw text corpus consisted of 15.1 million tokens. This corpus can
be used to generate various vocabularies. For example, we generated a unique lexi-
con vocabulary that consists of over 593,000 tokens. We also analyzed the top 4000
frequent words in the corpus and created a list of stop words by filtering irrelevant
entries manually. This list was further enriched with the preposition, conjunction,
adverb, auxiliary verb, and pronoun lists from the NTC POS tagged corpus. More-
over, the corpus was used for constructing word embeddings as explained in section
3.6. For this reason, other versions of the corpus were prepared with further cleaning.
For the first version, only the stop words were cleaned and in the second version, ad-
ditionally, all hapaxes, foreign scripts, digits, and punctuation were removed. This
cleaning process compressed the corpus by 41.9% to about 6.3 million tokens. In
order to analyze the effect of removing stop words, we experimented with both the
original and the cleaned version of the corpus. Our findings are more meaningful and
consistent with the cleaned version of the corpus, and hence, we report those results
in section 3.6.
15 And let them be for lights in the arch of heaven to give light on the earth:
and it was so.
16 And God made the two great lights: the greater light to be the ruler of the
day, and the smaller light to be the ruler of the night: and he made the stars.
17 And God put them in the arch of heaven, to give light on the earth;
18 To have rule over the day and the night, and for a division between the light
and the dark: and God saw that it was good.
Figure 3.3: The English verses are merged to match the Tigrinya
merged verses in Tigrinya text for constructing verse aligned paral-
lel corpus (verses 17 and 18 are merged).
15 And let them be for lights in the arch of heaven to give light on the earth:
and it was so.
16 And God made the two great lights: the greater light to be the ruler of the
day, and the smaller light to be the ruler of the night: and he made the stars.
17-18 And God put them in the arch of heaven, to give light on the earth; To
have rule over the day and the night, and for a division between the light and
the dark: and God saw that it was good.
3.6. Creating and Analyzing Word Embeddings 25
Word embedding refers to the encoding of raw text to vector of numbers that is con-
venient for use by machine learning algorithms. These numerical representations
encode semantic and syntactic relations of words in a language. Word representation
26 Chapter 3. Resource Construction
Several architectures have been proposed for building word embeddings and using
them as features to improve NLP tasks. We mention a few relevant ones here. [25]
proposed a new distributed representation of words that processes very large datasets
with significantly lower computational cost. This work introduced two model archi-
tectures known as the continuous bag of words (CBOW) and the Skip-gram model.
They also demonstrated simple vector algebraic operations that capture semantic re-
lation of words in the embedding space. In their follow-up work, [25] introduced
the method of negative sampling to further augment computational efficiency as well
as the quality of vectors [26]. Later, [27] presented GloVe, an approach that forms
a word co-occurrence matrix using global corpus statistics. Several other research
applied word embeddings as features to improve POS tagging of many languages,
including a Semitic language, Arabic [28], [29]. [28] used Gaussian hidden Markov
3.6. Creating and Analyzing Word Embeddings 27
model (HMM) and CRF to show that careful initialization of models and regener-
ating word embeddings improve unsupervised POS induction. Interestingly, [29]
achieved near state-of-the-art results, using word embeddings as the sole features
with a neural network POS tagger. Evaluating the quality of word representations
is very important. [30] designed an evaluation framework that combines intrinsic
and extrinsic measurements with detailed analysis and crowdsourcing. Generally,
the CBOW method performed well in many of the designed tasks. In this research,
we report experiments using CBOW and Skip-gram embeddings based on Mikolov
et al.’s work [25], [26].
3.6.3 Method
The two algorithms we used for our analysis are CBOW and Skip-gram distributed
word representations [25]. Both models are simple neural network architectures with
input, projection (hidden) and output layers. Backpropagation with stochastic gra-
dient descent is used to learn weights. Their difference is that CBOW predicts the
target word from the contextual information whereas the Skip-gram model predicts
the surrounding words, given the target word. These methods are briefly explained
in the following subsections.
In the CBOW model, the training objective is discovering word embeddings to predict
the target word given the context words. Consider V and N represent the vocabulary
size and the hidden layer size respectively. Furthermore, x = x1 , x2 , ..., xV is a one-
hot encoded input vector where xk = 1 and all other xk′ = 0 for k ̸= k ′ . The
weight matrix W between the input and the hidden layers is given by V × N matrix.
Specifically, each row of W is the N-dimensional vector vw of the relevant word w.
That is:
28 Chapter 3. Resource Construction
w w12 w13 ... w1n
11
w21 w22 w23 ... w2n
WV ×N =
.. .. .. .. ..
(3.1)
. . . . .
wV 1 wV 2 wV 3 . . . wV N
and
x1
x2
x=
.. (3.2)
.
xV
h = xT W (3.3)
Moreover, the weight matrix between the hidden and the output layer is given as
N × V matrix W′ = wi,j
′
. Consequently, the score uj (the degree of match between
the context and the next word) for each word in the vocabulary is obtained from the
dot product between the predicted representation (v′wj ) and the representation of the
candidate target word (h).
uj = v′ Twj h (3.4)
where v′wj is the j-th column of the matrix W′ . The posterior probability of words
is computed using the softmax classification model given by:
exp(uj )
p(wj |wI ) = yj = (3.5)
∑
V
exp(uj ′ )
j ′ =1
3.6. Creating and Analyzing Word Embeddings 29
where yj is the output of the j-th unit of the output layer. Substituting Equations
3.3 and 3.4 in Equation 3.5 we have:
where the input word w is represented by the input vectors vw and vw′ obtained
from the row vector W and column vector W ′ respectively.
In the Skip-gram model, given a target word, the training objective is to discover word
representations that would help in predicting the surrounding words in the associated
context.
The Skip-gram model is given by:
exp(uc,j )
p(wc,j = wO,c |wI ) = yc,j = (3.7)
∑
V
exp(uj ′ )
j ′ =1
where wI is the input word, wO,c is the c-th word in the output context and wc,j is
the j-th word on the c-th panel of the output layer. Moreover, yc,j is the output of the
j-th word on the c-th panel of the output.
3.6.4 Experiments
As mentioned earlier, the collected data in this experiment consists of news articles
and text from the Tigrinya Bible. Our reports are based on the cleaned version of the
data, which contains over 6.3 million tokens. We employed the word2vec tool3 for
the experiments. Furthermore, we assorted the minimum word count between two,
four, and six words, which means words that occur less than the minimum count in
3
https://radimrehurek.com/gensim/models/word2vec.html
30 Chapter 3. Resource Construction
the corpus were not considered. We ran an extensive set of tuning and experiments,
combining the following set of parameters and reported the optimal setting which
results in better similarity tests. The settings include the following:
Algorithm = [CBOW, Skip-gram]
Window size = [2, 4, 6]
Dimension = [100, 200, 300]
Negative sampling = [0, 5, 10]
Hierarchical softmax = [0, 1]
3.6.4.2 Evaluation
Evaluations on word embeddings are usually conducted for intrinsic qualities and
extrinsic improvements [30]. Intrinsic evaluation is a direct measure of the word
vector’s quality, using similarity, analogy and matching measurements. On the other
hand, extrinsic evaluations measure the contribution of word embeddings in improv-
ing external NLP tasks, such as POS tagging and NER. We analyzed both intrinsic
and extrinsic evaluations and reported the models and parameters that gave the best
results in our experiments. The unknown words are first tagged using the CRF-based
tagger from Chapter 5 (referred to as CRF-morph here). We then report the reduction
in error rate by isolating the wrongly assigned unknown words and retagging through
majority voting based on word embeddings.
3.6.4.3 Baseline
The performance of the CRF-morph was 90.89% overall and 80% on unknown words
using tagset-2 (the 20 reduced tagset). This score will be used as the baseline. Our
results are generalized for the 12 major tags which are verbs, nouns, pronouns, ad-
jectives, adverbs, prepositions, conjunctions, interjections, numerals, foreign words,
punctuation, and unclassified tags.
3.6. Creating and Analyzing Word Embeddings 31
3.6.5 Results
returned matching responses when the window size was two. In contrast, the ac-
curacy of the “city-country” type of semantic relatedness improves with wider con-
text. Likewise, a majority of the responses for similarity queries of models with nar-
row context generated more morphological inflections, whereas wider context models
resulted in better semantic relationships. These findings of syntactic relatedness are
in line with the reports of [28] that applied word embeddings for POS induction and
NER.
The POS tagging research that utilizes morphological pattern features is described in
Chapter 5. This section explains how word embeddings can be utilized to augment
32 Chapter 3. Resource Construction
Semantic relatedness
Relatedness Word pair 1 Word pair 2
capital-country asImera ErItIra karItumI sudanI
Asmara Eritrea Khartoum Sudan
country-continent afIriqa ErItIra EwIroPa sIPaNa
Africa Eritrea Europe Spain
man-woman gWalI wedi sebeyIti sebIayI
girl boy woman man
food-country pasIta iTalIya mereQI amErika
pasta Italy soup America
position-player akefafali mesi aTIqaOi ronalIdo
midfielder Messi striker Ronaldo
opposite newiHI HaxirI xelimI xaOIda
tall short black white
Syntactic relatedness
plural suffix tatI zEga zEgatatI InIsIsa InIsIsatatI
citizen citizens animal animals
plural suffix atI hagerI hageratI OametI OametatI
nation nations year years
relativization zIHulI zIzeHale wIOuyI zIweOaye
cold coldest hot hottest
pronoun suffix afIliTu afIliTomI hibu hibomI
he made it known they made it know he gave they gave
negation circumfix feleTetI ayIfeleTetInI habetI ayIhabetInI
she knew she didn’t know she gave she didn’t give
passive voice feleTe tefelITe habu tewahIbe
he knew he was known *they gave he was given
prefix Ina ‘while’ belIOe InabelIOe Inanekeye InaweseKe
he ate while eating *while decreasing while increasing
the POS generalization skill of the models. Specifically, the analysis will focus on
tailoring embeddings to improve the accuracy of tagging unknown words.
Table 3.2 lists some typical examples of similarity queries ranked in descending
order of similarity score. Most of the results show semantic and syntactic related-
ness with the query word. For example, the responses to the word ትምህርቲ tImIhIrIti
3.6. Creating and Analyzing Word Embeddings 33
(education) are all related to education, school, and learning. In addition to semantic
relatedness, these words also share the common morphological root “tmhrt”. Fur-
thermore, we observed a noticeable similarity between the query and the related re-
sponses in the POS category. In the previous example, except for the verb ትምህር
tImIhIrI (she teaches), the rest are all nouns. Similarly, the responses for the proper
noun ፈይስቡክ feyIsIbukI (Facebook) are mostly related proper nouns like ትዊተር tIwi-
terI (Twiter). Interestingly, the responses for ክልተ kIlIte (two) are numbers ranked
in ascending order with the strongly related word being ሰለስተ selesIte (three). This
powerfully embedded relatedness can be leveraged to complement supervised and
unsupervised POS tagging. Moreover, we tested word embeddings for categoriza-
tion (sorting out the word that does not belong to a group of words). In Table 3.3, we
consider words at column “word4” as unrelated to the other three words in the same
row. The boldfaced entries represent the responses returned as unrelated, according
to the test. Looking at the table, the unrelated word is successfully isolated in all of
the instances except for the beginning row, which is quite a difficult case representing
a subtle semantic distinction. In this row, since radio, television, and newspaper are
all means of broadcasting news, the fourth word “news” should have been selected as
the unrelated word instead of the given response “radio”. In order to measure the im-
pact of this relatedness, we conducted an extrinsic evaluation using NTC corpus and
34 Chapter 3. Resource Construction
CRF-morph. In the NTC corpus that was used for CRF-morph, 10% of the sentences
(7460 words) were used for testing in cross-validated settings. 18.7% (1390 words)
of the test data comprises unknown words. For extrinsic evaluation, we focused on
the effect of word embeddings on improving the tagger’s performance for unknown
words. As mentioned earlier, since the related words often share a common POS tag
with the query word, we tagged all related words of the unknown word and applied
majority voting to propose a candidate tag from the related words. The threshold
we used for the related candidates was ten. The majority vote selects the most com-
mon tag or the second most common tag among the list of candidates. This approach
reduced the error rate of tagging unknown words by 50%. As a result, the overall
performance of the tagger improved by 1.25 percentage point. The size of our corpus
for word embedding was around 6.3 million words (after cleaning); inducing the tag
of unknown words using this relatively small text corpus revealed some significant
improvements. The quality of word embeddings has been proved to improve with
the increase in the size of corpus [25]. Therefore, given a much larger data, we also
expect the tagging performance on unknown words to show further improvements.
3.6.7 Summary
In this section, we briefly explained the building of word embeddings for Tigrinya.
Several experiments were performed to obtain the optimal settings and parameters for
generating useful quality of word vectors. Generally, syntactic relatedness improved
with a shorter context, and better semantic relatedness was achieved with wider con-
text. We obtained optimal syntactic and semantic relatedness with a Skip-gram model
having context size of two, minimum word count of six, and 300 dimensions. We also
used the word embeddings to improve the POS tagging of unknown words. Although
the size of our text corpus was relatively small, the error rate of tagging unknown
words was reduced by half, resulting in an overall improvement of tagging unknown
words. In future, this result may be improved further by enlarging the size of the text
corpus and tuning parameters.
3.7. Chapter Summary 35
Chapter 4
4.1 Introduction
In this chapter, we describe the proposed methods of POS tagging and segmentation.
First, we attempt modeling tagging as a classification problem which follows an SVM
point-wise prediction to determine the category of words among n-classes. However,
the disambiguation of POS tags and morphological segments also depends on contex-
tual information. Therefore, we further employed CRF-based sequence labeling to
utilize the contextual and lexical features. Finally, we investigated the recent devel-
opments in deep learning and POS tagging with LSTM-based sequence-to-sequence
labeling. A brief description of SVM, CRF, and LSTM approaches is provided in the
following section.
are leveraged to solve multiclass tasks [32]. Some example of the first approach in-
clude naive bayes classifiers [33], k-nearest neighbor [34], decision trees [35], and
neural networks [36]. SVMs can be extended to employ direct methods of multiclass
estimation as in [37] or decomposed into independent binary classifiers like in [38].
Furthermore, it was argued by [38] that provided sufficient tuning of the binary classi-
fiers, one-versus-the-rest decomposition provides a simple solution which performs
at least on par with other more complicated methods. Therefore, in this research,
we follow the one-versus-the-rest approach to address the POS task in a multiclass
setting.
The task of multiclass classification is to learn a function f from a training dataset
{(xi , yi )} for estimating the correct class C of a new data point (xi , yi ) out of N
distinct classes where N > 2. In the one-versus-the-rest scheme, one classifier is
trained for each class C, where the data points of that class are considered the positive
instances (+1) and the rest N − 1 classes are categorized as the negative instances
(−1) (equation 4.1).
+1 for yi = C
y= (4.1)
−1 for yi ̸= C
During prediction, a data point x is assigned to the class that outputs the largest
value after running all N classifiers (equation 4.2).
are preferred over other models because they offer improved sequence labeling capa-
bilities. Firstly, the strict Markovian assumption in hidden Markov models (HMMs)
is relaxed in CRFs. CRFs also solve the bias label problem that exists in maximum
entropy Markov models (MEMMs) by training to predict the whole sequence correct
instead of training to predict each label independently [9].
The model is trained to predict a sequence of output labels (e.g., IOB tags in
morphological segmentation and POS tags in POS tagging) y:
y = y0 , y1 , ..., yT (4.3)
x = x0 , x1 , ..., xT (4.4)
The training task is to maximize the log probability log(p(y|x)) of the valid label
sequence. The conditional probability is computed as:
escore(x,y)
p(y|x) = (4.5)
Σy′ escore(x,y′ )
∑
T ∑
T
score(x, y) = Ayi ,yi+1 + Pi,yi (4.6)
i=0 i=1
Ayi ,yi+1 denotes the emission probability of state change from label i to label j while
Pi,j is the transition probability denoting the score of the j th label of the ith word.
encountered states. Given the input vector Xt , (e.g., in the case of our segmentation,
five characters left and right of the target character), the hidden state ht at each time
step t, can be expressed by equation 4.8.
where the W terms represent the weight matrices, b is a bias vector and σ is the acti-
vation function. However, due to the gradient vanishing (very small weight changes)
or exploding problems (very large weight changes), long-distance dependencies are
not properly propagated in RNNs. [10] introduced LSTMs to overcome this gradient
updating problem. The neurons of LSTMs, called memory blocks, are composed of
three gates (forget, input, and output) that control the flow of information and a mem-
ory cell with a self-recurrent connection. The formulae of these four components are
given in equations 4.9 to 4.13. In these equations, the input, forget, output, and cell
activation vectors are denoted by i, f, o, and c respectively; σ and g represent sig-
moid and tanh functions respectively; ⊙ operation is the Hadamard (element-wise)
product; W stands for the weight matrices and b is the bias.
ct = ft ⊙ ct−1 + it ⊙ cn , (4.12)
ht = ot ⊙ g(ct ) (4.14)
LSTM layer to regularize the model and prevent overfitting. The output of the net-
work is then processed by a fully connected (Dense) layer and finally tag probabil-
ity distributions over all candidates are computed via the softmax classifier. Due to
the capacity to work well with long-distance dependencies, LSTMs have achieved
state-of-the-art performance in window based approaches as well as sequence label-
ing tasks including morphological segmentation [39], part-of-speech [40] and named
entity recognition [41].
yt = σy (Why ht + by ), (4.15)
where W is the weight matrix, b is the bias vector of the output layer and σ is
the activation function of the output layer. Both the forward and backward networks
−
→
operate on the input state of xt and generate forward output h t and backward output
←−
h t . Then the final output, yt from both networks is combined using operations such
as multiplication, summation or simple concatenation as given in equation 4.16. In
our case, σ is a concatenating function.
−
→ ← −
yt = σ( h t , h t ) (4.16)
encodes the source sequence to a vector of a fixed size, and then decodes the target
sequence from the vector [42]. In sequence-to-sequence POS tagging, the input is
the sequence of words in a sentence, and the POS tag for each word forms the target
sequence.
In the POS tagging based on SVM and CRF, features were designed to capture the
morphological patterns of Tigrinya words and the syntactic structure of surrounding
context. In contrast, in the seq2seq approach, we used automated feature engineering
to capture similar features with a vector representation of words (embeddings) and
achieved a slightly better performance. The architecture of the BiLSTM seq2seq
model is depicted in Fig. 4.3.
44 Chapter 4. Methods of Tagging and Segmentation
Chapter 5
5.1 Introduction
As briefly explained in the introductory chapter, the POS tagging process refers to
determining the POS tag of a word according to the context of the sentence. POS
tagging is one of the early phases in natural language processing. The later stages of
NLP, including phrase identification, named entity recognition, and syntactic pars-
ing, all require POS-tagged data as input. In addition, POS tagging has proven useful
in tasks such as statistical machine translation, information extraction, and summa-
rization. This chapter details the related works, experiments, and results achieved
in POS tagging of Tigrinya based on SVM, CRF, and LSTM approaches. The ex-
traction and the effect of morphological patterns in boosting performance has been
discussed. We also experimented with LSTM and word embeddings forgoing feature
engineering and achieved state-of-the-art results in Tigrinya.
lead to either a word with complex POS tags or a sequence of segments with sim-
ple POS tags. The latter requires choosing among multiple ambiguous segments
[43]. The first POS-tagging research for the Arabic language was performed with
a corpus containing 50,000 tagged words collected from a newspaper [44]. Follow-
ing this, several research studies for the Arabic language emerged over the decades,
and currently, Arabic corpora with millions of words are available. For example,
the Arabic Newswire part-1 comprising 76 million tokens and over 666,094 unique
words [45] and the Arabic Treebank containing 1 million words [46] were compiled
at the University of Pennsylvania. Methodologically, the majority of recent stud-
ies have applied segmentation-based techniques for Arabic POS tagging [47]–[49].
An SVM-based POS tagger that performs morphological segmentation of words fol-
lowed by POS tagging was reported by [47]. Interestingly, a comparative study
of segmentation-based and segmentation-free methods by [50] reports better results
without segmentation (94.74% vs. 93.47%). In contrast, [43] examined the prob-
lem of word tokenization for Hebrew POS tagging and argued that segment-level
tagging is better suited for the POS tagging of Semitic languages in general, as it
suffers less from data sparseness. A recent work investigated modeling POS tagging
as an optimization problem using genetic algorithm [51]. POS-tagging research for
Amharic languages was mostly driven by a 210K-word news corpus that was tagged
with 30 POS tags [52]. Using this corpus, [53] reported various experiments apply-
ing Trigram‘n’Tags (TnT), SVM, and maximum entropy algorithms. Furthermore,
[54] identified inflection and derivation patterns of Amharic words as features and
improved tagging accuracy up to 90.95% using CRFs.
In section 5.4, we further performed separate experiments to explore recent ad-
vances in deep learning on sequence labeling. Specifically, we were interested in
BiLSTM with word embeddings for automatic encoding of the latent linguistic fea-
tures following the work by [40]. The research investigated BiLSTM-based POS
tagging with word, character, and Unicode byte embeddings on 22 languages includ-
ing Arabic and Hebrew. They also analyzed data size and label noise variations and
reported state-of-the-art accuracy across several languages. Interestingly, in contrast
to other language families, the BiLSTM approach in the case of Semitic languages
outperformed the TnT and CRF-based tagging on the first 1000 sentences, which
5.2. Related Works 47
may indicate the preference of BiLSTMs for a low-resource scenario in these lan-
guages. Earlier, [55] proposed a task-independent unified sequence tagging model
based on BiLSTMs and achieved results that were comparable to the state-of-the-art
POS tagging, Chunking, and NER in English. The method of combining character
representations to form word embeddings improved sequence tagging particularly
with morphologically complex languages such as Turkish [56]. Their method is dif-
ferent from traditional word embeddings, which consider embeddings at word level.
In a similar work, [57] derived sparse embedding features from continuous (Dense)
word embeddings and proposed a general sequence labeling framework that achieved
reasonable results for more than 40 treebanks in POS tagging.
With regards to Tigrinya, so far, the research on Tigrinya NLP has not progressed.
The absence of publicly available tagged or untagged corpora of Tigrinya has been
the main challenge. However, a notable achievement in the morphological analysis
and generation of Tigrinya, Amharic, and Oromo has been reported by [20]. This
work is discussed in more detail in the related works for Tigrinya morphological
segmentation (section 6.2). Other related works include a hybrid stemmer [58], some
electronic dictionaries 1 , input method editors 2 , a large Tigrinya lexicon from the
Crúbadán project 3 , and a web-crawling project that provides concordance access to
Tigrinya corpora among other languages of the Horn of Africa.4 .
In the following sections, we present the experiments and results. The first ex-
periment discusses SVM and CRF-based results, and the second experiment focuses
on BiLSTM based sequence-to-sequence labeling.
1
hdrimedia.com/, memhr.org/dic/, geezexperience.com/dictionary/
2
keyman.com/tigrigna/, geezlab.co/
3
crubadan.org/languages/ti
4
habit-project.eu/wiki/CorporaAndCorpusBuilding
48 Chapter 5. Tigrinya POS Tagging
In the English language, 98% of the cases for the suffix “able” were found to be ad-
jectives in the Wall Street Journal part of Penn Treebank [59]. This illustrates the
possibility of exploiting suffix rules for predicting POS. Morphological features may
be extracted statistically or hand-picked by experts. It was shown that a few lin-
guistically motivated suffixes can result in improved generalization in English [60].
On the other hand, [61] implemented a maximum entropy Markov model to automati-
cally extract morphological features. This improved the accuracy of tagging unknown
words from 61% to 80%, tested on Chinese Treebank 5.0. Unknown words are test
words that are not seen during training. As mentioned previously, Tigrinya words
possess rich morphological information embedded in the form of prefixes, infixes,
and suffixes. Tigrinya verbs use suffixes to mark the conjugation of personal pro-
nouns. For example, the perfective verb pattern CeCeCe is conjugated by inflecting
the stem CeCeC with one of the following personal pronoun suffixes (C represents
any Tigrinya consonant).
CeCeC + (e | etI | a | u | ka | ki | kumI | kInI | na| ku) For instance, the verb sebere can
be decomposed into seber + e, which translates to “he broke”. Likewise, the pattern
of the stems for the remaining verbs in the sebere family are all distinct. Specifically,
the pattern of the imperfective stem is -CeCC- (-sebr-), the imperative stem pattern
is -CCeC (-sber), and the gerundive stem has a CeCiC- (sebir-) pattern. For further
illustration, some of the most informative patterns based on a frequency distribution
of gerundive verbs (V_GER) are listed in Table 5.1. The V_GER proportion in the
second column gives the percentage of a particular V_GER pattern compared to a
total of 533 extracted V_GER patterns (including inflected patterns). The pattern
CeCiCu is the most dominant gerundive stem in the corpus (31.1%), with the suffix
“+u” indicating masculine, singular, and third-person attributes. In addition to the
pronoun suffix that is present in all verbs, stems with prefixes such as “te*” and “a*”
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
49
Patterns
are also frequently found in the corpus. The statistics reveals that 17.8% of the gerun-
dive verb inflections are prefixed with “te*,” which forms passive voice stems from
Tigrinya perfective and imperfective verbs. Similarly, the pattern of the causative
stem aCICeCe, which is prefixed by the morpheme “a*,” is retrieved as part of the
most informative patterns of gerundive verbs. These types of morphological clues
are very important in correctly predicting the POS of Tigrinya words.
Experiments, optimizations, and evaluations were carried out using scikit-learn
machine learning tools [62]. For CRF, the sklearn-crfsuite wrapper was used, which
is available at http://sklearn-crfsuite.readthedocs.io/en/latest/. The data and settings
are explained in the following sections.
The NTC corpus with 72,080 tokens was used for training and testing. The experi-
ments were performed in stratified 10-fold cross-validations where 90% of the data
was used for training and the remaining 10% for testing. A stratified 10-fold cross-
validation (CV) scheme evaluated the performance of the estimators. CV setup is
particularly useful in a low-resource scenario when the data for training and testing
is not sufficient. The stratified CV creates a balanced dataset by distributing approx-
imately the right proportion of tags into each of the 10 folds. Therefore, each fold
may be regarded as representative of the whole data. Furthermore, we investigated
with two versions of the corpus based on tagset-1 (the full tagset containing 73 tags)
and tagset-2 (the reduced tagset, containing 20 tags). The full list of the tagset is
50 Chapter 5. Tigrinya POS Tagging
presented in Table 5.2. On average, almost 81% of the test data were known words
and 18% constituted the unknown words.
5.3.3 Features
The full list of the proposed features includes a rich set of contextual and lexical
features. The contextual features span a window of two words and two POS tags pre-
ceding the target word, the target word, and two words succeeding the target word.
Table 5.2: The full tagset: labels ending with _C, _P, _PC indicate
clitics of Conjunction, Preposition, or both respectively. For example,
N_C refers to a NOUN (N) with attached CONJUNCTION (C).
Additionally, lexical features are extracted from the target word’s affixes, which com-
prise prefixes of one to six characters in length, consonant-vowel patterns of the word
(infixes), and suffixes of one to five characters in length. The length of characters is
decided from the best results of a grid search of character combinations.
A grid search was applied to find the optimum combinational of hyperparameter value
of parameter C, regularizations and the kernel function for the SVM model. Ac-
cordingly, the best estimation was when C=1, using L2 regularization and the linear
kernel. Similarly, a randomized grid search optimization for the CRF- based experi-
ments based on the LBFGS kernel was applied. The best estimation was found when
C1=0.5978 and C2=0.1598.
5.3.5 Evaluation
The overall accuracy is calculated from test data by computing the percentage of
correctly predicted tags to the true annotations as given by equation 5.1.
In addition, standard metrics of precision (P), recall (R), and F1-score (F1) were
utilized to assess the system’s performance on individual POS tags (equations 5.2,
5.3, and 5.4). Finally, in section 5.3.7.5, we analyze errors to discuss the strengths and
weaknesses of the taggers particularly on unknown words and some representative
tags.
52 Chapter 5. Tigrinya POS Tagging
2 ∗ (P ∗ R)
F1 = (5.4)
P +R
5.3.6 Baseline
With regard to the baseline, the current word was considered as the feature without
exposing left and right context, and the unknown words were tagged as nouns. In this
setting, the SVM tagger on tagset-1 yielded a baseline accuracy of 74.05%, and the
CRF tagger performed slightly better at 76.15% as presented in Table 5.3. All the
other results obtained using more contextual information outperform these baselines.
5.3.7 Results
In general, the CRF tagger slightly outperforms the SVM-based tagger. Table 5.3
shows the overall performance of the taggers for tagset-1 and tagset-2, varying six
configurations of affix features. Accordingly, the best score achieved was 90.89%
and 89.92% for the CRF and the SVM taggers respectively on tagset-2. The dif-
ference of 0.01 percentage point is significant(p = 0.002 < 0.05). POS tagging is a
sequence-labeling task, therefore, the context-aware CRF-based tagger has the ad-
vantage of utilizing relations in the surrounding context. This may be the reason why
the CRF-based tagger tends to outperform the point-wise SVM-based tagger in most
of the experiments. The consonant-vowel pattern features ptn, which are unique to
Tigrinya (and other Semitic languages), proved useful in the disambiguation of the
four types of verbs (perfective, imperfective, imperative, and gerundive), nouns, and
adjectives. According to Table 5.3, in both the CRF and SVM experiments, the pat-
tern features boost performance by more than 5.5 percentage point compared to the
baseline. The effects of the tagset design, the feature design, and the data size are
detailed in the following sections.
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
53
Patterns
Table 5.3: Overall performance for the data with tagset-1 and tagset-2.
Context = word-2, pos-2, word-1, pos-1, word, word+1, word+2,
ptn = consonant-vowel pattern, pref = prefix, suf = suffix,
all = context + all affixes
The rich inflectional and derivational morphology of Tigrinya generates a large num-
ber of words with various grammatical information including gender, number, case,
and so on. In addition, there are also clitics of prepositions and conjunctions affixed
to words. As mentioned earlier, the complete list of the tags (tagset-1) used in NTC
corpus has 73 tags denoting three types of information, which are the major POS cate-
gory (level-1), some sub-categories (level-2), and clitics (level-3). The distribution of
these tags in the corpus is not balanced. For example, nouns alone constituted about
33% of the words in the corpus while many other tags were rarely found. Specifi-
cally, seven tags were not assigned to words in the corpus, and the other 19 tags were
rarely present, each appearing less than 10 times in the corpus. Most of these rare tags
include the level-3 tags, which represent words that are cliticized with a preposition,
a conjunction, or both. Therefore, in order to reduce data sparseness, another set of
tags were designed by omitting level-3 annotations. Consequently, the number of the
remaining tags with level-1 and level-2 information was reduced tagset-2 containing
20 tags. In comparison to the tagset-1, the tagset-2 data showed a better distribution
of tags as the frequency of all the tags was more than 100. We investigated the impact
of the tagset design as reported in Table 5.3. In both the CRF and SVM methods, the
tagset-2 results slightly outperform that of tagset-1. The cross-validations of both
tagsets for the SVM-based tagger with “all” features were examined for significance
54 Chapter 5. Tigrinya POS Tagging
Table 5.5: Morphological features and the error rate (%) of unknown
words for selected POS tags in CRF-based experiment.
Considering CRF tagset-1 classifier and the features context+suf and context+pref,
the overall improvement was 5.78% and 9.34% percentage points, respectively (Ta-
ble 5.4). The impact of these features is even more visible upon detailed analysis of
their performance with regard to unknown words. The result is based on a held-out
evaluation in which about 81.4% of the test data is comprised of known words and
18.6% of unknown words. When only using the context features, the performance
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
55
Patterns
was as low as 38.38% for the SVM-based tagger and 39.21% for the CRF-based tag-
ger. However, augmenting context features with affixes almost doubled this result to
76.68% and 80.22% for SVM and CRF, respectively. Therefore, affix information
was very helpful in inferring the POS of unknown words. Analysis of the impact of
each feature shows that the addition of prefix (pref ) features contributes to the highest
gain compared to suffix (suf ) and pattern (ptn) contributions. This is also true for the
combined affix features. The setup integrating both prefix and suffix features (con-
text+pref+suf ) improved the accuracy by a significant 8.49 percentage point com-
pared to context+suf+ptn, and 19.74 percentage point compared to context+pref+ptn.
This effect may be due to the distribution of some of the very frequent prefixes such
as “mI” for verbal nouns (N_V) and “zI, Ite” for the relative verbs (V_REL). The er-
ror analysis in Table 5.5 clearly shows that this hypothesis holds. The error rate with
regards to N_V when training on prefix features was only 7.5%, whereas training on
suffix features resulted in the high error rate of 77.5%. In addition, with reference to
the affix slot positions of Tigrinya words, while conjunctions are either prefixed or
suffixed, prepositions are always prefixed making prefix features more informative.
The patterns of verbs, nouns, and adjectives are very informative features because in-
flectional and derivational rules indicate grammatical features such as verbal nouns,
relative verbs, and tense-aspect-mood features, as well as gender, person, and number
attributes. Incorporating patterns into the proposed contextual feature set yields con-
siderable performance gain on unknown words. The accuracy increased substantially
from 39.21% to 53.81% for CRF (Table 5.4) when using bare context and ptn pattern
features. Although less noticeable, the impact of patterns can also be seen in the com-
bined feature sets. The performance gain from context+pref to context+pref+ptn is
about 4.46 percentage point. However, the gain moving from context+pref+suf to all
features is rather small. This is probably due to the long range of character n-grams
for prefixes (1 to 6) and suffixes (1 to 5); unless a token is quite long, it is likely
that the pattern features are already encoded along with the prefix and suffix features.
In such cases, patterns features add little new information. Generally, this analysis
56 Chapter 5. Tigrinya POS Tagging
Figure 5.1: The performance of the tagger improves with the avail-
ability of more data.
reveals the extent to which infix patterns can be tailored to enrich the feature encod-
ing of Tigrinya lexical features. These types of infix features are unique to Tigrinya
and other Semitic languages. Therefore, it is expected that reinforcing the feature set
with more representative patterns would positively enhance the generalization of the
classifiers, and thereby the prediction accuracy of POS tags.
The available corpus contains 72,000 tokens of which 19,000 (26%) are word types.
For a small corpus of 72,000 words, this is a relatively high number of word types
and can possibly increase the proportion of unknown words during testing. Further-
more, according to the lexical diversity (token-type ratio), a word is repeated around
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
57
Patterns
Figure 5.2: The performance of the tagger improves with the avail-
ability of more data.
4 times in the corpus. However, about 6% of the words in the corpus form the ha-
paxes, which become a part of the unknown words if found in the test data. A robust
tagger is required to correctly predict the POS of words that can degrade the accu-
racy of the tagger upon incorrect assignments. The best accuracy achieved in this
research is 90.89%, employing the CRF-based tagger and tagset-2. However, this
result is rather low compared to the state-of-the-art results in resourceful languages
such as English. For instance, the TnT tagger trained on the English Penn Treebank
of 1,200,000 tokens reported 96.7% tagging accuracy [59]. This data is significantly
large compared to our small corpus. The experiments in the original implementation
of TnT showed that accuracy improved with increase in data size. In order to inves-
tigate the performance of the Tigrinya taggers with respect to data size, we show the
learning curve in Fig. 5.1 by gradually increasing the training data size. The same
trend in log scale is also shown in Fig. 5.2 for better visualization. The graphs show
that performance improves with greater availability of data. One of the reasons for
this improvement is the reduction of unknown or out-of-vocabulary (OOV) words.
As mentioned, the OOV problem is pertinent to low-resource languages. In our case,
this limitation was partly addressed by introducing rich sets of morphological fea-
tures and rigorous tunings aimed at estimating the best parameters for the employed
58 Chapter 5. Tigrinya POS Tagging
Table 5.7: Tag-wise precision, recall, and F1-score for the SVM-based
tagger (tagset-2),
Support is the number of words for the respective tag.
algorithms.
In Tigrinya, POS ambiguity may occur for all parts of speech. However, most of the
tagging errors stem from the role of (1) relative verbs and adjectives, (2) demonstra-
tive adjectives and pronouns, (3) adjectives and nouns, (4) imperfective verbs and
auxiliary verbs, and (5) adjectives and adverbs. Table 5.4 shows the error analysis
of unknown words for some representative POS tags that include nouns, verbs, ad-
jectives, and adverbs. Moreover, the performance of the SVM tagger (tagset-2) for
each POS tag is illustrated in Table 5.7 using precision, recall, and F-score measures
given by equations 5.2, 5.3, and 5.4 respectively.
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
59
Patterns
Similar to the CRF and SVM based experiments, we used the two versions of the NTC
tagged with tagset-1 and tagset-2, in the usual 90/10 train/test setting. As previously
mentioned, NTC consists of 71,080 tokens in 4,656 sentences. The vocabulary size
of the corpus is 18,740. We also experimented by extracting the most frequently
occuring 10,000 tokens and masking the remaining “rare words” with the special
“UNK” (Unknown) token.
5.4.3 Evaluation
The evaluation score is computed using the accuracy metric as given by equation 5.1.
5.4.4 Baselines
We consider the best performance in the previous section as the baseline for this
experiment. Accordingly, the best CRF accuracy scores were 90.37% and 90.89%
for the tagset-1 and tagset-2 data respectively.
5.4. Experiment: POS Tagging Based on seq2seq LSTM and BiLSTM 61
Unlike the rigorous tuning and feature engineering done with the SVM and CRF-
based approaches in Chapter 5, we applied minimum settings for the LSTM-based
approaches and relied on word embeddings for encoding morpho-syntactic informa-
tion and LSTMs capability to capture contextual long-distance dependencies. The
informativeness of word embeddings was demonstrated in section 3.6, where the er-
rors affecting unknown words were reduced by 50% using the information from word
embeddings. The accuracy results of the LSTM-based experiments are reported in
Table 5.8. Overall, the results achieved by all models are competitive. Looking at
the scores, the BiLSTM model trained on the dataset “all” performed better than the
other models. The BiLSTM model using tagset-2 outperformed the CRF baseline by
0.71 percentage points. This small improvement is remarkable since it was achieved
without any handcrafting of morphological features and with minimal tuning of model
parameters.
The accuracy of the LSTM-based network is quite comparable to the BiLSTM-
based network with the latter performing slightly better. The reason is that the bidirec-
tional LSTM processes the input sentence in both forward and reverse passes enabling
it to access both left (past) and right (future) contexts, whereas the regular LSTM
performs only a forward pass of the sentence preserving only past information. This
bidirectional access is quite helpful in Tigrinya POS tagging since some words re-
quire both information for disambiguation. For example, when nouns(N) function as
adjectives (ADJ) they appear after the modified noun (Example 5.1). Furthermore,
demonstrative adjectives which are the same as pronouns (PRON) tend to get repeated
before and after a noun adding some emphasis to that specific noun (Examples 5.2
and 5.3).
Example 5.3. Iti wedi Iti Iti/ADJ wedi/N Iti/ADJ (the boy, that boy)
The performance using a vocabulary size of 10,000 most common tokens and
the total 18,740 tokens was similar. Although further investigation on vocabulary
62 Chapter 5. Tigrinya POS Tagging
Table 5.8: Results of LSTM and BiLSTM models for POS tagging
in sequence-to-sequence setting. The vocabulary sizes marked as
“top10k” and “all” denote the most frequent 10,000 tokens and all the
vocabularies (18,740 tokens) respectively.
sizes will be required, the current result shows that masking the rare words with the
common UNK token proved useful and that the extracted contextual or syntactic in-
formation has significant importance in decoding the ambiguity of the rare words.
In summary, in this first LSTM-based POS research for Tigrinya, we achieved
state-of-the-art results using bidirectional LSTMs with word embedding for feature
extraction. However, the state-of-the-art results of similar Semitic languages or the
well-resourced languages is in the high nineties. Therefore, improving both the quan-
tity and quality of the available small annotated Tigrinya corpus is necessary to achieve
better performance of POS tagging. In addition, significant and relatively inexpen-
sive improvements can also be achieved using semi-supervised approaches that ex-
ploit unlabeled text ([57], [63], [64]). We plan to advance in this line of research in
future works.
5.5. Chapter Summary 63
Chapter 6
6.1 Introduction
As introduced in Chapter 1, in the task of morphological segmentation, we identi-
fied the boundaries of morphemes in words. Tigrinya has a rich non-concatenative
morphology, which embeds several grammatical information, conjunction and prepo-
sition clitics, object markers, and so on. In this research we constructed a new mor-
phologically segmented corpus for Tigrinya and investigated morphological segmen-
tation using supervised approaches based on CRFs and LSTMs. For CRFs, we em-
ployed character and sub-string features without explicit language-dependent feature
engineering. For LSTMs, we modeled segmentation with a fixed-size window ap-
proach relying on character and word embeddings for word representations. Details
of the related works, experimental settings, results, and analyses are provided in the
following sections.
teacher. Their system for affix segmentation achieved a performance of 0.94 preci-
sion and 0.97 recall measures.
Tigrinya remains an under-studied and under-resourced language from the NLP
perspective. However, as regards to morphological processing, [20] employed fi-
nite state transducers (FSTs) to develop, “HornMorpho”, a morphological analysis
and generation system for Tigrinya, Amharic, and Oromo languages. The FST em-
powered by feature structures was effectively adapted to process the unusual non-
concatenative root-and-pattern morphology. The Tigrinya module of HornMorpho
2.5 performs the full analysis of Tigrinya verb classes [17]. The system was tested
on 400 randomly selected word forms, with 329 of them being non-ambiguous. The
analysis revealed remarkably accurate results with a few errors due to unhandled
FSTs, which can be integrated. Our approach is different from HornMorpho in at
least two aspects. First, our task is limited to identifying morphological boundaries.
The results amount to partially analyzed segments although these segments are not
explicitly annotated for grammatical features. The annotation could be pursued with
further processing of the output or training on morphologically annotated data, which
is currently missing for Tigrinya. Second, the use of the FSTs relies heavily on the
linguistic knowledge of the language in question. This would require time-consuming
manual construction of language-specific rules, which is more challenging for root-
and-pattern morphology. In contrast, we follow a data-driven (machine learning)
approach to automatically extract features of the language from a relatively small
boundary annotated data. Moreover, on the limitations of HornMorpho, [17] noted
that the analysis incurs a “considerable time” to exhaust all options before the system
responds.
no general consensus as to which variant performs best. [67] used the IOBES format
as it encoded more information, whereas extensive evaluations by [73] showed that
the BIO has superior results compared to the IOB scheme. Furthermore, the BIES
scheme gave better results for Japanese word segmentation [74].
We experimented with four schemes, namely BIE, BIES, BIO, and BIOES, to
explore the modeling of morpheme segmentation in a low-resource setting with the
morphologically rich language Tigrinya. In our case, the B, I, and E tags marks Begin,
Inside, and End of multi-character morphemes. The S tag annotates single character
morpheme and the O tag assigns character sequences outside of morpheme chunks.
An example of using these annotations is presented in Table 6.1.
We directly apply labels to the Latin transliterations of Tigrinya words (not the
Ge’ez script). The Ge’ez script is syllabic, and in many cases, the boundary has
fusional properties resulting in alterations of characters at the boundary. For exam-
ple, the word sebere (He broke) would be segmented as “seber” + “e” because the
morpheme “e” represents grammatical features. However, this morpheme cannot be
isolated from the Ge’ez script because the last characters “re” is a single symbol in
the Ge’ez script and segmenting sebere into “sebe” + “re” is incorrect analysis.
Window Label
_ _ _ _ _ s e l a m I B
_ _ _ _ s e l a m I _ I
_ _ _ s e l a m I _ _ I
_ _ s e l a m I _ _ _ I
_ s e l a m I _ _ _ _ I
s e l a m I _ _ _ _ _ E
Table 6.3: Training data split as per the count of the fixed-width char-
acter windows (vectors) used as the actual input sequences
Data size
All Training Validation Test
100% 90% 5% 5%
289,158 260,242 14,458 14,458
6.5 Experiment
6.5.1 Settings
In this section, the data used and the experimental settings are reported. Initially, we
set training epochs to hundred and configured early stopping if the training continued
without loss improvement for 15 consecutive epochs. We used Keras 1 to develop
and Hyperas 2 to tune our deep neural networks. Hyperas, in turn, applies hyperopt
for optimization. The algorithm used is Tree-structured Parzen Estimator approach
(TPE) over five evaluation trails 3 .
6.5.1.1 Datasets
The text used in this research is extracted from the NTC corpus. The NTC consists
of around 72,000 POS-tagged tokens collected from newspaper articles. Our corpus
contains 45,127 tokens, of which 13,336 tokens are unique words. Training is per-
formed with ten-fold cross-validation, where about 10% of the data in every training
iteration is used for development. We further split the development data into two
equal halves, allotting one set for validation during training and the other half for
testing the model’s skill.
The evaluation reports are based on the final test set results. For every target
character in a word, a fixed-width character window is generated as shown in Table
6.2. The data statistics according to the generated character windows is presented in
Table 6.3.
1
https://www.keras.io/
2
https://github.com/maxpumperla/hyperas
3
https://papers.nips.cc/paper/4443-algorithms-for-hyperparameter-optimization.pdf
6.5. Experiment 71
6.5.1.2 Evaluation
We report boundary precision (P), boundary recall (P) and boundary F1 scores (F1)
as given by equation 6.1 to 6.3. The precision evaluates the percentage of correctly
predicted boundaries with respect to the predicted boundaries, and the recall measures
the percentage of correctly predicted boundaries with respect to the true boundaries.
The F1 score is the harmonic mean of precision and recall and can be interpreted as
their weighted average.
2 ∗ (precision ∗ recall)
F1 = (6.3)
precision + recall
The correct predictions are the count of true positives while the actual (true)
boundaries are these true positives added with the false negatives. The system-proposed
predictions can be determined from the sum of the true positives and the false posi-
tives. The number of I or O classes are present over twice as much of the B classes.
Therefore, to account for any class imbalance, we computed weighted macro-averages,
in which each class contribution is weighted by the number of samples of that class.
The macro-averaged precision and recall are calculated by taking the average of pre-
cision and recall of the individual classes as shown in equations 6.4 and 6.5, where n
denotes the total number of classes.
L2 regularization and C value of 1.5 were selected as these values gave better results.
The character and substring features we considered spanned a window size of five.
The full list of the character features is given as follows.
a. Chars: char -5, char -4, ... , char 4, char 5
b. Left context: -5 to 0, -4 to 0 , -3 to 0, -2 to 0, -1 to 0
c. Right context: 0 to 5, 0 to 4 , 0 to 3, 0 to 2, 0 to 1
d. Left+Right context: a + b
e. N-grams: bi-grams
LSTM: In order to search for the parameters that yield optimal performance, we
explored hyperparameters that include embedding size, batch size, dropouts, hidden
layer neurons, and optimizers. The complete list of the selected parameters for LSTM
is summarized in Table 6.4. We achieved similar results for the BIE and BIES tun-
ings, as shown in the table. However, the BIO and BIOES schemes showed differ-
ences in the tuning results and, therefore, these were used separately.
Embeddings: We tested several embedding dimensions from the set {50, 60, 100,
150, 200, 250}. Separate runs were made for all types of tag sets.
Dropouts: Randomly selecting and dropping-out nodes have proven effective
at mitigating overfitting and regularizing the model [75]. We applied dropout on
the character embedding as well as on the inputs and outputs of the LSTM/BiLSTM
layer. For example, the selected dropouts for the BIO-based tunings are 0.07 and 0.5
for the embedding and LSTM output layers, respectively. The dropout probabilities
are selected from a uniform distribution over the interval [0, 0.6].
Batch size: We ran tuning for the batch size of the set {16, 32, 64, 128, 256}.
Hidden layer size: The hidden layer’s size is searched from the set {64, 128, 256}.
Optimizers and learning rate: We investigated the stochastic gradient descent
(SGD) with stepwise learning rate and other more sophisticated algorithms such as
AdaDelta [76], Adam [77], and RMSProp [78]. The SGD learning rate was initialized
to 0.1, a momentum of 0.9 with rate updates for every 10 epochs at a drop rate of
0.5. However, the SGD setting did not result in significant gains compared to the
automatic gradient update methods.
6.6. Results 73
6.5.2 Baseline
The baseline considered in this study is the CRF model trained on only character
features with a window size of five. The window size is kept the same with the
neural LSTMs to compare the models’ subject to the same information. The baseline
experiments for the CRF achieved an F1 score of around 60% and 63% using the
BIOES and BIO tags, respectively. As for the BIE tags, the performance reached
about 75.7%. All the subsequent CRF-based experiments employed two auxiliary
features, which were the contextual and n-gram characters listed in section 6.5.1. As
depicted in Table 6.5, significant performance enhancements were achieved with the
auxiliary features. However, compared to the LSTM-based experiments, the window
five based CRF results were still suboptimal.
6.6 Results
We experimented and compared the performance of three models trained with four
different tagging strategies as explained earlier. The results of ten-fold cross-validation
are summarized in Table 6.5 with P, R, F1 representing precision, recall, and F1 score,
respectively.
Generally, we observed the choice of segmentation schemes affecting the model’s
performance. Overall, using the BIE scheme resulted in the best performance in all
tests. Although the BIOES tagset is the most expressive since the corpus is rather
74 Chapter 6. Morphological Segmentation for Tigrinya
Table 6.5: Results of CRF, LSTM and BiLSTM experiments with four
BIO schemes.
small, the simpler tagsets showed better generalization over the dataset. The BIE-
CRF model achieved 92.62% in the F1 score, while the performance of the BIO and
BIOES-based CRFs fell to about 84.88% and 83.6%, respectively. The CRF model
using the BIO scheme outperformed the CRF model using the BIOES by 1.28% ab-
solute change. A majority of the errors are associated with under-segmentation and
confusion between I (inside) and O (outside) tags. Masking these differences by drop-
ping the O tag has proved useful in our experiments. As a result, the CRF model using
the BIE scheme outperformed the BIO-based CRF by around 7.7 percentage point.
We compared the regular LSTM with its bidirectional extension. The overall results
showed that the BiLSTMs performed slightly better than the regular LSTMs sharing
the same scheme. In the window-based approach, the past and future contexts are
partly encoded in the feature vector. This window was fed to the regular LSTM at
training time, making the information available for both networks quite similar. This
may be the reason for not seeing much variation between the two models. Neverthe-
less, the additional design in BiLSTMs to reverse-process the input sequence allows
the network to learn more detailed structures. Therefore, we saw slightly improved
results with the BiLSTM network over the regular LSTM. As with CRFs, a signif-
icant increase in the F1 score was achieved when changing from the BIOES to the
BIE scheme. In both the LSTM models, we observed a gain of over 6 percentage
point in performance. Overall, in these low-resource experiments, the model gener-
alized better when the data was tagged with the BIE scheme as this reduced the model
complexity introduced by the O tags. The BiLSTM model using the BIE scheme per-
formed better than all others, scoring 94.67% in the F1. This result was achieved by
6.6. Results 75
Table 6.6: BiLSTM Confusion matrix comparison for BIE and BIO
schemes %. The Boldface entries represent BIE scheme. Rows and
columns represent predicted and true values respectively.
B/ B I/I E/O
B/B 91.20/82.60 4.54/9.94 4.26/7.46
I/I 2.83/3.21 93.84/90.74 3.34/6.05
E/O 3.09/4.19 9.65/10.83 87.26/84.98
The models using the BIE and the BIES tagsets achieved comparable results. We
also observed a similar trend for the models using the BIO and the BIOES tagsets.
However, the BIE-based results and the BIO-based results differ significantly. There-
fore, we analyzed the effect of the tagset choice, contrasting the BiLSTM models of
the BIE with the BIO schemes. For easier comparison, the confusion matrices (in
percentage) of both experiments are presented jointly in Table 6.6. The values are
paired in X/Y format, where X is a value from the set {B, I, E} in the BIE scheme
and Y is one of {B, I, O} in the BIO scheme. The row-wise entries represent model
predictions, while the column values denote the true (reference) tags. The diagonal
entries where the tags match (e.g., row B/B vs. column B/B) represent the percentage
of correctly assigned tags. The rest of the entries show the actual (column-wise) vs.
the predicted (row-wise) confusions. For example, the entry for the row B/B against
the column I/I (4.54/9.94) denotes that 4.54% (in BIE) or 9.94% (in BIO) of the B
tags were assigned the I tag by mistake. Overall, looking at the paired entries, ex-
cept for the diagonals, the first entry (a BIE entry) is always lower than the second
entry (a BIO entry) which shows that the BIE models have lower error rate than the
corresponding BIO scheme. As mentioned earlier, the number of I or O classes is
about two-fold greater than the B class. Therefore, there is more confusion stemming
from these major tags. The B tag represents the morpheme segmentation boundaries;
therefore, the under-segmentation and over-segmentation errors can be explained in
relation to the B tags. Looking at the BIE figures, around 91% of the predictions for
the B tags are correct. In other words, 9% of the true B tags were confused for the
I and O tags, which together amounts to the under-segmentation errors. These types
6.6. Results 77
of errors are almost doubled with the use of the BIO scheme. On the other hand, in
total, the over-segmentation errors are about 5.9% (I, E vs. B) and 7.4% (I, O vs.
B) in the BIE and BIO schemes, respectively. Over-segmentation occurs when the
true I, O, or E tags are confused with the B tag. In this case, morpheme boundaries
are predicted while the characters actually belong to the inside, outside, or end of the
morpheme. These results show that the under-segmentation errors contribute more
compared to the over-segmentation errors.
In the BIE scheme results, we observe that the I and B tags are correctly predicted
with higher accuracies than in the BIO scheme. Specifically, the prediction accuracy
of the B tag (boundaries) increased from 82.6% to 91.2% (8.6% absolute change).
The main reason is the considerable reduction of errors introduced due to the confu-
sions of tag I for O or vice-versa in the BIO scheme. This comparison suggests that
improved results can be achieved by denoising the data from the O tag. Therefore,
for a low-resource scenario of boundary detection, starting with the BIE tags may
be a better choice for improving model generalization compared to more complex
schemes that are feasible for larger datasets.
78 Chapter 6. Morphological Segmentation for Tigrinya
The following sample sentence extracted from the Bible is segmented with all
four models. Note that the training corpus does not include text from the Bible.
Tigrinya:
amIlaKI kea mIrIayu zEbIhIgI mIbIlaOu zITIOumI kWlu omI abI mIdIri
abIqWele
English:
And out of the earth the Lord made every tree to come, delighting the eye and
good for food.
The segmentation result from the four models is summarized in Table 6.7. The
expected segmentation is given under the column labeled “True”. We observe that
the BIE model segmentation is the nearest to the True segmentation. The other mod-
els show under-segmentation errors for the word ZEbIhIgI and kWlu. The verbal
noun prefix “mI*” is correctly segmented by all models, whereas the models failed
to recognize the boundary of the causative prefix “a*” in the last word abIqWele.
Table 6.8: The effect additional word and POS features on segmenta-
tion performance
Table 6.9 shows the improvement in F1 scores from about 92.3%, with only 10% of
the data, to about 94.67%, showing an absolute gain of 2.37%. Each F1 score is the
weighted macro-average of a ten-fold cross-validation of the specific dataset.
The comparison of both the ten-fold result sets is visualized in the box and whisker
plot (Fig. 6.1). The f1_10 and f1_100 denote the F1 scores of ten-folds using the
10% and all (100%) of the data respectively. The box represents the middle 50% of
80 Chapter 6. Morphological Segmentation for Tigrinya
the F1 scores; outliers are shown as dots, and the line dividing the box shows the
midpoint(median) of the scores.
The compact box of f1_100 indicates a comparatively lower variance in the 10-
fold averages whereas the longer box of f1_10 shows considerable variation. We also
observe higher overall F1 scores indicating the usefulness of enlarging the dataset.
The F1 distribution for 10% and 100% of the data does not fit a normal distribution
(non-Gaussian), and hence we performed KS test and observed that the absolute gain
is statistically significant (p = 0.0012 < 0.05).
Chapter 7
English-Tigrinya Machine
Translation with Morphological
Segmentation
7.1 Introduction
Machine translation (MT) is a task which involves translating one natural language to
another. There is an abundant wealth of digital information over the internet and other
sources. However, most of the information is available in English and other well-
known languages such as Chinese and Japanese. Although the internet penetration of
less-known native languages such as Tigrinya seems to be on the rise, comparatively,
there is still a huge gap of information provision for low-resource languages. In addi-
tion, the language barrier in societies contributes to difficulties in communication and
daily life. One way to overcome the digital divide and improve communication is by
providing machine translation services. However, problems with the complexity and
linguistic divergence of some language pairs, aggravated by an insufficient parallel
corpus, renders machine translation a challenging task. In Chapter 1, we presented
a brief introduction of the difficulty in English-Tigrinya MT. In this research, we at-
tempt statistical machine translation by improving convergence of these languages
through morphological segmentation. The following sections describe some previ-
ous works in relation to morphology and MT, the methods we adopted, and finally,
the experimental results and discussions.
Chapter 7. English-Tigrinya Machine Translation with Morphological
82
Segmentation
empty) on Wikipedia. Apart from the Bible translation, we could not find other par-
allel corpora open to the public. However, some online dictionaries and mobile ap-
plications are being developed. For example, memhir.org 2 maintains a dictionary of
over 15,000 entries. Recently, geezexperience.com 3 has been compiling a multilin-
gual dictionary between Tigrinya and a number of other languages including English,
German, Dutch, Italian, and Swedish. Hidri publishers 4 also provided a mobile ver-
sion of their printed dictionary “Advanced English-Tigrinya Dictionary” with over
62,000 entries.
In this research, we used the Bible as our parallel corpus. [87] discuss the use-
fulness of the Bible for language processing. They mention that the Bible has been
translated to over 2000 languages, making it the most translated book in the world.
The Bible text is carefully translated and organized on verse-level. According to [87],
the Bible has about 85% coverage of modern-day vocabulary and variations of writing
styles. Furthermore, [88] used the Bible as a bootstrapping text to set the parameters
of a stochastic translation system and noted the prospects of enabling translation of
thousands of languages using the Bible as a basis. SMT requires large parallel data
for high-quality translations. Consequently, the Bible alone will not be sufficient for
building high-quality translation. However, it is easily accessible and may be tailored
to build experimental models and investigate certain behaviors.
7.3 Methods
and a Tigrinya lexicon crawled from the internet5 . The shallow segmentation pro-
duces three segments (sub-word units) as “longest-prefix Stem Longest-suffix”. Note
that we use “prefix, stem, suffix” for simplicity; however, the sub-words here are not
necessarily valid linguistic units. To reduce overstemming words, the minimum stem
threshold was set to five characters. This threshold is selected because most Tigrinya
words, especially verbs, contain tri-literal roots or about six characters when translit-
erated.
first, the source sentence is segmented into shorter phrases, which may not necessarily
be linguistic phrases.
Following segmentation, each phrase is translated into a target phrase. Finally,
the translated phrases are combined (reordered) to determine the target sentence. For
example, suppose we have the following sentence pairs as training sentences:
English source sentence:
“How do you translate this?”
This sentence can be translated into Tigrinya as:
“Izi kemeyI gErIka tItIrIgWImo?”
Table 7.2 shows the sample phrases extracted from the given sentence pair.
Based on this type of large bilingual text, phrase-based systems learn models that
can predict the most probable translation outputs. The phrase translation based on
noisy channel model is defined using the Bayes rule. Accordingly, given a source
Chapter 7. English-Tigrinya Machine Translation with Morphological
86
Segmentation
where t is the target sentence and s is the source sentence. The probability p(t)
represents the language model while p(s|t) models the phrase translation. During
decoding, the source sentence s is segmented into a sequence of n phrases sn1 , and
each source phrase si in sn1 is translated into a target phrase ti . Then, a possible
reordering of the target sequences into a more fluent sentence is performed using
the distortion model which is modeled using relative distortion probability given by
d(starti − endi−1 ). The beginning and ending index of the ith source phrase trans-
lated to the ith target phrase and the (i − 1)th phrase translated to the (i − 1)th target
is given by starti and endi−1 , respectively.
Therefore, to account for distortion, the translation model p(s|t) is given by equa-
tion 7.3:
∏
n
p(sn1 |tn1 ) = ϕ(si |ti )d(starti − endi−1 ), (7.3)
i=1
7.4 Experiments
7.4.1 Settings
The tools we used for building the phrase-based translation model, language model
(LM), and evaluations are from Moses SMT toolkit [90]. Word alignment was per-
formed with MGIZA++ [91], an extended and optimized multi-threaded version of
GIZA++ [92]. While there is a large variation of verse length in the corpus, the aver-
age verse length of the Tigrinya unsegmented corpus is 19.9 and grows to 31.4 after
morphological segmentation (Table 7.3). Therefore, for the cleaning step, the max-
imum length of sentences is set to 60. We train language models using KenLM, an
efficient library optimized for speed and low memory consumption [93]. The order
of n-grams is set to five to account for words split by segmentation. Six language
models were built based on the segmentation schemes and two datasets. The reorder-
ing limit for the distortion model was set to the default six. The setting for datasets,
evaluation tests, and the baseline system are described as follows.
7.4.1.1 Datasets
The training, tuning, and test data are all extracted randomly from the Bible paral-
lel corpus. Table 7.3 lists the size of the verse-aligned parallel corpus (dataset-1),
Chapter 7. English-Tigrinya Machine Translation with Morphological
88
Segmentation
whereas Table 7.4 lists the size of the sentence-aligned parallel corpus (dataset-2)
extracted from dataset-1. Note that in dataset-1, there are verses combined for strict
alignment as explained in the preprocessing section. In this way, the verse-aligned
corpus consisted of 31,277 verses, with 29,307 verses used for training, 970 verses for
tuning, and 1000 verses held out for testing. However, the verse alignment process
also introduces lengthy sentences, possibly making word alignment more difficult
due to great difference in the sentence alignment of the verses. Given the small size
of the corpus, this may affect the overall quality of alignments. In order to investi-
gate its effect on translation quality, we constructed dataset-2, the sentence-aligned
parallel corpus by extracting only the single sentence verses based on the Tigrinya
corpus. Sentence identification was performed using the sentence-end marker of
Tigrinya. Therefore, all verses with a single sentence-end marker were extracted.
Consequently, the extracted corpus comprises a total of 20,578 parallel sentences,
which is about 65.8% of the original verse-aligned corpus. We notice that the mor-
phologically segmented Tigrinya corpus (dataset-1) is the closest match to the num-
ber of tokens in the English corpus. The average verse length of the English tokens is
7.4. Experiments 89
32.0 and that of the morphologically segmented Tigrinya corpus is 31.4 (Table 7.3).
It is interesting to see whether this match would be more useful in creating effective
word alignments for the machine translation (MT) models. Our experimental results
using both the corpora are summarized in Tables 7.5, 7.6, 7.7, and 7.8.
7.4.1.2 Evaluation
We evaluate the performance of the translation using BLEU, METEOR7 , and TER8
metrics. We also analyze the perplexity and OOV statistics to investigate the LM
improvement achieved by segmentation. OOVs are tokens in the test data that are not
present in the training data, and perplexity measures the complexity of LMs. Lower
perplexity score indicates a better fit of the test set to the reference set. We designed
four sets of system evaluations based on the MT models and the test sets. The settings
are given as follows:
1. Verse based models (MT-verse): Tables 7.5 and 7.6.
Dataset-1: the verse-aligned corpus
- contains 31,277 verses (1 verse >= 1 sent)
Test-1: 1000 verses
2. Sentence based models (MT-sent): Tables 7.5 and 7.6.
Dataset-2: extracted only the single sentence verses from Dataset-1
- contains 20,578 sentences
Test-2: extracted only the single sentence verses from Test-1
- contains 651 sentences
3. Dataset-1 + Test-2: Tables 7.7 and 7.8.
The result of MT-verse and MT-sent models may not be directly comparable since
the test data used is different in both cases. Therefore, for a better comparison of the
two models, we evaluated both models using the sentence-based test data (Test-2).
4. Raw text models against segmented text models: Tables 7.5 and 7.6.
For models from the segmented corpus, evaluation is straightforward; the trans-
lation output of the segmented MT model and the segmented reference are evaluated.
7
METEOR: Metric for Evaluation of Translation with Explicit ORdering
(https://en.wikipedia.org/wiki/METEOR)
8
TER: Translation Error Rate (http://www.cs.umd.edu/ snover/tercom/)
Chapter 7. English-Tigrinya Machine Translation with Morphological
90
Segmentation
However, the baseline model is built from the unsegmented parallel corpus. There-
fore, the evaluation output of the unsegmented and segmented MT models are not
directly comparable. Hence, for fair and easier comparison, we propose to segment
the translation output of the baseline model and evaluate it against the segmented
reference. Models system-1b and system-1c are examples of such models (Table
7.5). The evaluation system is depicted in Fig. 7.2 using as an example the base-
line segmented and the segmented cases. Another comparison method is performed
by restoring the translated segments in the segmented models to their root words and
evaluating against a raw text reference. This method conducts evaluation between
words instead of morphemes and can be performed with a separate detokenization
algorithm or input data representation, which is left for future research.
Figure 7.2: The proposed system evaluation for segmented and unseg-
mented models.
7.4.2 Baseline
The baseline system is built from the clean and tokenized (unsegmented) version of
the Bible corpus. The performance in terms of BLEU score is 15.6 for the verse-
aligned models (MT-verse) and 13.0 for the sentence-aligned models (MT-sent). All
the models from the segmented corpus outperform these baseline scores.
7.5. Results 91
7.5 Results
The experimental results of the baseline and segmented models are described in Ta-
bles 7.5 through 7.8. Below, we discuss the effect of segmentation according to the
evaluation settings mentioned earlier.
In general, the evaluation metrics in Table 7.5 show that the verse-aligned models
score better results than the sentence-aligned models. However, the performance drop
in the MT-sent models is most likely due to the larger proportion of OOVs resulting
from restricting the corpus to single-sentence verses only. Table 7.6 shows the OOV
ratio and the model perplexity of the two systems. For example, the OOV ratio of
MT-verse baseline is 6.7%, while that of MT-sent is 8.5%. Similarly the perplexity
Chapter 7. English-Tigrinya Machine Translation with Morphological
92
Segmentation
of MT-sent is higher than MT-verse. Therefore verse aligned models scored better
results. Nonetheless, the difference is rather small, suggesting that with proper data
size, sentence-based models may have performed better. For example, the BLEU
score of sys-segm and sys-segm-sent is 20.7 and 19.8, respectively. The difference
is only 0.9 BLEU points, although the MT-sent corpus is much smaller than the MT-
verse corpus. Notice that the MT-verse models are tested on the 1000 test verses
(Test-1). These verses include single-sentence verses as well as multiple-sentence
verses. However, MT-sent models are tested on 651 single-sentence verses (Test-2)
extracted from Test-1.
Therefore, in order to compare MT-verse with MT-sent, we evaluated both sys-
tems under Test-2 dataset. The evaluation and perplexity scores are presented in
Tables 7.7 and 7.8. We observe that this version of MT-verse outperforms MT-sent
under all metrics. This may be attributed to the fact that OOVs are greatly reduced
when using the smaller test set, Test-2 against MT-verse, which uses larger training
set.
7.5. Results 93
Overall, we observe that segmentation has improved the machine translation quality
compared to the unsegmented baseline. In the MT-verse system, the baseline for sys-
stm model is sys-base-stm while that of sys-segm is sys-base-segm. We see a BLEU
score improvement of 1.1 and 1.4 over the baseline compared to the stemmed and
segmented models. Although the BLEU score of the stemmed model is marginally
better than the segmented model (20.9 vs 20.7), the METEOR and TER metrics show
that the morphologically segmented model outperforms the others. Moreover, the
metrics for the segmented models of the MT-sent system consistently show the best
results. The analysis of OOVs and perplexity in Table 7.6 further clarifies the reason
for the performance gain. The OOV count decreases from 1408 for the baseline to
664 for the segmented model, which is a reduction of 2.1% in the OOV count. A
similar reduction pattern is also reflected in the MT-sent models. Since the test data
size is different for the models, the OOV ratio may better explain the OOV size in
relation to the test data. Accordingly, we see higher rates of OOV in the unsegmented
models while the ratio of OOVs to the test data decreases with finer segmentation.
We note that this ratio has a negative effect on perplexity; the higher the OOV ratio,
the worse the perplexity. Model system3-segm in Table 7.6 has the lowest OOV ratio
and perplexity among all the models while system6-segm-sent has the lowest score
among the MT-sent models. Therefore, based on these findings, we understand that
the fine-grained segmentation was better in improving quality of English-Tigrinya
machine translation. In general, although our training data is relatively small, the
performance gain observed from the segmented systems demonstrates the usefulness
of word segmentation strategies for English-Tigrinya machine translation.
The examples in Table 7.9 demonstrate translation output of two reference sentences
from all models of the MT system. We note very interesting insights both in terms
of morphological and syntactic transfer from the source to the target sentence. The
relatively shorter reference sentence (b) has been correctly translated by all models.
This may suggest that the models perform well with short sentences. Therefore, we
Chapter 7. English-Tigrinya Machine Translation with Morphological
94
Segmentation
discuss the syntactic structure and meaning preservation aspects of the translation,
taking the longer sentence (a) as an example. For easier high-level discussion, we
simplify the source sentence (a) by abstracting it into representative sense sub-phrases
(enclosed in square brackets [...]). We also convert the reference and Tigrinya outputs
of MT-verse system into similar sub-phrases as follows:
Source sentence (a):
[but of the fruit of the tree of the knowledge of good and evil][ you
may not take ;] [for on the day when you take of it ,][death will
certainly come to you .]
• Source: [of-tree][don’t-take][if-you-take][die]
• Reference: [from-tree][if-you-eat][die][don’t-eat]
• Translated-stm: [from-tree][don’t-eat][if-you-eat][die]
• Translated-seg: [tree-you][don’t-eat][if-you-eat][death]
Table 7.9: Sample Translations from the MT-verse and MT-sent mod-
els.
System Sentences
(a) but of the fruit of the tree of the knowledge of good and evil
Source you may not take ; for on the day when you take of it , death will
certainly come to you .
(b) and the name of the second river is gihon : this river goes
round all the land of cush .
(a) kabIta xIbuQInI kIfuInI ItefIlITI omI gIna : kabIa mIsI
Reference ItIbelIOI meOalItIsI motI kItImewItI iKa Imo : kabIa
ayItIbIlaOI : ilu azezo .
(b) sImI Iti KalIayI rIba dIma gihonI Iyu : nIsu nIKWla mIdIri
kushI yIzora .
Baseline (a) gInaKe : kabI fIre omI ayItIbIlaOI : kabIa mIsI ItIbelIOI
(unsegm.) meOalItIsI : motI kItImewItI iKa .
(b) sImI Iti KalIayI rIba dIma gihonI Iyu : nIsu nIKWla mIdIri
kushI yIzora .
System 2 (a) kabI fIre Ita gIna kI* f +uInI Ite* fIlITI omI dIma Imo : kabIa
(stemmed) ayItIbI* laOI : kabIa mIsI ItI* belIOI meOalItIsI motI kItI* mew
+ItI iKa : nI* sI +Ka .
(b) sImI Iti KalI +ayI rIba dIma giho +nI Iyu : nIsu nI* KWla
mIdIri kushI yI* zora .
System 3 (a) gIna +Ke It +i fIre It +a xIbuQI +nI kIfuI +nI i +Ka Imo :
(morph. kabI +a ayI* tI* bIlaOI : beta meOalIti Iti +a : kabI +a mIsI ItI*
segm.) belIOI meOalItI sI motI Iy +u : il +u azez +o .
(b) sImI It +i KalIayI rIba dIma giho +nI Iy +u : nIsu nI* KWla
mIdIri kushI yI* zor +a .
System 4 (a) gInaKe Iti fIre Ita xIbuQInI kIfuInI ItefIlITI omI dIma iKa
(MT-sent- Imo , beta (MT-sent-uns meOalIti mIsI : motI kItImewItI iKa .
unseg) (b) sImI Iti KalIayI rIba dIma gihonI Iyu : nIsu nIKWla mIdIri
kushI yIzora .
System 5 (a) gInaKe : kabI fIre omI dIma iKa Imo : kabIa ayItIbI* laOI :
(MT-sent- beta meOalIti Itia : kabIa mIsI ItI* belIOI meOalItIsI motI kItI*
stm) mew +ItI iKa .
(b) sImI Iti KalI +ayI rIba dIma giho +nI Iyu : nIsu nI* KWla
mIdIri kushI yI* zora +nI .
System 6 (a) (a) gIna +Ke It +i fIre It +a xIbuQI +nI kIfuI +nI i +Ka Imo :
(MT-sent- kabI +a ayI* tI* bIlaOI : beta meOalIti Iti +a : kabI +a mIsI ItI*
morph.seg) belIOI meOalItI +sI motI Iy +u : il +u azez +o .
(b) sImI It +i KalIayI rIba dIma giho +nI Iy +u : nIsu nI* KWla
mIdIri kushI yI* zor +a .
Chapter 7. English-Tigrinya Machine Translation with Morphological
96
Segmentation
meaning is not entirely preserved. Some studies have shown that aggressive segmen-
tation into very fine units might actually hurt the translation quality by unnecessar-
ily enlarging the phrase table and worsening the uncertainty of choosing the correct
phrase candidate [13]. There are two problems with translated-seg (system 3 in Table
7.9). First, the beginning phrase is translated to “you are the tree”, which is different
from the original phrase that has the sense of “from-tree”; and second the last phrase
“you will die” is wrongly translated as “it is death”. In comparison, the translated-
stm output preserves the meaning in the references better than the segmented model.
However, translated-seg seems to have better token coverage compared to the other
models. For example, the word “ilu azezo” in the reference was only found in the
translated-seg models. This could be the reason why translated-seg scores are better
because an increase in the token count contributes to improving BLEU score. There-
fore, in post-processing, a de-tokenization step is required to attach morphemes with
their root words and then make the evaluation from the words. We plan this type of
analysis for future research.
Chapter 8
Final Remarks
8.1 Conclusion
This thesis presents Tigrinya’s NLP research from the foundation by constructing new
language resources and fundamental NLP components for the first time. Specifically,
we researched Tigrinya POS tagging, morphological segmentation and its effect on
English-to-Tigrinya machine translation. We further analyzed optimal settings for
word embeddings in Tigrinya by designing intrinsic and extrinsic evaluations.
Tigrinya is an under-resourced language, and Consequently, NLP systems for
Tigrinya achieve low scores due to data sparsity or the OOV problem. The complex
inflectional and derivational patterns in Tigrinya generate substantial wordforms that
aggravate data sparsity. To address these issues, we explored several methods based
on classification, sequence labeling, and sequence-to-sequence labeling, leveraging
morphological patterns and language-independent substring features including char-
acter and word embeddings.
First, we constructed a news text corpus containing over 15 million tokens. Using
this corpus, we analyzed word embeddings for Tigrinya. In the intrinsic evaluations
of analogy, similarity, and categorization, the syntactic relatedness improved with
shorter word context while the semantic relatedness was better with wider context.
In the extrinsic evaluations, we showed the usefulness of word embeddings applied
to error reduction in POS-tagging task.
Second, we employed the NTC corpus for Tigrinya POS tagging. We utilized
morphological patterns that boosted tagging accuracy of unknown words by around
40% by employing CRF sequence labeling. The overall accuracy achieved was 90.89%.
98 Chapter 8. Final Remarks
Although the results show that morphological patterns are very informative for POS
disambiguation, we achieved improved accuracy of 91.6% (state-of-the-art) using
BiLSTM sequence-to-sequence labeling, relying on word embeddings (forgoing fea-
ture engineering).
Third, we presented a new morphologically segmented corpus and the first mor-
phological segmentation research for Tigrinya. Extensive experiments were per-
formed using different variants of BIO tagging scheme (BIE, BIES, BIO, and BIOES)
and supervised approaches exploiting primarily window-based character embeddings
as features for LSTMs and BiLSTMs. The results and the detailed error analyses show
that the BIE (begin-inside-end) scheme provides better representation. Moreover, we
observed that annotating words that do not need segmentation (outside-of-morpheme
chunks) with the “O” tag hurts performance. We also compared the LSTMs with the
CRF-based segmentation, employing language-agnostic character and substring fea-
tures. The results show that BiLSTM-based segmentation outperforms CRF-based
segmentation when both methods are exposed to similar context window. However,
the CRF-based results can also be enhanced further with careful design of features,
including language-dependent features.
Finally, we explored the effect of stem-based (shallow) and morpheme-based
(fine-grained) morphological segmentation in English-to-Tigrinya phrase-based sta-
tistical machine translation. For this research, we properly aligned the English-Tigrinya
Bible corpus on a verse-level and sentence-level. In the verse-level corpus, the to-
ken count of the English side was greater than the Tigrinya side by around 38%. As
a result of the fine-grained segmentation of the Tigrinya side, the token count was
almost equalized. Furthermore, the OOV ratio decreased by 4.7 percentage point
and 4.2 percentage point for the stem-based and the morpheme-based models respec-
tively. Similarly, the perplexity of language models was reduced (improved) in the
segmented models. Overall, despite the relatively small size of our parallel corpus, we
achieved 1.1 and 1.4 BLEU score improvement in the verse-based corpus as a result
of the shallow and the fine-grained segmentation, respectively. This result indicates
that segmentation can improve the quality of the English-to-Tigrinya phrase-based
statistical machine translation.
8.2. Contributions of the Research 99
1. A medium-sized news text corpus containing over 15.1 million tokens was con-
structed automatically. Subsequently, we created tools for crawling Tigrinya
websites, cleaning, formatting, transliterating, and normalizing the corpus. More-
over, the corpus was used to generate vocabularies such a lexicon of over
593,000 unique words and a list of stopwords.
2. An English-Tigrinya parallel text corpus was compiled from the Bible for ma-
chine translation research. The resource has been properly re-aligned for strict
alignment on verse level and sentence level.
3. Tigrinya word embeddings were created and analyzed for the first time from
the 15 million-token corpus. The quality of the word embeddings was evalu-
ated by designing intrinsic and extrinsic evaluations. Accordingly, the optimal
performance achieved was from a Skip-gram model with hierarchical softmax,
a dimension of 300, a window size of two, and a minimum word count of six.
Future work will concentrate on the following primary objectives. First, there is a
scope for improving the quality of the corpus and enriching its size by incorporating
other genres, in addition to news, to suit the ensuing enhancement. This will also
include revising the tagset design as well as rectifying tagging errors and inconsis-
tencies. It would be interesting to use recent advances in labeling strategies, such as
active learning, to enlarge the corpus at a lower cost.
The second objective will be to replace the current word-level tagging with a
morpheme-level approach because the latter reduces the POS tags into simple POS
categories by masking several layers of morphological inflections. The morphologi-
cal segmentation introduced in this research will be used for the morphological anal-
ysis prior to tagging.
Finally, there is room to investigate the use of semi-supervised approaches of POS
tagging, employing the unannotated version of the corpus to improve accuracy. The
tagger can be an essential prerequisite for further NLP studies, such as base phrase
chunking, syntactic parsing, machine translation, and text summarization.
8.3. Future Work 101
Statistical machine translation approaches require large bilingual texts to achieve rea-
sonable translation quality. However, finding language resources are a big challenge
in low-resource languages such as Tigrinya. In the future, we would like to initiate
the creation of a large English-Tigrinya parallel corpus for effective machine trans-
lation. It has been shown that enlarging language models can improve the translation
quality [94]. Language models are obtained from a monolingual corpus, which is
easier to build than a parallel corpus. Therefore, we are interested in building larger
language models and studying the effect thereof on the translation quality. Finally,
recent advances in machine translation have reported improved results with neural
machine translation [95]. Therefore, with the availability of a larger parallel corpus,
we would also like to explore English-Tigrinya neural machine translation.
103
Bibliography
[2] Y. Tedla and K. Yamamoto, “Nagaoka Tigrinya Corpus: Design and develop-
ment of part-of-speech tagged corpus”, in Language Processing Society 22nd
Annual Meeting Papers Collection, Tohoku, Japan: The Association for Nat-
ural Language Processing, 2016.
[12] S. Green and J. DeNero, “A class-based agreement model for generating ac-
curately inflected translations”, in Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics: Long Papers-Volume 1, Asso-
ciation for Computational Linguistics, 2012, pp. 146–155.
[14] J. Mason, Tigrinya grammar. New Jersey: The Red Sea Press Inc., 1996.
[15] Y. Firdyiwek and D. Yaqob, The system for Ethiopic representation in ASCII,
Jan. 1997. [Online]. Available: https://www.researchgate.net/publication/
2682324_The_System_for_Ethiopic_Representation_in_ASCII.
[17] M. Gasser, HornMorpho 2.5 user’s guide. Indiana University, Indiana, 2012.
[18] N. A. Kifle, “Differential object marking and topicality in Tigrinya”, pp. 4–25,
2007.
[20] M. Gasser, “Semitic morphological analysis and generation using finite state
transducers with feature structures”, in Proceedings of the 12th Conference
of the European Chapter of the Association for Computational Linguistics,
Association for Computational Linguistics, 2009, pp. 309–317.
[21] A. Ghebre, Tigrinya Grammar, 2nd ed. Stockholm: Admas Forlag, 2000.
[23] Y. Tedla and K. Yamamoto, “Analyzing word embeddings and improving pos
tagger of Tigrinya”, in Proceedings of the International Conference on Asian
Language Processing (IALP), Singapore: IEEE, 2017.
[27] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word
representation”, in Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2014, pp. 1532–1543.
[33] I. Rish, “An empirical study of the naive Bayes classifier”, in International
Joint Conferences on Artificial Intelligence(IJCAI), Workshop on Empirical
Methods in Artificial Intelligence, 2001.
[35] W.-Y. Loh, “Classification and regression trees”, Wiley Interdisciplinary Re-
views: Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 14–23, 2011.
[36] C. M. Bishop, Neural networks for pattern recognition. Oxford university press,
1995.
[45] D. Graff and W. Kevin, Arabic newswire part 1 - linguistic data consortium.
Philadelphia, 2001. [Online]. Available: https : / / catalog . ldc . upenn .
edu/LDC2001T55.
[46] M. Maamouri et al, Arabic treebank: part 1 - linguistic data consortium. Philadel-
phia, 2003. [Online]. Available: https : / / catalog . ldc . upenn . edu /
LDC2003T06.
[50] E. Mohammed and S. Kübler, “Is Arabic part of speech tagging feasible with-
out word segmentation?”, in Human Language Technologies; 5th Meeting of
the North American Chapter of the Association of Computational Linguistics,
Los Angeles: Association for Computational Linguistics, 2010, pp. 705–708.
[51] Bilel Ben Ali and Fethi Jarray, “Genetic approach for Arabic part of speech
tagging”, in International Journal on Natural Language Computing (IJNLC)
Vol. 2, No.3, 2013.
[54] B. G. Gebre, “Part of speech tagging for Amharic”, Master’s thesis, University
of Wolverhampton, 2010.
[55] P. Wang, Y. Qian, F. K. Soong, L. He, and H. Zhao, “A unified tagging so-
lution: Bidirectional LSTM recurrent neural network with word embedding”,
ArXiv preprint arXiv:1511.00215, 2015.
[57] G. Berend, “Sparse coding of neural word embeddings for multilingual se-
quence labeling”, Computing Research Repository (CoRR), vol. abs/1612.07130,
2016. [Online]. Available: http://arxiv.org/abs/1612.07130.
[63] K. Duh and K. Kirchhoff, “POS tagging of dialectal Arabic: a minimally su-
pervised approach”, in Proceedings of the ACL workshop on computational
approaches to Semitic languages, Association for Computational Linguistics,
2005, pp. 55–62.
[69] N. Zalmout and N. Habash, “Don’t throw those morphological analyzers away
just yet: Neural morphological disambiguation for Arabic”, in Proceedings of
the 2017 Conference on Empirical Methods in Natural Language Processing,
2017, pp. 704–713.
[73] N. Reimers and I. Gurevych, “Optimal hyperparameters for deep LSTM net-
works for sequence labeling tasks”, ArXiv preprint arXiv:1707.06799, 2017.
[74] Y. Kitagawa and M. Komachi, “Long short-term memory for Japanese word
segmentation”, ArXiv preprint arXiv:1709.08011, 2017.
[77] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, Com-
puting Research Repository (CoRR), vol. abs/1412.6980, 2014.
[81] M. Popović and H. Ney, “Towards the use of word stems and suffixes for statis-
tical machine translation”, in In Proceedings of The International Conference
on Language Resources and Evaluation (LREC), 2004.
[82] N. Habash and F. Sadat, “Arabic preprocessing schemes for statistical machine
translation”, in Proceedings of the Human Language Technology Conference
of the North American Chapter of the ACL, Association for Computational
Linguistics, 2006, pp. 49–52.
[87] P. Resnik, M. B. Olsen, and M. Diab, “The Bible as a parallel corpus: anno-
tating the book of 2000 tongues”, in Computers and the Humanities: Selected
Papers from TEI 10: Celebrating the Tenth Anniversary of the Text Encoding
Initiative, vol. 33, Denver, Colorado, 1999, pp. 129–153.
[90] P. Koehn et al., “Moses: open source toolkit for statistical machine transla-
tion”, in Proceedings of the 45th Annual Meeting of the ACL on Interactive
Poster and Demonstration Sessions, ser. ACL ’07, Prague, Czech Republic:
Association for Computational Linguistics, 2007, pp. 177–180.
[93] K. Heafield, “KenLM: Faster and smaller language model queries”, in Pro-
ceedings of the Sixth Workshop on Statistical Machine Translation, Associa-
tion for Computational Linguistics, 2011, pp. 187–197.
Appendix
Publication List
Journal Papers
Conference Papers