0% found this document useful (0 votes)
88 views136 pages

Doctoral Thesis

This doctoral thesis explores part-of-speech tagging, morphological segmentation, and machine translation for the Tigrinya language. It constructs several language resources for Tigrinya including a large news text corpus, morphological segmentation datasets, and an English-Tigrinya parallel corpus. The thesis presents approaches for POS tagging and morphological segmentation of Tigrinya text using machine learning methods like SVMs, CRFs, and BiLSTM neural networks. It also explores applying morphological segmentation to statistical machine translation between English and Tigrinya to address challenges like out-of-vocabulary words. The research aims to advance natural language processing for the low-resource Tigrinya language by developing fundamental language resources and techniques.

Uploaded by

Shushay Beyene
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views136 pages

Doctoral Thesis

This doctoral thesis explores part-of-speech tagging, morphological segmentation, and machine translation for the Tigrinya language. It constructs several language resources for Tigrinya including a large news text corpus, morphological segmentation datasets, and an English-Tigrinya parallel corpus. The thesis presents approaches for POS tagging and morphological segmentation of Tigrinya text using machine learning methods like SVMs, CRFs, and BiLSTM neural networks. It also explores applying morphological segmentation to statistical machine translation between English and Tigrinya to address challenges like out-of-vocabulary words. The research aims to advance natural language processing for the low-resource Tigrinya language by developing fundamental language resources and techniques.

Uploaded by

Shushay Beyene
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Doctoral Thesis

Tigrinya Morphological Segmentation with Bidirectional Long


Short-Term Memory Neural Networks and its Effect on
English-Tigrinya Machine Translation

By:
Yemane Keleta Tedla

Supervisor:
Assoc. Prof. Kazuhide Yamamoto
Natural Language Processing Lab

A dissertation submitted in partial fulfillment of the requirements


for the degree of
Doctor of Engineering
in the department of
Information Science and Control Engineering
Graduate School of Engineering
Nagaoka University of Technology

July 25, 2018


ii

Thesis committee:
Assoc. Prof. Kazuhide Yamamoto (Principal supervisor)
Prof. Takeshi Yukawa
Prof. Koichi Yamada
Prof. Hiromi Ban
Prof. Hideko Shibasaki
iii

Abstract

Natural language has a central role in the communication process of human beings.
Natural Language Processing (NLP) is the branch of artificial intelligence that enables
machines to understand and process human language. NLP products have extensive
applications in our day-to-day activities including grammar correction, spam filter-
ing, personal digital assistance, language translation, recommendation systems, and
so on. Significant NLP solutions have been reported for resourceful languages such as
English. However, the same is not true for most of the world languages. For instance,
Google Translate currently supports about 100 languages out of over 6000 languages
in the world. While there are several reasons that contribute to this digital divide,
the absence or scarcity of resources is a major bottleneck impeding NLP advances in
low-resource languages. Tigrinya is one of the languages with very limited language
resources. It is a morphologically rich Semitic language spoken by over seven mil-
lion people in Eritrea and northern Ethiopia. We aim to initiate Tigrinya language
processing from the foundation by constructing essential annotated and unannotated
text corpora and building fundamental NLP components. In the process of resource
building, we constructed a news text corpus of over 15 million tokens, containing a
lexicon of over 593,000 tokens. We processed this corpus to generate important word
lists including Tigrinya stop words and affix lists. The corpus is preprocessed using
several tools developed for Ge’ez-to-Latin script transliteration, clitic normalization,
and text cleaning. In an earlier research, a part of the corpus containing 72,080 tokens
was manually annotated for parts-of-speech (POS). In this research, we constructed
another new resource that consists of over 45,000 morphologically segmented tokens
extracted from the POS tagged corpus. Moreover, we compiled and properly aligned
an English-to-Tigrinya parallel corpus for machine translation research.
These resources are employed in the following research. First, we approached
POS tagging as a classification as well as a sequence labeling problem, employing
iv

support vector machines (SVM) and conditional random fields (CRF) respectively.
We utilized the unique morphological patterns of Tigrinya to boost the performance of
particularly unknown words. Our method doubled the accuracy of unknown words
from around 39% to 80%. As a result, the overall accuracy obtained was 90.89%.
Furthermore, we obtained 91.6% accuracy (state-of-the-art) by approaching POS tag-
ging as a sequence-to-sequence labeling using bidirectional long short-term memory
(BiLSTM) networks with word embedding forgoing feature engineering. Second,
we presented the first research on morphological segmentation for Tigrinya. We ex-
plored language-independent character and substring features based on CRF. In addi-
tion, we obtained a state-of-the-art F1 score of 95.07% with BiLSTM networks using
concatenated character and word embedding. This approach does not require feature
engineering to extract linguistic information, which is useful for languages lacking
sufficient resources. In this research, we explored several Begin-Inside-Outside (B-I-
O) tagging schemes to discover the recommended strategy in Tigrinya morphological
segmentation. Finally, we explored English-to-Tigrinya statistical machine transla-
tion. Translation from English to the morphologically rich Tigrinya entails several
challenges, including out-of-vocabulary (OOV) problem, language model perplex-
ity, and poor word alignment. We introduced shallow and fine-grained morphologi-
cal segmentation to mitigate these problems and improve the convergence of the two
languages. Generally, we observed that translation quality can improve by using the
morphologically segmented models.
v

Acknowledgements

First, I thank God for his wonderful gifts in my life that no words of gratitude can
adequately express. I would not be where I am today without the mercy and grace of
the almighty God.
I am sincerely thankful to my advisor, Assoc. Prof. Kazuhide Yamamoto, for
accepting me into his lab and guiding me throughout the doctoral study. I will always
be grateful for his outstanding mentoring, motivation, and kindness. I also extend my
heartfelt gratitude to Assoc. Prof. Ashu Marasinghe, my former advisor during my
master’s and the first semester of my doctoral study.
Special thanks go to my thesis committee members-Prof. Takashi Yukawa, Prof.
Koichi Yamada, Prof. Hiromi Ban, and Prof. Hideko Shibasaki-for their precious
time, insightful comments, and guidance.
I would also like to thank Dr. Chenchen Ding from NICT Japan for his continued
interest and useful comments.
It was a great joy to work with my labmates at the Natural Language Processing
Lab. I have learned a lot from everyone during our seminars, literature introduction,
and intense discussions. My labmates were there for me every time I needed any
kind of help. I will always treasure our memories together. I also thank my friends
Bereket Samuel and his wife Hanan Hussein for their company and love.
I feel honored to express my gratitude to the people and government of Japan
for the financial support provided through Super Global Monbukagakusho Schol-
arship. Furthermore, I was supported by Rotary Yoneyama Memorial Foundation
Scholarship. I extend my heartfelt thanks to the peace-loving Rotarians at Nagaoka
Nishi Club for their gracious generosity and hospitality. My rotary counselor, Mr.
Masanori Sanjo, deserves special appreciation for his fatherly care. I am extremely
honored to be a part of the Rotary family.
vi

Moreover, I am grateful to Mr. Tsurusaki Tsuneo from the JICA Eritrea Liaison
office for his continued support and kindness throughout my study. I would also like
to thank the Koide family, the Osaki family, and Ishiyama san for their warmth and
love that made me feel right at home.
I am very grateful to all the Tigrinya scholars who have taught me through their
writings, books, and discussions. I cannot thank Memhir Amare Weldeslassie enough
for offering me two great books of Tigrinya Grammar without hesitation. I am also
grateful to the author Engineer Tekie Tesfay for his encouragement, clarifications,
and for sharing his Tigrinya grammar book and other useful documents. I thank the
Ministry of Information, Eritrea, for providing documents and allowing the use of
Haddas Ertra articles for the research.
I am thankful to the administration and my colleagues at the Eritrea Institute of
Technology, who supported me in the pursuit of my studies.
Last but not the least, I thank all of my family and friends for investing in my
life in every way. Their prayers, love, inspiration, and support have helped me stay
committed to my studies over the years. I am truly blessed to have them all in my
life.
vii

Contents

Abstract iii

Acknowledgements v

1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Task of Morphological Segmentation . . . . . . . . . . . . . . 4
1.5 Morphological Segmentation for Machine Translation . . . . . . . . 5
1.6 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Tigrinya Language 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Words and Morphemes . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Writing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Tigrinya Morphology . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Morphological Ambiguity . . . . . . . . . . . . . . . . . . . . . . 15
2.6 POS Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Resource Construction 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Creating Large-scale Text Corpus . . . . . . . . . . . . . . . . . . 21
3.3 Nagaoka POS Tagged Tigrinya Corpus . . . . . . . . . . . . . . . 22
3.4 Morphologically Segmented Corpus . . . . . . . . . . . . . . . . . 23
3.5 English-Tigrinya Parallel Corpus . . . . . . . . . . . . . . . . . . . 23
viii

3.6 Creating and Analyzing Word Embeddings . . . . . . . . . . . . . 25


3.6.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . 25
3.6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.3.1 Continuous Bag of Words (CBOW) . . . . . . . . 27
3.6.3.2 Skip-gram Model . . . . . . . . . . . . . . . . . 29
3.6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.4.1 Dataset and Settings . . . . . . . . . . . . . . . . 29
3.6.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . 30
3.6.4.3 Baseline . . . . . . . . . . . . . . . . . . . . . . 30
3.6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.6 Improving POS Tagging Using Word Embedding . . . . . . 31
3.6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Methods of Tagging and Segmentation 37


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . 37
4.3 Conditional Random Fields (CRF) . . . . . . . . . . . . . . . . . . 38
4.4 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . . 39
4.5 Bidirectional Long Short-Term Memory (BiLSTM) . . . . . . . . . 41
4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Tigrinya POS Tagging 45


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Experiment: POS Tagging Based on SVM and CRF with Morpho-
logical Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Extracting Morphological Patterns . . . . . . . . . . . . . . 48
5.3.2 Datasets and Tagsets . . . . . . . . . . . . . . . . . . . . . 49
5.3.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.4 Hyperparameter Optimization . . . . . . . . . . . . . . . . 51
ix

5.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.6 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.7.1 The Effect of the Tagset Design . . . . . . . . . . 53
5.3.7.2 The Effect of the Prefix and Suffix Features . . . 54
5.3.7.3 The Effect of the Pattern Features . . . . . . . . . 55
5.3.7.4 The Effect of the Data Size . . . . . . . . . . . . 56
5.3.7.5 Error Analysis . . . . . . . . . . . . . . . . . . . 58
5.4 Experiment: POS Tagging Based on seq2seq LSTM and BiLSTM . 60
5.4.1 Datasets and Tagsets . . . . . . . . . . . . . . . . . . . . . 60
5.4.2 Training Parameters . . . . . . . . . . . . . . . . . . . . . 60
5.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . 61
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Morphological Segmentation for Tigrinya 65


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 Tagging Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Character Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . 70
6.5.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . 71
6.5.1.3 Hyperparameter tuning . . . . . . . . . . . . . . 71
6.5.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6.1 The Effect of BIO Scheme Design with Character Embeddings 73
6.6.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6.3 The Effects of Character and Word Embeddings . . . . . . 78
6.6.4 The Effects of Data Size . . . . . . . . . . . . . . . . . . . 79
x

6.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 English-Tigrinya Machine Translation with Morphological Segmenta-


tion 81
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.1 Affix-based Segmentation . . . . . . . . . . . . . . . . . . 83
7.3.2 Morpheme-based Segmentation . . . . . . . . . . . . . . . 84
7.3.3 Phrase-based Machine Translation . . . . . . . . . . . . . . 84
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . 87
7.4.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . 89
7.4.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.5.1 Verse-based Vs. Sentence-based Systems . . . . . . . . . . 91
7.5.2 Stemmed Vs. Morphologically Segmented Models . . . . . 93
7.5.3 Translation Output . . . . . . . . . . . . . . . . . . . . . . 93
7.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8 Final Remarks 97
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . 99
8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3.1 Improving the Quality of the NTC and the Performance of
the POS Tagger . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3.2 Improving Segmented Corpus and Integrating Minimally Su-
pervised Approaches . . . . . . . . . . . . . . . . . . . . . 101
8.3.3 Creating a Large Parallel Corpus and Exploring Neural Ma-
chine Translation . . . . . . . . . . . . . . . . . . . . . . . 101

Bibliography 103
xi

Publication List 113


xiii

List of Figures

1.1 The difficulty of aligning the Tigrinya word (unsegmented) with the
English words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Segmentation improves proper word alignment. . . . . . . . . . . . 7

2.1 Typical linguistic features of Tigrinya verbs . . . . . . . . . . . . . 10


2.2 Example of internal inflection in Tigrinya verbs. . . . . . . . . . . . 12
2.3 Affixes in Tigrinya verbs . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Excerpt from Tigrinya Bible illustrating merging of verses (verse 17


to 18 are merged) . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 English excerpt from Bible text corresponding to the Tigrinya verse
in Fig. 3.1. The English version has independent verses (verse 17
and verse 18 are listed separately). . . . . . . . . . . . . . . . . . . 24
3.3 The English verses are merged to match the Tigrinya merged verses
in Tigrinya text for constructing verse aligned parallel corpus (verses
17 and 18 are merged). . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Fixed-size window based LSTM network for morphological segmen-


tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Fixed-size window based BiLSTM network for morphological seg-
mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Sequence-to-sequence BiLSTM neural network for POS tagging . . 43

5.1 The performance of the tagger improves with the availability of more
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 The performance of the tagger improves with the availability of more
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xiv

6.1 Plotting the distribution of K-fold scores. . . . . . . . . . . . . . . 79

7.1 Translation system architecture for the English-to-Tigrinya phrase-


based translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2 The proposed system evaluation for segmented and unsegmented mod-
els. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xv

List of Tables

1.1 An inflected Tigrinya word is expressed with multiple words in English. 6

2.1 Examples of a Tigrinya tokens and their segmented morphemes. . . 10


2.2 Ge’ez script base alphabets . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Ge’ez script: Examples of seven ordered and five ordered alphabets. 11
2.4 Examples of intermediate splits caused by under-segmentation. The
italicized second row is the expected segmentation. . . . . . . . . . 16

3.1 Example of responses on relatedness query (model=skip-gram, fea-


tures=300, window=2, min. word=6). Bold-faced morphemes repre-
sent the morphological transfer in the analogy. The underlined words
have some unrelatedness to the criteria. . . . . . . . . . . . . . . . 32
3.2 Examples of similarities by POS tagging (model=skip-gram, features=300,
window=2, min word=6) . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Categorization results: words in bold are the responses. (model=skip-
gram, features=300, window=2, min word=6) . . . . . . . . . . . . 33

5.1 Some patterns of gerundive verbs (V_GER) . . . . . . . . . . . . . 49


5.2 The full tagset: labels ending with _C, _P, _PC indicate clitics of
Conjunction, Preposition, or both respectively. For example, N_C
refers to a NOUN (N) with attached CONJUNCTION (C). . . . . . 50
5.3 Overall performance for the data with tagset-1 and tagset-2. Con-
text = word-2, pos-2, word-1, pos-1, word, word+1, word+2, ptn =
consonant-vowel pattern, pref = prefix, suf = suffix, all = context +
all affixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 The effect of morphological features on known (kno.) and unknown
(unk.) word tagging based on accuracy results (%) for tagset-1. . . . 54
xvi

5.5 Morphological features and the error rate (%) of unknown words for
selected POS tags in CRF-based experiment. . . . . . . . . . . . . . 54
5.6 The accuracy of CRF-based tagger (Tagset 1) as data size increases 56
5.7 Tag-wise precision, recall, and F1-score for the SVM-based tagger
(tagset-2), Support is the number of words for the respective tag. . . 58
5.8 Results of LSTM and BiLSTM models for POS tagging in sequence-
to-sequence setting. The vocabulary sizes marked as “top10k” and
“all” denote the most frequent 10,000 tokens and all the vocabularies
(18,740 tokens) respectively. . . . . . . . . . . . . . . . . . . . . . 62

6.1 Annotating with different tagging schemes . . . . . . . . . . . . . . 68


6.2 An example of generated fixed window character sequences with as-
signed label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Training data split as per the count of the fixed-width character win-
dows (vectors) used as the actual input sequences . . . . . . . . . . 70
6.4 LSTM Hyperparameters selected by tuning . . . . . . . . . . . . . 73
6.5 Results of CRF, LSTM and BiLSTM experiments with four BIO
schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.6 BiLSTM Confusion matrix comparison for BIE and BIO schemes
%. The Boldface entries represent BIE scheme. Rows and columns
represent predicted and true values respectively. . . . . . . . . . . . 76
6.7 Sample segmentation result; the underscore character marks segmen-
tation boundaries and the boldfaced characters represent segmenta-
tion errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.8 The effect additional word and POS features on segmentation perfor-
mance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.9 The Effect of increasing data size on segmentation performance. Gen-
erally the performance improves with the availability of more data.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.1 Example of segmented words and grammatical functions of the seg-


ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Example of phrase-based translation pairs. . . . . . . . . . . . . . . 85
xvii

7.3 Dataset-1, Verse-aligned parallel corpus. . . . . . . . . . . . . . . . 88


7.4 Dataset-2: Sentence-aligned parallel corpus. . . . . . . . . . . . . . 88
7.5 MT-verse and MT-sent: BLEU, METEOR, and TER scores. . . . . 91
7.6 Perplexity: Test tokens with OOVs included. . . . . . . . . . . . . 91
7.7 MT-verse: BLEU, METEOR, and TER scores tested on Test2. . . . 92
7.8 MT-verse: Perplexity and OOV evaluated on Test-2. . . . . . . . . 92
7.9 Sample Translations from the MT-verse and MT-sent models. . . . 95
xix

List of Abbreviations

NLP Natural Language Processing


CRF Conditional Random Fields
SVM Support Vector Machines
LSTM Long Short Term Memory
BiLSTM Bidirectional Long Short Term Memory
POS Part(s) of Speech
NER Named Entity Recognition
NTC Nagaoka Tigrinya Corpus
CV Cross Validation
BIO Begin Inside Outside scheme
BIE Begin Inside End scheme
BIES Begin Inside Outside Single scheme
BIOES Begin Inside Outside End Single scheme
BIO Begin Inside Outside scheme
OOV Out of Vocabulary
BLEU BiLingual Evaluation Understudy
METEOR Metric for Evaluation of Translation with Explicit ORdering
TER Translation Error Rate
SMT Statistical Machine Translation
FST Finite State Transducers
xxi

Dedicated to all who have invested in my life


1

Chapter 1

Introduction

1.1 Introduction
Tigrinya belongs to the Semitic language branch of the Afroasiatic family, along with
Hebrew, Amharic, Maltese, Tigre, and Arabic. Tigrinya has over 7 million speakers
in Eritrea and northern Ethiopia. As noted in [1], although resource-rich languages
such as English have well-developed language tools, low-resource languages suffer
from either low grade or the absence of electronic data support altogether to pursue
NLP research. The Tigrinya language is one such low-resource language. Unlike ma-
jor Semitic languages that enjoy relatively widespread NLP research and resources,
the Tigrinya language has been largely ignored in NLP-related research, mainly due
to the absence of a Tigrinya text corpus. A text corpus is a large collection of text
intended for linguistic processing. Although a few electronic dictionaries and a lexi-
con of automatically extracted Tigrinya words exist, the linguistically annotated text
corpus has not been made available to the public. Given this circumstance, new re-
search is needed to initiate the NLP research from the foundation by constructing
a corpus, and thereby language tools, for the advancement of information access in
Tigrinya. Therefore, in an earlier research, we constructed and released the first POS
tagged Tigrinya corpus containing 72,080 words [2] and developed a POS tagger
based on this corpus [3]. In this study, we continue this research to improve the per-
formance of the tagger. However, our primary objective in this research is to create
a new morphologically segmented corpus and investigate the first Tigrinya morpho-
logical segmentation with language-independent and automatic feature engineering
approaches that leverage the rich morphology of Tigrinya. We are also interested in
2 Chapter 1. Introduction

exploring English-to-Tigrinya machine translation using the Bible, the only publicly
available English-Tigrinya parallel corpus. We introduce morphological segmenta-
tion on the target language, Tigrinya, to enhance translation quality by improving the
convergence of these two languages. The detailed research objectives are presented
in section 1.2. Furthermore, in Chapter 1, we describe the tasks of POS tagging,
morphological segmentation, and machine translation.

1.2 Research Objectives


In this section, we list the objectives of the research as follows.

1. Tigrinya is a low-resource language that lacks both annotated and unannotated


resources for language processing. First, we would like to mitigate this prob-
lem by building a new corpus that will motivate fundamental NLP research
for Tigrinya. Specifically, we focus on building a morphologically segmented
corpus, that is, words that are annotated manually for morpheme boundaries
of prefix, circumfix, suffix, and stem. We are also interested in constructing
a large text corpus for word embedding and lexicon generation. This corpus
will be a useful resource in many tasks including semi-supervised approaches
and language models of translation systems. Furthermore, we aim to compile
an English-to-Tigrinya parallel corpus that will be used in our machine trans-
lation research.

2. In our earlier research, we constructed the first POS-tagged corpus for Tigrinya
(the Nagaoka Tigrinya Corpus, NTC). We also developed a POS tagger based
on transformation learning and hidden Markov models. However, given the
small size of the corpus and the morphological complexity of Tigrinya, the
performance of the tagger, particularly with regard to unknown words was quite
low. In this research, we are interested in improving the performance of this
POS tagger. Therefore, we intend to introduce methods that exploit the unique
morphological pattern of Tigrinya and employ automatic feature engineering
such as word embedding to boost performance.
1.3. Part-of-Speech Tagging 3

3. One of the fundamental steps in NLP is morphological analysis. Having built a


manually segmented corpus, our third objective is investigating morphological
segmentation for Tigrinya words. We would like to discover the recommended
morpheme tagging scheme suitable for Tigrinya morphological segmentation.
We propose the usage of the state-of-the-art sequence labeling approaches and
feature engineering methods that can sufficiently handle the complex morphol-
ogy of Tigrinya.

4. Most of the information on the internet is available in well-known languages


such as English. Although the number of Tigrinya-based websites seems to be
increasing over time, the content of these sites is mostly limited to religion or
politics. One way to improve the information access through Tigrinya is by
translating the existing English content. To this end, Tigrinya is not supported
by Google Translate. In addition, the only publicly available English-Tigrinya
corpus is the Bible. In this research, we aim to investigate the properties of
English-Tigrinya statistical machine translation using the Bible as a parallel
corpus. These languages are linguistically divergent, and hence we propose
stem-based and morpheme-based segmentation of Tigrinya words in the pre-
processing step to improve machine translation quality.

In the sections that follow, we present a brief introduction to POS tagging, mor-
phological segmentation, and the use of morphological segmentation in machine
translation.

1.3 Part-of-Speech Tagging


Parts-of-speech (POS) are lexical categories of words such as verbs, nouns, and ad-
jectives. The process of labeling a word with its appropriate lexical category is known
as POS tagging. A word’s POS role may change depending on its lexical information
or the contextual syntactic arrangement. Consider the following simple sentence:

Example 1.1. “Flies like a flower.”

The interpretation of Example 1.1 may differ significantly according to the acti-
vated POS role of each word from the several possibilities. For instance, assuming
4 Chapter 1. Introduction

the word class of “Flies” is noun, the sentence conveys that “flies (the insects) love a
flower.” However, if the word “Flies” takes on the role of a verb, then the meaning
changes to “something performs the action of flying just like a flower does.” Sim-
ilarly, the word “like” can function as a preposition, adverb, conjunction, noun, or
verb, and the word flower can appear as a noun or a verb. Therefore, the correct
translation of Example 1.1 requires resolving the most likely POS tag of each word
within the given context.
The number of POS categories varies from language to language, depending on
the inherent properties of languages and the target application. For example, the
tagsets for the morphologically rich Arabic language defined by Penn Treebank con-
sisted up to 2000 fine-grained POS tags [4], whereas the Penn Treebank POS tagsets
for English contained only 45 tags [5]. Moreover, a universal tagset that consists
of twelve tags was proposed in an effort to standardize the most frequent and useful
POS tags across multiple languages [6]. For the current Tigrinya research, a set of
73 tags were defined to capture three levels of grammatical information. The com-
plete list of the tagset and the experiments thereof are detailed in Chapter 5. There are,
broadly, two computational approaches to automatic POS tagging [7]. The rule-based
approach is the classical tagging method that relies on linguistically motivated dis-
ambiguation rules. The main downside of this method is the expensive development
and maintenance cost incurred. The second and the most widely adopted approach
follows statistical models that compute the most probable tag sequence given the
word sequence from a POS tagged corpus. We adopted the latter approach based on
support vector machines (SVM) classification [8], conditional random fields (CRF),
[9] and long short-term memory neural network (LSTM) sequence labeling architec-
tures [10]. POS tagging is a necessary prior step in many NLP applications such as
syntactic parsing and word sense disambiguation (WSD).

1.4 The Task of Morphological Segmentation


Morphological processing at the word level is usually the initial step in the differ-
ent stages of NLP. Morphemes constitute the minimal meaning-bearing units in a
1.5. Morphological Segmentation for Machine Translation 5

language [11]. In this thesis, we focus on the task of detecting morphological bound-
aries, also referred to as morphological segmentation. This task involves the break-
ing down of words into their component morphemes. For example, the English word
“reads” can be segmented into “read” and “s”, where “read” is the stem and “s” is an
inflectional morpheme, marking third-person singular verb.
Morphological segmentation is useful for several downstream NLP tasks, such as
morphological analysis, POS tagging, stemming, and lemmatization. Segmentation
is also applied as an important preprocessing phase in a number of systems, including
machine translation, information retrieval, and speech recognition.
Segmentation is mainly performed using rule-based or machine learning approaches.
Rule-based approaches can be quite expensive and language dependent because the
morphemes and all the affixation rules need to be identified to disambiguate seg-
mentation boundaries. Machine learning approach, on the other hand, is data-driven
wherein the underlying structure is automatically extracted from the data. We present
supervised morphological segmentation based on CRFs [9] and LSTM neural net-
works [10]. Since morphemes are sequences of characters, we address the problem
as a sequence tagging task and propose a fixed-size window approach for modeling
contextual information of characters. CRFs are well-suited for this kind of sequence
aware classification tasks. We also exploit the long-distance memory capabilities of
LSTMs for modeling boundaries of morphemes.

1.5 Morphological Segmentation for Machine Trans-


lation
Machine translation systems translate one natural language (the source language)
to another (the target language) automatically. The statistical machine translation
(SMT) systems may not be consistently accurate but often produce a sufficient com-
prehension of the information in the source language. The research presented here in-
vestigates the English-to-Tigrinya translation system, using the Christian Holy Bible
(“the Bible” hereafter) as a parallel corpus. Due to morpho-syntactic divergence,
machine translation between these two languages is a difficult task. As mentioned
6 Chapter 1. Introduction

earlier, Tigrinya is a highly inflected language, while the English language is weakly
inflected [12]. A single word in Tigrinya may be expressed by one or more words
when translated into English. Consider the translation examples in Table 1.1 that
show a considerable unevenness in token count as the Tigrinya verb is inflected fur-
ther.
Table 1.1: An inflected Tigrinya word is expressed with multiple
words in English.

Tigrinya word English translation


HItto question
Hatete he asked
InIteHatete if he asked
InItezeyIHatete if he did not ask
InItezeyIHatetIkayo if you did not ask him
InItezeyIHatetIkalomI if you did not ask about/for them
InItezeyIHatetIkalomInI if you did not ask about/for them and

In the compiled Bible corpus, the English side has 1,002,214 tokens while the
Tigrinya side contains 623,984 tokens. Accordingly, the English side is larger by
around 38% in the token count. This mismatch causes data sparseness on the Tigrinya
side, aggravating the OOV problem caused by insufficient data. Moreover, source-
target word alignment cannot be performed properly as illustrated in Fig. 1.1.
Decomposing the inflected Tigrinya words into their constituent morphemes as
shown in Fig. 1.2 improves the alignment of words. Moreover, the granularity of
segmentation has been shown to affect the performance of translation systems [13].
Therefore, we explore the effect of coarse-grained (affix-based) segmentation and
fine-grained (morpheme-based) segmentation. In the affix-based segmentation, we
apply shallow pruning to find the longest prefix/suffix resulting in prefix-stem-suffix
segments (Example 1.2)1 ; while in morphological segmentation, we attempt detailed
morpheme-level segmentation (Example 1.3).

Example 1.2. [affix-based]


InIteHatetelomInI → InIte* Hatet +elomInI (if he asked about/for them)
1
“*” and “+” symbols are used to denote prefixes and suffixes respectively
1.6. Structure of the Thesis 7

Example 1.3. [morpheme-based]


InIteHatetelomInI → InIte* Hatet +e +lomI +nI (if he asked about/for them)

Figure 1.1: The difficulty of aligning the Tigrinya word (unseg-


mented) with the English words.

Figure 1.2: Segmentation improves proper word alignment.

As a result, the token count difference was substantially minimized to around


0.1% and 0.02% when coarse and fine-grained segmentation was applied respec-
tively. This research explores whether this word granularity transformation improves
English-Tigrinya machine translation.

1.6 Structure of the Thesis


The rest of the thesis is organized as follows. Chapter 2 introduces the Tigrinya
language and the relevant challenges of tagging and segmentation. In Chapter 3,
we present the various Tigrinya language resources that have contributed to this re-
search. Following that, we attempt to explain the employed methods of tagging and
segmentation in Chapter 4. The related works, experiments, results, and analyses re-
garding POS tagging, morphological segmentation, and the effect of segmentation on
machine translation are discussed in Chapters 5, 6, and 7 respectively. Finally, we
provide the concluding remarks and future works in Chapter 8.
9

Chapter 2

Tigrinya Language

2.1 Introduction
This chapter is an introduction to the relevant aspects of the Tigrinya language. In this
chapter, we first describe words and morphemes in Tigrinya and present the Ge’ez
script, which is a special writing system of Eritrea and Ethiopia. Next, we discuss
the unique non-concatenative morphology of Tigrinya, followed by the challenges in
Tigrinya morphological analysis and part-of-speech disambiguation.

2.2 Words and Morphemes


Tigrinya tokens in a sentence are delimited by spaces. These tokens are tradition-
ally considered as words. However, a word in Tigrinya is composed of morphemes
marking grammatical affixes as well as conjunction and preposition clitics. For ex-
ample, the token እንተዘይሓተትካያ InItezeyIHatetIkaya (if you did not ask her) can be
segmented into morphemes as follows (the pipe symbol “|” marks the morpheme
boundary).

InIte | zeyI | HatetI | ka | ya

The function of these morpheme units is listed below:


InIte - If (conjunction)
zeyI - did not (negation)
HatetI - ask (perfective verb)
ka - you (subject pronoun; masculine, second-person, singular)
10 Chapter 2. Tigrinya Language

ya - her (object pronoun; feminine, third-person, singular)


Such types of words, particularly verbs, effectively represent a phrase rather than
a single word. We show a few examples of Tigrinya tokens and their segmented
morphemes in Table 2.1. It can be understood from the English translation that most
of the Tigrinya morphemes correspond to individual words in the English translation.
Furthermore, a detailed analysis of the grammatical features and roles of a typical
Tigrinya verb is depicted in Fig. 2.1.

Table 2.1: Examples of a Tigrinya tokens and their segmented mor-


phemes.

Token Morpheme(s) Main POS English


HItto HItto Noun question
mIHItatI mI | HItatI Verbal noun to ask
Hatete Hatet | e Verb he asked
keyItIHatItI keyI | tI | HatIt | I Verb don’t ask
IntezeyIHatetIkaya Inte | zeyI | HatetI | ka | ya Verb If you did not ask her

Figure 2.1: Typical linguistic features of Tigrinya verbs


2.3. Writing System 11

2.3 Writing System


Tigrinya is one of the few African languages that still use an indigenous writing sys-
tem, called the Ge’ez script (also known as Ethiopic) for education and daily com-
munication1 . Other Semitic language that use the Ge’ez script are Amharic, the of-
ficial language of Ethiopia, and the Tigre language of Eritrea. The writing system is
adopted from the ancient Ge’ez language, which is currently used as a liturgical lan-
guage. Unlike Arabic and Hebrew scripts, Ge’ez script is written from left to write.

Table 2.2: Ge’ez script base alphabets

ሀ ለ ሐ መ ሠ ረ ሰ ሸ ቀ ቐ በ ተ ቸ ኀ ነ ኘ አ ከ
he le He me se re se she qe Qe be te che he ne Ne ’e ke
ኸ ወ ዐ ዘ ዠ የ ደ ጀ ገ ጠ ጨ ጰ ጸ ፀ ፈ ፐ ቨ
Ke we Oe ze Ze ye de je ge Te Ce Pe Se Se fe pe ve

Table 2.3: Ge’ez script: Examples of seven ordered and five ordered
alphabets.

ሀ ሁ ሂ ሃ ሄ ህ ሆ
he hu hi ha hE hI ho
ኰ ኲ ኳ ኴ ኵ
kWe kWi kWa kWE kWI

The Ge’ez script is an abugida 2 system where each letter (alphabet) represents a
joint consonant-vowel (CV) syllable. Accordingly, Tigrinya identifies seven vowels
usually called “orders” [14]. There are also a few alphabets that are variants of some
of the 35 base alphabets with only five orders (Table 2.3). In sum, about 275 symbols
make up the Tigrinya alphabet chart known as ፊደል Fidel. The base alphabets from
which the orders stem are shown in Table 2.2
The Ge’ez script does not mark the lengthening of consonants or gemination in
pronunciation; however, this limitation does not seem to create difficulties for native
1
http:mgafrica.comarticle2014-06-11-in-the-race-between-african-scripts-and-the-latin-alphabet-
only-ethiopia-and-eritrea-are-in-the-game
2
The term “abugida” (አቡጊዳ) itself is derived from letters of the Ge’ez alphabet chart. አ (a), ቡ
(bu), ጊ (gi), and ዳ (da) are the first, second, third, and fourth alphabets of the a, b, g, and d orders
respectively.
12 Chapter 2. Tigrinya Language

speakers. In this thesis, Tigrinya words are transliterated to Latin characters accord-
ing to the SERA transliteration scheme [15], with the addition of “I” for explicit
marking of the epenthetic vowel traditionally known as ሳድስ SadIsI.

2.4 Tigrinya Morphology


The morphology of Semitic languages, known as “root-and-pattern” morphology,
has distinct non-concatenative properties that intercalate consonantal roots and vowel
patterns [16]. This morphology defines the tense-aspect-mood (TAM) verb forms of
Semitic languages traditionally known as perfective, imperfective, gerundive, and
imperative/jussive verbs. In addition to the usual prefix/suffix inflection, Semitic
verbs inflect internally to indicate one of the four TAM forms. For example, in
Tigrinya, the word ሓተትካ HatetIka (you asked) and ሓቲትካ HatitIka (you asked),
share a common triconsonantal root or radical “H-t-t”, while the internal sequences
of vowel patterns (“a-e-I” and “a-i-I”) inserted in-between the radicals are inflected
to signal perfective and gerundive forms respectively. Fig. 2.2 depicts this internal
TAM inflection.
Figure 2.2: Example of internal inflection in Tigrinya verbs.

Fig. 2.3 shows the order and position of Tigrinya affixes. The circumfix is formed
from two constituents that are found at the prefix (CIRC-1) and the suffix side (CIRC-
2) illustrated by the arc in the figure.
The affixes represent various morphological features including gender, person,
number, tense, aspect, mood, voice, and so on. Furthermore, there are clitics of
mostly prepositions and conjunctions that can be affixed to other words. Some of
2.4. Tigrinya Morphology 13

Figure 2.3: Affixes in Tigrinya verbs

the typical linguistic features of a Tigrinya verb are illustrated in Fig. 2.1. The single
token in Tigrinya is expressed by multiple words in English. If the token is segmented
into morphological units, we see that a lot of linguistic information can be extracted,
which may be leveraged for disambiguation tasks.
Specifically, the order of morpheme slots is defined by [17] as follows.

(prep|conj)(rel)(neg)sbjSTEMsbj(obj)(neg)(conj)

The “prep” slot refers to preposition prefixes while the “conj” slot represents con-
junctions that can appear as both prefixes and suffixes. The “rel” indicates a rela-
tivizer (the prefix ዝ/ዚ zI/zi) corresponding in function to the English demonstratives
such as that, which, and who. The “sbj” on either side of the STEM are a prefix
and/or suffix of the four verb types, namely perfective, imperfective, imperative, and
gerundive. As shown in examples 2.1 and 2.4, the perfective and gerundive verbs con-
jugate only on suffixes while imperfective verbs undergo both the prefix and suffix
inflections (example 2.2). The imperatives either show the suffix-only conjugations
or change prefix as well (example 2.3). In addition to the verb type, these fusional
morphemes convey gender, person, and number information.

Example 2.1. seber + u (they broke)

Example 2.2. yI + sebIr + u (they break)


14 Chapter 2. Tigrinya Language

Example 2.3. yI + sIber + u (break/let them break)

Example 2.4. sebir + omI (they broke)

Moreover, Tigrinya independent pronouns have a pronominal suffix (SUF) of


gender, person, and number as shown in examples 2.5 and 2.6.

Example 2.5. nIsu (he)  nIs + u/SUF

Example 2.6. nIsa (she)  nIs + a/SUF

The word order typology is normally subject-object-verb (SOV), though there are
cases in which this sequence may not apply strictly[18], [19]. Changes in “sbj” verb
affixes, along with pronoun inflections, enforce subject-verb agreements. One as-
pect of the non-concatenative morphology in Tigrinya is the circumfixing of negation
morpheme in the structure “ayI-STEM-nI” [17]. Some conjunction enclitics such as
“do; ke; Ke” can also be found in Tigrinya orthography as mostly bound suffix mor-
phemes. For example,
KeyIdudo?  keyId + u + do (did he go?)
nIsuKe?  nIs + u + Ke (what about him?)
The pronominal object marker “obj” is always suffixed to the verb as shown in ex-
amples 2.7 to 2.10. According to [18], Tigrinya suffixes of object pronoun can be
categorized into two constructs. The first is described in relation to verbs (examples
2.7, 2.8 and 2.9) and the other indicates the semantic role of applicative cases by
inflecting for prepositional enclitic “lI” + a pronominal suffix as in example 2.10.

Example 2.7. beliOI + wo/obj (he ate [something])

Example 2.8. hibu + wo/obj (he gave [something] to him)

Example 2.9. hibu + ni/obj (he gave me [something])

Example 2.10. beliOI + lu/obj (he ate for him/he ate using [it])

Tigrinya words are also generated by derivational morphology. There are up to


eight derivational categories that can originate from a single verb [20]. For example,
the passive form of perfective (example 2.11) and gerundive verbs (example 2.12) is
constructed by prefixing “te” to the main verb. Furthermore, adverbs can be derived
2.5. Morphological Ambiguity 15

from nouns by prefixing “bI-” which has similar functionality as the English “-ly”
suffix (example 2.13).

Example 2.11. zekere (he remembered)  te + zekere (he was remembered)

Example 2.12. zekiru (he remembered)  te + zekiru (he was remembered)

Example 2.13. bIHaqi (truly)  bI + Haqi

In this work, we deal with boundaries of prefix, stem, and suffix morphemes.
Inflections related to infixes (internal stem changes) are not feasible for this type of
segmentation.

In conclusion, all the morphological and derivational processes generate a large


number of complex word forms. According to [20], Tigrinya verb inflections, deriva-
tions, and combinations of the slot order can produce more than 100,000 word forms
from a single verb root. The ambiguity and difficulty in relation to segmentation are
briefly discussed in the following section.

2.5 Morphological Ambiguity


In segmentation, ambiguity may occur at the word-level or due to sentential con-
text. In Tigrinya, a major source of ambiguity is when certain character sequences
of morphemes are natively present as part of words. In this case, the characters do
not represent grammatical features and hence segmentation should not be applied.
Consider the words ብርሃን bIrIhanI (light) and ብሓይሊ bIHayli (by force). The same
prefix “bI-” which is an inseparable part of the noun bIrIhanI, represents an adverb
of manner (by) in the second word. Moreover, morphemes may also appear as con-
stituents of other morphemes. For instance, the noun suffix “-netI” (example 2.16)
contains the sub-morph “-etI” that can have the role of a suffix for conjugation of
third person, feminine, singular attributes as in example 2.15. Note that “-netI” in
the example 2.14 is not a morpheme.

Example 2.14. genetI  genetI/NOUN_paradise

Example 2.15. wesenetI  wesen/STEM_decided + etI/SUF_she


16 Chapter 2. Tigrinya Language

Table 2.4: Examples of intermediate splits caused by under-


segmentation. The italicized second row is the expected segmentation.

InItezeyIHatetIkayomI
InIte-zeyI-HatetI-ka-yomI
InIte-zeyIHatetIkayomI
InIte-zeyI-HatetIkayomI
InIte-zeyI-HatetI-kayomI
InItezeyIHatetIka-yomI
InItezeyIHatetI-kayomI
InItezeyI-HatetIkayomI

Example 2.16. naxInetI  naxI/STEM_independence + netI/NOUN-SUF

Moreover, the lack of gemination marking may introduce segmentation ambigu-


ity. For example, the word መደረ medere can be interpreted as the noun “speech” or
the phrase “he gave a speech” if the “de” in medere is geminated. A computational
system must discern both cases so that the phrase is segmented while the noun is left
intact. Resolving such cases may require more context or additional linguistic infor-
mation such as part-of-speech. In this work, we would like to avoid resorting to any
language-specific knowledge. Therefore, in the LSTM approach, we use character
embeddings to capture contextual dependencies of characters.
As explained earlier, Tigrinya words have multiple consecutive morpheme slots.
This pattern causes under-segmentation confusion due to the numerous intermediate
splits comprising atomic and composite morphemes. Table 2.4 lists some of the pos-
sible segmentations for the word token እንተዘይሓተትካዮም InItezeyIHatetIkayomI (if
you did not ask them).
Furthermore, in Tigrinya, compound words are often written attached. For ex-
ample, ቤት መግቢ betI (house) megIbi (food), translates to “restaurant” in English. In
the orthography, these words can be found either separate or attached. We queried
for “betI-” starting words in a text corpus containing about 5 million words extracted
from the Haddas Ertra newspaper, which is published in Eritrea. The previous word,
for instance, occurs delimited by a space in about 95% of the response. Another
compound word ቤት ጽሕፈት betI SIHIfetI (office) was found attached in about 24%
of the response, which is not a small portion. Although the words are not grammatical
2.6. POS Ambiguity 17

morphemes, we believe segmenting (normalizing) these words would be useful for


practical reasons such as mitigating data sparseness.

2.6 POS Ambiguity


POS ambiguity may arise from the behavior of some words to assume various POS
roles according to their lexical information and the surrounding context. Some cases
of ambiguity in Tigrinya are discussed in this section. Demonstrative pronouns and
demonstrative adjectives in Tigrinya may be ambiguous, depending on the word be-
ing modified [14]. One could consider the statements

Example 2.17. እዚ ናተይ እዩ Izi nateyI Iyu, (This is mine)

Example 2.18. እዚ ቤት እዚ ናተይ እዩ Izi bEtI Izi nateyI eyu, (This house is mine)

In example 2.17, Izi (this) functions as a pronoun, whereas in example 2.18, “this”
is modifying the noun (house) and is hence an adjective. In Tigrinya, demonstrative
adjectives tend to repeat themselves in the pattern Izi NOUN Izi. Furthermore, an
adjective may take the role of a noun, a relative verb, a pronoun, a proper noun, or an
interjection in a sentence [19]. As described in relation to morphological ambiguity
(section 2.5), inflection and affixation of Tigrinya words may also render lexically
ambiguous words. Looking again at the prefix ብ bI (by), in some contexts, it in-
troduces ambiguity as the role of a noun is turned into an adverb [21] when “bI” is
prefixed. For instance, in the word ብመምህር bImemIhIrI (by a teacher), the prefix
(bI) functions as a preposition. However, for the word ብትብዓት bItIbIOatI (bravely),
the word assumes the role of an adverb, specifically, in the event it modifies a verb.
However, “bI” is not always a prefix or an adverb but sometimes part of the word
too. For example, the word ብስራት bIsIratI (good news), which can appear as a noun
and proper noun, starts with “bI”). This ambiguity can be partly resolved by a com-
bination of stemming and lexicon look-up to check if the stem exists in the lexicon
or alternatively looking at the context for disambiguation information such as the
part-of- speech of the word being modified. In this research, we applied the latter
alternative to disambiguate the POS based on the contextual information.
18 Chapter 2. Tigrinya Language

One of the features of Tigrinya as a Semitic language is that it has two tenses (per-
fective and imperfective) but several aspects (causative, reflexive, reciprocal, and so
on) and the imperative/jussive mood [14]. Other tenses, such as the future tense, are
expressed by combining one of these tenses with the auxiliary verbs. This distinction
becomes useful when using syntactic information to understand the arrangement of
words for POS disambiguation. For example, in the phrase ዘሊሉ ሃደመ zelilu hademe
(he escaped by jumping), the gerundive zelilu describes the way the action of escaping
was performed and thus functions as an adverb instead of a verb [21].
The noun declension in Tigrinya happens for gender, number, case, and definite-
ness. However, noun declension does not follow a regular pattern almost 75% of the
time [14]. Similarly, declension of adjectives takes place for gender and number. In
general, the complexity arising from Tigrinya grammar leads to several ambiguities
related to all parts of speech.
There is also POS ambiguity arising from the writing system of Tigrinya. The
phenomenon of gemination creates ambiguity in the POS of words. This ambiguity
results from the absence of notation symbols in Ge’ez alphabets, such as orthography
of the European languages to represent gemination of consonants. For example, the
word ሰበረ sebere can mean (he broke) or it can be a type of a legume if “be” in sebere
is geminated. Furthermore, the widespread use of cliticized words may pose serious
problems in POS tagging because of the orthographic variation it creates. During
Tigrinya cliticization, certain unpronounced characters of a word are omitted and
replaced by apostrophes. For example, the compound word መምህር’ዩ memIhIrI’yu,
(he is a teacher), is a combination of the noun memIhIrI and the auxiliary verb Iyu.
However, during the fusion of these words, the first character of the auxiliary verb
is omitted by cliticization. The proposed system includes some character recovery
rules and has normalized the corpus in order to reduce these types of orthographic
variations. Orthographic variations may aggravate the out-of-vocabulary problem
that often occurs with low-resource languages. However, the character recovery pro-
cess is not always straightforward because there are some combinations that require
more contextual or semantic knowledge to determine the proper missing character.
For example, the word ከመይ’ላ kemeyI’la could be resolved to either kemeyI ala, (how
is she) or kemeyI ila (how did she). Issues with the recovery of such ambiguous cases
2.7. Chapter Summary 19

will be resolved in the forthcoming enhancements of the corpus quality.

2.7 Chapter Summary


In Chapter 2, we introduced the Tigrinya language in the context of morphologi-
cal analysis and part-of-speech tagging. Accordingly, the unique non-concatenative
pattern morphology has a complex formation that generates a large number of word
forms causing serious data sparsity challenges. On the other hand, a word in Tigrinya
embeds rich morphological information that may be exploited to enrich features of
NLP systems such as part-of-speech disambiguation.
21

Chapter 3

Resource Construction

3.1 Introduction
One of the main purposes of this research is constructing various corpora to support
the development of Tigrinya NLP. In an earlier research, we constructed a POS tagged
corpus that was used to explore the first POS tagging research using methods such as
hidden Markov models (HMM). In the following sections, a brief description of ad-
ditional resources has been provided. These resources include a Tigrinya text corpus,
a morphologically segmented corpus, and a parallel corpus. These resources are em-
ployed to investigate Tigrinya word embeddings, morphological segmentation, and
machine translation.

3.2 Creating Large-scale Text Corpus


A text corpus is a large collection of natural language text or speech transcriptions
constructed specifically for language processing tasks [22]. We collected news text
and religious text to be used for constructing word embeddings and lexicons and
performing basic corpus analyses such as frequency distribution, collocation, and n-
gram statistics. The raw text consists of data collected from the Tigrinya newspaper,
Haddas Ertra (shabait.com) and the Tigrinya Bible (geezexperience.com). The news-
paper, Haddas Ertra, is published six days a week, covering a wide range of topics,
including sports, law, education, health, literature, politics, economy, and other social
issues. We collected articles dated between March 2013 and August 2017. Following
the automatic extraction of the text data, both raw corpora were properly cleaned and
22 Chapter 3. Resource Construction

formatted using tools developed for preprocessing the corpus. First, the raw text was
formatted into a sentence-based format. Then, we normalized the corpus to unify
over 60 different styles of Tigrinya word cliticizations into a common format. At
this stage, the total raw text corpus consisted of 15.1 million tokens. This corpus can
be used to generate various vocabularies. For example, we generated a unique lexi-
con vocabulary that consists of over 593,000 tokens. We also analyzed the top 4000
frequent words in the corpus and created a list of stop words by filtering irrelevant
entries manually. This list was further enriched with the preposition, conjunction,
adverb, auxiliary verb, and pronoun lists from the NTC POS tagged corpus. More-
over, the corpus was used for constructing word embeddings as explained in section
3.6. For this reason, other versions of the corpus were prepared with further cleaning.
For the first version, only the stop words were cleaned and in the second version, ad-
ditionally, all hapaxes, foreign scripts, digits, and punctuation were removed. This
cleaning process compressed the corpus by 41.9% to about 6.3 million tokens. In
order to analyze the effect of removing stop words, we experimented with both the
original and the cleaned version of the corpus. Our findings are more meaningful and
consistent with the cleaned version of the corpus, and hence, we report those results
in section 3.6.

3.3 Nagaoka POS Tagged Tigrinya Corpus


The Nagaoka Tigrinya Corpus (NTC) is a POS tagged corpus created in our earlier
research as described in [2]. The raw text was collected from a newspaper called
Haddas Ertra which is published in Eritrea. The articles were selected at random
and cover a range of topics including health, education, agriculture, business, sports,
health, social issues, history, culture, and literature. After cleaning and normaliza-
tion process, a 72,080-token corpus was manually annotated for POS tags. The POS
tagging research explained in Chapter 5 employs the NTC for the experiments.
A statistical summary of the NTC is given as follows:
Data source: Haddas Ertra newspaper
Selected articles: 100, from around 10 topics
Corpus size: 72,080 tokens
3.4. Morphologically Segmented Corpus 23

Sentences: 4656 (avg. 15 tokens/sent.)


Unique words: 18,740 (26%)
All Tags: 73
Token-type ratio: 3.85
Hapaxes: 12,510 (17%)
NTC was originally tagged with 73 tags (referred to as tagset-1) and later simplified
to 20 tags (referred to as tagset-2 due to tag distribution sparsity. Chapter 5 discusses
the impact of both tagsets on the performance of the presented taggers. NTC has been
released to the public in the hope of advancing NLP research for Tigrinya1 .

3.4 Morphologically Segmented Corpus


There is no publicly available morphologically segmented resource for Tigrinya.
Therefore, we constructed the first morphologically segmented corpus for Tigrinya.
The current version of this corpus comprises over 45,000 tokens derived from ran-
domly selected 2774 sentences of the NTC.
For the purpose of boundary detection, we employed various character-based BIO
chunking schemes, which allow us to exploit character dependencies and alleviate
out-of-vocabulary (OOV) problems by reducing the morpheme vocabulary to about
60 Latin characters that cover the transliteration mapping we adopted.

3.5 English-Tigrinya Parallel Corpus


SMT systems are built from a large volume of source-to-target aligned sentences. The
Tigrinya Bible is available in different formats on a number of websites and mobile
applications2 . There are also plenty of sources for several editions of the English
Bible over the Internet.
We employed a version of the Tigrinya-English Bible available on geezexperi-
ence.com. The Bible text is arranged into sequentially numbered verses. A verse
in the Bible may contain one or more sentences. However, the parallel translations
1
NTC is available on http://eng.jnlp.org/yemane/ntigcorpus
2
http://bible.geezexperience.com/tigrigna/, https://www.betezion.com/bible.php
24 Chapter 3. Resource Construction

on geezexperience.com are not strictly aligned verse-to-verse. The excerpts in Fig.


3.1 and 3.2 show this inconsistent alignment. Verses 17 and 18 are merged in the
Tigrinya Bible but found separately in the English parallel text. The English version
lists all verses independently by assigning a single verse number. However, on the
Tigrinya side of the translation, there is a frequent combination of one or more con-
secutive verses into a single verse. It is difficult identifying the boundary of these

Figure 3.1: Excerpt from Tigrinya Bible illustrating merging of verses


(verse 17 to 18 are merged)

ኣምላኽ ከኣ ኽልተ ዓበይቲ ብርሃናት ገበረ፥ እቲ ዓብዪ ብርሃን ብመዓልቲ ኺስልጥን፡


እቲ ንእሽቶ ብርሃን ድማ ብለይቲ ኺስልጥን፡ ከዋኽብቲውን ገበረ።
ኣብ ልዕሊ ምድሪ ምእንቲ ኼብርሁ፡ ኣብ መዓልትን ኣብ ለይትን ድማ ኪስልጥኑ፡
ንብርሃንውን ካብ ጸልማት ኪፈልዩ፡ ኣምላኽ ኣብ ጠፈር ሰማይ ገበሮም። ኣምላኽ ድማ ጽቡቕ
ከም ዝዀነ ረኣየ።
ምሸት ኰነ ብጊሓትውን ኰነ፡ ራብዐይቲ መዓልቲ።

Figure 3.2: English excerpt from Bible text corresponding to the


Tigrinya verse in Fig. 3.1. The English version has independent verses
(verse 17 and verse 18 are listed separately).

15 And let them be for lights in the arch of heaven to give light on the earth:
and it was so.
16 And God made the two great lights: the greater light to be the ruler of the
day, and the smaller light to be the ruler of the night: and he made the stars.
17 And God put them in the arch of heaven, to give light on the earth;
18 To have rule over the day and the night, and for a division between the light
and the dark: and God saw that it was good.

Figure 3.3: The English verses are merged to match the Tigrinya
merged verses in Tigrinya text for constructing verse aligned paral-
lel corpus (verses 17 and 18 are merged).

15 And let them be for lights in the arch of heaven to give light on the earth:
and it was so.
16 And God made the two great lights: the greater light to be the ruler of the
day, and the smaller light to be the ruler of the night: and he made the stars.
17-18 And God put them in the arch of heaven, to give light on the earth; To
have rule over the day and the night, and for a division between the light and
the dark: and God saw that it was good.
3.6. Creating and Analyzing Word Embeddings 25

combined verses automatically. Therefore, we corrected the verse alignment by join-


ing the English counterparts as well. For example, the corrected alignment in Fig. 3.3
merges the English verses 17 and 18 to align with the corresponding Tigrinya verses.
Following this, the corpus was cleaned and tokenized. During the cleaning phase,
we retained a few types of punctuation in the Tigrinya corpus and transliterated Ge’ez
script to Latin script for better manipulation during segmentation. The English text
was also tokenized and lowercased to minimize data sparseness. However, the Tigrinya
corpus was not lowercased since lower and uppercases represent distinct syllables in
the Tigrinya transliteration. After the preprocessing phase, the parallel corpus con-
tained 31,277 aligned verses. We divided this corpus into training, tuning, and test
sets by extracting verses at random for use with the Moses translation system.

3.6 Creating and Analyzing Word Embeddings


Word embeddings can be used to derive multiple degrees of similarity between words,
and hence tasks such as word sense disambiguation (WSD), sentiment analysis, and
synonym acquisition are greatly benefited. Furthermore, word vectors can support
features in syntactic NLP tasks, such as POS tagging, parsing, and named entity
recognition (NER). We applied state-of-the-art predictive methods to the low-resource
Tigrinya language. Low-resource languages are constrained by the unavailability of
digital resources and language processing tools. When a resource is available, it of-
ten lacks linguistic information, which is quite expensive to annotate manually [1]. In
this regard, word representations can be tailored to support many NLP tasks for low-
resource languages with possible augmentation of semantic and syntactic information
merely from raw text. Therefore, we investigate the properties of word embeddings
for the morphologically rich and low-resource Tigrinya language [23].

3.6.1 Word Embeddings

Word embedding refers to the encoding of raw text to vector of numbers that is con-
venient for use by machine learning algorithms. These numerical representations
encode semantic and syntactic relations of words in a language. Word representation
26 Chapter 3. Resource Construction

methods may be broadly classified into frequency or count-based methods which


employ co-occurrence of words and prediction-based approaches that assign proba-
bilities to measure degree of relatedness. An elaborate comparison of both methods
is discussed by [24]. Their results prove the effectiveness of predictive models at
many lexical semantic tasks with different settings. This research uses the Tigrinya
news text corpus with about 15 million tokens. However, due to very high mor-
phological production, the hapaxes, and rare words constitute a large portion of the
corpus. Semitic languages such as Tigrinya are characterized by a unique root and
template pattern morphology, which generates numerous inflections through prefix,
infix, and suffix affixation. We aim to investigate if some useful syntactic and se-
mantic intuitions can be derived from this rich Tigrinya morphology using word em-
beddings. The result may be useful for a wide range of NLP research, particularly
in low resource settings. We utilized word embeddings to improve the performance
of a Tigrinya part-of-speech tagger based on the NTC. The POS tagger had low per-
formance due to insufficient corpus size and high inflection rate that caused data
sparseness problem.

3.6.2 Related Works

Several architectures have been proposed for building word embeddings and using
them as features to improve NLP tasks. We mention a few relevant ones here. [25]
proposed a new distributed representation of words that processes very large datasets
with significantly lower computational cost. This work introduced two model archi-
tectures known as the continuous bag of words (CBOW) and the Skip-gram model.
They also demonstrated simple vector algebraic operations that capture semantic re-
lation of words in the embedding space. In their follow-up work, [25] introduced
the method of negative sampling to further augment computational efficiency as well
as the quality of vectors [26]. Later, [27] presented GloVe, an approach that forms
a word co-occurrence matrix using global corpus statistics. Several other research
applied word embeddings as features to improve POS tagging of many languages,
including a Semitic language, Arabic [28], [29]. [28] used Gaussian hidden Markov
3.6. Creating and Analyzing Word Embeddings 27

model (HMM) and CRF to show that careful initialization of models and regener-
ating word embeddings improve unsupervised POS induction. Interestingly, [29]
achieved near state-of-the-art results, using word embeddings as the sole features
with a neural network POS tagger. Evaluating the quality of word representations
is very important. [30] designed an evaluation framework that combines intrinsic
and extrinsic measurements with detailed analysis and crowdsourcing. Generally,
the CBOW method performed well in many of the designed tasks. In this research,
we report experiments using CBOW and Skip-gram embeddings based on Mikolov
et al.’s work [25], [26].

3.6.3 Method

The two algorithms we used for our analysis are CBOW and Skip-gram distributed
word representations [25]. Both models are simple neural network architectures with
input, projection (hidden) and output layers. Backpropagation with stochastic gra-
dient descent is used to learn weights. Their difference is that CBOW predicts the
target word from the contextual information whereas the Skip-gram model predicts
the surrounding words, given the target word. These methods are briefly explained
in the following subsections.

3.6.3.1 Continuous Bag of Words (CBOW)

In the CBOW model, the training objective is discovering word embeddings to predict
the target word given the context words. Consider V and N represent the vocabulary
size and the hidden layer size respectively. Furthermore, x = x1 , x2 , ..., xV is a one-
hot encoded input vector where xk = 1 and all other xk′ = 0 for k ̸= k ′ . The
weight matrix W between the input and the hidden layers is given by V × N matrix.
Specifically, each row of W is the N-dimensional vector vw of the relevant word w.
That is:
28 Chapter 3. Resource Construction

 
w w12 w13 ... w1n
 11 
 
 w21 w22 w23 ... w2n 
WV ×N =
 .. .. .. .. .. 
 (3.1)
 . . . . . 
 
wV 1 wV 2 wV 3 . . . wV N

and
 
x1
 
 
 x2 
x= 
 ..  (3.2)
 . 
 
xV

Given a word context, then

h = xT W (3.3)

Moreover, the weight matrix between the hidden and the output layer is given as
N × V matrix W′ = wi,j

. Consequently, the score uj (the degree of match between
the context and the next word) for each word in the vocabulary is obtained from the
dot product between the predicted representation (v′wj ) and the representation of the
candidate target word (h).

uj = v′ Twj h (3.4)

where v′wj is the j-th column of the matrix W′ . The posterior probability of words
is computed using the softmax classification model given by:

exp(uj )
p(wj |wI ) = yj = (3.5)

V
exp(uj ′ )
j ′ =1
3.6. Creating and Analyzing Word Embeddings 29

where yj is the output of the j-th unit of the output layer. Substituting Equations
3.3 and 3.4 in Equation 3.5 we have:

exp(v′ Two vwI )


p(wj |wI ) = (3.6)

V
exp(v′ wT
′ vwI )
j
j ′ =1

where the input word w is represented by the input vectors vw and vw′ obtained
from the row vector W and column vector W ′ respectively.

3.6.3.2 Skip-gram Model

In the Skip-gram model, given a target word, the training objective is to discover word
representations that would help in predicting the surrounding words in the associated
context.
The Skip-gram model is given by:

exp(uc,j )
p(wc,j = wO,c |wI ) = yc,j = (3.7)

V
exp(uj ′ )
j ′ =1

where wI is the input word, wO,c is the c-th word in the output context and wc,j is
the j-th word on the c-th panel of the output layer. Moreover, yc,j is the output of the
j-th word on the c-th panel of the output.

3.6.4 Experiments

3.6.4.1 Dataset and Settings

As mentioned earlier, the collected data in this experiment consists of news articles
and text from the Tigrinya Bible. Our reports are based on the cleaned version of the
data, which contains over 6.3 million tokens. We employed the word2vec tool3 for
the experiments. Furthermore, we assorted the minimum word count between two,
four, and six words, which means words that occur less than the minimum count in
3
https://radimrehurek.com/gensim/models/word2vec.html
30 Chapter 3. Resource Construction

the corpus were not considered. We ran an extensive set of tuning and experiments,
combining the following set of parameters and reported the optimal setting which
results in better similarity tests. The settings include the following:
Algorithm = [CBOW, Skip-gram]
Window size = [2, 4, 6]
Dimension = [100, 200, 300]
Negative sampling = [0, 5, 10]
Hierarchical softmax = [0, 1]

3.6.4.2 Evaluation

Evaluations on word embeddings are usually conducted for intrinsic qualities and
extrinsic improvements [30]. Intrinsic evaluation is a direct measure of the word
vector’s quality, using similarity, analogy and matching measurements. On the other
hand, extrinsic evaluations measure the contribution of word embeddings in improv-
ing external NLP tasks, such as POS tagging and NER. We analyzed both intrinsic
and extrinsic evaluations and reported the models and parameters that gave the best
results in our experiments. The unknown words are first tagged using the CRF-based
tagger from Chapter 5 (referred to as CRF-morph here). We then report the reduction
in error rate by isolating the wrongly assigned unknown words and retagging through
majority voting based on word embeddings.

errors during retagging


error rate =
total number of errors

3.6.4.3 Baseline

The performance of the CRF-morph was 90.89% overall and 80% on unknown words
using tagset-2 (the 20 reduced tagset). This score will be used as the baseline. Our
results are generalized for the 12 major tags which are verbs, nouns, pronouns, ad-
jectives, adverbs, prepositions, conjunctions, interjections, numerals, foreign words,
punctuation, and unclassified tags.
3.6. Creating and Analyzing Word Embeddings 31

3.6.5 Results

We conducted several experiments by tuning word2vec parameters and evaluating the


quality of word vectors by selecting test words for analogy, similarity, and categoriza-
tion evaluations. Due to the high rate of morphological productions in Tigrinya, we
expected the possibility of similarity responses appearing as inflected forms. There-
fore, we combined automatic and manual evaluations to adequately recognize their
subtle semantic and syntactic similarities. Table 3.1 shows some examples of the test
cases with their relatedness criteria, which are adopted in part from [25]. The first
part of the table shows the semantic relations. The boldfaced characters in the sec-
ond part represent the morphology captured by the relation. We observed that most
queries with frequently occurring words such as ኤርትራ ErItIra (Eritrea) returned ex-
act analogy, while queries with less frequent words often returned distant or incorrect
matches.
The best performance was achieved by a Skip-gram model with hierarchical soft-
max, a dimension of 300, a window size of two, and minimum word count of six. We
analyzed with different settings of the contextual window size. We noticed that syn-
tactic and morphological relatedness improved with shorter context. For example, in
most of the experiments, analogy queries involving morphotactics such as:
ዝሑል zIHulI (cold)  ዝዘሓለ zIzeHale (coldest)

ውዑይ wIOuyI (hot)  ?,

returned matching responses when the window size was two. In contrast, the ac-
curacy of the “city-country” type of semantic relatedness improves with wider con-
text. Likewise, a majority of the responses for similarity queries of models with nar-
row context generated more morphological inflections, whereas wider context models
resulted in better semantic relationships. These findings of syntactic relatedness are
in line with the reports of [28] that applied word embeddings for POS induction and
NER.

3.6.6 Improving POS Tagging Using Word Embedding

The POS tagging research that utilizes morphological pattern features is described in
Chapter 5. This section explains how word embeddings can be utilized to augment
32 Chapter 3. Resource Construction

Table 3.1: Example of responses on relatedness query (model=skip-gram, fea-


tures=300, window=2, min. word=6). Bold-faced morphemes represent the mor-
phological transfer in the analogy. The underlined words have some unrelatedness
to the criteria.

Semantic relatedness
Relatedness Word pair 1 Word pair 2
capital-country asImera ErItIra karItumI sudanI
Asmara Eritrea Khartoum Sudan
country-continent afIriqa ErItIra EwIroPa sIPaNa
Africa Eritrea Europe Spain
man-woman gWalI wedi sebeyIti sebIayI
girl boy woman man
food-country pasIta iTalIya mereQI amErika
pasta Italy soup America
position-player akefafali mesi aTIqaOi ronalIdo
midfielder Messi striker Ronaldo
opposite newiHI HaxirI xelimI xaOIda
tall short black white
Syntactic relatedness
plural suffix tatI zEga zEgatatI InIsIsa InIsIsatatI
citizen citizens animal animals
plural suffix atI hagerI hageratI OametI OametatI
nation nations year years
relativization zIHulI zIzeHale wIOuyI zIweOaye
cold coldest hot hottest
pronoun suffix afIliTu afIliTomI hibu hibomI
he made it known they made it know he gave they gave
negation circumfix feleTetI ayIfeleTetInI habetI ayIhabetInI
she knew she didn’t know she gave she didn’t give
passive voice feleTe tefelITe habu tewahIbe
he knew he was known *they gave he was given
prefix Ina ‘while’ belIOe InabelIOe Inanekeye InaweseKe
he ate while eating *while decreasing while increasing

Table 3.2: Examples of similarities by POS tagging (model=skip-


gram, features=300, window=2, min word=6)

tImIhIrIti = education feyIsIbukI = facebook kIlIte = two


tImIhIrItInI and education wikilikIsI wikileaks selesIte three
tImIhIrI she teaches tIwiterInI and twiter arIbaOIte four
bEtItImIhIrIti school fEsI face HamushIte five
mebaIta elementary tIwiterI twiter shewIOate seven
abIyatetImIhIrIti schools zerIgiHIwo He broadcasted shIdushIte six
tImIrIti education yahu yahoo shemonIte eight

the POS generalization skill of the models. Specifically, the analysis will focus on
tailoring embeddings to improve the accuracy of tagging unknown words.
Table 3.2 lists some typical examples of similarity queries ranked in descending
order of similarity score. Most of the results show semantic and syntactic related-
ness with the query word. For example, the responses to the word ትምህርቲ tImIhIrIti
3.6. Creating and Analyzing Word Embeddings 33

Table 3.3: Categorization results: words in bold are the responses.


(model=skip-gram, features=300, window=2, min word=6)

word1 word2 word3 word4


gazETa radIyo tElEviZInI zEna
newspaper radio television news
nIgusI pIreZidenItI nIgIsIti memIhIrI
king president queen teacher
tImali lomi xIbaHI semunI
yesterday today tommorow week
TElI bIOIrayI begiOI anIbesa
goat ox sheep lion
zanIta filImI tewasIo temaharo
story film stage-drama students

(education) are all related to education, school, and learning. In addition to semantic
relatedness, these words also share the common morphological root “tmhrt”. Fur-
thermore, we observed a noticeable similarity between the query and the related re-
sponses in the POS category. In the previous example, except for the verb ትምህር
tImIhIrI (she teaches), the rest are all nouns. Similarly, the responses for the proper
noun ፈይስቡክ feyIsIbukI (Facebook) are mostly related proper nouns like ትዊተር tIwi-
terI (Twiter). Interestingly, the responses for ክልተ kIlIte (two) are numbers ranked
in ascending order with the strongly related word being ሰለስተ selesIte (three). This
powerfully embedded relatedness can be leveraged to complement supervised and
unsupervised POS tagging. Moreover, we tested word embeddings for categoriza-
tion (sorting out the word that does not belong to a group of words). In Table 3.3, we
consider words at column “word4” as unrelated to the other three words in the same
row. The boldfaced entries represent the responses returned as unrelated, according
to the test. Looking at the table, the unrelated word is successfully isolated in all of
the instances except for the beginning row, which is quite a difficult case representing
a subtle semantic distinction. In this row, since radio, television, and newspaper are
all means of broadcasting news, the fourth word “news” should have been selected as
the unrelated word instead of the given response “radio”. In order to measure the im-
pact of this relatedness, we conducted an extrinsic evaluation using NTC corpus and
34 Chapter 3. Resource Construction

CRF-morph. In the NTC corpus that was used for CRF-morph, 10% of the sentences
(7460 words) were used for testing in cross-validated settings. 18.7% (1390 words)
of the test data comprises unknown words. For extrinsic evaluation, we focused on
the effect of word embeddings on improving the tagger’s performance for unknown
words. As mentioned earlier, since the related words often share a common POS tag
with the query word, we tagged all related words of the unknown word and applied
majority voting to propose a candidate tag from the related words. The threshold
we used for the related candidates was ten. The majority vote selects the most com-
mon tag or the second most common tag among the list of candidates. This approach
reduced the error rate of tagging unknown words by 50%. As a result, the overall
performance of the tagger improved by 1.25 percentage point. The size of our corpus
for word embedding was around 6.3 million words (after cleaning); inducing the tag
of unknown words using this relatively small text corpus revealed some significant
improvements. The quality of word embeddings has been proved to improve with
the increase in the size of corpus [25]. Therefore, given a much larger data, we also
expect the tagging performance on unknown words to show further improvements.

3.6.7 Summary

In this section, we briefly explained the building of word embeddings for Tigrinya.
Several experiments were performed to obtain the optimal settings and parameters for
generating useful quality of word vectors. Generally, syntactic relatedness improved
with a shorter context, and better semantic relatedness was achieved with wider con-
text. We obtained optimal syntactic and semantic relatedness with a Skip-gram model
having context size of two, minimum word count of six, and 300 dimensions. We also
used the word embeddings to improve the POS tagging of unknown words. Although
the size of our text corpus was relatively small, the error rate of tagging unknown
words was reduced by half, resulting in an overall improvement of tagging unknown
words. In future, this result may be improved further by enlarging the size of the text
corpus and tuning parameters.
3.7. Chapter Summary 35

3.7 Chapter Summary


In Chapter 3, we introduced the resources employed and contributed in this research.
These include the large text corpora, the POS tagged Nagaoka Tigrinya Corpus, the
newly constructed morphologically segmented corpus, and the English-Tigrinya par-
allel corpus. We also discussed creating and analyzing word embeddings for Tigrinya
using intrinsic and extrinsic evaluations. The resources will be essential for conduct-
ing further research, including later stages of NLP such as chunking and syntactic
analysis. We have released the POS-tagged corpus and the CRF-based POS tagging
tool to the public.
37

Chapter 4

Methods of Tagging and


Segmentation

4.1 Introduction
In this chapter, we describe the proposed methods of POS tagging and segmentation.
First, we attempt modeling tagging as a classification problem which follows an SVM
point-wise prediction to determine the category of words among n-classes. However,
the disambiguation of POS tags and morphological segments also depends on contex-
tual information. Therefore, we further employed CRF-based sequence labeling to
utilize the contextual and lexical features. Finally, we investigated the recent devel-
opments in deep learning and POS tagging with LSTM-based sequence-to-sequence
labeling. A brief description of SVM, CRF, and LSTM approaches is provided in the
following section.

4.2 Support Vector Machines (SVM)


We modeled POS tagging as a multiclass classification using a SVM classifier in a
one-versus-the-rest (also known as one-vs-all) scheme. Originally, the SVMs were
formulated for binary classification [8], [31]. However, multiclass classification in-
volves the assignment of data points into three or more classes. Several algorithms
are formulated to solve multiclass problems either by extending binary classification
or by transforming multiclass classification into a set of binary classifications that
38 Chapter 4. Methods of Tagging and Segmentation

are leveraged to solve multiclass tasks [32]. Some example of the first approach in-
clude naive bayes classifiers [33], k-nearest neighbor [34], decision trees [35], and
neural networks [36]. SVMs can be extended to employ direct methods of multiclass
estimation as in [37] or decomposed into independent binary classifiers like in [38].
Furthermore, it was argued by [38] that provided sufficient tuning of the binary classi-
fiers, one-versus-the-rest decomposition provides a simple solution which performs
at least on par with other more complicated methods. Therefore, in this research,
we follow the one-versus-the-rest approach to address the POS task in a multiclass
setting.
The task of multiclass classification is to learn a function f from a training dataset
{(xi , yi )} for estimating the correct class C of a new data point (xi , yi ) out of N
distinct classes where N > 2. In the one-versus-the-rest scheme, one classifier is
trained for each class C, where the data points of that class are considered the positive
instances (+1) and the rest N − 1 classes are categorized as the negative instances
(−1) (equation 4.1).



+1 for yi = C
y= (4.1)

−1 for yi ̸= C

During prediction, a data point x is assigned to the class that outputs the largest
value after running all N classifiers (equation 4.2).

y = argmax fn (x) (4.2)


n∈{1...N }

4.3 Conditional Random Fields (CRF)


Conditional random fields are probabilistic approaches capable of modeling context
dependent sequence labeling [9]. SVM and CRF approaches are among the methods
that have been successfully applied to develop state-of-the-art POS taggers 1 . CRFs
1
http://www.aclweb.org/aclwiki/
4.4. Long Short-Term Memory (LSTM) 39

are preferred over other models because they offer improved sequence labeling capa-
bilities. Firstly, the strict Markovian assumption in hidden Markov models (HMMs)
is relaxed in CRFs. CRFs also solve the bias label problem that exists in maximum
entropy Markov models (MEMMs) by training to predict the whole sequence correct
instead of training to predict each label independently [9].
The model is trained to predict a sequence of output labels (e.g., IOB tags in
morphological segmentation and POS tags in POS tagging) y:

y = y0 , y1 , ..., yT (4.3)

from a sequence of features x,

x = x0 , x1 , ..., xT (4.4)

The training task is to maximize the log probability log(p(y|x)) of the valid label
sequence. The conditional probability is computed as:

escore(x,y)
p(y|x) = (4.5)
Σy′ escore(x,y′ )

The equation for the scoring function score is given by:


T ∑
T
score(x, y) = Ayi ,yi+1 + Pi,yi (4.6)
i=0 i=1

Ayi ,yi+1 denotes the emission probability of state change from label i to label j while
Pi,j is the transition probability denoting the score of the j th label of the ith word.

4.4 Long Short-Term Memory (LSTM)


Recurrent Neural Networks (RNNs) are feedforward neural networks with feedback
cycles to capture time dynamics using backpropagation through time. This recurrent
connection allows the network to employ the current inputs as well as previously
40 Chapter 4. Methods of Tagging and Segmentation

encountered states. Given the input vector Xt , (e.g., in the case of our segmentation,
five characters left and right of the target character), the hidden state ht at each time
step t, can be expressed by equation 4.8.

Xt = [xt−n , xt−(n−1) , ..., xt , ..., xt+(n−1) , xt+n ] (4.7)

ht = σh (Wxh xt + Whh ht−1 + bh ), (4.8)

where the W terms represent the weight matrices, b is a bias vector and σ is the acti-
vation function. However, due to the gradient vanishing (very small weight changes)
or exploding problems (very large weight changes), long-distance dependencies are
not properly propagated in RNNs. [10] introduced LSTMs to overcome this gradient
updating problem. The neurons of LSTMs, called memory blocks, are composed of
three gates (forget, input, and output) that control the flow of information and a mem-
ory cell with a self-recurrent connection. The formulae of these four components are
given in equations 4.9 to 4.13. In these equations, the input, forget, output, and cell
activation vectors are denoted by i, f, o, and c respectively; σ and g represent sig-
moid and tanh functions respectively; ⊙ operation is the Hadamard (element-wise)
product; W stands for the weight matrices and b is the bias.

it = σ(Wix xt + Wih ht−1 + bi ), (4.9)

ft = σ(Wf x xt + Wf h ht−1 + bf ), (4.10)

cn = g(Wcx xt + Wch ht−1 + bc ), (4.11)

ct = ft ⊙ ct−1 + it ⊙ cn , (4.12)

ot = σ(Wox xt + Woh ht−1 + bo ), (4.13)

ht = ot ⊙ g(ct ) (4.14)

The LSTM architecture we proposed to use for segmentation is illustrated in Fig.


4.1. Character-level features (embeddings) generated by the embedding layer are
fed to the forward LSTM network. Dropout layers are adjusted before and after the
4.5. Bidirectional Long Short-Term Memory (BiLSTM) 41

LSTM layer to regularize the model and prevent overfitting. The output of the net-
work is then processed by a fully connected (Dense) layer and finally tag probabil-
ity distributions over all candidates are computed via the softmax classifier. Due to
the capacity to work well with long-distance dependencies, LSTMs have achieved
state-of-the-art performance in window based approaches as well as sequence label-
ing tasks including morphological segmentation [39], part-of-speech [40] and named
entity recognition [41].

Figure 4.1: Fixed-size window based LSTM network for morpholog-


ical segmentation

4.5 Bidirectional Long Short-Term Memory (BiLSTM)


LSTMs encode the input representation in the forward pass. In Bidirectional LSTM
network (BiLSTM), the input vector is processed in both forward and backward
passes and presented separately to hidden states. The mechanism is useful in en-
coding past (forward) and future (backward) information of the input vector. We
illustrate the architecture of the BiLSTM we propose to use in Fig. 4.2. Both the for-
ward and backward layer outputs are calculated by using the standard LSTM updating
equations 4.9 to 4.13. Accordingly, for every time step t, the forward and backward
networks take the layer input xt and then output ht . The cell takes the input state cn
and the cell output ct as well as the previous cell output ot for training and updating
parameters. Like in LSTMs, the output of each network in BiLSTM can be expressed
as:
42 Chapter 4. Methods of Tagging and Segmentation

Figure 4.2: Fixed-size window based BiLSTM network for morpho-


logical segmentation

yt = σy (Why ht + by ), (4.15)

where W is the weight matrix, b is the bias vector of the output layer and σ is
the activation function of the output layer. Both the forward and backward networks


operate on the input state of xt and generate forward output h t and backward output
←−
h t . Then the final output, yt from both networks is combined using operations such
as multiplication, summation or simple concatenation as given in equation 4.16. In
our case, σ is a concatenating function.


→ ← −
yt = σ( h t , h t ) (4.16)

This context-dependent representation from both networks is passed along to the


fully-connected hidden layer (Dense) which propagates it to the output layer.
We employed Sequence-to-sequence (seq2seq) LSTM and BiLSTM networks for
POS tagging. Seq2seq networks are a powerful class of models that take in a sequence
of inputs and output the corresponding sequence of outputs. Generally, the method
4.5. Bidirectional Long Short-Term Memory (BiLSTM) 43

Figure 4.3: Sequence-to-sequence BiLSTM neural network for POS


tagging

encodes the source sequence to a vector of a fixed size, and then decodes the target
sequence from the vector [42]. In sequence-to-sequence POS tagging, the input is
the sequence of words in a sentence, and the POS tag for each word forms the target
sequence.
In the POS tagging based on SVM and CRF, features were designed to capture the
morphological patterns of Tigrinya words and the syntactic structure of surrounding
context. In contrast, in the seq2seq approach, we used automated feature engineering
to capture similar features with a vector representation of words (embeddings) and
achieved a slightly better performance. The architecture of the BiLSTM seq2seq
model is depicted in Fig. 4.3.
44 Chapter 4. Methods of Tagging and Segmentation

4.6 Chapter Summary


In Chapter 4, we briefly explained the three supervised approaches employed in the
research. The first approach is SVM-based classification, which does not consider
contextual information as features for classification. In the second approach, we
adopted CRF-based sequence labeling which naturally takes into account label de-
pendencies as well as arbitrary features. Finally, we looked into LSTMs, which pro-
vide a powerful method of handling long-term dependencies and automated feature
engineering using distributed word representations.
45

Chapter 5

Tigrinya POS Tagging

5.1 Introduction
As briefly explained in the introductory chapter, the POS tagging process refers to
determining the POS tag of a word according to the context of the sentence. POS
tagging is one of the early phases in natural language processing. The later stages of
NLP, including phrase identification, named entity recognition, and syntactic pars-
ing, all require POS-tagged data as input. In addition, POS tagging has proven useful
in tasks such as statistical machine translation, information extraction, and summa-
rization. This chapter details the related works, experiments, and results achieved
in POS tagging of Tigrinya based on SVM, CRF, and LSTM approaches. The ex-
traction and the effect of morphological patterns in boosting performance has been
discussed. We also experimented with LSTM and word embeddings forgoing feature
engineering and achieved state-of-the-art results in Tigrinya.

5.2 Related Works


This section reviews the previous POS-tagging works of Semitic languages cate-
gorized under the same language branch as the Tigrinya language. There are two
approaches in POS tagging for Semitic languages. The first approach focuses on
the discrete nature of words without any morpheme-level segmentation, while the
second follows a decomposition strategy and works on segments or morphemes of
words. Therefore, two trajectories prevail in the tagging of Semitic languages that
46 Chapter 5. Tigrinya POS Tagging

lead to either a word with complex POS tags or a sequence of segments with sim-
ple POS tags. The latter requires choosing among multiple ambiguous segments
[43]. The first POS-tagging research for the Arabic language was performed with
a corpus containing 50,000 tagged words collected from a newspaper [44]. Follow-
ing this, several research studies for the Arabic language emerged over the decades,
and currently, Arabic corpora with millions of words are available. For example,
the Arabic Newswire part-1 comprising 76 million tokens and over 666,094 unique
words [45] and the Arabic Treebank containing 1 million words [46] were compiled
at the University of Pennsylvania. Methodologically, the majority of recent stud-
ies have applied segmentation-based techniques for Arabic POS tagging [47]–[49].
An SVM-based POS tagger that performs morphological segmentation of words fol-
lowed by POS tagging was reported by [47]. Interestingly, a comparative study
of segmentation-based and segmentation-free methods by [50] reports better results
without segmentation (94.74% vs. 93.47%). In contrast, [43] examined the prob-
lem of word tokenization for Hebrew POS tagging and argued that segment-level
tagging is better suited for the POS tagging of Semitic languages in general, as it
suffers less from data sparseness. A recent work investigated modeling POS tagging
as an optimization problem using genetic algorithm [51]. POS-tagging research for
Amharic languages was mostly driven by a 210K-word news corpus that was tagged
with 30 POS tags [52]. Using this corpus, [53] reported various experiments apply-
ing Trigram‘n’Tags (TnT), SVM, and maximum entropy algorithms. Furthermore,
[54] identified inflection and derivation patterns of Amharic words as features and
improved tagging accuracy up to 90.95% using CRFs.
In section 5.4, we further performed separate experiments to explore recent ad-
vances in deep learning on sequence labeling. Specifically, we were interested in
BiLSTM with word embeddings for automatic encoding of the latent linguistic fea-
tures following the work by [40]. The research investigated BiLSTM-based POS
tagging with word, character, and Unicode byte embeddings on 22 languages includ-
ing Arabic and Hebrew. They also analyzed data size and label noise variations and
reported state-of-the-art accuracy across several languages. Interestingly, in contrast
to other language families, the BiLSTM approach in the case of Semitic languages
outperformed the TnT and CRF-based tagging on the first 1000 sentences, which
5.2. Related Works 47

may indicate the preference of BiLSTMs for a low-resource scenario in these lan-
guages. Earlier, [55] proposed a task-independent unified sequence tagging model
based on BiLSTMs and achieved results that were comparable to the state-of-the-art
POS tagging, Chunking, and NER in English. The method of combining character
representations to form word embeddings improved sequence tagging particularly
with morphologically complex languages such as Turkish [56]. Their method is dif-
ferent from traditional word embeddings, which consider embeddings at word level.
In a similar work, [57] derived sparse embedding features from continuous (Dense)
word embeddings and proposed a general sequence labeling framework that achieved
reasonable results for more than 40 treebanks in POS tagging.
With regards to Tigrinya, so far, the research on Tigrinya NLP has not progressed.
The absence of publicly available tagged or untagged corpora of Tigrinya has been
the main challenge. However, a notable achievement in the morphological analysis
and generation of Tigrinya, Amharic, and Oromo has been reported by [20]. This
work is discussed in more detail in the related works for Tigrinya morphological
segmentation (section 6.2). Other related works include a hybrid stemmer [58], some
electronic dictionaries 1 , input method editors 2 , a large Tigrinya lexicon from the
Crúbadán project 3 , and a web-crawling project that provides concordance access to
Tigrinya corpora among other languages of the Horn of Africa.4 .
In the following sections, we present the experiments and results. The first ex-
periment discusses SVM and CRF-based results, and the second experiment focuses
on BiLSTM based sequence-to-sequence labeling.
1
hdrimedia.com/, memhr.org/dic/, geezexperience.com/dictionary/
2
keyman.com/tigrigna/, geezlab.co/
3
crubadan.org/languages/ti
4
habit-project.eu/wiki/CorporaAndCorpusBuilding
48 Chapter 5. Tigrinya POS Tagging

5.3 Experiment: POS Tagging Based on SVM and CRF


with Morphological Patterns

5.3.1 Extracting Morphological Patterns

In the English language, 98% of the cases for the suffix “able” were found to be ad-
jectives in the Wall Street Journal part of Penn Treebank [59]. This illustrates the
possibility of exploiting suffix rules for predicting POS. Morphological features may
be extracted statistically or hand-picked by experts. It was shown that a few lin-
guistically motivated suffixes can result in improved generalization in English [60].
On the other hand, [61] implemented a maximum entropy Markov model to automati-
cally extract morphological features. This improved the accuracy of tagging unknown
words from 61% to 80%, tested on Chinese Treebank 5.0. Unknown words are test
words that are not seen during training. As mentioned previously, Tigrinya words
possess rich morphological information embedded in the form of prefixes, infixes,
and suffixes. Tigrinya verbs use suffixes to mark the conjugation of personal pro-
nouns. For example, the perfective verb pattern CeCeCe is conjugated by inflecting
the stem CeCeC with one of the following personal pronoun suffixes (C represents
any Tigrinya consonant).
CeCeC + (e | etI | a | u | ka | ki | kumI | kInI | na| ku) For instance, the verb sebere can
be decomposed into seber + e, which translates to “he broke”. Likewise, the pattern
of the stems for the remaining verbs in the sebere family are all distinct. Specifically,
the pattern of the imperfective stem is -CeCC- (-sebr-), the imperative stem pattern
is -CCeC (-sber), and the gerundive stem has a CeCiC- (sebir-) pattern. For further
illustration, some of the most informative patterns based on a frequency distribution
of gerundive verbs (V_GER) are listed in Table 5.1. The V_GER proportion in the
second column gives the percentage of a particular V_GER pattern compared to a
total of 533 extracted V_GER patterns (including inflected patterns). The pattern
CeCiCu is the most dominant gerundive stem in the corpus (31.1%), with the suffix
“+u” indicating masculine, singular, and third-person attributes. In addition to the
pronoun suffix that is present in all verbs, stems with prefixes such as “te*” and “a*”
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
49
Patterns

Table 5.1: Some patterns of gerundive verbs (V_GER)

V_GER patterns V_GER(%) Examples


CeCiCu 31.1 feliTu (he knew)
CeCeCiCu 17.8 tefeliTu (it was known)
CeCaCiCu 15.9 tefaliTu (to know each other)
CeCiCICa 14.6 feliTIka (you knew)
aCICiCu 11.8 afIliTu (made something known)
CeCiCoCI 11.3 feliTomI (they knew)
CeCiCa 10.5 feliTa (she knew)

are also frequently found in the corpus. The statistics reveals that 17.8% of the gerun-
dive verb inflections are prefixed with “te*,” which forms passive voice stems from
Tigrinya perfective and imperfective verbs. Similarly, the pattern of the causative
stem aCICeCe, which is prefixed by the morpheme “a*,” is retrieved as part of the
most informative patterns of gerundive verbs. These types of morphological clues
are very important in correctly predicting the POS of Tigrinya words.
Experiments, optimizations, and evaluations were carried out using scikit-learn
machine learning tools [62]. For CRF, the sklearn-crfsuite wrapper was used, which
is available at http://sklearn-crfsuite.readthedocs.io/en/latest/. The data and settings
are explained in the following sections.

5.3.2 Datasets and Tagsets

The NTC corpus with 72,080 tokens was used for training and testing. The experi-
ments were performed in stratified 10-fold cross-validations where 90% of the data
was used for training and the remaining 10% for testing. A stratified 10-fold cross-
validation (CV) scheme evaluated the performance of the estimators. CV setup is
particularly useful in a low-resource scenario when the data for training and testing
is not sufficient. The stratified CV creates a balanced dataset by distributing approx-
imately the right proportion of tags into each of the 10 folds. Therefore, each fold
may be regarded as representative of the whole data. Furthermore, we investigated
with two versions of the corpus based on tagset-1 (the full tagset containing 73 tags)
and tagset-2 (the reduced tagset, containing 20 tags). The full list of the tagset is
50 Chapter 5. Tigrinya POS Tagging

presented in Table 5.2. On average, almost 81% of the test data were known words
and 18% constituted the unknown words.

5.3.3 Features

The full list of the proposed features includes a rich set of contextual and lexical
features. The contextual features span a window of two words and two POS tags pre-
ceding the target word, the target word, and two words succeeding the target word.

Table 5.2: The full tagset: labels ending with _C, _P, _PC indicate
clitics of Conjunction, Preposition, or both respectively. For example,
N_C refers to a NOUN (N) with attached CONJUNCTION (C).

POS category Label


Noun N, N_C, N_P, N_PC
Verbal N_V, N_V_C, N_V_P, N_V_PC
Proper N_PRP, N_PRP_C, N_PRP_P, N_PRP_PC
Pronoun PRO, PRO_C, PRO_P, PRO_PC
Verb V, V_C, V_P, V_PC
Perfective V_PRF, V_PRF_C, V_PRF_P, V_PRF_PC
Imperfective V_IMF, V_IMF_C, V_IMF_P, V_IMF_PC
Imperative V_IMV, V_IMV_C, V_IMV_P, V_IMV_PC
Gerundive V_GER, V_GER_C, V_GER_P, V_GER_PC
Relative V_REL, V_REL_C, V_REL_P, V_REL_PC
Auxiliary V_AUX
Adjective ADJ, ADJ_C, ADJ_P, ADJ_PC
Adverb ADV, ADV_C, ADV_P, ADV_PC
Preposition PRE, PRE_C, PRE_P, PRE_PC
Conjunction CON, CON_C, CON_P, CON_PC
Interjection INT, INT_C, INT_P, INT_PC
Numeral NUM
Cardinal NUM_CD, NUM_CD_C, NUM_CD_P, NUM_CD_PC
Ordinal NUM_OR, NUM_OR_C, NUM_OR_P, NUM_OR_PC
Punctuation PUN
Foreign Word FW
Unclassified UNC
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
51
Patterns

Additionally, lexical features are extracted from the target word’s affixes, which com-
prise prefixes of one to six characters in length, consonant-vowel patterns of the word
(infixes), and suffixes of one to five characters in length. The length of characters is
decided from the best results of a grid search of character combinations.

5.3.4 Hyperparameter Optimization

A grid search was applied to find the optimum combinational of hyperparameter value
of parameter C, regularizations and the kernel function for the SVM model. Ac-
cordingly, the best estimation was when C=1, using L2 regularization and the linear
kernel. Similarly, a randomized grid search optimization for the CRF- based experi-
ments based on the LBFGS kernel was applied. The best estimation was found when
C1=0.5978 and C2=0.1598.

5.3.5 Evaluation

The overall accuracy is calculated from test data by computing the percentage of
correctly predicted tags to the true annotations as given by equation 5.1.

N umber of correctly tagged tokens


Accuracy = (5.1)
T otal number of tokens

In addition, standard metrics of precision (P), recall (R), and F1-score (F1) were
utilized to assess the system’s performance on individual POS tags (equations 5.2,
5.3, and 5.4). Finally, in section 5.3.7.5, we analyze errors to discuss the strengths and
weaknesses of the taggers particularly on unknown words and some representative
tags.
52 Chapter 5. Tigrinya POS Tagging

Correctly predicted P OS tags


P = (5.2)
System predicted P OS tags

Correctly predicted P OS tags


R= (5.3)
T rue P OS tags

2 ∗ (P ∗ R)
F1 = (5.4)
P +R

5.3.6 Baseline

With regard to the baseline, the current word was considered as the feature without
exposing left and right context, and the unknown words were tagged as nouns. In this
setting, the SVM tagger on tagset-1 yielded a baseline accuracy of 74.05%, and the
CRF tagger performed slightly better at 76.15% as presented in Table 5.3. All the
other results obtained using more contextual information outperform these baselines.

5.3.7 Results

In general, the CRF tagger slightly outperforms the SVM-based tagger. Table 5.3
shows the overall performance of the taggers for tagset-1 and tagset-2, varying six
configurations of affix features. Accordingly, the best score achieved was 90.89%
and 89.92% for the CRF and the SVM taggers respectively on tagset-2. The dif-
ference of 0.01 percentage point is significant(p = 0.002 < 0.05). POS tagging is a
sequence-labeling task, therefore, the context-aware CRF-based tagger has the ad-
vantage of utilizing relations in the surrounding context. This may be the reason why
the CRF-based tagger tends to outperform the point-wise SVM-based tagger in most
of the experiments. The consonant-vowel pattern features ptn, which are unique to
Tigrinya (and other Semitic languages), proved useful in the disambiguation of the
four types of verbs (perfective, imperfective, imperative, and gerundive), nouns, and
adjectives. According to Table 5.3, in both the CRF and SVM experiments, the pat-
tern features boost performance by more than 5.5 percentage point compared to the
baseline. The effects of the tagset design, the feature design, and the data size are
detailed in the following sections.
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
53
Patterns

Table 5.3: Overall performance for the data with tagset-1 and tagset-2.
Context = word-2, pos-2, word-1, pos-1, word, word+1, word+2,
ptn = consonant-vowel pattern, pref = prefix, suf = suffix,
all = context + all affixes

Features CRF accuracy(%) SVM accuracy(%)


Tagset-1 Tagset-2 Tagset-1 Tagset-2
all 90.37 90.89 89.12 89.92
context + pref 86.5 89.04 85.56 88.38
context + suf 82.94 84.67 82.56 84.41
context + ptn 81.92 83.45 82.56 84.05
context 77.16 79.25 78.18 80.03
word (baseline) 76.15 75.15 74.05 76.43

5.3.7.1 The Effect of the Tagset Design

The rich inflectional and derivational morphology of Tigrinya generates a large num-
ber of words with various grammatical information including gender, number, case,
and so on. In addition, there are also clitics of prepositions and conjunctions affixed
to words. As mentioned earlier, the complete list of the tags (tagset-1) used in NTC
corpus has 73 tags denoting three types of information, which are the major POS cate-
gory (level-1), some sub-categories (level-2), and clitics (level-3). The distribution of
these tags in the corpus is not balanced. For example, nouns alone constituted about
33% of the words in the corpus while many other tags were rarely found. Specifi-
cally, seven tags were not assigned to words in the corpus, and the other 19 tags were
rarely present, each appearing less than 10 times in the corpus. Most of these rare tags
include the level-3 tags, which represent words that are cliticized with a preposition,
a conjunction, or both. Therefore, in order to reduce data sparseness, another set of
tags were designed by omitting level-3 annotations. Consequently, the number of the
remaining tags with level-1 and level-2 information was reduced tagset-2 containing
20 tags. In comparison to the tagset-1, the tagset-2 data showed a better distribution
of tags as the frequency of all the tags was more than 100. We investigated the impact
of the tagset design as reported in Table 5.3. In both the CRF and SVM methods, the
tagset-2 results slightly outperform that of tagset-1. The cross-validations of both
tagsets for the SVM-based tagger with “all” features were examined for significance
54 Chapter 5. Tigrinya POS Tagging

Table 5.4: The effect of morphological features on known (kno.)


and unknown (unk.) word tagging based on accuracy results (%) for
tagset-1.

Features CRF SVM


Kno. Unk. Kno. Unk.
all 95.42 80.22 94.18 76.68
Context+pref+suf 95.32 79.93 94.2 75.35
Context+pref+ptn 94.45 71.44 93.91 68.49
Context+suf+ptn 92.85 60.79 93.93 57.57
Context+pref 93.21 66.98 93.35 64.43
Context+suf 91.88 56.04 92.83 53.12
Context+ptn 91.73 53.81 93.06 51.56
Context 89.04 39.21 91.75 38.38

Table 5.5: Morphological features and the error rate (%) of unknown
words for selected POS tags in CRF-based experiment.

Features N N_V V_REL V_IMF V_AUX ADJ ADV


all 11.72 2.50 7.34 10.68 25.00 16.44 42.86
Context+pref+suf 10.88 5.00 5.50 11.17 37.50 17.81 42.86
Context+pref+ptn 12.55 10.00 5.50 13.59 25.00 19.18 64.29
Context+suf+ptn 16.74 47.50 42.20 36.41 12.50 21.92 71.43
Context+pref 16.32 7.50 6.42 13.59 25.00 26.03 57.14
Context+suf 17.15 77.50 47.71 34.47 12.50 23.97 85.71
Context+ptn 21.34 37.50 44.04 34.95 25.00 21.92 78.57
Context 30.96 85.00 57.80 49.03 12.50 28.77 100.00

test, revealing a statistically significant difference (p-value = 0.001 < 0.05).

5.3.7.2 The Effect of the Prefix and Suffix Features

Considering CRF tagset-1 classifier and the features context+suf and context+pref,
the overall improvement was 5.78% and 9.34% percentage points, respectively (Ta-
ble 5.4). The impact of these features is even more visible upon detailed analysis of
their performance with regard to unknown words. The result is based on a held-out
evaluation in which about 81.4% of the test data is comprised of known words and
18.6% of unknown words. When only using the context features, the performance
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
55
Patterns

was as low as 38.38% for the SVM-based tagger and 39.21% for the CRF-based tag-
ger. However, augmenting context features with affixes almost doubled this result to
76.68% and 80.22% for SVM and CRF, respectively. Therefore, affix information
was very helpful in inferring the POS of unknown words. Analysis of the impact of
each feature shows that the addition of prefix (pref ) features contributes to the highest
gain compared to suffix (suf ) and pattern (ptn) contributions. This is also true for the
combined affix features. The setup integrating both prefix and suffix features (con-
text+pref+suf ) improved the accuracy by a significant 8.49 percentage point com-
pared to context+suf+ptn, and 19.74 percentage point compared to context+pref+ptn.
This effect may be due to the distribution of some of the very frequent prefixes such
as “mI” for verbal nouns (N_V) and “zI, Ite” for the relative verbs (V_REL). The er-
ror analysis in Table 5.5 clearly shows that this hypothesis holds. The error rate with
regards to N_V when training on prefix features was only 7.5%, whereas training on
suffix features resulted in the high error rate of 77.5%. In addition, with reference to
the affix slot positions of Tigrinya words, while conjunctions are either prefixed or
suffixed, prepositions are always prefixed making prefix features more informative.

5.3.7.3 The Effect of the Pattern Features

The patterns of verbs, nouns, and adjectives are very informative features because in-
flectional and derivational rules indicate grammatical features such as verbal nouns,
relative verbs, and tense-aspect-mood features, as well as gender, person, and number
attributes. Incorporating patterns into the proposed contextual feature set yields con-
siderable performance gain on unknown words. The accuracy increased substantially
from 39.21% to 53.81% for CRF (Table 5.4) when using bare context and ptn pattern
features. Although less noticeable, the impact of patterns can also be seen in the com-
bined feature sets. The performance gain from context+pref to context+pref+ptn is
about 4.46 percentage point. However, the gain moving from context+pref+suf to all
features is rather small. This is probably due to the long range of character n-grams
for prefixes (1 to 6) and suffixes (1 to 5); unless a token is quite long, it is likely
that the pattern features are already encoded along with the prefix and suffix features.
In such cases, patterns features add little new information. Generally, this analysis
56 Chapter 5. Tigrinya POS Tagging

Table 5.6: The accuracy of CRF-based tagger (Tagset 1) as data size


increases

Sent. (%) Sentences Tokens Accuracy (%)


10 465 7013 84.94
25 1164 17730 86.53
50 2328 36726 88.65
75 3492 53261 89.52
100 4656 72080 90.37

Figure 5.1: The performance of the tagger improves with the avail-
ability of more data.

reveals the extent to which infix patterns can be tailored to enrich the feature encod-
ing of Tigrinya lexical features. These types of infix features are unique to Tigrinya
and other Semitic languages. Therefore, it is expected that reinforcing the feature set
with more representative patterns would positively enhance the generalization of the
classifiers, and thereby the prediction accuracy of POS tags.

5.3.7.4 The Effect of the Data Size

The available corpus contains 72,000 tokens of which 19,000 (26%) are word types.
For a small corpus of 72,000 words, this is a relatively high number of word types
and can possibly increase the proportion of unknown words during testing. Further-
more, according to the lexical diversity (token-type ratio), a word is repeated around
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
57
Patterns

Figure 5.2: The performance of the tagger improves with the avail-
ability of more data.

4 times in the corpus. However, about 6% of the words in the corpus form the ha-
paxes, which become a part of the unknown words if found in the test data. A robust
tagger is required to correctly predict the POS of words that can degrade the accu-
racy of the tagger upon incorrect assignments. The best accuracy achieved in this
research is 90.89%, employing the CRF-based tagger and tagset-2. However, this
result is rather low compared to the state-of-the-art results in resourceful languages
such as English. For instance, the TnT tagger trained on the English Penn Treebank
of 1,200,000 tokens reported 96.7% tagging accuracy [59]. This data is significantly
large compared to our small corpus. The experiments in the original implementation
of TnT showed that accuracy improved with increase in data size. In order to inves-
tigate the performance of the Tigrinya taggers with respect to data size, we show the
learning curve in Fig. 5.1 by gradually increasing the training data size. The same
trend in log scale is also shown in Fig. 5.2 for better visualization. The graphs show
that performance improves with greater availability of data. One of the reasons for
this improvement is the reduction of unknown or out-of-vocabulary (OOV) words.
As mentioned, the OOV problem is pertinent to low-resource languages. In our case,
this limitation was partly addressed by introducing rich sets of morphological fea-
tures and rigorous tunings aimed at estimating the best parameters for the employed
58 Chapter 5. Tigrinya POS Tagging

Table 5.7: Tag-wise precision, recall, and F1-score for the SVM-based
tagger (tagset-2),
Support is the number of words for the respective tag.

POS Prec. Rec. F1 Support


ADJ 0.85 0.83 0.84 870
ADV 0.79 0.79 0.79 233
CON 0.98 0.96 0.97 506
FW 0.60 0.71 0.65 17
INT 0.90 0.75 0.82 12
N 0.93 0.97 0.95 1860
NUM 0.98 0.95 0.97 120
N_PRP 0.94 0.86 0.90 244
N_V 0.97 0.99 0.98 203
PRE 0.95 0.95 0.95 447
PRO 0.83 0.85 0.84 223
PUN 1.00 1.00 1.00 778
UNC 1.00 0.67 0.80 9
V 0.69 0.65 0.67 37
V_AUX 0.95 0.96 0.96 348
V_GER 0.87 0.94 0.90 257
V_IMF 0.94 0.93 0.93 461
V_IMV 0.87 0.49 0.62 41
V_PRF 0.82 0.74 0.78 160
V_REL 0.84 0.85 0.85 382
Average 0.92 0.92 0.92 7208

algorithms.

5.3.7.5 Error Analysis

In Tigrinya, POS ambiguity may occur for all parts of speech. However, most of the
tagging errors stem from the role of (1) relative verbs and adjectives, (2) demonstra-
tive adjectives and pronouns, (3) adjectives and nouns, (4) imperfective verbs and
auxiliary verbs, and (5) adjectives and adverbs. Table 5.4 shows the error analysis
of unknown words for some representative POS tags that include nouns, verbs, ad-
jectives, and adverbs. Moreover, the performance of the SVM tagger (tagset-2) for
each POS tag is illustrated in Table 5.7 using precision, recall, and F-score measures
given by equations 5.2, 5.3, and 5.4 respectively.
5.3. Experiment: POS Tagging Based on SVM and CRF with Morphological
59
Patterns

In general, adding morphological affix features is shown to reduce errors when


compared to the context feature set. However, the most informative features for in-
dividual POS tags may differ. For instance, the error rate of verbal nouns (N_V) is
7.5% using the context+pref features, whereas 77.5% when using the context+suf
features. This is justified by the fact that verbal nouns are always prefixed by the
morpheme “mI”. The same is also true with the reduction of errors for relative verbs
that are prefixed mostly by “zI” relativizer. However, from the relatively low score
in the precision-recall table, it’s clear that relative verbs (V_REL) were difficult to
discern due to other ambiguities. One case of relative verbs, such as ዝበለጸ zIbeleSe,
(the best), is that they can assume the role of an adjective when modifying a noun,
as in ዝበለጸ ቦታ zIbeleSe bota, (the best place). Prepositions and conjunctions, which
form relatively few word types, are classified with high confidence. Comparatively,
the estimation for noun classes shows lower precision. In Tigrinya, all nouns (N)
may also function as adjectives (ADJ), such as when a noun modifies another noun.
The error rate figures (Table 5.5) show that joint feature of context+pref+suf proved
effective in reducing the total errors of most tags. However, this feature did not do a
better job with auxiliary verbs. The errors in auxiliary verbs may be related to verbs,
such as conjugations of ነበረ nebere (exist, live), ሃለወ halewe (present), ጸንሐ SenIHe
(stay), and ተገብአ tegebI’e (ought to), that primarily take the roles of the imperfective,
perfective, or gerundive verbs. This type of ambiguity could partly explain the per-
formance downgrade with the categories of gerundive (V_GER) and perfective verbs
(V_PRF) shown in Table 5.7.
In addition, auxiliary verbs form compound verbs of the pattern “word+V_AUX,”
which conjugate a bit differently from the four subcategories of verbs. Some exam-
ples are ሓፍ በለ hafI bele (he stood up), and ትም በል tImI belI, (keep quiet). These
constructs, annotated as “V,” in the corpus were predicted with considerably low
precision and recall. This is due to the tagging confusion regarding auxiliary verbs,
adverbs, or the other types of verbs. In general, affixes or patterns have significant
performance improvements on the POS tags that undergo inflection, but limited im-
pact on other POS tags that have no inflection information.
60 Chapter 5. Tigrinya POS Tagging

5.4 Experiment: POS Tagging Based on seq2seq LSTM


and BiLSTM

5.4.1 Datasets and Tagsets

Similar to the CRF and SVM based experiments, we used the two versions of the NTC
tagged with tagset-1 and tagset-2, in the usual 90/10 train/test setting. As previously
mentioned, NTC consists of 71,080 tokens in 4,656 sentences. The vocabulary size
of the corpus is 18,740. We also experimented by extracting the most frequently
occuring 10,000 tokens and masking the remaining “rare words” with the special
“UNK” (Unknown) token.

5.4.2 Training Parameters

The training parameters are summarized as follows:


Epoch: 4
Embedding dimension: 128
Hidden layer size (neurons): 64
Batch size: 32
Output layer size (neurons): size of tagset-1, size of tagset-2

5.4.3 Evaluation

The evaluation score is computed using the accuracy metric as given by equation 5.1.

5.4.4 Baselines

We consider the best performance in the previous section as the baseline for this
experiment. Accordingly, the best CRF accuracy scores were 90.37% and 90.89%
for the tagset-1 and tagset-2 data respectively.
5.4. Experiment: POS Tagging Based on seq2seq LSTM and BiLSTM 61

5.4.5 Results and Analysis

Unlike the rigorous tuning and feature engineering done with the SVM and CRF-
based approaches in Chapter 5, we applied minimum settings for the LSTM-based
approaches and relied on word embeddings for encoding morpho-syntactic informa-
tion and LSTMs capability to capture contextual long-distance dependencies. The
informativeness of word embeddings was demonstrated in section 3.6, where the er-
rors affecting unknown words were reduced by 50% using the information from word
embeddings. The accuracy results of the LSTM-based experiments are reported in
Table 5.8. Overall, the results achieved by all models are competitive. Looking at
the scores, the BiLSTM model trained on the dataset “all” performed better than the
other models. The BiLSTM model using tagset-2 outperformed the CRF baseline by
0.71 percentage points. This small improvement is remarkable since it was achieved
without any handcrafting of morphological features and with minimal tuning of model
parameters.
The accuracy of the LSTM-based network is quite comparable to the BiLSTM-
based network with the latter performing slightly better. The reason is that the bidirec-
tional LSTM processes the input sentence in both forward and reverse passes enabling
it to access both left (past) and right (future) contexts, whereas the regular LSTM
performs only a forward pass of the sentence preserving only past information. This
bidirectional access is quite helpful in Tigrinya POS tagging since some words re-
quire both information for disambiguation. For example, when nouns(N) function as
adjectives (ADJ) they appear after the modified noun (Example 5.1). Furthermore,
demonstrative adjectives which are the same as pronouns (PRON) tend to get repeated
before and after a noun adding some emphasis to that specific noun (Examples 5.2
and 5.3).

Example 5.1. meSIHafI tIgIrINa  meSIHafI/N tIgIrINa/ADJ (Tigrinya book)

Example 5.2. Iti wedi  Iti/PRON wedi/N (the boy)

Example 5.3. Iti wedi Iti  Iti/ADJ wedi/N Iti/ADJ (the boy, that boy)

The performance using a vocabulary size of 10,000 most common tokens and
the total 18,740 tokens was similar. Although further investigation on vocabulary
62 Chapter 5. Tigrinya POS Tagging

Table 5.8: Results of LSTM and BiLSTM models for POS tagging
in sequence-to-sequence setting. The vocabulary sizes marked as
“top10k” and “all” denote the most frequent 10,000 tokens and all the
vocabularies (18,740 tokens) respectively.

Model Tagset Vocab. size Accuracy


LSTM tagset1 top10k 89.9
LSTM tagset2 top10k 90.4
BiLSTM tagset1 top10k 91.0
BiLSTM tagset2 top10k 91.6
LSTM tagset1 all 89.9
LSTM tagset2 all 80.4
BiLSTM tagset1 all 91.0
BiLSTM tagset2 all 91.6
CRF tagset1 all 90.37
CRF tagset2 all 90.89

sizes will be required, the current result shows that masking the rare words with the
common UNK token proved useful and that the extracted contextual or syntactic in-
formation has significant importance in decoding the ambiguity of the rare words.
In summary, in this first LSTM-based POS research for Tigrinya, we achieved
state-of-the-art results using bidirectional LSTMs with word embedding for feature
extraction. However, the state-of-the-art results of similar Semitic languages or the
well-resourced languages is in the high nineties. Therefore, improving both the quan-
tity and quality of the available small annotated Tigrinya corpus is necessary to achieve
better performance of POS tagging. In addition, significant and relatively inexpen-
sive improvements can also be achieved using semi-supervised approaches that ex-
ploit unlabeled text ([57], [63], [64]). We plan to advance in this line of research in
future works.
5.5. Chapter Summary 63

5.5 Chapter Summary


In Chapter 5, we investigated Tigrinya POS tagging using statistical models and
sequence-to-sequence neural networks and achieved state-of-the-art results. The re-
search is based on the previously created Nagaoka Tigrinya Corpus (NTC) that con-
sists of 72,080 POS-annotated tokens extracted from news text. The first class of
models used language-independent character and substring features and exploited
morphological CV patterns to improve the performance, particularly when handling
OOVs. The accuracy regarding unknown words was improved by about 40% abso-
lute change, proving that morphological patterns are very informative features. The
tagging models are built using CRF and SVM classification approaches and tuned
rigorously for parameter optimization. Accordingly, the best accuracy of the CRF-
based tagger with 20 tags was 90.89%, outperforming the SVM counterpart by a
statistically significant 1 percentage point (p = 0.002 < 0.05). The second approach
achieved the state-of-the-art accuracy of 91.6% using sequence-to-sequence BiLSTM
networks and word embeddings to automatically capture the lexical information and
contextual dependency of words and POS tags.
65

Chapter 6

Morphological Segmentation for


Tigrinya

6.1 Introduction
As introduced in Chapter 1, in the task of morphological segmentation, we identi-
fied the boundaries of morphemes in words. Tigrinya has a rich non-concatenative
morphology, which embeds several grammatical information, conjunction and prepo-
sition clitics, object markers, and so on. In this research we constructed a new mor-
phologically segmented corpus for Tigrinya and investigated morphological segmen-
tation using supervised approaches based on CRFs and LSTMs. For CRFs, we em-
ployed character and sub-string features without explicit language-dependent feature
engineering. For LSTMs, we modeled segmentation with a fixed-size window ap-
proach relying on character and word embeddings for word representations. Details
of the related works, experimental settings, results, and analyses are provided in the
following sections.

6.2 Related Works


The work of [11] introduced the unsupervised discovery of morphemes based on
minimum description length (MDL) and likelihood optimization. This method lacks
contextual dependency representation, such as stem and affix orders. Although sev-
eral other unsupervised segmentation approaches have been proposed, it was shown
by [65] that minimally supervised approaches provide better performance compared
66 Chapter 6. Morphological Segmentation for Tigrinya

to solely unsupervised methods applied on large unlabeled datasets. For example,


unsupervised experiments for Estonian, that achieved 73.3% F1 score with 3.9 mil-
lion words, were outperformed by a supervised CRF that attained 82.1% with just
1000 word forms. This work also showed that semi-supervised approaches that use
both annotated and unannotated data can be leveraged to improve upon simply using
completely unsupervised methods.
In related works for the Semitic languages of Hebrew and Arabic, [66] used a
probabilistic model where segmentation and morpheme alignment is inferred from
the shared structure between both languages by using a parallel corpus with and with-
out annotation. Recent works on neural models rely on the use of some form of em-
bedding for extracting relevant features. [67] demonstrated a generic window-based
neural architecture that is applied to several NLP tasks while avoiding explicit use
of feature engineering. Their system trains on large unlabeled data to learn internal
representations. We also adopt a similar window-based approach, although with the
input of character sequence window instead of words as in [67].
In [68], state-of-the-art results were achieved with a neural model that learns
POS-related features only from character-level and sub-word embeddings. Several
research works, however, argue that enriching embeddings with additional morpho-
logical information can boost performance. For example, [69] demonstrates the ap-
plicability of morphological analyzer results to further improve the candidate ranking
of Arabic morphological disambiguation. In Burmese word segmentation, [70] ad-
dress the problem by employing binary classification with classifiers such as CRFs.
The tagset restriction to two classes was mainly due to the limitations of the avail-
able data size. Our corpus is similarly small-sized; however, we report experiments
of several schemes to investigate the effect of simple and more expressive tags in
Tigrinya morpheme segmentation.
Relatively speaking, Tigrinya is more closely related to Amharic than to Hebrew
and Arabic. Amharic and Tigrinya languages share a writing system and a number
of grammatical features and vocabularies. [71] presented morphological rule learn-
ing and segmentation based on inductive logic programming where rules or affixes
were learned by exposing easy-to-complex examples incrementally to an intelligent
6.3. Tagging Schemes 67

teacher. Their system for affix segmentation achieved a performance of 0.94 preci-
sion and 0.97 recall measures.
Tigrinya remains an under-studied and under-resourced language from the NLP
perspective. However, as regards to morphological processing, [20] employed fi-
nite state transducers (FSTs) to develop, “HornMorpho”, a morphological analysis
and generation system for Tigrinya, Amharic, and Oromo languages. The FST em-
powered by feature structures was effectively adapted to process the unusual non-
concatenative root-and-pattern morphology. The Tigrinya module of HornMorpho
2.5 performs the full analysis of Tigrinya verb classes [17]. The system was tested
on 400 randomly selected word forms, with 329 of them being non-ambiguous. The
analysis revealed remarkably accurate results with a few errors due to unhandled
FSTs, which can be integrated. Our approach is different from HornMorpho in at
least two aspects. First, our task is limited to identifying morphological boundaries.
The results amount to partially analyzed segments although these segments are not
explicitly annotated for grammatical features. The annotation could be pursued with
further processing of the output or training on morphologically annotated data, which
is currently missing for Tigrinya. Second, the use of the FSTs relies heavily on the
linguistic knowledge of the language in question. This would require time-consuming
manual construction of language-specific rules, which is more challenging for root-
and-pattern morphology. In contrast, we follow a data-driven (machine learning)
approach to automatically extract features of the language from a relatively small
boundary annotated data. Moreover, on the limitations of HornMorpho, [17] noted
that the analysis incurs a “considerable time” to exhaust all options before the system
responds.

6.3 Tagging Schemes


The popular IOB or BIO tagging scheme was introduced to annotate the Beginning
(B), Inside (I), and outside (O) of chunks in tasks such as noun phrase chunking and
NER [72]. Similarly, we address morpheme boundary detection by annotating every
character in morpheme chunks with the appropriate IOB label. There are different
variants of the IOB scheme including BIO, IOB, BIE, IOBES, and so on. There is
68 Chapter 6. Morphological Segmentation for Tigrinya

Table 6.1: Annotating with different tagging schemes

Scheme tImali InItezeyIHatete


(if he did not ask yesterday)
Word t I m a l i I n I t e z e y I H a t e t e
BIE B I I I I E B I I I E B I I E B I I I I E
BIES B I I I I E B I I I E B I I E B I I I I S
BIO O O O O O O B I I I I B I I I B I I I I I
BIOES O O O O O O B I I I E B I I E B I I I E S

no general consensus as to which variant performs best. [67] used the IOBES format
as it encoded more information, whereas extensive evaluations by [73] showed that
the BIO has superior results compared to the IOB scheme. Furthermore, the BIES
scheme gave better results for Japanese word segmentation [74].
We experimented with four schemes, namely BIE, BIES, BIO, and BIOES, to
explore the modeling of morpheme segmentation in a low-resource setting with the
morphologically rich language Tigrinya. In our case, the B, I, and E tags marks Begin,
Inside, and End of multi-character morphemes. The S tag annotates single character
morpheme and the O tag assigns character sequences outside of morpheme chunks.
An example of using these annotations is presented in Table 6.1.
We directly apply labels to the Latin transliterations of Tigrinya words (not the
Ge’ez script). The Ge’ez script is syllabic, and in many cases, the boundary has
fusional properties resulting in alterations of characters at the boundary. For exam-
ple, the word sebere (He broke) would be segmented as “seber” + “e” because the
morpheme “e” represents grammatical features. However, this morpheme cannot be
isolated from the Ge’ez script because the last characters “re” is a single symbol in
the Ge’ez script and segmenting sebere into “sebe” + “re” is incorrect analysis.

6.4 Character Embeddings


Morphological segmentation is primarily character-level analysis. For example, in
Tigrinya, the characters “zI” have a grammatical role when used as a relativizer that
is prefixed only to the perfective (example 6.1) or imperfective verbs (example 6.2).
Therefore, all other occurrences of “zI”, such as the noun zInabI (rain) should not
6.4. Character Embeddings 69

Table 6.2: An example of generated fixed window character sequences


with assigned label

Window Label
_ _ _ _ _ s e l a m I B
_ _ _ _ s e l a m I _ I
_ _ _ s e l a m I _ _ I
_ _ s e l a m I _ _ _ I
_ s e l a m I _ _ _ _ I
s e l a m I _ _ _ _ _ E

be segmented. Consequently, recognizing the shape of relativized perfectives and


imperfectives becomes crucial.

Example 6.1. zIsebere (that broke)  zI + sebere/PERFECTIVE

Example 6.2. zIsebIrI (that breaks)  zI + sebIrI/IMPERFECTIVE

Example 6.3. *zI + yIsIberI/IMPERATIVE - invalid

Example 6.4. zInabI (rain)  *zI + nabI - incorrect segmentation

Character-level information must be extracted to learn such informative features


useful for identifying boundaries. We generated past and future fixed-width charac-
ter context for each character. The concatenated contextual characters form a single
feature vector for the central target character. For example, the features of the word
selamI (peace, hello) for each character are generated as depicted in Table 6.2. We
also showed the corresponding label of the central character in the BIE scheme. The
boldface character represents the central character with its left (past) and right (fu-
ture) context of width five characters. Characters are padded with underscores (_) to
complete the remaining slots depending on the window size.
We settled for a window size of five as increasing it beyond five did not result in
significant improvements. Moreover, the dimension of optimal character embedding
was decided by hyperparameter tuning experiments. We initialized the embedding
layer from a lookup table for integer (index) representations of the fixed-width charac-
ter vectors. The embedding was then fed to an LSTM after passing through a dropout
layer.
70 Chapter 6. Morphological Segmentation for Tigrinya

Table 6.3: Training data split as per the count of the fixed-width char-
acter windows (vectors) used as the actual input sequences

Data size
All Training Validation Test
100% 90% 5% 5%
289,158 260,242 14,458 14,458

6.5 Experiment

6.5.1 Settings

In this section, the data used and the experimental settings are reported. Initially, we
set training epochs to hundred and configured early stopping if the training continued
without loss improvement for 15 consecutive epochs. We used Keras 1 to develop
and Hyperas 2 to tune our deep neural networks. Hyperas, in turn, applies hyperopt
for optimization. The algorithm used is Tree-structured Parzen Estimator approach
(TPE) over five evaluation trails 3 .

6.5.1.1 Datasets

The text used in this research is extracted from the NTC corpus. The NTC consists
of around 72,000 POS-tagged tokens collected from newspaper articles. Our corpus
contains 45,127 tokens, of which 13,336 tokens are unique words. Training is per-
formed with ten-fold cross-validation, where about 10% of the data in every training
iteration is used for development. We further split the development data into two
equal halves, allotting one set for validation during training and the other half for
testing the model’s skill.
The evaluation reports are based on the final test set results. For every target
character in a word, a fixed-width character window is generated as shown in Table
6.2. The data statistics according to the generated character windows is presented in
Table 6.3.
1
https://www.keras.io/
2
https://github.com/maxpumperla/hyperas
3
https://papers.nips.cc/paper/4443-algorithms-for-hyperparameter-optimization.pdf
6.5. Experiment 71

6.5.1.2 Evaluation

We report boundary precision (P), boundary recall (P) and boundary F1 scores (F1)
as given by equation 6.1 to 6.3. The precision evaluates the percentage of correctly
predicted boundaries with respect to the predicted boundaries, and the recall measures
the percentage of correctly predicted boundaries with respect to the true boundaries.
The F1 score is the harmonic mean of precision and recall and can be interpreted as
their weighted average.

correctly predicted boundaries


P = (6.1)
system predicted boundaries

correctly predicted boundaries


R= (6.2)
true boundaries

2 ∗ (precision ∗ recall)
F1 = (6.3)
precision + recall

The correct predictions are the count of true positives while the actual (true)
boundaries are these true positives added with the false negatives. The system-proposed
predictions can be determined from the sum of the true positives and the false posi-
tives. The number of I or O classes are present over twice as much of the B classes.
Therefore, to account for any class imbalance, we computed weighted macro-averages,
in which each class contribution is weighted by the number of samples of that class.
The macro-averaged precision and recall are calculated by taking the average of pre-
cision and recall of the individual classes as shown in equations 6.4 and 6.5, where n
denotes the total number of classes.

P (class 1) + P (class 2) + ... + P (class n)


M acro − ave. P = (6.4)
n

R(class 1) + R(class 2) + ... + R(class n)


M acro − ave. R = (6.5)
n

6.5.1.3 Hyperparameter tuning

CRF: We performed preliminary experiments for deciding the regularization algo-


rithm (L1 or L2) and the C-parameter values from the set 0.5, 1, 1.5. Consequently,
72 Chapter 6. Morphological Segmentation for Tigrinya

L2 regularization and C value of 1.5 were selected as these values gave better results.
The character and substring features we considered spanned a window size of five.
The full list of the character features is given as follows.
a. Chars: char -5, char -4, ... , char 4, char 5
b. Left context: -5 to 0, -4 to 0 , -3 to 0, -2 to 0, -1 to 0
c. Right context: 0 to 5, 0 to 4 , 0 to 3, 0 to 2, 0 to 1
d. Left+Right context: a + b
e. N-grams: bi-grams

LSTM: In order to search for the parameters that yield optimal performance, we
explored hyperparameters that include embedding size, batch size, dropouts, hidden
layer neurons, and optimizers. The complete list of the selected parameters for LSTM
is summarized in Table 6.4. We achieved similar results for the BIE and BIES tun-
ings, as shown in the table. However, the BIO and BIOES schemes showed differ-
ences in the tuning results and, therefore, these were used separately.
Embeddings: We tested several embedding dimensions from the set {50, 60, 100,
150, 200, 250}. Separate runs were made for all types of tag sets.
Dropouts: Randomly selecting and dropping-out nodes have proven effective
at mitigating overfitting and regularizing the model [75]. We applied dropout on
the character embedding as well as on the inputs and outputs of the LSTM/BiLSTM
layer. For example, the selected dropouts for the BIO-based tunings are 0.07 and 0.5
for the embedding and LSTM output layers, respectively. The dropout probabilities
are selected from a uniform distribution over the interval [0, 0.6].
Batch size: We ran tuning for the batch size of the set {16, 32, 64, 128, 256}.
Hidden layer size: The hidden layer’s size is searched from the set {64, 128, 256}.
Optimizers and learning rate: We investigated the stochastic gradient descent
(SGD) with stepwise learning rate and other more sophisticated algorithms such as
AdaDelta [76], Adam [77], and RMSProp [78]. The SGD learning rate was initialized
to 0.1, a momentum of 0.9 with rate updates for every 10 epochs at a drop rate of
0.5. However, the SGD setting did not result in significant gains compared to the
automatic gradient update methods.
6.6. Results 73

Table 6.4: LSTM Hyperparameters selected by tuning

Parameter BIE/BIES BIO BIOES


Window 5 5 5
Character embedding 150 150 250
hidden layer size 32 128 128
Batch size 32 256 256
optimizer adam adam RMSProp

6.5.2 Baseline

The baseline considered in this study is the CRF model trained on only character
features with a window size of five. The window size is kept the same with the
neural LSTMs to compare the models’ subject to the same information. The baseline
experiments for the CRF achieved an F1 score of around 60% and 63% using the
BIOES and BIO tags, respectively. As for the BIE tags, the performance reached
about 75.7%. All the subsequent CRF-based experiments employed two auxiliary
features, which were the contextual and n-gram characters listed in section 6.5.1. As
depicted in Table 6.5, significant performance enhancements were achieved with the
auxiliary features. However, compared to the LSTM-based experiments, the window
five based CRF results were still suboptimal.

6.6 Results

6.6.1 The Effect of BIO Scheme Design with Character Embed-


dings

We experimented and compared the performance of three models trained with four
different tagging strategies as explained earlier. The results of ten-fold cross-validation
are summarized in Table 6.5 with P, R, F1 representing precision, recall, and F1 score,
respectively.
Generally, we observed the choice of segmentation schemes affecting the model’s
performance. Overall, using the BIE scheme resulted in the best performance in all
tests. Although the BIOES tagset is the most expressive since the corpus is rather
74 Chapter 6. Morphological Segmentation for Tigrinya

Table 6.5: Results of CRF, LSTM and BiLSTM experiments with four
BIO schemes.

Tagset CRF LSTM BiLSTM


P R F1 P R F1 P R F1
BIE 92.62 92.62 92.62 94.38 94.37 94.37 94.68 94.68 94.67
BIES 92.44 92.44 92.44 94.22 94.20 94.20 94.60 94.59 94.59
BIO 84.88 84.88 84.88 90.01 89.96 89.96 90.16 90.11 90.11
BIOES 83.60 83.60 83.60 88.26 88.21 88.21 88.45 88.39 88.39

small, the simpler tagsets showed better generalization over the dataset. The BIE-
CRF model achieved 92.62% in the F1 score, while the performance of the BIO and
BIOES-based CRFs fell to about 84.88% and 83.6%, respectively. The CRF model
using the BIO scheme outperformed the CRF model using the BIOES by 1.28% ab-
solute change. A majority of the errors are associated with under-segmentation and
confusion between I (inside) and O (outside) tags. Masking these differences by drop-
ping the O tag has proved useful in our experiments. As a result, the CRF model using
the BIE scheme outperformed the BIO-based CRF by around 7.7 percentage point.
We compared the regular LSTM with its bidirectional extension. The overall results
showed that the BiLSTMs performed slightly better than the regular LSTMs sharing
the same scheme. In the window-based approach, the past and future contexts are
partly encoded in the feature vector. This window was fed to the regular LSTM at
training time, making the information available for both networks quite similar. This
may be the reason for not seeing much variation between the two models. Neverthe-
less, the additional design in BiLSTMs to reverse-process the input sequence allows
the network to learn more detailed structures. Therefore, we saw slightly improved
results with the BiLSTM network over the regular LSTM. As with CRFs, a signif-
icant increase in the F1 score was achieved when changing from the BIOES to the
BIE scheme. In both the LSTM models, we observed a gain of over 6 percentage
point in performance. Overall, in these low-resource experiments, the model gener-
alized better when the data was tagged with the BIE scheme as this reduced the model
complexity introduced by the O tags. The BiLSTM model using the BIE scheme per-
formed better than all others, scoring 94.67% in the F1. This result was achieved by
6.6. Results 75

employing only character embeddings to extract morphological features. On the other


hand, the CRF model which used a rich set of character + bi-gram + substring features
scored around 92.62%. For a fair comparison, we used the features of the CRF model
that spanned the same window of characters as the LSTM models. In other words, we
avoided heavy linguistic engineering on the CRF side. Although the features were not
linguistically motivated, achieving an optimal result this way still requires many trials
and better design of features. It is to be noted that additional fine-tuned hand-crafted
features and wider windows for the CRF model produce better results than what has
been reported. However, from the LSTM results, we saw that forgoing feature engi-
neering and extracting features using character embeddings could sufficiently encode
the morphological information desired for high-performance boundary detection.
While the results may not be directly comparable due to the difference in settings
and evaluation, two related previous works of [79] and [39] on the Semitic languages
of Arabic and Hebrew are referred for comparison. We adopted a LSTM-based fixed-
window approach similar to the one presented in [39] that investigated Arabic and
Hebrew morphological segmentation with the S & B dataset containing 6,192 word
types [80]. The BiLSTM-based best results reported are an F1 score of 95.2% on
Hebrew and 95.5% on Arabic. [39] employed more advanced multi-window features
and an ensemble of 10 models because the performance of character window features
was not satisfactory for the small data size used. In comparison, we achieved similar
results (see Table 6.5) with no further feature engineering other than simple character
window embeddings, however, the dataset available to us is larger with 13,336 word
types.
Prior to the work by [39], the state-of-the-art has been a CRF-based morpheme
segmentation relying on hand-crafted feature sets that include all left and right sub-
strings of a target character and an additional bias component [79]. Training on the
same S & B dataset, [79] achieved an F1 score of 94.5% and 97.8% on Hebrew and
Arabic, respectively. We notice that the LSTM-based work by [39] on the Arabic
dataset, does not outperform the CRF-based method of [79]. We achieved around
92.62% F1 score with CRF-based Tigrinya experiment that was trained on the feature
set listed in section 6.5.1, with constraints to span only five left and right characters
matching the information exposed to our LSTM settings.
76 Chapter 6. Morphological Segmentation for Tigrinya

6.6.2 Error Analysis

Table 6.6: BiLSTM Confusion matrix comparison for BIE and BIO
schemes %. The Boldface entries represent BIE scheme. Rows and
columns represent predicted and true values respectively.

B/ B I/I E/O
B/B 91.20/82.60 4.54/9.94 4.26/7.46
I/I 2.83/3.21 93.84/90.74 3.34/6.05
E/O 3.09/4.19 9.65/10.83 87.26/84.98

The models using the BIE and the BIES tagsets achieved comparable results. We
also observed a similar trend for the models using the BIO and the BIOES tagsets.
However, the BIE-based results and the BIO-based results differ significantly. There-
fore, we analyzed the effect of the tagset choice, contrasting the BiLSTM models of
the BIE with the BIO schemes. For easier comparison, the confusion matrices (in
percentage) of both experiments are presented jointly in Table 6.6. The values are
paired in X/Y format, where X is a value from the set {B, I, E} in the BIE scheme
and Y is one of {B, I, O} in the BIO scheme. The row-wise entries represent model
predictions, while the column values denote the true (reference) tags. The diagonal
entries where the tags match (e.g., row B/B vs. column B/B) represent the percentage
of correctly assigned tags. The rest of the entries show the actual (column-wise) vs.
the predicted (row-wise) confusions. For example, the entry for the row B/B against
the column I/I (4.54/9.94) denotes that 4.54% (in BIE) or 9.94% (in BIO) of the B
tags were assigned the I tag by mistake. Overall, looking at the paired entries, ex-
cept for the diagonals, the first entry (a BIE entry) is always lower than the second
entry (a BIO entry) which shows that the BIE models have lower error rate than the
corresponding BIO scheme. As mentioned earlier, the number of I or O classes is
about two-fold greater than the B class. Therefore, there is more confusion stemming
from these major tags. The B tag represents the morpheme segmentation boundaries;
therefore, the under-segmentation and over-segmentation errors can be explained in
relation to the B tags. Looking at the BIE figures, around 91% of the predictions for
the B tags are correct. In other words, 9% of the true B tags were confused for the
I and O tags, which together amounts to the under-segmentation errors. These types
6.6. Results 77

Table 6.7: Sample segmentation result; the underscore character


marks segmentation boundaries and the boldfaced characters represent
segmentation errors.

Sentence True BIE BIES BIO BIEOS


amIlaKI amIlaKI amIlaKI amIlaKI amIlaKI amIlaKI
kea kea kea kea kea kea
mIrIayu mI_rIay_u mI_rIay_u mI_rIay_u mI_rIay_u mI_rIay_u
zEbIhIgI zE_bIhIgI zE_bIhIgI zEbIhIgI zEbIhIgI zEbIhIgI
mIbIlaOu mI_bIlaO_u mI_bIlaO_u mI_bIlaO_u mI_bIlaO_u mI_bIlaO_u
zITIOumI zI_TIOumI zI_TIOumI zI_TIOumI zI_TIOumI zI_TIOumI
kWlu kWl_u kWl_u kWlu kWlu kWlu
omI omI omI omI omI omI
abI abI abI abI abI abI
mIdIri mIdIri mIdIri mIdIri mIdIri mIdIri
abIqWele a_bIqWel_e abIqWel_e abIqWel_e abIqWel_e abIqWel_e

of errors are almost doubled with the use of the BIO scheme. On the other hand, in
total, the over-segmentation errors are about 5.9% (I, E vs. B) and 7.4% (I, O vs.
B) in the BIE and BIO schemes, respectively. Over-segmentation occurs when the
true I, O, or E tags are confused with the B tag. In this case, morpheme boundaries
are predicted while the characters actually belong to the inside, outside, or end of the
morpheme. These results show that the under-segmentation errors contribute more
compared to the over-segmentation errors.
In the BIE scheme results, we observe that the I and B tags are correctly predicted
with higher accuracies than in the BIO scheme. Specifically, the prediction accuracy
of the B tag (boundaries) increased from 82.6% to 91.2% (8.6% absolute change).
The main reason is the considerable reduction of errors introduced due to the confu-
sions of tag I for O or vice-versa in the BIO scheme. This comparison suggests that
improved results can be achieved by denoising the data from the O tag. Therefore,
for a low-resource scenario of boundary detection, starting with the BIE tags may
be a better choice for improving model generalization compared to more complex
schemes that are feasible for larger datasets.
78 Chapter 6. Morphological Segmentation for Tigrinya

The following sample sentence extracted from the Bible is segmented with all
four models. Note that the training corpus does not include text from the Bible.

Tigrinya:
amIlaKI kea mIrIayu zEbIhIgI mIbIlaOu zITIOumI kWlu omI abI mIdIri
abIqWele

English:
And out of the earth the Lord made every tree to come, delighting the eye and
good for food.

The segmentation result from the four models is summarized in Table 6.7. The
expected segmentation is given under the column labeled “True”. We observe that
the BIE model segmentation is the nearest to the True segmentation. The other mod-
els show under-segmentation errors for the word ZEbIhIgI and kWlu. The verbal
noun prefix “mI*” is correctly segmented by all models, whereas the models failed
to recognize the boundary of the causative prefix “a*” in the last word abIqWele.

6.6.3 The Effects of Character and Word Embeddings

To analyze the contributions of auxiliary features, we performed BiLSTM experi-


ments concatenating character and word embeddings (char+word) for the BIE tagset.
Weights in the embedding layer are randomly initialized and we set the word embed-
ding space to 150 dimensions. Furthermore, the word embeddings are learned for all
of the vocabularies.
According to the results in Table 6.8, this setup provided improved performance
with an absolute gain of 0.4% in F1 score. This result suggests that boosting the
character (morphological) features with lexical features may be beneficial for seg-
mentation.
6.6. Results 79

Table 6.8: The effect additional word and POS features on segmenta-
tion performance

Feature Prec Rec F1


char 94.68 94.69 94.67
char+word 95.07 95.09 95.07

Table 6.9: The Effect of increasing data size on segmentation perfor-


mance. Generally the performance improves with the availability of
more data.

Data(%) Prec Rec F1


10 92.30 92.32 92.30
20 93.39 93.40 93.39
40 94.42 94.42 94.41
60 94.76 94.76 94.76
80 95.05 95.05 95.04
100 94.68 94.68 94.67

Figure 6.1: Plotting the distribution of K-fold scores.

6.6.4 The Effects of Data Size

Table 6.9 shows the improvement in F1 scores from about 92.3%, with only 10% of
the data, to about 94.67%, showing an absolute gain of 2.37%. Each F1 score is the
weighted macro-average of a ten-fold cross-validation of the specific dataset.
The comparison of both the ten-fold result sets is visualized in the box and whisker
plot (Fig. 6.1). The f1_10 and f1_100 denote the F1 scores of ten-folds using the
10% and all (100%) of the data respectively. The box represents the middle 50% of
80 Chapter 6. Morphological Segmentation for Tigrinya

the F1 scores; outliers are shown as dots, and the line dividing the box shows the
midpoint(median) of the scores.
The compact box of f1_100 indicates a comparatively lower variance in the 10-
fold averages whereas the longer box of f1_10 shows considerable variation. We also
observe higher overall F1 scores indicating the usefulness of enlarging the dataset.
The F1 distribution for 10% and 100% of the data does not fit a normal distribution
(non-Gaussian), and hence we performed KS test and observed that the absolute gain
is statistically significant (p = 0.0012 < 0.05).

6.7 Chapter Summary


In chapter 6, we presented the first work on morphological segmentation for Tigrinya.
The research was performed based on a new manually segmented corpus comprising
over 45,000 words. Four variants of the BIO chunk annotation scheme were em-
ployed to train three different morphological segmentation models. The first model
was based on CRFs with language-independent features of characters, n-grams, and
substrings. The other two were based on LSTM and BiLSTM neural architectures
leveraging character embeddings with fixed-size window approach to extract the mor-
pheme boundary features. We evaluated the BIE, BIES, BIO, and BIOES tagging
schemes. Although the size of the corpus is relatively small, we achieved state-of-
the-art 95.07% F1 score in boundary detection using the BIE chunking scheme with
concatenated character and word embeddings.
81

Chapter 7

English-Tigrinya Machine
Translation with Morphological
Segmentation

7.1 Introduction
Machine translation (MT) is a task which involves translating one natural language to
another. There is an abundant wealth of digital information over the internet and other
sources. However, most of the information is available in English and other well-
known languages such as Chinese and Japanese. Although the internet penetration of
less-known native languages such as Tigrinya seems to be on the rise, comparatively,
there is still a huge gap of information provision for low-resource languages. In addi-
tion, the language barrier in societies contributes to difficulties in communication and
daily life. One way to overcome the digital divide and improve communication is by
providing machine translation services. However, problems with the complexity and
linguistic divergence of some language pairs, aggravated by an insufficient parallel
corpus, renders machine translation a challenging task. In Chapter 1, we presented
a brief introduction of the difficulty in English-Tigrinya MT. In this research, we at-
tempt statistical machine translation by improving convergence of these languages
through morphological segmentation. The following sections describe some previ-
ous works in relation to morphology and MT, the methods we adopted, and finally,
the experimental results and discussions.
Chapter 7. English-Tigrinya Machine Translation with Morphological
82
Segmentation

7.2 Related Works


The nature of SMT may differ depending on the inflection complexity of the involved
languages. Nevertheless, most studies show the use of segmentation schemes, help-
ing the SMT translation quality to some extent. [81] applied stemming which resulted
in the reduction of translation errors from Spanish, Catalan, and Serbian to English.
Our first approach is similar to this work, whereby we also try partially segment-
ing or stemming words rather than performing a full analysis of morphological af-
fixes. Some researchers suggest that simple segmentation can also perform as much
as complex approaches [13]. Earlier, [82] investigated the impact of different seg-
mentation schemes on Arabic SMT. They reported that segmentation schemes can
vary between just proclitics pruning to sophisticated morphological analysis based
on the availability of data. [83] proposed joint morphological-lexical language model
for translation involving a morphologically complex language. In their experiments
concerning English-to-Arabic translation, they reported improved translation quality
over trigram baseline. [84] advanced the work to Arabic translation by making use
of context information along with segmentation. Similar studies on Hebrew-English
SMT show an improvement on BLEU1 score [85]. In the research, it was shown
that the linguistically-motivated morphological analyzer performed better when com-
pared to an unsupervised analyzer. Amharic is another Semitic language closely re-
lated to Tigrinya in morphology and syntax. Amharic has a relatively better support
of resources and NLP research compared to Tigrinya. [86] experimented on phrase-
based English-Amharic SMT with a parallel corpus comprising 18,434 sentences,
achieving a baseline score of 35.32%. They applied morphological segmentation to
the Amharic data and were able to improve the BLEU score by 0.92%. Further-
more, a cloud platform from ethiocloud.com (translator.abyssinica.com) also pro-
vides Amharic-English translation. Additionally, Amharic is among the languages
supported by Google translate.
Regarding Tigrinya translation, we find a recent promising project which re-
leased a web application of English-to-Tigrinya translation (tigrinyatranslate.com).
Tigrinya is not yet supported by Google Translate and has very few entries (often
1
BLEU: BiLingual Evaluation Understudy (https://en.wikipedia.org/wiki/BLEU)
7.3. Methods 83

empty) on Wikipedia. Apart from the Bible translation, we could not find other par-
allel corpora open to the public. However, some online dictionaries and mobile ap-
plications are being developed. For example, memhir.org 2 maintains a dictionary of
over 15,000 entries. Recently, geezexperience.com 3 has been compiling a multilin-
gual dictionary between Tigrinya and a number of other languages including English,
German, Dutch, Italian, and Swedish. Hidri publishers 4 also provided a mobile ver-
sion of their printed dictionary “Advanced English-Tigrinya Dictionary” with over
62,000 entries.
In this research, we used the Bible as our parallel corpus. [87] discuss the use-
fulness of the Bible for language processing. They mention that the Bible has been
translated to over 2000 languages, making it the most translated book in the world.
The Bible text is carefully translated and organized on verse-level. According to [87],
the Bible has about 85% coverage of modern-day vocabulary and variations of writing
styles. Furthermore, [88] used the Bible as a bootstrapping text to set the parameters
of a stochastic translation system and noted the prospects of enabling translation of
thousands of languages using the Bible as a basis. SMT requires large parallel data
for high-quality translations. Consequently, the Bible alone will not be sufficient for
building high-quality translation. However, it is easily accessible and may be tailored
to build experimental models and investigate certain behaviors.

7.3 Methods

7.3.1 Affix-based Segmentation

We performed affix-based segmentation or shallow segmentation of Tigrinya words


based on longest affix pruning. A list of Tigrinya prefixes and suffixes was compiled
from several Tigrinya corpora, based on character n-gram frequency. The current
corpora include a 15.1 million news text from Haddas Ertra newspaper, the Bible,
2
http://www.memhr.org/dic/
3
http://www.geezexperience.com/dictionary/?dr=0
4
https://play.google.com/store/apps/details?id=org.fgaim.android.aetd2&hl=en
Chapter 7. English-Tigrinya Machine Translation with Morphological
84
Segmentation

and a Tigrinya lexicon crawled from the internet5 . The shallow segmentation pro-
duces three segments (sub-word units) as “longest-prefix Stem Longest-suffix”. Note
that we use “prefix, stem, suffix” for simplicity; however, the sub-words here are not
necessarily valid linguistic units. To reduce overstemming words, the minimum stem
threshold was set to five characters. This threshold is selected because most Tigrinya
words, especially verbs, contain tri-literal roots or about six characters when translit-
erated.

7.3.2 Morpheme-based Segmentation

For deeper segmentation, we used an in-house morphological segmentation model


based on CRF. The model detects morphological boundaries using character based
Begin-Inside-Outside (BIO) tagging scheme tags, marking the beginning and inside
of morphemes, respectively, and the “O” tag marking word boundaries (whitespaces)
motivated by [12]. This model was specially designed and enriched with several mor-
phological features including infixes to attain high accuracy in segmentation. The
model segments morphological boundaries with the state-of-the-art accuracy of 97%
and performs almost full morphological analysis of the input word. For example,
“zeyItesebIrunI” (word-3 in Table 7.1) can be stemmed to “zeyIte* sebIr +unI” and
our model would further segment the composite prefix “zeyIte*” into “zeyI* te*”
and the suffix “+unI” into “+u +nI”. We note that the prefix “zeyI” is a fused form
of “zI” relativizer and “ayI” negative circumfix, which have undergone vowel alter-
ation making their boundary obscure. We did not segment such cases in this study.
Table 7.1 shows some examples of segmented words with the raw text, the stemmed
version (prefix-stem-suffix), and the morphologically segmented version of the same
word.

7.3.3 Phrase-based Machine Translation

In this research, we employ phrase-based statistical machine translation [89] to inves-


tigate the effect of segmentation on machine translation. In phrase-based translation,
5
http://www.cs.ru.nl/biniam/geez/crawl.php
7.3. Methods 85

Table 7.1: Example of segmented words and grammatical functions


of the segments.

Word 1 Word 2 Word 3


Word zIsebere InIteseberetI zeyItesebIrunI
Stemmed zI* seber +e InIte* seber +etI zeyIte* sebIr +unI
Segmented zI* seber +e InIte* seber +etI zeyI* te* sebIr +u +nI
Gloss he who broke if she broke that were not broken and
Grammar REL STEM CONJ STEM REL-NEG PASSIVE
singular-3rd-masc. singular-3rd-femin. STEM plural-3rd-masc.CONJ

first, the source sentence is segmented into shorter phrases, which may not necessarily
be linguistic phrases.
Following segmentation, each phrase is translated into a target phrase. Finally,
the translated phrases are combined (reordered) to determine the target sentence. For
example, suppose we have the following sentence pairs as training sentences:
English source sentence:
“How do you translate this?”
This sentence can be translated into Tigrinya as:
“Izi kemeyI gErIka tItIrIgWImo?”
Table 7.2 shows the sample phrases extracted from the given sentence pair.

Table 7.2: Example of phrase-based translation pairs.

Source sentence Target sentence


How do you kemeyI gErIka
translate tItIrIgWImo
this Izi
? ?

Based on this type of large bilingual text, phrase-based systems learn models that
can predict the most probable translation outputs. The phrase translation based on
noisy channel model is defined using the Bayes rule. Accordingly, given a source
Chapter 7. English-Tigrinya Machine Translation with Morphological
86
Segmentation

sentence, the best translation tbest is given as follows:

tbest = argmax p(t|s) (7.1)


t

tbest = argmax p(s|t)pLM (t), (7.2)


t

where t is the target sentence and s is the source sentence. The probability p(t)
represents the language model while p(s|t) models the phrase translation. During
decoding, the source sentence s is segmented into a sequence of n phrases sn1 , and
each source phrase si in sn1 is translated into a target phrase ti . Then, a possible
reordering of the target sequences into a more fluent sentence is performed using
the distortion model which is modeled using relative distortion probability given by
d(starti − endi−1 ). The beginning and ending index of the ith source phrase trans-
lated to the ith target phrase and the (i − 1)th phrase translated to the (i − 1)th target
is given by starti and endi−1 , respectively.
Therefore, to account for distortion, the translation model p(s|t) is given by equa-
tion 7.3:


n
p(sn1 |tn1 ) = ϕ(si |ti )d(starti − endi−1 ), (7.3)
i=1

where ϕ is the phrase translation model and d is the distortion/reordering probability.


The reordering difficulty varies according to the syntactic divergence of the languages
pairs. For example, English and Spanish have similar word orders (S-V-O), although
the Spanish word order allows more flexibility 6 . In contrast, Tigrinya and English
have differences in morphology and structure. Basically, Tigrinya follows the S-O-
V word order, and the verbs are highly inflected. Therefore, the distortion model is
required to improve the grammar of the translation by rearranging translated words or
phrases into their natural orders. We used the Moses translation system [90] for the
phrase-based experiments. The system architecture we proposed to use is presented
at Fig. 7.1. The experimental settings are described in the next section.
6
https://en.wikibooks.org/wiki/Spanish/Word_Order
7.4. Experiments 87

Figure 7.1: Translation system architecture for the English-to-


Tigrinya phrase-based translation.

7.4 Experiments

7.4.1 Settings

The tools we used for building the phrase-based translation model, language model
(LM), and evaluations are from Moses SMT toolkit [90]. Word alignment was per-
formed with MGIZA++ [91], an extended and optimized multi-threaded version of
GIZA++ [92]. While there is a large variation of verse length in the corpus, the aver-
age verse length of the Tigrinya unsegmented corpus is 19.9 and grows to 31.4 after
morphological segmentation (Table 7.3). Therefore, for the cleaning step, the max-
imum length of sentences is set to 60. We train language models using KenLM, an
efficient library optimized for speed and low memory consumption [93]. The order
of n-grams is set to five to account for words split by segmentation. Six language
models were built based on the segmentation schemes and two datasets. The reorder-
ing limit for the distortion model was set to the default six. The setting for datasets,
evaluation tests, and the baseline system are described as follows.

7.4.1.1 Datasets

The training, tuning, and test data are all extracted randomly from the Bible paral-
lel corpus. Table 7.3 lists the size of the verse-aligned parallel corpus (dataset-1),
Chapter 7. English-Tigrinya Machine Translation with Morphological
88
Segmentation

Table 7.3: Dataset-1, Verse-aligned parallel corpus.

Data Verse English Tokens Tigrinya Tokens


unsegm. stemmed morph. segm.
Training-1 29,307 938,837 584,318 837,675 918,719
Test-1 1,000 31,994 20,042 28,808 31,500
Tuning-1 970 31,383 19,624 28,254 30,889
Dataset-1 31,277 1,002,214 623,984 894,737 981,108
Average verse length 32.0 19.9 28.6 31.4

Table 7.4: Dataset-2: Sentence-aligned parallel corpus.

Data Verse English Tokens Tigrinya Tokens


unsegm. stemmed morph. segm.
Training-2 19,299 581,799 356,002 513,876 563,696
Test-2 651 19,980 12,292 17,884 19,545
Tuning-2 628 18,799 11,596 16,710 18,380
Dataset-2 20,578 620,578 379,890 548,470 601,621
Average sent. length 30.2 18.5 26.7 29.2

whereas Table 7.4 lists the size of the sentence-aligned parallel corpus (dataset-2)
extracted from dataset-1. Note that in dataset-1, there are verses combined for strict
alignment as explained in the preprocessing section. In this way, the verse-aligned
corpus consisted of 31,277 verses, with 29,307 verses used for training, 970 verses for
tuning, and 1000 verses held out for testing. However, the verse alignment process
also introduces lengthy sentences, possibly making word alignment more difficult
due to great difference in the sentence alignment of the verses. Given the small size
of the corpus, this may affect the overall quality of alignments. In order to investi-
gate its effect on translation quality, we constructed dataset-2, the sentence-aligned
parallel corpus by extracting only the single sentence verses based on the Tigrinya
corpus. Sentence identification was performed using the sentence-end marker of
Tigrinya. Therefore, all verses with a single sentence-end marker were extracted.
Consequently, the extracted corpus comprises a total of 20,578 parallel sentences,
which is about 65.8% of the original verse-aligned corpus. We notice that the mor-
phologically segmented Tigrinya corpus (dataset-1) is the closest match to the num-
ber of tokens in the English corpus. The average verse length of the English tokens is
7.4. Experiments 89

32.0 and that of the morphologically segmented Tigrinya corpus is 31.4 (Table 7.3).
It is interesting to see whether this match would be more useful in creating effective
word alignments for the machine translation (MT) models. Our experimental results
using both the corpora are summarized in Tables 7.5, 7.6, 7.7, and 7.8.

7.4.1.2 Evaluation

We evaluate the performance of the translation using BLEU, METEOR7 , and TER8
metrics. We also analyze the perplexity and OOV statistics to investigate the LM
improvement achieved by segmentation. OOVs are tokens in the test data that are not
present in the training data, and perplexity measures the complexity of LMs. Lower
perplexity score indicates a better fit of the test set to the reference set. We designed
four sets of system evaluations based on the MT models and the test sets. The settings
are given as follows:
1. Verse based models (MT-verse): Tables 7.5 and 7.6.
Dataset-1: the verse-aligned corpus
- contains 31,277 verses (1 verse >= 1 sent)
Test-1: 1000 verses
2. Sentence based models (MT-sent): Tables 7.5 and 7.6.
Dataset-2: extracted only the single sentence verses from Dataset-1
- contains 20,578 sentences
Test-2: extracted only the single sentence verses from Test-1
- contains 651 sentences
3. Dataset-1 + Test-2: Tables 7.7 and 7.8.
The result of MT-verse and MT-sent models may not be directly comparable since
the test data used is different in both cases. Therefore, for a better comparison of the
two models, we evaluated both models using the sentence-based test data (Test-2).
4. Raw text models against segmented text models: Tables 7.5 and 7.6.
For models from the segmented corpus, evaluation is straightforward; the trans-
lation output of the segmented MT model and the segmented reference are evaluated.
7
METEOR: Metric for Evaluation of Translation with Explicit ORdering
(https://en.wikipedia.org/wiki/METEOR)
8
TER: Translation Error Rate (http://www.cs.umd.edu/ snover/tercom/)
Chapter 7. English-Tigrinya Machine Translation with Morphological
90
Segmentation

However, the baseline model is built from the unsegmented parallel corpus. There-
fore, the evaluation output of the unsegmented and segmented MT models are not
directly comparable. Hence, for fair and easier comparison, we propose to segment
the translation output of the baseline model and evaluate it against the segmented
reference. Models system-1b and system-1c are examples of such models (Table
7.5). The evaluation system is depicted in Fig. 7.2 using as an example the base-
line segmented and the segmented cases. Another comparison method is performed
by restoring the translated segments in the segmented models to their root words and
evaluating against a raw text reference. This method conducts evaluation between
words instead of morphemes and can be performed with a separate detokenization
algorithm or input data representation, which is left for future research.

Figure 7.2: The proposed system evaluation for segmented and unseg-
mented models.

7.4.2 Baseline

The baseline system is built from the clean and tokenized (unsegmented) version of
the Bible corpus. The performance in terms of BLEU score is 15.6 for the verse-
aligned models (MT-verse) and 13.0 for the sentence-aligned models (MT-sent). All
the models from the segmented corpus outperform these baseline scores.
7.5. Results 91

Table 7.5: MT-verse and MT-sent: BLEU, METEOR, and TER


scores.

MT system System MT-models BLEU METEOR TER


System-1 sys-base 15.6 19.7 74.2
System-1b sys-base-stem 19.8 21.1 71.0
MT-verse system-1c sys-base-segm 19.3 20.9 71.0
System-2 sys-stem 20.9 22.7 72.7
System-3 sys-segm 20.7 23.3 71.7
System-4 sys-base-sent 13.0 17.8 76.7
MT-sent System-5 sys-stem-sent 18.8 21.1 74.4
System-6 sys-segm-sent 19.8 22.9 72.5

Table 7.6: Perplexity: Test tokens with OOVs included.

System Test tokens OOV count OOV ratio Perplexity


System1-base 21042 1408 6.7 270
System2-stem 29808 757 2.5 69
System3-segm 32500 664 2.0 52
System4-base-sent 12943 1106 8.5 317
System5-stem-sent 18535 604 3.3 79
System6-segm-sent 20196 532 2.6 59

7.5 Results
The experimental results of the baseline and segmented models are described in Ta-
bles 7.5 through 7.8. Below, we discuss the effect of segmentation according to the
evaluation settings mentioned earlier.

7.5.1 Verse-based Vs. Sentence-based Systems

In general, the evaluation metrics in Table 7.5 show that the verse-aligned models
score better results than the sentence-aligned models. However, the performance drop
in the MT-sent models is most likely due to the larger proportion of OOVs resulting
from restricting the corpus to single-sentence verses only. Table 7.6 shows the OOV
ratio and the model perplexity of the two systems. For example, the OOV ratio of
MT-verse baseline is 6.7%, while that of MT-sent is 8.5%. Similarly the perplexity
Chapter 7. English-Tigrinya Machine Translation with Morphological
92
Segmentation

Table 7.7: MT-verse: BLEU, METEOR, and TER scores tested on


Test2.

Test data/model BLEU METEOR TER


System1-Test2/unseg 14.5 18.9 75.8
System2-Test2/stem 20.0 22.2 73.5
System3-Test2/segm 19.8 22.9 72.5

Table 7.8: MT-verse: Perplexity and OOV evaluated on Test-2.

System Test2 OOV OOV Perplexity


tokens count ratio (%)
System1-Test2/base 12943 913 7.1 291
System2-Test2/stem 18535 477 2.6 70
System3-Test2/segm 20196 422 2.1 53

of MT-sent is higher than MT-verse. Therefore verse aligned models scored better
results. Nonetheless, the difference is rather small, suggesting that with proper data
size, sentence-based models may have performed better. For example, the BLEU
score of sys-segm and sys-segm-sent is 20.7 and 19.8, respectively. The difference
is only 0.9 BLEU points, although the MT-sent corpus is much smaller than the MT-
verse corpus. Notice that the MT-verse models are tested on the 1000 test verses
(Test-1). These verses include single-sentence verses as well as multiple-sentence
verses. However, MT-sent models are tested on 651 single-sentence verses (Test-2)
extracted from Test-1.
Therefore, in order to compare MT-verse with MT-sent, we evaluated both sys-
tems under Test-2 dataset. The evaluation and perplexity scores are presented in
Tables 7.7 and 7.8. We observe that this version of MT-verse outperforms MT-sent
under all metrics. This may be attributed to the fact that OOVs are greatly reduced
when using the smaller test set, Test-2 against MT-verse, which uses larger training
set.
7.5. Results 93

7.5.2 Stemmed Vs. Morphologically Segmented Models

Overall, we observe that segmentation has improved the machine translation quality
compared to the unsegmented baseline. In the MT-verse system, the baseline for sys-
stm model is sys-base-stm while that of sys-segm is sys-base-segm. We see a BLEU
score improvement of 1.1 and 1.4 over the baseline compared to the stemmed and
segmented models. Although the BLEU score of the stemmed model is marginally
better than the segmented model (20.9 vs 20.7), the METEOR and TER metrics show
that the morphologically segmented model outperforms the others. Moreover, the
metrics for the segmented models of the MT-sent system consistently show the best
results. The analysis of OOVs and perplexity in Table 7.6 further clarifies the reason
for the performance gain. The OOV count decreases from 1408 for the baseline to
664 for the segmented model, which is a reduction of 2.1% in the OOV count. A
similar reduction pattern is also reflected in the MT-sent models. Since the test data
size is different for the models, the OOV ratio may better explain the OOV size in
relation to the test data. Accordingly, we see higher rates of OOV in the unsegmented
models while the ratio of OOVs to the test data decreases with finer segmentation.
We note that this ratio has a negative effect on perplexity; the higher the OOV ratio,
the worse the perplexity. Model system3-segm in Table 7.6 has the lowest OOV ratio
and perplexity among all the models while system6-segm-sent has the lowest score
among the MT-sent models. Therefore, based on these findings, we understand that
the fine-grained segmentation was better in improving quality of English-Tigrinya
machine translation. In general, although our training data is relatively small, the
performance gain observed from the segmented systems demonstrates the usefulness
of word segmentation strategies for English-Tigrinya machine translation.

7.5.3 Translation Output

The examples in Table 7.9 demonstrate translation output of two reference sentences
from all models of the MT system. We note very interesting insights both in terms
of morphological and syntactic transfer from the source to the target sentence. The
relatively shorter reference sentence (b) has been correctly translated by all models.
This may suggest that the models perform well with short sentences. Therefore, we
Chapter 7. English-Tigrinya Machine Translation with Morphological
94
Segmentation

discuss the syntactic structure and meaning preservation aspects of the translation,
taking the longer sentence (a) as an example. For easier high-level discussion, we
simplify the source sentence (a) by abstracting it into representative sense sub-phrases
(enclosed in square brackets [...]). We also convert the reference and Tigrinya outputs
of MT-verse system into similar sub-phrases as follows:
Source sentence (a):

[but of the fruit of the tree of the knowledge of good and evil][ you
may not take ;] [for on the day when you take of it ,][death will
certainly come to you .]

Reference sentence (a):

[kabIta xIbuQInI kIfuInI ItefIlITI omI gIna :] [kabIa mIsI


ItIbelIOI meOalItIsI] [motI kItImewItI iKa Imo :][ kabIa ayItIbI-
laOI : ilu azezo .]

Conversion to English sub-phrase units:

• Source: [of-tree][don’t-take][if-you-take][die]

• Reference: [from-tree][if-you-eat][die][don’t-eat]

• Translated-unseg: [from-tree][don’t-eat] [if-you-eat][die]

• Translated-stm: [from-tree][don’t-eat][if-you-eat][die]

• Translated-seg: [tree-you][don’t-eat][if-you-eat][death]

Generally, Tigrinya has subject-object-verb ordering while English follows subject-


verb-object. In all the translations, the phrase order of the Tigrinya translation output
aligns better with the English source rather than the Tigrinya reference. The boldface
sub-phrases demonstrate this observation. In this specific case, the order alteration
does not make sentence comprehension very difficult. However, invalid ordering
may create ungrammatical translations which can also make the meaning difficult to
understand. The translated-seg is more difficult to understand because the original
7.5. Results 95

Table 7.9: Sample Translations from the MT-verse and MT-sent mod-
els.

System Sentences
(a) but of the fruit of the tree of the knowledge of good and evil
Source you may not take ; for on the day when you take of it , death will
certainly come to you .
(b) and the name of the second river is gihon : this river goes
round all the land of cush .
(a) kabIta xIbuQInI kIfuInI ItefIlITI omI gIna : kabIa mIsI
Reference ItIbelIOI meOalItIsI motI kItImewItI iKa Imo : kabIa
ayItIbIlaOI : ilu azezo .
(b) sImI Iti KalIayI rIba dIma gihonI Iyu : nIsu nIKWla mIdIri
kushI yIzora .
Baseline (a) gInaKe : kabI fIre omI ayItIbIlaOI : kabIa mIsI ItIbelIOI
(unsegm.) meOalItIsI : motI kItImewItI iKa .
(b) sImI Iti KalIayI rIba dIma gihonI Iyu : nIsu nIKWla mIdIri
kushI yIzora .
System 2 (a) kabI fIre Ita gIna kI* f +uInI Ite* fIlITI omI dIma Imo : kabIa
(stemmed) ayItIbI* laOI : kabIa mIsI ItI* belIOI meOalItIsI motI kItI* mew
+ItI iKa : nI* sI +Ka .
(b) sImI Iti KalI +ayI rIba dIma giho +nI Iyu : nIsu nI* KWla
mIdIri kushI yI* zora .
System 3 (a) gIna +Ke It +i fIre It +a xIbuQI +nI kIfuI +nI i +Ka Imo :
(morph. kabI +a ayI* tI* bIlaOI : beta meOalIti Iti +a : kabI +a mIsI ItI*
segm.) belIOI meOalItI sI motI Iy +u : il +u azez +o .
(b) sImI It +i KalIayI rIba dIma giho +nI Iy +u : nIsu nI* KWla
mIdIri kushI yI* zor +a .
System 4 (a) gInaKe Iti fIre Ita xIbuQInI kIfuInI ItefIlITI omI dIma iKa
(MT-sent- Imo , beta (MT-sent-uns meOalIti mIsI : motI kItImewItI iKa .
unseg) (b) sImI Iti KalIayI rIba dIma gihonI Iyu : nIsu nIKWla mIdIri
kushI yIzora .
System 5 (a) gInaKe : kabI fIre omI dIma iKa Imo : kabIa ayItIbI* laOI :
(MT-sent- beta meOalIti Itia : kabIa mIsI ItI* belIOI meOalItIsI motI kItI*
stm) mew +ItI iKa .
(b) sImI Iti KalI +ayI rIba dIma giho +nI Iyu : nIsu nI* KWla
mIdIri kushI yI* zora +nI .
System 6 (a) (a) gIna +Ke It +i fIre It +a xIbuQI +nI kIfuI +nI i +Ka Imo :
(MT-sent- kabI +a ayI* tI* bIlaOI : beta meOalIti Iti +a : kabI +a mIsI ItI*
morph.seg) belIOI meOalItI +sI motI Iy +u : il +u azez +o .
(b) sImI It +i KalIayI rIba dIma giho +nI Iy +u : nIsu nI* KWla
mIdIri kushI yI* zor +a .
Chapter 7. English-Tigrinya Machine Translation with Morphological
96
Segmentation

meaning is not entirely preserved. Some studies have shown that aggressive segmen-
tation into very fine units might actually hurt the translation quality by unnecessar-
ily enlarging the phrase table and worsening the uncertainty of choosing the correct
phrase candidate [13]. There are two problems with translated-seg (system 3 in Table
7.9). First, the beginning phrase is translated to “you are the tree”, which is different
from the original phrase that has the sense of “from-tree”; and second the last phrase
“you will die” is wrongly translated as “it is death”. In comparison, the translated-
stm output preserves the meaning in the references better than the segmented model.
However, translated-seg seems to have better token coverage compared to the other
models. For example, the word “ilu azezo” in the reference was only found in the
translated-seg models. This could be the reason why translated-seg scores are better
because an increase in the token count contributes to improving BLEU score. There-
fore, in post-processing, a de-tokenization step is required to attach morphemes with
their root words and then make the evaluation from the words. We plan this type of
analysis for future research.

7.6 Chapter Summary


In Chapter 7, we showed the improvement of morphological segmentation on the per-
formance of English-to-Tigrinya statistical machine translation. Segmentation was
performed to improve word alignment, the language model, and reduce OOVs. We
explored two segmentation schemes; one based on longest-affix (shallow) segmen-
tation and the other based on fine-grained morphological segmentation. We used a
relatively small parallel corpus derived from the Bible translation of both languages.
The Bible text was extracted automatically and aligned properly on verse-level and
sentence-level. The translation system is built using Moses phrase-based translation.
The experimental results show a promising improvement in translation quality using
both schemes. Segmentation reduced the OOVs ratio and perplexity of models and
as a result, the BLEU, METEOR, and TER scores improved. In general, the morpho-
logically segmented models scored better results than the unsegmented baseline and
affix segmented models.
97

Chapter 8

Final Remarks

8.1 Conclusion
This thesis presents Tigrinya’s NLP research from the foundation by constructing new
language resources and fundamental NLP components for the first time. Specifically,
we researched Tigrinya POS tagging, morphological segmentation and its effect on
English-to-Tigrinya machine translation. We further analyzed optimal settings for
word embeddings in Tigrinya by designing intrinsic and extrinsic evaluations.
Tigrinya is an under-resourced language, and Consequently, NLP systems for
Tigrinya achieve low scores due to data sparsity or the OOV problem. The complex
inflectional and derivational patterns in Tigrinya generate substantial wordforms that
aggravate data sparsity. To address these issues, we explored several methods based
on classification, sequence labeling, and sequence-to-sequence labeling, leveraging
morphological patterns and language-independent substring features including char-
acter and word embeddings.
First, we constructed a news text corpus containing over 15 million tokens. Using
this corpus, we analyzed word embeddings for Tigrinya. In the intrinsic evaluations
of analogy, similarity, and categorization, the syntactic relatedness improved with
shorter word context while the semantic relatedness was better with wider context.
In the extrinsic evaluations, we showed the usefulness of word embeddings applied
to error reduction in POS-tagging task.
Second, we employed the NTC corpus for Tigrinya POS tagging. We utilized
morphological patterns that boosted tagging accuracy of unknown words by around
40% by employing CRF sequence labeling. The overall accuracy achieved was 90.89%.
98 Chapter 8. Final Remarks

Although the results show that morphological patterns are very informative for POS
disambiguation, we achieved improved accuracy of 91.6% (state-of-the-art) using
BiLSTM sequence-to-sequence labeling, relying on word embeddings (forgoing fea-
ture engineering).
Third, we presented a new morphologically segmented corpus and the first mor-
phological segmentation research for Tigrinya. Extensive experiments were per-
formed using different variants of BIO tagging scheme (BIE, BIES, BIO, and BIOES)
and supervised approaches exploiting primarily window-based character embeddings
as features for LSTMs and BiLSTMs. The results and the detailed error analyses show
that the BIE (begin-inside-end) scheme provides better representation. Moreover, we
observed that annotating words that do not need segmentation (outside-of-morpheme
chunks) with the “O” tag hurts performance. We also compared the LSTMs with the
CRF-based segmentation, employing language-agnostic character and substring fea-
tures. The results show that BiLSTM-based segmentation outperforms CRF-based
segmentation when both methods are exposed to similar context window. However,
the CRF-based results can also be enhanced further with careful design of features,
including language-dependent features.
Finally, we explored the effect of stem-based (shallow) and morpheme-based
(fine-grained) morphological segmentation in English-to-Tigrinya phrase-based sta-
tistical machine translation. For this research, we properly aligned the English-Tigrinya
Bible corpus on a verse-level and sentence-level. In the verse-level corpus, the to-
ken count of the English side was greater than the Tigrinya side by around 38%. As
a result of the fine-grained segmentation of the Tigrinya side, the token count was
almost equalized. Furthermore, the OOV ratio decreased by 4.7 percentage point
and 4.2 percentage point for the stem-based and the morpheme-based models respec-
tively. Similarly, the perplexity of language models was reduced (improved) in the
segmented models. Overall, despite the relatively small size of our parallel corpus, we
achieved 1.1 and 1.4 BLEU score improvement in the verse-based corpus as a result
of the shallow and the fine-grained segmentation, respectively. This result indicates
that segmentation can improve the quality of the English-to-Tigrinya phrase-based
statistical machine translation.
8.2. Contributions of the Research 99

8.2 Contributions of the Research


The main contributions of the research are summarized as follows:

1. A medium-sized news text corpus containing over 15.1 million tokens was con-
structed automatically. Subsequently, we created tools for crawling Tigrinya
websites, cleaning, formatting, transliterating, and normalizing the corpus. More-
over, the corpus was used to generate vocabularies such a lexicon of over
593,000 unique words and a list of stopwords.

2. An English-Tigrinya parallel text corpus was compiled from the Bible for ma-
chine translation research. The resource has been properly re-aligned for strict
alignment on verse level and sentence level.

3. Tigrinya word embeddings were created and analyzed for the first time from
the 15 million-token corpus. The quality of the word embeddings was evalu-
ated by designing intrinsic and extrinsic evaluations. Accordingly, the optimal
performance achieved was from a Skip-gram model with hierarchical softmax,
a dimension of 300, a window size of two, and a minimum word count of six.

4. We constructed the first morphologically segmented corpus comprising over


45,000 words. The corpus is annotated with boundaries of prefix, stem, and
suffix morphemes.

5. We investigated POS tagging based on classification, sequence labeling, and


sequence-to-sequence labeling approaches and achieved a state-of-the-art ac-
curacy of 91.6% using BiLSTM neural networks. We employed the relatively
small Nagaoka Tigrinya Corpus 1.0 (NTC 1.0) created in our previous research.
The resource1 and a CRF-based POS-tagging tool2 have been released to the
public.

6. We presented the first supervised morphological segmentation research for


Tigrinya. The state-of-the-art F1 score achieved was 95.07%, with concate-
nated character and word embeddings for BiLSTMs.
1
eng.jnlp.org/yemane/ntigcorpus
2
tigtag.herokuapp.com/
100 Chapter 8. Final Remarks

7. We achieved a slight improvement in the English-Tigrinya SMT by alleviating


the OOV growth and the perplexity of language models through morphological
segmentation.

8. Tigrinya is an under-studied language. Therefore, we hope this fundamental


research will also contribute to bettering the understanding of its properties and
language processing challenges. We encourage further research by releasing
the tools and resources of this research to the public.

8.3 Future Work

8.3.1 Improving the Quality of the NTC and the Performance of


the POS Tagger

Future work will concentrate on the following primary objectives. First, there is a
scope for improving the quality of the corpus and enriching its size by incorporating
other genres, in addition to news, to suit the ensuing enhancement. This will also
include revising the tagset design as well as rectifying tagging errors and inconsis-
tencies. It would be interesting to use recent advances in labeling strategies, such as
active learning, to enlarge the corpus at a lower cost.
The second objective will be to replace the current word-level tagging with a
morpheme-level approach because the latter reduces the POS tags into simple POS
categories by masking several layers of morphological inflections. The morphologi-
cal segmentation introduced in this research will be used for the morphological anal-
ysis prior to tagging.
Finally, there is room to investigate the use of semi-supervised approaches of POS
tagging, employing the unannotated version of the corpus to improve accuracy. The
tagger can be an essential prerequisite for further NLP studies, such as base phrase
chunking, syntactic parsing, machine translation, and text summarization.
8.3. Future Work 101

8.3.2 Improving Segmented Corpus and Integrating Minimally


Supervised Approaches

In morphological segmentation, we plan to improve and enlarge the available corpus


by including vocabulary from different domains. This will allow the inclusion of
potentially unseen patterns as the current orthographic style is limited to the domain
of news. Furthermore, we are interested in improving performance by integrating
minimally supervised approaches to make use of the large unlabeled Tigrinya corpus.
This research will be helpful in supporting downstream NLP tasks such as full-fledged
morphological analysis, stemming, and part-of-speech tagging.

8.3.3 Creating a Large Parallel Corpus and Exploring Neural Ma-


chine Translation

Statistical machine translation approaches require large bilingual texts to achieve rea-
sonable translation quality. However, finding language resources are a big challenge
in low-resource languages such as Tigrinya. In the future, we would like to initiate
the creation of a large English-Tigrinya parallel corpus for effective machine trans-
lation. It has been shown that enlarging language models can improve the translation
quality [94]. Language models are obtained from a monolingual corpus, which is
easier to build than a parallel corpus. Therefore, we are interested in building larger
language models and studying the effect thereof on the translation quality. Finally,
recent advances in machine translation have reported improved results with neural
machine translation [95]. Therefore, with the availability of a larger parallel corpus,
we would also like to explore English-Tigrinya neural machine translation.
103

Bibliography

[1] O. Streiter, K. P. Scannell, and M. Stuflesser, “Implementing NLP projects for


noncentral languages: Instructions for funding bodies, strategies for develop-
ers”, Machine Translation, vol. 20, no. 4, pp. 267–289, 2006.

[2] Y. Tedla and K. Yamamoto, “Nagaoka Tigrinya Corpus: Design and develop-
ment of part-of-speech tagged corpus”, in Language Processing Society 22nd
Annual Meeting Papers Collection, Tohoku, Japan: The Association for Nat-
ural Language Processing, 2016.

[3] Y. Tedla, “Development of part-of-speech tagged corpus and part-of-speech


tagger for Tigrinya”, Master’s thesis, Nagaoka University of Technology, 2015.

[4] I. Zeroual, A. Lakhouaja, and R. Belahbib, “Towards a standard part of speech


tagset for the Arabic language”, Journal of King Saud University –Computer
and Information Sciences, vol. 29, 171–178, 2017.

[5] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large an-


notated corpus of English: The Penn Treebank”, vol. 19, no. 2, pp. 314–330,
1993.

[6] S. Petrov, D. Das, and R. Mcdonald, “A universal part-of-speech tagset”, 2011.


arXiv: arXiv:1104 .2086v1. [Online]. Available: https://arxiv.org/
pdf/1104.2086.pdf.

[7] H. Loftsson, S. Helgadóttir, and E. Rögnvaldsson, “Using a morphological


database to increase the accuracy in pos tagging”, in Proceedings of Recent
Advances in Natural Language Processing, Bulgaria, 2011, 49–55.

[8] C. Cortes and V. Vapnik, “Support-vector networks ”, Machine Learning,


pp. 273–297, 1995.
104 BIBLIOGRAPHY

[9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields:


Probabilistic models for segmenting and labeling sequence data”, in Proceed-
ings of the Eighteenth International Conference on Machine Learning, ser.
ICML ’01, San Francisco, USA, 2001, pp. 282–289.

[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory”, in Neural Com-


putation, vol. 9, Dec. 1997, pp. 1735–1780.

[11] M. Creutz and K. Lagus, “Unsupervised discovery of morphemes”, in Pro-


ceedings of the ACL-02 workshop on Morphological and phonological learning-
Volume 6, Association for Computational Linguistics, 2002, pp. 21–30.

[12] S. Green and J. DeNero, “A class-based agreement model for generating ac-
curately inflected translations”, in Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics: Long Papers-Volume 1, Asso-
ciation for Computational Linguistics, 2012, pp. 146–155.

[13] H. Al-Haj and A. Lavie, “The impact of Arabic morphological segmentation on


broad-coverage English-to-Arabic statistical machine translation”, Machine
Translation, vol. 26, no. 1, pp. 3–24, 2012.

[14] J. Mason, Tigrinya grammar. New Jersey: The Red Sea Press Inc., 1996.

[15] Y. Firdyiwek and D. Yaqob, The system for Ethiopic representation in ASCII,
Jan. 1997. [Online]. Available: https://www.researchgate.net/publication/
2682324_The_System_for_Ethiopic_Representation_in_ASCII.

[16] S. Wintner, Morphological Processing of Semitic Languages, I. Zitouni, Ed.


Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 43–66.

[17] M. Gasser, HornMorpho 2.5 user’s guide. Indiana University, Indiana, 2012.

[18] N. A. Kifle, “Differential object marking and topicality in Tigrinya”, pp. 4–25,
2007.

[19] A. Sahle, “A comprehensive Tigrinya grammar”, in. Lawrenceville NJ: Red


Sea Press, Inc., 1998.
BIBLIOGRAPHY 105

[20] M. Gasser, “Semitic morphological analysis and generation using finite state
transducers with feature structures”, in Proceedings of the 12th Conference
of the European Chapter of the Association for Computational Linguistics,
Association for Computational Linguistics, 2009, pp. 309–317.

[21] A. Ghebre, Tigrinya Grammar, 2nd ed. Stockholm: Admas Forlag, 2000.

[22] K. N. Björkenstam, Computational linguistics, [accessed 05-May-2018], Stock-


holm University, 2018. [Online]. Available: https://nordiskateckensprak.
files.wordpress.com/2014/01/knb\_whatisacorpus\_cph- 2013\
_outline.pdf.

[23] Y. Tedla and K. Yamamoto, “Analyzing word embeddings and improving pos
tagger of Tigrinya”, in Proceedings of the International Conference on Asian
Language Processing (IALP), Singapore: IEEE, 2017.

[24] M. Baroni, G. Dinu, and G. Kruszewski, “Don’t count, predict! a systematic


comparison of context-counting vs. context-predicting semantic vectors.”, in
Association for computational linguistics, ACL (1), 2014, pp. 238–247.

[25] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word


representations in vector space”, ArXiv preprint arXiv:1301.3781, 2013.

[26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed


representations of words and phrases and their compositionality”, in Advances
in Neural Information Processing Systems (NIPS), 2013, pp. 3111–3119.

[27] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word
representation”, in Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2014, pp. 1532–1543.

[28] C. C. Lin, W. Ammar, C. Dyer, and L. Levin, “Unsupervised POS induction


with word embeddings”, in Annual Conference of the North American Chapter
of the ACL, Denver, Colorado, 2015, 1311–1316.

[29] R. Al-Rfou, B. Perozzi, and S. Skiena, “Polyglot: Distributed word represen-


tations for multilingual NLP”, Computing Research Repository (CoRR), vol.
abs/1307.1662, 2013. [Online]. Available: http://arxiv.org/abs/1307.
1662.
106 BIBLIOGRAPHY

[30] T. Schnabel, I. Labutov, D. M. Mimno, and T. Joachims, “Evaluation methods


for unsupervised word embeddings.”, in Conference on Empirical Methods in
Natural Language Processing (EMNLP), 2015, pp. 298–307.

[31] C. J. Burges, “A tutorial on support vector machines for pattern recognition”,


Data mining and knowledge discovery, vol. 2, no. 2, pp. 121–167, 1998.

[32] M. Aly, Survey on multiclass classification methods, 2005. [Online]. Avail-


able: https://www.cs.utah.edu/~piyush/teaching/aly05multiclass.
pdf.

[33] I. Rish, “An empirical study of the naive Bayes classifier”, in International
Joint Conferences on Artificial Intelligence(IJCAI), Workshop on Empirical
Methods in Artificial Intelligence, 2001.

[34] S. D. Bay, “Combining nearest neighbor classifiers through multiple feature


subsets”, in International Conference on Machine Learning (ICML), vol. 98,
1998.

[35] W.-Y. Loh, “Classification and regression trees”, Wiley Interdisciplinary Re-
views: Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 14–23, 2011.

[36] C. M. Bishop, Neural networks for pattern recognition. Oxford university press,
1995.

[37] K. Crammer and Y. Singer, “On the algorithmic implementation of multiclass


kernel-based vector machines”, Journal of machine learning research, vol. 2,
no. Dec, pp. 265–292, 2001.

[38] R. Rifkin and A. Klautau, “In defense of one-vs-all classification”, Journal of


machine learning research, vol. 5, no. Jan, pp. 101–141, 2004.

[39] L. Wang, Z. Cao, Y. Xia, and G. de Melo, “Morphological segmentation with


window LSTM neural networks.”, in Proceedings of the Thirtieth AAAI Con-
ference on Artificial Intelligence (AAAI-16) Morphological, Association for
the Advancement of Artificial Intelligence, 2016.

[40] B. Plank, A. Søgaard, and Y. Goldberg, “Multilingual part-of-speech tagging


with bidirectional long short-term memory models and auxiliary loss”, ArXiv
preprint arXiv:1604.05529, 2016.
BIBLIOGRAPHY 107

[41] X. Ma and E. H. Hovy, “End-to-end sequence labeling via bi-directional LSTM-


CNNs-CRF”, Computing Research Repository (CoRR), vol. abs/1603.01354,
2016.

[42] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with


neural networks”, Advances in Neural Information Processing Systems (NIPS),
pp. 3104–3112, 2014. arXiv: 1409.3215.

[43] R. Bar-haim, K. Sima’an, and Y. Winter, “Part-of-speech tagging of modern


Hebrew text”, Natural Language Engineering, vol. 14, no. 2, pp. 223–251,
Apr. 2008.

[44] S. Khoja, “APT: Arabic part-of-speech tagger”, in Proceedings of the Student


Workshop at Conference of the North American Chapter of the Association for
Computational Linguistics(NAACL), 2001, pp. 20–25.

[45] D. Graff and W. Kevin, Arabic newswire part 1 - linguistic data consortium.
Philadelphia, 2001. [Online]. Available: https : / / catalog . ldc . upenn .
edu/LDC2001T55.

[46] M. Maamouri et al, Arabic treebank: part 1 - linguistic data consortium. Philadel-
phia, 2003. [Online]. Available: https : / / catalog . ldc . upenn . edu /
LDC2003T06.

[47] M. Diab, K. Hacioglu, and D. Jurafsky, “Automatic tagging of Arabic text:


From raw text to base phrase chunks”, in Human Language Technologies; 5th
Meeting of the North American Chapter of the Association of Computational
Linguistics (NAACL), Boston, MA: Association for Computational Linguis-
tics, 2004, pp. 149–152.

[48] E. Marsi, A. van den Bosch, and A. Soudi, “Memory-based morphological


analysis and part-of-speech tagging of Arabic”, in Proceedings of the ACL
Workshop on Computational Approaches to Semitic Languages, Ann Arbor:
Association for Computational Linguistics, 2005, pp. 1–8.

[49] N. Habash and O. Rambow, “Arabic tokenization, part-of-speech tagging and


morphological disambiguation in one fell swoop”, in Proceedings of the 43rd
Annual Meeting on Association for Computational Linguistics, ser. ACL ’05,
108 BIBLIOGRAPHY

Stroudsburg, PA,USA: Association for Computational Linguistics, 2005, pp. 573–


580.

[50] E. Mohammed and S. Kübler, “Is Arabic part of speech tagging feasible with-
out word segmentation?”, in Human Language Technologies; 5th Meeting of
the North American Chapter of the Association of Computational Linguistics,
Los Angeles: Association for Computational Linguistics, 2010, pp. 705–708.

[51] Bilel Ben Ali and Fethi Jarray, “Genetic approach for Arabic part of speech
tagging”, in International Journal on Natural Language Computing (IJNLC)
Vol. 2, No.3, 2013.

[52] Demeke, Girma A. and Getachew, Mesfin, “Manual annotation of Amharic


news items with part-of-speech tags and its challenges”, Addis Ababa: ELRC
Working Papers, 2:1–17., 2006.

[53] B. Gambäck, F. Olsson, A. Alemu, and A. L. Asker, “Methods for Amharic


part-of-speech tagging”, Proceedings of the First Workshop on Language Tech-
nologies for African Languages, pp. 104–111, 2009.

[54] B. G. Gebre, “Part of speech tagging for Amharic”, Master’s thesis, University
of Wolverhampton, 2010.

[55] P. Wang, Y. Qian, F. K. Soong, L. He, and H. Zhao, “A unified tagging so-
lution: Bidirectional LSTM recurrent neural network with word embedding”,
ArXiv preprint arXiv:1511.00215, 2015.

[56] W. Ling, T. Luís, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black,


and I. Trancoso, “Finding function in form: Compositional character models
for open vocabulary word representation”, ArXiv preprint arXiv:1508.02096,
2015.

[57] G. Berend, “Sparse coding of neural word embeddings for multilingual se-
quence labeling”, Computing Research Repository (CoRR), vol. abs/1612.07130,
2016. [Online]. Available: http://arxiv.org/abs/1612.07130.

[58] O. O. Ibrahim and Y. Mikami, “Stemming Tigrinya words for information


retrieval”, in Proceedings of International Conference on Computational Lin-
guistics (COLING): Demonstration Papers, Mumbai, 2012, pp. 345–352.
BIBLIOGRAPHY 109

[59] T. Brants, “TnT: a statistical part-of-speech tagger”, Proceedings of the sixth


conference on Applied Natural Language Processing, pp. 224–231, 2000.

[60] V. Savova and L. Peshkin, “Part-of-speech tagging with minimal lexicaliza-


tion”, in Proceedings of Computing Research Repository (CoRR), 2003.

[61] H. Tseng, D. Jurafsky, and C. Manning, “Morphological features help pos


tagging of unknown words across language varieties”, in Proceedings of the
Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Ko-
rea: Asian Federation of Natural Language Processing, 2005.

[62] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.


Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine
learning in Python”, Journal of Machine Learning Research, vol. 12, pp. 2825–
2830, 2011.

[63] K. Duh and K. Kirchhoff, “POS tagging of dialectal Arabic: a minimally su-
pervised approach”, in Proceedings of the ACL workshop on computational
approaches to Semitic languages, Association for Computational Linguistics,
2005, pp. 55–62.

[64] K. Stratos and M. Collins, “Simple semi-supervised POS tagging”, in Pro-


ceedings of the 1st Workshop on Vector Space Modeling for Natural Language
Processing, 2015, pp. 79–87.

[65] T. Ruokolainen, O. Kohonen, K. Sirts, S.-A. Grönroos, M. Kurimo, and S.


Virpioja, “A comparative study of minimally supervised morphological seg-
mentation”, Computational Linguistics, vol. 42, no. 1, pp. 91–120, 2016.

[66] B. Snyder and R. Barzilay, “Unsupervised multilingual learning for morpho-


logical segmentation”, Proceedings of Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies (ACL-08: HLT),
pp. 737–745, 2008.

[67] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa,


“Natural language processing (almost) from scratch”, Journal of Machine Learn-
ing Research, vol. 12, pp. 2493–2537, 2011.
110 BIBLIOGRAPHY

[68] C. D. Santos and B. Zadrozny, “Learning character-level representations for


part-of-speech tagging”, in Proceedings of the 31st International Conference
on Machine Learning, vol. 32, Bejing, China, 2014, pp. 1818–1826.

[69] N. Zalmout and N. Habash, “Don’t throw those morphological analyzers away
just yet: Neural morphological disambiguation for Arabic”, in Proceedings of
the 2017 Conference on Empirical Methods in Natural Language Processing,
2017, pp. 704–713.

[70] C. Ding, Y. K. Thu, M. Utiyama, and E. Sumita, “Word segmentation for


Burmese (Myanmar)”, ACM Transactions on Asian and Low-Resource Lan-
guage Information Processing (TALLIP), vol. 15, no. 4, p. 22, 2016.

[71] W. Mulugeta, M. Gasser, and B. Yimam, “Incremental learning of affix seg-


mentation”, Proceedings of International Conference on Computational Lin-
guistics (COLING), pp. 1901–1914, 2012.

[72] L. A. Ramshaw and M. P. Marcus, “Text chunking using transformation-based


learning”, in Natural language processing using very large corpora, Springer,
1999, pp. 157–176.

[73] N. Reimers and I. Gurevych, “Optimal hyperparameters for deep LSTM net-
works for sequence labeling tasks”, ArXiv preprint arXiv:1707.06799, 2017.

[74] Y. Kitagawa and M. Komachi, “Long short-term memory for Japanese word
segmentation”, ArXiv preprint arXiv:1709.08011, 2017.

[75] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,


“Dropout: A simple way to prevent neural networks from overfitting”, The
Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[76] M. D. Zeiler, “Adadelta: An adaptive learning rate method”, Computing Re-


search Repository (CoRR), vol. abs/1212.5701, 2012.

[77] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, Com-
puting Research Repository (CoRR), vol. abs/1412.6980, 2014.

[78] Y. Dauphin, H. de Vries, J. Chung, and Y. Bengio, “RMSProp and equilibrated


adaptive learning rates for non-convex optimization”, Computing Research
Repository (CoRR), vol. abs/1502.04390, 2015.
BIBLIOGRAPHY 111

[79] T. Ruokolainen, O. Kohonen, S. Virpioja, and M. Kurimo, “Supervised mor-


phological segmentation in a low-resource learning setting using conditional
random fields”, in Proceedings of the Seventeenth Conference on Computa-
tional Natural Language Learning, 2013, pp. 29–37.

[80] B. Snyder and R. Barzilay, “Cross-lingual propagation for morphological anal-


ysis.”, in Association for the Advancement of Artificial Intelligence (AAAI),
2008, pp. 848–854.

[81] M. Popović and H. Ney, “Towards the use of word stems and suffixes for statis-
tical machine translation”, in In Proceedings of The International Conference
on Language Resources and Evaluation (LREC), 2004.

[82] N. Habash and F. Sadat, “Arabic preprocessing schemes for statistical machine
translation”, in Proceedings of the Human Language Technology Conference
of the North American Chapter of the ACL, Association for Computational
Linguistics, 2006, pp. 49–52.

[83] R. Sarikaya and Y. Deng, “Joint morphological-lexical language modeling for


machine translation”, in Conference of the North American Chapter of the As-
sociation for Computational Linguistics (NAACL), 2007, pp. 145–148.

[84] I. Badr, R. Zbib, and J. Glass, “Segmentation for English-to-Arabic statistical


machine translation”, in Proceedings of Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies (ACL-08: HLT),
short papers(Companion Volume), 2008, pp. 153–156.

[85] N. Singh and N. Habash, “Hebrew morphological preprocessing for statistical


machine translation”, in Proceedings of the 16th European Association for
Machine Translation (EAMT) Conference, EAMT, 2012.

[86] G. T. Mulu and L. Besacier, “Preliminary experiments on English-Amharic


statistical machine translation”, in Proceedings of the 3rd International Work-
shop on Spoken Languages Technologies for Under-resourced Languages (SLTU),
2012.
112 BIBLIOGRAPHY

[87] P. Resnik, M. B. Olsen, and M. Diab, “The Bible as a parallel corpus: anno-
tating the book of 2000 tongues”, in Computers and the Humanities: Selected
Papers from TEI 10: Celebrating the Tenth Anniversary of the Text Encoding
Initiative, vol. 33, Denver, Colorado, 1999, pp. 129–153.

[88] J. D. Phillips, “The Bible as a basis for machine translation”, in Proceedings of


the Pacific Association for Computational Linguistics, PACLING-2001, 2001.

[89] P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-based translation”, in


Proceedings of the 2003 Conference of the North American Chapter of the
Association for Computational Linguistics on Human Language Technology-
Volume 1, Association for Computational Linguistics, 2003, pp. 48–54.

[90] P. Koehn et al., “Moses: open source toolkit for statistical machine transla-
tion”, in Proceedings of the 45th Annual Meeting of the ACL on Interactive
Poster and Demonstration Sessions, ser. ACL ’07, Prague, Czech Republic:
Association for Computational Linguistics, 2007, pp. 177–180.

[91] Q. Gao and S. Vogel, “Parallel implementations of word alignment tool”, in


Software Engineering, Testing, and Quality Assurance for Natural Language
Processing, ser. SETQA-NLP ’08, Columbus, Ohio: Association for Compu-
tational Linguistics, 2008, pp. 49–57.

[92] F. J. Och and H. Ney, “A systematic comparison of various statistical align-


ment models”, Computational Linguistics, vol. 29, no. 1, pp. 19–51, Mar.
2003.

[93] K. Heafield, “KenLM: Faster and smaller language model queries”, in Pro-
ceedings of the Sixth Workshop on Statistical Machine Translation, Associa-
tion for Computational Linguistics, 2011, pp. 187–197.

[94] G. Lembersky, N. Ordan, and S. Wintner, “Language models for machine


translation: original vs. translated texts”, Computational Linguistics, vol. 38,
no. 4, pp. 799–825, 2012.

[95] L. Bentivogli, A. Bisazza, M. Cettolo, and M. Federico, “Neural versus phrase-


based machine translation quality: a case study”, Empirical Methods in Natural
Language Processing (EMNLP), pp. 257–267, 2016.
113

Appendix

Publication List

Journal Papers

• Yemane Tedla and Kazuhide Yamamoto, “Morphological Segmentation with


LSTM Neural Networks for Tigrinya”, International Journal on Natural Lan-
guage Computing (IJNLC) Vol. 7, No. 2, pp 29-44, 2018.

• Yemane Tedla and Kazuhide Yamamoto, “Morphological Segmentation for


English-to-Tigrinya Statistical Machine Translation”, International Journal of
Asian Language Processing, vol. 27 no. 2: pp. 95-110, 2017.

• Yemane Tedla, Kazuhide Yamamoto and Ashuboda Marasinghe, “Tigrinya


Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka
Tigrinya Corpus”, International Journal of Computer Applications, Vol. 146,
No. 14, pp. 33-41, 2016.

Conference Papers

• Yemane Tedla and Kazuhide Yamamoto, “Analyzing word embeddings and


improving POS tagger of Tigrinya”, in Proceedings of the International Con-
ference on Asian Language Processing (IALP), IEEE, pp. 115-118, Singapore,
2017.

• Yemane Tedla and Kazuhide Yamamoto, “The Effect of Shallow Segmenta-


tion on English-Tigrinya Statistical Machine Translation”, in Proceedings of
the International Conference on Asian Language Processing (IALP), IEEE,
pp. 79-82, Taiwan, 2016.
114 Appendix . Publication List

Domestic Conference Paper

• Yemane Tedla, Kazuhide Yamamoto and Ashuboda Marasinghe, “Nagaoka


Tigrinya Corpus: Design and Development of Part-of-speech Tagged Corpus”,
In Language Processing Society 22nd Annual Meeting Papers Collection, The
Association for Natural Language Processing, pp. 413-416, Japan, 2016.

You might also like