0% found this document useful (0 votes)
71 views92 pages

AUTOMATIC IDENTIFICATION OF Major Ethio Language

The thesis by Biruk Tadesse Tefera focuses on the automatic identification of major Ethiopian languages, specifically seven languages spoken by over 79.3% of the population. It presents a mechanism for text-based language identification using Naïve Bayes and SVM classifiers, achieving high classification accuracy rates. The study addresses challenges such as closely related languages and the need for more robust training data to improve identification accuracy.

Uploaded by

tesfyeyetmo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views92 pages

AUTOMATIC IDENTIFICATION OF Major Ethio Language

The thesis by Biruk Tadesse Tefera focuses on the automatic identification of major Ethiopian languages, specifically seven languages spoken by over 79.3% of the population. It presents a mechanism for text-based language identification using Naïve Bayes and SVM classifiers, achieving high classification accuracy rates. The study addresses challenges such as closely related languages and the need for more robust training data to improve identification accuracy.

Uploaded by

tesfyeyetmo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

DSpace Institution

DSpace Repository http://dspace.org


Computer Science thesis

2020-03-20

AUTOMATIC IDENTIFICATION OF
MAJOR ETHIOPIAN LANGUAGES

Tadesse, Biruk

http://hdl.handle.net/123456789/10755
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY
BAHIR DAR INSTITUTE OF TECHNOLOGY
SCHOOL OF RESEARCH AND POSTGRADUATE STUDIES
FACULTY OF COMPUTING

AUTOMATIC IDENTIFICATION OF MAJOR ETHIOPIAN


LANGUAGES

BIRUK TADESSE TEFERA

BAHIR DAR, ETHIOPIA

February 14, 2018


AUTOMATIC IDENTIFICATION OF MAJOR ETHIOPIAN LANGUAGES

Biruk Tadesse Tefera

A thesis submitted to the school of Research and Graduate Studies of Bahir Dar
Institute of Technology, BDU in partial fulfillment of the requirements for the degree
of
Master of Science in Computer Science

Advisor Name: Professor Bandaru Rama Krishna Rao

Bahir Dar, Ethiopia


February 14, 2018
i
ii
ACKNOWLEDGEMENTS

First and foremost, I would like to thank the almighty God who made all things possible. I
like to express my sincere thanks to my advisor Professor Bandaru Rama Krishna Rao for
his important comments, and encouragement. Most importantly, I thank my parents and
my family for giving me the possibility to become who I am today. I am also grateful to
my instructors, my colleagues, my friends and my classmates for the support and
encouragement they provided me at all times.

Last but not least, I would like to thank Bahir Dar university for being a healthy
environment where I can learn and be challenged and Wachemo university for granting me
study leave to attend the graduate program of School of Computing in Bahir Dar
University.

iii
ABSTRACT

Text-based language identification is the task of automatically recognizing a language


from a given text of document. It is an important research area as large quantities of text
are processed automatically for tasks such as spelling and grammar checking, information
retrieval, search engines, language translation, and text mining. In this research, an
adequate mechanism for efficient text-based language identification is presented with an
emphasis on 7 major languages used in Ethiopia namely, Afar, Amharic, Nuer, Oromo,
Sidamo, Somali and Tigrigna. These languages were chosen because they are spoken by
more than 79.3% of the total population of Ethiopia. Factors affecting accuracy such as
the size and variety of training data and the size of the string to be identified are
investigated. Naïve Bayes classifier, SVM classifier and Dictionary Method are used.
Naïve Bayes and SVM classifiers are trained by using character n-gram of size 3 as a
feature set. The dictionary method uses stopwords. The experiments are conducted on
three different character windows that provide an equivalent representation of short,
medium and long document size. Overall, the 3-gram Naïve Bayes classifier, the 3-gram
SVM classifier and the dictionary method showed an average classification accuracy of
98.37%, 99.53%, and 90.53% respectively. When trained with homogeneously distributed
training data per language, the 3-gram Naïve Bayes and SVM classifiers showed an
average classification accuracy of 95.16% and 96.2% respectively. To evaluate
multilingual identification, an artificial corpus that contains 1050 documents is
constructed. 45 out of 1050 documents are wrongly classified which corresponds to
95.71% accuracy. The challenging tasks in the study are: identification of closely related
languages that share similar character sequences, identifying the language of short
excerpts from texts, and the unavailability of standard corpus. The use of classification
approach, combined with linguistically motivated features such as POS tags and
morphological information is recommended as a way forward for providing empirical
evidence on the convergences and divergences of language varieties in terms of lexicon,
orthography, morphology and syntax.

Keywords: Language Identification, Multilingual Identification, N-gram, Feature Set,


Naïve Bayes, Support Vector Machine, Dictionary Method, Character Window

iv
TABLE OF CONTENTS

DECLARATION ............................................... ERROR! BOOKMARK NOT DEFINED.

ACKNOWLEDGEMENTS ............................................................................................. III

ABSTRACT ...................................................................................................................... IV

TABLE OF CONTENTS .................................................................................................. V

LIST OF ABBREVATIONS ......................................................................................... VII

LIST OF FIGURES....................................................................................................... VIII

LIST OF TABLES............................................................................................................ IX

1. INTRODUCTION ....................................................................................................... 1

1.1 Background ........................................................................................................................................1

1.2 Problem statement .............................................................................................................................2

1.3 Objective of the study ........................................................................................................................5

1.4 Scope and limitation of the study ......................................................................................................5

1.5 Significance of the study ....................................................................................................................6

1.6 Methodology of the study ..................................................................................................................7

1.7 Research design ..................................................................................................................................7

1.8 Structure of the thesis ........................................................................................................................9

2 LITERATURE REVIEW......................................................................................... 10

2.1 Language Identification ................................................................................................................... 10

2.1.1 Overview ........................................................................................................................................... 10

2.1.2 Approaches to Language Identification ......................................................................................... 12

2.1.2.1 Detection based on N-gram ............................................................................................................. 12

2.1.2.2 Detection based on machine learning techniques .......................................................................... 14

v
2.1.2.3 Detection based on dictionaries containing stopwords ................................................................. 22

2.1.2.4 Multilingual Identification .............................................................................................................. 23

2.1.3 Evaluation of Language Identification Methods ........................................................................... 24

2.2 Overview of Major Ethiopian Languages ...................................................................................... 24

2.3 Related Works .................................................................................................................................. 30

3 METHOS AND TECHNIQUES .............................................................................. 35

3.1 Overview ........................................................................................................................................... 35

3.2 Corpora ............................................................................................................................................. 35

3.3 Evaluation and Computational Techniques .................................................................................. 41

3.3.1 Evaluation Metrics ........................................................................................................................... 41

3.3.2 Smoothing Techniques ..................................................................................................................... 42

3.3.3 Classification Methods ..................................................................................................................... 43

4 RESULTS AND DISCUSSION ............................................................................... 52

4.1 EXPERIMENT AND RESULT ...................................................................................................... 52

4.2 Confusion Matrix ............................................................................................................................. 59

4.3 Discussion ......................................................................................................................................... 62

4.4 Multilingual Identification .............................................................................................................. 65

5 CONCLUSION AND RECOMMENDATION ...................................................... 68

5.1 Conclusion ........................................................................................................................................ 68

5.2 Recommendation .............................................................................................................................. 70

REFERENCES ................................................................................................................. 71

APPENDIX ....................................................................................................................... 76

Appendix 1 Sample Codes ............................................................................................................................ 76

Appendix 2 Sample Outputs ........................................................................................................................ 79

vi
LIST OF ABBREVATIONS

BDU Bahir Dar University


BiT Bahir Dar Institute of Technology
Chw Character window
LIBSVM The library for support vector machines
NB Naïve Bayes
NLP Natural Language Processing
NLTK Natural Language Toolkit
POS Part of Speech
SVM Support Vector Machine

vii
LIST OF FIGURES

Figure 2.1 An illustration of the rank order method (Adopted from Cavnar & Trenkle,
1994) ................................................................................................................................... 14
Figure 2.2: Support vector machine concept ...................................................................... 16
Figure 2.3 Decision tree: play tennis example (Mitchell, 1997) ........................................ 18
Figure 2.4 Neural network architecture .............................................................................. 19
Figure 3.1 Steps to identify a particular language. ............................................................. 44
Figure 3.2 Naïve Bayes and SVM models applied to text languages in this study ............ 48
Figure 3.3 Dictionary Method applied to text languages in this study............................... 51
Figure 4.1 Evaluation (Accuracy) on test data for Naïve Bayes Classifier (n=3) .............. 53
Figure 4.2 Evaluation (Accuracy) on test data for SVM Classifier (n=3) ......................... 55
Figure 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language ...................................................... 56
Figure 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language ................................................................................ 57
Figure 4.5 Evaluation (Accuracy) on test data for Dictionary Method .............................. 59
Figure 4.6 Classification Results for Afar Language ......................................................... 63
Figure 4.7 Classification Results for Amharic Language .................................................. 63
Figure 4.8 Classification Results for Nuer Language ........................................................ 64
Figure 4.9 Classification Results for Oromo Language ..................................................... 64
Figure 4.10 Classification Results for Sidamo Language .................................................. 64
Figure 4.11 Classification Results for Somali Language ................................................... 64
Figure 4.12 Classification Results for Tigrigna Language ................................................. 65
Figure 4.13 Evaluation (Accuracy) on test data for Multilingual Identification ................ 67

viii
LIST OF TABLES

Table 3.1 Corpora: Facts and Figures - Number of Words/Tokens and Types.................. 36
Table 3.2 Corpus size for training (90%) and testing (10%) the models ........................... 37
Table 3.3 Data statistics for training and testing the models with homogeneously
distributed training data per language ................................................................................ 38
Table 3.4 Word statistics of each language to generate test characters window. Average
word lengths and number of words per character window are indicated. .......................... 39
Table 3.5 Number of test documents for the three characters windows ............................ 39
Table 3.6 Number of test documents for Multilingual Identification ................................ 40
Table 3.7 Dictionary size (number of stopwords) for the Dictionary Method ................... 50
Table 4.1 Evaluation on test data for Naïve Bayes Classifier (n=3) .................................. 53
Table 4.2 Evaluation on test data for SVM Classifier (n=3) .............................................. 54
Table 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language ...................................................... 56
Table 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language ................................................................................ 57
Table 4.5 Evaluation on test data for Dictionary Method .................................................. 58
Table 4.6 Confusion matrix of Naïve Bayes Classifier (n=3) for 15-chw ......................... 60
Table 4.7 Error rates for languages calculated from confusion matrices in Table 4.6 ....... 60
Table 4.8 Confusion matrix of SVM Classifier (n=3) for 15-chw ..................................... 60
Table 4.9 Error rates for languages calculated from confusion matrices in Table 4.8 ....... 61
Table 4.10 Confusion matrix of Dictionary Method for 15-chw ....................................... 61
Table 4.11 Error rates for languages calculated from confusion matrices in Table 4.10 ... 61
Table 4.12 Evaluation on test data for Multilingual Identification .................................... 66

ix
1. INTRODUCTION

1.1 Background

Ethiopia is a diverse country with various cultural and traditional differences. According
to Ethnologue, (2017), there are ninety individual languages spoken in Ethiopia. Most
people in the country speak Afro-asiatic languages of the Cushitic or Semitic branches.
The Cushitic Languages are mostly spoken in central, southern and eastern Ethiopia
(mainly in Afar, Oromia and Somali regions). The Semitic Languages are spoken in
northern, central and eastern Ethiopia (mainly in Tigray, Amhara, Harar and northern part
of the Southern Peoples' State regions). They use the Ge'ez script that is unique to the
country. The Omotic Languages are predominantly spoken between the Lakes of
southern Rift Valley and the Omo River. The Nilo-Saharan Languages are largely spoken
in the western part of the country along the border with Sudan (mainly in Gambella and
Benshangul regions) (Ethiopian Languages, 2008).

These languages use either Geez/Amharic or Latin transcriptions to represent the


languages in textual forms. Textual data in some of these languages are getting more and
more available on the global network. Among these are Amharic and Tigrigna use
Geez/Amharic transcription. Somali, Afaan Oromo, Sidama, Afar and Nuer use Latin
transcription (Desta, 2014).

Language identification is the task of automatically detecting the language(s) present in a


document based on the content of the document (Lui & Baldwin, 2011). Automatic
language identification is an integral part in many monolingual and multilingual language
processing systems. Language identification can be divided into two classes: spoken and
written language identification. Spoken language identification methods make use of
signal processing techniques where language identification from text is a symbolic
processing task.

The problem has been researched long both in the text domain and in the speech domain
(House & Neuburg, 1977). Computational methods can be applied to determine a
1
document’s language before undertaking further processing. State-of-the-art methods of
language identification for most European languages present satisfactory results above
95% accuracy (Martins & Silva, 2005).

Research on language identification has seen a variety of approaches. The major


approaches include: detection based on stop words usage, detection based on character n-
grams frequency, detection based on machine learning (ML) and hybrid methods. Many
standard machine learning techniques has been applied to automated text categorization
problems, such as Naïve Bayes classifiers, support vector machines, n-gram frequency
rank order, and neural networks classifiers (Peng, Schuurmans, & Wang, 2003).

The main challenges in language identification include: Improving the coverage of


language identification systems by increasing the number of languages that systems are
able to recognize, Improving the robustness of language identification systems by
training systems on multiple domains and various text types, Handling non-standard texts
(e.g. multilingual texts, computer- mediated communication content, code-switching),
and Discriminating between very similar languages, varieties and dialects.

1.2 Problem statement

Digital documents are getting larger and larger every day and dealing with large amounts
of textual data from an external source or from some document is a problem if many
languages are included. Users may wish to work with information in potentially any
language, it is necessary to have systems for language identification that provide
identifiers for all languages. Information is usually provided in a context that eliminates
ambiguities. However, if we don’t recognize the language, the processing stops. In this
case, knowing the name of the language would help.

Large quantities of text are processed automatically for tasks such as spelling and
grammar checking, information retrieval, search engines, language translation and text
mining. In a multilingual environment, language processing is often initiated with some
form of language identification. A document processing system needs to determine the
language of textual contents before topic identification, translation, stemming or search

2
and a text-to-speech system needs to determine the source language of short pieces of text
to determine pronunciation rules, prosodic models and phrasing strategies.

Although research on language identification has seen a variety of approaches, there is


still not a general understanding of the factors that determine identification accuracy. The
size of the textual fragment to identify and the amount and variety of training data
available have the capability to determine identification accuracy. In the major world
languages, corpora measured in millions of words are not uncommon, but in most of the
languages of the world, algorithm development has to proceed from much less text. The
domain from which the training text is extracted is potentially important in that context
since limited training data will lack the full variety of a language and is therefore,
expected to generalize less successfully to new domains.

The problem of language identification is that human language (words) have structure.
For example, in English, it's very common for the letter 'u' to follow the letter 'q,' while
this is not the case in Amharic or other Ethiopian languages. Certain combinations of
letters are more likely in some languages than others. N-Grams work by capturing this
structure.

There are a large number of languages used in Ethiopia that have speakers ranging from
millions to thousands and are considered as major languages. In most cases, frequent
code switching and code mixing are also observed. If we could segment multi-lingual
documents language-wise, it would be very useful both for exploration of linguistic
phenomena, such as code-switching and code mixing, and for computational processing
of each segment appropriately.

Distinguishing languages belonging to the same phylogenetic families is much harder


than identifying languages that fall outside such families. The 7 languages of Ethiopia
taken in this research are from different linguistic classification namely Cushitic
(consisting of Afar, Oromo, Sidamo and Somali) and Semitic (consisting of Amharic and
Tigrignya) of the Afro-Asiatic linguistic classification and Nilotic (consisting of Nuer) of

3
the Nilo-Saharan linguistic classification. These languages were chosen because they are
spoken by more than 79.3% of the total population of Ethiopia.

Advances in language identification have been made but are still very limited and
available for only a small number of languages. Although it is unpublished, the only other
research (to my knowledge) that includes the Ethiopian languages in a text-based
language identification task was done by (Desta, 2014). Although he reported a
substantially higher performance rate for four Cushitic languages of Ethiopia, his
research did not consider factors that can determine identification accuracy such as the
amount and variety of training data. He didn’t investigate the effect that homogeneously
distributed training data (equal amounts of training data for each language class) has on a
classifier when compared to a training set with heterogeneously distributed training data.
Moreover, his research is restricted to four Cushitic languages and didn’t include
languages from other linguistic classification. Furthermore, his research didn’t consider
multilingual identification for mixed language input. Thus text-based language
identification for the Ethiopian languages is still open for research.

The aim of this research is to investigate text-based language identification and to explore
factors that determine how accurately Ethiopian languages can be identified based on
written text in their respective writing system.

The research is intended to get answers for the following research questions.

• Which techniques are most successful for language identification of major


Ethiopian languages based on the languages’ characteristics?
• What are the factors that determine language identification accuracy?
• To what extent do the proposed language identifiers perform for local
languages?

4
1.3 Objective of the study

General objective: The general objective of this study is to propose text-based language
identifier techniques for the seven major Ethiopian languages.

Specific objectives:

• Review existing language identification techniques.


• Collect corpus of major Ethiopian languages.
• Preprocess and experiment language data.
• Design and develop a prototype.
• Implement the techniques and evaluate identification performance.
• Identify and recommend future research directions for further investigation on
language identification.

1.4 Scope and limitation of the study

The main intent of the study is to explore existing text language identification techniques
and compare their performances by considering factors that determine identification
accuracy on seven major Ethiopian languages. To achieve the objectives of the study,
machine learning methods (Naïve Bayes and support vector machine) that use character
n-gram of size 3 as a feature set and dictionaries that use stopwords are used. Documents
written in 7 languages are compiled and processed from various sources resulting in
slightly over 11.7 million tokens. The research includes both monolingual and
multilingual language identification.

Experimenting with all possible combinations was not feasible although a large number
of classifiers have been applied to text-based language identification. Using a pure
linguistic approach is undeniably the better choice to achieve a high classification
accuracy but it requires a large amount of linguistic expertise. Therefore, using statistical
approaches is a feasible alternative.

5
Although this research was aimed to include all languages of the Ethiopia, the data used
in this study was only collected from seven languages. These seven languages were
chosen because they are spoken by more than 79.3% of the total population and are
considered as major languages in Ethiopia.

1.5 Significance of the study

Language identification of text is important as large quantities of text are processed or


filtered automatically for tasks such as information retrieval or machine translation. The
research focuses on the use of different statistical methods for the identification of the
major Ethiopian languages and it is very important since it presents a method for
accurately identifying the language of a text.

Language identification is of special significance especially for multilingual country like


Ethiopia since there are a large number of languages from different linguistic
classifications. Segmenting multi-lingual documents language-wise would be very useful
both for exploration of linguistic phenomena, such as code-switching and code mixing,
and for computational processing of each segment appropriately.

Language identification methods are vital to many NLP and information retrieval
applications. They are necessary, for example, to aid document collection creation in
scenarios where the languages of documents are not known. A clear example are
documents crawled from the Internet which are often unlabeled regarding their language
and language identification methods are applied to identify the language of each
document (Ljubesic, Fiser, & Erjavec, 2014). Machine translation (MT) also benefits
from language identification methods, because to translate a document to a target
language, it is first necessary to determine its source language. These features are present
in different translation tools and web-browsers such as Google Chrome. Finally, corpus
creation for low-resource languages is another area that uses language identification
methods (Emerson, Tan, Fertmann, Palmer, & Regneri, 2014).

6
Identification of language from a given text is therefore an important problem in the
Ethiopian context and the study could also be used as a baseline for further studies on
language identification to be considered in Ethiopia.

1.6 Methodology of the study

In order to achieve the specific and general objectives of the study and answer the
research questions, the following research methods are used.

1.7 Research design

A research design includes the structure of a study and the strategies for conducting that
study (Kerlinger, 1973). For the purpose of conducting this research, experimental
research design is used. Experimental research design is a way to carefully plan
experiments in advance so that results are both objective and valid (Stephanie, 2018).

The experimentation conducted in this study is separated into two experiments, with each
experiment differing with regards to the amount of data used for training.

The first experiment utilizes the entire data set to train and test the classifiers with seven
language classes. The second experiment consists of varying the size of the training data
used to train the classifiers to simulate scarce data availability. Furthermore, this
experiment investigates the effect that homogeneously distributed training data (equal
amounts of training data for each language class) has on a classifier when compared to a
training set with heterogeneously distributed training data. The results of the experiments
are analyzed individually at the end of each experiment and an overall analysis is done to
summarize the observations.

Literature review

To understand the proper approaches of the problem-solving process, relevant literature


reviews related to language identification that are considered to be relevant for this work
are investigated. Related literatures from different sources (books, journals, Internet, etc.)

7
are reviewed to understand language identification, its development tools, techniques,
procedures and methodologies.

Preparation of the data

Text corpus of major Ethiopian languages is collected from sources such as newspaper,
TV/Radio news, Religious books (Bible, Quran), academic books, and language
dictionaries to insure the corpus spans various domains. Documents written in 7
languages are compiled and processed from various sources resulting in slightly over
11.7 million tokens. WebCorp tool is used to collect corpus from different websites
which is a suite of tools that allows access to the World Wide Web as a corpus. WebCorp
was created and is operated and maintained by the Research and Development Unit for
English Studies (RDUES) in the School of English at Birmingham City University
(WebCorp, 2018) . Different data preprocessing techniques are applied for processing the
data used by the algorithms chosen for the research. Python scripts are used for data
cleaning and preprocessing (Python, 2018). After data preprocessing, the classifiers are
trained using 90% of the collected corpus and tested using 10% of the collected corpus
that are generated randomly from the collected corpus. This data partitioning is chosen
because it has been suggested in several literatures in the field of Machine learning/
Pattern recognition. Moreover, a series of runs with different amounts of training data
was performed and no performance enhancement was achieved by increasing further the
size of the training corpus. Quantitative detail about the corpora can be found in section
3.2.

Model evaluation

The classifiers and the ranking methods are implemented in Python and use functions
available at the Natural Language Toolkit (NLTK). Classification techniques that are
used in this research are evaluated using a test dataset based on their identification
accuracy. Accuracy and error rate are calculated to evaluate the performance of the
models. Details about evaluation and computational techniques can be found in section
3.3.

8
1.8 Structure of the thesis

The thesis is organized into 5 chapters. The first chapter covers introduction, background,
problem statement, objective of the study, scope and limitation of the study, significance
of the study, and research methods used. The second chapter covers literature review,
overview of major Ethiopian languages, approaches to language identification, evaluation
of language identification methods, and related works. The third chapter covers the
methodology and the techniques followed in this research. It discusses data preparation
for the experiment and models that are implemented. Chapter four discusses the
experiment and the results. The last chapter presents conclusions and recommendations.

9
2 LITERATURE REVIEW

This Chapter discusses fundamental concepts of language identification and ideas


associated with major Ethiopian languages. Language identification methods are
presented in order to grasp clear overview of the topic. The main target of this study is to
design and develop language identification model by exploring existing language
identification techniques for major Ethiopian languages.

2.1 Language Identification

2.1.1 Overview

Automatic language identification can be broadly defined as the task of automatically


identifying the language(s) contained in a given document.6 This task is a well-known
research topic in computational linguistics and its origins can be traced back to the work
of Ingle, (1980).

There are a number of situations in which the source language of a document is unknown
and computational methods can be applied to determine it. This makes language
identification a relevant task that can be integrated with most NLP applications such as
machine translation (for cases in which knowing the source language or language variety
of a document is vital for further processing) or information retrieval.

As discussed by Lui, (2014) language identification systems are typically divided into
four main steps. Given a set of documents written in different languages the system will
implement the following:

Data representation: select a text representation (e.g. characters, words, or a combination


of the two);

Language modelling: calculate or derive a model from documents known to be written in


each language;

10
Classification function: define a function that best represents the similarity between a
document and each language model;

Prediction or output: compute the highest-scoring model to determine the language of the
document.

State-of-the-art methods for language identification apply n-gram-based language models


at the character or word level to distinguish a set of languages automatically. The most
successful approaches model the task as a supervised single-label classification (one label
for each document) problem. The average multi-class accuracy obtained by these
methods is usually over 95% (Lui & Baldwin, 2011).

As stated in Palmer, (2010) it is very common for language identification methods to


perform perfectly when distinguishing between languages which are typologically not
closely related as well as when recognizing languages with unique character sets. The
difficulty therefore lies in discriminating between similar languages and languages that
use similar character sets.

For languages with a unique alphabet not used by any other languages, language
identification is determined by character set identification. Similarly, character set
identification can be used to narrow the task of language identification to a smaller
number of languages that all share many characters such as Arabic vs. Persian, Russian
vs. Ukrainian, or Norwegian vs. Swedish (Palmer, 2010).

At this point it is not difficult to recognize a number of challenges faced by the language
identification systems. One of them is the identification of closely related languages that
share similar character sequences and lexical units (e.g. Amharic and Tigrigna). Another
challenge faced by language identification systems is identifying the language of short
excerpts from texts particularly those containing non-standard language. Systems have
difficulty when confronted with small excerpts from texts (a few words) that do not
provide enough data for algorithms to classify them correctly. This difficulty is even
more evident for texts available on the Internet, such as tweets, because these are often

11
noisy and contain non-standard spelling and/or code alternation (Nguyen & Dogruoz,
2013).

2.1.2 Approaches to Language Identification

Language identification is an important problem in the field of Natural Language


Processing. With the current spread of texts in digital form, texts are available in number
of languages other than English. The automatic treatment of these texts, for any purpose
requiring natural language processing, such as indexing, interrogation necessitates the
primary identification of language. It may seem to be an elementary and simple issue for
humans in the real world, but it is difficult for a machine, primarily because different
scripts are made up of different shaped patterns to produce different character sets.

The problem of automatic language identification is not new. Research on language


identification has seen a variety of approaches. The major approaches include: detection
based on stop words usage, detection based on character n-grams frequency, detection
based on machine learning (ML) and hybrid methods. Many standard machine learning
techniques has been applied to automated text categorization problems, such as Naïve
Bayes classifiers, support vector machines, n-gram frequency rank order, and neural
networks classifiers (Peng, Schuurmans, & Wang, 2003).

2.1.2.1 Detection based on N-gram

An n-gram is a sequence of n consecutive letters. The n-grams of a string are gathered by


extracting adjacent groups of n letters (Cavnar & Trenkle, 1994). The n-gram
combinations for the string “example” is shown below.

bi−grams: ex xa am mp pl le

tri−grams: exa xam amp mpl ple

quad−grams: exam xamp ampl mple

12
N-gram are simple to compute for any given text and from many language identification
and related tasks that involve several languages, they have been shown to perform well.
They automatically capture the roots of the most frequent words and operate
independently of languages and are tolerant of spelling errors and distortions caused
when using optical scanners and do not need the removing of Stopwords or the stemming
process that improve the performance of words based systems.

N-gram Rank Ordering

The most widely used ranking method for language identification and text categorization
is the one by Cavnar & Trenkle, (1994). This method generates language specific profiles
which contain most frequent character n-grams of the training corpus sorted by their
frequency. Figure 2.1 illustrates the ranking method.

In this approach, n-gram combinations in the training and testing sets are ordered from
most to least frequent. Cavnar and Trenkle build n-gram frequency “profiles” for several
languages and classify text by matching it to the profiles.

We can create a similar text profile from the text to be classified and then calculate a
cumulative n-gram rank difference measure between the text profile and each language
profile which is helpful to determine how far out of place any n-gram in one profile is
from its place in the other profile. It gives the maximum distance if an n-gram in the text
profile is absent from the language profile, which equals the number of n-grams in the
language profile. To identify the testing language, the sum of all the rank distances is
calculated and the most probable language will be the one with the smallest distance.

13
language text out-of-place
profile profile measure

most frequent
TH TH 0
ER ING 3
ON ON 0
LE ER 2
ING AND 1
AND ED no match=max
… …
least frequent
sum=total distance

Figure 2.1 An illustration of the rank order method (Adopted from Cavnar & Trenkle,
1994)

2.1.2.2 Detection based on machine learning techniques

Naive Bayesian Classifier

The naïve Bayesian classifier is used to classify a document D to one of a set of


predefined categories (languages) C = {c1, c2, ...cn} (Peng & Schuurmans, 2003). These
classifier uses Bayes theorem, shown in Equation 2. 1.

P (D|𝑐𝑗 ).P (𝑐𝑗 )


P (𝑐𝑗 |D) = Equation 2.1
P(D)

P (cj |D) = Probability of belonging of a tweet D to the language cj

P (D|cj )) = Probability of generating a tweet D given language cj

P (cj ) = Probability of occurrence of language cj

P (D) = Probability of a tweet D occurring

A document is represented by a vector D = (f1,f2,..fm) of m features, that are words or N-


grams with their internal frequencies. The computation of P (D|cj) can be simplified with
the additional assumption that each feature is conditionally independent of other features

14
given the language. This assumption is embodied in the naive Bayesian classifier by
Good, (1965) and the equation 2.1 is reduced to 2.2.

∏𝑚
𝑖−1 P (𝑓𝑖 |𝑐𝑗 )
P (𝑐𝑗 |D) = P (𝑐𝑗 ). Equation 2.2
P(D)

To find the most probable language of the document c maximum a posterior classifier
(MAP) cMAP is constructed in Equation 2.3. It maximizes the posterior P(cj|D). In
Equation 2.5 P(D) is eliminated as it is a constant for all languages. The probability of
occurrence of each language P(c) is assumed equal and is also excluded. Therefore, the
cMAP becomes equal to maximum likelihood classifier as shown in Equation 2.6.

𝑎𝑟𝑔max{𝑃(𝑐|𝐷)} Equation 2.3


𝑐∈𝐶

∏𝑚
𝑖=1 P (𝑓𝑖 |c)
= 𝑎𝑟𝑔max {P (𝑐). } Equation 2.4
𝑐∈𝐶 P(D)

= 𝑎𝑟𝑔max{P (𝑐). ∏𝑚
𝑖=1 𝑝(𝑓𝑖 |𝑐)} Equation 2.5
𝑐∈𝐶

= 𝑎𝑟𝑔max{∏𝑚
𝑖=1 𝑝(𝑓𝑖 |𝑐)} Equation 2.6
𝑐∈𝐶

As a result, for this classifier, a list of N-grams with possible duplicates is generated and
their internal frequencies, obtained from the training data, are multiplied. The language
with the maximal result is chosen as the language of the document.

Support Vector Machine

Support Vector Machine: are able to handle large feature spaces. This is particularly
helpful when classifying texts because algorithms often have to deal with a very large
number of features. The learning algorithm finds a linear classification boundary (hyper
plane) that separates the training examples and maximizes the margin to the closest
training examples regarding their class labels (Kruengkrai, Srichaivattana,
Sorlertlamvanich, & Isahara, 2005).

A support vector machine (SVM) (Cristianini & Shawe-Taylor, 2000) is a two-class


classifier constructed from sums of a kernel function K (.,.),

15
𝑓(𝑥) = ∑𝑁
𝑖=1 𝛼𝑖 𝑡𝑖 𝐾(𝑋, 𝑋𝑖 ) + 𝑏 Equation 2.7

where the ti are the target values, ∑N


i=1 αi t i = 0, and αi > 0. The vectors xi are support

vectors and obtained from the training set by an optimization process (Collobert &
Bengio, 2001). The target values are either 1 or -1 depending upon whether the
corresponding support vector is in class 0 or class 1. For classification, a class decision is
based upon whether the value, f(x), is above or below a threshold.

The kernel K (.,.) is constrained to have certain properties (the Mercer condition), so that
K (.,.) can be expressed as

𝐾(𝑥, 𝑦) = 𝑏(𝑥)𝑡 𝑏(𝑦) Equation 2.8

where b(x) is a mapping from the input space (where x lives) to a possibly infinite
dimensional space.

The optimization condition relies upon a maximum margin concept, see Figure 2.1. For a
separable data set, the system places a hyperplane in a high dimensional space so that the
hyperplane has maximum margin. The data points from the training set lying on the
boundaries (as indicated by solid line in the figure) are the support vectors in equation
(2.7). The focus then of the SVM training process is to model the boundary as opposed to
a traditional Gaussian mixture model which would model the probability distributions of
the document (a generative approach).

Class 0
f(x>0)

Margin

Class 1
f(x<0)
Separating hyperplane
Figure 2.2: Support vector machine concept

16
Decision Trees

A decision tree is a machine learning technique that is used to solve sequential decision-
making problems. It has been shown to be effective in classification problems across
domains. One of the attractive features of decision trees is the ease at which one can
visualise the decision-making process (Ishwaran & Rao, 2009). Another attractive quality
of decision trees is their effectiveness in utilizing contextual information (Hakkinen &
Tian, 2001).

Decision trees are made up of nodes. The initial node of the decision tree is known as the
root node. Internal nodes are nodes where decisions are made, the data at that node is
partitioned into children nodes and the resulting children nodes can be either internal
nodes or leaf nodes. Leaf nodes do not have any children nodes and the final
classification for an example is given at these nodes (Kingsford & Salzberg, 2008).

A decision tree classifier is composed of a two-phase approach. The first phase is the
training phase where a set of labelled examples is used to train a decision tree model. The
second phase is the classification phase where the model created can be used to predict
the classification of an unlabelled example. The class given is determined by the features
the example possesses. The decision tree is traversed by answering a series of questions
starting at the root node and filtering down to a leaf node where the final classification is
given (Kingsford & Salzberg, 2008) (Cavnar & Trenkle, 1994).

The famous ‘play tennis given the weather forecast’ decision tree is a very good example
of how conditions are tested in a decision tree. There are two labels to be attributed to
each instance, yes or no, given a set of conditions determined by three attributes
humidity, outlook, and wind. In this example (see Figure 2.3) each attribute has two or
three values. Conditions are tested to determine whether given the weather forecast it is
best to play tennis or not (e.g. if outlook is sunny and humidity normal: play tennis; if the
outlook is rain and the windy is strong then don’t play tennis).

17
Outlook
Sunny Rain
Overcast

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Figure 2.3 Decision tree: play tennis example (Mitchell, 1997)

Neural Networks

Neural networks are being used for some time and their design and implementation for
standard situations is well known (Haykin, 1994). Most of the work using neural
networks aims at the classification of objects. In this case, the network works as a
hypothesis function hΘ(X) that, based on a set of matrices Θ previously computed on a
training set, is able to classify an object based on a set of features extracted from that
object. So, in the case of language identification, the training set is a set of texts manually
classified in one language and an algorithm to extract features from them. These features
are them fed in the neural network training algorithm that will compute the set of
matrices Θ.

These matrices are then used to identify the language of new texts. For that, the features
X are extracted from the text to be classified, and the hypothesis function is called. The
resulting vector will include the probabilities of that text being identified as each one of
the trained languages.

Neural network works as follows as implemented using the more common definition of a
neural network (Haykin, 1994) :

A neural network is composed by a set of L layers, each one composed by a set or


processing units. A processing unit is denoted by ai(l) where l is the layer where it

18
belongs, and i its order. All units from a specific layer are connected to all units from the
next layer. This connection is controlled by a matrix Θ(l), for each layer l.

The first layer is known as the input layer. It has the same number of units as there are
features to be analyzed. Whenever the network hypothesis function is evaluated each cell
ai(1) is filled in with the values obtained by the features observation. The next layer, ai(2)
is computed using the previous layer and the matrix Θ(1). This process is done for every
layer l ≤ L. The layer L is known as the output layer. There are as many units in this
layer as the number of classes K in which the network will classify objects. Therefore, if
the network is trained to detect 25 languages, then there are 25 units in the output layer.
Each unit in the output layer will, optimally, get a value that is either 1 or 0, meaning that
the object is, or is not, in the respective class. Usually, the result is a value in this range,
that represent the probability of the object to be of that specific class.

The other layers, 1 < l < L, are known as the hidden layers. There are as many hidden
layers as one might want, but there is at least one hidden layer. Adding new layers will
make the network return better results but it will take more time to train the network, and
take more time to run the network hypothesis function.

Figure 2.4 Neural network architecture

19
The implementation of the neural network was based on the logistic function defined by
g(z). This function range is [0,1], and its result value can be considered a probability
measure. The logistic function is defined as:

1
𝑔(𝑧) = 1+𝑒𝑥𝑝−𝑧 Equation 2.9

The neural network hypothesis function, hΘ(l) (X) is defined by two matrices, Θ(1) and
Θ(2). These matrices of weights are used to compute the network. The input values,
obtained by the computed features, are stored in the vector X. This vector is multiplied by
the first weight matrix, and the logistic function is applied to each value of the resulting
vector. The resulting vector is denoted as a(2) and corresponds to the values of the second
layer of the network (the hidden layer). It is then possible to multiply a(2) vector by the
weights of Θ(2) and, after applying the sigmoid function to each element of the resulting
multiplication, we obtain a(3). This is the output layer, and each value of this vector
corresponds to the probability of the document being analyzed to as being written in a
specific language. This algorithm is known by forward propagation and is defined by:

𝑎𝑙 = 𝑥

𝑓𝑜𝑟 𝑖 = 2 to L,

𝑎𝑖 = 𝑔(Θ(𝑖−1) 𝑥) Equation 2.10

The main problem behind this implementation is how to obtain the weight values. For
that the usual methodology is to define a cost function and try to minimize it, that is,
finding the Θ values for which the hypothesis function has a smaller error for the training
set.

The cost function with regularization is defined as:

1 (𝑖) (𝑖)
𝐽(Θ) = − 𝑚 (∑𝑚 𝑘 (𝑖) (𝑖)
𝑖=1 ∑𝑘=1 𝑦𝑘 log ℎΘ (𝑥 ))𝑘 + (1 − 𝑦𝑘 ) log(1 − (ℎΘ (𝑥 ))𝑘)) +

𝜆 𝑠𝑙 𝑆𝑙 +1 (𝑙) 2
∑𝐿−1
𝑙=1 ∑𝑖=1 ∑𝑗=1 (Θ𝑗,𝑖 ) Equation 2.11
2𝑚

20
The regularization is controlled by the coefficient λ which can be used to tweak how the
Θ weights absolute value will increase. The minimization of the cost function J(Θ) is
computed by an algorithm known as Gradient Descent. This algorithm uses the partial
derivatives to compute the direction to use to obtain the function minimum. The
algorithm continues iterating until the difference between the obtained costs is very
small, or until a limit number of iterations it met.

𝜕
(𝑙) 𝐽(Θ) Equation 2.12
𝜕𝜃𝑖,𝑗

Gradient Descent can be implemented using an algorithm known as Backwards


Propagation to compute efficiently the partial derivatives (Simões, Almeida, & Byers,
1998).

Dot Product Calculation Based

Let the vector c represent all the n-gram combinations. The dot-product method builds a
language vector lj from the n-gram statistics of a document, where lij is the frequency of
the n-gram ci. A vector x is built from the n-gram statistics of the test string to classify a
test string. The normalized dot-product between a language vector lj and test string x is
then computed by (Lipschutz & Lipson, 2009):

∑𝑁
𝑖=1 𝑥𝑖 𝑙𝑖𝑗
𝑥. 𝑙𝑗 = |𝑥||𝑙𝑗 |
Equation 2.13

where N is the total number of n-gram combinations.

The computed measure indicates how close the two vectors are and therefore how close
the text string is to the language model. The closer the measurement is to 1, the more
similar the vectors become. Thus, the language model with the highest measurement will
be identified as the language of the text segment.

21
Using the Relative Entropy

Relative entropy (also known as the Kullback Leibler distance) is used in information
theory to measure the difference between two probability distributions. For a discrete
random variable x, and probability distributions p and q, the relative entropy is given by
(Kullback & Leibler, 1951):

𝑝(𝑥)
𝐷𝐾𝐿 = ∑𝑥∈𝑋 𝑝(𝑥)𝑙𝑜𝑔 𝑞(𝑥) Equation 2.14

Reynar, (1996) associated the p distribution with the test set, the q distribution was
associated with the training set and X is the samples in distribution p. If an n-gram is not
found in the trained model, the equation above is undefined; therefore, a penalty of 0.5 is
added for such events.

The language profile with the smallest relative entropy is used to assign the language of
the test string. Data was obtained from the European Corpus Initiatives CD-ROM and a
subset of documents in 18 Roman alphabet languages were used for training and testing
purposes. Classification was compared by training the classifier with uni-gram and bi-
gram statistics. For a 200-line training set and 1 test line, the average accuracy is 78.6%
for uni-grams and 90.2% for bi-grams. If the number of test lines is increased to 10, the
uni-gram accuracy increases to 98.5% and the bi-gram accuracy to 99.9%. The accuracy
for a 2000-line training set and 1 test line is 81.5% for uni-grams and 94.1% for bi-grams.
For 10 lines the uni-gram accuracy increases to 99.0% and the bi-gram accuracy to
99.9%.

2.1.2.3 Detection based on dictionaries containing stopwords

This method uses dictionaries containing stopwords. Stopwords are very specific to a
language, although some languages have some similar words and special characters they
are not all common. Stopwords usually are a good measure for automatic language
detection.

The stopwords dictionaries are created before applying the algorithm. The algorithm
picks the text, find most common words and compare with stopwords. The language with

22
the most stopwords is selected as the identified language. The algorithm counts how
many unique stopwords are seen in analyzed text to put this in language ratios dictionary
and detects language based on ratio in language ratios dictionary.

2.1.2.4 Multilingual Identification

Multilingual documents are documents that contain text in more than one language.
Recent research has investigated how to make use of multilingual documents from
sources such as web crawls (King & Abney, 2013).

One approach to handling multilingual documents is to attempt to segment them into


contiguous monolingual segments. In addition to identifying the languages present, this
requires identifying the locations of boundaries in the text which mark the transition from
one language to another. Several methods for supervised language segmentation have
been proposed. Teahan, (2000) proposed a system based on text compression that
identifies multilingual documents by first segmenting the text into monolingual blocks.

Kralisch, Anett, & Mandl, (2006) detect “language shift” using an eight- word sliding
window. Rehurek, Radim, & Kolkus, (2009) perform language segmentation by
computing a relevance score between terms and languages, smoothing across adjoining
terms and finally identifying points of transition between high and low relevance, which
are interpreted as boundaries between languages.

Yamaguchi, Hiroshi, & Tanaka-Ishii, (2012) use a minimum description length approach,
embedding a compressive model to compute the description length of text segments in
each language. They present a linear-time dynamic programming solution to optimize the
location of segment boundaries and language labels.

Lui & Baldwin, (2011) presented a system for language identification in multilingual
documents using a generative mixture model inspired by supervised topic modeling
algorithms, combined with a document representation based on previous research in
language identification for monolingual documents. They showed that their system is

23
able to accurately estimate the proportion of the document written in each of the
languages identified.

2.1.3 Evaluation of Language Identification Methods

Given a set of evaluation documents, each having a known correct label from a closed set
of labels, and a predicted label for each document from the same set, the document-level
accuracy is the proportion of documents that are correctly labeled over the entire
evaluation collection. This is the most often-reported metric, and conveys the same
information as the error rate, which is simply the proportion of documents that are
incorrectly labeled (i.e. 1 − accuracy).

Authors sometimes provide a per-language breakdown of results. There are two distinct
ways in which results are generally summarized per-language (Powers & David, 2011):
(1) precision, in which documents are grouped according to their predicted language; and
(2) recall, in which documents are grouped according to what language they are actually
written in.

In addition to evaluating performance for each individual language, authors have also
sought to convey the relationship between classification errors and specific sets of
languages. Errors in language identification systems are generally not random; rather,
certain sets of languages are much more likely to be confused. For example, Grefenstette,
(1995) found that Norwegian documents had an elevated chance of being misclassified as
Swedish, compared to a range of other European languages.

2.2 Overview of Major Ethiopian Languages

According to Ethnologue, (2017), there are ninety individual languages spoken in


Ethiopia. Most people in the country speak Afro-asiatic languages of the Cushitic or
Semitic branches. Of the languages spoken in Ethiopia, 86 are living and 2 are extinct. 41
of the living languages are institutional, 14 are developing, 18 are vigorous, 8 are in
danger of extinction, and 5 are near extinction.

24
In terms of writing systems, Ethiopia's principal orthography is the Ge'ez script.
Employed as an abugida for several of the country's languages, it first came into usage in
the sixth and fifth centuries BC as an abjad to transcribe the Semitic Ge'ez language.
(Rodolfo, 2003). Ge’ez now serves as the liturgical language of the Ethiopian and the
Eritrean Orthodox Tewahedo churches. Other writing systems have also been used over
the years by different Ethiopian communities. These include Arabic script for writing
some Ethiopian languages spoken by Muslim populations (Pankurst, 1991) and
Sheikh Bakri Sapalo's script for Oromo. (Hayward & Hassan, 1981). Today, many
Cushitic, Omotic, and Nilo-Saharan languages are written in Roman/Latin script.

According to the 2007 Ethiopian census, the largest first languages are: Oromo,
24,930,424 speakers or 33.80% of the total population; Amharic, 21,634,396 speakers or
29.30% of the total population; Somali 4,609,274 speakers or 6.25% of the total
population; Tigrinya, 4,324,476 speakers or 5.86% of the total population; Sidamo,
2,981,471 speakers or 4.84% of the total population; Wolaytta, 1,627,784 speakers or
2.21% of the total population; Gurage, 1,481,783 speakers or 2.01% of the total
population and Afar, 1,281,278 speakers or 1.74% of the total population. (Central
Statistical Agency, 2010).

Widely spoken foreign languages include English (major foreign language taught in
schools) Arabic and Italian. (Central Intelligence Agency, n.d.). Amharic is the official
language in which all federal laws are published, and it is spoken by millions of
Ethiopians as a second language. In most regions, it is the primary second language in the
school curriculum.

Amharic is an Afro-Asiatic language of the Semitic branch. It is spoken as a mother


tongue by the Amhara in Ethiopia. The language serves as the official working language
of Ethiopia, and is also the official or working language of several of the states within the
federal system (Gebremichael, 2011). It has been the working language of courts,
language of trade and everyday communications, the military, and the Ethiopian
Orthodox Tewahedo Church since the late 12th century and remains the official
language of Ethiopia today (Meyer, 2006; Judith, 2013). Amharic is spoken by 22 million

25
native speakers in Ethiopia and 15 million secondary speakers in Ethiopia (Central
Statistical Agency, 2010; Meyer, 2006). Additionally, 3 million emigrants outside of
Ethiopia speak the language. Most of the Ethiopian Jewish communities in Ethiopia and
Israel speak Amharic. In Washington DC, Amharic became one of the six non-English
languages in the Language Access Act of 2004, which allows government services and
education in Amharic (Language Access Act Fact Sheet, 2011).

The Amharic script is Ge’ez, and the graphemes of the Amharic writing system are
called fidel (Hudson, 2009). Each character represents a consonant + vowel sequence, but
the basic shape of each character is determined by the consonant, which is modified for
the vowel. Some consonant phonemes are written by more than one series of characters.
This is because these fidel originally represented distinct sounds, but phonological
changes merged them (Hudson, 2009). The citation form for each series is the consonant
+ ä form, i.e. the first column of the fidel. The Amharic script is included in Unicode, and
glyphs are included in fonts available with major operating systems.

Oromo is an Afro-asiatic language. It is the most widely spoken tongue in the


family's Cushitic branch. Forms of Oromo are spoken as a first language by more than
24.6 million Oromo people and neighboring peoples in Ethiopia, and by an additional
half million in parts of northern and eastern Kenya (Ethnologue, 2017). It is also spoken
by smaller numbers of emigrants in other African countries such as South
Africa, Libya, Egypt and Sudan. About 85 percent of Oromo speakers live in Ethiopia,
mainly in Oromia Region. In addition, in Somalia there are also some speakers of the
language (Ethnologue, 2017) Within Ethiopia, Oromo is the language with the largest
number of native speakers. Within Africa, Oromo is the language with the fourth most
speakers, after Arabic (if one counts the mutually unintelligible spoken forms of Arabic
as a single language and assumes the same for the varieties of Oromo), Swahili,
and Hausa. Besides first language speakers, a number of members of other ethnicities
who are in contact with the Oromo speak it as a second language. See, for example,
the Omotic-speaking Bambassi and the Nilo-Saharan-speaking Kwama in northwestern
Oromiyaa (Ethnologue, 2017). Ethnologue, (2017) divides the Oromo macro language
into four languages Boranaa–Arsii–Gujii Oromo (Southern Oromo, incl. Gabra and

26
Sakuye dialects); Eastern Oromo (Harar); Orma (Munyo, Orma, Waata/Sanye); West–
Central Oromo (Western Oromo and Central Oromo, incl. Mecha/Wollega, Raya,
Wello(Kemise), Tulema/Shewa).

Oromo is written with a Latin alphabet called Qubee which was formally adopted in 1991
(University of Pennsylvania, 1977). Various versions of the Latin-based orthography had
been used previously, mostly by Oromos outside of Ethiopia and by the OLF by the late
1970s (Heine, 1981). With the adoption of Qubee, it is believed more texts were written
in the Oromo language between 1991 and 1997 than in the previous 100 years. In Kenya,
the Borana and Waataalso use Roman letters but with different systems. The Sapalo
script was an indigenous Oromo script invented by Sheikh Bakri Sapalo (also known by
his birth name, Abubaker Usman Odaa) in the years following Italian invasion of
Ethiopia, and used underground afterwards (Hassan, 1981). The Arabic script has also
been used intermittently in areas with Muslim populations. The first comprehensive
online Afaan Oromo dictionary was developed by the Jimma Times Oromiffa Group
(JTOG) in cooperation with SelamSoft (Jimma_Times, n.d.). Oromo and Qubee are
currently utilized by the Ethiopian government's state radios, TV stations and regional
government newspaper.

Somali is an Afro-asiatic language belonging to the Cushitic branch. As of 2006, there


were approximately 16.6 million speakers of Somali, of which around 8.3 million resided
in Somalia (Ethnologue, 2017). Somali is an official language of Somalia, a national
language in Djibouti, and a working language in the Somali Region of Ethiopia (Kizitus
Mpoche, 2006). Somali is the second most widely spoken Cushitic language after Oromo.
As of 2006, there were approximately 16.6 million speakers of Somali, of which around
8.3 million resided in Somalia. The language is spoken by an estimated 95% of the
country's inhabitants, and also by a majority of the population in Djibouti. Somali is
recognized as an official working language in the Somali Region of Ethiopia (Kizitus
Mpoche, 2006). The Somali language is regulated by the Regional Somali Language
Academy, an intergovernmental institution established in June 2013 in Djibouti City by
the governments of Djibouti, Somalia and Ethiopia. It is officially mandated with

27
preserving the Somali language (COMESA, n.d.). As of 2013, Somali is also one of the
featured languages available on Google Translate (Google, 2013).

The Somali language is written officially with the Latin alphabet. In 1961 both the Latin
and Osmanya scripts were adopted for use in Somalia, but in 1969 there was a coup, with
one of its stated aims the resolution of the debate over the country's writing system. The
Latin alphabet was finally adopted in 1972 and at the same time Somali was made the
sole official language of Somalia. Shire Jama Ahmed is credited with the invention of
this spelling system, and his system was chosen from among eighteen competing new
orthographies (Omniglot, 2017).

Tigrinya is an Afro-asiatic language of the Semitic branch. It is mainly spoken in Eritrea


and northern Ethiopia in the Horn of Africa. According to Ethnologue, (2017), there are
6,915,000 total Tigrinya speakers. Of these, approximately 4,320,000 inhabit Ethiopia,
with most concentrated in the Tigray region (Ethnologue, 2017).

Tigrinya is written in the Ge'ez script, originally developed for Ge'ez, also called
Ethiopic. The Ge'ez script is an abugida: each symbol represents a consonant + vowel
syllable, and the symbols are organized in groups of similar symbols on the basis of both
the consonant and the vowel (Ethnologue, 2017).

Sidamo is an Afro-Asiatic language, belonging to the Highland East Cushitic branch of


the Cushitic family. It is spoken in parts of southern Ethiopia by the Sidama people,
particularly in the densely populated Sidama Zone. It shares over 64% lexical similarity
with Alaba-K'abeena, 62% with Kambaata, and 53% with Hadiyya, all of which are other
languages spoken in southwestern Ethiopia. The results from a research study conducted
in 1968-1969 concerning mutual intelligibility between different Sidamo languages
suggest that Sidaama is more closely related to the Gedeo language, which it shares a
border with to the south, than other Sidamo languages (Kawachi, 2007). The two
languages share a lexical similarity of 60% (Raymond, 2005).

In terms of its writing, Sidaama used an Ethiopic script up until 1993, from which point
forward it has used a Latin script (Raymond, 2005).

28
The Afar language is an Afro-asiatic language, belonging to the family's Cushitic branch.
It is spoken by the Afar people in Djibouti, Eritrea and Ethiopia. It is categorized in
the Lowland East Cushitic sub-group, along with Saho and Somali (Lewis, 1998). The
Afar language is spoken as a mother tongue by the Afar people in Djibouti, Eritrea, and
the Afar Region of Ethiopia. According to Ethnologue, there are 1,379,200 total Afar
speakers. Of these, 1,280,000 were recorded in the 2007 Ethiopian census, with 906,000
monolinguals registered in the 1994 census (Ethnologue, 2017).

In Ethiopia, Afar is written with the Ethiopic or Ge'ez script. Since around 1849,
the Latin script has been used in other areas to transcribe the language (Ethnologue,
2017). Additionally, Afar is also transcribed using the Arabic script. In the early 1970s,
two Afar intellectuals and nationalists, Dimis and Redo, formalized the Afar alphabet.
Known as Qafar Feera, the orthography is based on the Latin script (Omniglot, 2017).
Officials from the Institut des Langues de Djibouti, the Eritrean Ministry of Education,
and the Ethiopian Afar Language Studies and Enrichment Center have since worked with
Afar linguists, authors and community representatives to select a standard orthography
for Afar from among the various existing writing systems used to transcribe the language.

Nuer is a member of the Western Nilotic group of the Nilo-Saharan languages spoken in
southern Sudan and western Ethiopia by about 800,000 people. Nuer is one of eastern and
central Africa's most widely spoken languages, along with the Dinka language. The
language is very similar to the languages of Jieng and Chollo (Gurtong, n.d.)
(Ethnologue, 2017).

Nuer is written with a version of the Latin alphabet using an orthography adopted at the
Rejaf Language Conference in 1928 which has been modified to some extent by
missionaries since then (Omniglot, 2017).

29
2.3 Related Works

Cavnar & Trenkle, (1994) used n-gram rank ordering technique to test language
identification on newsgroup text in 8 languages. These languages were English, Polish,
Dutch and German (Germanic group), Portuguese, French, Spanish and Italian (Romance
Group). Training sets for the languages varied between 20,000 and 120,000 characters.
Using the 300 most frequent 3-grams the best accuracy of 98.6% was achieved for a test
string smaller than 300 characters. From the results, it is difficult to conclude what the
performance on smaller test strings (e.g. < 20 characters) would be.

Dunning, (1994) used Naïve Bayes approach to estimate the likelihood of a string
belongs to a language model. He did his experiments on English and Spanish parallel
texts. He reported an achieved accuracy of 92% with 50,000 characters of training and a
20-character string. The accuracy increased to 99% when the test string size was
increased to 500 characters. For 5K characters of training and a 500-character test string
the accuracy dropped to 97%. The evaluated languages were from different language
families (Romance and Germanic) and therefore it is difficult to compare results with
classification done on languages from the same family groups.

Kikui, (1994) proposed an algorithm for simultaneously identifying the coding system
and the language of a given string. The coding system was identified heuristically, where
the language identification part was based on the Naïve Bayes model. Documents were
collected from the Internet and grouped into 700 training and 640 testing documents. The
tests included 9 languages (English, German, Spanish, French, Italian, Portuguese,
Japanese, Chinese and Korean) and 11 coding schemes. They found that n=4 gave the
lowest error rates and that their accuracies were equivalent to the results presented in
Cavnar & Trenkle, (1994).

Grefenstette, (1995) compared two feature statistics in building a likelihood model. The
one statistical model contains the likelihood of all tri-grams that appeared more than 100
times in a training set. The other model contained the likelihood of all words which are
smaller than 5 characters and occur more than 3 times in the training text. It is found that

30
for sentences smaller than 15 words the tri-gram method performs best but for a greater
number of words both methods perform well. A 15-word window can roughly
approximate 100 characters. Therefore, it’s difficult to conclude what the error rate for
much smaller test strings would be.

Kruengkrai, Srichaivattana, Sorlertlamvanich, & Isahara, (2005) compared the


performance of two kernel classifiers: the centroid method and an SVM. A string kernel
was implemented which computes the inner product in the feature space generated by all
subsequences of length k. A subsequence can be of any combination of k characters. The
difference between n-grams and these combinations is that the subsequences do not need
to follow contiguously. The subsequences are weighed by an exponential decaying factor,
thus the subsequences that follow contiguously weighs more. A subsequence of k=5 and
a decay factor of 1 gave best results. As baseline in comparisons, an n-gram rank
ordering classification system was used and 5-fold cross validation on 17 languages and
an average of 578 sentences per language was used to test the performance.

On a test sample (with an average length of 50 character per item) n-gram rank ordering
performed with 90.2% accuracy, the centroid method with 95.9% accuracy and the SVM
gave an overall best performance accuracy of 99.7%.

L-F.Zhai, (2006) found that a reduction in the feature space of the SVM results in a
significant decrease in performance accuracy. They also concluded that the SVM is
highly sensitive to prior distributions. This is due to the nature of the SVM classifier that
does not compensate for different sizes of training data. Therefore, classification is biased
towards the class with the larger training set.

A frequency based n-gram difference based classifier and a support vector machine
(SVM) that uses the n-gram frequencies as features are discussed in Botha, Zimu, &
Barnard, (2006). Error rates of approximately 0.3% are achieved over large text window
sizes. It is also found that the SVM’s performance is better than the n-gram based
estimator’s, but at a much greater computational cost.

31
H. Lodhi, (2002) showed that an SVM trained with n-gram statistics outperforms an
SVM using kernel strings as features. Though it would be assumed that a kernel string
would capture more information of a language, it probably introduces complexity into the
SVM, which has a negative impact on decision-making.

J.Hakkinen, (2001) compared a Naïve Bayes classifier with a classifier based on a


decision tree. A word was divided into body, head and tail to enhance N-grams and to
train the classifiers on the n-gram statistics of these parts. To compute the likelihood, the
n-gram is chosen according to its position in the input letter sequence. The reason for
creating the enhanced n-grams was to incorporate distinct linguistic characteristics such
as the prefixes and suffixes in a language. The decision tree makes a classification
decision by asking a series of questions about the context of each letter in the input string.
Performance accuracy was tested on short names in 4 languages (English, Finnish,
Spanish and German). The decision tree achieved an accuracy of 66% where the tri-gram
trained Naïve Bayes classifier performed with 71.8% accuracy.

Simões, Almeida, & Byers, (1998) presented a neural network that is able to identify
languages with 96% or 97% of accuracy, depending on the number of iterations
performed during the training process. For that they used two kind of features: one
related with the language alphabet, and another related to the character trigrams with
higher occurrence. Given that they are able to use binary features to classify the alphabet,
the neural network is able to learn much faster to distinguish some collections of
languages. A problem with their approach is that it will perform badly on short snippets
of text (like instant messages or mobile messages), because of the low number of
trigrams selected by language.

S. MacNamara, (1998) compared a Naïve Bayes classifier with a neural net classifier. For
the Naïve Bayes classifier, the 100 most discriminative n-grams and the 100 most
frequent n-grams were extracted from the training text. An input string was first scanned
for discriminant n-grams and a decision was made once one was found. If no
discriminant n-grams occurred the log likelihood was calculated using probabilities of the
100 most frequent n-grams and the language with the highest score was classified as the

32
language of the input string. Tests were performed on 18 languages. For an average of 2
to 3 words the accuracy of the neural network was 70% and for the tri-gram trained Naïve
Bayes classifier an accuracy of 83% was found.

Prager, (1999) implemented dot product calculation by giving each vector item an inverse
document frequency (IDF) weighting which is the inverse of the number of languages
that contain the n-gram combination. For closely-related languages Prager, (1999) found
that this weighting stayed fairly fixed. By combining 4-gram and word statistics (with no
restriction on the word length), best performance was achieved. By evaluating thirteen
West-European languages, the average performance was 85.4% for a 20-character input
string and over 99% for 130 characters and up.

Padro, (2004) compared the Naïve Bayes method, the dot-product classification and the
n-gram rank ordering method to each other. Identification was tested on 6 languages
(English, Catalan, Spanish, Italian, German and Dutch). Overall, the Naïve Bayes
classifier proved best (significantly so for small test samples), followed by the dot-
product classifier and then n-gram rank ordering method.

Truica, Velcin, & Boicea, (2015) presented a statistical method for automatic language
identification of written text using dictionaries containing stopwords and diacritics. They
proposed different approaches that combine the two dictionaries to accurately determine
the language of textual corpora. They tested their method using a Twitter corpus
containing 500,000 tweets with 100,000 tweets for each studied language and a news
article corpus that contains 250,000 entries with 50,000 articles for each studied
language. Their results show that their proposed method has an accuracy of over 90% for
small texts and over 99.8% for large texts.

To the knowledge of the researcher, the only other research that includes the Ethiopian
languages in a text-based language identification task was done by Desta, (2014). Desta,
(2014) compared language identification accuracy by using character n-gram and
character n-gram location as a feature set for Naïve Bayes and Frequency rank order
models. The results showed that Naïve Bayes classifier achieved highest accuracy for

33
short, medium, and long string test documents. The identification accuracy of Frequency
Rank Order is low but showed an improvement for long text test documents. When using
character n-gram and its location frequency as feature set, the accuracy of the both
models showed an improvement. Although this is an encouraging result, it’s difficult to
jump to conclusions without considering factors that can determine identification
accuracy such as the amount and variety of training data. He didn’t investigate the effect
that homogeneously distributed training data (equal amounts of training data for each
language class) has on a classifier when compared to a training set with heterogeneously
distributed training data. Moreover, his research is restricted to four Cushitic languages
and didn’t include languages from other linguistic classification. Furthermore, his
research didn’t consider multilingual identification for mixed language input. Thus text-
based language identification for the Ethiopian languages is still open for research.

From the literature, it is evident that different approaches to solving the language
identification problem have been investigated. Naive Bayes classifiers prove to be quite
popular and successful with high orders of n for the n-grams used. The use of SVMs also
appear to be a popular choice achieving good performance. However, not many studies
have been conducted regarding dictionaries containing stopwords and also focusing on
major Ethiopian languages. Identification based on SVM, Naïve Bayes and dictionary
method were used in this research.

34
3 METHOS AND TECHNIQUES

3.1 Overview

Different methods have been used to compute classification accuracies in different


studies regarding text based language identification. Different amounts of textual training
data are used to train classifiers where documents spanned different domains. The
number of characters and words; the size in bytes, lines and sentences influence the
metric for the size of the test string (Botha, Zimu, & Barnard, 2006).

Languages without any family relationships and other within language families were used
to perform some of the tests. Some evaluated performance accuracy using only a
validation set where others performed a thorough cross-validation test. Classifiers were
evaluated under the same conditions in different studies that compared different
classifiers against each other so that more reliable conclusions can be made.

This study uses some previous ideas for evaluation process and it is designed to make
comparisons more reliable. To assure reliable results, the classifiers were evaluated under
the same circumstances.

In this study languages used for the experiment are from different linguistic classification
namely, Cushitic and Semitic of the Afro-Asiatic linguistic classification and Nilotic of
the Nilo-Saharan linguistic classification. Seven local languages are used in this study
namely, Afar, Amharic, Nuer, Oromo, Sidamo, Somali and Tigrigna.

3.2 Corpora

Seven corpora were collected to perform the experiments described here. All corpora
consist of different texts compiled from different sources. The data included text from
various sources (such as newspapers, periodicals, books, the Bible and government
documents) and therefore, the corpus spans several domains.

35
The size of text varied from 4MB to 5MB per language. Due to the variety of sources
used, text was not homogeneous and needed some automatic preprocessing in order to be
used for building models. Therefore, using Python scripts, the extraction, compilation,
cleaning and indexing of all articles was carried out before performing the experiments.
For example, consecutive white spaces were substituted with a single space character.
Moreover, numbers, names, abbreviations, punctuation marks and addresses (e.g. e-mail
and links to Internet websites) were removed. After this preprocessing, the size of the text
was reduced to 1MB to 2MB per language.

Documents written in 7 languages and language varieties were compiled and processed
resulting in slightly over 11.7 million tokens. Quantitative detail about the corpora
(number of tokens and types) is shown in Table 3.1.

Corpus Size
Languages
Number of Words/Tokens Number of Types (unique words)
Afar 17,459 6,673
Amharic 9,072,217 50,000
Nuer 1,337 433
Oromo 1,265,782 50,000
Sidamo 81,345 15,620
Somali 1,262,697 50,000
Tigrigna 42,391 12,844
Total 11,743,228 185,570

Table 3.1 Corpora: Facts and Figures - Number of Words/Tokens and Types

As it can be seen in Table 3.1 the number of texts differs across languages. Texts are also
of a different length depending on the data source. It is obvious that both the amount of
training material and the length of documents play an important role in language
identification and text classification tasks in general.

36
The experimentation conducted in this study is separated into two experiments, with each
experiment differing with regards to the amount of data used for training.

The first experiment utilizes the entire data set to train and test the classifiers with seven
language classes. This experiment was conducted by randomly taking 90% of the corpus
of the seven target languages for training the models and the remaining 10% for testing
the models. Table 3.2 shows the corpus size for training (90%) and testing (10%) the
models.

Languages Corpus Size


For training the models (90%) For testing the models (10%)
Number of Words/Tokens Number of Words/Tokens
Afar 15,713 1,746
Amharic 8,164,995 907,222
Nuer 1,203 134
Oromo 1,139,204 126,578
Sidamo 73,211 8,134
Somali 1,136,427 126,270
Tigrigna 38,152 4,239
Total 10,568,905 1,174,323

Table 3.2 Corpus size for training (90%) and testing (10%) the models

The second experiment consists of varying the size of the training data used to train the
classifiers to simulate scarce data availability. Furthermore, this experiment investigates
the effect that homogeneously distributed training data (equal amounts of training data
for each language class) has on a classifier when compared to a training set with
heterogeneously distributed training data. The smallest training data size is the data size
taken for Nuer language. Therefore, similar data sizes for the remaining six languages are
taken. Table 3.3 shows the corpus size for training and testing the models with
homogeneously distributed training data for the seven languages in this study.

37
Languages Corpus Size
For training the models For testing the models
Number of Words/Tokens Number of Words/Tokens
Afar 1,203 134
Amharic 1,203 134
Nuer 1,203 134
Oromo 1,203 134
Sidamo 1,203 134
Somali 1,203 134
Tigrigna 1,203 134
Total 8,421 938

Table 3.3 Data statistics for training and testing the models with homogeneously
distributed training data per language

To evaluate the performance of the models, windows containing a specified number of


words which are known as character windows are used. Character windows were used in
the experiments here since they are the most reliable measure of window size. The
experiments are conducted on three different character windows. 15-character window
which represents a short document size or 2-3 words, 100- character window which
represents a medium document size or a long sentence and 300-character window which
represents a long document size or a paragraph. The difficulty of correct classification is
influenced by the choice of these sizes. Average number of characters per word was
calculated for each language. This is helpful to know the number of words required for
the character windows. It is possible to generate test documents based on the analysis
made on the character windows. Quantitative detail about the corpora (average number of
words, and average number of characters for a word in each language) is shown in Table
3.4.

38
Languages Corpus Size
Number of Number of Average 15-chw 100-chw 300-chw
Words/Tokens Characters Word
Length
Afar 17,459 115,591 6.62 2.27 15.11 45.32
Amharic 9,072,217 42,596,236 4.70 3.19 21.28 63.83
Nuer 1,337 5,434 4.06 3.69 24.63 73.89
Oromo 1,265,782 10,180,001 8.04 1.87 12.44 37.31
Sidamo 81,345 750,926 9.23 1.63 10.83 32.50
Somali 1,262,697 9,756,935 7.73 1.94 12.94 38.81
Tigrigna 42,391 173,439 4.09 3.67 24.45 73.35

Table 3.4 Word statistics of each language to generate test characters window. Average
word lengths and number of words per character window are indicated.

Languages Number of 15-chw Number of Number of


test document 100-chw test 300-chw test
document document
Afar 582 125 41
Amharic 302,408 64,778 21,590
Nuer 45 10 3
Oromo 42,177 9,033 3,011
Sidamo 2,711 581 193
Somali 42,090 9,016 3,005
Tigrigna 1,413 303 101
Total 391,426 83,846 27,944

Table 3.5 Number of test documents for the three characters windows

To evaluate multilingual identification, an artificial corpus that contains 1050 documents


was constructed. Mixed input texts (3 languages in the same document) are considered in
this study and each one of the documents is a concatenation of 3 sections in different

39
languages which are constructed randomly from a collection of medium length texts per
language (6 to 50 words).

Multilingual Identification
Languages Number of Languages Number of
test documents test documents
Afar, Amharic, Nuer 30 Amharic, Nuer, Tigrigna 30
Afar, Amharic, Oromo 30 Amharic, Oromo, Sidamo 30
Afar, Amharic, Sidamo 30 Amharic, Oromo, Somali 30
Afar, Amharic, Somali 30 Amharic, Oromo, Tigrigna 30
Afar, Amharic, Tigrigna 30 Amharic, Sidamo, Somali 30
Afar, Nuer, Oromo 30 Amharic, Sidamo, Tigrigna 30
Afar, Nuer, Sidamo 30 Amharic, Somali, Tigrigna 30
Afar, Nuer, Somali 30 Nuer, Oromo, Sidamo 30
Afar, Nuer, Tigrigna 30 Nuer, Oromo, Somali 30
Afar, Oromo, Sidamo 30 Nuer, Oromo, Tigrigna 30
Afar, Oromo, Somali 30 Nuer, Sidamo, Somali 30
Afar, Oromo, Tigrigna 30 Nuer, Sidamo, Tigrigna 30
Afar, Sidamo, Somali 30 Nuer, Somali, Tigrigna 30
Afar, Sidamo, Tigrigna 30 Oromo, Sidamo, Somali 30
Afar, Somali, Tigrigna 30 Oromo, Sidamo, Tigrigna 30
Amharic, Nuer, Oromo 30 Oromo, Somali, Tigrigna 30
Amharic, Nuer, Sidamo 30 Sidamo, Somali, Tigrigna 30
Amharic, Nuer, Somali 30
Total 1050

Table 3.6 Number of test documents for Multilingual Identification

40
3.3 Evaluation and Computational Techniques

This section presents the computational techniques behind the automatic language
identification systems used in this thesis. We can see that many approaches can be
followed in a text-based language identification from the literature study in chapter two.
Using a pure linguistic approach is undeniably the better choice to achieve a high
classification accuracy but it requires a large amount of linguistic expertise. Therefore,
using statistical approaches is a feasible alternative. Statistics of words, letters or n-grams
can be used to build statistical language models. The n-gram based models outperform a
word-based model for small text fragments and do equally well for larger fragments
(Grefenstette, 1995). In another study n-grams achieved better results than string kernels
(Lodhi, Shawe-Taylor, Cristianini, & Watkins, 2002). We can see from the literature that
this is by far the most popular choice. That is why the feature sets are restricted to n-gram
based features. Naïve Bayes and SVM are trained by using character n-gram as feature
set.

3.3.1 Evaluation Metrics

To evaluate the extent to which the methods used in this thesis are suitable for language
identification, standard metrics used in Natural Language Processing and Text
Classification to report results in terms of accuracy and error rate are used.

The metrics used are presented next and they are based on the possible outcomes of a
confusion matrix.

The confusion matrix contains four possible outcomes: true positives, true negatives,
false positives, and false negatives. The results are obtained per class. To evaluate the
performance of the classifier across all classes it is necessary to calculate the average
(mean) performance of all classes. This allows us to evaluate how well the classifier is
performing when identifying each individual class as well as when distinguishing all
classes.

41
The evaluation metrics used in this thesis is accuracy and error rate. Accuracy is
calculated as the number of all correct predictions divided by the total number of the
dataset. The best accuracy is 1.0 or 100% if it is calculated in percent, whereas the worst
is 0.0 or 0% if it is calculated in percent. Error rate is calculated as the number of all
incorrect predictions divided by the total number of the dataset. The best error rate is 0.0
or 0% if it is calculated in percent, whereas the worst is 1.0 or 100% if it is calculated in
percent. Accuracy and error rates are calculated as follows (Powers & David, 2011):

𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃 Equation 3.1

𝐹𝑃+𝐹𝑁
𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃 Equation 3.2

3.3.2 Smoothing Techniques

A word may exist in the language but may not appear in a given corpus. Some words
might be very rare and may not occur no matter how big a corpus is. However, this does
not mean they will not occur in other samples. Therefore, attributing a zero probability to
them would spoil the calculation. The same is true for character sequences, due to the
simple fact that in any given language some character sequences are more frequent than
others.

Manning & Schu ̈tze, (1999) state ‘regardless of how the probability is computed, there is
still the need to assign a non-zero probability estimate to words or n-grams that are not
present in our training corpus’. This method is called smoothing and there are a number
of smoothing techniques that are used in natural language processing and in language
identification. A very simple one used in Dunning, (1994) and Zampieri, Gebre, &
Diwersy, (2012) is Laplace smoothing.

In the context of language modelling, Laplace smoothing is calculated as follows:

𝐶(𝑤1 … 𝑤𝑛 )+1
𝑃𝑙𝑎𝑝 (𝑤1 … 𝑤𝑛 ) = Equation 3.3
𝑁+𝐵

42
The formula is similar to the aforementioned Maximum Likelihood Estimation (MLE)
modified by adding 1 to the numerator (to assign a non-zero probability), and B
representing the number of total possible unique n-grams in the denominator.

In this study, Laplace smoothing is used to avoid zero multiplication for n-gram that was
not seen during model training. This is done by adding one to each n-gram frequency. If
the n-gram does not exist in training phase it was discarded from the calculation since the
size of training corpus is too large. Hence the effect of n-gram with zero frequency is
minimal on the classification accuracy. This process continues until probability of
language given the character n-gram is calculated for each of the languages. The
language with maximum such probability is considered as the language of the unknown
text document.

3.3.3 Classification Methods

In this section, the algorithms used in this thesis are presented. Implementations of Naive
Bayes and Support Vector Machines which are popular machine learning classifiers are
used. Another technique used in this is dictionaries containing stopwords.

Experimenting with all possible combinations was not feasible although a large number
of classifiers have been applied to text-based language identification. The classification
algorithms selected, were similarly chosen for their proven performance in published
studies, as well as their ability to clarify theoretical issues, as we discuss below.

The vast majority of state-of-the-art language identification methods rely on (character)


n-gram language models. The application of n-gram language models to language
identification can be traced back to the work of (Dunning, 1994). According to Shannon,
(1951), n-gram language models are simple statistical language models calculated based
on the co-occurrence of words or characters across text samples. Words can occur alone
or in sequence in a text or a corpus and their probability can be estimated by n-gram
models. We can apply the same to character sequences arranged in the form of n-grams.
As features for classification, n-gram statistics were used.

43
An algorithm capable of identifying the language of a text can be designed based on
probabilities of occurrence of letters and letter sequences.

Language Corpus Training Language 1


.
.
Language N

Model

Most Probable
Unknown Text Identification
Language

Figure 3.1 Steps to identify a particular language.

During training stage, all possible character n-grams were extracted. The core benefits of
n-gram models (and algorithms that use them) are relative simplicity and the ability to
scale up by simply increasing n. The model can be used to store more contexts with a
well understood space-time tradeoff, enabling small experiments to scale up very
efficiently. The n-gram approximation for calculating the next word in the sequence is
given by:

𝑃(𝑋1 … 𝑋𝑛 ) = 𝑃(𝑋1 )𝑃(𝑋2 |𝑋1 ) … 𝑃(𝑋𝑛 |𝑋1𝑛−1 ) Equation 3.4

𝑛
Π𝑘=1 𝑃(𝑥1𝑘−1 )

Language identification system has two components - Language Profile Generator and
Classifier. For language identification, the former (language profile generator) calculates
the n-gram profile of a text to be identified and compares it to language specific n-gram
profiles. For each language, it generates all possible n-grams for the text and save it into
corresponding language files. In classification method, given a test sample, its likelihood
is calculated for all the models, and the language that gives the best likelihood is selected.

44
TRAINING

ALGORITHM: LANGUAGE PROFILE GENERATION

Input: Language corpus of Afar, Amharic, Nuer, Oromo, Sidamo, Somali, Tigrigna.

Output: n-grams of these languages

1. The documents from corpus are taken one by one, and the preprocessing is done
i.e. removal of special characters, digits etc.
2. Tokenize the text in to words (tokens).
3. Generate all possible character n-grams (n =3 for Naïve Bayes and SVM) for each
language.
4. Store n-gram profiles of all language sets (classifier).

TESTING

ALGORITHM: LANGUAGE IDENTIFICATION

Input: Text in Afar, Amharic, Nuer, Oromo, Sidamo, Somali, Tigrigna.

Output: Language identified for the given text.

1. Read the text to be identified from user.


2. Remove the special characters and tokenize the text into tokens.
3. Generate all possible n-grams for the text.
4. Save it into corresponding files of languages.
5. Calculate similarity index of given languages by comparing it with the result of
language profile generators.
6. Select the language corresponding with the highest similarity value.

The first algorithm used in this thesis is Naive Bayes that uses character n-gram of size 3
as a feature set. The Naive Bayesian method seeks to maximize the probability of a

45
language given a document P(L|D), where L is a language, and D is a document. Using
Bayes rule, we rewrite this as

max P (L)P (D|L)


max P (L|D) = Equation 3.5
P(D)

By calculating the maximum probability of a document given a language, we calculate


the most probable language given that document. If n-gram statistics are used in the
classification model, it is assumed that successive n-grams are independent of one
another, so that the various log likelihoods can be added together.

For each language, a vector of n-gram probabilities is computed by

𝑓
𝑙𝑗 = |𝑓𝑗| Equation 3.6
𝑗

where fj is a vector of n-gram frequencies calculated from a language document of class j.


The test string’s probability of size α is calculated in the logarithmic domain. In the next
equation, the log likelihood simplifies calculations by adding logarithmic probabilities
and can be expressed as

𝑃(𝐿|𝐷) = ∑𝑛−𝛼+1
𝑖=1 𝑙𝑛𝑙𝑗 (𝑐𝑖 ) Equation 3.7

where lj (ci ) is the the probability of the n-gram ci in the language model lj .

After calculating the probabilities for all languages, the most likely language profile is
selected as the language of the test string.

Naive Bayes is also trained to identify intermixed text in multiple languages. For mixed
input texts (at least 2-3 languages in the same document), the classifier recognizes all the
mixed inputs and gives up to three most likely languages.

The second algorithm used in the thesis is the Support Vector Machine (SVM). SVM is
employed from the class of more complex classifiers. The SVM showed good results in
different comparison tests regarding language identification and high- dimensional spaces

46
generalization (Kruengkrai, Srichaivattana, Sorlertlamvanich, & Isahara, 2005), and has
also shown a good performance in many other pattern-recognition tasks. An available
software module with full SVM functionality was used therefore the classifier was not
implemented from first principles.

The library for support vector machines LIBSVM (Chang & Lin, 2001) provides a full
implementation of several SVMs. A language model was built with samples of size α
from a training set. These samples contained a frequency count of each n-gram
combination in the character string. Thus, the feature dimension of the SVM is equal to
the number of n-gram combinations. Samples of the testing set are created using the same
character window as used to build the language model. After training the SVM language
model, the test samples can be classified according to language. Details about how SVM
works can be found at section 2.1.2.2.

47
Training Phase Prediction Phase

Language Unknown
Corpus Document

Corpus Preprocessing Document Preprocessing

Feature Extraction Feature Extraction


• Character n-gram • Character n-gram

Language Profile
• n-gram, count, lang tag

Identification
Model

Language

Figure 3.2 Naïve Bayes and SVM models applied to text languages in this study

48
Language Corpus: Language corpus that includes the seven languages in this study is
read in from a file during training phase.

Unknown Document: An unknown document is read in from a file or raw texts are
typed in during prediction phase.

Corpus Preprocessing: Corpus Preprocessing is the process of preparing and cleaning


the corpus for classification. All numbers, special characters and punctuation marks are
removed.

Feature Extraction: The feature extractor extracts the features for the experiment.
Character n-gram was used as a feature set. For Naïve Bayes and SVM, n-gram of sizes 3
from tokens of the training corpus is extracted.

Language Profile: After feature extraction, the counts of each n-gram feature set and
their respective tag will be stored for each of the seven languages as language profile.
This is done during training phase. The language profile created is read in during
prediction phase.

Model: Naïve Bayes and SVM classifiers were used. Implementations of the classifiers
follow the approaches discussed in Section 3.2.3. The training phase results in a model
for language identification.

Language: After using the constructed model, the language of the unknown document is
identified and displayed during the prediction phase.

The last approach used in this thesis uses dictionaries containing stopwords. It’s known
that some languages have some similar words and special characters but they are not all
common. Since stopwords are very specific to a language, they usually are a good
measure for automatic language detection.

The stopwords dictionaries were created before applying the algorithm. The stopwords
dictionaries include a total number of approximately 185,570 stopwords for the seven
languages in this study. The algorithm picks the text, find most common words and

49
compare with stopwords. The language with the most stopwords is selected as the
identified language. The algorithm counts how many unique stopwords are seen in
analyzed text to put this in language ratios dictionary and detects language based on ratio
in language ratios dictionary.

Dictionary Method
Languages Dictionary Size (Stopwords)
Afar 6,673
Amharic 50,000
Nuer 433
Oromo 50,000
Sidamo 15,620
Somali 50,000
Tigrigna 12,844
Total 185,570

Table 3.7 Dictionary size (number of stopwords) for the Dictionary Method

TESTING

ALGORITHM: LANGUAGE IDENTIFICATION USING STOPWORDS

Input: Text in Afar, Amharic, Nuer, Oromo, Sidamo, Somali, Tigrigna.

Output: Language identified for the given text.

1. Read the text to be identified from user.


2. Remove the special characters and tokenize the text into tokens.
3. Capture stopwords from corpus
4. Intersect stopwords with tokenized words
5. Count stopwords of given text and compare with stopwords from corpus
6. Find max count

50
7. Select the language corresponding with the highest similarity value.

Unknown
Document

Document Preprocessing

Document Tokenization Dictionary

Intersection

Max Count

Language

Figure 3.3 Dictionary Method applied to text languages in this study

An unknown document is read in from a file or raw texts are typed in during prediction
phase and then document preprocessing is done. Document preprocessing involves
removing all numbers, special characters and punctuation marks. Tokenization which is a
task of chopping a running text up into pieces called tokens is then performed. The
standard (white space) tokenization that separates words by a special 'space' character is
used. Then the intersection between the stopwords dictionary and tokenized words is
taken and the language with the maximum count is identified as the language of the
unknown document.

51
4 RESULTS AND DISCUSSION

4.1 EXPERIMENT AND RESULT

The experimentation conducted in this study is separated into two experiments, with each
experiment differing with regards to the amount of data used for training. The results of
the experiments are analyzed individually at the end of each experiment and an overall
analysis is done to summarize the observations.

The first experiment utilizes the entire data set to train and test the classifiers with seven
language classes. This experiment was conducted by randomly taking 90% of the corpus
of the seven target languages for training the models and the remaining 10% for testing
the models.

Accuracy and error rates were calculated for comparing the classifiers as discussed in
section 3.3.1 of chapter 3. The data statistics that includes all results for all tests can be
seen in the tables below. 500,216 test documents of differing size were used for the seven
languages. These were divided into short document size or 2-3 words, medium document
size or 6-50 words and long document size or more than 50-word long sentence or a
paragraph. The test documents that are divided are equivalent to the three character
windows.

52
Naïve Bayes (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages

Number correctly

Number correctly

Number correctly
Number of test

Number of test

Number of test
% Correctly

% Correctly

% Correctly
documents

documents

documents
identified

identified

identified

identified

identified

identified
Afar 582 560 96.23 125 123 98.4 41 41 100
Amharic 302,408 292,096 96.59 64,778 64,435 99.47 21,590 21,590 100
Nuer 45 41 91.11 10 10 100 3 3 100
Oromo 42,177 40,679 96.45 9,033 8,842 97.89 3,011 2,982 99.04
Sidamo 2,711 2,618 96.57 581 565 97.25 193 192 99.48
Somali 42,090 39,556 93.98 9,016 8,777 97.35 3,005 2,975 99
Tigrigna 1,413 1,361 96.32 303 301 99.34 101 101 100
Total 391,426 376,911 96.29 83,846 83,053 99.05 27,944 27,884 99.76

Table 4.1 Evaluation on test data for Naïve Bayes Classifier (n=3)
99.63

99.48
99.47

99.34
100

100

100
100

100
99.6
97.89
98.4

97.35
97.25

100
96.59

96.57
96.45

96.32
96.23

98
93.98

96
94
91.11
Accuracy (%)

92
90
88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna

15-chw 100-chw 300-chw

Figure 4.1 Evaluation (Accuracy) on test data for Naïve Bayes Classifier (n=3)

53
For a 15-character window, the 3-gram Naïve Bayes classifier showed 96.29% accuracy
on average. For a 100-character window, the 3-gram Naïve Bayes classifier showed
99.05% accuracy on average. For a 300-character window, the improvement in accuracy
with increasing size of character windows is noticed. The 3 gram Naïve Bayes classifier
which has 99.76% average accuracy does better than the 15-character window, by 3.47%
on average and better than the 100-character window, by 0.71% on average.

SVM (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages

Number correctly

Number correctly

Number correctly
Number of test

Number of test

Number of test
% Correctly

% Correctly

% Correctly
documents

documents

documents
identified

identified

identified

identified

identified

identified
Afar 582 576 98.97 125 125 100 41 41 100
Amharic 302,408 302,231 99.94 64,778 64,778 100 21,590 21,590 100
Nuer 45 43 95.56 10 10 100 3 3 100
Oromo 42,177 40,882 96.93 9,033 8,865 98.14 3,011 2,985 99.14
Sidamo 2,711 2,683 98.97 581 577 99.31 193 193 100
Somali 42,090 40,688 96.67 9,016 8,770 97.27 3,005 2,985 99.33
Tigrigna 1,413 1,409 99.72 303 303 100 101 101 100
Total 391,426 388,512 99.26 83,846 83,428 99.50 27,944 27,898 99.84

Table 4.2 Evaluation on test data for SVM Classifier (n=3)

54
99.33
99.28

99.26
100

100
100

100
100

100

100
100
99.14

99.14

99.08
98.97

98.97
99.2

97.27
100

96.93

96.67
95.56
98
96
94
Accuracy (%)

92
90
88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna

15-chw 100-chw 300-chw

Figure 4.2 Evaluation (Accuracy) on test data for SVM Classifier (n=3)

For a 15-character window, the SVM classifier showed 99.26% accuracy on average. For
a 100-character window, the SVM classifier showed 99.50% accuracy on average. For a
300-character window, the improvement in accuracy with increasing size of character
windows is noticed. The SVM classifier which has 99.84% average accuracy does better
than the 15-character window, by 0.58% on average and better than the 100-character
window, by 0.34% on average.

The second experiment consists of varying the size of the training data used to train the
classifiers to simulate scarce data availability. Furthermore, this experiment investigates
the effect that homogeneously distributed training data (equal amounts of training data
for each language class) has on a classifier when compared to a training set with
heterogeneously distributed training data. The models are re-trained and tested with the
same data size for the seven languages in this study. The smallest training data size is the
data size taken for Nuer language (4,889 training characters). Therefore, similar data
sizes of 4,889 training characters were used per language. Tables 4.3 and 4.4 show the
test result for Naïve Bayes and SVM models for the seven languages in this study trained
under the same corpus size.

55
Naïve Bayes (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages

Number correctly

Number correctly

Number correctly
Number of test

Number of test

Number of test
% Correctly

% Correctly

% Correctly
documents

documents

documents
identified

identified

identified

identified

identified

identified
Afar 582 533 91.58 125 120 96 41 41 100
Amharic 302,408 278,064 91.95 64,778 62,090 95.85 21,590 21,150 97.96
Nuer 45 39 86.67 10 10 100 3 3 100
Oromo 42,177 38,723 91.81 9,033 8,658 95.85 3,011 2,950 97.97
Sidamo 2,711 2,492 91.92 581 557 95.85 193 189 97.93
Somali 42,090 37,603 89.34 9,016 8,642 95.85 3,005 2,944 97.97
Tigrigna 1,413 1,296 91.72 303 291 96.04 101 99 98.02
Total 391,426 358,750 91.65 83,846 80,368 95.85 27,944 27,376 97.97

Table 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language
100

100
100

98.02
97.97
97.97
97.96

97.93

100

96.04
95.85

95.85

95.85

95.85

98
96

96
91.95

91.92
91.81

91.72
91.58

94
89.34

92
90
86.67

88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna

15-chw 100-chw 300-chw

Figure 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language

56
SVM (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages

Number correctly

Number correctly

Number correctly
Number of test

Number of test

Number of test
% Correctly

% Correctly

% Correctly
documents

documents

documents
identified

identified

identified

identified

identified

identified
Afar 582 547 93.99 125 121 96.8 41 40 97.56
Amharic 302,408 287106 94.94 64,778 62,640 96.7 21,590 21,229 98.33
Nuer 45 41 91.11 10 10 100 3 3 100
Oromo 42,177 38773 91.93 9,033 8,567 94.84 3,011 2,935 97.48
Sidamo 2,711 2548 93.99 581 558 96.04 193 190 98.45
Somali 42,090 38584 91.67 9,016 8,472 93.97 3,005 2,935 97.67
Tigrigna 1,413 1338 94.69 303 293 96.7 101 99 98.02
Total 391,426 368,937 94.25 83,846 80,661 96.2 27,944 27,431 98.16

Table 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language
100
100

98.45
98.33

98.02
97.67
97.56

97.48

100
96.8

96.7

96.7
96.04

98
94.94

94.84

94.69
93.99

93.99

93.97

96
91.93

91.67

94
91.11

92
90
88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna

15-chw 100-chw 300-chw

Figure 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language

57
From the above results, we can notice the effect that homogeneously distributed training
data (equal amounts of training data for each language class) has on a classifier when
compared to a training set with heterogeneously distributed training data.

For a 15-character window, the 3-gram Naïve Bayes and SVM classifiers showed a
91.65% and 94.25% average accuracy respectively. For a 100-character window, the 3-
gram Naïve Bayes and SVM classifiers showed a 95.85% and 96.2% average accuracy
respectively. For a 300-character window, the improvement in accuracy with increasing
size of character windows is noticed. The 3 gram Naïve Bayes classifier which has
97.97% average accuracy does better than the 15-character window, by 6.32% on average
and better than the 100-character window, by 2.12% on average while the SVM classifier
which has 98.16% average accuracy does better than the 15-character window, by 3.91%
on average and better than the 100-character window, by 1.96% on average.

Dictionary Method
Character Windows
15-character window 100-character window 300-character window
Languages

Number correctly

Number correctly

Number correctly
Number of test

Number of test

Number of test
% Correctly

% Correctly

% Correctly
documents

documents

documents
identified

identified

identified

identified

identified

identified
Afar 582 198 34.02 125 80 64 41 39 95.12
Amharic 302,408 300,049 99.22 64,778 64,778 100 21,590 21,590 100
Nuer 45 44 97.78 10 10 100 3 3 100
Oromo 42,177 17,437 41.34 9,033 5,200 57.57 3,011 2,456 81.57
Sidamo 2,711 1,338 49.35 581 305 52.49 193 175 90.67
Somali 42,090 14,151 33.62 9,016 5,127 56.87 3,005 2,374 79
Tigrigna 1,413 1,401 99.15 303 303 100 101 101 100
Total 391,426 334,618 85.49 83,846 75,803 90.41 27,944 26,738 95.68

Table 4.5 Evaluation on test data for Dictionary Method

58
99.99

99.72
97.78
100
100

100
100

100
100
95.12

90.67
100

81.57
90

79
80

57.57
64

56.87
70
Accuracy (%)

52.49
49.35
60

41.34
50
34.02

33.62
40
30
20
10
0
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna

15-chw 100-chw 300-chw

Figure 4.5 Evaluation (Accuracy) on test data for Dictionary Method

For a 15-character window, the dictionary method showed 85.49% accuracy on average.
For a 100-character window, the dictionary method showed 90.41% accuracy on average.
For a 300-character window, the improvement in accuracy with increasing size of
character windows is noticed. The dictionary method which has 95.68% average
accuracy does better than the 15-character window, by 10.19% on average and better than
the 100-character window, by 5.27% on average.

4.2 Confusion Matrix

Higher error rates were noticed on the smallest character window for all classifiers in this
study. To understand the results better, the output was analyzed using all classifiers. The
types of errors that occurred were investigated by using a confusion matrix (Table 4.6 to
Table 4.11). Correct language of a set of samples are represented on each row. The
languages selected by the classifiers are indicated by using columns. Better overall
accuracy of classifiers is shown on the diagonal of the matrix which also has more
samples.

59
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna
560 0 2 206 0 697 0 Afar
(96.23%) (0%) (4.44%) (0.49%) (0.04%) (1.66%) (0%)
0 292,096 0 0 0 0 52 Amharic
(0%) (96.59%) (0%) (0%) (0%) (0%) (3.68%)
2 0 41 0 0 0 0 Nuer
(0.34%) (0%) (91.11%) (0%) (0%) (0%) (0%)
10 0 0 40,679 60 949 0 Oromo
(1.72%) (0%) (0%) (96.45%) (2.21%) (2.25%) (0%)
5 0 (0%) 0 249 2,618 888 0 Sidamo
(0.86%) (0%) (0.59%) (96.57%) (2.11%) (0%)
5 0 2 1,043 33 39,556 0 Somali
(0.86%) (0%) (4.44%) (2.47%) (1.22%) (93.98%) (0%)
0 10,312 0 0 0 0 1,361 Tigrigna
(0%) (3.41%) (0%) (0%) (0%) (0%) (96.32%)

Table 4.6 Confusion matrix of Naïve Bayes Classifier (n=3) for 15-chw

Afar Amharic Nuer Oromo Sidamo Somali Tigrigna Overall


3.77% 3.41% 8.89% 3.55% 3.43% 6.02% 3.68% 3.71%

Table 4.7 Error rates for languages calculated from confusion matrices in Table 4.6

Afar Amharic Nuer Oromo Sidamo Somali Tigrigna


576 0 1 155 0 947 0 Afar
(98.97%) (0%) (2.22%) (0.37%) (0%) (2.23%) (0%)
0 302,231 0 0 0 0 4 Amharic
(0%) (99.94%) (0%) (0%) (0%) (0%) (0.28%)
1 0 43 0 0 0 0 Nuer
(0.17%) (0%) (95.56%) (0%) (0%) (0%) (0%)
2 0 0 40,882 22 256 0 Oromo
(0.34%) (0%) (0%) (96.93%) (0.81%) (0.61%) (0%)
1 0 0 429 2,683 199 0 Sidamo
(0.17%) (0%) (0%) (1.02%) (98.97%) (0.48%) (0%)
2 0 1 711 6 40,688 0 Somali
(0.34%) (0%) (2.22%) (1.69%) (0.22%) (96.67%) (0%)
0 177 0 0 0 0 1,409 Tigrigna
(0%) (0.06%) (0%) (0%) (0%) (0%) (99.72%)

Table 4.8 Confusion matrix of SVM Classifier (n=3) for 15-chw

60
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna Overall
1.03% 0.06% 4.44% 3.07% 1.03% 3.33% 0.28% 0.74%

Table 4.9 Error rates for languages calculated from confusion matrices in Table 4.8

Afar Amharic Nuer Oromo Sidamo Somali Tigrigna


198 0 0 2,205 214 7,961 0 Afar
(34.02%) (0%) (0%) (5.23%) (7.89%) (18.91%) (0%)
0 300,049 0 0 0 0 12 Amharic
(0%) (99.22%) (0%) (0%) (0%) (0%) (0.85%)
0 0 44 0 0 4 0 Nuer
(0%) (0%) (97.78%) (0%) (0%) (0.01%) (0%)
184 0 0 17,437 879 15,152 0 Oromo
(31.62%) (0%) (0%) (41.34%) (32.42%) (36%) (0%)
63 0 0 5,795 1,338 4,822 0 Sidamo
(10.82%) (0%) (0%) (13.74%) (49.35%) (11.46%) (0%)
137 0 1 16,740 280 14,151 0 Somali
(23.54%) (0%) (2.22%) (39.69%) (10.33%) (33.62%) (0%)
0 2,359 0 0 0 0 1,401 Tigrigna
(0%) (0.78%) (0%) (0%) (0%) (0%) (99.15%)

Table 4.10 Confusion matrix of Dictionary Method for 15-chw

Afar Amharic Nuer Oromo Sidamo Somali Tigrigna Overall


65.98% 0.78% 2.22% 58.66% 50.65% 66.38% 0.85% 14.51%

Table 4.11 Error rates for languages calculated from confusion matrices in Table 4.10

From the above three confusion matrixes on a window size of 15 characters, we can see
that all errors result from confusions within all seven languages. When using the 3 gram
Naïve Bayes classifier, the most wrongly classified language is Nuer language with an
error rate of 8.89%. Nuer language test document was classified as Afar and as Somali
language 4.4% of the times. The second most wrongly classified language is Somali
language with an error rate of 6.02%. Somali language test document was classified as
Afar language 1.66% of the times, as Oromo language 2.25% of the times and as Sidamo
language 2.11% of the times. When using the SVM classifier, the most wrongly classified
language is Nuer language with an error rate of 4.44%. Nuer language test document was

61
classified as Afar and as Somali language 2.22% of the times. The second most wrongly
classified language is Somali language with an error rate of 3.33%. Somali language test
document was classified as Afar language 2.23% of the times, as Oromo language 0.61%
of the times and as Sidamo language 0.48% of the times. When using the Dictionary
Method, the most wrongly classified language is Somali language with an error rate of
66.38%. Somali language test document was classified as Afar language 18.91% of the
times, as Nuer language 0.01% of the times, as Oromo language 36% of the times and as
Sidamo language 11.46% of the times. This shows us that Nuer language has a higher
relationship with Afar and Somali languages and vice versa. It also shows us that Somali
language has a higher relationship with Afar, Oromo and Sidamo languages and vice
versa.

4.3 Discussion

From the above results, we can notice that classification accuracy increases as the test
windows becomes larger, to the extent that the SVM classifier with a character window
300 achieved an error rate of 0.16%. For the smallest test samples, accurate classification
was difficult. The lowest error rate was 0.06% and the highest error rate was 4.44%
considering the best classifier in this research which is the SVM. Increase in size of the
character windows improved the identification accuracy in all cases.

We can also notice the effect that homogeneously distributed training data (equal
amounts of training data for each language class) has on a classifier when compared to a
training set with heterogeneously distributed training data from the above results.

For the 15-character window, the SVM classifier trained with 3-grams did slightly better
than the 3-gram Naïve Bayes classifier with the same dataset of 4,889 training characters
per language. The differences between the SVM and the Naïve Bayes classifier is 2.6%
on average.

For a 100-character window, the SVM classifier trained with 3-grams outperforms the 3
gram Naïve Bayes classifier which has 95.85% accuracy. The 3 gram Naïve Bayes

62
classifier showed a 0.35% lower average percentage difference when compared with the
SVM classifier trained with 3-grams.

For a 300-character window, as with the 100-character window, the SVM classifier
trained with 3-grams slightly outperforms the 3 gram Naïve Bayes classifier. It does
better than the 3 gram Naïve Bayes classifier by 0.19% on average. The improvement in
accuracy with increasing size of character windows is noticed. We can also notice that
the accuracies of the SVM and Naïve Bayes classifiers showed a decrease when smaller
training data are employed.

Tracking the performance of the classifiers across different window sizes for every
language is shown in Figures 4.6 to 4.12.

Afar Amharic
74 65.98 4
3.41
64
Error rate (%)

Error rate (%)

54
44 36 3
34
24
14 4.88 2
4
4 3.77
3 0.78
1 0.53
2 1.6
1.03
1 0.06 0 0
0 0
0
15-chw 100-chw 300-chw
15-chw 100-chw 300-chw
Naïve Bayes (n=2-5)
Naïve Bayes (n=2-5)
SVM (n=3)
SVM (n=3)
Dictionary Method
Dictionary Method

Figure 4.6 Classification Results for Afar Figure 4.7 Classification Results for
Language Amharic Language

63
Nuer
Oromo
10 8.89
75
9 58.66
Error rate (%)

65
8

Error rate (%)


55 42.43
7 45
6 35
5 4.44 18.43
25
4 15
5
3 2.22 4 3.07
2 3 1.86
2 3.55 0.86
1 0 0 1 2.11
0 0 0
15-chw 100-chw 300-chw 15-chw 100-chw 300-chw

Naïve Bayes (n=2-5) Naïve Bayes (n=2-5)


SVM (n=3) SVM (n=3)
Dictionary Method Dictionary Method

Figure 4.8 Classification Results for Figure 4.9 Classification Results for
Nuer Language Oromo Language

Sidamo Somali
66.38
69 70
59 50.65 60
Error rate (%)

47.5
Error rate (%)

49 50 43.13
39 40
29 30 20.99
19 9.33 20
9 10
5 8
3.43 6.02
4 2.75 6
3 2.65
4
2 1.03 0.52
0.69 2 0
1 3.33 2.73
0 0 0 0.67
15-chw 100-chw 300-chw 15-chw 100-chw 300-chw

Naïve Bayes (n=2-5) Naïve Bayes (n=2-5)


SVM (n=3) SVM (n=3)
Dictionary Method Dictionary Method

Figure 4.10 Classification Results for Figure 4.11 Classification Results for
Sidamo Language Somali Language

64
Tigrigna
5
Error rate (%)

4 3.68

0.85
1
0.28 0.66
0 0
0
15-chw 100-chw 300-chw

Naïve Bayes (n=2-5)


SVM (n=3)
Dictionary Method

Figure 4.12 Classification Results for


Tigrigna Language

4.4 Multilingual Identification

Naïve Bayes classifier is used for multilingual language identification. For mixed-
language input, the top three languages found is returned. To evaluate multilingual
identification, an artificial corpus that contains 1050 documents was constructed. Each
one of the documents is a concatenation of 3 sections in different languages which are
constructed randomly from a collection of medium length texts per language (6 to 50
words).

65
Multilingual Identification
Languages Languages

No. correctly identified

No. correctly identified

% correctly identified
% correctly identified
No. of test documents

No. of test documents


Afar, Amharic, Nuer 30 30 100 Amharic, Nuer, Tigrigna 30 29 96.67
Afar, Amharic, Oromo 30 28 93.33 Amharic, Oromo, Sidamo 30 28 93.33
Afar, Amharic, Sidamo 30 30 100 Amharic, Oromo, Somali 30 28 93.33
Afar, Amharic, Somali 30 28 93.33 Amharic, Oromo, Tigrigna 30 29 96.67
Afar, Amharic, 30 29 96.67 Amharic, Sidamo, Somali 30 30 100
Tigrigna
Afar, Nuer, Oromo 30 28 93.33 Amharic, Sidamo, 30 29 96.67
Tigrigna
Afar, Nuer, Sidamo 30 30 100 Amharic, Somali, Tigrigna 30 29 96.67
Afar, Nuer, Somali 30 28 93.33 Nuer, Oromo, Sidamo 30 28 93.33
Afar, Nuer, Tigrigna 30 30 100 Nuer, Oromo, Somali 30 28 93.33
Afar, Oromo, Sidamo 30 26 86.67 Nuer, Oromo, Tigrigna 30 30 100
Afar, Oromo, Somali 30 28 93.33 Nuer, Sidamo, Somali 30 28 93.33
Afar, Oromo, Tigrigna 30 28 93.33 Nuer, Sidamo, Tigrigna 30 30 100
Afar, Sidamo, Somali 30 28 93.33 Nuer, Somali, Tigrigna 30 30 100
Afar, Sidamo, Tigrigna 30 30 100 Oromo, Sidamo, Somali 30 26 86.67
Afar, Somali, Tigrigna 30 28 93.33 Oromo, Sidamo, Tigrigna 30 28 93.33
Amharic, Nuer, Oromo 30 30 100 Oromo, Somali, Tigrigna 30 28 93.33
Amharic, Nuer, Sidamo 30 30 100 Sidamo, Somali, Tigrigna 30 28 93.33
Amharic, Nuer, Somali 30 30 100
1,050

1,005

95.71

Total

Table 4.12 Evaluation on test data for Multilingual Identification

66
Accuracy (%)

versa.
80
82
84
86
88
90
92
94
96
98
100

Afar, Amharic, Nuer 100

of Somali).
Afar, Amharic, Oromo 93.33
Afar, Amharic, Sidamo 100
Afar, Amharic, Somali 93.33
Afar, Amharic, Tigrigna 96.67
Afar, Nuer, Oromo 93.33
Afar, Nuer, Sidamo 100
Afar, Nuer, Somali 93.33
Afar, Nuer, Tigrigna 100
Afar, Oromo, Sidamo 86.67
Afar, Oromo, Somali 93.33
Afar, Oromo, Tigrigna 93.33
Afar, Sidamo, Somali 93.33
Afar, Sidamo, Tigrigna 100
Afar, Somali, Tigrigna 93.33
Amharic, Nuer, Oromo 100

67
Amharic, Nuer, Sidamo 100
Amharic, Nuer, Somali 100
Amharic, Nuer, Tigrigna 96.67
Amharic, Oromo, Sidamo 93.33
Amharic, Oromo, Somali 93.33
Amharic, Oromo, Tigrigna 96.67
Amharic, Sidamo, Somali 100
Amharic, Sidamo, Tigrigna 96.67
Amharic, Somali, Tigrigna 96.67
Nuer, Oromo, Sidamo 93.33
Nuer, Oromo, Somali 93.33
Nuer, Oromo, Tigrigna 100
Nuer, Sidamo, Somali 93.33
Figure 4.13 Evaluation (Accuracy) on test data for Multilingual Identification

Nuer, Sidamo, Tigrigna 100


Nuer, Somali, Tigrigna 100
Oromo, Sidamo, Somali 86.67
Oromo, Sidamo, Tigrigna 93.33
Oromo, Somali, Tigrigna 93.33
Sidamo, Somali, Tigrigna 93.33
20% Somali out of 1000 bytes of text means about 800 bytes of Amharic and 200 bytes
When getting results of the top three languages that are identified, the languages

wrongly classified language combinations are ‘Afar, Oromo, Sidamo’ and ‘Oromo,
This corresponds to 95.71% accuracy. We can also see that the highest error rates (13.33
From the above table, we can see that 45 out of 1050 documents were wrongly classified.
approximate percentages of the total text bytes is also returned (e.g. 80% Amharic and

%) result from confusions within Afar, Oromo, Sidamo and Somali languages. The most

shows us that Sidamo language has a higher relationship with Oromo language and vice
Sidamo. Somali’ where Sidamo language was classified as Oromo in both cases. This
5 CONCLUSION AND RECOMMENDATION

5.1 Conclusion

In this research, an adequate mechanism for efficient language identification of major


Ethiopian languages was presented. The languages used for the experiment are from
different linguistic classification namely Cushitic and Semitic of the Afro-Asiatic
linguistic classification and Nilotic of the Nilo-Saharan linguistic classification. Seven
local languages are used in this study namely Afar, Amharic, Nuer, Oromo, Sidamo,
Somali and Tigrigna. These languages were chosen because they are spoken by more
than 79.3% of the total population of Ethiopia.

Factors affecting accuracy such as the size and variety of training data, the size of the
string to be identified, and the type of classifier employed were investigated. Three
approaches were considered in this study. The Naïve Bayes, SVM and dictionary method.
N-gram statistics were used as features for classification for Naïve Bayes and SVM.

The Naïve Bayes and SVM classifiers were trained by using character n-gram of size 3 as
a feature set. The dictionary method uses stopwords. Three different sizes of character
windows were used to perform the tests. These sizes were chosen to provide a range of
challenges for classification (the larger the test string, the easier the classification task
becomes). Text from various domains was collected and was preprocessed before
building the models. Naïve Bayes and SVM classifiers were trained with heterogeneously
distributed training data at the first experiment and then with homogeneously distributed
training data (equal amounts of training data for each language class) at the second
experiment. All classifiers were tested under the same conditions.

In the first experiment, which utilizes the entire data set to train and test the classifiers
with seven language classes, the 3-gram Naïve Bayes and SVM classifiers showed an
average classification accuracy of 98.37% and 99.53% respectively. In the second
experiment, which focused on varying the size of the training data used to train the
classifiers to simulate scarce data availability and which investigated the effect that

68
homogeneously distributed training data has on a classifier, the 3-gram Naïve Bayes and
SVM classifiers showed an average classification accuracy of 95.16% and 96.2%
respectively. The dictionary method showed an average classification accuracy of
90.53%.

A confusion matrix analysis made it clear that samples were often classified incorrectly
as other languages within the same language families. For the larger window sizes, the
confusion is not that significant and classification is better. Increase in size of the
character windows and training data size improved the identification accuracy in all
cases.

For multilingual language identification, Naïve Bayes classifier is used. The top three
languages found is returned for mixed language input therefore a concatenation of 3
sections in different languages which are constructed randomly from a collection of
medium length texts per language (6 to 50 words) is used for evaluating multilingual
identification. This led to 95.71% accuracy.

The main challenge in this study is identification of closely related languages that share
similar character sequences and lexical units (e.g. Amharic and Tigrigna). Another
challenge faced by the researchers is identifying the language of short excerpts from texts
particularly those containing non-standard language and the unavailability of standard
corpus.

69
5.2 Recommendation

The results of this research are beneficial and can be used in any of the areas of language
identification applications. Language identification of text is important as large quantities
of text are processed or filtered automatically for tasks such as spell checker, information
retrieval and machine translation. The results of this research can therefore be used by
anyone who is interested in these research areas.

Since Ethiopia is a multilingual country, other languages from different linguistic


classifications can be added which improves the coverage of language identification
systems by increasing the number of languages that systems are able to recognize.

For further studies, the researchers recommend the use of classification methods
combined with linguistically motivated features such as POS tags and morphological
information which can provide empirical evidence on the convergences and divergences
of language varieties in terms of lexicon, orthography, morphology and syntax.

70
REFERENCES

Botha, G., Zimu, V., & Barnard, E. (2006). Text-based Language Identification for the South
African Languages. SAIEE Africa Research Journal .

Cavnar, W., & Trenkle, J. (1994). N-gram-Based Text Categorization. Proceedings of the Third
Annual Symposium on Document Analysis and Information Retrieval.

Central Intelligence Agency. (n.d.). Ethiopia. Retrieved April 13, 2017, from The World
Factbook: https://www.cia.gov/library/publications/the-world-factbook/geos/et.html

Central Statistical Agency. (2010). Population and Housing Census 2007 Report. Retrieved April
13, 2017, from http://catalog.ihsn.org/index.php/catalog/3583/download/50086

Chang, C., & Lin, C. (2001). LIBSVM: a library for support vector machines. Retrieved August
2, 2017, from http://www.csie.ntu.edu.tw/cjlin/libsvm

Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for large-scale
regression problems. Journal of Machine Learning Research, I, 143-160.

COMESA, R. I. (n.d.). Regional Somali Language Academy Launched in Djibouti. Retrieved


December 6, 2017, from
http://www.comesaria.org/site/en/news_details.php?chaine=regional-somali-language-
academy-launched-in-djibouti&id_news=17578&id_article=119

Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support Vector Machines: and
other kernel-based learning methods. New York: Cambridge University Press.

Desta, L. W. (2014). Modeling Text Language Identification for Ethiopian Cushitic Languages:
Masters thesis. HiLCoE School of computer Science.

Dunning, T. (1994). Statistical identification of language. Computing Research Lab, New Mexico
State University.

Emerson, G., Tan, L., Fertmann, S., Palmer, A., & Regneri, M. (2014). SeedLing: Building and
using a seed corpus for the Human Language Project. 77–85.

Ethiopian Languages. (2008). Retrieved November 28, 2017, from Dinknesh Ethiopia Tour:
http://www.dinkneshethiopiatour.com/index.htm

Ethnologue. (2017). Languages of Ethiopia. Retrieved April 13, 2017, from


https://www.ethnologue.com/country/et/languages

Gebremichael, M. (2011). Federalism and conflict management in Ethiopia : case study of


Benishangul-Gumuz Regional State. United Kingdom: University of Bradford.

Good, I. J. (1965). The estimation of probabilities. An essay on modern Bayesian methods.

71
Google, T. (2013). Google Translate - now in 80 languages. Retrieved December 6, 2017

Grefenstette, G. (1995). Comparing two Language Identification Schemes. Third International


Conference on Statistical Analysis of Textual Data.

Gurtong, T. (n.d.). Nuer (Naath). Retrieved December 6, 2017, from www.gurtong.net

Hakkinen, J., & Tian, J. (2001). n-gram and decision tree based language identification for written
words. Automatic Speech Recognition and Understanding, 335–338.

Hassan, R. J. (1981). The Oromo Orthography of Shaykh Bakri Saṗalō. Bulletin of the School of
Oriental and African Studies, 550-566.

Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall PTR.

Hayward, & Hassan. (1981). The Oromo Orthography of Shaykh Bakri Saṗalō. Bulletin of the
School of Oriental and African Studies, 551.

Heine, B. (1981). The Waata Dialect of Oromo: Grammatical Sketch and Vocabulary.

House, S. A., & Neuburg, P. E. (1977). Toward automatic identification of the language of an
utterance. I. Preliminary methodological considerations. Journal of the Acoustical Society
of America, 708–713.

Hudson, G. (2009). "Amharic". The World's Major Languages. 594–617.

Ingle, N. (1980). A Language Identification Table. Technical Translation International.

Ishwaran, H., & Rao, J. (2009). Decision tree: introduction. In Encyclopedia of Medical Decision
Making (pp. 323–328).

J.Hakkinen, J. (2001). n-Gram and Decision Tree Based Language Identification For Written
Words. Workshop on Automatic Speech Recognition and Understanding, Trento, 335-
339.

Jimma_Times. (n.d.). Online Afaan Oromoo–English Dictionary. Retrieved December 6, 2017,


from
http://www.jimmatimes.com/article/ARTS/ARTS/Ethiopia_Online_Afaan_Oromoo_Engl
ish_Dictionary/32153

Judith, R. (2013). Globally Speaking: Motives for Adopting English Vocabulary in Other
Languages (Multilingual Matters). 165.

Kawachi, K. (2007). A grammar of Sidaama (Sidamo), a Cushitic language of Ethiopia: Doctoral


dissertation. State University of New York.

Kerlinger, F. N. (1973). Foundations of behavioral research. New York.

72
Kikui, G.-I. (1994). Identifying the coding system and language of on-line documents on the
internet. Proceedings of the 16th International Conference on Computational Linguis-
tics, Denmark.

King, B., & Abney, S. (2013). Labeling the languages of words in mixed- language documents
using weakly supervised methods. Proceedings of the 2013 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies.

Kingsford, C., & Salzberg, S. L. (2008). What are decision trees? Nature biotechnology.

Kizitus Mpoche, T. M. (2006). Language, literature, and identity. 163–164.

Kralisch, Anett, & Mandl, T. (2006). Barriers to information access across languages on the
internet: Network and language effects. Proceedings of the 39th Annual Hawaii
International Conference on System Sciences, volume 3.

Kruengkrai, Srichaivattana, Sorlertlamvanich, & Isahara. (2005). Language Identification Based


on String Kernels. IEEE International Symposium on Communications and Information
Technology.

Kullback, S., & Leibler, R. (1951). On information and sufficiency. In Annals of Mathematical
Statistics (pp. 79–86).

Language Access Act Fact Sheet. (2011). Retrieved April 13, 2017, from
http://ohr.dc.gov/sites/default/files/dc/sites/ohr/publication/attachments/LAAFactSheet-
English.pdf

Lewis, I. (1998). Peoples of the Horn of Africa: Somali, Afar and Saho. Red Sea Press, 11.

L-F.Zhai, M. X. (2006). Discriminativelytrainedlanguagemodelsusing support vector machines


for language identification. The Speaker and Language Recognition Workshop, 1-6.

Lipschutz, S., & Lipson, M. (2009). Linear Algebra (Schaum’s Outlines). McGraw Hill.

Ljubesic, Fiser, & Erjavec. (2014). Tweet-CaT: a Tool for Building Twitter Corpora of Smaller
Languages. . Proceedings of the Internacional Conference on Language Resources and
Evaluation (LREC), 2279–2283.

Lodhi, Shawe-Taylor, Cristianini, & Watkins. (2002). Text classification using string kernels.
Journal of Machine Learning Research.

Lui, M. (2014). Generalized Language Identification. PhD thesis , 2014.

Lui, M., & Baldwin, T. (2011). Cross-domain feature selection for language identification.
Proceedings of the 5th International Joint Conference on Natural Language Processing.

Manning, C., & Schu ̈tze, H. (1999). Foundations of Statistical Natural Language Processing.
MIT Press.

73
Martins, B., & Silva, M. (2005). Language Identification in Web Pages. Proceedings of the 20th
ACM Symposium on Applied Computing (SAC), 763-768.

Meyer, R. (2006). Amharic as lingua franca in Ethiopia. Journal of African Languages and
Linguistics, 117–131.

Nguyen, D., & Dogruoz, A. S. (2013). Word Level Language Identification in Online
Multilingual Communication. Proceedings of the International Conference on Empirical
Methods in Natural Language Processing (EMNLP), 857–862.

Omniglot. (2017). Afar alphabets, pronunciation and language. Retrieved December 6, 2017,
from http://www.omniglot.com/writing/afar.htm

Omniglot. (2017). Nuer alphabets, pronunciation and language. Retrieved December 6, 2017,
from http://www.omniglot.com/writing/nuer.htm

Omniglot. (2017). Somali alphabets, pronunciation and language. Retrieved December 6, 2017,
from http://www.omniglot.com/writing/somali.htm

Padro, M. P. (2004). Comparing methods for language identication. Proceedings of the XX


Congreso de la Sociedad Espanola para el Procesamiento del Lenguage Natural.

Palmer, D. (2010). Text Processing. (N. Indurkhya, & F. Damerau, Eds.) Handbook of Natural
Language Processing, 9–30.

Pankurst, A. (1991). Indigenising Islam in Wällo: ajäm, Amharic verse written in Arabic script.
Proceedings of the Xlth International Conference of Ethiopian Studies, Addis Ababa.

Peng, F., & Schuurmans, D. (2003). Combining naive Bayes and n-gram language models for text
classification. Springer.

Peng, F., Schuurmans, D., & Wang, S. (2003). Augmenting Naive Bayes Classifiers with
Statistical Language. University of Waterloo Canada.

Powers, & David, M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC,
Informedness, Markedness & Correlation. Journal of Machine Learning Technologies,
37–63.

Powers, & David, M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC,
Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, II,
37–63.

Prager, J. (1999). Linguini: language identification for multilingual documents. Journal of


Management Information Systems, 71-101.

Python. (2018). Python. Retrieved February 2, 2018, from Python: https://www.python.org/

Raymond, G. J. (2005). Ethnologue: Languages of the World. Dallas: Summer Institute of


Linguistics.

74
Rehurek, Radim, & Kolkus, M. (2009). Language identification on the web: Extending the
dictionary method. Proceedings of the 10th International Conference on Intelligent Text
Processing and Computational Linguistics (CICLing-2009), 357-368.

Reynar, P. S. (1996). Language identification: Examining the issues. Proceedings of the 5th
Symposium on Document Analysis and Information Retrieval, 125-135.

Rodolfo, F. (2003). Akkälä Guzay. Encyclopaedia Aethiopica: A-C. Wiesbaden: Otto


Harrassowitz KG, 169.

S. MacNamara, P. C. (1998). Neural networks for language identification: A comparative study.


Information Processing and Management, 395-403.

Shannon, C. (1951). Prediction and Entropy of Printed English. Bell system technical journal.

Simões, A., Almeida, J. J., & Byers, S. D. (1998). Language Identification: a Neural Network
Approach. ACM.

Stephanie. (2018). Experimental Design. Retrieved February 4, 2018, from Statistics How To:
http://www.statisticshowto.com/

Teahan, W. J. (2000). Text classification and segmentation using minimum cross- entropy.
Proceedings of the 6th International Conference Recherche d’Information Assistee par
Ordinateur (RIAO’00), 943–961.

Truica, C.-O., Velcin, J., & Boicea, A. (2015). Automatic Language Identification for Romance
Languages using Stopwords and Diacritics. University Politehnica of Bucharest.

University of Pennsylvania, S. o. (1977). Afaan Oromo.

WebCorp. (2018). WebCorp. (B. C. University, Producer, & Birmingham City University)
Retrieved February 2, 2018, from WebCorp: http://www.webcorp.org.uk/live/

Yamaguchi, Hiroshi, & Tanaka-Ishii, K. (2012). Text segmentation by language using minimum
description length. Proceedings of the 50th Annual Meeting of the Association for
Computational Linguistics (ACL 2012), 969-978.

Zampieri, M., Gebre, B. G., & Diwersy, S. (2012). Classifying pluricentric languages: Extending
the monolingual model. Proceedings of the Fourth Swedish Language Technlogy
Conference (SLTC).

75
APPENDIX

Appendix 1 Sample Codes

Dictionary Method
# -*- coding: utf-8 -*-
import nltk
import Tkinter
import tkMessageBox
from Tkinter import *
import nltk
from nltk import wordpunct_tokenize
from nltk.corpus import stopwords
win=Tk()
win.title("Language Detector")
win.geometry("500x500")
def act():
languages_ratios = {}
tokens = wordpunct_tokenize(svalue.get()) #splits all
punctuations into separate tokens
words = [word.lower() for word in tokens]
# Compute per language included in nltk number of unique
stopwords appearing in analyzed text

#lan=['Afar','Amharic','Nuer','Oromo','Sidamo','Somali','Tigrigna']
#for language in lan:
for language in stopwords.fileids():
stopwords_set = set(stopwords.words(language))
words_set = set(words)
common_elements = words_set.intersection(stopwords_set)
languages_ratios[language] = len(common_elements)
#Calculate probability of given text to be written in several
languages and return the highest scored.
#It uses a stopwords based approach, counting how many unique
stopwords are seen in analyzed text.
most_rated_language = max(languages_ratios,
key=languages_ratios.get)
if(most_rated_language==0):
tkMessageBox.showinfo("Language is", "Can’t detect this
text")
else:
tkMessageBox.showinfo("Language is", most_rated_language)
svalue = StringVar() # defines the widget state as string
w = Entry(win,textvariable=svalue,width=60) # adds a textarea widget
foo = Button(win,text="Check Language", command=act,fg="black")
foo.config(width=20, height=2)
w.pack(ipady=50,pady=20)
foo.pack()
win.mainloop()

76
Generate N-Grams Frequency Profiles for Languages
import glob
import operator
import os
import sys
import argparse
try:
import nltk.corpus
except ImportError:
print '[!] You need to install nltk
(http://nltk.org/index.html)'
sys.exit(-1)
try:
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams
except ImportError:
print '[!] You need to install nltk
(http://nltk.org/index.html)'
sys.exit(-1)

LANGDATA_FOLDER = '.'
#---------------------------------------------------------------
def __init__(self):
"""Constructor"""

self._languages_statistics = {}
self._tokenizer = RegexpTokenizer("[a-zA-Z]+")
self._langdata_path = LANGDATA_FOLDER
#---------------------------------------------------------------
def _tokenize_text(self, raw_text):

tokens = self._tokenizer.tokenize(raw_text)

return tokens
#---------------------------------------------------------------
def _generate_ngrams(self, tokens):

generated_ngrams = []

for token in tokens:


for x in xrange(2, 5): # generate N-grams, for N=2 to 5
xngrams = ngrams(token, x,pad_left=True,
pad_right=True, pad_symbol=' ')

for xngram in xngrams:


ngram = ''.join(xngram)
generated_ngrams.append(ngram)

return generated_ngrams
#---------------------------------------------------------------
def _count_ngrams_and_hash_them(self, ngrams):

ngrams_statistics = {}

77
for ngram in ngrams:
if not ngrams_statistics.has_key(ngram):
ngrams_statistics.update({ngram:1})
else:
ngram_occurrences = ngrams_statistics[ngram]

ngrams_statistics.update({ngram:ngram_occurrences+1})

return ngrams_statistics
#---------------------------------------------------------------
def _calculate_ngram_occurrences(self, text):

tokens = self._tokenize_text(text)
ngrams_list = self._generate_ngrams(tokens)

ngrams_statistics =
self._count_ngrams_and_hash_them(ngrams_list)

ngrams_statistics_sorted =
sorted(ngrams_statistics.iteritems(),\key=operator.itemgetter(1),\
reverse=True)[0:10000]

return ngrams_statistics_sorted
#---------------------------------------------------------------
def generate_ngram_frequency_profile_from_raw_text(self,
raw_text, output_filename):

output_filenamepath = os.path.join(self._langdata_path,
output_filename)

profile_ngrams_sorted =
self._calculate_ngram_occurrences(raw_text)

fd = open(output_filenamepath, mode='w')
for ngram in profile_ngrams_sorted:
fd.write('%s\t%s\n' % (ngram[0], ngram[1]))
fd.close()

#---------------------------------------------------------------
def generate_ngram_frequency_profile_from_file(self, file_path,
output_filename):
raw_text = open(file_path, mode='r').read()

self.generate_ngram_frequency_profile_from_raw_text(raw_text,

output_filename)
profile_ngrams_sorted =
self._calculate_ngram_occurrences(raw_text)

78
Appendix 2 Sample Outputs

Monolingual Identification (Naïve Bayes)

79
Multilingual Identification (Naïve Bayes)

80

You might also like