AUTOMATIC IDENTIFICATION OF Major Ethio Language
AUTOMATIC IDENTIFICATION OF Major Ethio Language
2020-03-20
AUTOMATIC IDENTIFICATION OF
MAJOR ETHIOPIAN LANGUAGES
Tadesse, Biruk
http://hdl.handle.net/123456789/10755
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY
BAHIR DAR INSTITUTE OF TECHNOLOGY
SCHOOL OF RESEARCH AND POSTGRADUATE STUDIES
FACULTY OF COMPUTING
A thesis submitted to the school of Research and Graduate Studies of Bahir Dar
Institute of Technology, BDU in partial fulfillment of the requirements for the degree
of
Master of Science in Computer Science
First and foremost, I would like to thank the almighty God who made all things possible. I
like to express my sincere thanks to my advisor Professor Bandaru Rama Krishna Rao for
his important comments, and encouragement. Most importantly, I thank my parents and
my family for giving me the possibility to become who I am today. I am also grateful to
my instructors, my colleagues, my friends and my classmates for the support and
encouragement they provided me at all times.
Last but not least, I would like to thank Bahir Dar university for being a healthy
environment where I can learn and be challenged and Wachemo university for granting me
study leave to attend the graduate program of School of Computing in Bahir Dar
University.
iii
ABSTRACT
iv
TABLE OF CONTENTS
ABSTRACT ...................................................................................................................... IV
LIST OF TABLES............................................................................................................ IX
1. INTRODUCTION ....................................................................................................... 1
2 LITERATURE REVIEW......................................................................................... 10
v
2.1.2.3 Detection based on dictionaries containing stopwords ................................................................. 22
REFERENCES ................................................................................................................. 71
APPENDIX ....................................................................................................................... 76
vi
LIST OF ABBREVATIONS
vii
LIST OF FIGURES
Figure 2.1 An illustration of the rank order method (Adopted from Cavnar & Trenkle,
1994) ................................................................................................................................... 14
Figure 2.2: Support vector machine concept ...................................................................... 16
Figure 2.3 Decision tree: play tennis example (Mitchell, 1997) ........................................ 18
Figure 2.4 Neural network architecture .............................................................................. 19
Figure 3.1 Steps to identify a particular language. ............................................................. 44
Figure 3.2 Naïve Bayes and SVM models applied to text languages in this study ............ 48
Figure 3.3 Dictionary Method applied to text languages in this study............................... 51
Figure 4.1 Evaluation (Accuracy) on test data for Naïve Bayes Classifier (n=3) .............. 53
Figure 4.2 Evaluation (Accuracy) on test data for SVM Classifier (n=3) ......................... 55
Figure 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language ...................................................... 56
Figure 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language ................................................................................ 57
Figure 4.5 Evaluation (Accuracy) on test data for Dictionary Method .............................. 59
Figure 4.6 Classification Results for Afar Language ......................................................... 63
Figure 4.7 Classification Results for Amharic Language .................................................. 63
Figure 4.8 Classification Results for Nuer Language ........................................................ 64
Figure 4.9 Classification Results for Oromo Language ..................................................... 64
Figure 4.10 Classification Results for Sidamo Language .................................................. 64
Figure 4.11 Classification Results for Somali Language ................................................... 64
Figure 4.12 Classification Results for Tigrigna Language ................................................. 65
Figure 4.13 Evaluation (Accuracy) on test data for Multilingual Identification ................ 67
viii
LIST OF TABLES
Table 3.1 Corpora: Facts and Figures - Number of Words/Tokens and Types.................. 36
Table 3.2 Corpus size for training (90%) and testing (10%) the models ........................... 37
Table 3.3 Data statistics for training and testing the models with homogeneously
distributed training data per language ................................................................................ 38
Table 3.4 Word statistics of each language to generate test characters window. Average
word lengths and number of words per character window are indicated. .......................... 39
Table 3.5 Number of test documents for the three characters windows ............................ 39
Table 3.6 Number of test documents for Multilingual Identification ................................ 40
Table 3.7 Dictionary size (number of stopwords) for the Dictionary Method ................... 50
Table 4.1 Evaluation on test data for Naïve Bayes Classifier (n=3) .................................. 53
Table 4.2 Evaluation on test data for SVM Classifier (n=3) .............................................. 54
Table 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language ...................................................... 56
Table 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language ................................................................................ 57
Table 4.5 Evaluation on test data for Dictionary Method .................................................. 58
Table 4.6 Confusion matrix of Naïve Bayes Classifier (n=3) for 15-chw ......................... 60
Table 4.7 Error rates for languages calculated from confusion matrices in Table 4.6 ....... 60
Table 4.8 Confusion matrix of SVM Classifier (n=3) for 15-chw ..................................... 60
Table 4.9 Error rates for languages calculated from confusion matrices in Table 4.8 ....... 61
Table 4.10 Confusion matrix of Dictionary Method for 15-chw ....................................... 61
Table 4.11 Error rates for languages calculated from confusion matrices in Table 4.10 ... 61
Table 4.12 Evaluation on test data for Multilingual Identification .................................... 66
ix
1. INTRODUCTION
1.1 Background
Ethiopia is a diverse country with various cultural and traditional differences. According
to Ethnologue, (2017), there are ninety individual languages spoken in Ethiopia. Most
people in the country speak Afro-asiatic languages of the Cushitic or Semitic branches.
The Cushitic Languages are mostly spoken in central, southern and eastern Ethiopia
(mainly in Afar, Oromia and Somali regions). The Semitic Languages are spoken in
northern, central and eastern Ethiopia (mainly in Tigray, Amhara, Harar and northern part
of the Southern Peoples' State regions). They use the Ge'ez script that is unique to the
country. The Omotic Languages are predominantly spoken between the Lakes of
southern Rift Valley and the Omo River. The Nilo-Saharan Languages are largely spoken
in the western part of the country along the border with Sudan (mainly in Gambella and
Benshangul regions) (Ethiopian Languages, 2008).
The problem has been researched long both in the text domain and in the speech domain
(House & Neuburg, 1977). Computational methods can be applied to determine a
1
document’s language before undertaking further processing. State-of-the-art methods of
language identification for most European languages present satisfactory results above
95% accuracy (Martins & Silva, 2005).
Digital documents are getting larger and larger every day and dealing with large amounts
of textual data from an external source or from some document is a problem if many
languages are included. Users may wish to work with information in potentially any
language, it is necessary to have systems for language identification that provide
identifiers for all languages. Information is usually provided in a context that eliminates
ambiguities. However, if we don’t recognize the language, the processing stops. In this
case, knowing the name of the language would help.
Large quantities of text are processed automatically for tasks such as spelling and
grammar checking, information retrieval, search engines, language translation and text
mining. In a multilingual environment, language processing is often initiated with some
form of language identification. A document processing system needs to determine the
language of textual contents before topic identification, translation, stemming or search
2
and a text-to-speech system needs to determine the source language of short pieces of text
to determine pronunciation rules, prosodic models and phrasing strategies.
The problem of language identification is that human language (words) have structure.
For example, in English, it's very common for the letter 'u' to follow the letter 'q,' while
this is not the case in Amharic or other Ethiopian languages. Certain combinations of
letters are more likely in some languages than others. N-Grams work by capturing this
structure.
There are a large number of languages used in Ethiopia that have speakers ranging from
millions to thousands and are considered as major languages. In most cases, frequent
code switching and code mixing are also observed. If we could segment multi-lingual
documents language-wise, it would be very useful both for exploration of linguistic
phenomena, such as code-switching and code mixing, and for computational processing
of each segment appropriately.
3
the Nilo-Saharan linguistic classification. These languages were chosen because they are
spoken by more than 79.3% of the total population of Ethiopia.
Advances in language identification have been made but are still very limited and
available for only a small number of languages. Although it is unpublished, the only other
research (to my knowledge) that includes the Ethiopian languages in a text-based
language identification task was done by (Desta, 2014). Although he reported a
substantially higher performance rate for four Cushitic languages of Ethiopia, his
research did not consider factors that can determine identification accuracy such as the
amount and variety of training data. He didn’t investigate the effect that homogeneously
distributed training data (equal amounts of training data for each language class) has on a
classifier when compared to a training set with heterogeneously distributed training data.
Moreover, his research is restricted to four Cushitic languages and didn’t include
languages from other linguistic classification. Furthermore, his research didn’t consider
multilingual identification for mixed language input. Thus text-based language
identification for the Ethiopian languages is still open for research.
The aim of this research is to investigate text-based language identification and to explore
factors that determine how accurately Ethiopian languages can be identified based on
written text in their respective writing system.
The research is intended to get answers for the following research questions.
4
1.3 Objective of the study
General objective: The general objective of this study is to propose text-based language
identifier techniques for the seven major Ethiopian languages.
Specific objectives:
The main intent of the study is to explore existing text language identification techniques
and compare their performances by considering factors that determine identification
accuracy on seven major Ethiopian languages. To achieve the objectives of the study,
machine learning methods (Naïve Bayes and support vector machine) that use character
n-gram of size 3 as a feature set and dictionaries that use stopwords are used. Documents
written in 7 languages are compiled and processed from various sources resulting in
slightly over 11.7 million tokens. The research includes both monolingual and
multilingual language identification.
Experimenting with all possible combinations was not feasible although a large number
of classifiers have been applied to text-based language identification. Using a pure
linguistic approach is undeniably the better choice to achieve a high classification
accuracy but it requires a large amount of linguistic expertise. Therefore, using statistical
approaches is a feasible alternative.
5
Although this research was aimed to include all languages of the Ethiopia, the data used
in this study was only collected from seven languages. These seven languages were
chosen because they are spoken by more than 79.3% of the total population and are
considered as major languages in Ethiopia.
Language identification methods are vital to many NLP and information retrieval
applications. They are necessary, for example, to aid document collection creation in
scenarios where the languages of documents are not known. A clear example are
documents crawled from the Internet which are often unlabeled regarding their language
and language identification methods are applied to identify the language of each
document (Ljubesic, Fiser, & Erjavec, 2014). Machine translation (MT) also benefits
from language identification methods, because to translate a document to a target
language, it is first necessary to determine its source language. These features are present
in different translation tools and web-browsers such as Google Chrome. Finally, corpus
creation for low-resource languages is another area that uses language identification
methods (Emerson, Tan, Fertmann, Palmer, & Regneri, 2014).
6
Identification of language from a given text is therefore an important problem in the
Ethiopian context and the study could also be used as a baseline for further studies on
language identification to be considered in Ethiopia.
In order to achieve the specific and general objectives of the study and answer the
research questions, the following research methods are used.
A research design includes the structure of a study and the strategies for conducting that
study (Kerlinger, 1973). For the purpose of conducting this research, experimental
research design is used. Experimental research design is a way to carefully plan
experiments in advance so that results are both objective and valid (Stephanie, 2018).
The experimentation conducted in this study is separated into two experiments, with each
experiment differing with regards to the amount of data used for training.
The first experiment utilizes the entire data set to train and test the classifiers with seven
language classes. The second experiment consists of varying the size of the training data
used to train the classifiers to simulate scarce data availability. Furthermore, this
experiment investigates the effect that homogeneously distributed training data (equal
amounts of training data for each language class) has on a classifier when compared to a
training set with heterogeneously distributed training data. The results of the experiments
are analyzed individually at the end of each experiment and an overall analysis is done to
summarize the observations.
Literature review
7
are reviewed to understand language identification, its development tools, techniques,
procedures and methodologies.
Text corpus of major Ethiopian languages is collected from sources such as newspaper,
TV/Radio news, Religious books (Bible, Quran), academic books, and language
dictionaries to insure the corpus spans various domains. Documents written in 7
languages are compiled and processed from various sources resulting in slightly over
11.7 million tokens. WebCorp tool is used to collect corpus from different websites
which is a suite of tools that allows access to the World Wide Web as a corpus. WebCorp
was created and is operated and maintained by the Research and Development Unit for
English Studies (RDUES) in the School of English at Birmingham City University
(WebCorp, 2018) . Different data preprocessing techniques are applied for processing the
data used by the algorithms chosen for the research. Python scripts are used for data
cleaning and preprocessing (Python, 2018). After data preprocessing, the classifiers are
trained using 90% of the collected corpus and tested using 10% of the collected corpus
that are generated randomly from the collected corpus. This data partitioning is chosen
because it has been suggested in several literatures in the field of Machine learning/
Pattern recognition. Moreover, a series of runs with different amounts of training data
was performed and no performance enhancement was achieved by increasing further the
size of the training corpus. Quantitative detail about the corpora can be found in section
3.2.
Model evaluation
The classifiers and the ranking methods are implemented in Python and use functions
available at the Natural Language Toolkit (NLTK). Classification techniques that are
used in this research are evaluated using a test dataset based on their identification
accuracy. Accuracy and error rate are calculated to evaluate the performance of the
models. Details about evaluation and computational techniques can be found in section
3.3.
8
1.8 Structure of the thesis
The thesis is organized into 5 chapters. The first chapter covers introduction, background,
problem statement, objective of the study, scope and limitation of the study, significance
of the study, and research methods used. The second chapter covers literature review,
overview of major Ethiopian languages, approaches to language identification, evaluation
of language identification methods, and related works. The third chapter covers the
methodology and the techniques followed in this research. It discusses data preparation
for the experiment and models that are implemented. Chapter four discusses the
experiment and the results. The last chapter presents conclusions and recommendations.
9
2 LITERATURE REVIEW
2.1.1 Overview
There are a number of situations in which the source language of a document is unknown
and computational methods can be applied to determine it. This makes language
identification a relevant task that can be integrated with most NLP applications such as
machine translation (for cases in which knowing the source language or language variety
of a document is vital for further processing) or information retrieval.
As discussed by Lui, (2014) language identification systems are typically divided into
four main steps. Given a set of documents written in different languages the system will
implement the following:
10
Classification function: define a function that best represents the similarity between a
document and each language model;
Prediction or output: compute the highest-scoring model to determine the language of the
document.
For languages with a unique alphabet not used by any other languages, language
identification is determined by character set identification. Similarly, character set
identification can be used to narrow the task of language identification to a smaller
number of languages that all share many characters such as Arabic vs. Persian, Russian
vs. Ukrainian, or Norwegian vs. Swedish (Palmer, 2010).
At this point it is not difficult to recognize a number of challenges faced by the language
identification systems. One of them is the identification of closely related languages that
share similar character sequences and lexical units (e.g. Amharic and Tigrigna). Another
challenge faced by language identification systems is identifying the language of short
excerpts from texts particularly those containing non-standard language. Systems have
difficulty when confronted with small excerpts from texts (a few words) that do not
provide enough data for algorithms to classify them correctly. This difficulty is even
more evident for texts available on the Internet, such as tweets, because these are often
11
noisy and contain non-standard spelling and/or code alternation (Nguyen & Dogruoz,
2013).
bi−grams: ex xa am mp pl le
12
N-gram are simple to compute for any given text and from many language identification
and related tasks that involve several languages, they have been shown to perform well.
They automatically capture the roots of the most frequent words and operate
independently of languages and are tolerant of spelling errors and distortions caused
when using optical scanners and do not need the removing of Stopwords or the stemming
process that improve the performance of words based systems.
The most widely used ranking method for language identification and text categorization
is the one by Cavnar & Trenkle, (1994). This method generates language specific profiles
which contain most frequent character n-grams of the training corpus sorted by their
frequency. Figure 2.1 illustrates the ranking method.
In this approach, n-gram combinations in the training and testing sets are ordered from
most to least frequent. Cavnar and Trenkle build n-gram frequency “profiles” for several
languages and classify text by matching it to the profiles.
We can create a similar text profile from the text to be classified and then calculate a
cumulative n-gram rank difference measure between the text profile and each language
profile which is helpful to determine how far out of place any n-gram in one profile is
from its place in the other profile. It gives the maximum distance if an n-gram in the text
profile is absent from the language profile, which equals the number of n-grams in the
language profile. To identify the testing language, the sum of all the rank distances is
calculated and the most probable language will be the one with the smallest distance.
13
language text out-of-place
profile profile measure
most frequent
TH TH 0
ER ING 3
ON ON 0
LE ER 2
ING AND 1
AND ED no match=max
… …
least frequent
sum=total distance
Figure 2.1 An illustration of the rank order method (Adopted from Cavnar & Trenkle,
1994)
14
given the language. This assumption is embodied in the naive Bayesian classifier by
Good, (1965) and the equation 2.1 is reduced to 2.2.
∏𝑚
𝑖−1 P (𝑓𝑖 |𝑐𝑗 )
P (𝑐𝑗 |D) = P (𝑐𝑗 ). Equation 2.2
P(D)
To find the most probable language of the document c maximum a posterior classifier
(MAP) cMAP is constructed in Equation 2.3. It maximizes the posterior P(cj|D). In
Equation 2.5 P(D) is eliminated as it is a constant for all languages. The probability of
occurrence of each language P(c) is assumed equal and is also excluded. Therefore, the
cMAP becomes equal to maximum likelihood classifier as shown in Equation 2.6.
∏𝑚
𝑖=1 P (𝑓𝑖 |c)
= 𝑎𝑟𝑔max {P (𝑐). } Equation 2.4
𝑐∈𝐶 P(D)
= 𝑎𝑟𝑔max{P (𝑐). ∏𝑚
𝑖=1 𝑝(𝑓𝑖 |𝑐)} Equation 2.5
𝑐∈𝐶
= 𝑎𝑟𝑔max{∏𝑚
𝑖=1 𝑝(𝑓𝑖 |𝑐)} Equation 2.6
𝑐∈𝐶
As a result, for this classifier, a list of N-grams with possible duplicates is generated and
their internal frequencies, obtained from the training data, are multiplied. The language
with the maximal result is chosen as the language of the document.
Support Vector Machine: are able to handle large feature spaces. This is particularly
helpful when classifying texts because algorithms often have to deal with a very large
number of features. The learning algorithm finds a linear classification boundary (hyper
plane) that separates the training examples and maximizes the margin to the closest
training examples regarding their class labels (Kruengkrai, Srichaivattana,
Sorlertlamvanich, & Isahara, 2005).
15
𝑓(𝑥) = ∑𝑁
𝑖=1 𝛼𝑖 𝑡𝑖 𝐾(𝑋, 𝑋𝑖 ) + 𝑏 Equation 2.7
vectors and obtained from the training set by an optimization process (Collobert &
Bengio, 2001). The target values are either 1 or -1 depending upon whether the
corresponding support vector is in class 0 or class 1. For classification, a class decision is
based upon whether the value, f(x), is above or below a threshold.
The kernel K (.,.) is constrained to have certain properties (the Mercer condition), so that
K (.,.) can be expressed as
where b(x) is a mapping from the input space (where x lives) to a possibly infinite
dimensional space.
The optimization condition relies upon a maximum margin concept, see Figure 2.1. For a
separable data set, the system places a hyperplane in a high dimensional space so that the
hyperplane has maximum margin. The data points from the training set lying on the
boundaries (as indicated by solid line in the figure) are the support vectors in equation
(2.7). The focus then of the SVM training process is to model the boundary as opposed to
a traditional Gaussian mixture model which would model the probability distributions of
the document (a generative approach).
Class 0
f(x>0)
Margin
Class 1
f(x<0)
Separating hyperplane
Figure 2.2: Support vector machine concept
16
Decision Trees
A decision tree is a machine learning technique that is used to solve sequential decision-
making problems. It has been shown to be effective in classification problems across
domains. One of the attractive features of decision trees is the ease at which one can
visualise the decision-making process (Ishwaran & Rao, 2009). Another attractive quality
of decision trees is their effectiveness in utilizing contextual information (Hakkinen &
Tian, 2001).
Decision trees are made up of nodes. The initial node of the decision tree is known as the
root node. Internal nodes are nodes where decisions are made, the data at that node is
partitioned into children nodes and the resulting children nodes can be either internal
nodes or leaf nodes. Leaf nodes do not have any children nodes and the final
classification for an example is given at these nodes (Kingsford & Salzberg, 2008).
A decision tree classifier is composed of a two-phase approach. The first phase is the
training phase where a set of labelled examples is used to train a decision tree model. The
second phase is the classification phase where the model created can be used to predict
the classification of an unlabelled example. The class given is determined by the features
the example possesses. The decision tree is traversed by answering a series of questions
starting at the root node and filtering down to a leaf node where the final classification is
given (Kingsford & Salzberg, 2008) (Cavnar & Trenkle, 1994).
The famous ‘play tennis given the weather forecast’ decision tree is a very good example
of how conditions are tested in a decision tree. There are two labels to be attributed to
each instance, yes or no, given a set of conditions determined by three attributes
humidity, outlook, and wind. In this example (see Figure 2.3) each attribute has two or
three values. Conditions are tested to determine whether given the weather forecast it is
best to play tennis or not (e.g. if outlook is sunny and humidity normal: play tennis; if the
outlook is rain and the windy is strong then don’t play tennis).
17
Outlook
Sunny Rain
Overcast
No Yes No Yes
Figure 2.3 Decision tree: play tennis example (Mitchell, 1997)
Neural Networks
Neural networks are being used for some time and their design and implementation for
standard situations is well known (Haykin, 1994). Most of the work using neural
networks aims at the classification of objects. In this case, the network works as a
hypothesis function hΘ(X) that, based on a set of matrices Θ previously computed on a
training set, is able to classify an object based on a set of features extracted from that
object. So, in the case of language identification, the training set is a set of texts manually
classified in one language and an algorithm to extract features from them. These features
are them fed in the neural network training algorithm that will compute the set of
matrices Θ.
These matrices are then used to identify the language of new texts. For that, the features
X are extracted from the text to be classified, and the hypothesis function is called. The
resulting vector will include the probabilities of that text being identified as each one of
the trained languages.
Neural network works as follows as implemented using the more common definition of a
neural network (Haykin, 1994) :
18
belongs, and i its order. All units from a specific layer are connected to all units from the
next layer. This connection is controlled by a matrix Θ(l), for each layer l.
The first layer is known as the input layer. It has the same number of units as there are
features to be analyzed. Whenever the network hypothesis function is evaluated each cell
ai(1) is filled in with the values obtained by the features observation. The next layer, ai(2)
is computed using the previous layer and the matrix Θ(1). This process is done for every
layer l ≤ L. The layer L is known as the output layer. There are as many units in this
layer as the number of classes K in which the network will classify objects. Therefore, if
the network is trained to detect 25 languages, then there are 25 units in the output layer.
Each unit in the output layer will, optimally, get a value that is either 1 or 0, meaning that
the object is, or is not, in the respective class. Usually, the result is a value in this range,
that represent the probability of the object to be of that specific class.
The other layers, 1 < l < L, are known as the hidden layers. There are as many hidden
layers as one might want, but there is at least one hidden layer. Adding new layers will
make the network return better results but it will take more time to train the network, and
take more time to run the network hypothesis function.
19
The implementation of the neural network was based on the logistic function defined by
g(z). This function range is [0,1], and its result value can be considered a probability
measure. The logistic function is defined as:
1
𝑔(𝑧) = 1+𝑒𝑥𝑝−𝑧 Equation 2.9
The neural network hypothesis function, hΘ(l) (X) is defined by two matrices, Θ(1) and
Θ(2). These matrices of weights are used to compute the network. The input values,
obtained by the computed features, are stored in the vector X. This vector is multiplied by
the first weight matrix, and the logistic function is applied to each value of the resulting
vector. The resulting vector is denoted as a(2) and corresponds to the values of the second
layer of the network (the hidden layer). It is then possible to multiply a(2) vector by the
weights of Θ(2) and, after applying the sigmoid function to each element of the resulting
multiplication, we obtain a(3). This is the output layer, and each value of this vector
corresponds to the probability of the document being analyzed to as being written in a
specific language. This algorithm is known by forward propagation and is defined by:
𝑎𝑙 = 𝑥
𝑓𝑜𝑟 𝑖 = 2 to L,
The main problem behind this implementation is how to obtain the weight values. For
that the usual methodology is to define a cost function and try to minimize it, that is,
finding the Θ values for which the hypothesis function has a smaller error for the training
set.
1 (𝑖) (𝑖)
𝐽(Θ) = − 𝑚 (∑𝑚 𝑘 (𝑖) (𝑖)
𝑖=1 ∑𝑘=1 𝑦𝑘 log ℎΘ (𝑥 ))𝑘 + (1 − 𝑦𝑘 ) log(1 − (ℎΘ (𝑥 ))𝑘)) +
𝜆 𝑠𝑙 𝑆𝑙 +1 (𝑙) 2
∑𝐿−1
𝑙=1 ∑𝑖=1 ∑𝑗=1 (Θ𝑗,𝑖 ) Equation 2.11
2𝑚
20
The regularization is controlled by the coefficient λ which can be used to tweak how the
Θ weights absolute value will increase. The minimization of the cost function J(Θ) is
computed by an algorithm known as Gradient Descent. This algorithm uses the partial
derivatives to compute the direction to use to obtain the function minimum. The
algorithm continues iterating until the difference between the obtained costs is very
small, or until a limit number of iterations it met.
𝜕
(𝑙) 𝐽(Θ) Equation 2.12
𝜕𝜃𝑖,𝑗
Let the vector c represent all the n-gram combinations. The dot-product method builds a
language vector lj from the n-gram statistics of a document, where lij is the frequency of
the n-gram ci. A vector x is built from the n-gram statistics of the test string to classify a
test string. The normalized dot-product between a language vector lj and test string x is
then computed by (Lipschutz & Lipson, 2009):
∑𝑁
𝑖=1 𝑥𝑖 𝑙𝑖𝑗
𝑥. 𝑙𝑗 = |𝑥||𝑙𝑗 |
Equation 2.13
The computed measure indicates how close the two vectors are and therefore how close
the text string is to the language model. The closer the measurement is to 1, the more
similar the vectors become. Thus, the language model with the highest measurement will
be identified as the language of the text segment.
21
Using the Relative Entropy
Relative entropy (also known as the Kullback Leibler distance) is used in information
theory to measure the difference between two probability distributions. For a discrete
random variable x, and probability distributions p and q, the relative entropy is given by
(Kullback & Leibler, 1951):
𝑝(𝑥)
𝐷𝐾𝐿 = ∑𝑥∈𝑋 𝑝(𝑥)𝑙𝑜𝑔 𝑞(𝑥) Equation 2.14
Reynar, (1996) associated the p distribution with the test set, the q distribution was
associated with the training set and X is the samples in distribution p. If an n-gram is not
found in the trained model, the equation above is undefined; therefore, a penalty of 0.5 is
added for such events.
The language profile with the smallest relative entropy is used to assign the language of
the test string. Data was obtained from the European Corpus Initiatives CD-ROM and a
subset of documents in 18 Roman alphabet languages were used for training and testing
purposes. Classification was compared by training the classifier with uni-gram and bi-
gram statistics. For a 200-line training set and 1 test line, the average accuracy is 78.6%
for uni-grams and 90.2% for bi-grams. If the number of test lines is increased to 10, the
uni-gram accuracy increases to 98.5% and the bi-gram accuracy to 99.9%. The accuracy
for a 2000-line training set and 1 test line is 81.5% for uni-grams and 94.1% for bi-grams.
For 10 lines the uni-gram accuracy increases to 99.0% and the bi-gram accuracy to
99.9%.
This method uses dictionaries containing stopwords. Stopwords are very specific to a
language, although some languages have some similar words and special characters they
are not all common. Stopwords usually are a good measure for automatic language
detection.
The stopwords dictionaries are created before applying the algorithm. The algorithm
picks the text, find most common words and compare with stopwords. The language with
22
the most stopwords is selected as the identified language. The algorithm counts how
many unique stopwords are seen in analyzed text to put this in language ratios dictionary
and detects language based on ratio in language ratios dictionary.
Multilingual documents are documents that contain text in more than one language.
Recent research has investigated how to make use of multilingual documents from
sources such as web crawls (King & Abney, 2013).
Kralisch, Anett, & Mandl, (2006) detect “language shift” using an eight- word sliding
window. Rehurek, Radim, & Kolkus, (2009) perform language segmentation by
computing a relevance score between terms and languages, smoothing across adjoining
terms and finally identifying points of transition between high and low relevance, which
are interpreted as boundaries between languages.
Yamaguchi, Hiroshi, & Tanaka-Ishii, (2012) use a minimum description length approach,
embedding a compressive model to compute the description length of text segments in
each language. They present a linear-time dynamic programming solution to optimize the
location of segment boundaries and language labels.
Lui & Baldwin, (2011) presented a system for language identification in multilingual
documents using a generative mixture model inspired by supervised topic modeling
algorithms, combined with a document representation based on previous research in
language identification for monolingual documents. They showed that their system is
23
able to accurately estimate the proportion of the document written in each of the
languages identified.
Given a set of evaluation documents, each having a known correct label from a closed set
of labels, and a predicted label for each document from the same set, the document-level
accuracy is the proportion of documents that are correctly labeled over the entire
evaluation collection. This is the most often-reported metric, and conveys the same
information as the error rate, which is simply the proportion of documents that are
incorrectly labeled (i.e. 1 − accuracy).
Authors sometimes provide a per-language breakdown of results. There are two distinct
ways in which results are generally summarized per-language (Powers & David, 2011):
(1) precision, in which documents are grouped according to their predicted language; and
(2) recall, in which documents are grouped according to what language they are actually
written in.
In addition to evaluating performance for each individual language, authors have also
sought to convey the relationship between classification errors and specific sets of
languages. Errors in language identification systems are generally not random; rather,
certain sets of languages are much more likely to be confused. For example, Grefenstette,
(1995) found that Norwegian documents had an elevated chance of being misclassified as
Swedish, compared to a range of other European languages.
24
In terms of writing systems, Ethiopia's principal orthography is the Ge'ez script.
Employed as an abugida for several of the country's languages, it first came into usage in
the sixth and fifth centuries BC as an abjad to transcribe the Semitic Ge'ez language.
(Rodolfo, 2003). Ge’ez now serves as the liturgical language of the Ethiopian and the
Eritrean Orthodox Tewahedo churches. Other writing systems have also been used over
the years by different Ethiopian communities. These include Arabic script for writing
some Ethiopian languages spoken by Muslim populations (Pankurst, 1991) and
Sheikh Bakri Sapalo's script for Oromo. (Hayward & Hassan, 1981). Today, many
Cushitic, Omotic, and Nilo-Saharan languages are written in Roman/Latin script.
According to the 2007 Ethiopian census, the largest first languages are: Oromo,
24,930,424 speakers or 33.80% of the total population; Amharic, 21,634,396 speakers or
29.30% of the total population; Somali 4,609,274 speakers or 6.25% of the total
population; Tigrinya, 4,324,476 speakers or 5.86% of the total population; Sidamo,
2,981,471 speakers or 4.84% of the total population; Wolaytta, 1,627,784 speakers or
2.21% of the total population; Gurage, 1,481,783 speakers or 2.01% of the total
population and Afar, 1,281,278 speakers or 1.74% of the total population. (Central
Statistical Agency, 2010).
Widely spoken foreign languages include English (major foreign language taught in
schools) Arabic and Italian. (Central Intelligence Agency, n.d.). Amharic is the official
language in which all federal laws are published, and it is spoken by millions of
Ethiopians as a second language. In most regions, it is the primary second language in the
school curriculum.
25
native speakers in Ethiopia and 15 million secondary speakers in Ethiopia (Central
Statistical Agency, 2010; Meyer, 2006). Additionally, 3 million emigrants outside of
Ethiopia speak the language. Most of the Ethiopian Jewish communities in Ethiopia and
Israel speak Amharic. In Washington DC, Amharic became one of the six non-English
languages in the Language Access Act of 2004, which allows government services and
education in Amharic (Language Access Act Fact Sheet, 2011).
The Amharic script is Ge’ez, and the graphemes of the Amharic writing system are
called fidel (Hudson, 2009). Each character represents a consonant + vowel sequence, but
the basic shape of each character is determined by the consonant, which is modified for
the vowel. Some consonant phonemes are written by more than one series of characters.
This is because these fidel originally represented distinct sounds, but phonological
changes merged them (Hudson, 2009). The citation form for each series is the consonant
+ ä form, i.e. the first column of the fidel. The Amharic script is included in Unicode, and
glyphs are included in fonts available with major operating systems.
26
Sakuye dialects); Eastern Oromo (Harar); Orma (Munyo, Orma, Waata/Sanye); West–
Central Oromo (Western Oromo and Central Oromo, incl. Mecha/Wollega, Raya,
Wello(Kemise), Tulema/Shewa).
Oromo is written with a Latin alphabet called Qubee which was formally adopted in 1991
(University of Pennsylvania, 1977). Various versions of the Latin-based orthography had
been used previously, mostly by Oromos outside of Ethiopia and by the OLF by the late
1970s (Heine, 1981). With the adoption of Qubee, it is believed more texts were written
in the Oromo language between 1991 and 1997 than in the previous 100 years. In Kenya,
the Borana and Waataalso use Roman letters but with different systems. The Sapalo
script was an indigenous Oromo script invented by Sheikh Bakri Sapalo (also known by
his birth name, Abubaker Usman Odaa) in the years following Italian invasion of
Ethiopia, and used underground afterwards (Hassan, 1981). The Arabic script has also
been used intermittently in areas with Muslim populations. The first comprehensive
online Afaan Oromo dictionary was developed by the Jimma Times Oromiffa Group
(JTOG) in cooperation with SelamSoft (Jimma_Times, n.d.). Oromo and Qubee are
currently utilized by the Ethiopian government's state radios, TV stations and regional
government newspaper.
27
preserving the Somali language (COMESA, n.d.). As of 2013, Somali is also one of the
featured languages available on Google Translate (Google, 2013).
The Somali language is written officially with the Latin alphabet. In 1961 both the Latin
and Osmanya scripts were adopted for use in Somalia, but in 1969 there was a coup, with
one of its stated aims the resolution of the debate over the country's writing system. The
Latin alphabet was finally adopted in 1972 and at the same time Somali was made the
sole official language of Somalia. Shire Jama Ahmed is credited with the invention of
this spelling system, and his system was chosen from among eighteen competing new
orthographies (Omniglot, 2017).
Tigrinya is written in the Ge'ez script, originally developed for Ge'ez, also called
Ethiopic. The Ge'ez script is an abugida: each symbol represents a consonant + vowel
syllable, and the symbols are organized in groups of similar symbols on the basis of both
the consonant and the vowel (Ethnologue, 2017).
In terms of its writing, Sidaama used an Ethiopic script up until 1993, from which point
forward it has used a Latin script (Raymond, 2005).
28
The Afar language is an Afro-asiatic language, belonging to the family's Cushitic branch.
It is spoken by the Afar people in Djibouti, Eritrea and Ethiopia. It is categorized in
the Lowland East Cushitic sub-group, along with Saho and Somali (Lewis, 1998). The
Afar language is spoken as a mother tongue by the Afar people in Djibouti, Eritrea, and
the Afar Region of Ethiopia. According to Ethnologue, there are 1,379,200 total Afar
speakers. Of these, 1,280,000 were recorded in the 2007 Ethiopian census, with 906,000
monolinguals registered in the 1994 census (Ethnologue, 2017).
In Ethiopia, Afar is written with the Ethiopic or Ge'ez script. Since around 1849,
the Latin script has been used in other areas to transcribe the language (Ethnologue,
2017). Additionally, Afar is also transcribed using the Arabic script. In the early 1970s,
two Afar intellectuals and nationalists, Dimis and Redo, formalized the Afar alphabet.
Known as Qafar Feera, the orthography is based on the Latin script (Omniglot, 2017).
Officials from the Institut des Langues de Djibouti, the Eritrean Ministry of Education,
and the Ethiopian Afar Language Studies and Enrichment Center have since worked with
Afar linguists, authors and community representatives to select a standard orthography
for Afar from among the various existing writing systems used to transcribe the language.
Nuer is a member of the Western Nilotic group of the Nilo-Saharan languages spoken in
southern Sudan and western Ethiopia by about 800,000 people. Nuer is one of eastern and
central Africa's most widely spoken languages, along with the Dinka language. The
language is very similar to the languages of Jieng and Chollo (Gurtong, n.d.)
(Ethnologue, 2017).
Nuer is written with a version of the Latin alphabet using an orthography adopted at the
Rejaf Language Conference in 1928 which has been modified to some extent by
missionaries since then (Omniglot, 2017).
29
2.3 Related Works
Cavnar & Trenkle, (1994) used n-gram rank ordering technique to test language
identification on newsgroup text in 8 languages. These languages were English, Polish,
Dutch and German (Germanic group), Portuguese, French, Spanish and Italian (Romance
Group). Training sets for the languages varied between 20,000 and 120,000 characters.
Using the 300 most frequent 3-grams the best accuracy of 98.6% was achieved for a test
string smaller than 300 characters. From the results, it is difficult to conclude what the
performance on smaller test strings (e.g. < 20 characters) would be.
Dunning, (1994) used Naïve Bayes approach to estimate the likelihood of a string
belongs to a language model. He did his experiments on English and Spanish parallel
texts. He reported an achieved accuracy of 92% with 50,000 characters of training and a
20-character string. The accuracy increased to 99% when the test string size was
increased to 500 characters. For 5K characters of training and a 500-character test string
the accuracy dropped to 97%. The evaluated languages were from different language
families (Romance and Germanic) and therefore it is difficult to compare results with
classification done on languages from the same family groups.
Kikui, (1994) proposed an algorithm for simultaneously identifying the coding system
and the language of a given string. The coding system was identified heuristically, where
the language identification part was based on the Naïve Bayes model. Documents were
collected from the Internet and grouped into 700 training and 640 testing documents. The
tests included 9 languages (English, German, Spanish, French, Italian, Portuguese,
Japanese, Chinese and Korean) and 11 coding schemes. They found that n=4 gave the
lowest error rates and that their accuracies were equivalent to the results presented in
Cavnar & Trenkle, (1994).
Grefenstette, (1995) compared two feature statistics in building a likelihood model. The
one statistical model contains the likelihood of all tri-grams that appeared more than 100
times in a training set. The other model contained the likelihood of all words which are
smaller than 5 characters and occur more than 3 times in the training text. It is found that
30
for sentences smaller than 15 words the tri-gram method performs best but for a greater
number of words both methods perform well. A 15-word window can roughly
approximate 100 characters. Therefore, it’s difficult to conclude what the error rate for
much smaller test strings would be.
On a test sample (with an average length of 50 character per item) n-gram rank ordering
performed with 90.2% accuracy, the centroid method with 95.9% accuracy and the SVM
gave an overall best performance accuracy of 99.7%.
L-F.Zhai, (2006) found that a reduction in the feature space of the SVM results in a
significant decrease in performance accuracy. They also concluded that the SVM is
highly sensitive to prior distributions. This is due to the nature of the SVM classifier that
does not compensate for different sizes of training data. Therefore, classification is biased
towards the class with the larger training set.
A frequency based n-gram difference based classifier and a support vector machine
(SVM) that uses the n-gram frequencies as features are discussed in Botha, Zimu, &
Barnard, (2006). Error rates of approximately 0.3% are achieved over large text window
sizes. It is also found that the SVM’s performance is better than the n-gram based
estimator’s, but at a much greater computational cost.
31
H. Lodhi, (2002) showed that an SVM trained with n-gram statistics outperforms an
SVM using kernel strings as features. Though it would be assumed that a kernel string
would capture more information of a language, it probably introduces complexity into the
SVM, which has a negative impact on decision-making.
Simões, Almeida, & Byers, (1998) presented a neural network that is able to identify
languages with 96% or 97% of accuracy, depending on the number of iterations
performed during the training process. For that they used two kind of features: one
related with the language alphabet, and another related to the character trigrams with
higher occurrence. Given that they are able to use binary features to classify the alphabet,
the neural network is able to learn much faster to distinguish some collections of
languages. A problem with their approach is that it will perform badly on short snippets
of text (like instant messages or mobile messages), because of the low number of
trigrams selected by language.
S. MacNamara, (1998) compared a Naïve Bayes classifier with a neural net classifier. For
the Naïve Bayes classifier, the 100 most discriminative n-grams and the 100 most
frequent n-grams were extracted from the training text. An input string was first scanned
for discriminant n-grams and a decision was made once one was found. If no
discriminant n-grams occurred the log likelihood was calculated using probabilities of the
100 most frequent n-grams and the language with the highest score was classified as the
32
language of the input string. Tests were performed on 18 languages. For an average of 2
to 3 words the accuracy of the neural network was 70% and for the tri-gram trained Naïve
Bayes classifier an accuracy of 83% was found.
Prager, (1999) implemented dot product calculation by giving each vector item an inverse
document frequency (IDF) weighting which is the inverse of the number of languages
that contain the n-gram combination. For closely-related languages Prager, (1999) found
that this weighting stayed fairly fixed. By combining 4-gram and word statistics (with no
restriction on the word length), best performance was achieved. By evaluating thirteen
West-European languages, the average performance was 85.4% for a 20-character input
string and over 99% for 130 characters and up.
Padro, (2004) compared the Naïve Bayes method, the dot-product classification and the
n-gram rank ordering method to each other. Identification was tested on 6 languages
(English, Catalan, Spanish, Italian, German and Dutch). Overall, the Naïve Bayes
classifier proved best (significantly so for small test samples), followed by the dot-
product classifier and then n-gram rank ordering method.
Truica, Velcin, & Boicea, (2015) presented a statistical method for automatic language
identification of written text using dictionaries containing stopwords and diacritics. They
proposed different approaches that combine the two dictionaries to accurately determine
the language of textual corpora. They tested their method using a Twitter corpus
containing 500,000 tweets with 100,000 tweets for each studied language and a news
article corpus that contains 250,000 entries with 50,000 articles for each studied
language. Their results show that their proposed method has an accuracy of over 90% for
small texts and over 99.8% for large texts.
To the knowledge of the researcher, the only other research that includes the Ethiopian
languages in a text-based language identification task was done by Desta, (2014). Desta,
(2014) compared language identification accuracy by using character n-gram and
character n-gram location as a feature set for Naïve Bayes and Frequency rank order
models. The results showed that Naïve Bayes classifier achieved highest accuracy for
33
short, medium, and long string test documents. The identification accuracy of Frequency
Rank Order is low but showed an improvement for long text test documents. When using
character n-gram and its location frequency as feature set, the accuracy of the both
models showed an improvement. Although this is an encouraging result, it’s difficult to
jump to conclusions without considering factors that can determine identification
accuracy such as the amount and variety of training data. He didn’t investigate the effect
that homogeneously distributed training data (equal amounts of training data for each
language class) has on a classifier when compared to a training set with heterogeneously
distributed training data. Moreover, his research is restricted to four Cushitic languages
and didn’t include languages from other linguistic classification. Furthermore, his
research didn’t consider multilingual identification for mixed language input. Thus text-
based language identification for the Ethiopian languages is still open for research.
From the literature, it is evident that different approaches to solving the language
identification problem have been investigated. Naive Bayes classifiers prove to be quite
popular and successful with high orders of n for the n-grams used. The use of SVMs also
appear to be a popular choice achieving good performance. However, not many studies
have been conducted regarding dictionaries containing stopwords and also focusing on
major Ethiopian languages. Identification based on SVM, Naïve Bayes and dictionary
method were used in this research.
34
3 METHOS AND TECHNIQUES
3.1 Overview
Languages without any family relationships and other within language families were used
to perform some of the tests. Some evaluated performance accuracy using only a
validation set where others performed a thorough cross-validation test. Classifiers were
evaluated under the same conditions in different studies that compared different
classifiers against each other so that more reliable conclusions can be made.
This study uses some previous ideas for evaluation process and it is designed to make
comparisons more reliable. To assure reliable results, the classifiers were evaluated under
the same circumstances.
In this study languages used for the experiment are from different linguistic classification
namely, Cushitic and Semitic of the Afro-Asiatic linguistic classification and Nilotic of
the Nilo-Saharan linguistic classification. Seven local languages are used in this study
namely, Afar, Amharic, Nuer, Oromo, Sidamo, Somali and Tigrigna.
3.2 Corpora
Seven corpora were collected to perform the experiments described here. All corpora
consist of different texts compiled from different sources. The data included text from
various sources (such as newspapers, periodicals, books, the Bible and government
documents) and therefore, the corpus spans several domains.
35
The size of text varied from 4MB to 5MB per language. Due to the variety of sources
used, text was not homogeneous and needed some automatic preprocessing in order to be
used for building models. Therefore, using Python scripts, the extraction, compilation,
cleaning and indexing of all articles was carried out before performing the experiments.
For example, consecutive white spaces were substituted with a single space character.
Moreover, numbers, names, abbreviations, punctuation marks and addresses (e.g. e-mail
and links to Internet websites) were removed. After this preprocessing, the size of the text
was reduced to 1MB to 2MB per language.
Documents written in 7 languages and language varieties were compiled and processed
resulting in slightly over 11.7 million tokens. Quantitative detail about the corpora
(number of tokens and types) is shown in Table 3.1.
Corpus Size
Languages
Number of Words/Tokens Number of Types (unique words)
Afar 17,459 6,673
Amharic 9,072,217 50,000
Nuer 1,337 433
Oromo 1,265,782 50,000
Sidamo 81,345 15,620
Somali 1,262,697 50,000
Tigrigna 42,391 12,844
Total 11,743,228 185,570
Table 3.1 Corpora: Facts and Figures - Number of Words/Tokens and Types
As it can be seen in Table 3.1 the number of texts differs across languages. Texts are also
of a different length depending on the data source. It is obvious that both the amount of
training material and the length of documents play an important role in language
identification and text classification tasks in general.
36
The experimentation conducted in this study is separated into two experiments, with each
experiment differing with regards to the amount of data used for training.
The first experiment utilizes the entire data set to train and test the classifiers with seven
language classes. This experiment was conducted by randomly taking 90% of the corpus
of the seven target languages for training the models and the remaining 10% for testing
the models. Table 3.2 shows the corpus size for training (90%) and testing (10%) the
models.
Table 3.2 Corpus size for training (90%) and testing (10%) the models
The second experiment consists of varying the size of the training data used to train the
classifiers to simulate scarce data availability. Furthermore, this experiment investigates
the effect that homogeneously distributed training data (equal amounts of training data
for each language class) has on a classifier when compared to a training set with
heterogeneously distributed training data. The smallest training data size is the data size
taken for Nuer language. Therefore, similar data sizes for the remaining six languages are
taken. Table 3.3 shows the corpus size for training and testing the models with
homogeneously distributed training data for the seven languages in this study.
37
Languages Corpus Size
For training the models For testing the models
Number of Words/Tokens Number of Words/Tokens
Afar 1,203 134
Amharic 1,203 134
Nuer 1,203 134
Oromo 1,203 134
Sidamo 1,203 134
Somali 1,203 134
Tigrigna 1,203 134
Total 8,421 938
Table 3.3 Data statistics for training and testing the models with homogeneously
distributed training data per language
38
Languages Corpus Size
Number of Number of Average 15-chw 100-chw 300-chw
Words/Tokens Characters Word
Length
Afar 17,459 115,591 6.62 2.27 15.11 45.32
Amharic 9,072,217 42,596,236 4.70 3.19 21.28 63.83
Nuer 1,337 5,434 4.06 3.69 24.63 73.89
Oromo 1,265,782 10,180,001 8.04 1.87 12.44 37.31
Sidamo 81,345 750,926 9.23 1.63 10.83 32.50
Somali 1,262,697 9,756,935 7.73 1.94 12.94 38.81
Tigrigna 42,391 173,439 4.09 3.67 24.45 73.35
Table 3.4 Word statistics of each language to generate test characters window. Average
word lengths and number of words per character window are indicated.
Table 3.5 Number of test documents for the three characters windows
39
languages which are constructed randomly from a collection of medium length texts per
language (6 to 50 words).
Multilingual Identification
Languages Number of Languages Number of
test documents test documents
Afar, Amharic, Nuer 30 Amharic, Nuer, Tigrigna 30
Afar, Amharic, Oromo 30 Amharic, Oromo, Sidamo 30
Afar, Amharic, Sidamo 30 Amharic, Oromo, Somali 30
Afar, Amharic, Somali 30 Amharic, Oromo, Tigrigna 30
Afar, Amharic, Tigrigna 30 Amharic, Sidamo, Somali 30
Afar, Nuer, Oromo 30 Amharic, Sidamo, Tigrigna 30
Afar, Nuer, Sidamo 30 Amharic, Somali, Tigrigna 30
Afar, Nuer, Somali 30 Nuer, Oromo, Sidamo 30
Afar, Nuer, Tigrigna 30 Nuer, Oromo, Somali 30
Afar, Oromo, Sidamo 30 Nuer, Oromo, Tigrigna 30
Afar, Oromo, Somali 30 Nuer, Sidamo, Somali 30
Afar, Oromo, Tigrigna 30 Nuer, Sidamo, Tigrigna 30
Afar, Sidamo, Somali 30 Nuer, Somali, Tigrigna 30
Afar, Sidamo, Tigrigna 30 Oromo, Sidamo, Somali 30
Afar, Somali, Tigrigna 30 Oromo, Sidamo, Tigrigna 30
Amharic, Nuer, Oromo 30 Oromo, Somali, Tigrigna 30
Amharic, Nuer, Sidamo 30 Sidamo, Somali, Tigrigna 30
Amharic, Nuer, Somali 30
Total 1050
40
3.3 Evaluation and Computational Techniques
This section presents the computational techniques behind the automatic language
identification systems used in this thesis. We can see that many approaches can be
followed in a text-based language identification from the literature study in chapter two.
Using a pure linguistic approach is undeniably the better choice to achieve a high
classification accuracy but it requires a large amount of linguistic expertise. Therefore,
using statistical approaches is a feasible alternative. Statistics of words, letters or n-grams
can be used to build statistical language models. The n-gram based models outperform a
word-based model for small text fragments and do equally well for larger fragments
(Grefenstette, 1995). In another study n-grams achieved better results than string kernels
(Lodhi, Shawe-Taylor, Cristianini, & Watkins, 2002). We can see from the literature that
this is by far the most popular choice. That is why the feature sets are restricted to n-gram
based features. Naïve Bayes and SVM are trained by using character n-gram as feature
set.
To evaluate the extent to which the methods used in this thesis are suitable for language
identification, standard metrics used in Natural Language Processing and Text
Classification to report results in terms of accuracy and error rate are used.
The metrics used are presented next and they are based on the possible outcomes of a
confusion matrix.
The confusion matrix contains four possible outcomes: true positives, true negatives,
false positives, and false negatives. The results are obtained per class. To evaluate the
performance of the classifier across all classes it is necessary to calculate the average
(mean) performance of all classes. This allows us to evaluate how well the classifier is
performing when identifying each individual class as well as when distinguishing all
classes.
41
The evaluation metrics used in this thesis is accuracy and error rate. Accuracy is
calculated as the number of all correct predictions divided by the total number of the
dataset. The best accuracy is 1.0 or 100% if it is calculated in percent, whereas the worst
is 0.0 or 0% if it is calculated in percent. Error rate is calculated as the number of all
incorrect predictions divided by the total number of the dataset. The best error rate is 0.0
or 0% if it is calculated in percent, whereas the worst is 1.0 or 100% if it is calculated in
percent. Accuracy and error rates are calculated as follows (Powers & David, 2011):
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃 Equation 3.1
𝐹𝑃+𝐹𝑁
𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃 Equation 3.2
A word may exist in the language but may not appear in a given corpus. Some words
might be very rare and may not occur no matter how big a corpus is. However, this does
not mean they will not occur in other samples. Therefore, attributing a zero probability to
them would spoil the calculation. The same is true for character sequences, due to the
simple fact that in any given language some character sequences are more frequent than
others.
Manning & Schu ̈tze, (1999) state ‘regardless of how the probability is computed, there is
still the need to assign a non-zero probability estimate to words or n-grams that are not
present in our training corpus’. This method is called smoothing and there are a number
of smoothing techniques that are used in natural language processing and in language
identification. A very simple one used in Dunning, (1994) and Zampieri, Gebre, &
Diwersy, (2012) is Laplace smoothing.
𝐶(𝑤1 … 𝑤𝑛 )+1
𝑃𝑙𝑎𝑝 (𝑤1 … 𝑤𝑛 ) = Equation 3.3
𝑁+𝐵
42
The formula is similar to the aforementioned Maximum Likelihood Estimation (MLE)
modified by adding 1 to the numerator (to assign a non-zero probability), and B
representing the number of total possible unique n-grams in the denominator.
In this study, Laplace smoothing is used to avoid zero multiplication for n-gram that was
not seen during model training. This is done by adding one to each n-gram frequency. If
the n-gram does not exist in training phase it was discarded from the calculation since the
size of training corpus is too large. Hence the effect of n-gram with zero frequency is
minimal on the classification accuracy. This process continues until probability of
language given the character n-gram is calculated for each of the languages. The
language with maximum such probability is considered as the language of the unknown
text document.
In this section, the algorithms used in this thesis are presented. Implementations of Naive
Bayes and Support Vector Machines which are popular machine learning classifiers are
used. Another technique used in this is dictionaries containing stopwords.
Experimenting with all possible combinations was not feasible although a large number
of classifiers have been applied to text-based language identification. The classification
algorithms selected, were similarly chosen for their proven performance in published
studies, as well as their ability to clarify theoretical issues, as we discuss below.
43
An algorithm capable of identifying the language of a text can be designed based on
probabilities of occurrence of letters and letter sequences.
Model
Most Probable
Unknown Text Identification
Language
During training stage, all possible character n-grams were extracted. The core benefits of
n-gram models (and algorithms that use them) are relative simplicity and the ability to
scale up by simply increasing n. The model can be used to store more contexts with a
well understood space-time tradeoff, enabling small experiments to scale up very
efficiently. The n-gram approximation for calculating the next word in the sequence is
given by:
𝑛
Π𝑘=1 𝑃(𝑥1𝑘−1 )
Language identification system has two components - Language Profile Generator and
Classifier. For language identification, the former (language profile generator) calculates
the n-gram profile of a text to be identified and compares it to language specific n-gram
profiles. For each language, it generates all possible n-grams for the text and save it into
corresponding language files. In classification method, given a test sample, its likelihood
is calculated for all the models, and the language that gives the best likelihood is selected.
44
TRAINING
Input: Language corpus of Afar, Amharic, Nuer, Oromo, Sidamo, Somali, Tigrigna.
1. The documents from corpus are taken one by one, and the preprocessing is done
i.e. removal of special characters, digits etc.
2. Tokenize the text in to words (tokens).
3. Generate all possible character n-grams (n =3 for Naïve Bayes and SVM) for each
language.
4. Store n-gram profiles of all language sets (classifier).
TESTING
The first algorithm used in this thesis is Naive Bayes that uses character n-gram of size 3
as a feature set. The Naive Bayesian method seeks to maximize the probability of a
45
language given a document P(L|D), where L is a language, and D is a document. Using
Bayes rule, we rewrite this as
𝑓
𝑙𝑗 = |𝑓𝑗| Equation 3.6
𝑗
𝑃(𝐿|𝐷) = ∑𝑛−𝛼+1
𝑖=1 𝑙𝑛𝑙𝑗 (𝑐𝑖 ) Equation 3.7
where lj (ci ) is the the probability of the n-gram ci in the language model lj .
After calculating the probabilities for all languages, the most likely language profile is
selected as the language of the test string.
Naive Bayes is also trained to identify intermixed text in multiple languages. For mixed
input texts (at least 2-3 languages in the same document), the classifier recognizes all the
mixed inputs and gives up to three most likely languages.
The second algorithm used in the thesis is the Support Vector Machine (SVM). SVM is
employed from the class of more complex classifiers. The SVM showed good results in
different comparison tests regarding language identification and high- dimensional spaces
46
generalization (Kruengkrai, Srichaivattana, Sorlertlamvanich, & Isahara, 2005), and has
also shown a good performance in many other pattern-recognition tasks. An available
software module with full SVM functionality was used therefore the classifier was not
implemented from first principles.
The library for support vector machines LIBSVM (Chang & Lin, 2001) provides a full
implementation of several SVMs. A language model was built with samples of size α
from a training set. These samples contained a frequency count of each n-gram
combination in the character string. Thus, the feature dimension of the SVM is equal to
the number of n-gram combinations. Samples of the testing set are created using the same
character window as used to build the language model. After training the SVM language
model, the test samples can be classified according to language. Details about how SVM
works can be found at section 2.1.2.2.
47
Training Phase Prediction Phase
Language Unknown
Corpus Document
Language Profile
• n-gram, count, lang tag
Identification
Model
Language
Figure 3.2 Naïve Bayes and SVM models applied to text languages in this study
48
Language Corpus: Language corpus that includes the seven languages in this study is
read in from a file during training phase.
Unknown Document: An unknown document is read in from a file or raw texts are
typed in during prediction phase.
Feature Extraction: The feature extractor extracts the features for the experiment.
Character n-gram was used as a feature set. For Naïve Bayes and SVM, n-gram of sizes 3
from tokens of the training corpus is extracted.
Language Profile: After feature extraction, the counts of each n-gram feature set and
their respective tag will be stored for each of the seven languages as language profile.
This is done during training phase. The language profile created is read in during
prediction phase.
Model: Naïve Bayes and SVM classifiers were used. Implementations of the classifiers
follow the approaches discussed in Section 3.2.3. The training phase results in a model
for language identification.
Language: After using the constructed model, the language of the unknown document is
identified and displayed during the prediction phase.
The last approach used in this thesis uses dictionaries containing stopwords. It’s known
that some languages have some similar words and special characters but they are not all
common. Since stopwords are very specific to a language, they usually are a good
measure for automatic language detection.
The stopwords dictionaries were created before applying the algorithm. The stopwords
dictionaries include a total number of approximately 185,570 stopwords for the seven
languages in this study. The algorithm picks the text, find most common words and
49
compare with stopwords. The language with the most stopwords is selected as the
identified language. The algorithm counts how many unique stopwords are seen in
analyzed text to put this in language ratios dictionary and detects language based on ratio
in language ratios dictionary.
Dictionary Method
Languages Dictionary Size (Stopwords)
Afar 6,673
Amharic 50,000
Nuer 433
Oromo 50,000
Sidamo 15,620
Somali 50,000
Tigrigna 12,844
Total 185,570
Table 3.7 Dictionary size (number of stopwords) for the Dictionary Method
TESTING
50
7. Select the language corresponding with the highest similarity value.
Unknown
Document
Document Preprocessing
Intersection
Max Count
Language
An unknown document is read in from a file or raw texts are typed in during prediction
phase and then document preprocessing is done. Document preprocessing involves
removing all numbers, special characters and punctuation marks. Tokenization which is a
task of chopping a running text up into pieces called tokens is then performed. The
standard (white space) tokenization that separates words by a special 'space' character is
used. Then the intersection between the stopwords dictionary and tokenized words is
taken and the language with the maximum count is identified as the language of the
unknown document.
51
4 RESULTS AND DISCUSSION
The experimentation conducted in this study is separated into two experiments, with each
experiment differing with regards to the amount of data used for training. The results of
the experiments are analyzed individually at the end of each experiment and an overall
analysis is done to summarize the observations.
The first experiment utilizes the entire data set to train and test the classifiers with seven
language classes. This experiment was conducted by randomly taking 90% of the corpus
of the seven target languages for training the models and the remaining 10% for testing
the models.
Accuracy and error rates were calculated for comparing the classifiers as discussed in
section 3.3.1 of chapter 3. The data statistics that includes all results for all tests can be
seen in the tables below. 500,216 test documents of differing size were used for the seven
languages. These were divided into short document size or 2-3 words, medium document
size or 6-50 words and long document size or more than 50-word long sentence or a
paragraph. The test documents that are divided are equivalent to the three character
windows.
52
Naïve Bayes (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages
Number correctly
Number correctly
Number correctly
Number of test
Number of test
Number of test
% Correctly
% Correctly
% Correctly
documents
documents
documents
identified
identified
identified
identified
identified
identified
Afar 582 560 96.23 125 123 98.4 41 41 100
Amharic 302,408 292,096 96.59 64,778 64,435 99.47 21,590 21,590 100
Nuer 45 41 91.11 10 10 100 3 3 100
Oromo 42,177 40,679 96.45 9,033 8,842 97.89 3,011 2,982 99.04
Sidamo 2,711 2,618 96.57 581 565 97.25 193 192 99.48
Somali 42,090 39,556 93.98 9,016 8,777 97.35 3,005 2,975 99
Tigrigna 1,413 1,361 96.32 303 301 99.34 101 101 100
Total 391,426 376,911 96.29 83,846 83,053 99.05 27,944 27,884 99.76
Table 4.1 Evaluation on test data for Naïve Bayes Classifier (n=3)
99.63
99.48
99.47
99.34
100
100
100
100
100
99.6
97.89
98.4
97.35
97.25
100
96.59
96.57
96.45
96.32
96.23
98
93.98
96
94
91.11
Accuracy (%)
92
90
88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna
Figure 4.1 Evaluation (Accuracy) on test data for Naïve Bayes Classifier (n=3)
53
For a 15-character window, the 3-gram Naïve Bayes classifier showed 96.29% accuracy
on average. For a 100-character window, the 3-gram Naïve Bayes classifier showed
99.05% accuracy on average. For a 300-character window, the improvement in accuracy
with increasing size of character windows is noticed. The 3 gram Naïve Bayes classifier
which has 99.76% average accuracy does better than the 15-character window, by 3.47%
on average and better than the 100-character window, by 0.71% on average.
SVM (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages
Number correctly
Number correctly
Number correctly
Number of test
Number of test
Number of test
% Correctly
% Correctly
% Correctly
documents
documents
documents
identified
identified
identified
identified
identified
identified
Afar 582 576 98.97 125 125 100 41 41 100
Amharic 302,408 302,231 99.94 64,778 64,778 100 21,590 21,590 100
Nuer 45 43 95.56 10 10 100 3 3 100
Oromo 42,177 40,882 96.93 9,033 8,865 98.14 3,011 2,985 99.14
Sidamo 2,711 2,683 98.97 581 577 99.31 193 193 100
Somali 42,090 40,688 96.67 9,016 8,770 97.27 3,005 2,985 99.33
Tigrigna 1,413 1,409 99.72 303 303 100 101 101 100
Total 391,426 388,512 99.26 83,846 83,428 99.50 27,944 27,898 99.84
54
99.33
99.28
99.26
100
100
100
100
100
100
100
100
99.14
99.14
99.08
98.97
98.97
99.2
97.27
100
96.93
96.67
95.56
98
96
94
Accuracy (%)
92
90
88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna
Figure 4.2 Evaluation (Accuracy) on test data for SVM Classifier (n=3)
For a 15-character window, the SVM classifier showed 99.26% accuracy on average. For
a 100-character window, the SVM classifier showed 99.50% accuracy on average. For a
300-character window, the improvement in accuracy with increasing size of character
windows is noticed. The SVM classifier which has 99.84% average accuracy does better
than the 15-character window, by 0.58% on average and better than the 100-character
window, by 0.34% on average.
The second experiment consists of varying the size of the training data used to train the
classifiers to simulate scarce data availability. Furthermore, this experiment investigates
the effect that homogeneously distributed training data (equal amounts of training data
for each language class) has on a classifier when compared to a training set with
heterogeneously distributed training data. The models are re-trained and tested with the
same data size for the seven languages in this study. The smallest training data size is the
data size taken for Nuer language (4,889 training characters). Therefore, similar data
sizes of 4,889 training characters were used per language. Tables 4.3 and 4.4 show the
test result for Naïve Bayes and SVM models for the seven languages in this study trained
under the same corpus size.
55
Naïve Bayes (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages
Number correctly
Number correctly
Number correctly
Number of test
Number of test
Number of test
% Correctly
% Correctly
% Correctly
documents
documents
documents
identified
identified
identified
identified
identified
identified
Afar 582 533 91.58 125 120 96 41 41 100
Amharic 302,408 278,064 91.95 64,778 62,090 95.85 21,590 21,150 97.96
Nuer 45 39 86.67 10 10 100 3 3 100
Oromo 42,177 38,723 91.81 9,033 8,658 95.85 3,011 2,950 97.97
Sidamo 2,711 2,492 91.92 581 557 95.85 193 189 97.93
Somali 42,090 37,603 89.34 9,016 8,642 95.85 3,005 2,944 97.97
Tigrigna 1,413 1,296 91.72 303 291 96.04 101 99 98.02
Total 391,426 358,750 91.65 83,846 80,368 95.85 27,944 27,376 97.97
Table 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language
100
100
100
98.02
97.97
97.97
97.96
97.93
100
96.04
95.85
95.85
95.85
95.85
98
96
96
91.95
91.92
91.81
91.72
91.58
94
89.34
92
90
86.67
88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna
Figure 4.3 Evaluation on test data for Naïve Bayes Classifier (n=3) trained with
homogeneously distributed training data per language
56
SVM (n=3)
Character Windows
15-character window 100-character window 300-character window
Languages
Number correctly
Number correctly
Number correctly
Number of test
Number of test
Number of test
% Correctly
% Correctly
% Correctly
documents
documents
documents
identified
identified
identified
identified
identified
identified
Afar 582 547 93.99 125 121 96.8 41 40 97.56
Amharic 302,408 287106 94.94 64,778 62,640 96.7 21,590 21,229 98.33
Nuer 45 41 91.11 10 10 100 3 3 100
Oromo 42,177 38773 91.93 9,033 8,567 94.84 3,011 2,935 97.48
Sidamo 2,711 2548 93.99 581 558 96.04 193 190 98.45
Somali 42,090 38584 91.67 9,016 8,472 93.97 3,005 2,935 97.67
Tigrigna 1,413 1338 94.69 303 293 96.7 101 99 98.02
Total 391,426 368,937 94.25 83,846 80,661 96.2 27,944 27,431 98.16
Table 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language
100
100
98.45
98.33
98.02
97.67
97.56
97.48
100
96.8
96.7
96.7
96.04
98
94.94
94.84
94.69
93.99
93.99
93.97
96
91.93
91.67
94
91.11
92
90
88
86
84
82
80
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna
Figure 4.4 Evaluation on test data for SVM Classifier (n=3) trained with homogeneously
distributed training data per language
57
From the above results, we can notice the effect that homogeneously distributed training
data (equal amounts of training data for each language class) has on a classifier when
compared to a training set with heterogeneously distributed training data.
For a 15-character window, the 3-gram Naïve Bayes and SVM classifiers showed a
91.65% and 94.25% average accuracy respectively. For a 100-character window, the 3-
gram Naïve Bayes and SVM classifiers showed a 95.85% and 96.2% average accuracy
respectively. For a 300-character window, the improvement in accuracy with increasing
size of character windows is noticed. The 3 gram Naïve Bayes classifier which has
97.97% average accuracy does better than the 15-character window, by 6.32% on average
and better than the 100-character window, by 2.12% on average while the SVM classifier
which has 98.16% average accuracy does better than the 15-character window, by 3.91%
on average and better than the 100-character window, by 1.96% on average.
Dictionary Method
Character Windows
15-character window 100-character window 300-character window
Languages
Number correctly
Number correctly
Number correctly
Number of test
Number of test
Number of test
% Correctly
% Correctly
% Correctly
documents
documents
documents
identified
identified
identified
identified
identified
identified
Afar 582 198 34.02 125 80 64 41 39 95.12
Amharic 302,408 300,049 99.22 64,778 64,778 100 21,590 21,590 100
Nuer 45 44 97.78 10 10 100 3 3 100
Oromo 42,177 17,437 41.34 9,033 5,200 57.57 3,011 2,456 81.57
Sidamo 2,711 1,338 49.35 581 305 52.49 193 175 90.67
Somali 42,090 14,151 33.62 9,016 5,127 56.87 3,005 2,374 79
Tigrigna 1,413 1,401 99.15 303 303 100 101 101 100
Total 391,426 334,618 85.49 83,846 75,803 90.41 27,944 26,738 95.68
58
99.99
99.72
97.78
100
100
100
100
100
100
95.12
90.67
100
81.57
90
79
80
57.57
64
56.87
70
Accuracy (%)
52.49
49.35
60
41.34
50
34.02
33.62
40
30
20
10
0
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna
For a 15-character window, the dictionary method showed 85.49% accuracy on average.
For a 100-character window, the dictionary method showed 90.41% accuracy on average.
For a 300-character window, the improvement in accuracy with increasing size of
character windows is noticed. The dictionary method which has 95.68% average
accuracy does better than the 15-character window, by 10.19% on average and better than
the 100-character window, by 5.27% on average.
Higher error rates were noticed on the smallest character window for all classifiers in this
study. To understand the results better, the output was analyzed using all classifiers. The
types of errors that occurred were investigated by using a confusion matrix (Table 4.6 to
Table 4.11). Correct language of a set of samples are represented on each row. The
languages selected by the classifiers are indicated by using columns. Better overall
accuracy of classifiers is shown on the diagonal of the matrix which also has more
samples.
59
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna
560 0 2 206 0 697 0 Afar
(96.23%) (0%) (4.44%) (0.49%) (0.04%) (1.66%) (0%)
0 292,096 0 0 0 0 52 Amharic
(0%) (96.59%) (0%) (0%) (0%) (0%) (3.68%)
2 0 41 0 0 0 0 Nuer
(0.34%) (0%) (91.11%) (0%) (0%) (0%) (0%)
10 0 0 40,679 60 949 0 Oromo
(1.72%) (0%) (0%) (96.45%) (2.21%) (2.25%) (0%)
5 0 (0%) 0 249 2,618 888 0 Sidamo
(0.86%) (0%) (0.59%) (96.57%) (2.11%) (0%)
5 0 2 1,043 33 39,556 0 Somali
(0.86%) (0%) (4.44%) (2.47%) (1.22%) (93.98%) (0%)
0 10,312 0 0 0 0 1,361 Tigrigna
(0%) (3.41%) (0%) (0%) (0%) (0%) (96.32%)
Table 4.6 Confusion matrix of Naïve Bayes Classifier (n=3) for 15-chw
Table 4.7 Error rates for languages calculated from confusion matrices in Table 4.6
60
Afar Amharic Nuer Oromo Sidamo Somali Tigrigna Overall
1.03% 0.06% 4.44% 3.07% 1.03% 3.33% 0.28% 0.74%
Table 4.9 Error rates for languages calculated from confusion matrices in Table 4.8
Table 4.11 Error rates for languages calculated from confusion matrices in Table 4.10
From the above three confusion matrixes on a window size of 15 characters, we can see
that all errors result from confusions within all seven languages. When using the 3 gram
Naïve Bayes classifier, the most wrongly classified language is Nuer language with an
error rate of 8.89%. Nuer language test document was classified as Afar and as Somali
language 4.4% of the times. The second most wrongly classified language is Somali
language with an error rate of 6.02%. Somali language test document was classified as
Afar language 1.66% of the times, as Oromo language 2.25% of the times and as Sidamo
language 2.11% of the times. When using the SVM classifier, the most wrongly classified
language is Nuer language with an error rate of 4.44%. Nuer language test document was
61
classified as Afar and as Somali language 2.22% of the times. The second most wrongly
classified language is Somali language with an error rate of 3.33%. Somali language test
document was classified as Afar language 2.23% of the times, as Oromo language 0.61%
of the times and as Sidamo language 0.48% of the times. When using the Dictionary
Method, the most wrongly classified language is Somali language with an error rate of
66.38%. Somali language test document was classified as Afar language 18.91% of the
times, as Nuer language 0.01% of the times, as Oromo language 36% of the times and as
Sidamo language 11.46% of the times. This shows us that Nuer language has a higher
relationship with Afar and Somali languages and vice versa. It also shows us that Somali
language has a higher relationship with Afar, Oromo and Sidamo languages and vice
versa.
4.3 Discussion
From the above results, we can notice that classification accuracy increases as the test
windows becomes larger, to the extent that the SVM classifier with a character window
300 achieved an error rate of 0.16%. For the smallest test samples, accurate classification
was difficult. The lowest error rate was 0.06% and the highest error rate was 4.44%
considering the best classifier in this research which is the SVM. Increase in size of the
character windows improved the identification accuracy in all cases.
We can also notice the effect that homogeneously distributed training data (equal
amounts of training data for each language class) has on a classifier when compared to a
training set with heterogeneously distributed training data from the above results.
For the 15-character window, the SVM classifier trained with 3-grams did slightly better
than the 3-gram Naïve Bayes classifier with the same dataset of 4,889 training characters
per language. The differences between the SVM and the Naïve Bayes classifier is 2.6%
on average.
For a 100-character window, the SVM classifier trained with 3-grams outperforms the 3
gram Naïve Bayes classifier which has 95.85% accuracy. The 3 gram Naïve Bayes
62
classifier showed a 0.35% lower average percentage difference when compared with the
SVM classifier trained with 3-grams.
For a 300-character window, as with the 100-character window, the SVM classifier
trained with 3-grams slightly outperforms the 3 gram Naïve Bayes classifier. It does
better than the 3 gram Naïve Bayes classifier by 0.19% on average. The improvement in
accuracy with increasing size of character windows is noticed. We can also notice that
the accuracies of the SVM and Naïve Bayes classifiers showed a decrease when smaller
training data are employed.
Tracking the performance of the classifiers across different window sizes for every
language is shown in Figures 4.6 to 4.12.
Afar Amharic
74 65.98 4
3.41
64
Error rate (%)
54
44 36 3
34
24
14 4.88 2
4
4 3.77
3 0.78
1 0.53
2 1.6
1.03
1 0.06 0 0
0 0
0
15-chw 100-chw 300-chw
15-chw 100-chw 300-chw
Naïve Bayes (n=2-5)
Naïve Bayes (n=2-5)
SVM (n=3)
SVM (n=3)
Dictionary Method
Dictionary Method
Figure 4.6 Classification Results for Afar Figure 4.7 Classification Results for
Language Amharic Language
63
Nuer
Oromo
10 8.89
75
9 58.66
Error rate (%)
65
8
Figure 4.8 Classification Results for Figure 4.9 Classification Results for
Nuer Language Oromo Language
Sidamo Somali
66.38
69 70
59 50.65 60
Error rate (%)
47.5
Error rate (%)
49 50 43.13
39 40
29 30 20.99
19 9.33 20
9 10
5 8
3.43 6.02
4 2.75 6
3 2.65
4
2 1.03 0.52
0.69 2 0
1 3.33 2.73
0 0 0 0.67
15-chw 100-chw 300-chw 15-chw 100-chw 300-chw
Figure 4.10 Classification Results for Figure 4.11 Classification Results for
Sidamo Language Somali Language
64
Tigrigna
5
Error rate (%)
4 3.68
0.85
1
0.28 0.66
0 0
0
15-chw 100-chw 300-chw
Naïve Bayes classifier is used for multilingual language identification. For mixed-
language input, the top three languages found is returned. To evaluate multilingual
identification, an artificial corpus that contains 1050 documents was constructed. Each
one of the documents is a concatenation of 3 sections in different languages which are
constructed randomly from a collection of medium length texts per language (6 to 50
words).
65
Multilingual Identification
Languages Languages
% correctly identified
% correctly identified
No. of test documents
1,005
95.71
Total
66
Accuracy (%)
versa.
80
82
84
86
88
90
92
94
96
98
100
of Somali).
Afar, Amharic, Oromo 93.33
Afar, Amharic, Sidamo 100
Afar, Amharic, Somali 93.33
Afar, Amharic, Tigrigna 96.67
Afar, Nuer, Oromo 93.33
Afar, Nuer, Sidamo 100
Afar, Nuer, Somali 93.33
Afar, Nuer, Tigrigna 100
Afar, Oromo, Sidamo 86.67
Afar, Oromo, Somali 93.33
Afar, Oromo, Tigrigna 93.33
Afar, Sidamo, Somali 93.33
Afar, Sidamo, Tigrigna 100
Afar, Somali, Tigrigna 93.33
Amharic, Nuer, Oromo 100
67
Amharic, Nuer, Sidamo 100
Amharic, Nuer, Somali 100
Amharic, Nuer, Tigrigna 96.67
Amharic, Oromo, Sidamo 93.33
Amharic, Oromo, Somali 93.33
Amharic, Oromo, Tigrigna 96.67
Amharic, Sidamo, Somali 100
Amharic, Sidamo, Tigrigna 96.67
Amharic, Somali, Tigrigna 96.67
Nuer, Oromo, Sidamo 93.33
Nuer, Oromo, Somali 93.33
Nuer, Oromo, Tigrigna 100
Nuer, Sidamo, Somali 93.33
Figure 4.13 Evaluation (Accuracy) on test data for Multilingual Identification
wrongly classified language combinations are ‘Afar, Oromo, Sidamo’ and ‘Oromo,
This corresponds to 95.71% accuracy. We can also see that the highest error rates (13.33
From the above table, we can see that 45 out of 1050 documents were wrongly classified.
approximate percentages of the total text bytes is also returned (e.g. 80% Amharic and
%) result from confusions within Afar, Oromo, Sidamo and Somali languages. The most
shows us that Sidamo language has a higher relationship with Oromo language and vice
Sidamo. Somali’ where Sidamo language was classified as Oromo in both cases. This
5 CONCLUSION AND RECOMMENDATION
5.1 Conclusion
Factors affecting accuracy such as the size and variety of training data, the size of the
string to be identified, and the type of classifier employed were investigated. Three
approaches were considered in this study. The Naïve Bayes, SVM and dictionary method.
N-gram statistics were used as features for classification for Naïve Bayes and SVM.
The Naïve Bayes and SVM classifiers were trained by using character n-gram of size 3 as
a feature set. The dictionary method uses stopwords. Three different sizes of character
windows were used to perform the tests. These sizes were chosen to provide a range of
challenges for classification (the larger the test string, the easier the classification task
becomes). Text from various domains was collected and was preprocessed before
building the models. Naïve Bayes and SVM classifiers were trained with heterogeneously
distributed training data at the first experiment and then with homogeneously distributed
training data (equal amounts of training data for each language class) at the second
experiment. All classifiers were tested under the same conditions.
In the first experiment, which utilizes the entire data set to train and test the classifiers
with seven language classes, the 3-gram Naïve Bayes and SVM classifiers showed an
average classification accuracy of 98.37% and 99.53% respectively. In the second
experiment, which focused on varying the size of the training data used to train the
classifiers to simulate scarce data availability and which investigated the effect that
68
homogeneously distributed training data has on a classifier, the 3-gram Naïve Bayes and
SVM classifiers showed an average classification accuracy of 95.16% and 96.2%
respectively. The dictionary method showed an average classification accuracy of
90.53%.
A confusion matrix analysis made it clear that samples were often classified incorrectly
as other languages within the same language families. For the larger window sizes, the
confusion is not that significant and classification is better. Increase in size of the
character windows and training data size improved the identification accuracy in all
cases.
For multilingual language identification, Naïve Bayes classifier is used. The top three
languages found is returned for mixed language input therefore a concatenation of 3
sections in different languages which are constructed randomly from a collection of
medium length texts per language (6 to 50 words) is used for evaluating multilingual
identification. This led to 95.71% accuracy.
The main challenge in this study is identification of closely related languages that share
similar character sequences and lexical units (e.g. Amharic and Tigrigna). Another
challenge faced by the researchers is identifying the language of short excerpts from texts
particularly those containing non-standard language and the unavailability of standard
corpus.
69
5.2 Recommendation
The results of this research are beneficial and can be used in any of the areas of language
identification applications. Language identification of text is important as large quantities
of text are processed or filtered automatically for tasks such as spell checker, information
retrieval and machine translation. The results of this research can therefore be used by
anyone who is interested in these research areas.
For further studies, the researchers recommend the use of classification methods
combined with linguistically motivated features such as POS tags and morphological
information which can provide empirical evidence on the convergences and divergences
of language varieties in terms of lexicon, orthography, morphology and syntax.
70
REFERENCES
Botha, G., Zimu, V., & Barnard, E. (2006). Text-based Language Identification for the South
African Languages. SAIEE Africa Research Journal .
Cavnar, W., & Trenkle, J. (1994). N-gram-Based Text Categorization. Proceedings of the Third
Annual Symposium on Document Analysis and Information Retrieval.
Central Intelligence Agency. (n.d.). Ethiopia. Retrieved April 13, 2017, from The World
Factbook: https://www.cia.gov/library/publications/the-world-factbook/geos/et.html
Central Statistical Agency. (2010). Population and Housing Census 2007 Report. Retrieved April
13, 2017, from http://catalog.ihsn.org/index.php/catalog/3583/download/50086
Chang, C., & Lin, C. (2001). LIBSVM: a library for support vector machines. Retrieved August
2, 2017, from http://www.csie.ntu.edu.tw/cjlin/libsvm
Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for large-scale
regression problems. Journal of Machine Learning Research, I, 143-160.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support Vector Machines: and
other kernel-based learning methods. New York: Cambridge University Press.
Desta, L. W. (2014). Modeling Text Language Identification for Ethiopian Cushitic Languages:
Masters thesis. HiLCoE School of computer Science.
Dunning, T. (1994). Statistical identification of language. Computing Research Lab, New Mexico
State University.
Emerson, G., Tan, L., Fertmann, S., Palmer, A., & Regneri, M. (2014). SeedLing: Building and
using a seed corpus for the Human Language Project. 77–85.
Ethiopian Languages. (2008). Retrieved November 28, 2017, from Dinknesh Ethiopia Tour:
http://www.dinkneshethiopiatour.com/index.htm
71
Google, T. (2013). Google Translate - now in 80 languages. Retrieved December 6, 2017
Hakkinen, J., & Tian, J. (2001). n-gram and decision tree based language identification for written
words. Automatic Speech Recognition and Understanding, 335–338.
Hassan, R. J. (1981). The Oromo Orthography of Shaykh Bakri Saṗalō. Bulletin of the School of
Oriental and African Studies, 550-566.
Hayward, & Hassan. (1981). The Oromo Orthography of Shaykh Bakri Saṗalō. Bulletin of the
School of Oriental and African Studies, 551.
Heine, B. (1981). The Waata Dialect of Oromo: Grammatical Sketch and Vocabulary.
House, S. A., & Neuburg, P. E. (1977). Toward automatic identification of the language of an
utterance. I. Preliminary methodological considerations. Journal of the Acoustical Society
of America, 708–713.
Ishwaran, H., & Rao, J. (2009). Decision tree: introduction. In Encyclopedia of Medical Decision
Making (pp. 323–328).
J.Hakkinen, J. (2001). n-Gram and Decision Tree Based Language Identification For Written
Words. Workshop on Automatic Speech Recognition and Understanding, Trento, 335-
339.
Judith, R. (2013). Globally Speaking: Motives for Adopting English Vocabulary in Other
Languages (Multilingual Matters). 165.
72
Kikui, G.-I. (1994). Identifying the coding system and language of on-line documents on the
internet. Proceedings of the 16th International Conference on Computational Linguis-
tics, Denmark.
King, B., & Abney, S. (2013). Labeling the languages of words in mixed- language documents
using weakly supervised methods. Proceedings of the 2013 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies.
Kingsford, C., & Salzberg, S. L. (2008). What are decision trees? Nature biotechnology.
Kralisch, Anett, & Mandl, T. (2006). Barriers to information access across languages on the
internet: Network and language effects. Proceedings of the 39th Annual Hawaii
International Conference on System Sciences, volume 3.
Kullback, S., & Leibler, R. (1951). On information and sufficiency. In Annals of Mathematical
Statistics (pp. 79–86).
Language Access Act Fact Sheet. (2011). Retrieved April 13, 2017, from
http://ohr.dc.gov/sites/default/files/dc/sites/ohr/publication/attachments/LAAFactSheet-
English.pdf
Lewis, I. (1998). Peoples of the Horn of Africa: Somali, Afar and Saho. Red Sea Press, 11.
Lipschutz, S., & Lipson, M. (2009). Linear Algebra (Schaum’s Outlines). McGraw Hill.
Ljubesic, Fiser, & Erjavec. (2014). Tweet-CaT: a Tool for Building Twitter Corpora of Smaller
Languages. . Proceedings of the Internacional Conference on Language Resources and
Evaluation (LREC), 2279–2283.
Lodhi, Shawe-Taylor, Cristianini, & Watkins. (2002). Text classification using string kernels.
Journal of Machine Learning Research.
Lui, M., & Baldwin, T. (2011). Cross-domain feature selection for language identification.
Proceedings of the 5th International Joint Conference on Natural Language Processing.
Manning, C., & Schu ̈tze, H. (1999). Foundations of Statistical Natural Language Processing.
MIT Press.
73
Martins, B., & Silva, M. (2005). Language Identification in Web Pages. Proceedings of the 20th
ACM Symposium on Applied Computing (SAC), 763-768.
Meyer, R. (2006). Amharic as lingua franca in Ethiopia. Journal of African Languages and
Linguistics, 117–131.
Nguyen, D., & Dogruoz, A. S. (2013). Word Level Language Identification in Online
Multilingual Communication. Proceedings of the International Conference on Empirical
Methods in Natural Language Processing (EMNLP), 857–862.
Omniglot. (2017). Afar alphabets, pronunciation and language. Retrieved December 6, 2017,
from http://www.omniglot.com/writing/afar.htm
Omniglot. (2017). Nuer alphabets, pronunciation and language. Retrieved December 6, 2017,
from http://www.omniglot.com/writing/nuer.htm
Omniglot. (2017). Somali alphabets, pronunciation and language. Retrieved December 6, 2017,
from http://www.omniglot.com/writing/somali.htm
Palmer, D. (2010). Text Processing. (N. Indurkhya, & F. Damerau, Eds.) Handbook of Natural
Language Processing, 9–30.
Pankurst, A. (1991). Indigenising Islam in Wällo: ajäm, Amharic verse written in Arabic script.
Proceedings of the Xlth International Conference of Ethiopian Studies, Addis Ababa.
Peng, F., & Schuurmans, D. (2003). Combining naive Bayes and n-gram language models for text
classification. Springer.
Peng, F., Schuurmans, D., & Wang, S. (2003). Augmenting Naive Bayes Classifiers with
Statistical Language. University of Waterloo Canada.
Powers, & David, M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC,
Informedness, Markedness & Correlation. Journal of Machine Learning Technologies,
37–63.
Powers, & David, M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC,
Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, II,
37–63.
74
Rehurek, Radim, & Kolkus, M. (2009). Language identification on the web: Extending the
dictionary method. Proceedings of the 10th International Conference on Intelligent Text
Processing and Computational Linguistics (CICLing-2009), 357-368.
Reynar, P. S. (1996). Language identification: Examining the issues. Proceedings of the 5th
Symposium on Document Analysis and Information Retrieval, 125-135.
Shannon, C. (1951). Prediction and Entropy of Printed English. Bell system technical journal.
Simões, A., Almeida, J. J., & Byers, S. D. (1998). Language Identification: a Neural Network
Approach. ACM.
Stephanie. (2018). Experimental Design. Retrieved February 4, 2018, from Statistics How To:
http://www.statisticshowto.com/
Teahan, W. J. (2000). Text classification and segmentation using minimum cross- entropy.
Proceedings of the 6th International Conference Recherche d’Information Assistee par
Ordinateur (RIAO’00), 943–961.
Truica, C.-O., Velcin, J., & Boicea, A. (2015). Automatic Language Identification for Romance
Languages using Stopwords and Diacritics. University Politehnica of Bucharest.
WebCorp. (2018). WebCorp. (B. C. University, Producer, & Birmingham City University)
Retrieved February 2, 2018, from WebCorp: http://www.webcorp.org.uk/live/
Yamaguchi, Hiroshi, & Tanaka-Ishii, K. (2012). Text segmentation by language using minimum
description length. Proceedings of the 50th Annual Meeting of the Association for
Computational Linguistics (ACL 2012), 969-978.
Zampieri, M., Gebre, B. G., & Diwersy, S. (2012). Classifying pluricentric languages: Extending
the monolingual model. Proceedings of the Fourth Swedish Language Technlogy
Conference (SLTC).
75
APPENDIX
Dictionary Method
# -*- coding: utf-8 -*-
import nltk
import Tkinter
import tkMessageBox
from Tkinter import *
import nltk
from nltk import wordpunct_tokenize
from nltk.corpus import stopwords
win=Tk()
win.title("Language Detector")
win.geometry("500x500")
def act():
languages_ratios = {}
tokens = wordpunct_tokenize(svalue.get()) #splits all
punctuations into separate tokens
words = [word.lower() for word in tokens]
# Compute per language included in nltk number of unique
stopwords appearing in analyzed text
#lan=['Afar','Amharic','Nuer','Oromo','Sidamo','Somali','Tigrigna']
#for language in lan:
for language in stopwords.fileids():
stopwords_set = set(stopwords.words(language))
words_set = set(words)
common_elements = words_set.intersection(stopwords_set)
languages_ratios[language] = len(common_elements)
#Calculate probability of given text to be written in several
languages and return the highest scored.
#It uses a stopwords based approach, counting how many unique
stopwords are seen in analyzed text.
most_rated_language = max(languages_ratios,
key=languages_ratios.get)
if(most_rated_language==0):
tkMessageBox.showinfo("Language is", "Can’t detect this
text")
else:
tkMessageBox.showinfo("Language is", most_rated_language)
svalue = StringVar() # defines the widget state as string
w = Entry(win,textvariable=svalue,width=60) # adds a textarea widget
foo = Button(win,text="Check Language", command=act,fg="black")
foo.config(width=20, height=2)
w.pack(ipady=50,pady=20)
foo.pack()
win.mainloop()
76
Generate N-Grams Frequency Profiles for Languages
import glob
import operator
import os
import sys
import argparse
try:
import nltk.corpus
except ImportError:
print '[!] You need to install nltk
(http://nltk.org/index.html)'
sys.exit(-1)
try:
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams
except ImportError:
print '[!] You need to install nltk
(http://nltk.org/index.html)'
sys.exit(-1)
LANGDATA_FOLDER = '.'
#---------------------------------------------------------------
def __init__(self):
"""Constructor"""
self._languages_statistics = {}
self._tokenizer = RegexpTokenizer("[a-zA-Z]+")
self._langdata_path = LANGDATA_FOLDER
#---------------------------------------------------------------
def _tokenize_text(self, raw_text):
tokens = self._tokenizer.tokenize(raw_text)
return tokens
#---------------------------------------------------------------
def _generate_ngrams(self, tokens):
generated_ngrams = []
return generated_ngrams
#---------------------------------------------------------------
def _count_ngrams_and_hash_them(self, ngrams):
ngrams_statistics = {}
77
for ngram in ngrams:
if not ngrams_statistics.has_key(ngram):
ngrams_statistics.update({ngram:1})
else:
ngram_occurrences = ngrams_statistics[ngram]
ngrams_statistics.update({ngram:ngram_occurrences+1})
return ngrams_statistics
#---------------------------------------------------------------
def _calculate_ngram_occurrences(self, text):
tokens = self._tokenize_text(text)
ngrams_list = self._generate_ngrams(tokens)
ngrams_statistics =
self._count_ngrams_and_hash_them(ngrams_list)
ngrams_statistics_sorted =
sorted(ngrams_statistics.iteritems(),\key=operator.itemgetter(1),\
reverse=True)[0:10000]
return ngrams_statistics_sorted
#---------------------------------------------------------------
def generate_ngram_frequency_profile_from_raw_text(self,
raw_text, output_filename):
output_filenamepath = os.path.join(self._langdata_path,
output_filename)
profile_ngrams_sorted =
self._calculate_ngram_occurrences(raw_text)
fd = open(output_filenamepath, mode='w')
for ngram in profile_ngrams_sorted:
fd.write('%s\t%s\n' % (ngram[0], ngram[1]))
fd.close()
#---------------------------------------------------------------
def generate_ngram_frequency_profile_from_file(self, file_path,
output_filename):
raw_text = open(file_path, mode='r').read()
self.generate_ngram_frequency_profile_from_raw_text(raw_text,
output_filename)
profile_ngrams_sorted =
self._calculate_ngram_occurrences(raw_text)
78
Appendix 2 Sample Outputs
79
Multilingual Identification (Naïve Bayes)
80