Development of Amharic Morphological Analyzer Using Memory-Based Learning
Development of Amharic Morphological Analyzer Using Memory-Based Learning
net/publication/300023701
CITATIONS READS
0 363
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mesfin Abate on 24 December 2018.
1 Introduction
Morphological analysis helps to find the minimal units of a word which holds lin-
guistic information for further processing. Morphological analysis plays a critical
role in the development of natural language processing (NLP) applications. In
most practical language technology applications, morphological analysis is used
to perform lemmatization in which words can be segmented into its minimal
meaning [11]. In morphologically complex languages, morphological analysis is
also a core component in information retrieval, text summarization, question an-
swering, machine translation, etc. There are two broad categories of approaches
in computational morphology: rule-based and corpus-based. Currently, the most
widely applied rule-based approach to computational morphology uses the two-
level formalism. In rule-based approach, the formulation of rules for languages
makes the development of morphological analysis system costly and time con-
suming [4,11]. Because of a need of hand-crafted rules for the morphology of lan-
guages and intensive requirements of linguistic experts in rule-based approaches,
there is considerable interest in robust machine learning approaches to morphol-
ogy which extracts linguistic knowledge automatically from an annotated or
A. Przepiórkowski and M. Ogrodniczuk (Eds.): PolTAL 2014, LNAI 8686, pp. 1–13, 2014.
c Springer International Publishing Switzerland 2014
2 M. Abate and Y. Assabie
Like other Semitic languages, Amharic is one of the most morphologically com-
plex languages. It exhibits a root-pattern morphological phenomenon [1]. Root is
a set of consonants (also called radicals) which has a basic lexical meaning [12].
A pattern consists of a set of vowels which are inserted among the consonants
of a root to form a stem. Semitic languages, particularly Amharic verbal stems,
consist of a ‘root + vowels + template’ merger. For instance, the root verb sbr
+ ee + CVCVC leads to form the stem seber (‘broke’). In addition to such
Development of Amharic Morphological Analyzer 3
Inflected Words
Text
Source Document
Morpheme Feature
Annotation Extraction
Morphologically
Morpheme
Feature Identification
Annotated Words
Extraction
Classification
Memory-Based Extrapolation
Learning
Stem
Extraction
Learning
Reconstruction
Model
Morpheme Insertion
Root Extraction
Morphemes
With Functions
Morpheme Annotation
Amharic nouns have more than 2 and 7 affixes in the prefix and suffix position,
respectively. The affixation is not somehow arbitrary, rather they affix in ordered
manner. An Amharic noun consists of a lexical part, or stem and one or more
grammatical parts. This is easy to see with a noun, for example, the Amharic
noun bEtocacewn (‘their houses’). The lexical part is the stem bEt (‘house’); this
conveys most of the important content in the noun. Since the stem cannot be
broken into smaller meaningful units [8], it is a morpheme (a primitive unit of
meaning). The word contains three grammatical suffixes, each of which provides
information that is more abstract and less crucial to the understanding of the
word than the information provided by the stem: -oc, -acew, and -n. Each of
these suffixes can be seen as providing a value for a particular grammatical
feature (or dimension along which Amharic nouns can vary): -oc (plural marker),
-acew (third person plural neuter), and -n: (accusative). Since each of these
suffixes cannot be broken down further, they can be considered as a morpheme.
Generally, these grammatical morphemes can have a great role in understanding
the semantics of the whole word [7,12].
The following tasks were identified and performed to prepare annotated
datasets used for training: identifying inflected words; segmenting the word into
prefix, stem, suffix ; putting boundary marker between each segment ; and describ-
ing the representation of each marker. Morphemes that are attached next to the
stem (as suffixes) may have seven purposes: plurality/possessions, derivation,
relativazation, definiteness, negation, causative and conjection. The annotation
is according to the prefix-stem-suffix ([P]-[S]-[S]) structure as shown in Table 1.
The bracket ([ ]) can be filled with the appropriate grammatical features for
each segmentation where S, M, 1, K, D, and O indicate end of stem, plural, pos-
session, preposition, derivative and object markers, respectively. Lexicons were
prepared manually in such a way to be suitable for extraction purpose.
Amharic verbs have four slots for prefixes and four slots for suffixes [1,7,10].
The positions of the affixes are shown as follows, where prep is for preposition;
conj is for conjunction; rel is for relativation; neg is for negation; subj is for
subject; appl is for applicative; obj is for objective; def is for definiteness; and
acc is for accusative.
In addition to analyzing all these affixes, the root template pattern of Amharic
verbs makes its morphological analysis complex. It is a challenging task repre-
senting its features into suitable memory-based learning approach. Generally,
Amharic verb stems are broken into verb roots and grammatical templates. A
given root can be combined with more than 40 templates [1]. The stem is the lex-
ical part of the verb and also the source of most of its complexity. To consider all
morphologically productive of the verb types, we need a morphologically anno-
tated word list with its possible inflection forms. Then, the tokens are manually
annotated in similar fashion what we did for nouns and adjectives like prefix[],
stem[] and suffix[] pattern. The ‘[]’ can be filled with the appropriate grammat-
ical features for each segmentation. The sample annotation for verbs is shown
in Table 2.
Feature Extraction
Once the annotated words are stored in a database, instances are extracted au-
tomatically from the morphological database based on the concept of windowing
method [3] in a fixed length of left and right context. Each instance is associated
with a class. The class represents the morphological category in which the given
word posses. An instance usually consists of a vector of fixed length. The vector
is built up of n feature value pairs depending on the length of the vector. Each
example focuses on one letter, and includes a fixed number of left and right
neighbor letters using 8-1 to 8-1 windows which yields eighteen features. The
largest word length from the manually annotated data base is chosen to be the
length of windows size. The input character in focus, plus the eight preceding
and eight following characters are placed in the windows. Character based anal-
ysis gives concern for each character or letter to be considered. From the basic
annotation, instances were automatically extracted, to be suitable to memory-
based learning by sliding a window over the word in the lexicon. We used the
Algorithm 1 to extract feature based on character analysis.
Development of Amharic Morphological Analyzer 7
Memory-Based Learning
Memory-based approaches borrow some of the advantages of both probabilistic
and knowledge-based methods to successfully implement it in NLP tasks [5]. It
performs classification by analogy. In order to learn any NLP classification prob-
lem, different algorithms and concepts are implemented by reusing data struc-
tures. We used TiMBL as a learning tool for our task [3] . There are a number of
parameters to be tuned in memory-based learning using TiMBL. Therefore, to
get an optimal accuracy of the model we used the default settings and also tuned
some of the parameters. The optimized parameters are the MVDM (modified
value difference metric) and chi-square from distance metrics, IG (information
gain) from weighting metrics, ID (inverse distance) from class voting weights,
and k from the nearest neighbor. These optimized parameters are used together
with the different classifiers. The classifier engines we used are IGtree and IB1
which construct databases of instances in memory during the learning process.
The procedure of building an IGtree is described in [6]. Instances are classified
by IGTree or by IBI by matching them to all instances in the instance base. As
a result of this process, we get a memory-based learning model which will be
used later during the morphological analysis phase.
Feature Extraction
Memory-based learning learns new instances by storing previous training data
into memory. When a new word is given to be analyzed by the system, it accepts
and de-construct as instances to make similar representation with the one stored
in memory. Feature extraction in this section is different from the one described
in the training phase. The word is deconstructed in a fixed-length of instances
without listing (identifying) the class labels at the last index. For example, when
a new previously unseen word (which is not found in the memory) needs to be
segmented, the words are similarly deconstructed and represented as instances
using the same information. This instance is compared to each and every instance
in the training set, recorded by the memory-based learner. In doing so, the
classifier will try to find training instance in memory that most closely resembles
it. For instance, the word begoc is segmented and its features are extracted as
shown in Fig. 2.
Development of Amharic Morphological Analyzer 9
Morpheme Identification
When new or unknown inflected words are deconstructed as instances and given
to the system to be analyzed, an extrapolation is performed to assign the most
likely neighborhood class with its morphemes based on their boundaries. The
extrapolation is based on the similarity metric applied on the training data. If
there is an exact match on the memory, the classifier returns (extrapolates) the
class of that instance to the new instance. Otherwise, new instance is classified
by analogy in memory with a similar feature vector, and extrapolating a decision
from their class. This instance is compared to each and every instance in the
training set, recorded by the memory-based learner. In doing so, the classifier
tries to find that training instance in memory that most closely resembles it.
Taking the feature of lenegerecw as shown in Fig. 3, this might be instance 10
in Table 3, as they share almost all features (L8, L7, L5, L3-L1, F, R1-R8),
except L6 and L4. In this case, the memory-based learner then extrapolates
the 9 classes of this training instance and predicts it to be the class of the new
instance.
Stem Extraction
After appropriate morphemes are identified, the next step is the stem extraction
process. In stem extraction, reconstruction of individual instances into mean-
ingful morphemes (to their original word form) and insertions of identified mor-
phemes in their segmentation point are performed. After stem extraction, the
system searches resembling instances from previously stored patterns in mem-
ory. If there is no similar instance in memory, it uses a distance similarity matrix
to find more nearest neighbor. The modified value difference metric (MVDM)
which looks for the co-occurrence of the values with the target classes is used to
determine the similarity of the value of features. For example, the reconstruction
of the whole instances of the word slenegerecw is shown in Fig. 4. In the exam-
ple, four non-null classes are predicted in the classification step. In the second
10 M. Abate and Y. Assabie
step the letter of the morphemic segments are concatenated and morphemes are
inserted. Then, root extraction can be performed in the third step.
Root Extraction
The smallest unit morpheme for nouns and adjectives is the stem. Thus, the root
extraction process will not be applied on nouns and adjectives. Root extraction
in verbal stems is not complex task in Amharic as roots are consonants of verbal
stems. In order to extract the root from verbal stems, we simply remove the
vowels from verbal stems. However, there are exceptions as vowels in some verbal
stems (e.g. when the verbal stems start with vowels) serve as consonants. In
addition, vowels should not be removed from mono and bi-radical verb types
since they have valid meaning when they end with vowels.
4 Experiment
4.1 The Corpus
In order to evaluate the performance of the model and the capability of learn-
ability of the dataset we conducted the experiment by combining nouns and
verbs. To get unbiased estimate of the accuracy of a model learned through
machine learning, it should be tested on unseen data which is not present in
the training set. Therefore, we split our data set into training and testing. The
total number of our corpus contains 1022 words, of which 841 are verbs and
181 are nouns (adjectives are considered as nouns as they have similar analysis).
The number of instances extracted from nouns and adjectives are 1356 and from
verbs are 6719 which accounts a total of 8075 instances. A total of 26 different
class labels occur within these instances.
parameters are the modified value difference metric and chi-square from distance
metrics, information gain from weighting metrics, inverse distance from class
voting weights, and k from the nearest neighbor. For various combinations of
parameter values, we tune the parameters until no better result is found.
Simply splitting the corpus into a single training and testing set may not
give the best estimate of the system’s performance. Thus, we used 10-fold cross-
validation technique to test the performance of the system with IB1 and IGtree
algorithms. This means that the data is split in ten equal partitions, and each
of these is used once as test set, with the other nine as corresponding train
set. This way, all examples are used at least once as a test item, while keeping
training and test data carefully separated, and the memory-based classifier is
trained each time on 90% of the available training data. We also used leave-
one-out cross-validation for IB1 algorithm, which uses all available data except
one (n-1) example as training material. It tests the classifier on the one held-out
example by repeating it for all examples. However, we found it tame consuming
to use leave-one-out cross-validation for IGtree algorithm. Table 4 shows the
performance of the system for optimized parameters.
In memory-based learning the minimum size of the training set to begin with
is not yet specified. However, the size of the training data matters the learning
performance of the algorithm. Hence, it is crucial to draw learning curves in
addition to reporting the experimental results. We perform a series of experi-
ments by systematically increasing amounts of training data up to the currently
available total dataset which is 1022. When drawing a learning curve, in most
cases, the learning can be measured by fixing the number of test data against
which the performance is measured. The learning curve of the system is shown
in Fig. 5.
As compared to previous works, our system performed well and provided
promising results. For example, in the work of Gasser [7], the system (which
is rule-based) does not consider unseen or unknown words. To overcome this
problem, Mulugeta and Gasser [10] developed Amharic morphological analyzer
using inductive logic programming. However, our system still performs better in
terms of accuracy.
12 M. Abate and Y. Assabie
References
1. Amsalu, S., Gibbon, D.: Finite state morphology of Amharic. In: Proc. of Inter.
Conf. on Recent Advances in Natural Language Processing, Borovets, pp. 47–51
(2005)
2. Bosch, A., Busserand, B., Canisius, E., Daelemans, W.: An efficient memory-based
morpho-syntactic tagger and parser for Dutch. In: Proc. of the 17th Meeting Comp.
Ling. in the Netherlands, Leuven, Belgium (2007)
3. Bosch, A., Daelemans, W.: Memory-based morphological analysis. In: Proc. of the
37th Annual Meeting of the Association for Computational Linguistics, Strouds-
burg (1999)
4. Clark, A.: Memory-Based Learning of Morphology with Stochastic Transducers.
In: Proc. of the 40th Annual Meeting of the Assoc. for Comp. Ling., Philadelphia
(2002)
Development of Amharic Morphological Analyzer 13