Learning Morphological Rulesfor Amharic Verbsusing Inductive Logic Programming
Learning Morphological Rulesfor Amharic Verbsusing Inductive Logic Programming
net/publication/267394306
CITATIONS READS
22 4,438
2 authors:
All content following this page was uploaded by Wondwossen Mulugeta Gewe on 27 October 2014.
Abstract
This paper presents a supervised machine learning approach to morphological analysis of Amharic verbs. We use Inductive Logic
Programming (ILP), implemented in CLOG. CLOG learns rules as a first order predicate decision list. Amharic, an under-resourced
African language, has very complex inflectional and derivational verb morphology, with four and five possible prefixes and suffixes
respectively. While the affixes are used to show various grammatical features, this paper addresses only subject prefixes and suffixes.
The training data used to learn the morphological rules are manually prepared according to the structure of the background
predicates used for the learning process. The training resulted in 108 stem extraction and 19 root template extraction rules from the
examples provided. After combining the various rules generated, the program has been tested using a test set containing 1,784
Amharic verbs. An accuracy of 86.99% has been achieved, encouraging further application of the method for complex Amharic
verbs and other parts of speech.
7
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
8
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
9
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
goal of this predicate is to separate the vowels and ‘feature‟: used to associate the identified affixes
the consonants of a Stem. In this predicate we have and root CV pattern with the known
used the utility predicate ‘merge‟ to perform the grammatical features from the example. This
permutation. For example, if Stem is seber and the predicate uses a codified representation of the
example associates this stem with the Root sbr, then eight subjects and four tense-aspect-mood
„root_temp‟, using ‘merge‟, will generate many features (‘tam’) of Amharic verbs, which is also
patterns, one of which would be sbree. This, encoded as background knowledge. This
ultimately, will learn that the vowel pattern [ee] is predicate is the only language-dependent
valid within a stem. background knowledge we have used in our
implementation.
c) Learning stem internal alternations:
Another challenge for Amharic verb morphology
feature([X,Y],[X1,Y1]):-
learning is handling stem internal alternations. For tam([X],X1),
this purpose, we have used the background subj([Y],Y1).
predicate „set_internal_alter‟: Figure 9: Grammatical feature assignment predicate
set_internal_alter(Stem,Valid_Stem,St1,St2):-
split(Stem,P1,X1), 6. Experiments and Result
split(Valid_Stem,P1,X2),
split(X1,St1,Y1), For CLOG to learn a set of rules, the predicate and
split(X2,St2,Y1). arity for the rules must be provided. Since we are
Figure 6: stem internal alternation extractor learning words by associating them with their stem, root
and grammatical features, we use the predicate schemas
This predicate works much like the ‘set_affix’ rule(stem(_,_,_,_)) for set_affix and root_vocal, and
predicate except that it replaces a substring which is rule(alter(_,_)) for set_internal_alter. The training
found in the middle of Stem by another substring examples are also structured according to these predicate
from Valid_Stem. In order to learn stem alternations, schemas.
we require a different set of training data showing The training set contains 216 manually prepared
examples of stem internal alternations. Figure 7 Amharic verbs. The example contains all possible
shows some sample examples used for learning combinations of tense and subject features. Each word is
such rules. first romanized, then segmented into the stem and
alter([h,e,d],[h,y,e,d]).
grammatical features, as required by the ‘stem‟ predicate
alter([m,o,t],[m,e,w,o,t]). in the background knowledge. When the word results
alter([s,a,m],[s,e,?,a,m]). from the application of one or more alternation rules, the
Figure 7: Examples for internal stem alternation learning stem appears in the canonical form. For example, for the
word gdey, the stem specified is gdel (see the second
The first example in Figure 7 shows that for the example in Table 1).
words hed and hyed to unify, the e in the first Characters in the Amharic orthography represent
argument should be replaced with ye. syllables, hiding the detailed interaction between the
consonants and the vowels. For example, the masculine
Along with the three experiments for learning various
imperative verb ‘ግደል’ gdel can be made feminine by
aspects of verb morphology, we have also used two
adding the suffix ‘i’ (gdel-i). But, in Amharic, when the
utility predicates to support the integration between the
dental ‘l’ is followed by the vowel ‘i’, it is palatalized,
learned rules and to include some language specific
becoming ‘y’. Thus, the feminine form would be written
features. These predicates are ‘template‟ and ‘feature‟:
‘ግደይ’, where the character ‘ይ’ ‘y’ corresponds to the
‘template‟: used to extract the valid template for sequence ‘l-i’.
Stem. The predicate manipulates the stem to To perform the romanization, we have used our own
identify positions for the vowels. This predicate Prolog script which maps Amharic characters directly to
uses the list of vowels (vocal) in the language to sequences of roman consonants and vowels, using the
assign ‘0’ for the vowels and ‘1’ for the familiar SERA transliteration scheme. Since the
consonants. mapping is reversible, it is straightforward to convert
template([],[]). extracted forms back to Amharic script.
template([X|T1],[Y|B]):-
template(T1,B), After training the program using the example set,
(vocal(X)->Y=0;Y=1). which took around 58 seconds, 108 rules for affix
Figure 8: CV pattern decoding predicate extraction, 18 rules for root template extraction and 3
For the stem seber this predicate tries each rules for internal stem alternation have been learned. A
character separately and finally generates the sample rule generated for affix identification and
pattern [1,0,1,0,1] and for the stem sebr, it associating the word constituents with the grammatical
generates [1,0,1,1] to show the valid template of features is shown below:
Amharic verbs.
10
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
stem(Word, Stem, [2, 7]):- The above example shows that the suffix that needs to
set_affix(Word, Stem, [y], [], [u], []), be stripped off is [k,u] and that there is an alternation
feature([2, 7], [imperfective, tppn]), rule that changes ‘a’ to ‘?,a’ at the beginning of the
template(Stem, [1, 0, 1, 1]).
word.
Figure 10: Learned affix identification rule example
InputWord: [t, k, e, f, y, a, l, e, x]
The above rule declares that, if the word starts with y Stem: [k, e, f, l]
Template: [1,0, 1, 1]
and ends with u and if the stem extracted from the word Root: [k, f, l]
after stripping off the affixes has a CVCC ([1,0,1,1]) GrammaticalFeature: [imperfective, spsf*]
pattern, then that word is imperfective with third person
plural neutral subject (tppn). Figure 14: Sample Test Result (Internal alternation)
*spsf: second person singular feminine
alter(Stem,Valid_Stem):-
set_internal_alter(Stem,Valid_Stem, [o], [e, w, o]). The above example shows that the prefix and suffix
Figure 11: Learned internal alternation rule example that need to be stripped off are [t] and [a,l,e,x]
respectively and that there is an alternation rule that
The above rule will make a substitution of the vowel o changes ‘y’ to ‘l’ at the end of the stem after removing
in a specific circumstances (which is included in the the suffix.
program) with ewo to transform the initial stem to a The system is able to correctly analyze 1,552 words,
valid stem in the language. For example, if the Stem is resulting in 86.99% accuracy. With the small set of
zor, then o will be replaced with ewo to give zewor. training data, the result is encouraging and we believe
The other part of the program handles formation of that the performance will be enhanced with more
the root of the verb by extracting the template and the training examples of various grammatical combinations.
vowel sequence from the stem. A sample rule generated The wrong analyses and test cases that are not handled
to handle the task looks like the following: by the program are attributed to the absence of such
root(Stem, Root):- examples in the training set and an inappropriate
root_vocal(Stem, Root, [e, e]), alternation rule resulting in multiple analysis of a single
template(Stem, [1, 0, 1, 0, 1]) . test word.
Figure 12: Learned root-template extraction rule example Test Word Stem Root Feature
[s,e,m,a,c,h,u] [s,e,m,a,?] [s,m,?] perfective, sppn
The above rule declares that, as long as the consonant [s,e,m,a,c,h,u] [s,e,y,e,m] [s,y,m] gerundive, sppn
vowel sequence of a word is CVCVC and both vowels [l,e,g,u,m,u] [l,e,g,u,m] NA NA
are e, the stem is a possible valid verb. Our current Table 2: Example of wrong analysis
implementation does not use a dictionary to validate
Table 2 shows some of the wrong analyses and words
whether the verb is an existing word in Amharic.
that are not analyzed at all. The second example shows
Finally, we have combined the background predicates
that an alternation rules has been applied to the stem
used for the three learning tasks and the utility
resulting in wrong analysis (the stem should have been
predicates. We have also integrated all the rules learned
the one in the first example). The last example generated
in each experiment with the background predicates. The
a stem with vowel sequence of ‘eu’ which is not found
integration involves the combination of the predicates in
in any of the training set, categorizing the word in the
the appropriate order: stem analysis followed by internal
not-analyzed category.
stem alternation and root extraction.
After building the program, to test the performance of
the system, we started with verbs in their third person
7. Future work
singular masculine form, selected from the list of verbs ILP has proven to be applicable for word formation
transcribed from the appendix of Armbruster (1908)3. rule extraction for languages with simple rules like
We then inflected the verbs for the eight subjects and English. Our experiment shows that the approach can
four tense-aspect-mood features of Amharic, resulting in also be used for complex languages with more
1,784 distinct verb forms. The following are sample sophisticated background predicates and more examples.
analyses of new verbs that are not part of the training set While Amharic has more prefixes and suffixes for
by the program: various morphological features, our system is limited to
InputWord: [a, t, e, m, k, u] only subject markers. Moreover, all possible
Stem: [?, a, t, e, m] combinations of subject and tense-aspect-mood have
Template: [1,0, 1, 0, 1] been provided in the training examples for the training.
Root: [?, t, m]
GrammaticalFeature: [perfective, fpsn*] This approach is not practical if all the prefix and
suffixes are going to be included in the learning process.
Figure 13: Sample Test Result (with boundary alternation)
One of the limitations observed in ILP for
*fpsn: first person singular neuter
morphology learning is the inability to learn rules from
incomplete examples. In languages such as Amharic,
3
Available online at: http://nlp.amharic.org/resources/lexical/word-lists/verbs/c- there is a range of complex interactions among the
h-armbruster-initia-amharica/ (accessed February 12, 2012).
11
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
different morphemes, but we cannot expect every one of Generative Approach. Ph.D. thesis, Graduate School
the thousands of morpheme combinations to appear in of Texas.
the training set. When examples are limited to only Bratko, I. and King, R. (1994). Applications of Inductive
some of the legal morpheme combinations, CLOG is Logic Programming. SIGART Bull. 5, 1, 43-49.
inadequate because it is not able to use variables as part Dawkins, C. H., (1960). The Fundamentals of Amharic.
of the body of the predicates to be learned. Sudan Interior Mission, Addis Ababa, Ethiopia.
An example of a rule that could be learned from De Pauw, G. and P.W. Wagacha. (2007). Bootstrapping
partial examples is the following: “if a word has the Morphological Analysis of Gĩkũyũ Using Unsuper-
prefix 'te', then the word is passive no matter what the vised Maximum Entropy Learning. Proceedings of the
other morphemes are”. This rule (not learned by our Eighth INTERSPEECH Conference, Antwerp, Bel-
system) is shown in Figure 15. gium.
Gasser, M. (2011). HornMorpho: a system for morpho-
stem(Word, Stem, Root, GrmFeatu):- logical processing of Amharic, Oromo, and Ti-
set_affix(Word, Stem, [t,e], [], S, []), grinya. Conference on Human Language Technology
root_vocal(Stem, Root, [e, e]), for Development, Alexandria, Egypt.
template(Stem, [1, 0, 1, 0, 1]), Goldsmith, J. (2001). The unsupervised learning of
feature(GrmFeatu, [Ten, passive, Sub]). natural language morphology. Computational Lin-
Figure 15: Possible stem analysis rule with partial feature guistics, 27: 153-198.
Hammarström, H. and L. Borin. (2011). Unsupervised
That is, S is one of the valid suffixes, Ten is the Tense, learning of morphology. Computational Linguistics,
and Sub is the subject, which can take any of the 37(2): 309-350.
possible values. Kazakov, D. (2000). Achievements and Prospects of
Moreover, as shown in section 2, in Amharic verbs, Learning Word Morphology with ILP, Learning Lan-
some grammatical information is shown by various guage in Logic, Lecture Notes in Computer Science.
combinations of affixes. The various constraints on the Kazakov, D. and S. Manandhar. (2001). Unsupervised
co-occurrence of affixes are the other problem that needs learning of word segmentation rules with genetic al-
to be tackled. For example, the 2nd person masculine gorithms and inductive logic programming. Machine
singular imperfective suffix aleh can only co-occur with Learning, 43:121–162.
the 2nd person prefix t in words like t-sebr-aleh. At the Koskenniemi, K. (1983). Two-level Morphology: a Gen-
same time, the same prefix can occur with the suffix eral Computational Model for Word-Form Recogni-
alachu for the 2nd person plural imperfective form. To tion and Production. Department of General Linguis-
represent these constraints, we apparently need explicit tics, University of Helsinki, Technical Report No. 11.
predicates that are specific to the particular affix Manandhar, S. , Džeroski, S. and Erjavec, T. (1998).
relationship. However, CLOG is limited to learning only Learning multilingual morphology with CLOG. Pro-
the predicates that it has been provided with. ceedings of Inductive Logic Programming. 8th Inter-
We are currently experimenting with genetic national Workshop in Lecture Notes in Artificial Intel-
programming as a way to learn new predicates based on ligence. Page, David (Eds) pp.135–44. Berlin:
the predicates that are learned using CLOG. Springer-Verlag.
Mooney, R. J. (2003). Machine Learning. Oxford Hand-
8. Conclusion book of Computational Linguistics, Oxford Univer-
We have shown in this paper that ILP can be used to sity Press, pp. 376-394.
fast-track the process of learning morphological rules of Mooney, R. J. and Califf, M.E. (1995). Induction of first-
complex languages like Amharic with a relatively small order decision lists: results on learning the past tense
number of examples. Our implementation goes beyond of English verbs, Journal of Artificial Intelligence Re-
simple affix identification and confronts one of the search, v.3 n.1, p.1-24.
challenges in template morphology by learning the root- Oflazer, K., M. McShane, and S. Nirenburg. (2001).
template extraction as well as stem-internal alternation Bootstrapping morphological analyzers by combining
rule identification exhibited in Amharic and other human elicitation and machine learning. Computa-
Semitic languages. Our implementation also succeeds in tional Linguistics, 27(1):59–85.
learning to relate grammatical features with word Sieber, G. (2005). Automatic Learning Approaches to
constituents. Morphology, University of Tübingen, International
Studies in Computational Linguistics.
9. References Yimam, B. (1995). Yamarigna Sewasiw (Amharic
Grammar). Addis Ababa: EMPDA.
Armbruster, C. H. (1908). Initia Amharic: an Introduc-
Zdravkova, K., A. Ivanovska, S. Dzeroski and T. Er-
tion to Spoken Amharic. Cambridge: Cambridge Uni-
javec, (2005). Learning Rules for Morphological
versity Press.
Analysis and Synthesis of Macedonian Nouns. In
Beesley, K. R. and L. Karttunen. (2003). Finite State
Proceedings of SIKDD 2005, Ljubljana.
Morphology. Stanford, CA, USA: CSLI Publications.
Bender, M. L. (1968). Amharic Verb Morphology: A