Amharic Sentence Parsing Using Base Phrase Chunking
Amharic Sentence Parsing Using Base Phrase Chunking
1 Introduction
To process and understand natural languages, the linguistic structures of texts are
required to be organized at different levels. A structured text increases the capability
of NLP applications [2], [4]. The syntactic level of linguistic analysis concerns how
words are put together to form correct sentences and determines what structural role
each word plays in the sentence. Broadly speaking, the syntactic level deals with ana-
lyzing a sentence that generally consists of segmenting a sentence into words, group-
ing these words into a certain syntactic structural units, and recognizing syntactic
elements and their relationships within a structure. Syntactic level also indicates how
the words are grouped together into phrases, what words modify other words, and
what words are of central importance in the sentence [2], [7]. Parsing can be de-
scribed as a procedure that searches through various ways of combining grammatical
rules to find a combination that generates a tree representing the syntactic structure of
the input sentence. Parsing uses the syntax of languages to determine the functions of
words in a sentence in order to generate a data structure that can help to analyze the
meaning of the sentence [7]. In addition to this, parsing deals with a number of sub-
problems such as identifying constituents that can fit together. In general, parsing
assists to understand how words are put together to form the correct phrases or sen-
tence along with the structural roles of the words, and it plays a significant role in
many NLP applications as it helps to reduce the overall structural complexity of sen-
tences [13]. Some of the NLP applications where parser is used as a component are
A. Gelbukh (Ed.): CICLing 2014, Part I, LNCS 8403, pp. 297–306, 2014.
© Springer-Verlag Berlin Heidelberg 2014
Abeba Ibrahim and Yaregal Assabie (2014). “Amharic Sentence Parsing Using Base Phrase
Chunking”, In Proceedings of the 15th International Conference on Intelligent Text Processing
and Computational Linguistics (CICLing 2014), Springer Lecture Notes in Computer Science
(LNCS), Vol. 8403, pp. 297-306, Kathmandu, Nepal.
298 A. Ibrahim and Y. Assabie
be made from a single head word or with a combination of other words. Unlike other
phrase constructions, prepositions cannot be taken as a phrase. Instead they should be
combined with other constituents and the constituents may come either previous to or
subsequent to the preposition. If the complements are nouns or NPs, the position of
prepositions is in front of the complements whereas if the complements are PPs, the
position will shift to the end of the phrase. Examples are: እንደ ሰው (ĭndä säw/like a
human), ከቤቱ አጠገብ (käbetu aţägäb/close to the house), etc. In Amharic phrase
construction, the head of the phrase is always found at the end of the phrase except for
prepositional phrases.
Amharic language follows subject-object-verb grammatical pattern unlike, for ex-
ample, English language which has subject-verb-object sequence of words [3], [19].
For instance, the Amharic equivalent of sentence “John killed the lion” is written as
“ጆን (jon/John) አንበሳውን (anbäsawn/the lion) ገደለው (gädäläw/killed)”. Amharic
sentences can be constructed from simple or complex NP and simple or complex VP.
Simple sentences are constructed from simple NP followed by simple VP which con-
tains only a single verb. Complex sentences are sentences that contain at least one
complex NP or complex VP or both complex NP and complex VP. Complex NPs are
phrases that contain at least one embedded sentence in the phrase construction. The
embedded sentence can be complements.
This section discusses about the Amharic base phrase chunker we used as a compo-
nent to develop the parser. The Amharic chunker system is exposed further in detail in
[9]. The output of the system, i.e. the tag of chunks can be noun phrases, verb phrases,
adjectival phrases, etc. in line with the natural language construction rule. In order to
identify the boundaries of each chunk in sentences, the following boundary types are
used [15]: IOB1, IOB2, IOE1, IOE2, IO, “[”, and “]”. The first four formats are com-
plete chunk representations which can identify the beginning and ending of phrases
while the last three are partial chunk representations. All boundary types use “I” tag
for words that are inside a phrase and an “O” tag for words that are outside a phrase.
They differ in their treatment of chunk-initial and chunk-final words.
− IOB1: the first word inside a phrase immediately following another phrase
receives B tag.
− IOB2: all phrases- initial words receive B tag.
− IOE1: the final word inside a phrase immediately preceding another same
phrase type receives E tag.
− IOE2: all phrases- final words receive E tag.
− IO: words inside a phrase receive I tag, others receive O tag.
− “[”: all phrase-initial words receive “[” tag other words receive “.” tag.
− “]”: all phrase-final words receive “]” tag and other words receive “.” tag.
Amharic Sentence Parsing Using Base Phrase Chunking 301
We considered six different kinds of chunks, namely noun phrase (NP), verb
phrase (VP), Adjective phrase (AdjP), Adverb phrase (AdvP), prepositional phrase
(PP) and sentence (S). To identify the chunks, it is necessary to find the positions
where a chunk can end and a new chunk can begin. The part-of-speech (POS) tag
assigned to every token is used to discover these positions. We used the IOB2 tag set
to identify the boundaries of each chunk in sentences extracted from chunk tagged
text. Using the IOB2 tag set along with the chunk types considered, a total of 13
phrase tags were used in this work. These are: B-NP, I-NP, B-VP, I-VP, B-PP, I-PP,
B-ADJP, I-ADJP, B-ADVP, I-ADVP, B-S, I-S and O. For example, the IOB2 chunk
representation for the sentence ካሳ ያመጣው ትንሽ ልጅ እንደ አባቱ በጣም ታመመ (kasa
yamäţaw tĭnĭš lĭj ĭndä abatu bäţam tamämä/The little boy that Kassa has brought
became very sick like his father) is shown in Table 2. Accordingly, the chunk tagged
ካሳ
sentence would be “ ያመጣው
N B-S ትንሽ
VREL I-S ADJ B-NP ልጅ N I-NP እንደ
P B-PP አባቱ N I-PP በጣም ADJ B-VP ታመመ V I-VP”.
Table 2. IOB2 chunk representation for “ካሳ ያመጣው ትንሽ ልጅ እንደ አባቱ በጣም ታመመ ”
Words IOB2 chunk representation
B-S
ካሳ (kasa/Kassa)
I-S
ያመጣው (yamäţaw/that [Kassa] has brought)
B-NP
ትንሽ (tĭnĭš/little)
I-NP
ልጅ (lĭj/boy)
B-PP
እንደ (ĭndä/like)
I-PP
አባቱ (abatu/his father)
B-VP
በጣም (bäţam/very)
I-VP
ታመመ (tamämä/became sick)
To implement the chunker component, we use hidden Markov model (HHM) en-
hanced by a set of rule used to prune errors. In the training phase of HMM, the system
first accepts words with POS tags and chunk tags. Then, the HMM is trained with a
training set built from sentences where words are tagged with part-of-speeches and
chunks. Likewise in the test phase, the system accepts words with POS tags and out-
puts appropriate chunk tag sequences against each POS tag using HMM model. We
use POS tagged sentence as input from which we observe sequences of POS tags.
However, we also hypothesize that the corresponding sequences of chunk tags form
hidden Markovian properties. Thus, we used a hidden Markov model (HMM) with
POS tags serving as states. The HMM model is trained with sequences of POS tags
and chunk tags extracted from the training corpus. The HMM model is then used to
predict the sequence of chunk tags for a given sequence of POS tag by making use of
the Viterbi algorithm. The output of the decoder is the sequence of chunks tags which
group words based on syntactical correlations. The output chunk sequence is then
analyzed to improve the result by applying linguistic rules derived from the grammar
302 A. Ibrahim and Y. Assabie
of Amharic. For a given Amharic word w, linguistic rules (from which sample rules
are shown in Algorithm 1) were used to correct wrongly chunked words (“w-1” and
“w+1” are used to mean the previous and next word, respectively).
4 Sentence Parser
In this work, bottom-up approach is employed for sentence parsing by using the out-
put of the chunker as an input and recursively remove the head words to make new
phrases until individual words are reached. The parse tree is then constructed while
head words are recursively removed and new phrases are formed. When we obtain no
new phrases during the recursive process, it means that we complete the process of
parsing. The algorithm that is used for parsing is given in Algorithm 2.
CHUNKER
Word, POS tag and
chunk tag sequences
HMM
Model
Training
Testing
PARSER
Yes
New base Replace phrases No Base phrase
phrase? with their heads PP or S?
No Yes
The Amharic base phrase chunker was integrated in the parser. The overall archi-
tecture of the parser including the chunker is shown in Figure 1.
The following example shows how parsing is performed using the proposed algo-
ወንበዴዎች በጎፈቃደኞች
rithm for a given POS tagged sentence: " N NPREP የገነቡትን
VREL ድርጅት N ከጥቅም NPREP ውጭ PREP አደረጉት V".
Step3: [' ወንበዴዎች N', ('በጎፈቃደኞች NPREP የገነቡትን VREL ድርጅት N',
'NP'), ('ከጥቅም NPREP ውጭ PREP አደረጉት V', 'VP')]
·[à½x N [úĮFÜx NPREP Õô\p VREL á0ñp N ČH NPREP ¼Ĕ PREP Ü+õp V
(('[úĮFÜx NPREP Õô\p VREL', 'S') á0ñp N, 'NP') (('ČH NPREP ¼Ĕ PREP', 'PP') Ü+õp V, VP')
((('[úĮFÜx NPREP Õô\p VREL', 'S') á0ñp N', 'NP') (('ČH NPREP ¼Ĕ PREP', 'PP') Ü+õp V, 'VP'), 'VP')
'·[à½x N','NP') ((('[úĮFÜx NPREP Õô\p VREL', 'S') á0ñp N', 'NP') (('ČH NPREP ¼Ĕ PREP', 'PP') Ü+õp V', 'VP'), 'VP'), 'S' )
5 Experiment
The major source of the dataset we used for training and testing the system was Walta
Information Center (WIC) news corpus which is at present widely used for research
on Amharic natural language processing. The corpus contains 8067 sentences where
words are annotated with POS tags. Furthermore, we also collected additional text
from an Amharic grammar book authored by Yimam [19]. The sentences in the cor-
pus are classified as training data set and testing data set using 10 fold cross validation
technique.
6 Conclusion
References
1. Abney, S.: Parsing by chunks. In: Berwick, R., Abney, S., Tenny, C. (eds.) Principle-
Based Parsing. Kluwer Academic Publishers (1991)
2. Abney, S.: Chunks and Dependencies: Bringing Processing Evidence to Bear on Syntax.
In: Computational Linguistics and the Foundations of Linguistic Theory. CSLI (1995)
3. Amare, G.: ዘመናዊ የአማርኛ ሰዋስው በቀላል አቀራረብ (Modern Amharic Grammar in a Sim-
ple Approach), Addis Ababa, Ethiopia (2010)
4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Relly Media
Inc., Sebastopol (2009)
5. Earley, J.: An efficient context-free parsing algorithm. Communications of the
ACM 13(2), 94–102 (1970)
6. Grover, C., Tobin, R.: Rule-based chunking and reusability. In: Proceedings of the Fifth
International Conference on Language Resources and Evaluation, LREC 2006 (2006)
7. Jurafsky, D., Martin, H.: Speech and Language Processing: An Introduction to Natural
Language Processing, Speech Recognition, and Computational Linguistics, 2nd edn.
Prentice-Hall (2009)
8. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages,
And Computation, ch. 7, pp. 228–302. Addison-Wesley (2001)
9. Ibrahim, A., Assabie, Y.: Hierarchical Amharic Base Phrase Chunking Using HMM With
Error Pruning. In: Proceedings of the 6th Conference on Language and Technology,
Poznan, Poland, pp. 328–332 (2013)
10. Kutlu, M.: Noun phrase chunker for Turkish using dependency parser. Doctoral dissertation.
Bilkent University (2010)
11. Lewis, P., Simons, F., Fennig, D.: Ethnologue: Languages of the World, 17th edn. SIL
International, Dallas (2013)
12. Li, S.J.: Chunk parsing with maximum entropy principle. Chinese Journal of Computers:
Chinese Edition 26(12), 1722–1727 (2003)
13. Manning, C., Schuetze, H.: Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge (1999)
14. Molina, A., Pla, F.: Shallow parsing using specialized HMMs. The Journal of Machine
Learning Research 2, 595–613 (2002)
15. Ramshaw, A., Marcus, P.: Text chunking using transformation-based learning. In:
Proceedings of the Third ACL Workshop on Very Large Corpora, pp. 82–94 (1995)
16. Thao, H., Thai, P., Minh, N., Thuy, Q.: Vietnamese noun phrase chunking based on condi-
tional random fields. In: International Conference on Knowledge and Systems Engineering
(KSE 2009), pp. 172–178. IEEE (2009)
17. Tjong, E.F., Sang, K., Buchholz, S.: Introduction to the CoNLL-2000 shared task: Chunk-
ing. In: Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th
Conference on Computational Natural Language Learning, vol. 7, pp. 127–132 (2000)
18. Xu, F., Zong, C., Zhao, J.: A Hybrid Approach to Chinese Base Noun Phrase Chunking.
In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney
(2006)
19. Yimam, B.: የአማርኛ ሰዋስው (Amharic Grammar), Addis Ababa, Ethiopia (2000)