0% found this document useful (0 votes)
12 views

Lecture03-03

Uploaded by

mengesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture03-03

Uploaded by

mengesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Development of Amharic Grammar Checker Using Morphological

Features of Words and N-Gram Based Probabilistic Methods

Aynadis Temesgen Yaregal Assabie


Department of Computer Science Department of Computer Science
Addis Ababa University Addis Ababa University
[email protected] [email protected]

implement grammar checkers are rule-based, sta-


Abstract tistical and hybrid (Tsuruga and Aizu, 2011; Ehsan
and Faili, 2013; Xing et al, 2013). Rule-based sys-
Amharic is one of the most morphologically tems check grammars based on a set of manually
complex and under-resourced languages which developed rules which are used to match against
effectively hinder the development of efficient the text. However, it is very difficult to understand
natural language processing applications. Am- and include all grammatical rules of languages,
haric words, especially verbs, are marked for a
especially for complex sentences. On the other
combination of several grammatical functions
which makes grammar checking complex. This hand, in statistical grammar checking, part-of-
paper describes the design and development of speech (POS)-annotated corpus is used to automat-
statistical grammar checker for Amharic by ically build the grammatical rules by identifying
treating its morphological features. In a given the patterns of POS tag sequences in which case
Amharic sentence, the morphologies of individ- common sequences that occur often can be consi-
ual words making up the sentence are analyzed dered correct and the uncommon ones are reported
and then n-gram based probabilistic methods are to be incorrect. This has lead statistical approaches
used to check grammatical errors in the sen- to become popular methods to build efficient
tence. The system is tested with a test corpus and grammar checkers. However, it is very difficult to
experimental results are reported.
understand error messages suggested by such
checking system as there is no specific error mes-
1 Introduction sage. Hybrid grammar checking is then introduced
to benefit from the synergy effect of both ap-
With the rise of electronic documents, the need of proaches (Xing et al, 2013). A number of grammar
natural language processing (NLP) applications checkers have been developed so far for many lan-
that automatically process texts has drastically in- guages around the world. Among the most notable
creased. One of such important NLP applications is grammar checkers are those developed over the
grammar checker which automatically checks past few years for resourceful languages such as
grammatical errors in texts and also possibly sug- English (Richardson, 1997; Naber, 2003), Swedish
gests the user to choose among other alternatives. (Arppe, 2000; Domeij et al, 2000), German
Initially, most of the grammar checkers were based Schmidt-Wigger, (1998), and Arabic (Shaalan,
on checking styles, uncommon words and sentence 2005), etc. However, to our best knowledge, there
structures, but now they are upgraded to high ca- is no commercial Amharic grammar checker or
pacity with the capability of analyzing complex published article that presents grammar checking
sentence structures, not only as a part of other pro- for Amharic.
grams but also as easy software to be installed in This paper presents statistical-based Amharic
many operating system (Richardson, 1997; Liddy, grammar checker developed by treating the mor-
2001; Mudge, 2010; Mozgovoy, 2011). Various phological features of the language. The organiza-
techniques and methods have been proposed so far tion of the remaining part of the paper is as fol-
to build systems that could check the grammars of lows. Section 2 discusses an overview of the
texts. Among the most widely used approaches to grammatical structure of Amharic. Section 3

106
Aynadis Temesgen and Yaregal Assabie (2013). “Development of Amharic Grammar Checker Using Morphological Features
of Words and N-Gram Based Probabilistic Methods”, In Proceedings of the The 13th International Conference on Parsing
Technologies (IWPT2013), pp. 106-112, Nara, Japan.
presents the statistical methods applied to develop ወሰደ (wäsädä/he took)
the system. Experimental results are presented in ወሰደች (wäsädäč/she broke)
Section 4. In Section 5, we present our conclusion ወሰድኩ (wäsädku/I broke)
and recommendation for future works. A list of ወሰድኩት (alwäsädkutĭm/I took [it/him])
references is provided at the end. አልወሰድኩም (alwäsädkum/I didn’t take)
አልወሰደችም (alwäsädäčĭm/she didn’t take)
2 Grammatical Structure of Amharic አልወሰደም (alwäsädäm/he didn’t take)
2.1 Amharic Morphology አልወሰደኝም (alwäsädäňĭm/he didn’t take me)
አስወሰደ (aswäsädä/he let [someone] to take)
Amharic is the working language of Ethiopia ተወሰደ (täwäsädä/[it/he] was taken)
having a population of over 90 million at present. ስለተወሰደ (sĭlätäwäsädä/as [it/he] was taken)
Even though many languages are spoken in ከተወሰደ (kätäwäsädä/ if [it/he] is taken)
Ethiopia, Amharic is the dominant language that is እስኪወሰድ (ĭskiwäsäd/until [it/he] is taken)
spoken as a mother tongue by a large segment of ሲወሰድ (sĭwäsäd/when [it/he] is taken)
the population and it is the most commonly learned ..
.
second language throughout the country (Lewis et etc.
al, 2013). Amharic is written using its own script
which has 33 consonants (base characters) out of Amharic verbs are marked for any combination
which six other characters representing of person, gender, number, case, tense/aspect, and
mood resulting in thousands of words from a
combinations of vowels and consonants are single verbal root. As a result, a single word may
derived for each character. represent a complete sentence cosutructed with
Amharic is one of the most morphologically subject, verb and object. For example, ይወስደኛል
complex languages. Amharic nouns and adjectives (yĭwäsdäňal/[he/it] will take me) is a sentence
are marked for any combination of number, where the verbal stem ወስድ (wäsd/ will take) is
definiteness, gender and case. Morover, they are marked for various grammatical functions as
affixed with prepositions. For example, from the shown in Figure 1.
noun ተማሪ (tämari/student), the following words
are generated through inflection and affixation: ይ ወ ስ ደ ኛ ል
ተማሪዎች (tämariwoč/students), ተማሪው (tämariw/
the student {masculine}/his student), ተማሪየ yĭ wä s dä ňa l
(tämariyän/my student), ተማሪየን (tämariyän/my
student {objective case}), ተማሪሽ (tämariš/your
{feminine} student), ለተማሪ (lätämari/for student), The verbal stem ወስድ -ኧኛ- (-äňa-)
ከተማሪ (kätämari/ from student), etc. Similarly, we (wäsd/ will take) Marker for the
can generate the following words from the objective case “me”
adjective ፈጣን (fäţan/fast): ፈጣኑ (fäţanu/fast,
{definite} {masculine} { singular}), ፈጣኖች ይ....ል/yĭ....l
(fäţanoč/fast {plural}), ፈጣኖቹ (fäţanoču/fast Marker for the subject “he/it”
{definite} {plural}), etc. Figure 1. Morphology of the word ይወስደኛል.
Amharic verb inflections and derivations are
even more complex than those of nouns and 2.2 Grammatical Rules of Amharic
adjectives. Several verbs in surface forms are Common for most languages, if not for all, gram-
derived from a single verbal stem, and several mar checking starts with checking the validity of
stems in turn are derived from a single verbal root. the sequence of words in the sentence. This is also
For example, from the verbal root ውስድ (wsd/to true for Amharic. In addition, since Amharic is
take), we can derive verbal stems such as wäsd, morphologically complex language where verbs,
wäsäd, wasd, wäsasd, täwäsasd, etc. From each of nouns and adjectives are marked for various
these verbal stems we can derive many verbs in grammatical functions, the following agreements
their surface forms. For example, from the stem are required to be checked: adjective-noun, adjec-
wäsäd the following verbs can be derived:

107
tive-verb, subject-verb, object-verb, and adverb- are computed from the linguistic properties of
verb (Yimam, 2000; Amare, 2010). words in a sentence.
Word Sequence: Amharic language follows
subject-object-verb (SOV) grammatical pattern as 3.1 Representation of the Morphological
opposed to, for example, English language which Properties of Words
has SVO sequence of words. For instance, the To check grammatical errors in an Amharic sen-
Amharic equivalent of sentence “John ate bread” is tence, the morphological properties of words is
written as “ጆን (jon/John) ዳቦ (dabo/bread) በላ required. The morphological property of Amharic
(bäla/ate)”. Here, the part-of-speech (POS) tags of words contains linguistic information such as
individual words are used as inputs to check the number, gender, person, etc. Such linguistic infor-
validity of grammatical patterns. mation is used to check whether the linguistic
Adjective-Noun Agreement: Amharic nouns properties of one word agree with that of the other
are required to agree for number of modifying ad- words in the sentence. For this task, we used an
jectives. For example, ረጃጅም ልጆች (räjajĭm lĭjoč/ Amharic morphological analyzer known as
tall {plural} children) is a valid noun phrase whe- HornMorpho developed by Gasser (2011). After
reas ረጃጅም ልጅ (räjajĭm lĭj/ tall {plural} child) is performing morphological analysis for a given
an invalid noun phrase construction. word, the morphological property of the word is
Subject-Verb Agreement: Amharic verbs are stored along with its POS tag using a structure with
marked for number, person and gender of subjects. four slots as shown in Figure 2.
For example, ልጆቹ መስኮት ሰበሩ (lĭjoču mäskot
säbäru/the children broke a window) is a valid
Word < POS Number Person Gender >
Amharic sentence. However, ልጅቷ መስኮት ሰበረ
(lĭjtwa mäskot säbärä/the girl broke {masculine} a
window) is not a valid Amharic sentence since the Slot1 Slot2 Slot3 Slot4
subject ልጅቷ (lĭjtwa/the girl) is feminine and the Figure 2: A structure for representing the linguistic
verb ሰበረ (säbärä /broke {masculine}) is marked properties of words
for masculine.
Object-Verb Agreement: Amharic verbs are Slot1: This slot contains information about the
also marked for number, person and gender of ob- POS tag of the word. The corpus we used in this
jective cases. For example, in the sentence ልጆቹ work contains 31 types of POS tags, and the value
መስኮቶቹን ሰበሯቸው (lĭjoču mäskotočun säbärwa- for this slot is retrieved from the corpus. In addi-
čäw/the children broke {plural} the windows), the tion to checking the correct POS tag sequence in a
verb ሰበሯቸው (säbäru/broke {plural}) is marked for sentence, this slot is required to check agreements
the plural property of the object መስኮቶቹን in number, person and gender as well.
(mäskotočun/the windows). Slot2: This slot holds number information about
Adverb-Verb Agreement: Tenses of verbs are the word, i.e. whether the word is plural (P), singu-
required to agree with time adverbs. For example, lar (S), or unknown (U). In the case of nouns and
ትላንት ሰበሩ (tĭlant säbäru/ [they] broke yesterday) adjectives, it has three values: P, S, or U. Since
is a valid verb phrase construction whereas ትላንት Amharic verbs are marked for numbers of subject
and object, the value for this slot are combinations
ይሰብራሉ (tĭlant yĭsäbralu/ [they] will break yester-
of the aforementioned values for subject and objec-
day) is an invalid construction.
tive cases. We use the symbol “^” to represent
3 The Proposed Grammar Checker such combinations. For example, a verb marked
for plural subject and singular object is represented
The proposed grammar checker for Amharic text as SP^OS; a verb marked for singular subject and
passes through three phases: singular object is represented as SS^OS; etc.
• Checking word sequences; Slot3: This slot stores person information about
• Checking adjective-noun-verb agreements; the word, i.e. first person (P1), second person (P2),
• Checking adverb-verb agreement. third person (P3), or unknown (PU). The slot has
In the first two phases, we employ the n-gram four different possible values for the nouns and
based statistical method. The n-gram probabilities adjectives: P1, P2, P3 and PU. However, verbs can

108
have a combination of these four values for subject tracted. For each unique sequence of POS tags, the
and object grammatical functions. Examples of slot probability of the occurrence of the sequence is
values for verbs are the following. computed using n-gram models. The n-gram prob-
SP1^OP1: verb marked for first person subject abilities of POS tag sequences stored in the perma-
and first person object nent repository are accessed to check grammatical
SP2^OP1: verb marked for second person sub- errors in a given sentence. The probability pst of
ject and first person object the correctness of the POS tag sequence of words
SP3^OP1: verb marked for third person subject in a given sentence construction is computed as:
and first person object n
SP2^OP3: verb marked for second person sub- p st = ∏i =1 pti (2)
ject and third person object
..
. where n is the number of POS tags extracted in the
etc. sentence. Sentence with higher values of pst are
considered to be having a valid sequence of words
Slot4: This slot holds gender information about whereas those with low values are regarded as hav-
the word, i.e. whether the word is masculine (M), ing unlikely sequence of words. Finally, the deci-
feminine (F), or unknown (U). In the case of nouns sion is made based on a threshold value set by em-
and adjectives, it has three values: M, F, or U. The pirical method. The training process is illustrated
values of this slot for verbs are are combinations of in Figure 3.
the aforementioned values for subject and objec-
tive cases. Accordingly, a verb marked for mascu-
line subject and feminine object is represented as Training corpus n
SM^OF; a verb marked for feminine subject and
masculine object is represented as SF^OM; etc. Sentence list Sentence Splitter
For example, the linguistic information built for
the noun ፕሬዚዳንቱ (prezidantu/the president {mas- Request for another sentence
culine}) is: ፕሬዚዳንቱ <N|S|P3|M>. Likewise, the (until all are fetched)
Temporary
linguistic information for the verb ደረሰችበት Sentence
(däräsäčĭbät/she reached at him) is: ደረሰችበት Sentence Sequence Extractor
Repository
<V|SS^OS|SP3^OP3|SF^OM>. Accordingly, the
linguistic information about each word in the entire Sequence list
Permanent
corpus is automatically constructed so as to use it Sequence
for training and testing. Repository
Temporary Sequence
3.2 Word Sequences Sequence
Repository Sequence
To check the validity of POS tag sequence for a
given sentence, we use n-gram probability pt com- Sequence exists in perma-
puted as: nent repository?
Yes
count ( w1 w2 ...wn −1 wn ) Unique sequence
p t ( wn | w1 w2 ...wn −1 ) = (1) Request for
count ( w1 w2 ...wn −1 ) No with probability
another sentence
(until all are fetched)
where n is the number of words in a sequence and
Sequence Probability Calculator
w is POS tags of words. We have calculated n-
gram values for n=2 (bigram) and n=3 (trigram) Figure 3: A flowchart of the training process for check-
where they are saved in repository and used in ing sequences of words.
grammar checking process. The probabilities of
sequence occurrences are determined from the cor- 3.3 Adjective-Noun-Verb Agreements
pus, which is used to train the system. The training The agreements between words serving various
process starts by accepting the training corpus and grammatical functions in Amharic sentence are
the n-value as inputs. For each sentence in the cor- also checked using n-gram approach. Number, per-
pus, the sequences of POS tags of words are ex-

109
son and gender agreements are checked at this have correct agreement. Otherwise, it is reported as
phase. We perform this task by analyzing the four grammatically incorrect sentence.
slots representing linguistic information about
words as discussed in Section 3.1. Since the values 4 Experiment
for number, person, and gender depends on the
4.1 The Corpus
word class, the POS tag information is required.
Thus, for each word in the corpus, we extract such We used Walta Information Center (WIC) news
information as <slot1,slot2>, <slot1|slot3> and corpus which contains 8067 sentences where
<slot1|slot4> where slot1, slot2, slot3 and slot4 words are annotated with POS tags. We used 7964
represent POS tag, number, person and gender in- sentences for training and the remaining for test-
formation, respectively. Given the POS tag w of a ing. In addition, to test the performance of the sys-
word, the sequence probability pa of adjective- tem with grammatically wrong sentences, we also
noun-verb agreement for a slot is computed as: used manually prepared sentences which are
grammatically incorrect.
count ( w1v1...wn vn ) (3)
pa ( wn vn | w1v1...wn −1vn −1 ) = 4.2 Test Results
count( w1v1...wn −1vn −1 )
In order to test the performance of the grammar
where v is the value of the slot. The n-gram proba- checker, we are required to compute the number of
bility values for each unique pattern was computed actual errors in the test set, number of errors re-
and stored in a permanent repository which would ported by the system and the number of false posi-
be later accessed to adjective-noun-verb agree- tives generated by the system. These numbers were
ments in a given sentence. The probability psa of then used to calculate the precision and recall of
the correctness of the adjective-noun-verb agree- the system as follows.
ments in a given sentence is then computed as:
number of correctly flagged errors
n precision = *100% (5)
p sa = ∏i =1 p ai (4) total number of flag ged errors

number of correctly flagged errors (6)


recall = *100%
3.4 Adverb-Verb Agreement total number of grammatical errors

Amharic adverbs usually come before the verb Accordingly, we tested the system with simple
they modify. When adverb appears in the sentence and complex sentences where we obtained experi-
it usually modifies the next verb that comes after it. mental results as shown in Table1.
There could be a number of other words in be-
tween the adverb and the verb, but the modified Type of n-gram Precision Recall
verb appears next to the modifier before any other Sentence model
verb in the sentence. As Amharic adverbs are few Simple Bigram 59.72% 82.69%
in number, adverb-verb agreement was not Trigram 67.14% 90.38%
checked in the previous phases. To check time
agreement between the adverb and the verb, the Complex Bigram 57.82% 65.38%
tense for the verb that the adverb modifies should
Trigram 63.76% 67.69%
be identified. In this work, we considered four dif-
ferent types of tenses: perfective, imperfective, Table 1: Experimental results.
jussive/ imperative and gerundive. The pattern of
time adverbs associated with each tense type was Experimental results were also analyzed to eva-
extracted from the corpus and stored in repository. luate the performance of the system with regard to
Whenever these time adverbs are found in the sen- identifying various types of grammatical errors.
tence to be checked, the tense type of the next verb The detection rate of the various grammatical error
is extracted by using morphological analysis. If the types is shown in Table 2.
tense type extracted from the given sentence
matches with an adverb-tense pattern in the reposi-
tory, the adverb and the verb are considered to

110
greatly enhanced by using a more effective Amhar-
Error type Detection rate (%)
ic morphological analyzer.
Incorrect word order 73 Test results have shown that trigram models per-
form better than bigram models. In Amharic, head
Number disagreement 80 words in verb phrases, noun phrases, adjective
Person disagreement 52 phrases are located at the end of the phrases (Yi-
Gender disagreement 60 mam, 2000). This means that, for verb phrases, the
nouns and adjectives for which verbs are marked
Adjective-noun disagreement 55 come immediately before the head word (which is
Adverb-verb disagreement 90 a verb). Likewise, sequences of adjectives modify-
ing nouns in noun phrases come immediately be-
Table 2: Detection rate by error types.
fore the head word (which is a noun). Thus, se-
quences of multiple words in phrases are better
4.3 Discussion
captured by trigrams than bigrams. We have also
A complete system that checks Amharic gram- seen that grammatical errors in simple sentences
matical errors is developed. To train and test our are detected more accurately than in complex sen-
system, we used WIC corpus which is manually tences. The reason is that complex sentences have
annotated with POS tags. However, we have ob- complex phrasal structures which could not be di-
served that a number of words are tagged with rectly treated by trigram and bigram models.
wrong POS and many of them are also misspelled. However, the performance of the system can be
Since Amharic is one of the less-resourced lan- improved by using a parser that generates phrasal
guages, to our best knowledge, there is no tool that structures hierarchically at different levels. We can
checks and corrects the spelling of Amharic words. then systematically check grammatical errors at
Although attempts have been made to correct some various levels in line with the parse result.
of the erroneously tagged words in the corpus,
were were unable to manually correct all wrongly 5 Conclusion and Future Works
tagged words. POS tag errors cause the wrong tag
Amharic is one of the most morphologically com-
patterns to be interpreted as correct ones during the
plex languages. Furthermore, it is considered to be
training process which would ultimately affect the
less-resourced language. Despite its importance,
performance of the system. Thus, the performance
these circumstances lead to unavailability of effi-
of the system can be maximized if the system is
cient NLP tools that automatically process Amhar-
trained with error-free corpus. Moreover, since the
ic texts at present. This work is aimed at contribut-
corpus is collected from news items, most of the
ing to the ever-increasing need of developing Am-
sentences contain words which refer to third per-
haric NLP tools. Accordingly, the development of
son. For this reason, occurrence of first and second
Amharic grammar checker using morphological
person in the corpus is very small. This has af-
features and n-gram probabilities is presented. In
fected the system while checking person disagree-
this work, we have systematically treated the mor-
ment. This is evidenced by the low accuracy ob-
phological features of the language where we
tained while the system detects number disagree-
represented grammar dependency rules extracted
ment (see Table 2).
from the morphological structures of words.
To our best knowledge, HornMorpho is the only
However, lack of error-free corpus and effective
tool at present publicly available to morphological-
morphological analyzer are observed to be affect-
ly analyze Amharic words. However, the tool ana-
ing the performance of the developed grammar
lyses only some specific types of verbs and nouns.
checker. Thus, future works are recommended to
Adjectives analyzed as nouns and adverbs are not
be directed at improving linguistic resources and
analyzed at all. Since Amharic is morphologically
developing effective NLP tools such morphologi-
very complex language where combinations of var-
cal analyzer, parser, spell checker, etc. for Amhar-
ious linguistic information are encoded in a single
ic. The efficiency of these components is crucial
word, the effectiveness of grammar checking is
not only for Amharic grammar checking but also
hugely compromised if words are not properly ana-
for many Amharic NLP applications.
lyzed. Thus, the performance of the system can be

111
References Conference on Computer Science and Information
Systems (FedCSIS) Sept. 18-21, 2011, pp. 209-212,
Amare, G. (2010). ዘመናዊ የአማርኛ ሰዋስው በቀላል አቀራረብ Szczecin, Poland.
(Modern Amharic Grammar in a Simple Approach).
Addis Ababa, Ethiopia. Mudge, R. (2010). “The design of a proofreading soft-
ware service”; In Proceedings of the NAACL HLT
Arppe, A. (2000). “Developing a Grammar Checker for 2010 Workshop on Computational Linguistics and
Swedish”; In Proceeding of the 12th Nordic confe- Writing: Writing Processes and Authoring Aids. pp
rence on computational linguistic. pp 9 – 27. 24-32, Stroudsburg, PA, USA.
Domeij, R., Knutsson, O., Carlberger, J. and Kann, V. Naber, D. (2003). “A Rule-Based Style and Grammar
(2000). “ Granska: An efficient hybrid system for Checker”; PhD Thesis, Bielefeld University, Germa-
Swedish grammar checking”; In Proceedings of the ny.
12th Nordic conference on computational linguistics,
Richardson, S. (1997). “Microsoft Natuaral language
Nodalida- 99.
Understanding System and Grammar checker”; Mi-
Ehsan, N. and Faili, H. (2013), “Grammatical and con- crosoft, USA,.
text-sensitive error correction using a statistical ma-
Schmidt-Wigger, A. (1998). “Grammar And Style
chine translation framework”; Softw: Pract. Exper.,
Checking in German”; In Proceedings of CLAW.
43: 187–206. doi: 10.1002/spe.2110.
Vol 98.
Gasser, M. (2011). “HornMorpho: a system for morpho-
Shaalan, K. (2005). “Arabic Gramcheck: A Grammar
logical analysis and generation of Amharic, Oromo,
Checker for Arabic”; The British University in Du-
and Tigrinya words”’ In Proceedings of the Confe-
bai, United Arab Emirates.
rence on Human Language Technology for Devel-
opment. Tsuruga, I. and Aizu W. (2011). “Dependency-Based
Rules for Grammar Checking with LanguageTool”.
Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig
Maxim Mozgovoy. University of Aizu. IEEE. Japan,
(2013); Ethnologue: Languages of the World, Seven-
2011.
teenth edition. Dallas, Texas: SIL International.
Xing, J., Wang, L., Wong, D. F., Chao, S., and Zeng, X.
Liddy, E. D. (2001) “Natural language processing”, In
(2013). “UM-Checker: A Hybrid System for English
Encyclopedia of Library and Information Science, 2nd
Grammatical Error Correction”; In Proceedings of
Ed. Marcel Decker, Inc.
CoNLL-2013, vol. 34.
Mozgovoy, M. (2011). “Dependency-based rules for
Yimam, B. (2000). የአማርኛ ሰዋስው (Amharic Grammar).
grammar checking with LanguageTool”, Federated
Addis Ababa, Ethiopia.

112

You might also like