A Rule Based Kannada Morphological Analyzer and Generator Using Finite State Transducer
A Rule Based Kannada Morphological Analyzer and Generator Using Finite State Transducer
A Rule Based Kannada Morphological Analyzer and Generator Using Finite State Transducer
45
International Journal of Computer Applications (0975 – 8887)
Volume 27– No.10, August 2011
Table 2. Input/Output examples for morphological generator handle compound formation morphology and can handle
maximum 500 distinct nouns and verbs. A Paradigm based
Input Output
M orphological Analyzer for Kannada Language Using M achine
+ (Ane+gaLu) (AnegaLu) Learning Approach was developed by Antony P J and Dr Soman
K P of Amrita Vishwa Vidyapeetham in 2010 [14]. This is a
+ + (hOguttEne) morphological analyzer for Kannada verbs and can also handle
compound verb morphology. Uma M aheshwar Rao G and
(hOgu+utt+Ene) Parameshwari K of CALTS, University of Hyderabad attempted
to develop a morphological analyzer and generators for South
2. LITERATURE SURVEY Dravidian languages in 2010 [15]. A network and process model
In general there are several approaches attempted for developing for Kannada morphological analysis/ generation was developed
morphological analyzer. In 1983 Kimmo Koskenniemi developed by K. Narayana M urthy and the performance of the system is 60
a two-level morphology approach, where he tested this formalism to 70% on general texts [16]. Recently (Jan- 2011) Shambhavi B.
for Finnish language [2]. In this two level representation, the R and Dr. Ramakanth Kumar of RV College, Bangalore
surface level is to describe word form as they occur in written text developed a paradigm based morphological generator and
and the lexical level is to encode lexical units such as stem and analyzer using a trie based data strucure [17]. The disadvantage of
suffixes. In 1984 the same formalism was extended in other trie is that it consumes more memory as each node can have at
languages such as Arabic, Dutch, English, French, German, most „y‟ children, where y is the alphabet count of the language.
Italian, Japanese, Portuguese, Swedish, Turkish and developed As a result it can handle up to maximum 3700 root words and
morphological analyzers successfully. In the same time a rule around 88K inflected words.
based heuristic analyzer for Finnish nominal and verb forms was
developed by Jappinen [3]. In 1996, Beesley developed an Arabic 3. CHALLENGES IN KANNADA
finite state transducer for M A using Xerox finite state transducer MORPHOLOGY
(XFST), by reworking extensively on the lexicon and rules in the Kannada is a verb-final inflectional language with a relatively free
Kimmo-style [4]. At 2000, Agirve introduced a word–grammar word order. Kannada morphology is characterized as agglutinative
based morphological analyzer using the two- level and a or concatinative, i.e., words are formed by adding suffixes to the
unification- based formalism for a highly agglutinative language root word in a series. M ost of the words may change spelling
called Basque [5]. Similarly using XFST, karine made a Persian when stems are inflected. Normally root word is affixed with
MA in 2004 and Wintner came up with a morphological analyzer several morphemes to generate thousands of word forms. The
for Hebrew in 2005 [6, 7]. Oflazer Kamel developed a Finite State complexity of developing M AG for Dravidian language like
M achine (FSM ) based Turkish morphological analyzer. In 2008, Kannada is comparatively higher than the other languages like
using the syllables and utilizing the surface level clues, the English. M ost of the words may change spelling when stems are
features present in a word are identified for Swahili (or Kiswahili) inflected. In agglutinative language like Kannada normally root
language by Robert Elwell. word is affixed with several morphemes to generate thousands of
In case of Indian languages, AU-KBC Research Centre of Anna word forms. To build an effective morphological analyzer one
University developed a finite state automata based morphological should carefully analyze and identify all these roots and
analyzer for Tamil language [8]. Dr. Shailly Goyal and Dr. Niladri morphemes.
Chatterjee of Indian Institute of Technology Delhi, worked on Due to the highly agglutinating nature of the Kannada language
Hindi noun phrase morphology for developing a link grammar and the morphophonemic variations that take place at the point of
based parser [9]. M rs. Rita M athu , Dr. M adhavi Sinha and Prof. agglutination, it is very difficult to mark word boundaries [14].
Rekha Govil also worked on Hindi M orphology. M any attempts Design should possibly cover all types of inflections. For
have been done in case of Bengali and M arathi language example, the different meaningful parts of the word
morphology. In Bengali, unsupervised methodology is used for „ ದವನ’ (OdikoM Diddavana) -> „the one (masculine)
developing a morphological analyzer and two-level morphology
approach was used to handle Bengali compound words by Sajib who was reading‟ is:
Dasgupta, in 2007 [10]. M anish Shrivastav, Nitin Agrawal, +ಇ+ +ಉ+ + +ಅ+ +ಅ
Bibhuti M ahapatra, Smriti Singh and Pushpak Bhattacharyya
worked on morphology based natural language processing tools Odu + i + koLLu +M D+ u + iru + dd + a + avanu + a
for Indian languages. A morphological analyzer and generators Root + VBP + AUXV +P ST+ VBP + AUXV + P ST+ RP + P RON -3SM + ACC
for Telugu, Tamil and Kannada was developed by University of
Hyderabad [2]. Rule based morphological analyzer have been 3.1 Types and Features of Kannada Words
developed for Sanskrit and Oriya by Girish Nath jha and M ohanty In general, there are three types of Kannada Words namely: i)
respectively. namapada (Declinable words or nouns) ii) kriyapada (Conjugable
words or Verbs) and iii) avyaya (Uninflected words). Nouns,
We have made a literature survey on Kannada natural language Pronouns and Adjectives are belongs to declinable words and are
processing and found the following developments: A Kannada inflected to differences of case, number and gender. Conjugable
indexing software prototype is developed by Settar in 2002 [11]. words are inflected to mark differences of person, gender,
A Kannada Word net is attempted by Sahoo and Vidyasagar of number, aspect, mood and tense. All the Kannada words are of
Indian Institute of Technology, Bangalore, in 2003 [12]. T. N. three genders: masculine, feminine and neuter. Declinable and
Vikram and Shalini R Urs developed a prototype of Conjugable words have two numbers: singular and plural. The
morphological analyzer for Kannada language (2007) based on singular has no particular distinguishing marker added. The plural
Finite State M achine [13]. This is just a prototype and does not marker is usually “gaLu”, but there are some exceptions as
46
International Journal of Computer Applications (0975 – 8887)
Volume 27– No.10, August 2011
follows: M asculine nouns (E.g., huDuga) ending in “a” and some Ablative (deseyiMda (gaLadeseyiM
feminine nouns (E.g.,hemgasu) endings in “u” have plural with
(Pachami)
“aru” . Feminine nouns ending with “i (E.g.,huDugi)” or “e (atte)” ) (Mdirad
have plural with “yaru”. Also nouns with kinship terms (E.g., eseyiMda)/ (y
aNNa), the marker for plural is often “aM diru”. Some nouns are
irregular plurals such as “makkaLu” which is the plural for noun aradeseyiMda)/
“magu”. (radeseyiMda)
Table 3. Dative Case Characteristics suffixes for Nouns Vocative ಏ(E)/ (vE)/ಆ(A)/ಈ(I) (MdirE)/
Noun type Ends with Dative Examp Dative (Sambhod
ana) / (yare)
suffix le noun form
Neuter noun ಅ (a) ಮರ
(kke) (mara) (marakke)
3.3 Verb Morphology
ಎ,ಇ,ಉ Comparing with other Dravidian language like M alayalam, the
(e,i,u) (ge) (mane) (manege) morphological structure of Kannada is more complex because it
consonants inflects to person, gender, and number markings [14]. In case of
verb morphology each root word is combines with auxiliaries that
(ige) (Uru) (Urige) indicate aspect, mood, causation, attitude etc. The uniqueness in
Neuter - the structure of verbal complexity makes it very challenging to
determinative capture in a machine analyzable and generatable format. Also the
(akke) (idu) (idakke)
formation of the verbal complex involves arrangement of the
Rational noun - verbal units and the interpretation of their combinatory meaning.
(nige) (aNNa) (aNNanige) Phonology also plays a little role in word formation in terms of
„morphophonemic‟ and „sandhi‟ rules which account for the shape
changes due to inflection.
Table 4 below shows the different cases and their corresponding
characteristic suffixes for nouns. Verb forms can be broadly classified into two types: finite verbs
and non-finite verbs. In case of finite verbs, the verbs are usually
Table 4. Noun Cases and their Characteristics suffixes added to the end of sentences with the exception of Clitics and can
have nothing added to them. The general syntax of finite verb is
Feature Characteristic S uffix the form: Subject-Object-Verb. Some of the finite forms of the
Singular Plural verbs are imperatives, present and past forms marked with PNG,
modals and verbal/participle nouns. The tense can be
Nominative (vu)/ (yu)/ಉ(u)/ (gaLu)/ ( past/present/future, if it is in the affirmative. The negative form
(Prathama)
(nu) Mdiru)/ (yaru) does not take tense. The non-finite verbs in contrast cannot stand
alone and must have some other forms following them. Non-finite
Accusative (vannu)/ (ya verb forms include infinitives, verbal and adjectival participles
(Dwitiya) and tense-marked verb stems [19]. The non-past denotes both
nnu)/ (annu)/ (Mdirannu)/ (yaran
present and future tenses and unlike M alayalam language (another
(nannu) nu)/ (rannu) south Dravidian language) all tenses have different tense markers
in Kannada language. M ood is another important feature of
Instrument (diMda)/ ( Kannada language and is associated with statements of fact versus
al possibility, supposition, etc [20]. There are four different moods
(yiMda)/ (iMda)/ MdiriMda)/ (yariMd
(Tritiya) that are expressed in Kannada are: infinitive, imperative,
(niMda) a)/ (riMda) affirmative and negative. Also Kannada has some additional
modal forms such as: indicative, conditional, optative, potential,
Dative (kke)/ (ge)/ (Mdiri monitory and conjunctive.
(C haturthi)
(ige)/ (nige)/ ge)/ (yarige)/ (rige) Kannada language also include past verb stems in addition to
(akke) simple verb stems, that are used in forming the past tense, past
participles, conditionals and some other constructions. The past
47
International Journal of Computer Applications (0975 – 8887)
Volume 27– No.10, August 2011
stems also form the base to which contingent PNG markers are The Person–Noun-Gender (PNG) and the tense marker
added. The contingent form is another distinguished feature of concatenated to the verb stems are the two important aspect of
Kannada language that is not present in any other Dravidian verb morphology [14]. The verbal inflectional morphemes attach
languages [21]. Table 5 shows the features of Kannada words to the verbs providing information about the syntactic aspects like
with examples. number, person, case-ending relation and tense. Usually the
Kannada verbs follow the regular pattern of suffixation. The table
Table 5. Verb features and Characteristics suffixes 6 shows the various PNG suffixes that can be attached to be any
Feature Characteristic Example verb root word.
Suffixes
Table 6. Kannada PNG- Suffixes
‘ ‘(al)/ ‘ ’ ( baru ) -> come + Pe Numb Gender PNG S uffix
rso er Pres Futu Past Conti
(Okke). ( al ) + ( illa ) -> n ent re ngent
Infinitive
negative = ( Fir Singul M asculi , ,
st ar ne/
baralilla ) -> didn’t come (Ene ಎ ಎ (Enu)
Femini
ne )
O(yO)/ E(yO)/ (hOgO)/ (enu, (enu,
Imperativ e)
iri(yiri) (hOgE/ (hOgiri) e)
e Plural M asculi
ne/
‘ ’ (hOgu) + ಅ (a) + Femini (Eve (evu) (evu) (
Negative
(bAradu) / (bAradu) ne ) Evu)
Imperativ Se Singul M asculi
e ‘ ’ (bEDa)/ = ಈ, ಇ, ಇ, ಈಯ
co ar ne/
‘ ’ (hOgabAradu) -> ‘don’t go’ nd Femini (Izha)
ne (I, (i, (i,
(kUDadu)
Iye) iye) iye)
‘ಇ’ (i) (maDu) + (al) + Plural M asculi
Optative ne/
ಇ (i) = (mADali), ‘let Femini (Iri) (iri) (iri) ( Iri)
do’ ne
Singul M asculi
ar ne
‘ಓಣ’ (ONa) (mADu)+ ಓಣ (ONa) = Thi (Ane (anu) (anu) (Anu)
Hortative rd )
(mADONa), ‘let’s
Singul Femini
do’ ar ne
(Ale) (aLu) (aLu) (
‘ಆ’ (A)/ ‘ಇ’ (i)/ + = ALu)
Participle Plural M asculi (
‘ ‘(ade)/’ (nODu + ade = nODade) ->
: ne/
’ (adu)/‘ಅದ’ ‘without seeing’ Femini (Are (aru) (aru) Aru)
ne )
(ada)
Singul Neuter ,
ar
Verbal ‘ ’(biDu)/‘ ( biTTu biDu ) (ide) (udu) (ittu) (Ittu)
aspect Plural Neuter
markers ’(hOgu) ‘let go’
(ive) (avu) (avu) (
causative ‘ ’(isu) / ‘ ’ (kali -> learn) + Avu)
suffix
‘yisu’ ‘ ‘(isu) -> ‘ ’ (kalisu
-> teach) 4. IMPLEMENTATION OF MAG MODEL
The proposed rule based MAG tool was developed using AT &T
condition ‘are’ ‘ ’ (hOdare) ‘if
Finite State M achine. This section explains the various efforts
al suffix required to create the proposed M AG system.
(someone) goes, (then…)’
48
International Journal of Computer Applications (0975 – 8887)
Volume 27– No.10, August 2011
4.1 Classifying Verb Paradigms Class-19 - -(-dd-) Verbs ends with 'Eyu'
One of the most important steps involved in the creation of M AG Eg: :mEyu
is to classify the verb paradigms with computational perspective. Class-20 - -(-dd-) Verbs ends with 'ellu'
M ost of the cases the problem arises due to past tense markers that Eg: „gellu‟
change from one paradigm to another [22]. Past verbs are broadly Class-21 - -(-dd-) Verbs ends with 'ADu', 'ODu'
classified into two types called regular and irregular (or semi Eg: „ADu‟,‟nODu‟, „kADu‟,
regular). In case of regular the different words are formed by „tODu‟ , etc
adding „id‟ to the verb stem. In the other case different words are Class-22 - - (-id-) Verb ends with 'TTu',‟ddu‟, „bbu‟,‟
formed by adding any one of the past tense marker as shown in ttu‟, „llu‟, „ccu‟
table 7. To resolve the computational challenges in verb Eg: aTTu, addu, ubbu, kuttu, cellu,
morphological analysis we have classified verbs into 35 heTTu, beccu,hottu etc
distinguished paradigms and verb words are grouped based on Class-23 - - (-id-) verbs ending with 'Oru', 'Eru'
their class paradigms [14]. Eg: tOru, sEru, hEru ,hOru etc
Class-24 - - (-id-) Verbs ends with 'ju',‟Du‟,‟su‟
Table 7. Proposed Kannada Verb Paradigms Eg: mOju, ADu, aM kurisu etc
Paradig Past tense Description & Example Class-25 - - (-id-) Verb ends with 'M Tu',M ju, M cu
ms marker Eg. IM Tu, aM ju, hoM ju etc
Class-1 - -(-tt-) Verbs ends with 'Ayu', 'Iyu', 'ILu' Class-26 - - (-id-) Verbs ends with 'ELu', 'ILu'
Eg: sAyu, Iyu, kILu etc. Eg: hELu, sILu etc
Class-2 - -(-tt-) Verbs ends with 'eru', 'aLu', 'uLu' Class-27 - - (-nd-) Verbs ends with 'Eyu', 'Oyu'
Eg: „heru‟,‟horu‟,‟aLu‟,,‟uLu‟ etc. Eg: bEyu, nOyu etc
Class-3 - -(-tt-) Verbs ends with „aLu‟,‟uLu‟ Class-28 - - (-nd-) Verbs ends with 'A'(aru)
Eg : aLu, uLu Eg: taru(tA),baru(bA) etc.
Class-4 - - (-Mt-) Verbs ends with 'illu' Class-29 - - (-nd-) Verbs ends with 'ollu', 'ellu', 'allu'
Eg : nillu Eg: kollu,mellu ,sallu etc.
Class-5 - - (-t-) Verbs ending with „I‟ and „e‟ Class-30 - - (nD-) Verb stems ending with 'ANu'
Eg: „kali‟, „bali‟, „mere‟, „koLe‟ Eg: kANu
etc. Class-31 - - (nD-) Verb ends with 'oLLu'
Class-6 - - (-t-) Verbs ends with 'ULu' Eg: koLLu
Eg: Example: hULu Class-32 - - (-T-) Verb ends with „aDu',‟eDu‟,
Class-7 - - (-t-) Verbs ends with „Olu‟, „Ulu‟, „Elu‟ ‟oDu‟,‟iDu‟,‟uDu‟
Eg: „jOlu‟,‟sOlu‟,‟nUlu‟,‟hElu‟ Eg: aDu, keDu, koDu, naDu, iDu,
etc. uDu, toDu, paDu, haDu etc
Class-8 - - (-d-) Verbs ending with Class-33 - - (-k-) Verb ends with 'ggu'and ''gu'
'Ayu','Oyu','Eyu','Iyu' Eg: oggu, miggu(migu), hoggu,
Eg:‟ kAyu‟, „kOyu‟ sigu, nagu, etc
,‟tEyu‟,‟sIyu‟,‟hAyu' etc. Class-34 - - (-d-) Verbs ends with 'kAyu'
Class-9 - - (-d-) Verbs ending with 'A gu','O gu' Eg: : kAyu, dArikAyu
Eg: „hOgu‟, „A gu‟ etc Class-35 Verbs ends with 'kAyu'
- - (nD-)
Class-10 Verbs ends with 'are' Eg: baggiko, bEDiko etc
- - (-d-)
Eg: „bare‟
Class-11 - - (-d-) verbs ending with 'ge' and 'gi' 4.2 Information required to build MAG
Eg: „age‟, „agi‟ The following information‟s are required to build a morphological
Class-12 Verbs ending with 'yyu' analyzer and generator:
- - (-d-)
Eg: „koyyu‟, „geyyu‟, „hoyyu‟,
„bayyu‟, „suyyu‟ etc. 4.2.1 Lexicon
Class-13 Verbs ends with 'nnu' The list of stems and affixes together with basic information‟s
- - (-d-) about them (Noun stem or Verb stem etc,).
Eg:‟ annu‟, „tinnu‟, „ennu‟ etc
Class-14 - - (-d-) Verbs ending with 'Eyu' 4.2.2 Morphotactic
Eg: „gEyu‟, „nEyu‟ etc The model of morpheme ordering that explains which classes of
morphemes can follow other classes of morphemes inside a word.
E.g., the rule that Kannada plural morpheme follows the noun
Class-15 - - (-d-) Verbs ending with 'Ayu'
stem rather than preceding it.
Eg: „Ayu‟
Class-16 - -(-dd-) Verbs ends with 'iru' 4.2.3 Orthographic rules
Eg: „iru‟ These are spelling rules used to model the changes that occur in a
Class-17 - -(-dd-) Verbs ends with 'kaLu' word, usually when two morphemes combine. For example, insert
Eg: kaLu a “yu” on the surface tape just when the lexical tape has a
Class-18 - -(-dd-) Verbs ends with 'ILu','ELu' morpheme ending in „e‟ (or i, etc) and the next morphemes are
Eg: „bILu‟ ,‟ELu‟, etc “tt”(PRES) and “Ane”(3SM ).
49
International Journal of Computer Applications (0975 – 8887)
Volume 27– No.10, August 2011
+PL:
: :: +N: ε
:
Input
+N +PL
Lexical Level
50
International Journal of Computer Applications (0975 – 8887)
Volume 27– No.10, August 2011
The output of the first level becomes the input of the second level
where the orthographic (sandhi) rules are handled as shown in
Figure5. If it gets accepted then it generates the inflected word.
51
International Journal of Computer Applications (0975 – 8887)
Volume 27– No.10, August 2011
the proposed M AG can be improved. A rule based machine [10] Sajib Dasgupta, „M orphological Analysis of Inflecting
translation system for English to Kannada language was Compound Words in bangla‟, BRAC University, Dhaka,
developed using the proposed MAG. Bangladesh.
[9] Shailly Goyal, „Parsing Aligned Parallel Corpus by [23] Dr. A.G. M enon, S. Saravanan, R. Loganathan and Dr. K.
Projecting Syntactic Relations from Annotated Source Soman, Amrita University, Coimbatore, India. „Amrita
Corpus‟, Proceedings of the COLING/ACL 2006 M ain M orph Analyzer and Generator for Tamil: A Rule Based
Conference Poster Sessions, pages 301–308, Sydney, July Approach‟.
2006. Association for Computational Linguistics.
52