Chapter 1
Chapter 1
1.0 Introduction
Morphology is a branch of linguistics that deals with word formation, analysis and
generation. It is concerned with how words are formed or created in a language from
smaller units in a systematic way. It has to find as well as to describe the mechanism
behind this process. Each language has morphemes which are its smallest meaningful
units. Realization of morphemes as part of a word is called morph. In formation of words
from these smaller units, certain processes such as inflection, derivation and
compounding etc. are involved. Also involved is morphotactics that determines how
morphs should be put together to form words.
Low-level applications
1
Harald Trost, Entry under “Morphology” in Oxford Handbook of Computational Linguistics
4
Hyphenation is almost exclusively done automatically. Segmenting words correctly into
their morphs helps to solve the task. The major problem is spurious segmentations.
Spelling correction, in its simplest form, can be a mere comparison of the input against a
list of word forms. But such a list will never contain all the words occurring in a text and
enlarging the list will include more and more obscure words that will match with typos
thus preventing their detection. Therefore, most systems use a root lexicon, plus a
relatively small set of affixes and simple rules to cover morphotactics.
Stemmers are used in information retrieval to reduce as many related words and word
forms as possible to a common canonical form which can then be used in the retrieval
process. However, this canonical form is not necessarily the base form. The main
requirement is – like in all the above tasks – robustness.
2
http://en.wikipedia.org/wiki/Natural_language_processing
5
and information extraction to dialogue systems and machine translation. For many
current applications, only inflectional morphology is considered.
In generation, on the other hand, the task is to produce the correct word form from the
base form plus the relevant set of morphosyntactic features.
Speech applications
A text-to-speech system takes (electronically stored) text as input and produces speech
from it. Morphological analysis helps to solve two different tasks in such systems. One is
to guide the grapheme-to-phoneme conversion. Characters are often ambiguous with
respect to their translation into phonemes. Finding out the underlying morphological
structure is necessary for solving the task correctly.
A less obvious application is the use of morphological analysis to help in determining the
part-of-speech category of words. This is an important prerequisite of syntactic analysis
which is the basis for coming up with a correct prosody.
Speech recognition is a field where morphological analysis can become ever more
important. At the moment most available systems make use of full form lexicons and
perform their analysis on a word basis. Increasing demands on the lexicon size on the one
hand and the need to limit the necessary training time on the other hand will make
morph-based recognition systems more attractive.
6
1.1 Morphological Analysis and Morphological Analyzers
Morphological analysis is the process in which a word is analyzed into its root word and
associated morphemes. Morphological analyzers are programs used to provide
morphological analysis of a language. They include recognition engine, lexicon/thesaurus,
and algorithms to find out stem within an input word and identifying its suffixes.3 They
take as input a sentence, i.e. a sequence of words and retrieve related grammatical
information such as lexical category, gender, number, person, tense etc. of every word.
For example, for nouns they provide its gender, number and case information along with
the root word. For verbs, information related to tense, aspect, mood etc. is retrieved. If a
word has more than one meaning, the analyzer is supposed to return grammatical
information for every meaning separately. 4
It is the basic tool used in spell checker, grammar checker, parser and machine translation
systems. To build a syntactic representation of the input sentence, a parser must map each
word in the text to some canonical representation and recognize its morphological
properties.
For the purpose of analysis of morphologically rich languages, the root and the
morphemes of each word has to be identified. The combination of a surface form and its
analysis as a canonical form and inflection is called a lemma. The main problems
involved in the process are:
3
http://www.acroterion.ca/Morphological_Analysis.html
4
Akshar Bharati, Vineet Chaitanya and Rajeev Sangal. "Natural Language Processing: A Paninian
Perspective", Prentice-Hall of India, New Delhi, 1995.
7
There are many approaches for morphological analysis. A popular approach to
morphological alternation is the cut-and-paste method in which the canonical form is
derived by removing and adding letters to the end of a string. Another one is the finite
state technology, which was introduced in early 1980s and is applied for automatic
recognition and generation of word forms. It maintains that rules for morphological
alternations can be implemented by finite-state transducers. It was also widely recognized
that possible combinations of stems and affixes can be encoded as a finite-state network.
An automaton containing inflected word forms can be upgraded to a morphological
analyzer, for example, by adding a code to the end of the inflected form that triggers
some predefined cut-and-paste operation to produce the lemma. Instead of cutting and
pasting it at runtime, the entire lemma can be computed in advance and stored as a finite-
state transducer whose arcs are labeled by a pair of forms. The number of nodes in this
type of network is small, but the number of arc-label pairs is very large as there is one
symbol for each morpheme-allomorph pair. A more optimal lexical transducer can be
developed by constructing a finite-state network of lexical forms, augmented with
inflectional tags, and composing it with a set of rule transducers. Lexical transducers can
be constructed from descriptions containing any number of levels. This facilitates the
description of phenomena that are difficult to describe within the constraints of the two-
level model. Because lexical transducers are bidirectional, they are generally non-
deterministic in both directions. If a system is only to be used for analysis, a simple
finite-state network derived just for that purpose may be faster to operate. 5
A morphological generator performs the reverse function of an analyzer. For every root
and its features provided as input, it generates the word-forms in the given paradigms.
Morphological analyzer and morphological generator are two essential and basic tools for
building any language processing application for a natural language.
5
http://www.languageinindia.com/aug2006/parsingrajendran.pdf
8
1.1.1 India’s linguistic scenario
6
http://www.languageinindia.com/july2005/morphologynortheast1.html
9
Andamanese: Andamanese is the fifth language family of India.7 It comprises of the four
language groups represented by the four speech communities of the Great Andamanese,
the Jarawa, the Onge and the Sentinelese. Great Andamanese arguably is language family
in itself.8 All these languages are primitive languages and the speaking community of
these languages are the primitive tribes who have been living in these islands for
thousands of years, untouched from civilized society. They are people of the Negrito
stock and it is not clearly known where they came here from.
7
Abbi, Anvita. 2006. Endangered Languages of Andaman Islands. Lincom Europa.
8
Ibid.
9
http://tdil.mit.gov.in/
10
http://tdil.mit.gov.in/languagetechnologyresourcesapril03.pdf
10
List of Resource Centers
11
Source: TDIL website, http://tdil.mit.gov.in/
11
A few other funding agencies are also supporting the language technology efforts.
Several individuals in India and abroad and some industries have also been developing
different language products/resources. An example of a morphological verb analyzer can
be cited here that has been developed at JNU by a research scholar12 for an extremely
endangered language of Great Andamanese. As there are so many distributed efforts and
with so many languages it is very difficult to keep track of the activities in this area. So,
TDIL has tried to evolve an indexing system for the language technology resources, so
that a suitable portal can be created to offer a variety of services to persons concerned
with Indian language technologies. If all the resource centres follow this indexing and
share their documents with TDIL, it would be easy to keep track of the progress and
would be possible to encourage more people to participate in the development activities.
12
Choudhary, Narayan Kumar, 2006, “Developing a Computational Framework for the Verb Morphology
of Great Andamanese”, Unpublished dissertation at Centre for Linguistics, JNU.
12
The task of the morphological analyzer is to identify the structural components of a word,
and glean information about it. While Indian languages in general are morphologically
richer than languages like English, Dravidian languages are a lot more complex. Clearly,
there is no way to list all forms of all words in a dictionary. Therefore, morphology is
essential.
The development of Natural Language Processing in India has done significant progress
since its beginning. A variety of tools have been developed for different purposes. These
include machine translation systems, morphological analyzers and generators, POS
taggers, spell checkers, text processors, word nets and many other tools. Given the
morphological richness of these languages, numerous NLP tools related to morphology
have been developed. The range covered is from major languages like Hindi, Telugu and
Bengali to languages like Manipuri and Assamese. The survey of morphological
analyzers, generators and other tools for morphological understanding of various Indian
languages is given below. It mainly focuses on morphological analyzers and generators,
taggers, spell-checkers and text-processors. However, efforts and tools of government
institutions only have been covered leaving behind the initiatives of private sector which
also is putting a lot of effort in this field.
13
http://ccat.sas.upenn.edu/plc/tamilweb/hindi.html
13
Telugu is highly inflectional and agglutinative language. It exhibits features like vowel
harmony, sandhi etc. Clitics, particles, vocatives are all part of the word. Auxiliary verbs
are used in various combinations to indicate complex aspects.
RCILTS for Telugu, Univ. of Hyderabad claims to develop the only large scale
morphological analyzer system built for Telugu.14 It has been built, tested against corpora
and refined over the past 10 years. The system uses a root word dictionary of 64,000
entries and a suffix list categorized into a number of paradigm classes. The basic
methodology is to look for suffixes, remove them taking care of sandhi changes and then
cross checking with the dictionary. Inflection, derivation, external sandhi are all handled.
There is also a separate morphological generator that can put together the roots and
affixes to construct complete word forms.
14
TDIL magazine ‘vishvabharata’ July 2003, accessible@ tdil.mit.gov.in/Telugu-UOHJuly03.pdf
15
Ibid.
14
Morphological Analyzer and Synthesizer for Hindi 16
17
Media Lab Asia Research Laboratory at IIT Kharagpur, WB has developed a
morphological analyzer for Hindi. It can identify the tense, aspect, modality, person,
gender and number along with the root (‘dhaatu’) of an inflected verb form. For this, the
morphological analyzer uses a Directed Acyclic Morphological Structure. For nouns, the
analyzer has to determine its vibhakti (inflection), suffixes and prefixes. It analyzes the
lexical word groups corresponding to noun and determines the karaka (semantic role).
The group also aims to perform the decompositions for sandhi and samaasa (conjugating
and compounding words) so as to have a powerful vocabulary for the system, and a
generalized prefix and suffix handler. The DAM structure developed is based on the fact
that possible inflections of a word can be identified by scanning the word backward from
the end, one step at a time.
A morphological synthesizer has also been developed to generate the structural word
given the root of the word, and the tense, aspect, mode, number, gender, and person.
Hindi is a GNP language - i.e. the inflected forms of the verbs are mainly inflected using
the gender, number and person information.
Bengali is a classifier language - where the inflection takes place mainly due to the tense,
aspect and mode. For second and third persons, there is the concept of familiarity. The
morphological analyzer has been modeled as finite state transducers which are given a
16
http://www.mla.iitkgp.ernet.in/technology.html
17
http://www.mla.iitkgp.ernet.in/Resource/index.html
18
http://www.mla.iitkgp.ernet.in/technology.html
19
http://www.mla.iitkgp.ernet.in/Resource/index.html
15
root and TAM and GNP information as input. The transducer’s job is to generate the
correct inflected form of the word by adding the appropriate suffix for Bengali.
Verbs are classified according to the vowels present in them. The modus operandi of the
synthesizer is to determine the category of the root, given an input. It then accordingly
synthesizes the ultimate form of the word depending on the input TAM GNP parameters.
The number of exceptions is limited, and therefore easily traceable.
The OMA has many constituent parts 21 such as OriNet Database (OD), which stores the
Oriya lexicon (Only root words) whereas OMA Engine (OE) processes the system and
Morphological Parsing (MP) parses the word according to orthographic rule. Decision
Tree (DT) decides to classify the morphemes and different functional programmes by use
of OMA.
The OMA system has been designed on the basis of Object-Oriented Approach (OOA) so
that different functions can be easily added to or deleted from it. Pronoun Morphology
and Inflection Morphology have been implemented in the OMA in such a manner that it
successfully runs with the OriNet system, Oriya Spell Checker (OSC) and Oriya
Grammar Checker (OGC). The OSC handles any type of word (derived, inflectional or
root) using the OMA. It also provides sufficient interface for applications involved in
Oriya Machine Translation (OMT), Word-Net for Oriya (OriNet), Oriya Spell Checker
20
http://www.ilts-utkal.org/oriyamorphological.htm
16
(OSC) and Oriya Grammar Checker (OGC). All these developments have been worked
out on the basis of the syntactic approach of Sanskrit language.
Developers have developed and implemented the Decision Tree (DT) and its respective
algorithm of each type of morphology, through which OMA runs successfully. While
performing morphological analysis, OMA not only deals with the study of words but also
its morphemes. The OMA system is useful for various applications in developing NLP
tools. For example, in OMT, there is need of root words, which is obtained through the
OMA. Similarly, it has typical use in OriNet, the Word-Net for Oriya language, for
searching any type of lexicon. Its Morphological Parser (MP) successfully handles
ambiguous morphemes and provides different alternatives for them.
21
http://www.ilts-utkal.org/publication/NLP/Abstract/abs-12.pdf
22
tdil.mit.gov.in/TDIL-OCT-2003/morph%20analyzer.pdf
17
tagger tags the lexical category of the root and the grammatical category of the affixes.
The code is written using Perl script and the user interface has been developed using
Perl/Tk. The dictionaries are stored in MS-Access database.
Tamil is a verb final, relatively free-word order, morphologically rich and agglutinative
language. Computationally, each root word in Tamil can take a few thousand inflected
word-forms. Subject-verb argument is required for the grammaticality of a Tamil
sentence. Tamil allows subject and object drop as well as verb less sentences. In addition,
the subject of a sentence or a clause can be a possessive Noun Phrase (NP) or an NP in
nominative or dative case. As Tamil is an agglutinative language, each root word can
combine with multiple morphemes to generate word forms.
1. Rajendran’s Morphological Analyzer for Tamil: This was one of the very first
efforts towards building a morphological analyzer for Tamil. It was initiated by
anusaraka group of researchers under whose guidance Rajendran, Tamil University
prepared this morphological analyzer for Tamil for Translating Tamil into Hindi at the
word level.
2. AUKBC Morphological Parser for Tamil: 25 AUKBC NLP team under the
supervision of Rajendran prepared a Morphological parser for Tamil. The API Processor
of AUKBC makes use of the finite state machinery like PC Kimmo. It parses, but does
notgenerate.
23
http://www.iitg.ernet.in/rcilts
24
“Parsing in Tamil: Present State of Art”, S. Rajendran, Ph.D.
http://www.languageinindia.com/aug2006/parsingrajendran.pdf
25
http://www.au-kbc.org/frameresearch.html
18
3. Vaishnavi’s Morphological Generator and Analyzer for Tamil: Vaishanvi has built
generators and analyzers for Tamil morphology. The generator implements the item and
process model of linguistic description. It works by the synthesis method of PC Kimmo.
The analyzer uses a hybrid model for Tamil. It is theoretically rooted in a blend of IA and
IP models of morphology. It constitutes an in-built lexicon and involves a decomposition
of words in terms of morphemes within the model to realize surface well-formed words-
forms. Thus it tries to define a transformation depending on the morphemic nature of the
word stem. The analysis involves a scanning of the string from the right to left periphery
scanning each suffix at a time stripping it, and reconstructing the rest of the word with the
aid of phonological and mopho-phonemic rules exemplified in each instance. This goes
on till the string is exhausted. For the sake of comparison AMPLE and KIMMO models
are implemented.
26
http://ns.annauniv.edu
19
postpositions. The verb section takes tense and PNG makers, relative and verbal
participle suffixes, and auxiliary verbs. It uses sandhi rules and125 morphological rules.
It handles adjectives and adverbs. It has word and sentence generator interfaces.
Others
Ganesan’s Morphological Analyzer for Tamil to analyze CIIL corpus. It uses
phonological and morphophonemic rules and takes into account morphotactic
constraints of Tamil. An efficient morphological parser has also been built.
Kapilan’s Morphological Analyzer is especially built for analyzing Tamil
Verbal Forms.
Deivasundaram’s Morphological parser built for a Tamil Word Processor. He
too makes use of phonological and morphophonemic rules and morphotactics
constraints of the language for developing his parser.
Ramasamy’s Morphological Generator for Tamil: Ramasamy has prepared a
morphological generator for Tamil.
Winston Cruz’s Parsing and Generation of Tamil Verbs makes use of GS
morph method for parsing Tamil verbs which does morphotactics by indexing.
The algorithm simply looks up two files to see if the indices match or not. The
processor generates as many forms as it parses and uses only two files.
Dhurai Pandi’s Morphological Generator and Parsing Engine for Tamil
Verb Forms: It is a full-fledged morphological generator and a parsing engine on
verb patterns in modern Tamil.
The Advanced Centre for Technical Development of Punjabi Language, Literature and
Culture, Punjabi University, Patiala has developed a Morphological Analyzer and
Generator for Punjabi.27 As generator, the software displays the list of all the possible
word forms of any Punjabi root word, along with their respective grammatical
information. As analyzer, it identifies the grammatical attributes of any Punjabi word and
27
http://www.advancedcentrepunjabi.org/punjabi_mor_ana.asp
20
can also be used to search for any Punjabi word in it to know its root and other
grammatical information.
The application areas of the software include, automatic spelling and grammar checking,
natural language understanding, machine translation, speech recognition, speech
synthesis etc part of speech tagging, parsing. The common man can also get in-depth
information about the Punjabi words from the software. Knowing the grammatical
information of a word helps in its proper and error free use in sentences. It can also help
the beginners to learn new words and the specialists to create new terminology.
The database used in the software consists of more than 1.72 lakh Punjabi words,
grouped into 22 word classes such as noun, personal pronoun, reflexive pronoun, verb,
inflected and uninflected adverb, inflected and uninflected adjective, conjunction,
interjection etc.
IIT Bombay
28
http://ltrc.iiit.net/showfile.php?filename=onlineServices/morph/index.htm
21
(UNL) as interlingua. Wordnet for Hindi and Marathi with lot of literature on it can be
downloaded from the CFILT portal 29 . It also offers an overview of the UNL based
Interlingua approach to MT with its online versions of English to Hindi MTS, Hindi to
UNL conversion system, UNL to Hindi generation system. This centre is developing a
POS tagger for Hindi using a data driven approach, making use of graphical algorithm
Conditional Random Fields. This algorithm based tagger makes use of the spelling of the
words, their lexical attributes and the suffixes to achieve an accuracy of about 90% on the
BBC Hindi news corpora. It has a tagset of 30 tags with two pseudo tags S-START and
S-END marking the sentence boundaries.30
Bangla Spell-checker
29
Resource Centre for Indian Language Technology Solutions (CFILT) - http://www.cfilt.iitb.ac.in/
(accessed: 14.10.2006)
30
Srivastava, Manish et al., 2006, ‘Conditional Random Field Based POS Tagger for Hindi’, in the
Proceedings of MSPIL, IIT-B, Mumbai
31
tdil.mit.gov.in/Telugu-UOHJuly03.pdf
22
RCILTS-Bengali Indian Statistical Institute, Kolkata has developed a Bangla Spell-
Checker. 32 In this spell-checker, only non-word errors are considered and not word
errors like substitution, deletion, insertion, and transposition error. In Bangla, errors may
occur due to phonetic similarity of characters or typographical mistakes. In this spell-
checker, the main technique of error detection is based on matching the candidate string
in the normal as well as in the reversed dictionary. Moreover, this approach is combined
with a phonetic similarity key based approach where phonetically similar characters are
mapped into a single symbol and a nearly-phonetic dictionary of words is formed. Using
this dictionary, phonetic errors can be easily detected/ corrected. Here a candidate string
first passes through the phonetic dictionary. If the word is not found in the dictionary and
also failed to give suggestion then it tries to divide the word in root part and suffix part by
separately verifying both. If an error is found, the spell-checker tries to provide
suggestions. If it fails, it checks whether the string is a conjunct word generated by
appending two noun words and suffix. Option for adding new words permanently or
temporarily is provided in the spell checker.
The spell-checker uses several files containing root words and suffix words. The main
dictionary contains about 60,000 root-words and 100,000 inflected words. Noun and verb
suffix files are also used. The spell-checker works fast and almost correctly detects the
non-word errors. However, it makes about 5% false alarm due to conjunct words formed
by euphony and assimilation as well as proper nouns in the corpus.
32
tdil.mit.gov.in/Bengali-ISIKolkataJuly03.pdf
33
tdil.mit.gov.in/Gujarati-MSUniversityBarodaJuly03.pdf
23
rigorous testing to find all the faults in the spell checker and morphological analyzer.
Developers have plans to increase the size of the root dictionary covering the maximum
possible words of the language and improving the algorithms for the correctness and
efficiency of the spell checker. They also aim at building an independent spell checker
which can be used on Unicode or ISCII compatible text.
RCILTS for Tamil at Anna University, Chennai has developed a spell checker for Tamil
that takes care of the rich morphological structure of Tamil. After tokenizing the
document into a list of words, each word is passed to the morphological analyzer which
analyzes only the correct words. The morphological analyzer tries to split the suffix. In
34
tdil.mit.gov.in/Malayalam-CDACThiruvananthapuramJuly03.pdf
24
case of error, it passes the word to spelling verification and correction phase to correct the
mistake. When the correction of errors is completed, root word and all components are
sent to morphological generator (for word forming), which then generate the possible
corrected words as suggestions.
The Spell checker for Tamil helps the user to identify most of errors, which may occur
while typing. The tasks implemented in Tamil Spell checker are Case marker,
postposition checking and adjective checking for nouns, PNG marker checking for verbs,
Adverb checking, and adjacent key error checking.
25
1.1.3.3 Text-processors for Indian languages
35
http://tdil.mit.gov.in/Punjabi-TIETPatialaJuly03.pdf
36
http://ildc.gov.in/telugu/htm/Akshara.htm
26
processing in Punjabi Language through the common typewriter keyboard layout. It has
MS-Word compatible features and commands. It provides a number of features that make
the use of Punjabi Language on a computer easy and provides a number of tools to
increase the efficiency of the user. These tools include Bilingual Spell Checker with
suggestion list, on-screen keyboard layouts with composition reference for Punjabi
language typing, bilingual search and replace, sorting as per the language, alphabetical
order, technical glossaries and onscreen Bilingual dictionaries.
The main features of Likhari are:
Very simple user interface
Online active Keyboard for users who do not know how to type in Punjabi.
Choice of Phonetic, Remington and Alphabetic Keyboard layouts with
composition reference.
Bilingual Spell Checker for Punjabi and English.
Bilingual Search and Replace.
Support for sorting the text in English or Punjabi as per alphabetical order.
Extensive help at various levels to make it easy for the user to learn.
Nashir supports Nastaliq True Type fonts (presently 2 fonts) and Naskh fonts (presently
12 fonts) along with fonts for Sindhi and Kashmiri. It supports C-DAC & Phonetic
Keyboards as well as user-defined keyboards. It provides various drawing objects and
37
http://tdil.mit.gov.in/Punjabi-TIETPatialaJuly03.pdf
38
tdil.mit.gov.in/UrduSindhiKashmiri-CDACPuneJuly03.pdf
27
also supports the OLE Automation. Nashir supports Urdu, Sindhi & Kashmiri along with
English. A transliteration engine (uTRANS) has also been implemented which converts
an “.aci” (ISCII file) into transliterated version in Urdu script (Naskh/ Nastaliq). Rule
based transliteration has been developed for Hindi & Punjabi. The user can save the
document as HTML page, and thus Naskh as well Nastaliq scripts can be viewed on the
Internet.
Phonetic keyboard is supported. Both Horizontal and Vertical kerning is provided to
manually adjust a text. It has Horizontal and Vertical rulers in the GUI, dynamic font
settings for the Urdu and English fonts.
1. 2.1 Inflection
Sanskrit is very rich in inflections. The inflection morphology involves formation of two
kinds of words or padas: subanta padas (nominal words) and tianta padas (verb forms).
The Pāinian analysis for Sanskrit has categorized each and every usable word under
these two categories. A word cannot be used in the language unless it is one of them.
Even the indeclinable words (avyayas) are first assigned to the first category, and later
the characteristics of it are deleted. The two categories of padas are named so according
28
to the set of affixes which is affixed to them. These are called sup suffixes and ti
suffixes respectively, thus creating sup+anta = subanta and ti+anta = tianta padas.
Nominal inflection
Nominal inflection in Sanskrit includes all non-verb categories, i.e. subanta-padas.
Sanskrit subanta forms can be potentially very complex. They can include primary
(kdanta) and secondary (taddhitānta), feminine forms (strīpratyayānta) and compound
nouns (samāsa). They can also include upasargas and avyayas etc. According to Pāini,
there are 21 morphological suffixes called sup suffixes to be attached to the nominal
bases (prātipadika).
39
George Cardona, 1988 Pāini, His Work and its Traditions, vol. i (Delhi: MLBD, 1988)
40
svaujasamauchasābhyāmbhisebhyāmbhyasasibhyāmbhyasasosāyossup
41
supa
42
dvyekayordvivacanaikavacane
43
arthavadadhāturapratyaya prātipadikam 1/2/45; kttaddhitasamāsāśca 1/2/46
29
Verbal inflection
The ti terminations are 18 in number. They are added to the verb root which can be
primitive or derived. The ti terminations are divided in 2 equal sets in two padas:
parasmai and ātmane. The terminations in each set are again viewed as 3x3, i.e
combination of three persons (first, second and third) with three numbers.
1.2.2 Derivation
Sanskrit is rich in derivation. Both subanta padas and tianta padas, before they are
inflected, generally undergo certain derivational operations. Derivation includes suffixes
which are added to both nominal stems and verbal roots to form new words, again
nominal or verbal. In other words, newly derived nominal words or verb roots can be
formed by nominal stems as well as from verbal stems. The suffixes are well categorized
with their meanings and environments.
Nominal Derivatives:
Nominal derivatives are the words formed by adding derivational suffixes to nominal
words. These new words can be nominal stems as well as verbal stems. The nominal
stems which are derived from the nominal stems can be mainly categorized under
taddhitāntas and strīpratyayāntas. The nouns derived from verb roots are called kdantas.
These are derived by affixing kt suffixes to the verb roots.
The nominal forms formed from other nominal stems are called taddhitāntas- derived by
adding taddhita suffixes to nominal stems. The secondary derivative affixes are called
taddhita, which derive secondary nouns from prātipadikas. For example - dāśarathī,
gaua etc. Pāini has listed many taddhita suffixes some of which are- a, aka–c, ac, a–,
a, at, i han, īyasun, kan, hak, ha–, tamap, tarap, tayap, tal, tyap, tral, dvayasac, fak,
matup, mātrac, yak, yat, ya–, āc, kha, gha, cha, uraca, hak, ha–, han, na, ha, va, vatup
etc. For example, dākī, kva, aśvaka, viśvajanīnam, katriya, mālīya, raivatika,
dā ika, laghutama, gurutara, gārgyāyaa, iha, balavān etc.
30
Sanskrit also has eight feminine suffixes āp, cāp āp, ī, īn, īp, u and ti etc. and the
words ending in these suffixes are called strīpratyayānta. For example - ajā, gaurī,
mūikā, indrāī, gopī, a ādhyāyī, kurucarī, yuvatī, karabhorū etc.
The nominal morphology of Sanskrit distinguishes nouns and adjectives on the one hand
and pronouns on the other. There is no absolute distinction between nouns and adjectives
with respect to inflection, but modifiers take the number and gender of the terms they
qualify.
Pronouns are personal (first and second person), demonstrative (deictic), interrogative
and relative. Pronouns other than personal pronouns observe gender distinctions. There
are also endings particular to pronouns.
Sanskrit also has indeclinable terms, including particles such as the connective ca ‘and’
(sentence and nominal connective) and the negative particle na, as well as terms such as
yad, yadi ‘if’, tad, tarhi ‘then’, karhi ‘when?’, tata thence, yata whence, kuta
whence?, yadā when, tadā then, kadaa when?, yatra where, tatra there, kutra where?,
derived from pronominals.
There are also preverbs, which regularly occur immediately preceding a verb or another
preverb – although in vedic they can be separated from a verb or follow it- as well as pre-
and post-positional terms like adhi, anu, which co-occur with particular case-forms.
Indeclinable NPs
44
avyaya subanta-padas remain unchanged under all morphological conditions .
45
According to Pāini [2.2.82] , affixes cāp, āp, āp, (feminine suffixes) and sup are
deleted by luk when they occur after an avyaya. Pāini gives definitions for identifying
certain word forms as avyayas which also include compound forms.
44
sad tri u lige u sarvāsu ca vibhakti u |
vacane u ca sarve u yanna vyeti tadavyayam || - gopatha brāhmaa
45
avyayādāpsupa 2/4/82
31
Verbal Derivatives:
Verbal derivatives may be nominal stems or verbal stems derived from verb roots or
stems. The nouns that are derived immediately from verbs comprehend a great variety of
terms. There are two principal classes of their classification:
• Adjectives (Attributives)
• Names (Substantives)
These different nouns are formed by affixing certain terminations to the verb root, which
is modified in a greater or lesser degree, and then forms the inflective base. These
terminations are numerous. The greater number has a very limited application while
others comprehend a large class of words. The primary derivatives are called kdanta.
The primary affixes are to be added to verbs to derive substantives, adjectives or
indeclinable kt. For example pa hitavyam, pātavya, pa hanīya, pacelima, jeyam, deyam,
karttā, kumbhakāra, janamejaya, pā haka, pa hantī, gantum, khāditum, svapnam
gati, gatvā,vihāya, ādāya etc.
1.2.3 Compounding
32
1.2.4 Sanskrit Verb-Morphology:
The verb forms are derived from verb-roots or dhātus. These dhātus are encoded with the
core meaning of the verb. These can be primitive46 or derived47. Primitive verb-roots,
which are around 2000 in number, have been listed in a lexicon named ‘dhātupā ha’.
They are divided in 10 groups/classes called gaas. All the verb-roots of a group undergo
somewhat similar inflectional process. Derived verb-roots may be derived from primitive
verb-roots or from nominal forms. Prefixes also play an important role as they can
change the meaning of a verb root. These roots then have to undergo various inflectional
suffixes that represent different paradigms. In this process, the base or root also gets
changed. The chart given on the next page gives an overview of Sanskrit verb roots.
Verb-roots
Primitive Derived
(2000 listed)
Derived from verbs Denominatives
Divided in
10 gaas Causal
Desiderative
Frequentative
Derived Verb-roots:
Derived from verb-roots:
1. Causals (ijanta)- Causals are formed by adding ‘ic’ suffix to a primitive verb
root. They convey the sense that a person or thing causes another person or thing
to perform the action or to undergo the state denoted by root.
46
bhūvādayo dhātava (Pāņini 1/3/1)
33
2. Desideratives (sannanta)- Desiderative of a primitive verb root is formed by
adding ‘san’ affix to it. It conveys the sense that a person or thing wishes to
perform the action or is about to undergo the state indicated by the desiderative
form. Any basic verb-root or its causal base may have a desiderative form.
47
sanādyantā dhātava (Pāņini 3/1/32)
34
Process of formation of Sanskrit verb forms:
A Sanskrit verb root may take various forms. There are ten lakāras that represent Tense,
Aspect and Mood. Inflectional terminations are 18 in number. These are divided in two
groups – Parasmaipada and Ātmanepada, each having 9 affixes which is a combination of
3 persons x 3 numbers. A verb is conjugated in either pada, though some of the roots are
conjugated in both. For each different lakāra, a root is affixed with these 9 terminations.
Again, there are three voices- Active, Passive and Impersonal. Transitive verbs are used
in the Active and Passive voices while intransitive verbs are conjugated in the Active and
Impersonal voices. Addition of one or more of 22 prefixes (upasargas) to verb roots can
result in more variety of forms. Derivative verb roots, both derived from verb roots as
well as nominal words, also follow same process to form verb forms. There can be some
specific rules and exceptions in some cases. A chart given here gives a rough estimate of
all the possible verb-forms of Sanskrit.
VR [2000]
├ Causal ((ijanta)
├ Desiderative (sannanta)
└ Frequentative (ya(anta/ya(luganta)
+ One normal form
Denominatives
| ↓
TAM [10 lakāras]
↓
┌────────────┐
Active Voice Passive/Impersonal Voice
↓ ↓
10x9(3x3) forms 10x9(3x3) forms
↓ ↓
22(+) upasarga 22(+) upasarga
48
Mishra Sudhir K., Jha, Girish N., 2004, Identifying Verb Inflections in Sanskrit morphology, In proc.
of SIMPLE 04, IIT Kharagpur, pp. 79-81.
35
The verb roots of gaa 1, 4, 6 and 10 adopt certain terminations when ti affixes are
added to them. Consequently, the verb roots of these class forms base ending in ‘a’. The
ti affixation also influences the verb root and it undergoes several morpho-phonemic
changes such as its end vowel is gunated. The verb root can adopt certain more
operations resulting in the final verb-form.
C-DAC
49
Deśika (Natural Language Understanding System), http://tdil.mit.gov.in/download/Desika.htm
50
C-DAC R&D Activies: Development of Desika A Natural language Understanding (NLU) system for
Sanskrit. [1990 – 1994], http://www.cdac.in/html/ihg/activity.asp, accessed on July 16, 2006
51
C-DAC, R&D Activities: Developing Sanskrit Authoring System (VYASA),
http://www.cdac.in/html/ihg/activity.asp and The House Magazine of C-DAC, Pen to Paper Developing
Sanskrit Authoring System – VYASA, http://www.cdac.in/html/connect/3q2000/art10a.htm
36
searching facilities, indexing, and concordance. It also claims to provide tools for
analyses at morphological, syntactic and semantic levels. Tools for
searching/indexing/sorting, lexical updation, lexical tagging, extraction/indexing of
quotations in commentaries/explanations, transliteration facility, word split programs for
sandhi and samāsa, poetry analysis (textual/metric/statistical), statistical tools like
concordance, thesauri, and electronic dictionaries are also included. However, the system
is not available anywhere to evaluate or check.
52
Language Processing Tools: TDIL website, http://tdil.mit.gov.in/nlptools/ach-nlptools.htm
37
navīna nyāya system), Śemuī (subanta generator/analyser), Prajñā (tianta
generator/analyser), Cetanā (kdanta generator/analyser), Pāini (sandhi joiner
according to Pāinian rules) etc. which are to be released.53
RSV Tirupati has put much effort in developing linguistic resources for NLP in Sanskrit.
Prof. K.V. Ramakrishnamacharyulu, (presently V.C. of Rashtriya Sanskrit Sansthan,
Jaipur) and Dr. Srinivasa Varkhedi along with Prof. Vineet Chaitanya and Amba P.
Kulkarni have initiated many projects and have developed many tools like pada-ccheda,
which isolates Sanskrit compound words into its components, which works on Sanskrit
ISCII text in Linux environment. It is also working on developing Sanskrit morphological
analyzer. An initial analyzer developed in collaboration with IIIT-H is already online.
Apart from this it is also concentrating on kdanta and tianta analyzers and also
generators for subanta, tianta and samāsa. RSV Tirupati along with C-DAC Bangalore,
Ahobila Mutt Sanskrit College Madhurantakam Tamil Nadu, Poorna Prajna
Samshodhana Mandiram Bangalore, Chinmaya International Foundation Veliyanad
Kerala, ASR Melkote Karnataka, IIIT-H, Dept. of Sanskrit H.S.Gour University Sagar
Madhya Pradesh have combined initiative to develop a large Sanskrit Corpus (presently
this is not online).54
Resource Centre for Indian Language Technology Solutions (RCILTS) – Oriya Centre at
the Department of Computer Science and Application, Utkal University, led by Prof.
Sangamitra Mohanty and funded by Ministry of Information Technology (MIT), has been
working on Oriya-English-Hindi Trilingual Dictionary, Oriya and Sanskrit WordNet,
Trilingual Word Processor, Grammar and Spell checkers and MorphAs for Oriya and
53
Academy of Sanskrit Research, Melkote, http://www.sanskritacademy.org/About.htm
54
RSV Tirupati, http://rsvidyapeetha.ac.in and http://www.sansknet.org
38
also English to Oriya Machine Translation System. It has received IPR for four of their
products from Ministry of Human Resource and Development (MHRD), Govt. of India.
Dr. Gerard Huet, Director, INRIA has developed various computational tools for
Sanskrit which are available online56. The Declension Engine declines all the nominal
inflectional forms with the ‘compound base’ for a query word given with its gender
information. It also gives the ‘compound base’ of the query word. The Conjugation
Engine gives all the possible forms of the verb root in its ātmane and/or parasmai
terminations, in kartari and karmai/bhāve voices with its Desiderative forms in eight
lakāra-s. It also gives few participle and indeclinable kt forms57. He claims that from
535 roots, his engine generates 14554 noun forms, 113935 root verbal forms, 203281 root
participial forms, 14311 iic and periphrastic forms, and 737 indeclinable forms, totaling
roughly half a million forms58. Lemmatizer and Sanskrit Readers are the analyzers.
While the lemmatizer tries to tag a given simple inflected noun or a verb (without
upasarga-s), the Sanskrit Reader Companion analyses a given phrase or a simple
sentence, segments it into individual words, tags each word and parses the input. Modular
transducers are applied to constraint the lexical analyzer to recognize the stream of forms
as a regular expression over 14 phases, specifying Sanskrit's morphology geometry to get
55
RC-ILTS – Computer Science & Application, Utkal University, http://www.ilts-utkal.org/sanskrit.htm
56
The Sanskrit Heritage Site, Huet, Gerard. http://sanskrit.inria.fr/
57
Ibid.
39
right composition of compound chunks. Over generation of segmentation and ambiguity
of tags are checked by semantic role analysis similar to Pāini’s kāraka theory and also
by governance patterns of verbs.59
The Sanskrit Library Project, directed by Dr. Peter M. Scharf, Classics Dept., Brown
University, has developed a web-based reading room holding Kramapā ha, a Sanskrit
independent-study reader on texts like Pa–catantra, Rāmopākhyāna and Pāini’s
A ādhyāyi. The reader gives the text in Devanagari with Roman transliteration, sandhi
analysis, a detailed grammatical analysis and lexical analysis of the text, with notes and
translation. As part of independent-study texts, a detailed classification of grammatical
categories (tag set) is made.
A nominal inflection generator tool is also available which generates nominal inflections
for a given word with gender and class information.60
IIIT Hyderabad
Language Technology and Research Centre (LTRC), IIIT-H, is one of the leading
NLP centres in India. IIT Kanpur and later LTRC, IIIT-H, have been the pioneers in the
field of language technology and to initiate Pāinian approach to NLP in India. LTRC
has several ongoing activities with Govt. of India, Carnegie Mellon University's
Language Technology Institute, UPENN, HP Labs, Google, Nokia and TCS besides
several academic institutions in India. Prof. Rajeev Sangal and Prof. Vineet Chaitanya
with the Akshar Bharati61 group have developed many NLP tools for Indian languages
like MorphA-s (morphological analyzers), Anusaraka-s62 (language access tool), Shakti
58
Huet, Gerard, 2006, Parsing Sanskrit by Computer, (abstract) in the Proceeding of the 13th World
Sanskrit Conference, Edinburgh, UK
59
Ibid.
60
The Sanskrit Library Reading Room - http://cgi-user.brown.edu/cgi-user/sanskrit/login
61
Akshar Bharati, is a personification of groups (@ IITK, IIIT-H, University of Hyderabad, etc.) working
on NLP with special emphasis to Indian Languages giving due attention to Indian theories of grammar and
language.
62
Anusaaraka or a Language Accessor is a computer software which renders text from one Indian language
into another (Kannada-Hindi, Marathi-Hindi, Punjabi-Hindi and Telugu-Hindi). It produces output not in
40
Machine translation System. MorphA-s for Hindi, Telugu, Marathi, Kannada and Punjabi
were developed as part of Anusaraka-s and now presently they are made available online
and also can be downloaded from their website. Telugu MorphA is said to have coverage
of 95% for any arbitrary modern standard Telugu text and Hindi MorphA has 88%
coverage. 63 MorphA-s for these languages were developed based upon the paradigm
model.64 For a given word the MorphA gives the root word with its feature information
like gender, number, person, tense etc.65
Sanskrit MorphA was developed by Vinish Jain, an M.Tech student of IIIT-H under the
guidance of Amba P. Kularni, which was a further development of an earlier working
Sanskrit MorphA developed at ASR Melkote. This MorphA was developed using
paradigm approach in the model of other MorphA-s at IIIT-H which used a lexicon
derived from Monier William’s dictionary. This could only handle subanta-s. Later with
the collaboration of RSVP Tirupati, the present Sanskrit MorphA66 was developed by
Amba P. Kulkarni and V.Sheeba. This MorphA has separate modules to handle subanta,
tianta, kdanta and samāsa. Each word is filtered through all the four modules and all
possible answers are produced. subanta module contains 222 paradigms for nouns and
pronouns with a root dictionary of around 1.5 M words extracted from Monier Williams
dictionary. tianta module has a database of ~10 M verb forms (no prefixed verbs).
kdanta module can handle ~42 K of 20 types of kdanta. This approach has the
advantage of developing further without disturbing the main program, as the program is
independent of data and also more modules can be added.67 This MorphA can handle
only sandhi-free text and cannot filter the multiple tags and disambiguate it for a single
word in a given context.
the target language but close to it, which is in comprehensible to the reader who is trained to read the
output.
63
Morphological Analysers – IIIT Hderabad – http://www.iiit.net/ltrc/morph/morph_analyser.html
64
Bharati, Akshar et al, 1999, Natural Language Processing: A Paninian Perspective, Prentice Hall Pvt.
Ltd. New Delhi, pp.39-43.
65
Ibid..
66
RSVP, http://rsvidyapeetha.ac.in/~anusaraka (accessed: 13.10.2006)
41
Jawaharlal Nehru University (JNU)
The RCILTS – Sanskrit, Japanese, Chinese unit of JNU, under the leadership of Prof.
G.V.Singh claims to have developed web based Sanskrit Language Learning System for
the use of scholars for designing Knowledge based systems based on the Indian traditions.
The unit has developed a computational module of A ādhyāyī of Pāini, Sanskrit-
English lexicon, English-Sanskrit lexicon and a lexicon of Nyāya terms. It also says that
it has made some efforts on the sandhi analysis system.68
Girish Nath Jha69 developed a Nominal Inflection Generator for Sanskrit for his M. Phil.
dissertation. The program, written in prolog, generates all the inflections of subanta given
a Sanskrit word with gender and ending letter information.
Special Centre for Sanskrit Studies, JNU, established in 2002, has undertaken the task of
developing NLP tools for Sanskrit. Under the guidance of Dr. Girish Nath Jha, the
students have developed certain useful programs for computational processing of Sanskrit
texts. A list of works is given below:
The NLP team at the centre has developed an Online Multilingual Amarakosha (OMA)
under a project funded by UGC under UPOE program. It is Unicode based software
which supports seven languages- Sanskrit, Hindi, Kannada, Punjabi, Bangla, Oriya and
English and allows the user to search the synonym from one language to another. The
output displays the grammatical and semantic category of the word, its base word,
reference and ontological information. The software also provides the facility to enter and
67
Bharati Akshar, Amba Kulkarni, V Sheeba, “Building a Wide Coverage Sanskrit Morphological
Analyzer: A Practical Approach”, in the Proceedings of First National Symposium on Modelling and
Shallow Parsing of Indian Languages, 2-4 April 2006, IIT Bombay, Mumbai
68
RCILTS, JNU – Achievements: http://tdil.mit.gov.in/SanskritJapaneseChinese-JNUJuly03.pdf (accessed
15.10.2006)
69
Jha, Girish Nath, 1993, ‘Morphology of Sanskrit Case Affixes: A Computational Analysis’, M.Phil.,
submitted to JNU, New Delhi.
42
edit the data by language experts. The software is expected to be extended as a
multilingual interface, search engine and text processing tool in near future.70
R. Chandrashekhara71, developed tag-sets for POS tagging of Sanskrit text as part of his
Ph.D. thesis. This tagger (POST) which is an online system run on Apache Tomcat
platform using Java Servlet takes sandhi-free classical Sanskrit prose text as input and
provides the tagged text as output. The system is very important for the further R&D on
the Sanskrit-Indian Languages Machine Translation Systems (MTS).
Sudhir Kumar Mishra72, has recently completed his Ph.D. research on Kāraka Analyzer
for Laukika Sanskrit prose text based on Pāini’s Kāraka formulations. Kāraka rules are
of central importance in Sanskrit syntactic structure. As part of this research work he also
worked on identification of verb inflections in Sanskrit morphology73.
Subash Chandra74 developed a Sanskrit Subanta Recognizer and Analyser System (SRAS)
for his M.Phil. dissertation. SRAS is also an online system run on Apache Tomcat
platform using Java Servlet and MSSQL server 2005 as back end. This system has been
developed according to Pāinian formulation which accepts only non-joint (sandhi-rahita)
Sanskrit text in Devanāgarī script and fully depends on both the rule base, example base
and a database of other linguistic resources. The system has been tested with some
selected corpora and claims to give an average accuracy of 91.65%.
Research work is also being done on learning Sanskrit language using e-learning
approach 75 , and sandhi analyzer applying Pāinian and some heuristic rules 76 , online
70
System is accessible @ http://sanskrit.jnu.ac.in
71
Chandrashekhar, R, 2006, ‘POS Tagging for Sanskrit’, submitted for Ph.D degree at SCSS, JNU.
72
Mishra, Sudhir Kumar, 2007, ‘Sanskrit Karaka Analyzer for Machine Translation’, submitted for Ph.D.
degree at SCSS, JNU
73
Mishra, Sudhir Kumar & Girish Nath Jha, 2005, ‘Identifying Verb Inflections in Sanskrit Morphology’,
in the proceedings of SIMPLE-05, IIT-Kharagpur, pp.79-81
74
Chandra, Subash, 2006, ‘Machine Recognition and Morphological Analysis of Subanta-padas’, submitted
for M.Phil degree at SCSS, JNU
75
Bhowmik, Preeti & Jha, Girish Nath, 2006, ‘Sanskrit Language Pedagogy: an e-learning approach’, In
the Souvenir Abstracts of 28th AICL, BHU, Varanasi, p.150.
76
Kumar, Sachin & Jha, Girish Nath, 2006, ‘Issues in sandhi processing of Sanskrit’, In the Souvenir
Abstracts of 28th AICL, BHU, Varanasi, p.129.
43
indexing of Ādiparva of Mahābhārata77, computational analysis of Sanskrit gender and
analysis of derived nouns in Sanskrit.78
77
Mani, Diwakar, & Jha, Girish Nath, 2006, ‘Online indexing of Ādiparva of Mahābhārata’, In the
Souvenir Abstracts of 28th AICL, BHU, Varanasi, p. 125.
78
Singh, Surjit Kumar & Jha, Girish Nath, 2006, ‘Strategies for Identifying and Processing Derived Nouns
in Sanskrit’, In the Souvenir Abstracts of 28th AICL, BHU, Varanasi, p. 131.
79
Subramanian, Aparna, Jan 2005, “Sanskrit to English Translator”, Language in India, Vol. 5:1.
44