0% found this document useful (0 votes)
208 views14 pages

AComputationalgrammarof Sinhala

The document describes research conducted to develop a computational grammar for Sinhala. The grammar was created using a context-free grammar framework with linguistic features. It covers a significant subset of Sinhala sentences based on testing with 200 sentences from primary grade textbooks. The grammar accounted for 60% coverage of the test sentences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views14 pages

AComputationalgrammarof Sinhala

The document describes research conducted to develop a computational grammar for Sinhala. The grammar was created using a context-free grammar framework with linguistic features. It covers a significant subset of Sinhala sentences based on testing with 200 sentences from primary grade textbooks. The grammar accounted for 60% coverage of the test sentences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/235931895

A Computational Grammar of Sinhala

Conference Paper · March 2012


DOI: 10.1007/978-3-642-28604-9_16

CITATIONS READS

14 20,475

4 authors:

Chamila Liyanage Randil Pushpananda


University of Colombo University of Colombo
25 PUBLICATIONS   106 CITATIONS    18 PUBLICATIONS   128 CITATIONS   

SEE PROFILE SEE PROFILE

Dulip Herath Ruvan Weerasinghe


University of Colombo University of Colombo
7 PUBLICATIONS   83 CITATIONS    127 PUBLICATIONS   659 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Automatic Speech Recognition for Sinhala View project

Detecting Hate Speech in Social Media Articles in Romanized Sinhala View project

All content following this page was uploaded by Chamila Liyanage on 29 December 2021.

The user has requested enhancement of the downloaded file.


A Computational Grammar of Sinhala

Chamila Liyanage1, Randil Pushpananda1, Dulip Lakmal Herath2,


and Ruvan Weerasinghe1
1,2
University of Colombo School of Computing, 35, Reid Avenue,
Colombo 00700, Sri Lanka
{cml,rpn,arw}@ucsc.lk, [email protected]

Abstract. A Computational Grammar for a language is a very useful resource


for carrying out various language processing tasks for that language such as
Grammar checking, Machine Translation and Question Answering. As is the
case in most South Indian Languages, Sinhala is a highly inflected language
with three gender forms and two number forms among other grammatical
features. While piecemeal descriptions of Sinhala grammar is reported in the
literature, no comprehensive effort to develop a context-free grammar (CFG)
has been made that has been able to account for any significant coverage of the
language. This paper describes the development of a feature-based CFG for
non-trivial sentences in Sinhala. The resulting grammar covers a significant
subset of Sinhala as described in a well-known grammar book. A parser for
producing the appropriate parse tree(s) of input sentences was also developed
using the NLTK toolkit. The grammar also detects and so rejects
ungrammatical sentences. Two hundred sample sentences taken from primary
grade Sinhala grammar books were used to test the grammar. The grammar
accounted for 60% of the coverage over these sentences.

Keywords: Natural Language Processing, Context Free Grammar, Sinhala


Grammar, Computational Grammar.

1 Introduction

Sinhala is the official language of Sri Lanka and it is the language spoken by a
majority of Sri Lankans – nearly 70% of the population [7]. From a historical point of
view, Sinhala is a modern Indo-Aryan language, which is related to the Vedic
language or Old Sanskrit in India [10]. Modern Sinhala has subsequently gained
through its association with Tamil, English, Portuguese, and Dutch [8]. There are two
main varieties of Sinhala based on its usage, namely the literary and the spoken,
which differ from each other in important ways [5]. In addition Sinhala has an
alphasyllabary writing system, also called abugida; it is a segmental writing system
in which consonant-vowel sequences are written as single units [18].
Natural Language Processing (NLP) is an area of research that explores how
computers can be used to understand and manipulate natural languages [16].
Currently there are many research areas related to NLP in Sinhala such as Speech

A. Gelbukh (Ed.): CICLing 2012, Part I, LNCS 7181, pp. 188–200, 2012.
© Springer-Verlag Berlin Heidelberg 2012
A Computational Grammar of Sinhala 189

Processing, Machine Translation, Information Retrieval, Text Summarization among


others. Developing a computational grammar for Sinhala can profit such efforts.
Therefore in this research we report work carried out in developing a feature-based
context-free grammar for Sinhala using the open source Natural Language Tool Kit,
NLTK [3].

2 Related Work

Very little research has been reported in the literature on efforts to develop a formal
grammar for Sinhala. The following section describes a brief survey on grammar
development reported for Indic languages including Sinhala.
Hettige and Karunananda have implemented a computational model of grammar
for Sinhala [9]. Morphological and syntactic analysis of Sinhala has been considered
in this work, which is modeled using a Finite State Transducer (FST) and a Context-
Free grammar. Developed as part of a Machine Translation system, the parser in this
system handles only simple sentences containing 8 constituents, namely, Attributive
adjunct of Subject, Subject, Attributive adjunct of Object, Object, Attributive adjunct
of Predicate, Attributive adjunct of the complement of predicate, Complement of
predicate and Predicate.
In a research carried out for the Kannada language by Sagar et al., noun phrase –
verb phrase agreement in Kannada sentences has been modeled [17]. They have
classified noun phrases in to three sub categories as adjective noun, noun and
pronoun, but they have only considered the gender and number as features of the
grammar. Similar to the case of the Sinhala language, Kannada verbs need to agree
with the subject of their sentences in number and gender. Therefore, the suffix of the
verb is extracted to check masculine, feminine and plural verb endings. Here they
have used the context free grammar (CFG) to write the grammar rules and used
Python as the programming language. A Recursive Descent Parser, a simple top down
parser, from NLTK has been used to test the grammar. This is limited to resolve noun
– verb agreement and indicate whether the sentence is syntactically acceptable or not.
Sagar et al carried out another research which highlights the process of generating
a Context Free Grammar for simple Kannada sentences [16]. Here they have checked
the sentences with both a Top-Down (Recursive Descent) Parser and a Bottom-Up
(Shift-Reduce) Parser. According to the authors, two conflicts; Shift-Reduce and
Reduce-Reduce occurred when the sentences were parsed using the Bottom-Up
parser. Therefore the Top-Down parser was selected as the more suitable parser to
parse the given sentences.
Mosaddeque and Haque have done a research to propose a way of producing a
context-free grammar for the Bangla [2]. This work reports that only sentences of
seven to eight words in length are used for testing. They have taken 10 ad hoc
sentences from a newspaper article as the basis for designing the grammar. They have
then tagged all the words in the sentences with their respective parts-of-speech (POS)
tags and used NLTK’s Shift-Reduce Parser to test the grammar. Only one sentence
has been successfully parsed of these ten sentences.
190 C. Liyanage et al.

Naira Khan and Mumit Khan have implemented a Computational Grammar for
Bengali using the Head-Driven Phrase Structure Grammar (HPSG) formalism [14].
The Linguistic Knowledge Building (LKB) system was used to implement this
grammar, which allows the user to build a parser along with a generator. A set of
instructions for using the HPSG formalism to parse the grammar and to generate
grammatical sentences of Bengali is given in this paper.

3 Structure of Sinhala
Sinhala is a free word order language. Its unmarked word order is SOV; variant orders
are also possible with discourse – pragmatic effects. A sentence can have all the
possible orders of the main constituents with proper intonation [11]. Figure 1 shows
all the free word order forms of the English sentence “Father hit the younger brother
with a stick”.

i. තාත්තා | මල්ලීට | ෙකෝටුවකින් | ගැසුෙව් ය.


tӁa:tӁtӁa: | malli:ʈǝ | ko:ʈuvǝkin | gæsuve:yǝ
Father | to the younger brother | with a stick | hit
ii. තාත්තා | ෙකොටුවකින් | මල්ලීට | ගැසුෙව් ය.
tӁa:tӁtӁa: | ko:ʈuvǝkin | malli:ʈǝ | gæsuve:yǝ
Father | with a stick | to the younger brother | hit
iii. මල්ලීට | තාත්තා | ෙකෝටුවකින් | ගැසුෙව් ය.
malli:ʈǝ | tӁa:tӁtӁa: | ko:ʈuvǝkin | gæsuve:yǝ
to the younger brother | Father | with a stick | hit
iv. මල්ලීට | ෙකෝටුවකින් | තාත්තා | ගැසුෙව් ය.
malli:ʈǝ | ko:ʈuvǝkin | tӁa:tӁtӁa: | gæsuve:yǝ
to the younger brother | with a stick | Father | hit
v. ෙකෝටුවකින් | තාත්තා | මල්ලීට | ගැසුෙව් ය.
ko:ʈuvǝkin | tӁa:tӁtӁa: | malli:ʈǝ | gæsuve:yǝ
with a stick | Father | to the younger brother | hit
vi. ෙකෝටුවකින් | මල්ලීට | තාත්තා | ගැසුෙව් ය.
ko:ʈuvǝkin | malli:ʈǝ | tӁa:tӁtӁa: | gæsuve:yǝ
with a stick | to the younger brother | Father | hit
Fig. 1. Free word order in Sinhala

Sinhala is a head-final language, in which the complements and modifiers appear


before their heads [11].

(NP) ගෙම් මිනිස්සු


/game: minissu/
Village-GENITIVE people
‘People of the village’
A Computational Grammar of Sinhala 191

(ADJP) ෙබොෙහොම ලස්සන


/bohomǝ lassǝnǝ/
Much beautiful
Very beautiful

(VP) ෙසමින් කියවයි


/semin kiyǝvayi/
Slowly Read-non past/3rd person singular
Read slowly

Traditionally, a sentence is divided in to two parts; Noun Phrase (NP), and Verb
Phrase (VP). In Sinhala grammar, uktha (subject) and akyatha (predicate) are the two
parts of a sentence. Subject and predicate in Sinhala sentences agree in number,
gender and person [12].
The studies of sentence structures of Sinhala have been made by a number of
scholars [8] [1] [4]. According to Abayasingha [1] Sinhala has 25 types of simple
sentence structures. However in the present work, we have covered only the main
sentence structures and a few complex structures. These are described in the
following sections.

3.1 Noun Phrase


The Noun Phrase, denoted by NP, can be a common noun (N), pronoun (PrN) or a
proper noun (PropN). In addition to the head noun, the Sinhala noun phrase consists
of adjectival phrases and determiners. Sinhala NP has a very complex grammatical
structure. It can consist of various clause structures, such as adjectival clauses,
relative clauses, and subordinate clauses. Therefore building a computational
grammar, covering all the NP structures is complex. Figure 2 below shows the NP
structure we have covered in the grammar developed in this work.
An adjectival phrase (ADJP) is constructed with adjectives. According to Sinhala
grammar, an adjectival phrase comes before the Noun (N) and after the Determiner
(Det), if there is any determiner in the noun phrase. If the adjective is a qualitative
adjective, then it can be constructed with Degrees (Deg) to intensify its meaning.

Det Deg ADJ N

Fig. 2. structure of the NP in Sinhala

In the traditional grammar of Sinhala, nama visheshana (adjectives) denote some


quality or attribute of the noun. It can be divided into three classes, namely
qualitative, quantitative and demonstrative [8]. However in our grammar we do not
consider the features of the ADJP. The words which denote the degree of the
adjectives are added only before qualitative adjectives. i.e. the ADJP ‘ඉතා ෙහොඳ’ /itӁa:
192 C. Liyanage et al.

hondӁǝ/ (very good) is a adjectival phrase and it consists of a degree ‘ඉතා’ /itӁa:/ (very)
and a qualitative adjective ‘ෙහොඳ’ /hondӁǝ/ (good), which appears before the adjective.

3.2 Verb Phrase


According to generative grammar, a Verb Phrase (VP) is a phrase headed by a verb.
In addition to the verb, it consists of noun phrases and Adverbial Phrases (ADVP).
The verb in Sinhala can be categorized as single verbs, compound verbs and auxiliary
verbs. In this grammar we only consider single verbs. Generally ADVP occurs before
the verb in Sinhala sentences. However according to the features of the adverb, the
position where the adverb occurs is decided. Figure 3 below shows the VP structure in
Sinhala and what we have covered in the grammar. According to the structure, the
verb appears in the final position. ADVPs may appear both before the verb and after
the NP. If an adverb of manner occurs in the ADVP, the adverb can be combined with
degrees to intensify the meaning. For example ‘ඉතා ෙව්ගෙයන්’ /itӁa: ve:gǝyen/ (very
fast) is an ADVP which appears as an adverb of manner.

NP ADVP Deg ADV V

Fig. 3. structure of the VP in Sinhala

4 Grammatical Features of Sinhala

According to the Sinhala language the noun is inflected for number, gender, person,
tense, case and definiteness. The verb is inflected for number, gender, person, tense
and volition. Subject and predicate agree for the features of number, gender and
person.

4.1 Grammatical Features of the NP

The words that are marked for grammatical features linga (gender), vachana
(number), niyatha-aniyatha (definiteness) and vibhakthi (case), are recognized as
nouns in Sinhala [12]. Therefore in developing the grammar, we consider the features
of number, gender, case and definiteness for common nouns; number, gender and case
for proper nouns; and number, gender, case and person for pronouns.
As a highly inflected language, common nouns in Sinhala are inflected for number,
definiteness and case. Sinhala nouns are also divided into animate and inanimate
classes on the basis of their inflection. Animate nouns inflect for number (singular
and plural), definiteness (definite and indefinite) and five cases (nominative,
accusative, dative, genitive, and instrumental). The definiteness distinction applies
only in the singular form of nouns [6].
A Computational Grammar of Sinhala 193

As enumerated in Table 1, Sinhala nouns, which are inflected for five cases, have five
forms. However the same form may occur in several cases for nouns. For example
form 5 can occur in the cases instrumental, ablative and auxiliary. Therefore we have
defined the five cases in this grammar specification. All inflections relating to animate
nouns are shown in the table.

Table 1. Examples for inflections of animate common nouns

Form Case Singular


Masculine Feminine Plural
Def. Indef. Def. Indef.
/minࡧ isa:/ /minࡧ isek/ /kella/ /Kellak//kellek/ /minࡧ issu/
1 Nominative (the man) (a man) (the girl) (a girl) (men)
/minࡧ isa:/ /minࡧ iseku/ /kella/ /kellak‫ۑ‬/ /minࡧ isun/
2 Accusative (the man) (a man) (the girl) (a girl) (men)
/kelleku‫ۑݚ‬/
/minࡧ isa:‫ۑݚ‬/ /minࡧ iseku‫ۑݚ‬/ /kella‫ۑݚ‬/ /minࡧ isun‫ۑݚ‬/
3 Dative (to the man) (to a man) (to the girl)
kellak‫ۑݚۑ‬
(to men)
(to a girl)
/minࡧ isa:ge:/ /minࡧ isekuge:/ /kellage:/ /kellak‫ۑ‬ge:/ /minࡧ isunge:/
4 Genitive (the man’s) (a man’s) (the girl’s) (a girl’s) (men’s)
/minࡧ isa:genࡧ / /minࡧ isekugenࡧ / /kellagenࡧ / /kellak‫ۑ‬genࡧ / /minࡧ isunࡧ genࡧ /
5 Instrumental (from the man) (from a man) (from the girl) (from a girl) (from men)

Sinhala inanimate nouns are inflected similarly to the animate nouns for number
and definiteness. However they only have four cases – direct, dative, genitive and
instrumental [6]. Forms 3, 4, and 5 in Table 2 are similar to those in Table 1. However
form 1 accounts for the direct case. Table 2 shows all the inflections for an inanimate
noun as covered in the grammar developed.

Table 2. Examples for inflections of inanimate common nouns

Singular
Form Case Definite Indefinite Plural

/gas‫ۑ‬/ /gasak/ /gas/


1 Direct (the tree) (a tree) (trees)
/gas‫ۑݚۑ‬/ /gas‫ۑ‬k‫ۑݚۑ‬/ /gasw‫ۑ‬l‫ۑݚۑ‬/
3 Dative (to the tree) (to a tree) (to the trees)
/gase:/ /gasehi/ /gas‫ۑ‬k‫ۑ‬/ /gasw‫ۑ‬l‫ۑ‬/
4 Genitive on the tree on a tree on the trees
/gasenࡧ / /gasinࡧ / /gas‫ۑ‬kinࡧ / /gasw‫ۑ‬linࡧ /
5 instrumental (from the tree) (from a tree) (from trees)

Determiners of Sinhala do not carry any grammatical features and can engage with
any noun without agreement of features. Therefore grammatical features for
determiners were not considered. e.g. ‘ඒ’ /e:/ is a determiner of Sinhala which
combines with any noun without considering grammatical features of number, gender
or case. The noun phrases ‘ඒ ළමයා’ /e: lamǝja:/ (that child) and ‘ඒ ළමයි’ /e: lamaji/
(those children) differ in the number feature, but have the identical determiner ‘ඒ’.
194 C. Liyanage et al.

4.2 Grammatical Features of the VP

Number, gender, tense, person and volition are considered as the grammatical features
of the verb phrase (VP) in Sinhala. There are two tenses in Sinhala; past and non-past.
The non-past form can refer either to past or future. The future tense is expressed
using time adverbials. Therefore the single form that is used to denote both tenses
Present and Future is termed non-past. For example ‘යයි’/jaji/ is a verb form in
Sinhala which means ‘goes’ and carries the grammatical features singular, 3rd person,
non-past. i.e. the sentence ‘ඔහු පාසල් යයි’ /ohu pa:sal jaji/ (he goes to school) is in the
present tense. If we add the time adverbial ‘ෙහට’/heʈǝ/ to denote the future, then the
sentence would be in future tense; ‘ඔහු ෙහට පාසල් යයි’ /ohu heʈǝ pa:sal jaji/ (he will
go to school tomorrow). In these two sentences the verb ‘යයි’ /jaji/ is a pure verb. In
addition to the pure form, the form which used to denote the non-past tense is called
the krudantha. We can change the above sentence with a krudantha form as ‘ඔහු ෙහට
පාසල් යන්ෙන්ය’ /ohu heʈǝ pa:sal janӁnӁe:jǝ/. In this sentence ‘යන්ෙන්ය’ /janӁnӁe:jǝ/ is
similar to the form ‘යයි’. However, according to Kekulawala [13], the krudantha form
is non-past; which uses -න්ෙන- /-nӁnӁe-/ suffix, is used to denote the future tense in
Sinhala. In modern Sinhala writings, krudantha forms are used more frequently than
the pure forms in both past and non-past tenses. In this grammar, we considered the
tense as past and non-past other than past, present and future.
Volition (VLT) is another feature of the verb which in our grammar is considered
to be either true or false. For example ‘යයි’ is a volitive form of the verb “go”, while
its equivalent involitive form is ‘යැෙවයි’ /jæveji/. The other features of number,
gender and person are the same as their equivalents in the NP. Figure 4 gives an
overview of the grammatical features of the Sinhala Verb.

Verb

Tense Number Gender Person Volition

Past Non-past Singular Plural 1st 2nd 3rd True False

Masculine Feminine Neuter


Fig. 4. Grammatical Features of the Verb Phrase

5 The Sinhala CFG


According to Abhayasinghe (1998) there are 25 types of simple sentence structures in
Sinhala [1]. In this research, we considered the following ten types of sentence
structures in developing a CFG for Sinhala.
A Computational Grammar of Sinhala 195

1. සඳ | බැබෙලයි.
sandӁǝ | bæbǝleji
moon | is shining.
The moon is shining.
2. ගෙසන් | බිමට | ෙගඩියක් | වැටිණි.
gasenӁ | bimǝʈǝ | geɖijak | væʈiɳi
from the tree | to the floor | a fruit | fell
A fruit fell (to the floor) from the tree.
3. බල්ෙලෝ | බුරති.
ballo: | burǝtӁi
dogs | bark
Dogs bark.
4. අයියා | අඹ | කඩයි.
ajja: | ambǝ | kaɖaji
the elder brother | mangoes | plucks
The elder brother plucks mangoes.
5. තාත්තා | පුතාට | තෑග්ගක් | ෙදයි.
tӁa:tӁtӁa: | putӁa:ʈǝ | tӁæ:ggak | deji
father | to the son | a gift | gives
Father gives a gift to the son.
6. මිනිසා | ගසට | නගියි.
minӁisa: | gasaʈǝ | nӁagiji
the man | to the tree | climbs
The man climbs the tree.
7. හිඟන්නා | මෙගන් | රුපියලක් | ඉල්ලීය.
hiŋganӁnӁa: | magenӁ | rupiyǝlak | illi:jǝ
the beggar | from me | one rupee | asked
The beggar asked me one rupee.
8. ඔහුට | වත්තක් | තිෙබ්.
ohuʈǝ | vatӁtӁak | tӁibe:
to him | an estate | has
He has an estate.
9. මට | සින්දුවක් | ඇෙසයි.
maʈǝ | sinӁduvak | æseji
to me | a song | hear
I hear a song.
10. ළමයාට | ඇඬිණි.
ɭamǝja:ʈǝ | ænɖiɳi
to the child | cried
The child cried.

After analyzing each of the above sentences, all their constituents were identified.
According to this constituent structure, separate CFG productions were generated for
each type of sentence. After identifying all the grammar rules needed to cover the
196 C. Liyanage et al.

phenomena above, they were merged together to form and optimize a generic CFG
for Sinhala. In addition, some more complexity in the grammatical rules was also
introduced to the Sinhala CFG in order to increase its overall coverage. The
grammatical and lexical productions of the CFG developed are given below.

Grammar Productions

---------------------------- S expansion productions--------------------------------


S -> NP[NUM=?n, GEN=?G, PER=?P, DEF=?TF, CASE=F1] VP[NUM=?n, GEN=?G, PER=?P, CASE=?CS]
S -> NP[NUM=?n, GEN=?G, PER=?P, DEF=?TF, CASE=F3] VP[NUM=?n, GEN=?G, PER=?P, CASE=?CS]
S -> NP[NUM=?n, GEN=?G, PER=?P, DEF=?TF, CASE=F4] VP[NUM=?n, GEN=?G, PER=?P, CASE=?CS]
S -> NP[NUM=?n, GEN=?G, PER=?P, DEF=?TF, CASE=F5] VP[NUM=?n, GEN=?G, PER=?P, CASE=?CS]
----------------------------- NP expansion productions------------------------------
NP[NUM=?n, CASE=?CS, GEN=?G, DEF=?TF] -> N[NUM=?n, CASE=?CS, GEN=?G]
NP[NUM=?n, CASE=?CS, GEN=?G, PER=?P] -> PrN[NUM=?n, CASE=?CS, GEN=?G, PER=?P]
NP[NUM=?n, CASE=?CS] -> PropN[NUM=?n, CASE=?CS]
NP[NUM=?n, CASE=?CS, GEN=?G, DEF=?TF] -> Det N[NUM=?n, CASE=?CS, GEN=?G]
NP[NUM=?n, CASE=?CS, GEN=?G, DEF=?TF] -> ADJP N[NUM=?n, CASE=?CS, GEN=?G]
NP[NUM=?n, CASE=?CS, GEN=?G, DEF=?TF] -> Det ADJP N[NUM=?n, CASE=?CS, GEN=?G]
----------------------------- VP expansion productions ------------------------------
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> IV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> TV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> NP TV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> NP IV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> NP NP TV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> NP NP IV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> NP NP ADVP TV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> ADVP IV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> ADVP TV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> NP ADVP IV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
VP[TENSE=?t, NUM=?n, GEN=?G, PER=?P] -> NP ADVP TV[TENSE=?t, NUM=?n, GEN=?G, PER=?P]
------------------------------ADJP expansion productions-----------------------------
ADJP -> Adj
ADJP -> Adj ADJP
----------------------------- ADVP expansion productions-----------------------------
ADVP -> Adv
ADVP -> Adv ADVP
------------------------------Sample Lexical Productions-------------------------------
N[NUM=sg, GEN=MA, CASE=F1, DEF=TRue] -> 'බල්ලා' | 'මිනිසා' | 'ළමයා' | 'තාත්තා' | 'හිඟන්නා' | 'මල්ලී' | 'අයියා'
N[NUM=sg, GEN=MA, CASE=F2, DEF=TRue] -> 'බල්ලා' | 'මිනිසා' | 'ළමයා' | 'ෙකොල්ලා' | 'තාත්තා' | 'සර්පයා'
N[NUM=sg, GEN=MA, CASE=F3, DEF=TRue] -> 'බල්ලාට' | 'මිනිසාට' | 'ළමයාට' | 'පුතාට' | 'සර්පයාට' | 'අයියාට'
N[NUM=sg, GEN=MA, CASE=F1, DEF=False] -> 'බල්ෙලක්' | 'මිනිෙසක්' | 'ළමෙයක්' | 'ෙකොල්ෙලක්' | 'හිඟන්ෙනක්' | 'පුෙතක්'
N[NUM=sg, GEN=NE, CASE=F1, DEF=False] -> 'තෑග්ගක්' | 'රුපියලක්' | 'වත්තක්' | 'සින්දුවක්' | 'ෙපොතක්' | 'ෙගඩියක්'
N[NUM=sg, GEN=NE, CASE=F3, DEF=TRue] -> 'ගසට' | 'මලට' | 'අත්තට' | 'බතට' | 'ෙගදරට' | 'ෙපොතට' | 'ෙකෝටුවට' | 'බිමට'
N[NUM=sg, GEN=NE, CASE=F5, DEF=TRue] -> 'ගෙසන්' | 'මෙලන්' | 'ෙගදරින්' | 'ෙපොතින්' | 'ෙකෝටුෙවන්' | 'ෙපොරෙවන්'
N[NUM=sg, GEN=NE, CASE=F1, DEF=TRue] -> 'ගස' | 'මල' | 'අත්ත' | 'අඹ' | 'ෙගදර' | 'ෙපොත' | 'ෙකෝටුව' | 'කළය' | 'සඳ'
N[NUM=pl, GEN=MA, CASE=F1] -> 'බල්ෙලෝ' | 'මිනිස්සු' | 'ළමයි'
N[NUM=pl, GEN=FE, CASE=F2] -> 'ගැහැණුන්' | 'ෙකල්ලන්'
PrN[NUM=sg, CASE=F3, PER=F] -> 'මට'
PrN[NUM=sg, CASE=F5, PER=F] -> 'මෙගන්' | 'මාෙගන්'
PrN[NUM=pl, CASE=F1, PER=T] -> 'ඔවුහු' | 'ඒෙගොල්ෙලො'
Det -> 'ඒ' | 'ෙම්' | 'අර' | 'ඔය' | 'සමහර' | 'ඇතැම්'
Adj -> 'ලස්සන' | 'කැත' | 'මහත' | 'සුදු' | 'කලු' | 'ෙලොකු' | 'ෙපොඩි' | 'පුංචි' | 'උස'
Adv -> 'පන්සල්' | 'ෙගදර' | 'පාසලට' | 'නගරයට' | 'ෙව්ගෙයන්' | 'ලස්සනට' | 'ෙහොඳට' | 'ඉක්මනින්' | 'ෙසෙමන්'
A Computational Grammar of Sinhala 197

TV[TENSE=nPast, NUM=sg, GEN=MA, VLT=True, PER=T] -> 'ෙදයි' | 'කයි' | 'කඩයි' | 'කියයි' | 'ගසයි' | 'සිටියි' | 'තිෙබ්'
TV[TENSE=nPast, NUM=sg, VLT=False] -> 'කැෙවයි' | 'කියෙවයි' | 'ඇෙසයි' | 'ෙපෙවයි'
TV[TENSE=nPast, NUM=pl, VLT=True, PER=F] -> 'ෙදමු' | 'කමු' | 'ෙබොමු' | 'කියමු'
TV[TENSE=past, NUM=sg, GEN=MA, VLT=True, PER=T] -> 'කෑෙව්ය' | 'දුන්ෙන්ය' | 'කීෙව්ය' | 'ඉල්ලීය' | 'ගැසුෙව්ය'
IV[TENSE=nPast, NUM=sg, GEN=MA, VLT=True, PER=T] -> 'බුරයි' | 'නගියි' | 'ඇවිදියි' | 'යයි'
IV[TENSE=nPast, NUM=sg, GEN=NE, VLT=False, PER=T] -> 'බැබෙලයි' | 'පිෙපයි'
IV[TENSE=nPast, NUM=pl, VLT=True, PER=S] -> 'බුරහු' | 'ඇවිදිහු' | 'යහු'
IV[TENSE=past, NUM=sg, VLT=False, PER=T] -> 'බිරිණි' | 'ඇඬිණි' | 'වැටිණි'

Following are the parse trees that have been produced using the Recursive Decent
parser from the NLTK toolkit [3].

----------------------------------Sentence 3--------------------------------
(S[]
(NP[CASE='F1', DEF=?TF, GEN='MA', NUM='pl']
(N[CASE='F1', GEN='MA', NUM='pl'] බල්ෙලෝ))
(VP[GEN=?G, NUM='pl', PER='T', TENSE='pres']
(IV[NUM='pl', PER='T', TENSE='nPast', +VLT] බුරති)))

-----------------------------------Sentence 5--------------------------------
(S[]
(NP[CASE='F1', DEF=?TF, GEN='MA', NUM='sg']
(N[CASE='F1', DEF='TRue', GEN='MA', NUM='sg'] තාත්තා))
(VP[GEN='MA', NUM='sg', PER='T', TENSE='pres']
(NP[CASE='F3', DEF=?TF, GEN='MA', NUM='sg']
(N[CASE='F3', DEF='TRue', GEN='MA', NUM='sg'] පුතාට))
(NP[CASE='F1', DEF=?TF, GEN='NE', NUM='sg']
(N[CASE='F1', -DEF, GEN='NE', NUM='sg'] තෑග්ගක්))
(TV[GEN='MA', NUM='sg', PER='T', TENSE='nPast', +VLT]
ෙදයි)))

--------------------------------------Sentence 9--------------------------------
(S[]
(NP[CASE='F3', GEN=?G, NUM='sg', PER='F']
(PrN[CASE='F3', NUM='sg', PER='F'] මට))
(VP[GEN=?G, NUM='sg', PER=?P, TENSE='pres']
(NP[CASE='F1', DEF=?TF, GEN='NE', NUM='sg']
(N[CASE='F1', -DEF, GEN='NE', NUM='sg']
සින්දුවක්))
(TV[NUM='sg', TENSE='nPast', -VLT] ඇෙසයි)))

6 Evaluation and Results


In order to test and evaluate the grammar, two hundred sample sentences taken from
primary grade Sinhala Grammar books [19] [20] were used. According to the test, 118
sentences were parsed the grammar correctly and 82 were not parsed. Out of 82
198 C. Liyanage et al.

sentences two sentences are structured incorrectly and therefore they were restricted
from the grammar. Several sentences were not parsed because of the free word order.
For example, in this grammar ADVP is used before the verb and after the NP.
However, the sentences which have ADVP at the beginning were also not parsed
through the grammar.
If an inanimate noun occurs in the subject NP, it does not agree on number with the
predicate VP. i.e. the following sentence ‘මල පිෙපයි’ /malǝ pipeji/ (the flower
blooms) contains a singular NP and singular VP, while ‘මල් පිෙපයි’ /mal pipeji/
(flowers bloom) contains a plural NP and singular VP. According to Sinhala
language, both of these sentences are correct. However the second type of sentences;
which does not consider the number, has not been covered in this grammar. Sentences
which have compound verbs, auxiliary verbs, present participles, past participles, the
verbs which have imperative mood and negation of the verbs are also not parsed
through this grammar.

The test results are shown below.


Total Number of Sentences 200
Correct sentences parsed 118
Correct sentences not parsed 80
Incorrect sentences not parsed 2
According to the result, accuracy of this grammar is 60%.

7 Discussion
Free Word order
The grammar developed covers the default Sinhala sentence structure in the SOV
order. The first two sentences of Figure 1 are in SOV order, and only they can be
successfully parsed using the grammar developed. The rest of the sentence structures
can’t be parsed using the existing grammar. In natural language processing,
dependency grammars are used to solve the free word-order problem.

Word segmentation
In written Sinhala there is no unique method for word segmentation. The linguistics
literature reports on collections of rules for segmenting Sinhala words [15]. However
most users of the language are not aware of these rules and do not follow them closely
for word segmentation. For example the word-ending particle ‘ය’ is often used
inconsistently. The Sinhala language has two types of verbs, namely shudda kriya
‘pure verbs’ and krudanta kriya ‘participial verbs’. When a participial verb occurs in
the sentence ending position there are two ways to write it. One is by separating the
sentence-ending particle as in the case of ‘ගිෙය් ය’ “(he) went” and adding it to the
participial verb as ‘ගිෙය්ය’. Owing to this, it is desirable to have a word segmentation
algorithm to check whether the text is in a normalized form before the CFG parser is
employed.
A Computational Grammar of Sinhala 199

Non verbal sentences


There are number of sentence structures in Sinhala which do not contain a verb. These
types of sentences end with adjectives, oblique nominals, locative predicates and
adverbials among others, and the current grammar does not cover such non-verbal
sentences of Sinhala.

8 Conclusion and Future Work


This paper describes the development of a CFG for a non-trivial subset of Sinhala
using the NLTK toolkit. Ten simple sentence structures were selected and used to
design the grammar. Two hundred simple sentences were used to test the grammar
and 60% sentences were analyzed accurately the parser. In the future, it is hoped to
use a morphological analyzer and a word segmentation algorithm to develop a more
wide-coverage grammar for Sinhala.

Acknowledgment. We are grateful to all the members of Language Technology


Research Laboratory of the University of Colombo School of Computing, Sri Lanka,
who helped in various ways to make this work bear fruit.

References
1. Abhayasinghe, A.A.: Sinhala bhashave sarala vakya vibagaya (1998)
2. Ayesha Binte Mosaddeque, A.B., Haque, N.: Context-Free Grammar for Bangla. BRAC
University, Dhaka
3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text
with the Natural Language Toolkit. O’Reilly Media (2009)
4. Disanayaka, J.B.: Bashavaka rata samudaya. Lake house investment Co. Ltd., Colombo 2
(1969)
5. Fairbanks, G.H., Gair, J.W., Silva, M.W.S.D.: Colloquial Sinhalese. Cornell University,
New York (1968)
6. Gair, J.W., Karunatilaka, W.S.: Literary Sinhala inflected forms: A Synopsis with a
Translation Guide to Sinhala script. Cornell University, New York
7. Gair, J.W., Karunatilaka, W.S.: Literary Sinhala. Cornell University, New York (1974)
8. Gunasekara, A.M.: A Comprehensive Grammar of the Sinhalese Language. Godage
International Publishers (PVT) Ltd. (2008)
9. Hettige, B., Karunananda, A.S.: Computational Model of Grammar for English to Sinhala
Machine Translation. In: Proceedings of the International Conference on Advances in ICT
for Emerging Regions (2011)
10. Jayawardhane, T.: The surface case system in Sinhala. KALYANI, pp. 264–277.
University of Kelaniya (1996)
11. Kariyakarawana, S.M.: The Syntax of Focus and Wh-Questions in Sinhala. Karunaratne &
Sons Ltd. (1998)
12. Karunatilaka, W.S.: Sinhala bhasha vyakaranaya. M. D. Gunasena & Co. Ltd. (2009)
13. Kekulawala, S.L.: The future tense in Sinhalese – an ‘unorthodox’ point of view. Journal
of the Vidyalankara University of Ceylon (1972)
200 C. Liyanage et al.

14. Khan, N., Khan, M.: Developing a Computational Grammar for Bengali Using the HPSG
Formalism. In: Proceedings of the 9th International Conference on Computer and
Information Technology, ICCIT 2006 (2006)
15. Rajapaksha, D.: Sinhala bhashave pada bedima saha virama lakshana bhavithaya (2008)
16. Sagar, B.M., Shobha, G., Kumar, R.: Context Free Grammar (CFG) Analysis for simple
Kannada sentences. In: Proceedings of the International Conference [ACCTA-2010] on
Special Issue of IJCCT, vol. 1(2, 3, 4) (2010)
17. Sagar, B.M., Shobha, G., Kumar, R.: Solving the Noun Phrase and Verb Phrase Agreement
in Kannada Sentences. International Journal of Computer Theory and Engineering 1(3)
(August 2009)
18. Wikipedia (English), http://en.wikipedia.org/wiki/Sinhala_language
19. Dasanayaka, A.E.S.: Kumara rachanaya; Grade 4. M. D. Gunasena & Co. Ltd. (1990)
20. Dasanayaka, A.E.S.: Kumara rachanaya; Grade 5, M. D. Gunasena & Co. Ltd. (2005)

View publication stats

You might also like