AComputationalgrammarof Sinhala
AComputationalgrammarof Sinhala
net/publication/235931895
CITATIONS READS
14 20,475
4 authors:
Some of the authors of this publication are also working on these related projects:
Detecting Hate Speech in Social Media Articles in Romanized Sinhala View project
All content following this page was uploaded by Chamila Liyanage on 29 December 2021.
1 Introduction
Sinhala is the official language of Sri Lanka and it is the language spoken by a
majority of Sri Lankans – nearly 70% of the population [7]. From a historical point of
view, Sinhala is a modern Indo-Aryan language, which is related to the Vedic
language or Old Sanskrit in India [10]. Modern Sinhala has subsequently gained
through its association with Tamil, English, Portuguese, and Dutch [8]. There are two
main varieties of Sinhala based on its usage, namely the literary and the spoken,
which differ from each other in important ways [5]. In addition Sinhala has an
alphasyllabary writing system, also called abugida; it is a segmental writing system
in which consonant-vowel sequences are written as single units [18].
Natural Language Processing (NLP) is an area of research that explores how
computers can be used to understand and manipulate natural languages [16].
Currently there are many research areas related to NLP in Sinhala such as Speech
A. Gelbukh (Ed.): CICLing 2012, Part I, LNCS 7181, pp. 188–200, 2012.
© Springer-Verlag Berlin Heidelberg 2012
A Computational Grammar of Sinhala 189
2 Related Work
Very little research has been reported in the literature on efforts to develop a formal
grammar for Sinhala. The following section describes a brief survey on grammar
development reported for Indic languages including Sinhala.
Hettige and Karunananda have implemented a computational model of grammar
for Sinhala [9]. Morphological and syntactic analysis of Sinhala has been considered
in this work, which is modeled using a Finite State Transducer (FST) and a Context-
Free grammar. Developed as part of a Machine Translation system, the parser in this
system handles only simple sentences containing 8 constituents, namely, Attributive
adjunct of Subject, Subject, Attributive adjunct of Object, Object, Attributive adjunct
of Predicate, Attributive adjunct of the complement of predicate, Complement of
predicate and Predicate.
In a research carried out for the Kannada language by Sagar et al., noun phrase –
verb phrase agreement in Kannada sentences has been modeled [17]. They have
classified noun phrases in to three sub categories as adjective noun, noun and
pronoun, but they have only considered the gender and number as features of the
grammar. Similar to the case of the Sinhala language, Kannada verbs need to agree
with the subject of their sentences in number and gender. Therefore, the suffix of the
verb is extracted to check masculine, feminine and plural verb endings. Here they
have used the context free grammar (CFG) to write the grammar rules and used
Python as the programming language. A Recursive Descent Parser, a simple top down
parser, from NLTK has been used to test the grammar. This is limited to resolve noun
– verb agreement and indicate whether the sentence is syntactically acceptable or not.
Sagar et al carried out another research which highlights the process of generating
a Context Free Grammar for simple Kannada sentences [16]. Here they have checked
the sentences with both a Top-Down (Recursive Descent) Parser and a Bottom-Up
(Shift-Reduce) Parser. According to the authors, two conflicts; Shift-Reduce and
Reduce-Reduce occurred when the sentences were parsed using the Bottom-Up
parser. Therefore the Top-Down parser was selected as the more suitable parser to
parse the given sentences.
Mosaddeque and Haque have done a research to propose a way of producing a
context-free grammar for the Bangla [2]. This work reports that only sentences of
seven to eight words in length are used for testing. They have taken 10 ad hoc
sentences from a newspaper article as the basis for designing the grammar. They have
then tagged all the words in the sentences with their respective parts-of-speech (POS)
tags and used NLTK’s Shift-Reduce Parser to test the grammar. Only one sentence
has been successfully parsed of these ten sentences.
190 C. Liyanage et al.
Naira Khan and Mumit Khan have implemented a Computational Grammar for
Bengali using the Head-Driven Phrase Structure Grammar (HPSG) formalism [14].
The Linguistic Knowledge Building (LKB) system was used to implement this
grammar, which allows the user to build a parser along with a generator. A set of
instructions for using the HPSG formalism to parse the grammar and to generate
grammatical sentences of Bengali is given in this paper.
3 Structure of Sinhala
Sinhala is a free word order language. Its unmarked word order is SOV; variant orders
are also possible with discourse – pragmatic effects. A sentence can have all the
possible orders of the main constituents with proper intonation [11]. Figure 1 shows
all the free word order forms of the English sentence “Father hit the younger brother
with a stick”.
Traditionally, a sentence is divided in to two parts; Noun Phrase (NP), and Verb
Phrase (VP). In Sinhala grammar, uktha (subject) and akyatha (predicate) are the two
parts of a sentence. Subject and predicate in Sinhala sentences agree in number,
gender and person [12].
The studies of sentence structures of Sinhala have been made by a number of
scholars [8] [1] [4]. According to Abayasingha [1] Sinhala has 25 types of simple
sentence structures. However in the present work, we have covered only the main
sentence structures and a few complex structures. These are described in the
following sections.
hondӁǝ/ (very good) is a adjectival phrase and it consists of a degree ‘ඉතා’ /itӁa:/ (very)
and a qualitative adjective ‘ෙහොඳ’ /hondӁǝ/ (good), which appears before the adjective.
According to the Sinhala language the noun is inflected for number, gender, person,
tense, case and definiteness. The verb is inflected for number, gender, person, tense
and volition. Subject and predicate agree for the features of number, gender and
person.
The words that are marked for grammatical features linga (gender), vachana
(number), niyatha-aniyatha (definiteness) and vibhakthi (case), are recognized as
nouns in Sinhala [12]. Therefore in developing the grammar, we consider the features
of number, gender, case and definiteness for common nouns; number, gender and case
for proper nouns; and number, gender, case and person for pronouns.
As a highly inflected language, common nouns in Sinhala are inflected for number,
definiteness and case. Sinhala nouns are also divided into animate and inanimate
classes on the basis of their inflection. Animate nouns inflect for number (singular
and plural), definiteness (definite and indefinite) and five cases (nominative,
accusative, dative, genitive, and instrumental). The definiteness distinction applies
only in the singular form of nouns [6].
A Computational Grammar of Sinhala 193
As enumerated in Table 1, Sinhala nouns, which are inflected for five cases, have five
forms. However the same form may occur in several cases for nouns. For example
form 5 can occur in the cases instrumental, ablative and auxiliary. Therefore we have
defined the five cases in this grammar specification. All inflections relating to animate
nouns are shown in the table.
Sinhala inanimate nouns are inflected similarly to the animate nouns for number
and definiteness. However they only have four cases – direct, dative, genitive and
instrumental [6]. Forms 3, 4, and 5 in Table 2 are similar to those in Table 1. However
form 1 accounts for the direct case. Table 2 shows all the inflections for an inanimate
noun as covered in the grammar developed.
Singular
Form Case Definite Indefinite Plural
Determiners of Sinhala do not carry any grammatical features and can engage with
any noun without agreement of features. Therefore grammatical features for
determiners were not considered. e.g. ‘ඒ’ /e:/ is a determiner of Sinhala which
combines with any noun without considering grammatical features of number, gender
or case. The noun phrases ‘ඒ ළමයා’ /e: lamǝja:/ (that child) and ‘ඒ ළමයි’ /e: lamaji/
(those children) differ in the number feature, but have the identical determiner ‘ඒ’.
194 C. Liyanage et al.
Number, gender, tense, person and volition are considered as the grammatical features
of the verb phrase (VP) in Sinhala. There are two tenses in Sinhala; past and non-past.
The non-past form can refer either to past or future. The future tense is expressed
using time adverbials. Therefore the single form that is used to denote both tenses
Present and Future is termed non-past. For example ‘යයි’/jaji/ is a verb form in
Sinhala which means ‘goes’ and carries the grammatical features singular, 3rd person,
non-past. i.e. the sentence ‘ඔහු පාසල් යයි’ /ohu pa:sal jaji/ (he goes to school) is in the
present tense. If we add the time adverbial ‘ෙහට’/heʈǝ/ to denote the future, then the
sentence would be in future tense; ‘ඔහු ෙහට පාසල් යයි’ /ohu heʈǝ pa:sal jaji/ (he will
go to school tomorrow). In these two sentences the verb ‘යයි’ /jaji/ is a pure verb. In
addition to the pure form, the form which used to denote the non-past tense is called
the krudantha. We can change the above sentence with a krudantha form as ‘ඔහු ෙහට
පාසල් යන්ෙන්ය’ /ohu heʈǝ pa:sal janӁnӁe:jǝ/. In this sentence ‘යන්ෙන්ය’ /janӁnӁe:jǝ/ is
similar to the form ‘යයි’. However, according to Kekulawala [13], the krudantha form
is non-past; which uses -න්ෙන- /-nӁnӁe-/ suffix, is used to denote the future tense in
Sinhala. In modern Sinhala writings, krudantha forms are used more frequently than
the pure forms in both past and non-past tenses. In this grammar, we considered the
tense as past and non-past other than past, present and future.
Volition (VLT) is another feature of the verb which in our grammar is considered
to be either true or false. For example ‘යයි’ is a volitive form of the verb “go”, while
its equivalent involitive form is ‘යැෙවයි’ /jæveji/. The other features of number,
gender and person are the same as their equivalents in the NP. Figure 4 gives an
overview of the grammatical features of the Sinhala Verb.
Verb
1. සඳ | බැබෙලයි.
sandӁǝ | bæbǝleji
moon | is shining.
The moon is shining.
2. ගෙසන් | බිමට | ෙගඩියක් | වැටිණි.
gasenӁ | bimǝʈǝ | geɖijak | væʈiɳi
from the tree | to the floor | a fruit | fell
A fruit fell (to the floor) from the tree.
3. බල්ෙලෝ | බුරති.
ballo: | burǝtӁi
dogs | bark
Dogs bark.
4. අයියා | අඹ | කඩයි.
ajja: | ambǝ | kaɖaji
the elder brother | mangoes | plucks
The elder brother plucks mangoes.
5. තාත්තා | පුතාට | තෑග්ගක් | ෙදයි.
tӁa:tӁtӁa: | putӁa:ʈǝ | tӁæ:ggak | deji
father | to the son | a gift | gives
Father gives a gift to the son.
6. මිනිසා | ගසට | නගියි.
minӁisa: | gasaʈǝ | nӁagiji
the man | to the tree | climbs
The man climbs the tree.
7. හිඟන්නා | මෙගන් | රුපියලක් | ඉල්ලීය.
hiŋganӁnӁa: | magenӁ | rupiyǝlak | illi:jǝ
the beggar | from me | one rupee | asked
The beggar asked me one rupee.
8. ඔහුට | වත්තක් | තිෙබ්.
ohuʈǝ | vatӁtӁak | tӁibe:
to him | an estate | has
He has an estate.
9. මට | සින්දුවක් | ඇෙසයි.
maʈǝ | sinӁduvak | æseji
to me | a song | hear
I hear a song.
10. ළමයාට | ඇඬිණි.
ɭamǝja:ʈǝ | ænɖiɳi
to the child | cried
The child cried.
After analyzing each of the above sentences, all their constituents were identified.
According to this constituent structure, separate CFG productions were generated for
each type of sentence. After identifying all the grammar rules needed to cover the
196 C. Liyanage et al.
phenomena above, they were merged together to form and optimize a generic CFG
for Sinhala. In addition, some more complexity in the grammatical rules was also
introduced to the Sinhala CFG in order to increase its overall coverage. The
grammatical and lexical productions of the CFG developed are given below.
Grammar Productions
TV[TENSE=nPast, NUM=sg, GEN=MA, VLT=True, PER=T] -> 'ෙදයි' | 'කයි' | 'කඩයි' | 'කියයි' | 'ගසයි' | 'සිටියි' | 'තිෙබ්'
TV[TENSE=nPast, NUM=sg, VLT=False] -> 'කැෙවයි' | 'කියෙවයි' | 'ඇෙසයි' | 'ෙපෙවයි'
TV[TENSE=nPast, NUM=pl, VLT=True, PER=F] -> 'ෙදමු' | 'කමු' | 'ෙබොමු' | 'කියමු'
TV[TENSE=past, NUM=sg, GEN=MA, VLT=True, PER=T] -> 'කෑෙව්ය' | 'දුන්ෙන්ය' | 'කීෙව්ය' | 'ඉල්ලීය' | 'ගැසුෙව්ය'
IV[TENSE=nPast, NUM=sg, GEN=MA, VLT=True, PER=T] -> 'බුරයි' | 'නගියි' | 'ඇවිදියි' | 'යයි'
IV[TENSE=nPast, NUM=sg, GEN=NE, VLT=False, PER=T] -> 'බැබෙලයි' | 'පිෙපයි'
IV[TENSE=nPast, NUM=pl, VLT=True, PER=S] -> 'බුරහු' | 'ඇවිදිහු' | 'යහු'
IV[TENSE=past, NUM=sg, VLT=False, PER=T] -> 'බිරිණි' | 'ඇඬිණි' | 'වැටිණි'
Following are the parse trees that have been produced using the Recursive Decent
parser from the NLTK toolkit [3].
----------------------------------Sentence 3--------------------------------
(S[]
(NP[CASE='F1', DEF=?TF, GEN='MA', NUM='pl']
(N[CASE='F1', GEN='MA', NUM='pl'] බල්ෙලෝ))
(VP[GEN=?G, NUM='pl', PER='T', TENSE='pres']
(IV[NUM='pl', PER='T', TENSE='nPast', +VLT] බුරති)))
-----------------------------------Sentence 5--------------------------------
(S[]
(NP[CASE='F1', DEF=?TF, GEN='MA', NUM='sg']
(N[CASE='F1', DEF='TRue', GEN='MA', NUM='sg'] තාත්තා))
(VP[GEN='MA', NUM='sg', PER='T', TENSE='pres']
(NP[CASE='F3', DEF=?TF, GEN='MA', NUM='sg']
(N[CASE='F3', DEF='TRue', GEN='MA', NUM='sg'] පුතාට))
(NP[CASE='F1', DEF=?TF, GEN='NE', NUM='sg']
(N[CASE='F1', -DEF, GEN='NE', NUM='sg'] තෑග්ගක්))
(TV[GEN='MA', NUM='sg', PER='T', TENSE='nPast', +VLT]
ෙදයි)))
--------------------------------------Sentence 9--------------------------------
(S[]
(NP[CASE='F3', GEN=?G, NUM='sg', PER='F']
(PrN[CASE='F3', NUM='sg', PER='F'] මට))
(VP[GEN=?G, NUM='sg', PER=?P, TENSE='pres']
(NP[CASE='F1', DEF=?TF, GEN='NE', NUM='sg']
(N[CASE='F1', -DEF, GEN='NE', NUM='sg']
සින්දුවක්))
(TV[NUM='sg', TENSE='nPast', -VLT] ඇෙසයි)))
sentences two sentences are structured incorrectly and therefore they were restricted
from the grammar. Several sentences were not parsed because of the free word order.
For example, in this grammar ADVP is used before the verb and after the NP.
However, the sentences which have ADVP at the beginning were also not parsed
through the grammar.
If an inanimate noun occurs in the subject NP, it does not agree on number with the
predicate VP. i.e. the following sentence ‘මල පිෙපයි’ /malǝ pipeji/ (the flower
blooms) contains a singular NP and singular VP, while ‘මල් පිෙපයි’ /mal pipeji/
(flowers bloom) contains a plural NP and singular VP. According to Sinhala
language, both of these sentences are correct. However the second type of sentences;
which does not consider the number, has not been covered in this grammar. Sentences
which have compound verbs, auxiliary verbs, present participles, past participles, the
verbs which have imperative mood and negation of the verbs are also not parsed
through this grammar.
7 Discussion
Free Word order
The grammar developed covers the default Sinhala sentence structure in the SOV
order. The first two sentences of Figure 1 are in SOV order, and only they can be
successfully parsed using the grammar developed. The rest of the sentence structures
can’t be parsed using the existing grammar. In natural language processing,
dependency grammars are used to solve the free word-order problem.
Word segmentation
In written Sinhala there is no unique method for word segmentation. The linguistics
literature reports on collections of rules for segmenting Sinhala words [15]. However
most users of the language are not aware of these rules and do not follow them closely
for word segmentation. For example the word-ending particle ‘ය’ is often used
inconsistently. The Sinhala language has two types of verbs, namely shudda kriya
‘pure verbs’ and krudanta kriya ‘participial verbs’. When a participial verb occurs in
the sentence ending position there are two ways to write it. One is by separating the
sentence-ending particle as in the case of ‘ගිෙය් ය’ “(he) went” and adding it to the
participial verb as ‘ගිෙය්ය’. Owing to this, it is desirable to have a word segmentation
algorithm to check whether the text is in a normalized form before the CFG parser is
employed.
A Computational Grammar of Sinhala 199
References
1. Abhayasinghe, A.A.: Sinhala bhashave sarala vakya vibagaya (1998)
2. Ayesha Binte Mosaddeque, A.B., Haque, N.: Context-Free Grammar for Bangla. BRAC
University, Dhaka
3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text
with the Natural Language Toolkit. O’Reilly Media (2009)
4. Disanayaka, J.B.: Bashavaka rata samudaya. Lake house investment Co. Ltd., Colombo 2
(1969)
5. Fairbanks, G.H., Gair, J.W., Silva, M.W.S.D.: Colloquial Sinhalese. Cornell University,
New York (1968)
6. Gair, J.W., Karunatilaka, W.S.: Literary Sinhala inflected forms: A Synopsis with a
Translation Guide to Sinhala script. Cornell University, New York
7. Gair, J.W., Karunatilaka, W.S.: Literary Sinhala. Cornell University, New York (1974)
8. Gunasekara, A.M.: A Comprehensive Grammar of the Sinhalese Language. Godage
International Publishers (PVT) Ltd. (2008)
9. Hettige, B., Karunananda, A.S.: Computational Model of Grammar for English to Sinhala
Machine Translation. In: Proceedings of the International Conference on Advances in ICT
for Emerging Regions (2011)
10. Jayawardhane, T.: The surface case system in Sinhala. KALYANI, pp. 264–277.
University of Kelaniya (1996)
11. Kariyakarawana, S.M.: The Syntax of Focus and Wh-Questions in Sinhala. Karunaratne &
Sons Ltd. (1998)
12. Karunatilaka, W.S.: Sinhala bhasha vyakaranaya. M. D. Gunasena & Co. Ltd. (2009)
13. Kekulawala, S.L.: The future tense in Sinhalese – an ‘unorthodox’ point of view. Journal
of the Vidyalankara University of Ceylon (1972)
200 C. Liyanage et al.
14. Khan, N., Khan, M.: Developing a Computational Grammar for Bengali Using the HPSG
Formalism. In: Proceedings of the 9th International Conference on Computer and
Information Technology, ICCIT 2006 (2006)
15. Rajapaksha, D.: Sinhala bhashave pada bedima saha virama lakshana bhavithaya (2008)
16. Sagar, B.M., Shobha, G., Kumar, R.: Context Free Grammar (CFG) Analysis for simple
Kannada sentences. In: Proceedings of the International Conference [ACCTA-2010] on
Special Issue of IJCCT, vol. 1(2, 3, 4) (2010)
17. Sagar, B.M., Shobha, G., Kumar, R.: Solving the Noun Phrase and Verb Phrase Agreement
in Kannada Sentences. International Journal of Computer Theory and Engineering 1(3)
(August 2009)
18. Wikipedia (English), http://en.wikipedia.org/wiki/Sinhala_language
19. Dasanayaka, A.E.S.: Kumara rachanaya; Grade 4. M. D. Gunasena & Co. Ltd. (1990)
20. Dasanayaka, A.E.S.: Kumara rachanaya; Grade 5, M. D. Gunasena & Co. Ltd. (2005)