A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya

Uploaded by

Dinindu Dewapura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views5 pages

A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya

Uploaded by

Dinindu Dewapura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

A Rule-based Lemmatizing Approach for Sinhala

Language
Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya
Faculty of Information Technology, Faculty of Information Technology, Faculty of Information Technology,
University of Moratuwa, University of Moratuwa, University of Moratuwa,
Moratuwa,Sri Lanka. Moratuwa,Sri Lanka Moratuwa,Sri Lanka
[email protected] [email protected] [email protected]

Abstract—Speech recognition, natural language processing, retrieval rather than a stemming approach. Researchers had
language translation and deep learning researches are used lemmatizing when developing Sinhala Natural
bridging the communication gap between humans as well as Language Processing related tools but not much
between humans and machines. Sinhala is a native language information is available on the approaches they followed.
in Sri Lanka which is being used by 19 million people
approximately. The growth of Sinhala natural language This paper presents a method to derive base forms of
processing tools is less when compared to European and Sinhala words through an inflectional analysis by using a
other Asian Languages. A lemmatizer for Sinhala can be rule-based approach. During this research, a list of suffixes
used for the morphological analysis and is an essential was developed to analyze social media content.
module in Sinhala language processing mechanisms. Furthermore, rules were designed to derive the base form
Lemmatizing is a complex process in morphological of a given word according to its part of speech. The rest of
analyzing where base/root of words are derived. There is not the paper includes similar approaches, linguistic
much work published focusing on lemmatizer approaches background of Sinhala and the conclusion.
for Sinhala. This paper presents a rule based lemmatizing
approach which can be used to determine the base form of II. RELATED WORK
Sinhala words with an accuracy of 77.3%. It differs from
similar works because the data used in the research are
Lovins stemmer which was developed in 1968 by Julie
extracted from social media. Beth Lovins is credited as the first stemmer [4]. Later the
Lovins stemmer was further improved by Martin Porter for
Keywords—Sinhala Morphology, Lemmatization, English language [5]. The Porter algorithm is considered
Inflection, Rule-based, Social media data as one of the most accepted methods for stemming where
automatic removal of affixes is done for the English words.
I. INTRODUCTION T Miller et al. have considered the strong mutual
Morphology is the study of words, how they are dependency between lemmatization of a form in context
formed, and their relationship to other words in the same and disambiguating its part-of-speech (POS),
language. It analyzes the structure of words and parts of morphological attributes and developed the tool
words, such as stems, root words, prefixes, and suffixes ‘LEMMING’. LEMMING is a token-based statistical
[1]. The smallest grammatical unit of a language is known lemmatization tool that works for six European languages
as ‘morpheme’. It can be a word or a part of a word that [6].
has meaning [2]. There are two types of morphemes that
can be identified in a language as inflectional or In contrast to European languages, there is a dearth of
derivational. The difference between those two categories studies that has developed natural language processing
are that inflectional morphemes do not change the (NLP) applications for the Sinhala language since Sri
grammatical category while derivational morphemes often Lanka is the only country where Sinhala is being spoken. It
changes the part of speech of the word. reduces the commercial interest on developing Sinhala
NLP applications on a global scale [7].
Sinhalese language, also called Sinhala is one of the
two official languages of Sri Lanka, with about 19 million Indo-Aryan languages are used in South Asian region
speakers out of the total population of 21 million and by more than 800 million people. The main difference
known as a morphologically rich language. Sinhala is an between European languages and Indo-Aryan languages is
Indo-Aryan language like Pali, Hindi and Sanskrit [3]. For the richness and complexity of the Indo-Aryan
languages like Sinhala, an in-depth analysis of words need morphologies. Therefore, the process of finding the base
to be done in order to extract the meaning and grammatical form or dictionary form of Indo-Aryan languages is
information. Lemmatization or stemming are the two different to that of European languages. Though stemming
approaches that are used in morphological analysis to techniques work well for European languages, they are less
derive the base form/root of a word. efficient for Indo-Aryan languages such as Bengali and
Hindi. [8]
In Sinhala, verbs are conjugated by person, by mood,
by number, by gender and by tense. A Sinhala base noun The development of a lemmatizing algorithm for Hindi
can be conjugated to 27 inflected forms. In Sinhala, a using a rule-based approach by creating a knowledgebase
single root word can have multiple forms, making it harder which contains common Hindi words is carried out in a
to further process without the knowledge of the research by Paul et al. [8]. Their approach aimed to
morphological rules of Sinhala language. Morphological optimize time and generate an accurate result in a very
rules are considered when lemmatizing. Thus, a short period and has shown 91% of accuracy. A
lemmatizing approach is more suited for Sinhala base form computational grammar of Sinhala for English-Sinhala
machine translation is a research carried out by Hettige et

al. through which a ‘Sinhala Morphological Generator’ is 27 forms of nouns can be generated by inflecting a
developed to fulfill the requirement of generating base single root word called ‘prakurthi’ in Sinhala. This
forms of nouns and verbs [9]. The Sinhala language inflecting is called ‘Nama varanagilla - නාම වරනැගිල්ල’
contains many conjugation forms for the nouns and the (Word conjugation). The Sinhala noun contains more than
verbs. Among which, Hettige’s BEES Sinhala hundred rules to conjugate a noun using a given base form
Morphological generator handles 85 grammar rules for the (Prakurthi - පකෘති) [14]. The conjugation of a masculine
Sinhala nouns and 36 grammar rules for the Sinhala verbs noun by number (singular, plural) and case is shown in
[9]. Fundamentals of the Sinhala grammar such as Table II.
‘Prakurthi’, ‘Nama’ and ‘Kriyagana’ are used to
implement these rules. In another study, a shallow TABLE II. WORD CONJUGATION OF A SINHALA
stemming algorithm for Sinhala is evaluated and it has an MASCULINE COMMON NOUN
accuracy of 56.04 % [10]. The algorithm was evaluated
using predefined lexical roots. Masculine
Case
Singular Plural
Hindi, Bengali are morphologically rich languages that Nominative නායකයා නායකෙයෝ
can be considered as siblings of Sinhala language as all
these languages have evolved from Sanskrit. Das et al. has Accusative නායකයා නායකයන්
proposed a stemming approach for removing affixes to
generate the base forms for Bengali language [11]. It has Dative නායකයාට නායකයන්ට
shown much satisfactory performance and can be
considered when developing a lemmatizing algorithm for Ablative නායකයාෙගන් නායකයන්ෙගන්
Sinhala.
Genitive නායකයාෙග් නායකයන්ෙග්
III. SINHALA MORPHOLOGY
Sinhala is one of the several morphologically rich Vocative නායකයනි නායකයනි
languages for which currently there are no completed
repositories such as WordNet or Subjective lexicons.
Sinhala has two varieties; literal and spoken forms. In this There are 15 patterns identified for Sinhala nouns. The
research, we are focusing on posts published on Social nouns belonging to the same pattern shows similar
media. Social media has been characterized as platforms characteristics when conjugating. These patterns are called
on which people communicate in an almost speech-like ‘Gana’ (ගණ).
nature, and hence is more informal. The language in social
media in different communities includes significant More than 18 inflection forms are available for a
differences in language use, as is the case in other media Sinhala base verb including inflection of the tense, number
[12]. On social media people tend to use more emotive and and the person. From the morphological point of view, a
personal language. Language processing techniques used verb contains two parts, namely, Base verb and a suffix.
for data in formal/general format may not provide accurate Base verb is a prakurthi (පකෘති), and it is named as kriya
results when used in processing this types of data. A prakurthi (කියා පකෘති). Different verb forms are generated
domain-specific study is required to address this issue. by adding different suffixes to the ‘kriya prakurthi’ [14].
Table III shows verb conjugation in Sinhala based on tense
In the study of Morphology, which is concerned with and person (first, second and third person).
the structure of words, there has traditionally been a
TABLE III. VERB CONJUGATION IN SINHALA
distinction drawn between two types of affixes, inflectional
and derivational. Inflection is often defined as a type of
affix that distinguishes grammatical forms of the same Active
Person Number
lexeme while derivation refers to an affix that indicates a Present Past Future
change of grammatical category [13]. Table I shows an First Singular කරමි කෙලමි කරන්ෙනමි
example for inflected form and derived form. Plural කරමු කෙළමු කරන්ෙනමු
Second Singular කරහි කෙලහි කරන්ෙනහි
TABLE I. SINHALA WORD INFLECTION AND DERIVATION Plural කරහු කෙලහු කරන්ෙනහු
Third Singular කරයි කෙළේය කරන්ෙන්ය
Base word Inflected Forms Derived Forms
Plural කරති කෙලෝය කරන්ෙනෝය
ෙහොඳ ෙහොඳට (For good) ෙනොෙහොඳ (Bad)
(good) ෙහොඳින් (In a good ෙනොෙහොඳින් (In a In this research, we are only focusing on finding the
(adjective) manner) bad manner) base word of Sinhala words which are in inflected forms.
ෙහොඳම (The best) ෙනොෙහොඳම (The
worst)

IV. PROPOSED METHOD

The Sinhala Noun is a word that represents the noun,
The aim of this paper is to propose a method to develop
pronoun and the adjective in the English language. The
a lemmatizer for Sinhala language. Though there are many
Sinhala noun has four types of inflections; namely,
lemmatizing methods, the rule based approach has shown
Gender, Number, Person and Case.
more accuracy than others. The list of suffixes published
by Language Technology Research Center, University of
Colombo is used in this research.
A. Approach ‘Ingiya English-Sinhala Dictionary’ published by
• Stemming Language Technology Research Lab, University of
Colombo.

Input Word Remove suffix Return stem Tokenization is the process of breaking a stream of text
(studies) (-ies) (studi) into words, phrases, symbols, or other meaningful
elements called tokens [16]. For this research purpose,
phrases were tokenized into sentences using Python
• Lemmatizing Natural Language Toolkit.
Part-of-speech (POS) tagging is the process of
Input Word Analyze Return assigning a POS or other lexical class marker to each word
(studies) Morphological lemma in a sentence [17]. The input to a tagging algorithm is a
information (study) string of words and a tag set. The output is a single best tag
for each word. POS tagged Sinhala corpus of University of
Fig.1 Difference between Stemming and Lemmatizing processes Colombo was used as the training dataset of perceptron
tagger which is used in this research. As the output,
Base form retrieval of European languages such as phrases can be retrieved with their corresponding tag
English is done following stemming or lemmatizing according to the LTRL / UCSC POS tag set for Sinhala.
algorithms. The core functionality of both stemming and Stop words are a division of natural language. Stop
lemmatizing is the same. Both the methods trim a given words are removed from a text with the intention of
word to its ‘stem’ in stemming and ‘lemma’ in making the text look lighter and more important for
lemmatizing, as shown in Fig.1. The difference between analysts. Removing stop words reduces the dimensionality
those methods are that part of speech/ context of the word of term space [18].
is not considered in stemming while lemmatizers consider
the context of the word [15]. It was found that Sinhala is a language which satisfies
Zipf’s Law behavior. Zipf’s law states that, given a text
As explained above Sinhala is a morphologically rich corpus, if f: is word count and r: is rank, when sorted by
language in which accurate base forms could not be word count that,
derived using stemming. Therefore, the authors of this
paper followed a lemmatizing approach. . ≈ t
The proposed approach is as follows; This means that the most frequent word will occur
approximately twice as often as the second most frequent
1. Input text word, three times as often as the third most frequent word,
2. Check the part of speech etc. [19]
3. Check the suffix
4. Check the pre-base_word (pre base word is the word [ද, ෙසේ, ය, ◌ා, ඒ, ම, ඇත, බව, ද, වන] are identified as
after removing the suffix) the top ten stop words used in Sinhala language [20]. In
5. Check for the rule that satisfies all the conditions addition to these words, another set of stop words were
(conditions include having specific characters at the retrieved from our dataset after considering their
end, ‘pillam’ in the word,etc.) frequencies and importance to sentences. ඒක, බං, ඕයි,
6. Apply the relevant rule වෙග්, අෙන්, හැටි, ෙමෙහම are some of the 20 words that
7. Return base form are identified as stop words from the dataset.
After performing the steps mentioned above, the results
Correct grammar and formats cannot be seen in texts given below are achieved.
published on social media. Therefore, following existing
approaches of lemmatizing by checking the grammar, මලිංග/NNPA ෙලෝකෙය්/NNN ඉන්න/VFM අංක/NNN
tense or case of a sentence on this data will not provide එෙක්/QFNUM ෙව්ග/JJ පන්දු/NNN යවන්ෙනක්./NNM
accurate outcomes. POS and lemmatization considers the
Tags shown in the above output:
context in which a word is used. This approach which is
based on the dependency between POS and lemmatization NNPA - Proper Noun Animate
is proposed to overcome the above issue. NNN - Common Noun Neuter
B. Preprocessing VFM - Verb Finite Main
QFNUM - Number Quantifier
Preprocessing of raw text has to be done prior to the JJ - Adjective
main process. In this research, we used phrases in Sinhala
NNM - Common Noun Masculine [21]
which were published on social media, mainly on
The word is sent to the relevant lemmatizing algorithm
Facebook as the input dataset. The initial dataset contained
by sorting according to its POS tag.
200 posts. The dataset was studied to identify how the
inflection of words differs from other media. C. Suffix Identification
As the first step of preprocessing, extracted texts were The suffixes list which contains 413 suffixes in
cleaned. The posts published on social media are in free Sinhala, published by University of Colombo is referred
forms and have no guarantee of correct spellings and when developing separated suffixes lists for masculine
grammar. Though we extract posts written in Sinhala, there nouns, feminine nouns, pronouns, neuter nouns and verbs.
were words written in English too. Before proceeding The suffix retrieval algorithm is given below;
further, those texts were translated into Sinhala using the
1. Take the word as input By observing Table V, it can be seen that though the
2. Calculate its word length as k eliminated suffixes are the same, the rules that are applied
3. Traverse the word from the end until a suffix is differ according to the nature of the pre-base word.
matched from the relevant suffixes list. (see
The suffix ‘ ට’ is the suffix used in both words,
Table IV.)
k = word_length ‘ක ට ට’ and ‘ව ස ට’. But in the algorithm, the
nature of the pre-base word is checked. In this case, if the
while k <= character_length:
second character from the end of the pre-base word is a '◌ු '
pre_base_word = word [: -k]
or a '◌ූ ’, a '◌ු ' character is added at the end of the pre-base
suffix = word[-k:] word.
if suffix in nouns_suffixes_list:
return suffix Else if the second character from the end of the pre-
return pre_base_word base word is a ‘◌් ’, then the last two (2) characters are
k=k-1 removed and added '◌ු ' at the end.
කපුටන්ට කපුටු වස්සන්ට වසු
TABLE IV. EXAMPLE OF SUFFIX IDENTIFICATION
බමුණන්ට බමුණු ෙකොල්ලන්ට ෙකොලු
Word Suffix Pre-base word
අලිෙයක් ෙයක් අලි
මිනිෙසක්ෙග් ෙ◌ක්ෙග් මිනිස Another case is that both the words ‘ඇතුන්ට’ and
කරමින් මින් කර ‘ෙලඩ්ඩුන්ට’ have the suffix ‘◌ු න්ට’. But the rules applied
ෙකල්ලට ට ෙකල්ල to them are not the same. There are specific letters in
නටන්ෙන් ◌් ෙන් නටන Sinhala which cannot have the ‘◌් ’ on it when the letter is
at the end of a word such as 'ග', 'ච', 'ඡ', 'ට', 'ඩ', 'ද', 'ප'. Pre-
base word of ‘ෙලඩ්ඩුන්ට’ is ‘ෙලඩ්ඩ’ which contain ‘ඩ’ as
D. Rules the last letter. Adding a ‘◌් ’ at the end is against the rules of
After the suffix eliminating, pre-base word is received Sinhala. Therefore, the words of this type follow a
(see Table V). The pre-base word may be the exact base different rule.
word or not. Most of the words require further processing
ඇතුන්ෙග් ඇත් බඹරුන්ෙග් බඹර
on the pre-base word to derive its exact base form. Rules
මිනිසාෙග් මිනිස් ෙමොණරාෙග් ෙමොණර
are used for further processing.
After deriving the suffix of the word, the pre-base word
is matched with the relevant rule to output the base word. V. EVALUATION
The rules include adding or removing characters (letters or The approach was evaluated using 300 words extracted
‘pillam’). from Facebook.
30 rules are generated for the lemmatizing process of The dataset consisted of phrases such as;
Sinhala Nouns. When creating rules more attention was
paid on nouns which can be categorized as informal/ “අපි ෙකොෙහටහරි ටීෂර්ට් එකක් ගහෙගන, ෙඩනිමක් ගහෙගන යන්න
impolite or hate. Most of these words are created by people ගිෙයොත් තමා මැෙරන්න හදන්ෙන ෙමතනඅපිට ෙමච්චර සල්ලි .
themselves and spread through social media. Those words නෑෙන ඉතිං”
“ෙමෙහම ගිෙයොත් ෙලෝක සිතියෙමනුත් ෙම් රට අයින් කරල දානව
do not follow general rules in Sinhala. Rules needed to be
අපිව”
developed in a manner that such words get lemmatized too.
“ඇත්ත කතාව. සිස්ටම් එක ෙවනස් කරන්න ආස නම් ඒකට
TABLE V. APPLICATION OF RULES සෑෙහන්න බලයක් , හැකියාවක් තියාෙගන කරන්න ඕන”
“මූ බයිසිකල් පිස්ෙසක්, සාමාෙන්‍යන් ඔය වයෙස ෙකොල්ෙලොන්ට
Pre Rule
Base ඒවෙග් පිස්සු තියන එක සාමාන්‍ය ෙදයක්, නමුත් ඒ බයිසිකලයක් නිසා
Word Suffix base Remove Add word ෙම් ෙකොල්ලට අද ඉන්න ෙවලා තිෙයන්ෙන ICU එකක ඇදක් උඩ”
word
“ෙකොළඹ වෙට්ම බස්වල රස්තියාදු ෙවලා කරපු එල ෙකොල්ෙලක් තමා
ඇතුන්ට ◌ු න්ට ඇත - ◌් ඇත්
ඉතිං”
කපුටන්ට න්ට කපුට - ◌ු කපුටු
වස්සන්ට න්ට වස්ස ◌් ස (last ◌ු වසු The characteristics of the dataset were that they consist
two of many Sinhala words which cannot be found in
character)
dictionaries or other formal written sources and no correct
අලියන්ට යන්ට අලි - - අලි
grammar usage can be seen.
ෙලඩ්ඩුන්ට ◌ු න්ට ෙලඩ්ඩ ◌් ඩ (last - ෙලඩ
two For the evaluation, the preprocessed dataset was tagged
characters) using the trained perceptron tagger. The proposed
ගැහැනියට ◌ියට ගැහැන - ◌ු ගැහැනු algorithm was applied to the tagged dataset.
තරුනියට ◌ියට තරුන - ◌ි තරුනි
මැහැල්ලට ල්ලට මැහැ - ලි මැහැලි From the experimental dataset, 68 words produced
ෙපොතකට කට ෙපොත - ◌් ෙපොත් incorrect results.
භාෂාවකට වකට භාෂා - - භාෂා Accuracy of the approach was calculated as 77.33%
පිල්ලකට කට පිල්ල ◌් ල (last ◌ි පිලි using the following equation.
two
characters)
අත්තකට කට අත්ත ◌් ත (last ◌ු අතු Number of correctly lemmatized words
= × 100%
two Total number of words
characters)
Comparison of the results using other existing [15] A.G. Jivani, “A Comparative Study of Stemming Algorithms”, Int.
approaches for Sinhala language are limited since required J. Comp. Tech. Appl., Vol. 2, No.6, pp. 1930-1938, 2011.
[16] A. M. Gunasekara, “A Comprehensive Grammar of the Sinhalese
details and manually created social media domain specific Language", Asian Educational Services, New Delhi, Madras, India,
lexicons which is needed for applying those algorithms are 1999.
not published. [17] K Soman, “Parts of Speech Tagging for Indian Languages: A
Literature Survey”, International Journal of Computer
It was found that the accuracy of the POS tagger Applications (0975-8887), Vol. 34, No. 8, pp. 22-29, 2011.
directly affects the accuracy of the final output. This is [18] S.Vijayarani J. Ilamathi, Nithya, “Preprocessing Techniques for
because the applicable rule set of a word is chosen based Text Mining - An Overview”, International Journal of Computer
on the word’s POS tag. If the word is tagged incorrectly, Science & Communication Networks,Vol 5(1),7-16
[19] "Zipf's law", Encyclopedia Britannica, 2013. [Online]. Available:
irrelevant rules may have been applied on the word and tps://www.britannica.com/topic/Zipfs-law. [Accessed: 14- Nov-
output an incorrect result. 2018].
[20] S. Gallage, “Analysis of Sinhala Using Natural Language
VI. CONCLUSION AND FURTHER WORK Processing Techniques”. University of Wisconsin
This paper describes the development of a lemmatizer [21] “Tools & Resources | Language Technology Research Lab",
Ltrl.ucsc.lk, 2018. [Online]. Available: http://ltrl.ucsc.lk/tools-and-
for the Sinhala language which is important for tasks resourses/. [Accessed: 11- Nov- 2018].
related to Natural Language Processing. A rule-based
approach is proposed and implemented for nouns so far
including masculine, feminine, neuter nouns and pronouns.
This only works for inflected forms.
As further improvements, this algorithm can be
extended for verbs and derivate forms of words. To
increase the accuracy of the lemmatizer, a dictionary can
be used in future with lemmas of common words found in
Sinhala.
REFERENCES
[1] "Morphology", Encyclopedia Britannica, 2016. [Online].
Available: https://www.britannica.com/topic/morphology-
linguistics/. [Accessed: 11- Nov- 2018].
[2] "What are Morphemes? | SEA - Supporting English Acquisition",
Ntid.rit.edu, 2018. [Online]. Available:
https://www.ntid.rit.edu/sea/processes/wordknowledge/grammatica
l/whatare. [Accessed: 11- Nov- 2018].
[3] "Sinhalese language", Encyclopedia Britannica, 2018. [Online].
Available: https://www.britannica.com/topic/Sinhalese-language.
[Accessed: 11- Nov- 2018].
[4] M. Porter, "An algorithm for suffix stripping", Program, Vol. 14,
No. 3, pp. 130-137, 1980.
[5] J.B. Lovins, “Development of stemming Algorithm”, Mechanical
Translation and Computational Linguistics, Vol. 11, No. 1, pp 22-
23, 1968.
[6] T. Müller, R. Cotterell, A. Fraser and H. Schütze, "Joint
Lemmatization and Morphological Tagging with Lemming",
Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, 2015.
[7] I. Wijesiri, B. Gunathilaka, D. Wimalasuriya, R. Paranavithana, M.
Gallage, M. Lakjeewa, G. Dias and N. de Silva, "Building a
WordNet for Sinhala", 7th Global Wordnet Conference, 2014.
[8] B. Hettige and A.S. Karunananda. "A Computational grammar of
Sinhala for English-Sinhala machine translation.", Proc. of 2011
Int. Conf. on Advances in ICT for Emerging Regions (ICTer), pp.
26-31,2011.
[9] S.Paul, I.Mathur, and N.Joshi, “Development of a Hindi
Lemmatizer”, International Journal of Computational Linguistics
and Natural Language Processing, Vol 2, Issue 5, 2013
[10] V. Welgama, "Evaluation of a shallow stemming algorithm for
Sinhala", Poster in Sri Lanka Student Workshop on Computer
Science, 2011.
[11] S. Das and P.Mitra, “A rule-based approach of stemming for
inflectional and derivational words in Bengali”, Students'
Technology Symposium (TechSym), 2011 IEEE. IEEE,2011
[12] C.Paris, P.Thomas and S.Wan, “Differences in language and style
between two social media communities”, 6th AAAI International
Conference on Weblogs and Social Media, 2012
[13] "Linguistics for laypeople: inflection vs derivation", Quirky Case,
2013. [Online]. Available:
https://quirkycase.wordpress.com/2013/04/17/linguistics-for-
laypeople-inflection-vs-derivation/. [Accessed: 11- Nov- 2018].
[14] J. B. Disanayake, “BasakaMahima 6: Prakurthi”, Colombo 10, Sri
Lanka: S. Godage and Brothers, 2000.

First Grade Readiness Checklist: Reading
100% (1)
First Grade Readiness Checklist: Reading
2 pages
курсова робота з англійської
100% (2)
курсова робота з англійської
27 pages
Year 5-6 English Grammar and Punctuation Mark Scheme
No ratings yet
Year 5-6 English Grammar and Punctuation Mark Scheme
12 pages
Implemented Stemming Algorithms For Six Ethiopian Languages
No ratings yet
Implemented Stemming Algorithms For Six Ethiopian Languages
5 pages
Sinhala Is The Indo-Aryan. Abstractdocx
100% (2)
Sinhala Is The Indo-Aryan. Abstractdocx
1 page
Methods of Lexicological Analysis
No ratings yet
Methods of Lexicological Analysis
12 pages
RPT English Kefungsian Rendah 2021
100% (11)
RPT English Kefungsian Rendah 2021
59 pages
IITC 2008p4 PDF
No ratings yet
IITC 2008p4 PDF
10 pages
Sinhala Language
No ratings yet
Sinhala Language
17 pages
Word Order: Issues in Sinhala Syntax: Sentence Processing AND
No ratings yet
Word Order: Issues in Sinhala Syntax: Sentence Processing AND
11 pages
Form of Verbs
No ratings yet
Form of Verbs
15 pages
Morphological Analyzer For Tamil
No ratings yet
Morphological Analyzer For Tamil
37 pages
Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in - and Suffixes Alike
No ratings yet
Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in - and Suffixes Alike
9 pages
NLPofSinhala Sgallege
No ratings yet
NLPofSinhala Sgallege
6 pages
FYP Fnal Pesentation
No ratings yet
FYP Fnal Pesentation
79 pages
644-Article Text-4486-2-10-20230515
No ratings yet
644-Article Text-4486-2-10-20230515
10 pages
Sri Lankan English Morphology
100% (3)
Sri Lankan English Morphology
4 pages
Sinhalese Sinhala (Chandralal)
No ratings yet
Sinhalese Sinhala (Chandralal)
314 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
D AR B H L: Esign OF ULE Ased Indi Emmatizer
No ratings yet
D AR B H L: Esign OF ULE Ased Indi Emmatizer
8 pages
Banaras Hindu University: III Semester Assignment
No ratings yet
Banaras Hindu University: III Semester Assignment
22 pages
Stages of Historical Development: Sinhala Language
No ratings yet
Stages of Historical Development: Sinhala Language
2 pages
Word Order in Sinhala and English PDF
No ratings yet
Word Order in Sinhala and English PDF
17 pages
Morph Processes in English & Igala
No ratings yet
Morph Processes in English & Igala
30 pages
AComputationalgrammarof Sinhala
No ratings yet
AComputationalgrammarof Sinhala
14 pages
Lecture 1
No ratings yet
Lecture 1
4 pages
Linguistic Typology
No ratings yet
Linguistic Typology
5 pages
On The Usage of Sinhalese Differential Object Markers Object Marker /wa/ vs. Object Marker /ta
No ratings yet
On The Usage of Sinhalese Differential Object Markers Object Marker /wa/ vs. Object Marker /ta
12 pages
Word Level Analysis NLP Mod 2
No ratings yet
Word Level Analysis NLP Mod 2
18 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Thesis Review On Morophological Analyzer For Geez Verbs
No ratings yet
Thesis Review On Morophological Analyzer For Geez Verbs
13 pages
Sinhala Phonetics and Phonology
No ratings yet
Sinhala Phonetics and Phonology
12 pages
Intro To Core Lxs - Morphology
No ratings yet
Intro To Core Lxs - Morphology
37 pages
Inflection in Swahili Language
No ratings yet
Inflection in Swahili Language
20 pages
Linguistics Historical Perspective
100% (4)
Linguistics Historical Perspective
18 pages
Problems of Lexicology and Lexicography
No ratings yet
Problems of Lexicology and Lexicography
9 pages
Linguistics Terms 1
No ratings yet
Linguistics Terms 1
7 pages
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
No ratings yet
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
12 pages
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
No ratings yet
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
12 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
6 pages
Error Analysis of CV Writing
No ratings yet
Error Analysis of CV Writing
20 pages
Ich 006
No ratings yet
Ich 006
24 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
Linguistics Index
No ratings yet
Linguistics Index
230 pages
Singhali in Sri Lanka Malays
No ratings yet
Singhali in Sri Lanka Malays
2 pages
Branches of Linguistics
No ratings yet
Branches of Linguistics
8 pages
лекция 3 курс
No ratings yet
лекция 3 курс
57 pages
Linguistic Search
No ratings yet
Linguistic Search
4 pages
Intelligent Digitalization of The Sinhala Form Templates
No ratings yet
Intelligent Digitalization of The Sinhala Form Templates
7 pages
Admin, IJMS 2019 - 2
No ratings yet
Admin, IJMS 2019 - 2
13 pages
Lexical Approach To Teach
100% (1)
Lexical Approach To Teach
32 pages
Morphology Resume
No ratings yet
Morphology Resume
9 pages
Morpheme Segmentation For Kannada Standing On The Shoulder of Giants
No ratings yet
Morpheme Segmentation For Kannada Standing On The Shoulder of Giants
16 pages
Discovering The Lexical Features of A Language
No ratings yet
Discovering The Lexical Features of A Language
2 pages
2010 Morph Generator
No ratings yet
2010 Morph Generator
7 pages
Language Studies - Journal
No ratings yet
Language Studies - Journal
226 pages
3 Principles of Morphological Analysis Basics of Morphological Analysis Basics
No ratings yet
3 Principles of Morphological Analysis Basics of Morphological Analysis Basics
7 pages
Lesson 1 - 2
No ratings yet
Lesson 1 - 2
7 pages
Development of A Rule-Based Lemmatization Algorithm Through Finite State Machine For Uzbek Language
No ratings yet
Development of A Rule-Based Lemmatization Algorithm Through Finite State Machine For Uzbek Language
6 pages
Lexicology and Lexicography Lecture Notes
No ratings yet
Lexicology and Lexicography Lecture Notes
7 pages
Empower Unit 4A INT1
No ratings yet
Empower Unit 4A INT1
8 pages
Resume - Sprint 2016
No ratings yet
Resume - Sprint 2016
2 pages
3
No ratings yet
3
20 pages
Own It 1 - Unit 2
No ratings yet
Own It 1 - Unit 2
23 pages
Different Type of Speech Context: Intrapersonal Communication
No ratings yet
Different Type of Speech Context: Intrapersonal Communication
4 pages
Complete Sentences Reference Sheet
No ratings yet
Complete Sentences Reference Sheet
2 pages
Research Methods For Understanding Child Second Language Development
No ratings yet
Research Methods For Understanding Child Second Language Development
8 pages
Business Communication
No ratings yet
Business Communication
9 pages
Writing With Voice
No ratings yet
Writing With Voice
5 pages
On The Reasons For Indeterminacy of Translation
No ratings yet
On The Reasons For Indeterminacy of Translation
7 pages
Unit 6 - Grammar - Vocabulary
No ratings yet
Unit 6 - Grammar - Vocabulary
9 pages
Introduction To Informal Settlements: Discussion 2
No ratings yet
Introduction To Informal Settlements: Discussion 2
35 pages
Homonymy: Either Homographs or Homophones in Non-Technical Contexts. in This Looser Sense, The Word
No ratings yet
Homonymy: Either Homographs or Homophones in Non-Technical Contexts. in This Looser Sense, The Word
8 pages
C1 Apprendre Du Vocabulaire
No ratings yet
C1 Apprendre Du Vocabulaire
4 pages
STD 4 Ecat GF 2023-24 1 2
No ratings yet
STD 4 Ecat GF 2023-24 1 2
8 pages
F1 Lesson 1-6 NTB
No ratings yet
F1 Lesson 1-6 NTB
6 pages
Skills Units 1-10 A Answer
No ratings yet
Skills Units 1-10 A Answer
14 pages
BJU Writing II PBIS4312
No ratings yet
BJU Writing II PBIS4312
12 pages
Harvey2006 Article TheSociologicalAndGeographical PDF
No ratings yet
Harvey2006 Article TheSociologicalAndGeographical PDF
45 pages
Gerunds and Infinitives Grammar Guides 1
100% (1)
Gerunds and Infinitives Grammar Guides 1
2 pages
Linking Different Continuous Consonants
No ratings yet
Linking Different Continuous Consonants
21 pages
English Grammar Cloud 6 PDF 20.3.2020
No ratings yet
English Grammar Cloud 6 PDF 20.3.2020
32 pages
Celia - S Guide To Writing Formal Letters C1
No ratings yet
Celia - S Guide To Writing Formal Letters C1
13 pages
PDF Days of The Week in Japanese
No ratings yet
PDF Days of The Week in Japanese
8 pages
Project 4 SB
100% (3)
Project 4 SB
87 pages
Errors in The Use of Nouns 601
No ratings yet
Errors in The Use of Nouns 601
3 pages
Answer To Lexicology
100% (2)
Answer To Lexicology
34 pages

A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya

Uploaded by

A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya

Uploaded by

A Rule-based Lemmatizing Approach for Sinhala

978-1-5386-4417-1/18/$31.00 ©2018 IEEE

IV. PROPOSED METHOD

You might also like