A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya
A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya
Language
Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya
Faculty of Information Technology, Faculty of Information Technology, Faculty of Information Technology,
University of Moratuwa, University of Moratuwa, University of Moratuwa,
Moratuwa,Sri Lanka. Moratuwa,Sri Lanka Moratuwa,Sri Lanka
[email protected] [email protected] [email protected]
Abstract—Speech recognition, natural language processing, retrieval rather than a stemming approach. Researchers had
language translation and deep learning researches are used lemmatizing when developing Sinhala Natural
bridging the communication gap between humans as well as Language Processing related tools but not much
between humans and machines. Sinhala is a native language information is available on the approaches they followed.
in Sri Lanka which is being used by 19 million people
approximately. The growth of Sinhala natural language This paper presents a method to derive base forms of
processing tools is less when compared to European and Sinhala words through an inflectional analysis by using a
other Asian Languages. A lemmatizer for Sinhala can be rule-based approach. During this research, a list of suffixes
used for the morphological analysis and is an essential was developed to analyze social media content.
module in Sinhala language processing mechanisms. Furthermore, rules were designed to derive the base form
Lemmatizing is a complex process in morphological of a given word according to its part of speech. The rest of
analyzing where base/root of words are derived. There is not the paper includes similar approaches, linguistic
much work published focusing on lemmatizer approaches background of Sinhala and the conclusion.
for Sinhala. This paper presents a rule based lemmatizing
approach which can be used to determine the base form of II. RELATED WORK
Sinhala words with an accuracy of 77.3%. It differs from
similar works because the data used in the research are
Lovins stemmer which was developed in 1968 by Julie
extracted from social media. Beth Lovins is credited as the first stemmer [4]. Later the
Lovins stemmer was further improved by Martin Porter for
Keywords—Sinhala Morphology, Lemmatization, English language [5]. The Porter algorithm is considered
Inflection, Rule-based, Social media data as one of the most accepted methods for stemming where
automatic removal of affixes is done for the English words.
I. INTRODUCTION T Miller et al. have considered the strong mutual
Morphology is the study of words, how they are dependency between lemmatization of a form in context
formed, and their relationship to other words in the same and disambiguating its part-of-speech (POS),
language. It analyzes the structure of words and parts of morphological attributes and developed the tool
words, such as stems, root words, prefixes, and suffixes ‘LEMMING’. LEMMING is a token-based statistical
[1]. The smallest grammatical unit of a language is known lemmatization tool that works for six European languages
as ‘morpheme’. It can be a word or a part of a word that [6].
has meaning [2]. There are two types of morphemes that
can be identified in a language as inflectional or In contrast to European languages, there is a dearth of
derivational. The difference between those two categories studies that has developed natural language processing
are that inflectional morphemes do not change the (NLP) applications for the Sinhala language since Sri
grammatical category while derivational morphemes often Lanka is the only country where Sinhala is being spoken. It
changes the part of speech of the word. reduces the commercial interest on developing Sinhala
NLP applications on a global scale [7].
Sinhalese language, also called Sinhala is one of the
two official languages of Sri Lanka, with about 19 million Indo-Aryan languages are used in South Asian region
speakers out of the total population of 21 million and by more than 800 million people. The main difference
known as a morphologically rich language. Sinhala is an between European languages and Indo-Aryan languages is
Indo-Aryan language like Pali, Hindi and Sanskrit [3]. For the richness and complexity of the Indo-Aryan
languages like Sinhala, an in-depth analysis of words need morphologies. Therefore, the process of finding the base
to be done in order to extract the meaning and grammatical form or dictionary form of Indo-Aryan languages is
information. Lemmatization or stemming are the two different to that of European languages. Though stemming
approaches that are used in morphological analysis to techniques work well for European languages, they are less
derive the base form/root of a word. efficient for Indo-Aryan languages such as Bengali and
Hindi. [8]
In Sinhala, verbs are conjugated by person, by mood,
by number, by gender and by tense. A Sinhala base noun The development of a lemmatizing algorithm for Hindi
can be conjugated to 27 inflected forms. In Sinhala, a using a rule-based approach by creating a knowledgebase
single root word can have multiple forms, making it harder which contains common Hindi words is carried out in a
to further process without the knowledge of the research by Paul et al. [8]. Their approach aimed to
morphological rules of Sinhala language. Morphological optimize time and generate an accurate result in a very
rules are considered when lemmatizing. Thus, a short period and has shown 91% of accuracy. A
lemmatizing approach is more suited for Sinhala base form computational grammar of Sinhala for English-Sinhala
machine translation is a research carried out by Hettige et
Input Word Remove suffix Return stem Tokenization is the process of breaking a stream of text
(studies) (-ies) (studi) into words, phrases, symbols, or other meaningful
elements called tokens [16]. For this research purpose,
phrases were tokenized into sentences using Python
• Lemmatizing Natural Language Toolkit.
Part-of-speech (POS) tagging is the process of
Input Word Analyze Return assigning a POS or other lexical class marker to each word
(studies) Morphological lemma in a sentence [17]. The input to a tagging algorithm is a
information (study) string of words and a tag set. The output is a single best tag
for each word. POS tagged Sinhala corpus of University of
Fig.1 Difference between Stemming and Lemmatizing processes Colombo was used as the training dataset of perceptron
tagger which is used in this research. As the output,
Base form retrieval of European languages such as phrases can be retrieved with their corresponding tag
English is done following stemming or lemmatizing according to the LTRL / UCSC POS tag set for Sinhala.
algorithms. The core functionality of both stemming and Stop words are a division of natural language. Stop
lemmatizing is the same. Both the methods trim a given words are removed from a text with the intention of
word to its ‘stem’ in stemming and ‘lemma’ in making the text look lighter and more important for
lemmatizing, as shown in Fig.1. The difference between analysts. Removing stop words reduces the dimensionality
those methods are that part of speech/ context of the word of term space [18].
is not considered in stemming while lemmatizers consider
the context of the word [15]. It was found that Sinhala is a language which satisfies
Zipf’s Law behavior. Zipf’s law states that, given a text
As explained above Sinhala is a morphologically rich corpus, if f: is word count and r: is rank, when sorted by
language in which accurate base forms could not be word count that,
derived using stemming. Therefore, the authors of this
paper followed a lemmatizing approach. . ≈ t
The proposed approach is as follows; This means that the most frequent word will occur
approximately twice as often as the second most frequent
1. Input text word, three times as often as the third most frequent word,
2. Check the part of speech etc. [19]
3. Check the suffix
4. Check the pre-base_word (pre base word is the word [ද, ෙසේ, ය, ◌ා, ඒ, ම, ඇත, බව, ද, වන] are identified as
after removing the suffix) the top ten stop words used in Sinhala language [20]. In
5. Check for the rule that satisfies all the conditions addition to these words, another set of stop words were
(conditions include having specific characters at the retrieved from our dataset after considering their
end, ‘pillam’ in the word,etc.) frequencies and importance to sentences. ඒක, බං, ඕයි,
6. Apply the relevant rule වෙග්, අෙන්, හැටි, ෙමෙහම are some of the 20 words that
7. Return base form are identified as stop words from the dataset.
After performing the steps mentioned above, the results
Correct grammar and formats cannot be seen in texts given below are achieved.
published on social media. Therefore, following existing
approaches of lemmatizing by checking the grammar, මලිංග/NNPA ෙලෝකෙය්/NNN ඉන්න/VFM අංක/NNN
tense or case of a sentence on this data will not provide එෙක්/QFNUM ෙව්ග/JJ පන්දු/NNN යවන්ෙනක්./NNM
accurate outcomes. POS and lemmatization considers the
Tags shown in the above output:
context in which a word is used. This approach which is
based on the dependency between POS and lemmatization NNPA - Proper Noun Animate
is proposed to overcome the above issue. NNN - Common Noun Neuter
B. Preprocessing VFM - Verb Finite Main
QFNUM - Number Quantifier
Preprocessing of raw text has to be done prior to the JJ - Adjective
main process. In this research, we used phrases in Sinhala
NNM - Common Noun Masculine [21]
which were published on social media, mainly on
The word is sent to the relevant lemmatizing algorithm
Facebook as the input dataset. The initial dataset contained
by sorting according to its POS tag.
200 posts. The dataset was studied to identify how the
inflection of words differs from other media. C. Suffix Identification
As the first step of preprocessing, extracted texts were The suffixes list which contains 413 suffixes in
cleaned. The posts published on social media are in free Sinhala, published by University of Colombo is referred
forms and have no guarantee of correct spellings and when developing separated suffixes lists for masculine
grammar. Though we extract posts written in Sinhala, there nouns, feminine nouns, pronouns, neuter nouns and verbs.
were words written in English too. Before proceeding The suffix retrieval algorithm is given below;
further, those texts were translated into Sinhala using the
1. Take the word as input By observing Table V, it can be seen that though the
2. Calculate its word length as k eliminated suffixes are the same, the rules that are applied
3. Traverse the word from the end until a suffix is differ according to the nature of the pre-base word.
matched from the relevant suffixes list. (see
The suffix ‘ ට’ is the suffix used in both words,
Table IV.)
k = word_length ‘ක ට ට’ and ‘ව ස ට’. But in the algorithm, the
nature of the pre-base word is checked. In this case, if the
while k <= character_length:
second character from the end of the pre-base word is a '◌ු '
pre_base_word = word [: -k]
or a '◌ූ ’, a '◌ු ' character is added at the end of the pre-base
suffix = word[-k:] word.
if suffix in nouns_suffixes_list:
return suffix Else if the second character from the end of the pre-
return pre_base_word base word is a ‘◌් ’, then the last two (2) characters are
k=k-1 removed and added '◌ු ' at the end.
කපුටන්ට කපුටු වස්සන්ට වසු
TABLE IV. EXAMPLE OF SUFFIX IDENTIFICATION
බමුණන්ට බමුණු ෙකොල්ලන්ට ෙකොලු
Word Suffix Pre-base word
අලිෙයක් ෙයක් අලි
මිනිෙසක්ෙග් ෙ◌ක්ෙග් මිනිස Another case is that both the words ‘ඇතුන්ට’ and
කරමින් මින් කර ‘ෙලඩ්ඩුන්ට’ have the suffix ‘◌ු න්ට’. But the rules applied
ෙකල්ලට ට ෙකල්ල to them are not the same. There are specific letters in
නටන්ෙන් ◌් ෙන් නටන Sinhala which cannot have the ‘◌් ’ on it when the letter is
at the end of a word such as 'ග', 'ච', 'ඡ', 'ට', 'ඩ', 'ද', 'ප'. Pre-
base word of ‘ෙලඩ්ඩුන්ට’ is ‘ෙලඩ්ඩ’ which contain ‘ඩ’ as
D. Rules the last letter. Adding a ‘◌් ’ at the end is against the rules of
After the suffix eliminating, pre-base word is received Sinhala. Therefore, the words of this type follow a
(see Table V). The pre-base word may be the exact base different rule.
word or not. Most of the words require further processing
ඇතුන්ෙග් ඇත් බඹරුන්ෙග් බඹර
on the pre-base word to derive its exact base form. Rules
මිනිසාෙග් මිනිස් ෙමොණරාෙග් ෙමොණර
are used for further processing.
After deriving the suffix of the word, the pre-base word
is matched with the relevant rule to output the base word. V. EVALUATION
The rules include adding or removing characters (letters or The approach was evaluated using 300 words extracted
‘pillam’). from Facebook.
30 rules are generated for the lemmatizing process of The dataset consisted of phrases such as;
Sinhala Nouns. When creating rules more attention was
paid on nouns which can be categorized as informal/ “අපි ෙකොෙහටහරි ටීෂර්ට් එකක් ගහෙගන, ෙඩනිමක් ගහෙගන යන්න
impolite or hate. Most of these words are created by people ගිෙයොත් තමා මැෙරන්න හදන්ෙන ෙමතනඅපිට ෙමච්චර සල්ලි .
themselves and spread through social media. Those words නෑෙන ඉතිං”
“ෙමෙහම ගිෙයොත් ෙලෝක සිතියෙමනුත් ෙම් රට අයින් කරල දානව
do not follow general rules in Sinhala. Rules needed to be
අපිව”
developed in a manner that such words get lemmatized too.
“ඇත්ත කතාව. සිස්ටම් එක ෙවනස් කරන්න ආස නම් ඒකට
TABLE V. APPLICATION OF RULES සෑෙහන්න බලයක් , හැකියාවක් තියාෙගන කරන්න ඕන”
“මූ බයිසිකල් පිස්ෙසක්, සාමාෙන්යන් ඔය වයෙස ෙකොල්ෙලොන්ට
Pre Rule
Base ඒවෙග් පිස්සු තියන එක සාමාන්ය ෙදයක්, නමුත් ඒ බයිසිකලයක් නිසා
Word Suffix base Remove Add word ෙම් ෙකොල්ලට අද ඉන්න ෙවලා තිෙයන්ෙන ICU එකක ඇදක් උඩ”
word
“ෙකොළඹ වෙට්ම බස්වල රස්තියාදු ෙවලා කරපු එල ෙකොල්ෙලක් තමා
ඇතුන්ට ◌ු න්ට ඇත - ◌් ඇත්
ඉතිං”
කපුටන්ට න්ට කපුට - ◌ු කපුටු
වස්සන්ට න්ට වස්ස ◌් ස (last ◌ු වසු The characteristics of the dataset were that they consist
two of many Sinhala words which cannot be found in
character)
dictionaries or other formal written sources and no correct
අලියන්ට යන්ට අලි - - අලි
grammar usage can be seen.
ෙලඩ්ඩුන්ට ◌ු න්ට ෙලඩ්ඩ ◌් ඩ (last - ෙලඩ
two For the evaluation, the preprocessed dataset was tagged
characters) using the trained perceptron tagger. The proposed
ගැහැනියට ◌ියට ගැහැන - ◌ු ගැහැනු algorithm was applied to the tagged dataset.
තරුනියට ◌ියට තරුන - ◌ි තරුනි
මැහැල්ලට ල්ලට මැහැ - ලි මැහැලි From the experimental dataset, 68 words produced
ෙපොතකට කට ෙපොත - ◌් ෙපොත් incorrect results.
භාෂාවකට වකට භාෂා - - භාෂා Accuracy of the approach was calculated as 77.33%
පිල්ලකට කට පිල්ල ◌් ල (last ◌ි පිලි using the following equation.
two
characters)
අත්තකට කට අත්ත ◌් ත (last ◌ු අතු Number of correctly lemmatized words
= × 100%
two Total number of words
characters)
Comparison of the results using other existing [15] A.G. Jivani, “A Comparative Study of Stemming Algorithms”, Int.
approaches for Sinhala language are limited since required J. Comp. Tech. Appl., Vol. 2, No.6, pp. 1930-1938, 2011.
[16] A. M. Gunasekara, “A Comprehensive Grammar of the Sinhalese
details and manually created social media domain specific Language", Asian Educational Services, New Delhi, Madras, India,
lexicons which is needed for applying those algorithms are 1999.
not published. [17] K Soman, “Parts of Speech Tagging for Indian Languages: A
Literature Survey”, International Journal of Computer
It was found that the accuracy of the POS tagger Applications (0975-8887), Vol. 34, No. 8, pp. 22-29, 2011.
directly affects the accuracy of the final output. This is [18] S.Vijayarani J. Ilamathi, Nithya, “Preprocessing Techniques for
because the applicable rule set of a word is chosen based Text Mining - An Overview”, International Journal of Computer
on the word’s POS tag. If the word is tagged incorrectly, Science & Communication Networks,Vol 5(1),7-16
[19] "Zipf's law", Encyclopedia Britannica, 2013. [Online]. Available:
irrelevant rules may have been applied on the word and tps://www.britannica.com/topic/Zipfs-law. [Accessed: 14- Nov-
output an incorrect result. 2018].
[20] S. Gallage, “Analysis of Sinhala Using Natural Language
VI. CONCLUSION AND FURTHER WORK Processing Techniques”. University of Wisconsin
This paper describes the development of a lemmatizer [21] “Tools & Resources | Language Technology Research Lab",
Ltrl.ucsc.lk, 2018. [Online]. Available: http://ltrl.ucsc.lk/tools-and-
for the Sinhala language which is important for tasks resourses/. [Accessed: 11- Nov- 2018].
related to Natural Language Processing. A rule-based
approach is proposed and implemented for nouns so far
including masculine, feminine, neuter nouns and pronouns.
This only works for inflected forms.
As further improvements, this algorithm can be
extended for verbs and derivate forms of words. To
increase the accuracy of the lemmatizer, a dictionary can
be used in future with lemmas of common words found in
Sinhala.
REFERENCES
[1] "Morphology", Encyclopedia Britannica, 2016. [Online].
Available: https://www.britannica.com/topic/morphology-
linguistics/. [Accessed: 11- Nov- 2018].
[2] "What are Morphemes? | SEA - Supporting English Acquisition",
Ntid.rit.edu, 2018. [Online]. Available:
https://www.ntid.rit.edu/sea/processes/wordknowledge/grammatica
l/whatare. [Accessed: 11- Nov- 2018].
[3] "Sinhalese language", Encyclopedia Britannica, 2018. [Online].
Available: https://www.britannica.com/topic/Sinhalese-language.
[Accessed: 11- Nov- 2018].
[4] M. Porter, "An algorithm for suffix stripping", Program, Vol. 14,
No. 3, pp. 130-137, 1980.
[5] J.B. Lovins, “Development of stemming Algorithm”, Mechanical
Translation and Computational Linguistics, Vol. 11, No. 1, pp 22-
23, 1968.
[6] T. Müller, R. Cotterell, A. Fraser and H. Schütze, "Joint
Lemmatization and Morphological Tagging with Lemming",
Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, 2015.
[7] I. Wijesiri, B. Gunathilaka, D. Wimalasuriya, R. Paranavithana, M.
Gallage, M. Lakjeewa, G. Dias and N. de Silva, "Building a
WordNet for Sinhala", 7th Global Wordnet Conference, 2014.
[8] B. Hettige and A.S. Karunananda. "A Computational grammar of
Sinhala for English-Sinhala machine translation.", Proc. of 2011
Int. Conf. on Advances in ICT for Emerging Regions (ICTer), pp.
26-31,2011.
[9] S.Paul, I.Mathur, and N.Joshi, “Development of a Hindi
Lemmatizer”, International Journal of Computational Linguistics
and Natural Language Processing, Vol 2, Issue 5, 2013
[10] V. Welgama, "Evaluation of a shallow stemming algorithm for
Sinhala", Poster in Sri Lanka Student Workshop on Computer
Science, 2011.
[11] S. Das and P.Mitra, “A rule-based approach of stemming for
inflectional and derivational words in Bengali”, Students'
Technology Symposium (TechSym), 2011 IEEE. IEEE,2011
[12] C.Paris, P.Thomas and S.Wan, “Differences in language and style
between two social media communities”, 6th AAAI International
Conference on Weblogs and Social Media, 2012
[13] "Linguistics for laypeople: inflection vs derivation", Quirky Case,
2013. [Online]. Available:
https://quirkycase.wordpress.com/2013/04/17/linguistics-for-
laypeople-inflection-vs-derivation/. [Accessed: 11- Nov- 2018].
[14] J. B. Disanayake, “BasakaMahima 6: Prakurthi”, Colombo 10, Sri
Lanka: S. Godage and Brothers, 2000.