International Journal of Research in Advent Technology (E-ISSN: 2321-9637) Special Issue
1st International Conference on Advent Trends in Engineering, Science and Technology
“ICATEST 2015”, 08 March 2015
Survey paper of Different Lemmatization Approaches
Riddhi Dave1, Prem Balani2
1
ME Student, Information Technology Department, GCET, GTU affiliated, V.V. Nagar, Gujarat, India,
[email protected] 2
Assistant Professor, Information Technology Department, GCET, GTU affiliated, V.V. Nagar, Gujarat, India,
[email protected]Abstract: Lemmatization is use to normalize inflectional form of word to its root word. So it can be used as pre-
processing step in any natural language processing application. Lemmatization is very important approach for
information retrieval process. Lemmatization is used to reduce different inflectional form as well as derivational
form of word to its root or head word which called as its 'lemma'. A 'lemma' is the simply the "Dictionary form"
of a word. In lemmatization, different grammatical form of word can be analyzed as a single word. In this paper
we have discussed five different Lemmatization approaches. The first one is Edit Distance on dictionary
algorithm which is combination of string matching and most frequent inflectional suffixes model. Second is
Morphological Analyzer which is based on "finite state automata". Third approach uses "radix trie" data
structure which allow retrieving possible lemma of a given inflected or derivational form. Fourth approach is
Affix lemmatizer which is combination of rule based and supervised training approach and last approach is
fixed length truncation approach.
Keywords – Lemmatization, Information Retrieval,
1. INTRODUCTION Normalization is very important task in any natural
As the language is an important tool for language processing application. Stemming or
communication, so natural language processing is Lemmatization used as a normalized technique to
concerned with the interaction between human reduce different grammatical words to its head word
languages and computers. NLP involves enabling by applying set of rule. Both stemming and
computers to derive meaning from human or natural lemmatizing can be used as a pre processing steps in
language input. Natural language processing is very IR application.
hot research topic now a days, as it is used in most of
Stemming is process of reducing different inflectional
the linguistic activities.
form to its stem by applying different set of rule. Aim
An information retrieval process is the major activity of Stemming is just to reduce word to its stem without
in natural language processing. Information retrieval bothering about POS. It is used in most of the text
is the process of obtaining resources as per need from mining application where aim is just to reduce the
the avble resources. form of word without worrying of its occurrence in
the given context. So it is used to convert the different
An information retrieval process begins when a user
inflectional form of word to its stem. The result of
enters a query into the system. Queries are formal
stemming is called as a stem, it is not always a
statements of information needs, for example search
dictionary word.
strings in web search engines. In information retrieval
a query does not uniquely identify a single object in In linguistics, a lemma (from the Greek noun
the collection. Instead, several objects may match the “lemma”, “headword”) is the “dictionary” or
query, perhaps with different degrees of relevancy.[4] “canonical” form of a set of words. More specifically,
"Lemmatization" refers to normalized different a lemma is the canonical form of a lexeme, where
inflectional forms as well as derivational forms to its lexeme refers to the set of all the forms that have the
head word. same meaning, and lemma refers to the particular
form that is chosen as base form to represent the
This task can be used as a pre-processing step for
lexeme.[2] Lemmatization used as a most frequently
many natural processing applications (e.g.
used normalization technique in any information
morphological analyzers, electronic dictionaries,
retrieval application like indexing and searching.
spell-checkers, stemmers, etc.). It may also be useful
as a generic keywords generator for search engines Lemmatization aims to remove inflectional endings
and other data mining, clustering and classification only and to return dictionary form of a word and may
tools.[1] use of a vocabulary and/or morphological analysis of
words. Therefore lemmatizers require much
366
International Journal of Research in Advent Technology (E-ISSN: 2321-9637) Special
Issue 1st International Conference on Advent Trends in Engineering, Science and
Technology “ICATEST 2015”, 08 March 2015
knowledge about language than stemmers and they edits (i.e. insertions, deletions or substitutions)
don’t use language specific rules unlike stemmers. required to change one word into the other.[5]
Lemmatization is closely related to stemming, Searching similar sequences of data is of great
however, stemming operates only on a single word at importance to many applications such as the gene
a time. Instead, lemmatization may operate on the similarity determination, speech recognition
full-text and therefore can discriminate between applications, database and/or Internet search engines,
words that have different meanings depending on handwriting recognition, spell-checkers and other
part-of-speech. On the other hand, stemmers are biology, genomics and text processing applications.
typically easier to implement and run faster. Hence, Therefore, algorithms that can efficiently manipulate
lemmatizers play a significant role in IR and ability to sequences of data (in terms of time and/or space) are
lemmatize words efficiently and effectively is thus highly desirable, even with modest approximation
important.[2] guarantees.[1]
In this paper we discussed five different approaches of The Levenshtein Distance of two strings A and B is
lemmatization. First approach is Edit distance on the minimum number of character transformation
dictionary based approach. It is combination of string required to convert string A to string B.
matching and most frequent inflectional suffix model.
String matching is performed between the dictionary The following Equation 1 is used two find the
word and word given in to the query string. Second is Levenshtein distance between two strings a, b is given
Morphological Analyzer which is based on finite state by where
automata. Third approach is radix trie approach. It is
also known as tree approach so search for given query
string can be done from top to bottom. Fourth is
Affix lemmatizer. It is rule based approach where set
of rules are defined based on language knowledge. By
using the defined set of rules affixes is removed from
the inflectional and derivational words and produce
lemma. In additional to affix removal it used training
of data. This makes it more accurate. This approach is Equation 1: Levenshtein Distance between two strings
the fastest approach among all and fifth approach is
Fixed length truncation approach where fixed size of Where 1( ) is indicator function equal to 0
suffix removed from the given word and rest is when and equal to 1 otherwise.
returned as a result.
Note that the first element in the minimum
The rest of the paper is organized as bellow. The next corresponds to deletion (from a to b), the second to
section 2, explains different approaches of insertion and the third to match or mismatch,
lemmatization. Section 3, conclude the paper and depending on whether the respective symbols are the
Section 4 contains future enhancement. same.[5]
2. APPROACHES OF LEMMATIZER The edit distance algorithm is performed by using
We have study five lemmatization approaches. First three most "primitive edit operation". By term
approach is string matching dictionary based primitive edit operation we refer to the substitution of
approach. Second is based on finite state automata. one character to another, the deletion of a character
Third approach is based on trie approach, it is also and insertion of a character. So this algorithm can be
known as tree approach. Trie approach retrieve all performed by three basic operations like insertion,
possible lemma of a given word inflectional words. deletion and substitution.
Fourth approach is affix removal approach and last Some approached focused on suffix phenomena only.
one is fixed length truncation approach. Last But this approach deals with both suffixes as well as
approach mostly used for those language where size prefixes. So it is known as affixation phenomena.
of word is more than 7. So by removing fixed size of
suffix it can produce good result. Sometime it happens that suffixes added into the
words based on grammatical rules. For example word
a) Levenshtein Distance Dictionary based Approach "going", this approach return headword "go". But for
word "went", it contains discrete entry of lemma in
The Levenshtein distance is a string metric for dictionary.
measuring the difference between two sequences.
Informally, the Levenshtein distance between two The idea is to find out all possible lemma for user's
words is the minimum number of single-character input word. It contain a file which is having 30,000
possible lemmas stored.
36
International Journal of Research in Advent Technology (E-ISSN: 2321-9637) Special
Issue 1st International Conference on Advent Trends in Engineering, Science and
Technology “ICATEST 2015”, 08 March 2015
For each one of the target words, the similarity
Σ is a finite set, called the input alphabet;
distance between the source and the target word is
Γ is a finite set, called the output alphabet;
calculated and stored. When this process is
completed, the algorithm returns a set of target words I is a subset of Q, the set of initial states;
having the minimum edit distance from the source F is a subset of Q, the set of final states; and
word.[1]
So algorithm compare user input to the all available
stored lemmas. Retrieve the minimum distance word (where ε is the empty string) is the transition relation.
from the target word . [7]
FSM give input actions and output depends on only
The algorithm provides the option to select the value state. State change from input tap to output tap based
of the approximation that the system considers as on this action performed.
desired similarity distance (e.g. if the user enters zero
as the desired approximation, then only the target For example entry for the action is "open" starts a
words with the minimum edit distance will be motor opening the door, the entry action in state
returned, whereas if he/she enters e.g. 2 as the desired "Closing" starts a motor in the other direction closing
approximation, then the returned set will contain all the door. States "Opened" and "Closed" stop the
the target words having a distance <=(minimum + 2) motor when fully opened or closed. They signal to the
from the source word.[1] outside world (e.g., to other state machines) the
situation: "door is open" or "door is closed".[7]
This approach also distinguishes words like
"entertained" and "entertainment". Its return entertain So FSM takes action as input which can be any rule or
for the entertained word but not for entertainment, operation and generate output tap from current input
because entertainment its self is a noun and its tap. So this approach mostly used for computational
different than entertained. morphology and phonology.
b) Morphological Analyzer based Approach c) Radix Trie based Approach
Morphological Analyzer gives all possible analyses In computer science, a radix tree (also patricia trie or
for a given word which is based on finite state radix trie or compact prefix tree) is a space-optimized
technology, and it produces the morphological trie data structure where each node with only one
analysis of the word form as its output.[2] child is merged with its parent. This makes them
much more efficient for small sets (especially if the
This approach uses finite state automata and two level strings are long) and for sets of strings that share long
morphology to build a lexicon for a language with prefixes.[9]
infinite vocabulary.
Trie is a data structure which allows to retrieve all
Two-Level rules are declarative constraints that possible lemmas. Here each node is having single
describe morphological alternations, such as the y->ie character. Two nodes connected with the edges. Word
alternation in the plural of some English nouns (spy- is retrieve byte by byte. This approach is also involve
>spies). [6] backtracking for getting appropriate result.
Aim of this approach is to converts two-level rules
into deterministic, minimized finite-state transducers.
It describes the format of two-level grammars, the
rule formalism, and the user interface to the compiler.
It also explains how the compiler can assist the user in
the development of a two-level grammar.[6]
A finite state transducer (FST) is a finite state
machine with two tapes: an input tape and an output
tape. This contrasts with an ordinary finite state
automaton (or finite state acceptor), which has a
single tape.[7]
Transducer means to translate a word from one state
to another. Transducer is having two state, one is
input tape and another is output tape.
Finite state transducer is 6-tuple (Q, Σ, Γ, I, F, δ) such
that: Figure 1: A simple trie storing Hindi words[8]
Q is a finite set, the set of states;
36
International Journal of Research in Advent Technology (E-ISSN: 2321-9637) Special
Issue 1st International Conference on Advent Trends in Engineering, Science and
Technology “ICATEST 2015”, 08 March 2015
Word is stored in root, in character by character The bulk of ‘normal’ training words must be bigger
unicode byte order. User input word searched from for the new affix based lemmatizer than for the suffix
first node and it traverse the tree up to last character lemmatizer. This is because the new algorithm
of the word. It is possible that traverse need to generates immense numbers of candidate rules with
backtrack for some level. only marginal differences in accuracy, requiring many
Lemmatizer gives the following output: examples to find the best candidate.[10]
e) Fixed length truncation
Figure 2: Example for search "Ladkiyan" Hindi In this approach, we simply truncate the words and
word[8] use the first 5 and 7 characters of each word as its
lemma. In this approach words with less than n
characters are used as a lemma with no truncation.[2]
So search word up to third word "Ladka" as shown in
This approach is most appropriate for the languages
Figure 2. But after that it needs to performed
like Turkish which has average length of word is 7.07
backtrack for one level reach to the word "Ladk" and
letters.
then traverse again and get "Ladki" the correct word
as shown in Figure 1. So this approach is used when time is most priority
issue. It is the simplest approach not dependent on any
So just to use radix tree approach cannot give accurate
language or grammar. So it can be applicable to any
result. But performing backtracking for one or two
language.
level it gives most accurate result.
3. CONCLUSION
d) Affix Lemmatizer
As we have discussed that only rule based approach
The most common approach for word normalization can give the root word. It is not always an efficient
is to remove affix from a given word. Suffix or prefix solution because space needed for storing the
removed as per rules defined based on grammatical predefined rules is big issue. So by combing rule
knowledge of the language. To just remove suffix or based to some statistical approach can give more
prefix from word cannot give accurate head word or accurate result.
root word.
To use language independent approach is efficient
To just used rule based approach cannot give accurate solution. By the term “language-independent”, we
result so by combining rule based approach to some mean that the algorithm can perform sufficiently well
statistical approach like supervised training can give for a variety of languages regardless of the specific
more accurate result. grammar and inflectional rules that apply to them. So
for language independent approach Levenshtein edit
Supervised training algorithm generates a data
distance is best solution.
structure consisting of rules that a lemmatizer must
traverse to arrive at a rule that is elected to fire. [10] Another solution is to use some data structure like
radix tree can be optimal solution. It is the longest
After training, the data structure of rules is made
prefix match functionality, which is able to find most
permanent and can be consulted by a lemmatizer. The
appropriate lemma of the input word.
lemmatizer must elect and fire rules in the same way
as the training algorithm, so that all words from the
4. FUTURE ENHANCEMENT
training set are lemmatized correctly. It may however
Although research has been done in developing
fail to produce the correct lemmas for words that were
lemmatizer, still there are statistical approach or data
not in the training set – the OOV words.[10]
structure available which are used for linguistic
For training word this approach used prime and purpose. By using it we can achieve best lemmatizer
derived rules. Prime rule for training is the least which is most save both time as well as space.
specific rule needs to lemmatize. Where derived rules
are more specific rule-can be created by adding or REFERENCES
removing characters.
[1].Dimitrios P. Lyras, Kyriakos N. Sgarbas, Nikolaos
For example rule can be "watcha" which is derived D. Fakotakis, "Using the Levenshtein Edit
from what are you, "yer" which is derived from you Distance for Automatic Lemmatization: A Case
are rather than "your". Study for Modern Greek and English," Tools with
Artificial Intelligence, 19th IEEE International
This approach is more generalized than only suffix
Conference , pp.429-435, 29-31 October, 2007.
removal approach.
[2].Okan Ozturkmenoglu, Adil Alpkocak,
"Comparison of Different Lemmatization
Approaches for Information Retrieval on Turkish
36
International Journal of Research in Advent Technology (E-ISSN: 2321-9637) Special
Issue 1st International Conference on Advent Trends in Engineering, Science and
Technology “ICATEST 2015”, 08 March 2015
Text Collection," Innovations in Intelligent
Systems and Applications (INISTA), 2012 IEEE
International Symposium, ,pp.1-5, July 2012.
[3].Snigdha Paul, Nisheeth Joshi, Iti Mathur,
"Development of a Hindi Lemmatizer,"
International Journal of Computational Linguistics
and Natural Language Processing (IJCLNLP),
pp.380-384, vol.2, May 2013.
[4].http://en.wikipedia.org/wiki/Information_retrieval
[5].http://en.wikipedia.org/wiki/Levenshtein_distance
[6].L. Karttunen, K. R. Beesley, “Two-level rule
compiler,” Palo Alto,XEROX: Research Center-
Technical Report, 1992.
[7].http://en.wikipedia.org/wiki/Finite_state_transduc
er
[8].Pushpak Bhattacharyya, Ankit Bahuguna, Lavita
Talukdar and Bornali Phukan, "Facilitating Multi-
Lingual Sense Annotation: Human Mediated
Lemmatizer", Global Wordnet Conference (GWC
2014), Tartu, Estonia, 25-29 January, 2014
[9].http://en.wikipedia.org/wiki/Radix_tree
[10]. Bart Jongejan, Hercules Dalianis,
"Automatic training of lemmatization rules that
handle morphological changes in pre-, in- and
suffixes alike", 47th Annual Meeting of the
Association for computational linguistics (ACL)
and the 4th International Joint Conference on
Natural Language Processing (IJCNLP) of the
AFNLP, p.p.145-153, August 2009.
37