Stemming Algorithms: A Comparative Study and Their Analysis: Deepika Sharma (ME CSE)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868

Foundation of Computer Science FCS, New York, USA


Volume 4– No.3, September 2012 – www.ijais.org

Stemming Algorithms: A Comparative Study and their


Analysis

Deepika Sharma [ ME CSE]


Department of Computer Science and Engineering, Thapar University
Patiala, Punjab, India

ABSTRACT also used to reduce the size of index files. Since a single stem
Stemming is an approach used to reduce a word to its stem or typically corresponds to several full terms, by storing stems
root form and is used widely in information retrieval tasks to instead of terms, compression factor of 50 percent can be
increase the recall rate and give us most relevant results. There achieved.
are number of ways to perform stemming ranging from manual
to automatic methods, from language specific to language 2. CONFLATION METHODS
independent each having its own advantage over the other. This For achieving stemming we need to conflate a word to its
paper represents a comparative study of various available various variants. Figure 1 shows a various conflation methods
stemming alternatives widely used to enhance the effectiveness that can be used in stemming. Conflation of words or so called
and efficiency of information retrieval. stemming can either be done manually by using some kind of
regular expressions or automatically using stemmers. There are
four automatic approaches namely Affix Removal Method,
Keywords Successor Variety Method, n-gram Method and Table lookup
Information Retrieval, Stemming Algorithm, Conflation
method [1] [7].
Methods

1. INTRODUCTION
With the enormous amount of data available online, it is very Conflation Methods
essential to retrieve accurate data for some user query. There
are lots of approaches used to increase the effectiveness of
online data retrieval. The traditional approach used to retrieve
data for some user query is to search the documents present in Manual Automatic (Stemmers)
the corpus word by word for the given query. This approach is
very time consuming and it may miss some of the related
documents of equal importance. Thus to avoid these situations,
Stemming has been extensively used in various Information Affix Successor Table n-gram
Retrieval Systems to increase the retrieval accuracy. Removal Variety Lookup
Stemming is the conflation of the variant forms of a word into
a single representation, i.e. the stem. For example, the terms
presentation, presenting, and presented could all be stemmed to Longest Simple
present. The stem does not need to be a valid word, but it must Match Removal
capture the meaning of the word. In Information Retrieval
Systems stemming is used to conflate a word to its various
forms to avoid mismatches between the query being asked by Figure 1 Conflation Method
the user and the words present in the documents. For example
if a user wants to search for a document on “How to cook” and
2.1 Affix Removal Method
submits a query on “cooking” he may not get all the relevant
The affix removal method removes suffix or prefix from the
results. However, if the query is stemmed, so that “cooking”
words so as to convert them into a common stem form. Most of
becomes “cook”, then retrieval will be successful.
the stemmers that are currently used use this type of
Stemming has been extensively used to increase the approach for conflation. Affix removal method is based on
performance of Information Retrieval Systems. For some two principles one is iterations and the other is longest match.
International languages like Hebrew, Portuguese, Hungarian An iterative stemming algorithm is simply a recursive
[3], Czech, and French and for many Indian languages like procedure, as its name implies, which removes strings in each
Bengali, Marathi, and Hindi [2] stemming increase the number order-class one at a time, starting at the end of a word and
of documents retrieved by between 10 and 50 times. For working toward its beginning. No more than one match is
English though the results are less dramatic but better then the allowed within a single order-class, by definition. Iteration is
baseline approach where no stemming is used. Stemming is

7
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868
Foundation of Computer Science FCS, New York, USA
Volume 4– No.3, September 2012 – www.ijais.org

usually based on the fact that suffixes are attached to stems in a Thus, "information" has ten digrams, of which all are unique,
"certain order, that is, there exist order-classes of suffixes. and "informative" also has ten digrams, of which all are
The longest-match principle states that within any given class unique. The two words share eight unique digrams: in, nf, fo,
of endings, if more than one ending provides a match, the one or, rm, ma, at, and ti.
which is longest should be removed. The first stemmer based
on this approach is the one developed by Lovins (1968); MF Once the unique digrams for the word pair have been identified
Porter (1980) also used this method. However, Porter’s and counted, a similarity measure based on them is computed.
stemmer is more compact and easy to use then Lovins. YASS The similarity measure used is Dice's coefficient, which is
is another stemmer based on the same approach; it is however
2C
language independent is nature. defined as: S =
A+B
2.2 Successor Variety Method where A is the number of unique digrams in the first word, B
Successor variety stemmers [8] use the frequencies of letter
the number of unique digrams in the second, and C the number
sequences in a body of text as the basis of stemming. In less
of unique digrams shared by A and B. For the example above,
formal terms, the successor variety of a string is the number of
Dice's coefficient would equal (2 x 8) / (10 + 10) = .80. Such
different characters that follow it in words in some body of text.
similarity measures are determined for all pairs of terms in the
Consider a body of text consisting of the following words, for
database. Once such similarity is computed for all the word
example.
pairs they are clustered as groups. The value of Dice
coefficient gives us the hint that the stem for these pair of
back, beach, body, backward, boy
words lies in the first unique 8 digrams.
To determine the successor varieties for "battle," for example,
the following process would be used. The first letter of battle is 3. CLASSIFICATION OF STEMMING
"b." "b" is followed in the text body by four characters: "a," "e,”
and "o." Thus, the successor variety of "b" is three. The next
ALGORITHM
successor variety for battle would be one, since only "c"
follows "ba" in the text. When this process is carried out using Stemming algorithms can be broadly classified into two
a large body of text, the successor variety of substrings of a categories, namely Rule – Based and Statistical.
term will decrease as more characters are added until a segment
boundary is reached. At this point, the successor variety will
sharply increase. This information is used to identify stems. Stemming

2.3 Table Lookup method


Terms and their corresponding stems can also be stored in a
table. Stemming is then done via lookups in the table. One way
to do stemming is to store a table of all index terms and their
Rule - Based Statistical
stems. Terms from queries and indexes could then be stemmed
via table lookup [6]. Using B-tree or Hash table, such lookups
would be very fast. For example, presented, presentable,
presenting all can be stemmed to a common stem present. Figure 2 Types of Stemming Approach
There are problems with this approach. The first is that there
for making these lookup tables we need to extensively work on Rule based Stemmer encodes language specific rules where as
a language. There will be some probability that these tables statistical stemmer employs statistical information from a large
may miss out some exceptional cases. Another problem is the corpus of a given language to learn the morphology.
storage overhead for such a table.
3.1 Rule Based Approach
In a rule based approach language specific rules are encoded
2.4 n- gram Method and based on these rules stemming is performed. In this
Another method of conflating terms called the shared digram approach various conditions are specified for converting a
method given in 1974 by Adamson and Boreham [9]. A digram word to its derivational stem, a list of all valid stems are given
is a pair of consecutive letters. Besides digrams we can also use and also there are some exceptional rules which are used to
trigrams and hence it is called n-gram method in general [4]. In handle the exceptional cases. In Lovins stemmer, stemming
this approach, pairs of words are associated on the basis of comprises of two phases [11]: In the first phase, the stemming
unique digrams they both possess. For calculating this algorithm retrieves the stem from a word by removing its
association measures we use Dice’s coefficient [1]. For longest possible ending by matching these endings with the list
example, the terms information and informative can be broken of suffixes stored in the computer and in the second phase
spelling exceptions are handled. For example the word
into digrams as follows.
“absorption” is derived from the stem “absorpt” and “absorbing”
is derived from the stem “absorb”. The problem of the spelling
information => in nf fo or rm ma at ti io on exceptions arises in the above case when we try to match the
unique digrams = in nf fo or rm ma at ti io on two words “absorpt” and “absorb”. Such exceptions are
informative => in nf fo or rm ma at ti iv ve handled very carefully by introducing recording and partial
unique digrams = in nf fo or rm ma at ti iv ve matching techniques in the stemmer as post stemming
procedures.

8
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868
Foundation of Computer Science FCS, New York, USA
Volume 4– No.3, September 2012 – www.ijais.org

Recording [11] occurs immediately following the removal of 2. The procedure used in this approach handles
an ending and makes such changes at the end of the resultant individual words: it has no access to information
stem as are necessary to allow the ultimate matching of about their grammatical and semantic relations with
varying stems. These changes may involve turning one stem one another.
into another (e.g. the rule rpt  rb changes absorpt to absorb), 3. The amount of storage required to store rules for
or changing both stems involved by either recording their stem extraction from the words and also to store the
terminal consonants to some neutral element (absorb  absor exceptional cases.
, absorpt  absor ), or removing some of these letters 4. These stemmers may apply over stemming and under
entirely, that is, changing them to nullity stemming to the words.
( absorb  absor, absorpt  absor ).
The main difference between recording and partial matching is
that a recording procedure is a part of stemming algorithm 3.2 Statistical Approach
whereas partial matching procedure is applied on the output of Statistical stemming is an effective and popular approach in
stemming algorithm where the stems derived from the information retrieval [16] [5]. Some recent studies [17] [18]
catalogue terms are being searched for matches to the user’s show that statistical stemmers are good alternatives to rule-
query. based stemmers. Additionally, their advantage lies in the fact
that they do not require language expertise. Rather they employ
Apart form Lovins method; one more rule based method is statistical information from a large corpus of a given language
given by MF Porter which comprises of a set of conditional to learn morphology of words. Lot of research has been done in
rules [10]. These conditions are either applied on the stem or the area of statistical stemming method, some of the latest
on the suffix or on the stated rules. As per the conditions, a works are stated below:
word can be represented in a general form like:
3.2.1 YET ANOTHER SUFFIX STRIPPER (YASS)
m Most popular stemmers encode a large number of languages
[C] (VC) [V] specific rules built over a length of time. Such stemmers with
comprehensive rules are available only for a few languages. In
Where C represents a list of consonants, V represents a list of the absence of extensive linguistic resources for certain
vowels and m represents the measure of any word. For languages, statistical language processing methods have been
example: successfully used to improve the performance of IR systems.
Yet another suffix stripper (YASS) is one such statistics based
language independent stemmer [18]. Its performance is
m=0 RA, EE, BI, AT comparable to that of Porter’s and Lovin’s stemmers, both in
terms of average precision and the total number of relevant
m=1 TREES, OATS, RATES documents retrieved the challenge of retrieval from languages
with poor resources.
m=2 TEACHER, TROUBLES, SITUATION In this approach, a set of string distance measures [12] is
defined, and complete linkage clustering is used to discover
equivalence classes from the lexicon. The string distance
The general rule for removing a suffix is given as: measure is used to check the similarity between two words by
calculating the distance between two strings , the distance
(condition)S1  S2 function maps a pair of string a and b to a real number r, where
a smaller value of r indicates greater similarity between a and
b. A set of string distance measures {D1, D2, D3, and D4} for
Where, condition represents a stem and if the condition is
clustering the words. The main reason to calculate these
satisfied then suffixes S1 is replaced by suffix S2. For example
distances is to find long matching prefixes and to penalize an
early mismatch.
(m >1)ION  Given two strings X = x0x1.....xn and Y = y0y1.....yn we first
define a Boolean function pi as penalty for an early mismatch:
Here S1 is ION and S2 is null. This would map EDUCATION
to EDUCAT, since EDUCAT is a word part for which m=2.

3.1.1 ADVANTAGES 
0 if xi = yi 0  i  min(n,n )
pi = 1 otherwise
1. Rule Based stemmers are fast in nature i.e. the
computation time used to find a stem is lesser. Thus, pi is 1 if there is a mismatch in the ith position of X and
2. The retrieval results for English by using Rule Y. If X and Y are of unequal length, we pad the shorter string
Based Stemmer are very high. with null characters to make the string lengths equal. Let the
length of the string be n+1. We define D 1 as follows:
3.1.2 DISADVANTAGES
1. One of the main disadvantages of Rule Based n 1
D1(X, Y) =  pi (1)
Stemmer is that one need to have extensive language i=0 2
i
expertise to make them.

9
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868
Foundation of Computer Science FCS, New York, USA
Volume 4– No.3, September 2012 – www.ijais.org

Accordingly we define D2, D3 and D4 as follows: This distance counts the minimum number of edit operations
(inserting, deleting, or substituting a letter) required to
transform one string to the other. Once similarity between pair
1 n 1 of words have been calculated using distance measure, cluster
D2(X, Y) =   i-m
if m > 0,  otherwise (2)
m i-m 2 of the words are made by using complete linkage algorithm. In
the complete-linkage algorithm [13], the similarity of two
clusters is calculated as the minimum similarity between any
n-m+1 n 1 member of one cluster and any member of the other, the
D3(X, Y) = × i-m
if m > 0,  otherwise (3)
m i-m 2 probability of an element merging with a cluster is determined
by a least similar member of the cluster.
n-m+1 n 1
D4(X, Y) = ×  i-m
(4) 3.2.2 GRAPH BASED STEMMER (GRAS)
n+1 i-m 2 GRAS is a graph based language independent stemming
algorithm for information retrieval [19]. The following features
Where, m represents the position of first mismatch between X make this algorithm attractive and useful: (1) retrieval
and Y. In figure 3, we consider two pair of strings effectiveness, (2) generality, that is, its language-independent
{independence, independently} and {indecent, independence} nature, and (3) low computational cost. The steps that are
and value of various distance measure for these two pair of followed in this approach can be summarized as below:
words is calculated as below. Clearly we can infer that indecent
and independent are farther apart from independence and 1. Find long common prefix among the word pairs
independently. present in the documents. For this, we consider the
word-pairs of the form W1 = PS1 & W2 = PS2
0 1 2 3 4 5 6 7 8 9 10 11 12 where, P is the long common prefix between
I N D E P E N D E N C E * W1 & W2 .
I N D E P E N D E N T L Y 2. The suffix pair S1 & S2 should be valid suffixes i.e.
if other word pairs also have a common initial part
followed by these suffixes such that
1 1
D1 =  = 0.00073 W1 = PS1 & W2 = PS2 . Then, S1 & S2 is the
2 11 2 12
pair of candidate suffix if large number of word pairs
1 1 1
D2 =    = 0.1363 is of this form. Thus, suffixes are considered in pair
2 0 21 rather than individually.
11 3. Look for pairs that are morphological related i.e. if
2 1 1
D3 =   0  1  = 0.2727 -They share a non-empty common prefix.
2 2 -The suffix pair is a valid candidate suffix pair.
11 4. These words relationships will be modelled using a
2 1 1 Graph where nodes represent the words and edges
D4 =   0  1  = 0.2307 are used to connect the related words.
2 2
13 5. Pivot node is identified i.e. pivot is considered that
node which is connected by edges to a large number
Edit Distance = 2 of other nodes.
6. In the final step, a word that is connected to a pivot is
put in the same class as the pivot if it shares many
0 1 2 3 4 5 6 7 8 9 10 11 common neighbours with the pivot.
I N D E C E N T * * * * Once such words classes are formed, stemming is done by
I N D E P E N D E N C E mapping all the words in a class to the pivot for that class. This
stemming algorithm has outperformed Rule-Based Stemmer,
Statistical Stemmer (YASS, Linguistica [15] etc), and Baseline
1 1 1
D1 =   ........  11 = 0.1245 Strategy.
24 2 5 2

1 1 1 1
D2 =    .....  11 - 4  = 0.4980 3.2.3 ADVANTAGES
20 21 2 1. Statistical stemmers are useful for languages having
4
scarce resources. Like the Asian languages are
heavily used in Asian Sub Continent but very less
8 1 1 1
D3 =    .....  13  11  = 3.984 research is done on these languages.
20 21 2 2. This approach yields best retrieval results for
4 suffixing languages or the languages which are
morphologically more complex like French,
8 1 1 1
D4 =    .....  13 - 11  = 1.328 Portuguese, Hindi, Marathi, and Bengali rather than
20 21 2 English.
12 3. They are considered as Recall – Enhancing Devices
Edit Distance = 8 as they increase the value of recall at a given rate.

Figure 3 Calculations of Various Distance Measures

10
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868
Foundation of Computer Science FCS, New York, USA
Volume 4– No.3, September 2012 – www.ijais.org

3.2.4 DISADVANTAGES YASS appears to be particularly aggressive on all languages


1. Most of the statistical stemmer does their statistical analysis and produces largest Na value for English, French and
based on some sample of the actual corpus. As sample size Bengali. On the other hand, GRAS is the most aggressive on
decreases, the possibility of covering most morphological Marathi while it works equally well as rule- based stemmer for
variants will also decrease. Naturally, this would result in a other languages like English, French and Bengali.
stemmer with poorer coverage.
4.2 Computation Time
2. For the Bengali lexicon, there are few instances where two The comparison above clearly shows that YASS outperforms
semantically different terms fall in the same cluster due to all other stemmer. One more parameter that is used by
their string similarity. For example, Akram (the name of a researchers for comparing the performance of stemmers is
cricketer from Pakistan) and akraman (to attack) fall in the computation time which includes the time from submitting a
same cluster, as they share a significant prefix [18]. Such query to its processing and final retrieval. Figure 5 clearly
cases might lead to unsatisfactory results. shows that for equal number of words in various languages like
English, French, Bengali and Marathi the computation time of
3. Statistical Stemmers are time consuming because for these YASS is far more than its closest competitor GRAS [19]. So,
stemmers to work we need to have complete language we concluded that GRAS is far faster than YASS. In GRAS,
coverage, in terms of morphology of words, their variants etc. two aspects that influence the processing time are the density
of graph, that is, average degree of a node, and the length of the
4. COMPARISION AMONG THESE suffix.
APPROACHES
Here we will compare the performance of various stemming
approaches discussed till now. In this comparison we consider
one rule-based approach and compare it with statistical Computation Time (increasing order) YASS =
approaches like YASS and GRAS. The parameters used in this
comparison are each stemmer’s strength and the computation GRAS =
time required by each stemmer to compute the stem.

4.1 Stemmer Strength


We now present a comparative study of various stemmers in
terms of the stemmer strength. Stemmer Strength [14]
generally represents the extent to which a stemming method
changes words to its stems. One well-known measure of
stemmer strength is the average number of words per
conflation class. Formally, if Na, Nw, and Ns denote the mean
number of words per conflation class, the number of distinct
English French Bengali Marathi
words before stemming and the number of unique stems after
Nw
stemming respectively, then Na = [19]. Figure 5 Computation Time
Ns

5. CONCLUSION
RULE =
In the past few years, the amount of information on the Web
3.5
Value of Na for various Stemming Methods

YASS =
has grown exponentially. The information present on the Web
is practically on all topics and in various languages. Some of
GRAS =
these languages have not received much attention and for
3

which these language resources are scarce. To make this


2.5

available information useful, it has to be indexed and made


searchable by an Information Retrieval System. Stemming is
one such approach used in indexing process
2

We have presented a comparative study of various stemming


methods. In this we studied that stemming significantly
1.5

increases the retrieval results for both rule based and statistical
approach. It is also useful in reducing the size of index files as
1

the number of words to be indexed are reduced to common


English French Bengali forms or so called stems. The performance of statistical
Marathi stemmers is far superior to some well-known rule-based
stemmers and among statistical based stemmers GRAS has
Figure 4 Stemmer Strength outperformed YASS which is a clustering based suffix
stripping algorithm. But the main drawback that we have seen
in these statistical stemmers is the poor coverage of language
i.e. they do not include all the documents in the corpus to make
Figure 4 gives the value of Na for various stemming methods,
the statistical analysis as it is very time consuming rather they
clearly a higher value of Na indicates a more aggressive considers sample of documents from the corpus for this
stemmer. Among the three stemmers that we have considered analysis and this small collection may lead to poor coverage of

11
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868
Foundation of Computer Science FCS, New York, USA
Volume 4– No.3, September 2012 – www.ijais.org

the words. The performance of GRAS is also dependent on the [6] WB Frakes 1984. "Term Conflation for Information
density of the graph but studies have shown that it is capable of Retrieval" in Research and Development in Information
handling an interesting class of languages and improves Retrieval, ed. C. van Rijsbergen. New York: Cambridge
performance of Mono-lingual information retrieval University Press.
significantly with a low computation cost and in comparatively [7] WB Frakes 1992 "LATTIS: A Corporate Library and
low processing time. Information System for the UNIX Environment,"
Proceedings of the National Online Meeting, Medford,
6. FUTURE SCOPE N.J.: Learned Information Inc., 137-42.
Despite of the fact that stemming greatly enhances the [8] M. Hafer and S. Weiss 1974. "Word Segmentation by
performance of Information Retrieval Systems there are still Letter Successor Varieties," Information Storage and
some open issues in this field that are to be dealt properly. In Retrieval, 10, 371-85.
GRAS most of the time is spent on graph construction. These [9] G. Adamson and J. Boreham 1974. "The Use of an
graphs are dynamic in nature as more words are introduced in Association Measure Based on Character Structure to
the corpus, more nodes will be made and graph will become Identify Semantically Related Pairs of Words and
more complex and dense. Also the size of the sample that is Document Titles," Information Storage and Retrieval, 10,
considered in statistical stemming is under debate, if smaller 253-60.
size of the sample is considered then stemming will be faster [10] M. F. Porter 1980. "An Algorithm for Suffix Stripping
but language coverage will be in doubt and if larger samples Program", 14(3), 130-37.
are taken then stemming itself will take very long time. So, [11] J. B. Lovins 1968. "Development of a Stemming
some optimum sample must be considered that covers Algorithm." Mechanical Translation and Computational
maximum lexicon of a language. Linguistics, 11(1-2), 22-31.
[12] V. I. Levenstein 1966. Binary codes capable of correcting
deletions, insertions and reversals. Commun. ACM 27, 4,
7. REFERENCES 358–368
[1] WB Frakes, 1992,“Stemming Algorithm “, in [13] A. K. Jain, M.N. Murthy, and P. J. Flynn 1999. “Data
“Information Retrieval Data Structures and Algorithm”, clustering”: A review. ACM Comput. Surv. 31, 3, 264–
Chapter 8, page 132-139. 323.
[14] WB Frakes and C. J. Fox 2003. Strength and similarity of
[2] A. Ramanathan and D. Rao, 2003.” A lightweight affix removal stemming algorithms. SIGIR.
stemmer for Hindi”. In Proceedings of the 10th [15] J. Goldsmith 2001.” Linguistica: Unsupervised learning of
Conference of the European Chapter of the Association the morphology of a natural language”. Comput. Linguist.
for Computational Linguistics (EACL), on Computational 27, 2, 153–198.
Linguistics for South Asian Languages (Budapest, Apr.) [16] J. Xu and W. B. Croft 1998.” Corpus-based stemming
Workshop. using co occurrence of word variants”. ACM Trans. Inf.
Syst. 16, 1, 61–81.
[3] J. Savoy 2008.” Searching strategies for the Hungarian [17] M. Bacchin, N. Ferro, and M. Melucci 2005. “A
language”. Inf. Process. Manage. 44, 1, 310–324. probabilistic model for stemmer generation”. Inf. Process.
Manage. 41, 1, 121–137.
[4] P. McNamee, and J. Mayfield 2004.” Character n-gram [18] P. Majumder, M Mitra, S.K. Parui, and G. Kole (ISI), P.
tokenization for European language text retrieval”, Inf. Mitra (IIT), and K.K. Dutta. ”YASS: Yet another Suffix
Retr. 7(1-2), 73–97. Stripper”, published in ACM Transaction on Information
System (TOIS), Volume 25 Issue 4, October 2007,
[5] D.W. Oard, G.A. Levow and C.I. Cabezas 2001. CLEF Chapter 18, Page 5-6.
experiments at Maryland:” Statistical stemming and back [19] JH Paik, Mandar Mitra, Swapan K. Parui, Kalervo
off translation”. In Revised Papers from the Workshop of Jarvelin, “GRAS : An effective and efficient stemming
Cross-Language Evaluation Forum on Cross-Language algorithm for information retrieval”, published in ACM
Information Retrieval and Evaluation (CLEF), Springer, Transaction on Information System (TOIS), Volume 29
London, 176–187. Issue 4, December 2011, Chapter 19, page 20-24

12

You might also like