NLTK
NLTK
When working with Natural Language, we are not much interested in the
form of words – rather, we are concerned with the meaning that the words
intend to convey. Thus, we try to map every word of the language to its
root/base form. This process is called canonicalization.
E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same
action – hence, we can map them all to their base form i.e. ‘play’.
Now, there are two widely used canonicalization
techniques: Stemming and Lemmatization.
Stemming
Stemming generates the base word from the inflected word by removing
the affixes of the word. It has a set of pre-defined rules that govern the
dropping of these affixes. It must be noted that stemmers might not
always result in semantically meaningful base words. Stemmers are
faster and computationally less expensive than lemmatizers.
In the following code, we will be stemming words using Porter Stemmer –
one of the most widely used stemmers:
from nltk.stem import PorterStemmer
Lemmatization