NLTK - Stem NLTK - Stem: Print Print Print Print
NLTK - Stem NLTK - Stem: Print Print Print Print
# Porter Stemmer
stemmer = PorterStemmer()
print("Stemmer:")
print("running ->", stemmer.stem("running")) # Output: run (correct)
print("better ->", stemmer.stem("better")) # Output: bettr (incorrect, not a real word)
print("corpora ->", stemmer.stem("corpora")) # Output: corpora (incorrect, should be corpus)
Stemmer:
running -> run
better -> better
corpora -> corpora
Lemmatizer:
running -> running
better -> better
better (as adjective) -> good
corpora -> corpus
In [ ]:
#Porter Stemmer
Simpler and faster: It uses a rule-based approach to chop off suffixes from words.
Less accurate: May not always produce actual words and can lead to stemming errors. For instance, stemming "runni
ng" might result in "run" which is a valid word, but stemming "caring" might result in "car" which is not a valid
word in this context.
Doesn't consider context: Focuses solely on the word itself, ignoring its part of speech (POS) or surrounding wor
ds.
WordNet Lemmatizer
#WORDNETLEMMATIZER
More complex and slower: Relies on a lexical database (WordNet) to map words to their dictionary base forms (lemm
as).
More accurate: Aims to produce actual words that exist in the language.
Considers context (ideally): Can incorporate part-of-speech (POS) tagging to choose the most appropriate lemma (e
.g., "running" as the present participle of "run" vs "run" as a noun).