Lemmatization Approaches
Lemmatization Approaches
(https://www.machinelearningplus.com/)
Let's Data Science
Examples in Python
by Selva Prabhakaran (https://www.machinelearningplus.com/author/selva86/) | Upcoming Posts
(/#google_bookmarks) (/#google_gmail)
101 NLP Exercises (using modern libraries)
(NEW)
(https://www.machinelearningplus.com/nlp/nlp-
exercises/)
examples-python%2F&title=Lemmatization%20Approaches%20with%20Exam
custom-ner-model-in-spacy)
Contents
1. Introduction
2. Wordnet Lemmatizer
Send Me Post Updates!
Subscribe
3. Wordnet Lemmatizer with appropriate POS tag
4. spaCy Lemmatization
5. TextBlob Lemmatizer
6. TextBlob Lemmatizer with appropriate POS tag
7. Pattern Lemmatizer
8. Stanford CoreNLP Lemmatization
9. Gensim Lemmatize /
10. TreeTagger
11. Comparing NLTK, TextBlob, spaCy, Pattern and Stanford CoreNLP
12. Conclusion
1. Introduction
Lemmatization is the process of converting a word to its base form. The difference
between stemming and lemmatization is, lemmatization considers the context and
converts the word to its meaningful base form, whereas stemming just removes the
(https://www.ezoic.com/what-is-
last few characters, often leading to incorrect meanings and spelling errors.
ezoic/)
Recent Posts report this ad
For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’,
whereas, stemming would cutoff the ‘ing’ part and convert it to car. cProfile – How to profile your python code
(https://www.machinelearningplus.com/python/cprofil
how-to-profile-your-python-code/)
‘Caring’ -> Lemmatization -> ‘Care’
Subplots Python (Matplotlib)
‘Caring’ -> Stemming -> ‘Car ’
(https://www.machinelearningplus.com/plots/subplots
python-matplotlib/)
Also, sometimes, the same word can have multiple different ‘lemma’s. So, based on 101 NLP Exercises (using modern libraries)
the context it’s used, you should identify the ‘part-of-speech’ (POS) tag for the word (https://www.machinelearningplus.com/nlp/nlp-
in that specific context and extract the appropriate lemma. Examples of implementing exercises/)
this comes in the following sections. How to Train spaCy to Autodetect New
Entities (NER) [Complete Guide]
(https://www.machinelearningplus.com/nlp/training-
Today, we will see how to implement lemmatization using the following python
custom-ner-model-in-spacy/)
packages.
For-Loop in Julia
(https://www.machinelearningplus.com/julia/for-
. Wordnet Lemmatizer loop-in-julia/)
DataFrames in Julia
. Spacy Lemmatizer (https://www.machinelearningplus.com/julia/datafram
in-julia/)
While-loop in Julia
. Gensim Lemmatizer (https://www.machinelearningplus.com/julia/while-
loop-in-julia/)
Function in Julia
. TreeTagger
(https://www.machinelearningplus.com/julia/function-
in-julia/)
Follow the below instructions to install nltk and download wordnet . Matplotlib Pyplot
(https://www.machinelearningplus.com/plots/matplotl
pyplot/)
Python Boxplot
(https://www.machinelearningplus.com/plots/python-
boxplot/)
/
data.table in R – The Complete Beginners
# How to install and import NLTK
Guide
# In terminal or prompt: (https://www.machinelearningplus.com/data-
# pip install nltk manipulation/datatable-in-r-complete-guide/)
nltk.word_tokenize and then we will call lemmatizer.lemmatize() on each word. This Cosine Similarity - Understanding the math
can be done in a list comprehension (the for-loop inside square brackets to make a and how it works (with python codes)
(https://www.machinelearningplus.com/nlp/cosine-
list).
similarity/)
Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’
is not converted to ‘hang’ as expected. This can be corrected if we provide the
Tags
correct ‘part-of-speech’ tag (https://www.clips.uantwerpen.be/pages/MBSP-tags) (POS
tag) as the second argument to lemmatize() . Classification
(https://www.machinelearningplus.com/tag/class
data.table
Sometimes, the same word can have a multiple lemmas based on the meaning /
(https://www.machinelearningplus.com/tag/data-
context.
Data Manipulation
table/)
(https://www.machinelearningplus.com/ta
/
manipulation/) Debugging
print(lemmatizer.lemmatize("stripes", 'v'))
(https://www.machinelearningplus.com/tag/debugging/) Doc2Vec
#> strip
(https://www.machinelearningplus.com/tag/doc2vec/)
Evaluation Metrics
print(lemmatizer.lemmatize("stripes", 'n')) (https://www.machinelearningplus.com/tag/evaluation-
#> stripe metrics/) FastText
(https://www.machinelearningplus.com/tag/fasttext/)
Feature Selection
(https://www.machinelearningplus.com/tag/feature-selection/)
It may not be possible manually provide the corrent POS tag for every word for large Julia
(https://www.machinelearningplus.com/tag/huggingface/)
texts. So, instead, we will find out the correct POS tag for each word, map it to the (https://www.machinelearningplus.com/tag
Julia Packages
right input character that the WordnetLemmatizer accepts and pass it as the second
(https://www.machinelearningplus.com/tag/julia-
argument to lemmatize() .
packages/) LDA
(https://www.machinelearningplus.com/tag/lda/)
So how to get the POS tag for a given word? Lemmatization
(https://www.machinelearningplus.com/tag/lemmatization/)
Linear Regression
In nltk, it is available through the nltk.pos_tag() method. It accepts only a list (list of
(https://www.machinelearningplus.com/tag/linear-
words), even if its a single word.
regression/) Logistic
(https://www.machinelearningplus.com/tag/logistic/) Loop
(https://www.machinelearningplus.com/tag/loop/) LSI
print(nltk.pos_tag(['feet']))
Machine
(https://www.machinelearningplus.com/tag/lsi/)
#> [('feet', 'NNS')]
Learning
(https://www.machinelearningplus.com/tag/m
print(nltk.pos_tag(nltk.word_tokenize(sentence)))
#> [('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on learning/) Matplotlib
(https://www.machinelearningplus.co
NLP
nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS (https://www.machinelearningplus.com/ta
NLTK
tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function
(https://www.machinelearningplus.com/tag/nltk/)
defined below does this mapping job.
Numpy
(https://www.machinelearningplus.com/tag/numpy/)
P-Value
# Lemmatize with POS Tag
(https://www.machinelearningplus.com/tag/p-value/)
from nltk.corpus import wordnet
Pandas (https://www.machinelearningplus.com/tag/pandas/)
Phraser (https://www.machinelearningplus.com/tag/phraser/)
def get_wordnet_pos(word):
plots
"""Map POS tag to first character lemmatize() accepts""" (https://www.machinelearningplus.com/tag/plots/
tag = nltk.pos_tag([word])[0][1][0].upper() Practice Exercise
"N": wordnet.NOUN,
"V": wordnet.VERB,
Python
"R": wordnet.ADV}
(https://www.machinelearni
R
return tag_dict.get(tag, wordnet.NOUN)
(https://www.machinelearningplus.com/
Regex (https://www.machinelearningplus.com/tag/regex/)
Regression
# 1. Init Lemmatizer
(https://www.machinelearningplus.com/tag/regression
lemmatizer = WordNetLemmatizer()
Residual Analysis
(https://www.machinelearningplus.com/tag/residual-analysis/)
# 2. Lemmatize Single Word with the appropriate POS tag
Scikit Learn (https://www.machinelearningplus.com/tag/scikit-
word = 'feet' learn/) Significance Tests
print(lemmatizer.lemmatize(word, get_wordnet_pos(word))) (https://www.machinelearningplus.com/tag/significan
tests/) Soft Cosine Similarity
# 3. Lemmatize a Sentence with the appropriate POS tag (https://www.machinelearningplus.com/tag/soft-
sentence = "The striped bats are hanging on their feet for best" cosine-similarity/) spaCy
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]) (https://www.machinelearningplus.com/tag/spacy
#> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best'] Stationarity
(https://www.machinelearningplus.com/tag/stationarit
TextBlob
/
(https://www.machinelearningplus.com/tag/textblob/)
(https://www.machinelearningplus.com/tag/textsummarization/)
spaCy is a relatively new in the space and is billed as an industrial strength NLP Time
TFIDF (https://www.machinelearningplus.com/tag/tfidf/)
engine. It comes with pre-built models (https://spacy.io/usage/models) that can parse Series
text and compute various NLP related features through one single function call. (https://www.machinelearningplus.com/tag/tim
Ofcourse, it provides the lemma of the word too. series/) Topic Modeling
(https://www.machinelearningplus.com/tag/topic-
modeling/) Visualization
Before we begin, let’s install spaCy and download the ‘en’ model.
(https://www.machinelearningplus.com/tag/visualizat
Word2Vec (https://www.machinelearningplus.com/tag/word2vec/)
spaCy determines the part-of-speech tag by default and assigns the corresponding
lemma. It comes with a bunch of prebuilt models where the ‘en’ we just downloaded
above is one of the standard ones for english.
import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their feet for best"
# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)
It did all the lemmatizations the Wordnet Lemmatizer supplied with the correct POS
tag did. Plus it also lemmatized ‘best’ to ‘good’. Nice!
You’d see the -PRON- character coming up whenever spacy detects a pronoun.
5. TextBlob Lemmatizer
TexxtBlob is a powerful, fast and convenient NLP package as well. Using the Word
and TextBlob objects, its quite straighforward to parse and lemmatize words and
sentences respectively.
# Lemmatize a word
word = 'stripes'
w = Word(word)
w.lemmatize()
#> stripe
/
However to lemmatize a sentence or paragraph, we parse it using TextBlob and call
the lemmatize() function on the parsed words.
# Lemmatize a sentence
sentence = "The striped bats are hanging on their feet for best"
sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])
#> 'The striped bat are hanging on their foot for best'
It did not do a great job at the outset, because, like NLTK, TextBlob also uses
wordnet internally. So, let’s pass the appropriate POS tag to the lemmatize() method.
# Lemmatize
sentence = "The striped bats are hanging on their feet for best"
lemmatize_with_postag(sentence)
7. Pattern Lemmatizer
Pattern by CLiPs (https://www.clips.uantwerpen.be/pages/pattern) is a versatile module
with many useful NLP capabilities.
If you run into issues while installing pattern, check out the known issues on github
(https://github.com/clips/pattern/issues). I myself faced this issue
(https://github.com/clips/pattern/issues/203) when installing on a mac.
import pattern
from pattern.en import lemma, lexeme
sentence = "The striped bats were hanging on their feet and ate best fishes"
" ".join([lemma(wd) for wd in sentence.split()])
#> 'the stripe bat be hang on their feet and eat best fishes'
/
You can also view the possible lexeme’s for each word.
print(parse('The striped bats were hanging on their feet and ate best fishes',
lemmata=True, tags=False, chunks=False))
But before that, you need to download Java and the Standford CoreNLP software.
Make sure you have the following requirements before getting to the lemmatization
code:
Mac users can check the java version by typing java -version in terminal. If its 1.8+,
then its Ok. Else follow below steps.
brew update
brew install jenv
brew cask install java
/
Step 2: Download Standford CoreNLP software
(https://stanfordnlp.github.io/CoreNLP/index.html#download) and unzip it.
Step 3: Start the Stanford CoreNLP server from terminal. How? cd to the folder you
just unzipped and run below command in terminal:
cd stanford-corenlp-full-2018-02-27
This will start a StanfordCoreNLPServer listening at port 9000. Now, we are ready to
extract the lemmas in python.
/
# Run `pip install stanfordcorenlp` to install stanfordcorenlp package
from stanfordcorenlp import StanfordCoreNLP
import json
sentence = "The striped bats were hanging on their feet and ate best fishes"
parsed_str = nlp.annotate(sentence, properties=props)
parsed_dict = json.loads(parsed_str)
parsed_dict
#> {'sentences': [{'index': 0,
#> 'tokens': [{'after': ' ',
#> 'before': '',
#> 'characterOffsetBegin': 0,
#> 'characterOffsetEnd': 3,
#> 'index': 1,
#> 'lemma': 'the', << ----------- LEMMA
#> 'originalText': 'The',
#> 'pos': 'DT',
#> 'word': 'The'},
#> {'after': ' ',
#> 'before': ' ',
#> 'characterOffsetBegin': 4,
#> 'characterOffsetEnd': 11,
#> 'index': 2,
#> 'lemma': 'striped', << ----------- LEMMA
#> 'originalText': 'striped',
#> 'pos': 'JJ',
#> 'word': 'striped'},
#> {'after': ' ',
#> 'before': ' ',
#> 'characterOffsetBegin': 12,
#> 'characterOffsetEnd': 16,
#> 'index': 3,
#> 'lemma': 'bat', << ----------- LEMMA
#> 'originalText': 'bats',
#> 'pos': 'NNS',
#> 'word': 'bats'}
#> ...
#> ...
The output of nlp.annotate() was converted to a dict using json.loads . Now the
lemma we need is embedded a couple of layers inside the parsed_dict . So here, we
need to just the lemma value from each dict. I use list comprehensions below to do
the trick.
" ".join(lemma_list)
#> 'the striped bat be hang on they foot and eat best fish'
'annotators': 'pos,lemma',
'pipelineLanguage': 'en',
'outputFormat': 'json'
# form sentence
sentence2 = " ".join(sents_no_punct)
parsed_dict = json.loads(parsed_str)
lemmatize_corenlp(conn_nlp=nlp, sentence=sentence)
#> 'the striped bat be hang on they foot and eat best fish'
9. Gensim Lemmatize
Gensim provide lemmatization facilities based on the pattern package. It can be
implemented using the lemmatize() method in the utils module. By default
lemmatize() allows only the ‘JJ’, ‘VB’, ‘NN’ and ‘RB’ tags.
10. TreeTagger
Treetagger is a Part-of-Speech tagger for many languages. And it provides the lemma
of the word as well.
You will need to download and install the TreeTagger software (http://www.cis.uni-
muenchen.de/~schmid/tools/TreeTagger/) itself in order to use it by following steps
mentioned.
/
# pip install treetaggerwrapper
Treetagger indeed does a good job in converting ‘best’ to ‘good’ and for other words
as well. For further reading, refer to TreeTaggerWrapper ’s documentation
(https://treetaggerwrapper.readthedocs.io/en/latest/).
/
sentence = """Following mice attacks, caring farmers were marching to Delhi for better living
Delhi police on Tuesday fired water cannons and teargas shells at protesting farmers as they tr
break barricades with their cars, automobiles and tractors."""
# NLTK
# ('Following mouse attack care farmer be march to Delhi for well living '
# 'condition Delhi police on Tuesday fire water cannon and teargas shell at '
# 'protest farmer a they try to break barricade with their car automobile and '
# 'tractor')
# Spacy
import spacy
# ('follow mice attack , care farmer be march to delhi for good living condition '
# '. delhi police on tuesday fire water cannon and teargas shell at protest '
# 'farmer as -PRON- try to break barricade with -PRON- car , automobile and '
# 'tractor .')
# TextBlob
pprint(lemmatize_with_postag(sentence))
# ('Following mouse attack care farmer be march to Delhi for good living '
# 'condition Delhi police on Tuesday fire water cannon and teargas shell at '
# 'protest farmer a they try to break barricade with their car automobile and '
# 'tractor')
# Pattern
from pattern.en import lemma
# ('follow mice attacks, care farmer be march to delhi for better live '
# 'conditions. delhi police on tuesday fire water cannon and tearga shell at '
# 'protest farmer a they try to break barricade with their cars, automobile and '
# 'tractors.')
# Stanford
pprint(lemmatize_corenlp(conn_nlp=conn_nlp, sentence=sentence))
# ('follow mouse attack care farmer be march to Delhi for better living '
# 'condition Delhi police on Tuesday fire water cannon and tearga shell at '
# 'protest farmer as they try to break barricade with they car automobile and '
# 'tractor')
12. Conclusion
So those are the methods you can use the text time you take up an NLP project. I
would be happy to know if you have any new approaches or suggestions through
your comments. Happy learning!
/
(/#google_bookmarks) (/#google_gmail)
(https://www.addtoany.com/share#url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.machinelearnin
examples-python%2F&title=Lemmatization%20Approaches%20with%20Exam
ALSO ON MACHINELEARNINGPLUS.COM
1 Login
OG
(https://www.ezoic.com/what-is-
ezoic/)
report this ad