0% found this document useful (0 votes)

125 views13 pages

Lemmatization Approaches

Uploaded by

Luciano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views13 pages

Lemmatization Approaches

Uploaded by

Luciano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Machine Learning Plus

(https://www.machinelearningplus.com/)
Let's Data Science

Lemmatization Approaches with

Search here..

Examples in Python
by Selva Prabhakaran (https://www.machinelearningplus.com/author/selva86/) | Upcoming Posts

(/#facebook) (/#twitter) (/#whatsapp)

spaCy Tutorial - Learn All of spaCy NLP in
One Complete writeup (NEW)
(https://www.machinelearningplus.com/spacy-
(/#linkedin) (/#reddit) tutorial-nlp)

(/#google_bookmarks) (/#google_gmail)
101 NLP Exercises (using modern libraries)
(NEW)
(https://www.machinelearningplus.com/nlp/nlp-
exercises/)

How to train spaCy to autodetect new

(https://www.addtoany.com/share#url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.machinelearnin
entities (NER) (NEW)
(https://www.machinelearningplus.com/nlp/training-

examples-python%2F&title=Lemmatization%20Approaches%20with%20Exam
custom-ner-model-in-spacy)

Support Vector Machines Algorithm from Scratch

Lemmatization is the process of converting a word to its base form. The difference Creating Plots in Julia
between stemming and lemmatization is, lemmatization considers the context and converts
Julia DataFrames (NEW)
the word to its meaningful base form, whereas stemming just removes the last few (https://www.machinelearningplus.com/julia/datafram
characters, often leading to incorrect meanings and spelling errors. in-julia/)

101 Julia Practice Exercises

Python SQLite - Must Read Guide

Linear Regression with Julia (NEW)

(https://www.machinelearningplus.com/linear-
regression-in-julia/)

Waterfall Plot in Python (NEW)

Comparing Lemmatization (https://www.machinelearningplus.com/waterfall-
plot-in-python)

Python JSON - Guide

(https://www.machinelearningplus.com/python-
json-guide/)

Logistic Regression in Julia

Approaches in Python. Photo by Jasmin Schreiber Probability Theory - Beginners Guide
(https://unsplash.com/photos/Fpi3B9RMe5E?
Graph Theory
utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)
Gentle Introduction to Markov Chain (NEW)
(https://www.machinelearningplus.com/?
page_id=3489&preview=true)

Logistic Regression from Scratch

Python Collections - Guide and Examples

(NEW)
(https://www.machinelearningplus.com/python-
collections-guide/)

Contents
1. Introduction
2. Wordnet Lemmatizer
Send Me Post Updates!
Subscribe
3. Wordnet Lemmatizer with appropriate POS tag
4. spaCy Lemmatization
5. TextBlob Lemmatizer
6. TextBlob Lemmatizer with appropriate POS tag
7. Pattern Lemmatizer
8. Stanford CoreNLP Lemmatization
9. Gensim Lemmatize /
10. TreeTagger
11. Comparing NLTK, TextBlob, spaCy, Pattern and Stanford CoreNLP
12. Conclusion

1. Introduction
Lemmatization is the process of converting a word to its base form. The difference
between stemming and lemmatization is, lemmatization considers the context and
converts the word to its meaningful base form, whereas stemming just removes the
(https://www.ezoic.com/what-is-
last few characters, often leading to incorrect meanings and spelling errors.
ezoic/)
Recent Posts report this ad

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’,
whereas, stemming would cutoff the ‘ing’ part and convert it to car. cProfile – How to profile your python code
(https://www.machinelearningplus.com/python/cprofil
how-to-profile-your-python-code/)
‘Caring’ -> Lemmatization -> ‘Care’
Subplots Python (Matplotlib)
‘Caring’ -> Stemming -> ‘Car ’
(https://www.machinelearningplus.com/plots/subplots
python-matplotlib/)
Also, sometimes, the same word can have multiple different ‘lemma’s. So, based on 101 NLP Exercises (using modern libraries)
the context it’s used, you should identify the ‘part-of-speech’ (POS) tag for the word (https://www.machinelearningplus.com/nlp/nlp-
in that specific context and extract the appropriate lemma. Examples of implementing exercises/)

this comes in the following sections. How to Train spaCy to Autodetect New
Entities (NER) [Complete Guide]
(https://www.machinelearningplus.com/nlp/training-
Today, we will see how to implement lemmatization using the following python
custom-ner-model-in-spacy/)
packages.
For-Loop in Julia
(https://www.machinelearningplus.com/julia/for-
. Wordnet Lemmatizer loop-in-julia/)

DataFrames in Julia
. Spacy Lemmatizer (https://www.machinelearningplus.com/julia/datafram
in-julia/)

. TextBlob Matplotlib Line Plot

(https://www.machinelearningplus.com/plots/matplotl
line-plot/)
. CLiPS Pattern
K-Means Clustering Algorithm from Scratch
(https://www.machinelearningplus.com/predictive-
. Stanford CoreNLP modeling/k-means-clustering/)

While-loop in Julia
. Gensim Lemmatizer (https://www.machinelearningplus.com/julia/while-
loop-in-julia/)

Function in Julia
. TreeTagger
(https://www.machinelearningplus.com/julia/function-
in-julia/)

Python Scatter Plot

2. Wordnet Lemmatizer with NLTK (https://www.machinelearningplus.com/plots/python-
scatter-plot/)
Wordnet (https://wordnet.princeton.edu/) is an large, freely and publicly available lexical
Julia – Programming Language
database for the English language aiming to establish structured semantic relationships
(https://www.machinelearningplus.com/julia/julia-
between words. It offers lemmatization capabilities as well and is one of the earliest programming-language/)
and most commonly used lemmatizers.
Requests in Python (Guide)
(https://www.machinelearningplus.com/python/reques
NLTK offers an interface to it, but you have to download it first in order to use it. in-python/)

Follow the below instructions to install nltk and download wordnet . Matplotlib Pyplot
(https://www.machinelearningplus.com/plots/matplotl
pyplot/)

Python Boxplot
(https://www.machinelearningplus.com/plots/python-
boxplot/)

Bar Plot in Python

(https://www.machinelearningplus.com/plots/bar-
plot-in-python/)

/
data.table in R – The Complete Beginners
# How to install and import NLTK
Guide
# In terminal or prompt: (https://www.machinelearningplus.com/data-
# pip install nltk manipulation/datatable-in-r-complete-guide/)

Augmented Dickey Fuller Test (ADF Test) –

# # Download Wordnet through NLTK in python console: Must Read Guide
import nltk (https://www.machinelearningplus.com/time-
nltk.download('wordnet') series/augmented-dickey-fuller-test/)

KPSS Test for Stationarity

(https://www.machinelearningplus.com/time-
In order to lemmatize, you need to create an instance of the WordNetLemmatizer() series/kpss-test-for-stationarity/)
and call the lemmatize() function on a single word. 101 R data.table Exercises
(https://www.machinelearningplus.com/data-
manipulation/101-r-data-table-exercises/)
import nltk
from nltk.stem import WordNetLemmatizer
Top Posts & Pages
# Init the Wordnet Lemmatizer
ARIMA Model - Complete Guide to Time
lemmatizer = WordNetLemmatizer()
Series Forecasting in Python
(https://www.machinelearningplus.com/time-
# Lemmatize Single Word series/arima-model-time-series-forecasting-
print(lemmatizer.lemmatize("bats")) python/)
#> bat Parallel Processing in Python - A Practical
Guide with Examples
print(lemmatizer.lemmatize("are")) (https://www.machinelearningplus.com/python/paralle
processing-python/)
#> are
Time Series Analysis in Python - A
print(lemmatizer.lemmatize("feet")) Comprehensive Guide with Examples
(https://www.machinelearningplus.com/time-
#> foot
series/time-series-analysis-python/)

Machine Learning Better Explained!

Let’s lemmatize a simple sentence. We first tokenize the sentence into words using (https://www.machinelearningplus.com/)

nltk.word_tokenize and then we will call lemmatizer.lemmatize() on each word. This Cosine Similarity - Understanding the math
can be done in a list comprehension (the for-loop inside square brackets to make a and how it works (with python codes)
(https://www.machinelearningplus.com/nlp/cosine-
list).
similarity/)

Topic Modeling with Gensim (Python)

# Define the sentence to be lemmatized (https://www.machinelearningplus.com/nlp/topic-
modeling-gensim-python/)
sentence = "The striped bats are hanging on their feet for best"
Top 50 matplotlib Visualizations - The
Master Plots (with full python code)
# Tokenize: Split the sentence into words
(https://www.machinelearningplus.com/plots/top-
word_list = nltk.word_tokenize(sentence)
50-matplotlib-visualizations-the-master-plots-
print(word_list) python/)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
Matplotlib Histogram - How to Visualize
Distributions in Python
# Lemmatize list of words and join (https://www.machinelearningplus.com/plots/matplotl
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list]) histogram-python-examples/)
print(lemmatized_output) 101 Pandas Exercises for Data Analysis
#> The striped bat are hanging on their foot for best (https://www.machinelearningplus.com/python/101-
pandas-exercises-python/)

Python Logging - Simplest Guide with Full

The above code is a simple example of how to use the wordnet lemmatizer on words Code and Examples
and sentences. (https://www.machinelearningplus.com/python/python
logging-guide/)

Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’
is not converted to ‘hang’ as expected. This can be corrected if we provide the
Tags
correct ‘part-of-speech’ tag (https://www.clips.uantwerpen.be/pages/MBSP-tags) (POS
tag) as the second argument to lemmatize() . Classification
(https://www.machinelearningplus.com/tag/class
data.table
Sometimes, the same word can have a multiple lemmas based on the meaning /
(https://www.machinelearningplus.com/tag/data-
context.
Data Manipulation
table/)

(https://www.machinelearningplus.com/ta
/
manipulation/) Debugging
print(lemmatizer.lemmatize("stripes", 'v'))
(https://www.machinelearningplus.com/tag/debugging/) Doc2Vec
#> strip
(https://www.machinelearningplus.com/tag/doc2vec/)

Evaluation Metrics
print(lemmatizer.lemmatize("stripes", 'n')) (https://www.machinelearningplus.com/tag/evaluation-
#> stripe metrics/) FastText
(https://www.machinelearningplus.com/tag/fasttext/)
Feature Selection

(https://www.machinelearningplus.com/tag/feature-selection/)

3. Wordnet Lemmatizer with appropriate POS Gensim

(https://www.machinelearningplus.com/tag/gens
tag HuggingFace

It may not be possible manually provide the corrent POS tag for every word for large Julia
(https://www.machinelearningplus.com/tag/huggingface/)

texts. So, instead, we will find out the correct POS tag for each word, map it to the (https://www.machinelearningplus.com/tag
Julia Packages
right input character that the WordnetLemmatizer accepts and pass it as the second
(https://www.machinelearningplus.com/tag/julia-
argument to lemmatize() .
packages/) LDA
(https://www.machinelearningplus.com/tag/lda/)
So how to get the POS tag for a given word? Lemmatization
(https://www.machinelearningplus.com/tag/lemmatization/)
Linear Regression
In nltk, it is available through the nltk.pos_tag() method. It accepts only a list (list of
(https://www.machinelearningplus.com/tag/linear-
words), even if its a single word.
regression/) Logistic
(https://www.machinelearningplus.com/tag/logistic/) Loop
(https://www.machinelearningplus.com/tag/loop/) LSI
print(nltk.pos_tag(['feet']))
Machine
(https://www.machinelearningplus.com/tag/lsi/)
#> [('feet', 'NNS')]
Learning
(https://www.machinelearningplus.com/tag/m
print(nltk.pos_tag(nltk.word_tokenize(sentence)))

#> [('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on learning/) Matplotlib
(https://www.machinelearningplus.co
NLP
nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS (https://www.machinelearningplus.com/ta
NLTK
tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function
(https://www.machinelearningplus.com/tag/nltk/)
defined below does this mapping job.
Numpy
(https://www.machinelearningplus.com/tag/numpy/)
P-Value
# Lemmatize with POS Tag
(https://www.machinelearningplus.com/tag/p-value/)
from nltk.corpus import wordnet
Pandas (https://www.machinelearningplus.com/tag/pandas/)

Phraser (https://www.machinelearningplus.com/tag/phraser/)
def get_wordnet_pos(word):
plots
"""Map POS tag to first character lemmatize() accepts""" (https://www.machinelearningplus.com/tag/plots/
tag = nltk.pos_tag([word])[0][1][0].upper() Practice Exercise

tag_dict = {"J": wordnet.ADJ, (https://www.machinelearningplus.com/tag/practice-exercise/)

"N": wordnet.NOUN,
"V": wordnet.VERB,
Python
"R": wordnet.ADV}
(https://www.machinelearni
R
return tag_dict.get(tag, wordnet.NOUN)
(https://www.machinelearningplus.com/
Regex (https://www.machinelearningplus.com/tag/regex/)

Regression
# 1. Init Lemmatizer
(https://www.machinelearningplus.com/tag/regression
lemmatizer = WordNetLemmatizer()
Residual Analysis

(https://www.machinelearningplus.com/tag/residual-analysis/)
# 2. Lemmatize Single Word with the appropriate POS tag
Scikit Learn (https://www.machinelearningplus.com/tag/scikit-
word = 'feet' learn/) Significance Tests
print(lemmatizer.lemmatize(word, get_wordnet_pos(word))) (https://www.machinelearningplus.com/tag/significan
tests/) Soft Cosine Similarity
# 3. Lemmatize a Sentence with the appropriate POS tag (https://www.machinelearningplus.com/tag/soft-
sentence = "The striped bats are hanging on their feet for best" cosine-similarity/) spaCy
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]) (https://www.machinelearningplus.com/tag/spacy
#> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best'] Stationarity
(https://www.machinelearningplus.com/tag/stationarit
TextBlob
/
(https://www.machinelearningplus.com/tag/textblob/)

4. spaCy Lemmatization TextSummarization

(https://www.machinelearningplus.com/tag/textsummarization/)

spaCy is a relatively new in the space and is billed as an industrial strength NLP Time
TFIDF (https://www.machinelearningplus.com/tag/tfidf/)

engine. It comes with pre-built models (https://spacy.io/usage/models) that can parse Series
text and compute various NLP related features through one single function call. (https://www.machinelearningplus.com/tag/tim
Ofcourse, it provides the lemma of the word too. series/) Topic Modeling
(https://www.machinelearningplus.com/tag/topic-
modeling/) Visualization
Before we begin, let’s install spaCy and download the ‘en’ model.
(https://www.machinelearningplus.com/tag/visualizat
Word2Vec (https://www.machinelearningplus.com/tag/word2vec/)

# Install spaCy (run in terminal/prompt)

import sys
!{sys.executable} -m pip install spacy

# Download spaCy's 'en' Model

!{sys.executable} -m spacy download en

spaCy determines the part-of-speech tag by default and assigns the corresponding
lemma. It comes with a bunch of prebuilt models where the ‘en’ we just downloaded
above is one of the standard ones for english.

import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)

# Extract the lemma for each token and join

" ".join([token.lemma_ for token in doc])

#> 'the strip bat be hang on -PRON- foot for good'

It did all the lemmatizations the Wordnet Lemmatizer supplied with the correct POS
tag did. Plus it also lemmatized ‘best’ to ‘good’. Nice!

You’d see the -PRON- character coming up whenever spacy detects a pronoun.

5. TextBlob Lemmatizer
TexxtBlob is a powerful, fast and convenient NLP package as well. Using the Word
and TextBlob objects, its quite straighforward to parse and lemmatize words and
sentences respectively.

# pip install textblob

from textblob import TextBlob, Word

# Lemmatize a word
word = 'stripes'
w = Word(word)
w.lemmatize()

#> stripe

/
However to lemmatize a sentence or paragraph, we parse it using TextBlob and call
the lemmatize() function on the parsed words.

# Lemmatize a sentence
sentence = "The striped bats are hanging on their feet for best"
sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])
#> 'The striped bat are hanging on their foot for best'

It did not do a great job at the outset, because, like NLTK, TextBlob also uses
wordnet internally. So, let’s pass the appropriate POS tag to the lemmatize() method.

6. TextBlob Lemmatizer with appropriate POS

tag
# Define function to lemmatize each word with its POS tag
def lemmatize_with_postag(sentence):
sent = TextBlob(sentence)
tag_dict = {"J": 'a',
"N": 'n',
"V": 'v',
"R": 'r'}
words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]
lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
return " ".join(lemmatized_list)

# Lemmatize
sentence = "The striped bats are hanging on their feet for best"
lemmatize_with_postag(sentence)

#> 'The striped bat be hang on their foot for best'

7. Pattern Lemmatizer
Pattern by CLiPs (https://www.clips.uantwerpen.be/pages/pattern) is a versatile module
with many useful NLP capabilities.

!pip install pattern

If you run into issues while installing pattern, check out the known issues on github
(https://github.com/clips/pattern/issues). I myself faced this issue
(https://github.com/clips/pattern/issues/203) when installing on a mac.

import pattern
from pattern.en import lemma, lexeme

sentence = "The striped bats were hanging on their feet and ate best fishes"
" ".join([lemma(wd) for wd in sentence.split()])
#> 'the stripe bat be hang on their feet and eat best fishes'

/
You can also view the possible lexeme’s for each word.

# Lexeme's for each word

[lexeme(wd) for wd in sentence.split()]

#> [['the', 'thes', 'thing', 'thed'],

#> ['stripe', 'stripes', 'striping', 'striped'],
#> ['bat', 'bats', 'batting', 'batted'],
#> ['be', 'am', 'are', 'is', 'being', 'was', 'were', 'been',
#> . 'am not', "aren't", "isn't", "wasn't", "weren't"],
#> ['hang', 'hangs', 'hanging', 'hung'],
#> ['on', 'ons', 'oning', 'oned'],
#> ['their', 'theirs', 'theiring', 'theired'],
#> ['feet', 'feets', 'feeting', 'feeted'],
#> ['and', 'ands', 'anding', 'anded'],
#> ['eat', 'eats', 'eating', 'ate', 'eaten'],
#> ['best', 'bests', 'besting', 'bested'],
#> ['fishes', 'fishing', 'fishesed']]

You could also obtain the lemma by parsing the text.

from pattern.en import parse

print(parse('The striped bats were hanging on their feet and ate best fishes',
lemmata=True, tags=False, chunks=False))

#> The/DT/the striped/JJ/striped bats/NNS/bat were/VBD/be hanging/VBG/hang on/IN/on their/PRP$/

#> feet/NNS/foot and/CC/and ate/VBD/eat best/JJ/best fishes/NNS/fish

8. Stanford CoreNLP Lemmatization

Standford CoreNLP (https://stanfordnlp.github.io/CoreNLP/index.html) is a popular NLP

tool that is originally implemented in Java. There are many python wrappers written
around it. The one I use below is one that is quite convenient to use.

But before that, you need to download Java and the Standford CoreNLP software.
Make sure you have the following requirements before getting to the lemmatization
code:

Step 1: Java 8 Installed

You can download and install from Java download page

(https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-
2133151.html).

Mac users can check the java version by typing java -version in terminal. If its 1.8+,
then its Ok. Else follow below steps.

brew update
brew install jenv
brew cask install java

/
Step 2: Download Standford CoreNLP software
(https://stanfordnlp.github.io/CoreNLP/index.html#download) and unzip it.

Step 3: Start the Stanford CoreNLP server from terminal. How? cd to the folder you
just unzipped and run below command in terminal:

cd stanford-corenlp-full-2018-02-27

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit

This will start a StanfordCoreNLPServer listening at port 9000. Now, we are ready to
extract the lemmas in python.

In the stanfordcorenlp package, the lemma is embedded in the output of the

annotate() method of the StanfordCoreNLP connection object (see code below).

/
# Run `pip install stanfordcorenlp` to install stanfordcorenlp package
from stanfordcorenlp import StanfordCoreNLP
import json

# Connect to the CoreNLP server we just started

nlp = StanfordCoreNLP('http://localhost (http://localhost)', port=9000, timeout=30000)

# Define proporties needed to get lemma

props = {'annotators': 'pos,lemma',
'pipelineLanguage': 'en',
'outputFormat': 'json'}

sentence = "The striped bats were hanging on their feet and ate best fishes"
parsed_str = nlp.annotate(sentence, properties=props)
parsed_dict = json.loads(parsed_str)
parsed_dict
#> {'sentences': [{'index': 0,
#> 'tokens': [{'after': ' ',
#> 'before': '',
#> 'characterOffsetBegin': 0,
#> 'characterOffsetEnd': 3,
#> 'index': 1,
#> 'lemma': 'the', << ----------- LEMMA
#> 'originalText': 'The',
#> 'pos': 'DT',
#> 'word': 'The'},
#> {'after': ' ',
#> 'before': ' ',
#> 'characterOffsetBegin': 4,
#> 'characterOffsetEnd': 11,

#> 'index': 2,
#> 'lemma': 'striped', << ----------- LEMMA
#> 'originalText': 'striped',
#> 'pos': 'JJ',
#> 'word': 'striped'},
#> {'after': ' ',
#> 'before': ' ',
#> 'characterOffsetBegin': 12,
#> 'characterOffsetEnd': 16,
#> 'index': 3,
#> 'lemma': 'bat', << ----------- LEMMA
#> 'originalText': 'bats',
#> 'pos': 'NNS',
#> 'word': 'bats'}
#> ...
#> ...

The output of nlp.annotate() was converted to a dict using json.loads . Now the
lemma we need is embedded a couple of layers inside the parsed_dict . So here, we
need to just the lemma value from each dict. I use list comprehensions below to do
the trick.

lemma_list = [v for d in parsed_dict['sentences'][0]['tokens'] for k,v in d.items() if k == 'l

" ".join(lemma_list)
#> 'the striped bat be hang on they foot and eat best fish'

Let’s generalize this a nice function so as to handle larger paragraphs.

/
from stanfordcorenlp import StanfordCoreNLP

import json, string

def lemmatize_corenlp(conn_nlp, sentence):

props = {

'annotators': 'pos,lemma',
'pipelineLanguage': 'en',

'outputFormat': 'json'

# tokenize into words

sents = conn_nlp.word_tokenize(sentence)

# remove punctuations from tokenised list

sents_no_punct = [s for s in sents if s not in string.punctuation]

# form sentence
sentence2 = " ".join(sents_no_punct)

# annotate to get lemma

parsed_str = conn_nlp.annotate(sentence2, properties=props)

parsed_dict = json.loads(parsed_str)

# extract the lemma for each word

lemma_list = [v for d in parsed_dict['sentences'][0]['tokens'] for k,v in d.items() if k ==

# form sentence and return it

return " ".join(lemma_list)

# make the connection and call `lemmatize_corenlp`

nlp = StanfordCoreNLP('http://localhost (http://localhost)', port=9000, timeout=30000)

lemmatize_corenlp(conn_nlp=nlp, sentence=sentence)

#> 'the striped bat be hang on they foot and eat best fish'

9. Gensim Lemmatize
Gensim provide lemmatization facilities based on the pattern package. It can be
implemented using the lemmatize() method in the utils module. By default
lemmatize() allows only the ‘JJ’, ‘VB’, ‘NN’ and ‘RB’ tags.

from gensim.utils import lemmatize

sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

10. TreeTagger
Treetagger is a Part-of-Speech tagger for many languages. And it provides the lemma
of the word as well.

You will need to download and install the TreeTagger software (http://www.cis.uni-
muenchen.de/~schmid/tools/TreeTagger/) itself in order to use it by following steps
mentioned.

/
# pip install treetaggerwrapper

import treetaggerwrapper as ttpw

tagger = ttpw.TreeTagger(TAGLANG='en', TAGDIR='/Users/ecom-selva.p/Documents/MLPlus/11_Lemmatiz

tags = tagger.tag_text("The striped bats were hanging on their feet and ate best fishes")

lemmas = [t.split('\t')[-1] for t in tags]

#> ['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'and', 'eat', 'good', 'fish']

Treetagger indeed does a good job in converting ‘best’ to ‘good’ and for other words
as well. For further reading, refer to TreeTaggerWrapper ’s documentation
(https://treetaggerwrapper.readthedocs.io/en/latest/).

11. Comparing NLTK, TextBlob, spaCy, Pattern

and Stanford CoreNLP
Let’s run lemmatization using the 5 implementations on the following sentence and
compare output.

/
sentence = """Following mice attacks, caring farmers were marching to Delhi for better living

Delhi police on Tuesday fired water cannons and teargas shells at protesting farmers as they tr
break barricades with their cars, automobiles and tractors."""

# NLTK

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

pprint(" ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(senten

# ('Following mouse attack care farmer be march to Delhi for well living '
# 'condition Delhi police on Tuesday fire water cannon and teargas shell at '
# 'protest farmer a they try to break barricade with their car automobile and '
# 'tractor')

# Spacy
import spacy

nlp = spacy.load('en', disable=['parser', 'ner'])

doc = nlp(sentence)
pprint(" ".join([token.lemma_ for token in doc]))

# ('follow mice attack , care farmer be march to delhi for good living condition '
# '. delhi police on tuesday fire water cannon and teargas shell at protest '

# 'farmer as -PRON- try to break barricade with -PRON- car , automobile and '

# 'tractor .')

# TextBlob

pprint(lemmatize_with_postag(sentence))
# ('Following mouse attack care farmer be march to Delhi for good living '

# 'condition Delhi police on Tuesday fire water cannon and teargas shell at '

# 'protest farmer a they try to break barricade with their car automobile and '
# 'tractor')

# Pattern
from pattern.en import lemma

pprint(" ".join([lemma(wd) for wd in sentence.split()]))

# ('follow mice attacks, care farmer be march to delhi for better live '
# 'conditions. delhi police on tuesday fire water cannon and tearga shell at '

# 'protest farmer a they try to break barricade with their cars, automobile and '

# 'tractors.')

# Stanford
pprint(lemmatize_corenlp(conn_nlp=conn_nlp, sentence=sentence))

# ('follow mouse attack care farmer be march to Delhi for better living '
# 'condition Delhi police on Tuesday fire water cannon and tearga shell at '
# 'protest farmer as they try to break barricade with they car automobile and '

# 'tractor')

12. Conclusion
So those are the methods you can use the text time you take up an NLP project. I
would be happy to know if you have any new approaches or suggestions through
your comments. Happy learning!

(/#facebook) (/#twitter) (/#whatsapp)

(/#linkedin) (/#reddit)

/
(/#google_bookmarks) (/#google_gmail)

(https://www.addtoany.com/share#url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.machinelearnin
examples-python%2F&title=Lemmatization%20Approaches%20with%20Exam

ALSO ON MACHINELEARNINGPLUS.COM

List Comprehensions How to Train spaCy to Topic modeli

in Python – … Autodetect … visualization
2 years ago • 6 comments 2 months ago • 5 comments 2 years ago • 25 c

List comprehensions is a Named-entity recognition In this post, we

pythonic way of expressing (NER) is the process of structured appro
a 'For Loop' that appends to automatically identifying the gensim's topic m

10 Comments machinelearningplus.com 🔒 Privacy Policy


1 Login

 Recommend 3 t Tweet f Share Sort by Newest

Join the discussion…

(https://www.ezoic.com/what-is-
ezoic/)
report this ad

Home (https://www.machinelearningplus.com) Contact Us (https://www.machinelearningplus.com/contact-us/)
Privacy Policy (https://www.machinelearningplus.com/privacy-policy/) About Selva (https://www.machinelearningplus.com/about/)
Terms and Conditions (https://www.machinelearningplus.com/terms-of-use/)

(A. Paul (Auth.) ) Chemistry of Glasses (B-Ok - Xyz) PDF
100% (1)
(A. Paul (Auth.) ) Chemistry of Glasses (B-Ok - Xyz) PDF
300 pages
Complete Beginner's Guide To Processing Whatsapp Data With Python
No ratings yet
Complete Beginner's Guide To Processing Whatsapp Data With Python
9 pages
Compiler Design ECX6235 Answers For Tma 01: Name Reg. No Centre Due Date
No ratings yet
Compiler Design ECX6235 Answers For Tma 01: Name Reg. No Centre Due Date
11 pages
Lemmatization Stemming Presentation
No ratings yet
Lemmatization Stemming Presentation
11 pages
Text Processing For NLP Lemmatization in Text Processing
No ratings yet
Text Processing For NLP Lemmatization in Text Processing
12 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
ChatGPT-Tokenization Stemming Lemmatization NLTK
No ratings yet
ChatGPT-Tokenization Stemming Lemmatization NLTK
110 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Lemmas and Lemmatization
No ratings yet
Lemmas and Lemmatization
5 pages
Stemming and Lemmatization
No ratings yet
Stemming and Lemmatization
17 pages
Lab 04 - Text Normalization Tutorial
No ratings yet
Lab 04 - Text Normalization Tutorial
5 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
24 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
NLP Lab 2
No ratings yet
NLP Lab 2
4 pages
3 A Morphology
No ratings yet
3 A Morphology
4 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
NLTK - Stem NLTK - Stem: Print Print Print Print
No ratings yet
NLTK - Stem NLTK - Stem: Print Print Print Print
1 page
NLP 03
No ratings yet
NLP 03
3 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Lemmatization - Wikipedia
No ratings yet
Lemmatization - Wikipedia
2 pages
NLTK
No ratings yet
NLTK
3 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Record
No ratings yet
NLP Record
15 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
5 pages
Development of A Rule-Based Lemmatization Algorithm Through Finite State Machine For Uzbek Language
No ratings yet
Development of A Rule-Based Lemmatization Algorithm Through Finite State Machine For Uzbek Language
6 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
NLTK
No ratings yet
NLTK
4 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
02 - Text Preprocessing - Part2
No ratings yet
02 - Text Preprocessing - Part2
36 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
BanglaLem A Transformer-Based Bangla Lemmatizer With An Enhanced
No ratings yet
BanglaLem A Transformer-Based Bangla Lemmatizer With An Enhanced
10 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
Natual Languagr Processing
No ratings yet
Natual Languagr Processing
12 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Text Mining
No ratings yet
Text Mining
62 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Lab
No ratings yet
NLP Lab
7 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Unit 2
No ratings yet
Unit 2
20 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
04 StemminginNLP
No ratings yet
04 StemminginNLP
10 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Ai Ass 10
No ratings yet
Ai Ass 10
3 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
Unit2 A
No ratings yet
Unit2 A
22 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
100 Recipes for Programming Java
From Everand
100 Recipes for Programming Java
Jamie Munro
4.5/5 (2)
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
From Everand
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
Kartik Bhatnagar
No ratings yet
Cognitive Hierarchy Theory
No ratings yet
Cognitive Hierarchy Theory
18 pages
Carbon Black For Weatherstrips
No ratings yet
Carbon Black For Weatherstrips
8 pages
EPDM Weatherstrip Performance
100% (1)
EPDM Weatherstrip Performance
17 pages
Post Covid-19 Fractal Economics and Economies: DR - Kartik H
No ratings yet
Post Covid-19 Fractal Economics and Economies: DR - Kartik H
9 pages
Gödel's Incompleteness Theorems
No ratings yet
Gödel's Incompleteness Theorems
6 pages
Perimeter of An Ellipse Perimeter of An Ellipse
No ratings yet
Perimeter of An Ellipse Perimeter of An Ellipse
4 pages
The Firm As A Subeconomy: JLEO, V15 N1
No ratings yet
The Firm As A Subeconomy: JLEO, V15 N1
29 pages
Gödel's Theorem For Law
No ratings yet
Gödel's Theorem For Law
6 pages
Incompleteness and Randomness
No ratings yet
Incompleteness and Randomness
15 pages
Ensembling in Python
No ratings yet
Ensembling in Python
20 pages
Vertical Thermosyphon Reboilers
No ratings yet
Vertical Thermosyphon Reboilers
9 pages
The Curse of Dimensionality - Towards Data Science PDF
No ratings yet
The Curse of Dimensionality - Towards Data Science PDF
9 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
Deep Reinforcement Learning in Games
No ratings yet
Deep Reinforcement Learning in Games
9 pages
Chemicals and Capital Markets
No ratings yet
Chemicals and Capital Markets
15 pages
Chemistry's Reproducibility Crisis
No ratings yet
Chemistry's Reproducibility Crisis
6 pages
Infa Practice Test 1
No ratings yet
Infa Practice Test 1
47 pages
YAAC
No ratings yet
YAAC
14 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
68 pages
Gimic Manual-April 2020
No ratings yet
Gimic Manual-April 2020
51 pages
COSC 408 - Compiler Construction
No ratings yet
COSC 408 - Compiler Construction
324 pages
Problems With Variable Properties in Syntax
No ratings yet
Problems With Variable Properties in Syntax
19 pages
Syllabus Sviit Cse B.tech (Bda-Ibm) Vii Sem 18-19 Ws
No ratings yet
Syllabus Sviit Cse B.tech (Bda-Ibm) Vii Sem 18-19 Ws
12 pages
(S24)
No ratings yet
(S24)
3 pages
Plagiarism Detection System
100% (1)
Plagiarism Detection System
93 pages
III CSE Syllabus
No ratings yet
III CSE Syllabus
62 pages
Grammar Correction Using Rule Based System
No ratings yet
Grammar Correction Using Rule Based System
50 pages
CIT316 Summary
No ratings yet
CIT316 Summary
21 pages
CD Question Bank
No ratings yet
CD Question Bank
44 pages
Practical Top Down Parsing
No ratings yet
Practical Top Down Parsing
10 pages
SYLLABUS
No ratings yet
SYLLABUS
2 pages
L3 - Chapter2 (Discussion Questions 2)
No ratings yet
L3 - Chapter2 (Discussion Questions 2)
34 pages
Bottom-Up Parsing - Intro
No ratings yet
Bottom-Up Parsing - Intro
12 pages
Logic and Discrete Mathematics
0% (1)
Logic and Discrete Mathematics
10 pages
Error Handling
No ratings yet
Error Handling
7 pages
Natural Language Processing (NLP) With Python - Tutorial
No ratings yet
Natural Language Processing (NLP) With Python - Tutorial
72 pages
Enhancing Search Capabilities: Exploring Lucene and Solr Techniques For Improved Search Performance
No ratings yet
Enhancing Search Capabilities: Exploring Lucene and Solr Techniques For Improved Search Performance
10 pages
Syllabus VI Sem AI-ML 2021-25 20.07.2023
No ratings yet
Syllabus VI Sem AI-ML 2021-25 20.07.2023
22 pages
Three Reasons Why A Literature Review Is Needed in A Research Study
100% (1)
Three Reasons Why A Literature Review Is Needed in A Research Study
6 pages
2.10. Statements and Expressions
No ratings yet
2.10. Statements and Expressions
6 pages
MCS Course Outlines
No ratings yet
MCS Course Outlines
14 pages
Compiler Design
No ratings yet
Compiler Design
4 pages
Acd Q
No ratings yet
Acd Q
6 pages
3 - Grammars
No ratings yet
3 - Grammars
43 pages
IS 7118 Unit1 Introduction
No ratings yet
IS 7118 Unit1 Introduction
58 pages

Lemmatization Approaches

Uploaded by

Lemmatization Approaches

Uploaded by

Machine Learning Plus

Lemmatization Approaches with

(/#facebook) (/#twitter) (/#whatsapp)

How to train spaCy to autodetect new

Support Vector Machines Algorithm from Scratch

101 Julia Practice Exercises

Python SQLite - Must Read Guide

Linear Regression with Julia (NEW)

Waterfall Plot in Python (NEW)

Python JSON - Guide

Logistic Regression in Julia

Logistic Regression from Scratch

Python Collections - Guide and Examples

. TextBlob Matplotlib Line Plot

Python Scatter Plot

Bar Plot in Python

Augmented Dickey Fuller Test (ADF Test) –

KPSS Test for Stationarity

Machine Learning Better Explained!

Topic Modeling with Gensim (Python)

Python Logging - Simplest Guide with Full

3. Wordnet Lemmatizer with appropriate POS Gensim

tag_dict = {"J": wordnet.ADJ, (https://www.machinelearningplus.com/tag/practice-exercise/)

4. spaCy Lemmatization TextSummarization

# Install spaCy (run in terminal/prompt)

# Download spaCy's 'en' Model

# Extract the lemma for each token and join

" ".join([token.lemma_ for token in doc])

# pip install textblob

6. TextBlob Lemmatizer with appropriate POS

#> 'The striped bat be hang on their foot for best'

!pip install pattern

# Lexeme's for each word

#> [['the', 'thes', 'thing', 'thed'],

You could also obtain the lemma by parsing the text.

from pattern.en import parse

#> The/DT/the striped/JJ/striped bats/NNS/bat were/VBD/be hanging/VBG/hang on/IN/on their/PRP$/

8. Stanford CoreNLP Lemmatization

Standford CoreNLP (https://stanfordnlp.github.io/CoreNLP/index.html) is a popular NLP

Step 1: Java 8 Installed

You can download and install from Java download page

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit

In the stanfordcorenlp package, the lemma is embedded in the output of the

# Connect to the CoreNLP server we just started

# Define proporties needed to get lemma

lemma_list = [v for d in parsed_dict['sentences'][0]['tokens'] for k,v in d.items() if k == 'l

Let’s generalize this a nice function so as to handle larger paragraphs.

import json, string

def lemmatize_corenlp(conn_nlp, sentence):

# tokenize into words

# remove punctuations from tokenised list

# annotate to get lemma

parsed_str = conn_nlp.annotate(sentence2, properties=props)

# extract the lemma for each word

lemma_list = [v for d in parsed_dict['sentences'][0]['tokens'] for k,v in d.items() if k ==

# form sentence and return it

return " ".join(lemma_list)

# make the connection and call `lemmatize_corenlp`

from gensim.utils import lemmatize

import treetaggerwrapper as ttpw

tagger = ttpw.TreeTagger(TAGLANG='en', TAGDIR='/Users/ecom-selva.p/Documents/MLPlus/11_Lemmatiz

lemmas = [t.split('\t')[-1] for t in tags]

11. Comparing NLTK, TextBlob, spaCy, Pattern

from nltk.stem import WordNetLemmatizer

pprint(" ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(senten

nlp = spacy.load('en', disable=['parser', 'ner'])

pprint(" ".join([lemma(wd) for wd in sentence.split()]))

(/#facebook) (/#twitter) (/#whatsapp)

List Comprehensions How to Train spaCy to Topic modeli

List comprehensions is a Named-entity recognition In this post, we

10 Comments machinelearningplus.com 🔒 Privacy Policy

 Recommend 3 t Tweet f Share Sort by Newest

Join the discussion…

Copyright Machine Learning Plus (https://www.machinelearningplus.com/) . All rights reserved.

You might also like