0% found this document useful (0 votes)
8 views7 pages

Experiment 3 Manual

Uploaded by

cleverchatelet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

Experiment 3 Manual

Uploaded by

cleverchatelet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment No.

SEMESTER: VII (2023-2024) DATE OF DECLARATION:

SUBJECT: CSDL7013-NLP Lab DATE OF SUBMISSION:

NAME OF THE STUDENT: ROLL NO.:

To perform the stemming and lemmatization using using NLTK, spaCy and
AIM
textBlob for English sentences.

LEARING To highlight and identify the various preprocessing techniques for natural
OBJECTIVE language text processing.

LEARNING The student will be able to highlight and identify the various natural language text
OUTCOME preprocessing techniques.

COURSE CSDL7013.1 Apply various text processing techniques.


OUTCOME

PROGRAM
OUTCOME

BLOOM'S Remember
TAXONOMY
LEVEL

THEORY

Introduction to Text Normalization

In any natural language, words can be written or spoken in more than one form depending on the
situation. That’s what makes the language such a thrilling part of our lives, right? For example:

1. Lisa ate the food and washed the dishes.

2. They were eating noodles at a cafe.

3. Don’t you want to eat before we leave?

4. We have just eaten our breakfast.

5. It also eats fruit and vegetables.

In all these sentences, we can see that the word eat has been used in multiple forms. For us, it is easy to
understand that eating is the activity here. So it doesn’t really matter to us whether it is ‘ate’, ‘eat’, or
‘eaten’ – we know what is going on.

Unfortunately, that is not the case with machines. They treat these words differently. Therefore, we need
to normalize them to their root word, which is “eat” in our example.

Hence, text normalization is a process of transforming a word into a single canonical form. This can be
done by two processes, stemming and lemmatization. Let’s understand what they are in detail.

What are Stemming and Lemmatization?

Stemming and Lemmatization is simply normalization of words, which means reducing a word to its root
form.

Stemming

Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into
account a list of common prefixes or suffixes that could be found in that word. It is a rudimentary rule-
based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word

Lemmatization

Lemmatization, on the other hand, is an organized & step-by-step procedure of obtaining the root form of
the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word
structure and grammar relations).

Why do we need to Perform Stemming or Lemmatization?

Let’s consider the following two sentences:

1. He was driving

2. He went for a drive

We can easily state that both the sentences are conveying the same meaning, that is, driving activity in the
past. A machine will treat both sentences differently. Thus, to make the text understandable for the
machine, we need to perform stemming or lemmatization.

Another benefit of text normalization is that it reduces the number of unique words in the text data. This
helps in bringing down the training time of the machine learning model (and don’t we all want that?).

So, which one should we prefer?

Stemming algorithm works by cutting the suffix or prefix from the word. Lemmatization is a more
powerful operation as it takes into consideration the morphological analysis of the word.

Lemmatization returns the lemma, which is the root word of all its inflection forms.

We can say that stemming is a quick and dirty method of chopping off words to its root form while on the
other hand, lemmatization is an intelligent operation that uses dictionaries which are created by in-depth
linguistic knowledge. Hence, Lemmatization helps in forming better features.

Methods to Perform Text Normalization

1. Text Normalization using NLTK

2. Text Normalization using spaCy

3. Text Normalization using TextBlob

LAB EXERCISE

Methods to perform Text Normalization

1. Text Normalization using NLTK

The NLTK library has a lot of amazing methods to perform different steps of data preprocessing. There
are methods like PorterStemmer() and WordNetLemmatizer() to perform stemming and lemmatization,
respectively.

Code: STEMMING

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-
cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much
less valuable, and he had indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(text)

filtered_sentence = []

for w in word_tokens:

if w not in stop_words:
filtered_sentence.append(w)

Stem_words = []

ps =PorterStemmer()

for w in filtered_sentence:

rootWord=ps.stem(w)

Stem_words.append(rootWord)

print(filtered_sentence)

print(Stem_words)

Output:

Filtered_sentence

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase
rights become much less valuable, indeed vaguest idea wood river question.

Stem_words

He determin drop litig monastri, relinguish claim wood-cut fisheri rihgt. He readi becuas right become
much less valuabl, inde vaguest idea wood river question.

Code: LEMMATIZATION

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

import nltk

from nltk.stem import WordNetLemmatizer

set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-
cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much
less valuable, and he had

indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)

filtered_sentence = []

for w in word_tokens:

if w not in stop_words:

filtered_sentence.append(w)

print(filtered_sentence)

lemma_word = []

import nltk

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

for w in filtered_sentence:

word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")

word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")

word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))

lemma_word.append(word3)

print(lemma_word)

Output:

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase
rights become much less valuable, indeed vaguest idea wood river question.

He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He ready becuase
right become much le valuable, indeed vaguest idea wood river question.

Here, v stands for verb, a stands for adjective and n stands for noun. The lemmatizer only lemmatizes
those words which match the pos parameter of the lemmatize method. Lemmatization is done on the basis
of part-of-speech tagging (POS tagging). We’ll talk in detail about POS tagging in an upcoming article.

2. Text Normalization using spaCy

spaCy, as we saw earlier, is an amazing NLP library. It provides many industry-level methods to perform
lemmatization. Unfortunately, spaCy has no module for stemming. To perform lemmatization, check out
the below code:

Code: Lemmatization

#make sure to download the english model with "python -m spacy download en"

import en_core_web_sm

nlp = en_core_web_sm.load()

doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the
wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become
much less valuable, and he had indeed the vaguest idea where the wood and river in question were.""")

lemma_word1 = []

for token in doc:

lemma_word1.append(token.lemma_)

lemma_word1

Output:

-PRON- determine to drop -PRON- litigation with the monastry, and relinguish -PRON- claim to the
wood-cuting and \n fishery rihgts at once. -PRON- be the more ready to do this becuase the right have
become much less valuable, and -PRON- have \n indeed the vague idea where the wood and river in
question be.

Note: Here -PRON- is the notation for pronoun which could easily be removed using regular expressions.
The benefit of spaCy is that we do not have to pass any pos parameter to perform lemmatization.

3. Text Normalization using TextBlob

TextBlob is a Python library especially made for preprocessing text data. It is based on the NLTK library.
We can use TextBlob to perform lemmatization. However, there’s no module for stemming in TextBlob.

Code:

# from textblob lib import Word method

from textblob import Word

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he
had indeed the vaguest idea where the wood and river in question were."""

lem = []
for i in text.split():

word1 = Word(i).lemmatize("n")

word2 = Word(word1).lemmatize("v")

word3 = Word(word2).lemmatize("a")

lem.append(Word(word3).lemmatize())

print(lem)

Output:

He determine to drop his litigation with the monastry, and relinguish his claim to the

wood-cuting and fishery rihgts at once. He wa the more ready to do this becuase the right

have become much le valuable, and he have indeed the vague idea where the wood and river in question were.

REFERENCES 1. Steven Bird, Ewan Klein, Natural Language Processing with Python,
O‘Reilly

You might also like