0% found this document useful (0 votes)

8 views7 pages

Experiment 3 Manual

Uploaded by

cleverchatelet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views7 pages

Experiment 3 Manual

Uploaded by

cleverchatelet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Assignment No.

SEMESTER: VII (2023-2024) DATE OF DECLARATION:

SUBJECT: CSDL7013-NLP Lab DATE OF SUBMISSION:

NAME OF THE STUDENT: ROLL NO.:

To perform the stemming and lemmatization using using NLTK, spaCy and
AIM
textBlob for English sentences.

LEARING To highlight and identify the various preprocessing techniques for natural
OBJECTIVE language text processing.

LEARNING The student will be able to highlight and identify the various natural language text
OUTCOME preprocessing techniques.

COURSE CSDL7013.1 Apply various text processing techniques.

OUTCOME

PROGRAM
OUTCOME

BLOOM'S Remember
TAXONOMY
LEVEL

THEORY

Introduction to Text Normalization

In any natural language, words can be written or spoken in more than one form depending on the
situation. That’s what makes the language such a thrilling part of our lives, right? For example:

1. Lisa ate the food and washed the dishes.

2. They were eating noodles at a cafe.

3. Don’t you want to eat before we leave?

4. We have just eaten our breakfast.

5. It also eats fruit and vegetables.

In all these sentences, we can see that the word eat has been used in multiple forms. For us, it is easy to
understand that eating is the activity here. So it doesn’t really matter to us whether it is ‘ate’, ‘eat’, or
‘eaten’ – we know what is going on.

Unfortunately, that is not the case with machines. They treat these words differently. Therefore, we need
to normalize them to their root word, which is “eat” in our example.

Hence, text normalization is a process of transforming a word into a single canonical form. This can be
done by two processes, stemming and lemmatization. Let’s understand what they are in detail.

What are Stemming and Lemmatization?

Stemming and Lemmatization is simply normalization of words, which means reducing a word to its root
form.

Stemming

Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into
account a list of common prefixes or suffixes that could be found in that word. It is a rudimentary rule-
based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word

Lemmatization

Lemmatization, on the other hand, is an organized & step-by-step procedure of obtaining the root form of
the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word
structure and grammar relations).

Why do we need to Perform Stemming or Lemmatization?

Let’s consider the following two sentences:

1. He was driving

2. He went for a drive

We can easily state that both the sentences are conveying the same meaning, that is, driving activity in the
past. A machine will treat both sentences differently. Thus, to make the text understandable for the
machine, we need to perform stemming or lemmatization.

Another benefit of text normalization is that it reduces the number of unique words in the text data. This
helps in bringing down the training time of the machine learning model (and don’t we all want that?).

So, which one should we prefer?

Stemming algorithm works by cutting the suffix or prefix from the word. Lemmatization is a more
powerful operation as it takes into consideration the morphological analysis of the word.

Lemmatization returns the lemma, which is the root word of all its inflection forms.

We can say that stemming is a quick and dirty method of chopping off words to its root form while on the
other hand, lemmatization is an intelligent operation that uses dictionaries which are created by in-depth
linguistic knowledge. Hence, Lemmatization helps in forming better features.

Methods to Perform Text Normalization

1. Text Normalization using NLTK

2. Text Normalization using spaCy

3. Text Normalization using TextBlob

LAB EXERCISE

Methods to perform Text Normalization

1. Text Normalization using NLTK

The NLTK library has a lot of amazing methods to perform different steps of data preprocessing. There
are methods like PorterStemmer() and WordNetLemmatizer() to perform stemming and lemmatization,
respectively.

Code: STEMMING

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-
cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much
less valuable, and he had indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(text)

filtered_sentence = []

for w in word_tokens:

if w not in stop_words:
filtered_sentence.append(w)

Stem_words = []

ps =PorterStemmer()

for w in filtered_sentence:

rootWord=ps.stem(w)

Stem_words.append(rootWord)

print(filtered_sentence)

print(Stem_words)

Output:

Filtered_sentence

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase
rights become much less valuable, indeed vaguest idea wood river question.

Stem_words

He determin drop litig monastri, relinguish claim wood-cut fisheri rihgt. He readi becuas right become
much less valuabl, inde vaguest idea wood river question.

Code: LEMMATIZATION

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

import nltk

from nltk.stem import WordNetLemmatizer

set(stopwords.words('english'))

indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)

filtered_sentence = []

for w in word_tokens:

if w not in stop_words:

filtered_sentence.append(w)

print(filtered_sentence)

lemma_word = []

import nltk

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

for w in filtered_sentence:

word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")

word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")

word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))

lemma_word.append(word3)

print(lemma_word)

Output:

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase
rights become much less valuable, indeed vaguest idea wood river question.

He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He ready becuase
right become much le valuable, indeed vaguest idea wood river question.

Here, v stands for verb, a stands for adjective and n stands for noun. The lemmatizer only lemmatizes
those words which match the pos parameter of the lemmatize method. Lemmatization is done on the basis
of part-of-speech tagging (POS tagging). We’ll talk in detail about POS tagging in an upcoming article.

2. Text Normalization using spaCy

spaCy, as we saw earlier, is an amazing NLP library. It provides many industry-level methods to perform
lemmatization. Unfortunately, spaCy has no module for stemming. To perform lemmatization, check out
the below code:

Code: Lemmatization

#make sure to download the english model with "python -m spacy download en"

import en_core_web_sm

nlp = en_core_web_sm.load()

doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the
wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become
much less valuable, and he had indeed the vaguest idea where the wood and river in question were.""")

lemma_word1 = []

for token in doc:

lemma_word1.append(token.lemma_)

lemma_word1

Output:

-PRON- determine to drop -PRON- litigation with the monastry, and relinguish -PRON- claim to the
wood-cuting and \n fishery rihgts at once. -PRON- be the more ready to do this becuase the right have
become much less valuable, and -PRON- have \n indeed the vague idea where the wood and river in
question be.

Note: Here -PRON- is the notation for pronoun which could easily be removed using regular expressions.
The benefit of spaCy is that we do not have to pass any pos parameter to perform lemmatization.

3. Text Normalization using TextBlob

TextBlob is a Python library especially made for preprocessing text data. It is based on the NLTK library.
We can use TextBlob to perform lemmatization. However, there’s no module for stemming in TextBlob.

Code:

# from textblob lib import Word method

from textblob import Word

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he
had indeed the vaguest idea where the wood and river in question were."""

lem = []
for i in text.split():

word1 = Word(i).lemmatize("n")

word2 = Word(word1).lemmatize("v")

word3 = Word(word2).lemmatize("a")

lem.append(Word(word3).lemmatize())

print(lem)

Output:

He determine to drop his litigation with the monastry, and relinguish his claim to the

wood-cuting and fishery rihgts at once. He wa the more ready to do this becuase the right

have become much le valuable, and he have indeed the vague idea where the wood and river in question were.

REFERENCES 1. Steven Bird, Ewan Klein, Natural Language Processing with Python,
O‘Reilly

Brand Architecture
No ratings yet
Brand Architecture
5 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
3 A Morphology
No ratings yet
3 A Morphology
4 pages
NLTK
No ratings yet
NLTK
3 pages
Lab 04 - Text Normalization Tutorial
No ratings yet
Lab 04 - Text Normalization Tutorial
5 pages
NLP 03
No ratings yet
NLP 03
3 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
NLTK - Stem NLTK - Stem: Print Print Print Print
No ratings yet
NLTK - Stem NLTK - Stem: Print Print Print Print
1 page
Viva Questions
No ratings yet
Viva Questions
6 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
5 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP Lab 2
No ratings yet
NLP Lab 2
4 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Lemmatization Approaches
No ratings yet
Lemmatization Approaches
13 pages
Lemmatization Stemming Presentation
No ratings yet
Lemmatization Stemming Presentation
11 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Token Ization
No ratings yet
Token Ization
5 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Lab 2
No ratings yet
Lab 2
49 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLTK
No ratings yet
NLTK
4 pages
Week 3
No ratings yet
Week 3
15 pages
Word Level Analysis (NLP)
No ratings yet
Word Level Analysis (NLP)
28 pages
Lab 2
No ratings yet
Lab 2
4 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Text Mining
No ratings yet
Text Mining
62 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
02-Stemming - Jupyter Notebook
No ratings yet
02-Stemming - Jupyter Notebook
4 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (1)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
No ratings yet
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
25 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
04 Word Normalization and Stemming 11-47
No ratings yet
04 Word Normalization and Stemming 11-47
5 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
24 pages
NLP Lab
No ratings yet
NLP Lab
7 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Ruby Programming For Beginners: The Simple Guide to Learning Ruby Programming Language Fast!
From Everand
Ruby Programming For Beginners: The Simple Guide to Learning Ruby Programming Language Fast!
Tim Warren
2/5 (2)
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Just the basics of JavaScript
From Everand
Just the basics of JavaScript
Tom Henricksen
No ratings yet
Aquaculture Nutrition Sample Quiz/Test Questions: Objective 1: Calculation of Feed Conversion Ratio
No ratings yet
Aquaculture Nutrition Sample Quiz/Test Questions: Objective 1: Calculation of Feed Conversion Ratio
3 pages
Managing The Cat That Wont Eat
No ratings yet
Managing The Cat That Wont Eat
10 pages
Tle No. 9
No ratings yet
Tle No. 9
27 pages
Vegetables - Co .NZ: Fresh New Zealand Grown Vegetables
No ratings yet
Vegetables - Co .NZ: Fresh New Zealand Grown Vegetables
2 pages
Tabela de Preços
No ratings yet
Tabela de Preços
4 pages
So, Too, Either, Neither
No ratings yet
So, Too, Either, Neither
4 pages
Crusts The Ultimate Bakers Book With More Than 300 Recipes From Artisan Bakers Around The World Barbara Caracciolo PDF Download
No ratings yet
Crusts The Ultimate Bakers Book With More Than 300 Recipes From Artisan Bakers Around The World Barbara Caracciolo PDF Download
87 pages
Signs of Kitchen Safety
No ratings yet
Signs of Kitchen Safety
18 pages
Wedding Package
No ratings yet
Wedding Package
6 pages
Tle 10-Lesson 1
No ratings yet
Tle 10-Lesson 1
43 pages
AQUAFLORA™
No ratings yet
AQUAFLORA™
5 pages
Case #2 Should We Stay or Should We Go
No ratings yet
Case #2 Should We Stay or Should We Go
5 pages
Great Nyc 2022 12 03
No ratings yet
Great Nyc 2022 12 03
64 pages
Worksheets X English
No ratings yet
Worksheets X English
22 pages
After Being Bent by Reader (Z-Library) - 1
No ratings yet
After Being Bent by Reader (Z-Library) - 1
420 pages
Trip Itinerary: Time Activity
No ratings yet
Trip Itinerary: Time Activity
4 pages
Measure Cook Recipe Book Workbook
No ratings yet
Measure Cook Recipe Book Workbook
23 pages
Ethiopian Coffee Brochure
No ratings yet
Ethiopian Coffee Brochure
44 pages
Liquor Price in Delhi
No ratings yet
Liquor Price in Delhi
78 pages
Unit 15 - Practice - Time Clauses
No ratings yet
Unit 15 - Practice - Time Clauses
3 pages
Silibus DHB 2234 Pastry Making - Final - 26 Feb 2021
No ratings yet
Silibus DHB 2234 Pastry Making - Final - 26 Feb 2021
9 pages
Listen Exercise. - Be Going To
No ratings yet
Listen Exercise. - Be Going To
2 pages
CHAPTER 2 WPS Office
No ratings yet
CHAPTER 2 WPS Office
5 pages
Kota+Joe+Main+Menu+9!11!2023 +Screw+the+Diet
No ratings yet
Kota+Joe+Main+Menu+9!11!2023 +Screw+the+Diet
2 pages
Part 5 - Grammar - Lesson 1 - Word Choice
No ratings yet
Part 5 - Grammar - Lesson 1 - Word Choice
40 pages
Ferns TRM2
No ratings yet
Ferns TRM2
19 pages
Oil - Ghee
No ratings yet
Oil - Ghee
9 pages
Soal Pat B. Inggris Kelas 8 Sem 2
No ratings yet
Soal Pat B. Inggris Kelas 8 Sem 2
8 pages
Feasibility of A Bubble Tea Shop in Jakobstad
No ratings yet
Feasibility of A Bubble Tea Shop in Jakobstad
46 pages