0% found this document useful (0 votes)

38 views6 pages

NLP CT1

The document discusses various techniques for text preprocessing in natural language processing, including stemming, lemmatization, text encoding, and tokenization. It provides examples of different stemming algorithms like Porter's stemmer and Snowball stemmer. It also discusses challenges in NLP like contextual words, synonyms, irony and sarcasm. Regular expressions are introduced as a way to find patterns in text for tasks like data validation, filtering text and identifying strings.

Uploaded by

kz9057

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views6 pages

NLP CT1

Uploaded by

kz9057

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

15 marks:

1. Stemming and lemmetization:

Stemming

Stemming is the process of finding the root of words.

Stemming is definitely the simpler of the two approaches. With stemming, words are reduced to
their word stems. A word stem need not be the same root as a dictionary-based morphological root,
it just is an equal to or smaller form of the word.

When you are breaking down words with stemming, you can sometimes see that finding roots is
erroneous and absurd. Because Stemming works rule-based, it cuts the suffixes in words according
to a certain rule. This reveals inconsistencies regarding stemming. Overstemming and
understemming.

Types of stemming algorithms:

1. Porter’s Stemmer

Porter’s Stemmer is actually one of the oldest stemmer applications applied in computer science. It
first mention was in 1980 in the paper An algorithm for suffix stripping by Martin Porter and it is one
of the widely used stemmers available in nltk.

Example code:

from nltk.stem import PorterStemmer

porter = PorterStemmer()

porter.stem('amazing') returns ‘amaz’

This stem happens because ing is such a common termination in english words that the word
amazing gets stemmed into the word amaz. The amaz stem is also produced by the following words,
Amazement, Amaze and Amazed:

porter.stem('amazement') returns ‘amaz’

porter.stem('amaze') returns ‘amaz’

porter.stem('amazed') returns ‘amaz’

2. Snowball Stemmer

Snowball stemmer (formally called Porter2) is an updated version of Porter’s Stemmer with new
rules that were introduced modifying some of the existing ones already existing in Porter’s Stemmer.

The logic and process is exactly the same as Porter’s Stemmer, the word is stemmed sequentially
throughout the five phases of the stemmer.

from nltk.stem import SnowballStemmer

snowball = SnowballStemmer(language='english')

porter.stem('fairly') -> returns fairli

snowball.stem('fairly') -> returns fair

In the example below, Snowball is much better on normalizing the adverb fairly, having it produce
the stem fair, while Porter’s produce the stem fairli. Doing so makes the stem of the word fairly the
same as the adjective fair, which seems to make sense from a normalization perspective.

3. Lancaster Stemmer

Lancaster Stemmer is a stemmer developed and presented in the paper Another Stemmer by Chris
Paice from Lancaster University.

Its rules are more agressive than Porter and Snowball and it is one of the most agressive stemmers
as it tends to overstem a lot of words.

from nltk.stem import LancasterStemmer

lanc = LancasterStemmer()

Let’s see some examples of how words are stemmed with Lancaster’s Stemmer, comparing the
results with Snowball Stemmer approach — beginning with the word salty:

snowball.stem('salty') returns 'salti'

lanc.stem('salty') returns 'sal'

snowball.stem('sales') returns 'sale'

lanc.stem('sales') returns'sal'

4. RegexpStemmer

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression
Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix
that matches the expression

import nltk

from nltk.stem import RegexpStemmer

Reg_stemmer = RegexpStemmer(‘ing’)

Reg_stemmer.stem('eating')
'eat'

Reg_stemmer.stem('ingeat')

'eat'

Lemmatization

Lemmatization is the process of finding the form of the related word in the dictionary. It is different
from Stemming. It involves longer processes to calculate than Stemming. Let’s examine a definition
made about this.

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As
opposed to stemming, lemmatization does not simply chop off inflections. Instead, it uses lexical
knowledge bases to get the correct base forms of words.

import nltk

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Lemmatize single word

print(lemmatizer.lemmatize("workers"))

print(lemmatizer.lemmatize("beeches"))

worker

beech

2.Text Encoding:

Text encoding is a process to convert meaningful text into number / vector representation so as to
preserve the context and relationship between words and sentences, such that a machine can
understand the pattern associated in any text and can make out the context of sentences.

There are a lot of methods to convert Text into numerical vectors, they are:

-One Hot Encoding

- Index-Based Encoding

- Bag of Words (BOW)

- TF-IDF Encoding

- Word2Vector Encoding

- BERT Encoding

One Hot Encoding:

In one hot encoding, every word (even symbols) which are part of the given text data are written in
the form of vectors, constituting only of 1 and 0 . So one hot vector is a vector whose elements are
only 1 and 0. Each word is written or encoded as one hot vector, with each one hot vector being
unique. This allows the word to be identified uniquely by its one hot vector and vice versa, that is no
two words will have same one hot vector representation. For example see the below image shows
one hot encoding of words in the given sentence.

Index-Based Encoding:

As the name mentions, Index based, we surely need to give all the unique words an index, like we
have separated out our Data Corpus, now we can index them individually, like…

a:1

bad : 2

this : 13

Now that we have assigned a unique index to all the words so that based on the index we can
uniquely identify them, we can convert our sentences using this index-based method.

It is very trivial to understand, that we are just replacing the words in each sentence with their
respective indexes.

Bag of Words (BOW):

Bag of Words or BoW is another form of encoding where we use the whole data corpus to encode
our sentences. It will make sense once we see actually how to do it.

Data Corpus:

[“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” ,

“she” , “temper” , “this”]

As we know that our data corpus will never change, so if we use this as a

baseline to create encodings for our sentences, then we will be on an upper hand

to not pad any extra words.

Now, 1st sentence we have is this : “this is a good phone”

How do we use the whole corpus to represent this sentence?

TF-IDF Encoding:

Term Frequency — Inverse Document Frequency

As the name suggests, here we give every word a relative frequency coding w.r.t the current
sentence and the whole document.

Term Frequency: Is the occurrence of the current word in the current sentence w.r.t the total
number of words in the current sentence.

Inverse Data Frequency: Log of Total number of words in the whole data corpus w.r.t the total
number of sentences containing the current word.

TF: Term-Frequency

IDF: Inverse-Data-Frequency

One thing to note here is we have to calculate the word frequency of each word for that particular
sentence, because depending on the number of times a word occurs in a sentence the TF value can
change, whereas the IDF value remains constant, until and unless new sentences are getting added.

5marks:

Tokenization

Tokenization is the process of breaking down the given text in natural language processing into the
smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be
considered tokens.

Tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified
into 3 types

1. Word Tokenization

2.Character Tokenization

3.Subword(n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the
tokenization of the sentence results in 3 tokens – Never-give-up.

As each token is a word, it becomes an example of Word tokenization. Similarly, tokens can be either
characters or subwords. For example, let us consider “smarter”: 1. Character tokens: s-m-a-r-t-e-r 2.
Subword tokens: smart-er
Challenges of NLP

Contextual words and phrases and homonyms

The same words and phrases can have different meanings according the context of a sentence and
many words – especially in English – have the exact same pronunciation but totally different
meanings.

Synonyms

Synonyms can lead to issues similar to contextual understanding because we use many different
words to express the same idea. Furthermore, some of these words may convey exactly the same
meaning, while some may be levels of complexity (small, little, tiny, minute) and different people use
synonyms to denote slightly different meanings within their personal vocabulary.

Irony and sarcasm

Irony and sarcasm present problems for machine learning models because they generally use words
and phrases that, strictly by definition, may be positive or negative, but actually connote the
opposite.

Errors in text and speech

Misspelled or misused words can create problems for text analysis. Autocorrect and grammar
correction applications can handle common mistakes, but don’t always understand the writer’s
intention.

2. Regular Expressions

Regular expressions or RegEx is defined as a sequence of characters that are mainly used to find or
replace patterns present in the text. In simple words, we can say that a regular expression is a set of
characters or a pattern that is used to find substrings in a given string. A regular expression (RE) is a
language for specifying text search strings. It helps us to match or extract other strings or sets of
strings, with the help of a specialized syntax present in a pattern.

How can Regular Expressions be used in NLP?

In NLP, we can use Regular expressions at many places such as,

1. To Validate data fields.

For Example, dates, email address, URLs, abbreviations, etc.

2. To Filter a particular text from the whole corpus.

For Example, spam, disallowed websites, etc.

3. To Identify particular strings in a text.For Example, token boundaries

4. To convert the output of one processing component into the format required for a second
component.

The Khuzwayos
No ratings yet
The Khuzwayos
267 pages
BS en 60584-1-2013
100% (2)
BS en 60584-1-2013
72 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Guide - Making Money Online
91% (11)
Guide - Making Money Online
324 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Artificial Island, 1
No ratings yet
Artificial Island, 1
25 pages
Impact of Load Variation On Power System Stability and Performance of Power System Stabilizers: A Case Study of Peerdawd Gas Power Station, Iraq
No ratings yet
Impact of Load Variation On Power System Stability and Performance of Power System Stabilizers: A Case Study of Peerdawd Gas Power Station, Iraq
15 pages
Business Plan Group 2
No ratings yet
Business Plan Group 2
48 pages
PSS 5000 APNO Vehicle Tagging 80510800
100% (1)
PSS 5000 APNO Vehicle Tagging 80510800
46 pages
Schengen
No ratings yet
Schengen
8 pages
Final Project Report Crime Data
No ratings yet
Final Project Report Crime Data
65 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Music As Persuasive Communication StrategyinAdvertising and Branding
No ratings yet
Music As Persuasive Communication StrategyinAdvertising and Branding
18 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
What Is Identity & Access Management (IAM) ?
100% (1)
What Is Identity & Access Management (IAM) ?
8 pages
Persuasive Speech On Homework Should Be Banned
100% (1)
Persuasive Speech On Homework Should Be Banned
6 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
MR Marmalade PDF
No ratings yet
MR Marmalade PDF
26 pages
Lab 2
No ratings yet
Lab 2
49 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Module III
No ratings yet
Module III
42 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Text Mining
No ratings yet
Text Mining
62 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP TT-1 Question Bank
No ratings yet
NLP TT-1 Question Bank
21 pages
Final Project Report Crime Data
No ratings yet
Final Project Report Crime Data
37 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Client Services Agreement
No ratings yet
Client Services Agreement
37 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
Text Mining
No ratings yet
Text Mining
34 pages
TextMining
No ratings yet
TextMining
43 pages
DR AI 1688489062
No ratings yet
DR AI 1688489062
44 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Bruce Berkowitz
No ratings yet
Bruce Berkowitz
30 pages
Word Level Analysis (NLP)
No ratings yet
Word Level Analysis (NLP)
28 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Final Project Report Crime Data 2
No ratings yet
Final Project Report Crime Data 2
38 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
Chapter 1 - Notes - Fixed Income Analysis
No ratings yet
Chapter 1 - Notes - Fixed Income Analysis
3 pages
Natural Language Processing - Compressed
No ratings yet
Natural Language Processing - Compressed
17 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Cell Communication
No ratings yet
Cell Communication
45 pages
Detailed Lesson Plan in Physical Science mhelDS
No ratings yet
Detailed Lesson Plan in Physical Science mhelDS
16 pages
Chapter V - Working With Text Data
No ratings yet
Chapter V - Working With Text Data
30 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
File 46953
No ratings yet
File 46953
28 pages
NLP Record
No ratings yet
NLP Record
15 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Rosevil DLL Sample December 5 - 9
No ratings yet
Rosevil DLL Sample December 5 - 9
13 pages
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
No ratings yet
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
1 page
Family Business Management Presentation
No ratings yet
Family Business Management Presentation
16 pages
CAM - 3rd November
No ratings yet
CAM - 3rd November
10 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
PC1015
No ratings yet
PC1015
13 pages
Pipeline
No ratings yet
Pipeline
9 pages
NLP Soln
No ratings yet
NLP Soln
6 pages
Efu Health Insurance
No ratings yet
Efu Health Insurance
3 pages
Assignment
No ratings yet
Assignment
6 pages
NLTK
No ratings yet
NLTK
3 pages
Hunshu
No ratings yet
Hunshu
6 pages
NLTK
No ratings yet
NLTK
4 pages
Figlet
No ratings yet
Figlet
10 pages
Embeddings
No ratings yet
Embeddings
3 pages
Feasibility LPG Plant September 26, 2012 Revised Capacity
100% (2)
Feasibility LPG Plant September 26, 2012 Revised Capacity
22 pages
Blue Zones Minestrone - Dan's Version - Dan Buettner
No ratings yet
Blue Zones Minestrone - Dan's Version - Dan Buettner
3 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
Installing ICU 52
No ratings yet
Installing ICU 52
7 pages
Index Ai
No ratings yet
Index Ai
1 page
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet

NLP CT1

Uploaded by

NLP CT1

Uploaded by

15 marks:

1. Stemming and lemmetization:

Stemming is the process of finding the root of words.

Types of stemming algorithms:

from nltk.stem import PorterStemmer

porter.stem('amazing') returns ‘amaz’

porter.stem('amazement') returns ‘amaz’

porter.stem('amaze') returns ‘amaz’

from nltk.stem import SnowballStemmer

porter.stem('fairly') -> returns fairli

snowball.stem('fairly') -> returns fair

from nltk.stem import LancasterStemmer

snowball.stem('salty') returns 'salti'

lanc.stem('salty') returns 'sal'

snowball.stem('sales') returns 'sale'

from nltk.stem import RegexpStemmer

from nltk.stem import WordNetLemmatizer

# Lemmatize single word

-One Hot Encoding

- Bag of Words (BOW)

One Hot Encoding:

Bag of Words (BOW):

“she” , “temper” , “this”]

to not pad any extra words.

Now, 1st sentence we have is this : “this is a good phone”

How do we use the whole corpus to represent this sentence?

Term Frequency — Inverse Document Frequency

3.Subword(n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

Contextual words and phrases and homonyms

Irony and sarcasm

Errors in text and speech

How can Regular Expressions be used in NLP?

In NLP, we can use Regular expressions at many places such as,

1. To Validate data fields.

For Example, dates, email address, URLs, abbreviations, etc.

2. To Filter a particular text from the whole corpus.

For Example, spam, disallowed websites, etc.

3. To Identify particular strings in a text.For Example, token boundaries

You might also like