0% found this document useful (0 votes)
38 views6 pages

NLP CT1

The document discusses various techniques for text preprocessing in natural language processing, including stemming, lemmatization, text encoding, and tokenization. It provides examples of different stemming algorithms like Porter's stemmer and Snowball stemmer. It also discusses challenges in NLP like contextual words, synonyms, irony and sarcasm. Regular expressions are introduced as a way to find patterns in text for tasks like data validation, filtering text and identifying strings.

Uploaded by

kz9057
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views6 pages

NLP CT1

The document discusses various techniques for text preprocessing in natural language processing, including stemming, lemmatization, text encoding, and tokenization. It provides examples of different stemming algorithms like Porter's stemmer and Snowball stemmer. It also discusses challenges in NLP like contextual words, synonyms, irony and sarcasm. Regular expressions are introduced as a way to find patterns in text for tasks like data validation, filtering text and identifying strings.

Uploaded by

kz9057
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

15 marks:

1. Stemming and lemmetization:

Stemming

Stemming is the process of finding the root of words.

Stemming is definitely the simpler of the two approaches. With stemming, words are reduced to
their word stems. A word stem need not be the same root as a dictionary-based morphological root,
it just is an equal to or smaller form of the word.

When you are breaking down words with stemming, you can sometimes see that finding roots is
erroneous and absurd. Because Stemming works rule-based, it cuts the suffixes in words according
to a certain rule. This reveals inconsistencies regarding stemming. Overstemming and
understemming.

Types of stemming algorithms:

1. Porter’s Stemmer

Porter’s Stemmer is actually one of the oldest stemmer applications applied in computer science. It
first mention was in 1980 in the paper An algorithm for suffix stripping by Martin Porter and it is one
of the widely used stemmers available in nltk.

Example code:

from nltk.stem import PorterStemmer

porter = PorterStemmer()

porter.stem('amazing') returns ‘amaz’

This stem happens because ing is such a common termination in english words that the word
amazing gets stemmed into the word amaz. The amaz stem is also produced by the following words,
Amazement, Amaze and Amazed:

porter.stem('amazement') returns ‘amaz’

porter.stem('amaze') returns ‘amaz’


porter.stem('amazed') returns ‘amaz’

2. Snowball Stemmer

Snowball stemmer (formally called Porter2) is an updated version of Porter’s Stemmer with new
rules that were introduced modifying some of the existing ones already existing in Porter’s Stemmer.

The logic and process is exactly the same as Porter’s Stemmer, the word is stemmed sequentially
throughout the five phases of the stemmer.

from nltk.stem import SnowballStemmer

snowball = SnowballStemmer(language='english')

porter.stem('fairly') -> returns fairli

snowball.stem('fairly') -> returns fair

In the example below, Snowball is much better on normalizing the adverb fairly, having it produce
the stem fair, while Porter’s produce the stem fairli. Doing so makes the stem of the word fairly the
same as the adjective fair, which seems to make sense from a normalization perspective.

3. Lancaster Stemmer

Lancaster Stemmer is a stemmer developed and presented in the paper Another Stemmer by Chris
Paice from Lancaster University.

Its rules are more agressive than Porter and Snowball and it is one of the most agressive stemmers
as it tends to overstem a lot of words.

from nltk.stem import LancasterStemmer

lanc = LancasterStemmer()

Let’s see some examples of how words are stemmed with Lancaster’s Stemmer, comparing the
results with Snowball Stemmer approach — beginning with the word salty:

snowball.stem('salty') returns 'salti'

lanc.stem('salty') returns 'sal'

snowball.stem('sales') returns 'sale'

lanc.stem('sales') returns'sal'

4. RegexpStemmer

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression
Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix
that matches the expression

import nltk

from nltk.stem import RegexpStemmer

Reg_stemmer = RegexpStemmer(‘ing’)

Reg_stemmer.stem('eating')
'eat'

Reg_stemmer.stem('ingeat')

'eat'

Lemmatization

Lemmatization is the process of finding the form of the related word in the dictionary. It is different
from Stemming. It involves longer processes to calculate than Stemming. Let’s examine a definition
made about this.

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As
opposed to stemming, lemmatization does not simply chop off inflections. Instead, it uses lexical
knowledge bases to get the correct base forms of words.

import nltk

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Lemmatize single word

print(lemmatizer.lemmatize("workers"))

print(lemmatizer.lemmatize("beeches"))

worker

beech

2.Text Encoding:

Text encoding is a process to convert meaningful text into number / vector representation so as to
preserve the context and relationship between words and sentences, such that a machine can
understand the pattern associated in any text and can make out the context of sentences.

There are a lot of methods to convert Text into numerical vectors, they are:

-One Hot Encoding

- Index-Based Encoding

- Bag of Words (BOW)

- TF-IDF Encoding

- Word2Vector Encoding

- BERT Encoding

One Hot Encoding:

In one hot encoding, every word (even symbols) which are part of the given text data are written in
the form of vectors, constituting only of 1 and 0 . So one hot vector is a vector whose elements are
only 1 and 0. Each word is written or encoded as one hot vector, with each one hot vector being
unique. This allows the word to be identified uniquely by its one hot vector and vice versa, that is no
two words will have same one hot vector representation. For example see the below image shows
one hot encoding of words in the given sentence.

Index-Based Encoding:

As the name mentions, Index based, we surely need to give all the unique words an index, like we
have separated out our Data Corpus, now we can index them individually, like…

a:1

bad : 2

this : 13

Now that we have assigned a unique index to all the words so that based on the index we can
uniquely identify them, we can convert our sentences using this index-based method.

It is very trivial to understand, that we are just replacing the words in each sentence with their
respective indexes.

Bag of Words (BOW):

Bag of Words or BoW is another form of encoding where we use the whole data corpus to encode
our sentences. It will make sense once we see actually how to do it.

Data Corpus:

[“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” ,

“she” , “temper” , “this”]

As we know that our data corpus will never change, so if we use this as a

baseline to create encodings for our sentences, then we will be on an upper hand

to not pad any extra words.

Now, 1st sentence we have is this : “this is a good phone”

How do we use the whole corpus to represent this sentence?


TF-IDF Encoding:

Term Frequency — Inverse Document Frequency

As the name suggests, here we give every word a relative frequency coding w.r.t the current
sentence and the whole document.

Term Frequency: Is the occurrence of the current word in the current sentence w.r.t the total
number of words in the current sentence.

Inverse Data Frequency: Log of Total number of words in the whole data corpus w.r.t the total
number of sentences containing the current word.

TF: Term-Frequency

IDF: Inverse-Data-Frequency

One thing to note here is we have to calculate the word frequency of each word for that particular
sentence, because depending on the number of times a word occurs in a sentence the TF value can
change, whereas the IDF value remains constant, until and unless new sentences are getting added.

5marks:

Tokenization

Tokenization is the process of breaking down the given text in natural language processing into the
smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be
considered tokens.

Tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified
into 3 types

1. Word Tokenization

2.Character Tokenization

3.Subword(n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the
tokenization of the sentence results in 3 tokens – Never-give-up.

As each token is a word, it becomes an example of Word tokenization. Similarly, tokens can be either
characters or subwords. For example, let us consider “smarter”: 1. Character tokens: s-m-a-r-t-e-r 2.
Subword tokens: smart-er
Challenges of NLP

Contextual words and phrases and homonyms

The same words and phrases can have different meanings according the context of a sentence and
many words – especially in English – have the exact same pronunciation but totally different
meanings.

Synonyms

Synonyms can lead to issues similar to contextual understanding because we use many different
words to express the same idea. Furthermore, some of these words may convey exactly the same
meaning, while some may be levels of complexity (small, little, tiny, minute) and different people use
synonyms to denote slightly different meanings within their personal vocabulary.

Irony and sarcasm

Irony and sarcasm present problems for machine learning models because they generally use words
and phrases that, strictly by definition, may be positive or negative, but actually connote the
opposite.

Errors in text and speech

Misspelled or misused words can create problems for text analysis. Autocorrect and grammar
correction applications can handle common mistakes, but don’t always understand the writer’s
intention.

2. Regular Expressions

Regular expressions or RegEx is defined as a sequence of characters that are mainly used to find or
replace patterns present in the text. In simple words, we can say that a regular expression is a set of
characters or a pattern that is used to find substrings in a given string. A regular expression (RE) is a
language for specifying text search strings. It helps us to match or extract other strings or sets of
strings, with the help of a specialized syntax present in a pattern.

How can Regular Expressions be used in NLP?

In NLP, we can use Regular expressions at many places such as,

1. To Validate data fields.

For Example, dates, email address, URLs, abbreviations, etc.

2. To Filter a particular text from the whole corpus.

For Example, spam, disallowed websites, etc.

3. To Identify particular strings in a text.For Example, token boundaries

4. To convert the output of one processing component into the format required for a second
component.

You might also like