0% found this document useful (0 votes)
13 views15 pages

Text Preprocessing For NLP

The document provides an overview of Natural Language Processing (NLP), detailing its significance in understanding and generating human language through various preprocessing techniques. Key steps in text preprocessing include lower casing, tokenization, punctuation removal, stop word removal, stemming, and lemmatization, each contributing to the efficiency and accuracy of NLP tasks. Additionally, it discusses text classification, its applications, approaches, and the importance of chunking in managing large datasets.

Uploaded by

ritenpanchasara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Text Preprocessing For NLP

The document provides an overview of Natural Language Processing (NLP), detailing its significance in understanding and generating human language through various preprocessing techniques. Key steps in text preprocessing include lower casing, tokenization, punctuation removal, stop word removal, stemming, and lemmatization, each contributing to the efficiency and accuracy of NLP tasks. Additionally, it discusses text classification, its applications, approaches, and the importance of chunking in managing large datasets.

Uploaded by

ritenpanchasara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

BCA & BSCIT SEM-6

SUBJECT:- Machine Learning with Python


CH:-4 Natural Language Processing
 What is NLP
The meaning of NLP is Natural Language Processing (NLP) which is a
fascinating and rapidly evolving field that intersects computer science,
artificial intelligence, and linguistics. NLP focuses on the interaction
between computers and human language, enabling machines to
understand, interpret, and generate human language in a way that is
both meaningful and useful. With the increasing volume of text data
generated every day, from social media posts to research articles, NLP
has become an essential tool for extracting valuable insights and
automating various tasks.

Text Preprocessing For NLP


Data Preprocessing is the most essential step for any Machine
Learning model. How well the raw data has been cleaned and
preprocessed plays a major role in the performance of the model.
Likewise in the case of NLP, the very first step is Text Processing.

The various preprocessing steps that are involved are :

1. Lower Casing
2. Tokenization
3. Punctuation Mark Removal
4. Stop Word Removal
5. Stemming
6. Lemmatization

Let us explore them one at a time!

Text Pre-processing Using Lower Casing


It’s quite evident from the name itself, that we are trying to convert
our text data into lower case. But why is this step needed?

When we have a text input, such as a paragraph we find words both


in lower as well as upper case. However, the same words written in
different cases are considered as different entities by the computer.
For example: ‘Girl‘ and ‘girl‘ are considered as two separate words by
the computer even though they mean the same.

In order to resolve this issue, we must convert all the words to lower
case. This provides uniformity in the text.

sentence = "This text is used to demonstrate Text Preprocessing in


NLP."
sentence = sentence.lower()
print(sentence)

Output: this text is used to demonstrate text preprocessing in nlp.

Understand Tokenization In Text Pre-processing


The next text preprocessing step is Tokenization. Tokenization is the
process of breaking up the paragraph into smaller units such as
sentences or words. Each unit is then considered as an individual
token. The fundamental principle of Tokenization is to try to
understand the meaning of the text by analyzing the smaller units or
tokens that constitute the paragraph.

To do this, we shall use the NLTK library. NLTK is the Natural Language
Toolkit library in python that is used for Text Preprocessing.

import nltk
nltk.download('punkt')

Output:
Sentence Tokenize

Now we shall take a paragraph as input and tokenize it into its


constituting sentences. The result is a list stored in the variable
‘sentences’. It contains each sentence of the paragraph. The length of
the list gives us the total number of sentences.

import nltk
# nltk.download('punkt')

sentence = "This text is used to demonstrate Text Preprocessing in


NLP."
sentence = sentence.lower()
print(sentence)

paragraph="Linguistics is the scientific study of language. It


encompasses the analysis of every aspect of language, as well as the
methods for studying and modeling them. The traditional areas of
linguistic analysis include phonetics, phonology, morphology, syntax,
semantics, and pragmatics."
# Tokenize Sentences
sentences = nltk.sent_tokenize(paragraph.lower())
print("-------------------------------------")
print(sentences)
print("-------------------------------------")
print(len(sentences))
Word Tokenize

Similarly, we can also tokenize the paragraph into words. The result is
a list called ‘words’, containing each word of the paragraph. The length
of the list gives us the total number of words present in our paragraph.

# Tokenize Words
words = nltk.word_tokenize(paragraph.lower())
print (words)
print (len(words))

 Punctuation Mark Removal

This brings us to the next step. We must now remove the punctuation
marks from our list of words. Let us first display our original list of
words.

print (words)
Now, we can remove all the punctuation marks from our list of words
by excluding any alphanumeric element. This can be easily done in this
manner.

new_words= [word for word in words if word.isalnum()]

 Stop Word Removal

Have you ever observed that certain words pop up very frequently in
any language irrespective of what you are writing?

These words are our stop words!

Stop words are a collection of words that occur frequently in any


language but do not add much meaning to the sentences. These are
common words that are part of the grammar of any language. Every
language has its own set of stop words. For example some of the
English stop words are “the”, “he”, “him”, “his”, “her”, “herself” etc.

To do this, we will have to import stopwords from nltk.corpus

from nltk.corpus import stopwords


nltk.download('stopwords')

Output:

Once this is done, we can now display the stop words of any language
by simply using the following command and passing the language
name as a parameter.

print(stopwords.words("english"))

These are all the English stop words.

You can also get the stop words of other languages by simply changing
the parameter. Have some fun and try passing “Spanish” or “French”
as the parameter!

Since these stop words do not add much value to the overall meaning
of the sentence, we can easily remove these words from our text data.
This helps in dimensionality reduction by eliminating unnecessary
information.
WordSet = []
for word in new_words:
if word not in set(stopwords.words("english")):
WordSet.append(word)
print(WordSet)

print(len(WordSet))

Output: 24

We observe that all the stop words have been successfully removed
from our set of words. On printing the length of our new word list we
see that the length is now 24 which is much less than our original word
list length which was 49. This shows how we can effectively reduce the
dimensionality of our text dataset by removal of stop words without
losing any vital information. This becomes extremely useful in the case
of large text datasets.

 Stemming

Now, what do you mean by stemming?

As the name suggests, Stemming is the process of reduction of a word


into its root or stem word. The word affixes are removed leaving
behind only the root form or lemma.

For example: The words “connecting”, “connect”, “connection”,


“connects” are all reduced to the root form “connect”. The words
“studying”, “studies”, “study” are all reduced to “studi”.

Let us see, how this can be done.


To do this we must first import PorterStemmer from nltk.stem and
create an object of the PorterStemmer class.

from nltk.stem import PorterStemmer


ps = PorterStemmer()

After that using the PorterStemmer object ‘ps’ we shall call the stem
method to perform stemming on our wordlist.

WordSetStem = []
for word in WordSet:
WordSetStem.append(ps.stem(word))
print(WordSetStem)

Output: [‘linguist’, ‘scientif’, ‘studi’, ‘languag’, ‘encompass’,


‘analysi’, ‘everi’, ‘aspect’, ‘languag’, ‘well’, ‘method’, ‘studi’, ‘model’,
‘tradit’, ‘area’, ‘linguist’, ‘analysi’, ‘includ’, ‘phonet’, ‘phonolog’,
‘morpholog’, ‘syntax’, ‘semant’, ‘pragmat’]

Carefully observe the result that we have obtained. All the words in
our list have been reduced to their stem words or lemma. For
example, “linguistics” has been reduced to “linguist”, the word
“scientific” has been reduced to “scientif” and so on.

Note: The word list obtained after performing stemming does not
always contain words that are a part of the English vocabulary. In our
example, words such as “scientif“, “studi“, “everi” are not proper
words, i.e. they do not make sense to us.

 Lemmatization

We have just seen, how we can reduce the words to their root words
using Stemming.

However, Stemming does not always result in words that are part of
the language vocabulary. It often results in words that have no
meaning to the users. In order to overcome this drawback, we shall
use the concept of Lemmatization.
Let’s dive into the code.

from nltk.stem import WordNetLemmatizer


lm= WordNetLemmatizer()
nltk.download('wordnet')

Output:

WordSetLem = []
for word in WordSet:
WordSetLem.append(lm.lemmatize(word))
print(WordSetLem)

We see that the words in our list have been lemmatized. Each word
has been converted into a meaningful parent word.

Another key difference between stemming and lemmatization is that


in the case of lemmatization we can pass a POS parameter. This is used
to provide the context in which we wish to lemmatize our words by
mentioning the Parts Of Speech(POS). If nothing is mentioned, the
default is ‘noun’.

Let’s see this in action!

When we do not pass any parameter, or specify pos as “n” (Noun). We


get the following output:
test = []
for word in ["changing", "dancing","is", "was"]:
test.append(lm.lemmatize(word, pos="n"))
print(test)

Output: [‘changing’, ‘dancing’, ‘is’, ‘wa’]

Here we see that “changing” and “dancing” remain unchanged after


lemmatization. This is because these have been considered as nouns.
Now let us change the part of speech to verb by specifying pos to “v”.

test = []
for word in [“changing”, “dancing”,”is”, “was”]:
test.append(lm.lemmatize(word, pos=”v”))
print(test)

Output: [‘change’, ‘dance’, ‘be’, ‘be’]

Now the words have been changed to their proper root words. The
words such as “is” and “was” have also been converted to “be“. Thus
we observe that we can accurately specify the context of
lemmatization by passing in the desired parts of speech in the
parameter of the lemmatize method.

 Chunking

Chunking is extracting phrases from an unstructured text by


evaluating a sentence and determining its elements (Noun Groups,
Verbs, verb groups, etc.) However, it does not describe their internal
structure or their function in the introductory statement.

Note that there are eight parts of speech: noun, verb, adjective,
adverb, preposition, conjunction, pronoun, and interjection, as we
recall from our English grammar studies at school. Short phrases are
also defined as phrases generated by combining any of these parts of
speech in the previous definition of Chunking.

To identify and group noun phrases or nouns alone, adjectives or


adjective phrases, and so on, Chunking can be used.
Consider the following sentence:

"I had my breakfast, lunch and dinner."

In this case, if we wish to group or chunk noun phrases, we will get


"breakfast", "lunch", and "dinner", which are the nouns or noun
groups of the sentence.

Why do we need Chunking?


It's critical to understand that the statement contains a person, a date,
and a location (different entities). As a result, they're useless on their
own.

Chunking can break down sentences into phrases that are more useful
than single words and provide meaningful outcomes.

When extracting information from text, such as places and person


names, Chunking is critical. (extraction of entities)

There are two types of chunking:

Chunking Up

Chunking Down

Implementation of chunking in Python

Chunking in Python refers to splitting a sequence or list into


smaller, evenly sized chunks. This is often useful when dealing
with large datasets or when processing data in batches.
def chunk_list(lst, chunk_size):
return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


chunk_size = 3
chunks = chunk_list(my_list, chunk_size)
print(chunks)

Output:
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

This function chunk_list takes a list lst and a chunk_size as input and
returns a list of lists, each containing chunk_size elements from the
original list.
The range() function is used to iterate over the indices of the original
list, and list slicing is used to extract chunks of the specified size.

Why is Chunking Important in NLP?


When working with large datasets, the need for chunking arises from
various considerations, including:

1. Memory Management: Large texts can overwhelm memory,


especially in low-resource environments. Chunking helps
break down data into manageable pieces.

2. Improved Model Efficiency: Smaller chunks reduce


computational complexity, making it easier for models to
process large volumes of data quickly.

3. Context Management: Proper chunking helps retain the


context within each segment, which is essential for
understanding relationships between parts of the text.
4. Better Retrieval and Search: Chunks make it easier to search
for and retrieve specific pieces of information, boosting the
accuracy of information retrieval tasks.

5. Parallel Processing: Smaller chunks can be processed


independently, enabling parallelization, which speeds up
model training and inference.

 Text classification in NLP

Text Classification is the processing of labeling or organizing text data


into groups. It forms a fundamental part of Natural Language
Processing. In the digital age that we live in we are surrounded by text
on our social media accounts, in commercials, on websites, Ebooks,
etc. The majority of this text data is unstructured, so classifying this
data can be extremely useful.

Applications

Text Classification has a wide array of applications. Some popular


uses are:

 Spam detection in emails


 Sentiment analysis of online reviews
 Topic labeling documents like research papers
 Language detection like in Google Translate
 Age/gender identification of anonymous users
 Tagging online content
 Speech recognition used in virtual assistants like Siri and Alexa

Approaches

Text Classification can be achieved through three main approaches:

1. Rule-based approaches
These approaches make use of handcrafted linguistic rules to
classify text. One way to group text is to create a list of words
related to a certain column and then judge the text based on the
occurrences of these words. For example, words like “fur”,
“feathers”, “claws”, and “scales” could help a zoologist identify
texts talking about animals online. These approaches require a
lot of domain knowledge to be extensive, take a lot of time to
compile, and are difficult to scale.
2. Machine learning approaches
We can use machine learning to train models on large sets of text
data to predict categories of new text. To train models, we need
to transform text data into numerical data – this is known as
feature extraction. Important feature extraction techniques
include bag of words and n-grams.
There are several useful machine learning algorithms we can use
for text classification. The most popular ones are:
o Naive Bayes classifiers
o Support vector machines
o Deep learning algorithms
3. Hybrid approaches
These approaches are a combination of the two algorithms
above. They make use of both rule-based and machine learning
techniques to model a classifier that can be fine-tuned in certain
scenarios.
Reasons to consider text classification

Text classification can help you with:

Identifying problems users have with your product

Most customer service requests end up in a backlog, while the product


team is prioritizing new features. With a structured system to
categorize requests, you’ll have a better overview of the problems
users are facing.

Recognizing user segments to improve your targeting

You may segment your audience depending on the words and phrases
they use, allowing you to develop more focused campaigns.

Getting ideas for new features

One of your users could tweet “If this product would have a logo
generation feature, it would be perfect for me.” This is valuable
feedback and you can leverage it to make your product more useful.

Analyzing data in real-time

Automated text classification can track your brand mentions in real


time, allowing you to see timely posts and take immediate action.

Eliminating human error

Humans aren’t machines and they are prone to errors. Machine


Learning examines all data and outcomes through the same filter and
parameters. Once correctly trained, a text classification model works
with unbeatable reliability.

You might also like