0% found this document useful (0 votes)

36 views6 pages

CSDM2-Text Preprocessing For NL Data - 011050

Very useful

Uploaded by

ignaciojudyann596

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views6 pages

CSDM2-Text Preprocessing For NL Data - 011050

Very useful

Uploaded by

ignaciojudyann596

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CHAPTER

Text Preprocessing for Natural Language Data

6
This module provides a comprehensive overview of NLP and text
preprocessing techniques, complete with discussions, explanations, and practical
examples. Each step is crucial for preparing text data for analysis and improving
the performance of NLP models.

Learning Outcomes
At the end of this exercise, students should be able to:
1. Demonstrate an understanding of the terms and concepts
pertaining to natural language processing.
2. Implement text preprocessing techniques using Python libraries such
as natural language toolkit (NLTK).

Learning Content

Definition and Importance of NLP

Natural Language Processing (NLP) is a field of artificial intelligence that
focuses on the interaction between computers and humans using natural
language. The primary goal is to enable computers to understand, interpret, and
generate human language in a way that is meaningful and useful.
NLP is essential for developing applications that can understand and
respond to human language, making technology more accessible and intuitive.

Examples:
• Chatbots: Virtual assistants like Siri and Alexa that understand and respond
to voice commands.
• Sentiment Analysis: Analyzing customer reviews to determine whether they
are positive, negative, or neutral.
• Machine Translation: Translating text from one language to another, as
seen in Google Translate.

Applications of NLP
NLP is used in various domains such as healthcare, finance, customer
service, and more. Applications range from text classification, sentiment analysis,
machine translation, and information retrieval to more complex tasks like question
answering and summarization.

Examples:
• Healthcare: Analyzing patient records to extract relevant information for
diagnosis.
• Finance: Automatically categorizing transaction data for expense tracking.
• Customer Service: Implementing chatbots to handle customer inquiries.

1
Basic Concepts in NLP
• Tokens: The smallest units of text, such as words or punctuation marks.
• Corpora: Large collections of text data used for training NLP models.
• Syntax: The arrangement of words to form sentences.
• Semantics: The meaning of words and sentences.
• Pragmatics: The context in which language is used, affecting its
interpretation.

Text Preprocessing Techniques

1. Tokenization
Tokenization is the process of breaking down text into smaller units
called tokens. This step is fundamental as it converts raw text into a
structured format that can be easily analyzed.
• Word Tokenization: Splitting text into individual words.
• Sentence Tokenization: Splitting text into individual sentences.

Example:
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing with Python. It's a powerful tool for text
analysis."
word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)

print("Word Tokens:", word_tokens)

print("Sentence Tokens:", sentence_tokens)

Output:
Word Tokens: ['Natural', 'Language', 'Processing', 'with', 'Python', '.', 'It', "'s", 'a', 'powerful', 'tool', 'for',
'text', 'analysis', '.']

Sentence Tokens: ['Natural Language Processing with Python.', "It's a powerful tool for text
analysis."]

2. Normalization
Normalization involves transforming text into a standard format,
making it consistent and reducing variability.
• Lowercasing: Converting all characters to lowercase.
• Removing Punctuation: Eliminating punctuation marks to focus on
the textual content.
• Handling Special Characters: Removing or transforming special
characters such as hashtags, mentions, etc.

Example:
import re

text = "Natural Language Processing with Python!"

text_lower = text.lower()
text_clean = re.sub(r'[^\w\s]', '', text_lower)

print("Lowercased Text:", text_lower)

2
print("Cleaned Text:", text_clean)

Output:
Lowercased Text: natural language processing with python!
Cleaned Text: natural language processing with python

3. Stop Word Removal

Stop words are common words that typically do not add significant
meaning to text and can be removed to reduce noise.
Examples of stop words include 'is', 'and', 'the', 'in', etc.

Example:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
word_tokens = ["Natural", "Language", "Processing", "with", "Python"]
filtered_words = [word for word in word_tokens if word.lower() not in
stop_words]

print("Filtered Words:", filtered_words)

Output:
Filtered Words: ['Natural', 'Language', 'Processing', 'Python']

4. Stemming and Lemmatization

a. Stemming:
Stemming reduces words to their root form, which might not
be a real word but is sufficient for certain text processing tasks.
Common stemming algorithms include the Porter Stemmer,
Snowball Stemmer, and Lancaster Stemmer.

Example:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "runs", "runner", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]

print("Stemmed Words:", stemmed_words)

Output:
Stemmed Words: ['run', 'run', 'runner', 'easili', 'fairli']

b. Lemmatization:
Lemmatization reduces words to their base or dictionary form,
known as lemma, which is a real word. It considers the context and
grammatical role of the word.
Tools for lemmatization include WordNet Lemmatizer in NLTK
and SpaCy's lemmatizer.

3
Example:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "runner", "easily", "fairly"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] #
'v' indicates verb

print("Lemmatized Words:", lemmatized_words)

Output:
Lemmatized Words: ['run', 'run', 'run', 'easily', 'fairly']

5. Handling Numerical Data

Depending on the context, numerical data can be removed if
irrelevant or retained if it carries significant information. Special
consideration is given to dates, quantities, and other relevant numeric
data.

Example:
import re

text = "In 2020, the revenue was $5 million."

text_no_numbers = re.sub(r'\d+', '', text)

print("Text without Numbers:", text_no_numbers)

Output:
Text without Numbers: In , the revenue was $ million.

6. Text Cleaning
Text cleaning involves removing unnecessary characters and
formatting to reduce noise.
• Removing HTML Tags: Useful for web-scraped text.
• Removing Whitespace: Trimming excessive spaces, tabs, and
newlines.
Example:
from bs4 import BeautifulSoup

raw_html = "<html><body><p>Natural Language Processing with

Python.</p></body></html>"
clean_text = BeautifulSoup(raw_html, "html.parser").get_text()
clean_text = ' '.join(clean_text.split())

print("Cleaned Text:", clean_text)

Output:
Cleaned Text: Natural Language Processing with Python.

4
7. Handling Misspelled Words
Misspelled words can affect the quality of text analysis and should
be corrected. Libraries like TextBlob and pyspellchecker can be used for
spell correction.

Example:
from textblob import TextBlob

text = "Natural Langage Processing with Pyhton is intresting."

corrected_text = str(TextBlob(text).correct())

print("Corrected Text:", corrected_text)

Output:
Corrected Text: Natural Language Processing with Python is interesting.

8. Text Augmentation Techniques

Text augmentation involves generating variations of the text to
enhance model training and robustness.
• Synonym Replacement: Replacing words with their synonyms to
create diverse text.
• Back Translation: Translating text to another language and back to
the original to generate variations.

Example:
from nltk.corpus import wordnet

def get_synonyms(word):
synonyms = set()
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)

word = "interesting"
synonyms = get_synonyms(word)
print("Synonyms for 'interesting':", synonyms)

Output:
Synonyms for 'interesting': ['interest', 'interestingly', 'matter_to', 'fascinating',
'absorbing', 'engaging']

9. Vectorization Techniques
Vectorization converts text into numerical representations that can
be used by machine learning models.
• Bag of Words (BoW): Represents text as a collection of word counts.
• TF-IDF (Term Frequency-Inverse Document Frequency): Adjusts word
counts based on their importance in the document.
• Word Embeddings: Dense vector representations of words (e.g.,
Word2Vec, GloVe).

5
• Advanced Embeddings: Contextual embeddings like BERT, ELMo.

Example (TF-IDF):
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["Natural Language Processing with Python.",

"Python for Data Science.",
"Text Processing and Analysis."]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

print("TF-IDF Matrix:\n", X.toarray())

print("Feature Names:", vectorizer.get_feature_names_out())

Output:
TF-IDF Matrix:
[[0. 0.46979108 0.58028582 0.46979108 0. 0.46979108
0.35872874]
[0. 0. 0. 0. 0. 0.70710678
0.70710678]
[0.50709255 0. 0.62559262 0. 0.62559262 0.
0. ]]
Feature Names: ['analysis' 'language' 'natural' 'processing' 'text' 'with' 'python']

NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Concepts Resources
No ratings yet
NLP Concepts Resources
48 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
NLP 1
No ratings yet
NLP 1
11 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Proposal Pro
100% (1)
Proposal Pro
27 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Text Processing
No ratings yet
Text Processing
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
NLP Lab Manual Lab Work
No ratings yet
NLP Lab Manual Lab Work
24 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
Elasticsearch Optimization
No ratings yet
Elasticsearch Optimization
25 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
6th Semester Syllabus
No ratings yet
6th Semester Syllabus
20 pages
Chat Bot For College Management System U
No ratings yet
Chat Bot For College Management System U
4 pages
Voice Based System Assistant Using NLP and Deep Learning
No ratings yet
Voice Based System Assistant Using NLP and Deep Learning
63 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
Chapter 2
No ratings yet
Chapter 2
209 pages
LP Vi Lab Manual 2022-23 Final
No ratings yet
LP Vi Lab Manual 2022-23 Final
72 pages
Morphology FST
No ratings yet
Morphology FST
47 pages
NLP Module 2 - 1
No ratings yet
NLP Module 2 - 1
86 pages
NLP Question Bank
No ratings yet
NLP Question Bank
27 pages
Monitoring Suspicious Discussions On Online Forums Using Data Mining
No ratings yet
Monitoring Suspicious Discussions On Online Forums Using Data Mining
7 pages
Spell Checker For Gurmukhi Script: Thesis Submitted in Partial Fulfillment of The Requirements For The Award of Degree of
No ratings yet
Spell Checker For Gurmukhi Script: Thesis Submitted in Partial Fulfillment of The Requirements For The Award of Degree of
60 pages
Designing A Rule Based Stemmer For Ge'Ez Text: Zigju Demissie Baye
No ratings yet
Designing A Rule Based Stemmer For Ge'Ez Text: Zigju Demissie Baye
4 pages
NLP Module 1
No ratings yet
NLP Module 1
55 pages
Data Analytics and Model Evaluation
No ratings yet
Data Analytics and Model Evaluation
55 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Nlp4web Lecture 1 Intro
No ratings yet
Nlp4web Lecture 1 Intro
87 pages
Exploratory Project Report Format 2024-28 Batch
No ratings yet
Exploratory Project Report Format 2024-28 Batch
57 pages
2019 Framework For Hoax News Detection1
No ratings yet
2019 Framework For Hoax News Detection1
8 pages
Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach
No ratings yet
Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach
24 pages
Stemming: Ilakiyaselvan N, B2 Slot
No ratings yet
Stemming: Ilakiyaselvan N, B2 Slot
23 pages
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
33 pages
Systematic Literature Review of Stemming and Lemmatization Performance For Sentence Similarity
No ratings yet
Systematic Literature Review of Stemming and Lemmatization Performance For Sentence Similarity
6 pages
NLP Module Wise PYQ's
No ratings yet
NLP Module Wise PYQ's
3 pages
Improving Arabic Document Clustering Using K-Means Algorithm and Particle Swarm Optimization
No ratings yet
Improving Arabic Document Clustering Using K-Means Algorithm and Particle Swarm Optimization
7 pages
Amharic Light Stemmer: September 2020
No ratings yet
Amharic Light Stemmer: September 2020
4 pages
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet

CSDM2-Text Preprocessing For NL Data - 011050

Uploaded by

CSDM2-Text Preprocessing For NL Data - 011050

Uploaded by

CHAPTER

Text Preprocessing for Natural Language Data

Definition and Importance of NLP

Text Preprocessing Techniques

print("Word Tokens:", word_tokens)

text = "Natural Language Processing with Python!"

print("Lowercased Text:", text_lower)

3. Stop Word Removal

print("Filtered Words:", filtered_words)

4. Stemming and Lemmatization

print("Stemmed Words:", stemmed_words)

print("Lemmatized Words:", lemmatized_words)

5. Handling Numerical Data

text = "In 2020, the revenue was $5 million."

print("Text without Numbers:", text_no_numbers)

raw_html = "<html><body><p>Natural Language Processing with

print("Cleaned Text:", clean_text)

text = "Natural Langage Processing with Pyhton is intresting."

print("Corrected Text:", corrected_text)

8. Text Augmentation Techniques

documents = ["Natural Language Processing with Python.",

print("TF-IDF Matrix:\n", X.toarray())

You might also like