0% found this document useful (0 votes)
24 views

CSDM2-Text Preprocessing For NL Data - 011050

Very useful
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

CSDM2-Text Preprocessing For NL Data - 011050

Very useful
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CHAPTER

Text Preprocessing for Natural Language Data


6
This module provides a comprehensive overview of NLP and text
preprocessing techniques, complete with discussions, explanations, and practical
examples. Each step is crucial for preparing text data for analysis and improving
the performance of NLP models.

Learning Outcomes
At the end of this exercise, students should be able to:
1. Demonstrate an understanding of the terms and concepts
pertaining to natural language processing.
2. Implement text preprocessing techniques using Python libraries such
as natural language toolkit (NLTK).

Learning Content

Definition and Importance of NLP


Natural Language Processing (NLP) is a field of artificial intelligence that
focuses on the interaction between computers and humans using natural
language. The primary goal is to enable computers to understand, interpret, and
generate human language in a way that is meaningful and useful.
NLP is essential for developing applications that can understand and
respond to human language, making technology more accessible and intuitive.

Examples:
• Chatbots: Virtual assistants like Siri and Alexa that understand and respond
to voice commands.
• Sentiment Analysis: Analyzing customer reviews to determine whether they
are positive, negative, or neutral.
• Machine Translation: Translating text from one language to another, as
seen in Google Translate.

Applications of NLP
NLP is used in various domains such as healthcare, finance, customer
service, and more. Applications range from text classification, sentiment analysis,
machine translation, and information retrieval to more complex tasks like question
answering and summarization.

Examples:
• Healthcare: Analyzing patient records to extract relevant information for
diagnosis.
• Finance: Automatically categorizing transaction data for expense tracking.
• Customer Service: Implementing chatbots to handle customer inquiries.

1
Basic Concepts in NLP
• Tokens: The smallest units of text, such as words or punctuation marks.
• Corpora: Large collections of text data used for training NLP models.
• Syntax: The arrangement of words to form sentences.
• Semantics: The meaning of words and sentences.
• Pragmatics: The context in which language is used, affecting its
interpretation.

Text Preprocessing Techniques


1. Tokenization
Tokenization is the process of breaking down text into smaller units
called tokens. This step is fundamental as it converts raw text into a
structured format that can be easily analyzed.
• Word Tokenization: Splitting text into individual words.
• Sentence Tokenization: Splitting text into individual sentences.

Example:
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing with Python. It's a powerful tool for text
analysis."
word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)

print("Word Tokens:", word_tokens)


print("Sentence Tokens:", sentence_tokens)

Output:
Word Tokens: ['Natural', 'Language', 'Processing', 'with', 'Python', '.', 'It', "'s", 'a', 'powerful', 'tool', 'for',
'text', 'analysis', '.']

Sentence Tokens: ['Natural Language Processing with Python.', "It's a powerful tool for text
analysis."]

2. Normalization
Normalization involves transforming text into a standard format,
making it consistent and reducing variability.
• Lowercasing: Converting all characters to lowercase.
• Removing Punctuation: Eliminating punctuation marks to focus on
the textual content.
• Handling Special Characters: Removing or transforming special
characters such as hashtags, mentions, etc.

Example:
import re

text = "Natural Language Processing with Python!"


text_lower = text.lower()
text_clean = re.sub(r'[^\w\s]', '', text_lower)

print("Lowercased Text:", text_lower)

2
print("Cleaned Text:", text_clean)

Output:
Lowercased Text: natural language processing with python!
Cleaned Text: natural language processing with python

3. Stop Word Removal


Stop words are common words that typically do not add significant
meaning to text and can be removed to reduce noise.
Examples of stop words include 'is', 'and', 'the', 'in', etc.

Example:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
word_tokens = ["Natural", "Language", "Processing", "with", "Python"]
filtered_words = [word for word in word_tokens if word.lower() not in
stop_words]

print("Filtered Words:", filtered_words)

Output:
Filtered Words: ['Natural', 'Language', 'Processing', 'Python']

4. Stemming and Lemmatization


a. Stemming:
Stemming reduces words to their root form, which might not
be a real word but is sufficient for certain text processing tasks.
Common stemming algorithms include the Porter Stemmer,
Snowball Stemmer, and Lancaster Stemmer.

Example:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "runs", "runner", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]

print("Stemmed Words:", stemmed_words)

Output:
Stemmed Words: ['run', 'run', 'runner', 'easili', 'fairli']

b. Lemmatization:
Lemmatization reduces words to their base or dictionary form,
known as lemma, which is a real word. It considers the context and
grammatical role of the word.
Tools for lemmatization include WordNet Lemmatizer in NLTK
and SpaCy's lemmatizer.

3
Example:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "runner", "easily", "fairly"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] #
'v' indicates verb

print("Lemmatized Words:", lemmatized_words)

Output:
Lemmatized Words: ['run', 'run', 'run', 'easily', 'fairly']

5. Handling Numerical Data


Depending on the context, numerical data can be removed if
irrelevant or retained if it carries significant information. Special
consideration is given to dates, quantities, and other relevant numeric
data.

Example:
import re

text = "In 2020, the revenue was $5 million."


text_no_numbers = re.sub(r'\d+', '', text)

print("Text without Numbers:", text_no_numbers)

Output:
Text without Numbers: In , the revenue was $ million.

6. Text Cleaning
Text cleaning involves removing unnecessary characters and
formatting to reduce noise.
• Removing HTML Tags: Useful for web-scraped text.
• Removing Whitespace: Trimming excessive spaces, tabs, and
newlines.
Example:
from bs4 import BeautifulSoup

raw_html = "<html><body><p>Natural Language Processing with


Python.</p></body></html>"
clean_text = BeautifulSoup(raw_html, "html.parser").get_text()
clean_text = ' '.join(clean_text.split())

print("Cleaned Text:", clean_text)

Output:
Cleaned Text: Natural Language Processing with Python.

4
7. Handling Misspelled Words
Misspelled words can affect the quality of text analysis and should
be corrected. Libraries like TextBlob and pyspellchecker can be used for
spell correction.

Example:
from textblob import TextBlob

text = "Natural Langage Processing with Pyhton is intresting."


corrected_text = str(TextBlob(text).correct())

print("Corrected Text:", corrected_text)

Output:
Corrected Text: Natural Language Processing with Python is interesting.

8. Text Augmentation Techniques


Text augmentation involves generating variations of the text to
enhance model training and robustness.
• Synonym Replacement: Replacing words with their synonyms to
create diverse text.
• Back Translation: Translating text to another language and back to
the original to generate variations.

Example:
from nltk.corpus import wordnet

def get_synonyms(word):
synonyms = set()
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)

word = "interesting"
synonyms = get_synonyms(word)
print("Synonyms for 'interesting':", synonyms)

Output:
Synonyms for 'interesting': ['interest', 'interestingly', 'matter_to', 'fascinating',
'absorbing', 'engaging']

9. Vectorization Techniques
Vectorization converts text into numerical representations that can
be used by machine learning models.
• Bag of Words (BoW): Represents text as a collection of word counts.
• TF-IDF (Term Frequency-Inverse Document Frequency): Adjusts word
counts based on their importance in the document.
• Word Embeddings: Dense vector representations of words (e.g.,
Word2Vec, GloVe).

5
• Advanced Embeddings: Contextual embeddings like BERT, ELMo.

Example (TF-IDF):
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["Natural Language Processing with Python.",


"Python for Data Science.",
"Text Processing and Analysis."]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

print("TF-IDF Matrix:\n", X.toarray())


print("Feature Names:", vectorizer.get_feature_names_out())

Output:
TF-IDF Matrix:
[[0. 0.46979108 0.58028582 0.46979108 0. 0.46979108
0.35872874]
[0. 0. 0. 0. 0. 0.70710678
0.70710678]
[0.50709255 0. 0.62559262 0. 0.62559262 0.
0. ]]
Feature Names: ['analysis' 'language' 'natural' 'processing' 'text' 'with' 'python']

You might also like