CSDM2-Text Preprocessing For NL Data - 011050
CSDM2-Text Preprocessing For NL Data - 011050
Learning Outcomes
At the end of this exercise, students should be able to:
1. Demonstrate an understanding of the terms and concepts
pertaining to natural language processing.
2. Implement text preprocessing techniques using Python libraries such
as natural language toolkit (NLTK).
Learning Content
Examples:
• Chatbots: Virtual assistants like Siri and Alexa that understand and respond
to voice commands.
• Sentiment Analysis: Analyzing customer reviews to determine whether they
are positive, negative, or neutral.
• Machine Translation: Translating text from one language to another, as
seen in Google Translate.
Applications of NLP
NLP is used in various domains such as healthcare, finance, customer
service, and more. Applications range from text classification, sentiment analysis,
machine translation, and information retrieval to more complex tasks like question
answering and summarization.
Examples:
• Healthcare: Analyzing patient records to extract relevant information for
diagnosis.
• Finance: Automatically categorizing transaction data for expense tracking.
• Customer Service: Implementing chatbots to handle customer inquiries.
1
Basic Concepts in NLP
• Tokens: The smallest units of text, such as words or punctuation marks.
• Corpora: Large collections of text data used for training NLP models.
• Syntax: The arrangement of words to form sentences.
• Semantics: The meaning of words and sentences.
• Pragmatics: The context in which language is used, affecting its
interpretation.
Example:
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing with Python. It's a powerful tool for text
analysis."
word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)
Output:
Word Tokens: ['Natural', 'Language', 'Processing', 'with', 'Python', '.', 'It', "'s", 'a', 'powerful', 'tool', 'for',
'text', 'analysis', '.']
Sentence Tokens: ['Natural Language Processing with Python.', "It's a powerful tool for text
analysis."]
2. Normalization
Normalization involves transforming text into a standard format,
making it consistent and reducing variability.
• Lowercasing: Converting all characters to lowercase.
• Removing Punctuation: Eliminating punctuation marks to focus on
the textual content.
• Handling Special Characters: Removing or transforming special
characters such as hashtags, mentions, etc.
Example:
import re
2
print("Cleaned Text:", text_clean)
Output:
Lowercased Text: natural language processing with python!
Cleaned Text: natural language processing with python
Example:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
word_tokens = ["Natural", "Language", "Processing", "with", "Python"]
filtered_words = [word for word in word_tokens if word.lower() not in
stop_words]
Output:
Filtered Words: ['Natural', 'Language', 'Processing', 'Python']
Example:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "runs", "runner", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]
Output:
Stemmed Words: ['run', 'run', 'runner', 'easili', 'fairli']
b. Lemmatization:
Lemmatization reduces words to their base or dictionary form,
known as lemma, which is a real word. It considers the context and
grammatical role of the word.
Tools for lemmatization include WordNet Lemmatizer in NLTK
and SpaCy's lemmatizer.
3
Example:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "runner", "easily", "fairly"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] #
'v' indicates verb
Output:
Lemmatized Words: ['run', 'run', 'run', 'easily', 'fairly']
Example:
import re
Output:
Text without Numbers: In , the revenue was $ million.
6. Text Cleaning
Text cleaning involves removing unnecessary characters and
formatting to reduce noise.
• Removing HTML Tags: Useful for web-scraped text.
• Removing Whitespace: Trimming excessive spaces, tabs, and
newlines.
Example:
from bs4 import BeautifulSoup
Output:
Cleaned Text: Natural Language Processing with Python.
4
7. Handling Misspelled Words
Misspelled words can affect the quality of text analysis and should
be corrected. Libraries like TextBlob and pyspellchecker can be used for
spell correction.
Example:
from textblob import TextBlob
Output:
Corrected Text: Natural Language Processing with Python is interesting.
Example:
from nltk.corpus import wordnet
def get_synonyms(word):
synonyms = set()
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
word = "interesting"
synonyms = get_synonyms(word)
print("Synonyms for 'interesting':", synonyms)
Output:
Synonyms for 'interesting': ['interest', 'interestingly', 'matter_to', 'fascinating',
'absorbing', 'engaging']
9. Vectorization Techniques
Vectorization converts text into numerical representations that can
be used by machine learning models.
• Bag of Words (BoW): Represents text as a collection of word counts.
• TF-IDF (Term Frequency-Inverse Document Frequency): Adjusts word
counts based on their importance in the document.
• Word Embeddings: Dense vector representations of words (e.g.,
Word2Vec, GloVe).
5
• Advanced Embeddings: Contextual embeddings like BERT, ELMo.
Example (TF-IDF):
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
Output:
TF-IDF Matrix:
[[0. 0.46979108 0.58028582 0.46979108 0. 0.46979108
0.35872874]
[0. 0. 0. 0. 0. 0.70710678
0.70710678]
[0.50709255 0. 0.62559262 0. 0.62559262 0.
0. ]]
Feature Names: ['analysis' 'language' 'natural' 'processing' 'text' 'with' 'python']