0% found this document useful (0 votes)
4 views3 pages

Text Preprocessing

This document outlines a lab program for text preprocessing using the libraries spaCy and nltk, essential for preparing text data in Natural Language Processing (NLP). It covers key steps such as tokenization, lowercasing, stopword removal, lemmatization, and stemming, along with code examples for each step. The conclusion emphasizes the importance of these preprocessing techniques in enhancing machine learning model performance in NLP tasks.

Uploaded by

yashaswinivmipuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views3 pages

Text Preprocessing

This document outlines a lab program for text preprocessing using the libraries spaCy and nltk, essential for preparing text data in Natural Language Processing (NLP). It covers key steps such as tokenization, lowercasing, stopword removal, lemmatization, and stemming, along with code examples for each step. The conclusion emphasizes the importance of these preprocessing techniques in enhancing machine learning model performance in NLP tasks.

Uploaded by

yashaswinivmipuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Text Preprocessing 13/01/25, 21:05

Text Preprocessing with spaCy and nltk


This lab program demonstrates how to preprocess text using two powerful libraries:
spaCy and nltk. Text preprocessing is a crucial step in Natural Language Processing
(NLP) pipelines. It involves cleaning and preparing text data for analysis or model
training.

Steps Covered
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Lemmatization
5. Stemming (nltk)

Let's get started!

Install and Import Libraries

In [ ]: !pip install spacy nltk -q

# Download spaCy model and nltk resources


!python -m spacy download en_core_web_sm -q
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Import libraries
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

Text Dataset
For this lab, we'll use a small dataset of sentences that simulate real-world text data.
You can replace this with any dataset of your choice.

In [2]: # Define a sample text dataset


text = """
Natural Language Processing (NLP) is a fascinating field of Artificial Intellig

file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 1 of 3
Text Preprocessing 13/01/25, 21:05

It focuses on enabling computers to understand, interpret, and respond to human


With the rise of large language models, the scope of NLP has expanded significa
"""

Tokenization

In [ ]: print("Tokenization with nltk:")


sentences_nltk = sent_tokenize(text)
print("Sentences:", sentences_nltk)

words_nltk = word_tokenize(text)
print("\nWords:", words_nltk)

# Tokenization with spaCy


nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

print("\nTokenization with spaCy:")


print("Tokens:", [token.text for token in doc])

Lowercasing
Lowercasing converts all text to lowercase, which helps in standardising text.

In [ ]: # Lowercasing with nltk


lowercased_words_nltk = [word.lower() for word in words_nltk]
print("Lowercased Words (nltk):", lowercased_words_nltk)

# Lowercasing with spaCy


lowercased_words_spacy = [token.text.lower() for token in doc]
print("Lowercased Words (spaCy):", lowercased_words_spacy)

Stopword Removal
Stopwords are common words (like "the", "is", "and") that add little meaning to text
and can be removed.

In [ ]: # Stopword removal with nltk


stop_words = set(stopwords.words("english"))
filtered_words_nltk = [word for word in words_nltk if word.lower() not in stop_
print("Filtered Words (nltk):", filtered_words_nltk)

# Stopword removal with spaCy


filtered_words_spacy = [token.text for token in doc if not token.is_stop]
print("Filtered Words (spaCy):", filtered_words_spacy)

Lemmatization
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 2 of 3
Text Preprocessing 13/01/25, 21:05

Lemmatization reduces words to their base or root form (e.g., "running" becomes
"run").

In [ ]: # Lemmatization with nltk


lemmatizer = WordNetLemmatizer()
lemmatized_words_nltk = [lemmatizer.lemmatize(word) for word in filtered_words_
print("Lemmatized Words (nltk):", lemmatized_words_nltk)

# Lemmatization with spaCy


lemmatized_words_spacy = [token.lemma_ for token in doc if not token.is_stop]
print("Lemmatized Words (spaCy):", lemmatized_words_spacy)

Stemming
Stemming reduces words to their root form by chopping off suffixes.

In [ ]: stemmer = PorterStemmer()
stemmed_words_nltk = [stemmer.stem(word) for word in filtered_words_nltk]
print("Stemmed Words (nltk):", stemmed_words_nltk)

Conclusion
In this lab, we explored various text preprocessing steps using nltk and spaCy. These
steps are foundational for any NLP task and play a vital role in improving the
performance of machine learning models in NLP. Feel free to experiment with
different datasets and observe the results!

Key Takeaways
nltk and spaCy provide powerful tools for text preprocessing.
Both libraries have unique strengths, with nltk offering traditional NLP tools and
spaCy excelling in modern NLP pipelines.

file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 3 of 3

You might also like