Text Preprocessing
Text Preprocessing
Steps Covered
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Lemmatization
5. Stemming (nltk)
# Import libraries
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
Text Dataset
For this lab, we'll use a small dataset of sentences that simulate real-world text data.
You can replace this with any dataset of your choice.
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 1 of 3
Text Preprocessing 13/01/25, 21:05
Tokenization
words_nltk = word_tokenize(text)
print("\nWords:", words_nltk)
Lowercasing
Lowercasing converts all text to lowercase, which helps in standardising text.
Stopword Removal
Stopwords are common words (like "the", "is", "and") that add little meaning to text
and can be removed.
Lemmatization
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 2 of 3
Text Preprocessing 13/01/25, 21:05
Lemmatization reduces words to their base or root form (e.g., "running" becomes
"run").
Stemming
Stemming reduces words to their root form by chopping off suffixes.
In [ ]: stemmer = PorterStemmer()
stemmed_words_nltk = [stemmer.stem(word) for word in filtered_words_nltk]
print("Stemmed Words (nltk):", stemmed_words_nltk)
Conclusion
In this lab, we explored various text preprocessing steps using nltk and spaCy. These
steps are foundational for any NLP task and play a vital role in improving the
performance of machine learning models in NLP. Feel free to experiment with
different datasets and observe the results!
Key Takeaways
nltk and spaCy provide powerful tools for text preprocessing.
Both libraries have unique strengths, with nltk offering traditional NLP tools and
spaCy excelling in modern NLP pipelines.
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 3 of 3