0% found this document useful (0 votes)

4 views3 pages

Text Preprocessing

This document outlines a lab program for text preprocessing using the libraries spaCy and nltk, essential for preparing text data in Natural Language Processing (NLP). It covers key steps such as tokenization, lowercasing, stopword removal, lemmatization, and stemming, along with code examples for each step. The conclusion emphasizes the importance of these preprocessing techniques in enhancing machine learning model performance in NLP tasks.

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views3 pages

Text Preprocessing

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Text Preprocessing 13/01/25, 21:05

Text Preprocessing with spaCy and nltk

This lab program demonstrates how to preprocess text using two powerful libraries:
spaCy and nltk. Text preprocessing is a crucial step in Natural Language Processing
(NLP) pipelines. It involves cleaning and preparing text data for analysis or model
training.

Steps Covered
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Lemmatization
5. Stemming (nltk)

Let's get started!

Install and Import Libraries

In [ ]: !pip install spacy nltk -q

# Download spaCy model and nltk resources

!python -m spacy download en_core_web_sm -q
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Import libraries
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

Text Dataset
For this lab, we'll use a small dataset of sentences that simulate real-world text data.
You can replace this with any dataset of your choice.

In [2]: # Define a sample text dataset

text = """
Natural Language Processing (NLP) is a fascinating field of Artificial Intellig

file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 1 of 3
Text Preprocessing 13/01/25, 21:05

It focuses on enabling computers to understand, interpret, and respond to human

With the rise of large language models, the scope of NLP has expanded significa
"""

Tokenization

In [ ]: print("Tokenization with nltk:")

sentences_nltk = sent_tokenize(text)
print("Sentences:", sentences_nltk)

words_nltk = word_tokenize(text)
print("\nWords:", words_nltk)

# Tokenization with spaCy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

print("\nTokenization with spaCy:")

print("Tokens:", [token.text for token in doc])

Lowercasing
Lowercasing converts all text to lowercase, which helps in standardising text.

In [ ]: # Lowercasing with nltk

lowercased_words_nltk = [word.lower() for word in words_nltk]
print("Lowercased Words (nltk):", lowercased_words_nltk)

# Lowercasing with spaCy

lowercased_words_spacy = [token.text.lower() for token in doc]
print("Lowercased Words (spaCy):", lowercased_words_spacy)

Stopword Removal
Stopwords are common words (like "the", "is", "and") that add little meaning to text
and can be removed.

In [ ]: # Stopword removal with nltk

stop_words = set(stopwords.words("english"))
filtered_words_nltk = [word for word in words_nltk if word.lower() not in stop_
print("Filtered Words (nltk):", filtered_words_nltk)

# Stopword removal with spaCy

filtered_words_spacy = [token.text for token in doc if not token.is_stop]
print("Filtered Words (spaCy):", filtered_words_spacy)

Lemmatization
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 2 of 3
Text Preprocessing 13/01/25, 21:05

Lemmatization reduces words to their base or root form (e.g., "running" becomes
"run").

In [ ]: # Lemmatization with nltk

lemmatizer = WordNetLemmatizer()
lemmatized_words_nltk = [lemmatizer.lemmatize(word) for word in filtered_words_
print("Lemmatized Words (nltk):", lemmatized_words_nltk)

# Lemmatization with spaCy

lemmatized_words_spacy = [token.lemma_ for token in doc if not token.is_stop]
print("Lemmatized Words (spaCy):", lemmatized_words_spacy)

Stemming
Stemming reduces words to their root form by chopping off suffixes.

In [ ]: stemmer = PorterStemmer()
stemmed_words_nltk = [stemmer.stem(word) for word in filtered_words_nltk]
print("Stemmed Words (nltk):", stemmed_words_nltk)

Conclusion
In this lab, we explored various text preprocessing steps using nltk and spaCy. These
steps are foundational for any NLP task and play a vital role in improving the
performance of machine learning models in NLP. Feel free to experiment with
different datasets and observe the results!

Key Takeaways
nltk and spaCy provide powerful tools for text preprocessing.
Both libraries have unique strengths, with nltk offering traditional NLP tools and
spaCy excelling in modern NLP pipelines.

file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 3 of 3

AP for NLP-Word 2 Vec
No ratings yet
AP for NLP-Word 2 Vec
33 pages
Pad 251
No ratings yet
Pad 251
246 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
David G. Cork, Tohru Sugawara - Laboratory Automation in The Chemical Industries (2002, CRC Press) PDF
100% (1)
David G. Cork, Tohru Sugawara - Laboratory Automation in The Chemical Industries (2002, CRC Press) PDF
364 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Machine__Translation
No ratings yet
Machine__Translation
10 pages
Ecalc Manual PDF
No ratings yet
Ecalc Manual PDF
58 pages
Maintenance-Booklet - New Logo
No ratings yet
Maintenance-Booklet - New Logo
47 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
NLP LAB MANUAL 3-2 AIML R22 UPDATE (1)
100% (1)
NLP LAB MANUAL 3-2 AIML R22 UPDATE (1)
20 pages
Experiment 2 Manual
No ratings yet
Experiment 2 Manual
6 pages
Regularization in Linear Regression
No ratings yet
Regularization in Linear Regression
1 page
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
Context Clues
No ratings yet
Context Clues
26 pages
Day 1 Summarize Text Types
No ratings yet
Day 1 Summarize Text Types
8 pages
TF_idf
No ratings yet
TF_idf
6 pages
Lecture 02 - NLU concepts
No ratings yet
Lecture 02 - NLU concepts
27 pages
NLP Lab File (1)
No ratings yet
NLP Lab File (1)
13 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP Lab File
No ratings yet
NLP Lab File
15 pages
Introduction To Stop Words Inn LP
No ratings yet
Introduction To Stop Words Inn LP
10 pages
NLP_Lab1
No ratings yet
NLP_Lab1
2 pages
AOML
No ratings yet
AOML
14 pages
Dirtbaby Delay Madbeans Schemeatic
No ratings yet
Dirtbaby Delay Madbeans Schemeatic
8 pages
Chapter -9 Starting With Libre Office Base Class x q and A
No ratings yet
Chapter -9 Starting With Libre Office Base Class x q and A
2 pages
Prompting Techniques Slide Deck
No ratings yet
Prompting Techniques Slide Deck
29 pages
Lab 2
No ratings yet
Lab 2
49 pages
Tutorial Text Classification in Phyton Using Spacy
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
22 pages
AT 401 - OM 6 Lubrication PDF
No ratings yet
AT 401 - OM 6 Lubrication PDF
2 pages
exp2_Ananya_66_C_NLP
No ratings yet
exp2_Ananya_66_C_NLP
7 pages
NLP Lab Manual (1)
No ratings yet
NLP Lab Manual (1)
19 pages
ir manual
No ratings yet
ir manual
53 pages
CH4
No ratings yet
CH4
15 pages
01 NLP - Merged Vinay
No ratings yet
01 NLP - Merged Vinay
27 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
EDUC2220 - Lesson Plan - Aidan Garling
No ratings yet
EDUC2220 - Lesson Plan - Aidan Garling
9 pages
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
No ratings yet
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
37 pages
Self-Propelled Electric Scissor Lifts Part Catalog 2022.9
100% (1)
Self-Propelled Electric Scissor Lifts Part Catalog 2022.9
70 pages
Barneys Reaction To Priem and Butler
No ratings yet
Barneys Reaction To Priem and Butler
17 pages
1D Monoatomic Chain 3-5
No ratings yet
1D Monoatomic Chain 3-5
33 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
AOML_PROJJ
No ratings yet
AOML_PROJJ
11 pages
NLP
No ratings yet
NLP
81 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
VO_MCA_SEM 4 _ Text Mining _U2
No ratings yet
VO_MCA_SEM 4 _ Text Mining _U2
15 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
Inflatable Packers and Grouting
No ratings yet
Inflatable Packers and Grouting
7 pages
Chief Commercial Officer: Job Details
No ratings yet
Chief Commercial Officer: Job Details
4 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
4.TWITTER EXTRACTION AND ANALYTICS
No ratings yet
4.TWITTER EXTRACTION AND ANALYTICS
45 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
Sentiment__Analysis
No ratings yet
Sentiment__Analysis
12 pages
NLP_Preprocessing_Steps__1740444240
No ratings yet
NLP_Preprocessing_Steps__1740444240
20 pages
NLP LAB_MANUAL (1)
No ratings yet
NLP LAB_MANUAL (1)
33 pages
Test Method For Resins
No ratings yet
Test Method For Resins
10 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
Spacy Library
No ratings yet
Spacy Library
3 pages
Pin-On-Disc Wear Investigation of Aisi 316L Stainless Steel With and Without Nickel Coating and Hardness Test
No ratings yet
Pin-On-Disc Wear Investigation of Aisi 316L Stainless Steel With and Without Nickel Coating and Hardness Test
28 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Assessment Task 2-RMP
No ratings yet
Assessment Task 2-RMP
3 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
EWS Gender Neutral CSAB 2
No ratings yet
EWS Gender Neutral CSAB 2
19 pages
NLP - Spacy Package
No ratings yet
NLP - Spacy Package
28 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
TP1 3
No ratings yet
TP1 3
5 pages
Field Trips As A Teaching
No ratings yet
Field Trips As A Teaching
3 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Skeptic Issue 282 June 2023
No ratings yet
Skeptic Issue 282 June 2023
84 pages
AIML_P4
No ratings yet
AIML_P4
12 pages
p4
No ratings yet
p4
10 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
Manual - Pdms Hvac Design Vol1
No ratings yet
Manual - Pdms Hvac Design Vol1
98 pages
Cap 08
No ratings yet
Cap 08
44 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
72 General Gunsmith Tools P422 461
No ratings yet
72 General Gunsmith Tools P422 461
40 pages
ESSAY
No ratings yet
ESSAY
1 page
Genbio1 Module-1
100% (2)
Genbio1 Module-1
31 pages
Structural Analysis - RCD (Calculations)
No ratings yet
Structural Analysis - RCD (Calculations)
85 pages
Hydrostatic Pressure Test Safety Check
No ratings yet
Hydrostatic Pressure Test Safety Check
3 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Organizational Effectiveness
No ratings yet
Organizational Effectiveness
8 pages
IC5 - Level - 3 - Teacher's Edition Teaching Notes - pdf-237-244
No ratings yet
IC5 - Level - 3 - Teacher's Edition Teaching Notes - pdf-237-244
8 pages
Living with Linux in the Industrial World
From Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
No ratings yet

Text Preprocessing

Uploaded by

Text Preprocessing

Uploaded by

Text Preprocessing 13/01/25, 21:05

Text Preprocessing with spaCy and nltk

Let's get started!

Install and Import Libraries

In [ ]: !pip install spacy nltk -q

# Download spaCy model and nltk resources

In [2]: # Define a sample text dataset

It focuses on enabling computers to understand, interpret, and respond to human

In [ ]: print("Tokenization with nltk:")

# Tokenization with spaCy

print("\nTokenization with spaCy:")

In [ ]: # Lowercasing with nltk

# Lowercasing with spaCy

In [ ]: # Stopword removal with nltk

# Stopword removal with spaCy

In [ ]: # Lemmatization with nltk

# Lemmatization with spaCy

You might also like