0% found this document useful (0 votes)

3 views50 pages

Preprocessing in Python

The document discusses basic preprocessing techniques for natural language processing (NLP) data, focusing on converting text to lower-case, removing HTML tags, and eliminating stop words. It explains the importance of these steps in cleaning and preparing text data for analysis, using examples from an IMDB movie reviews dataset. Additionally, it introduces the use of regular expressions (regex) for text manipulation and provides code snippets for implementing these preprocessing tasks in Python.

Uploaded by

khatijaj21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views50 pages

Preprocessing in Python

Uploaded by

khatijaj21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

• Unlike traditional ML techniques, we have data in text format in NLP

problems. Like traditional ML techniques, we need to perform the

preprocessing on this data (which is a text data).
• This preprocessing of text data is either of basic level or advanced
level preprocessing. In this lecture, we will basically cover the basic
preprocessing techniques.
• We will cover different preprocessing techniques but it doesnt mean
that for each NLP applications, all these techniques will be used. Some
techniques might not be applicable for some of the NLP problems.

• Let's start with the first basic preprocessing method, which is

"converting the text in to lower-case letters".
Lower-case Letters
• But why we need converting text in to lowe-case letters??
• Suppose we have the following sentence and then we tokenoze it.

“The students take lectures in the weekends”

[“The”], [“students”], [“take”], [“lectures”], [“in”], [“the”],

[“weekends”]
• Now there are “The” and “the” in the sentence and since, Python is a
case-sensitive language, it considers that “The” and “the” are two
different words and not the same.

• To cope with this, we need to convert the whole text to lower-case letters (we can convert it to
upper-case letters too but lower-case is widely used technique)
• Lets we have the IMDB dataset
(csv format) which has about
50k movies reviews.
It has two columns: Review and Sentiment (positive or negative)
• df=
pd.read.csv(‘/kaggle/input/imdb-dataset-of-50k-moview-reviews/IMD
B Dataset.csv)
What is df?

• df stands for DataFrame, a data structure provided by the pandas

library in Python.

• The pandas library is used for handling tabular data.

• A DataFrame is like a table (think of it as an Excel sheet) where:

- Rows represent individual entries (e.g., one email per row).

- Columns represent different properties or features of the data (e.g.,
sentiment and its label).
import pandas as pd

# Create a dictionary to represent the data

data = {
'Email': ["WIN $10,000 NOW!!!", "Meeting tomorrow at 3 PM."],
'Label': ['Spam', 'Non-spam']
}

# Convert the dictionary into a DataFrame

df = pd.DataFrame(data)
• The DataFrame df will look like this:

Email Label
WIN $10,000 NOW!!! SPAM
Meeting tomorrow at 3 PM. NOT SPAM
With head(), we can display a few record of the dataset to see how our data looks like
• Let’s convert the fourth (4th) review in to lower-case letter, with the
following command.

• df[‘review’][3].lower()
“df” is the dataframe (table where the data is stored), “review” is the column name, “3” is the
fourth column and “lower” is the keyword to convert the text in to lower case letters
• Now we have about 50k reviews in total in our dataset, so to convert
all the reviews in to lower-case letters, we can use the following code:

df[‘review’].str..lower()
To make all the entire column in to lower-case letters
• Assign it to the variable

df[‘review’]= df[‘review’].str.lower()
• So usually we first perform the lower-case conversion as a pre-
processing tasks.

• When it is done, we have to remove the unimportant text from the

datasets.
Remove HTML tags

• Sometimes when we scrap/download data from the web, it

include some random html tags like <b> etc. For example, this
data is downloaded from the IMDB website.
Regex (short for Regular Expression)

• Regex in NLP is a powerful tool for searching, matching, and

manipulating patterns in text.

• Think of regex as a "search pattern" that helps you find specific text,
like words, numbers, or formats, within a larger text.
• It is commonly used in NLP for tasks like:

- Text Cleaning: Removing unwanted characters like punctuation or

special symbols.

- Pattern Matching: Finding specific patterns, such as email addresses or

dates.
Imagine you have a piece of text

"I have 3 apples and 5 bananas."

• Suppose you want to extract all the numbers from this text.

Regex Pattern: \d+

• \d matches any digit (0-9).

• + means "one or more" of the preceding element.
• Using this regex on the text will extract 3 and 5.
• Pattern: \d+
• \d: Matches any single digit (0-9).
• +: Matches one or more digits in a row.

"My phone number is 123456."

The regex \d+ will match 123456 (all consecutive digits).

Pattern: a+

a: Matches the character 'a'.

+: Matches one or more 'a's in a row.

For the text: "aaaabbbccaaa"

The regex a+ will match:

aaaa (first group of 'a's)

aaa (second group of 'a's)
import re

text = "I have 3 apples and 5 bananas."

# Find all numbers in the text

numbers = re.findall(r'\d+', text)

print(numbers) # Output: ['3', '5']

• The r in r'\d+' stands for raw string in Python.

• It tells Python to treat the string as a "raw" string, meaning that special
characters (like \) are not meant as escape characters.
Why is Regex Useful in NLP?

• It helps quickly identify and process specific parts of text, saving time
and effort in cleaning or extracting structured information from
unstructured data.
• We first want to see if our data contains any HTML tags or
punctuation marks.............
We will use Regex (re) for that

import re

The re module allows us to use regular expressions (regex), which are

patterns used to match specific text sequences.

Regular expressions are very useful when we need to identify or extract

certain patterns, like HTML tags or punctuation marks, from strings.
def contains_html_or_punctuation(text):

• We will use a function to make the code reusable.

• We can check if a string (in this case, a review) contains HTML tags
or punctuation without repeating the logic in the code every time.
• The text parameter represents each individual review in the
DataFrame.
• html_pattern = re.compile(r'<.*?>') # Match HTML tags

• text = re.sub(r'[^\w\s]', '', text) # Keep only letters, numbers, and

whitespace
In HTML, tags are usually enclosed in angle brackets (e.g., <b>, <p>, etc.). We use the pattern r'<.*?>' to match
anything that starts with <, ends with >, and contains any characters in between.

<.*?>: The dot . matches any character (except newline), the asterisk * means "zero or more of those
characters", and the question mark ? makes the match non-greedy (matches the shortest possible sequence).

Punctuation: r'[^\w\s]': This matches everything except:

\w: Alphanumeric characters (letters and numbers).

\s: Whitespace characters (spaces, tabs, etc.).

By replacing matches with an empty string (''), all punctuation (e.g., -, ?, ", etc.) is removed.

Preserves:

Words.
Spaces between words.

Why these specific characters? Because they are commonly used punctuation marks that could appear in the text
and may need to be cleaned up in preprocessing.
# Apply the cleaning function to the 'text' column
df['cleaned_text'] = df['text'].apply(clean_text)

# Display the first 10 rows of cleaned text

print(df[['cleaned_text', 'label']].head(10))
.head(10) is a Pandas method that returns the first 10 rows of the
DataFrame.

• We only need to show a small sample to examine the types of rows

that contain unwanted characters (HTML tags, punctuation) before
preprocessing.
Removing Numbers

• Remove numbers from the text, as they often do not carry significant
meaning (depending on the task).

• Numbers like "1234" might not be relevant for tasks like sentiment
analysis.
• # Remove numbers
• df['no_numbers'] = df['text'].apply(lambda x: re.sub(r'\d+', '', x))

• # Display the first few rows

• print(df[['no_numbers', 'label']].head(10))
What Are Stop Words?

• Stop words are common words in a language (like "is," "the," "and,"
etc.) that don't carry significant meaning for tasks like text
classification, sentiment analysis, or other NLP tasks.

• Removing them reduces noise in the text, allowing the model to focus
on more meaningful words.
• Import Required Libraries:

• nltk: A popular library for Natural Language Processing (NLP).

• stopwords: A module in nltk that provides a predefined list of

common stop words in different languages.
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

(This downloads the list of stop words (a collection of common words like "the," "a," "in," etc.)
provided by the nltk library.)
• Define the Stop Words Set

• stop_words = set(stopwords.words('english'))

• stopwords.words('english'): Retrieves the list of English stop words.

• set(): Converts the list into a set for faster lookup. (Checking membership in a set is quicker than
in a list.)
df['no_stopwords'] = df['no_numbers'].apply(
lambda x: ' '.join([word for word in x.split() if word not in
stop_words])
•)

• (If you remember, ‘no_stopwords’ is a column in the previous step when we removed
‘numbers’ from the data)
.apply() applies a function to each row (or element) in the column.
lambda x: ' '.join([word for word in x.split() if word not in stop_words])

How Does the Lambda Function Work?

x.split(): Splits the text (x) into individual words (creating a list of words).

[word for word in x.split() if word not in stop_words]: Filters out words that are in the stop words set.

For example: If x = "I like to learn NLP", and stop_words = {“to”, “I”}, then this becomes ["like", "learn",
"NLP"].

' '.join(...): Joins the filtered words back into a single string with spaces.
Result: "like learn NLP"

The Final Output Column:

df['no_stopwords']: A new column in the DataFrame where stop words have been removed from the
original text.
Code For removing stopwords

• from nltk.corpus import stopwords

• import nltk

• # Download the list of stop words

• nltk.download('stopwords')
• stop_words = set(stopwords.words('english'))

• # Remove stop words

• df['no_stopwords'] = df['no_numbers'].apply(
• lambda x: ' '.join([word for word in x.split() if word not in stop_words])
• )

• # Display the first few rows

• print(df[['no_stopwords', 'label']].head(10))

Hans C. Hönes - Tangled Paths - A Life of Aby Warburg-Reaktion Books (2024)
100% (1)
Hans C. Hönes - Tangled Paths - A Life of Aby Warburg-Reaktion Books (2024)
289 pages
TV Philips Service Manual 32pfl6087h12
0% (1)
TV Philips Service Manual 32pfl6087h12
165 pages
Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
No ratings yet
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
20 pages
Experiment No 3
No ratings yet
Experiment No 3
7 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Unit 5
No ratings yet
Unit 5
4 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
38 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
NLP Sentimental Analysis 1736351356
No ratings yet
NLP Sentimental Analysis 1736351356
32 pages
NLP Concepts Resources
No ratings yet
NLP Concepts Resources
48 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
CS173 Class Activity 2 Regex PDF
No ratings yet
CS173 Class Activity 2 Regex PDF
3 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
Text Cleaning Methods in NLP
No ratings yet
Text Cleaning Methods in NLP
7 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
NLP Lab
No ratings yet
NLP Lab
7 pages
Intro To NLP
No ratings yet
Intro To NLP
44 pages
NLP Slides
No ratings yet
NLP Slides
19 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Steps For Effective Text Data Cleaning
No ratings yet
Steps For Effective Text Data Cleaning
6 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Module II
No ratings yet
Module II
17 pages
Python Code For NLP
No ratings yet
Python Code For NLP
6 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
(Assignment 1 & 2) Regular Expression
No ratings yet
(Assignment 1 & 2) Regular Expression
3 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
String and Text Processing
No ratings yet
String and Text Processing
8 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
03 - Regex
No ratings yet
03 - Regex
64 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Assignment-9 (NLP)
No ratings yet
Assignment-9 (NLP)
2 pages
SQL Server: Tips and Tricks - 2
From Everand
SQL Server: Tips and Tricks - 2
Priyanka Agarwal
4.5/5 (3)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Ae Specs - Win-Pak 4.9.4
No ratings yet
Ae Specs - Win-Pak 4.9.4
40 pages
G9 Math Summative Test Q1W4 5
No ratings yet
G9 Math Summative Test Q1W4 5
2 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Tips On Upcat
No ratings yet
Tips On Upcat
5 pages
GPT Uefi Dell
No ratings yet
GPT Uefi Dell
2 pages
Open Listener - Cross-Cultural Experience and Identity in American
No ratings yet
Open Listener - Cross-Cultural Experience and Identity in American
168 pages
Clause: Prepared by
No ratings yet
Clause: Prepared by
16 pages
Types of Testing
No ratings yet
Types of Testing
2 pages
Zec, Božanska Iskra or Pistis Sophia, IKON 8 2015
No ratings yet
Zec, Božanska Iskra or Pistis Sophia, IKON 8 2015
15 pages
Bmi323 - Reference Design
No ratings yet
Bmi323 - Reference Design
2 pages
Practice Questions.
No ratings yet
Practice Questions.
5 pages
The Basis of Muslim Belief: by Gary Miller
No ratings yet
The Basis of Muslim Belief: by Gary Miller
18 pages
Cce tl340
No ratings yet
Cce tl340
7 pages
Finite Difference
No ratings yet
Finite Difference
5 pages
Outcomes UpperInt VocabBuilder Unit13 0
100% (1)
Outcomes UpperInt VocabBuilder Unit13 0
10 pages
Sexuality and Feminism in Shelley - Nathaniel Brown - Cambridge, Mass, Massachusetts, 1979 - Harvard University Press - 9780674802858 - Anna's Archive
No ratings yet
Sexuality and Feminism in Shelley - Nathaniel Brown - Cambridge, Mass, Massachusetts, 1979 - Harvard University Press - 9780674802858 - Anna's Archive
320 pages
Simple Present Vs Present Continuous Grammar Drills
No ratings yet
Simple Present Vs Present Continuous Grammar Drills
44 pages
(Legal Code) Disclaimer
No ratings yet
(Legal Code) Disclaimer
224 pages
IELTS Practice
No ratings yet
IELTS Practice
3 pages
ETP API 19.01 en
No ratings yet
ETP API 19.01 en
52 pages
A Study On Using Visual Aids in Teaching and Learning English Vocabulary
100% (8)
A Study On Using Visual Aids in Teaching and Learning English Vocabulary
88 pages
NityaModha CV
No ratings yet
NityaModha CV
2 pages
Sciencedirect: Contrastive Studies On Proverbs
No ratings yet
Sciencedirect: Contrastive Studies On Proverbs
4 pages
Lesson I Describing Art Works
No ratings yet
Lesson I Describing Art Works
3 pages
Arid Agriculture University, Rawalpindi
No ratings yet
Arid Agriculture University, Rawalpindi
3 pages
Second-Person Narrative - Wikipedia, The Free Encyclopedia
No ratings yet
Second-Person Narrative - Wikipedia, The Free Encyclopedia
5 pages
And by This We Know - Sunday Message
No ratings yet
And by This We Know - Sunday Message
5 pages
SS2 SECOND TERM Computer Science Notebook
No ratings yet
SS2 SECOND TERM Computer Science Notebook
38 pages

Preprocessing in Python

Uploaded by

Preprocessing in Python

Uploaded by

• Unlike traditional ML techniques, we have data in text format in NLP

problems. Like traditional ML techniques, we need to perform the

• Let's start with the first basic preprocessing method, which is

“The students take lectures in the weekends”

[“The”], [“students”], [“take”], [“lectures”], [“in”], [“the”],

• df stands for DataFrame, a data structure provided by the pandas

• The pandas library is used for handling tabular data.

- Rows represent individual entries (e.g., one email per row).

# Create a dictionary to represent the data

# Convert the dictionary into a DataFrame

• When it is done, we have to remove the unimportant text from the

• Sometimes when we scrap/download data from the web, it

• Regex in NLP is a powerful tool for searching, matching, and

- Text Cleaning: Removing unwanted characters like punctuation or

- Pattern Matching: Finding specific patterns, such as email addresses or

"I have 3 apples and 5 bananas."

Regex Pattern: \d+

• \d matches any digit (0-9).

"My phone number is 123456."

The regex \d+ will match 123456 (all consecutive digits).

a: Matches the character 'a'.

For the text: "aaaabbbccaaa"

The regex a+ will match:

aaaa (first group of 'a's)

text = "I have 3 apples and 5 bananas."

# Find all numbers in the text

print(numbers) # Output: ['3', '5']

The re module allows us to use regular expressions (regex), which are

Regular expressions are very useful when we need to identify or extract

• We will use a function to make the code reusable.

• text = re.sub(r'[^\w\s]', '', text) # Keep only letters, numbers, and

Punctuation: r'[^\w\s]': This matches everything except:

\w: Alphanumeric characters (letters and numbers).

# Display the first 10 rows of cleaned text

• We only need to show a small sample to examine the types of rows

• # Display the first few rows

• nltk: A popular library for Natural Language Processing (NLP).

• stopwords: A module in nltk that provides a predefined list of

• stopwords.words('english'): Retrieves the list of English stop words.

How Does the Lambda Function Work?

The Final Output Column:

• from nltk.corpus import stopwords

• # Download the list of stop words

• # Remove stop words

• # Display the first few rows

You might also like