0% found this document useful (0 votes)
3 views50 pages

Preprocessing in Python

The document discusses basic preprocessing techniques for natural language processing (NLP) data, focusing on converting text to lower-case, removing HTML tags, and eliminating stop words. It explains the importance of these steps in cleaning and preparing text data for analysis, using examples from an IMDB movie reviews dataset. Additionally, it introduces the use of regular expressions (regex) for text manipulation and provides code snippets for implementing these preprocessing tasks in Python.

Uploaded by

khatijaj21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views50 pages

Preprocessing in Python

The document discusses basic preprocessing techniques for natural language processing (NLP) data, focusing on converting text to lower-case, removing HTML tags, and eliminating stop words. It explains the importance of these steps in cleaning and preparing text data for analysis, using examples from an IMDB movie reviews dataset. Additionally, it introduces the use of regular expressions (regex) for text manipulation and provides code snippets for implementing these preprocessing tasks in Python.

Uploaded by

khatijaj21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

• Unlike traditional ML techniques, we have data in text format in NLP

problems. Like traditional ML techniques, we need to perform the


preprocessing on this data (which is a text data).
• This preprocessing of text data is either of basic level or advanced
level preprocessing. In this lecture, we will basically cover the basic
preprocessing techniques.
• We will cover different preprocessing techniques but it doesnt mean
that for each NLP applications, all these techniques will be used. Some
techniques might not be applicable for some of the NLP problems.

• Let's start with the first basic preprocessing method, which is


"converting the text in to lower-case letters".
Lower-case Letters
• But why we need converting text in to lowe-case letters??
• Suppose we have the following sentence and then we tokenoze it.

“The students take lectures in the weekends”

[“The”], [“students”], [“take”], [“lectures”], [“in”], [“the”],


[“weekends”]
• Now there are “The” and “the” in the sentence and since, Python is a
case-sensitive language, it considers that “The” and “the” are two
different words and not the same.

• To cope with this, we need to convert the whole text to lower-case letters (we can convert it to
upper-case letters too but lower-case is widely used technique)
• Lets we have the IMDB dataset
(csv format) which has about
50k movies reviews.
It has two columns: Review and Sentiment (positive or negative)
• df=
pd.read.csv(‘/kaggle/input/imdb-dataset-of-50k-moview-reviews/IMD
B Dataset.csv)
What is df?

• df stands for DataFrame, a data structure provided by the pandas


library in Python.

• The pandas library is used for handling tabular data.


• A DataFrame is like a table (think of it as an Excel sheet) where:

- Rows represent individual entries (e.g., one email per row).


- Columns represent different properties or features of the data (e.g.,
sentiment and its label).
import pandas as pd

# Create a dictionary to represent the data


data = {
'Email': ["WIN $10,000 NOW!!!", "Meeting tomorrow at 3 PM."],
'Label': ['Spam', 'Non-spam']
}

# Convert the dictionary into a DataFrame


df = pd.DataFrame(data)
• The DataFrame df will look like this:

Email Label
WIN $10,000 NOW!!! SPAM
Meeting tomorrow at 3 PM. NOT SPAM
With head(), we can display a few record of the dataset to see how our data looks like
• Let’s convert the fourth (4th) review in to lower-case letter, with the
following command.

• df[‘review’][3].lower()
“df” is the dataframe (table where the data is stored), “review” is the column name, “3” is the
fourth column and “lower” is the keyword to convert the text in to lower case letters
• Now we have about 50k reviews in total in our dataset, so to convert
all the reviews in to lower-case letters, we can use the following code:

df[‘review’].str..lower()
To make all the entire column in to lower-case letters
• Assign it to the variable

df[‘review’]= df[‘review’].str.lower()
• So usually we first perform the lower-case conversion as a pre-
processing tasks.

• When it is done, we have to remove the unimportant text from the


datasets.
Remove HTML tags

• Sometimes when we scrap/download data from the web, it


include some random html tags like <b> etc. For example, this
data is downloaded from the IMDB website.
Regex (short for Regular Expression)

• Regex in NLP is a powerful tool for searching, matching, and


manipulating patterns in text.

• Think of regex as a "search pattern" that helps you find specific text,
like words, numbers, or formats, within a larger text.
• It is commonly used in NLP for tasks like:

- Text Cleaning: Removing unwanted characters like punctuation or


special symbols.

- Pattern Matching: Finding specific patterns, such as email addresses or


dates.
Imagine you have a piece of text

"I have 3 apples and 5 bananas."

• Suppose you want to extract all the numbers from this text.

Regex Pattern: \d+

• \d matches any digit (0-9).


• + means "one or more" of the preceding element.
• Using this regex on the text will extract 3 and 5.
• Pattern: \d+
• \d: Matches any single digit (0-9).
• +: Matches one or more digits in a row.

"My phone number is 123456."

The regex \d+ will match 123456 (all consecutive digits).


Pattern: a+

a: Matches the character 'a'.


+: Matches one or more 'a's in a row.

For the text: "aaaabbbccaaa"

The regex a+ will match:

aaaa (first group of 'a's)


aaa (second group of 'a's)
import re

text = "I have 3 apples and 5 bananas."

# Find all numbers in the text


numbers = re.findall(r'\d+', text)

print(numbers) # Output: ['3', '5']


• The r in r'\d+' stands for raw string in Python.

• It tells Python to treat the string as a "raw" string, meaning that special
characters (like \) are not meant as escape characters.
Why is Regex Useful in NLP?

• It helps quickly identify and process specific parts of text, saving time
and effort in cleaning or extracting structured information from
unstructured data.
• We first want to see if our data contains any HTML tags or
punctuation marks.............
We will use Regex (re) for that

import re

The re module allows us to use regular expressions (regex), which are


patterns used to match specific text sequences.

Regular expressions are very useful when we need to identify or extract


certain patterns, like HTML tags or punctuation marks, from strings.
def contains_html_or_punctuation(text):

• We will use a function to make the code reusable.

• We can check if a string (in this case, a review) contains HTML tags
or punctuation without repeating the logic in the code every time.
• The text parameter represents each individual review in the
DataFrame.
• html_pattern = re.compile(r'<.*?>') # Match HTML tags

• text = re.sub(r'[^\w\s]', '', text) # Keep only letters, numbers, and


whitespace
In HTML, tags are usually enclosed in angle brackets (e.g., <b>, <p>, etc.). We use the pattern r'<.*?>' to match
anything that starts with <, ends with >, and contains any characters in between.

<.*?>: The dot . matches any character (except newline), the asterisk * means "zero or more of those
characters", and the question mark ? makes the match non-greedy (matches the shortest possible sequence).

Punctuation: r'[^\w\s]': This matches everything except:

\w: Alphanumeric characters (letters and numbers).


\s: Whitespace characters (spaces, tabs, etc.).

By replacing matches with an empty string (''), all punctuation (e.g., -, ?, ", etc.) is removed.

Preserves:

Words.
Spaces between words.

Why these specific characters? Because they are commonly used punctuation marks that could appear in the text
and may need to be cleaned up in preprocessing.
# Apply the cleaning function to the 'text' column
df['cleaned_text'] = df['text'].apply(clean_text)

# Display the first 10 rows of cleaned text


print(df[['cleaned_text', 'label']].head(10))
.head(10) is a Pandas method that returns the first 10 rows of the
DataFrame.

• We only need to show a small sample to examine the types of rows


that contain unwanted characters (HTML tags, punctuation) before
preprocessing.
Removing Numbers

• Remove numbers from the text, as they often do not carry significant
meaning (depending on the task).

• Numbers like "1234" might not be relevant for tasks like sentiment
analysis.
• # Remove numbers
• df['no_numbers'] = df['text'].apply(lambda x: re.sub(r'\d+', '', x))

• # Display the first few rows


• print(df[['no_numbers', 'label']].head(10))
What Are Stop Words?

• Stop words are common words in a language (like "is," "the," "and,"
etc.) that don't carry significant meaning for tasks like text
classification, sentiment analysis, or other NLP tasks.

• Removing them reduces noise in the text, allowing the model to focus
on more meaningful words.
• Import Required Libraries:

• nltk: A popular library for Natural Language Processing (NLP).

• stopwords: A module in nltk that provides a predefined list of


common stop words in different languages.
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

(This downloads the list of stop words (a collection of common words like "the," "a," "in," etc.)
provided by the nltk library.)
• Define the Stop Words Set

• stop_words = set(stopwords.words('english'))

• stopwords.words('english'): Retrieves the list of English stop words.


• set(): Converts the list into a set for faster lookup. (Checking membership in a set is quicker than
in a list.)
df['no_stopwords'] = df['no_numbers'].apply(
lambda x: ' '.join([word for word in x.split() if word not in
stop_words])
•)

• (If you remember, ‘no_stopwords’ is a column in the previous step when we removed
‘numbers’ from the data)
.apply() applies a function to each row (or element) in the column.
lambda x: ' '.join([word for word in x.split() if word not in stop_words])

How Does the Lambda Function Work?

x.split(): Splits the text (x) into individual words (creating a list of words).

[word for word in x.split() if word not in stop_words]: Filters out words that are in the stop words set.

For example: If x = "I like to learn NLP", and stop_words = {“to”, “I”}, then this becomes ["like", "learn",
"NLP"].

' '.join(...): Joins the filtered words back into a single string with spaces.
Result: "like learn NLP"

The Final Output Column:

df['no_stopwords']: A new column in the DataFrame where stop words have been removed from the
original text.
Code For removing stopwords

• from nltk.corpus import stopwords


• import nltk

• # Download the list of stop words


• nltk.download('stopwords')
• stop_words = set(stopwords.words('english'))

• # Remove stop words


• df['no_stopwords'] = df['no_numbers'].apply(
• lambda x: ' '.join([word for word in x.split() if word not in stop_words])
• )

• # Display the first few rows


• print(df[['no_stopwords', 'label']].head(10))

You might also like