0% found this document useful (0 votes)

79 views20 pages

Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science

This document discusses data cleaning and feature extraction techniques for natural language processing. It explains performing feature extraction in two rounds, before and after data cleaning, to capture meaningful features that may be lost during cleaning. The text then outlines specific cleaning and feature extraction steps like handling stopwords, punctuation, URLs and spell correction.

Uploaded by

aryaabyte

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views20 pages

Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science

Uploaded by

aryaabyte

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

You have 2 free member-only stories left this month. Upgrade for unlimited access.

Member-only story

Beginner’s Guide to Data Cleaning and Feature

Extraction in NLP
Enes Gokce · Follow
Published in Towards Data Science
8 min read · May 12, 2020

Listen Share More

Source: Levgeniiya Ocheretna via: Shutterstock

This article will explain the steps of data cleaning and future extraction for text
analysis done by using Neural Language Processing (NLP).
On the internet, there are many great text cleaning guides. Some of the guides are
making feature extraction after text cleaning while some of them are making before
the text cleaning. Both of the approaches work fine. However, here is the issue that
gets little attention: In the data cleaning process, we are losing some possible
features(variables). We need feature extraction before the data cleaning. On the
other hand, some features make sense only when they are extracted after the data
cleaning. Thus, we also need feature extraction after the data cleaning. This study
pays attention to this point, and this is what makes this study unique.

In order to address the stated points above, this study follows three steps in order:

1. Feature Extraction — Round 1

2. Data Cleaning

3. Feature Extraction — Round 2

This study article is a part of an Amazon Review Analysis with NLP methods. Here is
my GitHub repo for the Colab Notebook of the codes for the main study, and codes
for this study.

Brief information about the data I use: The data used in this project was
downloaded from Kaggle. It was uploaded by the Stanford Network Analysis Project.
The original data is coming from the study of ‘From amateurs to connoisseurs:
modeling the evolution of user expertise through online reviews’ done by J.
McAuley and J. Leskovec (2013). This data set consists of reviews of fine foods from
Amazon. The data includes all 568,454 reviews spanning 1999 to 2012. Reviews
include product and user information, ratings, and a plain text review.

Feature Extraction — Round 1

In this part, the features that are not possible to obtain after data cleaning will be
extracted.

1. Number of stop words: A stop word is a commonly used word (such as “the”,
“a”, “an”, “in”) that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the result
of a search query. In Python’s nltk package, there are 127 English stop words
default. With applying stopwords, these 127 words were ignored. Before
removing the stopwords, let’s have the ‘number of stopwords’ as a variable.

df['stopwords'] = df['Text'].apply(lambda x: len([x for x in

x.split() if x in stop]))
df[['Text','stopwords']].head()

2. The number of punctuation: Another feature that can’t be obtained after the data
cleaning be because pronunciations will be deleted.

def count_punct(text):
count = sum([1 for char in text if char in string.punctuation])
return count
#Apply the defined function on the text data
df['punctuation'] = df['Text'].apply(lambda x: count_punct(x))

#Let's check the dataset

df[['Text','punctuation']].head()
3. Number of hashtag characters: One more interesting feature which we can
extract from text data is the number of hashtags or mentions present in it. During
the data cleaning, hashtags will be deleted, and we won’t have access to this
information. Therefore, let’s extract this feature while we can still access it.

df['hastags'] = df['Text'].apply(lambda x: len([x for x in x.split()

if x.startswith('#')]))

df[['Text','hastags']].head()

4. Number of numerical characters: Having the number of numeric characters that

are present in the reviews can be useful.

df['numerics'] = df['Text'].apply(lambda x: len([x for x in

x.split() if x.isdigit()]))
df[['Text','numerics']].head()

5. Number of Uppercase words: Emotions such as anger, rage are quite often
expressed by writing in UPPERCASE words which makes this a necessary operation
to identify those words. During the data cleaning, all letters will converted to
lowercase.

df['upper'] = df['Text'].apply(lambda x: len([x for x in x.split()

if x.isupper()]))

df[['Text','upper']].head()

Now, we are done with features that can only be obtained before data cleaning. We
are ready to clean the data.

Text Cleaning Techniques

Before applying NLP techniques on the data, firstly the data needs to be cleaned and
prepared the data for the analysis. If this process is not done properly, it can ruin
the analysis part totally. Here are the steps that were applied to the data:

1. Make all text lower case: The first pre-processing step is transforming the
reviews into lower case. This avoids having multiple copies of the same
words. For example, while calculating the word count, ‘Dog’ and ‘dog’ will be
taken as different words if we ignore this transformation.

df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in

x.split()))
df['Text'].head()

2) Removing Punctuation: Punctuations creates noise in the data and, should be

cleared. For now, NLP methods do not have a meaningful way to analyze
punctuations. Thus, they were removed from the text data. With this step, these
characters were removed: [!”#$%&’()*+,-./:;=>?@[\]^_`{|}~]

df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in

df['Text'] = df['Text'].str.replace('[^\w\s]','')
df['Text'].head()

3) Removal of Stop Words: With this step, I removed all default English stop words
in nltk package.

from nltk.corpus import stopwords

stop = stopwords.words('english')

df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in

x.split() if x not in stop))
df['Text'].sample(10)

Adding your own stopwords: At this point, you may want to add your own stopwords.
I do this mostly after checking the most frequent words. We can check the most
frequent words in this way:

import pandas as pd
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:20]
freq
Open in app

Most common 20 words

From these words, I want to remove ‘br’, ‘get’, and ‘also’ because they don’t make
much sense. Let’s add them to the list of the stopwords:

# Adding common words from our document to stop_words

add_words = ["br", "get", "also"]

stop_words = set(stopwords.words("english"))
stop_added = stop_words.union(add_words)
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in
x.split() if x not in stop_added))
df['Text'].sample(10)

Note: In other guides, you may come across that TF-IDF method. TF-IDF is another
way to get rid of words that have no semantic value from the text data. If you are
using TF-IDF, you don’t need to apply stopwords (but applying both of them is no
harm).

4) Removing URLs: URLs are another noise in the data that were removed.
def remove_url(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'', text)

# remove all urls from df

import re
import string
df['Text'] = df['Text'].apply(lambda x: remove_url(x))

5) Remove html HTML tags: HTML is used extensively on the Internet. But HTML
tags themselves are not helpful when processing text. Thus, all texts beginning with
url will be deleted.

def remove_html(text):
html=re.compile(r'<.*?>')
return html.sub(r'',text)
# remove all html tags from df
df['Text'] = df['Text'].apply(lambda x: remove_html(x))

6) Removing Emojis: Emojis can be an indicator of some emotions that can be

related to being customer satisfaction. Unfortunately, we need to remove the emojis
in our text analysis because for now, it is currently not possible to analyze emojis
with NLP.

# Reference :
https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b

def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
#Example
remove_emoji("Omg another Earthquake 😔😔")
# remove all emojis from df
df['Text'] = df['Text'].apply(lambda x: remove_emoji(x))

7) Remove Emoticons: What is the difference between emoji and emoticons?

·:-) is an emoticon

·😜 is an→ emoji.

!pip install emot #This may be required for the Colab notebook
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
# Function for removing emoticons
def remove_emoticons(text):
emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in
EMOTICONS) + u')')
return emoticon_pattern.sub(r'', text)
#Example
remove_emoticons("Hello :-)")

df['Text'] = df['Text'].apply(lambda x: remove_emoticons(x))

8) Spell Correction: On Amazon reviews, there are a plethora of spelling mistakes.

Product reviews are sometimes filled with hastily sent reviews that are barely
legible at times.

In that regard, spelling correction is a useful pre-processing step because this also
will help us in reducing multiple copies of words. For example, “Analytics” and
“analytcs” will be treated as different words even if they are used in the same sense.

from textblob import TextBlob

df['Text'][:5].apply(lambda x: str(TextBlob(x).correct()))
9. Lemmatization: Lemmatization is the process of converting a word to its base
form. Lemmatization considers the context and converts the word to its meaningful
base form. For example:

‘Caring’ -> Lemmatization -> ‘Care’

Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to
lookup lemmas of words.

import nltk
from nltk.stem import WordNetLemmatizer

# Init the Wordnet Lemmatizer

lemmatizer = WordNetLemmatizer()

df['Text'] = df['Text'].apply(lambda x: lemmatizer(x))

For more detailed background about lemmatization, you can check Datacamp.

Here, I will stop cleaning the data. However, as a researcher, you may need more
text cleaning depending on your data. For example, you may want to use:

⚫ stemming the text data

⚫ Alternative ways for spell correction: isolated-term correction and context-

sensitive correction methods

⚫ Different packages use different numbers of stopwords. You can try other NLP
packages.

Feature Extraction- Round 2

Some features will be extracted after text cleaning because they are more
meaningful to obtain at this step. For example, the number of characters would be
affected badly from URL links if we would extract this feature before the data
cleaning. At this point, we must try to extract as many features as possible since
extra features have a chance to provide useful information during the text analysis.
We don’t have to worry about whether the features will actually be useful in the
future or not. In the worst case, we don’t use them.

1. Number of Words: This feature tells how many words there are in the review
df['word_count'] = df['Text'].apply(lambda x: len(str(x).split("
")))
df[['Text','word_count']].head()

2. Number of characters: How many letters are contained in the review.

df['char_count'] = df['Text'].str.len() ## this also includes spaces

df[['Text','char_count']].head()

3. Average Word Length: Average number of letters in the words in a review.

def avg_word(sentence):
words = sentence.split()
return (sum(len(word) for word in words)/(len(words)+0.000001))

df['avg_word'] = df['Text'].apply(lambda x: avg_word(x)).round(1)

df[['Text','avg_word']].head()

Let’s check how the extracted features look like in the dataframe:

df.sample(5)
Conclusion
This study explains the steps of text cleaning. In addition, this guide is unique in a
way that feature extraction is done by two rounds: before the text cleaning and the after
text cleaning. We need to remember that for an actual study, text cleaning is a
recursive process. Once we see an anomaly, we come back and make more cleaning
by addressing the anomaly.

*Special thanks to my friend Tabitha Stickel for proofreading this article.

NLP Python Text Mining Data Cleaning Data Science

Written by Enes Gokce

205 Followers · Writer for Towards Data Science

Data Scientist at Native, and Ph.D. student at Penn State University

Enes Gokce in Towards Data Science

Predicting Used Car Prices with Machine Learning Techniques

Comparing Performance of Five Different ML Models

· 20 min read · Jan 10, 2020

350 1
Jacob Marks, Ph.D. in Towards Data Science

How I Turned My Company’s Docs into a Searchable Database with

OpenAI
And how you can do the same with your docs

15 min read · Apr 25

3.7K 47

Leonie Monigatti in Towards Data Science

10 Exciting Project Ideas Using Large Language Models (LLMs) for Your
Portfolio
Learn how to build apps and showcase your skills with large language models (LLMs). Get
started today!

· 11 min read · May 15

1.2K 7

Enes Gokce in Towards Data Science

Sentiment Analysis on Amazon Reviews

An application of Natural Language Processing (NLP) by using Python

· 7 min read · May 26, 2020

112 1

See all from Enes Gokce

See all from Towards Data Science

Recommended from Medium

Clément Delteil in Towards AI

Unsupervised Sentiment Analysis With Real-World Data: 500,000

Tweets on Elon Musk
Guided walkthrough in a real-world Natural Language Processing project.

· 15 min read · Feb 12

268 4
Vatsal in Towards Data Science

Supervised & Unsupervised Approach to Topic Modelling in Python

Build a Topic Modelling Pipeline from Scratch in Python

· 11 min read · Jan 31

287 1

Lists

What is ChatGPT?
9 stories · 89 saves

Stories to Help You Level-Up at Work

19 stories · 84 saves

Staff Picks
342 stories · 98 saves
Zach Quinn in Pipeline: Your Data Engineering Resource

Creating The Dashboard That Got Me A Data Analyst Job Offer

A walkthrough of the Udemy dashboard that got me a job offer from one of the biggest names
in academic publishing.

· 9 min read · Dec 5, 2022

1.1K 15

Leonie Monigatti in Towards Data Science

10 Exciting Project Ideas Using Large Language Models (LLMs) for Your
Portfolio
Learn how to build apps and showcase your skills with large language models (LLMs). Get
started today!

· 11 min read · May 15

1.2K 7

Jacob Marks, Ph.D. in Towards Data Science

How I Turned My Company’s Docs into a Searchable Database with

OpenAI
And how you can do the same with your docs

15 min read · Apr 25

3.7K 47
Leonie Monigatti in Towards Data Science

Getting Started with LangChain: A Beginner’s Guide to Building LLM-

Powered Applications
A LangChain tutorial to build anything with large language models in Python

· 12 min read · Apr 25

2.7K 19

See more recommendations

A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
Amazon Assignment Ex
No ratings yet
Amazon Assignment Ex
11 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Sentiment Analysis With NLP Deep Learning
No ratings yet
Sentiment Analysis With NLP Deep Learning
8 pages
Smaexp 3
No ratings yet
Smaexp 3
9 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Ie ML Project (Getting Started)
No ratings yet
Ie ML Project (Getting Started)
3 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
No ratings yet
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
4 pages
Social Media Mining
No ratings yet
Social Media Mining
10 pages
Steps For Effective Text Data Cleaning
No ratings yet
Steps For Effective Text Data Cleaning
6 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
T SNE Visualization of Amazon Reviews With Polarity Based Color Coding+
No ratings yet
T SNE Visualization of Amazon Reviews With Polarity Based Color Coding+
29 pages
Text Mining Using Python
No ratings yet
Text Mining Using Python
1 page
Lab Manual
No ratings yet
Lab Manual
10 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
NLP Sentimental Analysis 1736351356
No ratings yet
NLP Sentimental Analysis 1736351356
32 pages
Text Processing Steps
No ratings yet
Text Processing Steps
3 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
This Next Video Is On Inferring. I Video 5
No ratings yet
This Next Video Is On Inferring. I Video 5
2 pages
Text Cleaning Methods in NLP - Part-2
No ratings yet
Text Cleaning Methods in NLP - Part-2
5 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
NLP Lab
No ratings yet
NLP Lab
7 pages
Amna Bagh Ali
No ratings yet
Amna Bagh Ali
6 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
121a1114 D2 Sma Exp3
No ratings yet
121a1114 D2 Sma Exp3
9 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Experiment No 3
No ratings yet
Experiment No 3
7 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
NLP Concepts Resources
No ratings yet
NLP Concepts Resources
48 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
Methodology
No ratings yet
Methodology
9 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
III Unit
No ratings yet
III Unit
4 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Unit 5
No ratings yet
Unit 5
8 pages
NLP Lect 2
No ratings yet
NLP Lect 2
5 pages
Stage 1 - Data Ingestion and Organization
No ratings yet
Stage 1 - Data Ingestion and Organization
9 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Text Processing
No ratings yet
Text Processing
5 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages

Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science

Uploaded by

Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science

Uploaded by

You have 2 free member-only stories left this month. Upgrade for unlimited access.

Beginner’s Guide to Data Cleaning and Feature

Listen Share More

Source: Levgeniiya Ocheretna via: Shutterstock

1. Feature Extraction — Round 1

3. Feature Extraction — Round 2

Feature Extraction — Round 1

df['stopwords'] = df['Text'].apply(lambda x: len([x for x in

#Let's check the dataset

df['hastags'] = df['Text'].apply(lambda x: len([x for x in x.split()

4. Number of numerical characters: Having the number of numeric characters that

df['numerics'] = df['Text'].apply(lambda x: len([x for x in

df['upper'] = df['Text'].apply(lambda x: len([x for x in x.split()

Text Cleaning Techniques

df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in

2) Removing Punctuation: Punctuations creates noise in the data and, should be

df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in

from nltk.corpus import stopwords

df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in

Most common 20 words

# Adding common words from our document to stop_words

add_words = ["br", "get", "also"]

# remove all urls from df

6) Removing Emojis: Emojis can be an indicator of some emotions that can be

7) Remove Emoticons: What is the difference between emoji and emoticons?

df['Text'] = df['Text'].apply(lambda x: remove_emoticons(x))

8) Spell Correction: On Amazon reviews, there are a plethora of spelling mistakes.

from textblob import TextBlob

‘Caring’ -> Lemmatization -> ‘Care’

# Init the Wordnet Lemmatizer

df['Text'] = df['Text'].apply(lambda x: lemmatizer(x))

⚫ stemming the text data

⚫ Alternative ways for spell correction: isolated-term correction and context-

Feature Extraction- Round 2

2. Number of characters: How many letters are contained in the review.

df['char_count'] = df['Text'].str.len() ## this also includes spaces

3. Average Word Length: Average number of letters in the words in a review.

df['avg_word'] = df['Text'].apply(lambda x: avg_word(x)).round(1)

*Special thanks to my friend Tabitha Stickel for proofreading this article.

NLP Python Text Mining Data Cleaning Data Science

Written by Enes Gokce

Data Scientist at Native, and Ph.D. student at Penn State University

More from Enes Gokce and Towards Data Science

Enes Gokce in Towards Data Science

Predicting Used Car Prices with Machine Learning Techniques

· 20 min read · Jan 10, 2020

How I Turned My Company’s Docs into a Searchable Database with

15 min read · Apr 25

Leonie Monigatti in Towards Data Science

· 11 min read · May 15

Enes Gokce in Towards Data Science

Sentiment Analysis on Amazon Reviews

· 7 min read · May 26, 2020

See all from Enes Gokce

See all from Towards Data Science

Clément Delteil in Towards AI

Unsupervised Sentiment Analysis With Real-World Data: 500,000

· 15 min read · Feb 12

Supervised & Unsupervised Approach to Topic Modelling in Python

· 11 min read · Jan 31

Stories to Help You Level-Up at Work

Creating The Dashboard That Got Me A Data Analyst Job Offer

· 9 min read · Dec 5, 2022

Leonie Monigatti in Towards Data Science

· 11 min read · May 15

Jacob Marks, Ph.D. in Towards Data Science

How I Turned My Company’s Docs into a Searchable Database with

15 min read · Apr 25

Getting Started with LangChain: A Beginner’s Guide to Building LLM-

· 12 min read · Apr 25

See more recommendations

You might also like