0% found this document useful (0 votes)
79 views20 pages

Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science

This document discusses data cleaning and feature extraction techniques for natural language processing. It explains performing feature extraction in two rounds, before and after data cleaning, to capture meaningful features that may be lost during cleaning. The text then outlines specific cleaning and feature extraction steps like handling stopwords, punctuation, URLs and spell correction.

Uploaded by

aryaabyte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views20 pages

Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science

This document discusses data cleaning and feature extraction techniques for natural language processing. It explains performing feature extraction in two rounds, before and after data cleaning, to capture meaningful features that may be lost during cleaning. The text then outlines specific cleaning and feature extraction steps like handling stopwords, punctuation, URLs and spell correction.

Uploaded by

aryaabyte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

You have 2 free member-only stories left this month. Upgrade for unlimited access.

Member-only story

Beginner’s Guide to Data Cleaning and Feature


Extraction in NLP
Enes Gokce · Follow
Published in Towards Data Science
8 min read · May 12, 2020

Listen Share More

Source: Levgeniiya Ocheretna via: Shutterstock

This article will explain the steps of data cleaning and future extraction for text
analysis done by using Neural Language Processing (NLP).
On the internet, there are many great text cleaning guides. Some of the guides are
making feature extraction after text cleaning while some of them are making before
the text cleaning. Both of the approaches work fine. However, here is the issue that
gets little attention: In the data cleaning process, we are losing some possible
features(variables). We need feature extraction before the data cleaning. On the
other hand, some features make sense only when they are extracted after the data
cleaning. Thus, we also need feature extraction after the data cleaning. This study
pays attention to this point, and this is what makes this study unique.

In order to address the stated points above, this study follows three steps in order:

1. Feature Extraction — Round 1

2. Data Cleaning

3. Feature Extraction — Round 2

This study article is a part of an Amazon Review Analysis with NLP methods. Here is
my GitHub repo for the Colab Notebook of the codes for the main study, and codes
for this study.

Brief information about the data I use: The data used in this project was
downloaded from Kaggle. It was uploaded by the Stanford Network Analysis Project.
The original data is coming from the study of ‘From amateurs to connoisseurs:
modeling the evolution of user expertise through online reviews’ done by J.
McAuley and J. Leskovec (2013). This data set consists of reviews of fine foods from
Amazon. The data includes all 568,454 reviews spanning 1999 to 2012. Reviews
include product and user information, ratings, and a plain text review.

Feature Extraction — Round 1


In this part, the features that are not possible to obtain after data cleaning will be
extracted.

1. Number of stop words: A stop word is a commonly used word (such as “the”,
“a”, “an”, “in”) that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the result
of a search query. In Python’s nltk package, there are 127 English stop words
default. With applying stopwords, these 127 words were ignored. Before
removing the stopwords, let’s have the ‘number of stopwords’ as a variable.

df['stopwords'] = df['Text'].apply(lambda x: len([x for x in


x.split() if x in stop]))
df[['Text','stopwords']].head()

2. The number of punctuation: Another feature that can’t be obtained after the data
cleaning be because pronunciations will be deleted.

def count_punct(text):
count = sum([1 for char in text if char in string.punctuation])
return count
#Apply the defined function on the text data
df['punctuation'] = df['Text'].apply(lambda x: count_punct(x))

#Let's check the dataset


df[['Text','punctuation']].head()
3. Number of hashtag characters: One more interesting feature which we can
extract from text data is the number of hashtags or mentions present in it. During
the data cleaning, hashtags will be deleted, and we won’t have access to this
information. Therefore, let’s extract this feature while we can still access it.

df['hastags'] = df['Text'].apply(lambda x: len([x for x in x.split()


if x.startswith('#')]))

df[['Text','hastags']].head()

4. Number of numerical characters: Having the number of numeric characters that


are present in the reviews can be useful.

df['numerics'] = df['Text'].apply(lambda x: len([x for x in


x.split() if x.isdigit()]))
df[['Text','numerics']].head()

5. Number of Uppercase words: Emotions such as anger, rage are quite often
expressed by writing in UPPERCASE words which makes this a necessary operation
to identify those words. During the data cleaning, all letters will converted to
lowercase.

df['upper'] = df['Text'].apply(lambda x: len([x for x in x.split()


if x.isupper()]))

df[['Text','upper']].head()

Now, we are done with features that can only be obtained before data cleaning. We
are ready to clean the data.

Text Cleaning Techniques


Before applying NLP techniques on the data, firstly the data needs to be cleaned and
prepared the data for the analysis. If this process is not done properly, it can ruin
the analysis part totally. Here are the steps that were applied to the data:

1. Make all text lower case: The first pre-processing step is transforming the
reviews into lower case. This avoids having multiple copies of the same
words. For example, while calculating the word count, ‘Dog’ and ‘dog’ will be
taken as different words if we ignore this transformation.

df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in


x.split()))
df['Text'].head()

2) Removing Punctuation: Punctuations creates noise in the data and, should be


cleared. For now, NLP methods do not have a meaningful way to analyze
punctuations. Thus, they were removed from the text data. With this step, these
characters were removed: [!”#$%&’()*+,-./:;=>?@[\]^_`{|}~]

df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in


df['Text'] = df['Text'].str.replace('[^\w\s]','')
df['Text'].head()

3) Removal of Stop Words: With this step, I removed all default English stop words
in nltk package.

from nltk.corpus import stopwords


stop = stopwords.words('english')

df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in


x.split() if x not in stop))
df['Text'].sample(10)

Adding your own stopwords: At this point, you may want to add your own stopwords.
I do this mostly after checking the most frequent words. We can check the most
frequent words in this way:

import pandas as pd
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:20]
freq
Open in app

Most common 20 words

From these words, I want to remove ‘br’, ‘get’, and ‘also’ because they don’t make
much sense. Let’s add them to the list of the stopwords:

# Adding common words from our document to stop_words

add_words = ["br", "get", "also"]


stop_words = set(stopwords.words("english"))
stop_added = stop_words.union(add_words)
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in
x.split() if x not in stop_added))
df['Text'].sample(10)

Note: In other guides, you may come across that TF-IDF method. TF-IDF is another
way to get rid of words that have no semantic value from the text data. If you are
using TF-IDF, you don’t need to apply stopwords (but applying both of them is no
harm).

4) Removing URLs: URLs are another noise in the data that were removed.
def remove_url(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'', text)

# remove all urls from df


import re
import string
df['Text'] = df['Text'].apply(lambda x: remove_url(x))

5) Remove html HTML tags: HTML is used extensively on the Internet. But HTML
tags themselves are not helpful when processing text. Thus, all texts beginning with
url will be deleted.

def remove_html(text):
html=re.compile(r'<.*?>')
return html.sub(r'',text)
# remove all html tags from df
df['Text'] = df['Text'].apply(lambda x: remove_html(x))

6) Removing Emojis: Emojis can be an indicator of some emotions that can be


related to being customer satisfaction. Unfortunately, we need to remove the emojis
in our text analysis because for now, it is currently not possible to analyze emojis
with NLP.

# Reference :
https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b

def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
#Example
remove_emoji("Omg another Earthquake 😔😔")
# remove all emojis from df
df['Text'] = df['Text'].apply(lambda x: remove_emoji(x))

7) Remove Emoticons: What is the difference between emoji and emoticons?

·:-) is an emoticon

·😜 is an→ emoji.

!pip install emot #This may be required for the Colab notebook
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
# Function for removing emoticons
def remove_emoticons(text):
emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in
EMOTICONS) + u')')
return emoticon_pattern.sub(r'', text)
#Example
remove_emoticons("Hello :-)")

df['Text'] = df['Text'].apply(lambda x: remove_emoticons(x))

8) Spell Correction: On Amazon reviews, there are a plethora of spelling mistakes.


Product reviews are sometimes filled with hastily sent reviews that are barely
legible at times.

In that regard, spelling correction is a useful pre-processing step because this also
will help us in reducing multiple copies of words. For example, “Analytics” and
“analytcs” will be treated as different words even if they are used in the same sense.

from textblob import TextBlob


df['Text'][:5].apply(lambda x: str(TextBlob(x).correct()))
9. Lemmatization: Lemmatization is the process of converting a word to its base
form. Lemmatization considers the context and converts the word to its meaningful
base form. For example:

‘Caring’ -> Lemmatization -> ‘Care’

Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to
lookup lemmas of words.

import nltk
from nltk.stem import WordNetLemmatizer

# Init the Wordnet Lemmatizer


lemmatizer = WordNetLemmatizer()

df['Text'] = df['Text'].apply(lambda x: lemmatizer(x))

For more detailed background about lemmatization, you can check Datacamp.

Here, I will stop cleaning the data. However, as a researcher, you may need more
text cleaning depending on your data. For example, you may want to use:

⚫ stemming the text data

⚫ Alternative ways for spell correction: isolated-term correction and context-


sensitive correction methods

⚫ Different packages use different numbers of stopwords. You can try other NLP
packages.

Feature Extraction- Round 2


Some features will be extracted after text cleaning because they are more
meaningful to obtain at this step. For example, the number of characters would be
affected badly from URL links if we would extract this feature before the data
cleaning. At this point, we must try to extract as many features as possible since
extra features have a chance to provide useful information during the text analysis.
We don’t have to worry about whether the features will actually be useful in the
future or not. In the worst case, we don’t use them.

1. Number of Words: This feature tells how many words there are in the review
df['word_count'] = df['Text'].apply(lambda x: len(str(x).split("
")))
df[['Text','word_count']].head()

2. Number of characters: How many letters are contained in the review.

df['char_count'] = df['Text'].str.len() ## this also includes spaces


df[['Text','char_count']].head()

3. Average Word Length: Average number of letters in the words in a review.

def avg_word(sentence):
words = sentence.split()
return (sum(len(word) for word in words)/(len(words)+0.000001))

df['avg_word'] = df['Text'].apply(lambda x: avg_word(x)).round(1)


df[['Text','avg_word']].head()

Let’s check how the extracted features look like in the dataframe:

df.sample(5)
Conclusion
This study explains the steps of text cleaning. In addition, this guide is unique in a
way that feature extraction is done by two rounds: before the text cleaning and the after
text cleaning. We need to remember that for an actual study, text cleaning is a
recursive process. Once we see an anomaly, we come back and make more cleaning
by addressing the anomaly.

*Special thanks to my friend Tabitha Stickel for proofreading this article.

NLP Python Text Mining Data Cleaning Data Science


Follow

Written by Enes Gokce


205 Followers · Writer for Towards Data Science

Data Scientist at Native, and Ph.D. student at Penn State University

More from Enes Gokce and Towards Data Science

Enes Gokce in Towards Data Science

Predicting Used Car Prices with Machine Learning Techniques


Comparing Performance of Five Different ML Models

· 20 min read · Jan 10, 2020

350 1
Jacob Marks, Ph.D. in Towards Data Science

How I Turned My Company’s Docs into a Searchable Database with


OpenAI
And how you can do the same with your docs

15 min read · Apr 25

3.7K 47

Leonie Monigatti in Towards Data Science


10 Exciting Project Ideas Using Large Language Models (LLMs) for Your
Portfolio
Learn how to build apps and showcase your skills with large language models (LLMs). Get
started today!

· 11 min read · May 15

1.2K 7

Enes Gokce in Towards Data Science

Sentiment Analysis on Amazon Reviews


An application of Natural Language Processing (NLP) by using Python

· 7 min read · May 26, 2020

112 1

See all from Enes Gokce

See all from Towards Data Science


Recommended from Medium

Clément Delteil in Towards AI

Unsupervised Sentiment Analysis With Real-World Data: 500,000


Tweets on Elon Musk
Guided walkthrough in a real-world Natural Language Processing project.

· 15 min read · Feb 12

268 4
Vatsal in Towards Data Science

Supervised & Unsupervised Approach to Topic Modelling in Python


Build a Topic Modelling Pipeline from Scratch in Python

· 11 min read · Jan 31

287 1

Lists

What is ChatGPT?
9 stories · 89 saves

Stories to Help You Level-Up at Work


19 stories · 84 saves

Staff Picks
342 stories · 98 saves
Zach Quinn in Pipeline: Your Data Engineering Resource

Creating The Dashboard That Got Me A Data Analyst Job Offer


A walkthrough of the Udemy dashboard that got me a job offer from one of the biggest names
in academic publishing.

· 9 min read · Dec 5, 2022

1.1K 15

Leonie Monigatti in Towards Data Science


10 Exciting Project Ideas Using Large Language Models (LLMs) for Your
Portfolio
Learn how to build apps and showcase your skills with large language models (LLMs). Get
started today!

· 11 min read · May 15

1.2K 7

Jacob Marks, Ph.D. in Towards Data Science

How I Turned My Company’s Docs into a Searchable Database with


OpenAI
And how you can do the same with your docs

15 min read · Apr 25

3.7K 47
Leonie Monigatti in Towards Data Science

Getting Started with LangChain: A Beginner’s Guide to Building LLM-


Powered Applications
A LangChain tutorial to build anything with large language models in Python

· 12 min read · Apr 25

2.7K 19

See more recommendations

You might also like