Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
Member-only story
This article will explain the steps of data cleaning and future extraction for text
analysis done by using Neural Language Processing (NLP).
On the internet, there are many great text cleaning guides. Some of the guides are
making feature extraction after text cleaning while some of them are making before
the text cleaning. Both of the approaches work fine. However, here is the issue that
gets little attention: In the data cleaning process, we are losing some possible
features(variables). We need feature extraction before the data cleaning. On the
other hand, some features make sense only when they are extracted after the data
cleaning. Thus, we also need feature extraction after the data cleaning. This study
pays attention to this point, and this is what makes this study unique.
In order to address the stated points above, this study follows three steps in order:
2. Data Cleaning
This study article is a part of an Amazon Review Analysis with NLP methods. Here is
my GitHub repo for the Colab Notebook of the codes for the main study, and codes
for this study.
Brief information about the data I use: The data used in this project was
downloaded from Kaggle. It was uploaded by the Stanford Network Analysis Project.
The original data is coming from the study of ‘From amateurs to connoisseurs:
modeling the evolution of user expertise through online reviews’ done by J.
McAuley and J. Leskovec (2013). This data set consists of reviews of fine foods from
Amazon. The data includes all 568,454 reviews spanning 1999 to 2012. Reviews
include product and user information, ratings, and a plain text review.
1. Number of stop words: A stop word is a commonly used word (such as “the”,
“a”, “an”, “in”) that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the result
of a search query. In Python’s nltk package, there are 127 English stop words
default. With applying stopwords, these 127 words were ignored. Before
removing the stopwords, let’s have the ‘number of stopwords’ as a variable.
2. The number of punctuation: Another feature that can’t be obtained after the data
cleaning be because pronunciations will be deleted.
def count_punct(text):
count = sum([1 for char in text if char in string.punctuation])
return count
#Apply the defined function on the text data
df['punctuation'] = df['Text'].apply(lambda x: count_punct(x))
df[['Text','hastags']].head()
5. Number of Uppercase words: Emotions such as anger, rage are quite often
expressed by writing in UPPERCASE words which makes this a necessary operation
to identify those words. During the data cleaning, all letters will converted to
lowercase.
df[['Text','upper']].head()
Now, we are done with features that can only be obtained before data cleaning. We
are ready to clean the data.
1. Make all text lower case: The first pre-processing step is transforming the
reviews into lower case. This avoids having multiple copies of the same
words. For example, while calculating the word count, ‘Dog’ and ‘dog’ will be
taken as different words if we ignore this transformation.
3) Removal of Stop Words: With this step, I removed all default English stop words
in nltk package.
Adding your own stopwords: At this point, you may want to add your own stopwords.
I do this mostly after checking the most frequent words. We can check the most
frequent words in this way:
import pandas as pd
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:20]
freq
Open in app
From these words, I want to remove ‘br’, ‘get’, and ‘also’ because they don’t make
much sense. Let’s add them to the list of the stopwords:
Note: In other guides, you may come across that TF-IDF method. TF-IDF is another
way to get rid of words that have no semantic value from the text data. If you are
using TF-IDF, you don’t need to apply stopwords (but applying both of them is no
harm).
4) Removing URLs: URLs are another noise in the data that were removed.
def remove_url(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'', text)
5) Remove html HTML tags: HTML is used extensively on the Internet. But HTML
tags themselves are not helpful when processing text. Thus, all texts beginning with
url will be deleted.
def remove_html(text):
html=re.compile(r'<.*?>')
return html.sub(r'',text)
# remove all html tags from df
df['Text'] = df['Text'].apply(lambda x: remove_html(x))
# Reference :
https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
#Example
remove_emoji("Omg another Earthquake 😔😔")
# remove all emojis from df
df['Text'] = df['Text'].apply(lambda x: remove_emoji(x))
·:-) is an emoticon
·😜 is an→ emoji.
!pip install emot #This may be required for the Colab notebook
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
# Function for removing emoticons
def remove_emoticons(text):
emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in
EMOTICONS) + u')')
return emoticon_pattern.sub(r'', text)
#Example
remove_emoticons("Hello :-)")
In that regard, spelling correction is a useful pre-processing step because this also
will help us in reducing multiple copies of words. For example, “Analytics” and
“analytcs” will be treated as different words even if they are used in the same sense.
Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to
lookup lemmas of words.
import nltk
from nltk.stem import WordNetLemmatizer
For more detailed background about lemmatization, you can check Datacamp.
Here, I will stop cleaning the data. However, as a researcher, you may need more
text cleaning depending on your data. For example, you may want to use:
⚫ Different packages use different numbers of stopwords. You can try other NLP
packages.
1. Number of Words: This feature tells how many words there are in the review
df['word_count'] = df['Text'].apply(lambda x: len(str(x).split("
")))
df[['Text','word_count']].head()
def avg_word(sentence):
words = sentence.split()
return (sum(len(word) for word in words)/(len(words)+0.000001))
Let’s check how the extracted features look like in the dataframe:
df.sample(5)
Conclusion
This study explains the steps of text cleaning. In addition, this guide is unique in a
way that feature extraction is done by two rounds: before the text cleaning and the after
text cleaning. We need to remember that for an actual study, text cleaning is a
recursive process. Once we see an anomaly, we come back and make more cleaning
by addressing the anomaly.
350 1
Jacob Marks, Ph.D. in Towards Data Science
3.7K 47
1.2K 7
112 1
268 4
Vatsal in Towards Data Science
287 1
Lists
What is ChatGPT?
9 stories · 89 saves
Staff Picks
342 stories · 98 saves
Zach Quinn in Pipeline: Your Data Engineering Resource
1.1K 15
1.2K 7
3.7K 47
Leonie Monigatti in Towards Data Science
2.7K 19