0% found this document useful (0 votes)
38 views

18 Text Mining - Text Preprocessing

18 Text Mining - Text Preprocessing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

18 Text Mining - Text Preprocessing

18 Text Mining - Text Preprocessing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Prof. V.V.

Subrahmanyam
School of Computer and Information Sciences
Indira Gandhi National Open University (IGNOU)
New Delhi
Date: 22nd Aug, 2024 Time : 4-00PM to 4-30PM
Text Mining
 Text mining, also known
as text data mining, is the
process of transforming
unstructured text into
a structured format to
identify meaningful
patterns and new insights.
Text Preprocessing - Introduction
 Text data derived from natural language is
unstructured and noisy.
 So text preprocessing is a critical step to
transform messy, unstructured text data
into a form that can be effectively used to
train machine learning models, leading to
better results and insights.
Text Preprocessing
 Text preprocessing refers to a series of
techniques used to clean, transform and
prepare raw textual data into a format that
is suitable for natural language processing
(NLP) or Text Mining or Machine Learning
(ML) tasks.
Goal of Text Preprocessing
 The goal of text preprocessing is to
enhance the quality and usability of
the text data for subsequent analysis
or modeling.
Common Text Preprocessing / Cleaning Steps
 Lower Casing  Conversion of
 Removal of Punctuations emoticons to words
 Removal of Stopwords  Conversion of emojis to
 Removal of Frequent words words
 Removal of Rare words  Removal of URLs
 Stemming
 Removal of HTML tags
 Lemmatization
 Removal of emojis  Chat words conversion
 Removal of emoticons  Spelling correction
Lower Casing
 Lower casing is a common text preprocessing
technique. The idea is to convert the input text
into same casing format so that, for example 'text',
'Text' and 'TEXT' are treated the same way.
 This is more helpful for text featurization
techniques like frequency, tfidf as it helps to
combine the same words together thereby
reducing the duplication and get correct counts /
tfidf values.
Removal of Punctuations
 This is again a text standardization process
that will help to treat 'hurray' and 'hurray!'
in the same way.
 We also need to carefully choose the list of
punctuations to exclude depending on the
use case.
Removal of Stopwords
 Stopwords are commonly occuring words in a language
like 'the', 'a' and so on.
 They can be removed from the text most of the times,
as they don't provide valuable information for
downstream analysis.
 In cases like Part of Speech(POS) tagging, we should
not remove them as provide very valuable information
about the POS.
Removal of Stop Frequent Words
 In the previos preprocessing step, we observed the
stopwords based on language information. But
say, if we have a domain specific corpus, we
might also have some frequent words which are of
not so much importance to us.
 So this step is to remove the frequent words in the
given corpus. If we use something like tfidf, this is
automatically taken care of.
Some of the Domain Specific Corpus
Frequent Words….

 I, us, DM, Help, We, Hi,


Please, Get, Thanks etc..
Removal of Rare Words
 This is very similar to previous
preprocessing step but we will remove the
rare words from the corpus.
 We can combine all the list of words
(stopwords, frequent words and rare
words) and create a single list to remove
them at once.
Stemming
 Stemming is the process of reducing inflected or derived
words to their word stem, base or root form.
 For example, if there are two words in the
corpus walks and walking, then stemming will stem the
suffix to make them walk.
 But say in another example, we have two
words console and consoling, the stemmer will remove
the suffix and make them consol which isn’t a proper
English word.
Contd…
 There are several type of stemming algorithms
available and one of the famous one is porter stemmer
which is widely used.
 Porter stemmer is for English language. If we are
working with other languages, we can use snowball
stemmer.
Stemming
Example
 We can see that words like private and propose have
their e at the end chopped off due to stemming. This is
not intented.
 What can we do for that? We can use Lemmatization
in such cases.
Lemmatization

 Lemmatization is similar to stemming in reducing


inflected words to their word stem but differs in the
way that it makes sure the root word (also called as
lemma) belongs to the language.
Examples: Propose, Private
Illustration of Lemmatization and Stemming
Removal of Emojis
 With more and more usage of social media
platforms, there is an explosion in the usage
of emojis in our day to day life as well.
Probably we might need to remove these
emojis for some of our textual analysis.
Removal of Emoticons

 There is a minor difference between emojis and


emoticons.
 Emoticon is built from keyboard characters that when
put together in a certain way represent a facial
expression, an emoji is an actual image.
 :-) is an emoticon
 😀 is an emoji
Conversion of Emoticon to Words
 In the previous step, we have removed the emoticons.
In case of use cases like sentiment analysis, the
emoticons give some valuable information and so
removing them might not be a good solution. What
can we do in such cases?
 One way is to convert the emoticons to word format so
that they can be used in downstream modeling
processes.
Conversion of Emoji to Words

 Now let us do the same for Emojis as well.


 We may make use of a dictionary to convert the emojis
to corresponding words.
 Again this conversion might be better than emoji
removal for certain use cases. Please use the one that is
suitable for the use case.
Removal of URLs
 Next preprocessing step is to remove any URLs present
in the data.
 For example, if we are doing a X (Twitter) analysis,
then there is a good chance that the tweet will have
some URL in it. Probably we might need to remove
them for our further analysis.
Removal of HTML Tags
 One another common preprocessing technique that
will come handy in multiple places is removal of
HRML tags.
 This is especially useful, if we scrap the data from
different websites. We might end up having html
strings as part of our text.
Chat Words Conversion

 This is an important text preprocessing step if we are


dealing with chat data.
 People do use a lot of abbreviated words in chat and
so it might be helpful to expand those words for our
analysis purposes.
Examples
 AFAIK=As Far As I Know
 AFK=Away From Keyboard
 ASAP=As Soon As Possible
 ATK=At The Keyboard
 ATM=At The Moment
 A3=Anytime, Anywhere, Anyplace
 BAK=Back At Keyboard
 BBL=Be Back Later
 BBS=Be Back Soon
 BFN=Bye For Now
 B4N=Bye For Now
Spelling Correction

 One another important text preprocessing step is


spelling correction.
 Typos are common in text data and we might want to
correct those spelling mistakes before we do our
analysis.
Tokenization
 Tokenization is the process of breaking up
text into separate tokens, which can be
individual words, phrases, or whole
sentences.
 In some cases, punctuation and special
characters (symbols like %, &, $) are
discarded in the process.
Tokenization
Contd…
A few common operations that require tokenization
include:
 Finding how many words or sentences appear in text
 Determining how many times a specific word or
phrase exists
 Accounting for which terms are likely to co-occur
Parts of Speech (POS) Tagging
 This is one of the more advanced text preprocessing
technique.
 This step augments the input text with additional
information about the sentence’s grammatical structure.
 Each word is, therefore, inserted into one of the predefined
categories such as a noun, verb, adjective, etc.
 This step is also sometimes referred to as grammatical
tagging.
Term Frequency
 Term frequency tells you how much a term occurs in
a document.
 Terms can be either individual words or phrases
containing multiple words.
 Since documents differ in length, it’s possible that a
term would appear more times in longer documents
than shorter ones.
Contd…
 Thus, you can calculate term frequency by dividing the
number of times the term appears, by the total
number of terms in the document, as a way of
normalization.
 Term Frequency = [Number of times the term appears
in the document] / [Total number of terms in the
document]
While Working with Python Language….
 We will be using the NLTK (Natural Language Toolkit)

# import the necessary libraries


import nltk
import string
import re
To Remove Punctuation
To remove white space
THANK YOU
Email: [email protected]

You might also like