18 Text Mining - Text Preprocessing
18 Text Mining - Text Preprocessing
Subrahmanyam
School of Computer and Information Sciences
Indira Gandhi National Open University (IGNOU)
New Delhi
Date: 22nd Aug, 2024 Time : 4-00PM to 4-30PM
Text Mining
Text mining, also known
as text data mining, is the
process of transforming
unstructured text into
a structured format to
identify meaningful
patterns and new insights.
Text Preprocessing - Introduction
Text data derived from natural language is
unstructured and noisy.
So text preprocessing is a critical step to
transform messy, unstructured text data
into a form that can be effectively used to
train machine learning models, leading to
better results and insights.
Text Preprocessing
Text preprocessing refers to a series of
techniques used to clean, transform and
prepare raw textual data into a format that
is suitable for natural language processing
(NLP) or Text Mining or Machine Learning
(ML) tasks.
Goal of Text Preprocessing
The goal of text preprocessing is to
enhance the quality and usability of
the text data for subsequent analysis
or modeling.
Common Text Preprocessing / Cleaning Steps
Lower Casing Conversion of
Removal of Punctuations emoticons to words
Removal of Stopwords Conversion of emojis to
Removal of Frequent words words
Removal of Rare words Removal of URLs
Stemming
Removal of HTML tags
Lemmatization
Removal of emojis Chat words conversion
Removal of emoticons Spelling correction
Lower Casing
Lower casing is a common text preprocessing
technique. The idea is to convert the input text
into same casing format so that, for example 'text',
'Text' and 'TEXT' are treated the same way.
This is more helpful for text featurization
techniques like frequency, tfidf as it helps to
combine the same words together thereby
reducing the duplication and get correct counts /
tfidf values.
Removal of Punctuations
This is again a text standardization process
that will help to treat 'hurray' and 'hurray!'
in the same way.
We also need to carefully choose the list of
punctuations to exclude depending on the
use case.
Removal of Stopwords
Stopwords are commonly occuring words in a language
like 'the', 'a' and so on.
They can be removed from the text most of the times,
as they don't provide valuable information for
downstream analysis.
In cases like Part of Speech(POS) tagging, we should
not remove them as provide very valuable information
about the POS.
Removal of Stop Frequent Words
In the previos preprocessing step, we observed the
stopwords based on language information. But
say, if we have a domain specific corpus, we
might also have some frequent words which are of
not so much importance to us.
So this step is to remove the frequent words in the
given corpus. If we use something like tfidf, this is
automatically taken care of.
Some of the Domain Specific Corpus
Frequent Words….