18 Text Mining - Text Preprocessing

Uploaded by

MD PRINTING PRESS PRINTING PRESS

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

18 Text Mining - Text Preprocessing

Uploaded by

MD PRINTING PRESS PRINTING PRESS

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Prof. V.V.

Subrahmanyam
School of Computer and Information Sciences
Indira Gandhi National Open University (IGNOU)
New Delhi
Date: 22nd Aug, 2024 Time : 4-00PM to 4-30PM
Text Mining
 Text mining, also known
as text data mining, is the
process of transforming
unstructured text into
a structured format to
identify meaningful
patterns and new insights.
Text Preprocessing - Introduction
 Text data derived from natural language is
unstructured and noisy.
 So text preprocessing is a critical step to
transform messy, unstructured text data
into a form that can be effectively used to
train machine learning models, leading to
better results and insights.
Text Preprocessing
 Text preprocessing refers to a series of
techniques used to clean, transform and
prepare raw textual data into a format that
is suitable for natural language processing
(NLP) or Text Mining or Machine Learning
(ML) tasks.
Goal of Text Preprocessing
 The goal of text preprocessing is to
enhance the quality and usability of
the text data for subsequent analysis
or modeling.
Common Text Preprocessing / Cleaning Steps
 Lower Casing  Conversion of
 Removal of Punctuations emoticons to words
 Removal of Stopwords  Conversion of emojis to
 Removal of Frequent words words
 Removal of Rare words  Removal of URLs
 Stemming
 Removal of HTML tags
 Lemmatization
 Removal of emojis  Chat words conversion
 Removal of emoticons  Spelling correction
Lower Casing
 Lower casing is a common text preprocessing
technique. The idea is to convert the input text
into same casing format so that, for example 'text',
'Text' and 'TEXT' are treated the same way.
 This is more helpful for text featurization
techniques like frequency, tfidf as it helps to
combine the same words together thereby
reducing the duplication and get correct counts /
tfidf values.
Removal of Punctuations
 This is again a text standardization process
that will help to treat 'hurray' and 'hurray!'
in the same way.
 We also need to carefully choose the list of
punctuations to exclude depending on the
use case.
Removal of Stopwords
 Stopwords are commonly occuring words in a language
like 'the', 'a' and so on.
 They can be removed from the text most of the times,
as they don't provide valuable information for
downstream analysis.
 In cases like Part of Speech(POS) tagging, we should
not remove them as provide very valuable information
about the POS.
Removal of Stop Frequent Words
 In the previos preprocessing step, we observed the
stopwords based on language information. But
say, if we have a domain specific corpus, we
might also have some frequent words which are of
not so much importance to us.
 So this step is to remove the frequent words in the
given corpus. If we use something like tfidf, this is
automatically taken care of.
Some of the Domain Specific Corpus
Frequent Words….

 I, us, DM, Help, We, Hi,

Please, Get, Thanks etc..
Removal of Rare Words
 This is very similar to previous
preprocessing step but we will remove the
rare words from the corpus.
 We can combine all the list of words
(stopwords, frequent words and rare
words) and create a single list to remove
them at once.
Stemming
 Stemming is the process of reducing inflected or derived
words to their word stem, base or root form.
 For example, if there are two words in the
corpus walks and walking, then stemming will stem the
suffix to make them walk.
 But say in another example, we have two
words console and consoling, the stemmer will remove
the suffix and make them consol which isn’t a proper
English word.
Contd…
 There are several type of stemming algorithms
available and one of the famous one is porter stemmer
which is widely used.
 Porter stemmer is for English language. If we are
working with other languages, we can use snowball
stemmer.
Stemming
Example
 We can see that words like private and propose have
their e at the end chopped off due to stemming. This is
not intented.
 What can we do for that? We can use Lemmatization
in such cases.
Lemmatization

 Lemmatization is similar to stemming in reducing

inflected words to their word stem but differs in the
way that it makes sure the root word (also called as
lemma) belongs to the language.
Examples: Propose, Private
Illustration of Lemmatization and Stemming
Removal of Emojis
 With more and more usage of social media
platforms, there is an explosion in the usage
of emojis in our day to day life as well.
Probably we might need to remove these
emojis for some of our textual analysis.
Removal of Emoticons

 There is a minor difference between emojis and

emoticons.
 Emoticon is built from keyboard characters that when
put together in a certain way represent a facial
expression, an emoji is an actual image.
 :-) is an emoticon
 😀 is an emoji
Conversion of Emoticon to Words
 In the previous step, we have removed the emoticons.
In case of use cases like sentiment analysis, the
emoticons give some valuable information and so
removing them might not be a good solution. What
can we do in such cases?
 One way is to convert the emoticons to word format so
that they can be used in downstream modeling
processes.
Conversion of Emoji to Words

 Now let us do the same for Emojis as well.

 We may make use of a dictionary to convert the emojis
to corresponding words.
 Again this conversion might be better than emoji
removal for certain use cases. Please use the one that is
suitable for the use case.
Removal of URLs
 Next preprocessing step is to remove any URLs present
in the data.
 For example, if we are doing a X (Twitter) analysis,
then there is a good chance that the tweet will have
some URL in it. Probably we might need to remove
them for our further analysis.
Removal of HTML Tags
 One another common preprocessing technique that
will come handy in multiple places is removal of
HRML tags.
 This is especially useful, if we scrap the data from
different websites. We might end up having html
strings as part of our text.
Chat Words Conversion

 This is an important text preprocessing step if we are

dealing with chat data.
 People do use a lot of abbreviated words in chat and
so it might be helpful to expand those words for our
analysis purposes.
Examples
 AFAIK=As Far As I Know
 AFK=Away From Keyboard
 ASAP=As Soon As Possible
 ATK=At The Keyboard
 ATM=At The Moment
 A3=Anytime, Anywhere, Anyplace
 BAK=Back At Keyboard
 BBL=Be Back Later
 BBS=Be Back Soon
 BFN=Bye For Now
 B4N=Bye For Now
Spelling Correction

 One another important text preprocessing step is

spelling correction.
 Typos are common in text data and we might want to
correct those spelling mistakes before we do our
analysis.
Tokenization
 Tokenization is the process of breaking up
text into separate tokens, which can be
individual words, phrases, or whole
sentences.
 In some cases, punctuation and special
characters (symbols like %, &, $) are
discarded in the process.
Tokenization
Contd…
A few common operations that require tokenization
include:
 Finding how many words or sentences appear in text
 Determining how many times a specific word or
phrase exists
 Accounting for which terms are likely to co-occur
Parts of Speech (POS) Tagging
 This is one of the more advanced text preprocessing
technique.
 This step augments the input text with additional
information about the sentence’s grammatical structure.
 Each word is, therefore, inserted into one of the predefined
categories such as a noun, verb, adjective, etc.
 This step is also sometimes referred to as grammatical
tagging.
Term Frequency
 Term frequency tells you how much a term occurs in
a document.
 Terms can be either individual words or phrases
containing multiple words.
 Since documents differ in length, it’s possible that a
term would appear more times in longer documents
than shorter ones.
Contd…
 Thus, you can calculate term frequency by dividing the
number of times the term appears, by the total
number of terms in the document, as a way of
normalization.
 Term Frequency = [Number of times the term appears
in the document] / [Total number of terms in the
document]
While Working with Python Language….
 We will be using the NLTK (Natural Language Toolkit)

# import the necessary libraries

import nltk
import string
import re
To Remove Punctuation
To remove white space
THANK YOU
Email: [email protected]

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Uncle Jack and The Meerkats
100% (2)
Uncle Jack and The Meerkats
37 pages
VO_MCA_SEM 4 _ Text Mining _U2
No ratings yet
VO_MCA_SEM 4 _ Text Mining _U2
15 pages
IR....
No ratings yet
IR....
5 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
NLP Practical
No ratings yet
NLP Practical
27 pages
ir manual
No ratings yet
ir manual
53 pages
4.TWITTER EXTRACTION AND ANALYTICS
No ratings yet
4.TWITTER EXTRACTION AND ANALYTICS
45 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
SL-3_Assignment No 7
No ratings yet
SL-3_Assignment No 7
14 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
NLP Pre-Processing
No ratings yet
NLP Pre-Processing
6 pages
Text Mining
No ratings yet
Text Mining
62 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
NLP Key Points
No ratings yet
NLP Key Points
3 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
13 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
Natural Language Processing_compressed
No ratings yet
Natural Language Processing_compressed
17 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
pdf NLP
No ratings yet
pdf NLP
7 pages
unit2
No ratings yet
unit2
20 pages
02_Text Preprocessing_part2
No ratings yet
02_Text Preprocessing_part2
36 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP Record
No ratings yet
NLP Record
15 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP SEM QUESTIONS AND ANSWERS
No ratings yet
NLP SEM QUESTIONS AND ANSWERS
72 pages
Preprocessing in Ir: Rida Hafeez
No ratings yet
Preprocessing in Ir: Rida Hafeez
14 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP_AI_X
No ratings yet
NLP_AI_X
6 pages
Ai Notes
No ratings yet
Ai Notes
11 pages
NLP Slides
No ratings yet
NLP Slides
19 pages
Lecture Notes On Lexical Processing
No ratings yet
Lecture Notes On Lexical Processing
16 pages
Coba Coba Upload
No ratings yet
Coba Coba Upload
3 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
UNIT 6- NLP NOTES
No ratings yet
UNIT 6- NLP NOTES
7 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
29 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Pipeline
No ratings yet
Pipeline
9 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
TextMining
No ratings yet
TextMining
43 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Business Writing Skills: 3 Quick & Easy ImprovementsYou can Make Today
From Everand
Business Writing Skills: 3 Quick & Easy ImprovementsYou can Make Today
Robert F. Abbott
No ratings yet
Dragon Professional Individual 15: Fast Track to Prolific Writing on Windows
From Everand
Dragon Professional Individual 15: Fast Track to Prolific Writing on Windows
Jose John
5/5 (2)
Adjective Clause Rayos Jasmin
No ratings yet
Adjective Clause Rayos Jasmin
22 pages
GREX3 U6L1 Chart6.1-6.3
No ratings yet
GREX3 U6L1 Chart6.1-6.3
3 pages
Regular & Irregular Verbs List
No ratings yet
Regular & Irregular Verbs List
4 pages
A Short Reference Grammar of Gulf Arabic PDF
No ratings yet
A Short Reference Grammar of Gulf Arabic PDF
144 pages
Exercises
No ratings yet
Exercises
2 pages
Advanced Latin - Grammar Table
No ratings yet
Advanced Latin - Grammar Table
13 pages
Derivation 2008- First Semester
No ratings yet
Derivation 2008- First Semester
6 pages
NLP Course
No ratings yet
NLP Course
26 pages
2-Simple and complete predicate practice packet
No ratings yet
2-Simple and complete predicate practice packet
5 pages
Morpho and Stuff
No ratings yet
Morpho and Stuff
23 pages
Types of Sentences
100% (1)
Types of Sentences
17 pages
Adverbial Phrases Worksheet PDF
No ratings yet
Adverbial Phrases Worksheet PDF
2 pages
Timetables and Schedules
0% (1)
Timetables and Schedules
9 pages
Past Continuous and Past Simple - Page 3 of 3 - Test-English
No ratings yet
Past Continuous and Past Simple - Page 3 of 3 - Test-English
1 page
Form: Question Word + Auxiliary Verb (Pomocni Glagol) + Subject + Main
No ratings yet
Form: Question Word + Auxiliary Verb (Pomocni Glagol) + Subject + Main
4 pages
JK Bose Class 11 English 2700 B 2020
No ratings yet
JK Bose Class 11 English 2700 B 2020
7 pages
Present Simple 2 Bachillerato
0% (1)
Present Simple 2 Bachillerato
3 pages
The Scheme of The Discourse Analysis
No ratings yet
The Scheme of The Discourse Analysis
2 pages
Chapter Two LexicalAnalysis
No ratings yet
Chapter Two LexicalAnalysis
16 pages
t02 Group 06
No ratings yet
t02 Group 06
6 pages
Mind Faces - FYJC French Notes - Lecon 3-4
No ratings yet
Mind Faces - FYJC French Notes - Lecon 3-4
11 pages
KS3 English Quiz - Apostrophes (Revision) 01 (Questions)
No ratings yet
KS3 English Quiz - Apostrophes (Revision) 01 (Questions)
2 pages
Harris, Hermes or Universal Grammar
No ratings yet
Harris, Hermes or Universal Grammar
500 pages
Headway Elementary Extra Practice Answer Key
No ratings yet
Headway Elementary Extra Practice Answer Key
6 pages
Semi Detailed Lesson Plan in Grammar
No ratings yet
Semi Detailed Lesson Plan in Grammar
7 pages
This Is An Individual Weekly Activity For You To Carry Out. This Activity Should Be Completed Before The Next Class
100% (1)
This Is An Individual Weekly Activity For You To Carry Out. This Activity Should Be Completed Before The Next Class
2 pages
Presentation of Nouns: Medical Record
No ratings yet
Presentation of Nouns: Medical Record
17 pages
1119 - 1 Bahasa Inggeris Kupasan Mutu Jawapan
No ratings yet
1119 - 1 Bahasa Inggeris Kupasan Mutu Jawapan
18 pages
照片的案子 SherlockHolmesandaScandalinShanghai
No ratings yet
照片的案子 SherlockHolmesandaScandalinShanghai
54 pages

18 Text Mining - Text Preprocessing

Uploaded by

18 Text Mining - Text Preprocessing

Uploaded by

Prof. V.V.

 I, us, DM, Help, We, Hi,

 Lemmatization is similar to stemming in reducing

 There is a minor difference between emojis and

 Now let us do the same for Emojis as well.

 This is an important text preprocessing step if we are

 One another important text preprocessing step is

# import the necessary libraries

You might also like