TextMining
TextMining
V I S I T I N G FA C U LT Y
Text Mining
What Is Text Mining?
• One of the domains that has created a lot of buzz in today’s technological field is Text Mining. It
is also called as Text Data Mining, Information Extraction or KDD (Knowledge Discovery in
Databases). So, for a newbie, trying to understand this vast domain might seem to be a
cumbersome task. Let us look into this domain from scratch.
• ‘Text Mining is the discovery, by computer, of new previously unknown information, by
automatically extracting information from different written resources." This mainly includes
finding novel insights, trends or patterns from text-based data. Such novel insights can be highly
essential in fields like business. The main sources of data for text mining is acquired from
customer and technical support, emails and memos, advertising and marketing, human resources
as well as other competitors.
Process of Text Mining
1. Text Preprocessing
2. Text Transformation
3. Feature Selection
4. Data Mining
5. Evaluation
1. Text Preprocessing
The raw text data obtained will be unstructured in nature. First, it needs to be cleaned. There are a few steps in this pre-
processing.
1.1 Text Normalization
1.2 Tokenization
1.3 Stemming
1.4 Lemmatization
1.5 Part-of-speech Tagging
1.6 Chunking
1.7 Named Entitiy Recognition (NER)
1.8 Relationship Extraction
Example
• “It would be unfair to demand that people cease pirating files when
those same people aren’t paid for their participation in very lucrative
network schemes. Ordinary people are relentlessly spied on, and not
compensated for information taken from them. While I’d like to see
everyone eventually pay for music and the like, I’d not ask for it until
there’s reciprocity.”
1.1 Text Normalization
This process involves the conversion of the data into a standard format.
Here, the whole text is converted into upper or lower case, the numbers,
punctuation, accent marks, white spaces, stop words and other diacritics
are removed. Python can be used to implement this.
After Text Normalization
After text normalization, the example provided would look like this:
"it would be unfair to demand that people cease pirating files when those same people arent paid for their
participation in very lucrative network schemes ordinary people are relentlessly spied on and not
compensated for information taken from them while id like to see everyone eventually pay for music and the
like id not ask for it until theres reciprocity"
In this normalized text:
- All letters are converted to lowercase.
- Numbers, punctuation, accent marks, and other diacritics are removed.
- White spaces are retained.
- Stop words (such as "to", "that", "for", "and", etc.) are removed. (Note: stop words are not explicitly removed
in this example, but they could be as part of the normalization process if desired.)
1.2 Tokenization
In this process, the whole text is split into smaller parts called tokens. The numbers,
punctuation marks, words, etc. can be considered as tokens. Natural Language
Toolkit (NLTK), Spacy and Gensim are a few tools that can be used for
tokenization.
After Tokenization
• After tokenization, the example provided would be split into individual tokens. Here's how it might look:
• ['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirating', 'files', 'when', 'those', 'same', 'people',
'arent', 'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'schemes', 'ordinary', 'people', 'are',
'relentlessly', 'spied', 'on', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', 'while', 'id',
'like', 'to', 'see', 'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', 'id', 'not', 'ask', 'for', 'it', 'until',
'theres', 'reciprocity']
• In this tokenized text:
- Each word and punctuation mark is considered a separate token.
- White spaces are not retained.
- Numbers are not present in this example, but if they were, they would also be considered separate tokens.
1.3 Stemming
It is the process of reduction of words to their stem, base or root form.
The two main algorithms used for this process is Porter stemming
algorithm and Lancaster stemming algorithm. NLTK as well as
Snowball can be used for this.
After Stemming
After stemming, words are reduced to their base or root form. Here's how the example might look after stemming using the
Porter stemming algorithm:
['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'peopl', 'ceas', 'pirat', 'file', 'when', 'those', 'same', 'peopl', 'arent', 'paid', 'for',
'their', 'particip', 'in', 'veri', 'lucrat', 'network', 'scheme', 'ordinari', 'peopl', 'are', 'relentless', 'spied', 'on', 'and', 'not', 'compens',
'for', 'inform', 'taken', 'from', 'them', 'while', 'id', 'like', 'to', 'see', 'everyon', 'eventu', 'pay', 'for', 'music', 'and', 'the', 'like', 'id',
'not', 'ask', 'for', 'it', 'until', 'there', 'reciproci']
- Words like "demand" become "demand", "participation" become "particip", "lucrative" become "lucrat", etc.
- The words are reduced to their base form, which may not always be a valid word but captures the essence of the word's
meaning.
1.4 Lemmatization
The aim of lemmatization, like stemming, is to reduce
inflectional forms to a common base form. But, as compared
to stemming, lemmatization does not simply remove the
inflections. Instead, it uses information from different
computational repositories to get the correct base forms of
words.
After Lemmatization
• After lemmatization, words are reduced to their base or dictionary form (lemma). Here's how the example might
look after lemmatization:
['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirate', 'file', 'when', 'those', 'same', 'people', 'arent',
'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'scheme', 'ordinary', 'people', 'are', 'relentlessly',
'spied', 'on', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', 'while', 'id', 'like', 'to', 'see',
'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', 'id', 'not', 'ask', 'for', 'it', 'until', 'there', 'reciprocity']
In this lemmatized text:
- Words like "demand" remain "demand", "participation" remain "participation", "lucrative" remain "lucrative", etc.
- The words are reduced to their base form, which is a valid word found in the dictionary. Lemmatization aims to
bring words to their canonical form.
1.5 Part-of-speech Tagging
It aims to assign parts of speech to each word of a given text based on a
its meaning and context. NLTK, spaCy, Pattern are a few softwares that
can be used for this.
After Pos Tagging
After part-of-speech (POS) tagging, each word in the example is labeled with its corresponding part of speech. Here's how the
example might look after POS tagging:
[('it', 'PRP'), ('would', 'MD'), ('be', 'VB'), ('unfair', 'JJ'), ('to', 'TO'), ('demand', 'VB'), ('that', 'IN'), ('people', 'NNS'), ('cease',
'VBP'), ('pirating', 'VBG'), ('files', 'NNS'), ('when', 'WRB'), ('those', 'DT'), ('same', 'JJ'), ('people', 'NNS'), ('arent', 'JJ'),
('paid', 'VBN'), ('for', 'IN'), ('their', 'PRP$'), ('participation', 'NN'), ('in', 'IN'), ('very', 'RB'), ('lucrative', 'JJ'), ('network',
'NN'), ('schemes', 'NNS'), ('ordinary', 'JJ'), ('people', 'NNS'), ('are', 'VBP'), ('relentlessly', 'RB'), ('spied', 'VBN'), ('on', 'IN'),
('and', 'CC'), ('not', 'RB'), ('compensated', 'VBN'), ('for', 'IN'), ('information', 'NN'), ('taken', 'VBN'), ('from', 'IN'), ('them',
'PRP'), ('while', 'IN'), ('id', 'NN'), ('like', 'IN'), ('to', 'TO'), ('see', 'VB'), ('everyone', 'NN'), ('eventually', 'RB'), ('pay', 'VB'),
('for', 'IN'), ('music', 'NN'), ('and', 'CC'), ('the', 'DT'), ('like', 'NN'), ('id', 'NN'), ('not', 'RB'), ('ask', 'VB'), ('for', 'IN'), ('it',
'PRP'), ('until', 'IN'), ('theres', 'NNS'), ('reciprocity', 'NN')]
- Each word is paired with its corresponding part of speech tag. For example, "it" is tagged as PRP (personal pronoun),
"would" as MD (modal), "be" as VB (verb), and so on.
- These tags provide information about the syntactic role of each word in the sentence.
1.6 Chunking
It is a natural language process that identifies constituent parts
of sentences and links them to higher order units that have
discrete grammatical meanings. NLTK is a good tool for this.
After Chunking
• After parsing, the example text would be represented as a hierarchical structure that identifies the constituent parts of the sentences
and their relationships. Here's how the example might look after parsing using a tool like NLTK:
(S
(NP (PRP It))
(VP
(VBZ is)
This/O helps/O in/O identifying/O relations/O among/O named/B-ORG entities/I-ORG like/O people/O ,/O
organizations/O ,/O etc/O .O It/O allows/O to/O get/O structured/O information/O from/O unstructured/O sources/O
such/O as/O raw/O text/O ./O
And let's say our example sentence is: "This helps in identifying relations among named entities like people, organizations, etc."
- The value of each element represents the TF-IDF score of the corresponding term in the document.
- Stop words and punctuation have been removed, and terms have been stemmed or lemmatized as appropriate.
- This vector representation allows us to perform various mathematical operations and comparisons to analyze the similarity or dissimilarity
between documents.
What is TF-IDF?
The process to find meaning of documents using TF-IDF is very similar to Bag of words,
• Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) ,
lemmatize data ( all words to root words ).
• Tokenize words with frequency
• Find TF for words
• Find IDF for words
• Vectorize vocab
How Do you Calculate TF and IDF
IDF =Log[(# Number of documents) / (Number of documents containing the word)] and
TF = (Number of repetitions of word in a document) / (# of words in a document)
To find TF-IDF we need to perform the steps we laid out above, let’s get to it.
Step 1: Clean
Data and
Tokenize
Step 2: Find TF
It is going to rain today.
Find it’s TF = (Number of repetitions of
word in a document) / (# of words in a
document)
Continue for rest of sentences -
Step 3: Find IDF
3520-7514