0% found this document useful (0 votes)
12 views43 pages

TextMining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views43 pages

TextMining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

G A U R AV O J H A

V I S I T I N G FA C U LT Y

Text Mining
What Is Text Mining?
• One of the domains that has created a lot of buzz in today’s technological field is Text Mining. It
is also called as Text Data Mining, Information Extraction or KDD (Knowledge Discovery in
Databases). So, for a newbie, trying to understand this vast domain might seem to be a
cumbersome task. Let us look into this domain from scratch.
• ‘Text Mining is the discovery, by computer, of new previously unknown information, by
automatically extracting information from different written resources." This mainly includes
finding novel insights, trends or patterns from text-based data. Such novel insights can be highly
essential in fields like business. The main sources of data for text mining is acquired from
customer and technical support, emails and memos, advertising and marketing, human resources
as well as other competitors.
Process of Text Mining
1. Text Preprocessing
2. Text Transformation
3. Feature Selection
4. Data Mining
5. Evaluation
1. Text Preprocessing
The raw text data obtained will be unstructured in nature. First, it needs to be cleaned. There are a few steps in this pre-
processing.
1.1 Text Normalization
1.2 Tokenization
1.3 Stemming
1.4 Lemmatization
1.5 Part-of-speech Tagging
1.6 Chunking
1.7 Named Entitiy Recognition (NER)
1.8 Relationship Extraction
Example
• “It would be unfair to demand that people cease pirating files when
those same people aren’t paid for their participation in very lucrative
network schemes. Ordinary people are relentlessly spied on, and not
compensated for information taken from them. While I’d like to see
everyone eventually pay for music and the like, I’d not ask for it until
there’s reciprocity.”
1.1 Text Normalization
This process involves the conversion of the data into a standard format.
Here, the whole text is converted into upper or lower case, the numbers,
punctuation, accent marks, white spaces, stop words and other diacritics
are removed. Python can be used to implement this.
After Text Normalization
After text normalization, the example provided would look like this:
"it would be unfair to demand that people cease pirating files when those same people arent paid for their
participation in very lucrative network schemes ordinary people are relentlessly spied on and not
compensated for information taken from them while id like to see everyone eventually pay for music and the
like id not ask for it until theres reciprocity"
In this normalized text:
- All letters are converted to lowercase.
- Numbers, punctuation, accent marks, and other diacritics are removed.
- White spaces are retained.
- Stop words (such as "to", "that", "for", "and", etc.) are removed. (Note: stop words are not explicitly removed
in this example, but they could be as part of the normalization process if desired.)
1.2 Tokenization
In this process, the whole text is split into smaller parts called tokens. The numbers,
punctuation marks, words, etc. can be considered as tokens. Natural Language
Toolkit (NLTK), Spacy and Gensim are a few tools that can be used for
tokenization.
After Tokenization
• After tokenization, the example provided would be split into individual tokens. Here's how it might look:
• ['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirating', 'files', 'when', 'those', 'same', 'people',
'arent', 'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'schemes', 'ordinary', 'people', 'are',
'relentlessly', 'spied', 'on', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', 'while', 'id',
'like', 'to', 'see', 'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', 'id', 'not', 'ask', 'for', 'it', 'until',
'theres', 'reciprocity']
• In this tokenized text:
- Each word and punctuation mark is considered a separate token.
- White spaces are not retained.
- Numbers are not present in this example, but if they were, they would also be considered separate tokens.
1.3 Stemming
It is the process of reduction of words to their stem, base or root form.
The two main algorithms used for this process is Porter stemming
algorithm and Lancaster stemming algorithm. NLTK as well as
Snowball can be used for this.
After Stemming
After stemming, words are reduced to their base or root form. Here's how the example might look after stemming using the
Porter stemming algorithm:

['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'peopl', 'ceas', 'pirat', 'file', 'when', 'those', 'same', 'peopl', 'arent', 'paid', 'for',
'their', 'particip', 'in', 'veri', 'lucrat', 'network', 'scheme', 'ordinari', 'peopl', 'are', 'relentless', 'spied', 'on', 'and', 'not', 'compens',
'for', 'inform', 'taken', 'from', 'them', 'while', 'id', 'like', 'to', 'see', 'everyon', 'eventu', 'pay', 'for', 'music', 'and', 'the', 'like', 'id',
'not', 'ask', 'for', 'it', 'until', 'there', 'reciproci']

In this stemmed text:

- Words like "demand" become "demand", "participation" become "particip", "lucrative" become "lucrat", etc.

- The words are reduced to their base form, which may not always be a valid word but captures the essence of the word's
meaning.
1.4 Lemmatization
The aim of lemmatization, like stemming, is to reduce
inflectional forms to a common base form. But, as compared
to stemming, lemmatization does not simply remove the
inflections. Instead, it uses information from different
computational repositories to get the correct base forms of
words.
After Lemmatization
• After lemmatization, words are reduced to their base or dictionary form (lemma). Here's how the example might
look after lemmatization:
['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirate', 'file', 'when', 'those', 'same', 'people', 'arent',
'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'scheme', 'ordinary', 'people', 'are', 'relentlessly',
'spied', 'on', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', 'while', 'id', 'like', 'to', 'see',
'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', 'id', 'not', 'ask', 'for', 'it', 'until', 'there', 'reciprocity']
In this lemmatized text:
- Words like "demand" remain "demand", "participation" remain "participation", "lucrative" remain "lucrative", etc.
- The words are reduced to their base form, which is a valid word found in the dictionary. Lemmatization aims to
bring words to their canonical form.
1.5 Part-of-speech Tagging
It aims to assign parts of speech to each word of a given text based on a
its meaning and context. NLTK, spaCy, Pattern are a few softwares that
can be used for this.
After Pos Tagging
After part-of-speech (POS) tagging, each word in the example is labeled with its corresponding part of speech. Here's how the
example might look after POS tagging:

[('it', 'PRP'), ('would', 'MD'), ('be', 'VB'), ('unfair', 'JJ'), ('to', 'TO'), ('demand', 'VB'), ('that', 'IN'), ('people', 'NNS'), ('cease',
'VBP'), ('pirating', 'VBG'), ('files', 'NNS'), ('when', 'WRB'), ('those', 'DT'), ('same', 'JJ'), ('people', 'NNS'), ('arent', 'JJ'),
('paid', 'VBN'), ('for', 'IN'), ('their', 'PRP$'), ('participation', 'NN'), ('in', 'IN'), ('very', 'RB'), ('lucrative', 'JJ'), ('network',
'NN'), ('schemes', 'NNS'), ('ordinary', 'JJ'), ('people', 'NNS'), ('are', 'VBP'), ('relentlessly', 'RB'), ('spied', 'VBN'), ('on', 'IN'),
('and', 'CC'), ('not', 'RB'), ('compensated', 'VBN'), ('for', 'IN'), ('information', 'NN'), ('taken', 'VBN'), ('from', 'IN'), ('them',
'PRP'), ('while', 'IN'), ('id', 'NN'), ('like', 'IN'), ('to', 'TO'), ('see', 'VB'), ('everyone', 'NN'), ('eventually', 'RB'), ('pay', 'VB'),
('for', 'IN'), ('music', 'NN'), ('and', 'CC'), ('the', 'DT'), ('like', 'NN'), ('id', 'NN'), ('not', 'RB'), ('ask', 'VB'), ('for', 'IN'), ('it',
'PRP'), ('until', 'IN'), ('theres', 'NNS'), ('reciprocity', 'NN')]

In this tagged text:

- Each word is paired with its corresponding part of speech tag. For example, "it" is tagged as PRP (personal pronoun),
"would" as MD (modal), "be" as VB (verb), and so on.

- These tags provide information about the syntactic role of each word in the sentence.
1.6 Chunking
It is a natural language process that identifies constituent parts
of sentences and links them to higher order units that have
discrete grammatical meanings. NLTK is a good tool for this.
After Chunking
• After parsing, the example text would be represented as a hierarchical structure that identifies the constituent parts of the sentences
and their relationships. Here's how the example might look after parsing using a tool like NLTK:
(S
(NP (PRP It))
(VP
(VBZ is)

In this parsed representation:


- The text is broken down into its constituent parts, such as noun phrases (NP), verb phrases (VP), subordinate clauses (SBAR), etc.
- Each part is nested within its parent part to show the hierarchical structure of the sentence.
- This hierarchical structure helps in understanding the syntactic relationships between different parts of the sentence.
1.7 Named Entity Recognition (NER)
It aims to find named entities in text and classify them into
pre-defined categories. NLTK, spaCy can be used for this.
After NER
[('It', 'O'), ('aims', 'O'), ('to', 'O'), ('find', 'O'), ('named', 'O'), ('entities', 'O'), ('in', 'O'),
('text', 'O'), ('and', 'O'), ('classify', 'O'), ('them', 'O'), ('into', 'O'), ('pre-defined', 'O'),
('categories', 'O'), ('.', 'O')]
In these annotated examples:
• Named entities such as "named entities" may be tagged with labels like "ORG"
(organization), "LOC" (location), "PER" (person), etc.
• Non-named entity words are typically tagged with the label "O" (outside).
1.8 Relationship Extraction
This helps in identifying relations among named entities like people, organizations,
etc. It allows to get structured information from unstructured sources such as raw
text.
1.9 After Relationship Extraction
After relation extraction, the example text would showcase identified relationships among named entities. Here's how the
example might look:

This/O helps/O in/O identifying/O relations/O among/O named/B-ORG entities/I-ORG like/O people/O ,/O
organizations/O ,/O etc/O .O It/O allows/O to/O get/O structured/O information/O from/O unstructured/O sources/O
such/O as/O raw/O text/O ./O

In this annotated example:


- Named entities such as "named entities", "people", "organizations", etc., are identified.
- Relations among these entities are recognized, although not explicitly mentioned in the provided example.
- The annotation helps in understanding the structured information extracted from the unstructured text.
2. Text Transformation
This process mainly involves the document representation by the text it contains and the number of
occurrences. There are mainly two approaches for this step.
2.1 Bag of Words: A text is represented as a bag of its words (multi set), disregarding grammar and
even word order, but keeping multiplicity.
2.2 Vector Space: In this model, a document is converted into a vector of index terms derived from
words. Each dimension of the vector corresponds to a term that appears in their text. Its weight
records the importance of the term to the text.
2.1 BoW
After using the bag-of-words representation, the example text would be transformed into a vector of word
counts or frequencies. Here's how the example might look:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
In this bag-of-words representation:
- Each element in the vector corresponds to a unique word in the text.
- The value of each element represents the count of occurrences of the corresponding word in the text.
- Stop words and punctuation may be removed beforehand depending on preprocessing choices.
- The order of the elements typically doesn't matter, as it's based on the frequency of words in the text rather
than their sequence.
2.2 Vector Space
After representing the text in a vector space model, each document (or in this case, sentence) would be transformed into a numerical vector
where each dimension represents a term and its value represents some measure of the term's importance in the document. This could be
based on word frequency, TF-IDF (Term Frequency-Inverse Document Frequency), or other measures. Here's a hypothetical example using
TF-IDF:

And let's say our example sentence is: "This helps in identifying relations among named entities like people, organizations, etc."

Using TF-IDF, the vector representation might look like this:

[0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0, 0]

In this vector representation:

- Each element corresponds to a term in the vocabulary.

- The value of each element represents the TF-IDF score of the corresponding term in the document.

- Terms not present in the document have a TF-IDF score of 0.

- Stop words and punctuation have been removed, and terms have been stemmed or lemmatized as appropriate.

- This vector representation allows us to perform various mathematical operations and comparisons to analyze the similarity or dissimilarity
between documents.
What is TF-IDF?
The process to find meaning of documents using TF-IDF is very similar to Bag of words,
• Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) ,
lemmatize data ( all words to root words ).
• Tokenize words with frequency
• Find TF for words
• Find IDF for words
• Vectorize vocab
How Do you Calculate TF and IDF
IDF =Log[(# Number of documents) / (Number of documents containing the word)] and
TF = (Number of repetitions of word in a document) / (# of words in a document)

Term Frequency-Inverse Document Frequency


Example
Let’s cover an example of 3 documents -
• Document 1 It is going to rain today.

• Document 2 Today I am not going outside.

• Document 3 I am going to watch the season premiere.

To find TF-IDF we need to perform the steps we laid out above, let’s get to it.
Step 1: Clean
Data and
Tokenize
Step 2: Find TF
It is going to rain today.
Find it’s TF = (Number of repetitions of
word in a document) / (# of words in a
document)
Continue for rest of sentences -
Step 3: Find IDF

Find IDF for documents (we do this for


feature names only/ vocab words which
have no stop words )

IDF =Log[(Number of documents) /


(Number of documents containing the word)]
Step 4 Build model i.e. stack all words next to each other

Step 5 Compare results and use table to ask questions
3. Feature Selection
Feature selection is also known as attribute selection or variable selection. It is the
selection of the most relevant features from the available variables that gives most
information for your prediction variables. The irrelevant features can increase the
complexity and decrease the accuracy of the analysis. Pearson correlation
coefficient, Chi — squared, recursive feature elimination, lasso regression, tree-
based algorithms are a few methods that can be used for this.
4. Data Mining
Data Mining: Here we combine the process of text mining with the traditional data
mining techniques. Once the data is structured after the above processes, the classic
data mining techniques are applied on the data to retrieve the information. These
techniques include classification, clustering, regression, outer, sequential patters,
prediction and association rules.
5. Evaluation
After the data mining techniques are applied, we get an end result. That
result is to be evaluated and checked for the accuracy in the prediction.
Relevance and Applications of Text Mining
Huge amounts of data are created everyday through economic, academic as well as social activities.
All this information can be utilized optimally with the correct combination of skill sets. Data and
text mining and analytics can be helpful for this. Text mining has been extensively used in today’s
business as well as corporate domains. Some of the applications of text mining are given.
I. Risk Management
• The humongous amount of textual data that is available helps the companies to have a deeper
look into their health and performance.
• Risk analysis is an important factor in the development of every companies.
• Insufficient risk analysis can result in major failures for the company.
• Text mining can enable the company to mitigate the risk factors and also can help in deciding
which firms to invest in, which people to give loans to and so much more by analyzing the
documents and profiles of various clients.
II. Customer Care Services
• Text mining as well as natural language processing has been extensively used in
order to enhance the customer experience.
• Now-a-days chat bots that mimic human customer care officers have been used in
many websites in order to make the user experience more customized.
• Text mining has been used in order to provide a rapid, automated response to the
customers, which has reduced their reliance on the call center operators to solve
the problems.
III. Personalized Advertising:
• The field of digital advertising has been revolutionized by the development of
text and web mining and this is one of the latest applications of text mining.
• The text data related to all that a person types or searches online are shared with
the other companies, which in turn show ads that has a higher probability of
being clicked and converted into a sale.
IV. Spam Filtering:
• One of the widely used means of official communication is e-mail.
• It has a really wide application, but a darker side to this are the spam mails that infest the inboxes
of the users.
• These spam mails use up a lot of storage and they can also be a source from which the viruses or
scams can enter.
• Various companies are using intelligent text mining softwares as well as the traditional keyword
matching techniques in order to identify and filter the spam mails.
V. Social Media Analysis and Crime Prevention
• Social media has been on the trend for a long time and millions of normal users use this medium
as a means of communication.
• The anonymous nature of internet has made it easy for many criminals to plan their various
strategies online.
• The task of identifying the potentially threatening messages from the normal ones is a task that
has been made possible by the use of advanced text mining softwares.
• Also, online text analysis can be a good method to analyse what is ‘hot’ or trending in a particular
time. This can be highly beneficial for various commercial companies.
Mentimenter

3520-7514

You might also like