Text Mining - Analytics
Text Mining - Analytics
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Natural language processing
Natural language processing (NLP) is an intersection between the fields of computer
science, linguistics and artificial intelligence. NLP is concerned with the interactions
between computers and human (natural) languages, in particular how to program
computers to process, analyze and model large amounts of natural language data.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
What is Text Mining?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Introduction
1. Text data requires special preparation before we can start using it for
predictive modelling
2. The text must be parsed to remove words, called tokenization.
3. Then the words need to be encoded as integers or floating point values for
use as input to a machine learning algorithm, called feature extraction (or
vectorization).
4. The scikit-learn library offers easy-to-use tools to perform both tokenization
and feature extraction of text data
5. We now learn to prepare text data for predictive modeling in Python with
scikit-learn
6. Learn to use the following algorithms
a. CountVectorizer : Convert text to word count vectors
b. TfidfVectorizer : Convert text to word frequency vectors
c. HashingVectorizer : Convert text to unique integers
4
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Applications of Text Mining
1. Analyze open ended responses where respondents give their view or opinion
without any constraints.
4. Auto documentation
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Text Mining Techniques
3. Clustering – Find similar document for instance similar queries in tech support
databases for automated resolution
4. Summarization – Find the key parts of the document or what the document
refers to and summarize the details
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Formulating a Text Analytics Task
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Text Mining Challenges
Image Source:
http://www.ruwhim.com/?p=4
7532
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Stop words
1. Many of the most frequently used words in English are not useful
in text analytics – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stop words list may be
constructed
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words model
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words model
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Counts with CountVectorizer
6. CountVectorizer tokenizes a collection of text documents and
builds a vocabulary of known words (textClassificationWithML.ipynb)
7. We can use it as follows:
a. Create an instance of the CountVectorizer class.
b. Call the fit() function in order to learn a vocabulary from one or more
documents.
c. Call the transform() function on one or more documents as needed to
encode each document as a vector.
d. An encoded vector is returned with a length of the entire vocabulary
and an integer count for the number of times each word appeared in
the document.
e. The vectors returned from a call to transform() will be sparse vectors, and
we can transform them back to numpy arrays to look and better
understand what is going on by calling the toarray() function.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix
textClassificationWithML.ipynb
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix
Example: 10 documents: 6 terms
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Feature Selection
• Performance of text classification algorithms can be optimized
by selecting only a subset of the discriminative terms
– Even after stopword removal.
• Greedy search
– Start from full set and delete one at a time
– Find the least important variable
• Can use Gini index for this if a classification problem
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Distances in DT matrices
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean Model
Document 1
Document 2
Document 1
Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words / Boolean model
Doc 2
Doc 3
Doc 4
Doc 1
Doc 5
Doc 6
Doc 7
Doc 8
Term
aid 0 0 0 1 0 0 0 1
Each column represents the view of
all 0 1 0 1 0 1 0 0
a particular document: What terms
back 1 0 1 0 0 0 1 0
brown 1 0 1 0 1 0 1 0
are contained in this document?
come 0 1 0 1 0 1 0 1
dog 0 0 1 0 1 0 0 0 Each row represents the view of a
fox 0 0 1 0 1 0 1 0 particular term: What documents
good 0 1 0 1 0 1 0 1 contain this term?
jump 0 0 1 0 0 0 0 0
lazy 1 0 1 0 1 0 1 0
men 0 1 0 1 0 0 0 1 To execute a query, pick out rows
now 0 1 0 0 0 1 0 1 corresponding to query terms and
over 1 0 1 0 1 0 1 1 then apply logic table of
party 0 0 0 0 0 1 0 1 corresponding Boolean operator
quick 1 0 1 0 0 0 0 0
their 1 0 0 0 1 0 1 0
time 0 1 0 1 0 1 0 0
Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean model (contd)
1. Query terms are combined logically using the Boolean operators AND,
OR, and NOT.
a. E.g., ((data AND mining) AND (NOT text))
2. Retrieval
a. Given a Boolean query, the system retrieves every document that makes
the query logically true.
b. Called exact match.
1. The retrieval results are usually quite poor because term frequency is not
considered
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean model (contd)
Doc 2
Doc 3
Doc 4
Doc 1
Doc 5
Doc 6
Doc 7
Doc 8
Term
dog 0 0 1 0 1 0 0 0
fox 0 0 1 0 1 0 1 0
dog fox 0 0 1 0 1 0 0 0 dog AND fox Doc 3, Doc 5
Doc 5
Doc 6
Doc 7
Doc 8
Term
good 0 1 0 1 0 1 0 1
party 0 0 0 0 0 1 0 1
Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. Each document is represented as a vector. The term weights are no longer 0
or 1.
2. Each term weight is computed based on term frequency. But term frequency
can mislead as they may occur frequently across all documents in all classes
(poor classifiers)
3. In TFIDF, using this concept weightage to frequent terms across all
documents is reduced relative to other words
“The quick brown fox jumped over the lazy dog’s back”
document
[111111112 ]
1st position corresponds to “back”
Vector in feature space
2nd position corresponds to “brown”
3rd position corresponds to “dog”
4th position corresponds to “fox”
5th position corresponds to “jump”
Image Source:
6th position corresponds to “lazy” http://lintool.github.io/UM
7th position corresponds to “over” D-courses/LBSC796-
8th position corresponds to “quick” INFM718R-2006-
9th position corresponds to “the” Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
• Represent a doc by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a N-dimensional space
– Element of vector corresponds to term weight
– E.g., d = (x1,…,xN), xi is “importance” of term i
• New document is assigned to the most likely category based on
vector similarity.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. How to select terms to represent the documents as vectors
– Remove fluff words (Stopwords)
• e.g. “and”, “the”, “always”, “along”
– Use word stem to prevent same word becoming multi
dimension
• e.g. “training”, “trainer”, “trained” => “train”
– Latent semantic indexing
2. How to assign weights to terms
– Not all words are equally important: Some are more indicative
than others
• e.g. “Automobile” vs. “Car”
3. How to measure the similarity between document vectors
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. Given two document
2. Similarity definition
– dot product
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Lab: SMS classification
textClassificationWithML.ipynb
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Frequencies with TFIDFVectorizer
1. One issue with simple counts is that some words other than stopwords appear
many times
2. Doc vectors with large counts attributes will not be very meaningful in the
encoded vectors. It is like a dimension with higher scale overwhelming others
3. An alternative is to calculate word frequencies and standardizing which is done
using TF-IDF.
4. This is an acronym than stands for “Term Frequency – Inverse Document”
Frequency which are the components of the resulting scores assigned to
each word.
5. Term Frequency: This summarizes how often a given word appears within a
document.
6. Inverse Document Frequency: This downscales words that appear a lot across
documents.
7. Without going into the math, TF-IDF are word frequency scores that try to
highlight words that are more interesting, e.g. frequent in a document but not
across documents.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. TF: Term Frequency, which measures how frequently a term occurs in a
document.
a. Since every document is different in length, it is possible that a term would appear
much more times in long documents than shorter ones.
b. Thus, the term frequency is often divided by the document length (aka. the total
number of terms in the document) as a way of normalization:
c. TF(t) = (Number of times term t appears in a document) / (Total number of
terms
in the document)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
N
wi, j tf i, j log
ni
wi, j weight assigned to term i in document j
information 6 3 3 2 0.000
interesting 1 0.602 0.60 0.62
Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
W'i,j
query 1 2 3 4
siberia 0.71
Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Frequencies with TFIDFVectorizer
1. The TfidfVectorizer will tokenize documents, learn the
vocabulary and inverse document frequency weightings, and
allow you to encode new documents.
2. Alternately, if we already have a learned CountVectorizer, we can
use it with a TfidfTransformer to just calculate the inverse
document frequencies and start encoding documents.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Questions?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.