0% found this document useful (0 votes)
20 views35 pages

Text Mining - Analytics

Text mining is the process of extracting useful insights from unstructured text data. It involves techniques like information extraction, categorization, clustering, and summarization. Some challenges include handling word relationships and frequencies. Common representations include the bag-of-words model and document term matrix. Scikit-learn provides tools for text preprocessing and feature extraction like CountVectorizer and TfidfVectorizer.

Uploaded by

Sukeshan R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views35 pages

Text Mining - Analytics

Text mining is the process of extracting useful insights from unstructured text data. It involves techniques like information extraction, categorization, clustering, and summarization. Some challenges include handling word relationships and frequencies. Common representations include the bag-of-words model and document term matrix. Scikit-learn provides tools for text preprocessing and feature extraction like CountVectorizer and TfidfVectorizer.

Uploaded by

Sukeshan R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Text Mining & Analytics

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Natural language processing
Natural language processing (NLP) is an intersection between the fields of computer
science, linguistics and artificial intelligence. NLP is concerned with the interactions
between computers and human (natural) languages, in particular how to program
computers to process, analyze and model large amounts of natural language data.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
What is Text Mining?

1. Text mining is the process of extracting useful insights /


information from a body of text (classified as unstructured data)

2. Volume of unstructured data generated today is much more than


the volume of structured data (90:10)

3. Source of text data include -


a. e-mails, corporate Web pages, customer surveys,
b. Social media and more…
twitter.com/timothy_hughes/status
/619075227021090817

4. Lots of information is held-up in the text


data

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Introduction

1. Text data requires special preparation before we can start using it for
predictive modelling
2. The text must be parsed to remove words, called tokenization.
3. Then the words need to be encoded as integers or floating point values for
use as input to a machine learning algorithm, called feature extraction (or
vectorization).
4. The scikit-learn library offers easy-to-use tools to perform both tokenization
and feature extraction of text data
5. We now learn to prepare text data for predictive modeling in Python with
scikit-learn
6. Learn to use the following algorithms
a. CountVectorizer : Convert text to word count vectors
b. TfidfVectorizer : Convert text to word frequency vectors
c. HashingVectorizer : Convert text to unique integers

4
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Applications of Text Mining

1. Analyze open ended responses where respondents give their view or opinion
without any constraints.

2. Automatic processing of huge volumes of electronic messages, emails

3. Classify the emails (text ) as spam & non-spam.

4. Analyze warranty or insurance claims for suspicious / anomalous claims

5. Diagnosis of situations described by people in form of text for e.g.


customer experience

6. Investigate competitors by crawling their web sites (be careful as crawling is


illegal in many websites)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Applications of Text Mining: Legal Domain
1. Judgements summarization

2. Similar judgements identification

3. Legal case outcome prediction


1. Domain specific and making it generic is a challenge
2. Can focus on consumer complaints judgements if interested

4. Auto documentation

5. Automation of judicial processes ( this is a broader problem)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Text Mining Techniques

1. Information Extraction - This is used to analyze the unstructured text by


identifying entities and their relationships.

2. Categorization - Classifies the text document under one or more pre-


determined categories such as spam or ham mails where each mail is a
document

3. Clustering – Find similar document for instance similar queries in tech support
databases for automated resolution

4. Summarization – Find the key parts of the document or what the document
refers to and summarize the details

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Formulating a Text Analytics Task

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Text Mining Challenges

1. Relations among word surface forms and their senses:


a. Homonomy: same form, but different meaning (e.g. bank: river bank,
financial institution)
b. Polysemy: same form, related meaning (e.g. bank: blood bank, financial
institution)
c. Synonymy: different form, same meaning (e.g. singer, vocalist)
d. Hyponymy: one word denotes a subclass of an another (e.g. breakfast,
meal)

2. Word frequencies in texts have power distribution (Zipf ’s law):


a. small number of very frequent words (usually useless words)
b. big number of low frequency words (long tail of useful words)

Image Source:
http://www.ruwhim.com/?p=4
7532
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Stop words
1. Many of the most frequently used words in English are not useful
in text analytics – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stop words list may be
constructed

2. Why do we need to remove stop words


– Reduce indexing (or data) file size
• Stop words accounts 20-30% of total word counts.
– Improve efficiency
• Stop words are not useful for searching or text mining
• Stop words always have a large number of hits

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words model

1. We cannot work with text directly when using machine learning


algorithms, we need to convert the text to numbers.
2. We may want to perform classification of documents, so each
document is an “input” and a class label is the “output” for our
predictive algorithm.
3. Algorithms take vectors of numbers as input, therefore we need
to convert documents to fixed-length vectors of numbers.
4. A simple and effective model for thinking about text documents
in machine learning is called the Bag-of-Words Model, or BoW.
5. The model is simple in that it throws away all of the order
information in the words and focuses on the occurrence of
words in a document.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words model

6. This can be done by assigning each word a unique number. Then


any document we see can be encoded as a fixed-length vector
with the length of the vocabulary of known words. The value in
each position in the vector could be filled with a count or
frequency of each word in the encoded document.
7. This is the bag of words model, where we are only concerned
with encoding schemes that represent what words are present or
the degree to which they are present in encoded documents
without any information about order.
8. The scikit-learn library provides 3 different APIs

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Counts with CountVectorizer
6. CountVectorizer tokenizes a collection of text documents and
builds a vocabulary of known words (textClassificationWithML.ipynb)
7. We can use it as follows:
a. Create an instance of the CountVectorizer class.
b. Call the fit() function in order to learn a vocabulary from one or more
documents.
c. Call the transform() function on one or more documents as needed to
encode each document as a vector.
d. An encoded vector is returned with a length of the entire vocabulary
and an integer count for the number of times each word appeared in
the document.
e. The vectors returned from a call to transform() will be sparse vectors, and
we can transform them back to numpy arrays to look and better
understand what is going on by calling the toarray() function.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix

• Most common form of representation in


text mining is the term - document matrix
– Term: typically a single word, but could be a word
phrase like “data mining”
– Document: a generic term meaning a collection
of text to be retrieved
– Can be large - terms are often 50k or larger,
documents can be in the billions!
– Can be binary, or use counts (frequency count)

textClassificationWithML.ipynb
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix
Example: 10 documents: 6 terms

ML Spark Kafka BigData NoSql SVM


D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23

• Each document now is just a vector of terms, sometimes


boolean
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix

1. Semantic content of the text is ignored


2. Before creating DTM, all same terms should look same
3. Remove words with no information (Stop Words)
4. Express the words in their root form (Stem)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Feature Selection
• Performance of text classification algorithms can be optimized
by selecting only a subset of the discriminative terms
– Even after stopword removal.

• Greedy search
– Start from full set and delete one at a time
– Find the least important variable
• Can use Gini index for this if a classification problem

• Often performance does not degrade even with orders


of magnitude reductions
– Only 140 out of 20,000 terms needed for classification!

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Distances in DT matrices

• Given a doc term matrix representation, now we can define


distances between documents (or terms!)
• Elements of matrix can be 0,1 or term frequencies
(sometimes normalized)
• Can use Euclidean or cosine distance
• Cosine distance is the angle between the two vectors
• Not intuitive, but has been proven to work well

• If docs are the same, dc =1, if nothing in common dc=0


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words / Boolean model

1. Documents (including query document) are treated as a


collection / bag of words

2. Word sequence / semantics is not considered

3. Given a collection of documents D, let vocabulary V = {t1,


t2, ..., t|V|} is set of distinctive words in D

4. A weight wij > 0 is associated with each term ti of a document


dj ∈ D. For a term that does not appear in document dj, wij = 0
dj = (w1j, w2j, ..., w|V|j),

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean Model

Document 1
Document 2
Document 1

The quick brown Term


fox jumped over
aid 0 1
the lazy dog’s
back. all 0 1 Stopword
back 1 0
brown 1 0
List
come 0 1 for
dog 1 0 is
fox 1 0 of
good 0 1
Document 2 jump 1 0
the
to
lazy 1 0
Now is the time men 0 1
for all good men now 0 1
to come to the over 1 0
aid of their party. party 0 1
quick 1 0
their 0 1
time 0 1

Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words / Boolean model

Doc 2
Doc 3
Doc 4
Doc 1

Doc 5
Doc 6
Doc 7
Doc 8
Term
aid 0 0 0 1 0 0 0 1
Each column represents the view of
all 0 1 0 1 0 1 0 0
a particular document: What terms
back 1 0 1 0 0 0 1 0
brown 1 0 1 0 1 0 1 0
are contained in this document?
come 0 1 0 1 0 1 0 1
dog 0 0 1 0 1 0 0 0 Each row represents the view of a
fox 0 0 1 0 1 0 1 0 particular term: What documents
good 0 1 0 1 0 1 0 1 contain this term?
jump 0 0 1 0 0 0 0 0
lazy 1 0 1 0 1 0 1 0
men 0 1 0 1 0 0 0 1 To execute a query, pick out rows
now 0 1 0 0 0 1 0 1 corresponding to query terms and
over 1 0 1 0 1 0 1 1 then apply logic table of
party 0 0 0 0 0 1 0 1 corresponding Boolean operator
quick 1 0 1 0 0 0 0 0
their 1 0 0 0 1 0 1 0
time 0 1 0 1 0 1 0 0

Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean model (contd)
1. Query terms are combined logically using the Boolean operators AND,
OR, and NOT.
a. E.g., ((data AND mining) AND (NOT text))

2. Retrieval
a. Given a Boolean query, the system retrieves every document that makes
the query logically true.
b. Called exact match.

1. The retrieval results are usually quite poor because term frequency is not
considered

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean model (contd)

Doc 2
Doc 3
Doc 4
Doc 1

Doc 5
Doc 6
Doc 7
Doc 8
Term
dog 0 0 1 0 1 0 0 0
fox 0 0 1 0 1 0 1 0
dog  fox 0 0 1 0 1 0 0 0 dog AND fox  Doc 3, Doc 5

dog  fox 0 0 1 0 1 0 1 0 dog OR fox  Doc 3, Doc 5, Doc 7

dog  fox 0 0 0 0 0 0 0 0 dog NOT fox  empty

fox  dog 0 0 0 0 0 0 1 0 fox NOT dog  Doc 7


Doc 2
Doc 3
Doc 4
Doc 1

Doc 5
Doc 6
Doc 7
Doc 8
Term
good 0 1 0 1 0 1 0 1
party 0 0 0 0 0 1 0 1

gp 0 0 0 0 0 1 0 1 good AND party  Doc 6, Doc 8


over 1 0 1 0 1 0 1 1

gpo 0 0 0 0 0 1 0 0 good AND party NOT over  Doc 6

Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. Each document is represented as a vector. The term weights are no longer 0
or 1.
2. Each term weight is computed based on term frequency. But term frequency
can mislead as they may occur frequently across all documents in all classes
(poor classifiers)
3. In TFIDF, using this concept weightage to frequent terms across all
documents is reduced relative to other words
“The quick brown fox jumped over the lazy dog’s back”
document
[111111112 ]
1st position corresponds to “back”
Vector in feature space
2nd position corresponds to “brown”
3rd position corresponds to “dog”
4th position corresponds to “fox”
5th position corresponds to “jump”
Image Source:
6th position corresponds to “lazy” http://lintool.github.io/UM
7th position corresponds to “over” D-courses/LBSC796-
8th position corresponds to “quick” INFM718R-2006-
9th position corresponds to “the” Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
• Represent a doc by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a N-dimensional space
– Element of vector corresponds to term weight
– E.g., d = (x1,…,xN), xi is “importance” of term i
• New document is assigned to the most likely category based on
vector similarity.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. How to select terms to represent the documents as vectors
– Remove fluff words (Stopwords)
• e.g. “and”, “the”, “always”, “along”
– Use word stem to prevent same word becoming multi
dimension
• e.g. “training”, “trainer”, “trained” => “train”
– Latent semantic indexing
2. How to assign weights to terms
– Not all words are equally important: Some are more indicative
than others
• e.g. “Automobile” vs. “Car”
3. How to measure the similarity between document vectors

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. Given two document

2. Similarity definition
– dot product

– normalized dot product (or cosine)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Lab: SMS classification

textClassificationWithML.ipynb

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Frequencies with TFIDFVectorizer
1. One issue with simple counts is that some words other than stopwords appear
many times
2. Doc vectors with large counts attributes will not be very meaningful in the
encoded vectors. It is like a dimension with higher scale overwhelming others
3. An alternative is to calculate word frequencies and standardizing which is done
using TF-IDF.
4. This is an acronym than stands for “Term Frequency – Inverse Document”
Frequency which are the components of the resulting scores assigned to
each word.
5. Term Frequency: This summarizes how often a given word appears within a
document.
6. Inverse Document Frequency: This downscales words that appear a lot across
documents.
7. Without going into the math, TF-IDF are word frequency scores that try to
highlight words that are more interesting, e.g. frequent in a document but not
across documents.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. TF: Term Frequency, which measures how frequently a term occurs in a
document.
a. Since every document is different in length, it is possible that a term would appear
much more times in long documents than shorter ones.
b. Thus, the term frequency is often divided by the document length (aka. the total
number of terms in the document) as a way of normalization:
c. TF(t) = (Number of times term t appears in a document) / (Total number of
terms
in the document)

2. IDF: Inverse Document Frequency measures how important a term is.


a. Certain terms, may appear a lot of times across documents but have little
importance.
b. Weigh down the frequent terms while scale up the rare ones, by computing the
following:
a. IDF(t) = log_e(Total number of documents / Number of documents
with term t in it).
Source: http://www.tfidf.com/

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
N
wi, j  tf i, j  log
ni
wi, j weight assigned to term i in document j

tfi, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i


• Consider a document containing 1000 words wherein the word ML appears 30
times. The term frequency (i.e., tf) for ML is then (30 / 1000) = 0.03. Now,
assume we have 10 million documents and the word ML appears in one
thousand of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the
product of these quantities: 0.03 * 4 = 0.12.
• This is the coefficient of the given document
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
tf Wi,j W'i,j
1 2 3 4 idf 1 2 3 4 1 2 3 4

complicated 5 2 0.301 1.51 0.60 0.57 0.69

contaminated 4 1 3 0.125 0.50 0.13 0.38 0.29 0.13 0.14

fallout 5 4 3 0.125 0.63 0.50 0.38 0.37 0.19 0.44

information 6 3 3 2 0.000
interesting 1 0.602 0.60 0.62

nuclear 3 7 0.301 0.90 2.11 0.53 0.79

retrieval 6 1 4 0.125 0.75 0.13 0.50 0.77 0.05 0.57

siberia 2 0.602 1.20 0.71

Length 1.70 0.97 2.67 0.87

Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
W'i,j
query 1 2 3 4

complicated 0.57 0.69

contaminated 3 0.29 0.13 0.14

fallout 0.37 0.19 0.44 Ranked list:


information Doc 2
Doc 1
interesting 0.62 Doc 4
nuclear 0.53 0.79
Doc 3

retrieval 1 0.77 0.05 0.57

siberia 0.71

similarity score 0.87 1.16 0.47 0.57

Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Frequencies with TFIDFVectorizer
1. The TfidfVectorizer will tokenize documents, learn the
vocabulary and inverse document frequency weightings, and
allow you to encode new documents.
2. Alternately, if we already have a learned CountVectorizer, we can
use it with a TfidfTransformer to just calculate the inverse
document frequencies and start encoding documents.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Questions?

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like