0% found this document useful (0 votes)

20 views35 pages

Text Mining - Analytics

Text mining is the process of extracting useful insights from unstructured text data. It involves techniques like information extraction, categorization, clustering, and summarization. Some challenges include handling word relationships and frequencies. Common representations include the bag-of-words model and document term matrix. Scikit-learn provides tools for text preprocessing and feature extraction like CountVectorizer and TfidfVectorizer.

Uploaded by

Sukeshan R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views35 pages

Text Mining - Analytics

Uploaded by

Sukeshan R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Text Mining & Analytics

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Natural language processing
Natural language processing (NLP) is an intersection between the fields of computer
science, linguistics and artificial intelligence. NLP is concerned with the interactions
between computers and human (natural) languages, in particular how to program
computers to process, analyze and model large amounts of natural language data.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
What is Text Mining?

1. Text mining is the process of extracting useful insights /

information from a body of text (classified as unstructured data)

2. Volume of unstructured data generated today is much more than

the volume of structured data (90:10)

3. Source of text data include -

a. e-mails, corporate Web pages, customer surveys,
b. Social media and more…
twitter.com/timothy_hughes/status
/619075227021090817

4. Lots of information is held-up in the text

data

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Introduction

1. Text data requires special preparation before we can start using it for
predictive modelling
2. The text must be parsed to remove words, called tokenization.
3. Then the words need to be encoded as integers or floating point values for
use as input to a machine learning algorithm, called feature extraction (or
vectorization).
4. The scikit-learn library offers easy-to-use tools to perform both tokenization
and feature extraction of text data
5. We now learn to prepare text data for predictive modeling in Python with
scikit-learn
6. Learn to use the following algorithms
a. CountVectorizer : Convert text to word count vectors
b. TfidfVectorizer : Convert text to word frequency vectors
c. HashingVectorizer : Convert text to unique integers

4
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Applications of Text Mining

1. Analyze open ended responses where respondents give their view or opinion
without any constraints.

2. Automatic processing of huge volumes of electronic messages, emails

3. Classify the emails (text ) as spam & non-spam.

4. Analyze warranty or insurance claims for suspicious / anomalous claims

5. Diagnosis of situations described by people in form of text for e.g.

customer experience

6. Investigate competitors by crawling their web sites (be careful as crawling is

illegal in many websites)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Applications of Text Mining: Legal Domain
1. Judgements summarization

2. Similar judgements identification

3. Legal case outcome prediction

1. Domain specific and making it generic is a challenge
2. Can focus on consumer complaints judgements if interested

4. Auto documentation

5. Automation of judicial processes ( this is a broader problem)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Text Mining Techniques

1. Information Extraction - This is used to analyze the unstructured text by

identifying entities and their relationships.

2. Categorization - Classifies the text document under one or more pre-

determined categories such as spam or ham mails where each mail is a
document

3. Clustering – Find similar document for instance similar queries in tech support
databases for automated resolution

4. Summarization – Find the key parts of the document or what the document
refers to and summarize the details

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Formulating a Text Analytics Task

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Text Mining Challenges

1. Relations among word surface forms and their senses:

a. Homonomy: same form, but different meaning (e.g. bank: river bank,
financial institution)
b. Polysemy: same form, related meaning (e.g. bank: blood bank, financial
institution)
c. Synonymy: different form, same meaning (e.g. singer, vocalist)
d. Hyponymy: one word denotes a subclass of an another (e.g. breakfast,
meal)

2. Word frequencies in texts have power distribution (Zipf ’s law):

a. small number of very frequent words (usually useless words)
b. big number of low frequency words (long tail of useful words)

Image Source:
http://www.ruwhim.com/?p=4
7532
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Stop words
1. Many of the most frequently used words in English are not useful
in text analytics – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stop words list may be
constructed

2. Why do we need to remove stop words

– Reduce indexing (or data) file size
• Stop words accounts 20-30% of total word counts.
– Improve efficiency
• Stop words are not useful for searching or text mining
• Stop words always have a large number of hits

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words model

1. We cannot work with text directly when using machine learning

algorithms, we need to convert the text to numbers.
2. We may want to perform classification of documents, so each
document is an “input” and a class label is the “output” for our
predictive algorithm.
3. Algorithms take vectors of numbers as input, therefore we need
to convert documents to fixed-length vectors of numbers.
4. A simple and effective model for thinking about text documents
in machine learning is called the Bag-of-Words Model, or BoW.
5. The model is simple in that it throws away all of the order
information in the words and focuses on the occurrence of
words in a document.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words model

6. This can be done by assigning each word a unique number. Then

any document we see can be encoded as a fixed-length vector
with the length of the vocabulary of known words. The value in
each position in the vector could be filled with a count or
frequency of each word in the encoded document.
7. This is the bag of words model, where we are only concerned
with encoding schemes that represent what words are present or
the degree to which they are present in encoded documents
without any information about order.
8. The scikit-learn library provides 3 different APIs

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Counts with CountVectorizer
6. CountVectorizer tokenizes a collection of text documents and
builds a vocabulary of known words (textClassificationWithML.ipynb)
7. We can use it as follows:
a. Create an instance of the CountVectorizer class.
b. Call the fit() function in order to learn a vocabulary from one or more
documents.
c. Call the transform() function on one or more documents as needed to
encode each document as a vector.
d. An encoded vector is returned with a length of the entire vocabulary
and an integer count for the number of times each word appeared in
the document.
e. The vectors returned from a call to transform() will be sparse vectors, and
we can transform them back to numpy arrays to look and better
understand what is going on by calling the toarray() function.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix

• Most common form of representation in

text mining is the term - document matrix
– Term: typically a single word, but could be a word
phrase like “data mining”
– Document: a generic term meaning a collection
of text to be retrieved
– Can be large - terms are often 50k or larger,
documents can be in the billions!
– Can be binary, or use counts (frequency count)

textClassificationWithML.ipynb
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix
Example: 10 documents: 6 terms

ML Spark Kafka BigData NoSql SVM

D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23

• Each document now is just a vector of terms, sometimes

boolean
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Document Term Matrix

1. Semantic content of the text is ignored

2. Before creating DTM, all same terms should look same
3. Remove words with no information (Stop Words)
4. Express the words in their root form (Stem)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Feature Selection
• Performance of text classification algorithms can be optimized
by selecting only a subset of the discriminative terms
– Even after stopword removal.

• Greedy search
– Start from full set and delete one at a time
– Find the least important variable
• Can use Gini index for this if a classification problem

• Often performance does not degrade even with orders

of magnitude reductions
– Only 140 out of 20,000 terms needed for classification!

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Distances in DT matrices

• Given a doc term matrix representation, now we can define

distances between documents (or terms!)
• Elements of matrix can be 0,1 or term frequencies
(sometimes normalized)
• Can use Euclidean or cosine distance
• Cosine distance is the angle between the two vectors
• Not intuitive, but has been proven to work well

• If docs are the same, dc =1, if nothing in common dc=0

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bag Of Words / Boolean model

1. Documents (including query document) are treated as a

collection / bag of words

2. Word sequence / semantics is not considered

3. Given a collection of documents D, let vocabulary V = {t1,

t2, ..., t|V|} is set of distinctive words in D

4. A weight wij > 0 is associated with each term ti of a document

dj ∈ D. For a term that does not appear in document dj, wij = 0
dj = (w1j, w2j, ..., w|V|j),

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean Model

Document 1
Document 2
Document 1

The quick brown Term

fox jumped over
aid 0 1
the lazy dog’s
back. all 0 1 Stopword
back 1 0
brown 1 0
List
come 0 1 for
dog 1 0 is
fox 1 0 of
good 0 1
Document 2 jump 1 0
the
to
lazy 1 0
Now is the time men 0 1
for all good men now 0 1
to come to the over 1 0
aid of their party. party 0 1
quick 1 0
their 0 1
time 0 1

Doc 2
Doc 3
Doc 4
Doc 1

Doc 5
Doc 6
Doc 7
Doc 8
Term
aid 0 0 0 1 0 0 0 1
Each column represents the view of
all 0 1 0 1 0 1 0 0
a particular document: What terms
back 1 0 1 0 0 0 1 0
brown 1 0 1 0 1 0 1 0
are contained in this document?
come 0 1 0 1 0 1 0 1
dog 0 0 1 0 1 0 0 0 Each row represents the view of a
fox 0 0 1 0 1 0 1 0 particular term: What documents
good 0 1 0 1 0 1 0 1 contain this term?
jump 0 0 1 0 0 0 0 0
lazy 1 0 1 0 1 0 1 0
men 0 1 0 1 0 0 0 1 To execute a query, pick out rows
now 0 1 0 0 0 1 0 1 corresponding to query terms and
over 1 0 1 0 1 0 1 1 then apply logic table of
party 0 0 0 0 0 1 0 1 corresponding Boolean operator
quick 1 0 1 0 0 0 0 0
their 1 0 0 0 1 0 1 0
time 0 1 0 1 0 1 0 0

2. Retrieval
a. Given a Boolean query, the system retrieves every document that makes
the query logically true.
b. Called exact match.

1. The retrieval results are usually quite poor because term frequency is not
considered

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Boolean model (contd)

Doc 2
Doc 3
Doc 4
Doc 1

Doc 5
Doc 6
Doc 7
Doc 8
Term
dog 0 0 1 0 1 0 0 0
fox 0 0 1 0 1 0 1 0
dog  fox 0 0 1 0 1 0 0 0 dog AND fox  Doc 3, Doc 5

dog  fox 0 0 1 0 1 0 1 0 dog OR fox  Doc 3, Doc 5, Doc 7

dog  fox 0 0 0 0 0 0 0 0 dog NOT fox  empty

fox  dog 0 0 0 0 0 0 1 0 fox NOT dog  Doc 7

Doc 2
Doc 3
Doc 4
Doc 1

Doc 5
Doc 6
Doc 7
Doc 8
Term
good 0 1 0 1 0 1 0 1
party 0 0 0 0 0 1 0 1

gp 0 0 0 0 0 1 0 1 good AND party  Doc 6, Doc 8

over 1 0 1 0 1 0 1 1

gpo 0 0 0 0 0 1 0 0 good AND party NOT over  Doc 6

Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. Each document is represented as a vector. The term weights are no longer 0
or 1.
2. Each term weight is computed based on term frequency. But term frequency
can mislead as they may occur frequently across all documents in all classes
(poor classifiers)
3. In TFIDF, using this concept weightage to frequent terms across all
documents is reduced relative to other words
“The quick brown fox jumped over the lazy dog’s back”
document
[111111112 ]
1st position corresponds to “back”
Vector in feature space
2nd position corresponds to “brown”
3rd position corresponds to “dog”
4th position corresponds to “fox”
5th position corresponds to “jump”
Image Source:
6th position corresponds to “lazy” http://lintool.github.io/UM
7th position corresponds to “over” D-courses/LBSC796-
8th position corresponds to “quick” INFM718R-2006-
9th position corresponds to “the” Spring/syllabus.html
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
• Represent a doc by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a N-dimensional space
– Element of vector corresponds to term weight
– E.g., d = (x1,…,xN), xi is “importance” of term i
• New document is assigned to the most likely category based on
vector similarity.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. How to select terms to represent the documents as vectors
– Remove fluff words (Stopwords)
• e.g. “and”, “the”, “always”, “along”
– Use word stem to prevent same word becoming multi
dimension
• e.g. “training”, “trainer”, “trained” => “train”
– Latent semantic indexing
2. How to assign weights to terms
– Not all words are equally important: Some are more indicative
than others
• e.g. “Automobile” vs. “Car”
3. How to measure the similarity between document vectors

2. Similarity definition
– dot product

– normalized dot product (or cosine)

textClassificationWithML.ipynb

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Frequencies with TFIDFVectorizer
1. One issue with simple counts is that some words other than stopwords appear
many times
2. Doc vectors with large counts attributes will not be very meaningful in the
encoded vectors. It is like a dimension with higher scale overwhelming others
3. An alternative is to calculate word frequencies and standardizing which is done
using TF-IDF.
4. This is an acronym than stands for “Term Frequency – Inverse Document”
Frequency which are the components of the resulting scores assigned to
each word.
5. Term Frequency: This summarizes how often a given word appears within a
document.
6. Inverse Document Frequency: This downscales words that appear a lot across
documents.
7. Without going into the math, TF-IDF are word frequency scores that try to
highlight words that are more interesting, e.g. frequent in a document but not
across documents.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
1. TF: Term Frequency, which measures how frequently a term occurs in a
document.
a. Since every document is different in length, it is possible that a term would appear
much more times in long documents than shorter ones.
b. Thus, the term frequency is often divided by the document length (aka. the total
number of terms in the document) as a way of normalization:
c. TF(t) = (Number of times term t appears in a document) / (Total number of
terms
in the document)

2. IDF: Inverse Document Frequency measures how important a term is.

a. Certain terms, may appear a lot of times across documents but have little
importance.
b. Weigh down the frequent terms while scale up the rare ones, by computing the
following:
a. IDF(t) = log_e(Total number of documents / Number of documents
with term t in it).
Source: http://www.tfidf.com/

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
N
wi, j  tf i, j  log
ni
wi, j weight assigned to term i in document j

tfi, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

• Consider a document containing 1000 words wherein the word ML appears 30
times. The term frequency (i.e., tf) for ML is then (30 / 1000) = 0.03. Now,
assume we have 10 million documents and the word ML appears in one
thousand of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the
product of these quantities: 0.03 * 4 = 0.12.
• This is the coefficient of the given document
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BAG Of Words / Vector Space model
tf Wi,j W'i,j
1 2 3 4 idf 1 2 3 4 1 2 3 4

complicated 5 2 0.301 1.51 0.60 0.57 0.69

contaminated 4 1 3 0.125 0.50 0.13 0.38 0.29 0.13 0.14

fallout 5 4 3 0.125 0.63 0.50 0.38 0.37 0.19 0.44

information 6 3 3 2 0.000
interesting 1 0.602 0.60 0.62

nuclear 3 7 0.301 0.90 2.11 0.53 0.79

retrieval 6 1 4 0.125 0.75 0.13 0.50 0.77 0.05 0.57

siberia 2 0.602 1.20 0.71

Length 1.70 0.97 2.67 0.87

complicated 0.57 0.69

contaminated 3 0.29 0.13 0.14

fallout 0.37 0.19 0.44 Ranked list:

information Doc 2
Doc 1
interesting 0.62 Doc 4
nuclear 0.53 0.79
Doc 3

retrieval 1 0.77 0.05 0.57

siberia 0.71

similarity score 0.87 1.16 0.47 0.57

Source: http://lintool.github.io/UMD-courses/LBSC796-INFM718R-2006-Spring/syllabus.html

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Word Frequencies with TFIDFVectorizer
1. The TfidfVectorizer will tokenize documents, learn the
vocabulary and inverse document frequency weightings, and
allow you to encode new documents.
2. Alternately, if we already have a learned CountVectorizer, we can
use it with a TfidfTransformer to just calculate the inverse
document frequencies and start encoding documents.

DAAD Epos Motivation Letter
100% (2)
DAAD Epos Motivation Letter
2 pages
The Nature and Importance of Entrepreneurship
No ratings yet
The Nature and Importance of Entrepreneurship
83 pages
N5ae Ver1-09 24AUG16
No ratings yet
N5ae Ver1-09 24AUG16
31 pages
Top 20 Java Multithreading Interview Questions &
No ratings yet
Top 20 Java Multithreading Interview Questions &
2 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
Gad 7 B Đào Nha
No ratings yet
Gad 7 B Đào Nha
8 pages
CMA Inter - July 2023 Past Paper Questions Practice
No ratings yet
CMA Inter - July 2023 Past Paper Questions Practice
36 pages
PHYTOREMEDIATION
50% (2)
PHYTOREMEDIATION
26 pages
Visual Guide To Phrasal Verbs - Part 1 - 1 - Run
No ratings yet
Visual Guide To Phrasal Verbs - Part 1 - 1 - Run
9 pages
DISRUPT
100% (1)
DISRUPT
14 pages
Vendor Qualification and Requirements - 1P - Latest 22-11-2019
100% (2)
Vendor Qualification and Requirements - 1P - Latest 22-11-2019
7 pages
Explain City Functional Movement
No ratings yet
Explain City Functional Movement
3 pages
Functional Programming - Wikipedia
No ratings yet
Functional Programming - Wikipedia
22 pages
Case Digest, Self Executing Provision
100% (1)
Case Digest, Self Executing Provision
2 pages
Introduction To Company: Bharti Airtel Limited
No ratings yet
Introduction To Company: Bharti Airtel Limited
61 pages
Impact of Api Active Pharmaceutical Ingredient Source Selection On Generic Drug Products 2167 7689 1000136
No ratings yet
Impact of Api Active Pharmaceutical Ingredient Source Selection On Generic Drug Products 2167 7689 1000136
11 pages
KR 120 R3900-2 K: Workspace Graphic
No ratings yet
KR 120 R3900-2 K: Workspace Graphic
1 page
The Cultural Revolution Extra Reading
No ratings yet
The Cultural Revolution Extra Reading
2 pages
CHAPTER 6 Valve Timing 13 Two Stroke Engine
No ratings yet
CHAPTER 6 Valve Timing 13 Two Stroke Engine
20 pages
UNHCR Note On The Interpretation of Article 1E of The 1951 Convention Relating To The Status of Refugees
No ratings yet
UNHCR Note On The Interpretation of Article 1E of The 1951 Convention Relating To The Status of Refugees
6 pages
Econf412 Finf313 Mids Q
No ratings yet
Econf412 Finf313 Mids Q
4 pages
Sample of An Information For Malversation of Public Funds and Property
No ratings yet
Sample of An Information For Malversation of Public Funds and Property
7 pages
ICASSP, 1991, Pp. 2773-2776.: Texture Information in Run-Length Matrices
No ratings yet
ICASSP, 1991, Pp. 2773-2776.: Texture Information in Run-Length Matrices
8 pages
Beton Dizayn Programi
No ratings yet
Beton Dizayn Programi
4 pages
B. Ujwala Libre
No ratings yet
B. Ujwala Libre
5 pages
Jonathan Allen CV Update
No ratings yet
Jonathan Allen CV Update
2 pages
Why Is The BSP The Main Government Agency Responsible For Promoting Price Stability
No ratings yet
Why Is The BSP The Main Government Agency Responsible For Promoting Price Stability
4 pages
Abracon LTE GPS Antenna Data Sheet
No ratings yet
Abracon LTE GPS Antenna Data Sheet
8 pages
Case 2 and 3 For Practice of Profession
No ratings yet
Case 2 and 3 For Practice of Profession
3 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1175)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2133)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Text Mining - Analytics

Uploaded by

Text Mining - Analytics

Uploaded by

Text Mining & Analytics

1. Text mining is the process of extracting useful insights /

2. Volume of unstructured data generated today is much more than

3. Source of text data include -

4. Lots of information is held-up in the text

2. Automatic processing of huge volumes of electronic messages, emails

3. Classify the emails (text ) as spam & non-spam.

4. Analyze warranty or insurance claims for suspicious / anomalous claims

5. Diagnosis of situations described by people in form of text for e.g.

6. Investigate competitors by crawling their web sites (be careful as crawling is

2. Similar judgements identification

3. Legal case outcome prediction

5. Automation of judicial processes ( this is a broader problem)

1. Information Extraction - This is used to analyze the unstructured text by

2. Categorization - Classifies the text document under one or more pre-

1. Relations among word surface forms and their senses:

2. Word frequencies in texts have power distribution (Zipf ’s law):

2. Why do we need to remove stop words

1. We cannot work with text directly when using machine learning

6. This can be done by assigning each word a unique number. Then

• Most common form of representation in

ML Spark Kafka BigData NoSql SVM

• Each document now is just a vector of terms, sometimes

1. Semantic content of the text is ignored

• Often performance does not degrade even with orders

• Given a doc term matrix representation, now we can define

• If docs are the same, dc =1, if nothing in common dc=0

1. Documents (including query document) are treated as a

2. Word sequence / semantics is not considered

3. Given a collection of documents D, let vocabulary V = {t1,

4. A weight wij > 0 is associated with each term ti of a document

The quick brown Term

dog  fox 0 0 1 0 1 0 1 0 dog OR fox  Doc 3, Doc 5, Doc 7

dog  fox 0 0 0 0 0 0 0 0 dog NOT fox  empty

fox  dog 0 0 0 0 0 0 1 0 fox NOT dog  Doc 7

gp 0 0 0 0 0 1 0 1 good AND party  Doc 6, Doc 8

gpo 0 0 0 0 0 1 0 0 good AND party NOT over  Doc 6

– normalized dot product (or cosine)

2. IDF: Inverse Document Frequency measures how important a term is.

tfi, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

complicated 5 2 0.301 1.51 0.60 0.57 0.69

contaminated 4 1 3 0.125 0.50 0.13 0.38 0.29 0.13 0.14

fallout 5 4 3 0.125 0.63 0.50 0.38 0.37 0.19 0.44

nuclear 3 7 0.301 0.90 2.11 0.53 0.79

retrieval 6 1 4 0.125 0.75 0.13 0.50 0.77 0.05 0.57

siberia 2 0.602 1.20 0.71

Length 1.70 0.97 2.67 0.87

complicated 0.57 0.69

contaminated 3 0.29 0.13 0.14

fallout 0.37 0.19 0.44 Ranked list:

retrieval 1 0.77 0.05 0.57

similarity score 0.87 1.16 0.47 0.57

You might also like