0% found this document useful (0 votes)
13 views51 pages

Lec 5 e Text Analytics Vector Space TF IDF

Uploaded by

Rao aafaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views51 pages

Lec 5 e Text Analytics Vector Space TF IDF

Uploaded by

Rao aafaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Big Data Analytics

Text Analytics
Sources of Text
Applications of Text Analytics
Text Analytics Concepts & Terminology
Text EDA
Vector Space Modeling
Set-of-Words: Binary word occurrences
Bag-of-Words: Word occurrences
tf-idf
Word embedding

Imdad ullah Khan


Imdad ullah Khan (LUMS) Text Analytics 1 / 51
Text Analytics
Applying data analytics to derive knowledge from text

Huge amount of textual data is available in the form of

Social media posts


Tweets
Question answer forums
Blogs
YouTube video comments
SMS
Product reviews
News articles

Imdad ullah Khan (LUMS) Text Analytics 2 / 51


How much textual data is produced?

2.5 quintillion bytes of data created each day (Forbes)


More than 65 billion messages sent on WhatsApp every day (Statista)
500 million tweets per day

Imdad ullah Khan (LUMS) Text Analytics 3 / 51


Stakeholders of text analytics

Government
What is the response of people towards a particular policy?
Advertisers
What is trending that could be used for advertisement?
Careem used LUMSU as promo code

Movie Makers
What people disliked about a movie?
This information is used to deliver in future what people want
Brand Managers
What value added services people want in a brand?
How people respond to social responsibility campaigns of a brand?
Academia
Is this document plagiarized?
Retrieve similar documents

Imdad ullah Khan (LUMS) Text Analytics 4 / 51


Structured Vs Unstructured Data

source: Google images

Unstructured (text) vs. structured (database) data in 1996 (left) and


2006 (right)
Market cap of unstructured data has grown massively
Need better techniques to handle queries/search on unstructured data

Imdad ullah Khan (LUMS) Text Analytics 5 / 51


Text Analytics: Applications

Imdad ullah Khan (LUMS) Text Analytics 6 / 51


Text Analytics: Tasks
Document Classification: Classify texts into fixed categories

Apply classification after text analytics

source: towardsdatascience.com

Imdad ullah Khan (LUMS) Text Analytics 7 / 51


Text Analytics: Tasks
Sentiment Analysis and Emotion Mining
Determine if the sentiment in the text is positive or negative
▷ Emotion Mining is fine-grained Sentiment Analysis
Sentiment Analysis Emotion Mining

Determine how the product is perceived by public from reviews


The Obama administration used it to gauge public opinion on policies
and campaign messages ahead of 2012 election
Given news headlines for last n days, would the stock market go up?

Imdad ullah Khan (LUMS) Text Analytics 8 / 51


Text Analytics: Tasks
Topic Modeling: Determine the topics and subject of documents
Document clustering, information retrieval, reviewer assignment

Imdad ullah Khan (LUMS) Text Analytics 9 / 51


Text Analytics: Tasks
Author profiling: Determine author attributes (age, gender, name etc.)

Security: Who is behind anonymous threat message?


Sales and marketing: Determine the demographic of the people
behind online reviews who liked or disliked the products

Figure credit: Francisco Rangel & Paolo Rosso [Universitat Politècnica de València]

Imdad ullah Khan (LUMS) Text Analytics 10 / 51


Text Analytics: Tasks
Fake News Identification: Determine if a news item is fake

Filtering and blocking of misleading information


Identify trustworthy news sources
Choraś et.al (2018) Pattern Recognition Solutions for Fake News Detection

Imdad ullah Khan (LUMS) Text Analytics 11 / 51


Text Analytics: Tasks
Paraphrase Identification: Find paraphrases or duplicates texts

Used for document clustering, information retrieval, plagiarism


Useful for question-answer forums, where an answer could be
retrieved if a question has already been asked and answered
source: Google AI blog

Imdad ullah Khan (LUMS) Text Analytics 12 / 51


Text Analytics: Basic Concepts
Vocabulary (language lexicon): Unique words that may appear in texts
n-gram: a (sub)sequence of n contiguous words in text (aka shingle)
Texts considered as sequences of n-grams, large n captures more context

source: devopedia.org

In computational biology, they are called k-mers

Tokenization: Break a character sequence into predefined units


Can be character level or word level, n-gram tokens

Imdad ullah Khan (LUMS) Text Analytics 13 / 51


Text Analytics: Basic Concepts

Imdad ullah Khan (LUMS) Text Analytics 14 / 51


Text Analytics: Text Normalization
Text Normalization
Initial Pre-procesing of text dataset
The goals is to standardize sentence structure and vocabulary
Helps reduce number of variables (dimensionality)
Exact preprocessing steps depends on application, they include
Remove duplicate whitespaces, punctuations, accents, capital letters,
special characters
Substitute word numerals by numbers (thirty → 30), values by type
($100 → currency/money), contractions by phrases (I’ve → I have)
Standardize formats (e.g. dates), replace abbreviation (e.g. USA)
Stopwords removal
Stemming
Lemmatization
Imdad ullah Khan (LUMS) Text Analytics 15 / 51
Text Analytics: Basic Concepts
Stop words
Common words not providing useful information the, it, is, are, an, a
Often removed (filtered out) during pre-processing
No universally good list of stop words
Reduces time/space complexity, can improve analytics quality

M Qasim (2018) Mining health reviews from online blogs and news

Imdad ullah Khan (LUMS) Text Analytics 16 / 51


Text Analytics: Basic Concepts
Stemming and Lemmatization
Convert different variations of a word to a common root form

Stemming: crude heuristic way of chopping off ends of words


Lemmatization: grammatically sound words replacing
am, are, is −→
car, cars, car’s, cars’ −→ car
“the boy’s cars are different colors” −→ “the boy car be differ color”

Imdad ullah Khan (LUMS) Text Analytics 17 / 51


Text EDA

Imdad ullah Khan (LUMS) Text Analytics 18 / 51


Text Analytics: Where to start?
First step in text analytics is Exploratory Data Analysis (EDA)

Gives insight about the data such as:


Class distribution
Top occurring words in the dataset
Distribution of words per document

These insights help in formulating solution strategies for the task


What preprocessing should be used?
What classifier should be used?

Imdad ullah Khan (LUMS) Text Analytics 19 / 51


Text Exploratory Data Analysis
Sentiment Polarity Detection Dataset
Clothing products review text, Reviewer info, rating and sentiment
Sentiment labels ∈ {−1, 0, 1} = {Negative, Neutral, Positive}
The problem is treated as Regression

source:chart-studio.plotly.com

Imdad ullah Khan (LUMS) Text Analytics 20 / 51


Text Exploratory Data Analysis

Rating distribution

source:chart-studio.plotly.com

Imdad ullah Khan (LUMS) Text Analytics 21 / 51


Text Exploratory Data Analysis

Distribution of age of the reviewers

source:chart-studio.plotly.com

Imdad ullah Khan (LUMS) Text Analytics 22 / 51


Text Exploratory Data Analysis

Distribution of the text length of the reviews

source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 23 / 51


Text Exploratory Data Analysis

Reviews per department

source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 24 / 51


Text Exploratory Data Analysis

Frequency of top unigrams before removing stopwords


source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 25 / 51


Text Exploratory Data Analysis

Frequency of top unigrams after removing stopwords


source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 26 / 51


Text Exploratory Data Analysis

Frequency of top bigrams before removing stopwords


source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 27 / 51


Text Exploratory Data Analysis

Frequency of top bigrams after removing stopwords

source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 28 / 51


Text Exploratory Data Analysis

Frequency of top trigrams before removing stopwords


source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 29 / 51


Text Exploratory Data Analysis

Frequency of top trigrams after removing stopwords


source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 30 / 51


Text Exploratory Data Analysis
Part-Of-Speech Tagging (POS) is a process of assigning parts of speech to
each word, such as noun, verb, adjective, etc

source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 31 / 51


Text Exploratory Data Analysis
Visualizing class-wise polarity distribution
Shows the threshold of sentiment score after which people tend to
recommend clothing

source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 32 / 51


Text Exploratory Data Analysis

Visualizing department wise sentiment polarity via boxplot


Shows the statistical summary of the values
source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 33 / 51


Text Exploratory Data Analysis

An integral tool for text EDA is Word Cloud


What could be said about the texts by looking at below examples?

Imdad ullah Khan (LUMS) Text Analytics 34 / 51


Vector Space Models

Imdad ullah Khan (LUMS) Text Analytics 35 / 51


Vector Space Models
Algorithms cannot work with raw texts directly
Calculate similarity/difference between two documents?
Convert texts to vectors. Vector Space Modeling

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Extract features from texts to reflect linguistic properties of the text


Popular feature extraction methods (VSM variations) are
Set-of-Words: Binary word occurrences
Bag-of-Words: Word occurrences
tf-idf
Word embedding
Imdad ullah Khan (LUMS) Text Analytics 36 / 51
Set and Bag of Words Models
Text represented as a set or a bag (multiset) of words it contains
Disregard grammar and word order
Binary Word Occurrences (Set of Words)

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Word Occurrences (aka Term Frequency) (Bag of Words)


Bag-of-Words model is Set-of-Words but it accounts for frequencies

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Imdad ullah Khan (LUMS) Text Analytics 37 / 51


The Set-of-Words Model

Set-of-Words: Documents represented by vectors ∈ {0, 1}|Σ|

Imdad ullah Khan (LUMS) Text Analytics 38 / 51


The Bag of words Model

Bag-of-Words: Documents represented by term-frequency vectors ∈ N|Σ|

Imdad ullah Khan (LUMS) Text Analytics 39 / 51


Bag of Words
Issues with Sets and Bag of Words

Set representation has associated high computational complexity


Dimensionality blow up, |Σ| could be very large
(SoW) treats mere appearance of words as feature of document
(Word appearing 1000 times versus one appearing once only)

Imdad ullah Khan (LUMS) Text Analytics 40 / 51


tf-idf - Motivaiton
tf-idf is more refined model to select features to represent texts
Key idea is to find special words characterizing the document
Reflect how significant a word is to a “document” in a “collection”
Frequency: Most frequent words implies most significant in doc
Actually exactly the opposite is true
Most frequent words (“the”, “are”, “and”) help English structure and
build ideas but not significant in characterizing documents
Rarity: Indicator of topics are rare words
rare words overall but concentrated in a few docs “batsman”,
“prime-minister”
ball, bat, pitch, catch, run =⇒ cricket related doc
An indicator word is likely to be repeated if it appear once

Imdad ullah Khan (LUMS) Text Analytics 41 / 51


tf-idf
tf-idf value increases proportionally to the number of times a word
appears in a document
Offset by the number of documents in corpus containing that word
Best known weighting scheme in IR. Value for a term increases with
Number of occurrences within a document
Rarity of the term in collection

Helps to adjust for the fact that some words appear more frequently
in general (frequent words are less meaningful than the rare ones)
Involve two characteristics of words (terms: bigram, trigram)
Term frequency
Inverse document frequency

Imdad ullah Khan (LUMS) Text Analytics 42 / 51


tf-idf: Term Frequency
Documents: D1 , . . . DN . Terms (Σ): t1 , . . . , tm

Frequency, fij : frequency of term ti in document Dj


Find a parameter to measure importance of ti to Dj
fij is not good, (very high for stop words in all documents)
It is also possible that large docs Dj (books) have larger fij , than fij ′
of short document Dj ′ even if ti is more important for Dj ′ than Dj
Recall normalization and scaling
fij
Term Frequency: tfij :=
maxi fij
Most frequent term ti in Dj gets tfij = 1 others are < 1

Imdad ullah Khan (LUMS) Text Analytics 43 / 51


tf-idf: Inverse Document Frequency
Documents: D1 , . . . DN . Terms (Σ): t1 , . . . , tm

Term frequency considers all ti equally important


Stop words appear frequently but have little importance
Need to weigh down the frequent terms while scale up the rare ones
Some terms are rare but appear in many documents a few times
Weigh tfij (inversely) by the term’s overall popularity in collection
Suppose the term ti appears in ni out of N documents. Then
 
N
Inverse Document Frequency: idfi := log
ni + 1
+1 in denominator avoids dividing by 0 if ti doesn’t appear in any doc

Imdad ullah Khan (LUMS) Text Analytics 44 / 51


tf-idf: Term frequency-inverse document frequency
Documents: D1 , . . . DN . Terms (Σ): t1 , . . . , tm
Finally, weight or importance of a term ti in document Dj is given as
tf-idf(i, j) = tfij × idfi
Check the extreme cases
If ti appears in all the documents, then tf-idf(i, j) = 0 in all Dj
Many stop words would get score close to 0
A term frequently appearing in some docs gets higher score there

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Imdad ullah Khan (LUMS) Text Analytics 45 / 51


tf-idf: Example
D1 : “The car is driven on the road”
D2 : “The truck is driven on the highway”

Common words score is zero (not significant)


Score of “car”, “truck”, “road”, and “highway” are non-zero
(significant words)

Imdad ullah Khan (LUMS) Text Analytics 46 / 51


The tf-idf Model

Each document is represented by a real vector of tf-idf weights ∈ R|Σ|

Imdad ullah Khan (LUMS) Text Analytics 47 / 51


Vector Space Models
“worst acting, worst plot, worst movie ever”
“best acting, best movie ever”
Set of Words

Bag of Words

tf-idf

Imdad ullah Khan (LUMS) Text Analytics 48 / 51


Vector Space Models
Problems with previous 3 VSM models
Dimensionality blow up, |Σ| could be very large
None preserve words order, which carries contextual information
Following two documents produce identical vectors (in all 3 models),
although the context and meaning is very different
Mary is faster than John
John is faster than Mary
They ignore synonyms (“old bike” vs “used bike”) and homonyms
n-gram model of vocabulary takes care of context to some extent
Solution: Word embedding

Imdad ullah Khan (LUMS) Text Analytics 49 / 51


Vector Space Models: Word embedding
Represent each word with n dimensional dense vector ▷ word2vec
Words appearing in similar context mapped to close-by points in Rn
Neural networks are used to learn these mappings ▷ See svd

Imdad ullah Khan (LUMS) Text Analytics 50 / 51


Vector Space Models: Document embedding
Can be extended to learn document level embeddings
Following is a 2-D representation of n-D document embeddings. (Can
convert n-D vectors to 2-D vectors by tSNE or PCA)

Imdad ullah Khan (LUMS) Text Analytics 51 / 51

You might also like