0% found this document useful (0 votes)

13 views51 pages

Lec 5 e Text Analytics Vector Space TF IDF

Uploaded by

Rao aafaq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views51 pages

Lec 5 e Text Analytics Vector Space TF IDF

Uploaded by

Rao aafaq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Big Data Analytics

Text Analytics
Sources of Text
Applications of Text Analytics
Text Analytics Concepts & Terminology
Text EDA
Vector Space Modeling
Set-of-Words: Binary word occurrences
Bag-of-Words: Word occurrences
tf-idf
Word embedding

Imdad ullah Khan

Imdad ullah Khan (LUMS) Text Analytics 1 / 51
Text Analytics
Applying data analytics to derive knowledge from text

Huge amount of textual data is available in the form of

Social media posts

Tweets
Question answer forums
Blogs
YouTube video comments
SMS
Product reviews
News articles

Imdad ullah Khan (LUMS) Text Analytics 2 / 51

How much textual data is produced?

2.5 quintillion bytes of data created each day (Forbes)

More than 65 billion messages sent on WhatsApp every day (Statista)
500 million tweets per day

Imdad ullah Khan (LUMS) Text Analytics 3 / 51

Stakeholders of text analytics

Government
What is the response of people towards a particular policy?
Advertisers
What is trending that could be used for advertisement?
Careem used LUMSU as promo code

Movie Makers
What people disliked about a movie?
This information is used to deliver in future what people want
Brand Managers
What value added services people want in a brand?
How people respond to social responsibility campaigns of a brand?
Academia
Is this document plagiarized?
Retrieve similar documents

Imdad ullah Khan (LUMS) Text Analytics 4 / 51

Structured Vs Unstructured Data

source: Google images

Unstructured (text) vs. structured (database) data in 1996 (left) and

2006 (right)
Market cap of unstructured data has grown massively
Need better techniques to handle queries/search on unstructured data

Imdad ullah Khan (LUMS) Text Analytics 5 / 51

Text Analytics: Applications

Imdad ullah Khan (LUMS) Text Analytics 6 / 51

Text Analytics: Tasks
Document Classification: Classify texts into fixed categories

Apply classification after text analytics

source: towardsdatascience.com

Imdad ullah Khan (LUMS) Text Analytics 7 / 51

Text Analytics: Tasks
Sentiment Analysis and Emotion Mining
Determine if the sentiment in the text is positive or negative
▷ Emotion Mining is fine-grained Sentiment Analysis
Sentiment Analysis Emotion Mining

Determine how the product is perceived by public from reviews

The Obama administration used it to gauge public opinion on policies
and campaign messages ahead of 2012 election
Given news headlines for last n days, would the stock market go up?

Imdad ullah Khan (LUMS) Text Analytics 8 / 51

Text Analytics: Tasks
Topic Modeling: Determine the topics and subject of documents
Document clustering, information retrieval, reviewer assignment

Imdad ullah Khan (LUMS) Text Analytics 9 / 51

Text Analytics: Tasks
Author profiling: Determine author attributes (age, gender, name etc.)

Security: Who is behind anonymous threat message?

Sales and marketing: Determine the demographic of the people
behind online reviews who liked or disliked the products

Figure credit: Francisco Rangel & Paolo Rosso [Universitat Politècnica de València]

Imdad ullah Khan (LUMS) Text Analytics 10 / 51

Text Analytics: Tasks
Fake News Identification: Determine if a news item is fake

Filtering and blocking of misleading information

Identify trustworthy news sources
Choraś et.al (2018) Pattern Recognition Solutions for Fake News Detection

Imdad ullah Khan (LUMS) Text Analytics 11 / 51

Text Analytics: Tasks
Paraphrase Identification: Find paraphrases or duplicates texts

Used for document clustering, information retrieval, plagiarism

Useful for question-answer forums, where an answer could be
retrieved if a question has already been asked and answered
source: Google AI blog

Imdad ullah Khan (LUMS) Text Analytics 12 / 51

Text Analytics: Basic Concepts
Vocabulary (language lexicon): Unique words that may appear in texts
n-gram: a (sub)sequence of n contiguous words in text (aka shingle)
Texts considered as sequences of n-grams, large n captures more context

source: devopedia.org

In computational biology, they are called k-mers

Tokenization: Break a character sequence into predefined units

Can be character level or word level, n-gram tokens

Imdad ullah Khan (LUMS) Text Analytics 13 / 51

Text Analytics: Basic Concepts

Imdad ullah Khan (LUMS) Text Analytics 14 / 51

Text Analytics: Text Normalization
Text Normalization
Initial Pre-procesing of text dataset
The goals is to standardize sentence structure and vocabulary
Helps reduce number of variables (dimensionality)
Exact preprocessing steps depends on application, they include
Remove duplicate whitespaces, punctuations, accents, capital letters,
special characters
Substitute word numerals by numbers (thirty → 30), values by type
($100 → currency/money), contractions by phrases (I’ve → I have)
Standardize formats (e.g. dates), replace abbreviation (e.g. USA)
Stopwords removal
Stemming
Lemmatization
Imdad ullah Khan (LUMS) Text Analytics 15 / 51
Text Analytics: Basic Concepts
Stop words
Common words not providing useful information the, it, is, are, an, a
Often removed (filtered out) during pre-processing
No universally good list of stop words
Reduces time/space complexity, can improve analytics quality

M Qasim (2018) Mining health reviews from online blogs and news

Imdad ullah Khan (LUMS) Text Analytics 16 / 51

Text Analytics: Basic Concepts
Stemming and Lemmatization
Convert different variations of a word to a common root form

Stemming: crude heuristic way of chopping off ends of words

Lemmatization: grammatically sound words replacing
am, are, is −→
car, cars, car’s, cars’ −→ car
“the boy’s cars are different colors” −→ “the boy car be differ color”

Imdad ullah Khan (LUMS) Text Analytics 17 / 51

Text EDA

Imdad ullah Khan (LUMS) Text Analytics 18 / 51

Text Analytics: Where to start?
First step in text analytics is Exploratory Data Analysis (EDA)

Gives insight about the data such as:

Class distribution
Top occurring words in the dataset
Distribution of words per document

These insights help in formulating solution strategies for the task

What preprocessing should be used?
What classifier should be used?

Imdad ullah Khan (LUMS) Text Analytics 19 / 51

Text Exploratory Data Analysis
Sentiment Polarity Detection Dataset
Clothing products review text, Reviewer info, rating and sentiment
Sentiment labels ∈ {−1, 0, 1} = {Negative, Neutral, Positive}
The problem is treated as Regression

Imdad ullah Khan (LUMS) Text Analytics 31 / 51

Text Exploratory Data Analysis
Visualizing class-wise polarity distribution
Shows the threshold of sentiment score after which people tend to
recommend clothing

source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 32 / 51

Text Exploratory Data Analysis

Visualizing department wise sentiment polarity via boxplot

Shows the statistical summary of the values
source:kdnuggets.com

Imdad ullah Khan (LUMS) Text Analytics 33 / 51

Text Exploratory Data Analysis

An integral tool for text EDA is Word Cloud

What could be said about the texts by looking at below examples?

Imdad ullah Khan (LUMS) Text Analytics 34 / 51

Vector Space Models

Imdad ullah Khan (LUMS) Text Analytics 35 / 51

Vector Space Models
Algorithms cannot work with raw texts directly
Calculate similarity/difference between two documents?
Convert texts to vectors. Vector Space Modeling

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Extract features from texts to reflect linguistic properties of the text

Popular feature extraction methods (VSM variations) are
Set-of-Words: Binary word occurrences
Bag-of-Words: Word occurrences
tf-idf
Word embedding
Imdad ullah Khan (LUMS) Text Analytics 36 / 51
Set and Bag of Words Models
Text represented as a set or a bag (multiset) of words it contains
Disregard grammar and word order
Binary Word Occurrences (Set of Words)

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Word Occurrences (aka Term Frequency) (Bag of Words)

Bag-of-Words model is Set-of-Words but it accounts for frequencies

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Imdad ullah Khan (LUMS) Text Analytics 37 / 51

The Set-of-Words Model

Set-of-Words: Documents represented by vectors ∈ {0, 1}|Σ|

Imdad ullah Khan (LUMS) Text Analytics 38 / 51

The Bag of words Model

Bag-of-Words: Documents represented by term-frequency vectors ∈ N|Σ|

Imdad ullah Khan (LUMS) Text Analytics 39 / 51

Bag of Words
Issues with Sets and Bag of Words

Set representation has associated high computational complexity

Dimensionality blow up, |Σ| could be very large
(SoW) treats mere appearance of words as feature of document
(Word appearing 1000 times versus one appearing once only)

Imdad ullah Khan (LUMS) Text Analytics 40 / 51

tf-idf - Motivaiton
tf-idf is more refined model to select features to represent texts
Key idea is to find special words characterizing the document
Reflect how significant a word is to a “document” in a “collection”
Frequency: Most frequent words implies most significant in doc
Actually exactly the opposite is true
Most frequent words (“the”, “are”, “and”) help English structure and
build ideas but not significant in characterizing documents
Rarity: Indicator of topics are rare words
rare words overall but concentrated in a few docs “batsman”,
“prime-minister”
ball, bat, pitch, catch, run =⇒ cricket related doc
An indicator word is likely to be repeated if it appear once

Imdad ullah Khan (LUMS) Text Analytics 41 / 51

tf-idf
tf-idf value increases proportionally to the number of times a word
appears in a document
Offset by the number of documents in corpus containing that word
Best known weighting scheme in IR. Value for a term increases with
Number of occurrences within a document
Rarity of the term in collection

Helps to adjust for the fact that some words appear more frequently
in general (frequent words are less meaningful than the rare ones)
Involve two characteristics of words (terms: bigram, trigram)
Term frequency
Inverse document frequency

Imdad ullah Khan (LUMS) Text Analytics 42 / 51

tf-idf: Term Frequency
Documents: D1 , . . . DN . Terms (Σ): t1 , . . . , tm

Frequency, fij : frequency of term ti in document Dj

Find a parameter to measure importance of ti to Dj
fij is not good, (very high for stop words in all documents)
It is also possible that large docs Dj (books) have larger fij , than fij ′
of short document Dj ′ even if ti is more important for Dj ′ than Dj
Recall normalization and scaling
fij
Term Frequency: tfij :=
maxi fij
Most frequent term ti in Dj gets tfij = 1 others are < 1

Imdad ullah Khan (LUMS) Text Analytics 43 / 51

tf-idf: Inverse Document Frequency
Documents: D1 , . . . DN . Terms (Σ): t1 , . . . , tm

Term frequency considers all ti equally important

Stop words appear frequently but have little importance
Need to weigh down the frequent terms while scale up the rare ones
Some terms are rare but appear in many documents a few times
Weigh tfij (inversely) by the term’s overall popularity in collection
Suppose the term ti appears in ni out of N documents. Then

N
Inverse Document Frequency: idfi := log
ni + 1
+1 in denominator avoids dividing by 0 if ti doesn’t appear in any doc

Imdad ullah Khan (LUMS) Text Analytics 44 / 51

tf-idf: Term frequency-inverse document frequency
Documents: D1 , . . . DN . Terms (Σ): t1 , . . . , tm
Finally, weight or importance of a term ti in document Dj is given as
tf-idf(i, j) = tfij × idfi
Check the extreme cases
If ti appears in all the documents, then tf-idf(i, j) = 0 in all Dj
Many stop words would get score close to 0
A term frequently appearing in some docs gets higher score there

Bengfort,, Bilbro & Ojeda: Applied Text Analysis with Python

Imdad ullah Khan (LUMS) Text Analytics 45 / 51

tf-idf: Example
D1 : “The car is driven on the road”
D2 : “The truck is driven on the highway”

Common words score is zero (not significant)

Score of “car”, “truck”, “road”, and “highway” are non-zero
(significant words)

Imdad ullah Khan (LUMS) Text Analytics 46 / 51

The tf-idf Model

Each document is represented by a real vector of tf-idf weights ∈ R|Σ|

Imdad ullah Khan (LUMS) Text Analytics 47 / 51

Vector Space Models
“worst acting, worst plot, worst movie ever”
“best acting, best movie ever”
Set of Words

Bag of Words

tf-idf

Imdad ullah Khan (LUMS) Text Analytics 48 / 51

Vector Space Models
Problems with previous 3 VSM models
Dimensionality blow up, |Σ| could be very large
None preserve words order, which carries contextual information
Following two documents produce identical vectors (in all 3 models),
although the context and meaning is very different
Mary is faster than John
John is faster than Mary
They ignore synonyms (“old bike” vs “used bike”) and homonyms
n-gram model of vocabulary takes care of context to some extent
Solution: Word embedding

Imdad ullah Khan (LUMS) Text Analytics 49 / 51

Vector Space Models: Word embedding
Represent each word with n dimensional dense vector ▷ word2vec
Words appearing in similar context mapped to close-by points in Rn
Neural networks are used to learn these mappings ▷ See svd

Imdad ullah Khan (LUMS) Text Analytics 50 / 51

Vector Space Models: Document embedding
Can be extended to learn document level embeddings
Following is a 2-D representation of n-D document embeddings. (Can
convert n-D vectors to 2-D vectors by tSNE or PCA)

Imdad ullah Khan (LUMS) Text Analytics 51 / 51

Sentiment Analysis
No ratings yet
Sentiment Analysis
30 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Lecture 5 - Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5 - Text Mining Sentiment and Social Media Analytics
52 pages
Data Mining and Sentiment Analysis: A Seminar Report On
No ratings yet
Data Mining and Sentiment Analysis: A Seminar Report On
39 pages
M3-Social Media Text Analytics
No ratings yet
M3-Social Media Text Analytics
19 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Text Analytics
100% (1)
Text Analytics
34 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
Text Mining & Applications in Social Media: by Anthony Yang
No ratings yet
Text Mining & Applications in Social Media: by Anthony Yang
30 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
45 pages
Concise Etymological Dictionary PDF
100% (4)
Concise Etymological Dictionary PDF
674 pages
Module 8 - Text - Update
No ratings yet
Module 8 - Text - Update
42 pages
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
No ratings yet
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
36 pages
Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
3510-6510 Ch5
No ratings yet
3510-6510 Ch5
73 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Sentimental Analysis Using NLP
No ratings yet
Sentimental Analysis Using NLP
5 pages
Sentiment Analysis On Twitter Data
No ratings yet
Sentiment Analysis On Twitter Data
23 pages
BDA3
No ratings yet
BDA3
61 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Sentiment Analysis of IMDb Movie Reviews
No ratings yet
Sentiment Analysis of IMDb Movie Reviews
9 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Chapter 03 - Sharda 11e Full Accessible PPT 07
No ratings yet
Chapter 03 - Sharda 11e Full Accessible PPT 07
29 pages
MARK3088 - Lecture WK 5 - New Product Idea Generation
No ratings yet
MARK3088 - Lecture WK 5 - New Product Idea Generation
46 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
Unit6 002
No ratings yet
Unit6 002
10 pages
ETB Text Analytics Using Machine Learning - 20-12-24
No ratings yet
ETB Text Analytics Using Machine Learning - 20-12-24
38 pages
Sentiment Analysis PDF
No ratings yet
Sentiment Analysis PDF
4 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Sentiment Analysis On IMDB Movie Comments and Twit
No ratings yet
Sentiment Analysis On IMDB Movie Comments and Twit
8 pages
Lecture 2 Guide To Text Analytics Techniques
No ratings yet
Lecture 2 Guide To Text Analytics Techniques
14 pages
Unit V
No ratings yet
Unit V
22 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Opinion Mining: Dr. Alaa El-Halees Faculty of Information Technology Islamic University of Gaza Seminar 9/9/2008
No ratings yet
Opinion Mining: Dr. Alaa El-Halees Faculty of Information Technology Islamic University of Gaza Seminar 9/9/2008
34 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
Unlock 2 Listening and Speaking-1-33
44% (9)
Unlock 2 Listening and Speaking-1-33
33 pages
Text Mining
No ratings yet
Text Mining
25 pages
Machine Learning With Advance Model
No ratings yet
Machine Learning With Advance Model
19 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
No ratings yet
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
73 pages
Restaurant Review Production Analysis Using Python
No ratings yet
Restaurant Review Production Analysis Using Python
33 pages
SentA Russir Day2
No ratings yet
SentA Russir Day2
33 pages
45 Ijmtst0806103
No ratings yet
45 Ijmtst0806103
4 pages
Lect 5
No ratings yet
Lect 5
40 pages
Minor Project Presentation
No ratings yet
Minor Project Presentation
16 pages
Astma Lab Manual
No ratings yet
Astma Lab Manual
17 pages
Analyzing Sentiment Using IMDb Dataset
No ratings yet
Analyzing Sentiment Using IMDb Dataset
4 pages
Business Intelligence and Anlytics UNIT 2
No ratings yet
Business Intelligence and Anlytics UNIT 2
35 pages
Crowd Sourcing Platform IEEE Paper 1
No ratings yet
Crowd Sourcing Platform IEEE Paper 1
7 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
RES Presentation
No ratings yet
RES Presentation
21 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Advanced Analytics - Course Outline
No ratings yet
Advanced Analytics - Course Outline
4 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Preprocessing The Informal Text For Efficient Sentiment Analysis
No ratings yet
Preprocessing The Informal Text For Efficient Sentiment Analysis
4 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Lec # 8
No ratings yet
Lec # 8
23 pages
8 English English 2019
100% (1)
8 English English 2019
6 pages
Style, Gender and Social Class
100% (11)
Style, Gender and Social Class
51 pages
Teacher Book
100% (2)
Teacher Book
233 pages
COL - TRB 2 - Standard - Grammar Unit 2 - Without Answers
100% (1)
COL - TRB 2 - Standard - Grammar Unit 2 - Without Answers
1 page
TESOL Made Practical For All Situations-Language Training Institute (2022)
No ratings yet
TESOL Made Practical For All Situations-Language Training Institute (2022)
589 pages
Burushaski Etymological Dictionary
No ratings yet
Burushaski Etymological Dictionary
291 pages
Hw2sol PDF
No ratings yet
Hw2sol PDF
7 pages
Book Word Workout Building A Muscular Vocabulary in 1
No ratings yet
Book Word Workout Building A Muscular Vocabulary in 1
3 pages
Cercle D'épistémologie
No ratings yet
Cercle D'épistémologie
34 pages
Adverbs of Manner: Examples
No ratings yet
Adverbs of Manner: Examples
4 pages
Grammar Prep Stage
No ratings yet
Grammar Prep Stage
40 pages
Baruu (Newproposalrevisi Cani TGL 9 Mei A
No ratings yet
Baruu (Newproposalrevisi Cani TGL 9 Mei A
40 pages
003-KNN Complete Updated
No ratings yet
003-KNN Complete Updated
72 pages
SYLLABUS OF 7th CLASS 2010
No ratings yet
SYLLABUS OF 7th CLASS 2010
30 pages
CS436 CS5310 Ee513 L05 CNN2
No ratings yet
CS436 CS5310 Ee513 L05 CNN2
27 pages
Session 3, English Teaching Methods
No ratings yet
Session 3, English Teaching Methods
46 pages
Lec 16 PCA
No ratings yet
Lec 16 PCA
64 pages
Elevator in Piazza Vittorio (Scontro Di Civiltà Per Un Ascensore A Piazza Vittorio) Is Evidence of
No ratings yet
Elevator in Piazza Vittorio (Scontro Di Civiltà Per Un Ascensore A Piazza Vittorio) Is Evidence of
10 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
66 pages
Week 9 Homework Packet 13-14
No ratings yet
Week 9 Homework Packet 13-14
1 page
002-Supervised Learning Setup 00 W2L1
No ratings yet
002-Supervised Learning Setup 00 W2L1
18 pages
Central Vowels
No ratings yet
Central Vowels
18 pages
Unit-5 Aim 502
No ratings yet
Unit-5 Aim 502
7 pages
Grammar Test Relative Pronouns
No ratings yet
Grammar Test Relative Pronouns
70 pages
Grammar Snack - Present Simple
No ratings yet
Grammar Snack - Present Simple
3 pages
Kurs Ishi Muqovasi Namunasi
No ratings yet
Kurs Ishi Muqovasi Namunasi
18 pages
Passive Voice
No ratings yet
Passive Voice
17 pages
Penjelasan Direct Dan Indirect Speech
No ratings yet
Penjelasan Direct Dan Indirect Speech
2 pages
Simple Past, Past Continous, Pr. Perfect, and Pr. Perfect Continous
No ratings yet
Simple Past, Past Continous, Pr. Perfect, and Pr. Perfect Continous
4 pages
Motivate Places in Town Lesson Plan
No ratings yet
Motivate Places in Town Lesson Plan
2 pages
Task 3 - Online English Test1
No ratings yet
Task 3 - Online English Test1
6 pages
Unit 5 Higher Test: Listening
No ratings yet
Unit 5 Higher Test: Listening
2 pages
Flashcards Further Oral
No ratings yet
Flashcards Further Oral
6 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet