Lecture 7
Lecture 7
● Use Cases
● Text Analysis Steps
● Text Processing
● TF-IDF
● Page Rank
● Topic Modeling
● Sentiment Analysis
2
Prepared by Dr. Sadeem Alsudais
Use Cases
1. Documents search
2. Determining sentiments
3. Identifying topics
3
Prepared by Dr. Sadeem Alsudais
Text Analysis
4
Prepared by Dr. Sadeem Alsudais
Text Analysis Steps
1. Collecting raw Text. Web scraping and crawling → parsing pages → corpus (a large
collection of texts).
a. Parsing. The process that takes unstructured text and imposes a structure for
further analysis
2. Representing Text. Tokenization → Dropping stop words → Normalization →
Stemming and lemmatization
3. Text mining. Uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.
a. Text retrieval. The identification of the documents in a corpus that contain
search items such as specific words, phrases, topics, or entities like people or
organizations.
b. Topic modeling
c. Sentiment analysis
4. Gain insights.
5
Prepared by Dr. Sadeem Alsudais
Tokenization
● Tokenization is the task of separating words (tokens) from the body of text.
● A common approach is tokenizing on spaces and punctuations.
● Is it always good to split by punctuations? What are the tokens for the words
“Wi-Fi” and “aren’t”? What about IP addresses, e.g., “142.32.48.231”?
start end=start+length
String s
end i a m a c e
0 1 2 3 4 5
Dictionary
start 0 T T T T F T
i
1 T T T F T am
a
2 F F F F ace
3 T F T
4 F F
5 F
● What happens if the words are in the dictionary but may result
in a low quality segmentation?
● Example: the table down there -> thetabledownthere
○ BAD: theta bled own there
○ BAD: the tabled own there
○ GOOD: the table down there
● Solution:
○ Each word has a probability/frequency in the dictionary
○ For each path/segmentation: multiply the probabilities of the
words on the path, assuming independent distributions of
words
○ Choose a path with the maximal probability
● What about Java’s java.util.StringTokenizer?
10
Prepared by Dr. Sadeem Alsudais
Normalization
11
Prepared by Dr. Sadeem Alsudais
Stemming and Lemmatization
12
Prepared by Dr. Sadeem Alsudais
Stemming and Lemmatization
● Example stemmers
13
Prepared by Dr. Sadeem Alsudais
Stemming and Lemmatization
14
Prepared by Dr. Sadeem Alsudais
Search and Indexing
15
Prepared by Dr. Sadeem Alsudais
Indexing
also
called
inverted
index
search “Julius
Caesar”
Which order
is better?
16
Prepared by Dr. Sadeem Alsudais
Positional Indexing Example
● Example:
Query: “to1 be2 or3 not4 to5 be6”
TO, 993427:
‹ 1: ‹7, 18, 33, 72, 86, 231›;
2: ‹1, 17, 74, 222, 255›;
4: ‹8, 16, 190, 429, 433›;
5: ‹363, 367›;
7: ‹13, 23, 191›; . . . ›
BE, 178239:
‹ 1: ‹17, 25›;
4: ‹17, 191, 291, 430, 434›;
5: ‹14, 19, 101›; . . . ›
Document 4 is a match!
● The more often the terms are mentioned in the document the more relevant
the document is.
● Term frequency can be normalized.
18
Prepared by Dr. Sadeem Alsudais
TFIDF - Ranking Documents
19
Prepared by Dr. Sadeem Alsudais
TFIDF Example
20
Prepared by Dr. Sadeem Alsudais
TFIDF Example Continued
term/d
Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10
oc
21
Prepared by Dr. Sadeem Alsudais
Information Retrieval on the Web
22
Prepared by Dr. Sadeem Alsudais
PageRank Calculation
● P is a page
● d is a damping factor (usually 0.85)
● P1...Pn are pages that link to P
● PR(Pi) is the PageRank of Pi
● C(Pi) is the number of outgoing links from Pi
23
Prepared by Dr. Sadeem Alsudais
PageRank Calculation Example
...
24
Prepared by Dr. Sadeem Alsudais
PageRank Calculation Example
25
Prepared by Dr. Sadeem Alsudais
PageRank Calculation Example
26
Prepared by Dr. Sadeem Alsudais
Topic Modeling
● Topic
○ Consists of a cluster of words that frequently occur together and
share the same theme.
○ Formally defined as a distribution over a fixed vocabulary of words.
○ A cluster of words with related meanings, and each word has a
corresponding weight inside this topic.
● Different topics have different distributions over the same
vocabulary.
● Document grouping can be achieved with clustering
methods, however, a better approach is to use topic
modeling.
● The simplest topic model is latent Dirichlet allocation (LDA).
27
Prepared by Dr. Sadeem Alsudais
LDA
28
Prepared by Dr. Sadeem Alsudais
Topic Modeling Example
29
Prepared by Dr. Sadeem Alsudais
Sentiment Analysis
30
Prepared by Dr. Sadeem Alsudais
Gain Insights by Visualization
31
Prepared by Dr. Sadeem Alsudais
References
32
Prepared by Dr. Sadeem Alsudais