0% found this document useful (0 votes)
10 views32 pages

Lecture 7

The document outlines a course on Software Data Mining focused on Text Analytics, detailing various text analysis steps including tokenization, normalization, and sentiment analysis. It discusses techniques such as TF-IDF for document ranking, PageRank for web page importance, and topic modeling using LDA. Use cases for text analysis include document search, sentiment determination, and topic identification.

Uploaded by

Essam Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views32 pages

Lecture 7

The document outlines a course on Software Data Mining focused on Text Analytics, detailing various text analysis steps including tokenization, normalization, and sentiment analysis. It discusses techniques such as TF-IDF for document ranking, PageRank for web page importance, and topic modeling using LDA. Use cases for text analysis include document search, sentiment determination, and topic identification.

Uploaded by

Essam Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

SWE564

Software Data Mining


Fall 2024
Text Analytics

Prepared by Dr. Sadeem Alsudais


Overview

● Use Cases
● Text Analysis Steps
● Text Processing
● TF-IDF
● Page Rank
● Topic Modeling
● Sentiment Analysis

2
Prepared by Dr. Sadeem Alsudais
Use Cases

1. Documents search
2. Determining sentiments
3. Identifying topics

3
Prepared by Dr. Sadeem Alsudais
Text Analysis

● Text analysis is the representation, processing, and modeling of


textual data to derive useful insights.
● Text analysis suffers from the curse of high dimensionality.
○ Every distinct term is a dimension.
● Data is unstructured.

4
Prepared by Dr. Sadeem Alsudais
Text Analysis Steps
1. Collecting raw Text. Web scraping and crawling → parsing pages → corpus (a large
collection of texts).
a. Parsing. The process that takes unstructured text and imposes a structure for
further analysis
2. Representing Text. Tokenization → Dropping stop words → Normalization →
Stemming and lemmatization
3. Text mining. Uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.
a. Text retrieval. The identification of the documents in a corpus that contain
search items such as specific words, phrases, topics, or entities like people or
organizations.
b. Topic modeling
c. Sentiment analysis
4. Gain insights.

5
Prepared by Dr. Sadeem Alsudais
Tokenization

● Tokenization is the task of separating words (tokens) from the body of text.
● A common approach is tokenizing on spaces and punctuations.

● Is it always good to split by punctuations? What are the tokens for the words
“Wi-Fi” and “aren’t”? What about IP addresses, e.g., “142.32.48.231”?

● Should we always split by spaces? should “Saudi Arabia” be considered a


single token or two? What about some words that can be written with a space
and without, e.g., “Whitespace” vs. “white space”.
● What about East Asian Languages (e.g., Chinese, Japanese, Korean, and
Thai), where text is written without any spaces between words?
● These issues of tokenization are language-specific.
● One approach to fix this is to use Word segmentation.
6
Prepared by Dr. Sadeem Alsudais
Tokenization - Word Segmentation

● Suppose blank spaces and punctuations are removed


● Example: i am ace -> iamace
○ We want to break it into a sequence of words that exist in a given
dictionary
○ WRONG: ia ma ce
○ WRONG: i ama ce
○ GOOD: i am ace
○ Source: https://www.youtube.com/watch?v=WepWFGxiwRs

start end=start+length

String s

© UCI’s CS221 Prepared by Dr. Sadeem Alsudais 7


Tokenization - Word Segmentation

end i a m a c e
0 1 2 3 4 5
Dictionary
start 0 T T T T F T
i
1 T T T F T am
a
2 F F F F ace

3 T F T

4 F F

5 F

© UCI’s CS221 Prepared by Dr. Sadeem Alsudais 8


Tokenization - Word Segmentation

● What happens if the words are in the dictionary but may result
in a low quality segmentation?
● Example: the table down there -> thetabledownthere
○ BAD: theta bled own there
○ BAD: the tabled own there
○ GOOD: the table down there
● Solution:
○ Each word has a probability/frequency in the dictionary
○ For each path/segmentation: multiply the probabilities of the
words on the path, assuming independent distributions of
words
○ Choose a path with the maximal probability
● What about Java’s java.util.StringTokenizer?

© UCI’s CS221 Prepared by Dr. Sadeem Alsudais 9


Dropping Stop Words

● Common words can be of little value in helping select documents


matching a user need are called stop words.
● Excluded from the vocabulary. Why? for search efficiency.
● How to determine the list of stop words (also called “Stop List”)?
based on frequency, (the total number of times each term appears in
the document collection).
● Often, the list is hand-filtered.
● NLTK’s stop list.
● Is removing stop words always good? is a result of “flights TO AlUla
FROM Riyadh” the same as “flights FROM AlUla TO Riyadh”?
● Solution:
○ In sentiment analysis, topic modeling, etc applications the stop words are
removed, but in IR applications, we use statistical approaches to collect term
frequencies.

10
Prepared by Dr. Sadeem Alsudais
Normalization

● Token normalization is the process of canonicalizing tokens


so that matches occur despite superficial differences in the
character sequences of the tokens.
○ KSA = Saudi Arabia = Kingdom of Saudi Arabia.
○ car = automobile = vehicle
● Implicitly create equivalence classes, which are normally
named after one member of the set.
● Approaches:
○ Accents, e.g., naive and naïve. Not in Arabic.
○ Capitalization/case-folding, e.g., KSU and ksu. Not in Arabic
● Depends on the language.

11
Prepared by Dr. Sadeem Alsudais
Stemming and Lemmatization

● Both reduce derivationally related forms of a word to a


common base form.
● Example: car, cars, car’s, cars’ ⇒ car.
● Stemming refers to a crude heuristic process that chops off
the ends of words.
● Lemmatization aims to return the base or dictionary form of a
word, which is known as the lemma.
● Example: “saw”
○ Stemming ⇒ “s”
○ lemmatization ⇒ “see” or “saw”.

12
Prepared by Dr. Sadeem Alsudais
Stemming and Lemmatization

● Example stemmers

13
Prepared by Dr. Sadeem Alsudais
Stemming and Lemmatization

● ‫( واﺻل‬one word)-> “he continued” -> lemmatized ‫وﺻل‬


● Possible Stemming rules?
○ ‫ واﺻل‬-> remove 1 prefix -> ‫( اﺻل‬WRONG, different
word meaning “origin”)
○ ‫ واﺻل‬-> remove 2 prefixes -> ‫( ﺻل‬WRONG, verb
“pray” or “order to connect” )
● Challenges in Arabic stemming:
○ Removing prefixes will lead to completely different meanings.
○ We cannot simply remove the prefixes.

14
Prepared by Dr. Sadeem Alsudais
Search and Indexing

● Given a search query “Riyadh Season” how do we know if we


have to retrieve documents relevant to the annual Riyadh event,
or the weather in Riyadh, or the hotel Four seasons in Riyadh,
etc.
● Because the tokens of this query are “Riyadh” and “Season”.
● Solution:
○ Part-of-Speech (POS) tagging
○ n-grams (All rights reserved, 99999999)
○ Positional indexing

15
Prepared by Dr. Sadeem Alsudais
Indexing
also
called
inverted
index

search “Julius
Caesar”
Which order
is better?

16
Prepared by Dr. Sadeem Alsudais
Positional Indexing Example

● Example:
Query: “to1 be2 or3 not4 to5 be6”
TO, 993427:
‹ 1: ‹7, 18, 33, 72, 86, 231›;
2: ‹1, 17, 74, 222, 255›;
4: ‹8, 16, 190, 429, 433›;
5: ‹363, 367›;
7: ‹13, 23, 191›; . . . ›
BE, 178239:
‹ 1: ‹17, 25›;
4: ‹17, 191, 291, 430, 434›;
5: ‹14, 19, 101›; . . . ›
Document 4 is a match!

© UCI’s CS221 Prepared by Dr. Sadeem Alsudais 17


TFIDF - Ranking Documents

● Term Frequency (TF): Given a term t and a document d containing n terms,


term frequency of t in d is the number of times t appears in d.

● The more often the terms are mentioned in the document the more relevant
the document is.
● Term frequency can be normalized.

● Term frequency by itself suffers a critical problem. It regards that stand-alone


document as the entire world.
● Using term frequency alone, the search engine would not properly assess
how relevant each document is in relation to the search query.

18
Prepared by Dr. Sadeem Alsudais
TFIDF - Ranking Documents

● Document Frequency (DF): the number of documents in the corpus that


contain term t.
○ Collection frequency.
● Let corpus D contain N documents. The document frequency of a term t in D

● The Inverse document frequency (IDF) of a term t is obtained by dividing N by


the document frequency of the term and then taking the logarithm of that
quotient.
to not
divide
by 0
● The TFIDF (or TF-IDF) is a measure that considers both the prevalence of a
term within a document (TF) and the scarcity of the term over the entire
corpus (IDF).

19
Prepared by Dr. Sadeem Alsudais
TFIDF Example

df(car) = 6 idf(car) = log(10/6) = 0.22

df(auto) = 7 idf(auto) = log(10/7) = 0.15

df(best) = 6 idf(best) = log(10/6) = 0.22

20
Prepared by Dr. Sadeem Alsudais
TFIDF Example Continued

term/d
Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10
oc

car 0.66 0 0 1.1 2.64 0 0 0.44 1.76 0.22

auto 1.2 0.9 0 1.8 0 0 1.35 0.15 0,45 1.5

best 0 0.22 1.54 0 0.22 1.1 2.64 0 0.44 0

21
Prepared by Dr. Sadeem Alsudais
Information Retrieval on the Web

● We can think of the webpages as documents in the web corpus.


● The web can be thought of as a graph, where nodes are webpages and
links are endorsements
● PageRank Scoring is a way of measuring the importance of website
pages.
○ More important websites are likely to receive more links.

22
Prepared by Dr. Sadeem Alsudais
PageRank Calculation

Algebraic Markov Chain

● PR(P) = (1-d) + d ∑ (PR(Pi)/C(Pi))


● PR(P) = (1-d) + d (PR(P1)/C(P1) + ... + PR(Pn)/C(Pn))

● P is a page
● d is a damping factor (usually 0.85)
● P1...Pn are pages that link to P
● PR(Pi) is the PageRank of Pi
● C(Pi) is the number of outgoing links from Pi

23
Prepared by Dr. Sadeem Alsudais
PageRank Calculation Example

A B PR(A) = 0.15 + 0.85 * PR(C)/2


PR(B) = 0.15 + 0.85 * (PR(A)/1 + PR(C)/2)
PR(C) = 0.15 + 0.85 * (PR(B)/1)
C

PR(A) = 0.15+0.85 * 1/2 = 0.575


Iter 1 PR(B) = 0.15+0.85 * (1+1/2) = 1.425
PR(C) = 0.15+0.85 * 1 = 1

PR(A) = 0.15+0.85 * 1/2 = 0.575


Iter 2 PR(B) = 0.15+0.85 * (0.575+1/2) = 1.06375
PR(C) = 0.15+0.85 * 1.425 = 1.36125

...
24
Prepared by Dr. Sadeem Alsudais
PageRank Calculation Example

25
Prepared by Dr. Sadeem Alsudais
PageRank Calculation Example

26
Prepared by Dr. Sadeem Alsudais
Topic Modeling

● Topic
○ Consists of a cluster of words that frequently occur together and
share the same theme.
○ Formally defined as a distribution over a fixed vocabulary of words.
○ A cluster of words with related meanings, and each word has a
corresponding weight inside this topic.
● Different topics have different distributions over the same
vocabulary.
● Document grouping can be achieved with clustering
methods, however, a better approach is to use topic
modeling.
● The simplest topic model is latent Dirichlet allocation (LDA).

27
Prepared by Dr. Sadeem Alsudais
LDA

● LDA assumes that there is a fixed vocabulary of words, and


the number of the latent topics is predefined and remains
constant.
● LDA assumes that each latent topic follows a Dirichlet
distribution over the vocabulary, and each document is
represented as a random mixture of latent topics.
● Example. Use pyLDAvis for visualization.

28
Prepared by Dr. Sadeem Alsudais
Topic Modeling Example

29
Prepared by Dr. Sadeem Alsudais
Sentiment Analysis

● A group of tasks that use statistics and natural language processing to


mine opinions to identify and extract subjective information from texts.
● Approaches:
○ Manually construct lists of words with positive sentiments (such as brilliant,
awesome, and spectacular) and negative sentiments (such as awful, stupid, and
hideous) → achieve accuracy around 60%.
○ Use classification methods such as SVM → these classifiers can score around
80% accuracy.
● Confusion matrix to evaluate the performance of a model.
○ Precision. The % of documents in the results that are relevant.
○ Recall. The % of returned documents among all relevant documents in the corpus.

30
Prepared by Dr. Sadeem Alsudais
Gain Insights by Visualization

31
Prepared by Dr. Sadeem Alsudais
References

● (Chapter 9) Data Science and Big Data Analytics. Discovering, Analyzing,


Visualizing, and Presenting Data.
● (Chapter 2, 6) Introduction to Information Retrieval. Christopher D.
Manning, Prabhakar Raghavan and Hinrich Schütze.
● UC Irvine’s CS221.

32
Prepared by Dr. Sadeem Alsudais

You might also like