0% found this document useful (0 votes)
87 views47 pages

Clustering and Similarity:: Retrieving Documents

Uploaded by

Beyzagul Demir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views47 pages

Clustering and Similarity:: Retrieving Documents

Uploaded by

Beyzagul Demir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Clustering and

Similarity:
Retrieving Documents
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Retrieving documents of interest

2 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Document retrieval
•  Currently reading article you like

©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Document retrieval
•  Currently reading article you like
•  Goal: Want to find similar article

©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Document retrieval

©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Challenges
•  How do we measure similarity?
•  How do we search over articles?

©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Word count representation
for measuring similarity

7 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Word count document
representation
•  Bag of words model
- Ignore order of words
- Count # of instances of
each word in vocabulary

“Carlos calls the sport futbol.


Emily calls the sport soccer.”

8   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Measuring similarity

1   0   0   0   5   3   0   0   1   0   0   0   0   1*3
+
5*2
3   0   0   0   2   0   0   1   0   1   0   0   0   = 13

9   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Measuring similarity

1   0   0   0   5   3   0   0   1   0   0   0   0  
0
0   0   1   0   0   0   9   0   0   6   0   4   0  

10   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Issues with word counts – Doc length

1   0   0   0   5   3   0   0   1   0   0   0   0   2   0   0   0   10   6   0   0   2   0   0   0   0  

3   0   0   0   2   0   0   1   0   1   0   0   0   6   0   0   0   4   0   0   2   0   2   0   0   0  
Similarity = 13 Similarity = 52

11   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Solution = normalize

1   0   0   0   5   3   0   0   1   0   0   0   0  
√(12 + 52 + 32 + 12)

1 5 3 1
/ 0   0   0   / / 0   0   / 0   0   0   0  
6   6   6   6  
12   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Prioritizing important words
with tf-idf

13 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Issues with word counts –
Rare words

Common words in doc: “the”, “player”, “field”, “goal”


Dominate rare words like: “futbol”, “Messi”
14   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Document frequency
•  What characterizes a rare word?
- Appears infrequently in the corpus

•  Emphasize words appearing in few docs


- Equivalently, discount word w based on
# of docs containing w in corpus

15   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Important words
•  Do we want only rare words to dominate???

•  What characterizes an important word?


-  Appears frequently in document
(common locally)
-  Appears rarely in corpus (rare globally)

•  Trade off between local frequency and


global rarity

16   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)

17   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Same as word counts

18   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Inverse document frequency

19   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Inverse document frequency

20   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Inverse document frequency

tf * idf
21   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Retrieving similar documents

22 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Nearest neighbor search
•  Query article:

•  Corpus:

•  Specify: Distance metric


•  Output: Set of most similar articles
©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
1 – Nearest neighbor
•  Input: Query article
•  Output: Most similar article

•  Algorithm:
- Search over each article in corpus
•  Compute s = similarity( , )
•  If s > Best_s, record =
and set Best_s = s
- Return
©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
k – Nearest neighbor
•  Input: Query article
•  Output: List of k similar articles

©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Clustering documents

26 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Structure documents by topic
•  Discover groups (clusters) of related articles

SPORTS WORLD NEWS

27   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


What if some of the labels are known?
•  Training set of labeled docs

SPORTS WORLD NEWS

ENTERTAINMENT SCIENCE
©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Multiclass classification problem
WORLD
NEWS SPORTS

ENTERTAINMENT
?
SCIENCE TECHNOLOGY

Example of
supervised learning
©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Clustering
•  No labels provided
•  Want to uncover cluster
structure

•  Input: docs as vectors


•  Output: cluster labels

An unsupervised
learning task
30   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
What defines a cluster?
•  Cluster defined by
center & shape/spread

•  Assign observation (doc)


to cluster (topic label)
-  Score under cluster is
higher than others
-  Often, just more similar to
assigned cluster center than
other cluster centers

31   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


k-means
•  Assume
- Similarity metric =
distance to cluster
center
(smaller better)
DATA
to
CLUSTER
32   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
k-means algorithm
0. Initialize cluster centers
1.  Assign observations to
closest cluster center
2.  Revise cluster centers
as mean of assigned
observations
3.  Repeat 1.+2. until
convergence

33   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


k-means algorithm
0. Initialize cluster centers
1.  Assign observations to
closest cluster center
2.  Revise cluster centers
as mean of assigned
observations
3.  Repeat 1.+2. until
convergence

34   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


k-means algorithm
0. Initialize cluster centers
1.  Assign observations to
closest cluster center
2.  Revise cluster centers
as mean of assigned
observations
3.  Repeat 1.+2. until
convergence

35   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


k-means algorithm
0. Initialize cluster centers
1.  Assign observations to
closest cluster center
2.  Revise cluster centers
as mean of assigned
observations
3.  Repeat 1.+2. until
convergence

36   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Other examples

37 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Clustering images
•  For search, group as:
- Ocean
- Pink flower
- Dog
- Sunset
- Clouds
- …

38   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Grouping patients by medical condition
•  Better characterize subpopulations
and diseases

39   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


Example: Patients and seizures are diverse

channels   0me    

40   Machine  Learning  Specializa0on  


Cluster seizures by observed time courses

41   Machine  Learning  Specializa0on  


Products on Amazon
•  Discover product categories
from purchase histories

“furniture”
“baby”
•  Or discovering groups of users
42   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Structuring web search results
•  Search terms can have multiple meanings
•  Example: “cardinal”

•  Use clustering to structure output


43   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
−01 2001−01 2005−01 2009−01 2013−01 1997−01 2001−01 2005−01 2009−01 2013−01

Discovering similar neighborhoods


(a) (b)
Fig 11. Estimated global trend using the seasonality decomposition approach of Clev
et al. (1990), after adjusting for hedonic e↵ects.
tl_2010_53_tract10

•  Task 1: Estimate price at a clusterID_reassign


Cluster ID
1

small regional level 2


3
4
•  Challenge: 5
6
- Only a few (or no!) sales 7
8
in each region per month 9
10

•  Solution: 11
12
13
- Cluster regions with similar 14

trends and share information


15
16

within a cluster City of Seattle


Fig 12. Map of clusters under the MAP sample. The cluster labels and associated
colors are selected to indicate the level of deviance of the cluster’s average (across tr
44   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Discovering similar neighborhoods
•  Task 2: Forecast violent crimes
to better task police
•  Again, cluster regions and
share information!
•  Leads to improved predictions
compared to examining each
region independently

Washington, DC
45   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  
Summary for clustering
and similarity

46 ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  


What you can do now…
•  Describe ways to represent a document (e.g., raw word
counts, tf-idf,…)
•  Measure the similarity between two documents
•  Discuss issues related to using raw word counts
-  Normalize counts to adjust for document length
-  Emphasize important words using tf-idf
•  Implement a nearest neighbor search for document
retrieval
•  Describe the input (unlabeled observations) and output
(labels) of a clustering algorithm
•  Determine whether a task is supervised or unsupervised
•  Cluster documents using k-means (algorithmic details to
come…)
•  Describe other applications of clustering
47   ©2015  Emily  Fox  &  Carlos  Guestrin   Machine  Learning  Specializa0on  

You might also like