Clustering and Similarity:: Retrieving Documents
Clustering and Similarity:: Retrieving Documents
Similarity:
Retrieving Documents
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
Retrieving documents of interest
1
0
0
0
5
3
0
0
1
0
0
0
0
1*3
+
5*2
3
0
0
0
2
0
0
1
0
1
0
0
0
= 13
1
0
0
0
5
3
0
0
1
0
0
0
0
0
0
0
1
0
0
0
9
0
0
6
0
4
0
1 0 0 0 5 3 0 0 1 0 0 0 0 2 0 0 0 10 6 0 0 2 0 0 0 0
3
0
0
0
2
0
0
1
0
1
0
0
0
6
0
0
0
4
0
0
2
0
2
0
0
0
Similarity = 13 Similarity = 52
1
0
0
0
5
3
0
0
1
0
0
0
0
√(12 + 52 + 32 + 12)
1 5 3 1
/ 0
0
0
/ / 0
0
/ 0
0
0
0
6
6
6
6
12
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
Prioritizing important words
with tf-idf
tf * idf
21
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
Retrieving similar documents
• Corpus:
• Algorithm:
- Search over each article in corpus
• Compute s = similarity( , )
• If s > Best_s, record =
and set Best_s = s
- Return
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
k – Nearest neighbor
• Input: Query article
• Output: List of k similar articles
ENTERTAINMENT SCIENCE
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
Multiclass classification problem
WORLD
NEWS SPORTS
ENTERTAINMENT
?
SCIENCE TECHNOLOGY
Example of
supervised learning
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
Clustering
• No labels provided
• Want to uncover cluster
structure
An unsupervised
learning task
30
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
What defines a cluster?
• Cluster defined by
center & shape/spread
channels 0me
“furniture”
“baby”
• Or discovering groups of users
42
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
Structuring web search results
• Search terms can have multiple meanings
• Example: “cardinal”
• Solution: 11
12
13
- Cluster regions with similar 14
Washington, DC
45
©2015
Emily
Fox
&
Carlos
Guestrin
Machine
Learning
Specializa0on
Summary for clustering
and similarity