Flat Clustering PDF
Flat Clustering PDF
Flat Clustering
1
Introduction to Information Retrieval
Clustering: Definition
2
Introduction to Information Retrieval
Propose algorithm
for finding the
cluster structure
in this example
3
Introduction to Information Retrieval
4
Introduction to Information Retrieval
5
Introduction to Information Retrieval
Applications of clustering in IR
Application What is Benefit
clustered?
Search result clustering search more effective
results information
presentation to user
Scatter-Gather (subsets alternative user
of) collection interface: “search
without typing”
Collection clustering collection effective information
presentation for
exploratory browsing
Cluster-based retrieval collection higher efficiency:
faster search
6
Introduction to Information Retrieval
7
Introduction to Information Retrieval
Scatter-Gather
8
Introduction to Information Retrieval
9
Introduction to Information Retrieval
Propose algorithm
for finding the
cluster structure
in this example
10
Introduction to Information Retrieval
12
Introduction to Information Retrieval
13
Introduction to Information Retrieval
Flat algorithms
Flat algorithms compute a partition of N documents into a
set of K clusters.
Given: a set of documents and the number K
Find: a partition into K clusters that optimizes the chosen
partitioning criterion
Global optimization: exhaustively enumerate partitions,
pick optimal one
Not tractable
Effective heuristic method: K-means algorithm
14
Introduction to Information Retrieval
Outline
Clustering: Introduction
Clustering in IR
K-means
Evaluation
15
Introduction to Information Retrieval
Internal criteria
Example of an internal criterion: RSS in K-means
But an internal criterion often does not evaluate the actual
utility of a clustering in the application.
Alternative: External criteria
Evaluate with respect to a human-defined classification
16
Introduction to Information Retrieval
17
Introduction to Information Retrieval
18
Introduction to Information Retrieval
19
Introduction to Information Retrieval
Rand index
Definition:
22
Introduction to Information Retrieval
23
Introduction to Information Retrieval
24
Introduction to Information Retrieval
25
Introduction to Information Retrieval
26
Introduction to Information Retrieval
F-Score
27
Introduction to Information Retrieval
28
Introduction to Information Retrieval
https://link.springer.com/content/pdf/10.1007%2Fs00357-006-0017-z.pdf
29
Introduction to Information Retrieval
30
Introduction to Information Retrieval
Outline
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
31
Introduction to Information Retrieval
K-means
32
Introduction to Information Retrieval
33
Introduction to Information Retrieval
K-means
Each cluster in K-means is defined by a centroid.
Objective/partitioning criterion: minimize the average
squared difference from the centroid
Recall definition of centroid:
K-means algorithm
35
Introduction to Information Retrieval
36
Introduction to Information Retrieval
38
Introduction to Information Retrieval
39
Introduction to Information Retrieval
40
Introduction to Information Retrieval
41
Introduction to Information Retrieval
42
Introduction to Information Retrieval
43
Introduction to Information Retrieval
44
Introduction to Information Retrieval
45
Introduction to Information Retrieval
46
Introduction to Information Retrieval
47
Introduction to Information Retrieval
48
Introduction to Information Retrieval
49
Introduction to Information Retrieval
50
Introduction to Information Retrieval
51
Introduction to Information Retrieval
52
Introduction to Information Retrieval
53
Introduction to Information Retrieval
54
Introduction to Information Retrieval
55
Introduction to Information Retrieval
56
Introduction to Information Retrieval
57
Introduction to Information Retrieval
58
Introduction to Information Retrieval
59
Introduction to Information Retrieval
60
Introduction to Information Retrieval
62
Introduction to Information Retrieval
Optimality of K-means
63
Introduction to Information Retrieval
64
Introduction to Information Retrieval
Initialization of K-means
Random seed selection is just one of many ways K-means
can be initialized.
Random seed selection is not very robust: It’s easy to get a
suboptimal clustering.
Better ways of computing initial centroids:
Select seeds not randomly, but using some heuristic (e.g., filter
out outliers or find a set of seeds that has “good coverage” of
the document space)
Use hierarchical clustering to find good seeds
Select i (e.g., i = 10) different random sets of seeds, do a K-
means clustering for each, select the clustering with lowest RSS
65
Introduction to Information Retrieval
67
Introduction to Information Retrieval
Basic idea:
Start with 1 cluster (K = 1)
Keep adding clusters (= keep increasing K)
Add a penalty for each new cluster
Trade off cluster penalties against average squared distance
from centroid
Choose the value of K with the best tradeoff
68
Introduction to Information Retrieval
70
Introduction to Information Retrieval
Silhouette method
• a(i) is the average distance between i and all other
data within the same cluster.
• b(i) is the smallest average distance of i to all points
in any other cluster, of which i is not a member.
71
Introduction to Information Retrieval
k-medoids
1. Initialize: select k of the n data points as the medoids
72
Introduction to Information Retrieval
EM Algorithm
73