Lecture14 Clustering
Lecture14 Clustering
Lecture14 Clustering
Introduction to
Information Retrieval
Clustering
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Introduction to Information Retrieval Ch. 16
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Introduction to Information Retrieval Sec. 16.1
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Introduction to Information Retrieval
… (30)
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Introduction to Information Retrieval Sec. 16.1
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Introduction to Information Retrieval
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Introduction to Information Retrieval Sec. 16.2
Notion of similarity/distance
Ideal: semantic similarity.
Practical: term-statistical similarity (docs as
vectors)
Cosine similarity
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
We will mostly speak of Euclidean distance
But real implementations use cosine similarity
Introduction to Information Retrieval
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
K means clustering
(Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative
(Top-down, divisive)
Introduction to Information Retrieval
Partitioning Algorithms
Partitioning method: Construct a partition of n
documents into a set of K clusters
Given: a set of documents and the number K
Find: a partition of K clusters that optimizes the
chosen partitioning criterion
Globally optimal
Intractable for many objective functions
Ergo, exhaustively enumerate all partitions
Effective heuristic methods: K-means and K-
medoids algorithms
See also Kleinberg NIPS 2002 – impossibility for natural clustering
Introduction to Information Retrieval Sec. 16.4
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
1
μ(c)
| c | xc
x
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Introduction to Information Retrieval Sec. 16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Introduction to Information Retrieval Sec. 16.4
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Convergence
Why should the K-means algorithm ever reach a
fixed point?
A state in which clusters don’t change.
K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
EM is known to converge.
Number of iterations could be large.
But in practice usually isn’t
Introduction to Information Retrieval Sec. 16.4
Convergence of K-Means
Residual Sum of Squares (RSS), a goodness
measure of a cluster, is the sum of squared
distances from the cluster centroid:
RSSj = Σi |di – cj|2 (sum over all di in cluster j)
RSS = Σj RSSj
Time Complexity
Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
Reassigning clusters: O(KN) distance computations,
or O(KNM).
Computing centroids: Each doc gets added once to
some centroid: O(NM).
Assume these two steps are each done once for I
iterations: O(IKNM).
Introduction to Information Retrieval Sec. 16.4
Seed Choice
Results can vary based on Example showing
random seed selection. sensitivity to seeds
Dhillon et al. ICDM 2002 – variation to fix some issues with small
document clusters
Introduction to Information Retrieval
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
34
Introduction to Information Retrieval Sec. 17.1
Hierarchical Agglomerative Clustering
(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, until there is only one cluster.
The history of merging forms a binary tree
or hierarchy.
Note: the resulting clusters are still “hard” and induce a partition
Introduction to Information Retrieval Sec. 17.2
Complete Link
Use minimum similarity of pairs:
Ci Cj Ck
Introduction to Information Retrieval Sec. 17.2
2. Do N – 1 times O(N2)
1. Find closest pair of documents/clusters to merge
Group Average
Similarity of two clusters = average similarity of all pairs
within merged cluster.
1
sim (ci , c j ) sim ( x , y )
ci c j ( ci c j 1) x( ci c j ) y( ci c j ): y x
Compromise between single and complete link.
Two options:
Averaged across all ordered pairs in the merged cluster
Averaged over all pairs between the two original clusters
No clear difference in efficacy
Introduction to Information Retrieval Sec. 17.3
( s (ci ) s (c j )) ( s (ci ) s (c j )) (| ci | | c j |)
sim (ci , c j )
(| ci | | c j |)(| ci | | c j | 1)
Introduction to Information Retrieval Sec. 16.3
Purity example
Same class in
ground truth 20 24
Different
classes in 20 72
ground truth
Introduction to Information Retrieval Sec. 16.3
A D
RI
A B C D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-
measure, which is probably a better measure.
Introduction to Information Retrieval
Resources
IIR 16 except 16.5
IIR 17.1–17.3