Notes 1149 Unit 3
Notes 1149 Unit 3
16
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Ch. 16
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Sec. 16.2
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
1
μ(c)
| c | xc
x
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Sec. 16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Sec. 16.4
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Convergence
Why should the K-means algorithm ever reach a
fixed point?
A state in which clusters don’t change.
K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
EM is known to converge.
Number of iterations could be large.
But in practice usually isn’t
Sec. 16.4
Convergence of K-Means
Define goodness measure of cluster k as sum of
squared distances from cluster centroid:
Gk = Σi (di – ck)2 (sum over all di in cluster k)
G = Σk Gk
Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Sec. 16.4
Convergence of K-Means
Recomputation monotonically decreases each Gk
since (mk is number of members in cluster k):
Σ (di – a)2 reaches minimum for:
Σ –2(di – a) = 0
Σ di = Σ a
mK a = Σ di
a = (1/ mk) Σ di = ck
K-means typically converges quickly
Sec. 16.4
Time Complexity
Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
Reassigning clusters: O(KN) distance computations,
or O(KNM).
Computing centroids: Each doc gets added once to
some centroid: O(NM).
Assume these two steps are each done once for I
iterations: O(IKNM).
Sec. 16.4
Seed Choice
Results can vary based on Example showing
random seed selection. sensitivity to seeds
Dhillon et al. ICDM 2002 – variation to fix some issues with small
document clusters
How Many Clusters?
Number of clusters K is given
Partition n docs into predetermined number of clusters
Finding the “right” number of clusters is part of the
problem
Given docs, partition into an “appropriate” number of
subsets.
E.g., for query results - ideal value of K not known up front
- though UI may impose limits.
Can usually take an algorithm for one flavor and
convert to the other.
K not specified in advance
Say, the results of a query.
Solve an optimization problem: penalize having
lots of clusters
application dependent, e.g., compressed summary
of search results list.
Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
K not specified in advance
Given a clustering, define the Benefit for a
doc to be the cosine similarity to its
centroid
Define the Total Benefit to be the sum of
the individual doc Benefits.
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
21
Sec. 17.1
Hierarchical Agglomerative Clustering
(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, until there is only one cluster.
The history of merging forms a binary tree
or hierarchy.
Note: the resulting clusters are still “hard” and induce a partition
Sec. 17.2
Complete Link
Use minimum similarity of pairs:
Ci Cj Ck
Sec. 17.2
Group Average
Similarity of two clusters = average similarity of all pairs
within merged cluster.
1
sim (ci , c j ) sim ( x , y )
ci c j ( ci c j 1) x( ci c j ) y( ci c j ): y x
Compromise between single and complete link.
Two options:
Averaged across all ordered pairs in the merged cluster
Averaged over all pairs between the two original clusters
No clear difference in efficacy
Sec. 17.3
( s (ci ) s (c j )) ( s (ci ) s (c j )) (| ci | | c j |)
sim (ci , c j )
(| ci | | c j |)(| ci | | c j | 1)
Sec. 16.3