Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Introduction to Information Retrieval Ch. 16
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Introduction to Information Retrieval Sec. 16.1
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Introduction to Information Retrieval
Notion of similarity/distance
Ideal: semantic similarity.
Practical: term-statistical similarity
We will use cosine similarity.
Docs as vectors.
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
We will mostly speak of Euclidean distance
But real implementations use cosine similarity
Introduction to Information Retrieval
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
K means clustering
(Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative
(Top-down, divisive)
Introduction to Information Retrieval
Partitioning Algorithms
Partitioning method: Construct a partition of n
documents into a set of K clusters
Given: a set of documents and the number K
Find: a partition of K clusters that optimizes the
chosen partitioning criterion
Globally optimal
Intractable for many objective functions
Ergo, exhaustively enumerate all partitions
Effective heuristic methods: K-means and K-
medoids algorithms
See also Kleinberg NIPS 2002 – impossibility for natural clustering
Introduction to Information Retrieval Sec. 16.4
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
1
μ(c)
| c | xc
x
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Introduction to Information Retrieval Sec. 16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Introduction to Information Retrieval Sec. 16.4
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Convergence
Why should the K-means algorithm ever reach a
fixed point?
A state in which clusters don’t change.
K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
EM is known to converge.
Number of iterations could be large.
But in practice usually isn’t
Introduction to Information Retrieval Sec. 16.4
Lower case!
Convergence of K-Means
Define goodness measure of cluster k as sum of
squared distances from cluster centroid:
Gk = Σi (di – ck)2 (sum over all di in cluster k)
G = Σk G k
Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Introduction to Information Retrieval Sec. 16.4
Convergence of K-Means
Recomputation monotonically decreases each Gk
since (mk is number of members in cluster k):
Σ (di – a)2 reaches minimum for:
Σ –2(di – a) = 0
Σ di = Σ a
mK a = Σ di
a = (1/ mk) Σ di = ck
K-means typically converges quickly
Introduction to Information Retrieval Sec. 16.4
Time Complexity
Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
Reassigning clusters: O(KN) distance computations,
or O(KNM).
Computing centroids: Each doc gets added once to
some centroid: O(NM).
Assume these two steps are each done once for I
iterations: O(IKNM).
Introduction to Information Retrieval Sec. 16.4
Seed Choice
Results can vary based on Example showing
random seed selection. sensitivity to seeds
Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
In the above, if you start
clusterings. with B and E as centroids
Select good seeds using a heuristic you converge to {A,B,C}
(e.g., doc least similar to any and {D,E,F}
If you start with D and F
existing mean) you converge to
Try out multiple starting points {A,B,D,E} {C,F}
Initialize with the results of another
method.
Introduction to Information Retrieval Sec. 16.4
Dhillon et al. ICDM 2002 – variation to fix some issues with small
document clusters
Introduction to Information Retrieval
Purity example
Same class in
ground truth 20 24
Different
classes in 20 72
ground truth
Introduction to Information Retrieval Sec. 16.3
A D
RI
A B C D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-measure, which is
probably a better measure.
Introduction to Information Retrieval Ch. 17
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
38
Introduction to Information Retrieval Sec. 17.1
Note: the resulting clusters are still “hard” and induce a partition
Introduction to Information Retrieval Sec. 17.2
Complete Link
Use minimum similarity of pairs:
Ci Cj Ck
Introduction to Information Retrieval Sec. 17.2
Computational Complexity
In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Introduction to Information Retrieval Sec. 17.3
Group Average
Similarity of two clusters = average similarity of all
pairs within merged cluster.
1
sim (ci , c j ) sim ( x , y )
ci c j ( ci c j 1) x( ci c j ) y( ci c j ): y x
Compromise between single and complete link.
Two options:
Averaged across all ordered pairs in the merged cluster
Averaged over all pairs between the two original clusters
No clear difference in efficacy
Introduction to Information Retrieval Sec. 17.3
( s (ci ) s (c j )) ( s (ci ) s (c j )) (| ci | | c j |)
sim (ci , c j )
(| ci | | c j |)(| ci | | c j | 1)
Introduction to Information Retrieval
Resources
IIR 16 except 16.5
IIR 17.1–17.3