0% found this document useful (0 votes)
2 views52 pages

Clustering

Clustering is the process of grouping similar documents together, with applications in information retrieval, summarization, topic segmentation, and lexical semantics. The document outlines various clustering techniques, including K-means and hierarchical clustering, and discusses their properties, advantages, and computational complexities. It emphasizes the importance of clustering in organizing information and improving retrieval effectiveness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views52 pages

Clustering

Clustering is the process of grouping similar documents together, with applications in information retrieval, summarization, topic segmentation, and lexical semantics. The document outlines various clustering techniques, including K-means and hierarchical clustering, and discusses their properties, advantages, and computational complexities. It emphasizes the importance of clustering in organizing information and improving retrieval effectiveness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

What is clustering?

Applications of clustering in information


retrieval

K-means algorithm

Introduction to hierarchical clustering

Single-link and complete-link clustering


Definition
(Document) clustering is the process of grouping a set of

documents into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Unsupervised = there are no labeled or annotated data.


Difference between classification & clustering

Classification Clustering

supervised learning unsupervised learning

classes are human-defined Clusters are inferred from


and part of the input to the the data without human input.
learning algorithm

output = membership in Output = membership in


class only class + distance from centroid
(“degree of cluster
membership”)
Cluster hypothesis

Documents in the same cluster behave similarly with respect to relevance to


information needs.

All applications of clustering in IR are based (directly or indirectly) on the cluster


hypothesis.

Van Rijsbergen’s original wording (1979): “closely associated documents tend to


be relevant to the same requests”.
Applications of Clustering
IR: presentation of results (clustering of documents)

Summarisation: clustering of similar documents for multi-document summarisation

clustering of similar sentences for re-generation of sentences

Topic Segmentation: clustering of similar paragraphs (adjacent or non-adjacent) for


detection of topic structure/importance

Lexical semantics: clustering of words by cooccurrence patterns


Clustering search results
Clustering news articles
Clustering Words

https://colah.github.io/posts/2015-01-Visualizing-Representations/
Types of Clustering
Hard clustering v. soft clustering

Hard clustering: every object is member in only one cluster

Soft clustering: objects can be members in more than one cluster

Hierarchical v. non-hierarchical clustering

Hierarchical clustering: pairs of most-similar clusters are iteratively linked until all
objects are in a clustering relationship

Non-hierarchical clustering results in flat clusters of “similar” documents


Desiderata for clustering
General goal: put related docs in the same cluster, put unrelated docs in different clusters.
We’ll see different ways of formalizing this.
The number of clusters should be appropriate for the data set we are clustering.
Initially, we will assume the number of clusters K is given.
There also exist semiautomatic methods for determining K
Secondary goals in clustering
Avoid very small and very large clusters
Define clusters that are easy to explain to the user
Many others . . .
Non-hierarchical (partitioning) clustering
Partitional clustering algorithms produce a set of k non-nested partitions
corresponding to k clusters of n objects.

Advantage: not necessary to compare each object to each other object, just
comparisons of objects – cluster centroids necessary

Optimal partitioning clustering algorithms are O(kn)

Main algorithm: K-means


K-means: Basic idea
K-means algorithm
Worked Example: Set of points to be clustered
Random seeds + Assign points to closest center
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Ex: Centroids and assignments after convergence
K-means is guaranteed to converge: Proof
RSS decreases during each reassignment step because each vector is moved to a closer
centroid
RSS decreases during each recomputation step.
This follows from the definition of a centroid:
the new centroid is the vector for which RSSk reaches its minimum
There is only a finite number of clusterings.
Thus: We must reach a fixed point.
Finite set & monotonically decreasing evaluation function ⇒ convergence
Assumption: Ties are broken consistently
Other properties of K-means
Fast convergence

K-means typically converges in around 10-20 iterations (if we don’t care about a few documents switching back and
forth)

However, complete convergence can take many more iterations.

Non-optimality

K-means is not guaranteed to find the optimal solution.

If we start with a bad set of seeds, the resulting clustering can be horrible.

Dependence on initial centroids

Solution 1: Use i clusterings, choose one with lowest RSS

Solution 2: Use prior hierarchical clustering step to find seeds with good coverage of document space
Time complexity of K-means
Reassignment step: O(KNM) (we need to compute KN document-centroid distances,
each of which costs O(M)

Recomputation step: O(NM) (we need to add each of the document’s < M values to one
of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all important dimensions


Hierarchical clustering
Imagine we now want to create a hierarchy in the form of a binary tree.

Assumes a similarity measure for determining the similarity of two clusters.

Up to now, our similarity measures were for documents.

We will look at different cluster similarity measures.

Main algorithm: HAC (hierarchical agglomerative clustering)


HAC: Basic algorithm
Start with each document in a separate cluster

Then repeatedly merge the two clusters that are most similar

Until there is only one cluster.

The history of merging is a hierarchy in the form of a binary tree.

The standard way of depicting this history is a dendrogram.


A dendrogram
Term–document matrix to document–document matrix
Hierarchical clustering: agglomerative (BottomUp, greedy)
Computational complexity of the basic algorithm
Hierarchical clustering: similarity functions
Example: hierarchical clustering; similarity functions
Single Link is O(n2)
Clustering Result under Single Link
Complete Link
Complete Link
Clustering result under complete link
Example: gene expression data
An example from biology: cluster genes by function
Survey 112 rat genes which are suspected to participate in development of CNS
Take 9 data points: 5 embryonic (E11, E13, E15, E18, E21), 3 postnatal (P0, P7,
P14) and one adult
Measure expression of gene (how much mRNA in cell?)
These measures are normalised logs; for our purposes, we can consider them as
weights
Cluster analysis determines which genes operate at the same time
Rat CNS gene expression data (excerpt)
Rat CNS gene clustering – single link
Rat CNS gene clustering – complete link
Rat CNS gene clustering – group average link
Flat or hierarchical clustering?
When a hierarchical structure is desired: hierarchical algorithm

Humans are bad at interpreting hierarchical clusterings (unless cleverly visualised)

For high efficiency, use flat clustering

For deterministic results, use HAC

HAC also can be applied if K can’t be predetermined (can start without knowing K)
Take-away
Partitional clustering
Provides less information but is more efficient (best: O(kn))
K-means
Hierarchical clustering
Best algorithms O(n2) complexity
Single-link vs. complete-link (vs. group-average)

Hierarchical and non-hierarchical clustering fulfills different needs

You might also like