12 Text Clustering
12 Text Clustering
Processing
Week 12
Text Clustering
Clustering ()ﺗﺟﻣﻊ
.
.. .
. . . . .
. . .
. . .
.
3
Aglommerative vs. Divisive Clustering
=
-0
jad
Emethods start
• Aglommerative (bottom-up)
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
83 · g
S
• Divisive (partitional, top-down) separate all
examples immediately into clusters.
4
5
& duster Approaches
6
*
= I duster , ,
7
Hierarchical Agglomerative Clustering
(HAC)
• Assumes a similarity function for determining
the similarity of two instances.
X
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
• The history of merging forms a binary tree or
hierarchy.
8
Hierarchical Agglomerative Clustering
(HAC)
9
HAC Algorithm
&
1. Start with all instances in their own cluster.
Until there is only one cluster:
10
Cluster Similarity
• Assume a similarity function that determines the
similarity of two instances: sim(x,y).
– Cosine similarity of document vectors.
• How to compute similarity of two clusters each
possibly containing multiple instances?
– Single Link: Similarity of two most similar
m
members.
-
– Complete Link: Similarity of two least similar
members. ~
12
Single Link Example
13
Complete Link Agglomerative Clustering
14
Complete Link Example
15
Cluster Similarity
16
Computational Complexity
• In the first iteration, all HAC methods need
to compute similarity of all pairs of n
individual instances which is O(n2).
• In each of the subsequent n2 merging
iterations, it must compute the distance
between the most recently created cluster
and all other existing clusters.
• In order to maintain an overall O(n2)
performance, computing similarity to each
other cluster must be done in constant time.
17
Computing Cluster Similarity
– Single Link:
sim((ci c j ), ck ) O
max(sim(ci , ck ), sim(c j , ck ))
– Complete Link:
E
sim((ci c j ), ck ) min( sim(ci , ck ), sim(c j , ck ))
18
Non-Hierarchical Clustering
• Typically must provide the number of desired
clusters, k.
&
• Randomly choose k instances as seeds, one per
cluster.
1055seeds
• Form initial clusters based on these seeds. =9
20
Distance Metrics
• L1 norm: m
L1 ( x , y ) xi yi
i 1 &
• Cosine Similarity (transform to a distance
by subtracting from 1):
x y
1
x y
21
K-Means Algorithm
& Converged!
23
Text Clustering
• HAC and K-Means have been applied to text in a
straightforward way.
• Typically use normalized, TF/IDF-weighted vectors
and cosine similarity.
• Optimize computations for sparse vectors.
• Applications: &
– During retrieval, add other documents in the same cluster
as the initial retrieved documents to improve recall.
– Clustering of results of retrieval to present more organized
results to the user.
– Automated production of hierarchical taxonomies of
documents for browsing purposes.
24
Soft Clustering
• Clustering typically assumes that each instance is
given a “hard” assignment to exactly one cluster.
• Does not allow uncertainty in class membership or
for an instance to belong to more than one cluster.
• Soft clustering gives probabilities that an instance
belongs to each of a set of clusters. &
• Each instance is assigned a probability distribution
across a set of discovered categories (probabilities
of all categories must sum to 1).
25
Issues in Clustering
• How to evaluate clustering?
*– Internal: -
• Tightness and separation of clusters (e.g. k-means
objective)
C
• Fit of probabilistic model to data
* – External
• Compare to known class labels on benchmark data
• Improving search to converge faster and
avoid local minima.
• Overlapping clustering.
26