0% found this document useful (0 votes)
20 views26 pages

12 Text Clustering

Uploaded by

thatsarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views26 pages

12 Text Clustering

Uploaded by

thatsarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

CS463 - Natural Language

Processing

Week 12
Text Clustering
Clustering (‫)ﺗﺟﻣﻊ‬

• Clustering – It is the process of partition (‫ )ﺗﻘﺳﯾم‬of


examples from a heterogeneous (‫ )ﻏﯾر ﻣﺗﺟﺎﻧﺳﺔ‬dataset
to homogeneous (‫ )ﻣﺗﺟﺎﻧﺳﺔ‬subsets (called clusters).
↳ groubing
• Partition unlabeled examples into disjoint subsets of
clusters, such that:
– Examples within a cluster are very similar
– Examples in different clusters are very different
• Discover new categories in an unsupervised manner
(because no sample category labels provided). 2
Clustering Example

.
.. .
. . . . .
. . .
. . .
.
3
Aglommerative vs. Divisive Clustering
=
-0
jad

Emethods start
• Aglommerative (bottom-up)
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
83 · g

S
• Divisive (partitional, top-down) separate all
examples immediately into clusters.

4
5
& duster Approaches

6
*

Direct Clustering Method


• Direct clustering methods require a
specification of the number of clusters, k,
desired. gaus :E
dis
↳ 13 ,

• A clustering evaluation function assigns a


real-value quality measure to a clustering.
• The number of clusters can be determined
automatically by explicitly generating-
i
clusterings for multiple values ofO
k and
choosing the best result according to a. 2 i
clustering evaluation function. gir 5 &
L
:7

= I duster , ,
7
Hierarchical Agglomerative Clustering
(HAC)
• Assumes a similarity function for determining
the similarity of two instances.
X
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
• The history of merging forms a binary tree or
hierarchy.

8
Hierarchical Agglomerative Clustering
(HAC)

9
HAC Algorithm
&
1. Start with all instances in their own cluster.
Until there is only one cluster:

2. Among the current clusters, determine the


two clusters, ci and cj, that are most similar.

3. Replace ci and cj with a single cluster ci  cj

10
Cluster Similarity
• Assume a similarity function that determines the
similarity of two instances: sim(x,y).
– Cosine similarity of document vectors.
• How to compute similarity of two clusters each
possibly containing multiple instances?
– Single Link: Similarity of two most similar
m
members.
-
– Complete Link: Similarity of two least similar
members. ~

– Group Average: Average similarity between


me
members. 11
Single Link Agglomerative Clustering

• Use maximum similarity of pairs:


sim(ci ,c j )  max sim( x, y )
xci , yc j points
blu 2
-

• Can result in “straggly” (long and thin)


clusters due to chaining effect.
– Appropriate in some domains, such as
clustering islands.

12
Single Link Example

13
Complete Link Agglomerative Clustering

• Use minimum similarity of pairs:


sim(ci ,c j )  min sim( x, y )
xci , yc j i
·
• Makes more “tight,” spherical clusters that
are typically preferable.

14
Complete Link Example

15
Cluster Similarity

16
Computational Complexity
• In the first iteration, all HAC methods need
to compute similarity of all pairs of n
individual instances which is O(n2).
• In each of the subsequent n2 merging
iterations, it must compute the distance
between the most recently created cluster
and all other existing clusters.
• In order to maintain an overall O(n2)
performance, computing similarity to each
other cluster must be done in constant time.
17
Computing Cluster Similarity

• After merging ci and cj, the similarity of the


resulting cluster to any other cluster, ck, can
be computed by:

– Single Link:
sim((ci  c j ), ck ) O
max(sim(ci , ck ), sim(c j , ck ))

– Complete Link:
E
sim((ci  c j ), ck )  min( sim(ci , ck ), sim(c j , ck ))
18
Non-Hierarchical Clustering
• Typically must provide the number of desired
clusters, k.
&
• Randomly choose k instances as seeds, one per
cluster.
1055seeds
• Form initial clusters based on these seeds. =9

• Iterate, repeatedly reallocating instances to


different clusters to improve the overall clustering.
• Stop when clustering converges or after a fixed
number of iterations. 19
K-Means

• Assumes instances are real-valued vectors.


• Clusters based on centroids, center of
gravity, or mean of points in a cluster, c:
 1 
μ(c)  
| c | xc
x
&
• Reassignment of instances to clusters is
based on distance to the current cluster
centroids.

20
Distance Metrics

• Euclidian distance (L2 norm):


m
 
L2 ( x , y )   i i
( x
i 1
 y ) 2

• L1 norm:   m
L1 ( x , y )   xi  yi
i 1 &
• Cosine Similarity (transform to a distance
by subtracting from 1):
 
x y
1  
x  y
21
K-Means Algorithm

Let d be the distance measure between instances.

Select k random instances {s1, s2,… sk} as seeds.


Until clustering converges or other stopping criterion:

For each instance xi: &


Assign xi to the cluster cj such that d(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)

For each cluster cj


sj = (cj)
22
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reasssign clusters
x x Compute centroids
x
x
Reassign clusters

& Converged!

23
Text Clustering
• HAC and K-Means have been applied to text in a
straightforward way.
• Typically use normalized, TF/IDF-weighted vectors
and cosine similarity.
• Optimize computations for sparse vectors.
• Applications: &
– During retrieval, add other documents in the same cluster
as the initial retrieved documents to improve recall.
– Clustering of results of retrieval to present more organized
results to the user.
– Automated production of hierarchical taxonomies of
documents for browsing purposes.
24
Soft Clustering
• Clustering typically assumes that each instance is
given a “hard” assignment to exactly one cluster.
• Does not allow uncertainty in class membership or
for an instance to belong to more than one cluster.
• Soft clustering gives probabilities that an instance
belongs to each of a set of clusters. &
• Each instance is assigned a probability distribution
across a set of discovered categories (probabilities
of all categories must sum to 1).

25
Issues in Clustering
• How to evaluate clustering?
*– Internal: -
• Tightness and separation of clusters (e.g. k-means
objective)
C
• Fit of probabilistic model to data
* – External
• Compare to known class labels on benchmark data
• Improving search to converge faster and
avoid local minima.
• Overlapping clustering.

26

You might also like