clustering
clustering
Clustering examples
1
Unsupervised learning
2
Example: a cholera outbreak in London
X X
X
XX XX
X X X
X X
X X
X X
X X
XX
X
3
Conceptual Clustering
4
Conceptual Clustering (cont’d)
5
Curse of dimensionality
6
Higher dimensional examples
7
SkyServer
8
Sloan Digital Sky Survey
9
Clustering CDs
10
The space of CDs
11
Clustering documents
13
Analyzing protein sequences
14
Measuring distance
15
K-dimensional Euclidean space
maxi =1
k |a - b |
i i
a 16
Non-Euclidean spaces
17
Non-Euclidean spaces (cont’d)
similarity(object1, object2) = 3 / 4
similarity(object1, object3) =
similarity(object2, object3) = 1/4
Note that it is possible to assign different
weights to features.
18
Approaches to Clustering
19
The k-means algorithm
•Pick k cluster centroids.
•Assign points to clusters by picking the
closest centroid to the point in question. As
points are assigned to clusters, the centroid of
the cluster may migrate.
Example: Suppose that k = 2 and we assign
points 1, 2, 3, 4, 5, in that order. Outline circles
represent points, filled circles represent
centroids. 1 5
3
4
20
The k-means algorithm example (cont’d)
1 5 1 5
2 2
3 3
4 4
1 5 1 5
2 2
3 3
4 4
21
Issues
22
Issues (cont’d)
• How to determine k?
One can try different values for k until the
smallest k such that increasing k does not
much decrease the average points of points to
their centroids.
XX
X X X
X X
X X X X
X
X
X
X X
X X X
23
Determining k
X
X X
X X X X When k = 1, all the points are
X X
X X X in one cluster, and the average
X distance to the centroid will be
high.
X
X X
X X X
X
XX
X X
When k = 2, one of the clusters
X
X
X
X X X
will be by itself and the other
X
X two will be forced into one
cluster. The average distance
X of points to the centroid will
X
X X
X X
shrink considerably.
24
Determining k (cont’d)
X
X X
X X X X When k = 3, each of the
X X
X X X apparent clusters should be a
X cluster by itself, and the
average distance from the
X
X
X
points to their centroids
X X X shrinks again.
Average
radius
1 2 3 4
k
26
The CLUSTER/2 algorithm
27
The CLUSTER/2 algorithm (cont’d)
28
The CLUSTER/2 algorithm (cont’d)
29
The steps of a CLUSTER/2 run
30
Document clustering
31
Hierarchical Agglomerative Clustering (HAC)
32
Example
A B C D E
The pair
A - 2 7 9 4 with the
highest
B 2 - 9 11 14 similarity
is:
C 7 9 - 4 8
B-E = 14
D 9 11 4 - 2
E 4 14 8 2 -
33
Example
BE
A C D B E
34
Example
A BE C D To compute
(A,BE):
A - 2 7 9 take the
minimum of
BE 2 - 8 2 (A,B)=2 and
(A,E)=4.
C 7 8 - 4
This is called
D 9 2 4 - complete
linkage.
35
Example
AD BE
A D C B E
36
Example
AD BE C
AD - 2 4
BE 2 - 8
C 4 8 -
37
Example
BCE
AD BE
A D C B E
38
Example
ABCDE
Everything
BCE has been
clustered.
AD BE
A D C B E
39
Time complexity analysis
41
Example
B
D
C
A
42
Example
B
D
AB
C
A
43
Example
B
D
AB
C
A
44
Example
E
DE
B
D
AB
C
A
45
Example
B
D CDE
AB
C
A
46
Time complexity analysis
47
Remember k-means clustering
48
Time complexity analysis
K-means requires:
• Each node gets added to a cluster, so there
are n clustering steps
• For each addition, we need to compare to k
centroids
• We also need to recompute the centroid after
adding the new node, this takes a constant
amount of time (say c)
• The total time needed is (k + c) n = O(n)
• So it is a linear algorithm!
49
But there are problems…
A B C
D E F
1. To get a byte.
2. Many things…
One option is to use the slow algorithm on a
portion of the problem to obtain a better
starting point for the fast algorithm.
51
Buckshot clustering
52
Getting the k clusters
ABCDE
AD BCE
BE
A D C B E
53
Effect of document order
54
Computing the distance (time)
55
Computing the distance (methods)
56
More on document clustering
• Applications
Structuring search results
Suggesting related pages
Automatic directory construction / update
Finding near identical pages
Finding mirror pages (e.g., for propagating updates)
Plagiarism detection
different times)
• Problems
Polysemy, e.g., “bat,” “Washington,” “Banks”
Multiple aspects of a single topic
Ultimately amounts to general problem of information
structuring
57
Clustering vs. classification
58
How many possible clusterings?
59
Cluster structure
• Hierarchical vs flat
• Overlap
Disjoint partitioning, e.g., partition congressmen by state
Multiple dimensions of partitioning, each disjoint, e.g.,
partition congressmen by state; by party; by
House/Senate
Arbitrary overlap, e.g., partition bills by congressmen
who voted for them
60
Measuring the quality of the clusters
61
Related communities
62