cs4811-ch10c-clustering
cs4811-ch10c-clustering
2
Example: a cholera outbreak in London
X X
X
XX XX
X X X
X X
X X
X X
X X
XX
X
3
Conceptual Clustering
4
Conceptual Clustering (cont’d)
5
Curse of dimensionality
6
Higher dimensional examples
7
Skycat software
8
Skycat software (cont’d)
9
Clustering CDs
10
The space of CDs
11
Clustering documents
12
Clustering documents (cont’d)
13
Analyzing protein sequences
14
Measuring distance
15
K-dimensional Euclidean space
maxi =1
k |a - b |
i i
a 16
Non-Euclidean spaces
17
Non-Euclidean spaces (cont’d)
similarity(object1, object2) = 3 / 4
similarity(object1, object3) =
similarity(object2, object3) = 1/4
Note that it is possible to assign different
weights to features.
18
Approaches to Clustering
19
The k-means algorithm
•Pick k cluster centroids.
•Assign points to clusters by picking the
closest centroid to the point in question. As
points are assigned to clusters, the centroid of
the cluster may migrate.
Example: Suppose that k = 2 and we assign
points 1, 2, 3, 4, 5, in that order. Outline circles
represent points, filled circles represent
centroids. 1 5
3
4
20
The k-means algorithm example (cont’d)
1 5 1 5
2 2
3 3
4 4
1 5 1 5
2 2
3 3
4 4
21
Issues
22
Issues (cont’d)
• How to determine k?
One can try different values for k until the
smallest k such that increasing k does not
much decrease the average points of points to
their centroids.
XX
X X X
X X
X X X X
X
X
X
X X
X X X
23
Determining k
X
X X
X X X X When k = 1, all the points are
X X
X X X in one cluster, and the average
X distance to the centroid will be
high.
X
X X
X X X
X
XX
X X
When k = 2, one of the clusters
X
X
X
X X X
will be by itself and the other
X
X two will be forced into one
cluster. The average distance
X of points to the centroid will
X
X X
X X
shrink considerably.
24
Determining k (cont’d)
X
X X
X X X X When k = 3, each of the
X X
X X X apparent clusters should be a
X cluster by itself, and the
average distance from the
X
X
X
points to their centroids
X X X shrinks again.
Average
radius
1 2 3 4
k
26
The CLUSTER/2 algorithm
27
The CLUSTER/2 algorithm (cont’d)
28
The CLUSTER/2 algorithm (cont’d)
29
The steps of a CLUSTER/2 run
30
A COBWEB
clustering for four
one-celled
organisms
(Gennari et
al.,1989)
32
Clustering vs. classification
33
Cluster structure
• Hierarchical vs flat
• Overlap
Disjoint partitioning, e.g., partition congressmen by state
Multiple dimensions of partitioning, each disjoint, e.g.,
partition congressmen by state; by party; by
House/Senate
Arbitrary overlap, e.g., partition bills by congressmen
who voted for them
34
More on document clustering
• Applications
Structuring search results
Suggesting related pages
Automatic directory construction / update
Finding near identical pages
Finding mirror pages (e.g., for propagating updates)
Plagiarism detection
different times)
• Problems
Polysemy, e.g., “bat,” “Washington,” “Banks”
Multiple aspects of a single topic
Ultimately amounts to general problem of information
structuring
35