Clustering
Clustering
Clustering
1
The Problem of Clustering
Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
members of a cluster are in some sense as
close to each other as possible.
2
Example
x x
xx x
x x x x
x x x x x x x
x xx x x
x x x xx x
x x x
x x
x x x x x
x x x
x
3
Problems With Clustering
Clustering in two dimensions looks easy.
Clustering small amounts of data looks easy.
The Curse of Dimensionality
Many applications involve not 2, but 10 or
10,000 dimensions.
4
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across clusters
5
Distance Measures
Each clustering problem is based on some
kind of “distance” between points.
Two major classes of distance measure:
1. Euclidean
2. Non-Euclidean
6
Euclidean Vs. Non-Euclidean
A Euclidean space has some number of real-
valued dimensions and “dense” points.
There is a notion of “average” of two points.
A Euclidean distance is based on the locations of
points in such a space.
A Non-Euclidean distance is based on
properties of points, but not their “location” in a
space.
7
Some Euclidean Distances
L2 norm : d(x,y) = square root of the sum of
the squares of the differences between x and
y in each dimension.
The most common notion of “distance.”
L1 norm : sum of the differences in each
dimension.
Manhattan distance = distance if you had to travel
along coordinates only.
8
Non-Euclidean Distances
Jaccard distance for sets = 1 minus ratio of
sizes of intersection and union.
Jaccard(x, y) = 1 - |x y|
|x y|
Cosine distance = angle between vectors
from the origin to the points in question.
Edit distance = number of inserts and deletes
to change one string into another.
9
Jaccard Distance for Bit-Vectors
Example: p1 = 10111; p2 = 10011.
Size of intersection = 3; size of union = 4, Jaccard
similarity (not distance) = 3/4.
Need to make a distance function satisfying
triangle inequality and other laws.
d(x,y) = 1 – (Jaccard similarity) works.
10
Cosine Distance
Think of a point as a vector from the origin
(0,0,…,0) to its location.
Two points’ vectors make an angle, whose
cosine is the normalized dot-product of the
vectors: p1.p2/|p2||p1|.
Example p1 = 00111; p2 = 10011.
p1.p2 = 2; |p1| = |p2| = 3.
cos() = 2/3; is about 48 degrees.
11
Edit Distance
The edit distance of two strings is the number
of inserts and deletes of characters needed to
turn one into the other.
Equivalently: d(x,y) = |x| + |y| -2|LCS(x,y)|.
LCS = longest common subsequence = longest
string obtained both by deleting from x and
deleting from y.
12
Example
x = abcde ; y = bcduve.
Turn x into y by deleting a, then inserting u
and v after d.
Edit-distance = 3.
Or, LCS(x,y) = bcde.
|x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3.
13
Clustering Algorithms
k -Means Algorithms
Hierarchical Clustering
14
Methods of Clustering
Point Assignment (Partitioning “flat” algorithms ):
Usually start with a random (partial) partitioning and Maintain
a set of clusters.
Refine it iteratively
Place points into their “nearest” cluster.
k means/medoids clustering
Model based clustering
Hierarchical (Agglomerative):
Initially, each point in cluster by itself.
Repeatedly combine the two “nearest” clusters into one.
15
Partional Clustering
Also called flat clustering
The most famous algorithm is K-Means
16
k –Means Algorithm(s)
Assumes Euclidean space.
Start by picking k, the number of clusters.
Initialize clusters by picking one point per
cluster.
For instance, pick one point at random, then k -1
other points, each as far away as possible from
the previous points.
17
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster
center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence (change
in cluster assignments less than a
threshold)
18
K-means example, step 1
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
19
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
20
K-means example, step 3
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each cluster k3
X
21
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
22
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
23
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
24
K-means example, step 5
k1
Y
k2
move cluster
centers to k3
cluster means
X
25
Discussion
26
Issue 1: How Many Clusters?
Number of clusters k is given
Partition n docs into predetermined number of
clusters
Finding the “right” number of clusters is part of the
problem
27
Getting k Right
Try different k, looking at the change in the
average distance to centroid, as k increases.
Average falls rapidly until right k, then
changes little.
Best value
Average of k
distance to
centroid
k
28
Example
x x
xx x
x x x x
many long x x x x x x x
distances x xx x x
to centroid. x x x xx x
x x x
x x
x x x x x
x x x
x
29
Example
x x
xx x
x x x x
Just right; x x x
distances x x x x
x xx x x
rather short. xx x
x x x
x x x
x x
x x x x x
x x x
x
30
Example
x x
xx x
x x x x
x x x x x x x
x xx x x
Too many; x x x xx x
little improvement x x x
in average
distance. x x
x x x x x
x x x
x
31
Issue: 2
Result can vary significantly depending on
initial choice of seeds (number and
position)
Can get trapped in local minimum
Example: initial
cluster
centers
instances
34
K-means clustering - outliers ?
What can be done about outliers?
35
K-means clustering summary
Advantages Disadvantages
Simple, Must pick number of
understandable clusters before hand
items automatically All items forced into
36
Clustering Algorithms
Hierarchical algorithms
Bottom-up, agglomerative
Top-down, divisive
37
Hierarchical Clustering
Two important questions:
1. How do you determine the “nearness” of
clusters?
2. How do you represent a cluster of more than
one point?
38
Hierarchical Clustering --- (2)
Key problem: as you build clusters, how do
you represent the location of each cluster, to
tell which pair of clusters is closest?
Euclidean case: each cluster has a centroid =
average of its points.
Measure intercluster distances by distances of
centroids.
39
Example
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o
(5,0)
40
And in the Non-Euclidean Case?
The only “locations” we can talk about are the
points themselves.
I.e., there is no “average” of two points.
Approach 1: clustroid = point “closest” to
other points.
Treat clustroid as if it were centroid, when
computing intercluster distances.
41
“Closest” Point?
Possible meanings:
1. Smallest maximum distance to the other points.
2. Smallest average distance to other points.
3. Smallest sum of squares of distances to other
points.
4. Etc., etc.
42
Example
clustroid
1 2
6 4
3
5 clustroid
intercluster
distance
43
*Hierarchical clustering
Bottom up
Start with single-instance clusters
At each step, join the two closest clusters
Design decision: distance between clusters
E.g. two closest instances in clusters
vs. distance between means
Top down
Start with one universal cluster
Find two clusters
Proceed recursively on each subset
Can be very fast
Both methods produce a
dendrogram
g a c i e d k b j f h 44
Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a
set of documents.
animal
vertebrate invertebrate
45
Hierarchical Agglomerative
Clustering (HAC)
Assumes a similarity function for determining
the similarity of two instances.
Starts with all instances in a separate cluster
and then repeatedly joins the two clusters
that are most similar until there is only one
cluster.
The history of merging forms a binary tree or
hierarchy.
46
A Dendogram: Hierarchical Clustering
• Dendrogram: Decomposes
data objects into a several
levels of nested partitioning
(tree of clusters).
47
HAC Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci cj
48
Hierarchical Clustering algorithms
Agglomerative (bottom-up):
Start with each item being a single cluster.
Divisive (top-down):
Start with all items belong to the same cluster.
Eventually each node forms a cluster on its own.
Does not require the number of clusters k in advance
Needs a termination/readout condition
49
Dendrogram: Document Example
As clusters agglomerate, docs likely to fall
into a hierarchy of “topics” or concepts.
d3
d5
d1 d3,d4,d
d4
5
d2
d1,d2 d4,d5 d3
50
“Closest pair” of clusters
Many variants to defining closest pair of clusters
“Center of gravity”
Clusters whose centroids (centers of gravity) are the most
cosine-similar
Single-link
Similarity of the most similar (single-link)
Complete-link
Similarity of the “furthest” points,
Average-link
Average similarity between pairs of elements
51
Major issue - labeling
After clustering algorithm finds clusters - how
can they be useful to the end user?
Need pithy label for each cluster
52
How to Label Clusters
Show titles of typical documents
Titles are easy to scan
But you can only show a few titles which may not
Differential labeling
But harder to scan
53
Evaluation of clustering
54
Approaches to evaluating
Anecdotal
User inspection
Ground “truth” comparison
Cluster retrieval
Purely quantitative measures
Average distance between cluster members
Microeconomic / utility
55
Anecdotal evaluation
Probably the commonest (and surely the easiest)
“I wrote this clustering algorithm and look what it
found!”
No benchmarks, no comparison possible
Any clustering algorithm will pick up the easy stuff
like partition by languages
Generally, unclear scientific value.
56
User inspection
Induce a set of clusters or a navigation tree
Have subject matter experts evaluate the results and
score them
some degree of subjectivity
57
Ground “truth” comparison
Take a union of docs from a taxonomy & cluster
Yahoo!, ODP, newspaper sections …
58
Microeconomic viewpoint
Anything - including clustering - is only as good as
the economic utility it provides
For clustering: net economic gain produced by an
approach (vs. another approach)
Examples
recommendation systems
59
Other Clustering Approaches
EM – probability based clustering
Bayesian clustering
SOM – self-organizing maps
…
60
Soft Clustering
61