Clustering Class
Clustering Class
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Distance Measures
• Each clustering problem is based on
some kind of “distance” between points.
• Two major classes of distance measure:
1. Euclidean
2. Non-Euclidean
Euclidean Vs. Non-
Euclidean
• A Euclidean space has some number of
real-valued dimensions and “dense”
points.
– There is a notion of “average” of two points.
– A Euclidean distance is based on the
locations of points in such a space.
• A Non-Euclidean distance is based on
properties of points, but not their “location”
in a space.
Axioms of a Distance
Measure
• d is a distance measure if it is a
function from pairs of points to reals such
that:
1. d(x,y) > 0.
2. d(x,y) = 0 iff x = y.
3. d(x,y) = d(y,x).
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Some Euclidean Distances
• L2 norm : d(x,y) = square root of the sum of
the squares of the differences between x
and y in each dimension.
– The most common notion of “distance.”
• L1 norm : sum of the differences in each
dimension.
– Manhattan distance = distance if you had to
travel along coordinates only.
Examples of Euclidean Distances
y = (9,8)
L2-norm:
dist(x,y) =
(42+32)
=5
5 3
L1-norm:
4 dist(x,y) =
x = (5,5) 4+3 = 7
Another Euclidean Distance
• L∞ norm : d(x,y) = the maximum of the
differences between x and y in any
dimension.
• Note: the maximum is the limit as n goes
to ∞ of what you get by taking the n th
power of the differences, summing and
taking the n th root.
Non-Euclidean Distances
BDA 21
Hierarchical:
▪ Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two
“nearest” clusters into one
▪ Divisive (top down):
▪ Start with one cluster and recursively split it
Point assignment:
▪ Maintain a set of clusters
▪ Points belong to “nearest” cluster
BDA 22
Key operation:
Repeatedly combine
two nearest clusters
BDA 23
Key operation: Repeatedly combine two
nearest clusters
(1) How to represent a cluster of many points?
▪ Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
Euclidean case: each cluster has a
centroid = average of its (data)points
(2) How to determine “nearness” of clusters?
▪ Measure cluster distances by distances of centroids
BDA 24
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid
Dendrogram
BDA 25
What about the Non-Euclidean case?
The only “locations” we can talk about are the
points themselves
▪ i.e., there is no “average” of two points
Approach 1:
▪ (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
▪ (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
BDA 26
(1) How to represent a cluster of many points?
clustroid = point “closest” to other points
Possible meanings of “closest”:
▪ Smallest maximum distance to other points
▪ Smallest average distance to other points
▪ Smallest sum of squares of distances to other points
▪ For distance metric d clustroid c of cluster C is: min d ( x, c) 2
c
xC
Datapoint Centroid
BDA 28
Approach 3.1: Use the diameter of the
merged cluster = maximum distance between
points in the cluster
Approach 3.2: Use the average distance
between points in the cluster
Approach 3.3: Use a density-based approach
▪ Take the diameter or avg. distance, e.g., and divide
by the number of points in the cluster
BDA 29
(3) When to stop clustering?
Stop when we have K clusters
Stop if diameter/radius of cluster that results from
the best merger exceeds a threshold.
Stop if density is below some threshold
▪ Density – Number of cluster points per unit volume of
cluster ( Ratio of number of cluster points divided by some
power of the diameter or radius)
If evidence suggest that merging will produce a bad
cluster
▪ Sudden increase in cluster diameter.
BDA 30
Naïve Implementation of hierarchical
clustering :
▪ At ech step, compute pairwaise distances between
all pairs of clusters, then merge
▪ O(𝑁 3 )
Careful implementation using priority queue
can reduce time to O(𝑁 2 𝑙𝑜𝑔𝑁)
▪ Still too expensive for really big datasets that do
not fit in memmory
BDA 31
The K-Means Clustering
Method
• Given k, the k-means algorithm is implemented in four
steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean point, of
the cluster)
– Assign each object to the cluster with the nearest seed point
– Go back to Step 2, stop when no more new assignment
The K-Means Clustering
Method
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
Assign 2
Update 2
2
1 each 1 the 1
0
objects
0
0 1 2 3 4 5 6 7 8 9 10
cluster 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
to most means
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7
6
7
6
object as initial 5 5
cluster center 4
Update 4
3 3
2
the 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10 means 0
0 1 2 3 4 5 6 7 8 9 10
Assumes Euclidean space/distance
BDA 35
1) For each point, place it in the cluster whose
current centroid it is nearest
x x x x x x
x … data point
… centroid Clusters after round 1
BDA 37
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 2
BDA 38
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters at the end
BDA 39
How to select k?
Try different k, looking at the change in the
average distance to centroid as k increases
Average falls rapidly until right k, then
changes little
Best value
of k
Average
distance to
centroid k
BDA 40
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
BDA 41
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
BDA 42
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
BDA 43
Extension of k-means to large data
BFR [Bradley-Fayyad-Reina] is a
variant of k-means designed to
handle very large (disk-resident) data sets
Compressed sets.
Their points are in
the CS.
A cluster.
All its points are in the DS. The centroid
BDA 49
2d + 1 values represent any size cluster
▪ d = number of dimensions
Average in each dimension (the centroid)
can be calculated as SUMi / N
▪ SUMi = ith component of SUM
Variance of a cluster’s discard set in
dimension i is: (SUMSQi / N) – (SUMi / N)2
▪ And standard deviation is the square root of that
Next step: Actual clustering
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
BDA 50
Processing the “Memory-Load” of points (1):
1) Find those points that are “sufficiently
close” to a cluster centroid and add those
points to that cluster and the DS
▪ These points are so close to the centroid that
they can be summarized and then discarded
2) Use any main-memory clustering algorithm
to cluster the remaining points and the old RS
▪ Clusters go to the CS; outlying points to the RS
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
BDA 51
Processing the “Memory-Load” of points (2):
3) DS set: Adjust statistics of the clusters to
account for the new points
▪ Add Ns, SUMs, SUMSQs
4) Consider merging compressed sets in the CS
5) If this is the last round, merge all compressed
sets in the CS and all RS points into their nearest
cluster
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
BDA 52
Points in
the RS
Compressed sets.
Their points are in
the CS.
BDA 54
Q1) We need a way to decide whether to put
a new point into a cluster (and discard)
BDA 55
Normalized Euclidean distance from centroid
BDA 57
Q2) Should 2 CS subclusters be combined?
Compute the variance of the combined
subcluster
▪ N, SUM, and SUMSQ allow us to make that
calculation quickly
Combine if the combined variance is
below some threshold
BDA 59
Extension of k-means to clusters
of arbitrary shapes
Vs.
Problem with BFR/k-means:
▪ Assumes clusters are normally
distributed in each dimension
▪ And axes are fixed – ellipses at
an angle are not OK
h
e e
e
h e
e e h
e e e e
salary h
e
h
h
h h
h h h
age
BDA 67
2 Pass algorithm. Pass 1:
0) Pick a random sample of points that fit in
main memory
1) Initial clusters:
▪ Cluster these points hierarchically – group
nearest points/clusters
2) Pick representative points:
▪ For each cluster, pick a sample of points, as
dispersed as possible
▪ From the sample, pick representatives by moving
them (say) 20% toward the centroid of the cluster
BDA 68
h h
h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h
age
BDA 69
h h
h
e e
e
h e
e e h
e e e e
h
salary e
h Pick (say) 4
h remote points
h h for each
h h h cluster.
age
BDA 70
h h
h
e e
e
h e
e e h
e e e e
h
salary e
h Move points
h (say) 20%
h h toward the
h h h centroid.
age
BDA 71
Pass 2:
Now, rescan the whole dataset and
visit each point p in the data set
BDA 72
Clustering: Given a set of points, with a notion
of distance between points, group the points
into some number of clusters
Algorithms:
▪ Agglomerative hierarchical clustering:
▪ Centroid and clustroid
▪ k-means:
▪ Initialization, picking k
▪ BFR
▪ CURE
BDA 73
Dealing With a Non-
Euclidean Space
• Problem: clusters cannot be represented by
centroids.
• Why? Because the “average” of “points” might
not be a point in the space.
• Best substitute: the clustroid = point in the
cluster that minimizes the sum of the squares
of distances to the points in the cluster.
Representing Clusters in Non-
Euclidean Spaces
• Recall BFR represents a Euclidean cluster
by N, SUM, and SUMSQ.
• A non-Euclidean cluster is represented by:
– N.
– The clustroid.
– Sum of the squares of the distances from
clustroid to all points in the cluster.
The GRGPF Algorithm
• From Ganti et al.
• Works for non-Euclidean distances.
• Works for massive (disk-resident) data.
• Hierarchical clustering.
• Clusters are grouped into a tree of disk
blocks (like a B-tree or R-tree).
Information Retained About
a Cluster
1. N, clustroid, SUMSQ.
2. The p points closest to the clustroid, and
their values of SUMSQ.
3. The p points of the cluster that are
furthest away from the clustroid, and
their SUMSQ’s.
At Interior Nodes of the Tree
main
memory samples
on disk
Initialization
5
+ (12,12)
10
+ (3,3) new centroid
+ (12,2)
+ (18,-2)
15
weights centroids
Example: New Costs
5
+ (12,12)
added
10
+ (3,3)
+ (12,2)