ch07 Clustering
ch07 Clustering
ch07 Clustering
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensiona Duplicate
Spam Queries on Perceptron,
lity document
Detection streams kNN
reduction detection
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster
Point assignment:
Maintain a set of clusters
Points belong to “nearest” cluster
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Hierarchical Clustering
Key operation:
Repeatedly combine
two nearest clusters
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid
Dendrogram
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
The only “locations” we can talk about are the
points themselves
i.e., there is no “average” of two points
Approach 1:
(1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
(2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
“Closest” Point?
(1) How to represent a cluster of many points?
clustroid = point “closest” to other points
Possible meanings of “closest”:
Smallest maximum distance to other points
Smallest average distance to other points
Smallest sum of squares of distances to other points
For distance metric d clustroid c of cluster C is:
Datapoint Centroid
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 1
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 2
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters at the end
Best value
of k
Average
distance to
centroid k
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
x x x
x x x x
x x x
x
Compressed sets.
Their points are in
the CS.
A cluster.
All its points are in the DS. The centroid
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
Summarizing Points: Comments
2d + 1 values represent any size cluster
d = number of dimensions
Average in each dimension (the centroid)
can be calculated as SUMi / N
SUMi = ith component of SUM
Variance of a cluster’s discard set in
dimension i is: (SUMSQi / N) – (SUMi / N)2
And standard deviation is the square root of that
Next step: Actual clustering
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too
big! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
The “Memory-Load” of Points
Processing the “Memory-Load” of points (1):
1) Find those points that are “sufficiently
close” to a cluster centroid and add those
points to that cluster and the DS
These points are so close to the centroid that
they can be summarized and then discarded
2) Use any main-memory clustering algorithm
to cluster the remaining points and the old RS
Clusters go to the CS; outlying points to the RS
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
The “Memory-Load” of Points
Processing the “Memory-Load” of points (2):
3) DS set: Adjust statistics of the clusters to
account for the new points
Add Ns, SUMs, SUMSQs
Compressed sets.
Their points are in
the CS.
h
e e
e
h e
e e h
e e e e
salary h
e
h
h
h h
h h h
age
h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h
age
h
e e
e
h e
e e h
e e e e
h
salary e Pick (say) 4
h
h remote points
h h for each
h h h cluster.
age
h
e e
e
h e
e e h
e e e e
h
salary e Move points
h
h (say) 20%
h h toward the
h h h centroid.
age