TM3 ch07 Clustering
TM3 ch07 Clustering
Clustering
(Chapter 7 of MMDS book)
DAMA60 Group
School of Science and Technology
Hellenic Open University
Contents
• Preliminaries
• The problem of clustering
• Distance measures and metrics
• Cluster metrics
3
High Dimensional Data
• Given a cloud of data points we want to understand its structure
4
The Problem of Clustering
5
Distance
● Numerical measure of how different two data
objects are
○ A function that maps pairs of objects to real
values
○ Lower when objects are more alike
3. d(x,y)=d(y,x) (symmetry)
y = (9,8)
5 3
9
Geometrical Properties of Distances
n n
x = xi − 0 =
2 2
i =1
x
i =1 i
||.|| : L2 Norm
n n
y = yi − 0 =
2 2
i =1
y
i =1 i
n n
x = xi − 0 =
2 2
i =1
x
i =1 i
n n
y = yi − 0 =
2 2
i =1
y
i =1 i
• x=[1,2,-1], y=[2,1,1]
• the dot product x * y = 1 * 2 + 2 * 1 + (-1) * 1 = 3
• ||x||= SQRT(12+22+(-1)2) = SQRT(6)
• ||y||= SQRT(22+12+12) = SQRT(6)
• cos(x,y) = 3 / SQRT(6) x SQRT(6) = 3/6 = ½
• The angle whose cosine is ½ is 60 degrees
12
Cosine Distance as a Distance Measure
• Non-negativity: Defined so that values are in [0 to 90]
• Non-negative only distances are considered.
• d(x,y) ≥ 0 (non-negativity)
• Identity: Two vectors have angle 0 if and only if they are the same
direction
• Treat vectors that are multiples of one another as the same direction
• d(x,y)=0 iff x=y. (identity)
• Symmetry:
• Angle between x and y is the same as the angle between y and x
• d(x,y)=d(y,x) (symmetry)
13
Distance between strings
Apparent vs Apparrent
Hear vs Here
Sea vs See
• Approximate matching
15
Why Edit Distance Is a Distance Metric
● Triangle inequality:
○ changing x to z and then to y is one way
to change x to y.
16
Unsupervised Measures: Cohesion and
Separation
cohesion separation
18
Unsupervised Measures: Cohesion and
Separation
19
Silhouette Coefficient
21
k-means clustering
k–means Algorithm(s)
23
K-Means basics
Another
Algorithm's
Variation
24
K-means with MapReduce
○ Re-center
■ Revise cluster centers by computing the mean of
assigned points.
■ Performed during reduce function.
25
Map function
● Assign points to closest cluster center
● Each mapper gets:
○ A fraction of the points to be clustered
center.
■ Emit(Ci, xi)
26
Reduce function
● Recalculate centers.
● For each Reduce Task:
○ Get a current center and the list of the respective closest
points:
(C, [x1, x2,…])
○ C: a current cluster center.
• Points: S2 (1,7)
• S1 (2,1)
• S2 (1,7) S5 (4,6)
• S3 (8,1)
• S4 (2,5) S4 (2,5)
• S5 (4,6)
• S6 (9,0)
• S7 (10,2)
S7 (10,2)
• Initial Centers:
• C1=S1 S1 (2,1) S3 (8,1)
• C2=S2
• C3=S6
S6 (9,0)
Map Processes
• We have two Map processes: M1 and M2
• Also, a single Reduce process: R S2 (1,7)
• M1 processes points
• S1 (2,1)
S5 (4,6)
• S2 (1,7)
• S3 (8,1)
S4 (2,5)
• M2 processes points
• S4 (2,5)
• S5 (4,6)
• S6 (9,0)
• S7 (10,2) S7 (10,2)
• Initial Centers: S1 (2,1) S3 (8,1)
• C1=S1, C2=S2, C3=S6
S6 (9,0)
Calculations at M1 using Euclidean Distance:
• S1
• d(S1, C1) = 0
• d(S1, C2) = SQRT((2-1)2+(1-7)2) = SQRT(1+36) = SQRT(37)
• d(S1, C3) = SQRT((2-9)2+(1-0)2) = SQRT(49+1) = SQRT(50)
• S2
• d(S2, C1) = d(S1, S2) = SQRT((2-1)2+(1-7)2) = SQRT(1+36) = SQRT(37)
• d(S2, C2) = 0
• d(S2, C3) = SQRT((1-9)2+(7-0)2) = SQRT(64+49) = SQRT(113)
• S3
• d(S3, C1) = SQRT((2-8)2+(1-1)2) = SQRT(36+0) = SQRT(36)
• d(S3, C2) = SQRT((1-8)2+(7-1)2) = SQRT(49+36) = SQRT(85)
• d(S3, C3) = SQRT((9-8)2+(0-1)2) = SQRT(1+1) = SQRT(2)
Intermediate key-value pairs produced at M1
• C1 : [S1]
• XC1 = X1 = 2
• YC1 = Y1 = 1
37
BFR Algorithm
•Points are read from disk one main-memory-
full at a time
•Most points from previous memory loads are
summarized by simple statistics
•To begin, from the initial load we select the
initial k centroids by some sensible approach:
• Take k random points
• Take a small random sample and cluster optimally
• Take a sample; pick a random point, and then
k–1 more points, each as far from the previously
selected points as possible
38
Three Classes of Points
3 sets of points which we keep track of:
•Discard set (DS):
• Points close enough to a centroid to be
summarized
39
BFR: “Galaxies” Picture
Points in
the RS
Compressed sets.
Their points are in
the CS.
A cluster.
All its points are in the DS. The centroid
41
Summarizing Points: Comments
• 2d + 1 values represent any size cluster
• d = number of dimensions
• Average in each dimension (the centroid)
can be calculated as SUMi / N
• SUMi = ith component of SUM
• Variance of a cluster’s discard set in dimension i is: (SUMSQi / N) –
(SUMi / N)2
• And standard deviation σ is the square root of the variance
• Next step: Actual clustering
42
The “Memory-Load” of Points
Processing the “Memory-Load” of points (1):
• 1) Find those points that are “sufficiently close” to
a cluster centroid and add those points to that
cluster and the DS
• These points are so close to the centroid that
they can be summarized and then discarded
• 2) Use any main-memory clustering algorithm to
cluster the remaining points and the old RS
• Clusters go to the CS; outlying points to the RS
43
The “Memory-Load” of Points
44
BFR: “Galaxies” Picture
Points in
the RS
Compressed sets.
Their points are in
the CS.
46
How Close is Close Enough?
47
Mahalanobis Distance
49
Should 2 CS clusters be combined?
51