0% found this document useful (0 votes)
22 views47 pages

TM3 ch07 Clustering

The document discusses various methods for measuring distance and similarity between data points for clustering algorithms. It covers distance metrics like Euclidean, Manhattan and cosine distances. It also discusses evaluating clustering results using metrics like cohesion, separation and the silhouette coefficient.

Uploaded by

tzinajojo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views47 pages

TM3 ch07 Clustering

The document discusses various methods for measuring distance and similarity between data points for clustering algorithms. It covers distance metrics like Euclidean, Manhattan and cosine distances. It also discusses evaluating clustering results using metrics like cohesion, separation and the silhouette coefficient.

Uploaded by

tzinajojo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Tutorial Meeting #3

Data Science and Machine Learning

Clustering
(Chapter 7 of MMDS book)

DAMA60 Group
School of Science and Technology
Hellenic Open University
Contents

• Preliminaries
• The problem of clustering
• Distance measures and metrics
• Cluster metrics

• Methods and Algorithms


• K-means (revision)
• K-means & MapReduce (application)
• BFR: K-means-based algorithm for big data

3
High Dimensional Data
• Given a cloud of data points we want to understand its structure

4
The Problem of Clustering

•Given a set of points, with a notion of


distance between points, group the points
into some number of clusters, so that
• Members of a cluster are close / similar to each
other
• Members of different clusters are dissimilar
•Usually:
• Points are in a high-dimensional space
• Similarity is defined using a distance measure
• Euclidean, Cosine, Jaccard, edit distance, etc.

5
Distance
● Numerical measure of how different two data
objects are
○ A function that maps pairs of objects to real

values
○ Lower when objects are more alike

○ Higher when two objects are different

● Minimum distance is 0, when comparing an


object with itself.
● Upper limit varies
Distance Metric
A distance function d is a distance metric if it is a function
from pairs of objects to real numbers such that:
1. d(x,y) ≥ 0 (non-negativity)

2. d(x,y)=0 if x=y (identity)

3. d(x,y)=d(y,x) (symmetry)

4. d(x,y) ≤ d(x,z) + d(y,z) (triangle inequality)


● Triangle inequality guarantees that the distance function is well-
behaved.
○ The direct connection is the shortest distance
● It is useful also for proving useful properties about the data.
Distances for real vectors

● Vectors (x1 ,..., xd) and (y1 ,..., yd)


● LP-norms or Minkowski distance: 
L p ( x, y ) = x1 − y1 + ... + xd − yd
p p
 1
p

L2-norm: Euclidean distance: L2 ( x, y ) = x1 − y1 + ... + xd − yd


2 2

● L1-norm: Manhattan distance: L1 ( x, y ) = x1 − y1 + ... + xd − yd


● L∞-norm: L ( x, y ) = max x1 − y1 ,..., xd − yd 

● The limit of Lp as p goes to infinity.


● Minkowski distances are metrics
Example of Distances

y = (9,8)

Distance using L2 Norm


L2 ( x , y ) = 5 − 9 + 5 − 8 = 4 2 + 3 2 = 5
2 2

5 3

4 Distance using L1 Norm


x = (5,5) (5,9) L1 ( x, y ) = 5 − 9 + 5 − 8 = 4 + 3 = 7

Distance using L∞ Norm


L ( x, y ) = max5 − 9 , 5 − 8  = max4,3 = 4

9
Geometrical Properties of Distances

Green: All points y at distance from point L1(x, y) =r

Blue: All points y at distance from point L2(x, y) =r

Red: All points y at distance from point L∞(x, y) =r


Cosine Distance
• Applicable to spaces that have dimensions, including Euclidean spaces.
• We do not distinguish between a vector and a multiple of that vector.
• Cosine distance: the angle that the vectors to those points make.
• Angle will be in the range 0 to 180 degrees
• regardless of how many dimensions the space has.
• Calculation:
• Compute the cosine of the angle
• Apply the arc-cosine
  function to translate to an angle in [0-180] degrees
x y
cos( x, y ) =  
x  y
 
x  y = x1 , x2 ,..., xn   y1 , y2 ,..., yn  = i =1 xi yi
n


 
n n
x = xi − 0 =
2 2
i =1
x
i =1 i

 ||.|| : L2 Norm
 
n n
y = yi − 0 =
2 2
i =1
y
i =1 i

In the denominator, we remove the magnitude of the vectors so that their


length is equal (unit vector). This normalization is important, so that we can
compare vectors with different magnitudes. 11
Cosine Distance Example Calculation
 
x y
cos( x, y ) =  
x  y
 
x  y = x1 , x2 ,..., xn   y1 , y2 ,..., yn  = i =1 xi yi
n


 
n n
x = xi − 0 =
2 2
i =1
x
i =1 i


 
n n
y = yi − 0 =
2 2
i =1
y
i =1 i

• x=[1,2,-1], y=[2,1,1]
• the dot product x * y = 1 * 2 + 2 * 1 + (-1) * 1 = 3
• ||x||= SQRT(12+22+(-1)2) = SQRT(6)
• ||y||= SQRT(22+12+12) = SQRT(6)
• cos(x,y) = 3 / SQRT(6) x SQRT(6) = 3/6 = ½
• The angle whose cosine is ½ is 60 degrees

12
Cosine Distance as a Distance Measure
• Non-negativity: Defined so that values are in [0 to 90]
• Non-negative only distances are considered.
• d(x,y) ≥ 0 (non-negativity)

• Identity: Two vectors have angle 0 if and only if they are the same
direction
• Treat vectors that are multiples of one another as the same direction
• d(x,y)=0 iff x=y. (identity)

• Symmetry:
• Angle between x and y is the same as the angle between y and x
• d(x,y)=d(y,x) (symmetry)

• Triangle inequality: Physical reasoning.


• One way to rotate from x to y is to rotate to z and then to y.
• The sum of those two rotations cannot be less than the rotation directly from
x to y.
• d(x,y) ≤ d(x,z) + d(y,z) (triangle inequality)

13
Distance between strings

• How do we define similarity between strings?

Apparent vs Apparrent
Hear vs Here
Sea vs See

• Important in various domains:


• Recognizing and correcting typing errors

• Approximate matching

• Analyzing DNA sequences


Edit Distance for strings
• The edit distance of two strings is the minimum number of
operations on characters needed to turn one string into the
other.
• Different types of edit distance allow different sets of string
operations
• Levenshtein distance assumes insertion, deletion and
substitution

• Example: x = ebook; y = bucks. Turn x into y:


• Delete e
• Substitute o with u
• Substitute o with c
• Insert s
• Edit distance = 4.

15
Why Edit Distance Is a Distance Metric

● d(x,x) = 0 because 0 edits suffice.

● d(x,y) = d(y,x) because insert/delete are


inverses of each other.

● d(x,y) > 0: no notion of negative edits.

● Triangle inequality:
○ changing x to z and then to y is one way

to change x to y.
16
Unsupervised Measures: Cohesion and
Separation

• A proximity graph-based approach can


also be used for cohesion and
separation.
• Cluster cohesion is the sum of the weight of all links
within a cluster.
• Cluster separation is the sum of the weights between
nodes in the cluster and nodes outside the cluster.

cohesion separation

18
Unsupervised Measures: Cohesion and
Separation

19
Silhouette Coefficient

• Silhouette coefficient combines ideas of both cohesion


and separation, but for individual points, as well as
clusters and clusterings
• For an individual point, i
• Calculate a = average distance of i to the points in its cluster
• Calculate b = min (average distances of i to points in all other cluster)
• The silhouette coefficient for a point is then given by

s = (b – a) / max(a, b) Distances used


to calculate b
i
• Value can vary between -1 and 1 Distances used
• Typically ranges between 0 and 1. to calculate a

• The closer to 1 the better.

21
k-means clustering
k–means Algorithm(s)

23
K-Means basics

Another
Algorithm's
Variation

24
K-means with MapReduce

● K-means has two phases. These are implemented as


follows:
○ Classify
■ Assign points to closest cluster center.

■ Performed during map function.

○ Re-center
■ Revise cluster centers by computing the mean of

assigned points.
■ Performed during reduce function.

25
Map function
● Assign points to closest cluster center
● Each mapper gets:
○ A fraction of the points to be clustered

○ All cluster centers

○ For each point:

■ Compute the distance from each cluster

center.
■ Emit(Ci, xi)

● xi::The point I am examining

● Ci: the closest center to xi.

26
Reduce function
● Recalculate centers.
● For each Reduce Task:
○ Get a current center and the list of the respective closest

points:
(C, [x1, x2,…])
○ C: a current cluster center.

○ [x1, x2,…]: Points closer to C.

○ Calculate new cluster centers


○ For each current center
○ For each dimension calculate the average (or
median)
○ emit(Cluster_Center_id, Cluster_Center_Coordinates)
27
Classify:

• Points: S2 (1,7)
• S1 (2,1)
• S2 (1,7) S5 (4,6)
• S3 (8,1)
• S4 (2,5) S4 (2,5)
• S5 (4,6)
• S6 (9,0)
• S7 (10,2)
S7 (10,2)
• Initial Centers:
• C1=S1 S1 (2,1) S3 (8,1)
• C2=S2
• C3=S6
S6 (9,0)
Map Processes
• We have two Map processes: M1 and M2
• Also, a single Reduce process: R S2 (1,7)
• M1 processes points
• S1 (2,1)
S5 (4,6)
• S2 (1,7)
• S3 (8,1)
S4 (2,5)
• M2 processes points
• S4 (2,5)
• S5 (4,6)
• S6 (9,0)
• S7 (10,2) S7 (10,2)
• Initial Centers: S1 (2,1) S3 (8,1)
• C1=S1, C2=S2, C3=S6

S6 (9,0)
Calculations at M1 using Euclidean Distance:
• S1
• d(S1, C1) = 0
• d(S1, C2) = SQRT((2-1)2+(1-7)2) = SQRT(1+36) = SQRT(37)
• d(S1, C3) = SQRT((2-9)2+(1-0)2) = SQRT(49+1) = SQRT(50)
• S2
• d(S2, C1) = d(S1, S2) = SQRT((2-1)2+(1-7)2) = SQRT(1+36) = SQRT(37)
• d(S2, C2) = 0
• d(S2, C3) = SQRT((1-9)2+(7-0)2) = SQRT(64+49) = SQRT(113)
• S3
• d(S3, C1) = SQRT((2-8)2+(1-1)2) = SQRT(36+0) = SQRT(36)
• d(S3, C2) = SQRT((1-8)2+(7-1)2) = SQRT(49+36) = SQRT(85)
• d(S3, C3) = SQRT((9-8)2+(0-1)2) = SQRT(1+1) = SQRT(2)
Intermediate key-value pairs produced at M1

•For each point:


• Key: Closest center to the Point
• Value: The Point

•Key value pairs produced:


•(C1, S1)
•(C2, S2)
•(C3, S3)
Calculations at M2 using Euclidean Distance:
• S4
• d(S4, C1) = SQRT((2-2)2+(1-5)2) = SQRT(0+16) = SQRT(16)
• d(S4, C2) = SQRT((2-1)2+(7-5)2) = SQRT(1+4) = SQRT(5)
• d(S4, C3) = SQRT((2-9)2+(0-5)2) = SQRT(49+25) = SQRT(74)
• S5
• d(S5, C1) = SQRT((2-4)2+(1-6)2) = SQRT(4+25) = SQRT(29)
• d(S5, C2) = SQRT((1-4)2+(7-6)2) = SQRT(9+1) = SQRT(10)
• d(S5, C3) = SQRT((9-4)2+(0-6)2) = SQRT(25+36) = SQRT(61)
• S6
• d(S6, C1) = SQRT((2-9)2+(1-0)2) = SQRT(49+1) = SQRT(50)
• d(S6, C2) = SQRT((1-9)2+(7-0)2) = SQRT(64+49) = SQRT(113)
• d(S6, C3) = 0
• S7
• d(S7, C1) = SQRT((2-10)2+(1-2)2) = SQRT(64+1) = SQRT(65)
• d(S7, C2) = SQRT((1-10)2+(7-2)2) = SQRT(81+25) = SQRT(106)
• d(S7, C3) = SQRT((10-9)2+(0-2)2) = SQRT(1+4) = SQRT(5)
Intermediate key-value pairs produced at M2

•For each point:


• Key: Closest center to the Point
• Value: The Point

•Key value pairs produced:


•(C2, S4)
•(C2, S5)
•(C3, S6)
•(C3, S7)
Reduce step: Evaluate new centers

• C1 : [S1]
• XC1 = X1 = 2
• YC1 = Y1 = 1

• C2: [S2, S4, S5]


• XC2 = (X2+X4+X5)/3 = (1+2+4)/3 = 7/3
• YC2 = (Y2+Y4+Y5)/3 = (7+5+6)/3 = 18/3

• C3: [S3, S6, S7]


• XC2 = (X3+X6+X7)/3 = (8+9+10)/3 = 27/3
• YC2 = (Y3+Y6+Y7)/3 = (1+0+2)/3 = 3/3
Reduce step: Final key value pairs
•(1, (2, 1))
•(2, (7/3, 18/3))
•(3, (27/3, 1))
The BFR Algorithm

Extension of k-means to large data


BFR Algorithm

37
BFR Algorithm
•Points are read from disk one main-memory-
full at a time
•Most points from previous memory loads are
summarized by simple statistics
•To begin, from the initial load we select the
initial k centroids by some sensible approach:
• Take k random points
• Take a small random sample and cluster optimally
• Take a sample; pick a random point, and then
k–1 more points, each as far from the previously
selected points as possible
38
Three Classes of Points
3 sets of points which we keep track of:
•Discard set (DS):
• Points close enough to a centroid to be
summarized

•Compression set (CS):


• Groups of points that are close together but not
close to any existing centroid
• These points are summarized, but not assigned
to a cluster

•Retained set (RS):


• Isolated points waiting to be assigned to a
compression set

39
BFR: “Galaxies” Picture
Points in
the RS

Compressed sets.
Their points are in
the CS.

A cluster. Its points The centroid


are in the DS.
Discard set (DS): Close enough to a centroid to be summarized
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
40
Summarizing Sets of Points
For each cluster, the discard set (DS) is summarized
by:
• The number of points, N
• The vector SUM, whose ith component is the sum of
the coordinates of the points in the ith dimension
• The vector SUMSQ: ith component = sum of squares
of coordinates in ith dimension

A cluster.
All its points are in the DS. The centroid

41
Summarizing Points: Comments
• 2d + 1 values represent any size cluster
• d = number of dimensions
• Average in each dimension (the centroid)
can be calculated as SUMi / N
• SUMi = ith component of SUM
• Variance of a cluster’s discard set in dimension i is: (SUMSQi / N) –
(SUMi / N)2
• And standard deviation σ is the square root of the variance
• Next step: Actual clustering

Note: Dropping the “axis-aligned” clusters assumption would require


storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!

42
The “Memory-Load” of Points
Processing the “Memory-Load” of points (1):
• 1) Find those points that are “sufficiently close” to
a cluster centroid and add those points to that
cluster and the DS
• These points are so close to the centroid that
they can be summarized and then discarded
• 2) Use any main-memory clustering algorithm to
cluster the remaining points and the old RS
• Clusters go to the CS; outlying points to the RS

Discard set (DS): Close enough to a centroid to be summarized.


Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points

43
The “Memory-Load” of Points

Discard set (DS): Close enough to a centroid to be summarized.


Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points

44
BFR: “Galaxies” Picture
Points in
the RS

Compressed sets.
Their points are in
the CS.

A cluster. Its points The centroid


are in the DS.

Discard set (DS): Close enough to a centroid to be summarized


Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
45
A Few Details…

46
How Close is Close Enough?

47
Mahalanobis Distance

σi … standard deviation of points in


the cluster in the ith dimension
48
Mahalanobis Distance

49
Should 2 CS clusters be combined?

51

You might also like