0% found this document useful (0 votes)
19 views

Data Mining - Clustering

Uploaded by

pawankr16123114
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Data Mining - Clustering

Uploaded by

pawankr16123114
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

What is Cluster Analysis?

 Cluster: A collection of data objects


 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
2
Quality: What Is Good Clustering?
 A good clustering method will
produce high quality clusters Inter-cluster
distances are
 high intra-class similarity: Intra-cluster maximized
distances are
cohesive within clusters minimized

 low inter-class similarity:


distinctive between clusters
 The quality of a clustering method
depends on
 the similarity measure used by the
method
 its implementation, and
 Its ability to discover some or all
of the hidden patterns
3
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
○ The answer is typically highly subjective

4
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of
these
 Constraint-based clustering
○ User may give inputs on constraints
○ Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality

5
Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 A set of nested clusters organized as a hierarchical tree
 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
6
Partitional Clustering

Original Points

A Partitional Clustering
Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Traditional Dendrogram
Traditional Hierarchical Clustering

p1
p3 p4
p2

p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram


Other Distinctions Between Sets of Clusters

 Exclusive versus non-exclusive


 In non-exclusive clusterings, points may belong to multiple clusters.
 Can represent multiple classes or „border‟ points
 Fuzzy versus non-fuzzy
 In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
 Weights must sum to 1
 Probabilistic clustering has similar characteristics
 Partial versus complete
 In some cases, we only want to cluster some of the data
 Heterogeneous versus homogeneous
 Cluster of widely different sizes, shapes, and densities
Partitioning Algorithms: Basic Concept

 Partitioning method: Partitioning a database D of n objects into a set of k


clusters, such that the sum of squared distances is minimized (where ci
is the centroid or medoid of cluster Ci)

E    pCi ( p  ci )
k
i 1
2

 Given k, find a partition of k clusters that optimizes the chosen


partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen‟67, Lloyd‟57/‟82): Each cluster is represented by
the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw‟87): Each cluster is represented by one of the objects in
the cluster
10
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in


four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change

11
K-means Clustering

 Partitional clustering approach


 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 Number of clusters, K, must be specified
 The basic algorithm is very simple
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition
centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
13
Working Example
 Cluster the following eight points (with (x, y)
representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5),
A6(6, 4), A7(1, 2), A8(4, 9)

 Let‟s assume Initial cluster centers are: A1(2, 10),


A4(5, 8) and A7(1, 2).

 The distance function between two points


a = (x1, y1) and b = (x2, y2) is defined as -
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Working Example: Iteration I
Given Distance Distance Distance Point
Points from center from from center belong to
(2, 10) of center (1, 2) of C3 cluster
C1 (5, 8) of C2

A1 (2, 10) 0 5 9 C1
A2 (2, 5) 5 6 4 C3
A3 (8, 4) 12 7 9 C2
A4 (5, 8) 5 0 10 C2
A5 (7, 5) 10 5 9 C2
A6 (6, 4) 10 5 7 C2
A7 (1, 2) 9 10 0 C3
A8 (4, 9) 3 2 10 C2
Working Example
 New clusters are as follows:

 C1 contains only one point, i.e. A1 (2, 10) and it


is the cluster center

 C2 contains: A3(8, 4), A4(5, 8), A5(7, 5), A6(6,


4), A8(4, 9)
 Cluster center of C2: (6, 6)

 C3 contains: A2(2, 5), A7(1, 2)


 Cluster center of C3: (1.5, 3.5)
Iteration II
Given Distance Distance Distance Point
Points from from from belong to
center (2, center (6, center cluster
10) of C1 6) of C2 (1.5, 3.5)
of C3

A1 (2, 10) 0 8 7 C1
A2 (2, 5) 5 5 2 C3
A3 (8, 4) 12 4 7 C2
A4 (5, 8) 5 3 8 C2
A5 (7, 5) 10 2 7 C2
A6 (6, 4) 10 2 5 C2
A7 (1, 2) 9 9 2 C3
A8 (4, 9) 3 5 8 C1
Working Example
 New cluster centers:

 For C1: (3, 9.5)

 For C2: (6.5, 5.25)

 For C3: (1.5, 3.5)

 The process will be repeated till there is no


change in the cluster centers
Exercise

 Use K-Means algorithm to divide


the following points into 2 clusters:
A (2, 2), B (3, 2), C (1, 1), D (3, 1), E
(1.5, 0.5)
Solution
 After Iteration I, clusters are as follows:

 C1: A (2, 2), B (3, 2), D (3, 1)

 C2: C (1, 1), E (1.5, 0.5)


K-means Clustering – Details

 Initial centroids are often chosen randomly.


 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 „Closeness‟ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few
iterations.
 Often the stopping condition is changed to „Until relatively few
points change clusters‟
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Limitations of K-means
 K-means has problems when clusters
are of differing
 Sizes
 Densities
 Non-globular shapes

 K-means has problems when the data


contains outliers.
A drawback of K-means
 Consider six points in 1-D space having the values 1,2,3,8,9,10, and 25,
respectively.

 by visual inspection we may imagine the points partitioned into the clusters
{1,2,3} and {8,9,10}, where point 25 is excluded because it appears to be
an outlier.

 If K = 2, one clustering can be {{1,2,3},{8,9,10,25}} has the within-cluster


variation: (1 − 2)2 + (2 − 2) 2 + (3 − 2)2 + (8 − 13)2 + (9 − 13)2 + (10 − 13)2 +
(25 − 13)2 =196

 There can be another partitioning {{1,2,3,8},{9,10,25}}, for which k-means


computes the within cluster variation as (1 − 3.5) 2 + (2 − 3.5) 2 + (3 − 3.5) 2
+ (8 − 3.5) 2 + (9 − 14.67) 2 + (10 − 14.67) 2 + (25 − 14.67) 2 = 189.67
What is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially

distort the distribution of the data

 K-Medoids: Instead of taking the mean value of the object in a cluster


as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

24
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

25
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

○ Starts from an initial set of medoids and iteratively replaces one of

the medoids by one of the non-medoids if it improves the total


distance of the resulting clustering

○ PAM works effectively for small data sets, but does not scale well for

large data sets (due to the computational complexity)

 Efficiency improvement on PAM

 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

 CLARANS (Ng & Han, 1994): Randomized re-sampling

26
Hierarchical Clustering
 Produces a set of nested clusters organized as
a hierarchical tree
 Can be visualized as a dendrogram
 A tree like diagram that records the sequences of
merges or splits

0.2 6 5

4
0.15 3 4
2
5
0.1 2

0.05 1
3 1

0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
 Do not have to assume any particular number
of clusters
 Any desired number of clusters can be obtained by
„cutting‟ the dendogram at the proper level

 Representing data objects in the form of a


hierarchy is useful for data summarization and
visualization
Hierarchical Clustering
 Two main types of hierarchical clustering
 Agglomerative:
○ Start with the points as individual clusters
○ At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left

 Divisive:
○ Start with one, all-inclusive cluster
○ At each step, split a cluster until each cluster contains a point
(or there are k clusters)

 Traditional hierarchical algorithms use a similarity


or distance matrix
 Merge or split one cluster at a time
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
30
Dendrogram representation for
Hierarchical clustering
AGNES (AGglomerative NESting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

32
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of


clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at


the desired level, then each connected component forms a cluster

33
Agglomerative Clustering Algorithm

 More popular hierarchical clustering technique


 Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

 Key operation is the computation of the proximity of


two clusters
 Different approaches to defining the distance between
clusters distinguish the different algorithms
Starting Situation
 Start with clusters of individual points and a
proximity matrix p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
.
Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
 After some merging steps, we have some clusters
C1 C2 C3 C4 C5
Proximity Matrix C1
C2
C3 C3
C4
C4
C5
C1

C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
 We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. Proximity Matrix
C1 C2 C3 C4 C5
C1
C3 C2
C4 C3
C4
C1 C5

C5 p4
...
p1 p2C2 p3 p9 p10 p11 p12
After Merging
 The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4
C1 ?

C2 U C5 ? ? ?
C3 ?
C3 ?
C4
C4 ?
Proximity Matrix
C1

C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3

p4
p5
 MIN
.
 MAX
.
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
 MIN
.
 MAX
.
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
 MIN
.
 MAX
.
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
 MIN
.
 MAX
.
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
  p2
p3

p4
p5
 MIN
.
 MAX
.
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward‟s Method uses squared error
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster

44
Cluster Similarity: MIN or Single
Link
 Similarity of two clusters is based on the
two most similar (closest) points in the
different clusters
 Determined by one pair of points, i.e., by
one link in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN

5
1
3
5 0.2

2 1 0.15

2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram


Cluster Similarity: MAX or Complete Linkage

 Similarity of two clusters is based on the two


least similar (most distant) points in the
different clusters
 Determined by all pairs of points in the two
clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
1 2 3 4 5
I5 0.20 0.50 0.30 0.80 1.00
Hierarchical Clustering: MAX

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6 0.2

3 0.15
1 0.1

0.05
4
0
3 6 4 1 2 5

Nested Clusters Dendrogram


Major Weaknesses

 Can never undo what was done previously

 Do not scale well: time complexity of at least O(n2),


where n is the number of total objects
Working Example
Clustering the following 7 data points.

X1 X2
A 10 5
B 1 4
C 5 8
D 9 2
E 12 10
F 15 8
G 7 7
Working Example contd..
Working Example contd..
 Step 1: Calculate distances between all data points using Euclidean distance function. The shortest distance
is between data points C and G.

A B C D E F
B 9.06
C 5.83 5.66
D 3.16 8.25 7.21

E 5.39 12.53 7.28 14.42

F 5.83 14.56 10.00 16.16 3.61

G 3.61 6.71 2.24 8.60 5.83 8.06


Working Example contd..
 Step 2: We use "Average Linkage" to measure the distance between the "C,G" cluster and other data points.

A B C,G D E
B 9.06
C,G 4.72 6.10
D 3.16 8.25 6.26
E 5.39 12.53 6.50 14.42
F 5.83 14.56 9.01 16.16 3.61
Working Example contd..

A,D B C,G E
B 8.51
C,G 5.32 6.10
E 6.96 12.53 6.50
F 7.11 14.56 9.01 3.61
Working Example contd..

A,D B C,G
B 8.51
C,G 5.32 6.10
E,F 6.80 13.46 7.65
Working Example contd..

A,D,C,G B
B 6.91
E,F 6.73 13.46
Working Example contd..

A,D,C,G,E,F

B 9.07
Working Example contd..
Exercise
 Consider the following distance matrix:

 Apply Hierarchical clustering using MIN


distance (single linkage)
DIANA (DIvisive ANAlysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

64
Centroid, Radius and Diameter of a Cluster (for numerical data sets)

 Centroid: the “middle” of a cluster iN 1(t


Cm  ip )
N

 Radius: square root of average distance from any point of the


cluster to its centroid  N (t  cm ) 2
Rm  i 1 i
N
 Diameter: square root of average mean squared distance
between all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 j 1
i j
N ( N 1)

65
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such as
density-connected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: (Density-Based Spatial Clustering of Applications with Noise)
Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

66
Density-Based Clustering: Basic Definitions
 Two parameters:
 Eps: Maximum radius of the neighborhood

 MinPts: Minimum number of points in an Eps-neighborhood


of that point

 NEps(p): {q belongs to D | dist (p, q) ≤ Eps}


67
Density-Based Clustering: Basic Definitions
 The density of a neighborhood can be measured simply by the
number of objects in the neighborhood.

 An object is a core object if the Eps-neighborhood of the object


contains at least MinPts objects
 core point condition:
|NEps (q)| ≥ MinPts
 Directly density-reachable: A point p is directly density – reachable
from a point q w.r.t. Eps, MinPts if
 p belongs to NEps(q) p MinPts = 5
Eps = 1 cm
q

 Clearly, an object p is directly density-reachable from another


object q if and only if q is a core object and p is in the Eps-
neighborhood of q.
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a
p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o w.r.t. o
Eps and MinPts

69
Example
 Consider the following figure, where MinPts = 3:

 m, p, o, r are core objects because each is in an Eps-neighborhood containing at


least three points.

 Object q is directly density-reachable from m, Object m is directly density-reachable


from p and vice versa
 Object q is (indirectly) density-reachable from p because q is directly density
reachable from m and m is directly density-reachable from p.
 However, p is not density reachable from q…..why?
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Border Point: A point which has fewer than MinPts within Eps
but it is in the neighborhood of a core point.
 Noise or outlier: A point which is not a core point or border
point.

Outlier

Border
Eps = 1cm
Core
MinPts = 5

71
DBSCAN: The Algorithm
 Arbitrarily select a point p

 Retrieve all points density-reachable from p w.r.t. Eps


and MinPts

 If p is a core point, a cluster is formed

 If p is a border point, no points are density-reachable


from p and DBSCAN visits the next point of the database

 Continue the process until all of the points have been


processed

72
Working Example
 Let‟s take a dataset of 13 points as shown and plotted below:

 Let‟s choose eps = 0.6 and MinPts = 4.


Working Example contd..
 Let‟s consider the first data point in the dataset (1,2) & calculate its
distance from every other data point in the data set.
Working Example contd..

 As evident from the above table, the point (1, 2) has only
two other points in its neighborhood (1, 2.5), (1.2, 2.5) for
the assumed value of eps

 As it is less than MinPts, we can‟t declare it as a core point

 Let‟s repeat the above process for every point in the dataset
and find out the neighborhood of each
Working Example contd..
Working Example contd..
Working Example contd..
Cluster Validity
 For supervised classification we have a variety of
measures to evaluate how good our model is
 Accuracy, precision, recall

 For cluster analysis, the analogous question is


how to evaluate the “goodness” of the resulting
clusters?

 But “clusters are in the eye of the beholder”!

 Then why do we want to evaluate them?


 To avoid finding patterns in noise
 To compare clustering algorithms
 To compare two sets of clusters
 To compare two clusters
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists in the
data.
2. Comparing the results of a cluster analysis to externally known
results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data
without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the „correct‟ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to


evaluate the entire clustering or just individual clusters.
Measures of Cluster Validity
 Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following three types.
 External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
○ Entropy

 Internal Index: Used to measure the goodness of a clustering


structure without respect to external information.
○ Sum of Squared Error (SSE)

 Relative Index: Used to compare two different clusterings or


clusters.
○ Often an external or internal index is used for this function, e.g., SSE or
entropy
 Sometimes these are referred to as criteria instead of indices
 However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
Measuring Cluster Validity Via Correlation
 Two matrices
 Proximity Matrix
 “Incidence” Matrix
○ One row and one column for each data point
○ An entry is 1 if the associated pair of points belong to the same cluster
○ An entry is 0 if the associated pair of points belongs to different clusters

 Compute the correlation between the two matrices


 Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
 High correlation indicates that points that belong to the
same cluster are close to each other.
 Not a good measure for some density or contiguity based
clusters.
Using Similarity Matrix for Cluster Validation

 Order the similarity matrix with respect to cluster


labels and inspect visually.

1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6

Points
50 0.5
0.5
y

60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation

 Clusters in random data are not so crisp


1 1

10 0.9 0.9

20 0.8 0.8

30 0.7 0.7

40 0.6 0.6
Points

50 0.5 0.5

y
60 0.4 0.4

70 0.3 0.3

80 0.2 0.2

90 0.1 0.1

100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x

DBSCAN
Internal Measures: SSE
 Clusters in more complicated figures aren‟t well separated
 Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
 SSE
 SSE is good for comparing two clusterings or two clusters
(average SSE).
 Can also be used to estimate the number of clusters
10

6 9

8
4
7

2 6

SSE
5
0
4
-2 3

2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
Internal Measures: Cohesion and Separation

 Cluster Cohesion: Measures how closely related


are objects in a cluster
 Example: SSE
 Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
 Example: Squared Error
 Cohesion is measured by the within cluster sum of squares (SSE)

WSS    ( x  mi )2
i xC i
 Separation is measured by the between cluster sum of squares

BSS   Ci (m  mi )2
i
 Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation

 Example: SSE
 BSS + WSS = constant
m
  
1 m1 2 3 4 m2 5

K=1 cluster: WSS (1  3)2  (2  3)2  (4  3)2  (5  3)2  10


BSS 4  (3  3)2  0
Total  10  0  10

K=2 clusters: WSS (1  1.5)2  (2  1.5)2  (4  4.5)2  (5  4.5)2  1


BSS 2  (3  1.5)2  2  (4.5  3)2  9
Total  1  9  10
Internal Measures: Cohesion and Separation

 A proximity graph based approach can also be used for


cohesion and separation.
 Cluster cohesion is the sum of the weight of all links within a cluster.
 Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient

 Silhouette Coefficient combine ideas of both cohesion and separation,


but for individual points, as well as clusters and clusterings
 For an individual point, i
 Calculate a = average distance of i to the points in its cluster
 Calculate b = min (average distance of i to points in another cluster)
 The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)

 Typically between 0 and 1. b


 The closer to 1 the better. a

 Can calculate the Average Silhouette width for a cluster or a


clustering
Final Comment on Cluster Validity
“The validation of clustering structures is the
most difficult and frustrating part of cluster
analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only
to those true believers who have experience
and great courage.”

Algorithms for Clustering Data, Jain and Dubes

You might also like