0% found this document useful (0 votes)
102 views58 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document summarizes different methods for validating clustering results, including external, internal, and relative measures. It describes external measures that use ground truth labels to evaluate how well clusters match the true partitions. Purity, maximum matching, and F-measure are examples of external matching-based measures presented. The document also discusses using internal measures derived from the dataset itself, like intra/intercluster distances.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views58 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document summarizes different methods for validating clustering results, including external, internal, and relative measures. It describes external measures that use ground truth labels to evaluate how well clusters match the true partitions. Purity, maximum matching, and F-measure are examples of external matching-based measures presented. The document also discusses using internal measures derived from the dataset itself, like intra/intercluster distances.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 17: Clustering Validation

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 1 / 58
Clustering Validation and Evaluation

Cluster validation and assessment encompasses three main tasks: clustering


evaluation seeks to assess the goodness or quality of the clustering, clustering
stability seeks to understand the sensitivity of the clustering result to various
algorithmic parameters, for example, the number of clusters, and clustering
tendency assesses the suitability of applying clustering in the first place, that is,
whether the data has any inherent grouping structure.

Validity measures can be divided into three main types:


External: External validation measures employ criteria that are not inherent
to the dataset, e.g., class labels.
Internal: Internal validation measures employ criteria that are derived from
the data itself, e.g., intracluster and intercluster distances.
Relative: Relative validation measures aim to directly compare different
clusterings, usually those obtained via different parameter settings
for the same algorithm.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 2 / 58
External Measures

External measures assume that the correct or ground-truth clustering is known a


priori, which is used to evaluate a given clustering.
Let D = {x i }ni=1 be a dataset consisting of n points in a d-dimensional space,
partitioned into k clusters. Let yi ∈ {1, 2, . . . , k} denote the ground-truth cluster
membership or label information for each point.
The ground-truth clustering is given as T = {T1 , T2 , . . . , Tk }, where the cluster Tj
consists of all the points with label j, i.e., Tj = {x i ∈ D|yi = j}. We refer to T as
the ground-truth partitioning, and to each Ti as a partition.
Let C = {C1 , . . . , Cr } denote a clustering of the same dataset into r clusters,
obtained via some clustering algorithm, and let ŷi ∈ {1, 2, . . . , r } denote the cluster
label for x i .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 3 / 58
External Measures

External evaluation measures try capture the extent to which points from the
same partition appear in the same cluster, and the extent to which points from
different partitions are grouped in different clusters.
All of the external measures rely on the r × k contingency table N that is induced
by a clustering C and the ground-truth partitioning T , defined as follows

N(i, j) = nij = |Ci ∩ Tj |

The count nij denotes the number of points that are common to cluster Ci and
ground-truth partition Tj .
Let ni = |Ci | denote the number of points in cluster Ci , and let mj = |Tj | denote
the number of points in partition Tj .
The contingency table can be computed from T and C in O(n) time by examining
the partition and cluster labels, yi and ŷi , for each point x i ∈ D and incrementing
the corresponding count nyi ŷi .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 4 / 58
Matching Based Measures: Purity

Purity quantifies the extent to which a cluster Ci contains entities from only one
partition:
1 k
purity i = max {nij }
ni j =1

The purity of clustering C is defined as the weighted sum of the clusterwise purity
values:

r r
X ni 1X k
purity = purity i = max{nij }
i =1
n n i =1 j =1

ni
where the ratio n
denotes the fraction of points in cluster Ci .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 5 / 58
Matching Based Measures: Maximum Matching

The maximum matching measure selects the mapping between clusters and
partitions, such that the sum of the number of common points (nij ) is maximized,
provided that only one cluster can match with a given partition.
Let G be a bipartite graph over the vertex set V = C ∪ T , and let the edge set be
E = {(Ci , Tj )} with edge weights w (Ci , Tj ) = nij . A matching M in G is a subset
of E , such that the edges in M are pairwise nonadjacent, that is, they do not have
a common vertex.
The maximum weight matching in G is given as:
 
w (M)
match = arg max
M n

where w (M)Pis the sum of the sum of all the edge weights in matching M, given
as w (M) = e ∈M w (e)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 6 / 58
Matching Based Measures: F-measure

Given cluster Ci , let ji denote the partition that contains the maximum number of points
from Ci , that is, ji = maxkj=1 {nij }.
The precision of a cluster Ci is the same as its purity:

1 k nij
prec i = max {nij } = i
ni j =1 ni

The recall of cluster Ci is defined as

niji nij
recall i = = i
|Tji | mji

where mji = |Tji |.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 7 / 58
Matching Based Measures: F-measure

The F-measure is the harmonic mean of the precision and recall values for each Ci

2 2 · prec i · recall i 2 niji


Fi = 1 1
= =
prec i + recall prec i + recall i n i + mj i
i

The F-measure for the clustering C is the mean of clusterwise F-meaure values:
r
1X
F= Fi
r i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 8 / 58
K-means: Iris Principal Components Data
Good Case
u2
uT bC
uT bC
bC
1.0 bC
uT bC bC
bC
uT uT bC bC
uT uT bC
Tu rS bC bC
0.5 uT
uT
uT
uT uT rS bC bC bC Cb
uT uT uT Cb bC bC bC
uT uT Tu uT Tu uT uT Sr Sr rS rS
rS rS bC
uT Tu Tu uT Sr bC bC bC bC
uT bC bC bC bC bC
u T uT uT Sr
rS bC
uT uT
0 rS Sr rS Sr rS
bC bC
uT uT rS rS rS S r
Sr rS bC bC bC bC bC
uT uT uT rS rS Sr bC bC bC bC
rS Sr rS rS Sr
Sr rSSr Sr
bC
rS rS rS rS rS Sr Cb
uT rS rS Sr bC bC
−0.5 rS rS S r Sr Sr bC
rS rS Sr rS rS
rS rS rS
rS
bC
−1.0 rS rS
rS
rS

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 (squares) 0 47 14 61
C2 (circles) 50 0 0 50
C3 (triangles) 0 3 36 39
mj 50 50 50 n = 100

purity = 0.887, match = 0.887, F = 0.885.


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 9 / 58
K-means: Iris Principal Components Data
Bad Case
u2
uT rS
uT rS
rS
1.0 rS
uT rS rS
rS
uT uT rS rS
uT uT rS
rS rS rS Sr rS
Tu uT rS rS
0.5 uT
uT
uT
uT uT uT
uT uT uT Sr rS rS rS
uT uT Tu Tu uT uT Tu Tu uT uT
uT uT rS
uT Tu Tu uT Tu rS rS rS
uT rS rS rS rS rS
u T uT uT Tu
uT bC
0 uT uT uT uT uT Tu uT bC
uT uT uT Tu uT bC bC bC bC bC
bC
uT uT uT Tu uT bC bC bC bC
uT uT uT uT
uT
Tu uT uT Tu Tu
bC
uT uT uT Tu Tu Tu uT uT Tu bC Cb
uT uT uT Tu bC bC
−0.5 uT uT T u Tu Tu bC
uT uT Tu uT uT
uT uT bC
uT
bC
−1.0 bC bC
uT
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 (squares) 30 0 0 30
C2 (circles) 20 4 0 24
C3 (triangles) 0 46 50 96
mj 50 50 50 n = 150

purity = 0.667, match = 0.560, F = 0.658


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 10 / 58
Entropy-based Measures: Conditional Entropy

The entropy of a clustering C and partitioning T is given as


r
X k
X
H(C) = − pCi log pCi H(T ) = − pTj log pTj
i =1 j =1

ni mj
where pCi = n
and pTj = n
are the probabilities of cluster Ci and partition Tj .

The cluster-specific entropy of T , that is, the conditional entropy of T with


respect to cluster Ci is defined as
k    
X nij nij
H(T |Ci ) = − log
j =1
ni ni

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 11 / 58
Entropy-based Measures: Conditional Entropy

The conditional entropy of T given clustering C is defined as the weighted sum:

r k
r X  
X ni X pij
H(T |C) = H(T |Ci ) = − pij log
i =1
n i =1 j =1
p Ci

= H(C, T ) − H(C)
n
where pij = nij is the probability that a point in cluster i also belongs to partition
Pr P k
and where H(C, T ) = − i =1 j =1 pij log pij is the joint entropy of C and T .

H(T |C) = 0 if and only if T is completely determined by C, corresponding to the


ideal clustering. If C and T are independent of each other, then H(T |C) = H(T ).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 12 / 58
Entropy-based Measures: Normalized Mutual Information
The mutual information tries to quantify the amount of shared information
between the clustering C and partitioning T , and it is defined as

k
r X
!
X pij
I (C, T ) = pij log
i =1 j =1
pCi · pTj

When C and T are independent then pij = pCi · pTj , and thus I (C, T ) = 0.
However, there is no upper bound on the mutual information.

The normalized mutual information (NMI) is defined as the geometric mean:

s
I (C, T ) I (C, T ) I (C, T )
NMI (C, T ) = · =p
H(C) H(T ) H(C) · H(T )

The NMI value lies in the range [0, 1]. Values close to 1 indicate a good clustering.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 13 / 58
Entropy-based Measures: Variation of Information

This criterion is based on the mutual information between the clustering C and the
ground-truth partitioning T , and their entropy; it is defined as

VI (C, T ) = (H(T ) − I (C, T )) + (H(C) − I (C, T ))


= H(T ) + H(C) − 2I (C, T )

Variation of information (VI) is zero only when C and T are identical. Thus, the
lower the VI value the better the clustering C.

VI can also be expressed as:

VI (C, T ) = H(T |C) + H(C|T )

VI (C, T ) = 2H(T , C) − H(T ) − H(C)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 14 / 58
K-means: Iris Principal Components Data
Good Case

u2 u2
uT bC uT rS
uT bC uT rS
bC rS
1.0 bC
1.0 rS
uT bC bC uT rS rS
bC rS
uT uT bC bC uT uT rS rS
uT uT bC uT uT rS
rS rS rS Sr rS
Tu rS bC bC Tu uT rS rS
0.5 uT
uT
uT
uT uT rS bC bC bC bC 0.5 uT
uT
uT
uT uT uT
uT uT uT bC bC bC bC uT uT uT rS rS rS rS
uT uT Tu uT Tu uT uT Sr Sr rS Sr
rS rS bC uT uT Tu Tu uT uT Tu Tu uT Tu
uT uT rS
uT Tu Tu uT Sr bC bC bC bC uT Tu Tu uT Tu rS rS rS
uT bC bC bC bC bC uT rS rS rS rS rS
uTu T uT Sr
Sr bC uTu T uT Tu
Tu bC
0 uT uT rS bC 0 uT uT uT uT uT Tu uT bC
S
rS rS rS rS Sr rS rS
r Sr rS bC bC bC bC bC
bC
uT uT uT Tu uT bC bC bC bC bC
bC
uT uT rS bC bC bC bC uT uT uT uT uT bC bC bC bC
uT uT uT rS rS
rS
Sr rS rS Sr Sr uT uT uT uT uT
uT
Tu uT uT Tu Tu
rS rS S r rSrS rS rS rS
bC Cb uT uT T u uT uT uT uT bC bC Cb
rS uT
uT rS rS Sr bC bC uT uT uT Tu bC bC
−0.5 rS rS Sr Sr Sr bC −0.5 uT uT Tu Tu Tu bC
rS rS Sr rS Sr uT uT Tu uT Tu
rS rS rS uT uT bC
rS uT
bC bC
−1.0 rS rS −1.0 bC bC
rS uT
rS bC

−1.5 u 1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

(a) K-means: good (b) K-means: bad

purity match F H(T |C) NMI VI


(a) Good 0.887 0.887 0.885 0.418 0.742 0.812
(b) Bad 0.667 0.560 0.658 0.743 0.587 1.200

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 15 / 58
Pairwise Measures
Given clustering C and ground-truth partitioning T , let x i , x j ∈ D be any two
points, with i 6= j. Let yi denote the true partition label and let ŷi denote the
cluster label for point x i .
If both x i and x j belong to the same cluster, that is, ŷi = ŷj , we call it a positive
event, and if they do not belong to the same cluster, that is, ŷi 6= ŷj , we call that a
negative event. Depending on whether there is agreement between the cluster
labels and partition labels, there are four possibilities to consider:
True Positives: x i and x j belong to the same partition in T , and they are also in
the same cluster in C. The number of true positive pairs is given as

TP = {(x i , x j ) : yi = yj and ŷi = ŷj }

False Negatives: x i and x j belong to the same partition in T , but they do not
belong to the same cluster in C. The number of all false negative
pairs is given as

FN = {(x i , x j ) : yi = yj and ŷi 6= ŷj }

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 16 / 58
Pairwise Measures

False Positives: x i and x j do not belong to the same partition in T , but they do
belong to the same cluster in C. The number of false positive pairs
is given as


FP = {(x i , x j ) : yi 6= yj and ŷi = ŷj }

True Negatives: x i and x j neither belong to the same partition in T , nor do they
belong to the same cluster in C. The number of such true negative
pairs is given as


TN = {(x i , x j ) : yi 6= yj and ŷi 6= ŷj }

n
 n(n−1)
Because there are N = 2 = 2 pairs of points, we have the following identity:

N = TP + FN + FP + TN

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 17 / 58
Pairwise Measures: TP, TN, FP, FN
They can be computed efficiently using the contingency table N = {nij }. The
number of true positives is given as
 r k 
1 X X 2 
TP = nij − n
2 i =1 j =1

The false negatives can be computed as


 k r k 
1 X 2 XX 2
FN = mj − nij
2 j =1 i =1 j =1

The number of false positives are:


 r r k 
1 X 2 XX 2
FP = ni − nij
2 i =1 i =1 j =1

Finally, the number of true negatives can be obtained via


 r k r k 
1 2 X 2 X 2 XX 2
TN = N − (TP + FN + FP) = n − ni − mj + nij
2 i =1 j =1 i =1 j =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 18 / 58
Pairwise Measures: Jaccard Coefficient, Rand Statistic

Jaccard Coefficient: measures the fraction of true positive point pairs, but after ignoring
the true negative:

TP
Jaccard =
TP + FN + FP

Rand Statistic: measures the fraction of true positives and true negatives over all point
pairs:

TP + TN
Rand =
N

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 19 / 58
Pairwise Measures: FM Measure

Fowlkes-Mallows Measure: Define the overall pairwise precision and pairwise


recall values for a clustering C, as follows:

prec = TP/TP + FP recall = TP/TP + FN

The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the


pairwise precision and recall

p TP
FM = prec · recall = p
(TP + FN)(TP + FP)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 20 / 58
K-means: Iris Principal Components Data
Good Case

u2
uT bC

1.0
uT
bC
bC
Contingency table:
bC
uT bC bC
bC
uT uT bC Cb
uT uT bC
Tu bC bC
0.5 uT uT rS
setosa versicolor virginica
uT uT uT rS bC Cb bC bC
 
uT uT uT bC bC bC bC
uT uT Tu uT Tu uT uT Sr Sr rS rS
rS rS bC
uT Tu Tu uT Sr bC bC bC bC

0 uT uT
uT uT
uT
Tu uT uT
rS rS rS
rS
S r
Sr
rS Sr rS
Sr rS
Sr
Sr rS
Cb bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC
bC
bC

 T1 T2 T3 
uT uT uT rS rS
rS
Sr rS rS Sr Sr
Sr rSrS Sr
bC

−0.5 uT rS
rS rS
rS rS rS
Sr rS
rS
Sr rS
Sr Sr
Sr
bC
bC
Cb
bC
C
 1 0 47 14  
rS Sr rS rS
rS rS rS

−1.0
rS
rS rS
bC
C2 50 0 0 
rS

−1.5
rS

u1
C3 0 3 36
−4 −3 −2 −1 0 1 2 3

The number of true positives is:


         
47 14 50 3 36
TP = + + + + = 3030
2 2 2 2 2

Likewise, we have FN = 645, FP = 766, TN = 6734, and N = 150



2 = 11175.
We therefore have: Jaccard = 0.682, Rand = 0.887, FM = 0.811.
For the “bad” clustering, we have: Jaccard = 0.477, Rand = 0.717, FM = 0.657.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 21 / 58
Correlation Measures: Hubert statistic
Let X and Y be two symmetric n × n matrices, and let N = 2n . Let x, y ∈ RN denote


the vectors obtained by linearizing the upper triangular elements (excluding the main
diagonal) of X and Y .
Let µX denote the element-wise mean of x, given as
n−1 n
1XX 1
µX = X (i, j) = x T x
N N
i =1 j =i +1

and let z x denote the centered x vector, defined as z x = x − 1 · µX


The Hubert statistic is defined as

n−1 n
1XX 1
Γ= X (i, j) · Y (i, j) = x T y
N N
i =1 j =i +1

The normalized Hubert statistic is defined as the element-wise correlation

z Tx z y
Γn = = cos θ
kz x k · kz y k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 22 / 58
Correlation-based Measure: Discretized Hubert Statistic
Let T and C be the n × n matrices defined as
( (
1 if yi = yj , i 6= j 1 if ŷi = ŷj , i 6= j
T (i, j) = C (i, j) =
0 otherwise 0 otherwise

Let t, c ∈ RN denote the N-dimensional vectors comprising the upper triangular


elements (excluding the diagonal) of T and C . Let z t and z c denote the centered
t and c vectors.
The discretized Hubert statistic is computed by setting x = t and y = c:
1 T TP
Γ= t c=
N N

The normalized version of the discretized Hubert statistic is simply the correlation
between t and c
TP
z Tt z c N
− µT µC
Γn = =p
kz t k · kz c k µT µC (1 − µT )(1 − µC )
TP+FN TP+FP
where µT = N
and µC = N
.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 23 / 58
Internal Measures

Internal evaluation measures do not have recourse to the ground-truth partitioning.


To evaluate the quality of the clustering, internal measures therefore have to
utilize notions of intracluster similarity or compactness, contrasted with notions of
intercluster separation, with usually a trade-off in maximizing these two aims.
The internal measures are based on the n × n distance matrix, also called the
proximity matrix, of all pairwise distances among the n points:

n on
W = kx i − xj k (1)
i ,j =1

where kx i − xj k is the Euclidean distance between x i , x j ∈ D.


The proximity matrix W is the adjacency matrix of the weighted complete graph
G over the n points, that is, with nodes V = {x i | x i ∈ D}, edges
E = {(x i , x j ) | x i , x j ∈ D}, and edge weights wij = W (i, j) for all x i , x j ∈ D.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 24 / 58
Internal Measures
The clustering C can be considered as a k-way cut in G . Given any subsets
S, R ⊂ V , define W (S, R) as the sum of the weights on all edges with one vertex
in S and the other in R, given as
XX
W (S, R) = wij
x i ∈S x j ∈R

We denote by S = V − S the complementary set of vertices.


The sum of all the intracluster and intercluster weights are given as
k k k −1
1X 1X XX
Win = W (Ci , Ci ) Wout = W (Ci , Ci ) = W (Ci , Cj )
2 i =1 2 i =1 i =1 j >i

The number of distinct intracluster and intercluster edges is given as


k   k −1 X
k
X ni X
Nin = Nout = ni · nj
i =1
2 i =1 j =i +1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 25 / 58
Clusterings as Graphs: Iris
Only intracluster edges shown.

Goodu clustering. u2
2
uT bC uT bC

uT bC uT bC
bC bC

1.0 bC
1.0 bC
uT bC bC bC bC
bC uT
bC
uT uT bC Cb uT uT
bC
uT uT bC bC bC
Tu rS bC bC uT uT
bC Cb
0.5 uT
uT
uT
uT uT rS bC Cb bC bC 0.5 uT uT
uT
rS bC bC bC
uT uT bC bC bC bC uT rS bC
uT
uT Tu uT Tu rS rS uT uT uT uT bC
uT uT rS rS bC bC bC
uT uT Sr Sr rS rS bC uT
uT bC bC bC bC
uT rS bC
uT Tu Tu uT
uT uT uT rS bC
uT rS
uT Sr Cb bC bC bC bC uT
uT
uT rS
rS
bC
bC
bC bC
uT uT Sr uT bC bC
uT uT u T rS bC uT uT
uT rS
rS bC
bC

0 rS bC 0 uT
Sr rS
bC
rS rS rS Sr rS bC uT rS
bC bC bC bC bC rS rS Cb
uT uT S r rS rS rS bC
rS rS Sr rS bC bC
uT uT uT rS Sr bC bC bC bC uT uT
uT
uT rS rS rS rS rS
rS bC bC Cb Cb Cb
rS Sr rS rS Sr uT bC

Sr rSSr rS
rS
bC rS
rS rS rS Cb rS rS
Sr rS Sr rS
rS
rS rS rS rS
rS
rS
rS
bC bC
rS rS
uT rS Sr rS bC bC rS bC
−0.5 rS rS rS Sr Sr bC −0.5 uT
rS
rS
rS
rS
rS
rS bC
bC
rS
rS rS Sr rS rS rS rS rS
rS rS rS rS rS
rS rS rS
rS rS
bC bC
−1.0 rS rS −1.0 rS rS

rS rS
rS rS

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

Bad uclustering. u2
2
uT rS uT rS

uT rS uT rS
rS rS

1.0 rS
1.0 rS
uT rS rS rS rS
rS uT
rS
uT uT rS Sr uT uT
rS
uT uT rS rS rS

rS Sr rS rS rS
Tu uT rS rS uT uT
rS Sr
0.5 uT
uT
uT
uT uT uT 0.5 uT uT
uT
uT rS rS rS
uT uT rS rS rS rS uT uT rS
uT
uT Tu Tu uT uT uT uT uT uT rS
uT uT uT uT rS rS rS
uT uT Tu Tu uT uT rS uT
uT rS rS rS
uT uT rS
uT Tu Tu uT
uT uT uT uT rS
uT uT
uT Tu Sr rS rS rS rS uT
uT
uT uT
uT
rS
rS
rS rS
uT uT Tu uT rS rS
uT uT u T Tu bC uT uT
uT uT rS

uT uT uT Tu uT
uT bC
0 bC 0 uT
Tu uT
bC
bC uT uT
bC bC bC bC bC uT uT Cb
uT uT uT uT uT Tu uT uT uT uT uT
bC
bC bC
uT uT uT uT uT Tu bC bC bC bC uT uT
uT
uT uT uT uT uT uT
uT bC bC Cb Cb Cb
uT Tu uT uT Tu bC
uT uT uT bC

uT uT uT Cb uT uT
Tu uT Tu uT uT Tu bC uT
uT
uT
uT uT
uT
uT uT
uT
uT
uT
bC bC

uT uT Tu Tu bC bC uT bC
−0.5 uT uT uT uT Tu bC −0.5 uT
uT
uT
uT
uT
uT
uT bC
bC
uT
uT uT Tu uT uT uT uT uT
uT uT bC uT uT
uT uT bC
uT uT
bC bC
−1.0 bC bC −1.0 bC bC

uT uT
bC bC

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 26 / 58
Internal Measures: BetaCV and C-index
BetaCV Measure: The BetaCV measure is the ratio of the mean intracluster
distance to the mean intercluster distance:
Pk
Win /Nin Nout Win Nout i =1 W (Ci , Ci )
BetaCV = = · = Pk
Wout /Nout Nin Wout Nin i =1 W (Ci , Ci )

The smaller the BetaCV ratio, the better the clustering.

C-index: Let Wmin (Nin ) be the sum of the smallest Nin distances in the proximity
matrix W , where Nin is the total number of intracluster edges, or point pairs. Let
Wmax (Nin ) be the sum of the largest Nin distances in W .
The C-index measures to what extent the clustering puts together the Nin points
that are the closest across the k clusters. It is defined as

Win − Wmin (Nin )


Cindex =
Wmax (Nin ) − Wmin (Nin )

The C-index lies in the range [0, 1]. The smaller the C-index, the better the
clustering.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 27 / 58
Internal Measures: Normalized Cut and Modularity
Normalized Cut Measure: The normalized cut objective for graph clustering can
also be used as an internal clustering evaluation measure:

k k
X W (Ci , Ci ) X W (Ci , Ci )
NC = =
i =1
vol(Ci ) i =1
W (Ci , V )

where vol(Ci ) = W (Ci , V ) is the volume of cluster Ci . The higher the normalized
cut value the better.

Modularity: The modularity objective is given as

k  2 !
X W (Ci , Ci ) W (Ci , V )
Q= −
i =1
W (V , V ) W (V , V )

The smaller the modularity measure the better the clustering.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 28 / 58
Internal Measures: Dunn Index
The Dunn index is defined as the ratio between the minimum distance between
point pairs from different clusters and the maximum distance between point pairs
from the same cluster

min
Wout
Dunn =
Winmax

min
where Wout is the minimum intercluster distance:
min

Wout = min wab |x a ∈ Ci , x b ∈ Cj
i ,j > i

and Winmax is the maximum intracluster distance:

Winmax = max wab |x a , x b ∈ Ci



i

The larger the Dunn index the better the clustering because it means even the
closest distance between points in different clusters is much larger than the
farthest distance between points in the same cluster.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 29 / 58
Internal Measures: Davies-Bouldin Index
Let µi denote the cluster mean
1 X
µi = xj
ni x ∈ C
j i

Let σµi denote the dispersion or spread of the points around the cluster mean
sP
2
x j ∈Ci δ(x j , µi ) p
σ µi = = var (Ci )
ni
The Davies–Bouldin measure for a pair of clusters Ci and Cj is defined as the ratio

σ µi + σ µj
DB ij =
δ(µi , µj )

DB ij measures how compact the clusters are compared to the distance between
the cluster means. The Davies–Bouldin index is then defined as
k
1X
DB = max{DB ij }
k i =1 j 6=i
The smaller the DB value the Data
Zaki & Meira Jr. (RPI and UFMG)
better the clustering.
Mining and Machine Learning Chapter 17: Clustering Validation 30 / 58
Silhouette Coefficient
Define the silhoutte coefficient of a point x i as

µmin
out (x i ) − µin (x i )
si = n o
max µmin
out (x i ), µin (x i )

where µin (x i ) is the mean distance from x i to points in its own cluster ŷi :
P
x j ∈Cŷ ,j 6=i δ(x i , x j )
i
µin (x i ) =
nŷi − 1
and µmin
out (x i ) is the mean of the distances from x i to points in the closest cluster:
(P )
min y ∈Cj δ(x i , y )
µout (x i ) = min
j 6=ŷi nj

The si value lies in the interval [−1, +1]. A value close to +1 indicates that x i is much
closer to points in its own cluster, a value close to zero indicates x i is close to the
boundary, and a value close to −1 indicates that x i is much closer to another cluster,
and therefore may be mis-clustered.
The silhouette coefficient is the mean si value: SC = 1n ni=1 si . A value close to +1
P
indicates a good clustering.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 31 / 58
Iris Data: Good vs. Bad Clustering

u2 u2
uT bC uT rS

uT bC uT rS
bC rS

1.0 bC
1.0 rS
bC bC rS rS
uT uT
bC rS
uT uT uT uT
bC rS
uT bC bC uT rS rS
uT uT
bC Cb rS Sr
0.5 uT
uT uT
uT
uT
uT
uT
rS
rS bC
bC bC bC
bC
0.5 uT
uT uT
uT
uT
uT
uT
uT
uT rS
rS rS rS
rS
uT uT uT uT
uT uT rS rS bC bC bC uT uT uT uT rS rS rS
uT uT uT rS bC uT uT uT uT rS
uT uT rS rS bC uT uT uT uT rS
uT uT rS bC bC uT uT uT rS rS
uT uT rS bC bC uT uT uT rS rS
bC bC rS rS
uT rS bC uT uT rS
uT uT rS bC uT uT uT bC
0 uT uT
rS
rS
rS
rS
rS
rS
bC
bC Cb
bC
0 uT uT
uT
uT
uT
uT
uT
uT
bC
bC Cb
bC
rS rS bC uT uT bC
uT rS rS rS rS bC bC Cb Cb Cb uT uT uT uT uT bC bC Cb Cb Cb
uT uT uT uT rS
bC uT uT uT uT uT
bC
rS rS rS uT uT uT
rS rS bC uT uT bC
rS rS rS Sr bC uT uT uT Tu bC
rS rS rS uT uT uT
rS rS rS uT uT uT
uT rS rS bC bC uT uT uT bC bC
−0.5 rS rS rS −0.5 uT uT uT
rS rS bC uT uT bC
rS uT
rS rS rS uT uT uT
rS rS rS uT uT bC
rS rS uT uT
rS uT
bC bC
rS rS bC bC
−1.0 −1.0
rS uT
rS bC

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

(a) Good (b) Bad

Lower better Higher better


BetaCV Cindex Q DB NC Dunn SC Γ Γn
(a) Good 0.24 0.034 −0.23 0.65 2.67 0.08 0.60 8.19 0.92
(b) Bad 0.33 0.08 −0.20 1.11 2.56 0.03 0.55 7.32 0.83

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 32 / 58
Relative Measures: Silhouette Coefficient

The silhouette coefficient for each point sj , and the average SC value can be used
to estimate the number of clusters in the data.
The approach consists of plotting the sj values in descending order for each
cluster, and to note the overall SC value for a particular value of k, as well as
clusterwise SC values:
1 X
SCi = sj
ni x ∈ C
j i

We then pick the value k that yields the best clustering, with many points having
high sj values within each cluster, as well as high values for SC and SCi
(1 ≤ i ≤ k).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 33 / 58
Iris K-means: Silhouette Coefficient Plot (k = 2)

1.0
0.9
silhouette coefficient

b b b b b b b b b b b b b b b b
b b b b b b b b b b b b b
b b b b b b
b b b

0.8 b b b b b b b b b b b b b b b
b b b b b b b b b b b b b
b b b b b b b b
b b b b b b b b b b b b
b b b b
b b b
b b
b
b
0.7 b b b b
b b b b b b
b b b b
b b b b
b b b
b b b b
b

0.6 b
b b b b
b b b
b
b b b b
b

0.5 b b
b
b

0.4 b
b

0.3 b

0.2 b

0.1 b b
b

0
SC1 = 0.662 SC2 = 0.785
n1 = 97 n2 = 53

(a) k = 2, SC = 0.706

k = 2 yields the highest silhouette coefficient, with the two clusters essentially well
separated. C1 starts out with high si values, which gradually drop as we get to
border points. C2 is even better separated, since it has a higher silhouette
coefficient and the pointwise scores are all high, except for the last three points.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 34 / 58
Iris K-means: Silhouette Coefficient Plot (k = 3)
y

0.9 b b b b b b b b b b b
b b b b b b b b
b b b b b b b b b b
b b
b b b b b b

0.8 b b b
b b
b b b
b
b b
0.7 b b b b b b b b
b b b b b b b
b
b
b b b b
b b b b b
b b b
b b b b b b b b b
b b

0.6 b b b b
b b
b b b b
b
b
b b
b b b b b
b b b b b
0.5 b

b b
b

b
b

0.4 b b b b
b b
b b
b
b

b b

0.3 b b b
b b
b b
b
b

0.2 b
b

b
b b
b
0.1 b
b
b

0
b
x
SC1 = 0.466 SC2 = 0.818 SC3 = 0.52
n1 = 61 n2 = 50 n3 = 39

(b) k = 3, SC = 0.598

C1 from k = 2 has been split into two clusters for k = 3, namely C1 and C3 . Both
of these have many bordering points, whereas C2 is well separated with high
silhouette coefficients across all points.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 35 / 58
Iris K-means: Silhouette Coefficient Plot (k = 4)
y

0.9 b b b b b b b b b
b b b b b b b b
b b b b b

0.8 b b b b b b b b b
b b b b
b b b
b b
b b
b b

0.7 b
b b b
b b
b b b
b
b
b b
b b
b b b b b b

0.6 b b
b b b
b b
b b b
b
b b b
b
b b
b b b b
b b
b
b
b
b b b b b
b

0.5 b b b
b

b b
b b

b
b
b
b
b b
b b b b b

0.4 b b b
b b
b
b
b
b b

b
b

0.3 b

b b
b
b b
b

0.2 b b
b b b
b
b
b
b

0.1 b

b b

0 b b
b x
SC1 = 0.376 SC2 = 0.534 SC3 = 0.787 SC4 = 0.484
n1 = 49 n2 = 28 n3 = 50 n4 = 23

(c) k = 4, SC = 0.559

C3 is the well separated cluster, corresponding to C2 (in k = 2 and k = 3), and the
remaining clusters are essentially subclusters of C1 for k = 2. Cluster C1 also has
two points with negative si values, indicating that they are probably misclustered.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 36 / 58
Relative Measures: Calinski–Harabasz Index

Given the dataset D = {x i }ni=1 , the scatter matrix for D is given as


n
X T
S = nΣ = (x j − µ) (x j − µ)
j =1

Pn
where µ = 1n j =1 x j is the mean and Σ is the covariance matrix. The scatter
matrix can be decomposed into two matrices S = S W + S B , where S W is the
within-cluster scatter matrix and S B is the between-cluster scatter matrix, given as
k X
X T
SW = (x j − µi ) (x j − µi )
i =1 x j ∈Ci

k
X T
SB = ni (µi − µ) (µi − µ)
i =1

1
P
where µi = ni x j ∈Ci x j is the mean for cluster Ci .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 37 / 58
Relative Measures: Calinski–Harabasz Index

The Calinski–Harabasz (CH) variance ratio criterion for a given value of k is


defined as follows:

tr (S B )/(k − 1) n − k tr (S B )
CH(k) = = ·
tr (S W )/(n − k) k − 1 tr (S W )

where tr is the trace of the matrix.


We plot the CH values and look for a large increase in the value followed by little
or no gain. We choose the value k > 3 that minimizes the term
   
∆(k) = CH(k + 1) − CH(k) − CH(k) − CH(k − 1)

The intuition is that we want to find the value of k for which CH(k) is much
higher than CH(k − 1) and there is only a little improvement or a decrease in the
CH(k + 1) value.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 38 / 58
Calinski–Harabasz Variance Ratio
CH ratio for various values of k on the Iris principal components data, using the
K-means algorithm, with the best results chosen from 200 runs.

750
rS
rS
rS
rS
700 rS
rS
rS
CH

650

600
rS

2 3 4 5 6 7 8 9
k
The successive CH(k) and ∆(k) values are as follows:
k 2 3 4 5 6 7 8 9
CH(k) 570.25 692.40 717.79 683.14 708.26 700.17 738.05 728.63
∆(k) – −96.78 −60.03 59.78 −33.22 45.97 −47.30 –
∆(k) suggests k = 3 as the best (lowest) value.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 39 / 58
Relative Measures: Gap Statistic

The gap statistic compares the sum of intracluster weights Win for different values
of k with their expected values assuming no apparent clustering structure, which
forms the null hypothesis.

Let Ck be the clustering obtained for a specified value of k. Let Wink (D) denote
the sum of intracluster weights (over all clusters) for Ck on the input dataset D.

We would like to compute the probability of the observed Wink value under the null
hypothesis. To obtain an empirical distribution for Win , we resort to Monte Carlo
simulations of the sampling process.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 40 / 58
Relative Measures: Gap Statistic

We generate t random samples comprising n points. Let R i ∈ Rn×d , 1 ≤ i ≤ t


denote the ith sample. Let Wink (R i ) denote the sum of intracluster weights for a
given clustering of R i into k clusters.
From each sample dataset R i , we generate clusterings for different values of k,
and record the intracluster values Wink (R i ).
Let µW (k) and σW (k) denote the mean and standard deviation of these
intracluster weights for each value of k. The gap statistic for a given k is then
defined as

gap(k) = µW (k) − log Wink (D)

Choose k as follows:
n o
k ∗ = arg min gap(k) ≥ gap(k + 1) − σW (k + 1)
k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 41 / 58
Gap Statistic: Randomly Generated Data
A random sample of n = 150 points, which does not have any apparent cluster
structure.

uT bC bC bC rS
uT bC rS
uT bC bC rS
uT uT bC rS
uT bC
1.0 uT bC bC bC rS rS
bC bC bC bC rS
uT bC rS
uT rS
uT bC bC
uT uT
0.5 uT uT uT bC bC
bC bC rS
uT uT uT bC Cb bC bC rS rS rS rS
uT uT Cb bC bC rS
uT uT bC rS
bC bC bC bC
bC rS
uT uT bC bC
0 uT uT uT uT uT uT rS rS rS rS
uT uT bC rS rS
uT bC bC rS rS rS
uT uT bC rS
uT bC
uT bC
uT bC bC bC rS
−0.5 uT uT Tu uT uT uT
rS
rS
bC bC rS rS
uT uT bC rS
bC bC
uT uT uT bC
uT uT uTuT bC bC rS rS
−1.0 uT rS rS
bC bC rS rS
uT bC bC rS

−1.5
−4 −3 −2 −1 0 1 2 3
(a) Randomly generated data (k = 3)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 42 / 58
Gap Statistic: Intracluster Weights and Gap Values

We generate t = 200 random datasets, and compute both the expected and the
observed (Iris) intracluster weight µW (k), for each value of k. The observed
Wink (D) values are smaller than the expected values µW (k).
uTbC
uT
expected: µW (k) 0.9
bC
15 observed: Wink rS
0.8
rS
rS
uT 0.7 rS rS
14 bC rS
0.6 rS
log2 Wink

gap(k)
uT
13 0.5
bC uT
uT 0.4
rS
12 bC uT
bC 0.3
uT
bC uT 0.2
11 uT
bC 0.1 rS
bC
bC
10 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
k k
(b) Intracluster weights (c) Gap statistic

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 43 / 58
Gap Statistic as a Function of k
k gap(k) σW (k) gap(k) − σW (k)
1 0.093 0.0456 0.047
2 0.346 0.0486 0.297
3 0.679 0.0529 0.626
4 0.753 0.0701 0.682
5 0.586 0.0711 0.515
6 0.715 0.0654 0.650
7 0.808 0.0611 0.746
8 0.680 0.0597 0.620
9 0.632 0.0606 0.571
The optimal value for the number of clusters is k = 4 because
gap(4) = 0.753 > gap(5) − σW (5) = 0.515

However, if we relax the gap test to be within two standard deviations, then the
optimal value is k = 3 because
gap(3) = 0.679 > gap(4) − 2σW (4) = 0.753 − 2 · 0.0701 = 0.613
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 44 / 58
Cluster Stability

The main idea behind cluster stability is that the clusterings obtained from several
datasets sampled from the same underlying distribution as D should be similar or
“stable.”
Stability can be used to find a good value for k, the correct number of clusters.
We generate t samples of size n by sampling from D with replacement. Let
Ck (D i ) denote the clustering obtained from sample D i , for a given value of k.
Next, we compare the distance between all pairs of clusterings Ck (D i ) and Ck (D j )
using several of the external cluster evaluation measures. From these values we
compute the expected pairwise distance for each value of k. Finally, the value k ∗
that exhibits the least deviation between the clusterings obtained from the
resampled datasets is the best choice for k because it exhibits the most stability.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 45 / 58
Clustering Stability Algorithm
ClusteringStability (A, t, k max , D):
1 n ← |D|
2 for i = 1, 2, . . . , t do
3 D i ← sample n points from D with replacement
4 for i = 1, 2, . . . , t do
5 for k = 2, 3, . . . , k max do
6 Ck (D i ) ← cluster D i into k clusters using algorithm A
7 foreach pair D i , D j with j > i do
8 D ij ← D i ∩ D j // create common dataset
9
10 for k = 2, 3, . . . , k max do 
11 dij (k) ← d Ck (D i ), Ck (D j ), D ij // distance between
clusterings
12

13 for k = 2, 3, . . . , k max
Pdo
2 t P
14 µd (k) ← t (t −1) i =1 j >i dij (k)


15 k ← arg mink µd (k)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 46 / 58
Clustering Stability: Iris Data
t = 500 bootstrap samples; best K-means from 100 runs

Both the Variation of Information and the Fowlkes-Mallows measures indicate that
k = 2 is the best value. VI indicates the least expected distance between pairs of
clusterings, and FM indicates the most expected similarity between clusterings.

bC bC
0.9 bC bC bC bC bC bC
uT
0.8 uT
uT
Expected Value

0.7 uT
uT
0.6 uT
0.5 uT
uT
0.4
0.3
0.2 bC
µs (k) : FM
uT
0.1 µd (k) : VI
0
0 1 2 3 4 5 6 7 8 9
k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 47 / 58
Clustering Tendency: Spatial Histogram

Clustering tendency or clusterability aims to determine whether the dataset D has


any meaningful groups to begin with.
Let X1 , X2 , . . . , Xd denote the d dimensions. Given b, the number of bins for each
dimension, we divide each dimension Xj into b equi-width bins, and simply count
how many points lie in each of the b d d-dimensional cells.
From this spatial histogram, we can obtain the empirical joint probability mass
function (EPMF) for the dataset D

{x j ∈ cell i }
f (i ) = P(x j ∈ cell i ) =
n
where i = (i1 , i2 , . . . , id ) denotes a cell index, with ij denoting the bin index along
dimension Xj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 48 / 58
Clustering Tendency: Spatial Histogram

We generate t random samples, each comprising n points within the same


d-dimensional space as the input dataset D. Let R j denote the jth such random
sample. We then compute the corresponding EPMF gj (i ) for each R j , 1 ≤ j ≤ t.
We next compute how much the distribution f differs from gj (for j = 1, . . . , t),
using the Kullback–Leibler (KL) divergence from f to gj , defined as
 
X f (i )
KL(f |gj ) = f (i ) log
gj (i )
i

The KL divergence is zero only when f and gj are the same distributions. Using
these divergence values, we can compute how much the dataset D differs from a
random dataset.
Its main limitation is that the number of cells (b d ) increases exponentially with
the dimensionality, and, with a fixed sample size n, most of the cells will have
none or one point, making it hard to estimate the divergence. The method is also
sensitive to the choice of parameter b.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 49 / 58
Spatial Histogram: Iris PCA Data versus Uniform
Uniform has n = 150 points

u2 u2
bC bC bC
bC bC bC bC
bC Cb bC
bC bC bC Cb bC
bC bC
bC bC bC Cb bC
Cb bC
1.0 1.0 bC bC bC bC bC
bC Cb bC Cb bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC Cb
bC bC bC bC bC
bC bC bC bC bC bC
0.5 bC bC bC bC Cb bC bC bC 0.5 bC bC
bC bC
bC bC Cb bC Cb Cb Cb bC bC bC bC bC bC
bC bC Cb bC bC bC bC Cb Cb bC bC bC bC bC
bC bC Cb bC bC bC bC Cb bC bC bC Cb bC
bC bC bC Cb bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC C b bC bC Cb bC bC
C b bC bC bC bC bC bC
bC bC Cb bC bC bC bC bC bC bC bC bC
0 bC bC bC 0 bC bC
C b Cb bC bC bC bC bC bC Cb bC bC bC bC bC bC bC
bC bC Cb bC bC bC
bC bC bC bC Cb bC bC bC bC bC Cb bC bC
bC bC Cb bC bC bC Cb bC Cb bC bC bC
bC bC
bC bC bC bC bC Cb bC Cb bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC
−0.5 bC bC bC Cb bC bC −0.5 bC bC bC bC bC bC
bC
bC
bC Cb Cb bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC bC Cb bC bC bC bC
−1.0 −1.0 Cb bC bC bC
bC bC bC
bC
bC bC bC bC bC bC

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

(a) Iris: spatial cells (b) Uniform: spatial cells

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 50 / 58
Spatial Histogram: Empirical PMF
5 bins results in 25 spatial cells

0.18 Iris (f )
0.16 Uniform (gj )
0.14
Probability

0.12
0.10
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Spatial Cells
(c) Empirical probability mass function

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 51 / 58
Spatial Histogram: KL Divergence Distribution

0.25
Probability

0.20

0.15

0.10

0.05

0
0.65 0.80 0.95 1.10 1.25 1.40 1.55 1.70
KL Divergence
(d) KL-divergence distribution

We generated t = 500 random samples from the null distribution, and computed
the KL divergence from f to gj for each 1 ≤ j ≤ t.
The mean KL value is µKL = 1.17, with a standard deviation of σKL = 0.18, that
is, Iris PCA is clusterable.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 52 / 58
Clustering Tendency: Distance Distribution

We can compare the pairwise point distances from D, with those from the
randomly generated samples R i from the null distribution.
We create the EPMF from the proximity matrix W for D by binning the distances
into b bins:


{wpq ∈ bin i}
f (i) = P(wpq ∈ bin i | x p , x q ∈ D, p < q) =
n(n − 1)/2

Likewise, for each of the samples R j , we determine the EPMF for the pairwise
distances, denoted gj .
Finally, we compute the KL divergences between f and gj . The expected
divergence indicates the extent to which D differs from the null (random)
distribution.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 53 / 58
Iris PCA Data × Uniform: Distance Distribution

The distance distribution is obtained by binning the edge weights between all pairs
of points using b = 25 bins.

0.10 Iris (f )
Uniform (gj )
0.09
0.08
0.07
Probability

0.06
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6
Pairwise distance
(a)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 54 / 58
Iris PCA Data × Uniform: Distance Distribution
We compute the KL divergence from D to each Rj , over t = 500 samples. The
mean divergence is µKL = 0.18, with standard deviation σKL = 0.017. Even though
the Iris dataset has a good clustering tendency, the KL divergence is not very large.

0.20
Probability

0.15

0.10

0.05

0
0.12 0.14 0.16 0.18 0.20 0.22
KL divergence
(b)

We conclude that, at least for the Iris dataset, the distance distribution is not as
discriminative as the spatial histogram approach for clusterability analysis.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 55 / 58
Clustering Tendency: Hopkins Statistic
Given a dataset D comprising n points, we generate t uniform subsamples R i of
m points each, sampled from the same dataspace as D.
We also generate t subsamples of m points directly from D, using sampling
without replacement. Let D i denote the ith direct subsample.
Next, we compute the minimum distance between each point x j ∈ D i and points
in D
n o
δmin (xj ) = min kxj − x i k
x i ∈D ,x i 6=xj

We also compute the minimum distance δmin (y j ) between a point y j ∈ R i and


points in D.
The Hopkins statistic (in d dimensions) for the ith pair of samples R i and D i is
then defined as
P d
y j ∈R i δmin (y j )
HS i = P d P d
y j ∈R i δmin (y j ) + x j ∈D i (δmin (x j ))
If the data is well clustered we expect δmin (x j ) values to be smaller compared to
the δmin (y j ) values, and in this case HS i tends to 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 56 / 58
Iris PCA Data × Uniform: Hopkins Statistic Distribution
Number of sample pairs t = 500, subsample size m = 30.

0.10
Probability

0.05

0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98


Hopkins Statistic

The Hopkins statistic has µHS = 0.935 and σHS = 0.025.


Given the high value of the statistic, we conclude that the Iris dataset has a good
clustering tendency.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 57 / 58
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 17: Clustering Validation

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 58 / 58

You might also like