0% found this document useful (0 votes)

102 views58 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document summarizes different methods for validating clustering results, including external, internal, and relative measures. It describes external measures that use ground truth labels to evaluate how well clusters match the true partitions. Purity, maximum matching, and F-measure are examples of external matching-based measures presented. The document also discusses using internal measures derived from the dataset itself, like intra/intercluster distances.

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views58 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 17: Clustering Validation

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 1 / 58
Clustering Validation and Evaluation

Cluster validation and assessment encompasses three main tasks: clustering

evaluation seeks to assess the goodness or quality of the clustering, clustering
stability seeks to understand the sensitivity of the clustering result to various
algorithmic parameters, for example, the number of clusters, and clustering
tendency assesses the suitability of applying clustering in the first place, that is,
whether the data has any inherent grouping structure.

Validity measures can be divided into three main types:

External: External validation measures employ criteria that are not inherent
to the dataset, e.g., class labels.
Internal: Internal validation measures employ criteria that are derived from
the data itself, e.g., intracluster and intercluster distances.
Relative: Relative validation measures aim to directly compare different
clusterings, usually those obtained via different parameter settings
for the same algorithm.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 2 / 58
External Measures

External measures assume that the correct or ground-truth clustering is known a

priori, which is used to evaluate a given clustering.
Let D = {x i }ni=1 be a dataset consisting of n points in a d-dimensional space,
partitioned into k clusters. Let yi ∈ {1, 2, . . . , k} denote the ground-truth cluster
membership or label information for each point.
The ground-truth clustering is given as T = {T1 , T2 , . . . , Tk }, where the cluster Tj
consists of all the points with label j, i.e., Tj = {x i ∈ D|yi = j}. We refer to T as
the ground-truth partitioning, and to each Ti as a partition.
Let C = {C1 , . . . , Cr } denote a clustering of the same dataset into r clusters,
obtained via some clustering algorithm, and let ŷi ∈ {1, 2, . . . , r } denote the cluster
label for x i .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 3 / 58
External Measures

External evaluation measures try capture the extent to which points from the
same partition appear in the same cluster, and the extent to which points from
different partitions are grouped in different clusters.
All of the external measures rely on the r × k contingency table N that is induced
by a clustering C and the ground-truth partitioning T , defined as follows

N(i, j) = nij = |Ci ∩ Tj |

The count nij denotes the number of points that are common to cluster Ci and
ground-truth partition Tj .
Let ni = |Ci | denote the number of points in cluster Ci , and let mj = |Tj | denote
the number of points in partition Tj .
The contingency table can be computed from T and C in O(n) time by examining
the partition and cluster labels, yi and ŷi , for each point x i ∈ D and incrementing
the corresponding count nyi ŷi .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 4 / 58
Matching Based Measures: Purity

Purity quantifies the extent to which a cluster Ci contains entities from only one
partition:
1 k
purity i = max {nij }
ni j =1

The purity of clustering C is defined as the weighted sum of the clusterwise purity
values:

r r
X ni 1X k
purity = purity i = max{nij }
i =1
n n i =1 j =1

ni
where the ratio n
denotes the fraction of points in cluster Ci .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 5 / 58
Matching Based Measures: Maximum Matching

The maximum matching measure selects the mapping between clusters and
partitions, such that the sum of the number of common points (nij ) is maximized,
provided that only one cluster can match with a given partition.
Let G be a bipartite graph over the vertex set V = C ∪ T , and let the edge set be
E = {(Ci , Tj )} with edge weights w (Ci , Tj ) = nij . A matching M in G is a subset
of E , such that the edges in M are pairwise nonadjacent, that is, they do not have
a common vertex.
The maximum weight matching in G is given as:

w (M)
match = arg max
M n

where w (M)Pis the sum of the sum of all the edge weights in matching M, given
as w (M) = e ∈M w (e)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 6 / 58
Matching Based Measures: F-measure

Given cluster Ci , let ji denote the partition that contains the maximum number of points
from Ci , that is, ji = maxkj=1 {nij }.
The precision of a cluster Ci is the same as its purity:

1 k nij
prec i = max {nij } = i
ni j =1 ni

The recall of cluster Ci is defined as

niji nij
recall i = = i
|Tji | mji

where mji = |Tji |.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 7 / 58
Matching Based Measures: F-measure

The F-measure is the harmonic mean of the precision and recall values for each Ci

2 2 · prec i · recall i 2 niji

Fi = 1 1
= =
prec i + recall prec i + recall i n i + mj i
i

The F-measure for the clustering C is the mean of clusterwise F-meaure values:
r
1X
F= Fi
r i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 8 / 58
K-means: Iris Principal Components Data
Good Case
u2
uT bC
uT bC
bC
1.0 bC
uT bC bC
bC
uT uT bC bC
uT uT bC
Tu rS bC bC
0.5 uT
uT
uT
uT uT rS bC bC bC Cb
uT uT uT Cb bC bC bC
uT uT Tu uT Tu uT uT Sr Sr rS rS
rS rS bC
uT Tu Tu uT Sr bC bC bC bC
uT bC bC bC bC bC
u T uT uT Sr
rS bC
uT uT
0 rS Sr rS Sr rS
bC bC
uT uT rS rS rS S r
Sr rS bC bC bC bC bC
uT uT uT rS rS Sr bC bC bC bC
rS Sr rS rS Sr
Sr rSSr Sr
bC
rS rS rS rS rS Sr Cb
uT rS rS Sr bC bC
−0.5 rS rS S r Sr Sr bC
rS rS Sr rS rS
rS rS rS
rS
bC
−1.0 rS rS
rS
rS

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 (squares) 0 47 14 61
C2 (circles) 50 0 0 50
C3 (triangles) 0 3 36 39
mj 50 50 50 n = 100

purity = 0.887, match = 0.887, F = 0.885.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 9 / 58
K-means: Iris Principal Components Data
Bad Case
u2
uT rS
uT rS
rS
1.0 rS
uT rS rS
rS
uT uT rS rS
uT uT rS
rS rS rS Sr rS
Tu uT rS rS
0.5 uT
uT
uT
uT uT uT
uT uT uT Sr rS rS rS
uT uT Tu Tu uT uT Tu Tu uT uT
uT uT rS
uT Tu Tu uT Tu rS rS rS
uT rS rS rS rS rS
u T uT uT Tu
uT bC
0 uT uT uT uT uT Tu uT bC
uT uT uT Tu uT bC bC bC bC bC
bC
uT uT uT Tu uT bC bC bC bC
uT uT uT uT
uT
Tu uT uT Tu Tu
bC
uT uT uT Tu Tu Tu uT uT Tu bC Cb
uT uT uT Tu bC bC
−0.5 uT uT T u Tu Tu bC
uT uT Tu uT uT
uT uT bC
uT
bC
−1.0 bC bC
uT
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 (squares) 30 0 0 30
C2 (circles) 20 4 0 24
C3 (triangles) 0 46 50 96
mj 50 50 50 n = 150

purity = 0.667, match = 0.560, F = 0.658

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 10 / 58
Entropy-based Measures: Conditional Entropy

The entropy of a clustering C and partitioning T is given as

r
X k
X
H(C) = − pCi log pCi H(T ) = − pTj log pTj
i =1 j =1

ni mj
where pCi = n
and pTj = n
are the probabilities of cluster Ci and partition Tj .

The cluster-specific entropy of T , that is, the conditional entropy of T with

respect to cluster Ci is defined as
k
X nij nij
H(T |Ci ) = − log
j =1
ni ni

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 11 / 58
Entropy-based Measures: Conditional Entropy

The conditional entropy of T given clustering C is defined as the weighted sum:

r k
r X
X ni X pij
H(T |C) = H(T |Ci ) = − pij log
i =1
n i =1 j =1
p Ci

= H(C, T ) − H(C)
n
where pij = nij is the probability that a point in cluster i also belongs to partition
Pr P k
and where H(C, T ) = − i =1 j =1 pij log pij is the joint entropy of C and T .

H(T |C) = 0 if and only if T is completely determined by C, corresponding to the

ideal clustering. If C and T are independent of each other, then H(T |C) = H(T ).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 12 / 58
Entropy-based Measures: Normalized Mutual Information
The mutual information tries to quantify the amount of shared information
between the clustering C and partitioning T , and it is defined as

k
r X
!
X pij
I (C, T ) = pij log
i =1 j =1
pCi · pTj

When C and T are independent then pij = pCi · pTj , and thus I (C, T ) = 0.
However, there is no upper bound on the mutual information.

The normalized mutual information (NMI) is defined as the geometric mean:

s
I (C, T ) I (C, T ) I (C, T )
NMI (C, T ) = · =p
H(C) H(T ) H(C) · H(T )

The NMI value lies in the range [0, 1]. Values close to 1 indicate a good clustering.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 13 / 58
Entropy-based Measures: Variation of Information

This criterion is based on the mutual information between the clustering C and the
ground-truth partitioning T , and their entropy; it is defined as

VI (C, T ) = (H(T ) − I (C, T )) + (H(C) − I (C, T ))

= H(T ) + H(C) − 2I (C, T )

Variation of information (VI) is zero only when C and T are identical. Thus, the
lower the VI value the better the clustering C.

VI can also be expressed as:

VI (C, T ) = H(T |C) + H(C|T )

VI (C, T ) = 2H(T , C) − H(T ) − H(C)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 14 / 58
K-means: Iris Principal Components Data
Good Case

u2 u2
uT bC uT rS
uT bC uT rS
bC rS
1.0 bC
1.0 rS
uT bC bC uT rS rS
bC rS
uT uT bC bC uT uT rS rS
uT uT bC uT uT rS
rS rS rS Sr rS
Tu rS bC bC Tu uT rS rS
0.5 uT
uT
uT
uT uT rS bC bC bC bC 0.5 uT
uT
uT
uT uT uT
uT uT uT bC bC bC bC uT uT uT rS rS rS rS
uT uT Tu uT Tu uT uT Sr Sr rS Sr
rS rS bC uT uT Tu Tu uT uT Tu Tu uT Tu
uT uT rS
uT Tu Tu uT Sr bC bC bC bC uT Tu Tu uT Tu rS rS rS
uT bC bC bC bC bC uT rS rS rS rS rS
uTu T uT Sr
Sr bC uTu T uT Tu
Tu bC
0 uT uT rS bC 0 uT uT uT uT uT Tu uT bC
S
rS rS rS rS Sr rS rS
r Sr rS bC bC bC bC bC
bC
uT uT uT Tu uT bC bC bC bC bC
bC
uT uT rS bC bC bC bC uT uT uT uT uT bC bC bC bC
uT uT uT rS rS
rS
Sr rS rS Sr Sr uT uT uT uT uT
uT
Tu uT uT Tu Tu
rS rS S r rSrS rS rS rS
bC Cb uT uT T u uT uT uT uT bC bC Cb
rS uT
uT rS rS Sr bC bC uT uT uT Tu bC bC
−0.5 rS rS Sr Sr Sr bC −0.5 uT uT Tu Tu Tu bC
rS rS Sr rS Sr uT uT Tu uT Tu
rS rS rS uT uT bC
rS uT
bC bC
−1.0 rS rS −1.0 bC bC
rS uT
rS bC

−1.5 u 1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

(a) K-means: good (b) K-means: bad

purity match F H(T |C) NMI VI

(a) Good 0.887 0.887 0.885 0.418 0.742 0.812
(b) Bad 0.667 0.560 0.658 0.743 0.587 1.200

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 15 / 58
Pairwise Measures
Given clustering C and ground-truth partitioning T , let x i , x j ∈ D be any two
points, with i 6= j. Let yi denote the true partition label and let ŷi denote the
cluster label for point x i .
If both x i and x j belong to the same cluster, that is, ŷi = ŷj , we call it a positive
event, and if they do not belong to the same cluster, that is, ŷi 6= ŷj , we call that a
negative event. Depending on whether there is agreement between the cluster
labels and partition labels, there are four possibilities to consider:
True Positives: x i and x j belong to the same partition in T , and they are also in
the same cluster in C. The number of true positive pairs is given as

TP = {(x i , x j ) : yi = yj and ŷi = ŷj }

False Negatives: x i and x j belong to the same partition in T , but they do not
belong to the same cluster in C. The number of all false negative
pairs is given as

FN = {(x i , x j ) : yi = yj and ŷi 6= ŷj }

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 16 / 58
Pairwise Measures

False Positives: x i and x j do not belong to the same partition in T , but they do
belong to the same cluster in C. The number of false positive pairs
is given as

FP = {(x i , x j ) : yi 6= yj and ŷi = ŷj }

True Negatives: x i and x j neither belong to the same partition in T , nor do they
belong to the same cluster in C. The number of such true negative
pairs is given as

TN = {(x i , x j ) : yi 6= yj and ŷi 6= ŷj }

n
n(n−1)
Because there are N = 2 = 2 pairs of points, we have the following identity:

N = TP + FN + FP + TN

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 17 / 58
Pairwise Measures: TP, TN, FP, FN
They can be computed efficiently using the contingency table N = {nij }. The
number of true positives is given as
r k
1 X X 2
TP = nij − n
2 i =1 j =1

The false negatives can be computed as

k r k
1 X 2 XX 2
FN = mj − nij
2 j =1 i =1 j =1

The number of false positives are:

r r k
1 X 2 XX 2
FP = ni − nij
2 i =1 i =1 j =1

Finally, the number of true negatives can be obtained via

r k r k
1 2 X 2 X 2 XX 2
TN = N − (TP + FN + FP) = n − ni − mj + nij
2 i =1 j =1 i =1 j =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 18 / 58
Pairwise Measures: Jaccard Coefficient, Rand Statistic

Jaccard Coefficient: measures the fraction of true positive point pairs, but after ignoring
the true negative:

TP
Jaccard =
TP + FN + FP

Rand Statistic: measures the fraction of true positives and true negatives over all point
pairs:

TP + TN
Rand =
N

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 19 / 58
Pairwise Measures: FM Measure

Fowlkes-Mallows Measure: Define the overall pairwise precision and pairwise

recall values for a clustering C, as follows:

prec = TP/TP + FP recall = TP/TP + FN

The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the

pairwise precision and recall

p TP
FM = prec · recall = p
(TP + FN)(TP + FP)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 20 / 58
K-means: Iris Principal Components Data
Good Case

u2
uT bC

1.0
uT
bC
bC
Contingency table:
bC
uT bC bC
bC
uT uT bC Cb
uT uT bC
Tu bC bC
0.5 uT uT rS
setosa versicolor virginica
uT uT uT rS bC Cb bC bC
 
uT uT uT bC bC bC bC
uT uT Tu uT Tu uT uT Sr Sr rS rS
rS rS bC
uT Tu Tu uT Sr bC bC bC bC

0 uT uT
uT uT
uT
Tu uT uT
rS rS rS
rS
S r
Sr
rS Sr rS
Sr rS
Sr
Sr rS
Cb bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC
bC
bC

 T1 T2 T3 
uT uT uT rS rS
rS
Sr rS rS Sr Sr
Sr rSrS Sr
bC

−0.5 uT rS
rS rS
rS rS rS
Sr rS
rS
Sr rS
Sr Sr
Sr
bC
bC
Cb
bC
C
 1 0 47 14  
rS Sr rS rS
rS rS rS

−1.0
rS
rS rS
bC
C2 50 0 0 
rS

−1.5
rS

u1
C3 0 3 36
−4 −3 −2 −1 0 1 2 3

The number of true positives is:

47 14 50 3 36
TP = + + + + = 3030
2 2 2 2 2

Likewise, we have FN = 645, FP = 766, TN = 6734, and N = 150

2 = 11175.
We therefore have: Jaccard = 0.682, Rand = 0.887, FM = 0.811.
For the “bad” clustering, we have: Jaccard = 0.477, Rand = 0.717, FM = 0.657.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 21 / 58
Correlation Measures: Hubert statistic
Let X and Y be two symmetric n × n matrices, and let N = 2n . Let x, y ∈ RN denote

the vectors obtained by linearizing the upper triangular elements (excluding the main
diagonal) of X and Y .
Let µX denote the element-wise mean of x, given as
n−1 n
1XX 1
µX = X (i, j) = x T x
N N
i =1 j =i +1

and let z x denote the centered x vector, defined as z x = x − 1 · µX

The Hubert statistic is defined as

n−1 n
1XX 1
Γ= X (i, j) · Y (i, j) = x T y
N N
i =1 j =i +1

The normalized Hubert statistic is defined as the element-wise correlation

z Tx z y
Γn = = cos θ
kz x k · kz y k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 22 / 58
Correlation-based Measure: Discretized Hubert Statistic
Let T and C be the n × n matrices defined as
( (
1 if yi = yj , i 6= j 1 if ŷi = ŷj , i 6= j
T (i, j) = C (i, j) =
0 otherwise 0 otherwise

Let t, c ∈ RN denote the N-dimensional vectors comprising the upper triangular

elements (excluding the diagonal) of T and C . Let z t and z c denote the centered
t and c vectors.
The discretized Hubert statistic is computed by setting x = t and y = c:
1 T TP
Γ= t c=
N N

The normalized version of the discretized Hubert statistic is simply the correlation
between t and c
TP
z Tt z c N
− µT µC
Γn = =p
kz t k · kz c k µT µC (1 − µT )(1 − µC )
TP+FN TP+FP
where µT = N
and µC = N
.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 23 / 58
Internal Measures

Internal evaluation measures do not have recourse to the ground-truth partitioning.

To evaluate the quality of the clustering, internal measures therefore have to
utilize notions of intracluster similarity or compactness, contrasted with notions of
intercluster separation, with usually a trade-off in maximizing these two aims.
The internal measures are based on the n × n distance matrix, also called the
proximity matrix, of all pairwise distances among the n points:

n on
W = kx i − xj k (1)
i ,j =1

where kx i − xj k is the Euclidean distance between x i , x j ∈ D.

The proximity matrix W is the adjacency matrix of the weighted complete graph
G over the n points, that is, with nodes V = {x i | x i ∈ D}, edges
E = {(x i , x j ) | x i , x j ∈ D}, and edge weights wij = W (i, j) for all x i , x j ∈ D.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 24 / 58
Internal Measures
The clustering C can be considered as a k-way cut in G . Given any subsets
S, R ⊂ V , define W (S, R) as the sum of the weights on all edges with one vertex
in S and the other in R, given as
XX
W (S, R) = wij
x i ∈S x j ∈R

We denote by S = V − S the complementary set of vertices.

The sum of all the intracluster and intercluster weights are given as
k k k −1
1X 1X XX
Win = W (Ci , Ci ) Wout = W (Ci , Ci ) = W (Ci , Cj )
2 i =1 2 i =1 i =1 j >i

The number of distinct intracluster and intercluster edges is given as

k k −1 X
k
X ni X
Nin = Nout = ni · nj
i =1
2 i =1 j =i +1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 25 / 58
Clusterings as Graphs: Iris
Only intracluster edges shown.

Goodu clustering. u2
2
uT bC uT bC

uT bC uT bC
bC bC

1.0 bC
1.0 bC
uT bC bC bC bC
bC uT
bC
uT uT bC Cb uT uT
bC
uT uT bC bC bC
Tu rS bC bC uT uT
bC Cb
0.5 uT
uT
uT
uT uT rS bC Cb bC bC 0.5 uT uT
uT
rS bC bC bC
uT uT bC bC bC bC uT rS bC
uT
uT Tu uT Tu rS rS uT uT uT uT bC
uT uT rS rS bC bC bC
uT uT Sr Sr rS rS bC uT
uT bC bC bC bC
uT rS bC
uT Tu Tu uT
uT uT uT rS bC
uT rS
uT Sr Cb bC bC bC bC uT
uT
uT rS
rS
bC
bC
bC bC
uT uT Sr uT bC bC
uT uT u T rS bC uT uT
uT rS
rS bC
bC

0 rS bC 0 uT
Sr rS
bC
rS rS rS Sr rS bC uT rS
bC bC bC bC bC rS rS Cb
uT uT S r rS rS rS bC
rS rS Sr rS bC bC
uT uT uT rS Sr bC bC bC bC uT uT
uT
uT rS rS rS rS rS
rS bC bC Cb Cb Cb
rS Sr rS rS Sr uT bC

Sr rSSr rS
rS
bC rS
rS rS rS Cb rS rS
Sr rS Sr rS
rS
rS rS rS rS
rS
rS
rS
bC bC
rS rS
uT rS Sr rS bC bC rS bC
−0.5 rS rS rS Sr Sr bC −0.5 uT
rS
rS
rS
rS
rS
rS bC
bC
rS
rS rS Sr rS rS rS rS rS
rS rS rS rS rS
rS rS rS
rS rS
bC bC
−1.0 rS rS −1.0 rS rS

rS rS
rS rS

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

Bad uclustering. u2
2
uT rS uT rS

uT rS uT rS
rS rS

1.0 rS
1.0 rS
uT rS rS rS rS
rS uT
rS
uT uT rS Sr uT uT
rS
uT uT rS rS rS

rS Sr rS rS rS
Tu uT rS rS uT uT
rS Sr
0.5 uT
uT
uT
uT uT uT 0.5 uT uT
uT
uT rS rS rS
uT uT rS rS rS rS uT uT rS
uT
uT Tu Tu uT uT uT uT uT uT rS
uT uT uT uT rS rS rS
uT uT Tu Tu uT uT rS uT
uT rS rS rS
uT uT rS
uT Tu Tu uT
uT uT uT uT rS
uT uT
uT Tu Sr rS rS rS rS uT
uT
uT uT
uT
rS
rS
rS rS
uT uT Tu uT rS rS
uT uT u T Tu bC uT uT
uT uT rS

uT uT uT Tu uT
uT bC
0 bC 0 uT
Tu uT
bC
bC uT uT
bC bC bC bC bC uT uT Cb
uT uT uT uT uT Tu uT uT uT uT uT
bC
bC bC
uT uT uT uT uT Tu bC bC bC bC uT uT
uT
uT uT uT uT uT uT
uT bC bC Cb Cb Cb
uT Tu uT uT Tu bC
uT uT uT bC

uT uT uT Cb uT uT
Tu uT Tu uT uT Tu bC uT
uT
uT
uT uT
uT
uT uT
uT
uT
uT
bC bC

uT uT Tu Tu bC bC uT bC
−0.5 uT uT uT uT Tu bC −0.5 uT
uT
uT
uT
uT
uT
uT bC
bC
uT
uT uT Tu uT uT uT uT uT
uT uT bC uT uT
uT uT bC
uT uT
bC bC
−1.0 bC bC −1.0 bC bC

uT uT
bC bC

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 26 / 58
Internal Measures: BetaCV and C-index
BetaCV Measure: The BetaCV measure is the ratio of the mean intracluster
distance to the mean intercluster distance:
Pk
Win /Nin Nout Win Nout i =1 W (Ci , Ci )
BetaCV = = · = Pk
Wout /Nout Nin Wout Nin i =1 W (Ci , Ci )

The smaller the BetaCV ratio, the better the clustering.

C-index: Let Wmin (Nin ) be the sum of the smallest Nin distances in the proximity
matrix W , where Nin is the total number of intracluster edges, or point pairs. Let
Wmax (Nin ) be the sum of the largest Nin distances in W .
The C-index measures to what extent the clustering puts together the Nin points
that are the closest across the k clusters. It is defined as

Win − Wmin (Nin )

Cindex =
Wmax (Nin ) − Wmin (Nin )

The C-index lies in the range [0, 1]. The smaller the C-index, the better the
clustering.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 27 / 58
Internal Measures: Normalized Cut and Modularity
Normalized Cut Measure: The normalized cut objective for graph clustering can
also be used as an internal clustering evaluation measure:

k k
X W (Ci , Ci ) X W (Ci , Ci )
NC = =
i =1
vol(Ci ) i =1
W (Ci , V )

where vol(Ci ) = W (Ci , V ) is the volume of cluster Ci . The higher the normalized
cut value the better.

Modularity: The modularity objective is given as

k 2 !
X W (Ci , Ci ) W (Ci , V )
Q= −
i =1
W (V , V ) W (V , V )

The smaller the modularity measure the better the clustering.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 28 / 58
Internal Measures: Dunn Index
The Dunn index is defined as the ratio between the minimum distance between
point pairs from different clusters and the maximum distance between point pairs
from the same cluster

min
Wout
Dunn =
Winmax

min
where Wout is the minimum intercluster distance:
min

Wout = min wab |x a ∈ Ci , x b ∈ Cj
i ,j > i

and Winmax is the maximum intracluster distance:

Winmax = max wab |x a , x b ∈ Ci

The larger the Dunn index the better the clustering because it means even the
closest distance between points in different clusters is much larger than the
farthest distance between points in the same cluster.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 29 / 58
Internal Measures: Davies-Bouldin Index
Let µi denote the cluster mean
1 X
µi = xj
ni x ∈ C
j i

Let σµi denote the dispersion or spread of the points around the cluster mean
sP
2
x j ∈Ci δ(x j , µi ) p
σ µi = = var (Ci )
ni
The Davies–Bouldin measure for a pair of clusters Ci and Cj is defined as the ratio

σ µi + σ µj
DB ij =
δ(µi , µj )

DB ij measures how compact the clusters are compared to the distance between
the cluster means. The Davies–Bouldin index is then defined as
k
1X
DB = max{DB ij }
k i =1 j 6=i
The smaller the DB value the Data
Zaki & Meira Jr. (RPI and UFMG)
better the clustering.
Mining and Machine Learning Chapter 17: Clustering Validation 30 / 58
Silhouette Coefficient
Define the silhoutte coefficient of a point x i as

µmin
out (x i ) − µin (x i )
si = n o
max µmin
out (x i ), µin (x i )

where µin (x i ) is the mean distance from x i to points in its own cluster ŷi :
P
x j ∈Cŷ ,j 6=i δ(x i , x j )
i
µin (x i ) =
nŷi − 1
and µmin
out (x i ) is the mean of the distances from x i to points in the closest cluster:
(P )
min y ∈Cj δ(x i , y )
µout (x i ) = min
j 6=ŷi nj

The si value lies in the interval [−1, +1]. A value close to +1 indicates that x i is much
closer to points in its own cluster, a value close to zero indicates x i is close to the
boundary, and a value close to −1 indicates that x i is much closer to another cluster,
and therefore may be mis-clustered.
The silhouette coefficient is the mean si value: SC = 1n ni=1 si . A value close to +1
P
indicates a good clustering.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 31 / 58
Iris Data: Good vs. Bad Clustering

u2 u2
uT bC uT rS

uT bC uT rS
bC rS

1.0 bC
1.0 rS
bC bC rS rS
uT uT
bC rS
uT uT uT uT
bC rS
uT bC bC uT rS rS
uT uT
bC Cb rS Sr
0.5 uT
uT uT
uT
uT
uT
uT
rS
rS bC
bC bC bC
bC
0.5 uT
uT uT
uT
uT
uT
uT
uT
uT rS
rS rS rS
rS
uT uT uT uT
uT uT rS rS bC bC bC uT uT uT uT rS rS rS
uT uT uT rS bC uT uT uT uT rS
uT uT rS rS bC uT uT uT uT rS
uT uT rS bC bC uT uT uT rS rS
uT uT rS bC bC uT uT uT rS rS
bC bC rS rS
uT rS bC uT uT rS
uT uT rS bC uT uT uT bC
0 uT uT
rS
rS
rS
rS
rS
rS
bC
bC Cb
bC
0 uT uT
uT
uT
uT
uT
uT
uT
bC
bC Cb
bC
rS rS bC uT uT bC
uT rS rS rS rS bC bC Cb Cb Cb uT uT uT uT uT bC bC Cb Cb Cb
uT uT uT uT rS
bC uT uT uT uT uT
bC
rS rS rS uT uT uT
rS rS bC uT uT bC
rS rS rS Sr bC uT uT uT Tu bC
rS rS rS uT uT uT
rS rS rS uT uT uT
uT rS rS bC bC uT uT uT bC bC
−0.5 rS rS rS −0.5 uT uT uT
rS rS bC uT uT bC
rS uT
rS rS rS uT uT uT
rS rS rS uT uT bC
rS rS uT uT
rS uT
bC bC
rS rS bC bC
−1.0 −1.0
rS uT
rS bC

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

(a) Good (b) Bad

Lower better Higher better

BetaCV Cindex Q DB NC Dunn SC Γ Γn
(a) Good 0.24 0.034 −0.23 0.65 2.67 0.08 0.60 8.19 0.92
(b) Bad 0.33 0.08 −0.20 1.11 2.56 0.03 0.55 7.32 0.83

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 32 / 58
Relative Measures: Silhouette Coefficient

The silhouette coefficient for each point sj , and the average SC value can be used
to estimate the number of clusters in the data.
The approach consists of plotting the sj values in descending order for each
cluster, and to note the overall SC value for a particular value of k, as well as
clusterwise SC values:
1 X
SCi = sj
ni x ∈ C
j i

We then pick the value k that yields the best clustering, with many points having
high sj values within each cluster, as well as high values for SC and SCi
(1 ≤ i ≤ k).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 33 / 58
Iris K-means: Silhouette Coefficient Plot (k = 2)

1.0
0.9
silhouette coefficient

b b b b b b b b b b b b b b b b
b b b b b b b b b b b b b
b b b b b b
b b b

0.8 b b b b b b b b b b b b b b b
b b b b b b b b b b b b b
b b b b b b b b
b b b b b b b b b b b b
b b b b
b b b
b b
b
b
0.7 b b b b
b b b b b b
b b b b
b b b b
b b b
b b b b
b

0.6 b
b b b b
b b b
b
b b b b
b

0.5 b b
b
b

0.4 b
b

0.3 b

0.2 b

0.1 b b
b

0
SC1 = 0.662 SC2 = 0.785
n1 = 97 n2 = 53

(a) k = 2, SC = 0.706

k = 2 yields the highest silhouette coefficient, with the two clusters essentially well
separated. C1 starts out with high si values, which gradually drop as we get to
border points. C2 is even better separated, since it has a higher silhouette
coefficient and the pointwise scores are all high, except for the last three points.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 34 / 58
Iris K-means: Silhouette Coefficient Plot (k = 3)
y

0.9 b b b b b b b b b b b
b b b b b b b b
b b b b b b b b b b
b b
b b b b b b

0.8 b b b
b b
b b b
b
b b
0.7 b b b b b b b b
b b b b b b b
b
b
b b b b
b b b b b
b b b
b b b b b b b b b
b b

0.6 b b b b
b b
b b b b
b
b
b b
b b b b b
b b b b b
0.5 b

b b
b

b
b

0.4 b b b b
b b
b b
b
b

b b

0.3 b b b
b b
b b
b
b

0.2 b
b

b
b b
b
0.1 b
b
b

0
b
x
SC1 = 0.466 SC2 = 0.818 SC3 = 0.52
n1 = 61 n2 = 50 n3 = 39

(b) k = 3, SC = 0.598

C1 from k = 2 has been split into two clusters for k = 3, namely C1 and C3 . Both
of these have many bordering points, whereas C2 is well separated with high
silhouette coefficients across all points.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 35 / 58
Iris K-means: Silhouette Coefficient Plot (k = 4)
y

0.9 b b b b b b b b b
b b b b b b b b
b b b b b

0.8 b b b b b b b b b
b b b b
b b b
b b
b b
b b

0.7 b
b b b
b b
b b b
b
b
b b
b b
b b b b b b

0.6 b b
b b b
b b
b b b
b
b b b
b
b b
b b b b
b b
b
b
b
b b b b b
b

0.5 b b b
b

b b
b b

b
b
b
b
b b
b b b b b

0.4 b b b
b b
b
b
b
b b

b
b

0.3 b

b b
b
b b
b

0.2 b b
b b b
b
b
b
b

0.1 b

b b

0 b b
b x
SC1 = 0.376 SC2 = 0.534 SC3 = 0.787 SC4 = 0.484
n1 = 49 n2 = 28 n3 = 50 n4 = 23

C3 is the well separated cluster, corresponding to C2 (in k = 2 and k = 3), and the
remaining clusters are essentially subclusters of C1 for k = 2. Cluster C1 also has
two points with negative si values, indicating that they are probably misclustered.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 36 / 58
Relative Measures: Calinski–Harabasz Index

Given the dataset D = {x i }ni=1 , the scatter matrix for D is given as

n
X T
S = nΣ = (x j − µ) (x j − µ)
j =1

Pn
where µ = 1n j =1 x j is the mean and Σ is the covariance matrix. The scatter
matrix can be decomposed into two matrices S = S W + S B , where S W is the
within-cluster scatter matrix and S B is the between-cluster scatter matrix, given as
k X
X T
SW = (x j − µi ) (x j − µi )
i =1 x j ∈Ci

k
X T
SB = ni (µi − µ) (µi − µ)
i =1

1
P
where µi = ni x j ∈Ci x j is the mean for cluster Ci .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 37 / 58
Relative Measures: Calinski–Harabasz Index

The Calinski–Harabasz (CH) variance ratio criterion for a given value of k is

defined as follows:

tr (S B )/(k − 1) n − k tr (S B )
CH(k) = = ·
tr (S W )/(n − k) k − 1 tr (S W )

where tr is the trace of the matrix.

We plot the CH values and look for a large increase in the value followed by little
or no gain. We choose the value k > 3 that minimizes the term

∆(k) = CH(k + 1) − CH(k) − CH(k) − CH(k − 1)

The intuition is that we want to find the value of k for which CH(k) is much
higher than CH(k − 1) and there is only a little improvement or a decrease in the
CH(k + 1) value.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 38 / 58
Calinski–Harabasz Variance Ratio
CH ratio for various values of k on the Iris principal components data, using the
K-means algorithm, with the best results chosen from 200 runs.

750
rS
rS
rS
rS
700 rS
rS
rS
CH

650

600
rS

2 3 4 5 6 7 8 9
k
The successive CH(k) and ∆(k) values are as follows:
k 2 3 4 5 6 7 8 9
CH(k) 570.25 692.40 717.79 683.14 708.26 700.17 738.05 728.63
∆(k) – −96.78 −60.03 59.78 −33.22 45.97 −47.30 –
∆(k) suggests k = 3 as the best (lowest) value.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 39 / 58
Relative Measures: Gap Statistic

The gap statistic compares the sum of intracluster weights Win for different values
of k with their expected values assuming no apparent clustering structure, which
forms the null hypothesis.

Let Ck be the clustering obtained for a specified value of k. Let Wink (D) denote
the sum of intracluster weights (over all clusters) for Ck on the input dataset D.

We would like to compute the probability of the observed Wink value under the null
hypothesis. To obtain an empirical distribution for Win , we resort to Monte Carlo
simulations of the sampling process.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 40 / 58
Relative Measures: Gap Statistic

We generate t random samples comprising n points. Let R i ∈ Rn×d , 1 ≤ i ≤ t

denote the ith sample. Let Wink (R i ) denote the sum of intracluster weights for a
given clustering of R i into k clusters.
From each sample dataset R i , we generate clusterings for different values of k,
and record the intracluster values Wink (R i ).
Let µW (k) and σW (k) denote the mean and standard deviation of these
intracluster weights for each value of k. The gap statistic for a given k is then
defined as

gap(k) = µW (k) − log Wink (D)

Choose k as follows:
n o
k ∗ = arg min gap(k) ≥ gap(k + 1) − σW (k + 1)
k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 41 / 58
Gap Statistic: Randomly Generated Data
A random sample of n = 150 points, which does not have any apparent cluster
structure.

uT bC bC bC rS
uT bC rS
uT bC bC rS
uT uT bC rS
uT bC
1.0 uT bC bC bC rS rS
bC bC bC bC rS
uT bC rS
uT rS
uT bC bC
uT uT
0.5 uT uT uT bC bC
bC bC rS
uT uT uT bC Cb bC bC rS rS rS rS
uT uT Cb bC bC rS
uT uT bC rS
bC bC bC bC
bC rS
uT uT bC bC
0 uT uT uT uT uT uT rS rS rS rS
uT uT bC rS rS
uT bC bC rS rS rS
uT uT bC rS
uT bC
uT bC
uT bC bC bC rS
−0.5 uT uT Tu uT uT uT
rS
rS
bC bC rS rS
uT uT bC rS
bC bC
uT uT uT bC
uT uT uTuT bC bC rS rS
−1.0 uT rS rS
bC bC rS rS
uT bC bC rS

−1.5
−4 −3 −2 −1 0 1 2 3
(a) Randomly generated data (k = 3)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 42 / 58
Gap Statistic: Intracluster Weights and Gap Values

We generate t = 200 random datasets, and compute both the expected and the
observed (Iris) intracluster weight µW (k), for each value of k. The observed
Wink (D) values are smaller than the expected values µW (k).
uTbC
uT
expected: µW (k) 0.9
bC
15 observed: Wink rS
0.8
rS
rS
uT 0.7 rS rS
14 bC rS
0.6 rS
log2 Wink

gap(k)
uT
13 0.5
bC uT
uT 0.4
rS
12 bC uT
bC 0.3
uT
bC uT 0.2
11 uT
bC 0.1 rS
bC
bC
10 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
k k
(b) Intracluster weights (c) Gap statistic

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 43 / 58
Gap Statistic as a Function of k
k gap(k) σW (k) gap(k) − σW (k)
1 0.093 0.0456 0.047
2 0.346 0.0486 0.297
3 0.679 0.0529 0.626
4 0.753 0.0701 0.682
5 0.586 0.0711 0.515
6 0.715 0.0654 0.650
7 0.808 0.0611 0.746
8 0.680 0.0597 0.620
9 0.632 0.0606 0.571
The optimal value for the number of clusters is k = 4 because
gap(4) = 0.753 > gap(5) − σW (5) = 0.515

However, if we relax the gap test to be within two standard deviations, then the
optimal value is k = 3 because
gap(3) = 0.679 > gap(4) − 2σW (4) = 0.753 − 2 · 0.0701 = 0.613
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 44 / 58
Cluster Stability

The main idea behind cluster stability is that the clusterings obtained from several
datasets sampled from the same underlying distribution as D should be similar or
“stable.”
Stability can be used to find a good value for k, the correct number of clusters.
We generate t samples of size n by sampling from D with replacement. Let
Ck (D i ) denote the clustering obtained from sample D i , for a given value of k.
Next, we compare the distance between all pairs of clusterings Ck (D i ) and Ck (D j )
using several of the external cluster evaluation measures. From these values we
compute the expected pairwise distance for each value of k. Finally, the value k ∗
that exhibits the least deviation between the clusterings obtained from the
resampled datasets is the best choice for k because it exhibits the most stability.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 45 / 58
Clustering Stability Algorithm
ClusteringStability (A, t, k max , D):
1 n ← |D|
2 for i = 1, 2, . . . , t do
3 D i ← sample n points from D with replacement
4 for i = 1, 2, . . . , t do
5 for k = 2, 3, . . . , k max do
6 Ck (D i ) ← cluster D i into k clusters using algorithm A
7 foreach pair D i , D j with j > i do
8 D ij ← D i ∩ D j // create common dataset
9
10 for k = 2, 3, . . . , k max do
11 dij (k) ← d Ck (D i ), Ck (D j ), D ij // distance between
clusterings
12

13 for k = 2, 3, . . . , k max
Pdo
2 t P
14 µd (k) ← t (t −1) i =1 j >i dij (k)
∗

15 k ← arg mink µd (k)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 46 / 58
Clustering Stability: Iris Data
t = 500 bootstrap samples; best K-means from 100 runs

Both the Variation of Information and the Fowlkes-Mallows measures indicate that
k = 2 is the best value. VI indicates the least expected distance between pairs of
clusterings, and FM indicates the most expected similarity between clusterings.

bC bC
0.9 bC bC bC bC bC bC
uT
0.8 uT
uT
Expected Value

0.7 uT
uT
0.6 uT
0.5 uT
uT
0.4
0.3
0.2 bC
µs (k) : FM
uT
0.1 µd (k) : VI
0
0 1 2 3 4 5 6 7 8 9
k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 47 / 58
Clustering Tendency: Spatial Histogram

Clustering tendency or clusterability aims to determine whether the dataset D has

any meaningful groups to begin with.
Let X1 , X2 , . . . , Xd denote the d dimensions. Given b, the number of bins for each
dimension, we divide each dimension Xj into b equi-width bins, and simply count
how many points lie in each of the b d d-dimensional cells.
From this spatial histogram, we can obtain the empirical joint probability mass
function (EPMF) for the dataset D

{x j ∈ cell i }
f (i ) = P(x j ∈ cell i ) =
n
where i = (i1 , i2 , . . . , id ) denotes a cell index, with ij denoting the bin index along
dimension Xj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 48 / 58
Clustering Tendency: Spatial Histogram

We generate t random samples, each comprising n points within the same

d-dimensional space as the input dataset D. Let R j denote the jth such random
sample. We then compute the corresponding EPMF gj (i ) for each R j , 1 ≤ j ≤ t.
We next compute how much the distribution f differs from gj (for j = 1, . . . , t),
using the Kullback–Leibler (KL) divergence from f to gj , defined as

X f (i )
KL(f |gj ) = f (i ) log
gj (i )
i

The KL divergence is zero only when f and gj are the same distributions. Using
these divergence values, we can compute how much the dataset D differs from a
random dataset.
Its main limitation is that the number of cells (b d ) increases exponentially with
the dimensionality, and, with a fixed sample size n, most of the cells will have
none or one point, making it hard to estimate the divergence. The method is also
sensitive to the choice of parameter b.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 49 / 58
Spatial Histogram: Iris PCA Data versus Uniform
Uniform has n = 150 points

u2 u2
bC bC bC
bC bC bC bC
bC Cb bC
bC bC bC Cb bC
bC bC
bC bC bC Cb bC
Cb bC
1.0 1.0 bC bC bC bC bC
bC Cb bC Cb bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC Cb
bC bC bC bC bC
bC bC bC bC bC bC
0.5 bC bC bC bC Cb bC bC bC 0.5 bC bC
bC bC
bC bC Cb bC Cb Cb Cb bC bC bC bC bC bC
bC bC Cb bC bC bC bC Cb Cb bC bC bC bC bC
bC bC Cb bC bC bC bC Cb bC bC bC Cb bC
bC bC bC Cb bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC C b bC bC Cb bC bC
C b bC bC bC bC bC bC
bC bC Cb bC bC bC bC bC bC bC bC bC
0 bC bC bC 0 bC bC
C b Cb bC bC bC bC bC bC Cb bC bC bC bC bC bC bC
bC bC Cb bC bC bC
bC bC bC bC Cb bC bC bC bC bC Cb bC bC
bC bC Cb bC bC bC Cb bC Cb bC bC bC
bC bC
bC bC bC bC bC Cb bC Cb bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC
−0.5 bC bC bC Cb bC bC −0.5 bC bC bC bC bC bC
bC
bC
bC Cb Cb bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC bC Cb bC bC bC bC
−1.0 −1.0 Cb bC bC bC
bC bC bC
bC
bC bC bC bC bC bC

−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

(a) Iris: spatial cells (b) Uniform: spatial cells

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 50 / 58
Spatial Histogram: Empirical PMF
5 bins results in 25 spatial cells

0.18 Iris (f )
0.16 Uniform (gj )
0.14
Probability

0.12
0.10
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Spatial Cells
(c) Empirical probability mass function

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 51 / 58
Spatial Histogram: KL Divergence Distribution

0.25
Probability

0.20

0.15

0.10

0.05

0
0.65 0.80 0.95 1.10 1.25 1.40 1.55 1.70
KL Divergence
(d) KL-divergence distribution

We generated t = 500 random samples from the null distribution, and computed
the KL divergence from f to gj for each 1 ≤ j ≤ t.
The mean KL value is µKL = 1.17, with a standard deviation of σKL = 0.18, that
is, Iris PCA is clusterable.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 52 / 58
Clustering Tendency: Distance Distribution

We can compare the pairwise point distances from D, with those from the
randomly generated samples R i from the null distribution.
We create the EPMF from the proximity matrix W for D by binning the distances
into b bins:

{wpq ∈ bin i}
f (i) = P(wpq ∈ bin i | x p , x q ∈ D, p < q) =
n(n − 1)/2

Likewise, for each of the samples R j , we determine the EPMF for the pairwise
distances, denoted gj .
Finally, we compute the KL divergences between f and gj . The expected
divergence indicates the extent to which D differs from the null (random)
distribution.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 53 / 58
Iris PCA Data × Uniform: Distance Distribution

The distance distribution is obtained by binning the edge weights between all pairs
of points using b = 25 bins.

0.10 Iris (f )
Uniform (gj )
0.09
0.08
0.07
Probability

0.06
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6
Pairwise distance
(a)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 54 / 58
Iris PCA Data × Uniform: Distance Distribution
We compute the KL divergence from D to each Rj , over t = 500 samples. The
mean divergence is µKL = 0.18, with standard deviation σKL = 0.017. Even though
the Iris dataset has a good clustering tendency, the KL divergence is not very large.

0.20
Probability

0.15

0.10

0.05

0
0.12 0.14 0.16 0.18 0.20 0.22
KL divergence
(b)

We conclude that, at least for the Iris dataset, the distance distribution is not as
discriminative as the spatial histogram approach for clusterability analysis.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 55 / 58
Clustering Tendency: Hopkins Statistic
Given a dataset D comprising n points, we generate t uniform subsamples R i of
m points each, sampled from the same dataspace as D.
We also generate t subsamples of m points directly from D, using sampling
without replacement. Let D i denote the ith direct subsample.
Next, we compute the minimum distance between each point x j ∈ D i and points
in D
n o
δmin (xj ) = min kxj − x i k
x i ∈D ,x i 6=xj

We also compute the minimum distance δmin (y j ) between a point y j ∈ R i and

points in D.
The Hopkins statistic (in d dimensions) for the ith pair of samples R i and D i is
then defined as
P d
y j ∈R i δmin (y j )
HS i = P d P d
y j ∈R i δmin (y j ) + x j ∈D i (δmin (x j ))
If the data is well clustered we expect δmin (x j ) values to be smaller compared to
the δmin (y j ) values, and in this case HS i tends to 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 56 / 58
Iris PCA Data × Uniform: Hopkins Statistic Distribution
Number of sample pairs t = 500, subsample size m = 30.

0.10
Probability

0.05

0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98

Hopkins Statistic

The Hopkins statistic has µHS = 0.935 and σHS = 0.025.

Given the high value of the statistic, we conclude that the Iris dataset has a good
clustering tendency.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 57 / 58
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 17: Clustering Validation

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 58 / 58

Bms Project (Matrix) Completed
83% (72)
Bms Project (Matrix) Completed
14 pages
CE345 - Lecture #10 - Clustering (Part 2)
No ratings yet
CE345 - Lecture #10 - Clustering (Part 2)
64 pages
Cluster Validation
No ratings yet
Cluster Validation
47 pages
4.6 Methods For Clustering Validation
No ratings yet
4.6 Methods For Clustering Validation
31 pages
Clustering in Medical and Educational Domains
No ratings yet
Clustering in Medical and Educational Domains
26 pages
Comparing Clustering
No ratings yet
Comparing Clustering
42 pages
Clustering - Introduction J Evaluation Metrics
No ratings yet
Clustering - Introduction J Evaluation Metrics
19 pages
Clustering (Introduction, Evaluation Metrics)
No ratings yet
Clustering (Introduction, Evaluation Metrics)
21 pages
Week 09
No ratings yet
Week 09
26 pages
Clustering Performance Evaluation Metrics1
No ratings yet
Clustering Performance Evaluation Metrics1
19 pages
Cluster Validity
No ratings yet
Cluster Validity
18 pages
1 s2.0 S0031320305002943 Main
No ratings yet
1 s2.0 S0031320305002943 Main
17 pages
Chapter 03
No ratings yet
Chapter 03
16 pages
CSE4261 Lecture-8
No ratings yet
CSE4261 Lecture-8
49 pages
Cluster Validity Indices
No ratings yet
Cluster Validity Indices
21 pages
Evaluation Metrics For Clustering
No ratings yet
Evaluation Metrics For Clustering
16 pages
Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
No ratings yet
Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
67 pages
Dataxplore
No ratings yet
Dataxplore
34 pages
Internalmeasures
No ratings yet
Internalmeasures
6 pages
Cluster
No ratings yet
Cluster
120 pages
11.cluster Validation PDF
No ratings yet
11.cluster Validation PDF
37 pages
Validating Clusters Using Hopkins Statistics
No ratings yet
Validating Clusters Using Hopkins Statistics
5 pages
Ambo University Inistitute of Technology Department of Computer Science
No ratings yet
Ambo University Inistitute of Technology Department of Computer Science
13 pages
Understanding Clustering Results
No ratings yet
Understanding Clustering Results
12 pages
Unit4 Clustering Evaluation
No ratings yet
Unit4 Clustering Evaluation
53 pages
An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index
No ratings yet
An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index
5 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Clustering
No ratings yet
Clustering
34 pages
BIL Report
No ratings yet
BIL Report
24 pages
Clustering Theory Applications and Algorithms
No ratings yet
Clustering Theory Applications and Algorithms
9 pages
Cal 99
No ratings yet
Cal 99
7 pages
(Balasko, Dkk. 2007) Fuzzy Clustering
No ratings yet
(Balasko, Dkk. 2007) Fuzzy Clustering
77 pages
Algorithms New
No ratings yet
Algorithms New
8 pages
Fuzzy C-Means - Review
No ratings yet
Fuzzy C-Means - Review
3 pages
Expectation-Maximization Clustring V2
No ratings yet
Expectation-Maximization Clustring V2
9 pages
2002 Hakidi Cluster Validity Methods Part II
No ratings yet
2002 Hakidi Cluster Validity Methods Part II
9 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Fuzzy Clustering Toolbox
No ratings yet
Fuzzy Clustering Toolbox
77 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Fast and Robust General Purpose Clustering Algorit
No ratings yet
Fast and Robust General Purpose Clustering Algorit
29 pages
Entropy: A Clustering Method Based On The Maximum Entropy Principle
No ratings yet
Entropy: A Clustering Method Based On The Maximum Entropy Principle
30 pages
Clustering Quality Paper
No ratings yet
Clustering Quality Paper
8 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
20-463 Internal and External Validity PDF
No ratings yet
20-463 Internal and External Validity PDF
8 pages
Performance Evaluation of Distance Metrics in The Clustering Algorithms
No ratings yet
Performance Evaluation of Distance Metrics in The Clustering Algorithms
14 pages
Automatic Clustering With Single Optimal Solution
No ratings yet
Automatic Clustering With Single Optimal Solution
13 pages
Agglomerative Mean-Shift Clustering
No ratings yet
Agglomerative Mean-Shift Clustering
7 pages
The Clustering Validity With Silhouette and Sum of Squared Errors
No ratings yet
The Clustering Validity With Silhouette and Sum of Squared Errors
8 pages
Arbelaitz, 2013. Cluster Validity
No ratings yet
Arbelaitz, 2013. Cluster Validity
14 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
Unit 4
No ratings yet
Unit 4
5 pages
Comparison of Purity and Entropy of K-Means Clustering and Fuzzy C Means Clustering
No ratings yet
Comparison of Purity and Entropy of K-Means Clustering and Fuzzy C Means Clustering
4 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
FCM-Fuzzy Rule Base: A New Rule Extraction Mechanism
No ratings yet
FCM-Fuzzy Rule Base: A New Rule Extraction Mechanism
5 pages
4024 Sequences PDF
0% (1)
4024 Sequences PDF
8 pages
1 - Rounding Large Numbers
No ratings yet
1 - Rounding Large Numbers
4 pages
Namma Kalvi 10th Maths Public Exam and PTA Question Papers With Answers EM 221737
No ratings yet
Namma Kalvi 10th Maths Public Exam and PTA Question Papers With Answers EM 221737
78 pages
Geniuses List PDF
No ratings yet
Geniuses List PDF
9 pages
4 Rational Equations and Inequalities
No ratings yet
4 Rational Equations and Inequalities
16 pages
Grade 5 (PT-1 Math) SHRI ANAND PARSHWA GURUKUL
No ratings yet
Grade 5 (PT-1 Math) SHRI ANAND PARSHWA GURUKUL
2 pages
Math G5-Q1-WK4
No ratings yet
Math G5-Q1-WK4
25 pages
2019 Sasmo G2
No ratings yet
2019 Sasmo G2
54 pages
College Algebra: 10 Edition
No ratings yet
College Algebra: 10 Edition
42 pages
Practice: Transportation Degeneracy: Combinatorial Optimization
No ratings yet
Practice: Transportation Degeneracy: Combinatorial Optimization
11 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Introduction of Data Science - Mahatma Gandhi Central University
No ratings yet
Introduction of Data Science - Mahatma Gandhi Central University
17 pages
Data Science in Agriculture Part I: Introduction
100% (1)
Data Science in Agriculture Part I: Introduction
2 pages
Class 4 Fraction and Decimal Worksheet 1
0% (1)
Class 4 Fraction and Decimal Worksheet 1
2 pages
Chapter 2
No ratings yet
Chapter 2
84 pages
6 Ee 2b Vocab Cards
No ratings yet
6 Ee 2b Vocab Cards
7 pages
Chapter 7: Dimensionality Reduction
No ratings yet
Chapter 7: Dimensionality Reduction
34 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
45 pages
Riemann and Physics
100% (1)
Riemann and Physics
38 pages
IMSO 2012 - Essay
No ratings yet
IMSO 2012 - Essay
9 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
59 pages
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
No ratings yet
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
6 pages
Finite Difference Method
100% (1)
Finite Difference Method
16 pages
East West University
No ratings yet
East West University
17 pages
Chapter 10: Sequence Mining
No ratings yet
Chapter 10: Sequence Mining
37 pages
An Epistemic Model of Task Design in Dynamic Geometry Environment
No ratings yet
An Epistemic Model of Task Design in Dynamic Geometry Environment
12 pages
On Neutrosophic Semi-Open Sets in Neutrosophic Topological Spaces
No ratings yet
On Neutrosophic Semi-Open Sets in Neutrosophic Topological Spaces
10 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
79 pages
1) Introduction:: The Potential Theory of Power
100% (1)
1) Introduction:: The Potential Theory of Power
8 pages
Monotonocity: A. Definitions
No ratings yet
Monotonocity: A. Definitions
11 pages
E103 St10 Case Study Daniels Story
No ratings yet
E103 St10 Case Study Daniels Story
6 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
57 pages
Chapter 8: Itemset Mining
No ratings yet
Chapter 8: Itemset Mining
34 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
31 pages
R 2 Calculations
No ratings yet
R 2 Calculations
30 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
29 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
28 pages
FAM
No ratings yet
FAM
8 pages
Chapter 3: Categorical Attributes
No ratings yet
Chapter 3: Categorical Attributes
26 pages
Jmi 18 51
No ratings yet
Jmi 18 51
16 pages
Chapter 1: Data Mining and Analysis
No ratings yet
Chapter 1: Data Mining and Analysis
24 pages
Kronecker's Canonical Form and The QZ Algorithm
No ratings yet
Kronecker's Canonical Form and The QZ Algorithm
19 pages
Chapter 6: High-Dimensional Data
No ratings yet
Chapter 6: High-Dimensional Data
21 pages
Test 34 + Answer Key
No ratings yet
Test 34 + Answer Key
7 pages
The Number of Subgroups Contained in The Dihedral Group Research Project
No ratings yet
The Number of Subgroups Contained in The Dihedral Group Research Project
39 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
16 pages
04 Triangulation
No ratings yet
04 Triangulation
13 pages
Certcdavao: Weekly Exam Math
No ratings yet
Certcdavao: Weekly Exam Math
2 pages
Fifth Dimension: The Light to See
From Everand
Fifth Dimension: The Light to See
Marc E. King
No ratings yet
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 17: Clustering Validation

Cluster validation and assessment encompasses three main tasks: clustering

Validity measures can be divided into three main types:

External measures assume that the correct or ground-truth clustering is known a

N(i, j) = nij = |Ci ∩ Tj |

The recall of cluster Ci is defined as

where mji = |Tji |.

2 2 · prec i · recall i 2 niji

purity = 0.887, match = 0.887, F = 0.885.

purity = 0.667, match = 0.560, F = 0.658

The entropy of a clustering C and partitioning T is given as

The cluster-specific entropy of T , that is, the conditional entropy of T with

The conditional entropy of T given clustering C is defined as the weighted sum:

H(T |C) = 0 if and only if T is completely determined by C, corresponding to the

The normalized mutual information (NMI) is defined as the geometric mean:

VI (C, T ) = (H(T ) − I (C, T )) + (H(C) − I (C, T ))

VI can also be expressed as:

VI (C, T ) = H(T |C) + H(C|T )

VI (C, T ) = 2H(T , C) − H(T ) − H(C)

(a) K-means: good (b) K-means: bad

purity match F H(T |C) NMI VI

The false negatives can be computed as

The number of false positives are:

Finally, the number of true negatives can be obtained via

Fowlkes-Mallows Measure: Define the overall pairwise precision and pairwise

prec = TP/TP + FP recall = TP/TP + FN

The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the

The number of true positives is:

Likewise, we have FN = 645, FP = 766, TN = 6734, and N = 150

and let z x denote the centered x vector, defined as z x = x − 1 · µX

The normalized Hubert statistic is defined as the element-wise correlation

Let t, c ∈ RN denote the N-dimensional vectors comprising the upper triangular

Internal evaluation measures do not have recourse to the ground-truth partitioning.

where kx i − xj k is the Euclidean distance between x i , x j ∈ D.

We denote by S = V − S the complementary set of vertices.

The number of distinct intracluster and intercluster edges is given as

The smaller the BetaCV ratio, the better the clustering.

Win − Wmin (Nin )

Modularity: The modularity objective is given as

The smaller the modularity measure the better the clustering.

and Winmax is the maximum intracluster distance:

Winmax = max wab |x a , x b ∈ Ci

(a) Good (b) Bad

Lower better Higher better

Given the dataset D = {x i }ni=1 , the scatter matrix for D is given as

The Calinski–Harabasz (CH) variance ratio criterion for a given value of k is

where tr is the trace of the matrix.

We generate t random samples comprising n points. Let R i ∈ Rn×d , 1 ≤ i ≤ t

gap(k) = µW (k) − log Wink (D)

Clustering tendency or clusterability aims to determine whether the dataset D has

We generate t random samples, each comprising n points within the same

(a) Iris: spatial cells (b) Uniform: spatial cells

We also compute the minimum distance δmin (y j ) between a point y j ∈ R i and

0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98

The Hopkins statistic has µHS = 0.935 and σHS = 0.025.

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 17: Clustering Validation

You might also like