Data Mining and Machine Learning: Fundamental Concepts and Algorithms
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 1 / 58
Clustering Validation and Evaluation
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 2 / 58
External Measures
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 3 / 58
External Measures
External evaluation measures try capture the extent to which points from the
same partition appear in the same cluster, and the extent to which points from
different partitions are grouped in different clusters.
All of the external measures rely on the r × k contingency table N that is induced
by a clustering C and the ground-truth partitioning T , defined as follows
The count nij denotes the number of points that are common to cluster Ci and
ground-truth partition Tj .
Let ni = |Ci | denote the number of points in cluster Ci , and let mj = |Tj | denote
the number of points in partition Tj .
The contingency table can be computed from T and C in O(n) time by examining
the partition and cluster labels, yi and ŷi , for each point x i ∈ D and incrementing
the corresponding count nyi ŷi .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 4 / 58
Matching Based Measures: Purity
Purity quantifies the extent to which a cluster Ci contains entities from only one
partition:
1 k
purity i = max {nij }
ni j =1
The purity of clustering C is defined as the weighted sum of the clusterwise purity
values:
r r
X ni 1X k
purity = purity i = max{nij }
i =1
n n i =1 j =1
ni
where the ratio n
denotes the fraction of points in cluster Ci .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 5 / 58
Matching Based Measures: Maximum Matching
The maximum matching measure selects the mapping between clusters and
partitions, such that the sum of the number of common points (nij ) is maximized,
provided that only one cluster can match with a given partition.
Let G be a bipartite graph over the vertex set V = C ∪ T , and let the edge set be
E = {(Ci , Tj )} with edge weights w (Ci , Tj ) = nij . A matching M in G is a subset
of E , such that the edges in M are pairwise nonadjacent, that is, they do not have
a common vertex.
The maximum weight matching in G is given as:
w (M)
match = arg max
M n
where w (M)Pis the sum of the sum of all the edge weights in matching M, given
as w (M) = e ∈M w (e)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 6 / 58
Matching Based Measures: F-measure
Given cluster Ci , let ji denote the partition that contains the maximum number of points
from Ci , that is, ji = maxkj=1 {nij }.
The precision of a cluster Ci is the same as its purity:
1 k nij
prec i = max {nij } = i
ni j =1 ni
niji nij
recall i = = i
|Tji | mji
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 7 / 58
Matching Based Measures: F-measure
The F-measure is the harmonic mean of the precision and recall values for each Ci
The F-measure for the clustering C is the mean of clusterwise F-meaure values:
r
1X
F= Fi
r i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 8 / 58
K-means: Iris Principal Components Data
Good Case
u2
uT bC
uT bC
bC
1.0 bC
uT bC bC
bC
uT uT bC bC
uT uT bC
Tu rS bC bC
0.5 uT
uT
uT
uT uT rS bC bC bC Cb
uT uT uT Cb bC bC bC
uT uT Tu uT Tu uT uT Sr Sr rS rS
rS rS bC
uT Tu Tu uT Sr bC bC bC bC
uT bC bC bC bC bC
u T uT uT Sr
rS bC
uT uT
0 rS Sr rS Sr rS
bC bC
uT uT rS rS rS S r
Sr rS bC bC bC bC bC
uT uT uT rS rS Sr bC bC bC bC
rS Sr rS rS Sr
Sr rSSr Sr
bC
rS rS rS rS rS Sr Cb
uT rS rS Sr bC bC
−0.5 rS rS S r Sr Sr bC
rS rS Sr rS rS
rS rS rS
rS
bC
−1.0 rS rS
rS
rS
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 (squares) 0 47 14 61
C2 (circles) 50 0 0 50
C3 (triangles) 0 3 36 39
mj 50 50 50 n = 100
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 (squares) 30 0 0 30
C2 (circles) 20 4 0 24
C3 (triangles) 0 46 50 96
mj 50 50 50 n = 150
ni mj
where pCi = n
and pTj = n
are the probabilities of cluster Ci and partition Tj .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 11 / 58
Entropy-based Measures: Conditional Entropy
r k
r X
X ni X pij
H(T |C) = H(T |Ci ) = − pij log
i =1
n i =1 j =1
p Ci
= H(C, T ) − H(C)
n
where pij = nij is the probability that a point in cluster i also belongs to partition
Pr P k
and where H(C, T ) = − i =1 j =1 pij log pij is the joint entropy of C and T .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 12 / 58
Entropy-based Measures: Normalized Mutual Information
The mutual information tries to quantify the amount of shared information
between the clustering C and partitioning T , and it is defined as
k
r X
!
X pij
I (C, T ) = pij log
i =1 j =1
pCi · pTj
When C and T are independent then pij = pCi · pTj , and thus I (C, T ) = 0.
However, there is no upper bound on the mutual information.
s
I (C, T ) I (C, T ) I (C, T )
NMI (C, T ) = · =p
H(C) H(T ) H(C) · H(T )
The NMI value lies in the range [0, 1]. Values close to 1 indicate a good clustering.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 13 / 58
Entropy-based Measures: Variation of Information
This criterion is based on the mutual information between the clustering C and the
ground-truth partitioning T , and their entropy; it is defined as
Variation of information (VI) is zero only when C and T are identical. Thus, the
lower the VI value the better the clustering C.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 14 / 58
K-means: Iris Principal Components Data
Good Case
u2 u2
uT bC uT rS
uT bC uT rS
bC rS
1.0 bC
1.0 rS
uT bC bC uT rS rS
bC rS
uT uT bC bC uT uT rS rS
uT uT bC uT uT rS
rS rS rS Sr rS
Tu rS bC bC Tu uT rS rS
0.5 uT
uT
uT
uT uT rS bC bC bC bC 0.5 uT
uT
uT
uT uT uT
uT uT uT bC bC bC bC uT uT uT rS rS rS rS
uT uT Tu uT Tu uT uT Sr Sr rS Sr
rS rS bC uT uT Tu Tu uT uT Tu Tu uT Tu
uT uT rS
uT Tu Tu uT Sr bC bC bC bC uT Tu Tu uT Tu rS rS rS
uT bC bC bC bC bC uT rS rS rS rS rS
uTu T uT Sr
Sr bC uTu T uT Tu
Tu bC
0 uT uT rS bC 0 uT uT uT uT uT Tu uT bC
S
rS rS rS rS Sr rS rS
r Sr rS bC bC bC bC bC
bC
uT uT uT Tu uT bC bC bC bC bC
bC
uT uT rS bC bC bC bC uT uT uT uT uT bC bC bC bC
uT uT uT rS rS
rS
Sr rS rS Sr Sr uT uT uT uT uT
uT
Tu uT uT Tu Tu
rS rS S r rSrS rS rS rS
bC Cb uT uT T u uT uT uT uT bC bC Cb
rS uT
uT rS rS Sr bC bC uT uT uT Tu bC bC
−0.5 rS rS Sr Sr Sr bC −0.5 uT uT Tu Tu Tu bC
rS rS Sr rS Sr uT uT Tu uT Tu
rS rS rS uT uT bC
rS uT
bC bC
−1.0 rS rS −1.0 bC bC
rS uT
rS bC
−1.5 u 1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 15 / 58
Pairwise Measures
Given clustering C and ground-truth partitioning T , let x i , x j ∈ D be any two
points, with i 6= j. Let yi denote the true partition label and let ŷi denote the
cluster label for point x i .
If both x i and x j belong to the same cluster, that is, ŷi = ŷj , we call it a positive
event, and if they do not belong to the same cluster, that is, ŷi 6= ŷj , we call that a
negative event. Depending on whether there is agreement between the cluster
labels and partition labels, there are four possibilities to consider:
True Positives: x i and x j belong to the same partition in T , and they are also in
the same cluster in C. The number of true positive pairs is given as
TP = {(x i , x j ) : yi = yj and ŷi = ŷj }
False Negatives: x i and x j belong to the same partition in T , but they do not
belong to the same cluster in C. The number of all false negative
pairs is given as
FN = {(x i , x j ) : yi = yj and ŷi 6= ŷj }
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 16 / 58
Pairwise Measures
False Positives: x i and x j do not belong to the same partition in T , but they do
belong to the same cluster in C. The number of false positive pairs
is given as
FP = {(x i , x j ) : yi 6= yj and ŷi = ŷj }
True Negatives: x i and x j neither belong to the same partition in T , nor do they
belong to the same cluster in C. The number of such true negative
pairs is given as
TN = {(x i , x j ) : yi 6= yj and ŷi 6= ŷj }
n
n(n−1)
Because there are N = 2 = 2 pairs of points, we have the following identity:
N = TP + FN + FP + TN
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 17 / 58
Pairwise Measures: TP, TN, FP, FN
They can be computed efficiently using the contingency table N = {nij }. The
number of true positives is given as
r k
1 X X 2
TP = nij − n
2 i =1 j =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 18 / 58
Pairwise Measures: Jaccard Coefficient, Rand Statistic
Jaccard Coefficient: measures the fraction of true positive point pairs, but after ignoring
the true negative:
TP
Jaccard =
TP + FN + FP
Rand Statistic: measures the fraction of true positives and true negatives over all point
pairs:
TP + TN
Rand =
N
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 19 / 58
Pairwise Measures: FM Measure
p TP
FM = prec · recall = p
(TP + FN)(TP + FP)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 20 / 58
K-means: Iris Principal Components Data
Good Case
u2
uT bC
1.0
uT
bC
bC
Contingency table:
bC
uT bC bC
bC
uT uT bC Cb
uT uT bC
Tu bC bC
0.5 uT uT rS
setosa versicolor virginica
uT uT uT rS bC Cb bC bC
uT uT uT bC bC bC bC
uT uT Tu uT Tu uT uT Sr Sr rS rS
rS rS bC
uT Tu Tu uT Sr bC bC bC bC
0 uT uT
uT uT
uT
Tu uT uT
rS rS rS
rS
S r
Sr
rS Sr rS
Sr rS
Sr
Sr rS
Cb bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC
bC
bC
T1 T2 T3
uT uT uT rS rS
rS
Sr rS rS Sr Sr
Sr rSrS Sr
bC
−0.5 uT rS
rS rS
rS rS rS
Sr rS
rS
Sr rS
Sr Sr
Sr
bC
bC
Cb
bC
C
1 0 47 14
rS Sr rS rS
rS rS rS
−1.0
rS
rS rS
bC
C2 50 0 0
rS
−1.5
rS
u1
C3 0 3 36
−4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 21 / 58
Correlation Measures: Hubert statistic
Let X and Y be two symmetric n × n matrices, and let N = 2n . Let x, y ∈ RN denote
the vectors obtained by linearizing the upper triangular elements (excluding the main
diagonal) of X and Y .
Let µX denote the element-wise mean of x, given as
n−1 n
1XX 1
µX = X (i, j) = x T x
N N
i =1 j =i +1
n−1 n
1XX 1
Γ= X (i, j) · Y (i, j) = x T y
N N
i =1 j =i +1
z Tx z y
Γn = = cos θ
kz x k · kz y k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 22 / 58
Correlation-based Measure: Discretized Hubert Statistic
Let T and C be the n × n matrices defined as
( (
1 if yi = yj , i 6= j 1 if ŷi = ŷj , i 6= j
T (i, j) = C (i, j) =
0 otherwise 0 otherwise
The normalized version of the discretized Hubert statistic is simply the correlation
between t and c
TP
z Tt z c N
− µT µC
Γn = =p
kz t k · kz c k µT µC (1 − µT )(1 − µC )
TP+FN TP+FP
where µT = N
and µC = N
.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 23 / 58
Internal Measures
n on
W = kx i − xj k (1)
i ,j =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 24 / 58
Internal Measures
The clustering C can be considered as a k-way cut in G . Given any subsets
S, R ⊂ V , define W (S, R) as the sum of the weights on all edges with one vertex
in S and the other in R, given as
XX
W (S, R) = wij
x i ∈S x j ∈R
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 25 / 58
Clusterings as Graphs: Iris
Only intracluster edges shown.
Goodu clustering. u2
2
uT bC uT bC
uT bC uT bC
bC bC
1.0 bC
1.0 bC
uT bC bC bC bC
bC uT
bC
uT uT bC Cb uT uT
bC
uT uT bC bC bC
Tu rS bC bC uT uT
bC Cb
0.5 uT
uT
uT
uT uT rS bC Cb bC bC 0.5 uT uT
uT
rS bC bC bC
uT uT bC bC bC bC uT rS bC
uT
uT Tu uT Tu rS rS uT uT uT uT bC
uT uT rS rS bC bC bC
uT uT Sr Sr rS rS bC uT
uT bC bC bC bC
uT rS bC
uT Tu Tu uT
uT uT uT rS bC
uT rS
uT Sr Cb bC bC bC bC uT
uT
uT rS
rS
bC
bC
bC bC
uT uT Sr uT bC bC
uT uT u T rS bC uT uT
uT rS
rS bC
bC
0 rS bC 0 uT
Sr rS
bC
rS rS rS Sr rS bC uT rS
bC bC bC bC bC rS rS Cb
uT uT S r rS rS rS bC
rS rS Sr rS bC bC
uT uT uT rS Sr bC bC bC bC uT uT
uT
uT rS rS rS rS rS
rS bC bC Cb Cb Cb
rS Sr rS rS Sr uT bC
Sr rSSr rS
rS
bC rS
rS rS rS Cb rS rS
Sr rS Sr rS
rS
rS rS rS rS
rS
rS
rS
bC bC
rS rS
uT rS Sr rS bC bC rS bC
−0.5 rS rS rS Sr Sr bC −0.5 uT
rS
rS
rS
rS
rS
rS bC
bC
rS
rS rS Sr rS rS rS rS rS
rS rS rS rS rS
rS rS rS
rS rS
bC bC
−1.0 rS rS −1.0 rS rS
rS rS
rS rS
−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
Bad uclustering. u2
2
uT rS uT rS
uT rS uT rS
rS rS
1.0 rS
1.0 rS
uT rS rS rS rS
rS uT
rS
uT uT rS Sr uT uT
rS
uT uT rS rS rS
rS Sr rS rS rS
Tu uT rS rS uT uT
rS Sr
0.5 uT
uT
uT
uT uT uT 0.5 uT uT
uT
uT rS rS rS
uT uT rS rS rS rS uT uT rS
uT
uT Tu Tu uT uT uT uT uT uT rS
uT uT uT uT rS rS rS
uT uT Tu Tu uT uT rS uT
uT rS rS rS
uT uT rS
uT Tu Tu uT
uT uT uT uT rS
uT uT
uT Tu Sr rS rS rS rS uT
uT
uT uT
uT
rS
rS
rS rS
uT uT Tu uT rS rS
uT uT u T Tu bC uT uT
uT uT rS
uT uT uT Tu uT
uT bC
0 bC 0 uT
Tu uT
bC
bC uT uT
bC bC bC bC bC uT uT Cb
uT uT uT uT uT Tu uT uT uT uT uT
bC
bC bC
uT uT uT uT uT Tu bC bC bC bC uT uT
uT
uT uT uT uT uT uT
uT bC bC Cb Cb Cb
uT Tu uT uT Tu bC
uT uT uT bC
uT uT uT Cb uT uT
Tu uT Tu uT uT Tu bC uT
uT
uT
uT uT
uT
uT uT
uT
uT
uT
bC bC
uT uT Tu Tu bC bC uT bC
−0.5 uT uT uT uT Tu bC −0.5 uT
uT
uT
uT
uT
uT
uT bC
bC
uT
uT uT Tu uT uT uT uT uT
uT uT bC uT uT
uT uT bC
uT uT
bC bC
−1.0 bC bC −1.0 bC bC
uT uT
bC bC
−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 26 / 58
Internal Measures: BetaCV and C-index
BetaCV Measure: The BetaCV measure is the ratio of the mean intracluster
distance to the mean intercluster distance:
Pk
Win /Nin Nout Win Nout i =1 W (Ci , Ci )
BetaCV = = · = Pk
Wout /Nout Nin Wout Nin i =1 W (Ci , Ci )
C-index: Let Wmin (Nin ) be the sum of the smallest Nin distances in the proximity
matrix W , where Nin is the total number of intracluster edges, or point pairs. Let
Wmax (Nin ) be the sum of the largest Nin distances in W .
The C-index measures to what extent the clustering puts together the Nin points
that are the closest across the k clusters. It is defined as
The C-index lies in the range [0, 1]. The smaller the C-index, the better the
clustering.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 27 / 58
Internal Measures: Normalized Cut and Modularity
Normalized Cut Measure: The normalized cut objective for graph clustering can
also be used as an internal clustering evaluation measure:
k k
X W (Ci , Ci ) X W (Ci , Ci )
NC = =
i =1
vol(Ci ) i =1
W (Ci , V )
where vol(Ci ) = W (Ci , V ) is the volume of cluster Ci . The higher the normalized
cut value the better.
k 2 !
X W (Ci , Ci ) W (Ci , V )
Q= −
i =1
W (V , V ) W (V , V )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 28 / 58
Internal Measures: Dunn Index
The Dunn index is defined as the ratio between the minimum distance between
point pairs from different clusters and the maximum distance between point pairs
from the same cluster
min
Wout
Dunn =
Winmax
min
where Wout is the minimum intercluster distance:
min
Wout = min wab |x a ∈ Ci , x b ∈ Cj
i ,j > i
The larger the Dunn index the better the clustering because it means even the
closest distance between points in different clusters is much larger than the
farthest distance between points in the same cluster.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 29 / 58
Internal Measures: Davies-Bouldin Index
Let µi denote the cluster mean
1 X
µi = xj
ni x ∈ C
j i
Let σµi denote the dispersion or spread of the points around the cluster mean
sP
2
x j ∈Ci δ(x j , µi ) p
σ µi = = var (Ci )
ni
The Davies–Bouldin measure for a pair of clusters Ci and Cj is defined as the ratio
σ µi + σ µj
DB ij =
δ(µi , µj )
DB ij measures how compact the clusters are compared to the distance between
the cluster means. The Davies–Bouldin index is then defined as
k
1X
DB = max{DB ij }
k i =1 j 6=i
The smaller the DB value the Data
Zaki & Meira Jr. (RPI and UFMG)
better the clustering.
Mining and Machine Learning Chapter 17: Clustering Validation 30 / 58
Silhouette Coefficient
Define the silhoutte coefficient of a point x i as
µmin
out (x i ) − µin (x i )
si = n o
max µmin
out (x i ), µin (x i )
where µin (x i ) is the mean distance from x i to points in its own cluster ŷi :
P
x j ∈Cŷ ,j 6=i δ(x i , x j )
i
µin (x i ) =
nŷi − 1
and µmin
out (x i ) is the mean of the distances from x i to points in the closest cluster:
(P )
min y ∈Cj δ(x i , y )
µout (x i ) = min
j 6=ŷi nj
The si value lies in the interval [−1, +1]. A value close to +1 indicates that x i is much
closer to points in its own cluster, a value close to zero indicates x i is close to the
boundary, and a value close to −1 indicates that x i is much closer to another cluster,
and therefore may be mis-clustered.
The silhouette coefficient is the mean si value: SC = 1n ni=1 si . A value close to +1
P
indicates a good clustering.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 31 / 58
Iris Data: Good vs. Bad Clustering
u2 u2
uT bC uT rS
uT bC uT rS
bC rS
1.0 bC
1.0 rS
bC bC rS rS
uT uT
bC rS
uT uT uT uT
bC rS
uT bC bC uT rS rS
uT uT
bC Cb rS Sr
0.5 uT
uT uT
uT
uT
uT
uT
rS
rS bC
bC bC bC
bC
0.5 uT
uT uT
uT
uT
uT
uT
uT
uT rS
rS rS rS
rS
uT uT uT uT
uT uT rS rS bC bC bC uT uT uT uT rS rS rS
uT uT uT rS bC uT uT uT uT rS
uT uT rS rS bC uT uT uT uT rS
uT uT rS bC bC uT uT uT rS rS
uT uT rS bC bC uT uT uT rS rS
bC bC rS rS
uT rS bC uT uT rS
uT uT rS bC uT uT uT bC
0 uT uT
rS
rS
rS
rS
rS
rS
bC
bC Cb
bC
0 uT uT
uT
uT
uT
uT
uT
uT
bC
bC Cb
bC
rS rS bC uT uT bC
uT rS rS rS rS bC bC Cb Cb Cb uT uT uT uT uT bC bC Cb Cb Cb
uT uT uT uT rS
bC uT uT uT uT uT
bC
rS rS rS uT uT uT
rS rS bC uT uT bC
rS rS rS Sr bC uT uT uT Tu bC
rS rS rS uT uT uT
rS rS rS uT uT uT
uT rS rS bC bC uT uT uT bC bC
−0.5 rS rS rS −0.5 uT uT uT
rS rS bC uT uT bC
rS uT
rS rS rS uT uT uT
rS rS rS uT uT bC
rS rS uT uT
rS uT
bC bC
rS rS bC bC
−1.0 −1.0
rS uT
rS bC
−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 32 / 58
Relative Measures: Silhouette Coefficient
The silhouette coefficient for each point sj , and the average SC value can be used
to estimate the number of clusters in the data.
The approach consists of plotting the sj values in descending order for each
cluster, and to note the overall SC value for a particular value of k, as well as
clusterwise SC values:
1 X
SCi = sj
ni x ∈ C
j i
We then pick the value k that yields the best clustering, with many points having
high sj values within each cluster, as well as high values for SC and SCi
(1 ≤ i ≤ k).
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 33 / 58
Iris K-means: Silhouette Coefficient Plot (k = 2)
1.0
0.9
silhouette coefficient
b b b b b b b b b b b b b b b b
b b b b b b b b b b b b b
b b b b b b
b b b
0.8 b b b b b b b b b b b b b b b
b b b b b b b b b b b b b
b b b b b b b b
b b b b b b b b b b b b
b b b b
b b b
b b
b
b
0.7 b b b b
b b b b b b
b b b b
b b b b
b b b
b b b b
b
0.6 b
b b b b
b b b
b
b b b b
b
0.5 b b
b
b
0.4 b
b
0.3 b
0.2 b
0.1 b b
b
0
SC1 = 0.662 SC2 = 0.785
n1 = 97 n2 = 53
(a) k = 2, SC = 0.706
k = 2 yields the highest silhouette coefficient, with the two clusters essentially well
separated. C1 starts out with high si values, which gradually drop as we get to
border points. C2 is even better separated, since it has a higher silhouette
coefficient and the pointwise scores are all high, except for the last three points.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 34 / 58
Iris K-means: Silhouette Coefficient Plot (k = 3)
y
0.9 b b b b b b b b b b b
b b b b b b b b
b b b b b b b b b b
b b
b b b b b b
0.8 b b b
b b
b b b
b
b b
0.7 b b b b b b b b
b b b b b b b
b
b
b b b b
b b b b b
b b b
b b b b b b b b b
b b
0.6 b b b b
b b
b b b b
b
b
b b
b b b b b
b b b b b
0.5 b
b b
b
b
b
0.4 b b b b
b b
b b
b
b
b b
0.3 b b b
b b
b b
b
b
0.2 b
b
b
b b
b
0.1 b
b
b
0
b
x
SC1 = 0.466 SC2 = 0.818 SC3 = 0.52
n1 = 61 n2 = 50 n3 = 39
(b) k = 3, SC = 0.598
C1 from k = 2 has been split into two clusters for k = 3, namely C1 and C3 . Both
of these have many bordering points, whereas C2 is well separated with high
silhouette coefficients across all points.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 35 / 58
Iris K-means: Silhouette Coefficient Plot (k = 4)
y
0.9 b b b b b b b b b
b b b b b b b b
b b b b b
0.8 b b b b b b b b b
b b b b
b b b
b b
b b
b b
0.7 b
b b b
b b
b b b
b
b
b b
b b
b b b b b b
0.6 b b
b b b
b b
b b b
b
b b b
b
b b
b b b b
b b
b
b
b
b b b b b
b
0.5 b b b
b
b b
b b
b
b
b
b
b b
b b b b b
0.4 b b b
b b
b
b
b
b b
b
b
0.3 b
b b
b
b b
b
0.2 b b
b b b
b
b
b
b
0.1 b
b b
0 b b
b x
SC1 = 0.376 SC2 = 0.534 SC3 = 0.787 SC4 = 0.484
n1 = 49 n2 = 28 n3 = 50 n4 = 23
(c) k = 4, SC = 0.559
C3 is the well separated cluster, corresponding to C2 (in k = 2 and k = 3), and the
remaining clusters are essentially subclusters of C1 for k = 2. Cluster C1 also has
two points with negative si values, indicating that they are probably misclustered.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 36 / 58
Relative Measures: Calinski–Harabasz Index
Pn
where µ = 1n j =1 x j is the mean and Σ is the covariance matrix. The scatter
matrix can be decomposed into two matrices S = S W + S B , where S W is the
within-cluster scatter matrix and S B is the between-cluster scatter matrix, given as
k X
X T
SW = (x j − µi ) (x j − µi )
i =1 x j ∈Ci
k
X T
SB = ni (µi − µ) (µi − µ)
i =1
1
P
where µi = ni x j ∈Ci x j is the mean for cluster Ci .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 37 / 58
Relative Measures: Calinski–Harabasz Index
tr (S B )/(k − 1) n − k tr (S B )
CH(k) = = ·
tr (S W )/(n − k) k − 1 tr (S W )
The intuition is that we want to find the value of k for which CH(k) is much
higher than CH(k − 1) and there is only a little improvement or a decrease in the
CH(k + 1) value.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 38 / 58
Calinski–Harabasz Variance Ratio
CH ratio for various values of k on the Iris principal components data, using the
K-means algorithm, with the best results chosen from 200 runs.
750
rS
rS
rS
rS
700 rS
rS
rS
CH
650
600
rS
2 3 4 5 6 7 8 9
k
The successive CH(k) and ∆(k) values are as follows:
k 2 3 4 5 6 7 8 9
CH(k) 570.25 692.40 717.79 683.14 708.26 700.17 738.05 728.63
∆(k) – −96.78 −60.03 59.78 −33.22 45.97 −47.30 –
∆(k) suggests k = 3 as the best (lowest) value.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 39 / 58
Relative Measures: Gap Statistic
The gap statistic compares the sum of intracluster weights Win for different values
of k with their expected values assuming no apparent clustering structure, which
forms the null hypothesis.
Let Ck be the clustering obtained for a specified value of k. Let Wink (D) denote
the sum of intracluster weights (over all clusters) for Ck on the input dataset D.
We would like to compute the probability of the observed Wink value under the null
hypothesis. To obtain an empirical distribution for Win , we resort to Monte Carlo
simulations of the sampling process.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 40 / 58
Relative Measures: Gap Statistic
Choose k as follows:
n o
k ∗ = arg min gap(k) ≥ gap(k + 1) − σW (k + 1)
k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 41 / 58
Gap Statistic: Randomly Generated Data
A random sample of n = 150 points, which does not have any apparent cluster
structure.
uT bC bC bC rS
uT bC rS
uT bC bC rS
uT uT bC rS
uT bC
1.0 uT bC bC bC rS rS
bC bC bC bC rS
uT bC rS
uT rS
uT bC bC
uT uT
0.5 uT uT uT bC bC
bC bC rS
uT uT uT bC Cb bC bC rS rS rS rS
uT uT Cb bC bC rS
uT uT bC rS
bC bC bC bC
bC rS
uT uT bC bC
0 uT uT uT uT uT uT rS rS rS rS
uT uT bC rS rS
uT bC bC rS rS rS
uT uT bC rS
uT bC
uT bC
uT bC bC bC rS
−0.5 uT uT Tu uT uT uT
rS
rS
bC bC rS rS
uT uT bC rS
bC bC
uT uT uT bC
uT uT uTuT bC bC rS rS
−1.0 uT rS rS
bC bC rS rS
uT bC bC rS
−1.5
−4 −3 −2 −1 0 1 2 3
(a) Randomly generated data (k = 3)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 42 / 58
Gap Statistic: Intracluster Weights and Gap Values
We generate t = 200 random datasets, and compute both the expected and the
observed (Iris) intracluster weight µW (k), for each value of k. The observed
Wink (D) values are smaller than the expected values µW (k).
uTbC
uT
expected: µW (k) 0.9
bC
15 observed: Wink rS
0.8
rS
rS
uT 0.7 rS rS
14 bC rS
0.6 rS
log2 Wink
gap(k)
uT
13 0.5
bC uT
uT 0.4
rS
12 bC uT
bC 0.3
uT
bC uT 0.2
11 uT
bC 0.1 rS
bC
bC
10 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
k k
(b) Intracluster weights (c) Gap statistic
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 43 / 58
Gap Statistic as a Function of k
k gap(k) σW (k) gap(k) − σW (k)
1 0.093 0.0456 0.047
2 0.346 0.0486 0.297
3 0.679 0.0529 0.626
4 0.753 0.0701 0.682
5 0.586 0.0711 0.515
6 0.715 0.0654 0.650
7 0.808 0.0611 0.746
8 0.680 0.0597 0.620
9 0.632 0.0606 0.571
The optimal value for the number of clusters is k = 4 because
gap(4) = 0.753 > gap(5) − σW (5) = 0.515
However, if we relax the gap test to be within two standard deviations, then the
optimal value is k = 3 because
gap(3) = 0.679 > gap(4) − 2σW (4) = 0.753 − 2 · 0.0701 = 0.613
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 44 / 58
Cluster Stability
The main idea behind cluster stability is that the clusterings obtained from several
datasets sampled from the same underlying distribution as D should be similar or
“stable.”
Stability can be used to find a good value for k, the correct number of clusters.
We generate t samples of size n by sampling from D with replacement. Let
Ck (D i ) denote the clustering obtained from sample D i , for a given value of k.
Next, we compare the distance between all pairs of clusterings Ck (D i ) and Ck (D j )
using several of the external cluster evaluation measures. From these values we
compute the expected pairwise distance for each value of k. Finally, the value k ∗
that exhibits the least deviation between the clusterings obtained from the
resampled datasets is the best choice for k because it exhibits the most stability.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 45 / 58
Clustering Stability Algorithm
ClusteringStability (A, t, k max , D):
1 n ← |D|
2 for i = 1, 2, . . . , t do
3 D i ← sample n points from D with replacement
4 for i = 1, 2, . . . , t do
5 for k = 2, 3, . . . , k max do
6 Ck (D i ) ← cluster D i into k clusters using algorithm A
7 foreach pair D i , D j with j > i do
8 D ij ← D i ∩ D j // create common dataset
9
10 for k = 2, 3, . . . , k max do
11 dij (k) ← d Ck (D i ), Ck (D j ), D ij // distance between
clusterings
12
13 for k = 2, 3, . . . , k max
Pdo
2 t P
14 µd (k) ← t (t −1) i =1 j >i dij (k)
∗
15 k ← arg mink µd (k)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 46 / 58
Clustering Stability: Iris Data
t = 500 bootstrap samples; best K-means from 100 runs
Both the Variation of Information and the Fowlkes-Mallows measures indicate that
k = 2 is the best value. VI indicates the least expected distance between pairs of
clusterings, and FM indicates the most expected similarity between clusterings.
bC bC
0.9 bC bC bC bC bC bC
uT
0.8 uT
uT
Expected Value
0.7 uT
uT
0.6 uT
0.5 uT
uT
0.4
0.3
0.2 bC
µs (k) : FM
uT
0.1 µd (k) : VI
0
0 1 2 3 4 5 6 7 8 9
k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 47 / 58
Clustering Tendency: Spatial Histogram
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 48 / 58
Clustering Tendency: Spatial Histogram
The KL divergence is zero only when f and gj are the same distributions. Using
these divergence values, we can compute how much the dataset D differs from a
random dataset.
Its main limitation is that the number of cells (b d ) increases exponentially with
the dimensionality, and, with a fixed sample size n, most of the cells will have
none or one point, making it hard to estimate the divergence. The method is also
sensitive to the choice of parameter b.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 49 / 58
Spatial Histogram: Iris PCA Data versus Uniform
Uniform has n = 150 points
u2 u2
bC bC bC
bC bC bC bC
bC Cb bC
bC bC bC Cb bC
bC bC
bC bC bC Cb bC
Cb bC
1.0 1.0 bC bC bC bC bC
bC Cb bC Cb bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC Cb
bC bC bC bC bC
bC bC bC bC bC bC
0.5 bC bC bC bC Cb bC bC bC 0.5 bC bC
bC bC
bC bC Cb bC Cb Cb Cb bC bC bC bC bC bC
bC bC Cb bC bC bC bC Cb Cb bC bC bC bC bC
bC bC Cb bC bC bC bC Cb bC bC bC Cb bC
bC bC bC Cb bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC C b bC bC Cb bC bC
C b bC bC bC bC bC bC
bC bC Cb bC bC bC bC bC bC bC bC bC
0 bC bC bC 0 bC bC
C b Cb bC bC bC bC bC bC Cb bC bC bC bC bC bC bC
bC bC Cb bC bC bC
bC bC bC bC Cb bC bC bC bC bC Cb bC bC
bC bC Cb bC bC bC Cb bC Cb bC bC bC
bC bC
bC bC bC bC bC Cb bC Cb bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC
−0.5 bC bC bC Cb bC bC −0.5 bC bC bC bC bC bC
bC
bC
bC Cb Cb bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC bC Cb bC bC bC bC
−1.0 −1.0 Cb bC bC bC
bC bC bC
bC
bC bC bC bC bC bC
−1.5 u1 −1.5 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 50 / 58
Spatial Histogram: Empirical PMF
5 bins results in 25 spatial cells
0.18 Iris (f )
0.16 Uniform (gj )
0.14
Probability
0.12
0.10
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Spatial Cells
(c) Empirical probability mass function
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 51 / 58
Spatial Histogram: KL Divergence Distribution
0.25
Probability
0.20
0.15
0.10
0.05
0
0.65 0.80 0.95 1.10 1.25 1.40 1.55 1.70
KL Divergence
(d) KL-divergence distribution
We generated t = 500 random samples from the null distribution, and computed
the KL divergence from f to gj for each 1 ≤ j ≤ t.
The mean KL value is µKL = 1.17, with a standard deviation of σKL = 0.18, that
is, Iris PCA is clusterable.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 52 / 58
Clustering Tendency: Distance Distribution
We can compare the pairwise point distances from D, with those from the
randomly generated samples R i from the null distribution.
We create the EPMF from the proximity matrix W for D by binning the distances
into b bins:
{wpq ∈ bin i}
f (i) = P(wpq ∈ bin i | x p , x q ∈ D, p < q) =
n(n − 1)/2
Likewise, for each of the samples R j , we determine the EPMF for the pairwise
distances, denoted gj .
Finally, we compute the KL divergences between f and gj . The expected
divergence indicates the extent to which D differs from the null (random)
distribution.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 53 / 58
Iris PCA Data × Uniform: Distance Distribution
The distance distribution is obtained by binning the edge weights between all pairs
of points using b = 25 bins.
0.10 Iris (f )
Uniform (gj )
0.09
0.08
0.07
Probability
0.06
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6
Pairwise distance
(a)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 54 / 58
Iris PCA Data × Uniform: Distance Distribution
We compute the KL divergence from D to each Rj , over t = 500 samples. The
mean divergence is µKL = 0.18, with standard deviation σKL = 0.017. Even though
the Iris dataset has a good clustering tendency, the KL divergence is not very large.
0.20
Probability
0.15
0.10
0.05
0
0.12 0.14 0.16 0.18 0.20 0.22
KL divergence
(b)
We conclude that, at least for the Iris dataset, the distance distribution is not as
discriminative as the spatial histogram approach for clusterability analysis.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 55 / 58
Clustering Tendency: Hopkins Statistic
Given a dataset D comprising n points, we generate t uniform subsamples R i of
m points each, sampled from the same dataspace as D.
We also generate t subsamples of m points directly from D, using sampling
without replacement. Let D i denote the ith direct subsample.
Next, we compute the minimum distance between each point x j ∈ D i and points
in D
n o
δmin (xj ) = min kxj − x i k
x i ∈D ,x i 6=xj
0.10
Probability
0.05
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 58 / 58