0% found this document useful (0 votes)
6 views54 pages

Chapter 6

Chapter 6 discusses unsupervised learning with a focus on clustering techniques, including partitional clustering (k-Means), hierarchical clustering, and fuzzy c-means. It explains the process of clustering, the importance of centroids and medoids, and provides examples of k-Means and hierarchical clustering algorithms. The chapter also highlights the strengths and weaknesses of these algorithms, as well as evaluation criteria for clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views54 pages

Chapter 6

Chapter 6 discusses unsupervised learning with a focus on clustering techniques, including partitional clustering (k-Means), hierarchical clustering, and fuzzy c-means. It explains the process of clustering, the importance of centroids and medoids, and provides examples of k-Means and hierarchical clustering algorithms. The chapter also highlights the strengths and weaknesses of these algorithms, as well as evaluation criteria for clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

Chapter 6

Unsupervised
Learning – Clustering
Assoc. Prof. Dr. Duong Tuan Anh
Faculty of Computer Science and Engineering,
HCMC Univ. of Technology
3/2015

1
Outline

 1 Introduction to unsupervised learning and


clustering
 2 Partitional clustering (k-Means algorithm)
 3 Hierarchical clustering
 4 Fuzzy c-means clustering algorithm
 5 Incremental Clustering
 6 Clustering Evaluation Criteria

2
1. Introduction to clustering
 Clustering is the process of grouping a set of patterns. It
generates a partition consisting of groups or clusters from a
given collection of patterns.
 Representations or descriptions of the clusters formed are used
in decision making – classification, prediction, outlier detection.
 A clustering-based classification scheme is very useful in
solving large-scale classification problems in data mining.
 Patterns to be clustered are either labeled or unlabeled. We
have:
 Clustering algorithms which group sets of unlabeled patterns.

These types of approaches are popular.


 Algorithms which cluster labeled patterns. These types of

approaches are practically important and are called supervised


clustering. Supervised clustering is helpful in identifying clusters
within collections of labeled patterns. Abstractions in the form of
cluster representatives/ descriptions which are useful for
efficient classification (e.g., in data reduction for classification).

3
Clustering
 The process of clustering is carried out so that patterns
in the same cluster are similar in some sense and
patterns in different clusters are dissimilar in a
corresponding sense.

Figure 6.1
The Euclidean distance
between any two points
characterizes similarity: intra-
cluster distance is small and
inter-cluster distance is large.

4
Centroid and medoid

 Clustering is useful for generating data abstraction. A cluster


of points is represented by its centroid or its medoid.
 A centroid stands for the sample mean of the points in cluster
C; it is given by (1/NC)Xi C, where NC is the number of the
points in cluster C.
 The medoid is the most centrally located point in the cluster.
The medoid is that point in the cluster from which the sum of
the distances from the points in the cluster is the minimum.
 There is another point in the figure 6.2 which is far off from
any of the points in the cluster. This is an outlier.
 Clustering algorithms that use medoids are more robust in the
presence of noisy patterns or outliers.

5
Centroid and medoid (cont.)

Figure 6.2 Centroid and medoid

6
An example of medoid
Given a cluster with the five following patterns:
X1 = (1, 1), X2 = (1, 2), X3 = (2, 1), X4 = (1.6, 1.4), X5 = (2, 2)
We calculate the distances from each patterns to all other patterns:
d(X1, X2) = 1, d(X1,X3) =1, d(X1, X4) =0.72, d(X1,X5)= 0.41
  d(X1, Xi) = 4.13
d(X2, X1) = 1, d(X2,X3)= 1.41, d(X2, X4) = 0.84, d(X2,X5)= 1
  d(X2, Xi) = 4.25
d(X3, X1) = 1, d(X3,X2)= 1.41, d(X3, X4) = 0.56, d(X3,X5)= 1
  d(X3, Xi) = 3.97
d(X4, X1) = 0.72, d(X4,X2)= 0.84, d(X4, X3) = 0.56, d(X4,X5)= 0.72
  d(X4, Xi) = 2.84
d(X5, X1) = 1.41, d(X5,X2)= 1, d(X5, X3) =1, d(X5,X4)= 0.72
  d(X5, Xi) = 4.13
So X4 = (1.6, 1.4) is the medoid since X4 is the most centrally-located pattern.

7
2. Partitional Clustering
Partitional clustering algorithms generate a partition of the data. The
most popular of this kind of algorithms is the k-means algorithm.

A simple description of the k-means algorithm is given below.


 Step 1: Select k out of the given n patterns as the initial cluster

centers. Assign each of the remaining n - k patterns to one of the k


clusters; a pattern is assigned to its closest center/cluster.
 Step 2: Compute the cluster centers based on the current assignment

of patterns.
 Step 3: Assign each of the n patterns to its closest center/cluster.

 Step 4: If there is no change in the assignment of patterns to clusters

during two successive iterations, then stop; else, goto Step 2.


Selecting the initial cluster centers is a very important issue.

8
K-Means algorithm

Figure 6.3 An example of k-means


9
Example of k-Means
 We illustrate k-Means with the two-dimensional data set of 7 points
shown in Fig 6.4 and 6.5
 The collection of points are grouped into 3 clusters (k = 3). The
patterns are located at A = (1, 1), B = (1, 2), C = (2, 2), D = (6, 2), E
= (7, 2), F = (6, 6), G = (7, 6).
 If A, D and F are selected as initial centers. Cluster 1 has (1, 1) as
its cluster center. Cluster 2 has (6, 2) as its cluster center and
Cluster 3 has (6, 6) as its cluster center. Now, B, C  Cluster 1; E 
Cluster 2; and G  Cluster 3.
 The new cluster center of Cluster 1 will be the mean of the patterns
in Cluster 1 which will be (1.33, 1.66). The new cluster center of
Cluster 2 will be (6.5, 2) and the new cluster center of Cluster 3 will
be (6.5, 6). Now, A, B, C  Cluster 1, D, E  Cluster 2 and F, G 
Cluster 3. Since there is no change in the clusters formed, this is the
final set of clusters.

10
This gives a good partition
of three clusters {A, B, C},
{D, E} and {F, G}.

Figure 6.4 Optimal partition when A, D and F are the initial centers

11
Example of K-Means

If starting with A, B and C as


the initial centers, we end up
with the clusters shown in
Fig 6.5.
This partition has smaller
variances in two clusters and
a large variance in the third.

Figure 6.5 A non-optimal partition


when A, B and C are the initial
centers

12
K-Means (cont.)
 An important property of the k-means algorithm is that it
minimizes the sum of squared deviations of patterns in a cluster
from the center.
 If Ci is the i-th cluster and i is its center, then the criterion
function minimized by the algorithm is
k

  || x  
i 1 x  Ci
i ||

The time complexity of the algorithm is O(nkdl), where l is the


number of iterations and d is the dimensionality.
The space requirement is O(kd). These features make the
algorithm very attractive.
The k-Means is one of the most frequently used algorithms in a
variety of applications. K-Means is one of top ten algorithm in
Data mining

13
Weaknesses of k-Means

 Users have to determine k, the number of clusters and


initial centers.
 K-Means can not handle non-spherical clusters.
 K-Means is sensitive with data that contains outliers.
 K-Means is restricted to data for which there is a notion
of centroid.

 K-medoids: a another partitional clustering method.


 We can estimate the number of clusters (k) by using the
Elbow method.

14
3. Hierarchical algorithms
 Hierarchical algorithms produce a nested sequence of data
partitions. The sequence can be depicted using a tree structure that
is known as a dendrogram.
 The algorithms are either divisive or agglomerative.
 The divisive starts with a single cluster having all the patterns; at
each successive step, a cluster is split. This process continues until
we end up with one pattern in a cluster (a collection of singleton) or
we reach to the required number of clusters.
 Divisive algorithms use a top-down strategy for generating
partitions of the data.
 Agglomerative algorithms, on the other hand, use a bottom-up
strategy. They start with n singleton clusters when the input data
set is of size n, where each input pattern is in a different cluster. At
successive levels, the most similar pair of clusters is merged to
reduce the size of the partition by one.

15
Hierarchical algorithms (cont.)

 An important property of agglomerative algorithms is that


once two patterns are placed in the same cluster at a level,
they remain in the same cluster at a level, they remain the
same cluster at all subsequent levels.

 Similarly, in the divisive algorithms, once two patterns are


placed in two different clusters at a level, they remain in
different clusters at all subsequent levels.

16
Agglomerative clustering
Typically, an agglomerative clustering algorithm goes through
the following steps:
 Step 1: Compute the similarity/dissimilarity matrix between all

pairs of patterns. Initialize each cluster with a distinct pattern.


 Step 2: Find the closest pair of clusters and merge them.

Update the proximity matrix to reflect the merge.


 Step 3: If all the patterns are in one cluster, or the algorithm

reaches to the required number of clusters, stop. Else, goto


Step 2.

Step 1 in the above algorithm requires O(n2) time to compute


pair-wise similarities and O(n2) space to store the values,
where n is the number of patterns in the data set.
Another stop condition of the iteration in the algorithm: it
reaches to the specified number of clusters

17
Agglomerative clustering
(cont.)
There
In
are several ways of implementing the second step.
the single-link algorithm, the distance between two clusters
C1 and C2 is the minimum of the distances d(X, Y), where X  C1
and Y  C2.
In the complete-link algorithm, the distance between two

clusters C1 and C2 is the maximum of the distances d(X, Y),


where X  C1 and Y  C2.

Figure 6.6

18
Example of HAC

 Example: Given the dataset as in the Figure 6.7


consisting of 8 data points.
 There are 8 clusters to start with. Manhattan distance is
used in this example.

19
Example of HAC

Figure 6.7 The data set


A = (0.5, 0.5);
B = (2, 1.5);
C = (2, 0.5);
D = (5, 1); E = (5.75, 1)
F = (5, 3); G = (5.5, 3);
H = (2, 3)

Since the clusters {F} and {G} are the closest to each other with a distance 0.5,
they are merged.

20
A B C D E F G H
A 0 2.5 1.5 5 5.75 7 7.5 4 The initial distance matrix
B 2.5 0 1.0 3.5 4.25 4.5 5 1.5
C 1.5 1 0 3.5 4.25 5.5 6 2.5
D 5 3.5 3.5 0 0.75 2 2.5 5
E 5.75 4.25 4.25 0.75 0 2.75 2.25 5.75
F 7 4.5 5.5 2 2.75 0 0.5 3
G 7.5 5 6 2.5 2.5 0.5 0 3.5
H 4 1.5 2.5 2.5 5.75 3 3.5 0

A B C D E F,G H
A 0 2.5 1.5 5 5.75 7 4
The updated matrix after B 2.5 0 1.0 3.5 4.25 4.5 1.5
merging {F} and {G} into C 1.5 1 0 3.5 4.25 5.5 2.5
D 5 3.5 3.5 0 0.75 2 5
one cluster. E 5.75 4.25 4.25 0.75 0 2.25 5.75
Note: Here, single-link is F,G 7 4.5 5.5 2 2.75 0 3
used. H 4 1.5 2.5 2.5 5.75 3 0

21
{D} merges {E}; {B} merges {C}; {B,C} merges {A}; {A, B, C} merges {H}; {D, E}
merges {F, G}. At this stage, there are 2 clusters. The process can be stopped.
The dendrogram given in Figure 6.8 shows the merging of clusters at various
level.

Figure 6.8 Dendrogram for the single-link algorithm

22
Applying the complete-link on the data set given in Figure 6.7, the
dendrogram generated by the complete-link algorithm is shown in Figure
6.9

Figure 6.9 Dendrogram for the complete-link algorithm

23
Complete link and simple-link

 In general, the complete-link algorithm generates compact


clusters as it relates every pattern in a cluster with every other
pattern in the cluster.
 The single-link algorithm characterizes the presence of a
pattern in a cluster, based on its nearest neighbor in the
cluster. The single-link algorithm is highly versatile and can
generate clusters of different shapes.

 Note: HAC is available in Scikit-learn

24
Divisive clustering
 Divisive algorithms are either polythetic (where the division is based
on more than one feature) or monothetic when only one feature is
considered at a time.
 A scheme for polythetic clustering requires (1) determining the most
suitable cluster to be divided and (2) finding all possible 2-partitions
of the data and choosing the best partition.
 Here, the partition with the least sum of the sample variances of

the two clusters is chosen as the best.


 From the resulting partition, the cluster with the maximum sample

variance is selected and is split into an optimal 2-partition. This


process is repeated until we get singleton clusters.
 The sum of the sample variances is calculated as follows. If the

patterns are split into 2 partitions with m patterns X1,…,Xm in one


cluster and n patterns Y1, …,Yn in the other cluster with the centroids
being C1 and C2, then the sum of sample variances is

 i 1  j 2
(
i
X  C ) 2
 (Y  C
j
) 2

25
Example of polythetic clustering
 Given the dataset in Figure 6.7.
 At the top there is a single cluster consisting of all the 8 patterns.
By considering all possible 2-partitions (27-1= 127), the best 2-
partition given by {{A, B, C, H} {D, E, F, G}} is obtained. At the
next level, {D, E, F, G}  {D, E} & {F, G}. So far, we have 3
clusters.
 At the next level, {A, B, C, H}  {A, B, C} & {H} and at the
subsequent level, the cluster {A, B, C}  {A} & {B, C}. Now, we
have 5 clusters.
 Similarly, a dendrogram depicts partions having 6, 7 and 8
clusters of the data at successive levels. At the final level, each
cluster has only one point; such clusters are singleton clusters.
 The dendrogram is given in Figure 6.10

26
Figure 6.10 Dendrogram for a polythetic clustering of the eight 2-dimensional points

27
Monothetic clustering
It is possible to use one feature at a time to partition (monothetic
clustering) the given data set.
 In such a case, a feature direction is considered and the data is
partitioned into clusters based on the gap in the projected values
along the feature direction. That is, the data set is split into two
parts at a point that corresponds to the mean value of the
maximum gap found among the values of the data’s feature.
 Each of these clusters is further partitioned sequentially using the
remaining features.

Example: We have 8 two-dimensional patterns; x1 and x2 are the two


features used to describe these patterns. Taking feature x1, the
data is split into two clusters based on the maximum inter-pattern
gap which occurred between two successive patterns in the x1
direction. If we consider the x1 values of the patterns in increasing
order, they are:
A: 0.5, B: 2, H: 2, C: 2, D: 5, F = 5, G = 5.5, and F = 5.5

28
So, the maximum inter-pattern gap (5 – 2 = 3) occurs between C and D. We select
the mid-point between 2 and 5 which is 3.5 and use it to split the data into two
clusters. They are: C1 = {A, B, C, H} and C2 = {D, E, F, G}
Each of these clusters is split
further into two clusters based
on the values of x2.
Ordering the patterns in C1
based on their x2 values, we get
A: 0.5, C:0.5, B:1.5, H=3.0.
Here, the maximum gap of 1.5
occurs between B and H. So,
splitting C1 at the mid-point,
2.25, of the direction x2, we get
two clusters: C11 = {H} , C12 =
{A, B, C}.
Similarly, by splitting C2 using
the value of 2 for x2:
Figure 6.11 Monothetic divisive clustering
C21 = {F, G}, C22 = {D,E}
29
Complexity of hierarchical
clustering
 Hierarchical clustering algorithms are computationally
expensive.
 The agglomerative algorithms requires computation and
storage of a similarity or dissimilarity matrix of values that has
O(n2) time and space requirement. They can be used in
application where hundreds of patterns were to be clustered.
However, when the data sets are larger in size, these
algorithms are not feasible because of quadratic time and
space demands.
 The divisive algorithms require exponential time to analyze the
number of patterns or number of features. So they too do not
scale up well in large-scale problems involving millions of
patterns.

30
Density-based clustering

 Density-based clustering methods aim to discover


clusters with arbitrary shapes.
 These methods regard clusters as dense regions of
objects in the data space that are separated by regions
of low density (representing noise).
 DBSCAN is a well-known density-based clustering
method.

Model-based clustering.
EM (Expectation Maximization)

31
4. Fuzzy-c-means clustering
 The k-Means clustering is an example of partitional clustering
where the data are divided between non-overlapping clusters,
each represented by a prototype which is the centroid of the
objects in a cluster. In such clustering, each data object belongs
to just one cluster.
 However, in fuzzy clustering, each object can belong to more
than one cluster, and associated with each object is a set of
membership weights, wij, representing the strength of the
association between that xi and a particular cluster Cj.
Membership weights vary between 0 and 1, and all the weights
for a particular object, xi, add up to 1.
 Fuzzy clustering is a process of assigning these membership
weights.

32
Fuzzy-c-means clustering (cont.)
 Fuzzy clustering is called soft clustering ( hard clustering)
 One of the most widely used fuzzy clustering algorithms is the
fuzzy c-means algorithm (Bezdek 1981), which is a direct
generalization of the k-means clustering algorithm.
 With fuzzy c-means, the centroid of a cluster, cj, is the
weighted centroid of all the points, weighted by their
membership weight or degree of belonging to that particular
cluster.
 The membership weight of an object is inversely related to
the distance of the object to the cluster center as calculated
on the previous pass.

33
Fuzzy-c-means clustering algorithm
The fuzzy-c-means algorithm consists of a very simple
iterative scheme which is similar to k-means.
Step 1: Choose a number of clusters, c, and randomly assign
weights, wij, to each of the m objects for being in the clusters
Step 2: Compute the centroid for each cluster, cj, as the
weighted mean of each object

(1)

p is an exponent, the fuzzifier.


34
Fuzzy-c-means clustering algorithm (cont.)
 Step 3: For each object, update its membership weights of
being in the clusters by minimizing the SSE:
c m
SSE (C1 , C 2 ,..., C c )   wijp dist ( xi , c j ) 2
k 1 i 1

where xi is an object (i = 1,.., m) and cj is the centroid of the


cluster j (j = 1,.., c).
In order to minimize the SSE, each membership weight wij
should be calculated as follows:
1
 c 2

  dist ( xi , c j )  p 1
 (2)
wij    
 k 1 dist ( xi , c k ) 
   

35
Fuzzy-c-means clustering algorithm (cont.)

 subject to the constraint that the weights sum to 1, where p is


an exponent, the fuzzifier, that has a value between 1 and ,
and determines the influence of the weights and, therefore,
the level of cluster fuzziness. A large value results in smaller
values of the membership weights and, hence, fuzzier
clusters. Commonly p is set to 2. If p is close to 1, then the
membership weights, wij, converge to 0 or 1, and the
algorithm behaves like k-means clustering.
 Step 4: Finally, return to Step 2 until convergence (i.e.,
when the change of the weights is no more than a given
sensitivity threshold).

36
Benefits and weaknesses of fuzzy-c-means

 The fuzzy-c-means algorithm minimizes intra-cluster


variance and is less susceptible to outliers,
 It still suffers from the problem that the minimum is likely
to be a local minimum rather than the global minimum
 The clustering results depend on the initial choice of
weights.

37
Example
 Given a set of points:
X1 = (1, 3), X2 = (1.5, 3.2), X3 = (1.3, 2.8), X4 =(3,1)
 Assume that: the number of clusters c = 2, the fuzzifier p = 2,
convergence criterion L= 0.01. The membership weights:
w11 = 1, w21 = 1, w31 = 1, w41 = 0
w12 = 0, w22 = 0, w32 = 0, w42 = 1
 Now, we move to Step 2 to compute the centroids, using the
formula (1):

38
Example (cont.)
The first centroid
c1 = ((1*1.0 + 1*1.5 + 1*1.3)/3, (1*3.0+ 1*3.2 + 1*2.8)/3)
= (1.26, 3)
The second centroid c2 = (3,1)
Now, we move to Step 3
dist(x1, c1) = sqrt((1-1.26)2 + (3-3)2) = 0.26
dist(x2, c1) = sqrt((1.5-1.26)2 + (3.2-3)2) = 0.31
dist(x3, c1) = sqrt((1.3-1.26)2 + (2.8-3)2) = 0.20
dist(x4, c1) = sqrt((3-1.26)2 + (1-3)2) = 2.65
dist(x1, c2) = sqrt((1-3)2 + (3-1)2) = 2.82
dist(x2, c2) = sqrt((1.5-3)2 + (3.2-1)2) = 2.66
dist(x3, c2) = sqrt((1.3-3)2 + (2.8-1)2) = 2.47
dist(x4, c2) = sqrt((3-3)2 + (1-1)2) = 0
39
Example (cont.)
With these distances, we can update the membership weights of
all objects using Formula (2).
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
11 1 1 1 1 1 1 1 2

= ((0.26/0.26)2+(0.26/2.82)2)-1 = (1+(0.26/2.82)2))-1
= 0.991
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
21 2 1 2 1 2 1 2 2

= (1+(0.31/2.66)2)-1
= 0.986
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
31 3 1 3 1 3 1 3 2

= (1+(0.20/2.47)2)-1
= 0.993
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
41 4 1 4 1 4 1 4 2

= (1+(2.65/0)2)-1 = 0
40
Example (cont.)
 w12 = (((dist(x1, c2)/d(x1,c1))2 + (dist(x1,c2)/dist(x1,c2)2))-1
= ((2.82/0.26)2+(2.82/2.82)2)-1 = ((2.82/0.26)2 +1))-1
= 0.009
 w22 = (((dist(x2, c2)/d(x2,c1))2 + (dist(x2,c2)/dist(x2,c2)2))-1
= ((2.66/0.31)2 + 1)-1
= 0.014
 w32 = (((dist(x3, c2)/d(x3,c1))2 + (dist(x3,c2)/dist(x3,c2)2))-1
= ((2.47/0.20)2 +1)-1
= 0.007
 w42 = (((dist(x4, c2)/d(x4,c1))2 + (dist(x4,c2)/dist(x4,c2)2))-1
= ((0/2.65)2 +1)-1 = 1

41
Example (cont.)

w11 = 0.991, w21 = 0.986, w31 = 0.993, w41 = 0


w12 = 0.009, w22 = 0.014, w32 = 0.007, w42 = 1
Since max |wij1 – wij0| = 0.014 > L = 0.01
The new membership weights do not satisfy the convergence
condition.
So, we have to do another repetition.

Note: Fuzzy-c-means is available in MATLAB.

42
5. Incremental clustering
 Incremental clustering is based on the assumption that it is
possible to consider patterns one at a time and assign them to
existing clusters. A new data pattern is assigned to a cluster
without affecting the existing clusters significantly.

 An incremental clustering algorithm (Leader algorithm)



Step 1: Assign the first data pattern, P1 to cluster C1; set i =
1 and j = 1 and let the leader Li be Pj.

Step 2: Set j = j + 1; consider clusters C1 to Ci in increasing
order of the index and assign Pj to cluster Cm ( 1  m  i) if
the distance between Lm and Pj is less than a user-
specified threshold T; if no cluster satisfies this property,
then set i = i + 1 and assign Pj to the new cluster Ci. Set the
leader Li to be Pj.
 Repeat Step 2 until all the data patterns are assigned to

clusters.
43
Incremental clustering (cont.)
 The Leader algorithm requires a single database
scan. It is very efficient.
 Unfortunately, the incremental algorithm is order-
dependent.
 Leader is used for continuous data
 Leader  Squeezer (for categorical data)

Note: There are several well-known incremental


clustering algorithms, for example, BIRCH algorithm.

T. Zhang, R. Ramakrishnan, M. Livny, “BIRCH: an efficient data


clustering method for very large databases”, Proc. of ACM
SIGMOD Int. Conf. on Management of Data, 1996.

44
6. Clustering evaluation criteria
 We can use classified datasets and compare how good the
clustered results fit with the data labels. Five clustering
evaluation criteria: Jaccard, Rand, FM, CSM and NMI can be
used in this case. (External evaluation)
 Besides, we can evaluate the quality of clustering by using
the objective function:
k N
F   || xi  cm ||
m 1 i 1

where x are the objects and c are the cluster centers.


This criterion considers only compactness. This is an internal
evaluation criterion. The criterion value is smaller, the
clustering quality is better.
45
The modified Hubert statistic

 The definition of this internal evaluation criterion is given by


the equation

N1 N
 (1 / M )  P(i, j ).Q(i, j )
i 1 j i 1

where M = N(N-1)/2, P is the proximity matrix of the dataset and


Q is an NN matrix whose (i, j) element is equal to the distance
between the representative points (ci, cj) of the clusters where
the objects xi and xj belong.
This criterion considers only separation.
The criterion value is bigger, the clustering quality is better.

46
External Evaluation Criteria
 Consider G = G1, G2, …,GM as the clusters (classes) from a
classified dataset, and A1, A2,…,AM as those obtained by a
clustering algorithm. Denote D as the dataset of patterns. For all
pairs of patterns (Di, Dj) in D, we count the following quantities
 a is the number of pairs, each belongs to one class in G and are
clustered together in A.
 b is the number of pairs, each belongs to one class in G , but are
not clustered in A.
 c is the number of pairs that are clustered together in A, but are
not belong to one class in G.
 d is the number of pairs, each neither clustered together in A, nor
belongs to the same class in G.
 The clustering evaluation criteria are defined as below:
1. Jaccard score (Jaccard):
a
Jaccard 
a b c
47
2. Rand statistic (Rand):
ad
Rand 
a b c d
3. Folkes and Mallow index (FM):
a a
FM  *
a b a c
4. Cluster Similarity Measure (CSM):
1 M
CSM (G, A)   max sim(G i, A j )
M i 1 1 j M

where
2 | Gi  A j |
sim(Gi , A j ) 
| Gi |  | A j
|
|Ai| the number of patterns in cluster Ai. |Gi| the number of patterns in
class Gi , M is the number of classes (clusters). |Gi  Aj | is the
number of patterns in the class Gi which are also in cluster Aj.
48
 5. Normalized Mutual Information (NMI):

where N the number of patterns in the dataset,


|Ai| the number of patterns in cluster Ai.
|Gi| the number of patterns in class Gi
Ni,j = |Gi  Aj|

|Gi  Aj | is the number of patterns in the class Gi which are also in


cluster Aj.
All the evaluation criteria have value ranging from 0 to 1, where 1
corresponds to the case when G and A are identical.
The criterion value is bigger, the clustering quality is better.

49
References

 M. N. Murty and V. S. Devi, 2011, Pattern Recognition –


An Algorithmic Approach, Springer.
 S. Marsland, 2015, Machine Learning - An Algorithm
Perspective, 2nd Edition, Chapman & Hall/CRC.

50
How to determine the parameter k of k-means:
The elbow method

51
Appendix: how to compute dissimilarity
on categorical data.
 When attributes are not continuous, we apply the following distance
measure to patterns with categorical attributes.
 Given X and Y are two patterns with m categorical attributes. The
distance between X and Y can be defined as the total number of
differences between each corresponding attributes of X and Y. This
total number of differences is smaller, the distance of the two
patterns smaller. That means:
m
d ( X , Y )   ( x j , y j )
j 1
where
0 (x j y j )
 ( x j , y j ) 
1 (x j y j )

52
 S. Boriah, V. Chandola, V.J. Kumar, Similarity
measures for categorical data: a comparative
evaluation, Proc. of 8th SIAM International
Conference in Data Mining, 2008, pp. 242-254.

53
Terminology
 Clustering: gom cụm, partition: phân hoạch, data reduction: thu giảm
tập dữ liệu, labeled pattern: mẫu có gắn nhãn lớp, unlabeled pattern:
mẫu không có nhãn lớp, intra-cluster distance: khoảng cách giữa các
điểm trong một cụm, inter-cluster distance: khoảng cách giữa các
cụm, data abstraction: trích yếu dữ liệu, outlier: điểm ngoại biên,
centroid: trung tâm cụm, partitional clustering: gom cụm phân hoạch,
hierarchical clustering: gom cụm phân cấp, divisive hierarchical
clustering : gom cụm phân cấp tách, agglomerative hierarchical
clustering: gom cụm phân cấp gộp, single-link: liên kết đơn,
complete-link: liên kết đầy đủ, polythetic clustering: gom cụm tách
dựa vào nhiều thuộc tính, monothetic clustering: gom cụm phân cấp
tách dựa vào một thuộc tính, fuzzy clustering: gom cụm mờ, soft
clustering: gom cụm mềm, fuzzifier: hệ số mờ hóa, internal
evaluation: đánh giá nội, modified Hubert statistic: độ đo Hubert cải
tiến, objective function: hàm mục tiêu, external evaluation: đánh giá
ngoại, density-based clustering: gom cụm dựa vào mật độ,
incremental clustering: gom cụm gia tăng, non-spherical cluster: cụm
không phải hình cầu.
54

You might also like