Chapter 6
Chapter 6
Unsupervised
Learning – Clustering
Assoc. Prof. Dr. Duong Tuan Anh
Faculty of Computer Science and Engineering,
HCMC Univ. of Technology
3/2015
1
Outline
2
1. Introduction to clustering
Clustering is the process of grouping a set of patterns. It
generates a partition consisting of groups or clusters from a
given collection of patterns.
Representations or descriptions of the clusters formed are used
in decision making – classification, prediction, outlier detection.
A clustering-based classification scheme is very useful in
solving large-scale classification problems in data mining.
Patterns to be clustered are either labeled or unlabeled. We
have:
Clustering algorithms which group sets of unlabeled patterns.
3
Clustering
The process of clustering is carried out so that patterns
in the same cluster are similar in some sense and
patterns in different clusters are dissimilar in a
corresponding sense.
Figure 6.1
The Euclidean distance
between any two points
characterizes similarity: intra-
cluster distance is small and
inter-cluster distance is large.
4
Centroid and medoid
5
Centroid and medoid (cont.)
6
An example of medoid
Given a cluster with the five following patterns:
X1 = (1, 1), X2 = (1, 2), X3 = (2, 1), X4 = (1.6, 1.4), X5 = (2, 2)
We calculate the distances from each patterns to all other patterns:
d(X1, X2) = 1, d(X1,X3) =1, d(X1, X4) =0.72, d(X1,X5)= 0.41
d(X1, Xi) = 4.13
d(X2, X1) = 1, d(X2,X3)= 1.41, d(X2, X4) = 0.84, d(X2,X5)= 1
d(X2, Xi) = 4.25
d(X3, X1) = 1, d(X3,X2)= 1.41, d(X3, X4) = 0.56, d(X3,X5)= 1
d(X3, Xi) = 3.97
d(X4, X1) = 0.72, d(X4,X2)= 0.84, d(X4, X3) = 0.56, d(X4,X5)= 0.72
d(X4, Xi) = 2.84
d(X5, X1) = 1.41, d(X5,X2)= 1, d(X5, X3) =1, d(X5,X4)= 0.72
d(X5, Xi) = 4.13
So X4 = (1.6, 1.4) is the medoid since X4 is the most centrally-located pattern.
7
2. Partitional Clustering
Partitional clustering algorithms generate a partition of the data. The
most popular of this kind of algorithms is the k-means algorithm.
of patterns.
Step 3: Assign each of the n patterns to its closest center/cluster.
8
K-Means algorithm
10
This gives a good partition
of three clusters {A, B, C},
{D, E} and {F, G}.
Figure 6.4 Optimal partition when A, D and F are the initial centers
11
Example of K-Means
12
K-Means (cont.)
An important property of the k-means algorithm is that it
minimizes the sum of squared deviations of patterns in a cluster
from the center.
If Ci is the i-th cluster and i is its center, then the criterion
function minimized by the algorithm is
k
|| x
i 1 x Ci
i ||
13
Weaknesses of k-Means
14
3. Hierarchical algorithms
Hierarchical algorithms produce a nested sequence of data
partitions. The sequence can be depicted using a tree structure that
is known as a dendrogram.
The algorithms are either divisive or agglomerative.
The divisive starts with a single cluster having all the patterns; at
each successive step, a cluster is split. This process continues until
we end up with one pattern in a cluster (a collection of singleton) or
we reach to the required number of clusters.
Divisive algorithms use a top-down strategy for generating
partitions of the data.
Agglomerative algorithms, on the other hand, use a bottom-up
strategy. They start with n singleton clusters when the input data
set is of size n, where each input pattern is in a different cluster. At
successive levels, the most similar pair of clusters is merged to
reduce the size of the partition by one.
15
Hierarchical algorithms (cont.)
16
Agglomerative clustering
Typically, an agglomerative clustering algorithm goes through
the following steps:
Step 1: Compute the similarity/dissimilarity matrix between all
17
Agglomerative clustering
(cont.)
There
In
are several ways of implementing the second step.
the single-link algorithm, the distance between two clusters
C1 and C2 is the minimum of the distances d(X, Y), where X C1
and Y C2.
In the complete-link algorithm, the distance between two
Figure 6.6
18
Example of HAC
19
Example of HAC
Since the clusters {F} and {G} are the closest to each other with a distance 0.5,
they are merged.
20
A B C D E F G H
A 0 2.5 1.5 5 5.75 7 7.5 4 The initial distance matrix
B 2.5 0 1.0 3.5 4.25 4.5 5 1.5
C 1.5 1 0 3.5 4.25 5.5 6 2.5
D 5 3.5 3.5 0 0.75 2 2.5 5
E 5.75 4.25 4.25 0.75 0 2.75 2.25 5.75
F 7 4.5 5.5 2 2.75 0 0.5 3
G 7.5 5 6 2.5 2.5 0.5 0 3.5
H 4 1.5 2.5 2.5 5.75 3 3.5 0
A B C D E F,G H
A 0 2.5 1.5 5 5.75 7 4
The updated matrix after B 2.5 0 1.0 3.5 4.25 4.5 1.5
merging {F} and {G} into C 1.5 1 0 3.5 4.25 5.5 2.5
D 5 3.5 3.5 0 0.75 2 5
one cluster. E 5.75 4.25 4.25 0.75 0 2.25 5.75
Note: Here, single-link is F,G 7 4.5 5.5 2 2.75 0 3
used. H 4 1.5 2.5 2.5 5.75 3 0
21
{D} merges {E}; {B} merges {C}; {B,C} merges {A}; {A, B, C} merges {H}; {D, E}
merges {F, G}. At this stage, there are 2 clusters. The process can be stopped.
The dendrogram given in Figure 6.8 shows the merging of clusters at various
level.
22
Applying the complete-link on the data set given in Figure 6.7, the
dendrogram generated by the complete-link algorithm is shown in Figure
6.9
23
Complete link and simple-link
24
Divisive clustering
Divisive algorithms are either polythetic (where the division is based
on more than one feature) or monothetic when only one feature is
considered at a time.
A scheme for polythetic clustering requires (1) determining the most
suitable cluster to be divided and (2) finding all possible 2-partitions
of the data and choosing the best partition.
Here, the partition with the least sum of the sample variances of
i 1 j 2
(
i
X C ) 2
(Y C
j
) 2
25
Example of polythetic clustering
Given the dataset in Figure 6.7.
At the top there is a single cluster consisting of all the 8 patterns.
By considering all possible 2-partitions (27-1= 127), the best 2-
partition given by {{A, B, C, H} {D, E, F, G}} is obtained. At the
next level, {D, E, F, G} {D, E} & {F, G}. So far, we have 3
clusters.
At the next level, {A, B, C, H} {A, B, C} & {H} and at the
subsequent level, the cluster {A, B, C} {A} & {B, C}. Now, we
have 5 clusters.
Similarly, a dendrogram depicts partions having 6, 7 and 8
clusters of the data at successive levels. At the final level, each
cluster has only one point; such clusters are singleton clusters.
The dendrogram is given in Figure 6.10
26
Figure 6.10 Dendrogram for a polythetic clustering of the eight 2-dimensional points
27
Monothetic clustering
It is possible to use one feature at a time to partition (monothetic
clustering) the given data set.
In such a case, a feature direction is considered and the data is
partitioned into clusters based on the gap in the projected values
along the feature direction. That is, the data set is split into two
parts at a point that corresponds to the mean value of the
maximum gap found among the values of the data’s feature.
Each of these clusters is further partitioned sequentially using the
remaining features.
28
So, the maximum inter-pattern gap (5 – 2 = 3) occurs between C and D. We select
the mid-point between 2 and 5 which is 3.5 and use it to split the data into two
clusters. They are: C1 = {A, B, C, H} and C2 = {D, E, F, G}
Each of these clusters is split
further into two clusters based
on the values of x2.
Ordering the patterns in C1
based on their x2 values, we get
A: 0.5, C:0.5, B:1.5, H=3.0.
Here, the maximum gap of 1.5
occurs between B and H. So,
splitting C1 at the mid-point,
2.25, of the direction x2, we get
two clusters: C11 = {H} , C12 =
{A, B, C}.
Similarly, by splitting C2 using
the value of 2 for x2:
Figure 6.11 Monothetic divisive clustering
C21 = {F, G}, C22 = {D,E}
29
Complexity of hierarchical
clustering
Hierarchical clustering algorithms are computationally
expensive.
The agglomerative algorithms requires computation and
storage of a similarity or dissimilarity matrix of values that has
O(n2) time and space requirement. They can be used in
application where hundreds of patterns were to be clustered.
However, when the data sets are larger in size, these
algorithms are not feasible because of quadratic time and
space demands.
The divisive algorithms require exponential time to analyze the
number of patterns or number of features. So they too do not
scale up well in large-scale problems involving millions of
patterns.
30
Density-based clustering
Model-based clustering.
EM (Expectation Maximization)
31
4. Fuzzy-c-means clustering
The k-Means clustering is an example of partitional clustering
where the data are divided between non-overlapping clusters,
each represented by a prototype which is the centroid of the
objects in a cluster. In such clustering, each data object belongs
to just one cluster.
However, in fuzzy clustering, each object can belong to more
than one cluster, and associated with each object is a set of
membership weights, wij, representing the strength of the
association between that xi and a particular cluster Cj.
Membership weights vary between 0 and 1, and all the weights
for a particular object, xi, add up to 1.
Fuzzy clustering is a process of assigning these membership
weights.
32
Fuzzy-c-means clustering (cont.)
Fuzzy clustering is called soft clustering ( hard clustering)
One of the most widely used fuzzy clustering algorithms is the
fuzzy c-means algorithm (Bezdek 1981), which is a direct
generalization of the k-means clustering algorithm.
With fuzzy c-means, the centroid of a cluster, cj, is the
weighted centroid of all the points, weighted by their
membership weight or degree of belonging to that particular
cluster.
The membership weight of an object is inversely related to
the distance of the object to the cluster center as calculated
on the previous pass.
33
Fuzzy-c-means clustering algorithm
The fuzzy-c-means algorithm consists of a very simple
iterative scheme which is similar to k-means.
Step 1: Choose a number of clusters, c, and randomly assign
weights, wij, to each of the m objects for being in the clusters
Step 2: Compute the centroid for each cluster, cj, as the
weighted mean of each object
(1)
35
Fuzzy-c-means clustering algorithm (cont.)
36
Benefits and weaknesses of fuzzy-c-means
37
Example
Given a set of points:
X1 = (1, 3), X2 = (1.5, 3.2), X3 = (1.3, 2.8), X4 =(3,1)
Assume that: the number of clusters c = 2, the fuzzifier p = 2,
convergence criterion L= 0.01. The membership weights:
w11 = 1, w21 = 1, w31 = 1, w41 = 0
w12 = 0, w22 = 0, w32 = 0, w42 = 1
Now, we move to Step 2 to compute the centroids, using the
formula (1):
38
Example (cont.)
The first centroid
c1 = ((1*1.0 + 1*1.5 + 1*1.3)/3, (1*3.0+ 1*3.2 + 1*2.8)/3)
= (1.26, 3)
The second centroid c2 = (3,1)
Now, we move to Step 3
dist(x1, c1) = sqrt((1-1.26)2 + (3-3)2) = 0.26
dist(x2, c1) = sqrt((1.5-1.26)2 + (3.2-3)2) = 0.31
dist(x3, c1) = sqrt((1.3-1.26)2 + (2.8-3)2) = 0.20
dist(x4, c1) = sqrt((3-1.26)2 + (1-3)2) = 2.65
dist(x1, c2) = sqrt((1-3)2 + (3-1)2) = 2.82
dist(x2, c2) = sqrt((1.5-3)2 + (3.2-1)2) = 2.66
dist(x3, c2) = sqrt((1.3-3)2 + (2.8-1)2) = 2.47
dist(x4, c2) = sqrt((3-3)2 + (1-1)2) = 0
39
Example (cont.)
With these distances, we can update the membership weights of
all objects using Formula (2).
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
11 1 1 1 1 1 1 1 2
= ((0.26/0.26)2+(0.26/2.82)2)-1 = (1+(0.26/2.82)2))-1
= 0.991
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
21 2 1 2 1 2 1 2 2
= (1+(0.31/2.66)2)-1
= 0.986
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
31 3 1 3 1 3 1 3 2
= (1+(0.20/2.47)2)-1
= 0.993
w = (((dist(x , c )/d(x ,c ))2 + (dist(x ,c )/dist(x ,c )2))-1
41 4 1 4 1 4 1 4 2
= (1+(2.65/0)2)-1 = 0
40
Example (cont.)
w12 = (((dist(x1, c2)/d(x1,c1))2 + (dist(x1,c2)/dist(x1,c2)2))-1
= ((2.82/0.26)2+(2.82/2.82)2)-1 = ((2.82/0.26)2 +1))-1
= 0.009
w22 = (((dist(x2, c2)/d(x2,c1))2 + (dist(x2,c2)/dist(x2,c2)2))-1
= ((2.66/0.31)2 + 1)-1
= 0.014
w32 = (((dist(x3, c2)/d(x3,c1))2 + (dist(x3,c2)/dist(x3,c2)2))-1
= ((2.47/0.20)2 +1)-1
= 0.007
w42 = (((dist(x4, c2)/d(x4,c1))2 + (dist(x4,c2)/dist(x4,c2)2))-1
= ((0/2.65)2 +1)-1 = 1
41
Example (cont.)
42
5. Incremental clustering
Incremental clustering is based on the assumption that it is
possible to consider patterns one at a time and assign them to
existing clusters. A new data pattern is assigned to a cluster
without affecting the existing clusters significantly.
clusters.
43
Incremental clustering (cont.)
The Leader algorithm requires a single database
scan. It is very efficient.
Unfortunately, the incremental algorithm is order-
dependent.
Leader is used for continuous data
Leader Squeezer (for categorical data)
44
6. Clustering evaluation criteria
We can use classified datasets and compare how good the
clustered results fit with the data labels. Five clustering
evaluation criteria: Jaccard, Rand, FM, CSM and NMI can be
used in this case. (External evaluation)
Besides, we can evaluate the quality of clustering by using
the objective function:
k N
F || xi cm ||
m 1 i 1
N1 N
(1 / M ) P(i, j ).Q(i, j )
i 1 j i 1
46
External Evaluation Criteria
Consider G = G1, G2, …,GM as the clusters (classes) from a
classified dataset, and A1, A2,…,AM as those obtained by a
clustering algorithm. Denote D as the dataset of patterns. For all
pairs of patterns (Di, Dj) in D, we count the following quantities
a is the number of pairs, each belongs to one class in G and are
clustered together in A.
b is the number of pairs, each belongs to one class in G , but are
not clustered in A.
c is the number of pairs that are clustered together in A, but are
not belong to one class in G.
d is the number of pairs, each neither clustered together in A, nor
belongs to the same class in G.
The clustering evaluation criteria are defined as below:
1. Jaccard score (Jaccard):
a
Jaccard
a b c
47
2. Rand statistic (Rand):
ad
Rand
a b c d
3. Folkes and Mallow index (FM):
a a
FM *
a b a c
4. Cluster Similarity Measure (CSM):
1 M
CSM (G, A) max sim(G i, A j )
M i 1 1 j M
where
2 | Gi A j |
sim(Gi , A j )
| Gi | | A j
|
|Ai| the number of patterns in cluster Ai. |Gi| the number of patterns in
class Gi , M is the number of classes (clusters). |Gi Aj | is the
number of patterns in the class Gi which are also in cluster Aj.
48
5. Normalized Mutual Information (NMI):
49
References
50
How to determine the parameter k of k-means:
The elbow method
51
Appendix: how to compute dissimilarity
on categorical data.
When attributes are not continuous, we apply the following distance
measure to patterns with categorical attributes.
Given X and Y are two patterns with m categorical attributes. The
distance between X and Y can be defined as the total number of
differences between each corresponding attributes of X and Y. This
total number of differences is smaller, the distance of the two
patterns smaller. That means:
m
d ( X , Y ) ( x j , y j )
j 1
where
0 (x j y j )
( x j , y j )
1 (x j y j )
52
S. Boriah, V. Chandola, V.J. Kumar, Similarity
measures for categorical data: a comparative
evaluation, Proc. of 8th SIAM International
Conference in Data Mining, 2008, pp. 242-254.
53
Terminology
Clustering: gom cụm, partition: phân hoạch, data reduction: thu giảm
tập dữ liệu, labeled pattern: mẫu có gắn nhãn lớp, unlabeled pattern:
mẫu không có nhãn lớp, intra-cluster distance: khoảng cách giữa các
điểm trong một cụm, inter-cluster distance: khoảng cách giữa các
cụm, data abstraction: trích yếu dữ liệu, outlier: điểm ngoại biên,
centroid: trung tâm cụm, partitional clustering: gom cụm phân hoạch,
hierarchical clustering: gom cụm phân cấp, divisive hierarchical
clustering : gom cụm phân cấp tách, agglomerative hierarchical
clustering: gom cụm phân cấp gộp, single-link: liên kết đơn,
complete-link: liên kết đầy đủ, polythetic clustering: gom cụm tách
dựa vào nhiều thuộc tính, monothetic clustering: gom cụm phân cấp
tách dựa vào một thuộc tính, fuzzy clustering: gom cụm mờ, soft
clustering: gom cụm mềm, fuzzifier: hệ số mờ hóa, internal
evaluation: đánh giá nội, modified Hubert statistic: độ đo Hubert cải
tiến, objective function: hàm mục tiêu, external evaluation: đánh giá
ngoại, density-based clustering: gom cụm dựa vào mật độ,
incremental clustering: gom cụm gia tăng, non-spherical cluster: cụm
không phải hình cầu.
54