Clustering 47698 Techniques
Clustering 47698 Techniques
and Engineering
Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science Fundamentals
Topics to be covered…
Introduction to clustering
Clustering techniques
Partitioning algorithms
Hierarchical algorithms
Density-based algorithm
Hierarchical
DIANA (divisive algorithm)
AGNES
(Agglomerative
ROCK algorithm)
Density – Based
DBSCAN
3. Compute the “cluster centers” of each cluster. These become the new cluster
centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
A1 A2 25
6.8 12.6
0.8 9.8 20
1.2 11.6
2.8 9.6 15
3.8 9.9
4.4 6.5
A2
10
4.8 1.1
6.0 19.9 5
6.2 18.5
7.6 17.4
0
0 2 4 6 8 10 12
7.8 12.2
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table 16.2.
• Assignment of each object to the respective centroid is shown in the right-
most column and the clustering so obtained is shown in Fig 16.2.
6
0.8 9.8 3.0 7.4 10.2 1
1.2 11. 3.1 6.6 8.5 1
6
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19. 10.2 7.9 1.4 3
9
6.2 18. 8.9 6.5 0.0 3
5
7.6 17. 8.4 5.2 1.8 3
4
7.8 12. 4.6 0.0 6.5 2
2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.640003:
CS 11.Data 5.9
Analytics2.1 8.1 2 12
Illustration of k-Means clustering algorithms
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below. The cluster with new centroids
are shown in Fig 16.3.
New Objects
Centroi
d A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6
• For example, there are different ways to cluster 20 items into 4 clusters!
• Thus, the strategy having its own limitation is practical only if
1) The sample is negatively small (~100-1000), and
2) k is relatively small compared to n (i.e.. .
The Manhattan distance (L1 norm) is used as a proximity measure, where the
objective is to minimize the sum-of-absolute error denoted as SAE and defined as
Dn
and
‖‖
• In other words, the mean calculation assumed that each object is defined with
numerical attribute(s). Thus, we cannot apply the k-Means to objects which are
defined with categorical attributes.
• More precisely, the k-means algorithm require some definition of cluster mean
exists, but not necessarily it does have as defined in the above equation.
• In fact, the k-Means is a very general clustering algorithm and can be used with
a wide variety of data types, such as documents, time series, etc.
The above two interpretations can be readily verified as given in the next slide.
Or,
Or,
Or,
1
𝑐 𝑖= ∑
𝑛𝑖 𝑥SSE
Thus, the best centroid for minimizing
𝑥
∈ 𝑪 of a cluster is the mean of the
𝑖
Or,
𝑐 =𝑚𝑒𝑑𝑖𝑎𝑛 { 𝑥|𝑥∈𝑪 }
Thus, the best centroid for 𝑖minimizing SAE of a cluster
𝑖 is the median of the
objects in the cluster.
Thus, time requirement is a linear order of number of objects and the algorithm
runs in a modest time if and (the iteration can be moderately controlled to check
the value of ).
• It is also efficient both from storage requirement and execution time point of
views. By saving distance information from one iteration to the next, the actual
number of distance calculations, that must be made can be reduced (specially, as
it reaches towards the termination).
Limitations:
is the updation in each iteration?
• The k-Means is not suitable for all types of data. For example, k-Means does not
work on categorical data because mean cannot be defined.
• k-means finds a local optima and may actually minimize the global optimum.
Non-convex shaped
clusters
Fig 16.6: Some failure instance of k-Means algorithm
CS 40003: Data Analytics 36
Different variants of k-means algorithm
There are a quite few variants of the k-Means algorithm. These can differ in the
procedure of selecting the initial k means, the calculation of proximity and strategy
for calculating cluster means. Another variants of k-means to cluster categorical
data.
Few variant of k-Means algorithm includes
• Bisecting k-Means (addressing the issue of initial choice of cluster means).
1. M. Steinbach, G. Karypis and V. Kumar “A comparison of document clustering
techniques”, Proceedings of KDD workshop on Text mining, 2000.
• Mean of clusters (Proposing various strategies to define means and variants of
means).
• B. zhan “Generalised k-Harmonic means – Dynamic weighting of data in
unsupervised learning”, Technical report, HP Labs, 2000.
• A. D. Chaturvedi, P. E. Green, J. D. Carroll, “k-Modes clustering”, Journal of
classification, Vol. 18, PP. 35-36, 2001.
• D. Pelleg, A. Moore, “x-Means: Extending k-Means with efficient estimation of the
number of clusters”, 17th International conference on Machine Learning, 2000.
Illustration of PAM
• Suppose, there are set of 12 objects and we are to cluster them into four clusters.
At any instant, the four cluster are shown in Fig. 16.7 (a).Also assume that are
the medoids in the clusters , respectively. For this clustering we can calculate
SAE.
• There are many ways to choose a non-medoid object to be replaced any one
medoid object. Out of these, suppose, if is considered as candidate medoid
instead of then it gives the lowest SAE. Thus, the new set of medoids would be .
The new cluster is shown in Fig 16.7 (b).
CS 40003: Data Analytics 41
PAM (Partitioning around Medoids)
11. Stop
References:
For PAM and CLARA:
• L. kaufman and P. J. Rousseew, “Finding Groups in Data: An introduction to
cluster analysis”, John and Wiley, 1990.
For CLARANS:
• R. Ng and J. Han, “Efficient and effective clustering method for spatial Data
mining”, Proceeding very large databases [VLDB-94], 1994.