Unit V - Clustering
Unit V - Clustering
• Cluster analysis, also known as clustering, is a method of data mining that groups similar data points together. The
goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each group are more
similar to each other than to data points in other groups. This process is often used for exploratory data analysis and can help
identify patterns or relationships within the data that may not be immediately obvious.
Clustering Methods:
1. Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n” partitions are done
on “p” objects of the database then each partition is represented by a cluster and n < p. The two conditions which need
to be satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.
2. Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects are created.
Dr.Priya Govindarajan
There are two types of approaches for the creation of hierarchical decomposition, they are:
•Agglomerative Approach: The agglomerative approach is also known as the bottom-up approach. Initially, the
given data is divided into which objects form separate groups. Thereafter it keeps on merging the objects or the
groups that are close to one another which means that they exhibit similar properties. This merging process
continues until the termination condition holds.
•Divisive Approach: The divisive approach is also known as the top-down approach. In this approach, we would
start with the data objects that are in the same cluster. The group of individual clusters is divided into small
clusters by continuous iteration. The iteration continues until the condition of termination is met or until each cluster
contains one object.
3. Density-Based Method: The density-based method mainly focuses on density. In this method, the given cluster
will keep on growing continuously as long as the density in the neighbourhood exceeds some threshold, i.e, for each
data point within a given cluster. The radius of a given cluster has to contain at least a minimum number of points.
4.Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object space is
quantized into a finite number of cells that form a grid structure.
Dr.Priya Govindarajan
5. Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the data which is best
suited for the model. The clustering of the density function is used to locate the clusters for a given model. It reflects the
spatial distribution of data points and also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. Therefore it yields robust clustering methods.
6.Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of application or
user-oriented constraints. A constraint refers to the user expectation or the properties of the desired clustering
results. Constraints provide us with an interactive way of communication with the clustering process. The user or the
application requirement can specify constraints.
Dr.Priya Govindarajan
Advantages of Cluster Analysis:
1.It can help identify patterns and relationships within a dataset that may not be immediately obvious.
2.It can be used for exploratory data analysis and can help with feature selection.
3.It can be difficult to interpret the results of the analysis if the clusters are not well-defined.
The simplest and most fundamental version of cluster analysis is partitioning, which organizes the objects of a set
into several exclusive groups or clusters. To keep the problem specification concise, we can assume that the
number of clusters is given as background knowledge. This parameter is the starting point for partitioning
methods.
Given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the
objects into k partitions (k <= n), where each partition represents a cluster.
• For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on
the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively
improves the within-cluster variation. Dr.Priya Govindarajan
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of
the objects in the cluster.
• Input:
k: the number of clusters,
D: a data set containing n objects.
• Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in
the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for each cluster;
(5) until no change;
Dr.Priya Govindarajan
Implement k-means – to form clusters, for the given dataset
3 168 60 2 2
𝑥𝑜 − 𝑥𝑐 + 𝑦𝑜 − 𝑦𝑐
4 179 68
5 182 72 2. Which ever calculation of ED (k1,k2) – gives less value – that would be
added to the corresponding clusters. K1 – {1}, K2 – {2,3}
6 188 77
7 180 71 3. Calculate new centroid, for the set – in which the set was added to
8 180 70 frame the cluster.
i.e., for k2 =(170+168/2 , 60+56/2)
9 183 84
10 180 88 4. Calculate the Euclidean distance based on the newly generated
11 180 67 centroid value – i.e., K1 (185,72) k2 (169,58) – ED for 4 (obs.values)
Dr.Priya Govindarajan
• Advantages of k-means
1. Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice
for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality.
3. Scalability: K-means can handle large datasets with a large number of data points and can be easily scaled to handle even
larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be used with different distance metrics and
initialization methods.
• Disadvantages of K-Means
1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a suboptimal
solution.
2. Requires specifying the number of clusters: The number of clusters k needs to be specified before running the
algorithm, which can be challenging in some applications.
3. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting clusters.
Dr.Priya Govindarajan
K-Medoids
The problem with the K-Means algorithm is that the algorithm needs to handle outlier data. An outlier is a
point different from the rest of the points. All the outlier data points show up in a different cluster and will
attract other clusters to merge with it. Outlier data increases the mean of a cluster. Hence, K-Means
clustering is highly affected by outlier data.
K-Medoids (Partitioning Around Medoid - PAM) algorithm was proposed in 1987 by Kaufman and
Rousseeuw. A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points
in the cluster are minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi – Ci|
Dr.Priya Govindarajan
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
• Input:
• k: the number of clusters,
• D: a data set containing n objects.
• Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
(7) until no change;
Dr.Priya Govindarajan
Example: Apply k-medoid algorithm, with k = 2
1. Select two random representative objects :
C1 (3,4) – X2
i x y
C2(7,4) – X8
x1 2 6
2. Form a table, to calculate the distance cost – with
x2 3 4
perspective to C1 (3,4) – distance cost - |a-c| + |b-d|
x3 3 8
i x y C1 Distance cost C
x4 4 7
x1 2 6 3 4 |2-3| + |6-4| 3
x5 6 2 X3 3 8 3 4
x6 6 4 X4 4 7 3 4
x7 7 3 X5 6 2 3 4
X6 6 4 3 4
x8 7 4
X7 7 3 3 4
x9 8 5 X9 8 5 3 4
x10 7 6 x10 7 6 3 4
Dr.Priya Govindarajan
Form a table, to calculate the distance cost – with perspective C2 (7,4) – distance cost - |a-c| + |b-d|
i x y C2 Distance cost C
x1 2 6 7 4 |2-7| + |6-4| 7
X3 3 8 7 4
X4 4 7 7 4
X5 6 2 7 4
X6 6 4 7 4
X7 7 3 7 4
X9 8 5 7 4
x10 7 6 7 4
3. Compare the cost of C1 and C2 for every i and select the minimum one–form clusters
Cluster 1 : {(3,4),(2,6),(3,8),(4,7)}
Cluster 2: {(7,4),(6,2),(6,4),(7,3),(8,5),(7,6)}
7. Medoid cost is high by 2, than the past cost - so moving o` would be a bad idea, therefore taking the previous
choice would be a good one.
Dr.Priya Govindarajan
Advantages of using K-Medoids:
Disadvantages:
Dr.Priya Govindarajan
Question 1: Apply k-medoid algorithm, with k = 2
x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9
Dr.Priya Govindarajan
Question 2: Apply k-medoid algorithm, with k = 2
S. No. X Y
1 9 6
2 10 4
3 4 4
4 5 8
5 3 8
6 2 5
7 8 5
8 4 6
9 8 4
10 9 3
Dr.Priya Govindarajan
Question 3: Apply k-medoid algorithm, with k = 2
Dr.Priya Govindarajan
Question 4: Implement k-means – to form clusters, for the given dataset
Cluster the following eight points (with (x, y) representing locations :
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Dr.Priya Govindarajan
Dr.Priya Govindarajan