0% found this document useful (0 votes)
32 views19 pages

Unit V - Clustering

1. Cluster analysis groups similar data points together through various clustering methods. The goal is to divide data into clusters where points within each cluster are more similar to each other than points in other clusters. 2. Common clustering methods include partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods. K-means is a popular partitioning technique that groups data by minimizing distances between points and cluster centroids. 3. Cluster analysis has various applications and advantages such as identifying patterns, reducing dimensionality, and performing market segmentation. However, it also has disadvantages like sensitivity to initial conditions and hyperparameters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views19 pages

Unit V - Clustering

1. Cluster analysis groups similar data points together through various clustering methods. The goal is to divide data into clusters where points within each cluster are more similar to each other than points in other clusters. 2. Common clustering methods include partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods. K-means is a popular partitioning technique that groups data by minimizing distances between points and cluster centroids. 3. Cluster analysis has various applications and advantages such as identifying patterns, reducing dimensionality, and performing market segmentation. However, it also has disadvantages like sensitivity to initial conditions and hyperparameters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit V - Clustering

• Cluster analysis, also known as clustering, is a method of data mining that groups similar data points together. The
goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each group are more
similar to each other than to data points in other groups. This process is often used for exploratory data analysis and can help
identify patterns or relationships within the data that may not be immediately obvious.

Clustering Methods:

1. Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n” partitions are done
on “p” objects of the database then each partition is represented by a cluster and n < p. The two conditions which need
to be satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.

2. Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects are created.

Dr.Priya Govindarajan
There are two types of approaches for the creation of hierarchical decomposition, they are:

•Agglomerative Approach: The agglomerative approach is also known as the bottom-up approach. Initially, the
given data is divided into which objects form separate groups. Thereafter it keeps on merging the objects or the
groups that are close to one another which means that they exhibit similar properties. This merging process
continues until the termination condition holds.

•Divisive Approach: The divisive approach is also known as the top-down approach. In this approach, we would
start with the data objects that are in the same cluster. The group of individual clusters is divided into small
clusters by continuous iteration. The iteration continues until the condition of termination is met or until each cluster
contains one object.

3. Density-Based Method: The density-based method mainly focuses on density. In this method, the given cluster
will keep on growing continuously as long as the density in the neighbourhood exceeds some threshold, i.e, for each
data point within a given cluster. The radius of a given cluster has to contain at least a minimum number of points.

4.Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object space is
quantized into a finite number of cells that form a grid structure.

Dr.Priya Govindarajan
5. Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the data which is best
suited for the model. The clustering of the density function is used to locate the clusters for a given model. It reflects the
spatial distribution of data points and also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. Therefore it yields robust clustering methods.

6.Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of application or
user-oriented constraints. A constraint refers to the user expectation or the properties of the desired clustering
results. Constraints provide us with an interactive way of communication with the clustering process. The user or the
application requirement can specify constraints.

Applications Of Cluster Analysis:

• It is widely used in image processing, data analysis, and pattern recognition.


• It helps marketers to find the distinct groups in their customer base and they can characterize their customer groups by using
purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with the same
capabilities.
• It also helps in information discovery by classifying documents on the web.

Dr.Priya Govindarajan
Advantages of Cluster Analysis:
1.It can help identify patterns and relationships within a dataset that may not be immediately obvious.

2.It can be used for exploratory data analysis and can help with feature selection.

3.It can be used to reduce the dimensionality of the data.

4.It can be used for anomaly detection and outlier identification.

5.It can be used for market segmentation and customer profiling.

Disadvantages of Cluster Analysis:


1.It can be sensitive to the choice of initial conditions and the number of clusters.

2.It can be sensitive to the presence of noise or outliers in the data.

3.It can be difficult to interpret the results of the analysis if the clusters are not well-defined.

4.It can be computationally expensive for large datasets.


Dr.Priya Govindarajan
5.The results of the analysis can be affected by the choice of clustering algorithm used.
k-Means: A Centroid-Based Technique

The simplest and most fundamental version of cluster analysis is partitioning, which organizes the objects of a set
into several exclusive groups or clusters. To keep the problem specification concise, we can assume that the
number of clusters is given as background knowledge. This parameter is the starting point for partitioning
methods.

Given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the
objects into k partitions (k <= n), where each partition represents a cluster.

“How does the k-means algorithm work?”


• The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster.
It proceeds as follows. First, it randomly selects k of the objects in D, each of which initially represents a cluster
mean or center.

• For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on
the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively
improves the within-cluster variation. Dr.Priya Govindarajan
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of
the objects in the cluster.
• Input:
k: the number of clusters,
D: a data set containing n objects.
• Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in
the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for each cluster;
(5) until no change;

Dr.Priya Govindarajan
Implement k-means – to form clusters, for the given dataset

S.No Height Weight 1. Calculate the Euclidean distance for k1 & k2


1 185 72
2 170 56 k1 (185,72) k2 (170,56) – ED for 3 (obs. Values)

3 168 60 2 2
𝑥𝑜 − 𝑥𝑐 + 𝑦𝑜 − 𝑦𝑐
4 179 68
5 182 72 2. Which ever calculation of ED (k1,k2) – gives less value – that would be
added to the corresponding clusters. K1 – {1}, K2 – {2,3}
6 188 77
7 180 71 3. Calculate new centroid, for the set – in which the set was added to
8 180 70 frame the cluster.
i.e., for k2 =(170+168/2 , 60+56/2)
9 183 84
10 180 88 4. Calculate the Euclidean distance based on the newly generated
11 180 67 centroid value – i.e., K1 (185,72) k2 (169,58) – ED for 4 (obs.values)

12 177 76 5. Repeat the process – till clusters are formed.

Dr.Priya Govindarajan
• Advantages of k-means

1. Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice
for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality.
3. Scalability: K-means can handle large datasets with a large number of data points and can be easily scaled to handle even
larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be used with different distance metrics and
initialization methods.

• Disadvantages of K-Means

1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a suboptimal
solution.
2. Requires specifying the number of clusters: The number of clusters k needs to be specified before running the
algorithm, which can be challenging in some applications.
3. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting clusters.
Dr.Priya Govindarajan
K-Medoids

The problem with the K-Means algorithm is that the algorithm needs to handle outlier data. An outlier is a
point different from the rest of the points. All the outlier data points show up in a different cluster and will
attract other clusters to merge with it. Outlier data increases the mean of a cluster. Hence, K-Means
clustering is highly affected by outlier data.

K-Medoids (Partitioning Around Medoid - PAM) algorithm was proposed in 1987 by Kaufman and
Rousseeuw. A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points
in the cluster are minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi – Ci|

The cost in K-Medoids algorithm is given as

Dr.Priya Govindarajan
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
• Input:
• k: the number of clusters,
• D: a data set containing n objects.
• Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
(7) until no change;

PAM, a k-medoids partitioning algorithm

Dr.Priya Govindarajan
Example: Apply k-medoid algorithm, with k = 2
1. Select two random representative objects :
C1 (3,4) – X2
i x y
C2(7,4) – X8
x1 2 6
2. Form a table, to calculate the distance cost – with
x2 3 4
perspective to C1 (3,4) – distance cost - |a-c| + |b-d|
x3 3 8
i x y C1 Distance cost C
x4 4 7
x1 2 6 3 4 |2-3| + |6-4| 3
x5 6 2 X3 3 8 3 4
x6 6 4 X4 4 7 3 4

x7 7 3 X5 6 2 3 4
X6 6 4 3 4
x8 7 4
X7 7 3 3 4
x9 8 5 X9 8 5 3 4
x10 7 6 x10 7 6 3 4

Dr.Priya Govindarajan
Form a table, to calculate the distance cost – with perspective C2 (7,4) – distance cost - |a-c| + |b-d|

i x y C2 Distance cost C
x1 2 6 7 4 |2-7| + |6-4| 7
X3 3 8 7 4
X4 4 7 7 4
X5 6 2 7 4
X6 6 4 7 4
X7 7 3 7 4
X9 8 5 7 4
x10 7 6 7 4

3. Compare the cost of C1 and C2 for every i and select the minimum one–form clusters
Cluster 1 : {(3,4),(2,6),(3,8),(4,7)}
Cluster 2: {(7,4),(6,2),(6,4),(7,3),(8,5),(7,6)}

4. Calculate the total cost (x,c) :


= 3+4+4+3+1+1+2+2 = 20 - these were the random medoids, so take another medoid for comparison
Dr.Priya Govindarajan
5. Select one of non-medoids o` - let o` = (7,3) i.e (x7)
generate table to calculate the distance cost – with perspective C1 (3,4) – X2 & o` (7,3) – X7 – calculate the
distance cost (current ) – using - |a-c| + |b-d|

6. cost of swapping medoid from C2 to o`


S = current total cost – past total cost
= 22-20 = 2
Since Swap Cost (S)>0, we would undo the swap.

7. Medoid cost is high by 2, than the past cost - so moving o` would be a bad idea, therefore taking the previous
choice would be a good one.

8. If the answer is negative – then keep iterating

Dr.Priya Govindarajan
Advantages of using K-Medoids:

1. Deals with noise and outlier data effectively


2. Easily implementable and simple to understand
3. Faster compared to other partitioning algorithms

Disadvantages:

1. Not suitable for Clustering arbitrarily shaped groups of data points.


2. As the initial medoids are chosen randomly, the results might vary based on the choice in different runs.

Dr.Priya Govindarajan
Question 1: Apply k-medoid algorithm, with k = 2

x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9

Dr.Priya Govindarajan
Question 2: Apply k-medoid algorithm, with k = 2

S. No. X Y
1 9 6
2 10 4
3 4 4
4 5 8
5 3 8
6 2 5
7 8 5
8 4 6
9 8 4
10 9 3

Dr.Priya Govindarajan
Question 3: Apply k-medoid algorithm, with k = 2

Dr.Priya Govindarajan
Question 4: Implement k-means – to form clusters, for the given dataset
Cluster the following eight points (with (x, y) representing locations :
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Question 5: construct a decision tree, using following datasets

Dr.Priya Govindarajan
Dr.Priya Govindarajan

You might also like