0% found this document useful (0 votes)

32 views19 pages

Unit V - Clustering

1. Cluster analysis groups similar data points together through various clustering methods. The goal is to divide data into clusters where points within each cluster are more similar to each other than points in other clusters. 2. Common clustering methods include partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods. K-means is a popular partitioning technique that groups data by minimizing distances between points and cluster centroids. 3. Cluster analysis has various applications and advantages such as identifying patterns, reducing dimensionality, and performing market segmentation. However, it also has disadvantages like sensitivity to initial conditions and hyperparameters.

Uploaded by

Raksha Poonacha.B.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views19 pages

Unit V - Clustering

Uploaded by

Raksha Poonacha.B.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Unit V - Clustering

• Cluster analysis, also known as clustering, is a method of data mining that groups similar data points together. The
goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each group are more
similar to each other than to data points in other groups. This process is often used for exploratory data analysis and can help
identify patterns or relationships within the data that may not be immediately obvious.

Clustering Methods:

1. Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n” partitions are done
on “p” objects of the database then each partition is represented by a cluster and n < p. The two conditions which need
to be satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.

2. Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects are created.

Dr.Priya Govindarajan
There are two types of approaches for the creation of hierarchical decomposition, they are:

•Agglomerative Approach: The agglomerative approach is also known as the bottom-up approach. Initially, the
given data is divided into which objects form separate groups. Thereafter it keeps on merging the objects or the
groups that are close to one another which means that they exhibit similar properties. This merging process
continues until the termination condition holds.

•Divisive Approach: The divisive approach is also known as the top-down approach. In this approach, we would
start with the data objects that are in the same cluster. The group of individual clusters is divided into small
clusters by continuous iteration. The iteration continues until the condition of termination is met or until each cluster
contains one object.

3. Density-Based Method: The density-based method mainly focuses on density. In this method, the given cluster
will keep on growing continuously as long as the density in the neighbourhood exceeds some threshold, i.e, for each
data point within a given cluster. The radius of a given cluster has to contain at least a minimum number of points.

4.Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object space is
quantized into a finite number of cells that form a grid structure.

Dr.Priya Govindarajan
5. Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the data which is best
suited for the model. The clustering of the density function is used to locate the clusters for a given model. It reflects the
spatial distribution of data points and also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. Therefore it yields robust clustering methods.

6.Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of application or
user-oriented constraints. A constraint refers to the user expectation or the properties of the desired clustering
results. Constraints provide us with an interactive way of communication with the clustering process. The user or the
application requirement can specify constraints.

Applications Of Cluster Analysis:

• It is widely used in image processing, data analysis, and pattern recognition.

• It helps marketers to find the distinct groups in their customer base and they can characterize their customer groups by using
purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with the same
capabilities.
• It also helps in information discovery by classifying documents on the web.

Dr.Priya Govindarajan
Advantages of Cluster Analysis:
1.It can help identify patterns and relationships within a dataset that may not be immediately obvious.

2.It can be used for exploratory data analysis and can help with feature selection.

3.It can be used to reduce the dimensionality of the data.

4.It can be used for anomaly detection and outlier identification.

5.It can be used for market segmentation and customer profiling.

Disadvantages of Cluster Analysis:

1.It can be sensitive to the choice of initial conditions and the number of clusters.

2.It can be sensitive to the presence of noise or outliers in the data.

3.It can be difficult to interpret the results of the analysis if the clusters are not well-defined.

4.It can be computationally expensive for large datasets.

Dr.Priya Govindarajan
5.The results of the analysis can be affected by the choice of clustering algorithm used.
k-Means: A Centroid-Based Technique

The simplest and most fundamental version of cluster analysis is partitioning, which organizes the objects of a set
into several exclusive groups or clusters. To keep the problem specification concise, we can assume that the
number of clusters is given as background knowledge. This parameter is the starting point for partitioning
methods.

Given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the
objects into k partitions (k <= n), where each partition represents a cluster.

“How does the k-means algorithm work?”

• The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster.
It proceeds as follows. First, it randomly selects k of the objects in D, each of which initially represents a cluster
mean or center.

• For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on
the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively
improves the within-cluster variation. Dr.Priya Govindarajan
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of
the objects in the cluster.
• Input:
k: the number of clusters,
D: a data set containing n objects.
• Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in
the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for each cluster;
(5) until no change;

Dr.Priya Govindarajan
Implement k-means – to form clusters, for the given dataset

S.No Height Weight 1. Calculate the Euclidean distance for k1 & k2

1 185 72
2 170 56 k1 (185,72) k2 (170,56) – ED for 3 (obs. Values)

3 168 60 2 2
𝑥𝑜 − 𝑥𝑐 + 𝑦𝑜 − 𝑦𝑐
4 179 68
5 182 72 2. Which ever calculation of ED (k1,k2) – gives less value – that would be
added to the corresponding clusters. K1 – {1}, K2 – {2,3}
6 188 77
7 180 71 3. Calculate new centroid, for the set – in which the set was added to
8 180 70 frame the cluster.
i.e., for k2 =(170+168/2 , 60+56/2)
9 183 84
10 180 88 4. Calculate the Euclidean distance based on the newly generated
11 180 67 centroid value – i.e., K1 (185,72) k2 (169,58) – ED for 4 (obs.values)

12 177 76 5. Repeat the process – till clusters are formed.

Dr.Priya Govindarajan
• Advantages of k-means

1. Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice
for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality.
3. Scalability: K-means can handle large datasets with a large number of data points and can be easily scaled to handle even
larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be used with different distance metrics and
initialization methods.

• Disadvantages of K-Means

1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a suboptimal
solution.
2. Requires specifying the number of clusters: The number of clusters k needs to be specified before running the
algorithm, which can be challenging in some applications.
3. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting clusters.
Dr.Priya Govindarajan
K-Medoids

The problem with the K-Means algorithm is that the algorithm needs to handle outlier data. An outlier is a
point different from the rest of the points. All the outlier data points show up in a different cluster and will
attract other clusters to merge with it. Outlier data increases the mean of a cluster. Hence, K-Means
clustering is highly affected by outlier data.

K-Medoids (Partitioning Around Medoid - PAM) algorithm was proposed in 1987 by Kaufman and
Rousseeuw. A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points
in the cluster are minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi – Ci|

The cost in K-Medoids algorithm is given as

Dr.Priya Govindarajan
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
• Input:
• k: the number of clusters,
• D: a data set containing n objects.
• Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
(7) until no change;

PAM, a k-medoids partitioning algorithm

Dr.Priya Govindarajan
Example: Apply k-medoid algorithm, with k = 2
1. Select two random representative objects :
C1 (3,4) – X2
i x y
C2(7,4) – X8
x1 2 6
2. Form a table, to calculate the distance cost – with
x2 3 4
perspective to C1 (3,4) – distance cost - |a-c| + |b-d|
x3 3 8
i x y C1 Distance cost C
x4 4 7
x1 2 6 3 4 |2-3| + |6-4| 3
x5 6 2 X3 3 8 3 4
x6 6 4 X4 4 7 3 4

x7 7 3 X5 6 2 3 4
X6 6 4 3 4
x8 7 4
X7 7 3 3 4
x9 8 5 X9 8 5 3 4
x10 7 6 x10 7 6 3 4

Dr.Priya Govindarajan
Form a table, to calculate the distance cost – with perspective C2 (7,4) – distance cost - |a-c| + |b-d|

i x y C2 Distance cost C
x1 2 6 7 4 |2-7| + |6-4| 7
X3 3 8 7 4
X4 4 7 7 4
X5 6 2 7 4
X6 6 4 7 4
X7 7 3 7 4
X9 8 5 7 4
x10 7 6 7 4

3. Compare the cost of C1 and C2 for every i and select the minimum one–form clusters
Cluster 1 : {(3,4),(2,6),(3,8),(4,7)}
Cluster 2: {(7,4),(6,2),(6,4),(7,3),(8,5),(7,6)}

4. Calculate the total cost (x,c) :

= 3+4+4+3+1+1+2+2 = 20 - these were the random medoids, so take another medoid for comparison
Dr.Priya Govindarajan
5. Select one of non-medoids o` - let o` = (7,3) i.e (x7)
generate table to calculate the distance cost – with perspective C1 (3,4) – X2 & o` (7,3) – X7 – calculate the
distance cost (current ) – using - |a-c| + |b-d|

6. cost of swapping medoid from C2 to o`

S = current total cost – past total cost
= 22-20 = 2
Since Swap Cost (S)>0, we would undo the swap.

7. Medoid cost is high by 2, than the past cost - so moving o` would be a bad idea, therefore taking the previous
choice would be a good one.

8. If the answer is negative – then keep iterating

Dr.Priya Govindarajan
Advantages of using K-Medoids:

1. Deals with noise and outlier data effectively

2. Easily implementable and simple to understand
3. Faster compared to other partitioning algorithms

Disadvantages:

1. Not suitable for Clustering arbitrarily shaped groups of data points.

2. As the initial medoids are chosen randomly, the results might vary based on the choice in different runs.

Dr.Priya Govindarajan
Question 1: Apply k-medoid algorithm, with k = 2

x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9

Dr.Priya Govindarajan
Question 2: Apply k-medoid algorithm, with k = 2

S. No. X Y
1 9 6
2 10 4
3 4 4
4 5 8
5 3 8
6 2 5
7 8 5
8 4 6
9 8 4
10 9 3

Dr.Priya Govindarajan
Question 3: Apply k-medoid algorithm, with k = 2

Dr.Priya Govindarajan
Question 4: Implement k-means – to form clusters, for the given dataset
Cluster the following eight points (with (x, y) representing locations :
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Question 5: construct a decision tree, using following datasets

Dr.Priya Govindarajan
Dr.Priya Govindarajan

Unit 4
No ratings yet
Unit 4
29 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Complete Clustering
No ratings yet
Complete Clustering
80 pages
Clustering
No ratings yet
Clustering
89 pages
Clustering
No ratings yet
Clustering
39 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Cluster
No ratings yet
Cluster
20 pages
Module 4-1
No ratings yet
Module 4-1
153 pages
Unit4 ML
No ratings yet
Unit4 ML
20 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Lecture5 - Clustering (K Means and K Medoids)
No ratings yet
Lecture5 - Clustering (K Means and K Medoids)
36 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
The International Journal of Engineering and Science (The IJES)
No ratings yet
The International Journal of Engineering and Science (The IJES)
4 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Clustering
No ratings yet
Clustering
80 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering Notes
No ratings yet
Clustering Notes
29 pages
Big Data in The Construction Industry A Review of Present Status, Opportunities, and Future Trends
100% (3)
Big Data in The Construction Industry A Review of Present Status, Opportunities, and Future Trends
42 pages
Statistical Method
No ratings yet
Statistical Method
4 pages
Clustering
No ratings yet
Clustering
32 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Clustering
No ratings yet
Clustering
25 pages
Unit 5
No ratings yet
Unit 5
85 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
DM Unit Iv
No ratings yet
DM Unit Iv
45 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Maheshwari Chapter 1
No ratings yet
Maheshwari Chapter 1
39 pages
Lec 1 Data Mining Introduction For Exam
No ratings yet
Lec 1 Data Mining Introduction For Exam
48 pages
KDD-Knowledge Discovery in Databases
No ratings yet
KDD-Knowledge Discovery in Databases
5 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Module 5
No ratings yet
Module 5
98 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Unit 4
No ratings yet
Unit 4
125 pages
Sample Project Synopsis
No ratings yet
Sample Project Synopsis
5 pages
M5
No ratings yet
M5
40 pages
M Tech2014-16
No ratings yet
M Tech2014-16
173 pages
M5
No ratings yet
M5
40 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Unit 4
No ratings yet
Unit 4
16 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
2018 Class Resume Book
No ratings yet
2018 Class Resume Book
46 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Clustering
No ratings yet
Clustering
84 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
AIA 6550 Module 5
50% (2)
AIA 6550 Module 5
21 pages
Clustering
No ratings yet
Clustering
125 pages
Unit 4
No ratings yet
Unit 4
74 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Clustering
No ratings yet
Clustering
104 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
15 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
L07 - Advance Analytical Theory and Methods - Clustering
No ratings yet
L07 - Advance Analytical Theory and Methods - Clustering
22 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
K Mean
No ratings yet
K Mean
7 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Internal
No ratings yet
Internal
267 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Department of Computer Science: Prepared By: Ms. Zainab Imtiaz
No ratings yet
Department of Computer Science: Prepared By: Ms. Zainab Imtiaz
5 pages
Market Segmentation For Airlines
No ratings yet
Market Segmentation For Airlines
1 page
Machine Learning Toolkit User Manual PDF
No ratings yet
Machine Learning Toolkit User Manual PDF
7 pages
K-Means Clustering Dan Local Outlier Factor: Clustering Data Remunerasi PNS Menggunakan Metode
No ratings yet
K-Means Clustering Dan Local Outlier Factor: Clustering Data Remunerasi PNS Menggunakan Metode
8 pages
Analisis Pola Pembelian Konsumen Pada PT Indoritel Makmur Internasional TBK Menggunakan Metode Algoritma Apriori
No ratings yet
Analisis Pola Pembelian Konsumen Pada PT Indoritel Makmur Internasional TBK Menggunakan Metode Algoritma Apriori
6 pages
ML06 Classical Techniques
No ratings yet
ML06 Classical Techniques
38 pages
DM GTU Study Material Presentations Unit-6 21052021124430PM
No ratings yet
DM GTU Study Material Presentations Unit-6 21052021124430PM
33 pages
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
No ratings yet
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
3 pages
Genetic Neural Network Based Data Mining in Prediction of Heart Disease Using Risk Factors
No ratings yet
Genetic Neural Network Based Data Mining in Prediction of Heart Disease Using Risk Factors
5 pages
Topic 1 - Intro To Data Analytics
No ratings yet
Topic 1 - Intro To Data Analytics
18 pages
An Integrated E-Recruitment System For Automated Personality Mining and Applicant Ranking PDF
No ratings yet
An Integrated E-Recruitment System For Automated Personality Mining and Applicant Ranking PDF
8 pages
Unit 4
No ratings yet
Unit 4
4 pages
Objectives Questions For Data Mining
No ratings yet
Objectives Questions For Data Mining
4 pages
Frequent Pattern Based Clustering
No ratings yet
Frequent Pattern Based Clustering
4 pages
Big Data Review
No ratings yet
Big Data Review
8 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet