0% found this document useful (0 votes)

11 views22 pages

Unit 7 Clustering (P)

K-means clustering partitions a dataset into k clusters based on proximity measures, typically using Euclidean distance. The algorithm iteratively assigns data points to the nearest centroid and recalculates centroids until no significant changes occur. Evaluation of clustering effectiveness is done through metrics like sum of squared errors (SSE) and the Davies-Bouldin index, which assess cluster cohesiveness and separation.

Uploaded by

tantnn0080

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views22 pages

Unit 7 Clustering (P)

Uploaded by

tantnn0080

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Clustering – K-MEANS CLUSTERING

 k-Means clustering creates k partitions in n-dimensional space, where n is the

number of attributes in a given dataset.
 To partition the dataset, a proximity measure has to be defined. The most commonly
used measure for a numeric attribute is the Euclidean distance.
 Fig. 7.3 illustrates the clustering of the Iris dataset with only the petal length and petal
width attributes.
 This Iris dataset is two-dimensional (selected for easy visual explanation), with
numeric attributes and k specified as 3.
 The outcome of k-means clustering provides a clear partition space for Cluster 1 and
a narrow space for the other two clusters, Cluster 2 and Cluster 3

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

How it works
 The logic of finding k-clusters within a given dataset is rather simple and always
converges to a solution.
 However, the final result in most cases will be locally optimal where the solution will
not converge to the best global solution.
 The process of k-means clustering is similar to Voronoi iteration, where the objective
is to divide a space into cells around points.
 The difference is Voronoi iteration partitions the space, whereas, k-means
clustering partitions the points in data space

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

Ex: Figure 7.4, 2-dimensional dataset, k = 3.

 Step 1: Initiate Centroids
 The number of clusters k should be specified by the user. In this case, 3 centroids are
initiated in a given data space.

2/20/2024 internal use

Fig. 7.5: each initial centroid is given a shape (with a
Clustering – K-MEANS CLUSTERING
circle to differentiate centroids from other data points)
so that data points assigned to a centroid can be
indicated by the same shape.
Clustering – K-MEANS CLUSTERING

 Step 2: Assign Data Points

 Once centroids have been initiated, all the data points are now assigned to
the nearest centroid to form a cluster.
 In this context the “nearest” is calculated by a proximity measure. Euclidean distance
measurement is the most common proximity measure, though other measures like the
Manhattan measure and Jaccard coefficient can be used, which, between two data
points X (x1, x2,. . ., xn) and C (c1, c2,. . ., cn) with n attributes, is given:

 All the data points associated to a centroid now have the same shape as their
corresponding centroid as in Fig. 7.6. This step also leads to partitioning of data space
into Voronoi partitions, with lines shown as boundaries.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

 Step 3: Calculate New Centroids

 For each cluster, a new centroid can now be calculated, which is also the prototype of
each cluster group.
 This new centroid is the most representative data point of the cluster.
 Mathematically, this step can be expressed as minimizing the sum of squared errors
(SSEs) of all data points in a cluster to the centroid of the cluster.
 The overall objective of the step is to minimize the SSEs of individual clusters. The
SSE of a cluster can be calculated using Eq. (7.2).

where Ci is the ith cluster, j are the data points in a given cluster, μi is the centroid for ith
cluster, and xj is a specific data object.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

 Step 3: Calculate New Centroids

 The centroid with minimal SSE for the given cluster i is the new mean of the cluster.
 The mean of the cluster can be calculated using (7.3)

where X is the data object vector (x1, x2,. . ., xn). In the case of k-means clustering, the
new centroid will be the mean of all the data points.
 k-Medoid clustering is a variation of k-means clustering, where the median is
calculated instead of the mean. Fig. 7.7 shows the location of the new centroids.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING
Clustering – K-MEANS CLUSTERING

 Step 4: Repeat Assignment and Calculate New Centroids

 Once the new centroids have been identified, assigning data points to the nearest
centroid is repeated until all the data points are reassigned to new centroids.
 Fig. 7.8, note the change in assignment of three data points that belonged to different
clusters in the previous step.
 Step 5: Termination
 Step 3 (calculating new centroids), and step 4 (assigning data points to new centroids)
are repeated until no further change in assignment of data points.
 In other words, no significant change in centroids are noted. The final centroids are
declared the prototypes of the clusters and they are used to describe the whole
clustering model.
 Each data point in the dataset is now tied with a new clustering ID attribute that
identifies the cluster.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

 Special Cases
 Even though k-means clustering is simple and easy to implement, one of its
key drawbacks is that the algorithm seeks to find a local optimum, which may not
yield globally optimal clustering.
 In this approach, the algorithm starts with an initial configuration (centroids) and
continuously improves to find the best solution possible for that initial configuration.
 Since the solution is optimal to the initial configuration, there might be a better
optimal solution if the initial configuration changes  the success of a k-means
algorithm much depends on the initiation of centroids.
 This limitation can be addressed by having multiple random initiations;
 In each run one could measure the cohesiveness of the clusters by a performance
criterion. The clustering run with the best performance metric can be chosen as the
final run.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

 Evaluation of Clusters
 Evaluation of k-means clustering is different from regression and
classification algorithms because in clustering there are no known external
labels for comparison.
 The evaluation parameter will have to be developed from the very dataset
that is being evaluated.
 This is called unsupervised or internal evaluation. Evaluation of clustering
can be as simple as computing total SSE.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

 Evaluation of Clusters (cont’d)

 Good models will have low SSE within the cluster and low overall SSE
among all clusters. SSE can also be referred to as the average within-cluster
distance and can be calculated for each cluster and then averaged for all the
clusters.
 Another commonly used evaluation measure is the Davies-Bouldin index, a
measure of uniqueness of the clusters and takes into consideration both
cohesiveness of the cluster (distance between the data points and center of
the cluster) and separation between the clusters.
 It is the function of the ratio of within cluster separation to the separation
between the clusters.
 The lower the value of the Davies-Bouldin index, the better the clustering.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

How to implement
 One operator for modeling and one for unsupervised evaluation.
 In the modeling step, the parameter for the number of clusters, k, is specified as
desired. The output model is a list of centroids for each cluster and a new attribute is
attached to the original input dataset with the cluster ID. The cluster label is appended
to the original dataset for each data point and can be visually evaluated after the
clustering.
 A model evaluation step is required to calculate the average cluster distance and
Davies-Bouldin index.
 Iris dataset (4 attributes, 150 data objects).
 Even though a class label is not needed for clustering, it was kept for later
explanation to see if identified clusters from an unlabeled dataset are similar to
natural clusters of species in the dataset.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

How to implement (cont’d)

 Step 1: Data Preparation
 k-Means clustering accepts both numeric and polynominal data types;
 However, the distance measures are more effective with numeric data types.
 The number of attributes increases the dimension space for clustering.
 In this example the number of attributes has been limited to two by selecting petal
width (a3) and petal length (a4) using the Select Attribute operator.
 It is easy to visualize the mechanics of k-means algorithm by looking at two-
dimensional plots for clustering. In practical implementations, clustering datasets will
have more attributes.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

 Step 2: Clustering Operator and Parameters

 The k-means modeling operator is available in the Modeling  Clustering
and Segmentation folder of RapidMiner. The parameters are as follows:
 k: The desired number of clusters.
 Add cluster as attribute: Append cluster labels (IDs) into the original dataset.
 Max runs: Multiple runs are required to select the clustering with the lowest SSE. The number
of such runs can be specified here.
 Measure type: The default and most common measurement is Euclidean distance (L2). Other
options: Manhattan distance (L1), Jaccard coefficient, and cosine similarity for document data.
 Max optimization steps: The number of iterations of assigning data objects to centroids and
calculating new centroids
 The output: the cluster model with k centroid data objects and the initial dataset appended with
cluster labels. Cluster labels are named generically such as cluster_0, cluster_1,. . ., cluster_k1.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING

 Step 3: Evaluation
 Since the attributes used in the dataset are numeric, the effectiveness of clustering
groups need to be evaluated using SSE and the Davies-Bouldin index.
 In RapidMiner, the Cluster Model Visualizer operator under Modeling
Segmentation is available for a performance evaluation of cluster groups and
visualization.
 Cluster Model Visualizer operator needs both inputs from the modeling step: cluster
centroid vector (model) and the labeled dataset.
 The two measurement outputs of the evaluation are average cluster distance and the
Davies-Bouldin index.

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING
Clustering – K-MEANS CLUSTERING

 Step 4: Execution and Interpretation

The outputs can be observed from results window:
 Cluster Model (Clustering): The model output contains the centroid for each
of the k-clusters, along with their attribute values.
 Labeled example set: The cluster value is appended as a new special
polynominal attribute and takes a generic label format.
 Visualizer and Performance vector: The output of the Cluster Model
Visualizer shows the centroid charts, table, scatter plots, heatmaps, and
performance evaluation metrics like average distance measured and the
Davies-Bouldin index

2/20/2024 internal use

Clustering – K-MEANS CLUSTERING
Clustering – K-MEANS CLUSTERING

Mathematics in The Modern World Chapter 7
100% (1)
Mathematics in The Modern World Chapter 7
27 pages
21csc305p Machine Learning Unit 3 - Updated
No ratings yet
21csc305p Machine Learning Unit 3 - Updated
147 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
Unit 4
No ratings yet
Unit 4
125 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
No ratings yet
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
20 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
L7 Clustering
No ratings yet
L7 Clustering
58 pages
KMeans Clustering
No ratings yet
KMeans Clustering
11 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
K Mean
No ratings yet
K Mean
12 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Kmean
No ratings yet
Kmean
24 pages
K Mean
No ratings yet
K Mean
7 pages
Clustering
No ratings yet
Clustering
125 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
K Means
No ratings yet
K Means
33 pages
10.program K Means
No ratings yet
10.program K Means
16 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
ML Application in Signal Processing and Communication Engineering
No ratings yet
ML Application in Signal Processing and Communication Engineering
27 pages
Algo
No ratings yet
Algo
59 pages
Clustering
No ratings yet
Clustering
84 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
1 Kmeans
No ratings yet
1 Kmeans
13 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
CS8091 BDA Unit 2
No ratings yet
CS8091 BDA Unit 2
101 pages
Experiment 10 Vtu ML
No ratings yet
Experiment 10 Vtu ML
5 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Week 9
No ratings yet
Week 9
66 pages
K Means
No ratings yet
K Means
23 pages
Week 11
No ratings yet
Week 11
49 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
PART2
No ratings yet
PART2
61 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Unit 5
No ratings yet
Unit 5
63 pages
Lecture 15 Unsupervised Clustering
No ratings yet
Lecture 15 Unsupervised Clustering
73 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Q. 1 Write Algorithm For The Following: A) To Check Whether An Entered Number Is Odd / Even. B) To Calculate Sum of Three Numbers
100% (1)
Q. 1 Write Algorithm For The Following: A) To Check Whether An Entered Number Is Odd / Even. B) To Calculate Sum of Three Numbers
3 pages
A Genetic Algorithm For The Maximum Clique Problem
No ratings yet
A Genetic Algorithm For The Maximum Clique Problem
29 pages
Reinforcement Learning: R M V E R I
No ratings yet
Reinforcement Learning: R M V E R I
21 pages
Ai Theory Question Bank 2024-25
No ratings yet
Ai Theory Question Bank 2024-25
6 pages
Workbooks 6th Grade Equation Key2
No ratings yet
Workbooks 6th Grade Equation Key2
34 pages
Important Questions of DS
No ratings yet
Important Questions of DS
2 pages
Analysis and Design of Algorithms - Handout
No ratings yet
Analysis and Design of Algorithms - Handout
32 pages
Sorting and Seaeching Array Tail Recursion
No ratings yet
Sorting and Seaeching Array Tail Recursion
25 pages
Post-Placement Power Optimization
No ratings yet
Post-Placement Power Optimization
6 pages
هياكل البيانات DS data structure c++ ميد تيرم
No ratings yet
هياكل البيانات DS data structure c++ ميد تيرم
17 pages
Knowledge Representation Reasoning and Declarative Problem Solving With Answer Sets - Chitta Baral PDF
100% (1)
Knowledge Representation Reasoning and Declarative Problem Solving With Answer Sets - Chitta Baral PDF
417 pages
Two-Step Inequalities - Ks-Ia1
No ratings yet
Two-Step Inequalities - Ks-Ia1
2 pages
In Today's Lab We Will Design and Implement The Priority Queue ADT
No ratings yet
In Today's Lab We Will Design and Implement The Priority Queue ADT
2 pages
III VI: Unit No 5
No ratings yet
III VI: Unit No 5
13 pages
Numerical Methods Test 1
No ratings yet
Numerical Methods Test 1
17 pages
Lexical Analysis Finite Automata
No ratings yet
Lexical Analysis Finite Automata
12 pages
Lesson+2 Flowcharts
No ratings yet
Lesson+2 Flowcharts
14 pages
Lab 0. Preliminary: 1 Instructions
No ratings yet
Lab 0. Preliminary: 1 Instructions
2 pages
2019人工智能发展报告
No ratings yet
2019人工智能发展报告
391 pages
DL
No ratings yet
DL
9 pages
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
No ratings yet
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
48 pages
Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
No ratings yet
Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
67 pages
Anshu Dsa File 10-30
No ratings yet
Anshu Dsa File 10-30
50 pages
Class Xi PT 2 Cs 2022
No ratings yet
Class Xi PT 2 Cs 2022
2 pages
Graphical Solution of LP Problems Hand-Out
No ratings yet
Graphical Solution of LP Problems Hand-Out
11 pages
Chapter 1 Problems
No ratings yet
Chapter 1 Problems
3 pages
Design and Analysis of Algorithms New Questionpaper
No ratings yet
Design and Analysis of Algorithms New Questionpaper
1 page
Chap 4
No ratings yet
Chap 4
12 pages
DSAD Makeup Question PDF
No ratings yet
DSAD Makeup Question PDF
3 pages

Unit 7 Clustering (P)

Uploaded by

Unit 7 Clustering (P)

Uploaded by

Clustering – K-MEANS CLUSTERING

 k-Means clustering creates k partitions in n-dimensional space, where n is the

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

Ex: Figure 7.4, 2-dimensional dataset, k = 3.

2/20/2024 internal use

 Step 2: Assign Data Points

2/20/2024 internal use

 Step 3: Calculate New Centroids

2/20/2024 internal use

 Step 3: Calculate New Centroids

2/20/2024 internal use

 Step 4: Repeat Assignment and Calculate New Centroids

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

 Evaluation of Clusters (cont’d)

2/20/2024 internal use

2/20/2024 internal use

How to implement (cont’d)

2/20/2024 internal use

2/20/2024 internal use

 Step 2: Clustering Operator and Parameters

2/20/2024 internal use

2/20/2024 internal use

 Step 4: Execution and Interpretation

2/20/2024 internal use

You might also like