0% found this document useful (0 votes)
11 views22 pages

Unit 7 Clustering (P)

K-means clustering partitions a dataset into k clusters based on proximity measures, typically using Euclidean distance. The algorithm iteratively assigns data points to the nearest centroid and recalculates centroids until no significant changes occur. Evaluation of clustering effectiveness is done through metrics like sum of squared errors (SSE) and the Davies-Bouldin index, which assess cluster cohesiveness and separation.

Uploaded by

tantnn0080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Unit 7 Clustering (P)

K-means clustering partitions a dataset into k clusters based on proximity measures, typically using Euclidean distance. The algorithm iteratively assigns data points to the nearest centroid and recalculates centroids until no significant changes occur. Evaluation of clustering effectiveness is done through metrics like sum of squared errors (SSE) and the Davies-Bouldin index, which assess cluster cohesiveness and separation.

Uploaded by

tantnn0080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Clustering – K-MEANS CLUSTERING

 k-Means clustering creates k partitions in n-dimensional space, where n is the


number of attributes in a given dataset.
 To partition the dataset, a proximity measure has to be defined. The most commonly
used measure for a numeric attribute is the Euclidean distance.
 Fig. 7.3 illustrates the clustering of the Iris dataset with only the petal length and petal
width attributes.
 This Iris dataset is two-dimensional (selected for easy visual explanation), with
numeric attributes and k specified as 3.
 The outcome of k-means clustering provides a clear partition space for Cluster 1 and
a narrow space for the other two clusters, Cluster 2 and Cluster 3

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

How it works
 The logic of finding k-clusters within a given dataset is rather simple and always
converges to a solution.
 However, the final result in most cases will be locally optimal where the solution will
not converge to the best global solution.
 The process of k-means clustering is similar to Voronoi iteration, where the objective
is to divide a space into cells around points.
 The difference is Voronoi iteration partitions the space, whereas, k-means
clustering partitions the points in data space

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

Ex: Figure 7.4, 2-dimensional dataset, k = 3.


 Step 1: Initiate Centroids
 The number of clusters k should be specified by the user. In this case, 3 centroids are
initiated in a given data space.

2/20/2024 internal use


Fig. 7.5: each initial centroid is given a shape (with a
Clustering – K-MEANS CLUSTERING
circle to differentiate centroids from other data points)
so that data points assigned to a centroid can be
indicated by the same shape.
Clustering – K-MEANS CLUSTERING

 Step 2: Assign Data Points


 Once centroids have been initiated, all the data points are now assigned to
the nearest centroid to form a cluster.
 In this context the “nearest” is calculated by a proximity measure. Euclidean distance
measurement is the most common proximity measure, though other measures like the
Manhattan measure and Jaccard coefficient can be used, which, between two data
points X (x1, x2,. . ., xn) and C (c1, c2,. . ., cn) with n attributes, is given:

 All the data points associated to a centroid now have the same shape as their
corresponding centroid as in Fig. 7.6. This step also leads to partitioning of data space
into Voronoi partitions, with lines shown as boundaries.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

 Step 3: Calculate New Centroids


 For each cluster, a new centroid can now be calculated, which is also the prototype of
each cluster group.
 This new centroid is the most representative data point of the cluster.
 Mathematically, this step can be expressed as minimizing the sum of squared errors
(SSEs) of all data points in a cluster to the centroid of the cluster.
 The overall objective of the step is to minimize the SSEs of individual clusters. The
SSE of a cluster can be calculated using Eq. (7.2).

where Ci is the ith cluster, j are the data points in a given cluster, μi is the centroid for ith
cluster, and xj is a specific data object.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

 Step 3: Calculate New Centroids


 The centroid with minimal SSE for the given cluster i is the new mean of the cluster.
 The mean of the cluster can be calculated using (7.3)

where X is the data object vector (x1, x2,. . ., xn). In the case of k-means clustering, the
new centroid will be the mean of all the data points.
 k-Medoid clustering is a variation of k-means clustering, where the median is
calculated instead of the mean. Fig. 7.7 shows the location of the new centroids.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING
Clustering – K-MEANS CLUSTERING

 Step 4: Repeat Assignment and Calculate New Centroids


 Once the new centroids have been identified, assigning data points to the nearest
centroid is repeated until all the data points are reassigned to new centroids.
 Fig. 7.8, note the change in assignment of three data points that belonged to different
clusters in the previous step.
 Step 5: Termination
 Step 3 (calculating new centroids), and step 4 (assigning data points to new centroids)
are repeated until no further change in assignment of data points.
 In other words, no significant change in centroids are noted. The final centroids are
declared the prototypes of the clusters and they are used to describe the whole
clustering model.
 Each data point in the dataset is now tied with a new clustering ID attribute that
identifies the cluster.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

 Special Cases
 Even though k-means clustering is simple and easy to implement, one of its
key drawbacks is that the algorithm seeks to find a local optimum, which may not
yield globally optimal clustering.
 In this approach, the algorithm starts with an initial configuration (centroids) and
continuously improves to find the best solution possible for that initial configuration.
 Since the solution is optimal to the initial configuration, there might be a better
optimal solution if the initial configuration changes  the success of a k-means
algorithm much depends on the initiation of centroids.
 This limitation can be addressed by having multiple random initiations;
 In each run one could measure the cohesiveness of the clusters by a performance
criterion. The clustering run with the best performance metric can be chosen as the
final run.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

 Evaluation of Clusters
 Evaluation of k-means clustering is different from regression and
classification algorithms because in clustering there are no known external
labels for comparison.
 The evaluation parameter will have to be developed from the very dataset
that is being evaluated.
 This is called unsupervised or internal evaluation. Evaluation of clustering
can be as simple as computing total SSE.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

 Evaluation of Clusters (cont’d)


 Good models will have low SSE within the cluster and low overall SSE
among all clusters. SSE can also be referred to as the average within-cluster
distance and can be calculated for each cluster and then averaged for all the
clusters.
 Another commonly used evaluation measure is the Davies-Bouldin index, a
measure of uniqueness of the clusters and takes into consideration both
cohesiveness of the cluster (distance between the data points and center of
the cluster) and separation between the clusters.
 It is the function of the ratio of within cluster separation to the separation
between the clusters.
 The lower the value of the Davies-Bouldin index, the better the clustering.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

How to implement
 One operator for modeling and one for unsupervised evaluation.
 In the modeling step, the parameter for the number of clusters, k, is specified as
desired. The output model is a list of centroids for each cluster and a new attribute is
attached to the original input dataset with the cluster ID. The cluster label is appended
to the original dataset for each data point and can be visually evaluated after the
clustering.
 A model evaluation step is required to calculate the average cluster distance and
Davies-Bouldin index.
 Iris dataset (4 attributes, 150 data objects).
 Even though a class label is not needed for clustering, it was kept for later
explanation to see if identified clusters from an unlabeled dataset are similar to
natural clusters of species in the dataset.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

How to implement (cont’d)


 Step 1: Data Preparation
 k-Means clustering accepts both numeric and polynominal data types;
 However, the distance measures are more effective with numeric data types.
 The number of attributes increases the dimension space for clustering.
 In this example the number of attributes has been limited to two by selecting petal
width (a3) and petal length (a4) using the Select Attribute operator.
 It is easy to visualize the mechanics of k-means algorithm by looking at two-
dimensional plots for clustering. In practical implementations, clustering datasets will
have more attributes.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

 Step 2: Clustering Operator and Parameters


 The k-means modeling operator is available in the Modeling  Clustering
and Segmentation folder of RapidMiner. The parameters are as follows:
 k: The desired number of clusters.
 Add cluster as attribute: Append cluster labels (IDs) into the original dataset.
 Max runs: Multiple runs are required to select the clustering with the lowest SSE. The number
of such runs can be specified here.
 Measure type: The default and most common measurement is Euclidean distance (L2). Other
options: Manhattan distance (L1), Jaccard coefficient, and cosine similarity for document data.
 Max optimization steps: The number of iterations of assigning data objects to centroids and
calculating new centroids
 The output: the cluster model with k centroid data objects and the initial dataset appended with
cluster labels. Cluster labels are named generically such as cluster_0, cluster_1,. . ., cluster_k1.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING

 Step 3: Evaluation
 Since the attributes used in the dataset are numeric, the effectiveness of clustering
groups need to be evaluated using SSE and the Davies-Bouldin index.
 In RapidMiner, the Cluster Model Visualizer operator under Modeling
Segmentation is available for a performance evaluation of cluster groups and
visualization.
 Cluster Model Visualizer operator needs both inputs from the modeling step: cluster
centroid vector (model) and the labeled dataset.
 The two measurement outputs of the evaluation are average cluster distance and the
Davies-Bouldin index.

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING
Clustering – K-MEANS CLUSTERING

 Step 4: Execution and Interpretation


The outputs can be observed from results window:
 Cluster Model (Clustering): The model output contains the centroid for each
of the k-clusters, along with their attribute values.
 Labeled example set: The cluster value is appended as a new special
polynominal attribute and takes a generic label format.
 Visualizer and Performance vector: The output of the Cluster Model
Visualizer shows the centroid charts, table, scatter plots, heatmaps, and
performance evaluation metrics like average distance measured and the
Davies-Bouldin index

2/20/2024 internal use


Clustering – K-MEANS CLUSTERING
Clustering – K-MEANS CLUSTERING

You might also like