Unit 7 Clustering (P)
Unit 7 Clustering (P)
How it works
The logic of finding k-clusters within a given dataset is rather simple and always
converges to a solution.
However, the final result in most cases will be locally optimal where the solution will
not converge to the best global solution.
The process of k-means clustering is similar to Voronoi iteration, where the objective
is to divide a space into cells around points.
The difference is Voronoi iteration partitions the space, whereas, k-means
clustering partitions the points in data space
All the data points associated to a centroid now have the same shape as their
corresponding centroid as in Fig. 7.6. This step also leads to partitioning of data space
into Voronoi partitions, with lines shown as boundaries.
where Ci is the ith cluster, j are the data points in a given cluster, μi is the centroid for ith
cluster, and xj is a specific data object.
where X is the data object vector (x1, x2,. . ., xn). In the case of k-means clustering, the
new centroid will be the mean of all the data points.
k-Medoid clustering is a variation of k-means clustering, where the median is
calculated instead of the mean. Fig. 7.7 shows the location of the new centroids.
Special Cases
Even though k-means clustering is simple and easy to implement, one of its
key drawbacks is that the algorithm seeks to find a local optimum, which may not
yield globally optimal clustering.
In this approach, the algorithm starts with an initial configuration (centroids) and
continuously improves to find the best solution possible for that initial configuration.
Since the solution is optimal to the initial configuration, there might be a better
optimal solution if the initial configuration changes the success of a k-means
algorithm much depends on the initiation of centroids.
This limitation can be addressed by having multiple random initiations;
In each run one could measure the cohesiveness of the clusters by a performance
criterion. The clustering run with the best performance metric can be chosen as the
final run.
Evaluation of Clusters
Evaluation of k-means clustering is different from regression and
classification algorithms because in clustering there are no known external
labels for comparison.
The evaluation parameter will have to be developed from the very dataset
that is being evaluated.
This is called unsupervised or internal evaluation. Evaluation of clustering
can be as simple as computing total SSE.
How to implement
One operator for modeling and one for unsupervised evaluation.
In the modeling step, the parameter for the number of clusters, k, is specified as
desired. The output model is a list of centroids for each cluster and a new attribute is
attached to the original input dataset with the cluster ID. The cluster label is appended
to the original dataset for each data point and can be visually evaluated after the
clustering.
A model evaluation step is required to calculate the average cluster distance and
Davies-Bouldin index.
Iris dataset (4 attributes, 150 data objects).
Even though a class label is not needed for clustering, it was kept for later
explanation to see if identified clusters from an unlabeled dataset are similar to
natural clusters of species in the dataset.
Step 3: Evaluation
Since the attributes used in the dataset are numeric, the effectiveness of clustering
groups need to be evaluated using SSE and the Davies-Bouldin index.
In RapidMiner, the Cluster Model Visualizer operator under Modeling
Segmentation is available for a performance evaluation of cluster groups and
visualization.
Cluster Model Visualizer operator needs both inputs from the modeling step: cluster
centroid vector (model) and the labeled dataset.
The two measurement outputs of the evaluation are average cluster distance and the
Davies-Bouldin index.