Module 3
Module 3
• Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-
classes, called clusters.
• Unsupervised learning (No predefined classes)
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups
• Partitioning algorithms:
• These algorithms divide a dataset into a predetermined number of clusters.
• Each data point is assigned to exactly one cluster based on similarity.
• A well-known example is k-Means, which minimizes the variance within clusters.
• These methods work well when the number of clusters is known and the data is well-
separated.
• Hierarchical algorithms:
• These algorithms create a tree-like structure (dendrogram) by repeatedly merging or splitting
clusters.
• Agglomerative (Bottom-Up): Starts with individual points as clusters and merges them
iteratively.
• Divisive (Top-Down): Starts with a single cluster and splits it into smaller ones.
• Example: Hierarchical Clustering.
• Graph-based:
• Represents data as a graph, where nodes represent data points, and edges represent
relationships (similarities) between them.
• This method is particularly useful when the relationships between data points are complex
and not easily captured by distance metrics alone.
• Model-based:
• Assumes that the data is generated by a mixture of underlying probability distributions.
• Tries to find the best fit for each cluster using statistical models.
• Can be useful when data follows a known distribution.
• Example: Gaussian Mixture Models (GMM).
• DIANA [1990]
Divisive
• AGNES [1990]
Hierarchical • BIRCH [1996]
methods Agglomerative • CURE [1998]
methods • ROCK [1999]
• Chamelon [1999]
Clustering
Techniques • STING [1997] • DENCLUE [1998]
Density-based
• DBSCAN [1996] • OPTICS [1999]
methods • CLIQUE [1998] • Wave Cluster [1998]
• EM Algorithm [1977]
Model based • Auto class [1996]
clustering • COBWEB [1987]
• ANN Clustering [1982, 1989]
❑ Partitioning
❑ k-Means algorithm
❑ Hierarchical
❑ Divisive algorithm
❑ Agglomerative algorithm
❑ Density based
❑ DBSCAN
▪ Given a set of n distinct objects, the k-Means clustering algorithm partitions the objects into k number of clusters such
that intracluster similarity is high but the intercluster similarity is low.
3. Re-compute the “cluster centers” by calculating mean of each cluster. These become the new cluster
centroids.
▪ k-Means is simple and can be used for a wide variety of object types.
▪ It is also efficient both from storage requirement and execution time point of views.
▪ By saving distance information from one iteration to the next, the actual number of distance calculations that
must be made can be reduced (specially, as it reaches towards the termination).
• Sensitive to Outliers:
• Outliers can significantly distort the cluster centroids, leading to incorrect clustering.
• Computational Complexity:
• Determining medians can be computationally more expensive than computing means,
especially for large datasets, as sorting or median-finding operations are required at
each iteration.
• Sensitivity to Initial Centers:
• Similar to k-means, k-medians can converge to local minima. The choice of initial cluster
centers strongly influences results, and poor initialization can yield suboptimal
clustering.
• Less Efficient for High-Dimensional Data:
• When dealing with high-dimensional spaces, the performance of k-medians often
deteriorates due to the complexity of median calculations in each dimension.
• Not Completely Robust to Outliers:
• While k-medians are more robust to outliers than k-means (due to using median rather
than mean), significant outliers can still distort clustering, especially in smaller clusters.
𝑆𝐴𝐸 = 𝑥 − 𝑐𝑚
𝑖=1 𝑥∈𝑪𝑖 ,𝑥∉𝑀 𝑎𝑛𝑑 𝑐𝑚 ∈𝑀
Point Coordinates
• Suppose that we want to group the above dataset into two
A1 (2, 6)
clusters.
A2 (3, 8)
• Following are two points from the dataset that we have A3 (4, 7)
selected as medoids. A4 (6, 2)
• M1 = (3, 4) A5 (6, 4)
• M2 = (7, 3) A6 (7, 3)
• Use the Manhattan distance measure. A7 (7,4)
A8 (8, 5)
• Apply k-Medoids Clustering to form 2 clusters.
A9 (7, 6)
A10 (3, 4)
Distance Distance
Assigned
Point Coordinates From M1 from M2
Cluster
(3,4) (7,3)
A1 (2, 6) 3 8 Cluster 1 ➢ The clusters made with medoids (3, 4) and (7, 3) are as
follows.
A2 (3, 8) 4 9 Cluster 1
➢ Points in cluster1= {(2, 6), (3, 8), (4, 7), (3, 4)}
A3 (4, 7) 4 7 Cluster 1 ➢ Points in cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5),
(7,6)}
A4 (6, 2) 5 2 Cluster 2
➢ After assigning clusters, we will calculate the cost for
A5 (6, 4) 3 2 Cluster 2 each cluster and find their sum. The cost is nothing but
the sum of distances of all the data points from the
A6 (7, 3) 5 0 Cluster 2
medoid of the cluster they belong to.
A7 (7,4) 4 1 Cluster 2 ➢ Hence, the cost for the current cluster will be
A8 (8, 5) 6 3 Cluster 2
3+4+4+2+2+0+1+3+3+0=22.
A9 (7, 6) 6 3 Cluster 2
A10 (3, 4) 0 5 Cluster 1
Distance Distance
Coordinat Now, we will select another non-medoid point (7, 4) and make it
Point From M1 from M2 Assigned Cluster
es a temporary medoid for the second cluster. Hence,
(3,4) (7,4)
• M1 = (3, 4)
A1 (2, 6) 3 7 Cluster 1 • M2 = (7, 4)
Now, let us calculate the distance between all the data points
A2 (3, 8) 4 8 Cluster 1 and the current medoids.
A3 (4, 7) 4 6 Cluster 1 ➢ The data points haven’t changed in the clusters after changing
the medoids. Hence, clusters are:
A4 (6, 2) 5 3 Cluster 2
➢ Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
A5 (6, 4) 3 1 Cluster 2 ➢ Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
➢ Now, let us again calculate the cost for each cluster and find
A6 (7, 3) 5 1 Cluster 2
their sum. The total cost this time will be
A7 (7,4) 4 0 Cluster 2 3+4+4+3+1+1+0+2+2+0=20.
➢ Here, the current cost is less than the cost calculated in the
A8 (8, 5) 6 2 Cluster 2
previous iteration. Hence, we will make the swap permanent
A9 (7, 6) 6 2 Cluster 2 and make (7,4) the medoid for cluster 2.
➢ If the cost this time was greater than the previous cost i.e. 22,
A10 (3, 4) 0 4 Cluster 1
we would have to revert the change. New medoids after this
iteration are (3, 4) and (7, 4) with no change in the clusters.
Distance Distance ➢ Now, let us again change the medoid of cluster 2 to (6, 4).
Assigned
Point Coordinates From M1 from M2 ➢ Hence, the new medoids for the clusters are M1=(3, 4) and
Cluster
(3,4) (6,4)
M2= (6, 4 ).
A1 (2, 6) 3 6 Cluster 1 ➢ Again, the clusters haven’t changed. Hence, clusters are:
➢ Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
A2 (3, 8) 4 7 Cluster 1 ➢ Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
A3 (4, 7) 4 5 Cluster 1 ➢ Now, let us again calculate the cost for each cluster and find
their sum. The total cost this time will be
A4 (6, 2) 5 2 Cluster 2 3+4+4+2+0+2+1+3+3+0=22.
A5 (6, 4) 3 0 Cluster 2 ➢ The current cost is 22 which is greater than the cost in the
previous iteration i.e. 20. Hence, we will revert the change
A6 (7, 3) 5 2 Cluster 2 and the point (7, 4) will again be made the medoid for
A7 (7,4) 4 1 Cluster 2 cluster 2.
➢ So, the clusters after this iteration will be
A8 (8, 5) 6 3 Cluster 2 • cluster1 = {(2, 6), (3, 8), (4, 7), (3, 4)} and
A9 (7, 6) 6 3 Cluster 2 • cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}.
• The medoids are (3,4) and (7,4).
A10 (3, 4) 0 3 Cluster 1
• The main disadvantage of K-Medoid algorithms is that it is not suitable for clustering non-spherical
(arbitrarily shaped) groups of objects.
• It may obtain different results for different runs on the same dataset because the first k medoids
are chosen randomly.
• Not as Scalable – Works better for small to medium-sized datasets but can be slow for large
datasets.
▪ Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters.
▪ The assumption is that data points close to each other are more similar or related than data points
farther apart.
• Bottom-up strategy
• Each cluster starts with only one object.
• Clusters are merged into larger and larger clusters until:
All the objects are in a single cluster
• Top-down strategy
• Start with all objects in one cluster
• Clusters are subdivided into smaller and smaller clusters until:
Each object forms a cluster on its own
4. Centroid Linkage
The distance between two clusters is the distance between their
centroids (mean points).
▪ To choose the number of clusters in hierarchical clustering, we make use of a concept called dendrogram.
▪ A dendrogram is a tree-like diagram that shows the hierarchical relationship between the observations. It contains
the memory of hierarchical clustering algorithms.
▪ Just by looking at the Dendrogram, you can tell how the cluster is
formed.
▪ Let’s see how to form the dendrogram for the data points.
1. The Height of the blocks represents the distance between clusters, Cutting line
and
▪ But the question still remains the same, how do we find the number
of clusters using a dendrogram or where should we stop merging
the clusters? Observations are allocated to clusters by drawing a
horizontal line through the dendrogram.
Generally, we cut the dendrogram in such a way that it cuts the tallest
vertical line. In the above example, we have two clusters. One cluster has
observations A and B, and a second cluster has C, D, E, and F.
▪ Mean shift clustering is a non-parametric, density-based algorithm that does not assume any specific shape for
the clusters. It is particularly useful for discovering clusters of arbitrary shapes and sizes in datasets.
▪ The core idea behind mean shift clustering is that it attempts to identify high-density regions of the data by
iteratively shifting data points toward regions of higher density. It essentially "shifts" data points to the mode
(peak) of the data distribution.
Where h is a bandwidth parameter, and the kernel is commonly a Gaussian. The kernel function k smooths the
contribution of each data point, ensuring that points closer to x have a higher influence on the density estimate.
2. Kernel Function: Mean shift uses a kernel function to define the neighborhood of each data point. The kernel is
often a Gaussian kernel, but other kernel types can also be used.
1. Initialization:
• Begin by considering each data point in the dataset as an initial cluster center. These are the points that the
algorithm will try to "shift" toward denser regions.
• The first step is to define a bandwidth (hyper-parameter) which determines the size of the neighborhood around
each data point (often referred to as the window size).
• A kernel function is applied to each data point. The kernel typically assigns weights to points depending on their
distance from the current point of interest. The most commonly used kernel is the Gaussian kernel.
• The idea behind the kernel is that nearby points (in terms of the bandwidth) will have a higher influence than
distant points.
Prabhu Prasad Dev ML / Module-III 178
Contd…
3. Shift Step:
• For each data point, a local mean is calculated using all the points within its bandwidth (window). The mean
is a weighted average of the points in the neighborhood.
• The mean shift vector is computed as the difference between the current point and the mean of its
neighborhood.
• This vector is then used to "shift" the current point toward the denser region (higher density area).
▪ Mathematically:
4. Iterative Shifting:
• The point is shifted by the computed vector and the process is repeated.
• This shifting continues until the movement of the point becomes very small, i.e., when convergence is
achieved.
5. Convergence:
• The algorithm stops when the data points stop shifting significantly, meaning that each point has converged
to a location of higher density (the mode of the data distribution).
• Once convergence is achieved, all the points that have converged to the same position (mode) are grouped
together as a single cluster.
6. Cluster Formation:
• After convergence, all points that are "close" to each other (in terms of the distance between their final mean
shift positions) are assigned to the same cluster.
• Typically, a distance threshold is used to merge points that are close enough to each other.
• The clusters correspond to local maxima (modes) of the data density, where a large number of points are
gathered in the same area.
▪ No Need to Predefine the Number of Clusters: Unlike k-means, mean shift clustering does not
require you to specify the number of clusters beforehand. It finds the number of clusters automatically
▪ Arbitrary Shape Clusters: Mean shift can discover clusters of any shape, unlike k-means, which
assumes spherical or circular clusters. It is especially useful for irregularly shaped clusters.
▪ Robust to Outliers: Outliers have less impact on the mean shift algorithm because they do not
• Computational Complexity: Mean shift is computationally expensive, especially for large datasets. It
involves multiple iterations to shift each data point, which can make it slow for high-dimensional data.
• Sensitivity to Bandwidth Parameter: The bandwidth parameter significantly impacts the results. A small
bandwidth might lead to many small clusters, while a large bandwidth might merge distinct clusters.