Clustering
Clustering
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.
The following stages will help us understand how the K-Means clustering
technique works-
K-Medoids Algorithm
K-Medoids is an unsupervised clustering algorithm in which data points
called “medoids" act as the cluster's center. A medoid is a point in the
cluster whose sum of distances(also called dissimilarity) to all the objects in
the cluster is minimal. The distance can be the Euclidean distance,
Manhattan distance, or any other suitable distance function.
Therefore, the K -medoids algorithm divides the data into K clusters by
selecting K medoids from our data sample.
Working of the Algorithm
The steps taken by the K-medoids algorithm for clustering can be
explained as follows:-
1. Randomly select k points from the data( k is the number of
clusters to be formed). These k points would act as our initial
medoids.
2. The distances between the medoid points and the non-medoid
points are calculated, and each point is assigned to the cluster of
its nearest medoid.
3. Calculate the cost as the total sum of the distances(also called
dissimilarities) of the data points from the assigned medoid.
4. Swap one medoid point with a non-medoid point(from the same
cluster as the medoid point) and recalculate the cost.
5. If the calculated cost with the new medoid point is more than the
previous cost, we undo the swap, and the algorithm converges
else; we repeat step 4
• Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Steps 3 and 4 until only a single cluster remains.
2. Divisive:
Agglomerative
clustering is generally Comparatively less
more computationally expensive as divisive
expensive, especially clustering only
for large datasets as requires the
Complexity level this approach requires calculation of
3.
the calculation of all distances between
pairwise distances sub-clusters, which
between data points, can reduce the
which can be computational
computationally burden.
expensive.
divisive clustering
may create sub-
Agglomerative clusters around
clustering can handle outliers, leading to
outliers better than suboptimal clustering
4. Outliers divisive clustering results.
since outliers can be
absorbed into larger
clusters
Agglomerative
S.No. Parameters Clustering Divisive Clustering
Agglomerative
clustering tends to divisive clustering can
produce more be more difficult to
interpretable results interpret since the
Interpretability since the dendrogram dendrogram shows
shows the merging the splitting process
5.
process of the of the clusters, and
clusters, and the user the user must choose
can choose the a stopping criterion to
number of clusters determine the number
based on the desired of clusters.
level of granularity.
Scikit-learn provides
multiple linkage
divisive clustering is
Implementation methods for
not currently
agglomerative
6. implemented in Scikit-
clustering, such as
learn.
“ward,” “complete,”
“average,” and
“single,”
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering
work for finding spherical-shaped clusters or convex clusters. In other
words, they are suitable only for compact and well-separated clusters.
Moreover, they are also severely affected by the presence of noise and
outliers in the data.
Real-life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure
below.
2. Data may contain noise.
The figure above shows a data set containing non-convex shape clusters
and outliers. Given such data, the k-means algorithm has difficulties in
identifying these clusters with arbitrary shapes.
Parameters Required For DBSCAN Algorithm
1. eps: It defines the neighborhood around a data point i.e. if the
distance between two points is lower or equal to ‘eps’ then they are
considered neighbors. If the eps value is chosen too small then a
large part of the data will be considered as an outlier. If it is chosen
very large then the clusters will merge and the majority of the data
points will be in the same clusters. One way to find the eps value is
based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps
radius. The larger the dataset, the larger value of MinPts must be
chosen. As a general rule, the minimum MinPts can be derived from
the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.
if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}