Clustering Algorithm
Clustering Algorithm
E.g., Say music, one approach might be to look for meaningful groups or collections.
You might organize music by genre, while your friend might organize music by
decade. How you choose to group items helps you to understand more about them
as individual pieces of music.
You might find that you have affinity for rock and further break down the genre into
different approaches or music from different locations.
On the other hand, your friend might look at music from the 1980's and be able to
understand how the music across genres at that time was influenced by the socio-
political climate.
In both cases, you and your friend have learned something interesting about music,
even though you took different approaches.
For instance, consider a shoe data set with only one feature: shoe size. You can
quantify how similar two shoes are by calculating the difference between their sizes.
The smaller the numerical difference between sizes, the greater the similarity
between shoes. This is called a manual similarity measure.
Suppose the model has two features: shoe size and shoe price data. Since both
features are numeric, you can combine them into a single number representing
similarity as follows.
Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then
normalize the data.
Price (p): The data is probably a Poisson distribution. Confirm this. If you have
enough data, convert the data to quantiles and scale to [0,1].
Combine the data by using root mean squared error (RMSE). Here, the similarity is
s2 + p2
√
2
let’s calculate similarity for two shoes with US sizes 8 and 11, and prices 120 and
150. Since we don’t have enough data to understand the distribution, we’ll simply
scale the data without normalizing or using quantiles.
1. Scale the size: Assume a maximum possible shoe size of 20. Divide 8 and 11 by
the maximum size 20 to get 0.4 and 0.55
2. Scale the price: Divide 120 and 150 by the maximum price 150 to get 0.8 and 1
0.22 +0.152
5. Find the RMSE:√ = 0.17
2
6. Similarity =1-0.17=0.83
What if you wanted to find similarities between shoes by using both size and color?
Color is categorical data, and is harder to combine with the numerical size data.
market segmentation
social network analysis
search result grouping
medical imaging
image segmentation
anomaly detection
generalization
data compression
privacy preservation.
Now, you can condense the entire feature set for an example into its cluster ID.
Clustering data can simplify large datasets and becomes easy for managing the
data.
Clustering:
Grouping related examples, particularly during unsupervised learning. Once all the
examples are grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the k-means algorithm clusters
examples based on their proximity to a centroid, as in the following diagram:
A human researcher could then review the clusters and, for example, label cluster 1
as "dwarf trees" and cluster 2 as "full-size trees."
Types of Clustering:
Each approach is best suited to a particular data distribution. Below is a short
discussion of four common approaches, focusing on centroid-based clustering using
k-means.
Centroid-based Clustering:
Centroid-based clustering organizes the data into non-hierarchical clusters, in
contrast to hierarchical clustering defined below.
Density-based Clustering:
Density-based clustering connects areas of high example density into clusters. This
allows for arbitrary-shaped distributions as long as dense areas can be connected.
These algorithms have difficulty with data of varying densities and high dimensions.
Further, by design, these algorithms do not assign outliers to clusters.
Distribution-based Clustering:
This clustering approach assumes data is composed of distributions, such
as Gaussian distributions.
The distribution-based algorithm clusters data into three Gaussian distributions. As
distance from the distributions centre increases, the probability that a point belongs
to the distribution decreases. The bands show that decrease in probability. When
you do not know the type of distribution in your data, you should use a different
algorithm.
Hierarchical Clustering:
Hierarchical clustering creates a tree of clusters.It is well suited to hierarchical
data. Another advantage is that any number of clusters can be chosen by cutting the
tree at the right level.
Inter cluster distance is the distance between objects in the different cluster.
Data prepearation:
In clustering, calculate the similarity between two examples by combining all the
feature data for those examples into a numeric value. Combining feature data
requires that the data have the same scale.
Normalizing: min-max/std
Transforming: log
Quantile bucketing: Distributing a feature's values into buckets so that each bucket
contains the same (or almost the same) number of examples. For example, the
following figure divides 44 points into 4 buckets, each of which contains 11 points. In
order for each bucket in the figure to contain the same number of points, some
buckets span a different width of x-values.
K-means clustering in Machine Learning:
K-means clustering is one of the simplest and popular unsupervised machine
learning algorithms.
The objective of K-means is simple: group similar data points together and discover
underlying patterns.
To achieve this objective, K-means looks for a fixed number (k) of clusters in a
dataset.
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs
to only one group. It tries to make the intra-cluster data points as similar as possible
while also keeping the clusters as different (far) as possible. It assigns data points to
a cluster such that the sum of the squared distance between the data points and the
cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is
at the minimum. The less variation we have within clusters, the more homogeneous
(similar) the data points are within the same cluster.
Elbow Method:
Elbow method gives us an idea on what a good k number of clusters would be based
on the sum of squared distance (SSE) between data points and their assigned
clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming
an elbow.
See where the curve might form an elbow and flatten out.
Silhouette Analysis:
Silhouette analysis can be used to determine the degree of separation between
clusters.
• The optimal number of clusters k is the one that maximize the average silhouette
over a range of possible values for k.
Compute the average distance from all data points in the same cluster (ai).
Compute the average distance from all data points in the closest cluster (bi).
Compute the coefficient:
That is, for each variable (xi) in the data set we compute its range [min(xi),max(xj)]
and generate values for the n points uniformly from the interval min to max.
Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and
compute the corresponding total within intra-cluster variation Wk.
For the observed data and the reference data, the total intra-cluster variation is
computed using different values of k. The gap statistic for a given k is defined as
follows. Compute the estimated gap statistic as the deviation of the
observed Wk value from its expected value Wkb under the null hypothesis:
B
1 ∗ )
Gap(k) = ∑ log(wkb − log(wk )
B
b=1
The estimate of the optimal clusters will be the value that maximizes the gap
statistic.
This means that the clustering structure is far away from the random uniform
distribution of points.
Drawbacks:
Kmeans algorithm is good in capturing structure of the data if clusters have a
spherical-like shape. It always try to construct a nice spherical shape around the
centroid. That means, the minute the clusters have a complicated geometric shapes,
kmeans does a poor job in clustering the data. We’ll illustrate three cases where
kmeans will not perform well.
First, kmeans algorithm doesn’t let data points that are far-away from each other
share the same cluster even though they obviously belong to the same cluster.
Below is an example of data points on two different horizontal lines that illustrates
how kmeans tries to group half of the data points of each horizontal lines together.
we would have 3 groups of data where each group was generated from different
multivariate normal distribution (different mean/standard deviation). One group will
have a lot more data points than the other two combined. Next, run kmeans on the
data with K=3 and see if it will be able to cluster the data correctly.
Data that have complicated geometric shapes such as moons and circles within
each other and test kmeans on both of the datasets.
However, we can help kmeans perfectly cluster these kind of datasets if we use
kernel methods. The idea is we transform to higher dimensional representation that
make the data linearly separable (the same idea that we use in SVMs). Different
kinds of algorithms work very well in such scenarios such as Spectral Clustering.
Hierarchical clustering Technique:
Hierarchical clustering is one of the popular and easy to understand clustering
technique. This clustering technique is divided into two types:
Agglomerative
Divisive
MIN:
Also known as single-linkage algorithm can be defined as the similarity of two
clusters C1 and C2 is equal to the minimum of the similarity between points Pi and Pj
such that Pi belongs to C1 and Pj belongs to C2.
In simple words, pick the two closest points such that one point lies in cluster one
and the other point lies in cluster 2 and takes their similarity and declares it as the
similarity between two clusters.
Pros of MIN:
This approach can separate non-elliptical shapes as long as the gap between the
two clusters is not small.
Original data vs Clustered data using MIN approach
Cons of MIN:
MIN approach cannot separate clusters properly if there is noise between clusters.
MAX:
Also known as the complete linkage algorithm, this is exactly opposite to
the MIN approach. The similarity of two clusters C1 and C2 is equal to
the maximum of the similarity between points Pi and Pj such that Pi belongs to C1
and Pj belongs to C2.
In simple words, pick the two farthest points such that one point lies in cluster one
and the other point lies in cluster 2 and takes their similarity and declares it as the
similarity between two clusters.
Pros of MAX:
MAX approach does well in separating clusters if there is noise between clusters.
Cons of Max:
Max approach is biased towards globular clusters.
Group Average:
Take all the pairs of points and compute their similarities and calculate the average
of the similarities.
where, Pi ∈ C1 & Pj ∈ C2
The group Average approach does well in separating clusters if there is noise
between clusters
Ward’s Method:
This approach of calculating the similarity between two clusters is exactly the same
as Group Average except that Ward’s method calculates the sum of the square of
the distances Pi and PJ.
High space and time complexity for Hierarchical clustering. Hence this clustering
algorithm cannot be used when we have huge data.