0% found this document useful (0 votes)
47 views

Clustering Algorithm

Clustering Algorithm Explained

Uploaded by

spraga1995
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Clustering Algorithm

Clustering Algorithm Explained

Uploaded by

spraga1995
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

What is Clustering?

A cluster refers to a collection of data points aggregated together because of certain


similarities.

Grouping unlabelled data is called clustering.

E.g., Say music, one approach might be to look for meaningful groups or collections.
You might organize music by genre, while your friend might organize music by
decade. How you choose to group items helps you to understand more about them
as individual pieces of music.

You might find that you have affinity for rock and further break down the genre into
different approaches or music from different locations.

On the other hand, your friend might look at music from the 1980's and be able to
understand how the music across genres at that time was influenced by the socio-
political climate.

In both cases, you and your friend have learned something interesting about music,
even though you took different approaches.

In machine learning too, we often group examples as a first step to understand a


subject (data set) in a machine learning system.

Grouping unlabelled examples is called clustering.

As the examples are unlabelled, clustering relies on unsupervised machine learning.


If the examples are labelled, then clustering becomes classification

Unlabelled examples grouped into three clusters.

This is based on feature similarity.


Similarity Measure:
A numerical value that quantifies the similarity between two data points is called
similarity measure

For instance, consider a shoe data set with only one feature: shoe size. You can
quantify how similar two shoes are by calculating the difference between their sizes.
The smaller the numerical difference between sizes, the greater the similarity
between shoes. This is called a manual similarity measure.

Suppose the model has two features: shoe size and shoe price data. Since both
features are numeric, you can combine them into a single number representing
similarity as follows.

Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then
normalize the data.

Price (p): The data is probably a Poisson distribution. Confirm this. If you have
enough data, convert the data to quantiles and scale to [0,1].

Combine the data by using root mean squared error (RMSE). Here, the similarity is

s2 + p2

2

let’s calculate similarity for two shoes with US sizes 8 and 11, and prices 120 and
150. Since we don’t have enough data to understand the distribution, we’ll simply
scale the data without normalizing or using quantiles.

1. Scale the size: Assume a maximum possible shoe size of 20. Divide 8 and 11 by
the maximum size 20 to get 0.4 and 0.55

2. Scale the price: Divide 120 and 150 by the maximum price 150 to get 0.8 and 1

3. Find the difference in size: 0.55-0.4=0.15

4. Find the difference in price:1-0.8=0.2

0.22 +0.152
5. Find the RMSE:√ = 0.17
2

6. Similarity =1-0.17=0.83

What if you wanted to find similarities between shoes by using both size and color?
Color is categorical data, and is harder to combine with the numerical size data.

It cannot be calculated MANUALLY. That’s when you switch to a supervised


similarity measure, where a Deep Neural Network calculates the similarity.
Loss Function for Supervised similarity measure calculation:

Mean square error (MSE) for numerical output

Log loss /Softmax cross entropy loss for categorical.

What are the Uses of Clustering?


Clustering has a myriad of uses in a variety of industries. Some common
applications for clustering include the following:

 market segmentation
 social network analysis
 search result grouping
 medical imaging
 image segmentation
 anomaly detection
 generalization
 data compression
 privacy preservation.

After clustering, each cluster is assigned a number called a cluster ID.

Now, you can condense the entire feature set for an example into its cluster ID.

Representing a complex example by a simple cluster ID makes clustering powerful.

Clustering data can simplify large datasets and becomes easy for managing the
data.

For example, you can group items by different features as follows:

Group documents by topic.

Group stars by brightness.

Group books by category


Machine learning systems can then use cluster IDs to simplify the processing of
large datasets. Thus, clustering’s output serves as feature data for downstream ML
systems.

Clustering:
Grouping related examples, particularly during unsupervised learning. Once all the
examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the k-means algorithm clusters
examples based on their proximity to a centroid, as in the following diagram:

A human researcher could then review the clusters and, for example, label cluster 1
as "dwarf trees" and cluster 2 as "full-size trees."

As another example, consider a clustering algorithm based on an example's distance


from a center point, illustrated as follows:

Types of Clustering:
Each approach is best suited to a particular data distribution. Below is a short
discussion of four common approaches, focusing on centroid-based clustering using
k-means.
Centroid-based Clustering:
Centroid-based clustering organizes the data into non-hierarchical clusters, in
contrast to hierarchical clustering defined below.

K-means is the most widely used centroid-based clustering algorithm.

Centroid-based algorithms are efficient but sensitive to outliers. It is an efficient,


effective, and simple clustering algorithm.

Example of centroid-based clustering.

Density-based Clustering:
Density-based clustering connects areas of high example density into clusters. This
allows for arbitrary-shaped distributions as long as dense areas can be connected.
These algorithms have difficulty with data of varying densities and high dimensions.
Further, by design, these algorithms do not assign outliers to clusters.

Example of density-based clustering.

Distribution-based Clustering:
This clustering approach assumes data is composed of distributions, such
as Gaussian distributions.
The distribution-based algorithm clusters data into three Gaussian distributions. As
distance from the distributions centre increases, the probability that a point belongs
to the distribution decreases. The bands show that decrease in probability. When
you do not know the type of distribution in your data, you should use a different
algorithm.

Example of distribution-based clustering.

Hierarchical Clustering:
Hierarchical clustering creates a tree of clusters.It is well suited to hierarchical
data. Another advantage is that any number of clusters can be chosen by cutting the
tree at the right level.

Example of a hierarchical tree clustering animals.


Objective of Cluster Analysis:
Intra cluster distance is the sum of distances between objects in the same cluster.

This distance should always be minimized.

Inter cluster distance is the distance between objects in the different cluster.

This distance should always be maximized.

Data prepearation:
In clustering, calculate the similarity between two examples by combining all the
feature data for those examples into a numeric value. Combining feature data
requires that the data have the same scale.

Normalizing: min-max/std

Transforming: log

Quantile bucketing: Distributing a feature's values into buckets so that each bucket
contains the same (or almost the same) number of examples. For example, the
following figure divides 44 points into 4 buckets, each of which contains 11 points. In
order for each bucket in the figure to contain the same number of points, some
buckets span a different width of x-values.
K-means clustering in Machine Learning:
K-means clustering is one of the simplest and popular unsupervised machine
learning algorithms.

The objective of K-means is simple: group similar data points together and discover
underlying patterns.

To achieve this objective, K-means looks for a fixed number (k) of clusters in a
dataset.

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs
to only one group. It tries to make the intra-cluster data points as similar as possible
while also keeping the clusters as different (far) as possible. It assigns data points to
a cluster such that the sum of the squared distance between the data points and the
cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is
at the minimum. The less variation we have within clusters, the more homogeneous
(similar) the data points are within the same cluster.

The way Kmeans algorithm works is as follows:

 Specify number of clusters K.


 Initialize centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without replacement.
 Keep iterating until there is no change to the centroids. i.e assignment of data
points to clusters isn’t changing.
o Compute the sum of the squared distance between data points and all
centroids.
o Assign each data point to the closest cluster (centroid).
o Compute the centroids for the clusters by taking the average of the all
data points that belong to each cluster.

Determining the Optimal Number of Clusters:

Elbow Method:
Elbow method gives us an idea on what a good k number of clusters would be based
on the sum of squared distance (SSE) between data points and their assigned
clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming
an elbow.

See where the curve might form an elbow and flatten out.
Silhouette Analysis:
Silhouette analysis can be used to determine the degree of separation between
clusters.

It computes the average silhouette of observations for different values of k.


• It measures the quality of a clustering i.e.it determines how well each object lies
within its cluster.
• A high average silhouette coefficient indicates a good clustering.

• The optimal number of clusters k is the one that maximize the average silhouette
over a range of possible values for k.

For each sample:

 Compute the average distance from all data points in the same cluster (ai).
 Compute the average distance from all data points in the closest cluster (bi).
 Compute the coefficient:

The coefficient can take values in the interval [-1, 1].

 If it is 0 –> the sample is very close to the neighboring clusters.


 If it is 1 –> the sample is far away from the neighboring clusters.
 If it is -1 –> the sample is assigned to the wrong clusters.
Therefore, we want the coefficients to be as big as possible and close to 1 to have a
good clusters.

Gap Statistic Method:


The gap statistic compares the total within intra-cluster variation for different values
of k with their expected values under null reference distribution of the data. The
estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that
yields the largest gap statistic). This means that the clustering structure is far away
from the random uniform distribution of points.

Bootstrapping(B) by generating B copies of the reference datasets and, by


computing the average log(W k). The gap statistic measures the deviation of the
observed W k value from its expected value under the null hypothesis. The estimate
of the optimal clusters will be the value that maximizes Gap (k) This means that the
clustering structure is far away from the uniform distribution of points.

That is, for each variable (xi) in the data set we compute its range [min(xi),max(xj)]
and generate values for the n points uniformly from the interval min to max.

The algorithm works as follow:

Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and
compute the corresponding total within intra-cluster variation Wk.

Generate B Bootstrapped reference data sets with a random uniform distribution.


Cluster each of these reference data sets with varying number of clusters k = 1,
…, kmax, and compute the corresponding total within intra-cluster variation Wkb.

For the observed data and the reference data, the total intra-cluster variation is
computed using different values of k. The gap statistic for a given k is defined as
follows. Compute the estimated gap statistic as the deviation of the
observed Wk value from its expected value Wkb under the null hypothesis:
B
1 ∗ )
Gap(k) = ∑ log(wkb − log(wk )
B
b=1

The estimate of the optimal clusters will be the value that maximizes the gap
statistic.

This means that the clustering structure is far away from the random uniform
distribution of points.

Drawbacks:
Kmeans algorithm is good in capturing structure of the data if clusters have a
spherical-like shape. It always try to construct a nice spherical shape around the
centroid. That means, the minute the clusters have a complicated geometric shapes,
kmeans does a poor job in clustering the data. We’ll illustrate three cases where
kmeans will not perform well.

First, kmeans algorithm doesn’t let data points that are far-away from each other
share the same cluster even though they obviously belong to the same cluster.
Below is an example of data points on two different horizontal lines that illustrates
how kmeans tries to group half of the data points of each horizontal lines together.
we would have 3 groups of data where each group was generated from different
multivariate normal distribution (different mean/standard deviation). One group will
have a lot more data points than the other two combined. Next, run kmeans on the
data with K=3 and see if it will be able to cluster the data correctly.

Data that have complicated geometric shapes such as moons and circles within
each other and test kmeans on both of the datasets.

However, we can help kmeans perfectly cluster these kind of datasets if we use
kernel methods. The idea is we transform to higher dimensional representation that
make the data linearly separable (the same idea that we use in SVMs). Different
kinds of algorithms work very well in such scenarios such as Spectral Clustering.
Hierarchical clustering Technique:
Hierarchical clustering is one of the popular and easy to understand clustering
technique. This clustering technique is divided into two types:

Agglomerative

Divisive

1. Agglomerative Hierarchical clustering Technique:


In this technique, initially each data point is considered as an individual cluster. At
each iteration, the similar clusters merge with other clusters until one cluster or K
clusters are formed.

2. Divisive Hierarchical clustering Technique:


The Divisive Hierarchical clustering is exactly the opposite of the Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we consider all the data
points as a single cluster and in each iteration, we separate the data points from the
cluster which are not similar. Each data point which is separated is considered as an
individual cluster. In the end, it will be left with n clusters.
Calculating the Similarity Between Two Clusters:
 MIN
 MAX
 Group Average
 Distance Between Centroids
 Ward’s Method

MIN:
Also known as single-linkage algorithm can be defined as the similarity of two
clusters C1 and C2 is equal to the minimum of the similarity between points Pi and Pj
such that Pi belongs to C1 and Pj belongs to C2.

Mathematically this can be written as,

Sim(C1,C2) = Min Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2

In simple words, pick the two closest points such that one point lies in cluster one
and the other point lies in cluster 2 and takes their similarity and declares it as the
similarity between two clusters.

Pros of MIN:
This approach can separate non-elliptical shapes as long as the gap between the
two clusters is not small.
Original data vs Clustered data using MIN approach

Cons of MIN:
MIN approach cannot separate clusters properly if there is noise between clusters.

Original data vs Clustered data using MIN approach

MAX:
Also known as the complete linkage algorithm, this is exactly opposite to
the MIN approach. The similarity of two clusters C1 and C2 is equal to
the maximum of the similarity between points Pi and Pj such that Pi belongs to C1
and Pj belongs to C2.

Mathematically this can be written as,

Sim(C1,C2) = Max Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2

In simple words, pick the two farthest points such that one point lies in cluster one
and the other point lies in cluster 2 and takes their similarity and declares it as the
similarity between two clusters.
Pros of MAX:
MAX approach does well in separating clusters if there is noise between clusters.

Original data vs Clustered data using MAX approach

Cons of Max:
Max approach is biased towards globular clusters.

Max approach tends to break large clusters.

Original data vs Clustered data using MAX approach

Group Average:
Take all the pairs of points and compute their similarities and calculate the average
of the similarities.

Mathematically this can be written as,

sim(C1,C2) = ∑ sim(Pi, Pj)/|C1|*|C2|

where, Pi ∈ C1 & Pj ∈ C2
The group Average approach does well in separating clusters if there is noise
between clusters

Distance between centroids:


Compute the centroids of two clusters C1 & C2 and take the similarity between the
two centroids as the similarity between two clusters. This is a less popular technique
in the real world.

Ward’s Method:
This approach of calculating the similarity between two clusters is exactly the same
as Group Average except that Ward’s method calculates the sum of the square of
the distances Pi and PJ.

Mathematically this can be written as,

sim(C1,C2) = ∑ (dist(Pi, Pj))²/|C1|*|C2|

Pros of Ward’s method:


Ward’s method approach also does well in separating clusters if there is noise
between clusters.

Cons of Ward’s method:


Ward’s method approach is also biased towards globular clusters.

Space and Time Complexity of Hierarchical clustering Technique


Space complexity: The space required for the Hierarchical clustering Technique
is very high when the number of data points are high as we need to store the
similarity matrix in the RAM.

Time complexity: Since we’ve to perform n iterations and in each iteration, we


need to update the similarity matrix and restore the matrix, the time complexity is
also very high.

High space and time complexity for Hierarchical clustering. Hence this clustering
algorithm cannot be used when we have huge data.

You might also like