Clustering
Clustering
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as “A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or no similarities with
another group.”
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior,
etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system
can use this id to simplify the processing of large and complex datasets.
Example: Let’s understand the clustering technique with the real-world example of Mall: When we visit
any shopping mall, we can observe that the things with similar usage are grouped together. Such as the
t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of clustering are grouping
documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique
are:
Market Segmentation
Image segmentation
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the
movies and web-series to its users as per the watch history.The below diagram explains the working of
the clustering algorithm. We can see the different fruits are divided into several groups with similar
properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group)
and Soft Clustering (data points can belong to another group also). But there are also other various
approaches of Clustering exist. Below are the main clustering methods used in Machine learning:
Partitioning Clustering
Density-Based Clustering
Hierarchical Clustering
Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-
defined groups. The cluster center is created in such a way that the distance between the data points
of one cluster is minimum as compared to another cluster centroid.
Density Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily
shaped distributions are formed as long as the dense region can be connected. This algorithm does it
by identifying different clusters in the dataset and connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities
and high dimensions.
In the distribution model-based clustering method, the data is divided based on the probability of
how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
Hierarchical clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of
membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It classifies
the dataset by dividing the samples into different clusters of equal variances. The number of clusters
must be specified in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data
points. It is an example of a centroid-based model, that works on updating the candidates for centroid
to be the center of the points within a given region.
DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the mean-shift, but with some remarkable advantages. In
this algorithm, the areas of high density are separated by the areas of low density. Because of this, the
clusters can be found in any arbitrary shape.
Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-
means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data
points are Gaussian distributed.
Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the bottom-
up hierarchical clustering. In this, each data point is treated as a single cluster at the outset and then
successively merged. The cluster hierarchy can be represented as a tree-structure.
Affinity Propagation: It is different from other clustering algorithms as it does not require to specify the
number of clusters. In this, each data point sends a message between the pair of data points until
convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one group
that is far from the other dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
Customer Segmentation: It is used in market research to segment the customers based on their choice
and preferences.
In Biology: It is used in the biology stream to classify different species of plants and animals using the
image recognition technique.
In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used, that
means for which purpose it is more suitable.
K-MEANS CLUSTERING
HIERARCHICAL CLUSTERING
Spectral Clustering
Spectral Clustering is a variant of the clustering algorithm that uses the connectivity between the data
points to form the clustering. It uses eigenvalues and eigenvectors of the data matrix to forecast the
data into lower dimensions space to cluster the data points. It is based on the idea of a graph
representation of data where the data point are represented as nodes and the similarity between the
data points are represented by an edge.
Building the Similarity Graph Of The Data: This step builds the Similarity Graph in the form of an
adjacency matrix which is represented by A. The adjacency matrix can be built in the following manners:
Epsilon-neighbourhood Graph: A parameter epsilon is fixed beforehand. Then, each point is connected
to all the points which lie in its epsilon-radius. If all the distances between any two points are similar in
scale then typically the weights of the edges ie the distance between the two points are not stored since
they do not provide any additional information. Thus, in this case, the graph built is an undirected and
unweighted graph.
K-Nearest Neighbours A parameter k is fixed beforehand. Then, for two vertices u and v, an edge is
directed from u to v only if v is among the k-nearest neighbours of u. Note that this leads to the
formation of a weighted and directed graph because it is not always the case that for each u having v as
one of the k-nearest neighbours, it will be the same case for v having u among its k-nearest neighbours.
To make this graph undirected, one of the following approaches is followed:-
Direct an edge from u to v and from v to u if either v is among the k-nearest neighbours of u OR u is
among the k-nearest neighbours of v.
Direct an edge from u to v and from v to u if v is among the k-nearest neighbours of u AND u is among
the k-nearest neighbours of v.
Fully-Connected Graph: To build this graph, each point is connected with an undirected edge-weighted
by the distance between the two points to every other point. Since this approach is used to model the
local neighbourhood relationships thus typically the Gaussian similarity metric is used to calculate the
distance.
1) Preprocessing
Construct Matrix representation of graph
2) Decomposition
To compute the vectors
3) Grouping
Assigning the points to the cluster.
What is subspace clustering?
Subspace clustering is a technique which finds clusters within different subspaces (a selection of one or
more dimensions). The underlying assumption is that we can find valid clusters which are defined by
only a subset of dimensions (it is not needed to have the agreement of all N features). For example, if
we consider as input patient data observing the gene expression level (we can have more than 20000
features), a cluster of patients suffering from Alzheimer can be found only by looking at the expression
data of a subset of 100 genes, or stated differently, the subset exists in 100D. Stated differently,
subspace clustering is an extension of traditional N dimensional cluster analysis which allows to
simultaneously group features and observations by creating both row and column clusters.
The resulting clusters may be overlapping both in the space of features and observations. Another
example is shown in the figures below,
High dimensional data consists in input having from a few dozen to many thousands of features (or
dimensions). This is a context typically encountered for instance in bioinformatics (all sorts of
sequencing data) or in NLP where the size of the vocabulary if very high. High dimensional data is
challenging because:
It makes the visualization and thus understanding of the input difficult, it often requires applying a
dimensionality reduction technique beforehand. It leads to the ‘curse of dimensionality’ which means
that the complete enumeration of all subspaces becomes intractable with increasing dimensionality
Most underlying clustering techniques depend on the results and the choice of the dimensionality
reduction technique
Many dimensions may be irrelevant and can mask existing clusters in noisy data
One common technique is to perform feature selection (remove irrelevant dimensions) but there are
cases when identifying redundant dimensions is not easy
bottom up approaches start by finding clusters in low dimensional (1 D) spaces and iteratively merging
them to process higher dimensional spaces (up to N D).
Top down approaches find clusters in the full set of dimensions and evaluate the subspace of each
cluster.