Unit-5 Clustering (March 16, 24)
Unit-5 Clustering (March 16, 24)
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their
own experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to
the machine learning model in order to train it. Firstly, it will interpret the raw data
to find the hidden patterns from the data and then will apply suitable algorithms
such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of
problems:
It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of
those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.
Example: Let's understand the clustering technique with the real-world example
of Mall: When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one section,
and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most
common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of
products. Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist. Below
are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
Partitioning clustering is further subdivided into:
➢ K-Means clustering
➢ Fuzzy C-Means clustering
Here, the features or characteristics are compared, and all objects having similar
characteristics are clustered together.
• Fuzzy c-means is very similar to k-means in the sense that it clusters objects
that have similar characteristics together. In k-means clustering, a single
object cannot belong to two different clusters. But in c-means, objects can
belong to more than one cluster, as shown.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
Hierarchical Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the
given dataset, whereas some are required to find the minimum distance between
the observation of the dataset.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Here we are discussing mainly popular Clustering algorithms that are widely used
in machine learning:
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely
used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique.
The search result appears based on the closest object to the search query. It
does it by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of
the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that for
what purpose the particular land should be used, that means for which
purpose it is more suitable.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and
this tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no
requirement to predetermine the number of clusters as we did in the K-Means
algorithm.
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
From the above-given approaches, we can apply any of them according to the type
of problem or business requirement.
The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.
We can cut the dendrogram tree structure at any level as per our requirement.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these datasets
into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point. So,
here we are selecting the below two points as k points, which are not the part
of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point
or centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow centroid.
Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
below image:
From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so
the new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below
As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image: