0% found this document useful (0 votes)
2 views6 pages

Clustering

Chapter 13 discusses unsupervised learning, specifically clustering, which involves grouping similar data points without labeled responses. It highlights the importance of clustering in various applications such as customer segmentation, document clustering, image segmentation, and recommendation engines, and introduces popular algorithms like Hierarchical and K-means clustering. The chapter also explains the K-means algorithm's process and stopping criteria, as well as methods to determine the optimal number of clusters using the Elbow method.

Uploaded by

Dinanshu Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Clustering

Chapter 13 discusses unsupervised learning, specifically clustering, which involves grouping similar data points without labeled responses. It highlights the importance of clustering in various applications such as customer segmentation, document clustering, image segmentation, and recommendation engines, and introduces popular algorithms like Hierarchical and K-means clustering. The chapter also explains the K-means algorithm's process and stopping criteria, as well as methods to determine the optimal number of clusters using the Elbow method.

Uploaded by

Dinanshu Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 13

Unsupervised Learning: Clustering


Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses. In some pattern recognition problems, the training data
consists of a set of input vectors x without any corresponding target values.
The goal of this unsupervised machine learning technique is to find
similarities in the data point and group similar data points together.

13.1 What is Clustering?


Unlike supervised learning, clustering is considered an unsupervised
learning method. “Clustering” is the process of grouping similar entities
together.

Clustering is one of the most common exploratory data analysis technique


used to get an intuition about the structure of the data. It can be defined as
the task of identifying subgroups in the data such that data points in the
same subgroup (cluster) are very similar while data points in different
clusters are very different. In other words, we try to find homogeneous
subgroups within the data such that data points in each cluster are as
similar as possible according to a similarity measure such as Euclidean-
based distance or correlation-based distance. The decision of which
similarity measure to use is application-specific.

Example: A bank wants to give credit card offers to its customers.


Currently, they look at the details of each customer and based on this
information, decide which offer should be given to which customer. Now,
the bank can potentially have millions of customers. It does not make sense
to look at the details of each customer separately and then make a decision.
It is a manual process and will take a huge amount of time. So what can the
bank do? One option is to segment its customers into different groups. For
instance, the bank can group the customers based on their income: High
different strategies for individual customers, they only have to make 3
strategies. This will reduce the effort as well as the time.

13.2 Applications of Clustering


Clustering is a widely used technique in the industry and has many
applications. It is actually being used in almost every domain, ranging from
banking to recommendation engines, document clustering to image
segmentation.
 Customer Segmentation: One of the most common applications of
clustering is customer segmentation. And it isn’t just limited to
banking. This strategy is across functions, including telecom, e-
commerce, sports, advertising, sales, etc.
 Document Clustering: This is another common application of
clustering. Let’s say you have multiple documents and you need to
cluster similar documents together. Clustering helps us group these
documents such that similar documents are in the same clusters. e.g.
google news
 Image Segmentation: We can also use clustering to perform image
segmentation. Club similar pixels in the image together. We can
apply clustering to create clusters having similar pixels in the same
group, which is actually the representation of individual objects.

 Recommendation Engines: Clustering can also be used in


recommendation engines. Let’s say you want to recommend songs
to your friends. You can look at the songs liked by that person and
then use clustering to find similar songs and finally recommend the
most similar songs.

13.3 Clustering Algorithms


There are many algorithms developed to implement clustering technique.
The two most popular and widely used algorithms are:
1. Hierarchical Clustering
2. K-means Clustering
13.4 Hierarchical Clustering
Hierarchical clustering starts by assigning all data points as their own
cluster. As the name suggests it builds the hierarchy and in the next step, it
combines the two nearest data point and merges it together to one cluster.
This is called agglomerative clustering and its reverse (one group splitting
into many) is called divisive clustering.
1. Assign each data point to its own cluster.
2. Find closest pair of cluster using Euclidean distance and merge
them in to single cluster.
3. Calculate distance between two nearest clusters and combine until
all items are clustered in to a single cluster.
In this technique, you can decide the optimal number of clusters by noticing
which vertical lines can be cut by horizontal line without intersecting a
cluster and covers the maximum distance.

13.5 K-means Clustering


K-means algorithm is an iterative algorithm that tries to partition the
dataset into K pre-defined distinct non-overlapping subgroups (clusters)
where each data point belongs to only one group. It tries to make the inter-
cluster data points as similar as possible while also keeping the clusters as
different (far) as possible. It assigns data points to a cluster such that the
sum of the squared distance between the data points and the cluster’s
centroid (arithmetic mean of all the data points that belong to that cluster)
is at the minimum. The less variation we have within clusters; the more
homogeneous (similar) the data points are within the same cluster. K-
means algorithm works is as follows:
1. It starts with K as the input which is how many clusters you want to
find. Place K centroids in random locations in your space.
2. Now, using the Euclidean distance between data points and
centroids, assign each data point to the cluster which is close to it.
3. Recalculate the cluster centers as a mean of data points assigned to
it.
4. Repeat 2 and 3 until no further changes occur.

Example:
Let’s now take an example to understand how K-Means actually works:
We have these 8 points and we want to apply k-means to create clusters for
these points. Here’s how we can do it.

Step 1: Choose the number of clusters k.


Let the value of k=2

Step 2: Select k random points from the data as centroids


Next, we randomly select the centroid for each cluster. Let’s say we want to
have 2 clusters, so k is equal to 2 here. We then randomly select the
centroid:

Here, the red and green circles represent the centroid for these clusters.
Step 3: Assign all the points to the closest cluster centroid
Once we have initialized the centroids, we assign each point to the closest
cluster centroid:

Here you can see that the points which are closer to the red point are
assigned to the red cluster whereas the points which are closer to the green
point are assigned to the green cluster.

Step 4: Re-compute the centroids of newly formed clusters


Now, once we have assigned all of the points to either cluster, the next step
is to compute the centroids of newly formed clusters:

Here, the red and green crosses are the new centroids.

Step 5: Repeat steps 3 and 4

13.6 Stopping Criteria for K-Means Clustering


There are essentially three stopping criteria that can be adopted to stop the
K-means algorithm:
 Centroids of newly formed clusters do not change
 Points remain in the same cluster
 Maximum number of iterations are reached

We can stop the algorithm if the centroids of newly formed clusters are not
changing. Even after multiple iterations, if we are getting the same
centroids for all the clusters, we can say that the algorithm is not learning
any new pattern and it is a sign to stop the training.

Another clear sign that we should stop the training process if the points
remain in the same cluster even after training the algorithm for multiple
iterations.

Finally, we can stop the training if the maximum number of iterations is


reached. Suppose if we have set the number of iterations as 100. The
process will repeat for 100 iterations before stopping.

13.7 How do I decide the value of K in the first step?


One of the methods is called “Elbow” method can be used to decide an
optimal number of clusters. Here you would run K-mean clustering on a
range of K values and plot the “percentage of variance explained” on the Y-
axis and “K” on X-axis.

In the picture below you would notice that as we add more clusters after 3
it doesn't give much better modeling on the data. The first cluster adds
much information, but at some point, the marginal gain will start dropping.

You might also like