UNIT - 3 - Clustering
UNIT - 3 - Clustering
UNIT - 3 - Clustering
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group than those
in other groups. In simple words, the aim is to segregate groups with similar traits and assign
them into clusters.
Clustering is used to identify groups of similar objects in datasets with two or more variable
quantities.
K – Means Algorithm:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them as
blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:
We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters that
it forms. But choosing the optimal number of clusters is a big task. There are some different
ways to find the optimal number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given below:
Elbow Method:
The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method. The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose
the number of clusters equal to the data points, then the value of WCSS becomes zero, and
that will be the endpoint of the plot.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending
Score (which is the calculated value of how much a customer has spent in the mall, the more the
value, the more he has spent). From this dataset, we need to calculate some patterns, as it is an
unsupervised method, so we don't know what to calculate exactly.
o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters
The first step will be the data pre-processing, as we did in our earlier topics of Regression and
Classification. But for the clustering problem, it will be different from other models. Let's
discuss it:
o ImportingLibraries
As we did in previous topics, firstly, we will import the libraries for our model, which is
part of data pre-processing. The code is given below:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
In the above code, the numpy we have imported for the performing mathematics
calculation, matplotlib is for plotting the graph, and pandas are for managing the dataset.
o Importing-the-Dataset:
Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:
dataset = pd.read_csv('Mall_Customers_data.csv')
Here we don't need any dependent variable for data pre-processing step as it is a clustering
problem, and we have no idea about what to determine. So we will just add a line of code for the
matrix of features.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering problem.
So, as discussed above, here we are going to use the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS
values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate the
value for WCSS for different k values ranging from 1 to 10. Below is the code for it:
As we can see in the above code, we have used the KMeans class of sklearn. cluster library to
form the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is used to contain
the value of wcss computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging from
1 to 10; since for loop in Python, exclude the outbound limit, so it is taken as 11 to include
10th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a
matrix of features and then plotted the graph between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters here will be
5.
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the above section,
but here instead of using i, we will use 5, as we know there are 5 clusters that need to be formed.
The code is given below:
The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train the model.
By executing the above lines of code, we will get the y_predict variable. We can check it
under the variable explorer option in the Spyder IDE. We can now compare the values of
y_predict with our original dataset. Consider the below image:
From the above image, we can now relate that the CustomerID 1 belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.
Step-4: Visualizing the Clusters
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize
each cluster one by one.
To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first
coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing the
matrix of features values, and the y_predict is ranging from 0 to 1.
Output:
The output image is clearly showing the five different clusters with different colors. The clusters
are formed between two parameters of the dataset; Annual income of customer and Spending.
We can change the colors and labels as per the requirement or choice. We can also observe some
points from the above patterns, which are given below:
o Cluster1 shows the customers with average salary and average spending so we can
categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we can categorize
them as careful.
o Cluster3 shows the low income and also low spending so they can be categorized as
sensible.
o Cluster4 shows the customers with low income with very high spending so they can be
categorized as careless.
o Cluster5 shows the customers with high income and high spending so they can be
categorized as target, and these customers can be the most profitable customers for the
mall owner.
https://www.youtube.com/watch?v=P2KZisgs4A4
Hierarchical Clustering:
Hierarchical clustering is another unsupervised learning algorithm that is used to group together
the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls
into following two categories.
Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data
point is treated as a single cluster and then successively merge or agglomerate (bottom-up
approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or
tree structure.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all
the data points are treated as one big cluster and the process of clustering involves dividing
(Top-down approach) the one big cluster into various small clusters.
The hierarchy of clusters is developed in the form of a tree in this technique, and this tree-shaped
structure is known as the dendrogram.
Simply speaking, Separating data into groups based on some measure of similarity, finding a
technique to quantify how they're alike and different, and limiting down the data is what
hierarchical clustering is all about.
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There
will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider
the below images:
Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
1. Easy to understand: hierarchical clustering doesn't use any complex methods that are
too hard to understand, instead it uses simple methods that can be easily understood by
anyone regardless of their familiarity with the topic.
For More :
https://www.javatpoint.com/hierarchical-clustering-in-machine-learning
Partitioning Clustering:
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of clusters
that has to be generated for the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each partition
represents a cluster and a particular region. There are many algorithms that come under
partitioning method some of the popular ones are K-Mean, PAM(K-Mediods), CLARA
algorithm (Clustering Large Applications) etc.
Partitional clustering (or partitioning clustering) are clustering methods used to classify
observations, within a data set, into multiple groups based on their similarity. The algorithms
require the analyst to specify the number of clusters to be generated.
The main drawback of this algorithm is whenever a point is close to the center of another
cluster; it gives poor result due to overlapping of data points [3]. There are many methods of
partitioning clustering; they are k-mean, Bisecting K Means Method, Medoids Method, PAM
(Partitioning around Medoids).
For More:
https://medium.com/analytics-vidhya/partitional-clustering-181d42049670