0% found this document useful (0 votes)
11 views77 pages

Lecture 18 K Means Clustering

Uploaded by

Fasih Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views77 pages

Lecture 18 K Means Clustering

Uploaded by

Fasih Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

CLUSTERING

It is basically a type of unsupervised learning


method.

An unsupervised learning method is a method in which we draw references


from datasets consisting of input data without labeled responses.

Generally, it is used as a process to find meaningful structure, explanatory


underlying processes, generative features, and groupings inherent in a set of
examples.
DEFINITION:
CLUSTERING

Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group and dissimilar to the data points in other groups.

Clustering is very much important as it determines the intrinsic grouping among


the unlabelled data present.
K-means
clustering

Hierarchical
clustering
K-MEANS
CLUSTERING
◾ Let us assume that we have a dataset
◾ The scatter plot is shown in the figure
◾ We want to find the clusters in the data
◾ In first look, we can see that there are
three clusters in the data.
◾ We can see that there are three clusters in the
data
STEPS FOR K-MEANS CLUSTERING
◾ Can you visually identify
the numbers of clusters
in the dataset (not very
easy!!!)
◾ Let us assume that we
have identified the optimal
number of clusters to be
2.
◾ Let us assume that we select the red and blue points as
the centriods
◾ We know from geometry that the points on the
green line are equi-distant from the red and
blue centroid.
◾ Now it becomes clear, which points will belong
to cluster 1 and cluster 2.
◾ Closest centroid is a
relative term.
◾ W e are using Euclidean
distance here
◾ But in other scenarios some
other parameters may be
more appropriate
◾ Compute the new centroids
for each cluster, by
computing the
average/center of gravity
of all the points except the
centroid itself in each
cluster.
◾ New centroids have
been
assigned
◾ So, if we plot a line through the scatter plot.
◾ We can see that three data points are in the wrong
cluster
◾ Now, we will recolor those three points to assign them the correct
cluster.
◾ Since some reassignment as taken place therefore, we go back to step 4
◾ Compute the center of gravity for the new
clusters
◾ The new centroids have been
assigned
◾ We again draw the line to check whether any data points are in the wrong
cluster
◾ We see that there is only one point that is in the wrong cluster
◾ The point has been reassigned to the blue
cluster
◾ Next we need to recompute the
centroids
◾ The centroids have been relocated
◾ Now only one point needs to be
reassigned
◾ The data point gets
reassigned
◾ Computing the new centroids for the
clusters
◾ Now, this time we do not need to reassign any data
points.
◾ The algorithm has converged.
◾ From the initial data points to the clustered
output.
◾ Right away we can tell which three clusters will be
formed
◾ Even if we move around the centroids a little bit, nothing is going to
change
◾ These are the clusters we are going to end up
with
◾ Again, we will go through the steps of k-means
clustering
◾ The initial random selection of centroids are not as good as
before.
◾ The three clusters will be formed as
follows
◾ Recompute the
centroids
◾ Now, no data point will be
reassigned.
◾ The algorithm has converged
K-MEANS SOL 2 K-means sol
1

◾ The clusters formed are different based on different initial


centroids
◾ What should be the solution to this random initialization
problem/trap
◾ The solution is the k-means++ algorithm
◾ The python takes care of this and implements the k-means++
algorithm
Drawback of standard K-means algorithm:
One disadvantage of the K-means algorithm is that it is sensitive to the
initialization of the centroids or the mean points.
So, if a centroid is initialized to be a “far-off” point, it might just end up with no
points associated with it, and at the same time, more than one cluster might end up
linked with a single centroid.
Similarly, more than one centroids might be initialized into the same cluster
resulting in poor clustering. For example, consider the images shown below.
A poor initialization of centroids resulted in poor clustering.
k-means++

To overcome the above-mentioned drawback we use K-means+


+.
This algorithm ensures a smarter initialization of the centroids
and improves the quality of the clustering.
Apart from initialization, the rest of the algorithm is the same as
the standard K-means algorithm.
That is K-means++ is the standard K-means algorithm coupled
with a smarter initialization of the centroids

1.Randomly select the first centroid from the data points.


2.For each data point compute its distance from the nearest, previously
chosen centroid.
3.Select the next centroid from the data points such that the probability of
choosing a point as centroid is directly proportional to its distance from
the nearest, previously chosen centroid. (i.e. the point having maximum
distance from the nearest centroid is most likely to be selected next as a
centroid)
4.steps 2 and 3 until k centroids have been sampled
◾ We will learn how to decide the correct number of
clusters
◾ Let us assume we have the following data set
◾ If we run the k-means clustering algorithm with k=3, the results are shown
below
◾ We will need a metric to identify that a certain number of clusters in a dataset provide an optimum
solution.
◾ Preferably that metric should be quantifiable
◾ The metric is called within cluster sum of squares (WCCS)

◾ Where C1, C2 and C3 are the centroids of cluster 1, cluster 2 and cluster 3
respectively.
◾ Pi is the ith data point in the respective cluster
◾ The WCSS is a good metric for comparing the solutions obtained using different values of k for the k-
means clustering algorithm.
◾ Let us see how the WCSS metric changes with different values of k.
◾ Here is the solution with k=1.
◾ When we compute WCSS we will get quite a large value since, the centroid is away from the data
points Pi. Consequently, the distance between the centroid and the data points Pi will be large.
◾ Let us now increase the number of clusters to 2 and see how the WCSS changes.
◾ Since now we have two centroids, and the distance will be computed within each cluster and it does not
need to reach all the way to the middle of the whole dataset.
◾ We can see the value of WCSS will decrease as compared to when we had only one centroid i.e., one
cluster.
◾ Now we increase the number of clusters to 3.
◾ There is no change in cluster 1 so no change in the distance for cluster 1.
◾ The distance in cluster 2 and cluster 3 will decrease as compared to when there were only two
clusters.
◾ What is the upper limit on the number of clusters in K-means algorithm?
◾ The maximum number of clusters can be equal to the number of datapoints.
◾ If we reach the maximum limit of number of clusters the WCSS will reach a value of
zero
◾ The chart above shows the W C SS value as k i.e.,
number
of clusters increases.
◾ We can see that the WCSS starts off with a high
value and then decreases substantially as k increases
◾ For example when k increases from 1 to 2 then the
decrease in number of units on the y-axis is 8000-
3000 = 5000
◾ when k increases from 2 to 3 then the
decrease in number of units on the y-axis is
3000-1000 = 2000
◾ when k increases from 3 to 4 then the
decrease in number of units on the y-axis is
1000-700 = 300
◾ We can see that the change in WCSS was large
in the
beginning and low at the end
◾ So, we can us the elbow method to determine the
optimal number of clusters i.e., look for that point
Evaluating the Clustering Algorithm

There are three commonly used evaluation metrics:


Silhouette score,
Calinski Harabaz index,
Davies-Bouldin Index.
Silhouette Score

To study the separation distance between the clusters formed by the


algorithm silhouette analysis could be used.
The Silhouette Coefficient is calculated by using the mean of the
distance of the intra-cluster and nearest cluster for all the samples.
The Silhouette Coefficient ranges from [-1,1].
The higher the Silhouette Coefficients (the closer to +1), the more is
the separation between clusters.
If the value is 0 it indicates that the sample is on or very close to the
decision boundary between two neighboring clusters whereas a
negative value indicates that those samples might have been assigned
to the wrong cluster.
(nc-ic)/max(ic,nc)
where,
ic = mean of the intra-cluster distance
nc = mean of the nearest-cluster distance
Calinski Harabaz Index

The Calinski Harabaz index is based on the principle of


variance ratio. This ratio is calculated between two parameters
within-cluster diffusion and between cluster dispersion. The
higher the index the better is clustering.
The formula used is
CH(k)=[B(k)W(k)][(n−k)(k−1)]
where,
n = data points
k = clusters
W(k) = within cluster variation
B(k) = between cluster variation.
Davies Bouldin index

Davies Bouldin index is based on the principle of with-cluster


and between cluster distances. It is commonly used for deciding
the number of clusters in which the data points should be labeled.
It is different from the other two as the value of this index should
be small. So the main motive is to decrease the DB index.
The formula which is used to calculate the DB index.
DB(C)=1Ci=1kmaxjk,jiDij
Dij=di + dj dij
where,
Dij= within-to-between cluster distance ratio for the ith and jth
clusters.
C = no of clusters
i,j = numbers of clusters which come from the same partitioning
PYTHON
IMPLEMENTATION
PROBLE
M
◾ The dataset given was made by the strategy team a mall
◾ The information in the dataset is
◾ Customer ID
◾ Gender
◾ Age
◾ Annual Income
◾ Spending score (1-100) (lower score represents less spending and higher score represents higher spending)
◾ The goal is to identify some patterns within the customers
◾ This is unsupervised learning, so we have no idea what to predict
◾ So, we will create a dependent variable (cluster number), which will represent the class based on the
independent variables
STEPS FOR IMPLEMENTATION

◾ Importing the library


◾ Importing the dataset
◾ Using elbow method to find the optimal number of
clusters
◾ Training the k-means model on the dataset
◾ Visualizing the clusters
IMPORTING THE LIBRARIES A N D
DATASET
◾ The customer id is of no importance to us so we
will discard customer id.
◾ Although all other independent variables i.e., Gender,
Age, Annual Income, Spending score are important for
our problem
◾ But we need to visualize the results, which can only
happen for a dataset with only two independent
variables
◾ So, we select two indendent variables i.e., ‘Annual
Income’ and ‘spending score’ as our independent
variables of choice.
USING THE ELBO W METHOD TO FIND THE OPTIMAL NUMBER
OF CLUSTERS
K=5
TRAINING THE K-MEANS MO D EL O N THE
DATASET
VISUALIZING THE
CLUSTERS
Evaluating the Clustering
Algorithm

You might also like