0% found this document useful (0 votes)

11 views77 pages

Lecture 18 K Means Clustering

Uploaded by

Fasih Ullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views77 pages

Lecture 18 K Means Clustering

Uploaded by

Fasih Ullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 77

CLUSTERING

It is basically a type of unsupervised learning

method.

An unsupervised learning method is a method in which we draw references

from datasets consisting of input data without labeled responses.

Generally, it is used as a process to find meaningful structure, explanatory

underlying processes, generative features, and groupings inherent in a set of
examples.
DEFINITION:
CLUSTERING

Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group and dissimilar to the data points in other groups.

Clustering is very much important as it determines the intrinsic grouping among

the unlabelled data present.
K-means
clustering

Hierarchical
clustering
K-MEANS
CLUSTERING
◾ Let us assume that we have a dataset
◾ The scatter plot is shown in the figure
◾ We want to find the clusters in the data
◾ In first look, we can see that there are
three clusters in the data.
◾ We can see that there are three clusters in the
data
STEPS FOR K-MEANS CLUSTERING
◾ Can you visually identify
the numbers of clusters
in the dataset (not very
easy!!!)
◾ Let us assume that we
have identified the optimal
number of clusters to be
2.
◾ Let us assume that we select the red and blue points as
the centriods
◾ We know from geometry that the points on the
green line are equi-distant from the red and
blue centroid.
◾ Now it becomes clear, which points will belong
to cluster 1 and cluster 2.
◾ Closest centroid is a
relative term.
◾ W e are using Euclidean
distance here
◾ But in other scenarios some
other parameters may be
more appropriate
◾ Compute the new centroids
for each cluster, by
computing the
average/center of gravity
of all the points except the
centroid itself in each
cluster.
◾ New centroids have
been
assigned
◾ So, if we plot a line through the scatter plot.
◾ We can see that three data points are in the wrong
cluster
◾ Now, we will recolor those three points to assign them the correct
cluster.
◾ Since some reassignment as taken place therefore, we go back to step 4
◾ Compute the center of gravity for the new
clusters
◾ The new centroids have been
assigned
◾ We again draw the line to check whether any data points are in the wrong
cluster
◾ We see that there is only one point that is in the wrong cluster
◾ The point has been reassigned to the blue
cluster
◾ Next we need to recompute the
centroids
◾ The centroids have been relocated
◾ Now only one point needs to be
reassigned
◾ The data point gets
reassigned
◾ Computing the new centroids for the
clusters
◾ Now, this time we do not need to reassign any data
points.
◾ The algorithm has converged.
◾ From the initial data points to the clustered
output.
◾ Right away we can tell which three clusters will be
formed
◾ Even if we move around the centroids a little bit, nothing is going to
change
◾ These are the clusters we are going to end up
with
◾ Again, we will go through the steps of k-means
clustering
◾ The initial random selection of centroids are not as good as
before.
◾ The three clusters will be formed as
follows
◾ Recompute the
centroids
◾ Now, no data point will be
reassigned.
◾ The algorithm has converged
K-MEANS SOL 2 K-means sol
1

◾ The clusters formed are different based on different initial

centroids
◾ What should be the solution to this random initialization
problem/trap
◾ The solution is the k-means++ algorithm
◾ The python takes care of this and implements the k-means++
algorithm
Drawback of standard K-means algorithm:
One disadvantage of the K-means algorithm is that it is sensitive to the
initialization of the centroids or the mean points.
So, if a centroid is initialized to be a “far-off” point, it might just end up with no
points associated with it, and at the same time, more than one cluster might end up
linked with a single centroid.
Similarly, more than one centroids might be initialized into the same cluster
resulting in poor clustering. For example, consider the images shown below.
A poor initialization of centroids resulted in poor clustering.
k-means++

To overcome the above-mentioned drawback we use K-means+

+.
This algorithm ensures a smarter initialization of the centroids
and improves the quality of the clustering.
Apart from initialization, the rest of the algorithm is the same as
the standard K-means algorithm.
That is K-means++ is the standard K-means algorithm coupled
with a smarter initialization of the centroids

1.Randomly select the first centroid from the data points.

2.For each data point compute its distance from the nearest, previously
chosen centroid.
3.Select the next centroid from the data points such that the probability of
choosing a point as centroid is directly proportional to its distance from
the nearest, previously chosen centroid. (i.e. the point having maximum
distance from the nearest centroid is most likely to be selected next as a
centroid)
4.steps 2 and 3 until k centroids have been sampled
◾ We will learn how to decide the correct number of
clusters
◾ Let us assume we have the following data set
◾ If we run the k-means clustering algorithm with k=3, the results are shown
below
◾ We will need a metric to identify that a certain number of clusters in a dataset provide an optimum
solution.
◾ Preferably that metric should be quantifiable
◾ The metric is called within cluster sum of squares (WCCS)

◾ Where C1, C2 and C3 are the centroids of cluster 1, cluster 2 and cluster 3
respectively.
◾ Pi is the ith data point in the respective cluster
◾ The WCSS is a good metric for comparing the solutions obtained using different values of k for the k-
means clustering algorithm.
◾ Let us see how the WCSS metric changes with different values of k.
◾ Here is the solution with k=1.
◾ When we compute WCSS we will get quite a large value since, the centroid is away from the data
points Pi. Consequently, the distance between the centroid and the data points Pi will be large.
◾ Let us now increase the number of clusters to 2 and see how the WCSS changes.
◾ Since now we have two centroids, and the distance will be computed within each cluster and it does not
need to reach all the way to the middle of the whole dataset.
◾ We can see the value of WCSS will decrease as compared to when we had only one centroid i.e., one
cluster.
◾ Now we increase the number of clusters to 3.
◾ There is no change in cluster 1 so no change in the distance for cluster 1.
◾ The distance in cluster 2 and cluster 3 will decrease as compared to when there were only two
clusters.
◾ What is the upper limit on the number of clusters in K-means algorithm?
◾ The maximum number of clusters can be equal to the number of datapoints.
◾ If we reach the maximum limit of number of clusters the WCSS will reach a value of
zero
◾ The chart above shows the W C SS value as k i.e.,
number
of clusters increases.
◾ We can see that the WCSS starts off with a high
value and then decreases substantially as k increases
◾ For example when k increases from 1 to 2 then the
decrease in number of units on the y-axis is 8000-
3000 = 5000
◾ when k increases from 2 to 3 then the
decrease in number of units on the y-axis is
3000-1000 = 2000
◾ when k increases from 3 to 4 then the
decrease in number of units on the y-axis is
1000-700 = 300
◾ We can see that the change in WCSS was large
in the
beginning and low at the end
◾ So, we can us the elbow method to determine the
optimal number of clusters i.e., look for that point
Evaluating the Clustering Algorithm

There are three commonly used evaluation metrics:

Silhouette score,
Calinski Harabaz index,
Davies-Bouldin Index.
Silhouette Score

To study the separation distance between the clusters formed by the

algorithm silhouette analysis could be used.
The Silhouette Coefficient is calculated by using the mean of the
distance of the intra-cluster and nearest cluster for all the samples.
The Silhouette Coefficient ranges from [-1,1].
The higher the Silhouette Coefficients (the closer to +1), the more is
the separation between clusters.
If the value is 0 it indicates that the sample is on or very close to the
decision boundary between two neighboring clusters whereas a
negative value indicates that those samples might have been assigned
to the wrong cluster.
(nc-ic)/max(ic,nc)
where,
ic = mean of the intra-cluster distance
nc = mean of the nearest-cluster distance
Calinski Harabaz Index

The Calinski Harabaz index is based on the principle of

variance ratio. This ratio is calculated between two parameters
within-cluster diffusion and between cluster dispersion. The
higher the index the better is clustering.
The formula used is
CH(k)=[B(k)W(k)][(n−k)(k−1)]
where,
n = data points
k = clusters
W(k) = within cluster variation
B(k) = between cluster variation.
Davies Bouldin index

Davies Bouldin index is based on the principle of with-cluster

and between cluster distances. It is commonly used for deciding
the number of clusters in which the data points should be labeled.
It is different from the other two as the value of this index should
be small. So the main motive is to decrease the DB index.
The formula which is used to calculate the DB index.
DB(C)=1Ci=1kmaxjk,jiDij
Dij=di + dj dij
where,
Dij= within-to-between cluster distance ratio for the ith and jth
clusters.
C = no of clusters
i,j = numbers of clusters which come from the same partitioning
PYTHON
IMPLEMENTATION
PROBLE
M
◾ The dataset given was made by the strategy team a mall
◾ The information in the dataset is
◾ Customer ID
◾ Gender
◾ Age
◾ Annual Income
◾ Spending score (1-100) (lower score represents less spending and higher score represents higher spending)
◾ The goal is to identify some patterns within the customers
◾ This is unsupervised learning, so we have no idea what to predict
◾ So, we will create a dependent variable (cluster number), which will represent the class based on the
independent variables
STEPS FOR IMPLEMENTATION

◾ Importing the library

◾ Importing the dataset
◾ Using elbow method to find the optimal number of
clusters
◾ Training the k-means model on the dataset
◾ Visualizing the clusters
IMPORTING THE LIBRARIES A N D
DATASET
◾ The customer id is of no importance to us so we
will discard customer id.
◾ Although all other independent variables i.e., Gender,
Age, Annual Income, Spending score are important for
our problem
◾ But we need to visualize the results, which can only
happen for a dataset with only two independent
variables
◾ So, we select two indendent variables i.e., ‘Annual
Income’ and ‘spending score’ as our independent
variables of choice.
USING THE ELBO W METHOD TO FIND THE OPTIMAL NUMBER
OF CLUSTERS
K=5
TRAINING THE K-MEANS MO D EL O N THE
DATASET
VISUALIZING THE
CLUSTERS
Evaluating the Clustering
Algorithm

Humanities and Social Sciences (Humss) Grade 11 Grade 12: ST Century From The Philippines and The World
83% (6)
Humanities and Social Sciences (Humss) Grade 11 Grade 12: ST Century From The Philippines and The World
1 page
Q2 Grade 8 Music DLL Week 1
91% (23)
Q2 Grade 8 Music DLL Week 1
9 pages
DLL Cpar
No ratings yet
DLL Cpar
3 pages
Assessment Task 4 Instructions
0% (3)
Assessment Task 4 Instructions
3 pages
K Means
No ratings yet
K Means
26 pages
Y12 2nd Year Internship Details
No ratings yet
Y12 2nd Year Internship Details
27 pages
Bank Clerks: Job Description
100% (1)
Bank Clerks: Job Description
3 pages
JBTS 3.1 Compressed
No ratings yet
JBTS 3.1 Compressed
231 pages
K - Means Clustering
No ratings yet
K - Means Clustering
34 pages
10.CV Terbaru Firmansyah
No ratings yet
10.CV Terbaru Firmansyah
5 pages
Bai Tap Ve Tu Noi Trong Tieng Anh Linking Words Connectors
No ratings yet
Bai Tap Ve Tu Noi Trong Tieng Anh Linking Words Connectors
5 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
Article 31 Guidelines
No ratings yet
Article 31 Guidelines
49 pages
CPE412 Pattern Recognition (Week 7)
No ratings yet
CPE412 Pattern Recognition (Week 7)
48 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Algo
No ratings yet
Algo
59 pages
Technology in A Constructivist Environment
No ratings yet
Technology in A Constructivist Environment
19 pages
DM Unit Iv
No ratings yet
DM Unit Iv
45 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
Clustering (Class 38-39)
No ratings yet
Clustering (Class 38-39)
45 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
My Ideal Home
No ratings yet
My Ideal Home
6 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unit 4
No ratings yet
Unit 4
63 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
K means&HC
No ratings yet
K means&HC
56 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
No ratings yet
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
44 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Kmean
No ratings yet
Kmean
24 pages
PART2
No ratings yet
PART2
61 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Unit 4
No ratings yet
Unit 4
46 pages
ML-Unit III - K-Means Clustering
No ratings yet
ML-Unit III - K-Means Clustering
22 pages
Clustering
No ratings yet
Clustering
24 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
Unit 4
No ratings yet
Unit 4
22 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Malaysia Course Exam Fees 2018 Without GST Effective 1 June 2018
No ratings yet
Malaysia Course Exam Fees 2018 Without GST Effective 1 June 2018
13 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
79 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
E Class Record 9 STE Consumer Chem R. CAYANAN
No ratings yet
E Class Record 9 STE Consumer Chem R. CAYANAN
12 pages
Clustering
No ratings yet
Clustering
17 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
NLP Complete - BEPEC - Opendir - Cloud
No ratings yet
NLP Complete - BEPEC - Opendir - Cloud
17 pages
Artificial Intelligence Project Topic: Home Security Using Ai
No ratings yet
Artificial Intelligence Project Topic: Home Security Using Ai
16 pages
(Ebook) Academy Stars Level 1 Teacher's Book Pack by Dave Tucker ISBN 9781380006509, 1380006503
100% (1)
(Ebook) Academy Stars Level 1 Teacher's Book Pack by Dave Tucker ISBN 9781380006509, 1380006503
77 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Perspectives and Problems of Codifying Nigerian Pidgin English Orthography
No ratings yet
Perspectives and Problems of Codifying Nigerian Pidgin English Orthography
9 pages
Introduction To Data Science Lecture 6 KG Sir OEC M 621 (E)
No ratings yet
Introduction To Data Science Lecture 6 KG Sir OEC M 621 (E)
19 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
K-Means Clustering
No ratings yet
K-Means Clustering
14 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
K Clustering
No ratings yet
K Clustering
28 pages
K-Means Clustering
No ratings yet
K-Means Clustering
7 pages
K Means Clustering
No ratings yet
K Means Clustering
27 pages
Bachelor of Paramedicine - Victoria University
No ratings yet
Bachelor of Paramedicine - Victoria University
18 pages
LKAU23 at Qur'an QA 2023
No ratings yet
LKAU23 at Qur'an QA 2023
8 pages
Stcgan Shadow
No ratings yet
Stcgan Shadow
10 pages
Modality Effects in Delayed Free Recall and Recognition Visual Is Better Than Auditory
No ratings yet
Modality Effects in Delayed Free Recall and Recognition Visual Is Better Than Auditory
16 pages
Data Mining-4
No ratings yet
Data Mining-4
9 pages
EXAMÉN EDUSOFT-Intermediate 1 Exit Test
No ratings yet
EXAMÉN EDUSOFT-Intermediate 1 Exit Test
7 pages
SHS Form 9 V3.0
No ratings yet
SHS Form 9 V3.0
6 pages
CV Kishore
No ratings yet
CV Kishore
3 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Simple K Means
No ratings yet
Simple K Means
3 pages
Introduction To The K-Means Clustering Algorithm Based On The Elbow
No ratings yet
Introduction To The K-Means Clustering Algorithm Based On The Elbow
4 pages
Script For Basic Research
No ratings yet
Script For Basic Research
3 pages
Cs2402 Mobile Computing: Unit Ii
No ratings yet
Cs2402 Mobile Computing: Unit Ii
6 pages
Lagos State University of Science and Technology
No ratings yet
Lagos State University of Science and Technology
2 pages
Name: - Class: - Date: - Planning A Party Rubric - Miss Brennan's 6 Grade Math Class - Fractions Unit
No ratings yet
Name: - Class: - Date: - Planning A Party Rubric - Miss Brennan's 6 Grade Math Class - Fractions Unit
2 pages
Mathmatters: The Hidden Calculations of Everyday Life
From Everand
Mathmatters: The Hidden Calculations of Everyday Life
Chris Waring
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Lecture 18 K Means Clustering

Uploaded by

Lecture 18 K Means Clustering

Uploaded by

CLUSTERING

It is basically a type of unsupervised learning

An unsupervised learning method is a method in which we draw references

Generally, it is used as a process to find meaningful structure, explanatory

Clustering is very much important as it determines the intrinsic grouping among

◾ The clusters formed are different based on different initial

To overcome the above-mentioned drawback we use K-means+

1.Randomly select the first centroid from the data points.

There are three commonly used evaluation metrics:

To study the separation distance between the clusters formed by the

The Calinski Harabaz index is based on the principle of

Davies Bouldin index is based on the principle of with-cluster

◾ Importing the library

You might also like