K-Means Clustering

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number (k) of clusters. It works by assigning each data point to the cluster with the nearest mean and recalculating the means for each cluster iteratively until convergence is reached. Key steps include initializing k cluster centers, calculating distances between data points and centers, assigning points to closest clusters, and updating cluster centers. The algorithm aims to minimize within-cluster variance but results depend on initialization and it assumes spherical clusters of equal size and density.

Uploaded by

madhullika204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views38 pages

K-Means Clustering

Uploaded by

madhullika204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

K-means Clustering

K-means Clustering
• What is clustering?
• Why would we want to cluster?
• How would you determine clusters?
• How can you do this efficiently?
Clustering
• Unsupervised learning
– Requires data, but no labels
• Detect patterns e.g. in
– Group emails or search results
– Customer shopping patterns
– Regions of images
• Useful when don’t know what you’re looking for
• Basic idea: group together similar instances
• What could “similar” mean?
– One option: Euclidean distance
• Clustering results are crucially dependent on the
measure of similarity (or distance) between “points”
to be clustered
Clustering algorithms
K-means Clustering
Basic Algorithm:
• Step 0: select K
• Step 1: Randomly select any K data points as cluster
centers.
• Step 2: calculate distance from each object to each
cluster center.
• What type of distance should we use?
– Squared Euclidean distance
– given distance function
K-means Clustering
• Step 3: Assign each object to the closest cluster
• Step 4: Compute the new centroid for each cluster
– The center of a cluster is computed by taking mean of all
the data points contained in that cluster.
• Iterate from step 2 to 4:
– Calculate distance from objects to cluster centroids.
– Assign objects to closest cluster
– Recalculate new centroids
K-means Clustering
• Stop based on convergence criteria
– Center of newly formed clusters do not change
– Data points remain present in the same cluster
– Maximum number of iterations are reached
Example 2
Example 3
• Cluster the following eight points (with (x, y) representing
locations) into three clusters:
– A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b =
(x2, y2) is defined as-
– Ρ(a, b) = |x2 – x1| + |y2 – y1|

• Use K-Means Algorithm to find the three cluster centers after

the second iteration.
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1|
– = |2 – 2| + |10 – 10| = 0
• Calculating Distance Between A1(2, 10) and C2(5, 8)-
– Ρ(A1, C2) = |x2 – x1| + |y2 – y1|
– = |5 – 2| + |8 – 10| = 3 + 2 = 5
• Calculating Distance Between A1(2, 10) and C3(1, 2)-
– Ρ(A1, C3) = |x2 – x1| + |y2 – y1|
– = |1 – 2| + |2 – 10| = 1 + 8 = 9
New clusters are-

• Cluster-01: First cluster contains points-

– A1(2, 10)

• Cluster-02: Second cluster contains points-

– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)
– A8(4, 9)

• Cluster-03: Third cluster contains points-

– A2(2, 5)
– A7(1, 2)
• Re-compute the new cluster centers.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.

• For Cluster-01:
– We have only one point A1(2, 10) in Cluster-01.
– So, cluster center remains the same.

• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)

• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)

• This is completion of Iteration-01.

Iteration - 2
• Calculate the distance of each point from each of the center of
the three clusters
• The distance is calculated by using the given distance function.
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| = 0

• Calculating Distance Between A1(2, 10) and C2(6, 6)-

– Ρ(A1, C2) = |x2 – x1| + |y2 – y1| = |6 – 2| + |6 – 10| = 4 + 4 = 8

• Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

– Ρ(A1, C3) = |x2 – x1| + |y2 – y1| = |1.5 – 2| + |3.5 – 10| = 0.5 + 6.5 = 7
New clusters

• Cluster-01: First cluster contains points-

– A1(2, 10)
– A8(4, 9)

• Cluster-02: Second cluster contains points-

– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)

• Cluster-03: Third cluster contains points-

– A2(2, 5)
– A7(1, 2)
New cluster centers

• For Cluster-01:
– Center of Cluster-01 = ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)

• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25)

• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
K-means Clustering
• Strengths
– Simple iterative method
– Guaranteed to converge in a finite number of iterations
– User provides “K”
– Running time per iteration:
• Assign data points to closest cluster center O(KN) time
• Change the cluster center to the average of its assigned points O(N)
• Weaknesses
– Often too simple  bad results
– can not handle noisy data and outliers
– Difficult to guess the correct “K”
– not suitable to identify clusters when the clusters have varying sizes,
different densities or non-convex shapes
K-means Issues
• Distance measure is squared Euclidean
– Scale should be similar in all dimensions
• Rescale data?
– Not good for nominal data. Why?
• Approach tries to minimize the within-cluster sum of
squares error (WCSS) or inertia
– Implicit assumption that SSE is similar for each group
WCSS
• The over all WCSS is given by:

• The goal is to find the smallest WCSS

• Does this depend on the initial seed values?
• Possibly.
• Figure shows two suboptimal solutions that the algorithm can
converge to if you are not lucky with the random initialization
step.
Finding the optimal number of clusters

• The inertia is not a good performance metric when trying to

choose k because it keeps getting lower as we increase k.
• Indeed, the more clusters there are, the closer each instance
will be to its closest centroid, and therefore the lower the
inertia will be
• When plotting the inertia as a function of the number of
clusters k, the curve often contains an inflexion point called
the “elbow”
• curve has roughly the shape of an arm, and there is an
“elbow”
• K-Means fails to cluster these ellipsoidal blobs properly
Image segmentation using K-Means with various
numbers of color clusters
Bottom Line
• K-means
– Easy to use
– Need to know K
– May need to scale data
– Good initial method
• Local optima
– No guarantee of optimal solution
– Repeat with different starting values

WebNMS Technical Guide PDF
No ratings yet
WebNMS Technical Guide PDF
2,204 pages
Unit V
No ratings yet
Unit V
165 pages
5 - CH 5-K-Means Clustering
No ratings yet
5 - CH 5-K-Means Clustering
54 pages
Java Notes
100% (2)
Java Notes
77 pages
Unit IV
No ratings yet
Unit IV
51 pages
1 - PPT Module (20 Files Merged)
No ratings yet
1 - PPT Module (20 Files Merged)
714 pages
K Means Example
No ratings yet
K Means Example
8 pages
CLD Exercise 3: Action Engine Timer Objective
100% (1)
CLD Exercise 3: Action Engine Timer Objective
3 pages
K - Means Clustering
No ratings yet
K - Means Clustering
34 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
System Programming and Operating System by System Programming and Operating System by Dhamdhere PDF Dhamdhere PDF
No ratings yet
System Programming and Operating System by System Programming and Operating System by Dhamdhere PDF Dhamdhere PDF
3 pages
Unit 4 - K-Means Clustering Algorithm With Examples
No ratings yet
Unit 4 - K-Means Clustering Algorithm With Examples
14 pages
GNU Linker
No ratings yet
GNU Linker
157 pages
Quality of Clustering: Clustering (K-Means Algorithm)
No ratings yet
Quality of Clustering: Clustering (K-Means Algorithm)
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
79 pages
VHDL Instant v3.1
No ratings yet
VHDL Instant v3.1
28 pages
All Weeks
No ratings yet
All Weeks
53 pages
Data Structures Mcqs
No ratings yet
Data Structures Mcqs
27 pages
8051 Instruction Set
No ratings yet
8051 Instruction Set
71 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
16f877a Programming Specifications
No ratings yet
16f877a Programming Specifications
22 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
3 00f3f2a7d5 K Means
No ratings yet
3 00f3f2a7d5 K Means
13 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Example 1
No ratings yet
Example 1
8 pages
Latestlog 042430
No ratings yet
Latestlog 042430
32 pages
K Means
No ratings yet
K Means
66 pages
PART2
No ratings yet
PART2
61 pages
Clustering
No ratings yet
Clustering
125 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Algo
No ratings yet
Algo
59 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Jax WS PDF
No ratings yet
Jax WS PDF
19 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Module 5
No ratings yet
Module 5
98 pages
Clustering TNP
No ratings yet
Clustering TNP
53 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Kubectl Commands Kubernetes
No ratings yet
Kubectl Commands Kubernetes
179 pages
DM Unit Iv
No ratings yet
DM Unit Iv
45 pages
K Means
No ratings yet
K Means
25 pages
CPE412 Pattern Recognition (Week 7)
No ratings yet
CPE412 Pattern Recognition (Week 7)
48 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Class 11 It Record Programs
No ratings yet
Class 11 It Record Programs
25 pages
Full Physical Report
No ratings yet
Full Physical Report
71 pages
K Means Example
No ratings yet
K Means Example
14 pages
Clustering Solved Examples
No ratings yet
Clustering Solved Examples
13 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
K Means
No ratings yet
K Means
14 pages
K-Means Clustering
No ratings yet
K-Means Clustering
21 pages
Kmeans Clustering Lecture 8
No ratings yet
Kmeans Clustering Lecture 8
20 pages
Sap Enhancements Exits and Badi Overview
No ratings yet
Sap Enhancements Exits and Badi Overview
107 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
Accessor and Mutator
No ratings yet
Accessor and Mutator
20 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
ML-Unit III - K-Means Clustering
No ratings yet
ML-Unit III - K-Means Clustering
22 pages
08 K-Means
No ratings yet
08 K-Means
19 pages
K Means
No ratings yet
K Means
19 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Log Cat 1729407392479
No ratings yet
Log Cat 1729407392479
11 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
ML Unit 4 Part A Material
No ratings yet
ML Unit 4 Part A Material
15 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Cryptography
No ratings yet
Cryptography
6 pages
K Means Alg, Example
No ratings yet
K Means Alg, Example
9 pages
The PRAM Model and Algorithms: Advanced Topics Spring 2008
No ratings yet
The PRAM Model and Algorithms: Advanced Topics Spring 2008
24 pages
VB Prog
No ratings yet
VB Prog
8 pages
SAP ABAP 4 Tutorial - Simple Tabstrip Control
No ratings yet
SAP ABAP 4 Tutorial - Simple Tabstrip Control
11 pages
Zeeshan Ali Operators
No ratings yet
Zeeshan Ali Operators
3 pages
K Means Tutorial
No ratings yet
K Means Tutorial
8 pages
IntroductionToProgramming ConditionalStatement Problems
No ratings yet
IntroductionToProgramming ConditionalStatement Problems
4 pages
K Means Example
No ratings yet
K Means Example
8 pages
Kmeans Clustering Numerical - 1
No ratings yet
Kmeans Clustering Numerical - 1
5 pages
Commonly Used Functions
No ratings yet
Commonly Used Functions
12 pages
Chadive Ganesh ReddyF!
No ratings yet
Chadive Ganesh ReddyF!
1 page
Assignment 1
No ratings yet
Assignment 1
1 page
K Means Example
No ratings yet
K Means Example
10 pages
Student Result System
No ratings yet
Student Result System
12 pages
CH 11 Imp
No ratings yet
CH 11 Imp
7 pages
KMeans Example
No ratings yet
KMeans Example
8 pages
Linear Algebra Fundamentals
From Everand
Linear Algebra Fundamentals
Kartikeya Dutta
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet