0% found this document useful (0 votes)

34 views

K-Means Clustering Algorithm

Uploaded by

Polina Bogdanova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

K-Means Clustering Algorithm

Uploaded by

Polina Bogdanova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

k-Means clustering algorithm

Abdul Kader Sarah

Bogdanova Polina
Radovanov Angelina
Schoener Juliana
Tampir Martina
Agenda
- k-Means clustering algorithm: introduction
- NLP problems k-means can solve
- Learning and algorithm process behind it
- K in k-means: what is it?
- Loss function
- Decision boundary
- k-Means inference
- Hyperparameters
- Pros & cons
- Improvement methods
- Quiz time!
Clustering
- 1932 by H.E. Driver and A.L.Kroeber in their paper on “Quantitative
expression of cultural relationship”
- Grouping of objects together so that objects belonging in the same
group (cluster) are more similar to each other than those in other
groups (clusters)
- Similarities vs. dissimilarities
- Unlabeled data
- Unsupervised learning
k-Means clustering algorithm
- The standard algorithm was ﬁrst proposed by Stuart Lloyd of Bell Labs in
1957 . The term "k-means" was ﬁrst used by James MacQueen in 1967
- One of the most popular algorithms
- Complexity of O(n)
- Brief description:
1. Choose your k-number
2. Set centroids in your data
3. Assign each data point to the centroid nearest to it
4. Repeat
5. No more updates > you’re done!
Example usage

- Dog snacks?

- Good boy
- Run-for-a-ball-speed
Example usage
- Collected data
on a scatter
plot
- Nice and slow
- Nice and fast
- Fast and not so
nice > 3 groups
Example usage
- k-Means
- Data clustered
Example usage
- Dog snacks?
- Yes!
NLP problems & k-means
- Unsupervised Document Classiﬁcation
- Automatically discover groups of similar documents within a collection of
documents > high-level overview of the information
- Only numerical data > word frequencies, embeddings or TF-IDF values
- Examples

–Fake News Identiﬁcation

–Topic Modeling

–Get an overview of search results

–Sentiment Analysis
How does k-means learn?

- k-means groups data points in clusters based on their similarity and vicinity
- Assignment and Update
- Algorithm = finding these groups (clusters) and their centroids , where the data points in each
group are as close to their centroid as possible
What is the algorithm process behind it?
- Steps of Algorithm:
- 1. First, decide the number of
clusters you want— k
- 2. Next, initialize centroids randomly
(Centroids are chosen randomly at
start)
- 3. Assign each data point to the
nearest centroid
- 4. Once the points are grouped,
update the centroids
- 5. Repeat 3 & 4!
What is the k in k-means and how to choose a good k?
- K is the number of clusters you want to find in your data, if you set k = 3, the algorithm will divide your data
into 3 groups
- Methods to find good k:
- 1. Domain Knowledge - predefined or given
- 2. Elbow Method
- 3. Silhouette Method
Loss function

Euclidean distance Manhattan distance Intra-cluster distance

Loss function

● m: number of data points

● K: number of clusters
● ci: cluster assignment of data point xi
● μk: centroid of the cluster k
● ∥xi−μk∥^2: squared Euclidean distance
between point xi and centroid μk
Decision boundary
Voronoi tessellation/diagram

https://k-means-explorable.vercel.app/
Which hyperparameters can one control when using k-Mean?
Maximum iterations (max_iter):
- Elbow method: When should algorithm stop building more clusters?
- Silhouette score: How well do measures ﬁt in their clusters?
Initialization method (init):
- k-means++: Initial centroids are spread intelligently
- Random: Centroids are picked randomly
- Custom array: Choose centroids manually
Number of initializations (n_init): default ~10 times
Tolerance (tol): How small should movement of centroids get before the algorithm stops?
How does the k-Means inference work?
- centroids are ﬁnalized:

-> What happens when new data enters?

Euclidean distance used to calculate the distance of a new data point x to the nearest
centroid

-> data point x gets assigned to the nearest centroid

Pros and Cons of k-Means
+ simple to understand and implement
+ works well and fast with large datasets
+ output useful as input to other ML algorithms

- k has to be deﬁned manually

- assumes clusters are circular and even
- sensitive to outliers
- struggles with fuzzy data
What happens when clusters are not
circular and even?
What methods are out there that
improve the classical/simpliﬁed
algorithm? What is “k-means++”?

Methods:

- Elkan’s Algorithm (Optimized <- Weighted

k-means
Distance Calculations)
- Mini-Batch k-Means
- Bisecting k-Means
- Weighted k-Means
- Kernel k-Means
- k-Means++ Initialization
How does k-means ++ initialize the centroids?

https://www.youtube.com/watch?v=4qJWhvFQb9g
Python Code Example using scikit-learn package
from sklearn.cluster import Kmeans

kmeans = KMeans(init="k-means++", n_clusters=10, n_init=4)

kmeans.fit(data)
Quiz time!
The code 5223 6971
What type of machine learning does k-means belong to?
1. Supervised
2. Unsupervised
3. Reinforcement
What type of machine learning does k-means belong to?
1. Supervised
2. Unsupervised
3. Reinforcement
What type of data does k-means operate on?
1. Labeled
2. Unlabeled
What type of data does k-means operate on?
1. Labeled
2. Unlabeled
k-Means is doing what task?
1. Classiﬁcation
2. Regression
3. Clustering
k-Means is doing what task?
1. Classiﬁcation
2. Regression
3. Clustering
What is it called, when the linear decision boundaries divide the
data space into regions that entail clusters?
1. Voronoi tessellation or diagram
2. Archimedean solids
3. Roman partition
What is it called, when the linear decision boundaries divide the
data space into regions that entail clusters?
1. Voronoi tessellation or diagram
2. Archimedean solids
3. Roman partition
How does k-means learn?
1. Initialization and Optimization
2. Assignment and Update
3. Selection and Transformation
How does k-means learn?
1. Initialization and Optimization
2. Assignment and Update
3. Selection and Transformation
What does the k in k-means represent?
1. The number of iterations the algorithm runs
2. The size of each cluster
3. The number of clusters to form in the data
What does the k in k-means represent?
1. The number of iterations the algorithm runs
2. The size of each cluster
3. The number of clusters to form in the data
What distance is used when we work with numerical data?
1. Manhattan distance
2. Euclidean distance
3. Intra-cluster distance
What distance is used when we work with numerical data?
1. Manhattan distance
2. Euclidean distance
3. Intra-cluster distance
Which shape does k-means assume clusters are?
1. Circular
2. Oval
3. Triangle
Which shape does k-means assume clusters are?
1. Circular
2. Oval
3. Triangle
Sources
Ang, Yi Zhe: “K-Means Clustering.” https://k-means-explorable.vercel.app/.

Attae Pedram: “Silhouette or Elbow? That is the question.”, https://towardsdatascience.com/silhouette-or-elbow-that-is-the-question-a1dda4fb974

Anktia: “K-Means, getting the optimal number of clusters”,

https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters/#Methods_to_Find_the_Best_Value_of_K

Bouley, Carter: “K-Means Clustering”, https://towardsdatascience.com/k-means-clustering-fa4df5990ﬀf

Geeks for Geeks: “Euclidean distance”, https://www.geeksforgeeks.org/euclidean-distance/

Koufos, Nikos; Martin, Brendan: “K-Means & Other Clustering Algorithms: A Quick Intro with Python.”
https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/.

Naftali, Harris: “Visualizing K-Means CLustering”, https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Saji, Basil: “Elbow Method for Optimal Cluster Number in k-means”,

https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/

Sharma, Natasha: “K-Means Clustering Explained.”, 15.04.2024, https://neptune.ai/blog/k-means-clustering.

Sharma, Pulkit: “An Introduction to K-Means Clustering”,

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/#h-objective-of-k-means-clustering
Sources
Wang, X., Ling, B.WK. Decision regions and decision boundaries of generalized K mean algorithm based on various norm criteria. Multimed Tools
Appl 79, 30669–30684 (2020). https://doi-org.uaccess.univie.ac.at/10.1007/s11042-020-09402-7

Yu, Zengchen & Wang, Ke & Xie, Shuxuan & Zhong, Yuanfeng & Lyu, Zhihan. (2022). Prototypical Network Based on Manhattan Distance. Computer
Modeling in Engineering & Sciences. 131. 1-21. 10.32604/cmes.2022.019612.

Zhang, Wei: “K-Means.” https://wei2624.github.io/MachineLearning/usv_kmeans/.

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Weka Tutorial
No ratings yet
Weka Tutorial
45 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Week 10
No ratings yet
Week 10
41 pages
kmea
No ratings yet
kmea
53 pages
K Mean
No ratings yet
K Mean
12 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Week 9
No ratings yet
Week 9
66 pages
Kmean
No ratings yet
Kmean
24 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
ML 5 (1)
No ratings yet
ML 5 (1)
61 pages
USL
No ratings yet
USL
21 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
K Means
No ratings yet
K Means
33 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
2 - K-Mean
No ratings yet
2 - K-Mean
39 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
No ratings yet
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
20 pages
1-Kmeans
No ratings yet
1-Kmeans
13 pages
Chapter 9
No ratings yet
Chapter 9
8 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Kmeans Worksheet PDF
No ratings yet
Kmeans Worksheet PDF
6 pages
Kmeans Worksheet
No ratings yet
Kmeans Worksheet
6 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
Kmeans
No ratings yet
Kmeans
92 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
9.1. Machine Learning Unsupervised Learning-1
No ratings yet
9.1. Machine Learning Unsupervised Learning-1
57 pages
algo
No ratings yet
algo
59 pages
Assignment 6 ML
No ratings yet
Assignment 6 ML
4 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
K Means Clustering
No ratings yet
K Means Clustering
17 pages
k-means
No ratings yet
k-means
25 pages
Presentation: Operating System Concept CS-582
No ratings yet
Presentation: Operating System Concept CS-582
13 pages
Unit 3 - KmeansClustering
No ratings yet
Unit 3 - KmeansClustering
17 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Clustering
No ratings yet
Clustering
6 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
06. k Clustering
No ratings yet
06. k Clustering
28 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
ML-12
No ratings yet
ML-12
19 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
K Mean Cluster Analysis
No ratings yet
K Mean Cluster Analysis
16 pages
kmeansfinal
No ratings yet
kmeansfinal
16 pages
Introduction To The K-Means Clustering Algorithm Based On The Elbow
No ratings yet
Introduction To The K-Means Clustering Algorithm Based On The Elbow
4 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML-Unit III - K-Means Clustering
No ratings yet
ML-Unit III - K-Means Clustering
22 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
Chapter 11 Mining Social-Network Graphs
No ratings yet
Chapter 11 Mining Social-Network Graphs
13 pages
Qubo frormulation
No ratings yet
Qubo frormulation
7 pages
Important Question of Introduction of Data Science
No ratings yet
Important Question of Introduction of Data Science
10 pages
A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data
No ratings yet
A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data
10 pages
Ipt Report
No ratings yet
Ipt Report
19 pages
Approaches About School Curriculum: My FS Learning Episode Overview
No ratings yet
Approaches About School Curriculum: My FS Learning Episode Overview
10 pages
CS3491 Set3
No ratings yet
CS3491 Set3
2 pages
Secr Overview
No ratings yet
Secr Overview
22 pages
Antonio González-Pardo, Francisco B. Rodríguez, Estrella Pulido and David Camacho. Using Virtual Worlds For Behaviour Clustering-Based Analysis
No ratings yet
Antonio González-Pardo, Francisco B. Rodríguez, Estrella Pulido and David Camacho. Using Virtual Worlds For Behaviour Clustering-Based Analysis
7 pages
Testing The Value of Customization When Do Custome
No ratings yet
Testing The Value of Customization When Do Custome
20 pages
Deep Learning Assignment
No ratings yet
Deep Learning Assignment
8 pages
From Cloud Down To Things An Overview of Machine Learning in Internet
No ratings yet
From Cloud Down To Things An Overview of Machine Learning in Internet
14 pages
Datamites CDS Syllabus
100% (1)
Datamites CDS Syllabus
12 pages
s11235-023-01041-1
No ratings yet
s11235-023-01041-1
20 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
22 pages
Vo John
No ratings yet
Vo John
1 page
Seed 2
No ratings yet
Seed 2
11 pages
Deep Learning Approach For Earthquake Parameters Classification in Earthquake Early Warning System
No ratings yet
Deep Learning Approach For Earthquake Parameters Classification in Earthquake Early Warning System
5 pages
A_Deep_Learning_Approach_Based_on_CT_Images_for_an_Automatic_Detection
No ratings yet
A_Deep_Learning_Approach_Based_on_CT_Images_for_an_Automatic_Detection
5 pages
3 - Image Forgery Detection Based On Fussion of Light Weight Deep Learning Models
No ratings yet
3 - Image Forgery Detection Based On Fussion of Light Weight Deep Learning Models
78 pages
Regolith Domain Modelling Using Multivariate Cluster Analysis at MT Thirsty Co-Ni Deposit
No ratings yet
Regolith Domain Modelling Using Multivariate Cluster Analysis at MT Thirsty Co-Ni Deposit
9 pages
Review Article Digital Change Detection Techniques Using Remotely Sensed Data
No ratings yet
Review Article Digital Change Detection Techniques Using Remotely Sensed Data
16 pages
BTech Mech Structure Syllabus AY 2022-23
No ratings yet
BTech Mech Structure Syllabus AY 2022-23
221 pages
Acsnano 1c03992
No ratings yet
Acsnano 1c03992
13 pages
Clustering High-Dimensional Data - A Survey On Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering
No ratings yet
Clustering High-Dimensional Data - A Survey On Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering
58 pages
C: A Hierarchical Clustering Algorithm Using Dynamic Modeling
No ratings yet
C: A Hierarchical Clustering Algorithm Using Dynamic Modeling
22 pages
Buy ebook An introduction to IoT Analytics 1st Edition Harry G Perros cheap price
100% (4)
Buy ebook An introduction to IoT Analytics 1st Edition Harry G Perros cheap price
65 pages
DM Mod5
No ratings yet
DM Mod5
49 pages
Analysis of Tourist Experiences at Dark Tourism Sites in Kinmen: Clustering Analysis of Motivations, Emotional Responses, and Satisfaction
No ratings yet
Analysis of Tourist Experiences at Dark Tourism Sites in Kinmen: Clustering Analysis of Motivations, Emotional Responses, and Satisfaction
7 pages

K-Means Clustering Algorithm

Uploaded by

K-Means Clustering Algorithm

Uploaded by

k-Means clustering algorithm

Abdul Kader Sarah

–Fake News Identiﬁcation

–Get an overview of search results

Euclidean distance Manhattan distance Intra-cluster distance

● m: number of data points

-> What happens when new data enters?

-> data point x gets assigned to the nearest centroid

- k has to be deﬁned manually

- Elkan’s Algorithm (Optimized <- Weighted

kmeans = KMeans(init="k-means++", n_clusters=10, n_init=4)

Attae Pedram: “Silhouette or Elbow? That is the question.”, https://towardsdatascience.com/silhouette-or-elbow-that-is-the-question-a1dda4fb974

Anktia: “K-Means, getting the optimal number of clusters”,

Bouley, Carter: “K-Means Clustering”, https://towardsdatascience.com/k-means-clustering-fa4df5990ﬀf

Geeks for Geeks: “Euclidean distance”, https://www.geeksforgeeks.org/euclidean-distance/

Naftali, Harris: “Visualizing K-Means CLustering”, https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Saji, Basil: “Elbow Method for Optimal Cluster Number in k-means”,

Sharma, Natasha: “K-Means Clustering Explained.”, 15.04.2024, https://neptune.ai/blog/k-means-clustering.

Sharma, Pulkit: “An Introduction to K-Means Clustering”,

Zhang, Wei: “K-Means.” https://wei2624.github.io/MachineLearning/usv_kmeans/.

You might also like