0% found this document useful (0 votes)
20 views

K-Means Clustering Algorithm

Uploaded by

Polina Bogdanova
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

K-Means Clustering Algorithm

Uploaded by

Polina Bogdanova
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

k-Means clustering algorithm

Abdul Kader Sarah


Bogdanova Polina
Radovanov Angelina
Schoener Juliana
Tampir Martina
Agenda
- k-Means clustering algorithm: introduction
- NLP problems k-means can solve
- Learning and algorithm process behind it
- K in k-means: what is it?
- Loss function
- Decision boundary
- k-Means inference
- Hyperparameters
- Pros & cons
- Improvement methods
- Quiz time!
Clustering
- 1932 by H.E. Driver and A.L.Kroeber in their paper on “Quantitative
expression of cultural relationship”
- Grouping of objects together so that objects belonging in the same
group (cluster) are more similar to each other than those in other
groups (clusters)
- Similarities vs. dissimilarities
- Unlabeled data
- Unsupervised learning
k-Means clustering algorithm
- The standard algorithm was first proposed by Stuart Lloyd of Bell Labs in
1957 . The term "k-means" was first used by James MacQueen in 1967
- One of the most popular algorithms
- Complexity of O(n)
- Brief description:
1. Choose your k-number
2. Set centroids in your data
3. Assign each data point to the centroid nearest to it
4. Repeat
5. No more updates > you’re done!
Example usage

- Dog snacks?

- Good boy
- Run-for-a-ball-speed
Example usage
- Collected data
on a scatter
plot
- Nice and slow
- Nice and fast
- Fast and not so
nice > 3 groups
Example usage
- k-Means
- Data clustered
Example usage
- Dog snacks?
- Yes!
NLP problems & k-means
- Unsupervised Document Classification
- Automatically discover groups of similar documents within a collection of
documents > high-level overview of the information
- Only numerical data > word frequencies, embeddings or TF-IDF values
- Examples

–Fake News Identification

–Topic Modeling

–Get an overview of search results

–Sentiment Analysis
How does k-means learn?

- k-means groups data points in clusters based on their similarity and vicinity
- Assignment and Update
- Algorithm = finding these groups (clusters) and their centroids , where the data points in each
group are as close to their centroid as possible
What is the algorithm process behind it?
- Steps of Algorithm:
- 1. First, decide the number of
clusters you want— k
- 2. Next, initialize centroids randomly
(Centroids are chosen randomly at
start)
- 3. Assign each data point to the
nearest centroid
- 4. Once the points are grouped,
update the centroids
- 5. Repeat 3 & 4!
What is the k in k-means and how to choose a good k?
- K is the number of clusters you want to find in your data, if you set k = 3, the algorithm will divide your data
into 3 groups
- Methods to find good k:
- 1. Domain Knowledge - predefined or given
- 2. Elbow Method
- 3. Silhouette Method
Loss function

Euclidean distance Manhattan distance Intra-cluster distance


Loss function

● m: number of data points


● K: number of clusters
● ci: cluster assignment of data point xi
● μk: centroid of the cluster k
● ∥xi−μk∥^2: squared Euclidean distance
between point xi and centroid μk
Decision boundary
Voronoi tessellation/diagram

https://k-means-explorable.vercel.app/
Which hyperparameters can one control when using k-Mean?
Maximum iterations (max_iter):
- Elbow method: When should algorithm stop building more clusters?
- Silhouette score: How well do measures fit in their clusters?
Initialization method (init):
- k-means++: Initial centroids are spread intelligently
- Random: Centroids are picked randomly
- Custom array: Choose centroids manually
Number of initializations (n_init): default ~10 times
Tolerance (tol): How small should movement of centroids get before the algorithm stops?
How does the k-Means inference work?
- centroids are finalized:

-> What happens when new data enters?

Euclidean distance used to calculate the distance of a new data point x to the nearest
centroid

-> data point x gets assigned to the nearest centroid


Pros and Cons of k-Means
+ simple to understand and implement
+ works well and fast with large datasets
+ output useful as input to other ML algorithms

- k has to be defined manually


- assumes clusters are circular and even
- sensitive to outliers
- struggles with fuzzy data
What happens when clusters are not
circular and even?
What methods are out there that
improve the classical/simplified
algorithm? What is “k-means++”?

Methods:

- Elkan’s Algorithm (Optimized <- Weighted


k-means
Distance Calculations)
- Mini-Batch k-Means
- Bisecting k-Means
- Weighted k-Means
- Kernel k-Means
- k-Means++ Initialization
How does k-means ++ initialize the centroids?

https://www.youtube.com/watch?v=4qJWhvFQb9g
Python Code Example using scikit-learn package
from sklearn.cluster import Kmeans

kmeans = KMeans(init="k-means++", n_clusters=10, n_init=4)

kmeans.fit(data)
Quiz time!
The code 5223 6971
What type of machine learning does k-means belong to?
1. Supervised
2. Unsupervised
3. Reinforcement
What type of machine learning does k-means belong to?
1. Supervised
2. Unsupervised
3. Reinforcement
What type of data does k-means operate on?
1. Labeled
2. Unlabeled
What type of data does k-means operate on?
1. Labeled
2. Unlabeled
k-Means is doing what task?
1. Classification
2. Regression
3. Clustering
k-Means is doing what task?
1. Classification
2. Regression
3. Clustering
What is it called, when the linear decision boundaries divide the
data space into regions that entail clusters?
1. Voronoi tessellation or diagram
2. Archimedean solids
3. Roman partition
What is it called, when the linear decision boundaries divide the
data space into regions that entail clusters?
1. Voronoi tessellation or diagram
2. Archimedean solids
3. Roman partition
How does k-means learn?
1. Initialization and Optimization
2. Assignment and Update
3. Selection and Transformation
How does k-means learn?
1. Initialization and Optimization
2. Assignment and Update
3. Selection and Transformation
What does the k in k-means represent?
1. The number of iterations the algorithm runs
2. The size of each cluster
3. The number of clusters to form in the data
What does the k in k-means represent?
1. The number of iterations the algorithm runs
2. The size of each cluster
3. The number of clusters to form in the data
What distance is used when we work with numerical data?
1. Manhattan distance
2. Euclidean distance
3. Intra-cluster distance
What distance is used when we work with numerical data?
1. Manhattan distance
2. Euclidean distance
3. Intra-cluster distance
Which shape does k-means assume clusters are?
1. Circular
2. Oval
3. Triangle
Which shape does k-means assume clusters are?
1. Circular
2. Oval
3. Triangle
Sources
Ang, Yi Zhe: “K-Means Clustering.” https://k-means-explorable.vercel.app/.

Attae Pedram: “Silhouette or Elbow? That is the question.”, https://towardsdatascience.com/silhouette-or-elbow-that-is-the-question-a1dda4fb974

Anktia: “K-Means, getting the optimal number of clusters”,


https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters/#Methods_to_Find_the_Best_Value_of_K

Bouley, Carter: “K-Means Clustering”, https://towardsdatascience.com/k-means-clustering-fa4df5990fff

Geeks for Geeks: “Euclidean distance”, https://www.geeksforgeeks.org/euclidean-distance/

Koufos, Nikos; Martin, Brendan: “K-Means & Other Clustering Algorithms: A Quick Intro with Python.”
https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/.

Naftali, Harris: “Visualizing K-Means CLustering”, https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Saji, Basil: “Elbow Method for Optimal Cluster Number in k-means”,


https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/

Sharma, Natasha: “K-Means Clustering Explained.”, 15.04.2024, https://neptune.ai/blog/k-means-clustering.

Sharma, Pulkit: “An Introduction to K-Means Clustering”,


https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/#h-objective-of-k-means-clustering
Sources
Wang, X., Ling, B.WK. Decision regions and decision boundaries of generalized K mean algorithm based on various norm criteria. Multimed Tools
Appl 79, 30669–30684 (2020). https://doi-org.uaccess.univie.ac.at/10.1007/s11042-020-09402-7

Yu, Zengchen & Wang, Ke & Xie, Shuxuan & Zhong, Yuanfeng & Lyu, Zhihan. (2022). Prototypical Network Based on Manhattan Distance. Computer
Modeling in Engineering & Sciences. 131. 1-21. 10.32604/cmes.2022.019612.

Zhang, Wei: “K-Means.” https://wei2624.github.io/MachineLearning/usv_kmeans/.

You might also like