K-Means Clustering Algorithm
K-Means Clustering Algorithm
- Dog snacks?
- Good boy
- Run-for-a-ball-speed
Example usage
- Collected data
on a scatter
plot
- Nice and slow
- Nice and fast
- Fast and not so
nice > 3 groups
Example usage
- k-Means
- Data clustered
Example usage
- Dog snacks?
- Yes!
NLP problems & k-means
- Unsupervised Document Classification
- Automatically discover groups of similar documents within a collection of
documents > high-level overview of the information
- Only numerical data > word frequencies, embeddings or TF-IDF values
- Examples
–Topic Modeling
–Sentiment Analysis
How does k-means learn?
- k-means groups data points in clusters based on their similarity and vicinity
- Assignment and Update
- Algorithm = finding these groups (clusters) and their centroids , where the data points in each
group are as close to their centroid as possible
What is the algorithm process behind it?
- Steps of Algorithm:
- 1. First, decide the number of
clusters you want— k
- 2. Next, initialize centroids randomly
(Centroids are chosen randomly at
start)
- 3. Assign each data point to the
nearest centroid
- 4. Once the points are grouped,
update the centroids
- 5. Repeat 3 & 4!
What is the k in k-means and how to choose a good k?
- K is the number of clusters you want to find in your data, if you set k = 3, the algorithm will divide your data
into 3 groups
- Methods to find good k:
- 1. Domain Knowledge - predefined or given
- 2. Elbow Method
- 3. Silhouette Method
Loss function
https://k-means-explorable.vercel.app/
Which hyperparameters can one control when using k-Mean?
Maximum iterations (max_iter):
- Elbow method: When should algorithm stop building more clusters?
- Silhouette score: How well do measures fit in their clusters?
Initialization method (init):
- k-means++: Initial centroids are spread intelligently
- Random: Centroids are picked randomly
- Custom array: Choose centroids manually
Number of initializations (n_init): default ~10 times
Tolerance (tol): How small should movement of centroids get before the algorithm stops?
How does the k-Means inference work?
- centroids are finalized:
Euclidean distance used to calculate the distance of a new data point x to the nearest
centroid
Methods:
https://www.youtube.com/watch?v=4qJWhvFQb9g
Python Code Example using scikit-learn package
from sklearn.cluster import Kmeans
kmeans.fit(data)
Quiz time!
The code 5223 6971
What type of machine learning does k-means belong to?
1. Supervised
2. Unsupervised
3. Reinforcement
What type of machine learning does k-means belong to?
1. Supervised
2. Unsupervised
3. Reinforcement
What type of data does k-means operate on?
1. Labeled
2. Unlabeled
What type of data does k-means operate on?
1. Labeled
2. Unlabeled
k-Means is doing what task?
1. Classification
2. Regression
3. Clustering
k-Means is doing what task?
1. Classification
2. Regression
3. Clustering
What is it called, when the linear decision boundaries divide the
data space into regions that entail clusters?
1. Voronoi tessellation or diagram
2. Archimedean solids
3. Roman partition
What is it called, when the linear decision boundaries divide the
data space into regions that entail clusters?
1. Voronoi tessellation or diagram
2. Archimedean solids
3. Roman partition
How does k-means learn?
1. Initialization and Optimization
2. Assignment and Update
3. Selection and Transformation
How does k-means learn?
1. Initialization and Optimization
2. Assignment and Update
3. Selection and Transformation
What does the k in k-means represent?
1. The number of iterations the algorithm runs
2. The size of each cluster
3. The number of clusters to form in the data
What does the k in k-means represent?
1. The number of iterations the algorithm runs
2. The size of each cluster
3. The number of clusters to form in the data
What distance is used when we work with numerical data?
1. Manhattan distance
2. Euclidean distance
3. Intra-cluster distance
What distance is used when we work with numerical data?
1. Manhattan distance
2. Euclidean distance
3. Intra-cluster distance
Which shape does k-means assume clusters are?
1. Circular
2. Oval
3. Triangle
Which shape does k-means assume clusters are?
1. Circular
2. Oval
3. Triangle
Sources
Ang, Yi Zhe: “K-Means Clustering.” https://k-means-explorable.vercel.app/.
Koufos, Nikos; Martin, Brendan: “K-Means & Other Clustering Algorithms: A Quick Intro with Python.”
https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/.
Yu, Zengchen & Wang, Ke & Xie, Shuxuan & Zhong, Yuanfeng & Lyu, Zhihan. (2022). Prototypical Network Based on Manhattan Distance. Computer
Modeling in Engineering & Sciences. 131. 1-21. 10.32604/cmes.2022.019612.