0% found this document useful (0 votes)

141 views9 pages

"These Are Just Rough Notes For References" What Is K-Means Clustering

This document provides an overview of k-means clustering and hierarchical clustering algorithms. 1. K-means clustering is an unsupervised learning algorithm that groups unlabeled data points into a specified number (k) of clusters, where each data point belongs to the cluster with the nearest mean. 2. Hierarchical clustering creates tree-based clusters by either iteratively merging the closest clusters (agglomerative) or iteratively splitting clusters (divisive). It does not require specifying the number of clusters beforehand. 3. Both algorithms aim to minimize distances between points and cluster centroids or linkages, and can be evaluated using metrics like inertia and silhouette scores.

Uploaded by

Nikhil Jojen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views9 pages

"These Are Just Rough Notes For References" What Is K-Means Clustering

Uploaded by

Nikhil Jojen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

You are on page 1/ 9

“these are just rough notes for references”

What is k-Means Clustering

This is an unsupervised learning algorithm, essentially meaning that the algorithm learns
patterns from untagged(unlabelled) data. This implies that you can train a model to create
clusters on any given dataset without having to initially
the data. (unsupervised)
The intuition behind the algorithm is to divide data points into different pre defined (K) clusters,
where a datapoint in each cluster would only belong to that cluster. The cluster would consist of
data which share similarities with one another, implying that data points in different clusters
would be dissimilar to one another.

Algorithm
 Choose your value of K (elbow method and silhouette)
 Randomly select K data points to represent the cluster centroids
 Assign all other data points to its nearest cluster centroids
 Reposition the cluster centroid until it is the average of the points in the cluster
 Repeat steps 3 & 4 until there are no changes in each cluster

1. Initialize K & Centroids

As a starting point, you tell your model how many clusters it should make. First the model picks up K, (let
K = 3) datapoints from the dataset. These datapoints are called cluster centroids.

Now there are different ways you to initialize the centroids, you can either choose them at random — or
sort the dataset, split it into K portions and pick one datapoint from each portion as a centriod.

2. Assigning Clusters to datapoints

From here on wards, the model performs calculations on it’s own and assigns a cluster to each
datapoint. Your model would calculate the distance between the datapoint & all the centroids, and will
be assigned to the cluster with the nearest centroid. Again, there are different ways you can calculate
this distance; all having their pros and cons. Usually we use the L2 distance.

The picture below shows how to calculate the L2 distance between the centeroid and a datapoint. Every
time a datapoint is assigned to a cluster the following steps are followed.

L2 or Euclidean distance

3. Updating Centroids

Because the initial centroids were chosen arbitrarily, your model the updates them with new cluster
values. The new value might or might not occur in the dataset, in fact, it would be a coincidence if it
does. This is because the updated cluster centorid is the average or the mean value of all the datapoints
within that cluster.
Updating cluster centroids

Now if some other algo, like K-Mode, or K-Median was used, instead of taking the average value, mode
and median would be taken respectively.

4. Stopping Criterion

Since step 2 and 3 would be performed iteratively, it would go on forever if we don’t set a stopping
criterion. The stopping criterion tells our algo when to stop updating the clusters. It is important to note
that setting a stopping criterion would not necessarily return THE BEST clusters, but to make sure it
returns reasonably good clusters, and more importantly at least return some clusters, we need to have a
stopping criterion.

Like everything else, there are different ways to set the stopping criterion. You can even set multiple
conditions that, if met, would stop the iteration and return the results. Some of the stopping conditions
are:

The datapoints assigned to specific cluster remain the same (takes too much time)

Centroids remain the same (time consuming)

The distance of datapoints from their centroid is minimum (the thresh you’ve set)

Fixed number of iterations have reached (insufficient iterations → poor results, choose max iteration
wisely)

Evaluating the cluster quality

The goal here isn’t just to make clusters, but to make good, meaningful clusters. Quality clustering is
when the datapoints within a cluster are close together, and afar from other clusters.

The two methods to measure the cluster quality are described below:

Inertia: Intuitively, inertia tells how far away the points within a cluster are. Therefore, a small of inertia
is aimed for. The range of inertia’s value starts from zero and goes up.

Silhouette score: Silhouette score tells how far away the datapoints in one cluster are, from the
datapoints in another cluster. The range of silhouette score is from -1 to 1. Score should be closer to 1
than -1.

Choosing K (2 ways)
Elbow method :
 The elbow method uses the sum of squared distance (SSE) to choose an ideal value of k
based on the distance between the data points and their assigned clusters. We would
choose a value of k where the SSE begins to flatten out and we see an inflection point.
When visualized this graph would look somewhat like an elbow, hence the name of the
method.
Silhouette Analysis:
 Silhouette analysis can be used to determine the degree of separation between clusters.
For each sample:

o Compute the average distance from all data points in the same cluster (ai).
o Compute the average distance from all data points in the closest cluster (bi).

Compute the coefficient:

The coefficient can take values in the interval [-1, 1].

o If it is 0 –> the sample is very close to the neighboring clusters. ( we do not want)
o It it is 1 –> the sample is far away from the neighboring clusters. (WANT)
o It it is -1 –> the sample is assigned to the wrong clusters. ( we do not want)

Therefore, we want the coefficients to be as big as possible and close to 1 to have a good
clusters.

In theory, k-means++ is much nicer. It is a biased random sampling that prefers points that
are farther from each other, and avoids close points. Random initialization may be unlucky
and choose nearby centers.
So in theory, k-means++ should require fewer iterations and have a higher chance of
finding the global optimum.

K-Means is an iterative clustering method which randomly assigns initial centroids and shifts
them to minimize the sum of squares. One problem is that, because the centroids are
initially random, a bad starting position could cause the algorithm to converge at a local
optimum.(multiple centroids might be close to each other)

K-Means++ was designed to combat this - It chooses the initial centroids using a weighted
method which makes it more likely that points further away will be chosen as the initial
centroids. The idea is that while initialization is more complex and will take longer, the
centroids will be more accurate and thus fewer iterations are needed, hence it will reduce
overall time
Hierarchical Clustering — Explained
One of the advantages of hierarchical clustering is that we do not have to specify the number of
clusters (but we can). Let’s dive into details after this short introduction.

Hierarchical clustering means creating a tree of clusters by iteratively grouping or separating

data points. There are two types of hierarchical clustering:
 Agglomerative clustering
 Divisive clustering

Agglomerative clustering
Agglomerative clustering is kind of a bottom-up approach. Each data point is assumed to be a
separate cluster at first. Then the similar clusters are iteratively combined. Let’s go over an
example to explain the concept clearly.

10 points -> 10 clusters

We have a dataset consists of 9 samples. I choose numbers associated with these samples to
demonstrate the concept of similarity. At each iteration (or level), the closest numbers (i.e.
samples) are combined together. As you can see in the figure below, we start with 9 clusters.
The closest ones are combined at the first level and then we have 7 clusters. The number of
black lines that intersect with blue lines represents the number of clusters.
This is a very simple data set to illustrate the purpose but real life data sets are obviously more
complex. We mention that “closest data points (or clusters)” are combined together. But how do
the algorithms identify closest ones? There are 4 different methods implemented in scikit-
learn to measure the similarity:
 Ward’s linkage: Minimizes the variance of the clusters being merged. Least increase in
total variance around cluster centroids is aimed.
 Average linkage: Average distance of each data point in two clusters.
 Complete (maximum) linkage: Maximum distance among all data points in two clusters.
 Single (minimum) linkage: Min distance among all data points in two clusters.
One of the advantages of hierarchical clustering is that we do not have to specify the number of
clusters beforehand. However, it is not wise to combine all data points into one cluster. We
should stop combining clusters at some point. Scikit-learn provides two options for this:
 Stop after a number of clusters is reached (n_clusters)
 Set a threshold value for linkage (distance_threshold). If the distance between two
clusters are above the threshold, these clusters will not be merged.

Divisive Clustering
Divisive clustering is not commonly used in real life so I will mention it briefly. Simple yet clear
explanation is that divisive clustering is the opposite of agglomerative clustering. We start with
one giant cluster including all data points. Then data points are separated into different clusters.
It is an up to bottom approach.

Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
No ratings yet
Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
38 pages
1energy Consumption Prediction Strategy For Electric Vehicle (2024)
No ratings yet
1energy Consumption Prediction Strategy For Electric Vehicle (2024)
35 pages
Machine Learning Lab Assignment CSE-716: S. M. Shafkat Raihan ID: 16701041 SESSION: 2015-16
No ratings yet
Machine Learning Lab Assignment CSE-716: S. M. Shafkat Raihan ID: 16701041 SESSION: 2015-16
9 pages
Hello.: An Introduction To Devops For Project Managers
No ratings yet
Hello.: An Introduction To Devops For Project Managers
35 pages
Predicting Electric Vehicle Energy Consumption From Field Data Using Machine Learning
No ratings yet
Predicting Electric Vehicle Energy Consumption From Field Data Using Machine Learning
12 pages
Azoff, E. Micharl - Neural Network, Time Series Forecasting, Financial Markets
0% (1)
Azoff, E. Micharl - Neural Network, Time Series Forecasting, Financial Markets
205 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Drug Dosage Control System Using Reinforcement Learning
No ratings yet
Drug Dosage Control System Using Reinforcement Learning
8 pages
What Is Supervised Machine Learning
No ratings yet
What Is Supervised Machine Learning
3 pages
PRML Solution Manual
No ratings yet
PRML Solution Manual
253 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
K Mean
No ratings yet
K Mean
9 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Unit IV
No ratings yet
Unit IV
51 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Unit 4
No ratings yet
Unit 4
63 pages
ML Assign4
No ratings yet
ML Assign4
7 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Module 5
No ratings yet
Module 5
43 pages
Unit IV
No ratings yet
Unit IV
96 pages
Cluster
100% (1)
Cluster
72 pages
Algo
No ratings yet
Algo
59 pages
Week 10
No ratings yet
Week 10
84 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Unit 4
No ratings yet
Unit 4
22 pages
Clustering (Class 38-39)
No ratings yet
Clustering (Class 38-39)
45 pages
3.k-Metoids and Hierarchical Updated
No ratings yet
3.k-Metoids and Hierarchical Updated
50 pages
Unit 4
No ratings yet
Unit 4
19 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Kmeans Clustering
No ratings yet
Kmeans Clustering
3 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
UnsupervisedLearning FoundationalMathofAI S24
No ratings yet
UnsupervisedLearning FoundationalMathofAI S24
6 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
MTH601 MidTerm Paper7.6.2014
No ratings yet
MTH601 MidTerm Paper7.6.2014
5 pages
878 2234 1 PB
No ratings yet
878 2234 1 PB
12 pages
AdvancedMath Bisection Method
No ratings yet
AdvancedMath Bisection Method
19 pages
Data Structures Unit 2 Notes
No ratings yet
Data Structures Unit 2 Notes
51 pages
AI Deep Learning Cheat Sheets-From BecomingHuman - Ai PDF
100% (3)
AI Deep Learning Cheat Sheets-From BecomingHuman - Ai PDF
25 pages
H Method and P Method
No ratings yet
H Method and P Method
2 pages
DEShaw Tagged LeetCode Problems 1661452730
No ratings yet
DEShaw Tagged LeetCode Problems 1661452730
3 pages
Newton Raphson Method
No ratings yet
Newton Raphson Method
22 pages
Correlation and Regression Analysis For Node Betweenness Centrality
No ratings yet
Correlation and Regression Analysis For Node Betweenness Centrality
20 pages
DSA Sheet by Arsh (45 Days Plan) - Sheet1
No ratings yet
DSA Sheet by Arsh (45 Days Plan) - Sheet1
8 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
C# Operators - Arithmetic, Comparison, Logical and More - PDF
No ratings yet
C# Operators - Arithmetic, Comparison, Logical and More - PDF
17 pages
Data Structures Question Paper 2023
33% (3)
Data Structures Question Paper 2023
2 pages
HW2 Solution
No ratings yet
HW2 Solution
9 pages
R Programming - 3
No ratings yet
R Programming - 3
8 pages
DAA-sesi06 Analisis Algoritma Rekursif (Bag 1)
No ratings yet
DAA-sesi06 Analisis Algoritma Rekursif (Bag 1)
26 pages
Artificial Intelligence Informed Search Techniques (Heuristic Functions, Best First Search, 8-Puzzle Problem With Heuristics)
No ratings yet
Artificial Intelligence Informed Search Techniques (Heuristic Functions, Best First Search, 8-Puzzle Problem With Heuristics)
5 pages
Combinational Logic Circuits
No ratings yet
Combinational Logic Circuits
53 pages
Lecture 10 (Sorting)
No ratings yet
Lecture 10 (Sorting)
39 pages
Data Structure and Algorithm Prelim Exam
No ratings yet
Data Structure and Algorithm Prelim Exam
3 pages
Beam Search
No ratings yet
Beam Search
41 pages
Binary Tree Algorithms
No ratings yet
Binary Tree Algorithms
14 pages
Java Visibility Modifiers: Default Visibility Means That No Visibility Modifier Was Explicitly Used. Default
No ratings yet
Java Visibility Modifiers: Default Visibility Means That No Visibility Modifier Was Explicitly Used. Default
4 pages
Institute of Business Administration: Shah Abdul Latif University Khairpur Mir's
No ratings yet
Institute of Business Administration: Shah Abdul Latif University Khairpur Mir's
3 pages
Generic Article Paper Science Quantum
No ratings yet
Generic Article Paper Science Quantum
5 pages
Coding Interview in Java
No ratings yet
Coding Interview in Java
190 pages
Chapter 3 - Solving Problems by Searching Concise
No ratings yet
Chapter 3 - Solving Problems by Searching Concise
67 pages
Artificial Intelligence: Informed Search
No ratings yet
Artificial Intelligence: Informed Search
25 pages
MODULE I-A Star Algorithm
No ratings yet
MODULE I-A Star Algorithm
12 pages
Computer Science Solved Mcqs
No ratings yet
Computer Science Solved Mcqs
10 pages

"These Are Just Rough Notes For References" What Is K-Means Clustering

Uploaded by

"These Are Just Rough Notes For References" What Is K-Means Clustering

Uploaded by

“these are just rough notes for references”

What is k-Means Clustering

1. Initialize K & Centroids

2. Assigning Clusters to datapoints

Centroids remain the same (time consuming)

Evaluating the cluster quality

Compute the coefficient:

The coefficient can take values in the interval [-1, 1].

Hierarchical clustering means creating a tree of clusters by iteratively grouping or separating

10 points -> 10 clusters

You might also like