"These Are Just Rough Notes For References" What Is K-Means Clustering
"These Are Just Rough Notes For References" What Is K-Means Clustering
Algorithm
Choose your value of K (elbow method and silhouette)
Randomly select K data points to represent the cluster centroids
Assign all other data points to its nearest cluster centroids
Reposition the cluster centroid until it is the average of the points in the cluster
Repeat steps 3 & 4 until there are no changes in each cluster
Now there are different ways you to initialize the centroids, you can either choose them at random — or
sort the dataset, split it into K portions and pick one datapoint from each portion as a centriod.
From here on wards, the model performs calculations on it’s own and assigns a cluster to each
datapoint. Your model would calculate the distance between the datapoint & all the centroids, and will
be assigned to the cluster with the nearest centroid. Again, there are different ways you can calculate
this distance; all having their pros and cons. Usually we use the L2 distance.
The picture below shows how to calculate the L2 distance between the centeroid and a datapoint. Every
time a datapoint is assigned to a cluster the following steps are followed.
L2 or Euclidean distance
3. Updating Centroids
Because the initial centroids were chosen arbitrarily, your model the updates them with new cluster
values. The new value might or might not occur in the dataset, in fact, it would be a coincidence if it
does. This is because the updated cluster centorid is the average or the mean value of all the datapoints
within that cluster.
Updating cluster centroids
Now if some other algo, like K-Mode, or K-Median was used, instead of taking the average value, mode
and median would be taken respectively.
4. Stopping Criterion
Since step 2 and 3 would be performed iteratively, it would go on forever if we don’t set a stopping
criterion. The stopping criterion tells our algo when to stop updating the clusters. It is important to note
that setting a stopping criterion would not necessarily return THE BEST clusters, but to make sure it
returns reasonably good clusters, and more importantly at least return some clusters, we need to have a
stopping criterion.
Like everything else, there are different ways to set the stopping criterion. You can even set multiple
conditions that, if met, would stop the iteration and return the results. Some of the stopping conditions
are:
The datapoints assigned to specific cluster remain the same (takes too much time)
The distance of datapoints from their centroid is minimum (the thresh you’ve set)
Fixed number of iterations have reached (insufficient iterations → poor results, choose max iteration
wisely)
The two methods to measure the cluster quality are described below:
Inertia: Intuitively, inertia tells how far away the points within a cluster are. Therefore, a small of inertia
is aimed for. The range of inertia’s value starts from zero and goes up.
Silhouette score: Silhouette score tells how far away the datapoints in one cluster are, from the
datapoints in another cluster. The range of silhouette score is from -1 to 1. Score should be closer to 1
than -1.
Choosing K (2 ways)
Elbow method :
The elbow method uses the sum of squared distance (SSE) to choose an ideal value of k
based on the distance between the data points and their assigned clusters. We would
choose a value of k where the SSE begins to flatten out and we see an inflection point.
When visualized this graph would look somewhat like an elbow, hence the name of the
method.
Silhouette Analysis:
Silhouette analysis can be used to determine the degree of separation between clusters.
For each sample:
o Compute the average distance from all data points in the same cluster (ai).
o Compute the average distance from all data points in the closest cluster (bi).
Therefore, we want the coefficients to be as big as possible and close to 1 to have a good
clusters.
In theory, k-means++ is much nicer. It is a biased random sampling that prefers points that
are farther from each other, and avoids close points. Random initialization may be unlucky
and choose nearby centers.
So in theory, k-means++ should require fewer iterations and have a higher chance of
finding the global optimum.
K-Means is an iterative clustering method which randomly assigns initial centroids and shifts
them to minimize the sum of squares. One problem is that, because the centroids are
initially random, a bad starting position could cause the algorithm to converge at a local
optimum.(multiple centroids might be close to each other)
K-Means++ was designed to combat this - It chooses the initial centroids using a weighted
method which makes it more likely that points further away will be chosen as the initial
centroids. The idea is that while initialization is more complex and will take longer, the
centroids will be more accurate and thus fewer iterations are needed, hence it will reduce
overall time
Hierarchical Clustering — Explained
One of the advantages of hierarchical clustering is that we do not have to specify the number of
clusters (but we can). Let’s dive into details after this short introduction.
Agglomerative clustering
Agglomerative clustering is kind of a bottom-up approach. Each data point is assumed to be a
separate cluster at first. Then the similar clusters are iteratively combined. Let’s go over an
example to explain the concept clearly.
We have a dataset consists of 9 samples. I choose numbers associated with these samples to
demonstrate the concept of similarity. At each iteration (or level), the closest numbers (i.e.
samples) are combined together. As you can see in the figure below, we start with 9 clusters.
The closest ones are combined at the first level and then we have 7 clusters. The number of
black lines that intersect with blue lines represents the number of clusters.
This is a very simple data set to illustrate the purpose but real life data sets are obviously more
complex. We mention that “closest data points (or clusters)” are combined together. But how do
the algorithms identify closest ones? There are 4 different methods implemented in scikit-
learn to measure the similarity:
Ward’s linkage: Minimizes the variance of the clusters being merged. Least increase in
total variance around cluster centroids is aimed.
Average linkage: Average distance of each data point in two clusters.
Complete (maximum) linkage: Maximum distance among all data points in two clusters.
Single (minimum) linkage: Min distance among all data points in two clusters.
One of the advantages of hierarchical clustering is that we do not have to specify the number of
clusters beforehand. However, it is not wise to combine all data points into one cluster. We
should stop combining clusters at some point. Scikit-learn provides two options for this:
Stop after a number of clusters is reached (n_clusters)
Set a threshold value for linkage (distance_threshold). If the distance between two
clusters are above the threshold, these clusters will not be merged.
Divisive Clustering
Divisive clustering is not commonly used in real life so I will mention it briefly. Simple yet clear
explanation is that divisive clustering is the opposite of agglomerative clustering. We start with
one giant cluster including all data points. Then data points are separated into different clusters.
It is an up to bottom approach.