K-Means Clustering
K-Means Clustering
K-Means Clustering
k-means clustering is a method of classifying/grouping items into k groups (where k is the number of prechosen groups). The grouping is done by minimizing the sum of squared distances (Euclidean distances)
between items and the corresponding centroid.
A centroid is "the center of mass of a geometric object of uniform density", though here, we'll consider
mean vectors as centroids.
Figure 1. A clustered scatter plot. The black dots are data points. The red lines illustrate the partitions
created by the k-means algorithm. The blue dots represent the centroids which define the partitions.
Dynamically Chosen: This method is good when the amount of data is expected to
grow. The initial cluster means can simply be the first few items of data from the set. For
instance, if the data will be grouped into 3 clusters, then the initial cluster means will be
the first 3 items of data.
Randomly Chosen: Almost self-explanatory, the initial cluster means are randomly
chosen values within the same range as the highest and lowest of the data values.
Choosing from Upper and Lower Bounds: Depending on the types of data in the set,
the highest and lowest (or at least the extremities) of the data range are chosen as the
initial cluster means. The example below uses this method.
The Algorithm
This outline of the algorithm assumes two clusters, and each individual's scores include two variables (as
in the example above). Adding more clusters is as easy as adding another step like Step 4 or Step 5.
Adding another variable for each individual is as easy as adding calculations within the type of step like
Step 4 or Step 5.
Step 1:
Choose the number of clusters.
Step 2:
Set the initial partition, and the initial mean vectors for each cluster.
Step 3:
For each remaining individual...
Step 4:
Get averages for comparison to the Cluster 1:
Add individual's A value to the sum of A values of the individuals in Cluster 1, then divide by the total
number of scores that were summed.
Add individual's B value to the sum of B values of the individuals in Cluster 1, then divide by the total
number of scores that were summed.
Step 5:
Get averages for comparison to the Cluster 2:
Add individual's A value to the sum of A values of the individuals in Cluster 2, then divide by the total
number of scores that were summed.
Add individual's B value to the sum of B values of the individuals in Cluster 2, then divide by the total
number of scores that were summed.
Step 6:
If the averages found in Step 4 are closer to the mean values of Cluster 1, then this individual belongs to
Cluster 1, and the averages found now become the new mean vectors for Cluster 1.
If closer to Cluster 2, then it goes to Cluster 2, along with the averages as new mean vectors.
Step 7:
If there are more individual's to process, continue again with Step 4. Otherwise go to Step 8.
Step 8:
Now compare each individuals distance to its own cluster's mean vector, and to that of the opposite
cluster. The distance to its cluster's mean vector should be smaller than it distance to the other vector. If
not, relocate the individual to the opposite cluster.
Step 9:
If any relocations occurred in Step 8, the algorithm must continue again with Step 3, using all individuals
and the new mean vectors.
If no relocations occurred, stop. Clustering is complete.
Again, in case the algorithm never settles on a final solution, it may be a good idea to implement a
maximum number of iterations check.
With fewer samples of data, initial grouping will determine the cluster significantly.
We never know which variable contributes more to the clustering process since we
assume that each has the same weight.
The accuracy of mathematical averaging weakens because of outliers, which may pull
the centroid away from its true position.
The results are clusters with circular or spherical shapes because of the use of distance.
Possible Solutions:
Include as many samples of data as possible (the more data, the more accurate the
results).
To avoid distortions caused by excessive outliers, the median can be used instead of the
mode.
As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of
two variables on each of seven individuals:
Subject
1.0
1.0
1.5
2.0
3.0
4.0
5.0
7.0
3.5
5.0
4.5
5.0
3.5
4.5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the
A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the initial
cluster means, giving:
Individual
Mean Vector
(centroid)
Group 1
(1.0, 1.0)
Group 2
(5.0, 7.0)
The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a
new member is added. This leads to the following series of steps:
Cluster 1
Cluster 2
Step
Individual
Mean
Vector
(centroid)
Individual
Mean
Vector
(centroid)
(1.0, 1.0)
(5.0, 7.0)
1, 2
(1.2, 1.5)
(5.0, 7.0)
1, 2, 3
(1.8, 2.3)
(5.0, 7.0)
1, 2, 3
(1.8, 2.3)
4, 5
(4.2, 6.0)
1, 2, 3
(1.8, 2.3)
4, 5, 6
(4.3, 5.7)
1, 2, 3
(1.8, 2.3)
4, 5, 6, 7
(4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Individual
Mean Vector
(centroid)
Cluster 1
1, 2, 3
(1.8, 2.3)
Cluster 2
4, 5, 6, 7
(4.1, 5.4)
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare
each individuals distance to its own cluster mean and to
that of the opposite cluster. And we find:
Individual
Distance to Distance to
mean
mean
(centroid) of (centroid) of
Cluster 1
Cluster 2
1.5
5.4
0.4
4.3
2.1
1.8
5.7
1.8
3.2
0.7
3.8
0.6
2.8
1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In
other words, each individual's distance to its own cluster mean should be smaller that the distance to the
other cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
Individual
Mean Vector
(centroid)
Cluster 1
1, 2
(1.3, 1.5)
Cluster 2
3, 4, 5, 6, 7
(3.9, 5.1)
The iterative relocation would now continue from this new partition until no more relocations occur.
However, in this example each individual is now nearer its own cluster mean than that of the other cluster
and the iteration stops, choosing the latest partitioning as the final cluster solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be a good
idea to consider stopping the algorithm after a pre-chosen maximum of iterations.