06 - K Means Clustering
06 - K Means Clustering
K-MEANS CLUSTERING
Clustering
•Basic idea: group together similar instances
Example: 2D point patterns
– Unsupervised learning
– Requires data, but no labels
– Detect patterns e.g. in
• Group emails or search results
• Customer shopping patterns
• Regions of images
– Useful when don’t now what you’re
looking for
Clustering
– Clustering results are
crucially dependent on
the measure of similarity
(or distance) between
“points” to be clustered.
6
4
5
6
4
7
5
Variable 1
3
3
2
2
1
1
0
0 1 2 3 4 5 6 7 8
2 Variable 2
A Simple Example k-means (using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two
clusters. In this case the 2 centroid are: C1=(1.0,1.0) and C2=(5.0,7.0).
A Simple Example k-means (using
K=2)
Step 2:
Centroid 1 Centroid 2
1 √( 1 – 1 )² +( 1 – 1 )² = 0 √( 5 – 1 )² + ( 7 – 1 )² = 7.21
3 √( 1 – 3 )² +( 1 – 4 )² = 3.61 √( 5 – 3 )² + ( 7 – 4 )² = 3.61
4 √( 1 – 5 )² +( 1 – 7 )² = 7.21 √( 5 – 5 )² + ( 7 – 7 )² = 0
A Simple Example k-
means (using K=2)
A Simple Example k-means (using
K=2)
Step 3:
Centroid 1 Centroid 2
7 √(1.83 – 3.5 )² + (2.33 – 4.5 )² = 2.74 √( 4.12 – 3.5 )² + ( 5.38 – 4.5 )² = 1.08
A Simple Example k-means (using K=2)
Step 3:
=(1.25 , 1.5)
=(3.9 , 5.1)
A Simple Example k-means (using K=2)
Step 4:
Centroid 1 Centroid 2
7 √(1.25 – 3.5 )² + (1.5 – 4.5 )² = 3.75 √( 3.9 – 3.5 )² + ( 5.1 – 4.5 )² = 0.72
A Simple Example k-means (using K=2)
➢ Therefore, there is no change in the cluster.
5
6
4
4
Variable 1
7 5
3 3
2
2
1
1
0
0 1 2 3 4 5 6 7 8
Variable 2
Another Example k-means
➢ Suppose we have 4 types of medicines and each has two attributes (pH and
weight index) Our goal is to group these objects into K= 2 group of
medicine
Another Example k-means
➢ Step 1: Use initial seed points for partitioning
Disadvantages of k-means
1. Choosing (k) manually.
2. Being dependent on initial values.
3. Scaling with number of dimensions.
Example
We will be using California housing data from Kaggle. We will use location data (latitude and longitude) as well
as the median house value. We will cluster the houses by location and observe how house prices fluctuate
across California. We save the dataset as a csv file called ‘housing.csv’ in our working directory and read it
using pandas.
• longitude: A value representing how far west a house is. Higher values represent houses that are further
West.
• latitude: A value representing how far north a house is. Higher values represent houses that are further north.
• median_house_value: The median house price within a block measured in USD.
Example
Visualize the Data
We start by visualizing our housing data. We look at the location data with a heatmap based on the median
price in a block. We will use Seaborn to quickly create plots
Next, we normalize the training and test data using the preprocessing.normalize() method from sklearn.
Example
Fitting and Evaluating the Model
For the first iteration, we will arbitrarily choose a number of clusters (referred to as k) of 3. Building and fitting
models in sklearn is very simple. We will create an instance of KMeans, define the number of clusters using the
n_clusters attribute, set n_init, which defines the number of iterations the algorithm will run with different
centroid seeds, to “auto,” and we will set the random_state to 0 so we get the same result each time we run the
code. We can then fit the model to the normalized training data using the fit() method.
Once the data are fit, we can access labels from the labels_
attribute. Below, we visualize the data we just fit.
Example
We see that the data are now clearly split into 3 distinct groups (Northern California, Central California, and
Southern California).
We can evaluate performance of the clustering algorithm using a Silhouette score which is a part of sklearn.
Metrics where a lower score represents a better fit.
Example
Choosing the best number of clusters
The weakness of k-means clustering is that we don’t know how many clusters we need by just running the
model. We need to test ranges of values and make a decision on the best value of k. We typically make a
decision using the Elbow method to determine the optimal number of clusters where we are both not overfitting
the data with too many clusters, and also not underfitting with too few.
We create the below loop to test and store different model results so that we can make a decision on the best
number of clusters.
Example
We can then first visually look at a few different values of k.
First we look at k = 2.
Example
The model does an ok job of splitting the state into two halves, but probably doesn’t capture enough nuance in
the California housing market.
Next, we look at k = 4.
Example
We see this plot groups California into more logical clusters across the state based on how far North or South
the houses are in the state. This model most likely captures more nuance in the housing market as we move
across the state.
Finally, we look at k = 7.
Example
The above graph appears to have too many clusters. We have sacrifice easy interpretation of the clusters for a
“more accurate” geo-clustering result.
Typically, as we increase the value of K, we see improvements in clusters and what they represent until a
certain point. We then start to see diminishing returns or even worse performance. We can visually see this to
help make a decision on the value of k by using an line plot where the y-axis is a measure of goodness of fit and
the x-axis is the value of k.