0% found this document useful (0 votes)
11 views36 pages

06 - K Means Clustering

The document provides an overview of K-means clustering, an unsupervised machine learning algorithm used to group similar instances without labeled data. It explains the algorithm's steps, advantages, and disadvantages, along with practical examples, including clustering California housing data based on geographical and price attributes. The document also discusses the importance of normalizing data and selecting the optimal number of clusters using methods like the Elbow method.

Uploaded by

david1milad1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views36 pages

06 - K Means Clustering

The document provides an overview of K-means clustering, an unsupervised machine learning algorithm used to group similar instances without labeled data. It explains the algorithm's steps, advantages, and disadvantages, along with practical examples, including clustering California housing data based on geographical and price attributes. The document also discusses the importance of normalizing data and selecting the optimal number of clusters using methods like the Elbow method.

Uploaded by

david1milad1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine LearningAlgorithms

K-MEANS CLUSTERING
Clustering
•Basic idea: group together similar instances
Example: 2D point patterns

– Unsupervised learning
– Requires data, but no labels
– Detect patterns e.g. in
• Group emails or search results
• Customer shopping patterns
• Regions of images
– Useful when don’t now what you’re
looking for
Clustering
– Clustering results are
crucially dependent on
the measure of similarity
(or distance) between
“points” to be clustered.

– One option: small


Euclidean distance
(squared)
How the K−Mean Clustering algorithm works?
K-‐means clustering: Example
A Simple Example k-means (using K=2)
A Simple Example k-means (using K=2)

6
4
5
6

4
7
5
Variable 1

3
3

2
2
1
1

0
0 1 2 3 4 5 6 7 8

2 Variable 2
A Simple Example k-means (using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two
clusters. In this case the 2 centroid are: C1=(1.0,1.0) and C2=(5.0,7.0).
A Simple Example k-means (using
K=2)
Step 2:
Centroid 1 Centroid 2

1 √( 1 – 1 )² +( 1 – 1 )² = 0 √( 5 – 1 )² + ( 7 – 1 )² = 7.21

2 √( 1 – 1.5 )² +( 1 – 2 )² = 1.12 √( 5 – 1.5 )² + ( 7 – 2 )² = 6.10

3 √( 1 – 3 )² +( 1 – 4 )² = 3.61 √( 5 – 3 )² + ( 7 – 4 )² = 3.61

4 √( 1 – 5 )² +( 1 – 7 )² = 7.21 √( 5 – 5 )² + ( 7 – 7 )² = 0

5 √( 1 – 3.5 )² +( 1 – 5 )² = 4.72 √( 5 – 3.5 )² + ( 7 – 5 )² = 2.5

6 √( 1 – 4.5 )² +( 1 – 5 )² = 5.31 √( 5 – 4.5 )² + ( 7 – 5 )² = 2.06

7 √( 1 – 3.5 )² +( 1 – 4.5 )² = 4.30 √( 5 – 3.5 )² + ( 7 – 4.5 )² = 2.92


Step 2:

A Simple Example k-
means (using K=2)
A Simple Example k-means (using
K=2)
Step 3:
Centroid 1 Centroid 2

1 √( 1.83 – 1 )² +(2.33 – 1 )² = 1.57 √( 4.12 – 1 )² + ( 5.38 – 1 )² = 5.38

2 √(1.83 – 1.5 )² + (2.33 – 2 )² = 0.47 √( 4.12 – 1.5 )² + ( 5.38 – 2 )² = 4.29

3 √(1.83 – 3 )² + (2.33 – 4 )² = 2.04 √( 4.12 – 3 )² + ( 5.38 – 4 )² = 1.78

4 √(1.83 – 5 )² + (2.33 – 7 )² = 5.64 √( 4.12 – 5 )² + ( 5.38 – 7 )² = 1.84

5 √(1.83 – 3.5 )² + (2.33 – 5 )² = 3.15 √( 4.12 – 3.5 )² + ( 5.38 – 5 )² = 0.73

6 √(1.83 – 4.5 )² + (2.33 – 5 )² = 3.78 √( 4.12 – 4.5 )² + ( 5.38 – 5 )² = 0.54

7 √(1.83 – 3.5 )² + (2.33 – 4.5 )² = 2.74 √( 4.12 – 3.5 )² + ( 5.38 – 4.5 )² = 1.08
A Simple Example k-means (using K=2)
Step 3:

Therefore, the new clusters are:

{1,2} and {3,4,5,6,7}

=(1.25 , 1.5)

=(3.9 , 5.1)
A Simple Example k-means (using K=2)
Step 4:
Centroid 1 Centroid 2

1 √( 1.25 – 1 )² + ( 1.5 – 1 )² = 0.58 √( 3.9 – 1 )² + ( 5.1 – 1 )² = 5.02

2 √(1.25 – 1.5 )² + (1.5 – 2 )² = 0.56 √( 3.9 – )² + ( 5.1 – 2 )² = 3.92


1.5
3 √(1.25 – 3 )² + (1.5 – 4 )² = 3.05 √( 3.9 – 3 )² + ( 5.1 – 4 )² = 1.42

4 √(1.25 – 5 )² + (1.5 – 7 )² = 6.66 √( 3.9 – 5 )² + ( 5.1 – 7 )² = 2.20

5 √(1.25 – 3.5 )² + (1.5 – 5 )² = 4.16 √( 3.9 – 3.5 )² + ( 5.1 – 5 )² = 0.41

6 √(1.25 – 4.5 )² + (1.5 – 5 )² = 4.78 √( 3.9 – 4.5 )² + ( 5.1 – 5 )² = 0.61

7 √(1.25 – 3.5 )² + (1.5 – 4.5 )² = 3.75 √( 3.9 – 3.5 )² + ( 5.1 – 4.5 )² = 0.72
A Simple Example k-means (using K=2)
➢ Therefore, there is no change in the cluster.

➢ Thus, the algorithm comes to a halt here.

➢ The final result consist of 2 clusters {1,2} and {3,4,5,6,7}.

5
6
4
4
Variable 1

7 5
3 3
2
2
1
1

0
0 1 2 3 4 5 6 7 8
Variable 2
Another Example k-means
➢ Suppose we have 4 types of medicines and each has two attributes (pH and
weight index) Our goal is to group these objects into K= 2 group of
medicine
Another Example k-means
➢ Step 1: Use initial seed points for partitioning

Assign each object to the cluster


with the nearest seed point
Another Example k-means
➢ Step 2 Compute new centroids of the current partition

Knowing the members of each


cluster, now we compute the new
centroid of each group based on
these new memberships.
Another Example k-means

➢ Step 2 Renew membership based on new centroids

Compute the distance of all


objects to the new centroids

Assign the membership to objects


Another Example k-means

➢ Step 3 Repeat the first two steps until its convergence

Knowing the members of each


cluster, now we compute the new
centroid of each group based on
these new memberships.
Another Example k-means

➢ Step 3 Repeat the first two steps until its convergence

Compute the distance of all objects to the


new centroids

Stop due to no new assignment


Membership in each cluster no longer change
For the medicine data set, use K means with the Manhattan distance metric for
clustering analysis by setting K= 2 and initializing seeds as C1 = A and C2 = C
Answer three questions as follows
1) How many steps are required for convergence?
2) What are memberships of two clusters after convergence?
3) What are centroids of two clusters after convergence?
k-means
Advantages of k-means
1. Relatively simple to implement.
2. Scales to large data sets.
3. Guarantees convergence.
4. Easily adapts to new examples.

Disadvantages of k-means
1. Choosing (k) manually.
2. Being dependent on initial values.
3. Scaling with number of dimensions.
Example
We will be using California housing data from Kaggle. We will use location data (latitude and longitude) as well
as the median house value. We will cluster the houses by location and observe how house prices fluctuate
across California. We save the dataset as a csv file called ‘housing.csv’ in our working directory and read it
using pandas.

The data include 3 variables :

• longitude: A value representing how far west a house is. Higher values represent houses that are further
West.
• latitude: A value representing how far north a house is. Higher values represent houses that are further north.
• median_house_value: The median house price within a block measured in USD.
Example
Visualize the Data
We start by visualizing our housing data. We look at the location data with a heatmap based on the median
price in a block. We will use Seaborn to quickly create plots

We see that most of the expensive


houses are on the west coast of
California with different areas that
have clusters of moderately priced
houses. This is expected as typically
waterfront properties are worth more
than houses that are not on the
coast.

Clusters are often easy to spot


when you are only using 2 features.
It becomes increasingly difficult or
impossible when the number
exceeds 2.
Example
Normalizing the Data
When working with distance-based algorithms, like k-Means Clustering, we must normalize the data. If we do
not normalize the data, variables with different scaling will be weighted differently in the distance formula that is
being optimized during training.

Next, we normalize the training and test data using the preprocessing.normalize() method from sklearn.
Example
Fitting and Evaluating the Model
For the first iteration, we will arbitrarily choose a number of clusters (referred to as k) of 3. Building and fitting
models in sklearn is very simple. We will create an instance of KMeans, define the number of clusters using the
n_clusters attribute, set n_init, which defines the number of iterations the algorithm will run with different
centroid seeds, to “auto,” and we will set the random_state to 0 so we get the same result each time we run the
code. We can then fit the model to the normalized training data using the fit() method.

Once the data are fit, we can access labels from the labels_
attribute. Below, we visualize the data we just fit.
Example
We see that the data are now clearly split into 3 distinct groups (Northern California, Central California, and
Southern California).

We can also look at the distribution of median


house prices in these 3 groups using a boxplot.
Example
We clearly see that the Northern and Southern clusters have similar distributions of median house values
(clusters 0 and 2) that are higher than the prices in the central cluster (cluster 1).

We can evaluate performance of the clustering algorithm using a Silhouette score which is a part of sklearn.
Metrics where a lower score represents a better fit.
Example
Choosing the best number of clusters
The weakness of k-means clustering is that we don’t know how many clusters we need by just running the
model. We need to test ranges of values and make a decision on the best value of k. We typically make a
decision using the Elbow method to determine the optimal number of clusters where we are both not overfitting
the data with too many clusters, and also not underfitting with too few.

We create the below loop to test and store different model results so that we can make a decision on the best
number of clusters.
Example
We can then first visually look at a few different values of k.

First we look at k = 2.
Example
The model does an ok job of splitting the state into two halves, but probably doesn’t capture enough nuance in
the California housing market.

Next, we look at k = 4.
Example
We see this plot groups California into more logical clusters across the state based on how far North or South
the houses are in the state. This model most likely captures more nuance in the housing market as we move
across the state.
Finally, we look at k = 7.
Example
The above graph appears to have too many clusters. We have sacrifice easy interpretation of the clusters for a
“more accurate” geo-clustering result.

Typically, as we increase the value of K, we see improvements in clusters and what they represent until a
certain point. We then start to see diminishing returns or even worse performance. We can visually see this to
help make a decision on the value of k by using an line plot where the y-axis is a measure of goodness of fit and
the x-axis is the value of k.

We typically choose the point where


the improvements in performance
start to flatten or get worse. We see
k = 5 is probably the best we can
do without overfitting.

We can also see that the clusters do


a relatively good job of breaking
California into distinct clusters and
these clusters map relatively well to
different price ranges as seen
below.
Example

You might also like