0% found this document useful (0 votes)

112 views9 pages

4 Clustering With K-Means - Kaggle

This document discusses unsupervised learning algorithms and clustering. Clustering assigns data points to groups based on similarity. It can be used for feature engineering by discovering customer segments or geographic areas with similar characteristics. The document then explains k-means clustering, which assigns data points to centroids to minimize distance, and provides an example applying k-means to cluster latitude, longitude, and income data from California housing prices.

Uploaded by

Prujith Muthu Ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views9 pages

4 Clustering With K-Means - Kaggle

Uploaded by

Prujith Muthu Ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Introduction

This lesson and the next make use of what are known as unsupervised learning algorithms. Unsupervised
algorithms don't make use of a target; instead, their purpose is to learn some property of the data, to
represent the structure of the features in a certain way. In the context of feature engineering for prediction,
you could think of an unsupervised algorithm as a "feature discovery" technique.

Clustering simply means the assigning of data points to groups based upon how similar the points are to
each other. A clustering algorithm makes "birds of a feather flock together," so to speak.

When used for feature engineering, we could attempt to discover groups of customers representing a
market segment, for instance, or geographic areas that share similar weather patterns. Adding a feature of
cluster labels can help machine learning models untangle complicated relationships of space or proximity.

Cluster Labels as a Feature

Applied to a single real-valued feature, clustering acts like a traditional "binning" or "discretization"
(https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html)
transform. On multiple features, it's like "multi-dimensional binning" (sometimes called vector quantization).
Left: Clustering a single feature. Right: Clustering across two features.

Added to a dataframe, a feature of cluster labels might look like this:

Longitude Latitude Cluster

-93.619 42.054 3
-93.619 42.053 3

-93.638 42.060 1
-93.602 41.988 0
link code
It's important to remember that this Cluster feature is categorical. Here, it's shown with a label encoding
(that is, as a sequence of integers) as a typical clustering algorithm would produce; depending on your
model, a one-hot encoding may be more appropriate.

The motivating idea for adding cluster labels is that the clusters will break up complicated relationships
across features into simpler chunks. Our model can then just learn the simpler chunks one-by-one instead
having to learn the complicated whole all at once. It's a "divide and conquer" strategy.

Clustering the YearBuilt feature helps this linear model learn its relationship to SalePrice.

The figure shows how clustering can improve a simple linear model. The curved relationship between the
YearBuilt and SalePrice is too complicated for this kind of model -- it underfits. On smaller chunks
however the relationship is almost linear, and that the model can learn easily.

k-Means Clustering

There are a great many clustering algorithms. They differ primarily in how they measure "similarity" or
"proximity" and in what kinds of features they work with. The algorithm we'll use, k-means, is intuitive and
easy to apply in a feature engineering context. Depending on your application another algorithm might be
more appropriate.

K-means clustering measures similarity using ordinary straight-line distance (Euclidean distance, in other
words). It creates clusters by placing a number of points, called centroids, inside the feature-space. Each
point in the dataset is assigned to the cluster of whichever centroid it's closest to. The "k" in "k-means" is
how many centroids (that is, clusters) it creates. You define the k yourself.
You could imagine each centroid capturing points through a sequence of radiating circles. When sets of
circles from competing centroids overlap they form a line. The result is what's called a Voronoi tessallation.
The tessallation shows you to what clusters future data will be assigned; the tessallation is essentially what
k-means learns from its training data.

The clustering on the Ames (https://www.kaggle.com/c/house-prices-advanced-regression-

techniques/data) dataset above is a k-means clustering. Here is the same figure with the tessallation and
centroids shown.

K-means clustering creates a Voronoi tessallation of the feature space.

Let's review how the k-means algorithm learns the clusters and what that means for feature engineering.
We'll focus on three parameters from scikit-learn's implementation: n_clusters , max_iter , and
n_init .

It's a simple two-step process. The algorithm starts by randomly initializing some predefined number
( n_clusters ) of centroids. It then iterates over these two operations:

1. assign points to the nearest cluster centroid

2. move each centroid to minimize the distance to its points

It iterates over these two steps until the centroids aren't moving anymore, or until some maximum number
of iterations has passed ( max_iter ).

It often happens that the initial random position of the centroids ends in a poor clustering. For this reason
the algorithm repeats a number of times ( n_init ) and returns the clustering that has the least total
distance between each point and its centroid, the optimal clustering.

The animation below shows the algorithm in action. It illustrates the dependence of the result on the initial
centroids and the importance of iterating until convergence.
The K-means clustering algorithm on Airbnb rentals in NYC.

You may need to increase the max_iter for a large number of clusters or n_init for a complex dataset.
Ordinarily though the only parameter you'll need to choose yourself is n_clusters (k, that is). The best
partitioning for a set of features depends on the model you're using and what you're trying to predict, so it's
best to tune it like any hyperparameter (through cross-validation, say).

Example - California Housing

As spatial features, California Housing (https://www.kaggle.com/camnugent/california-housing-prices)'s

'Latitude' and 'Longitude' make natural candidates for k-means clustering. In this example we'll
cluster these with 'MedInc' (median income) to create economic segments in different regions of
California.
unfold_less Hide code

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans

plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
"axes",
labelweight="bold",
labelsize="large",
titleweight="bold",
titlesize=14,
titlepad=10,
)

df = pd.read_csv("../input/fe-course-data/housing.csv")
X = df.loc[:, ["MedInc", "Latitude", "Longitude"]]
X.head()

Out[1]:

MedInc Latitude Longitude

0 8.3252 37.88 -122.23

1 8.3014 37.86 -122.22

2 7.2574 37.85 -122.24

3 5.6431 37.85 -122.25

4 3.8462 37.85 -122.25

Since k-means clustering is sensitive to scale, it can be a good idea rescale or normalize data with extreme
values. Our features are already roughly on the same scale, so we'll leave them as-is.
In [2]:
# Create cluster feature
kmeans = KMeans(n_clusters=6)
X["Cluster"] = kmeans.fit_predict(X)
X["Cluster"] = X["Cluster"].astype("category")

X.head()

Out[2]:

MedInc Latitude Longitude Cluster

0 8.3252 37.88 -122.23 5

1 8.3014 37.86 -122.22 5

2 7.2574 37.85 -122.24 5

3 5.6431 37.85 -122.25 5

4 3.8462 37.85 -122.25 1

Now let's look at a couple plots to see how effective this was. First, a scatter plot that shows the
geographic distribution of the clusters. It seems like the algorithm has created separate segments for
higher-income areas on the coasts.
In [3]:
sns.relplot(
x="Longitude", y="Latitude", hue="Cluster", data=X, height=6,
);

The target in this dataset is MedHouseVal (median house value). These box-plots show the distribution of
the target within each cluster. If the clustering is informative, these distributions should, for the most part,
separate across MedHouseVal , which is indeed what we see.
In [4]:
X["MedHouseVal"] = df["MedHouseVal"]
sns.catplot(x="MedHouseVal", y="Cluster", data=X, kind="boxen", height=
6);

Your Turn

Add a feature of cluster labels (https://www.kaggle.com/kernels/fork/14393920) to Ames and learn about

another kind of feature clustering can create.

IEEE-1234 - Guide For Fault Locating - Techniques On Shielded-Power Cable
No ratings yet
IEEE-1234 - Guide For Fault Locating - Techniques On Shielded-Power Cable
64 pages
Worldatwork: Total Rewards Model
No ratings yet
Worldatwork: Total Rewards Model
6 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
06 - K Means Clustering
No ratings yet
06 - K Means Clustering
36 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Day 3
No ratings yet
Day 3
74 pages
Ds Un4
No ratings yet
Ds Un4
11 pages
_0006_ K Means Clustering – Introduction _ 2025
No ratings yet
_0006_ K Means Clustering – Introduction _ 2025
19 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
ML Unit-4
No ratings yet
ML Unit-4
23 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
19 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
Machine Learning IV
No ratings yet
Machine Learning IV
54 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Minor Project
No ratings yet
Minor Project
10 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
No ratings yet
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
23 pages
Unit 4
No ratings yet
Unit 4
19 pages
Detecting Patterns With Unsupervised Learning
No ratings yet
Detecting Patterns With Unsupervised Learning
21 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
Clustering
No ratings yet
Clustering
84 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
K Means
No ratings yet
K Means
9 pages
Lesson 6 - Unsupervised Learning
No ratings yet
Lesson 6 - Unsupervised Learning
63 pages
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
No ratings yet
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
23 pages
Machine Learning Chapter 3
No ratings yet
Machine Learning Chapter 3
12 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
4 Clustring
No ratings yet
4 Clustring
48 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Unit 4
No ratings yet
Unit 4
63 pages
K Means
No ratings yet
K Means
25 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
Unit 4
No ratings yet
Unit 4
125 pages
Lecture 06
No ratings yet
Lecture 06
47 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
UNIT-6 K Means Clustering
No ratings yet
UNIT-6 K Means Clustering
12 pages
Clustering
No ratings yet
Clustering
55 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering
No ratings yet
Clustering
35 pages
Building K-Means Clustering Algorithm From Scratch
No ratings yet
Building K-Means Clustering Algorithm From Scratch
10 pages
K Mean
No ratings yet
K Mean
9 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Module-1 Intro To 3d Autocad
No ratings yet
Module-1 Intro To 3d Autocad
29 pages
Micropara First Part Chapter 1
No ratings yet
Micropara First Part Chapter 1
8 pages
Binary Airthmatics
No ratings yet
Binary Airthmatics
45 pages
The Impact of Managerial Coaching Behavior On Employee Work-Related Reactions
No ratings yet
The Impact of Managerial Coaching Behavior On Employee Work-Related Reactions
17 pages
Accounting Information Systems 15th Edition Romney Test Bank Available Instantly
No ratings yet
Accounting Information Systems 15th Edition Romney Test Bank Available Instantly
317 pages
Architectural Portofolio - Univ Stuttgart Application
No ratings yet
Architectural Portofolio - Univ Stuttgart Application
6 pages
Engineering Mechanics Statics 7th Edition Si
No ratings yet
Engineering Mechanics Statics 7th Edition Si
6 pages
A Fracture Mechanics Life Prediction Methodology Applied To Dovetail Fretting
No ratings yet
A Fracture Mechanics Life Prediction Methodology Applied To Dovetail Fretting
9 pages
Semana 2 (1) Douglas y Ubelaker (Eds.) (2019)
No ratings yet
Semana 2 (1) Douglas y Ubelaker (Eds.) (2019)
402 pages
Articles and Prepositions-1
No ratings yet
Articles and Prepositions-1
14 pages
L7PA (200V) - Manual - V1.1 - 201506 (EN) PDF
No ratings yet
L7PA (200V) - Manual - V1.1 - 201506 (EN) PDF
414 pages
Final
100% (1)
Final
52 pages
Qualitative Researchers Use:: The Characteristics of Research
No ratings yet
Qualitative Researchers Use:: The Characteristics of Research
5 pages
Eec 161 ch05
No ratings yet
Eec 161 ch05
141 pages
MYDEV Session 2 Activity 5
No ratings yet
MYDEV Session 2 Activity 5
5 pages
1-HbA1c Technique - GB 24012018
No ratings yet
1-HbA1c Technique - GB 24012018
165 pages
Introduction To Quantum Field
No ratings yet
Introduction To Quantum Field
83 pages
Interview
No ratings yet
Interview
56 pages
Assignment 6 - (Sun Path Diagram) Bhavya
No ratings yet
Assignment 6 - (Sun Path Diagram) Bhavya
4 pages
Akuntansi Keperilakuan Nur Cahyani Dehi
No ratings yet
Akuntansi Keperilakuan Nur Cahyani Dehi
11 pages
Bams - 3607 - Rachana Sharir (Human Anatomy) - TP1 (December-2023) - December-2023 (Oct-23)
No ratings yet
Bams - 3607 - Rachana Sharir (Human Anatomy) - TP1 (December-2023) - December-2023 (Oct-23)
3 pages
Site Planning
100% (1)
Site Planning
69 pages
Redevelopment of BFAR Building Complex Martin Heizel Heins
No ratings yet
Redevelopment of BFAR Building Complex Martin Heizel Heins
14 pages
Dwiastuti Et Al. 2024 - Prospects of Medicinal Plants To Control Citrus and Other - AIP
No ratings yet
Dwiastuti Et Al. 2024 - Prospects of Medicinal Plants To Control Citrus and Other - AIP
94 pages
2 Predictive Testing
No ratings yet
2 Predictive Testing
2 pages
Literature Review Ubc
100% (2)
Literature Review Ubc
8 pages
ABUBAKAR MODEL 1-Layout1
No ratings yet
ABUBAKAR MODEL 1-Layout1
1 page
Soldiers Inc. Search and Destroy Chart
50% (2)
Soldiers Inc. Search and Destroy Chart
4 pages

4 Clustering With K-Means - Kaggle

Uploaded by

4 Clustering With K-Means - Kaggle

Uploaded by

Introduction

Cluster Labels as a Feature

Added to a dataframe, a feature of cluster labels might look like this:

Longitude Latitude Cluster

The clustering on the Ames (https://www.kaggle.com/c/house-prices-advanced-regression-

K-means clustering creates a Voronoi tessallation of the feature space.

1. assign points to the nearest cluster centroid

2. move each centroid to minimize the distance to its points

Example - California Housing

As spatial features, California Housing (https://www.kaggle.com/camnugent/california-housing-prices)'s

MedInc Latitude Longitude

0 8.3252 37.88 -122.23

2 7.2574 37.85 -122.24

3 5.6431 37.85 -122.25

MedInc Latitude Longitude Cluster

1 8.3014 37.86 -122.22 5

2 7.2574 37.85 -122.24 5

4 3.8462 37.85 -122.25 1

Add a feature of cluster labels (https://www.kaggle.com/kernels/fork/14393920) to Ames and learn about

You might also like