0% found this document useful (0 votes)
6 views25 pages

Chapter 3

The document provides an overview of k-means clustering, highlighting its advantages over hierarchical clustering, particularly in terms of runtime efficiency. It details the steps to implement k-means in Python, including generating cluster centers and labels, and discusses methods for determining the optimal number of clusters using the elbow method. Additionally, it addresses limitations of k-means, such as sensitivity to initial seed values and biases towards equal-sized clusters.

Uploaded by

簡維萱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views25 pages

Chapter 3

The document provides an overview of k-means clustering, highlighting its advantages over hierarchical clustering, particularly in terms of runtime efficiency. It details the steps to implement k-means in Python, including generating cluster centers and labels, and discusses methods for determining the optimal number of clusters using the elbow method. Additionally, it addresses limitations of k-means, such as sensitivity to initial seed values and biases towards equal-sized clusters.

Uploaded by

簡維萱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Basics of k-means

clustering
C L U S TE R AN ALYS I S I N P YTH ON

Shaumik Daityari
Business Analyst
Why k-means clustering?
A critical drawback of hierarchical clustering: runtime

K means runs signi cantly faster on large datasets

CLUSTER ANALYSIS IN PYTHON


Step 1: Generate cluster centers
kmeans(obs, k_or_guess, iter, thresh, check_finite)

obs : standardized observations

k_or_guess : number of clusters

iter : number of iterations (default: 20)

thres : threshold (default: 1e-05)

check_finite : whether to check if observations contain only nite numbers (default: True)

Returns two objects: cluster centers, distortion

CLUSTER ANALYSIS IN PYTHON


How is distortion calculated?

CLUSTER ANALYSIS IN PYTHON


Step 2: Generate cluster labels
vq(obs, code_book, check_finite=True)

obs : standardized observations

code_book : cluster centers

check_finite : whether to check if observations contain only nite numbers (default: True)

Returns two objects: a list of cluster labels, a list of distortions

CLUSTER ANALYSIS IN PYTHON


A note on distortions
kmeans returns a single value of distortions

vq returns a list of distortions.

CLUSTER ANALYSIS IN PYTHON


Running k-means
# Import kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Generate cluster centers and labels


cluster_centers, _ = kmeans(df[['scaled_x', 'scaled_y']], 3)
df['cluster_labels'], _ = vq(df[['scaled_x', 'scaled_y']], cluster_centers)

# Plot clusters
sns.scatterplot(x='scaled_x', y='scaled_y', hue='cluster_labels', data=df)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
Next up: exercises!
C L U S TE R AN ALYS I S I N P YTH ON
How many clusters?
C L U S TE R AN ALYS I S I N P YTH ON

Shaumik Daityari
Business Analyst
How to find the right k?
No absolute method to nd right number of
clusters (k) in k-means clustering

Elbow method

如何去找適合的K數?
⼿肘法

看島⼿肘轉職之處 表⽰超過那個⼦集數⽬之後
distorsions減少的幅度較少

相較如果⼀個⼿肘圖 是⼀個斜直線
那可能沒有最適合的K值

CLUSTER ANALYSIS IN PYTHON


Distortions revisited
Distortion: sum of squared distances of
points from cluster centers

Decreases with an increasing number of


clusters

Becomes zero when the number of clusters


equals the number of points

Elbow plot: line plot between cluster


centers and distortion

CLUSTER ANALYSIS IN PYTHON


Elbow method
Elbow plot: plot of the number of clusters and distortion

Elbow plot helps indicate number of clusters present in data

CLUSTER ANALYSIS IN PYTHON


Elbow method in Python
# Declaring variables for use
distortions = []

num_clusters = range(2, 7)

# Populating distortions for various clusters


for i in num_clusters:
centroids, distortion = kmeans(df[['scaled_x', 'scaled_y']], i)
distortions.append(distortion)

# Plotting elbow plot data


elbow_plot_data = pd.DataFrame({'num_clusters': num_clusters,
'distortions': distortions})

sns.lineplot(x='num_clusters', y='distortions',
data = elbow_plot_data)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
Final thoughts on using the elbow method
Only gives an indication of optimal _k_ (numbers of clusters)

Does not always pinpoint how many _k_ (numbers of clusters)

Other methods: average silhoue e and gap statistic

CLUSTER ANALYSIS IN PYTHON


Next up: exercises
C L U S TE R AN ALYS I S I N P YTH ON
Limitations of k-
means clustering
C L U S TE R AN ALYS I S I N P YTH ON

Shaumik Daityari
Business Analyst
Limitations of k-means clustering
How to nd the right _K_ (number of clusters)?

Impact of seeds

Biased towards equal sized clusters

CLUSTER ANALYSIS IN PYTHON


Impact of seeds
Initialize a random seed Seed: np.array(1000, 2000)

Cluster sizes: 29, 29, 43, 47, 52


from numpy import random
random.seed(12)

Seed: np.array(1,2,3)

Cluster sizes: 26, 31, 40, 50, 53

CLUSTER ANALYSIS IN PYTHON


Impact of seeds: plots
Seed: np.array(1000, 2000) Seed: np.array(1,2,3)

在資料集的cluster不明顯的狀況下
seed的初始化會影響到分類
所以最好⼀開始就設定好同樣的亂數值

但如果資料集⼀開始分類就明顯
就不會影響

CLUSTER ANALYSIS IN PYTHON


Uniform clusters in k means

Text

CLUSTER ANALYSIS IN PYTHON


Uniform clusters in k-means: a comparison
K-means clustering with 3 clusters Hierarchical clustering with 3 clusters

像這個⼦集比較分散
在周遭的點到其中⼼的距離
甚⾄比到左邊中⼼還要遠

因為k mean 想要最⼩化distortion


但每個⼦集不⼀定會有⼀樣多的資料點
就會出現有些被分在別⼈家的情形
This is because the very idea of k-means clustering is to minimize distortions.
This results in clusters that have similar areas and not necessarily the similar
number of data points.

CLUSTER ANALYSIS IN PYTHON


Final thoughts
Each technique has its pros and cons

Consider your data size and pa erns before deciding on algorithm

Clustering is exploratory phase of analysis

CLUSTER ANALYSIS IN PYTHON


Next up: exercises
C L U S TE R AN ALYS I S I N P YTH ON
# Set up a random seed in numpy
random.seed([1000,2000])

# Fit the data into a k-means algorithm


cluster_centers,_ = kmeans(fifa[['scaled_def', 'scaled_phy']], 3)

# Assign cluster labels


fifa['cluster_labels'], _ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers)

# Display cluster centers


print(fifa[['scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean())

# Create a scatter plot through seaborn


sns.scatterplot(x='scaled_def', y='scaled_phy', hue='cluster_labels', data=fifa)
plt.show()

You might also like