0% found this document useful (0 votes)

6 views25 pages

Chapter 3

The document provides an overview of k-means clustering, highlighting its advantages over hierarchical clustering, particularly in terms of runtime efficiency. It details the steps to implement k-means in Python, including generating cluster centers and labels, and discusses methods for determining the optimal number of clusters using the elbow method. Additionally, it addresses limitations of k-means, such as sensitivity to initial seed values and biases towards equal-sized clusters.

Uploaded by

簡維萱

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views25 pages

Chapter 3

Uploaded by

簡維萱

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Basics of k-means

clustering
C L U S TE R AN ALYS I S I N P YTH ON

Shaumik Daityari
Business Analyst
Why k-means clustering?
A critical drawback of hierarchical clustering: runtime

K means runs signi cantly faster on large datasets

CLUSTER ANALYSIS IN PYTHON

Step 1: Generate cluster centers
kmeans(obs, k_or_guess, iter, thresh, check_finite)

obs : standardized observations

k_or_guess : number of clusters

iter : number of iterations (default: 20)

thres : threshold (default: 1e-05)

check_finite : whether to check if observations contain only nite numbers (default: True)

Returns two objects: cluster centers, distortion

CLUSTER ANALYSIS IN PYTHON

How is distortion calculated?

CLUSTER ANALYSIS IN PYTHON

Step 2: Generate cluster labels
vq(obs, code_book, check_finite=True)

obs : standardized observations

code_book : cluster centers

check_finite : whether to check if observations contain only nite numbers (default: True)

Returns two objects: a list of cluster labels, a list of distortions

CLUSTER ANALYSIS IN PYTHON

A note on distortions
kmeans returns a single value of distortions

vq returns a list of distortions.

CLUSTER ANALYSIS IN PYTHON

Running k-means
# Import kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Generate cluster centers and labels

cluster_centers, _ = kmeans(df[['scaled_x', 'scaled_y']], 3)
df['cluster_labels'], _ = vq(df[['scaled_x', 'scaled_y']], cluster_centers)

# Plot clusters
sns.scatterplot(x='scaled_x', y='scaled_y', hue='cluster_labels', data=df)
plt.show()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
Next up: exercises!
C L U S TE R AN ALYS I S I N P YTH ON
How many clusters?
C L U S TE R AN ALYS I S I N P YTH ON

Shaumik Daityari
Business Analyst
How to find the right k?
No absolute method to nd right number of
clusters (k) in k-means clustering

Elbow method

如何去找適合的Ｋ數？
⼿肘法

看島⼿肘轉職之處表⽰超過那個⼦集數⽬之後
distorsions減少的幅度較少

相較如果⼀個⼿肘圖是⼀個斜直線
那可能沒有最適合的Ｋ值

CLUSTER ANALYSIS IN PYTHON

Distortions revisited
Distortion: sum of squared distances of
points from cluster centers

Decreases with an increasing number of

clusters

Becomes zero when the number of clusters

equals the number of points

Elbow plot: line plot between cluster

centers and distortion

CLUSTER ANALYSIS IN PYTHON

Elbow method
Elbow plot: plot of the number of clusters and distortion

Elbow plot helps indicate number of clusters present in data

CLUSTER ANALYSIS IN PYTHON

Elbow method in Python
# Declaring variables for use
distortions = []

num_clusters = range(2, 7)

# Populating distortions for various clusters

for i in num_clusters:
centroids, distortion = kmeans(df[['scaled_x', 'scaled_y']], i)
distortions.append(distortion)

# Plotting elbow plot data

elbow_plot_data = pd.DataFrame({'num_clusters': num_clusters,
'distortions': distortions})

sns.lineplot(x='num_clusters', y='distortions',
data = elbow_plot_data)
plt.show()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
Final thoughts on using the elbow method
Only gives an indication of optimal _k_ (numbers of clusters)

Does not always pinpoint how many _k_ (numbers of clusters)

Other methods: average silhoue e and gap statistic

CLUSTER ANALYSIS IN PYTHON

Next up: exercises
C L U S TE R AN ALYS I S I N P YTH ON
Limitations of k-
means clustering
C L U S TE R AN ALYS I S I N P YTH ON

Shaumik Daityari
Business Analyst
Limitations of k-means clustering
How to nd the right _K_ (number of clusters)?

Impact of seeds

Biased towards equal sized clusters

CLUSTER ANALYSIS IN PYTHON

Impact of seeds
Initialize a random seed Seed: np.array(1000, 2000)

Cluster sizes: 29, 29, 43, 47, 52

from numpy import random
random.seed(12)

Seed: np.array(1,2,3)

Cluster sizes: 26, 31, 40, 50, 53

CLUSTER ANALYSIS IN PYTHON

Impact of seeds: plots
Seed: np.array(1000, 2000) Seed: np.array(1,2,3)

在資料集的cluster不明顯的狀況下
seed的初始化會影響到分類
所以最好⼀開始就設定好同樣的亂數值

但如果資料集⼀開始分類就明顯
就不會影響

CLUSTER ANALYSIS IN PYTHON

Uniform clusters in k means

Text

CLUSTER ANALYSIS IN PYTHON

Uniform clusters in k-means: a comparison
K-means clustering with 3 clusters Hierarchical clustering with 3 clusters

像這個⼦集比較分散
在周遭的點到其中⼼的距離
甚⾄比到左邊中⼼還要遠

因為k mean 想要最⼩化distortion

但每個⼦集不⼀定會有⼀樣多的資料點
就會出現有些被分在別⼈家的情形
This is because the very idea of k-means clustering is to minimize distortions.
This results in clusters that have similar areas and not necessarily the similar
number of data points.

CLUSTER ANALYSIS IN PYTHON

Final thoughts
Each technique has its pros and cons

Consider your data size and pa erns before deciding on algorithm

Clustering is exploratory phase of analysis

CLUSTER ANALYSIS IN PYTHON

Next up: exercises
C L U S TE R AN ALYS I S I N P YTH ON
# Set up a random seed in numpy
random.seed([1000,2000])

# Fit the data into a k-means algorithm

cluster_centers,_ = kmeans(fifa[['scaled_def', 'scaled_phy']], 3)

# Assign cluster labels

fifa['cluster_labels'], _ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers)

# Display cluster centers

print(fifa[['scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean())

# Create a scatter plot through seaborn

sns.scatterplot(x='scaled_def', y='scaled_phy', hue='cluster_labels', data=fifa)
plt.show()

Foundations of Computer Vision Indice
100% (1)
Foundations of Computer Vision Indice
10 pages
Thomas NumericalPartialDifferentialEquations1 PDF
100% (1)
Thomas NumericalPartialDifferentialEquations1 PDF
451 pages
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
No ratings yet
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
22 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
K Means
No ratings yet
K Means
4 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Cluster Analysis in Python Chapter1 PDF
No ratings yet
Cluster Analysis in Python Chapter1 PDF
31 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
K-Means and PCA
No ratings yet
K-Means and PCA
69 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Unit 5
No ratings yet
Unit 5
85 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
ML Assignment-10
No ratings yet
ML Assignment-10
5 pages
Data Science Analysis Final Project
No ratings yet
Data Science Analysis Final Project
10 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Kmeans
No ratings yet
Kmeans
92 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-4
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-4
2 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
K Means
No ratings yet
K Means
25 pages
Experiment 10 Vtu ML
No ratings yet
Experiment 10 Vtu ML
5 pages
K Means
No ratings yet
K Means
18 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
DWM Exp7 C49
No ratings yet
DWM Exp7 C49
11 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Module 4-1
No ratings yet
Module 4-1
153 pages
Lecture5 - Clustering (K Means and K Medoids)
No ratings yet
Lecture5 - Clustering (K Means and K Medoids)
36 pages
Module 4
No ratings yet
Module 4
63 pages
Experiment 9
No ratings yet
Experiment 9
10 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Unit 7 Clustering (P)
No ratings yet
Unit 7 Clustering (P)
22 pages
K-MEANS CLUSTERING PPT Kpu
No ratings yet
K-MEANS CLUSTERING PPT Kpu
4 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Unit 4
No ratings yet
Unit 4
76 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
K - Mean Clustering
No ratings yet
K - Mean Clustering
15 pages
3.1 K - Means
No ratings yet
3.1 K - Means
16 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Clustering Analysis: What Is Cluster Analysis?
No ratings yet
Clustering Analysis: What Is Cluster Analysis?
5 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
RDM Slides Clustering With R 1
No ratings yet
RDM Slides Clustering With R 1
64 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Assignmrnt 1
No ratings yet
Assignmrnt 1
9 pages
Econometrics
No ratings yet
Econometrics
9 pages
A Nonlocal Bayesian Image Denoising Algorithm: M. Lebrun, A. Buades, J. M. Morel
No ratings yet
A Nonlocal Bayesian Image Denoising Algorithm: M. Lebrun, A. Buades, J. M. Morel
24 pages
Introduction To Basic Programming Concepts
No ratings yet
Introduction To Basic Programming Concepts
2 pages
APPC 1.6A WKST Polynomial End Behavior
No ratings yet
APPC 1.6A WKST Polynomial End Behavior
2 pages
Unit11 Eigen Values and Eigen Vector Part 2 PDF
No ratings yet
Unit11 Eigen Values and Eigen Vector Part 2 PDF
12 pages
SVM PDF
No ratings yet
SVM PDF
52 pages
Unit 2 Slides
No ratings yet
Unit 2 Slides
35 pages
Clustering Others Evaluation
No ratings yet
Clustering Others Evaluation
70 pages
Summative Test Q2
No ratings yet
Summative Test Q2
2 pages
VLSI Design of SVM-Based Seizure Detection System With On-Chip Learning Capability - TBCAS2018
No ratings yet
VLSI Design of SVM-Based Seizure Detection System With On-Chip Learning Capability - TBCAS2018
11 pages
AD LAB-8.1-GrWork-updated
No ratings yet
AD LAB-8.1-GrWork-updated
7 pages
(IJCST-V12I3P8) :annie Florance V, Fathima G
No ratings yet
(IJCST-V12I3P8) :annie Florance V, Fathima G
6 pages
6.8 Graphing Linear Inequalities
No ratings yet
6.8 Graphing Linear Inequalities
32 pages
BT0033 DATA STRUCTURE USING C PAPER 2 (BSciIT SEM 1)
No ratings yet
BT0033 DATA STRUCTURE USING C PAPER 2 (BSciIT SEM 1)
15 pages
Department of Computational and Data Sciences (CDS) Indian Institute of Science, Bangalore
No ratings yet
Department of Computational and Data Sciences (CDS) Indian Institute of Science, Bangalore
7 pages
ABC Hospital - Solution
100% (1)
ABC Hospital - Solution
12 pages
Skip Mock Test Paper 1
No ratings yet
Skip Mock Test Paper 1
10 pages
Bayesian Algorithm
No ratings yet
Bayesian Algorithm
6 pages
Power Point Presentation On-: Array Based Applications in C Language
No ratings yet
Power Point Presentation On-: Array Based Applications in C Language
20 pages
مدل سازی و بهینه سازی چند هدفه پارامترهای عملیاتی در آسیای نیمه خودشکن
No ratings yet
مدل سازی و بهینه سازی چند هدفه پارامترهای عملیاتی در آسیای نیمه خودشکن
13 pages
A E T S F E: Pplied Conometric IME Eries Ourth Dition
No ratings yet
A E T S F E: Pplied Conometric IME Eries Ourth Dition
43 pages
Graded: What Is AI? Applications and Examples of AI
100% (1)
Graded: What Is AI? Applications and Examples of AI
2 pages
On Methods For Low Velocity Friction Compensation: Theory and Experimental Study John Adams Shahram Payandeh
No ratings yet
On Methods For Low Velocity Friction Compensation: Theory and Experimental Study John Adams Shahram Payandeh
33 pages
7 Math158 Module 2 Equation of Values
No ratings yet
7 Math158 Module 2 Equation of Values
36 pages
Maximum Likelihood Decoding Techniques Notes
No ratings yet
Maximum Likelihood Decoding Techniques Notes
18 pages
Act03 CSP 4 Queens Sol
No ratings yet
Act03 CSP 4 Queens Sol
13 pages
AI Lab Student Sample File With Pages Removed
No ratings yet
AI Lab Student Sample File With Pages Removed
14 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Basics of k-means

K means runs signi cantly faster on large datasets

CLUSTER ANALYSIS IN PYTHON

obs : standardized observations

k_or_guess : number of clusters

iter : number of iterations (default: 20)

thres : threshold (default: 1e-05)

Returns two objects: cluster centers, distortion

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

obs : standardized observations

code_book : cluster centers

Returns two objects: a list of cluster labels, a list of distortions

CLUSTER ANALYSIS IN PYTHON

vq returns a list of distortions.

CLUSTER ANALYSIS IN PYTHON

# Generate cluster centers and labels

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

Decreases with an increasing number of

Becomes zero when the number of clusters

Elbow plot: line plot between cluster

CLUSTER ANALYSIS IN PYTHON

Elbow plot helps indicate number of clusters present in data

CLUSTER ANALYSIS IN PYTHON

# Populating distortions for various clusters

# Plotting elbow plot data

CLUSTER ANALYSIS IN PYTHON

Does not always pinpoint how many _k_ (numbers of clusters)

Other methods: average silhoue e and gap statistic

CLUSTER ANALYSIS IN PYTHON

Biased towards equal sized clusters

CLUSTER ANALYSIS IN PYTHON

Cluster sizes: 29, 29, 43, 47, 52

Cluster sizes: 26, 31, 40, 50, 53

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

因為k mean 想要最⼩化distortion

CLUSTER ANALYSIS IN PYTHON

Consider your data size and pa erns before deciding on algorithm

Clustering is exploratory phase of analysis

CLUSTER ANALYSIS IN PYTHON

# Fit the data into a k-means algorithm

# Assign cluster labels

# Display cluster centers

# Create a scatter plot through seaborn

You might also like