Chapter 3
Chapter 3
clustering
C L U S TE R AN ALYS I S I N P YTH ON
Shaumik Daityari
Business Analyst
Why k-means clustering?
A critical drawback of hierarchical clustering: runtime
check_finite : whether to check if observations contain only nite numbers (default: True)
check_finite : whether to check if observations contain only nite numbers (default: True)
# Plot clusters
sns.scatterplot(x='scaled_x', y='scaled_y', hue='cluster_labels', data=df)
plt.show()
Shaumik Daityari
Business Analyst
How to find the right k?
No absolute method to nd right number of
clusters (k) in k-means clustering
Elbow method
如何去找適合的K數?
⼿肘法
看島⼿肘轉職之處 表⽰超過那個⼦集數⽬之後
distorsions減少的幅度較少
相較如果⼀個⼿肘圖 是⼀個斜直線
那可能沒有最適合的K值
num_clusters = range(2, 7)
sns.lineplot(x='num_clusters', y='distortions',
data = elbow_plot_data)
plt.show()
Shaumik Daityari
Business Analyst
Limitations of k-means clustering
How to nd the right _K_ (number of clusters)?
Impact of seeds
Seed: np.array(1,2,3)
在資料集的cluster不明顯的狀況下
seed的初始化會影響到分類
所以最好⼀開始就設定好同樣的亂數值
但如果資料集⼀開始分類就明顯
就不會影響
Text
像這個⼦集比較分散
在周遭的點到其中⼼的距離
甚⾄比到左邊中⼼還要遠