INSY446 - 10 - Clustering Part 2
INSY446 - 10 - Clustering Part 2
2
Example 1
Clustering Iris
iris = datasets.load_iris()
# Develop clusters
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
3
Measuring Cluster Performance
§ If we assume that the “ground truth” is known,
we can measure the cluster performance using
methods that are similar to classification
accuracy.
§ The common technique in this group of
techniques is called Rand index.
§ It is similar to classification accuracy with a
few adjustments
– Rand index accounts for permuted cluster labels
– Adjusted Rand index accounts for chance of
random labeling and use it as a baseline
4
Example 2
Rand Index
iris = datasets.load_iris()
# Develop clusters
from sklearn.cluster import KMeans
5
Measuring Cluster Performance
§ In the cases where the “ground truth” is not
known there are multiple methods to measure
the cluster goodness-of-fit, which is commonly
used to reflect the cluster performance
§ In this class, we cover three methods:
– Elbow method
– Silhouette score
– Pseudo F-statistics
6
The Elbow Method
§ Examine within-cluster variation for different
number of clusters
§ Adding another cluster almost always reduce
the variance
§ The number of cluster k that minimizes the
variance is k=n where variance becomes 0
§ We should choose the number k that adding
another cluster does not provide much better
results
7
Example 3
Elbow Method
iris = datasets.load_iris()
X = iris.data[:,2:]
8
Can We Do Better?
§ Cluster cohesion refers to how tightly related
the records within the individual clusters are
§ Cluster separation represents how distant the
clusters are from each other
§ However, the sum of squares error (SSE) only
accounts for cluster cohesion and is
monotonically decreasing with increasing
numbers of clusters
§ Good measures should incorporate both as do
the Silhouette and pseudo-F statistic
9
The Silhouette Method
§ For each data value 𝑖 the silhouette is used to
gauge how good the cluster assignment is for that
point:
"! #$!
𝑆𝑖𝑙ℎ𝑜𝑢𝑒𝑡𝑡𝑒! = 𝑠! =
%&' ("! ,$! )
10
The Silhouette Method (Cont’d)
§ A positive value indicates that the assignment
is good, with higher values better than lower
values
§ A value close to zero is considered to be weak
since the observation could have been
assigned to the next cluster with little negative
consequence
§ A negative value is considered to be
misclustered since assignment to the next
closest cluster would have been better
11
The Average Silhouette Value
§ The average silhouette value over all records
yields a measure of how well the cluster
solution fits.
§ A thumbnail interpretation, meant as a guide
only:
– 0.5 or better provides good evidence of the reality of
the clusters in the data
– 0.25 – 0.5 provides some evidence of the reality of
the clusters in the data.
– Less than 0.25 provides scant evidence of cluster
reality
12
Silhouette Examples
§ Apply k-means clustering to the following data set
𝑥" = 0 𝑥# = 2 𝑥$ = 4 𝑥% = 6 𝑥& = 10
0 3 8 8 "#$
"
=0.625
2 2 6 6 %#&
=0.67
%
4 3 4 4 '#$
=0.25
'
6 4 4 4 '#'
=0
'
10 4 8 8 "#'
"
=0.5
Mean Silhouette = 0.4083
14
Example 4
Silhouette Method
X = [[0],[2],[4],[6],[10]]
15
Example 5
Silhouette Method (2)
iris = datasets.load_iris()
X = iris.data[:,2:]
import pandas
df = pandas.DataFrame({'label':labels,'silhouette':silhouette})
16
Silhouette Analysis of Iris Data
§ Scatterplot of petal width vs. petal length by
species, where two of the species seem to
blend into one another
17
Silhouette Analysis of Iris Data
§ Results of k=3 k-means clustering do not
match perfectly with flower types:
18
Silhouette Analysis of Iris Data
§ The average silhouette scores for each cluster
and overall model are as follows:
19
Silhouette Analysis of Iris Data
§ Results of k=2 k-means clustering combine the
Versicolor and Virginica into a single cluster:
20
Silhouette Analysis of Iris Data
§ The silhouette plot has fewer lower values than
for k=3 and higher mean values
21
The Pseudo-F Statistic
§ Let:
k be number of clusters
22
The Pseudo-F Statistic (Cont’d)
§ Then the sum of squares between the clusters is:
+
𝑆𝑆𝐵 = 8 𝑛( : 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑚( , 𝑀
()*
23
The Pseudo-F Statistic (Cont’d)
§ The hypotheses being tested are:
𝐻': There 𝑖𝑠 no cluster in the data.
𝐻( : There are 𝑘 clusters in the data.
24
The Pseudo-F Statistic (Cont’d)
§ The pseudo-F statistic should not be used to
determine the presence of clusters but can be
used to select the optimal number of clusters
as follows:
1. Use a clustering algorithm to develop a clustering
solution for a variety of values of k.
2. Calculate the pseudo-F statistic for each candidate
and select the candidate with the highest pseudo-
F statistic as the best clustering solution.
25
Pseudo-F Statistic Examples
§ Apply k-means clustering to the following data set:
𝑥" = 0 𝑥# = 2 𝑥$ = 4 𝑥% = 6 𝑥& = 10
§ And
-
𝑆𝑆𝐸 = ∑+()* ∑,)* 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑥(, , 𝑚(
!
§ And
𝑀𝑆𝐵 𝑆𝑆𝐵⁄𝑘 − 1 43.2⁄1 43.2
𝐹= = = = = 8.1
𝑀𝑆𝐸 𝑆𝑆𝐸 ⁄𝑁 − 𝑘 16⁄3 5.33
27
Pseudo-F Statistic Examples
§ Distribution of the F statistic shows the p-value
of 0.06532.
28
Example 6
Pseudo F-Statistics
X = [[0],[2],[4],[6],[10]]
29
Example 7
Pseudo F-Statistics (2)
iris = datasets.load_iris()
X = iris.data[:,2:]
30
Exercise #1
§ Use RidingMower dataset
§ Based on Income and Lot size, develop a
cluster using K-Mean clustering algorithm with
k=3
§ Print the average silhouette score of each
cluster and the overall average silhouette
score for this cluster model
31
Exercise #2
§ Use the same dataset in #1
§ Based on Income and Lot size, develop a
cluster using K-Mean clustering algorithm with
different K
§ Find the best K based on the average
silhouette score
32