0% found this document useful (0 votes)
26 views32 pages

INSY446 - 10 - Clustering Part 2

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views32 pages

INSY446 - 10 - Clustering Part 2

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

INSY 446 – Winter 2023

Data Mining for Business


Analytics

Session 10 – Unsupervised Learning Part 2


March 20, 2023
Dongliang Sheng
Measuring Cluster Performance
§ In supervised machine learning algorithms, we
can measure the performance of the model(s)
directly (e.g., using MSE or accuracy score)
§ In unsupervised machine learning algorithms,
it is difficult to measure the performance of the
model(s) since we usually do not know the
“ground-truth”
§ For example:
– What is the optimal number of cluster to identify?
– How do we measure whether one set of clusters is
preferable to another?

2
Example 1
Clustering Iris

from sklearn import datasets

iris = datasets.load_iris()

# Use only petal as predictors


X = iris.data[:,2:]

# Standardize predictors using Min-Max Normalization


from sklearn.preprocessing import MinMaxScaler

# Develop clusters
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)

# Plot cluster membership


from matplotlib import pyplot

3
Measuring Cluster Performance
§ If we assume that the “ground truth” is known,
we can measure the cluster performance using
methods that are similar to classification
accuracy.
§ The common technique in this group of
techniques is called Rand index.
§ It is similar to classification accuracy with a
few adjustments
– Rand index accounts for permuted cluster labels
– Adjusted Rand index accounts for chance of
random labeling and use it as a baseline

4
Example 2
Rand Index

from sklearn import datasets

iris = datasets.load_iris()

# Use only petal as predictors


X = iris.data[:,2:]

# Standardize predictors using Min-Max Normalization


from sklearn.preprocessing import MinMaxScaler

# Develop clusters
from sklearn.cluster import KMeans

# Measuring cluster performance using Rand index


from sklearn import metrics

5
Measuring Cluster Performance
§ In the cases where the “ground truth” is not
known there are multiple methods to measure
the cluster goodness-of-fit, which is commonly
used to reflect the cluster performance
§ In this class, we cover three methods:
– Elbow method
– Silhouette score
– Pseudo F-statistics

6
The Elbow Method
§ Examine within-cluster variation for different
number of clusters
§ Adding another cluster almost always reduce
the variance
§ The number of cluster k that minimizes the
variance is k=n where variance becomes 0
§ We should choose the number k that adding
another cluster does not provide much better
results

7
Example 3
Elbow Method

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:,2:]

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
X_std = scaler.fit_transform(X)

from sklearn.cluster import KMeans

from matplotlib import pyplot

8
Can We Do Better?
§ Cluster cohesion refers to how tightly related
the records within the individual clusters are
§ Cluster separation represents how distant the
clusters are from each other
§ However, the sum of squares error (SSE) only
accounts for cluster cohesion and is
monotonically decreasing with increasing
numbers of clusters
§ Good measures should incorporate both as do
the Silhouette and pseudo-F statistic

9
The Silhouette Method
§ For each data value 𝑖 the silhouette is used to
gauge how good the cluster assignment is for that
point:

"! #$!
𝑆𝑖𝑙ℎ𝑜𝑢𝑒𝑡𝑡𝑒! = 𝑠! =
%&' ("! ,$! )

where 𝑎! is the average distance between 𝑖 and all


other data within the same cluster (represents
cohesion) and 𝑏! is the average distance of 𝑖 to all
points in the next closest cluster (represents
separation)

10
The Silhouette Method (Cont’d)
§ A positive value indicates that the assignment
is good, with higher values better than lower
values
§ A value close to zero is considered to be weak
since the observation could have been
assigned to the next cluster with little negative
consequence
§ A negative value is considered to be
misclustered since assignment to the next
closest cluster would have been better

11
The Average Silhouette Value
§ The average silhouette value over all records
yields a measure of how well the cluster
solution fits.
§ A thumbnail interpretation, meant as a guide
only:
– 0.5 or better provides good evidence of the reality of
the clusters in the data
– 0.25 – 0.5 provides some evidence of the reality of
the clusters in the data.
– Less than 0.25 provides scant evidence of cluster
reality

12
Silhouette Examples
§ Apply k-means clustering to the following data set

𝑥" = 0 𝑥# = 2 𝑥$ = 4 𝑥% = 6 𝑥& = 10

§ The first three data values are assigned to Cluster


1 and the last two to Cluster 2

§ Center for Cluster 1 is 𝑚" = 2 and for Cluster 2 is


𝑚# = 8

§ Values for 𝑎𝑖 are average distance between 𝑥𝑖 and


all other data within the same cluster; values for 𝑏𝑖
are average distance between 𝑥𝑖 and all other
data in another cluster
13
Silhouette Examples (Cont’d)
§ Calculations for individual data values:
𝒙𝒊 𝒂𝒊 𝒃𝒊 𝑴𝒂𝒙(𝒂𝒊 , 𝒃𝒊 ) 𝒃𝒊 − 𝒂𝒊
𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝒊 = 𝒔𝒊 =
𝐌𝐚𝐱(𝒃𝒊 , 𝒂𝒊 )

0 3 8 8 "#$
"
=0.625
2 2 6 6 %#&
=0.67
%
4 3 4 4 '#$
=0.25
'
6 4 4 4 '#'
=0
'
10 4 8 8 "#'
"
=0.5
Mean Silhouette = 0.4083

14
Example 4
Silhouette Method

X = [[0],[2],[4],[6],[10]]

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_samples

from sklearn.metrics import silhouette_score

15
Example 5
Silhouette Method (2)

from sklearn import datasets


import numpy

iris = datasets.load_iris()
X = iris.data[:,2:]

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
X_std = scaler.fit_transform(X)

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_samples

import pandas
df = pandas.DataFrame({'label':labels,'silhouette':silhouette})

print('Average Silhouette Score for Cluster 0: ',numpy.average(df[df['label'] == 0].silhouette))


print('Average Silhouette Score for Cluster 1: ',numpy.average(df[df['label'] == 1].silhouette))
print('Average Silhouette Score for Cluster 2: ',numpy.average(df[df['label'] == 2].silhouette))

from sklearn.metrics import silhouette_score

16
Silhouette Analysis of Iris Data
§ Scatterplot of petal width vs. petal length by
species, where two of the species seem to
blend into one another

17
Silhouette Analysis of Iris Data
§ Results of k=3 k-means clustering do not
match perfectly with flower types:

18
Silhouette Analysis of Iris Data
§ The average silhouette scores for each cluster
and overall model are as follows:

Cluster 1 Cluster 2 Cluster 3 Overall

Mean Silhouette 0.90 0.61 0.51 0.68

§ Cluster 1 is the best defined with the highest


average silhouette score
§ Clusters 2 and 3 are also reasonably well
defined with the average silhouette score of
0.61 and 0.51

19
Silhouette Analysis of Iris Data
§ Results of k=2 k-means clustering combine the
Versicolor and Virginica into a single cluster:

20
Silhouette Analysis of Iris Data
§ The silhouette plot has fewer lower values than
for k=3 and higher mean values

Cluster 1 Cluster 2 Overall


Mean Silhouette 0.92 0.65 0.74

§ So, if we measure the cluster performance


using the silhouette score, the model with k=2
is better than the model with k=3

§ Should we use k=3 or k=2?

21
The Pseudo-F Statistic
§ Let:
k be number of clusters

∑ 𝑛" = 𝑁 be the total sample size

𝑥"# refer to the 𝑗$% data value in the 𝑖 $% cluster

𝑚" refer to cluster center (centroid) of the 𝑖 $% cluster


M represent the grand mean of all the data

and 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑎, 𝑏 = ∑ 𝑎" − 𝑏" &

22
The Pseudo-F Statistic (Cont’d)
§ Then the sum of squares between the clusters is:
+
𝑆𝑆𝐵 = 8 𝑛( : 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑚( , 𝑀
()*

§ And the sum of squares error, or the sum of


squares within the clusters is:
+ -!
𝑆𝑆𝐸 = 8 8 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑥(, , 𝑚(
()* ,)*

§ And the pseudo-F statistic is:


𝑀𝑆𝐵 𝑆𝑆𝐵⁄𝑘 − 1
𝐹= =
𝑀𝑆𝐸 𝑆𝑆𝐸 ⁄𝑁 − 𝑘

23
The Pseudo-F Statistic (Cont’d)
§ The hypotheses being tested are:
𝐻': There 𝑖𝑠 no cluster in the data.
𝐻( : There are 𝑘 clusters in the data.

§ Reject H0 for sufficiently small p-value where:


p-value = 𝑃(𝐹'(),+(' > Pseudo F value)

24
The Pseudo-F Statistic (Cont’d)
§ The pseudo-F statistic should not be used to
determine the presence of clusters but can be
used to select the optimal number of clusters
as follows:
1. Use a clustering algorithm to develop a clustering
solution for a variety of values of k.
2. Calculate the pseudo-F statistic for each candidate
and select the candidate with the highest pseudo-
F statistic as the best clustering solution.

25
Pseudo-F Statistic Examples
§ Apply k-means clustering to the following data set:

𝑥" = 0 𝑥# = 2 𝑥$ = 4 𝑥% = 6 𝑥& = 10

§ The first three data values are assigned to Cluster


1 and the last two to Cluster 2

§ Center for Cluster 1 is 𝑚" = 2 and for Cluster 2 is


𝑚# = 8

§ 𝑛" = 3 and 𝑛# = 2 data values, and N = 5, the


grand mean is M = 4.4. And, because we are in
one dimension, 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑚! , 𝑀 = 𝑚! − 𝑀
26
Pseudo-F Statistic Examples
§ Then
𝑆𝑆𝐵 = ∑+()* 𝑛( : 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑚( , 𝑀
= 3 : 2 − 4.4 & + 2 : 8 − 4.4 & = 43.2

§ And
-
𝑆𝑆𝐸 = ∑+()* ∑,)* 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑥(, , 𝑚(
!

= 0−2 & + 2−2 & + 4−2 & + 6−8 & + 10 − 8 & = 16

§ And
𝑀𝑆𝐵 𝑆𝑆𝐵⁄𝑘 − 1 43.2⁄1 43.2
𝐹= = = = = 8.1
𝑀𝑆𝐸 𝑆𝑆𝐸 ⁄𝑁 − 𝑘 16⁄3 5.33

27
Pseudo-F Statistic Examples
§ Distribution of the F statistic shows the p-value
of 0.06532.

28
Example 6
Pseudo F-Statistics

X = [[0],[2],[4],[6],[10]]

from sklearn.cluster import KMeans

from sklearn.metrics import calinski_harabasz_score

from scipy.stats import f


df1 = 1 # df1 = k-1
df2 = 3 # df2 = n-k

29
Example 7
Pseudo F-Statistics (2)

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:,2:]

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
X_std = scaler.fit_transform(X)

from sklearn.cluster import KMeans

from sklearn.metrics import calinski_harabasz_score

from scipy.stats import f

30
Exercise #1
§ Use RidingMower dataset
§ Based on Income and Lot size, develop a
cluster using K-Mean clustering algorithm with
k=3
§ Print the average silhouette score of each
cluster and the overall average silhouette
score for this cluster model

31
Exercise #2
§ Use the same dataset in #1
§ Based on Income and Lot size, develop a
cluster using K-Mean clustering algorithm with
different K
§ Find the best K based on the average
silhouette score

32

You might also like