0% found this document useful (0 votes)

26 views32 pages

INSY446 - 10 - Clustering Part 2

Uploaded by

iryannh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views32 pages

INSY446 - 10 - Clustering Part 2

Uploaded by

iryannh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

INSY 446 – Winter 2023

Data Mining for Business

Analytics

Session 10 – Unsupervised Learning Part 2

March 20, 2023
Dongliang Sheng
Measuring Cluster Performance
§ In supervised machine learning algorithms, we
can measure the performance of the model(s)
directly (e.g., using MSE or accuracy score)
§ In unsupervised machine learning algorithms,
it is difficult to measure the performance of the
model(s) since we usually do not know the
“ground-truth”
§ For example:
– What is the optimal number of cluster to identify?
– How do we measure whether one set of clusters is
preferable to another?

2
Example 1
Clustering Iris

from sklearn import datasets

iris = datasets.load_iris()

# Use only petal as predictors

X = iris.data[:,2:]

# Standardize predictors using Min-Max Normalization

from sklearn.preprocessing import MinMaxScaler

# Develop clusters
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)

# Plot cluster membership

from matplotlib import pyplot

3
Measuring Cluster Performance
§ If we assume that the “ground truth” is known,
we can measure the cluster performance using
methods that are similar to classification
accuracy.
§ The common technique in this group of
techniques is called Rand index.
§ It is similar to classification accuracy with a
few adjustments
– Rand index accounts for permuted cluster labels
– Adjusted Rand index accounts for chance of
random labeling and use it as a baseline

4
Example 2
Rand Index

from sklearn import datasets

iris = datasets.load_iris()

# Use only petal as predictors

X = iris.data[:,2:]

# Standardize predictors using Min-Max Normalization

from sklearn.preprocessing import MinMaxScaler

# Develop clusters
from sklearn.cluster import KMeans

# Measuring cluster performance using Rand index

from sklearn import metrics

5
Measuring Cluster Performance
§ In the cases where the “ground truth” is not
known there are multiple methods to measure
the cluster goodness-of-fit, which is commonly
used to reflect the cluster performance
§ In this class, we cover three methods:
– Elbow method
– Silhouette score
– Pseudo F-statistics

6
The Elbow Method
§ Examine within-cluster variation for different
number of clusters
§ Adding another cluster almost always reduce
the variance
§ The number of cluster k that minimizes the
variance is k=n where variance becomes 0
§ We should choose the number k that adding
another cluster does not provide much better
results

7
Example 3
Elbow Method

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:,2:]

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_std = scaler.fit_transform(X)

from sklearn.cluster import KMeans

from matplotlib import pyplot

8
Can We Do Better?
§ Cluster cohesion refers to how tightly related
the records within the individual clusters are
§ Cluster separation represents how distant the
clusters are from each other
§ However, the sum of squares error (SSE) only
accounts for cluster cohesion and is
monotonically decreasing with increasing
numbers of clusters
§ Good measures should incorporate both as do
the Silhouette and pseudo-F statistic

9
The Silhouette Method
§ For each data value 𝑖 the silhouette is used to
gauge how good the cluster assignment is for that
point:

"! #$!
𝑆𝑖𝑙ℎ𝑜𝑢𝑒𝑡𝑡𝑒! = 𝑠! =
%&' ("! ,$! )

where 𝑎! is the average distance between 𝑖 and all

other data within the same cluster (represents
cohesion) and 𝑏! is the average distance of 𝑖 to all
points in the next closest cluster (represents
separation)

10
The Silhouette Method (Cont’d)
§ A positive value indicates that the assignment
is good, with higher values better than lower
values
§ A value close to zero is considered to be weak
since the observation could have been
assigned to the next cluster with little negative
consequence
§ A negative value is considered to be
misclustered since assignment to the next
closest cluster would have been better

11
The Average Silhouette Value
§ The average silhouette value over all records
yields a measure of how well the cluster
solution fits.
§ A thumbnail interpretation, meant as a guide
only:
– 0.5 or better provides good evidence of the reality of
the clusters in the data
– 0.25 – 0.5 provides some evidence of the reality of
the clusters in the data.
– Less than 0.25 provides scant evidence of cluster
reality

12
Silhouette Examples
§ Apply k-means clustering to the following data set

𝑥" = 0 𝑥# = 2 𝑥$ = 4 𝑥% = 6 𝑥& = 10

§ The first three data values are assigned to Cluster

1 and the last two to Cluster 2

§ Center for Cluster 1 is 𝑚" = 2 and for Cluster 2 is

𝑚# = 8

§ Values for 𝑎𝑖 are average distance between 𝑥𝑖 and

all other data within the same cluster; values for 𝑏𝑖
are average distance between 𝑥𝑖 and all other
data in another cluster
13
Silhouette Examples (Cont’d)
§ Calculations for individual data values:
𝒙𝒊 𝒂𝒊 𝒃𝒊 𝑴𝒂𝒙(𝒂𝒊 , 𝒃𝒊 ) 𝒃𝒊 − 𝒂𝒊
𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝒊 = 𝒔𝒊 =
𝐌𝐚𝐱(𝒃𝒊 , 𝒂𝒊 )

0 3 8 8 "#$
"
=0.625
2 2 6 6 %#&
=0.67
%
4 3 4 4 '#$
=0.25
'
6 4 4 4 '#'
=0
'
10 4 8 8 "#'
"
=0.5
Mean Silhouette = 0.4083

14
Example 4
Silhouette Method

X = [[0],[2],[4],[6],[10]]

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_samples

from sklearn.metrics import silhouette_score

15
Example 5
Silhouette Method (2)

from sklearn import datasets

import numpy

iris = datasets.load_iris()
X = iris.data[:,2:]

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_std = scaler.fit_transform(X)

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_samples

import pandas
df = pandas.DataFrame({'label':labels,'silhouette':silhouette})

print('Average Silhouette Score for Cluster 0: ',numpy.average(df[df['label'] == 0].silhouette))

print('Average Silhouette Score for Cluster 1: ',numpy.average(df[df['label'] == 1].silhouette))
print('Average Silhouette Score for Cluster 2: ',numpy.average(df[df['label'] == 2].silhouette))

from sklearn.metrics import silhouette_score

16
Silhouette Analysis of Iris Data
§ Scatterplot of petal width vs. petal length by
species, where two of the species seem to
blend into one another

17
Silhouette Analysis of Iris Data
§ Results of k=3 k-means clustering do not
match perfectly with flower types:

18
Silhouette Analysis of Iris Data
§ The average silhouette scores for each cluster
and overall model are as follows:

Cluster 1 Cluster 2 Cluster 3 Overall

Mean Silhouette 0.90 0.61 0.51 0.68

§ Cluster 1 is the best defined with the highest

average silhouette score
§ Clusters 2 and 3 are also reasonably well
defined with the average silhouette score of
0.61 and 0.51

19
Silhouette Analysis of Iris Data
§ Results of k=2 k-means clustering combine the
Versicolor and Virginica into a single cluster:

20
Silhouette Analysis of Iris Data
§ The silhouette plot has fewer lower values than
for k=3 and higher mean values

Cluster 1 Cluster 2 Overall

Mean Silhouette 0.92 0.65 0.74

§ So, if we measure the cluster performance

using the silhouette score, the model with k=2
is better than the model with k=3

§ Should we use k=3 or k=2?

21
The Pseudo-F Statistic
§ Let:
k be number of clusters

∑ 𝑛" = 𝑁 be the total sample size

𝑥"# refer to the 𝑗$% data value in the 𝑖 $% cluster

𝑚" refer to cluster center (centroid) of the 𝑖 $% cluster

M represent the grand mean of all the data

and 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑎, 𝑏 = ∑ 𝑎" − 𝑏" &

22
The Pseudo-F Statistic (Cont’d)
§ Then the sum of squares between the clusters is:
+
𝑆𝑆𝐵 = 8 𝑛( : 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑚( , 𝑀
()*

§ And the sum of squares error, or the sum of

squares within the clusters is:
+ -!
𝑆𝑆𝐸 = 8 8 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑥(, , 𝑚(
()* ,)*

§ And the pseudo-F statistic is:

𝑀𝑆𝐵 𝑆𝑆𝐵⁄𝑘 − 1
𝐹= =
𝑀𝑆𝐸 𝑆𝑆𝐸 ⁄𝑁 − 𝑘

23
The Pseudo-F Statistic (Cont’d)
§ The hypotheses being tested are:
𝐻': There 𝑖𝑠 no cluster in the data.
𝐻( : There are 𝑘 clusters in the data.

§ Reject H0 for sufficiently small p-value where:

p-value = 𝑃(𝐹'(),+(' > Pseudo F value)

24
The Pseudo-F Statistic (Cont’d)
§ The pseudo-F statistic should not be used to
determine the presence of clusters but can be
used to select the optimal number of clusters
as follows:
1. Use a clustering algorithm to develop a clustering
solution for a variety of values of k.
2. Calculate the pseudo-F statistic for each candidate
and select the candidate with the highest pseudo-
F statistic as the best clustering solution.

25
Pseudo-F Statistic Examples
§ Apply k-means clustering to the following data set:

𝑥" = 0 𝑥# = 2 𝑥$ = 4 𝑥% = 6 𝑥& = 10

§ The first three data values are assigned to Cluster

1 and the last two to Cluster 2

§ Center for Cluster 1 is 𝑚" = 2 and for Cluster 2 is

𝑚# = 8

§ 𝑛" = 3 and 𝑛# = 2 data values, and N = 5, the

grand mean is M = 4.4. And, because we are in
one dimension, 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑚! , 𝑀 = 𝑚! − 𝑀
26
Pseudo-F Statistic Examples
§ Then
𝑆𝑆𝐵 = ∑+()* 𝑛( : 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑚( , 𝑀
= 3 : 2 − 4.4 & + 2 : 8 − 4.4 & = 43.2

§ And
-
𝑆𝑆𝐸 = ∑+()* ∑,)* 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 & 𝑥(, , 𝑚(
!

= 0−2 & + 2−2 & + 4−2 & + 6−8 & + 10 − 8 & = 16

§ And
𝑀𝑆𝐵 𝑆𝑆𝐵⁄𝑘 − 1 43.2⁄1 43.2
𝐹= = = = = 8.1
𝑀𝑆𝐸 𝑆𝑆𝐸 ⁄𝑁 − 𝑘 16⁄3 5.33

27
Pseudo-F Statistic Examples
§ Distribution of the F statistic shows the p-value
of 0.06532.

28
Example 6
Pseudo F-Statistics

X = [[0],[2],[4],[6],[10]]

from sklearn.cluster import KMeans

from sklearn.metrics import calinski_harabasz_score

from scipy.stats import f

df1 = 1 # df1 = k-1
df2 = 3 # df2 = n-k

29
Example 7
Pseudo F-Statistics (2)

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:,2:]

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_std = scaler.fit_transform(X)

from sklearn.cluster import KMeans

from sklearn.metrics import calinski_harabasz_score

from scipy.stats import f

30
Exercise #1
§ Use RidingMower dataset
§ Based on Income and Lot size, develop a
cluster using K-Mean clustering algorithm with
k=3
§ Print the average silhouette score of each
cluster and the overall average silhouette
score for this cluster model

31
Exercise #2
§ Use the same dataset in #1
§ Based on Income and Lot size, develop a
cluster using K-Mean clustering algorithm with
different K
§ Find the best K based on the average
silhouette score

The Sharpe Ratio: Statistics and Applications 1st Edition Steven E. Pav - Read The Ebook Now or Download It For A Full Experience
No ratings yet
The Sharpe Ratio: Statistics and Applications 1st Edition Steven E. Pav - Read The Ebook Now or Download It For A Full Experience
52 pages
Theory Question For 504 A
No ratings yet
Theory Question For 504 A
2 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
DLL COT v.2.0
No ratings yet
DLL COT v.2.0
4 pages
CQA Certification Guide and How To Crack Exam On Asq Certified Quality Auditor
100% (1)
CQA Certification Guide and How To Crack Exam On Asq Certified Quality Auditor
15 pages
Lecture 6
No ratings yet
Lecture 6
42 pages
DS Prac 8
No ratings yet
DS Prac 8
4 pages
Business Analyst Master's Program in Collaboration With IBM V11 - New
No ratings yet
Business Analyst Master's Program in Collaboration With IBM V11 - New
28 pages
MLT Lab 08
No ratings yet
MLT Lab 08
5 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Biostatistics 6th Semester BSN Notes, Educational Platform
0% (1)
Biostatistics 6th Semester BSN Notes, Educational Platform
401 pages
Griffin Dissertation
No ratings yet
Griffin Dissertation
148 pages
1 Data Presentation
No ratings yet
1 Data Presentation
61 pages
Ds Paper
No ratings yet
Ds Paper
35 pages
Lian Polyan Watumlawar, Lakon Utamakno, Yudho Dwi Galih Cahyono Institut Teknologi Adhi Tama Surabaya
No ratings yet
Lian Polyan Watumlawar, Lakon Utamakno, Yudho Dwi Galih Cahyono Institut Teknologi Adhi Tama Surabaya
8 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
51 DA5400 - FML51 - 20250501 ProblemSet06
No ratings yet
51 DA5400 - FML51 - 20250501 ProblemSet06
4 pages
Chapter 2.1 - Kmean
No ratings yet
Chapter 2.1 - Kmean
10 pages
KDD WS 24 25 E4 Clustering I
No ratings yet
KDD WS 24 25 E4 Clustering I
2 pages
1 Var Stats
No ratings yet
1 Var Stats
10 pages
UNIT 3-Clustering Metrics
No ratings yet
UNIT 3-Clustering Metrics
59 pages
UNIT 3-Clustering Metrics
No ratings yet
UNIT 3-Clustering Metrics
54 pages
T-Test For A Proportion
No ratings yet
T-Test For A Proportion
5 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
K-Means Clustering Using PCA Analysis Lab Report
No ratings yet
K-Means Clustering Using PCA Analysis Lab Report
9 pages
Unit 4
No ratings yet
Unit 4
46 pages
K Means
No ratings yet
K Means
25 pages
Unit 3
No ratings yet
Unit 3
130 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
A Sensitivity Analysis of Cross-Country Growth Regressions (Levine and Renelt, 1992)
No ratings yet
A Sensitivity Analysis of Cross-Country Growth Regressions (Levine and Renelt, 1992)
23 pages
5 2 Multilayer Perceptron
No ratings yet
5 2 Multilayer Perceptron
17 pages
ENGM 620: Quality Management: - Process Capability
No ratings yet
ENGM 620: Quality Management: - Process Capability
22 pages
Przepiorka 2016
No ratings yet
Przepiorka 2016
23 pages
Entropy 23 00759
No ratings yet
Entropy 23 00759
17 pages
UnsupervisedLearning FoundationalMathofAI S24
No ratings yet
UnsupervisedLearning FoundationalMathofAI S24
6 pages
PERT Estimation Technique: Optimistic Pessimistic Most Likely
No ratings yet
PERT Estimation Technique: Optimistic Pessimistic Most Likely
2 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
34 pages
ML-Lab Programs - VTU
No ratings yet
ML-Lab Programs - VTU
5 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Forecasting Crude Oil Prices Using Eviews
No ratings yet
Forecasting Crude Oil Prices Using Eviews
5 pages
What Is Research Design?: Blueprint
No ratings yet
What Is Research Design?: Blueprint
23 pages
PeerEval Unsupervised
No ratings yet
PeerEval Unsupervised
6 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Unit 4
No ratings yet
Unit 4
63 pages
Detailed Lesson Plan Grade 7 - Mathematics
No ratings yet
Detailed Lesson Plan Grade 7 - Mathematics
3 pages
Assignment 4 A
No ratings yet
Assignment 4 A
15 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
Determining Clusters
No ratings yet
Determining Clusters
4 pages
42236
No ratings yet
42236
242 pages
Department Of: Computer Science & Engineering
No ratings yet
Department Of: Computer Science & Engineering
4 pages
Business Research
No ratings yet
Business Research
2 pages
Practice Problems In, R and Charts
100% (1)
Practice Problems In, R and Charts
2 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
Data Mining
No ratings yet
Data Mining
10 pages
Performance Metrics
No ratings yet
Performance Metrics
16 pages
Shahapure 2020
No ratings yet
Shahapure 2020
2 pages
IDS26 Clustering and Classification
No ratings yet
IDS26 Clustering and Classification
30 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Kmeans Clustering
No ratings yet
Kmeans Clustering
3 pages
Lesson1 - Simple Linier Regression
No ratings yet
Lesson1 - Simple Linier Regression
40 pages
Foundations of Data Analytics
No ratings yet
Foundations of Data Analytics
5 pages
Exercises - Dss - Partd - Handout
No ratings yet
Exercises - Dss - Partd - Handout
12 pages
Data Science Analysis Final Project
No ratings yet
Data Science Analysis Final Project
10 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Introduction To Probability and Statistics Twelfth Edition
No ratings yet
Introduction To Probability and Statistics Twelfth Edition
24 pages
Factorial Analysis
No ratings yet
Factorial Analysis
5 pages
Data Mining
No ratings yet
Data Mining
18 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Augmented Dickey-Fuller Test - Wikipedia
No ratings yet
Augmented Dickey-Fuller Test - Wikipedia
4 pages
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
No ratings yet
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
22 pages
Forecasting Exchange Rates Using General Regression Neural Networks
No ratings yet
Forecasting Exchange Rates Using General Regression Neural Networks
35 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
Cluster Validation: Presented By:Rohit Paul
No ratings yet
Cluster Validation: Presented By:Rohit Paul
22 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
AI With Python - Unsupervised Learning - Clustering
No ratings yet
AI With Python - Unsupervised Learning - Clustering
12 pages
Introduction-To-Ml-Part-3 Edited
No ratings yet
Introduction-To-Ml-Part-3 Edited
73 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
K-Means and PCA
No ratings yet
K-Means and PCA
69 pages
Using Machine Learning To Locate Support and Resistance Lines For Stocks
No ratings yet
Using Machine Learning To Locate Support and Resistance Lines For Stocks
14 pages
Silhouette (Clustering) : Method
No ratings yet
Silhouette (Clustering) : Method
7 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages