0% found this document useful (0 votes)

53 views38 pages

Clustering Techniques

This document discusses different clustering techniques. It defines clustering as an unsupervised learning technique that groups similar data points together into clusters. It describes two main clustering methods: hierarchical clustering and k-means clustering. Hierarchical clustering uses distance as a measure of similarity to group data points into a hierarchical tree structure. K-means clustering iteratively assigns data points to clusters to form stable clusters. The document also discusses different distance measures and linkage algorithms used in hierarchical clustering.

Uploaded by

kmkatariya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views38 pages

Clustering Techniques

Uploaded by

kmkatariya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Clustering

Earning is in Learning
- Rajesh Jakhotia
Content
• Clustering Definition
• Distance Measure
• Hierarchical Clustering
• K Mean Clustering

2
Learning Objectives
• Why Clustering?
• What is Clustering?
• Various Distance Measures
• Hierarchical Clustering
• K Means Clustering

3
Clustering Definitions
Distance Measures
Why Clustering? Applications of Clustering
• Why Clustering?
 To group similar objects / data points
 To find homogenous sets of customers
 To segment the data in similar groups

• Applications:
 Marketing : Customer Segmentation & Profiling
 Libraries : Book classification
 Retail : Store Categorization
5
What is Clustering?
• Clustering is a technique for finding similar groups in
data, called clusters.

• Clustering is an Unsupervised Learning Technique

• Clustering can also be thought of as a case reduction

technique wherein it groups together similar records
in cluster

• Clustering helps simplify data by reducing many data

points into a few clusters (segments)

6
What is a Cluster?
• A cluster can be Shopper Price
s Conscious
Brand
Loyalty
defined as a collection A 2 4
of objects which are B 8 2

“similar” between them C

D
9
1
3
5
and are “dissimilar” to E 8 1
the objects belonging
6
Shoppers

to other clusters
5 D

4 A

Brand Loyalty
3 C

• How do we define 2 B

“Similar” in clustering? 1 E

– Based on Distance 0 5
Price Conscious
10

7
How do we define “(dis) Similar” ?
• Similar in clustering is based on Distance
• Various distance measures
– Eucledian Distance
– Chebyshev Distance
A
– Manhattan Distance …and more
Block Manhattan Distance = 8 + 4 = 12

Block Chebyshev Distance = Max (8, 4) = 8

Block Eucledian Distance = sqrt ( 8^2 + 4^2) = 8.94

Block Block Block Block Block Block Block Block B

8
Distance Computation
A B
What is the distance
between Point A and B?

Ans: 7

What is the distance

between Point A and B?
B

Ans:
A
[ 𝑥2 − 𝑥1 2 + 𝑦2 − 𝑦1 2

(Remember the Pythagoras

Theorem)

9
Eucledian Distance
 What is the distance between Point A
and B in n-Dimension Space?
 If A (x1, y1, … Z1) and B (x2, y2, … z2) are
cartesian coordinates
 By using Euclidean Distance we get
Distance AB as
 DAB =
[(x2−x1)2 + (y2−y1)2 +….+ (z2−z1)2]

10
Chebyshev Distance
• In mathematics, Chebyshev distance is a
metric defined on a vector space where the
distance between two vectors is the
greatest of their differences along any
coordinate dimension

• Assume two vectors: A (x1, y1, … z1) & B (x2,

y2, … z2)
Reference Link :
https://en.wikipedia.org/wiki/Chebyshev_distance
• Chebyshev Distance 11
Manhattan Distance
• Manhattan Distance also called City Block
Distance
• Assume two vectors: A (x1, y1, …z1) & B (x2,
y2, …z2)
A
• Manhattan Distance
Block Manhattan Distance = 8 + 4 = 12
= | x2 - x1 | + | y2 – y1 | +….. | z2 – z1 |
Block Chebyshev Distance = Max (8, 4) = 8

Block Eucledian Distance = sqrt ( 8^2 + 4^2) = 8.94

Block Block Block Block Block Block Block Block B

12
Types of Clustering
Types of Clustering Procedures
 Hierarchical clustering is
characterized by a tree like
structure and uses distance
as a measure of
(dis)similarity

 Partitioning Algorithms
starts with a set of partitions
as clusters and iteratively
refines the partitions to
form stable clusters

14
Steps involved in Clustering
Formulate the problem – Select variables to be used for clustering

Decide the Clustering Procedure (Hierarchical / Partitioning)

Select the measure of similarity (dis-similarity)

Choose cluster linkage algorithm (applicable in hierarchical clustering)

Decide on the number of clusters

Interpret the cluster output (Profile the clusters)

Validate the clusters

15
Hierarchical Clustering
Hierarchical Clustering
• Hierarchical Clustering is a clustering techniques which tends
to create clusters in a hierarchical tree like structure

• Hierarchical clustering makes use of Distance as a measure of

similarity

• Cluster tree like output is called Dendogram

17
Hierarchical Clustering | Agglomerative Clustering Steps
• Starts with each record as a
cluster of one record each Ste Ste Ste Ste Ste
p0 p1 p2 p3 p4
• Sequentially merges 2 closest a
records by distance as a ab
measure of (dis)similarity to b abcde
form a cluster. This reduces
the number of records by 1 c
cde
d
• Repeat the above step with de
new cluster and all remaining e
clusters till we have one big
How do you measure the
cluster
distance between cluster (a,b)
and (c) or the cluster (a,b) and
(d,e)
????

18
Agglomerative Clustering Linkage Algorithms
• Single linkage – Minimum
distance or Nearest neighbour
rule

• Complete linkage – Maximum

distance or Farthest distance

• Average linkage – Average of

the distances between all pairs

• Centroid method – combine

cluster with minimum distance
between the centroids of the
two clusters

• Ward’s method – Combine

clusters with which the increase
in within cluster variance is to
the smallest degree

19
Hierarchical Clustering for Retail Customers

## Let us find the clusters in given Retail Customer Spends data

## We will use Hierarchical Clustering technique
## Let us first set the working directory path and import the data
setwd ("D:/K2Analytics/Clustering/")
RCDF <- read.csv("datafiles/Cust_Spend_Data.csv", header=TRUE)
HyperMarket Customer Spend MetaData
View(RCDF)
AVG_Mthly_Spend: The average monthly
amount spent by customer

No_of_Visits: The number of times a

customer visited the HyperMarket in a month

Item Counts: Count of Apparel, Fruits and

Vegetable, Staple Items purchased in a month

20
Building the hierarchical clusters (without variable scaling)
Note: The two clusters
formed are primarily on
?dist ## to get help on distance function the basis of
AVG_MTHLY_SPEND
d.euc <- dist(x=RCDF[,3:7], method = "euclidean")
Eucledian Distance
computation in this
## we will use the hclust function to build the cluster case is influenced by
AVG_MTHLY_SPEND
?hclust ## to get help on hclust function variable as the range
of this variable is too
large compared to the
clus1 <- hclust(d.euc, method = "average") other variables
plot(clus1, labels = as.character(RCDF[,2]))
To avoid this problem,
we should scale the
variables used for
clustering

21
Building the hierarchical clusters (with variable scaling)

## scale function standardizes the values

scaled.RCDF <- scale(RCDF[,3:7])
head(scaled.RCDF, 10)
d.euc <- dist(x=scaled.RCDF, method = "euclidean")
clus2 <- hclust(d.euc, method = "average")
plot(clus2, labels = as.character(RCDF[,2]))
rect.hclust(clus2, k=3, border="red")

22
Understanding the Height Calculation in Clustering

## Let us see the distance matrix

d.Euc
Dist. A B C D E F G H I
B 4.25
C 3.41 3.84
D 2.51 3.47 1.26
E 4.27 2.70 2.92 3.20
F 3.98 2.21 3.58 2.85 3.43
G 4.38 3.02 3.38 3.35 1.41 3.17
H 3.40 3.60 3.66 2.93 3.24 2.35 2.46
I 3.53 3.39 4.05 3.21 3.48 2.18 2.61 0.73
J 4.55 2.97 3.59 3.04 3.41 1.24 2.80 2.12 2.06
## Let us see the height for clusters
clus2$height

23
Dist. A B C D E F G H I
B 4.25
C 3.41 3.84

` D
E
F
G
2.51
4.27
3.98
4.38
3.47
2.70
2.21
3.02
1.26
2.92
3.58
3.38
3.20
2.85
3.35
3.43
1.41 3.17
H 3.40 3.60 3.66 2.93 3.24 2.35 2.46
I 3.53 3.39 4.05 3.21 3.48 2.18 2.61 0.73
J 4.55 2.97 3.59 3.04 3.41 1.24 2.80 2.12 2.06

C, D 1.26 H, I 0.73 F, J 1.24 E, G 1.41

A, (C,D) (H,I), (F,J) B, (E,G)

A, C 3.41 F J B, E 2.70
A, D 2.51 H 2.35 2.12 B, G 3.02
A, (C,D) 2.96 I 2.18 2.06 B, (E,G) 2.86
((H,I), (F,J)) 2.17

(H,I, F,J) , (B, E,G)

B E G
H 3.60 3.24 2.46
I 3.39 3.48 2.61
F 2.21 3.43 3.17
J 2.97 3.41 2.80
(H,I, F,J) , (B, E,G) 3.06

(A, C, D) , (H, I, F, J, B, E, G)
H I F J B E G
A 3.40 3.53 3.98 4.55 4.25 4.27 4.38
C 3.66 4.05 3.58 3.59 3.84 2.92 3.38
D 2.93 3.21 2.85 3.04 3.47 3.20 3.35
(A, C, D) , (H, I, F, J, B, E, G) 3.59

24
Profiling the clusters
## profiling the clusters
RCDF$Clusters <- cutree(clus2, k=3)
aggr = aggregate(RCDF[, -c(1,2, 8)], list(RCDF$Clusters), mean )
clus.profile <- data.frame( Cluster = aggr[,1] ,
Freq = as.vector(table(RCDF$Clusters)) ,
aggr[,-1]
)
View(clus.profile)

25
Partitioning Clustering

K Means Clustering
K Means Clustering
• K-Means is the most used, non-hierarchical clustering technique

• It is not based on Distance…

• It is based on within cluster Variation, in other words Squared Distance

from the Centre of the Cluster

• The algorithm aims at segmenting data such that within cluster

variation is reduced

27
K Means Algorithm
• Input Required : No of Clusters to be formed. (Say K)

• Steps
1. Assume K Centroids (for K Clusters)
2. Compute Eucledian distance of each objects with these Centroids.
3. Assign the objects to clusters with shortest distance
4. Compute the new centroid (mean) of each cluster based on the objects
assigned to each clusters. The K number of means obtained will become the
new centroids for each cluster
5. Repeat step 2 to 4 till there is convergence
• i.e. there is no movement of objects from one cluster to another
• Or threshold number of iterations have occurred

28
K-means advantages
• K-means is superior technique compared to Hierarchical technique as it is less
impacted by outliers

• Computationally it is more faster compared to Hierarchical

• Preferable to use on interval or ratio-scaled data as it uses Eucledian distance…

desirable to avoid using on ordinal data

• Challenge – Number of clusters are to be pre-defined and to be provided as

input to the process

29
Why find optimal No. of Clusters?
Data to be  Two Clusters – 2 possible solution
clustered
D2
C1 C2
D2 D2
C1

D1 D1 D1

 Three Clusters – Multiple possible

solution
C1 C2 C1 C2
C3
D2 D2 D2 D2
C1 C1

C3 C3 C3
C2 C2

D1 D1 D1 D1

30
R code to get Optimal No. of Clusters
## code taken from the R-statistics blog http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
## Identifying the optimal number of clusters form WSS
wssplot <- function(data, nc=15, seed=1234) {
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc) {
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}

wssplot(scaled.RCDF, nc=5)

31
Using NbClust to get optimal No. of Clusters
## Identifying the optimal number of clusters
## install.packages("NbClust")
library(NbClust)
set.seed(1234)
nc <- NbClust(KRCDF[,c(-1,-2)], min.nc=2, max.nc=4, method="kmeans")
table(nc$Best.n[1,])

barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")

32
K Means Clustering R Code
?kmeans
kmeans.clus = kmeans(x=scaled.RCDF, centers = 3, nstart = 25)
## x = data frame to be clustered
## centers = No. of clusters to be created
## nstart = No. of random sets to be used for clustering
kmeans.clus

33
Plotting the clusters
## plotting the clusters
## install.packages("fpc")
## plotting the clusters
## install.packages("fpc") library(fpc)
library(fpc) plotcluster( scaled.RCDF, kmeans.clus$cluster )
plotcluster( scaled.RCDF, kmeans.clus$cluster )

34
Profiling the clusters
## profiling the clusters
KRCDF$Clusters <- kmeans.clus$cluster
aggr = aggregate(KRCDF[,-c(1,2, 8)],list(KRCDF$Clusters),mean)
clus.profile <- data.frame( Cluster=aggr[,1],
Freq=as.vector(table(KRCDF$Clusters)),
aggr[,-1])

View(clus.profile)

35
Next steps after clustering
• Clustering provides you with clusters in the given dataset

• Clustering does not provide you rules to classify future

records

• To be able to classify future records you may do the

following
– Build Discriminant Model on Clustered Data
– Build Classification Tree Model on Clustered Data

36
References
• Chapter 9 : Cluster Analysis
(http://www.springer.com)
– Google search : “www.springer.com cluster analysis chapter 9”

• http://sites.stat.psu.edu/~ajw13/stat505/fa06/19_cluster/09_
cluster_wards.html

• https://home.deib.polimi.it/matteucc/Clustering/tutorial_ht
ml/

37
Thank you

Artificial Intelligence
No ratings yet
Artificial Intelligence
14 pages
Personalized Marketing Leveraging AI For Culturally - 2025 - Alexandria Enginee
No ratings yet
Personalized Marketing Leveraging AI For Culturally - 2025 - Alexandria Enginee
14 pages
CSCI946 W4-Clustering
No ratings yet
CSCI946 W4-Clustering
70 pages
Midterms Day 3
No ratings yet
Midterms Day 3
74 pages
Bsem 34 Chapter 1 Complete
No ratings yet
Bsem 34 Chapter 1 Complete
58 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
Cluster Analysis
100% (1)
Cluster Analysis
58 pages
Text Mining Applications and Theory
100% (1)
Text Mining Applications and Theory
5 pages
Unit V
No ratings yet
Unit V
165 pages
IML21 Term1
No ratings yet
IML21 Term1
7 pages
Trilite Catalogue
100% (1)
Trilite Catalogue
24 pages
Weather Patterns Analysis Presentation
No ratings yet
Weather Patterns Analysis Presentation
9 pages
Lec 35
No ratings yet
Lec 35
18 pages
Epidem Chapter 8
No ratings yet
Epidem Chapter 8
62 pages
Clustering Course Slides
No ratings yet
Clustering Course Slides
46 pages
Section 3
No ratings yet
Section 3
22 pages
Elementary Statistics: Davis Lazarus Assistant Professor ISIM, The IIS University
No ratings yet
Elementary Statistics: Davis Lazarus Assistant Professor ISIM, The IIS University
73 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
Practice Questions
No ratings yet
Practice Questions
14 pages
An Efficent Hybrid Optimization of ETL Process in Data Warehouse of Cloud Architecture
No ratings yet
An Efficent Hybrid Optimization of ETL Process in Data Warehouse of Cloud Architecture
13 pages
BUSN 2429 Chapter 2 Displaying Descriptive Statistics
No ratings yet
BUSN 2429 Chapter 2 Displaying Descriptive Statistics
105 pages
Chap 003
No ratings yet
Chap 003
38 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Food Spoilage Detection Using Convolutional Neural Networks and K Means Clustering
No ratings yet
Food Spoilage Detection Using Convolutional Neural Networks and K Means Clustering
7 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
K-Means Clustering and K-Nearest Neighbors Algorithm
No ratings yet
K-Means Clustering and K-Nearest Neighbors Algorithm
62 pages
Chapter 2 Measures of Location
No ratings yet
Chapter 2 Measures of Location
16 pages
Project Report IITM SHALINI
No ratings yet
Project Report IITM SHALINI
8 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
K Means
No ratings yet
K Means
19 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Experiment 3.1 K-Mean
No ratings yet
Experiment 3.1 K-Mean
8 pages
K-Means Cluster Analysis UC Business Analytics R Programming Guide
No ratings yet
K-Means Cluster Analysis UC Business Analytics R Programming Guide
19 pages
Statistics: Afrah Umran
No ratings yet
Statistics: Afrah Umran
27 pages
New K Means - Jupyter Notebook
No ratings yet
New K Means - Jupyter Notebook
4 pages
Agroeco CH 2
No ratings yet
Agroeco CH 2
28 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
K - Means Clustering Algorithm Applications in Data Mining and Pattern Recognition
No ratings yet
K - Means Clustering Algorithm Applications in Data Mining and Pattern Recognition
8 pages
Time Series Data: y + X + - . .+ X + U
No ratings yet
Time Series Data: y + X + - . .+ X + U
81 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Descriptive Statistics: Tabular and Graphical Presentations
No ratings yet
Descriptive Statistics: Tabular and Graphical Presentations
26 pages
A Survey Paper On Credit Card Fraud Detection Techniques
No ratings yet
A Survey Paper On Credit Card Fraud Detection Techniques
9 pages
Katalog - Lenovo Jakarta
No ratings yet
Katalog - Lenovo Jakarta
20 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Hierarchical Clustering Algorithm
No ratings yet
Hierarchical Clustering Algorithm
8 pages
SPSS Tutorial Cluster Analysis PDF
No ratings yet
SPSS Tutorial Cluster Analysis PDF
42 pages
K-Means Clustering Using Weka Interface
No ratings yet
K-Means Clustering Using Weka Interface
6 pages
Revision Total Knee Arthroplasty: Principles of Management: Israel Orthopaedic Association December, 2012
No ratings yet
Revision Total Knee Arthroplasty: Principles of Management: Israel Orthopaedic Association December, 2012
35 pages
Lecture 1 PDF
0% (1)
Lecture 1 PDF
49 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
Jemo 7044 - ISO 1421
No ratings yet
Jemo 7044 - ISO 1421
2 pages
AB1202 Statistics and Analysis
No ratings yet
AB1202 Statistics and Analysis
18 pages
Chemometric Software For Multivariate Data Analysis Based On Matlab
No ratings yet
Chemometric Software For Multivariate Data Analysis Based On Matlab
8 pages
Frequency Distribution
No ratings yet
Frequency Distribution
3 pages
Multiple Discriminant Analysis
No ratings yet
Multiple Discriminant Analysis
18 pages
Discriminant Function Analysis: Basics Psy524 Andrew Ainsworth
No ratings yet
Discriminant Function Analysis: Basics Psy524 Andrew Ainsworth
39 pages
K Means EM Cobweb WEKA PDF
No ratings yet
K Means EM Cobweb WEKA PDF
6 pages
Presentation Malo
No ratings yet
Presentation Malo
65 pages
Quality of Clustering: Clustering (K-Means Algorithm)
No ratings yet
Quality of Clustering: Clustering (K-Means Algorithm)
4 pages
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
No ratings yet
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
9 pages
Assignment Example
No ratings yet
Assignment Example
21 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
33 pages
Process Capability: K.Masan Sr. Manager QA - Moldtek Plastics Limited
No ratings yet
Process Capability: K.Masan Sr. Manager QA - Moldtek Plastics Limited
19 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
First Contact With Tensor Flow PDF
100% (2)
First Contact With Tensor Flow PDF
136 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
11 Chapter 3
No ratings yet
11 Chapter 3
17 pages
DWDN Lab
No ratings yet
DWDN Lab
7 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
Discriminant Analysis
0% (1)
Discriminant Analysis
16 pages
Quiz Week 5 - Attempt Review
No ratings yet
Quiz Week 5 - Attempt Review
6 pages
Image To Come: Zimmer Nexgen Mis Tibial Component
No ratings yet
Image To Come: Zimmer Nexgen Mis Tibial Component
12 pages
Exceptional: Performance Evaluation Sheet
No ratings yet
Exceptional: Performance Evaluation Sheet
2 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
Discriminant Analysis
100% (1)
Discriminant Analysis
9 pages
Genesis II Revision Surgical Technique
No ratings yet
Genesis II Revision Surgical Technique
36 pages
SPSS Week7
No ratings yet
SPSS Week7
42 pages
SPSS Week7
No ratings yet
SPSS Week7
42 pages
High Alert High Alert High Alert High Alert
No ratings yet
High Alert High Alert High Alert High Alert
1 page
Book Reviews Leadership
No ratings yet
Book Reviews Leadership
14 pages
GENESIS II Surgical Technique DCF
No ratings yet
GENESIS II Surgical Technique DCF
46 pages
Credit Card Fraud Detection Using Machine Learning and Blockchain
100% (1)
Credit Card Fraud Detection Using Machine Learning and Blockchain
9 pages
SPSS Tutorial Cluster Analysis
No ratings yet
SPSS Tutorial Cluster Analysis
42 pages
Control Charts
No ratings yet
Control Charts
36 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Presentation On 7 Tools of Q.C.
No ratings yet
Presentation On 7 Tools of Q.C.
25 pages
Applied Machine Learning Question Paper
100% (1)
Applied Machine Learning Question Paper
2 pages
Freedom Knee Surgical Technique Print
No ratings yet
Freedom Knee Surgical Technique Print
16 pages

Clustering Techniques

Uploaded by

Clustering Techniques

Uploaded by

Clustering

• Clustering is an Unsupervised Learning Technique

• Clustering can also be thought of as a case reduction

• Clustering helps simplify data by reducing many data

“similar” between them C

Block Chebyshev Distance = Max (8, 4) = 8

Block Eucledian Distance = sqrt ( 8^2 + 4^2) = 8.94

Block Block Block Block Block Block Block Block B

What is the distance

(Remember the Pythagoras

• Assume two vectors: A (x1, y1, … z1) & B (x2,

Block Eucledian Distance = sqrt ( 8^2 + 4^2) = 8.94

Block Block Block Block Block Block Block Block B

Decide the Clustering Procedure (Hierarchical / Partitioning)

Select the measure of similarity (dis-similarity)

Choose cluster linkage algorithm (applicable in hierarchical clustering)

Decide on the number of clusters

Interpret the cluster output (Profile the clusters)

Validate the clusters

• Hierarchical clustering makes use of Distance as a measure of

• Cluster tree like output is called Dendogram

• Complete linkage – Maximum

• Average linkage – Average of

• Centroid method – combine

• Ward’s method – Combine

## Let us find the clusters in given Retail Customer Spends data

No_of_Visits: The number of times a

Item Counts: Count of Apparel, Fruits and

## scale function standardizes the values

## Let us see the distance matrix

C, D 1.26 H, I 0.73 F, J 1.24 E, G 1.41

A, (C,D) (H,I), (F,J) B, (E,G)

(H,I, F,J) , (B, E,G)

• It is not based on Distance…

• It is based on within cluster Variation, in other words Squared Distance

• The algorithm aims at segmenting data such that within cluster

• Computationally it is more faster compared to Hierarchical

• Preferable to use on interval or ratio-scaled data as it uses Eucledian distance…

• Challenge – Number of clusters are to be pre-defined and to be provided as

 Three Clusters – Multiple possible

• Clustering does not provide you rules to classify future

• To be able to classify future records you may do the

You might also like