Clustering Techniques
Clustering Techniques
Earning is in Learning
- Rajesh Jakhotia
Content
• Clustering Definition
• Distance Measure
• Hierarchical Clustering
• K Mean Clustering
2
Learning Objectives
• Why Clustering?
• What is Clustering?
• Various Distance Measures
• Hierarchical Clustering
• K Means Clustering
3
Clustering Definitions
Distance Measures
Why Clustering? Applications of Clustering
• Why Clustering?
To group similar objects / data points
To find homogenous sets of customers
To segment the data in similar groups
• Applications:
Marketing : Customer Segmentation & Profiling
Libraries : Book classification
Retail : Store Categorization
5
What is Clustering?
• Clustering is a technique for finding similar groups in
data, called clusters.
6
What is a Cluster?
• A cluster can be Shopper Price
s Conscious
Brand
Loyalty
defined as a collection A 2 4
of objects which are B 8 2
to other clusters
5 D
4 A
Brand Loyalty
3 C
• How do we define 2 B
“Similar” in clustering? 1 E
– Based on Distance 0 5
Price Conscious
10
7
How do we define “(dis) Similar” ?
• Similar in clustering is based on Distance
• Various distance measures
– Eucledian Distance
– Chebyshev Distance
A
– Manhattan Distance …and more
Block Manhattan Distance = 8 + 4 = 12
8
Distance Computation
A B
What is the distance
between Point A and B?
Ans: 7
Ans:
A
[ 𝑥2 − 𝑥1 2 + 𝑦2 − 𝑦1 2
9
Eucledian Distance
What is the distance between Point A
and B in n-Dimension Space?
If A (x1, y1, … Z1) and B (x2, y2, … z2) are
cartesian coordinates
By using Euclidean Distance we get
Distance AB as
DAB =
[(x2−x1)2 + (y2−y1)2 +….+ (z2−z1)2]
10
Chebyshev Distance
• In mathematics, Chebyshev distance is a
metric defined on a vector space where the
distance between two vectors is the
greatest of their differences along any
coordinate dimension
12
Types of Clustering
Types of Clustering Procedures
Hierarchical clustering is
characterized by a tree like
structure and uses distance
as a measure of
(dis)similarity
Partitioning Algorithms
starts with a set of partitions
as clusters and iteratively
refines the partitions to
form stable clusters
14
Steps involved in Clustering
Formulate the problem – Select variables to be used for clustering
15
Hierarchical Clustering
Hierarchical Clustering
• Hierarchical Clustering is a clustering techniques which tends
to create clusters in a hierarchical tree like structure
17
Hierarchical Clustering | Agglomerative Clustering Steps
• Starts with each record as a
cluster of one record each Ste Ste Ste Ste Ste
p0 p1 p2 p3 p4
• Sequentially merges 2 closest a
records by distance as a ab
measure of (dis)similarity to b abcde
form a cluster. This reduces
the number of records by 1 c
cde
d
• Repeat the above step with de
new cluster and all remaining e
clusters till we have one big
How do you measure the
cluster
distance between cluster (a,b)
and (c) or the cluster (a,b) and
(d,e)
????
18
Agglomerative Clustering Linkage Algorithms
• Single linkage – Minimum
distance or Nearest neighbour
rule
19
Hierarchical Clustering for Retail Customers
20
Building the hierarchical clusters (without variable scaling)
Note: The two clusters
formed are primarily on
?dist ## to get help on distance function the basis of
AVG_MTHLY_SPEND
d.euc <- dist(x=RCDF[,3:7], method = "euclidean")
Eucledian Distance
computation in this
## we will use the hclust function to build the cluster case is influenced by
AVG_MTHLY_SPEND
?hclust ## to get help on hclust function variable as the range
of this variable is too
large compared to the
clus1 <- hclust(d.euc, method = "average") other variables
plot(clus1, labels = as.character(RCDF[,2]))
To avoid this problem,
we should scale the
variables used for
clustering
21
Building the hierarchical clusters (with variable scaling)
22
Understanding the Height Calculation in Clustering
23
Dist. A B C D E F G H I
B 4.25
C 3.41 3.84
` D
E
F
G
2.51
4.27
3.98
4.38
3.47
2.70
2.21
3.02
1.26
2.92
3.58
3.38
3.20
2.85
3.35
3.43
1.41 3.17
H 3.40 3.60 3.66 2.93 3.24 2.35 2.46
I 3.53 3.39 4.05 3.21 3.48 2.18 2.61 0.73
J 4.55 2.97 3.59 3.04 3.41 1.24 2.80 2.12 2.06
(A, C, D) , (H, I, F, J, B, E, G)
H I F J B E G
A 3.40 3.53 3.98 4.55 4.25 4.27 4.38
C 3.66 4.05 3.58 3.59 3.84 2.92 3.38
D 2.93 3.21 2.85 3.04 3.47 3.20 3.35
(A, C, D) , (H, I, F, J, B, E, G) 3.59
24
Profiling the clusters
## profiling the clusters
RCDF$Clusters <- cutree(clus2, k=3)
aggr = aggregate(RCDF[, -c(1,2, 8)], list(RCDF$Clusters), mean )
clus.profile <- data.frame( Cluster = aggr[,1] ,
Freq = as.vector(table(RCDF$Clusters)) ,
aggr[,-1]
)
View(clus.profile)
25
Partitioning Clustering
K Means Clustering
K Means Clustering
• K-Means is the most used, non-hierarchical clustering technique
27
K Means Algorithm
• Input Required : No of Clusters to be formed. (Say K)
• Steps
1. Assume K Centroids (for K Clusters)
2. Compute Eucledian distance of each objects with these Centroids.
3. Assign the objects to clusters with shortest distance
4. Compute the new centroid (mean) of each cluster based on the objects
assigned to each clusters. The K number of means obtained will become the
new centroids for each cluster
5. Repeat step 2 to 4 till there is convergence
• i.e. there is no movement of objects from one cluster to another
• Or threshold number of iterations have occurred
28
K-means advantages
• K-means is superior technique compared to Hierarchical technique as it is less
impacted by outliers
29
Why find optimal No. of Clusters?
Data to be Two Clusters – 2 possible solution
clustered
D2
C1 C2
D2 D2
C1
C2
D1 D1 D1
C3 C3 C3
C2 C2
D1 D1 D1 D1
30
R code to get Optimal No. of Clusters
## code taken from the R-statistics blog http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
## Identifying the optimal number of clusters form WSS
wssplot <- function(data, nc=15, seed=1234) {
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc) {
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(scaled.RCDF, nc=5)
31
Using NbClust to get optimal No. of Clusters
## Identifying the optimal number of clusters
## install.packages("NbClust")
library(NbClust)
set.seed(1234)
nc <- NbClust(KRCDF[,c(-1,-2)], min.nc=2, max.nc=4, method="kmeans")
table(nc$Best.n[1,])
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")
32
K Means Clustering R Code
?kmeans
kmeans.clus = kmeans(x=scaled.RCDF, centers = 3, nstart = 25)
## x = data frame to be clustered
## centers = No. of clusters to be created
## nstart = No. of random sets to be used for clustering
kmeans.clus
33
Plotting the clusters
## plotting the clusters
## install.packages("fpc")
## plotting the clusters
## install.packages("fpc") library(fpc)
library(fpc) plotcluster( scaled.RCDF, kmeans.clus$cluster )
plotcluster( scaled.RCDF, kmeans.clus$cluster )
34
Profiling the clusters
## profiling the clusters
KRCDF$Clusters <- kmeans.clus$cluster
aggr = aggregate(KRCDF[,-c(1,2, 8)],list(KRCDF$Clusters),mean)
clus.profile <- data.frame( Cluster=aggr[,1],
Freq=as.vector(table(KRCDF$Clusters)),
aggr[,-1])
View(clus.profile)
35
Next steps after clustering
• Clustering provides you with clusters in the given dataset
36
References
• Chapter 9 : Cluster Analysis
(http://www.springer.com)
– Google search : “www.springer.com cluster analysis chapter 9”
• http://sites.stat.psu.edu/~ajw13/stat505/fa06/19_cluster/09_
cluster_wards.html
• https://home.deib.polimi.it/matteucc/Clustering/tutorial_ht
ml/
37
Thank you
Contact us:
[email protected]