0% found this document useful (0 votes)
53 views38 pages

Clustering Techniques

This document discusses different clustering techniques. It defines clustering as an unsupervised learning technique that groups similar data points together into clusters. It describes two main clustering methods: hierarchical clustering and k-means clustering. Hierarchical clustering uses distance as a measure of similarity to group data points into a hierarchical tree structure. K-means clustering iteratively assigns data points to clusters to form stable clusters. The document also discusses different distance measures and linkage algorithms used in hierarchical clustering.

Uploaded by

kmkatariya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views38 pages

Clustering Techniques

This document discusses different clustering techniques. It defines clustering as an unsupervised learning technique that groups similar data points together into clusters. It describes two main clustering methods: hierarchical clustering and k-means clustering. Hierarchical clustering uses distance as a measure of similarity to group data points into a hierarchical tree structure. K-means clustering iteratively assigns data points to clusters to form stable clusters. The document also discusses different distance measures and linkage algorithms used in hierarchical clustering.

Uploaded by

kmkatariya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Clustering

Earning is in Learning
- Rajesh Jakhotia
Content
• Clustering Definition
• Distance Measure
• Hierarchical Clustering
• K Mean Clustering

2
Learning Objectives
• Why Clustering?
• What is Clustering?
• Various Distance Measures
• Hierarchical Clustering
• K Means Clustering

3
Clustering Definitions
Distance Measures
Why Clustering? Applications of Clustering
• Why Clustering?
 To group similar objects / data points
 To find homogenous sets of customers
 To segment the data in similar groups

• Applications:
 Marketing : Customer Segmentation & Profiling
 Libraries : Book classification
 Retail : Store Categorization
5
What is Clustering?
• Clustering is a technique for finding similar groups in
data, called clusters.

• Clustering is an Unsupervised Learning Technique

• Clustering can also be thought of as a case reduction


technique wherein it groups together similar records
in cluster

• Clustering helps simplify data by reducing many data


points into a few clusters (segments)

6
What is a Cluster?
• A cluster can be Shopper Price
s Conscious
Brand
Loyalty
defined as a collection A 2 4
of objects which are B 8 2

“similar” between them C


D
9
1
3
5
and are “dissimilar” to E 8 1
the objects belonging
6
Shoppers

to other clusters
5 D

4 A

Brand Loyalty
3 C

• How do we define 2 B

“Similar” in clustering? 1 E

– Based on Distance 0 5
Price Conscious
10

7
How do we define “(dis) Similar” ?
• Similar in clustering is based on Distance
• Various distance measures
– Eucledian Distance
– Chebyshev Distance
A
– Manhattan Distance …and more
Block Manhattan Distance = 8 + 4 = 12

Block Chebyshev Distance = Max (8, 4) = 8

Block Eucledian Distance = sqrt ( 8^2 + 4^2) = 8.94

Block Block Block Block Block Block Block Block B

8
Distance Computation
A B
What is the distance
between Point A and B?

Ans: 7

What is the distance


between Point A and B?
B

Ans:
A
[ 𝑥2 − 𝑥1 2 + 𝑦2 − 𝑦1 2

(Remember the Pythagoras


Theorem)

9
Eucledian Distance
 What is the distance between Point A
and B in n-Dimension Space?
 If A (x1, y1, … Z1) and B (x2, y2, … z2) are
cartesian coordinates
 By using Euclidean Distance we get
Distance AB as
 DAB =
[(x2−x1)2 + (y2−y1)2 +….+ (z2−z1)2]

10
Chebyshev Distance
• In mathematics, Chebyshev distance is a
metric defined on a vector space where the
distance between two vectors is the
greatest of their differences along any
coordinate dimension

• Assume two vectors: A (x1, y1, … z1) & B (x2,


y2, … z2)
Reference Link :
https://en.wikipedia.org/wiki/Chebyshev_distance
• Chebyshev Distance 11
Manhattan Distance
• Manhattan Distance also called City Block
Distance
• Assume two vectors: A (x1, y1, …z1) & B (x2,
y2, …z2)
A
• Manhattan Distance
Block Manhattan Distance = 8 + 4 = 12
= | x2 - x1 | + | y2 – y1 | +….. | z2 – z1 |
Block Chebyshev Distance = Max (8, 4) = 8

Block Eucledian Distance = sqrt ( 8^2 + 4^2) = 8.94

Block Block Block Block Block Block Block Block B

12
Types of Clustering
Types of Clustering Procedures
 Hierarchical clustering is
characterized by a tree like
structure and uses distance
as a measure of
(dis)similarity

 Partitioning Algorithms
starts with a set of partitions
as clusters and iteratively
refines the partitions to
form stable clusters

14
Steps involved in Clustering
Formulate the problem – Select variables to be used for clustering

Decide the Clustering Procedure (Hierarchical / Partitioning)

Select the measure of similarity (dis-similarity)

Choose cluster linkage algorithm (applicable in hierarchical clustering)

Decide on the number of clusters

Interpret the cluster output (Profile the clusters)

Validate the clusters

15
Hierarchical Clustering
Hierarchical Clustering
• Hierarchical Clustering is a clustering techniques which tends
to create clusters in a hierarchical tree like structure

• Hierarchical clustering makes use of Distance as a measure of


similarity

• Cluster tree like output is called Dendogram

17
Hierarchical Clustering | Agglomerative Clustering Steps
• Starts with each record as a
cluster of one record each Ste Ste Ste Ste Ste
p0 p1 p2 p3 p4
• Sequentially merges 2 closest a
records by distance as a ab
measure of (dis)similarity to b abcde
form a cluster. This reduces
the number of records by 1 c
cde
d
• Repeat the above step with de
new cluster and all remaining e
clusters till we have one big
How do you measure the
cluster
distance between cluster (a,b)
and (c) or the cluster (a,b) and
(d,e)
????

18
Agglomerative Clustering Linkage Algorithms
• Single linkage – Minimum
distance or Nearest neighbour
rule

• Complete linkage – Maximum


distance or Farthest distance

• Average linkage – Average of


the distances between all pairs

• Centroid method – combine


cluster with minimum distance
between the centroids of the
two clusters

• Ward’s method – Combine


clusters with which the increase
in within cluster variance is to
the smallest degree

19
Hierarchical Clustering for Retail Customers

## Let us find the clusters in given Retail Customer Spends data


## We will use Hierarchical Clustering technique
## Let us first set the working directory path and import the data
setwd ("D:/K2Analytics/Clustering/")
RCDF <- read.csv("datafiles/Cust_Spend_Data.csv", header=TRUE)
HyperMarket Customer Spend MetaData
View(RCDF)
AVG_Mthly_Spend: The average monthly
amount spent by customer

No_of_Visits: The number of times a


customer visited the HyperMarket in a month

Item Counts: Count of Apparel, Fruits and


Vegetable, Staple Items purchased in a month

20
Building the hierarchical clusters (without variable scaling)
Note: The two clusters
formed are primarily on
?dist ## to get help on distance function the basis of
AVG_MTHLY_SPEND
d.euc <- dist(x=RCDF[,3:7], method = "euclidean")
Eucledian Distance
computation in this
## we will use the hclust function to build the cluster case is influenced by
AVG_MTHLY_SPEND
?hclust ## to get help on hclust function variable as the range
of this variable is too
large compared to the
clus1 <- hclust(d.euc, method = "average") other variables
plot(clus1, labels = as.character(RCDF[,2]))
To avoid this problem,
we should scale the
variables used for
clustering

21
Building the hierarchical clusters (with variable scaling)

## scale function standardizes the values


scaled.RCDF <- scale(RCDF[,3:7])
head(scaled.RCDF, 10)
d.euc <- dist(x=scaled.RCDF, method = "euclidean")
clus2 <- hclust(d.euc, method = "average")
plot(clus2, labels = as.character(RCDF[,2]))
rect.hclust(clus2, k=3, border="red")

22
Understanding the Height Calculation in Clustering

## Let us see the distance matrix


d.Euc
Dist. A B C D E F G H I
B 4.25
C 3.41 3.84
D 2.51 3.47 1.26
E 4.27 2.70 2.92 3.20
F 3.98 2.21 3.58 2.85 3.43
G 4.38 3.02 3.38 3.35 1.41 3.17
H 3.40 3.60 3.66 2.93 3.24 2.35 2.46
I 3.53 3.39 4.05 3.21 3.48 2.18 2.61 0.73
J 4.55 2.97 3.59 3.04 3.41 1.24 2.80 2.12 2.06
## Let us see the height for clusters
clus2$height

23
Dist. A B C D E F G H I
B 4.25
C 3.41 3.84

` D
E
F
G
2.51
4.27
3.98
4.38
3.47
2.70
2.21
3.02
1.26
2.92
3.58
3.38
3.20
2.85
3.35
3.43
1.41 3.17
H 3.40 3.60 3.66 2.93 3.24 2.35 2.46
I 3.53 3.39 4.05 3.21 3.48 2.18 2.61 0.73
J 4.55 2.97 3.59 3.04 3.41 1.24 2.80 2.12 2.06

C, D 1.26 H, I 0.73 F, J 1.24 E, G 1.41

A, (C,D) (H,I), (F,J) B, (E,G)


A, C 3.41 F J B, E 2.70
A, D 2.51 H 2.35 2.12 B, G 3.02
A, (C,D) 2.96 I 2.18 2.06 B, (E,G) 2.86
((H,I), (F,J)) 2.17

(H,I, F,J) , (B, E,G)


B E G
H 3.60 3.24 2.46
I 3.39 3.48 2.61
F 2.21 3.43 3.17
J 2.97 3.41 2.80
(H,I, F,J) , (B, E,G) 3.06

(A, C, D) , (H, I, F, J, B, E, G)
H I F J B E G
A 3.40 3.53 3.98 4.55 4.25 4.27 4.38
C 3.66 4.05 3.58 3.59 3.84 2.92 3.38
D 2.93 3.21 2.85 3.04 3.47 3.20 3.35
(A, C, D) , (H, I, F, J, B, E, G) 3.59

24
Profiling the clusters
## profiling the clusters
RCDF$Clusters <- cutree(clus2, k=3)
aggr = aggregate(RCDF[, -c(1,2, 8)], list(RCDF$Clusters), mean )
clus.profile <- data.frame( Cluster = aggr[,1] ,
Freq = as.vector(table(RCDF$Clusters)) ,
aggr[,-1]
)
View(clus.profile)

25
Partitioning Clustering

K Means Clustering
K Means Clustering
• K-Means is the most used, non-hierarchical clustering technique

• It is not based on Distance…

• It is based on within cluster Variation, in other words Squared Distance


from the Centre of the Cluster

• The algorithm aims at segmenting data such that within cluster


variation is reduced

27
K Means Algorithm
• Input Required : No of Clusters to be formed. (Say K)

• Steps
1. Assume K Centroids (for K Clusters)
2. Compute Eucledian distance of each objects with these Centroids.
3. Assign the objects to clusters with shortest distance
4. Compute the new centroid (mean) of each cluster based on the objects
assigned to each clusters. The K number of means obtained will become the
new centroids for each cluster
5. Repeat step 2 to 4 till there is convergence
• i.e. there is no movement of objects from one cluster to another
• Or threshold number of iterations have occurred

28
K-means advantages
• K-means is superior technique compared to Hierarchical technique as it is less
impacted by outliers

• Computationally it is more faster compared to Hierarchical

• Preferable to use on interval or ratio-scaled data as it uses Eucledian distance…


desirable to avoid using on ordinal data

• Challenge – Number of clusters are to be pre-defined and to be provided as


input to the process

29
Why find optimal No. of Clusters?
Data to be  Two Clusters – 2 possible solution
clustered
D2
C1 C2
D2 D2
C1

C2

D1 D1 D1

 Three Clusters – Multiple possible


solution
C1 C2 C1 C2
C3
D2 D2 D2 D2
C1 C1

C3 C3 C3
C2 C2

D1 D1 D1 D1

30
R code to get Optimal No. of Clusters
## code taken from the R-statistics blog http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
## Identifying the optimal number of clusters form WSS
wssplot <- function(data, nc=15, seed=1234) {
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc) {
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}

wssplot(scaled.RCDF, nc=5)

31
Using NbClust to get optimal No. of Clusters
## Identifying the optimal number of clusters
## install.packages("NbClust")
library(NbClust)
set.seed(1234)
nc <- NbClust(KRCDF[,c(-1,-2)], min.nc=2, max.nc=4, method="kmeans")
table(nc$Best.n[1,])

barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")

32
K Means Clustering R Code
?kmeans
kmeans.clus = kmeans(x=scaled.RCDF, centers = 3, nstart = 25)
## x = data frame to be clustered
## centers = No. of clusters to be created
## nstart = No. of random sets to be used for clustering
kmeans.clus

33
Plotting the clusters
## plotting the clusters
## install.packages("fpc")
## plotting the clusters
## install.packages("fpc") library(fpc)
library(fpc) plotcluster( scaled.RCDF, kmeans.clus$cluster )
plotcluster( scaled.RCDF, kmeans.clus$cluster )

34
Profiling the clusters
## profiling the clusters
KRCDF$Clusters <- kmeans.clus$cluster
aggr = aggregate(KRCDF[,-c(1,2, 8)],list(KRCDF$Clusters),mean)
clus.profile <- data.frame( Cluster=aggr[,1],
Freq=as.vector(table(KRCDF$Clusters)),
aggr[,-1])

View(clus.profile)

35
Next steps after clustering
• Clustering provides you with clusters in the given dataset

• Clustering does not provide you rules to classify future


records

• To be able to classify future records you may do the


following
– Build Discriminant Model on Clustered Data
– Build Classification Tree Model on Clustered Data

36
References
• Chapter 9 : Cluster Analysis
(http://www.springer.com)
– Google search : “www.springer.com cluster analysis chapter 9”

• http://sites.stat.psu.edu/~ajw13/stat505/fa06/19_cluster/09_
cluster_wards.html

• https://home.deib.polimi.it/matteucc/Clustering/tutorial_ht
ml/

37
Thank you

Contact us:
[email protected]

You might also like