Clustering
Clustering
Serge Nyawa
October 2023
Roadmap
▶ Objectives
▶ Technics and Algorithms
▶ K-means: an overview
▶ K-means with R: a case study
Introductory Example: Customer Segmentation for a
telecom Company
A mobile telecommunications company has approximately 2 million
customers. In their storage system, they have tremendous data on
call detail and customer data. The company wants to carry out
specific marketing actions in different group of customers in order
to meet specific business objectives. They want to divide customers
into homogeneous groups on the basis of common attributes
(habits, tastes, etc). They need a clustering algorithm to do that.
Others examples
▶ All data points in the cluster are likely to belong to the same
distribution (For example: Normal, Gaussian)
▶ Expectation-maximization algorithm is an example: it uses
multivariate normal distributions
Density-based clustering
▶ Clusters are defined as areas of higher density than the
remainder of the data set
k-means: an overview
▶ Step 1: Choose the value of k and the k initial guesses for the
centroids
▶ Step 2: Compute the distance from each data point to each
centroid. Assign each point to the closest centroid. the
Euclidean distance can be used
v
u n
uX
d(x1 , x2 ) = t (x1i − x2i )2
i=1
M X
X n
i
(xij − xcj,k )2
i=1 j=1
library (plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(graphics)
library(grid)
library(gridExtra)
k-means with R: a case study
customers2[1:8, ]
8000
6000
4000
2000
0 5 10 15 20 25 30
Number of Clusters
k-means with R: a case study
▶ The WSS is greatly reduced when k increases from one to
two. Another substantial reduction in WSS occurs at k=5.
However, the improvement in WSS is fairly linear for k>5.
Therefore, the k-means analysis will be conducted for k = 5.
km<-kmeans(customers2,5,nstart=25)
km
## K-means clustering with 5 clusters of sizes 645, 1384, 687, 19, 1128
##
## Cluster means:
## recency.z frequency.z monetary.z
## 1 -1.33843402 1.5413678 1.2616943
## 2 0.91680685 -0.8288942 -0.6516600
## 3 -0.73655436 -0.5117379 -0.4794205
## 4 0.72543111 -0.9601901 -6.2924848
## 5 0.07682528 0.4634884 0.4760848
##
## Clustering vector:
## [1] 4 1 1 1 1 2 5 5 2 1 3 1 2 5 2 3 2 2 5 2 3 1 5 1 5 5 3 2 3 5 2 2 5 5 1 2 2
## [38] 2 1 5 2 2 2 2 1 2 5 2 3 5 2 5 5 2 3 2 3 3 2 1 2 3 1 2 5 1 1 1 2 5 2 1 3 1
## [75] 2 2 5 3 5 2 2 1 5 1 5 5 3 5 5 5 3 2 5 1 1 1 1 1 3 1 2 1 3 1 2 3 4 2 3 1 2
## [112] 2 2 5 3 2 3 3 2 2 1 3 3 5 3 2 5 2 5 1 2 2 1 2 2 2 1 1 1 5 5 3 1 1 5 1 5 3
## [149] 5 3 5 2 2 5 5 5 3 2 2 2 3 5 5 5 1 2 5 3 2 2 2 2 5 3 2 3 1 2 2 2 3 1 3 2 1
## [186] 5 3 2 1 1 3 5 5 1 2 1 1 2 2 1 3 1 2 5 5 5 1 5 2 5 5 2 2 5 5 1 2 2 5 2 2 5
## [223] 2 5 1 5 1 2 5 3 2 3 3 5 1 5 2 3 5 5 5 5 3 2 5 5 2 5 5 5 3 3 3 5 2 2 1 2 1
## [260] 1 2 5 2 3 3 5 2 2 5 1 1 5 5 3 5 1 2 5 5 5 2 2 2 5 2 2 5 1 2 2 2 2 1 1 5 2
## [297] 3 5 2 5 2 2 2 2 5 5 2 5 5 3 4 3 2 2 2 2 1 5 1 1 1 1 3 3 2 5 3 3 2 5 5 2 2
## [334] 5 2 3 2 2 2 1 3 2 2 5 2 2 1 2 5 1 5 3 5 3 5 5 3 1 5 2 3 1 1 2 5 2 3 5 3 5
## [371] 2 2 2 1 2 2 5 2 2 2 3 3 2 2 2 2 2 2 2 2 1 3 4 5 2 2 2 2 5 2 5 5 5 2 5 2 2
## [408] 3 2 2 1 2 5 5 2 2 3 3 5 1 5 2 5 5 3 1 2 2 5 5 5 1 3 3 5 3 5 2 1 3 1 2 2 5
k-means with R: a case study
4 cluster
1
3
2
2
3
1
4
0 5
−1
−2 −1 0 1
customers$recency.z
customers$monetary.z
cluster
2.5
1
0.0 2
3
−2.5
4
−5.0 5
−2 −1 0 1
customers$recency.z
customers$monetary.z
cluster
2.5
1
0.0 2
3
−2.5
4
−5.0 5
−1 0 1 2 3 4 5
customers$frequency.z
k-means with R: a case study