0% found this document useful (0 votes)
9 views25 pages

Clustering

The document outlines clustering techniques, focusing on K-means clustering for customer segmentation in a telecom company. It explains various clustering algorithms, objectives of clustering, and provides a case study using R for customer segmentation based on purchase behavior. The document details the steps for implementing K-means, including data preparation, determining the number of clusters, and analyzing results.

Uploaded by

loicblue86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views25 pages

Clustering

The document outlines clustering techniques, focusing on K-means clustering for customer segmentation in a telecom company. It explains various clustering algorithms, objectives of clustering, and provides a case study using R for customer segmentation based on purchase behavior. The document details the steps for implementing K-means, including data preparation, determining the number of clusters, and analyzing results.

Uploaded by

loicblue86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Clustering

Serge Nyawa

October 2023
Roadmap

▶ Objectives
▶ Technics and Algorithms
▶ K-means: an overview
▶ K-means with R: a case study
Introductory Example: Customer Segmentation for a
telecom Company
A mobile telecommunications company has approximately 2 million
customers. In their storage system, they have tremendous data on
call detail and customer data. The company wants to carry out
specific marketing actions in different group of customers in order
to meet specific business objectives. They want to divide customers
into homogeneous groups on the basis of common attributes
(habits, tastes, etc). They need a clustering algorithm to do that.
Others examples

▶ Cluster customers based on their purchase histories


▶ Cluster products based on the sets of customers who
purchased them
▶ Cluster documents based on similar words or shingles
Objectives

By the end of this course, you need to be able:


▶ To group multi-dimensional data-set into closely related
groups
▶ To classify each data point into a specific group
▶ To put in the same group data points with similar properties
and/or features
▶ To use unsupervised technics to construct clusters: labels to
apply to the clusters are unknown in advance
Technics and Algorithms

There are at least four clustering algorithms Data Scientists Need


to Know
▶ Connectivity-based clustering (hierarchical clustering)
▶ Centroid-based clustering
▶ Distribution-based clustering
▶ Density-based clustering
Connectivity-based clustering (hierarchical clustering)
▶ Each data point is treated as a single cluster
▶ A distance metric measures the distance between two clusters
▶ On each iteration, two clusters with the smallest distance
between each other are combined
▶ Iterations are repeated until we reach one cluster which
contains all data points
Centroid-based clustering

▶ Similarity is derived by the closeness of a data point to the


centroid of the clusters
▶ Clusters are represented by a central vector, which may not
necessarily be a member of the data set
▶ The number of clusters have to be mentioned beforehand
▶ A formal definition is given by the K-means clustering
▶ Derive the k cluster centers
▶ Objects are assigned to the nearest cluster center, such that
the squared distances from the cluster are minimized
Distribution-based clustering

▶ All data points in the cluster are likely to belong to the same
distribution (For example: Normal, Gaussian)
▶ Expectation-maximization algorithm is an example: it uses
multivariate normal distributions
Density-based clustering
▶ Clusters are defined as areas of higher density than the
remainder of the data set
k-means: an overview
▶ Step 1: Choose the value of k and the k initial guesses for the
centroids
▶ Step 2: Compute the distance from each data point to each
centroid. Assign each point to the closest centroid. the
Euclidean distance can be used
v
u n
uX
d(x1 , x2 ) = t (x1i − x2i )2
i=1

▶ Step 3: Compute the centroid, the center of mass, of each


newly defined cluster from Step 2
Pm Pm
i=1 xi1 i=1 xin
xc = ( , ..., )
m m
▶ Repeat Steps 2 and 3 until the algorithm converges.
Convergence is reached when the computed centroids do not
change.
How to determine the Number of Clusters k?

▶ k is the integer such that the squares of the distances between


each data point and the closest centroid is minimal
▶ k minimizes the Within Sum of Squares metric

M X
X n
i
(xij − xcj,k )2
i=1 j=1

where xi = (xi1 , .., xin ) is the i th point, and


i = (x i
xc,k i
c1,k , .., xcn,k ) is the closest centroid that is associated
with the i th point when k clusters are considered.
k-means with R: a case study

A retail company want to carry out a customer segmentation using


customer purchase behavior. Available data are stored into the
dataset customers.csv and concern metrics for: customer’s recency
of last purchase (recency), frequency of purchase (frequency), and
monetary value (monetary). These three variables, are often used
in customer segmentation for marketing purposes. The recency
variable refers to the number of days that have elapsed since the
customer last purchased something. Frequency refers to the
number of invoices with purchases during the year. Monetary value
is the amount that the customer spent during the year.
k-means with R: a case study
Since k-means clustering requires quantitative variables and works
best with relatively normally-distributed, variables are standardized
in the following way:
▶ Log-transform positively-skewed variables
▶ customersrecency .log < −log(customersrecency)
▶ customersfrequency .log < −log(customersfrequency)
▶ customersmonetary .log < −customersmonetary + 0.1 # can’t
take log(0), so add a small value to remove zeros
▶ customersmonetary .log < −log(customersmonetary.log)
▶ Z-scores
▶ customersrecency .z < −scale(customersrecency.log,
center=TRUE, scale=TRUE)
▶ customersfrequency .z < −scale(customersfrequency.log,
center=TRUE, scale=TRUE)
▶ customersmonetary .z < −scale(customersmonetary.log,
center=TRUE, scale=TRUE)
k-means with R: a case study

▶ Useful packages are installed


▶ install.packages(‘plyr’)
▶ install.packages(‘ggplot2’)
▶ install.packages(‘cluster’)
▶ install.packages(‘lattice’)
▶ install.packages(‘graphics’)
▶ install.packages(‘grid’)
▶ install.packages(‘gridExtra’)
k-means with R: a case study

▶ Useful packages are loaded

library (plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(graphics)
library(grid)
library(gridExtra)
k-means with R: a case study

▶ Import the dataset


customers <- read.csv("C://Users//s.nyawa//Documents//R for Data Scientist TBS//customers.csv")

▶ The dataset is transformed into a matrix

customers2<-as.matrix(customers[, c("X", "CustomerID", "recency", "frequency",


"monetary", "recency.log", "frequency.log", "monetary.log",
"recency.z", "frequency.z", "monetary.z")])
k-means with R: a case study
▶ We keep into the analysis only z-scores

customers2 <- customers2[,9:11]

▶ An overview of the dataset

customers2[1:8, ]

## recency.z frequency.z monetary.z


## [1,] 1.4769737 -1.0419194 -6.4333701
## [2,] -1.9505774 1.5373188 1.3061430
## [3,] -2.7537602 4.8703550 2.7425188
## [4,] -1.7402564 0.7608907 1.3119950
## [5,] -1.7402564 0.5109367 0.2765832
## [6,] 1.1726513 -1.0419194 -1.4229360
## [7,] 0.3626359 -0.2654913 0.2581795
## [8,] 0.4027055 0.7608907 0.7343981
k-means with R: a case study

▶ To determine an appropriate value for k, the k-means


algorithm is used to identify clusters for k = 1, 2, .. . , 30

wss <- numeric(30)


for(k in 1:30){
wss[k]<-sum(kmeans(customers2, centers=k, nstart = 25,
iter.max = 50)$withinss)
}

▶ The option nstart=25 specifies that the k-means algorithm


will be repeated 25 times, each starting with k random initial
centroids.
k-means with R: a case study
▶ The Within Sum of Squares metric (wss) is plotted against
the respective number of centroids

plot(1:30, wss, type="b", xlab="Number of Clusters",


ylab="Within Sum of Squares" )
12000
Within Sum of Squares

8000
6000
4000
2000

0 5 10 15 20 25 30

Number of Clusters
k-means with R: a case study
▶ The WSS is greatly reduced when k increases from one to
two. Another substantial reduction in WSS occurs at k=5.
However, the improvement in WSS is fairly linear for k>5.
Therefore, the k-means analysis will be conducted for k = 5.
km<-kmeans(customers2,5,nstart=25)
km

## K-means clustering with 5 clusters of sizes 645, 1384, 687, 19, 1128
##
## Cluster means:
## recency.z frequency.z monetary.z
## 1 -1.33843402 1.5413678 1.2616943
## 2 0.91680685 -0.8288942 -0.6516600
## 3 -0.73655436 -0.5117379 -0.4794205
## 4 0.72543111 -0.9601901 -6.2924848
## 5 0.07682528 0.4634884 0.4760848
##
## Clustering vector:
## [1] 4 1 1 1 1 2 5 5 2 1 3 1 2 5 2 3 2 2 5 2 3 1 5 1 5 5 3 2 3 5 2 2 5 5 1 2 2
## [38] 2 1 5 2 2 2 2 1 2 5 2 3 5 2 5 5 2 3 2 3 3 2 1 2 3 1 2 5 1 1 1 2 5 2 1 3 1
## [75] 2 2 5 3 5 2 2 1 5 1 5 5 3 5 5 5 3 2 5 1 1 1 1 1 3 1 2 1 3 1 2 3 4 2 3 1 2
## [112] 2 2 5 3 2 3 3 2 2 1 3 3 5 3 2 5 2 5 1 2 2 1 2 2 2 1 1 1 5 5 3 1 1 5 1 5 3
## [149] 5 3 5 2 2 5 5 5 3 2 2 2 3 5 5 5 1 2 5 3 2 2 2 2 5 3 2 3 1 2 2 2 3 1 3 2 1
## [186] 5 3 2 1 1 3 5 5 1 2 1 1 2 2 1 3 1 2 5 5 5 1 5 2 5 5 2 2 5 5 1 2 2 5 2 2 5
## [223] 2 5 1 5 1 2 5 3 2 3 3 5 1 5 2 3 5 5 5 5 3 2 5 5 2 5 5 5 3 3 3 5 2 2 1 2 1
## [260] 1 2 5 2 3 3 5 2 2 5 1 1 5 5 3 5 1 2 5 5 5 2 2 2 5 2 2 5 1 2 2 2 2 1 1 5 2
## [297] 3 5 2 5 2 2 2 2 5 5 2 5 5 3 4 3 2 2 2 2 1 5 1 1 1 1 3 3 2 5 3 3 2 5 5 2 2
## [334] 5 2 3 2 2 2 1 3 2 2 5 2 2 1 2 5 1 5 3 5 3 5 5 3 1 5 2 3 1 1 2 5 2 3 5 3 5
## [371] 2 2 2 1 2 2 5 2 2 2 3 3 2 2 2 2 2 2 2 2 1 3 4 5 2 2 2 2 5 2 5 5 5 2 5 2 2
## [408] 3 2 2 1 2 5 5 2 2 3 3 5 1 5 2 5 5 3 1 2 2 5 5 5 1 3 3 5 3 5 2 1 3 1 2 2 5
k-means with R: a case study

▶ The displayed contents of the variable km include the


following:
▶ The location of the cluster means
▶ A clustering vector that defines the membership of each
customer to a corresponding cluster 1,2,3,4 and 5
▶ The WSS of each cluster
▶ A list of all the available k-means components
k-means with R: a case study

▶ The ggplot2 package is used to visualize the identified


customer clusters and centroids
df=as.data.frame(customers2)
df$cluster=factor(km$cluster)
centers=as.data.frame(km$centers)

g1=ggplot(data=df, aes(x=customers$recency.z, y=customers$frequency.z,


color=cluster)) + geom_point()+
geom_point(data=centers, aes(x=centers$recency.z, y=centers$frequency.z),
color=c("indianred1", "khaki3", "lightgreen",
"lightskyblue", "plum2"), size=8, show.legend=FALSE)

g2=ggplot(data=df, aes(x=customers$recency.z, y=customers$monetary.z,


color=cluster)) + geom_point()+
geom_point(data=centers, aes(x=centers$recency.z, y=centers$monetary.z),
color=c("indianred1", "khaki3", "lightgreen",
"lightskyblue", "plum2"), size=8, show.legend=FALSE)

g3=ggplot(data=df, aes(x=customers$frequency.z, y=customers$monetary.z,


color=cluster)) + geom_point()+
geom_point(data=centers, aes(x=centers$frequency.z, y=centers$monetary.z),
color=c("indianred1", "khaki3", "lightgreen",
"lightskyblue", "plum2"), size=8, show.legend=FALSE)
k-means with R: a case study
grid.arrange(g1,g2,g3)
5
customers$frequency.z

4 cluster
1
3
2
2
3
1
4
0 5
−1
−2 −1 0 1
customers$recency.z
customers$monetary.z

cluster
2.5
1
0.0 2
3
−2.5
4

−5.0 5

−2 −1 0 1
customers$recency.z
customers$monetary.z

cluster
2.5
1
0.0 2
3
−2.5
4

−5.0 5

−1 0 1 2 3 4 5
customers$frequency.z
k-means with R: a case study

▶ The large circles represent the location of the cluster means


▶ The small dots represent the customers corresponding to the
appropriate cluster by assigned color
▶ The plots indicate the five clusters of customers:
▶ Cluster 1:
▶ Cluster 2:
▶ Cluster 3:
▶ Cluster 4:
▶ Cluster 5:

You might also like