Clusturing Algorithms For Customer Segmentation
Clusturing Algorithms For Customer Segmentation
Submitted by:
Jagriti Lakher
Date: 31thMarch 2023
Abstract
Customer segmentation is the process of dividing a customer base into groups of
individuals with similar characteristics to create targeted marketing strategies and
improve customer experience. Clustering algorithm is a type of unsupervised machine
learning algorithm that groups a set of data into clusters based on similarities. Clustering
algorithms play a crucial role in customer segmentation by identifying groups of
customers who share similar characteristics or behavior. It helps businesses to better
understand their customer base and make data-driven decisions about product
development, pricing, and customer service. Through Customer segmentation, a business
can identify areas for improvement, track trends and optimize their operations to better
serve their customers. Out of many clustering algorithms, three representative algorithms
namely k-means, k-means++, and mini-batch k-means are used. The performance of the
three algorithms will be compared on the basis of silhouette score, computational
efficiency, and adjusted rand index (ARI) parameters. Overall, this project has the
potential to provide valuable insights into customer behavior and preferences leading to
more effective business strategies.
1. Table of Content
i
s
Abstract.................................................................................................................................i
Table of Contents.................................................................................................................ii
List of Abbreviation............................................................................................................iii
List of Figures......................................................................................................................v
CHAPTER-1........................................................................................................................1
1. Introduction...............................................................................................................1
1.3 Objectives..............................................................................................................2
CHAPTER-2........................................................................................................................3
2. Literature Review......................................................................................................3
CHAPTER-3........................................................................................................................5
3. Methodology.............................................................................................................5
CHAPTER-4........................................................................................................................9
4. Implementation:........................................................................................................9
4.1 Algorithms.............................................................................................................9
4.1.1 K- Means.................................................................................................................
ii
4.1.2 K- Means++...........................................................................................................
CHAPTER-5......................................................................................................................13
CHAPTER-6......................................................................................................................23
6.1 Conclusion...........................................................................................................23
7. References...............................................................................................................24
8. Annex......................................................................................................................25
List of Abbreviation
iii
Abbreviation Definition
B2C Business to customer
B2B Business to business
CLR Cyclical Learning Rate
HDD Hard Disk Drive
IDE Integrated Development Environment
RAM Random Access Memory
RI Random Index
SGD Stochastic Gradient Descent
WCSS Within-cluster sum of squares
iv
List of Figures
FIGURE 1: CONCEPTUAL FRAMEWORK FOR CLASSIFICATION................................................................
FIGURE 2: NUMBER OF MALES AND FEMALES (500 DATA SETS).........................................................
FIGURE 3: DISTPLOT OF AGE, ANNUAL INCOME AND SPENDING SCORE (500 DATA SETS)...................
FIGURE 4: CLUSTERING USING K MEANS (500 DATA SETS)..................................................................
FIGURE 5: CLUSTERING USING K MEANS++ (500 DATA SETS).............................................................
FIGURE 6: CLUSTERING USING MINI BATCH K MEANS (500 DATA SET WITH 250 BATCH
SIZE)..............................................................................................................................................
FIGURE 7: CLUSTERING USING MINI BATCH K MEANS (500 DATA SET WITH 500 BATCH
SIZE)..............................................................................................................................................
FIGURE 9: DISTPLOT OF AGE, ANNUAL INCOME AND SPENDING SCORE (200 DATA SETS)...................
FIGURE 8: NUMBER OF MALES AND FEMALES (200 DATA SETS).........................................................
FIGURE 10: CLUSTERING USING K MEANS (200 DATA SETS)................................................................
FIGURE 11: CLUSTERING USING K MEANS++ (200 DATA SETS)...........................................................
FIGURE 12: CLUSTERING USING MINI BATCH K MEANS (200 DATA SET WITH 100 BATCH
SIZE)..............................................................................................................................................
FIGURE 13: CLUSTERING USING MINI BATCH K MEANS (200 DATA SET WITH 200 BATCH
SIZE)..............................................................................................................................................
......................................................................................................................................................
FIGURE 16: CLUSTERING USING K MEANS (100 DATA SETS)................................................................
FIGURE 17: CLUSTERING USING K MEANS++ (200 DATA SETS)...........................................................
FIGURE 18: CLUSTERING USING MINI BATCH K MEANS (100 DATA SET WITH 50 BATCH
SIZE)..............................................................................................................................................
FIGURE 19: CLUSTERING USING MINI BATCH K MEANS (100 DATA SET WITH 100 BATCH
SIZE)..............................................................................................................................................
v
CHAPTER-1
2. Introduction
1
2.2 Problem Statement
When customer segmentation is not performed, the data may remain unexplored, and the
relationships between data points may go unnoticed .customer segmentation can reveal
patterns and insights that might not be immediately apparent from looking at individual
data points. Without clustering, it can be difficult to identify groups or segments within
the data set that share similar characteristics or behaviors. This can make it harder to draw
meaningful conclusions and make informed decisions about the data. For example,
clustering can be used to group customers based on their purchasing behavior or
demographics .Without clustering, it may be challenging to identify which customers are
more likely to purchase a particular product or service.
Customer segmentation is a widely used marketing strategy that involves dividing
customers into groups based on their common characteristics, needs, and preferences. The
goal is to create targeted marketing campaigns that resonate with each customer group,
resulting in higher engagement, customer loyalty, and revenue. K-means, k-means++, and
mini-batch k-means are three popular clustering algorithms used for customer
segmentation. Each algorithm has its own advantages and disadvantages in terms of
accuracy, efficiency, and scalability. The problem is to evaluate the performance of these
three algorithms on a real-world customer dataset and determine which one produces the
most accurate and meaningful customer segments. This will help businesses make
informed decisions about which algorithm to use for customer segmentation and optimize
their marketing efforts to achieve the best possible results. This problem can be addressed
by comparing the performance of each algorithm on the same dataset and evaluating the
resulting clusters based on criteria such as within-cluster sum of squares, silhouette score,
and cluster purity.
2.3 Objectives
The main objectives of this project are as follows:
To implement K-Means, K-Means++ and Mini Batch K-Means algorithms for
customer segmentation.
To perform comparative analysis of those algorithms based on following
parameters: The silhouette score adjusted Rand Index (ARI), the within-cluster
sum of squares on a Customer dataset.
2
CHAPTER-2
3. Literature Review
Historically the segmentation groups were mostly segmented based on two foundations
which was business to business (B2B) or business to consumer (B2C). Compared to
nowadays, the segmentation foundations are based on more variables. In practice, it is
common that companies or other organizations segment their market by taking advantage
of any foundation that can be identifiable, measurable, actionable and stable.
According to Sandström (2003) segmentation has its basis in the concept that consumers
who take part of the company's products and services are not proportionally valuable[1].
For companies, customers are of different significance and to be able to stay in the
market, companies need to distribute their attention unevenly, meaning that they need to
move attention from the non-profit consumers to the ones with higher profit. For a
company to continue gaining profit, they need to target a lot of attention to the customers
that consume their products or services frequently or in greater volumes to create groups
that are fruitful.
A company provides the market with either a service or a product. Because of this it is
vital according to (Fang, Palmatier&Steenkamp) for a company to reach the customer
service elements to please their customers[3].
Well established service companies have the right skill set and right knowledge to full
fill the demands, expectations and needs of their customers (Mattson, 2004)[4].
3
The concept of customer service can be defined as what a company does to include the
purchasers, sellers and other groups that can boost their product or service. A successful
customer segmentation within services benefits the company to enhance their relationship
with their purchasers and sellers which also contributes to an enhanced competitiveness
(Pauline, 2009)[5]
CHAPTER-3
4. Methodology
In this section use of various clustering algorithm is done which are based on partitioning
to perform clustering on data sets. There are three different methods which is explained in
via chart below:
4
Figure 1: Conceptual Framework for classification
5
Data cleaning: This involves removing missing or incorrect data, duplicates and
irrelevant data. In our case we removed the data that are unsuitable for cluster
analysis due to our parametric range restrictions.
Data transformation: This involves converting data from one format to another,
such as converting categorical data into numerical data.
Data integration: This involves combining data from different sources and formats
into a single dataset.
Data reduction: This involves reducing the amount of data by sampling or
aggregating.
Data discretization: This involves dividing continuous data into discrete
categories. In our case we divided 500 datasets in to 200 and 100 data respectively
and performed the clustering algorithms.
The following hardware will be used for the implementation of the system
The following software will be used for the implementation of the system
6
Python Libraries: NumPy, Pandas, Matplotlib, and Scikit-learn libraries must be
installed.
Integrated Development Environment (IDE): A Python IDE like PyCharm or
Jupyter Notebook must be installed to write and execute the Python code.
The silhouette score is calculated for each object in the dataset and then the average score
across all objects is used as the final score for the clustering. The final score is between -1
and 1, where a score closer to 1 indicates a better clustering and a score closer to -1
indicates a poor clustering.
It measures the similarity between true class and predicted cluster class labels. A higher
ARI means there is a better match between true and predicted values. The ARI is a
variation of the Rand Index, which is a measure of the similarity between two clustering
by counting the number of pairs of objects that are in the same cluster in both clusterings,
and the number of pairs of objects that are in different clusters in both clusterings.
The formula for ARI is:
ARI = (RI - Expected_RI) / (max (RI) - Expected_RI)
7
It is commonly used to evaluate the performance of clustering algorithms and to compare
different clusterings of the same data.
The WCSS is often used as a measure of the quality of clustering algorithms, particularly
in k-means clustering. A lower WCSS value indicates that the clusters are more compact,
and thus the cluster assignment is better.
CHAPTER-4
5. Implementation:
8
5.1 Algorithms
5.1.1 K- Means
K-means is a popular clustering algorithm used to group similar data points together.
The goal of the algorithm is to partition a set of n data points into k clusters, where
each data point belongs to the cluster with the nearest mean (Sharma, 2021).
Step 1: Initialize k centroid, which will serve as the center of each cluster. This
can be done randomly or by choosing k data points from the dataset.
Step 2: For each data point, calculate the distance to each centroid and assign the
data point to the cluster with the closest centroid.
Step 3: Recalculate the centroid for each cluster as the mean of all the data points
in that cluster.
Step 4: Repeat steps 2 and 3 until the centroids no longer change, or a maximum
number of iterations is reached.
Step 5: The final result is k clusters, each represented by its centroid, and a set of
data points that belong to each cluster.
9
5.1.2 K- Means++
K-means++ is a variation of the K-means algorithm that addresses the problem of poor
initialization of centroids. The basic idea of the algorithm is to choose the initial centroids
in a more intelligent way, to avoid getting stuck in poor local optima.
The algorithm works as follows:
Step 1: Select one data point randomly from the dataset as the first centroid.
Step 2: For each data point, calculate the distance to the closest centroid already
chosen.
Step 3: Select the next centroids randomly from the remaining data points, with a
probability proportional to the distance calculated in step 2.
Step 4: Repeat steps 2 and 3 for k-1 times, until all k centroids have been chosen.
Step 5: Run the standard k-means algorithm using the chosen centroids as the
initial centroids.
K-means++ algorithm ensures that the initial centroids are chosen far away from each
other, which makes it less likely to converge to poor local optima. It also tends to create
more spherical clusters and avoid empty clusters.
One popular k-mini batch algorithm is the Cyclical Learning Rate (CLR) algorithm,
which is used to train deep neural networks. It works by gradually increasing the learning
rate over a fixed number of iterations, then decreasing it over the next set of iterations,
and repeating this cycle. This allows the model to converge faster by exploring a larger
range of parameter values. Another popular algorithm is the AdamW and look ahead
10
optimizer which is a combination of Adam optimizer and look ahead optimizer. It's used
to improve the performance of deep learning models, by adding weight decay and look
ahead mechanism to the Adam optimizer.
11
CHAPTER-5
The number of Male and Female involved in these dataset is shown in the bar diagram
The Display plot of Age, Annual income and spending score is shown below
12
Figure 3: Distplot of age, annual income and spending score (500 data sets)
13
The clustering of datasets using K-Means, K-Means++ and Mini Batch K-Means
algorithm is shown in the diagrams below:
Figure 6: Clustering using mini batch k means (500 data set with 250 batch size)
14
Silhouette score 0.5461125816171075
Adjusted Rand Index 0.003977272727272726
WCSS value 158629.0
Convergence time 1.4028229713439941 seconds
Figure 7: Clustering using mini batch k means (500 data set with 500 batch size)
The number of Male and Female involved in these dataset is shown in the bar diagram
15
The Display plot of Age, Annual income and spending score is shown below
Figure 9: Distplot of age, annual income and spending score (200 data sets)
The clustering of datasets using K-Means, K-Means++ and Mini Batch K-Means
algorithm is shown in the diagrams below:
16
Figure 11: Clustering using k means++ (200 data sets)
17
18
Figure 12: Clustering using mini batch k means (200 data set with 100 batch size)
Figure 13: Clustering using mini batch k means (200 data set with 200 batch size)
19
Figure 14: Number of males and Females (100 data sets)
The Display plot of Age, Annual income and spending score is shown below
Figure 15: Distplot of age, annual income and spending score (100 data sets)
The clustering of datasets using K-Means, K-Means++ and Mini Batch K-Means
algorithm is shown in the diagrams below:
20
Figure 16: Clustering using k means (100 data sets)
Figure 18: Clustering using mini batch k means (100 data set with 50 batch size)
21
Figure 19: Clustering using mini batch k means (100 data set with 100 batch size)
22
CHAPTER-6
7.1 Conclusion
In this project comparison of three clustering algorithms: K-means, K-means++
and mini batch K-means are comparatively analyzed for customer segmentation.
Performance parameter: Silhouette score, adjusted rank index and within cluster sum of
squares are calculated for each of the datasets and performance is computed based on
these scores. The mean of Silhouette score, adjusted rank index and within cluster sum of
squares for mini batch algorithms is 0.56, 0.003945 and 96128 respectively. So, we can
conclude than mini batch algorithm is much more effective and accurate clustering
algorithm. But on the basis of convergence time mini batch takes longer time than k-
means and k-means++. So, at last mini batch algorithms is faster on the basis of
Silhouette score, adjusted rank index and within cluster sum of squares, but it is slower on
the basis of convergence time.
23
8. References
24
Available:https://www.3tl.com/blog/importance-of-customer-segmentation.
[Accessed 29 Jan 2023].
[10] E. L. Melnic, "How to strengthen customer loyalty, using customer segmentation?,"
in V: Economic Sciences, Bulletin of the Transilvania University of Braşov Series,
2016, pp. 52-60.
9. Annex
25
9.1.2 Source code for k-means++ algorithm
26
27
9.1.3 Source code for mini batch algorithm
28
29