0% found this document useful (0 votes)
5 views35 pages

Clusturing Algorithms For Customer Segmentation

This document discusses customer segmentation using clustering algorithms, specifically k-means, k-means++, and mini-batch k-means, to improve targeted marketing strategies and customer experience. The performance of these algorithms is evaluated based on silhouette score, computational efficiency, and adjusted Rand index (ARI) to determine the most effective method for segmenting customers. The project aims to provide insights into customer behavior, enabling businesses to make data-driven decisions for enhanced marketing efforts.

Uploaded by

jagriti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views35 pages

Clusturing Algorithms For Customer Segmentation

This document discusses customer segmentation using clustering algorithms, specifically k-means, k-means++, and mini-batch k-means, to improve targeted marketing strategies and customer experience. The performance of these algorithms is evaluated based on silhouette score, computational efficiency, and adjusted Rand index (ARI) to determine the most effective method for segmenting customers. The project aims to provide insights into customer behavior, enabling businesses to make data-driven decisions for enhanced marketing efforts.

Uploaded by

jagriti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Submitted to:

Artificial Neural Networks

Submitted by:
Jagriti Lakher
Date: 31thMarch 2023
Abstract
Customer segmentation is the process of dividing a customer base into groups of
individuals with similar characteristics to create targeted marketing strategies and
improve customer experience. Clustering algorithm is a type of unsupervised machine
learning algorithm that groups a set of data into clusters based on similarities. Clustering
algorithms play a crucial role in customer segmentation by identifying groups of
customers who share similar characteristics or behavior. It helps businesses to better
understand their customer base and make data-driven decisions about product
development, pricing, and customer service. Through Customer segmentation, a business
can identify areas for improvement, track trends and optimize their operations to better
serve their customers. Out of many clustering algorithms, three representative algorithms
namely k-means, k-means++, and mini-batch k-means are used. The performance of the
three algorithms will be compared on the basis of silhouette score, computational
efficiency, and adjusted rand index (ARI) parameters. Overall, this project has the
potential to provide valuable insights into customer behavior and preferences leading to
more effective business strategies.

Keywords: Adjusted Rand Index (ARI), K-means, K-means++, Mini-batch K-means,


Silhouette score.

1. Table of Content

i
s
Abstract.................................................................................................................................i

Table of Contents.................................................................................................................ii

List of Abbreviation............................................................................................................iii

List of Figures......................................................................................................................v

CHAPTER-1........................................................................................................................1

1. Introduction...............................................................................................................1

1.1 Introduction to Customer Segmentation................................................................1

1.2 Problem Statement.................................................................................................2

1.3 Objectives..............................................................................................................2

CHAPTER-2........................................................................................................................3

2. Literature Review......................................................................................................3

CHAPTER-3........................................................................................................................5

3. Methodology.............................................................................................................5

3.1 Data Preprocessing................................................................................................6

3.2 Dataset Collection.................................................................................................6

3.3 Hardware and Software Requirements..................................................................6

3.3.1 Hardware Requirements..........................................................................................

3.3.2 Software Requirements...........................................................................................

3.4 Analysis Parameter................................................................................................7

3.4.1 The silhouette scores...............................................................................................

3.4.2 The Adjusted Rand index (ARI).............................................................................

3.4.3 The within-cluster sum of squares...........................................................................

CHAPTER-4........................................................................................................................9

4. Implementation:........................................................................................................9

4.1 Algorithms.............................................................................................................9

4.1.1 K- Means.................................................................................................................

ii
4.1.2 K- Means++...........................................................................................................

4.1.3 Mini Batch K-means.............................................................................................

CHAPTER-5......................................................................................................................13

5. Result Analysis and Comparison............................................................................13

5.1 Performance comparison on first dataset............................................................13

5.2 Performance comparison on second dataset........................................................16

5.3 Performance comparison on third dataset...........................................................19

5.4 Overall Performance Comparison.......................................................................22

CHAPTER-6......................................................................................................................23

6. Conclusion and Future Works.................................................................................23

6.1 Conclusion...........................................................................................................23

6.2 Future Works.......................................................................................................23

7. References...............................................................................................................24

8. Annex......................................................................................................................25

8.1 Source code and function....................................................................................25

8.1.1 Source code for k-means algorithm.......................................................................

8.1.2 Source code for k-means++ algorithm..................................................................

8.1.3 Source code for mini batch algorithm...................................................................

List of Abbreviation

iii
Abbreviation Definition
B2C Business to customer
B2B Business to business
CLR Cyclical Learning Rate
HDD Hard Disk Drive
IDE Integrated Development Environment
RAM Random Access Memory
RI Random Index
SGD Stochastic Gradient Descent
WCSS Within-cluster sum of squares

iv
List of Figures
FIGURE 1: CONCEPTUAL FRAMEWORK FOR CLASSIFICATION................................................................
FIGURE 2: NUMBER OF MALES AND FEMALES (500 DATA SETS).........................................................
FIGURE 3: DISTPLOT OF AGE, ANNUAL INCOME AND SPENDING SCORE (500 DATA SETS)...................
FIGURE 4: CLUSTERING USING K MEANS (500 DATA SETS)..................................................................
FIGURE 5: CLUSTERING USING K MEANS++ (500 DATA SETS).............................................................
FIGURE 6: CLUSTERING USING MINI BATCH K MEANS (500 DATA SET WITH 250 BATCH

SIZE)..............................................................................................................................................

FIGURE 7: CLUSTERING USING MINI BATCH K MEANS (500 DATA SET WITH 500 BATCH

SIZE)..............................................................................................................................................

FIGURE 9: DISTPLOT OF AGE, ANNUAL INCOME AND SPENDING SCORE (200 DATA SETS)...................
FIGURE 8: NUMBER OF MALES AND FEMALES (200 DATA SETS).........................................................
FIGURE 10: CLUSTERING USING K MEANS (200 DATA SETS)................................................................
FIGURE 11: CLUSTERING USING K MEANS++ (200 DATA SETS)...........................................................
FIGURE 12: CLUSTERING USING MINI BATCH K MEANS (200 DATA SET WITH 100 BATCH

SIZE)..............................................................................................................................................

FIGURE 13: CLUSTERING USING MINI BATCH K MEANS (200 DATA SET WITH 200 BATCH

SIZE)..............................................................................................................................................

FIGURE 14: NUMBER OF MALES AND FEMALES (100 DATA SETS).......................................................


FIGURE 15: DISTPLOT OF AGE, ANNUAL INCOME AND SPENDING SCORE (100 DATA SETS)

......................................................................................................................................................
FIGURE 16: CLUSTERING USING K MEANS (100 DATA SETS)................................................................
FIGURE 17: CLUSTERING USING K MEANS++ (200 DATA SETS)...........................................................
FIGURE 18: CLUSTERING USING MINI BATCH K MEANS (100 DATA SET WITH 50 BATCH

SIZE)..............................................................................................................................................

FIGURE 19: CLUSTERING USING MINI BATCH K MEANS (100 DATA SET WITH 100 BATCH

SIZE)..............................................................................................................................................

v
CHAPTER-1

2. Introduction

2.1 Introduction to Customer Segmentation


Customer segmentation is a marketing technique that involves dividing a customer base
into groups based on their common characteristics, such as age, gender, income, and
spending behavior. It is a crucial task for businesses as it helps them understand their
customers' needs, preferences, and behaviors. By segmenting their customers, businesses
can create targeted marketing campaigns; provide personalized experiences, and increase
customer satisfaction and loyalty. Customer segmentation can also help businesses
identify profitable and high-potential customer segments, leading to increased revenue
and profits.

Clustering is a type of unsupervised learning technique that involves grouping similar


data points together. It is a popular technique used in data analysis and machine learning,
including customer segmentation. There are several clustering algorithms available,
including k-means, k-means++, and mini-batch k-means. These algorithms work by
minimizing the distance between data points within each cluster and maximizing the
distance between different clusters. The choice of clustering algorithm depends on the
size and complexity of the dataset, the desired level of accuracy, and the available
computational resources.

The combination of customer segmentation and clustering algorithms can provide


businesses with valuable insights into their customers' behavior and preferences. By using
clustering algorithms to group customers with similar characteristics, businesses can
create targeted marketing campaigns, personalized offers, and tailored experiences that
meet their customers' needs and expectations. Moreover, customer segmentation based on
clustering algorithms can help businesses identify profitable customer segments and
prioritize their marketing efforts, leading to increased revenue and profitability.
Therefore, the use of clustering algorithms in customer segmentation is a valuable
technique for businesses looking to improve their marketing strategies and customer
engagement.

1
2.2 Problem Statement
When customer segmentation is not performed, the data may remain unexplored, and the
relationships between data points may go unnoticed .customer segmentation can reveal
patterns and insights that might not be immediately apparent from looking at individual
data points. Without clustering, it can be difficult to identify groups or segments within
the data set that share similar characteristics or behaviors. This can make it harder to draw
meaningful conclusions and make informed decisions about the data. For example,
clustering can be used to group customers based on their purchasing behavior or
demographics .Without clustering, it may be challenging to identify which customers are
more likely to purchase a particular product or service.
Customer segmentation is a widely used marketing strategy that involves dividing
customers into groups based on their common characteristics, needs, and preferences. The
goal is to create targeted marketing campaigns that resonate with each customer group,
resulting in higher engagement, customer loyalty, and revenue. K-means, k-means++, and
mini-batch k-means are three popular clustering algorithms used for customer
segmentation. Each algorithm has its own advantages and disadvantages in terms of
accuracy, efficiency, and scalability. The problem is to evaluate the performance of these
three algorithms on a real-world customer dataset and determine which one produces the
most accurate and meaningful customer segments. This will help businesses make
informed decisions about which algorithm to use for customer segmentation and optimize
their marketing efforts to achieve the best possible results. This problem can be addressed
by comparing the performance of each algorithm on the same dataset and evaluating the
resulting clusters based on criteria such as within-cluster sum of squares, silhouette score,
and cluster purity.

2.3 Objectives
The main objectives of this project are as follows:
 To implement K-Means, K-Means++ and Mini Batch K-Means algorithms for
customer segmentation.
 To perform comparative analysis of those algorithms based on following
parameters: The silhouette score adjusted Rand Index (ARI), the within-cluster
sum of squares on a Customer dataset.

2
CHAPTER-2

3. Literature Review

Historically the segmentation groups were mostly segmented based on two foundations
which was business to business (B2B) or business to consumer (B2C). Compared to
nowadays, the segmentation foundations are based on more variables. In practice, it is
common that companies or other organizations segment their market by taking advantage
of any foundation that can be identifiable, measurable, actionable and stable.

According to Sandström (2003) segmentation has its basis in the concept that consumers
who take part of the company's products and services are not proportionally valuable[1].
For companies, customers are of different significance and to be able to stay in the
market, companies need to distribute their attention unevenly, meaning that they need to
move attention from the non-profit consumers to the ones with higher profit. For a
company to continue gaining profit, they need to target a lot of attention to the customers
that consume their products or services frequently or in greater volumes to create groups
that are fruitful.

Lambert state the importance of segmenting markets in emerging production industries as


well as in service industries. All kinds of organizations need to find a method that fits to
categorize the market into different segments to meet the customer demand in best way
and increase the revenue for the company. According to Lambert, when an organization
starts the process of segmenting it is recommended to look at the customers from a need
based point of view where the customer with high demand should be prioritized[2]

A company provides the market with either a service or a product. Because of this it is
vital according to (Fang, Palmatier&Steenkamp) for a company to reach the customer
service elements to please their customers[3].

Well established service companies have the right skill set and right knowledge to full
fill the demands, expectations and needs of their customers (Mattson, 2004)[4].

3
The concept of customer service can be defined as what a company does to include the
purchasers, sellers and other groups that can boost their product or service. A successful
customer segmentation within services benefits the company to enhance their relationship
with their purchasers and sellers which also contributes to an enhanced competitiveness
(Pauline, 2009)[5]

CHAPTER-3

4. Methodology
In this section use of various clustering algorithm is done which are based on partitioning
to perform clustering on data sets. There are three different methods which is explained in
via chart below:

4
Figure 1: Conceptual Framework for classification

4.1 Data Preprocessing


It is the process of transforming raw data into format that is more suitable for
analysis. It involves cleaning, transforming, and organizing data in a way that makes it
easier to understand and analyze. The purpose of data preprocessing is to improve the
quality of data and prepare for analysis.
Here are some common techniques used in data preprocessing:

5
 Data cleaning: This involves removing missing or incorrect data, duplicates and
irrelevant data. In our case we removed the data that are unsuitable for cluster
analysis due to our parametric range restrictions.
 Data transformation: This involves converting data from one format to another,
such as converting categorical data into numerical data.
 Data integration: This involves combining data from different sources and formats
into a single dataset.
 Data reduction: This involves reducing the amount of data by sampling or
aggregating.
 Data discretization: This involves dividing continuous data into discrete
categories. In our case we divided 500 datasets in to 200 and 100 data respectively
and performed the clustering algorithms.

4.2 Dataset Collection


The "Customer.csv" dataset is a secondary dataset that can be obtained through Kaggle.
The dataset includes 1000 data points, each with 5 characteristics: Customer ID, Gender,
Age, Annual Income ($), and Spending Score (1-100). The source of the dataset is Kaggle
which is a platform where data scientists and machine learning practitioners can find and
share datasets, so it is important to examine and preprocess the data to ensure its quality
before performing any clustering analysis.

4.3 Hardware and Software Requirements

4.3.1 Hardware Requirements

The following hardware will be used for the implementation of the system

 Processor: Dual-core processor or higher


 RAM: Minimum 4 GB RAM or higher
 Storage: At least 1 GB of free disk space

4.3.2 Software Requirements

The following software will be used for the implementation of the system

 Operating System: Windows, macOS, or Linux


 Python: The latest version of Python (3.x) must be installed on the system.

6
 Python Libraries: NumPy, Pandas, Matplotlib, and Scikit-learn libraries must be
installed.
 Integrated Development Environment (IDE): A Python IDE like PyCharm or
Jupyter Notebook must be installed to write and execute the Python code.

4.4 Analysis Parameter

These algorithms can be analyzed on the basis of the following parameters:

4.4.1 The silhouette scores

It measures the similarity of a point to its own clusters compared to others.


The formula for the silhouette score is:
(b(i) - a(i)) / max(a(i), b(i))
where:
a(i) is the average distance between an object i and all other objects in the same cluster as
i
b(i) is the lowest average distance between object i and all objects in any other cluster

The silhouette score is calculated for each object in the dataset and then the average score
across all objects is used as the final score for the clustering. The final score is between -1
and 1, where a score closer to 1 indicates a better clustering and a score closer to -1
indicates a poor clustering.

4.4.2 The Adjusted Rand index (ARI)

It measures the similarity between true class and predicted cluster class labels. A higher
ARI means there is a better match between true and predicted values. The ARI is a
variation of the Rand Index, which is a measure of the similarity between two clustering
by counting the number of pairs of objects that are in the same cluster in both clusterings,
and the number of pairs of objects that are in different clusters in both clusterings.
The formula for ARI is:
ARI = (RI - Expected_RI) / (max (RI) - Expected_RI)

7
It is commonly used to evaluate the performance of clustering algorithms and to compare
different clusterings of the same data.

4.4.3 The within-cluster sum of squares

The Within-cluster sum of squares (WCSS) is a measure of the compactness of a


clustering. It is defined as the sum of the squared distances between the data points in a
cluster and the centroid (mean) of that cluster.
Mathematically, for a clustering with k clusters and n data points, the WCSS can be
calculated as:
WCSS = Σk (Σi=1..n (x(i) - μ(k))² )
Where:
x(i) is the i-th data point
μ(k) is the centroid of the k-th cluster

The WCSS is often used as a measure of the quality of clustering algorithms, particularly
in k-means clustering. A lower WCSS value indicates that the clusters are more compact,
and thus the cluster assignment is better.

CHAPTER-4

5. Implementation:

8
5.1 Algorithms

5.1.1 K- Means

K-means is a popular clustering algorithm used to group similar data points together.
The goal of the algorithm is to partition a set of n data points into k clusters, where
each data point belongs to the cluster with the nearest mean (Sharma, 2021).

The algorithm works as follows:

Step 1: Initialize k centroid, which will serve as the center of each cluster. This
can be done randomly or by choosing k data points from the dataset.
Step 2: For each data point, calculate the distance to each centroid and assign the
data point to the cluster with the closest centroid.
Step 3: Recalculate the centroid for each cluster as the mean of all the data points
in that cluster.
Step 4: Repeat steps 2 and 3 until the centroids no longer change, or a maximum
number of iterations is reached.
Step 5: The final result is k clusters, each represented by its centroid, and a set of
data points that belong to each cluster.

The k-means algorithm is sensitive to initial conditions, so it is generally


recommended to run the algorithm multiple times with different initial centroids
and choose the best solution. Additionally, the k-means algorithm also has some
limitations, as it assumes that all clusters have a spherical shape, and all have
equal size and density.

9
5.1.2 K- Means++

K-means++ is a variation of the K-means algorithm that addresses the problem of poor
initialization of centroids. The basic idea of the algorithm is to choose the initial centroids
in a more intelligent way, to avoid getting stuck in poor local optima.
The algorithm works as follows:
Step 1: Select one data point randomly from the dataset as the first centroid.
Step 2: For each data point, calculate the distance to the closest centroid already
chosen.
Step 3: Select the next centroids randomly from the remaining data points, with a
probability proportional to the distance calculated in step 2.
Step 4: Repeat steps 2 and 3 for k-1 times, until all k centroids have been chosen.
Step 5: Run the standard k-means algorithm using the chosen centroids as the
initial centroids.
K-means++ algorithm ensures that the initial centroids are chosen far away from each
other, which makes it less likely to converge to poor local optima. It also tends to create
more spherical clusters and avoid empty clusters.

In summary, K-means++ algorithm is an improvement over the standard k-means


algorithm by choosing the initial centroid positions in a more intelligent way, resulting in
better solutions, faster convergence and more robustness against poor initializations.

5.1.3 Mini Batch K-means

K-mini batch algorithm is a variation of the standard mini-batch algorithm used in


machine learning. In a standard mini-batch algorithm, the training data is divided into
small batches, typically of size 32 or 64, and the model is updated after each batch. In a k-
mini batch algorithm, the training data is divided into k smaller batches, and the model is
updated after each batch (Unknown, 2023). This can lead to faster convergence and
better generalization performance.

One popular k-mini batch algorithm is the Cyclical Learning Rate (CLR) algorithm,
which is used to train deep neural networks. It works by gradually increasing the learning
rate over a fixed number of iterations, then decreasing it over the next set of iterations,
and repeating this cycle. This allows the model to converge faster by exploring a larger
range of parameter values. Another popular algorithm is the AdamW and look ahead

10
optimizer which is a combination of Adam optimizer and look ahead optimizer. It's used
to improve the performance of deep learning models, by adding weight decay and look
ahead mechanism to the Adam optimizer.

The algorithm works as follows:


Step 1: Divide the training data into k smaller batches.
Step 2: Initialize the model parameters.
Step 3: For each iterations of the training loop:
Loop over the k batches of data.
For each batch, forward propagate the input through the model to
calculate the output and the loss.
Back propagate the error to calculate the gradients of the parameters.
Update the model parameters using an optimization algorithm, such a
stochastic gradient descent (SGD) or Adam.
Step 4: Repeat steps 3a-d for a specified number of iterations or until the
model converges.
Step 5: Use the trained model to make predictions on new data.

It's worth mentioning that, k-mini-batch algorithm is used as an alternative to full-batch


and mini-batch algorithms. The choice of k depends on the size of the dataset and the
available computational resources. A smaller value of k can lead to faster convergence
but may also increase the chance of over fitting.

11
CHAPTER-5

6. Result Analysis and Comparison

6.1 Performance comparison on first dataset


In this section, the K-Means, K-Means++ and Mini batch K Means algorithm are applied
to the first dataset named “Customer.csv”. The dataset contains 500 data. The
Performance of this algorithm is measured based on the silhouette score, within-cluster
sum of squares and adjusted rank index.

The number of Male and Female involved in these dataset is shown in the bar diagram

Figure 2: Number of males and Females (500 data sets)

The Display plot of Age, Annual income and spending score is shown below

12
Figure 3: Distplot of age, annual income and spending score (500 data sets)

13
The clustering of datasets using K-Means, K-Means++ and Mini Batch K-Means
algorithm is shown in the diagrams below:

Figure 4: Clustering using k means (500 data sets)

Silhouette score 0.445701075586


Adjusted Rand Index -0.00092764378
WCSS value 156472.0
Convergence time 0.1526 seconds

Figure 5: Clustering using k means++ (500 data sets)

Silhouette score 0.445701075586


Adjusted Rand Index -0.00092764378
WCSS value 156472.0
Convergence time 1.4055 seconds

Figure 6: Clustering using mini batch k means (500 data set with 250 batch size)

14
Silhouette score 0.5461125816171075
Adjusted Rand Index 0.003977272727272726
WCSS value 158629.0
Convergence time 1.4028229713439941 seconds

Figure 7: Clustering using mini batch k means (500 data set with 500 batch size)

Silhouette score 0.5553885438696844


Adjusted Rand Index 0.003977272727272726
6.2 WCSS value 156472.0
Convergence time 0.28032612800598145 seconds

Performance comparison on second dataset


In this section, the K-Means, K-Means++ and Mini batch K Means algorithm are
applied to the second dataset named “Customer1.csv”. The dataset contains 200 data and
these data are extracted from the main primary datasets The Performance of this
algorithm is measured based on the silhouette score, within-cluster sum of squares and
adjusted rank index.

The number of Male and Female involved in these dataset is shown in the bar diagram

Figure 8: Number of males and Females (200 data sets)

15
The Display plot of Age, Annual income and spending score is shown below

Figure 9: Distplot of age, annual income and spending score (200 data sets)
The clustering of datasets using K-Means, K-Means++ and Mini Batch K-Means
algorithm is shown in the diagrams below:

Figure 10: Clustering using k means (200 data sets)

Silhouette score 0.5653234146367464


Adjusted Rand Index -0.0007102272727272737
WCSS value 44527.0
Convergence time 0.0828 seconds

16
Figure 11: Clustering using k means++ (200 data sets)

Silhouette score 0.5653234146367464


Adjusted Rand Index -0.0007102272727272737

WCSS value 44527.0


Convergence time 0.9330 seconds

17
18
Figure 12: Clustering using mini batch k means (200 data set with 100 batch size)

Silhouette score 0.6670631370414486


Adjusted Rand Index -0.0007102272727272737
WCSS value 44527.0
Convergence time 0.9246363639831543 seconds

Figure 13: Clustering using mini batch k means (200 data set with 200 batch size)

Silhouette score 0.6670631370414486


6.3 Adjusted Rand Index -0.0007102272727272737
WCSS value 44527.0
Convergence time 0.10625696182250977 seconds

Performance comparison on third dataset


In this section, the K-Means, K-Means++ and Mini batch K Means algorithm are applied
to the second dataset named “Customer1.csv”. The dataset contains 100 data and these
data are extracted from the main primary datasets The Performance of this algorithm is
measured based on the silhouette score, within-cluster sum of squares and adjusted rank
index.
The number of Male and Female involved in these dataset is shown in the bar diagram

19
Figure 14: Number of males and Females (100 data sets)

The Display plot of Age, Annual income and spending score is shown below

Figure 15: Distplot of age, annual income and spending score (100 data sets)

The clustering of datasets using K-Means, K-Means++ and Mini Batch K-Means
algorithm is shown in the diagrams below:

20
Figure 16: Clustering using k means (100 data sets)

Silhouette score 0.3976131578801382


Adjusted Rand Index 0.0008116883116883107
WCSS value 87387.0
Convergence time 0.1033 seconds

Figure 17: Clustering using k


means++ (200 data sets)
Silhouette score 0.3978571849502031
Adjusted Rand Index 0.0008116883116883107
WCSS value 87452.0
Convergence time 0.6157 seconds

Figure 18: Clustering using mini batch k means (100 data set with 50 batch size)

Silhouette score 0.5033883031811441


Adjusted Rand Index 0.003977272727272726
WCSS value 95896.0
Convergence time 0.8452229499816895 seconds

21
Figure 19: Clustering using mini batch k means (100 data set with 100 batch size)

Silhouette score 0.507435905435106


Adjusted Rand Index 0.0008116883116883107
6.4 WCSS value 89532.0
Convergence time 0.8449118137359619 seconds

Overall Performance Comparison


The performance comparison of these 3 data sets is evaluated on the basis of the
silhouette score, adjusted rank index and within cluster sum of squares. One additional
Parameter, convergence time is also added to evaluate the performance index of these
algorithms:
Silhouette score: The silhouette score for K-means and K-means++ algorithm
appears to be same so, we can’t conclude the difference on this basis. But on the other
hand the silhouette score for mini batch k means algorithm is better than other two
algorithms so, we can conclude that mini batch k means algorithm performs better
clustering than K-means and K-means++.
Adjusted Rank Index: The ARI score for k means and k means++ are same, so
we can’t conclude the difference on this basis. But on the other hand, ARI score for mini
batch k means algorithm is positive along with value greater than 0. So, we can conclude
that the assumption of the predicted class is much closer to the true class.
The within-cluster sum of squares: The within cluster sum of squares value for
these algorithm for given dataset is equal so we can’t conclude the difference on the basis
of this score.
Convergence Time: In general, larger batch sizes can lead to slower convergence,
as each update to the model parameters is based on a larger subset of data points. This can
result in slower updates and slower convergence towards the optimal solution. In our
dataset the k means algorithm is better with its fast convergence time, following with k
means++ which is slightly more, and at last mini batch k means algorithm takes slightly
more time to converge.

22
CHAPTER-6

7. Conclusion and Future Works

7.1 Conclusion
In this project comparison of three clustering algorithms: K-means, K-means++
and mini batch K-means are comparatively analyzed for customer segmentation.
Performance parameter: Silhouette score, adjusted rank index and within cluster sum of
squares are calculated for each of the datasets and performance is computed based on
these scores. The mean of Silhouette score, adjusted rank index and within cluster sum of
squares for mini batch algorithms is 0.56, 0.003945 and 96128 respectively. So, we can
conclude than mini batch algorithm is much more effective and accurate clustering
algorithm. But on the basis of convergence time mini batch takes longer time than k-
means and k-means++. So, at last mini batch algorithms is faster on the basis of
Silhouette score, adjusted rank index and within cluster sum of squares, but it is slower on
the basis of convergence time.

7.2 Future Works


This project can be further extended to work on other datasets which are more
diverse and can improve the performance of the algorithms. This project can be further
enhanced by adding more features which can improve the accuracy by enhancing
performance index. So, the performance of index value varies on the number of data
present in the dataset. By increasing the value of the data in dataset the accuracy can be
increased.

23
8. References

[1] B. Sandström, "Tillväxt och lönsamhet," in Verktyg för affärsutvecklare i mindre


och, 2003.
[2] M. Lambert, "Segmentation of Markets Based on Customer Service," International
Journal of Physical Distribution & Logistics Management, vol. 20 , no. 7, pp. 19-
27, 1990.
[3] P. R. &. S. J. Fang E., " Effect of service transition strategies on firm value.," J
Mark, 2008, pp. 1-14.
[4] S. Mattsson, " Logistikens termer och begrepp," in Stockholm: PLAN Föreningen
för Produktionslogistik, 2004.
[5] R. Pauline, " Successful Customer Service," in Get Brilliant Results Fast, Crimson,
2009.
[6] P. Sharma, "Analytics Bidhya," 24 Nov 2021. [Online].
Available: https://www.analyticsvidhya.com/ -k-means-clustering-in-machine-
learningwith-examples/. [Accessed 29 Jan 2023].
[7] Unknown, "ML | Mini Batch K-means clustering algorithm," 23 Jan 2023. [Online].
Available: https://www.geeksforgeeks.org/ml-mini-batch-k-means-clustering-
algorithm/ [Accessed 29 Jan 2023].
[8] "https://www.lotame.com/what-is-market-segmentation," Lotame, 11 March 2019.
[Online].Available: https://www.lotame.com/what-is-market-segmentation/types..
[Accessed 28 Jan 2023].
[9] S. Wong, "3-tier logic," 4 Jan 2022. [Online].

24
Available:https://www.3tl.com/blog/importance-of-customer-segmentation.
[Accessed 29 Jan 2023].
[10] E. L. Melnic, "How to strengthen customer loyalty, using customer segmentation?,"
in V: Economic Sciences, Bulletin of the Transilvania University of Braşov Series,
2016, pp. 52-60.

9. Annex

9.1 Source code and function

9.1.1 Source code for k-means algorithm

25
9.1.2 Source code for k-means++ algorithm

26
27
9.1.3 Source code for mini batch algorithm

28
29

You might also like