IEEE Conference Template 5
IEEE Conference Template 5
Abstract—Customer segmentation has developed into one of [3]. For an e-commerce site, selecting the appropriate machine
the most important and practical strategies for e-marketing or learning algorithm can effectively extract the effective features
e-commerce in recent years. It is essential to the online system for of customer behaviors and use these features to realize the
product recommendations and aids in the understanding of the
local and international wholesale and retail markets. Customer division of different customer groups. In order to obtain
segmentation is the process of categorizing customers based on differentiated interpretation of various clusters, cluster analysis
shared traits like gender, age, location, ratings, and so forth. is a type of method frequently employed in machine learning.
Customer segmentation, the process of grouping like-minded It is mostly utilized in the analysis of enterprise data to notice
customers into the same segment, is aided by the clustering the distribution characteristics present in large datasets (Oña
algorithm. The most commonly used clustering methods are K-
means, density-based, and hierarchical clustering. et al. 2016).
In this paper, an improved customer segmentation frame- The aim of this paper is to determine the types of customers,
work has been developed using the best clustering model for such as best customers, potential loyal customers, and at-risk
insight and recommendation . There are five components in customers, and to determine the value of customers so that e-
the framework, including the Cluster Tendency Test, building a commerce sites can select and decide which types of customer
clustering model, comparison between K means and Hierarchical
Clustering, comparison between K means and DBSCAN, insight
will provide robust revenue and which won’t, and also what
and recommendation. new market strategy they can apply to promote their growth
Index Terms—Customer segmentation, Hierarchical clustering in revenue.
, Clustering algorithms, Density based clustering, Distortion
score. II. R ELATED W ORK
The customer segmentation approach was employed by
I. I NTRODUCTION several researchers in a variety of fields.A methodology for
customer segmentation based on clusters was investigated.
The emergence of the information age and the quick ad-
It was suggested to use the hierarchical clustering method,
vancement of computer network technology have fundamen-
named HACNJ, which is based on Q-criterion [4]. The online
tally altered the nature of market competitiveness. In addition
store may recognize customers by using the data mining
to being quicker and more convenient in terms of time and
method. Customers can therefore receive personalized services
space, the internet-based business model also, to a significant
in the right marketing strategies based on their requirements
extent, offers effective performance for businesses to gather
[5]. In order to efficiently analyze customer characteristics, a
customer resources and market knowledge [1]. .E-commerce
method was proposed in which a retail supermarket was used
websites provide a very beneficial platform for the online
as the research object. Data mining techniques were then used
buying and selling of a variety of goods from various locations.
to identify retail enterprise customer segments, and association
E-commerce websites have a specific marketing objective:
rules obtained using the Apriori algorithm were applied to
to engage with customers.Every customer is unique, that
various customer groups [6].
comes up is the overloading information because of many
products offered by e-commerce [2]. In order to overcome the III. BACKGROUND OF THE S TUDY
overloaded information problems, customer segmentation is The theoretical concepts that underlie this study, such as
needed to implement in e-commerce services. The traditional clustering techniques, are briefly described here.
method of customer segmentation is relatively simple and
A. Clustering Algorithm
rough and can’t well cater to the market’s business model
The task of grouping a collection of objects into groups
Identify applicable funding agency here. If none, delete this. that contain only objects of the same type is known as
cluster analysis or clustering. In a lot of study, cluster anal- V. E XPERIMENTAL R ESULT A NALYSIS
ysis methodologies have been covered in detail (see [1], A. Setup
[4],[5],[8],[16]). Three algorithms—k-means, DBSCAN, and
This experiment is carried out on an Intel Core i5, Jupyter
hierarchical clustering—are the main focus of our study.
Notebook, Python 3, and an x64-based processor. Pandas
1) K-means: The data are divided into the specified number version: 1.4.3, numpy version: 1.20.3 and seaborn version:
of clusters using the k-means algorithm. Clusters are first 0.12.0 were installed.
chosen at random. The closest cluster is used to reassign ob-
servations throughout each iteration. Recalculated new cluster B. dataset
centers are used. Up until all of the observations are located Data was compiled into an excel file. Some customer data
in the nearest cluster, the process is repeated. were collected from the e-commerce website and some data
2) DBSCAN Algorithm : This common understanding were updated, including details from customer registration,
of ”clusters” and ”noise” is the foundation of the DB- login, and ratings. There were 400 rows in it. There are 7
SCAN(Density Based Spatial Clustering of Applications with attributes in the dataset, which are listed below in tableI:
Noise) algorithm. The main principle is that at least a certain
number of points must be present in the vicinity of each point TABLE I
within a cluster within a particular radius.Instead of estimating F EATURES OF DATA
the number of clusters in the case of DBSCAN, the two No Feature Description
hyperparameters epsilon and MinPts will be defined. 1 CustomerId Unique Customer Id
2 Age Age of Customer
• Epsilon (ϵ): A unit of measurement that will be used to 3 Annual Income(k$) Annual income of customer
locate points and determine the density around any given 4 Gender Gender of customer(Male or female)
point. 5 ratings Ratings of customer on basis of service
6 Age Age of Customer
• MinPts(n): The least amount of points necessary to form 7 Loaction The loacation of customer
a clustere.
MinPts(n) ≥ D + 1
C. Cluster Tendency Test
Where, D= Dimension count for the dataset.
The main purpose of the cluster tendency test is to determine
3) Hierarchical Clustering : Another unsupervised ma-
whether clustering is applicable.Therefore, Cluster tendency
chine learning approach, hierarchical clustering (also known
test was observed. The Hopkins Test is a well-known test for
as hierarchical cluster analysis, or HCA), is used to cluster
cluster tendency. It determines whether or not observations are
unlabeled datasets. There are two varieties of the hierarchical
spread randomly around the area.
clustering. They are: Agglomerative Hierarchical Clustering
Hopkins Score: 0.09322634776929459
and Divisive Hierarchical Clustering
A dataset is acceptable for clustering when it has a Hopkins
score of less than 0.5. A dataset that has a score of more than
IV. M ETHODOLOGY 0.5 is also unsuitable for clustering. It is clear that the dataset
was appropriate for clustering because the score was below
The required libraries are imported. Then the data is im-
the cutoff.
ported. In this project, clustering algorithms include the K-
means algorithm, hierarchical clustering, DBSCAN are build D. Building The Clustering Model
and then the models are compared. After comparing them- The number of clusters is first required by many clustering
selves, the best model is selected and the k-means clustering methods like K-Means or Hierarchical, however determining
algorithm has been applied in customer segmentation. In Fig. the optimal number for clustering can be challenging. For each
1, integrated customer segmentation framework is shown. . cluster analysis, the optimal number of clusters can be found
in a variety of ways. The Elbow Method was employed in this
instance.
• Distortion score:
In Fig. 2 demonstrates how a distortion score decreases as
the number of clusters rises. However, it is impossible to see
an obvious ”elbow.” Four clusters were recommended by the
underlying algorithm.
• Silhouette score:
Plotting the silhouette score as a function of cluster size
is another method for determining the ideal cluster size.
The silhouette score measures how effectively samples are
Fig. 1. Integrated Customer Segmentation Framework clustered with other samples that are similar to them in order to
assess the quality of clusters produced by clustering algorithms
Fig. 4. Determine optimal number of clusters
Where, Yi is the centroid for observation Xi . In Table. II , it is clear that DBSCAN was unable to produce
In Fig. 4,it is observed that according to the elbow model, rational clusters. DBSCAN will provide less-than-ideal results
segmenting customers into four categories would be optimal. if one of our clusters is less dense than the others because it
The optimal number of clusters was therefore 4. The Kmeans won’t classify the least dense group as a cluster.
VI. I NSIGHT AND R ECOMMENDATION
According to the analysis’s findings, there are four groups
or segments that can be used to promote to specific customer
groups.