0% found this document useful (0 votes)
13 views7 pages

Adm Final

Uploaded by

dbtfabian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Adm Final

Uploaded by

dbtfabian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342571209

Customer Segmentation Based on RFM Model Using K-Means, Hierarchical and


Fuzzy C- Means Clustering Algorithms

Research · August 2019


DOI: 10.13140/RG.2.2.15379.71201

CITATIONS READS

2 4,865

2 authors, including:

Surefunmi Idowu
National College of Ireland
5 PUBLICATIONS 2 CITATIONS

SEE PROFILE

All content following this page was uploaded by Surefunmi Idowu on 30 June 2020.

The user has requested enhancement of the downloaded file.


Customer Segmentation Based on RFM Model
Using K-Means, Hierarchical and Fuzzy C-
Means Clustering Algorithms
Oluwasurefunmi Idowu Adithya Annam Eashwar Rangarajan Srivatsav Kattukottai
National College of Ireland National College of Ireland National College of Ireland National College of Ireland
MSc. Data Analytics MSc. Data Analytics MSc. Data Analytics MSc. Data Analytics
x18158188 x18134963 x18140386 x18145922

Abstract - E-commerce is the way of purchasing products via an an interesting topic, it is more fascinating to develop generic
online platform. With tons of technological improvements, e- models which would perform this operation on any similar data.
commerce industry has seen a vast improvement in recent years. The e-commerce industry is likely to expand dramatically in the
Firms in this domain aim at increasing profit by analyzing the coming years. By developing a machine learning model that can
previous customer purchase data. Therefore, by deploying a data help us understand customers and their purchase pattern,
mining technique, customer data can be analyzed which helps an organizations can make critical business decisions to attract
organization understand their customers and take necessary more customers and serve them better, thereby gaining a
decisions. This study presents the analysis of three clustering competitive advantage. Consequently, it has attracted a lot of
algorithms - K-means Clustering, Fuzzy C-means Clustering, and research by marketers and data analysts who have discovered
Hierarchical Clustering, which were built using R programming several machine learning models to aid profit making in this
language to understand customer behavior and segment them domain. One of the widely known techniques is clustering,
based on the purchase pattern through Recency, Frequency and
which is simply defined by [19, p. 286] as an unsupervised
Monetary (RFM) model. The analysis was done using the KDD
classification model which finds patterns in a dataset. These
methodology approach and all three algorithms were evaluated
patterns can be found using some analysis like basket market
internally and externally using various validation measures like
analysis [11] and RFM model [7].
Silhouette Width, Dunn Index, Adjusted Rand Index and Variation
This analysis is inspired from [2] which develops a k-means
Index. The k-means and fuzzy c-means results were similar as five
clusters were realized and named, with the largest cluster being the
clustering model to understand the customer purchase pattern
high spending customers. On the other hand, the hierarchical
and sequence. Besides, the research tries to find the
produced two clusters. With a Dunn Index of 1.58, it was concluded characteristics of customers by segmenting them based on a
that the hierarchical model performed best. phenomenon called Recency, Frequency and Monetary model
(RFM). By using the cluster node in Statistical Analysis
Systems (SAS) Enterprise Minor, the k-means clustering
Keywords: Customer Segmentation, Clustering, RFM, K-means, algorithm was employed. It aims to develop three different
Hierarchical, Fuzzy C-means.
clustering algorithms: k-means, fuzzy c-means and hierarchical
clustering for analyzing customer purchase pattern. These
I. INTRODUCTION
customers will be clustered based on the RFM model and a
comparison will be done, choosing the best technique.
Business is always an outcome of supply and demand. Any The next section of this paper highlights the research
industry runs around its customers and consumers. Hence, it is question and objectives. Afterwards, the work is justified based
necessary for firms to understand their customers and decipher on numerous literatures published in past and recent times.
their wants and needs. The competitiveness in every industry Subsequently, the methodology and necessary steps are
makes the organization to stay up-to-date and provide the best explained, followed by the evaluation metrics. Conclusively, the
possible service to its customers. Over the past few decades, paper is summarized, and possible future work is proposed.
with advancements in technology, the digitalization of
businesses has grown manifold. This has paved way for the
availability of large volumes of data, thereby allowing analysis II. RESEARCH QUESTION
to be performed by categorizing and grouping the customers and “Can hierarchical clustering model perform better than k-
in turn classifying their needs. A lot of technological means and fuzzy c-means algorithm when analyzing customer
transformations have happened in the last three decades, causing purchase pattern using RFM analysis?”
the internet to serve as a platform for running businesses. This
gave birth to the term e-commerce. While analyzing this data is Hence, the objectives of this research are the following:
• To implement k-means, fuzzy c-means and yield better performance. According to [5, p. 609], clustering
hierarchical clustering in analyzing customer purchase algorithms can be classified into four parts, which includes k-
pattern using RFM model. means, hierarchical, density based and self-organization maps.
• To evaluate each model internally and externally. Furthermore, in the literature, a hybrid clustering technique was
• To derive customer segments using the mentioned implemented, which is a combination of k-means and
clustering techniques. hierarchical clustering, aiming to organize large amounts of text
• To pick the best performing clustering model. data into significant clusters. It was realized that in hierarchal
clustering, prototype vectors spotted out and were discovered to
hide the noise from the technique, but no comparison was found
III. LITERATURE REVIEW within the techniques. However, some have done the work to
compare and contrast several clustering algorithms [4], the
A. Introduction
unique combination of k-means, fuzzy c-means and hierarchical
Clustering technique is generally used to find similarity in a clustering have not been evaluated together.
dataset [1]. It has been applied by so many authors through One important issue in clustering is the fact that the data is not
different methods and in a variety of fields. For instance, in the completely understood. This phenomenon is called imperfect
medical field, doctors use it to diagnose patients with similar knowledge of the dataset. To handle these imperfections, several
diseases thereby prescribing same treatment for them. In the uncertainty theories have been recommended, one of which is
marketing field, retailers use it to understand customer buying fuzzy sets [16]. Taking a further look into clustering methods,
behavior in order to enhance their offers and make more profit. [10] proposed a possibilistic meta-clustering model where two
In the banking field [13], it is used to gain key insights into granules are created for customers and the products bought
customer value and determine strategies for customer segments. frequently. Data inputs were derived from the database of retail
These days, companies don’t mind investing a large amount of store by segmentation process of customers and their purchasing
money in developing marketing strategies, which aim towards pattern and the model was implemented with the help of k-
customer segmentation, thereby maximizing customer modes clustering algorithm thus making better decision-making
satisfaction and retention. In order to do this, they analyze big results. Although, [5] evaluated clustering techniques using
customer data to give them insights, which enables them to precision, accuracy and recall, [16] used kappa index, rand
make stronger business decisions, thereby gaining competitive index. Some papers [11] use market basket analytics method for
advantage in the market [14]. For customer segmentation to be the analysis of customers shopping preferences by segmenting
useful, it must be easily understood, identifiable, relevant, customers’ visits with help of feature selection approach,
significant and reachable1. Customers could be segmented based collaborative filtering and association rules [8].
on their visits [8][11], purchase history and details [15],
payment details and so on.
C. Recency Frequency and Monetary (RFM) Model
B. Clustering Techniques
Originally known as RFM analysis, is a widely known standard
There are two main methods of clustering; hard or crisp technique used to evaluate customer lifetime value, especially in
clustering and soft clustering [1]. Hard clustering includes k- the retail industry. It was first proposed by [17]. Recency stands
means (used for numeric data) and k-mode (for categorical for the last purchase date within a specific period, Frequency
data), while soft methods include fuzzy c-means [2], stands for the number of purchases within a specific period, and
possibilistic c-means [10], and evidential c-means [3]. K-means Monetary is the value of purchases within a specific period. The
clustering is one of the most popular clustering techniques. It is RFM model is calculated thus:
only useful for a dataset with numeric variables. Since it was
specially designed that way, it cannot cluster objects with RFM score = (rs x rw) + (fs x fw) + (ms x mw)
categorical variables. Although k-means is easy to implement
and obtains good results, it has some limitations highlighted by where:
[19, p. 289] in the sense that it does not have the tendency to rs = recency score and rw = recency weight
find the optimum number of clusters because it uses “an element fs = frequency score and fw = frequency weight, and
of random chances.” It was also stated that it is not ideal for ms = monetary score and mw = monetary weight
clusters which have considerable differences in density. More
so, research has proven that k-means clustering model doesn’t The model has been used by several authors [7][20]. Although
work well with clusters of varying size and density [3]. [7] developed a clustering model to identify customer segments
Therefore, in cases where we cannot define a k-value, of one of Turkey’s largest sport retail stores using two-step
hierarchical clustering model can be used, which provides more cluster analysis and k-means clustering, [20] used a combination
intuitive results. Subsequently, researchers [5] have tried to of regression and k-means. Subsequently in [7], the RFM values
combine k-means with other clustering algorithms in order to were used as indicators to cluster customers. For two-step

1
https://askpivot.com/blog
clustering, the number of clusters was not fixed, whereas in k- A. Data Selection:
means clustering, the best model was built when the value of k To analyze the purchase pattern of customers in retail shops,
was 4. The same approach has been used in other retail stores, dataset of a Brazilian retail store (Olist) has been downloaded
but in different countries like India [12] and Romania [6]. from Kaggle that contains customer location, products ordered
and payment related information. Since the purchase pattern can
D. Conclusion be analyzed by considering how recent and frequent customers
In this paper, we compare three clustering techniques such have placed orders, as well as the price of the product purchased,
as k-means and hierarchical clustering approach to determine only 3 datasets were used for segmentation purpose that contains
customer segments using data of a Brazilian retail store, sourced customer details, orders and payment data.
from Kaggle. The clusters are evaluated internally and
externally, and the best algorithm is picked based on the
different parameters.

IV. METHODOLOGY AND IMPLEMENTATION


The methodology followed in this research is KDD
(Knowledge Discovery of Databases). KDD is chosen, as this
research follows a data-driven approach and this study aims at
acquiring knowledge from database which is to classify the
customers based on purchase pattern. The purchase pattern is
analyzed using Clustering models by identifying the Recency,
Frequency and Monetary value of customers. This research is
implemented through various stages that are explained below:
Fig. 2. Data Architecture
• Data Collection.
• Data pre-processing. Figure 2 explains the merging of the individual datasets to form
• Data transformation. the database, which is done in R Studio. These tables were
• Data mining. joined based on order_id and customer_id.
• Evaluation.

B. Data Cleaning and Pre-processing:


The target data is cleaned before proceeding further. Cleaning
here includes, handling missing values, detecting and handling
outliers etc. The merged database has 103,886 records. This
database does not contain any missing values. Outliers are
detected once RFM values are calculated as all fields will be
numeric and the missing values are checked once again. As part
of data pre-processing, the date was extracted from the
“order_approved_at” column which consists of the time stamp
of the invoice generation, thereby splitting it to get only the
dates. Thus, the pre-processed data generated.

C. Data Transformation:
Cleaned and pre-processed data is transformed in order to build
machine learning models on them. In transformation stage,
customers are grouped, and Recency, Frequency and Monetary
values are calculated for each customer. Recency is calculated
by identifying the difference between current date and the latest
invoice date. Frequency is calculated by identifying the number
of orders by a customer. Monetary is calculated by the total
Fig. 1. KDD Process Flow
amount of all orders by the customer. Thus, the transformed
dataset contains Customer ID, Recency (in days), Frequency, that is not crossed by any horizontal lines. Then a cut off
and Monetary. This transformed dataset is checked for any horizontal line is drawn to find the optimal number of clusters
missing values & outliers. The below plot displays the ratio of where it is seen from the below dendrogram plot that the number
missing entries. As the ratio of missing values is less than 0.1%, of clusters obtained is 2.
we can ignore the missing entries and consider only the
complete cases for further analysis.

Fig. 3. Missingness Map

Thus, the data is transformed and is made ready for the models Fig. 5. Dendrogram plot to identify the optimal clusters
to be implemented. Outliers are detected and removed
accordingly. E. Machine Learning Models:
(i) K-means Clustering:
K-means is executed with number of clusters as 5. The output
D. Data Mining: obtained is presented in figure 6 below.
(i) Elbow Method for K-means:
Elbow method is commonly used to identify the optimal number
of clusters. For a range of k-value, k-Means is executed and the
value of total within cluster Sum of squares is plotted. The value
of k at which the plot bends is the optimal number of clusters.
As shown in the plot, after k=5, value of total WSS doesn’t
improve further. Hence, number of clusters will be chosen as 5.
The same number of clusters are used to develop Fuzzy C-means
model as well.

Fig. 6. K-means output

5 clusters represent 5 groups of customers as below:


1. High spending customers
2. Less Spending, recent customers
3. Average spending recent customers
Fig. 4. Elbow Plot to identify optimal number of clusters 4. Less spending, less recent customers
5. Average spending, less recent customers
(ii) Dendrograms for Hierarchical Clustering:
This k-Means model is evaluated, and various parameters are
Dendrograms contain the memory of hierarchical clustering
discussed in section v.
algorithm. Each vertical line represents the distance or
dissimilarity thresholds and the longest vertical line is selected
(ii) Fuzzy c-means: The performance of this model can be measured by various
Fuzzy C-means was built to identify 5 clusters. It is soft cluster metrics which are explained in below section.
clustering unlike k-means, and it categorizes as:
1. High Spending customers,
V. RESULTS AND DISCUSSION
2. Average Spending customers,
The performance of cluster models is validated in two
3. Less spending, Less recent customers,
categories. They are explained below.
4. Less spending, recent customers,
5. Least spending, recent customers. A. Internal Cluster Validation:
The goal of a cluster model is to have lower intra-cluster
distance (between objects of same cluster) and higher inter-
cluster distance (between different clusters). The performance
of cluster models is evaluated using the below measures:
• (i) Silhouette Width: It measures the closeness of each point in
one cluster in comparison to other clusters. It ranges from 0 to 1
with 1 indicating the observations are well clustered. Silhouette
width for the three cluster models are listed in the table below:
• (ii) Dunn Index: It is the ratio of the smallest inter-cluster
distance to the largest intra cluster distance. Ideally, a higher
value of Dunn Index is desired.

Fuzzy c-
K-means Hierarchical means
Silhouette
Fig. 7. Fuzzy c-means output
Width 0.39418 0.381595 0.35838
Various performance metrics are calculated and discussed in the Dunn Index 1.46043 1.580904 1.34221
next section.
As indicated in the table, K-Means has marginally higher
(iii) Hierarchical Clustering Silhouette width in comparison to Hierarchical model. It should
be noted that K-Means & Fuzzy means was built with 5 clusters,
whereas Hierarchical was built only with one cluster. Based on
the above information, we can confirm Hierarchical model
provided better results, but the business value of the results is
not so deep as it shows only two clusters.

B. External Validation:
External validation is done for comparing cluster models and
study the similarity between different cluster models. It is
evaluated using below parameters:
• (i) Adjusted Rand Index: It is a measure of the agreement
between different cluster models. It varies from 0 to 1 with 1
indicating maximum agreement.

Adjusted Fuzzy c-
K-means Hierarchical
Fig. 8. Hierarchical output Rand Index means
K-means N/A 0.32519 0.51907
By considering the dendrogram, we can see the optimal number
of clusters is 2 which groups the customers as below: Hierarchical 0.32519 N/A 0.317132
1. Low spending customers and Fuzzy c-
0.51907 0.317132 N/A
2. High spending customers. means
The above table indicates that K-Means and Fuzzy C means [6] B. Romania, “Clustering the Grocery Retail Market” in Business
produce relatively similar results and more than 50% of objects Excellence Challenges During Economic Crisis in 7th Int. Conf. on
are clustered similarly. Business Excellence, Ohio, USA, pp. 98-312, Oct. 12-13, 2013.
[7] O. Doğan, E. Ayçi̇ n, and Z. A. Bulut, “Customer segmentation by
(i) Variation of Information Index: It is a measure of variation using RFM model and clustering methods: A case study in retail
between two cluster models. In general, a lower Variation Index industry,” IJCEAS, vol. 8, no. 1, pp. 1-19, 2018.
is desired. [8] M. Khalaji, S. J. Mirabedini, “Recommender System Based on
Association of Complementary and Similarity in Electronic Market”
Fuzzy c- Middle East Journal of Scientific Research vol. 13, no. 6, pp. 823-828,
VI Index K-means Hierarchical 2013. DOI: 10.5829/idosi.mejsr.2013.13.6.2497
means
K-means N/A 1.184 1.2072 [9] S. Škerlič and R. Muha, “Identifying warehouse location using
hierarchical clustering,” Transport Problems, vol. 11, no. 3, pp. 121–
Hierarchical 1.184 N/A 1.89365
129, 2017.
Fuzzy c- means 1.2072 1.89365 N/A [10] A. Ammar, Z. Elouedi, and P. Lingras, “Meta-clustering of
possibilistically segmented retail datasets,” Fuzzy Sets and Systems,
The above table highlights the variations shown by Hierarchical vol. 286, pp. 173–196, Mar. 2016.
& Fuzzy C means and helps in studying the difference in results [11] Griva, A., Bardaki, C., Pramatari, K. and Papakiriakopoulos, D.
by cluster models. (2018). Retail business analytics: Customer visit segmentation using
market basket data.
[12] M. Abirami and V. Pattabiraman, “Data Mining Approach for
VI. CONCLUSION AND FUTURE WORK Intelligent Customer Behavior Analysis for a Retail Store,” in
Proceedings of the 3rd International Symposium on Big Data and
In this research, various cluster models were developed to
Cloud Computing Challenges (ISBCC – 16’), 2016, pp. 283–291.
segment customers based on Recency, Frequency and Monetary
[13] A. Ansari and A. Riasi, “Taxonomy of Marketing Strategies Using
(RFM) model and these models were evaluated. The Dunn Bank Customers’ Clustering,” IJBM, vol. 11, no. 7, p. 106, Jun. 2016
Index indicates that the hierarchical model performed better [14] A. Riasi and S. Pourmiri, “Effects of online marketing on Iranian
compared to k-means and fuzzy c-means model in terms of ecotourism industry: Economic, sociological, and cultural aspects,”
producing a good cluster. Likewise, the Adjusted Rand Index Management Science Letters, vol. 5, no. 10, pp. 915–926, 2015.
score indicates that k-means and hierarchical models provided [15] C. H. Park, Y. H. Park, and D. A. Schweidel, “A multi-category
more varying results. These results indicate that hierarchical customer base analysis,” International Journal of Research in
model provided better clusters that satisfy the primary goal of Marketing, vol. 31, no. 3, pp. 266–279, Sep. 2014.
[16] Z. Ji, Z. Xia, Q. Sun, and G. Cao, “Interval-valued possibilistic
lower intra cluster distance and high inter cluster distance.
fuzzy C-means clustering algorithm,” Fuzzy sets and systems, vol. 253,
However, performance of these models can be evaluated with pp. 138-156, 2014.
more parameters like entropy, partition co-efficient, and so on, [17] K.B. Munroe, Pricing: Making Profitable Decisions, McGraw-
thereby comparing results. Furthermore, hierarchical models Hill, New York, 1990.
with different distance and partitioning methods such as [18] M. Khajvand and M. J. Tarokh, “Estimating customer future value
Manhattan, bit-vector, and hamming can be implemented and of different customer segments based on adapted RFM model in retail
compared with k-means. banking context,” Procedia Computer Science, vol. 3, pp. 1327–1332,
Jan. 2011.
[19] B. Lantz, Machine learning with R, 2nd ed, Birmingham: Packt
REFERENCES Publishing Ltd., 2015.
[1] A. Ammar, Z. Elouedi, and P. Lingras, “Meta-clustering of [20] F. Yoseph and M. Heikkila, “Segmenting Retail Customers with
possibilistically segmented retail datasets,” Fuzzy Sets and Systems, an Enhanced RFM and a Hybrid Regression/Clustering Method,” 2018
vol. 286, pp. 173–196, Mar. 2016. International Conference on Machine Learning and Data Engineering
[2] S. C. Oner and B. Oztaysi, “An interval type 2 hesitant fuzzy (ICMLDE), Sydney, Australia, pp. 108-116, 2018.
MCDM approach and a fuzzy c means clustering for retailer
clustering,” Soft Comput, vol. 22, no. 15, pp. 4971–4987, Aug. 2018
[3] M.-H. Masson and T. Doneux, “ECM: An evidential version of the
fuzzy c-means algorithm,” Pattern Recognition, vol. 41, no. 4, pp.
1384–1397, Apr. 2008.
[4] S. Ghosh and S. K. Dubey, “Comparative analysis of k-means and
fuzzy c-means algorithms” IJACSA vol. 4, no. 4, pp. 35-39, 2013
[5] G. Singh, N. Kaur, “Implementation of hybrid clustering algorithm
with enhanced k-means and hierarchal clustering” IJARCSSE vol. 3,
no. 8, pp. 608-618, Aug. 2013.

View publication stats

You might also like