Customer Segmentation Using Data Science
Customer Segmentation Using Data Science
Abstract: Customers have always been and always will We know good reviews and better quality products
be the center of the market, the most important aspect make customers turn twice towards any business
of it. As time progresses on the diverse nature of and as word spreads new customers can flock into
customers and how they affect the market and product analyse all this , have them classified in such
schemes as whole is brought forward; especially
manners based on circumstance-and-need based
through heaps of data available these days. Most
customers prefer online shopping especially after the
clusters can help a businessperson to handle their
pandemic which has help increase the data about the reach better. As we know the market keeps
customers, their interests, characteristics and has hence changing overnight, to be in front, analysing is the
helped companies to understand their needs better. One most important factor and customer segmentation
of the ways to better provide for customers, is using helps in achieving it. Customer segmentation refers
customer segmentation. Customer segmentation is the to the process of determining how to interact with
technique through which we can form clusters of consumers in various groups to amplify the value
customers based on different aspects from their already of each customer to the company. Marketers can
collected data, this could be based on gender, region,
use customer segmentation to reach out to each
age etc. Practicing customer segmentation helps the
company understand their potential audience better consumer in the most efficient way feasible [2-3].
which in turns helps them provide better marketing Through these groups customers’ behavioural,
schemes targeting special zones which can help boost demographic etc patterns can be recognised which
their product growth. Here we have implemented can further help improve business a s it could pave
customer segmentation keeping in mind the above- the way for bringing in more profit, at the end any
mentioned things, and tried to form a system that can business’ main goal is to bring as much profit as
help anticipate the products that the customer might be possible. This could also help attract new potential
willing to buy using K-means clustering with elbow
customers, for this a great marketing strategy is
method. To implement the system, we have used K-
required and that strategy could very well be
means method via Python language with the help of
Machine Learning and Data Science approaches. The designed for a specific group of customers based on
data set used here provides real-time data about their characteristics cluster. Clustering has been
products brough along with other important proved to be a good way to implement consumer
information. segmentation. Clustering, which is the capacity to
uncover categories in unlabelled datasets, is an
example of unsupervised learning. K-means,
Keywords: Customer segmentation, K-means , hierarchical clustering, DBSCAN clustering, and
Euclidean distance , Classification, Data science, other approaches are among the clustering methods
machine learning available.[4]. Through this project we will develop
a model that will help us classify customers from a
dataset of user-purchases into segments. By further
working on these segments, we will anticipate the
I. INTRODUCTION
purchases that will be made by a new customer,
Customers are people who purchase and use the
during the following year.
firm's goods. Customers have wants, and they are
the ones who ultimately decide whether the
company's product meets those demands.
Customers represent the company's market share,
and sales and profits are generated by them.
Customer loyalty to a product or service is II. LITERARY SURVEY
advantageous since customers will continue to look In [5], For assessing customer classification for
for the thing they want.[1] In order to make a campaign management, the customer life time
business better and more-flourishing , knowing value (LTV) model and the recency frequency and
your customer is the most important. It could range monetary (RFM) model are suggested. A general
from knowing the gender of your customer to technique is also provided for locating more
knowing which shift in marketing scheme got your suitable customers for each marketing approach.
product the most attention from the customers. Targeting and segmenting customers is the
Knowing your customer is important but being able marketing plan. Nissan car retailer dataset of more
to attract more customers is even more important.
than 4000 clients is divided for assessment of the intra cluster distance when the effectiveness of both
suggested method's effectiveness. When it comes to methods was evaluated and contrasted.
client targeting, the suggested model assessment is
more effective than the random process of In [10], To segment the clients in the private
choosing. banking industry, neural networks and support
vector machines are suggested as machine learning
In [6], On e-commerce websites, the RFM, kano, methods. Customers are divided into groups based
and BG/NBD models are suggested for consumer on factors such as character identification, credit
segmentation. Based on factors like customer default, fraud forecast, and outlook for the foreign
lifetime worth, customer satisfaction, and customer exchange market. Customer segmentation is the
engagement, customers are divided into various first stage in the private banking industry to
categories. The customers are divided into ten achieving the most lucrative company growth.
groups in accordance with the marketing strategies.
By categorizing and targeting customers, In [11], From the viewpoints of network operators,
segmentation increases earnings for companies. handset makers, and application writers, authors
showed established user segmentation on pertinent
In [7], An RFM paradigm was suggested. To create metrics. To analyze the results of a smartphone
segment-level models that can be predicted, a measuring study with 129 subjects, they used latent
chosen data model is submitted to and implemented class analysis. The information is then connected to
with pattern-based clustering and signature finding psychographic and socioeconomic information to
methods. In this instance, credit card consumption support behaviors. Different service groups can be
data is used, and the methods are applied to identified in terms of network traffic (phone, SMS,
produce a financial matrix and a fluctuate-rate and internet), as well as the use of content services.
matrix that assist in the investigation of various (i.e. Applications and URLs).
modes. Using the clustering on the two vectors, we
examine various customer traits. A two- In [12], The conventional K-means clustering
dimensional customer segmentation model based method is thoroughly examined in this article, and
on consumption is created with the aid of these a modelling procedure based on the least squares
factors. idea is presented for telco consumer segmentation.
A clustering technology based on the K-means
In [8], The grocery store business uses the new method is being provided for Changzhou Telecom
RFM model LRFMP, which means for Length, in Jiangsu Province, and real results demonstrate
Recency, Frequency, Monetary, and Periodicity, to that it offers an effective and successful resolve of
categorize customers and find various client customer segmentation for Telecom, bringing
segments. Customers are segmented using real- services closer to the customers.
world data from a Turkish grocery chain and a
combination of the LRFMP model and clustering In [13],A Decision Tree Analysis method is given
method. This research paves a simple path for in Full-Service Restaurants to classify patrons
researchers and practitioners to gather useful according to their dining tastes. When making
insights to identify various customer profiles based purchases, consumers are differentiated based on
on the LRFMP model and primarily assists the five factors, including the menu, ambiance, pricing,
decision makers to obtain various methods to health, and brand. 390 surveys in total were used to
develop useful customer relationships and gather the data. When it comes to decision tree
distinctive marketing strategies to reach wider analysis, a researcher can target and locate suitable
customers. The grocery store business is the consumers. Five customer segments are divided
example. based on a series of criteria. The administration and
advertisers of restaurants can benefit from customer
In [9], The network of online banking customers segmentation.
has expanded rapidly in recent years, and clustering
algorithms-based consumer segmentation of In [14], Maintaining customer loyalty and attention
unstructured transactional data is urgently needed. span is presently one of the biggest challenges
On datasets based on the RFM score of a confronting the retail industry. The methods used in
customer's online banking activities, the most marketing are constantly evolving. The variables
popular clustering methods, K-means and K- that have the biggest effects on historical
medoids, are used. The K-means strategy correlation are determined by sales information
outperformed the K-Medoids method based on acquired through a transaction. Based on groups,
suitable resources can be allocated. using machine
learning to route traffic to algorithms for happy help us while forming clusters further in the
consumers. Singular Value Decomposition is used project.
for offering, and K-Means clustering is used for
client classification.
D. Creating clusters:
E. Classification:
Fig3 Learning curve for KNN Fig6 Learning curve for AdaBoost Classifier
When the model is fitted with the training When the model is fitted using gradient
data using Decision Tree, the precision boosting classifier the precision is 89.47%
comes out to be 85.73%
Fig4 Learning curve for decision tree Fig7 Learning curve for Gradient Boosting
4) Random Forest
A. Loading Dataset
B. Data Preprocessing
Fig11 Distribution based on the frequency of words
The data was cleaned by eliminating null occurring in the product description
values from all columns and removing By distributions like above mentioned ones,
unnecessary features. Additionally, any we have formed 11 clusters of products.
missing values were replaced with the mean
value of the respective term. D. Testing Predictions