0% found this document useful (0 votes)
71 views17 pages

DSML - Project Report - Group 3

This document presents the analysis of customer transaction data from a supermarket over 117 weeks. The objectives are to perform customer segmentation based on purchasing behavior and create targeted marketing strategies. The CRISP-DM methodology is used, including data understanding, preparation, exploratory analysis, clustering techniques and conclusions. Key findings are that customer lifestyle data is most incomplete, while store, date and product information are complete.

Uploaded by

Deepak Bhatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views17 pages

DSML - Project Report - Group 3

This document presents the analysis of customer transaction data from a supermarket over 117 weeks. The objectives are to perform customer segmentation based on purchasing behavior and create targeted marketing strategies. The CRISP-DM methodology is used, including data understanding, preparation, exploratory analysis, clustering techniques and conclusions. Key findings are that customer lifestyle data is most incomplete, while store, date and product information are complete.

Uploaded by

Deepak Bhatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

IIM Kashipur

Master of Business Administration


Master of Business Administration (Analytics)

Data Science and Machine Learning, Term-IV


Year 2022-2023

Submitted To: Prof. Venkataraghavan Krishnaswamy

Submitted By: Group 3

Anika Garg MBAA21006

Anurag Shelar MBAA21009

Avijit Kirttania MBAA21015

Ayushi Sharma MBAA21017

Paresh Kumar Jaiswal MBAA21023

Yukti Dewan MBAA21038

1
Table of Contents

Abstract.............................................................................................................................3
1. Business Understanding.................................................................................................4
2. Objective........................................................................................................................5
3. Methodology:................................................................................................................5
4. Data Understanding:.....................................................................................................6
5. Data Preparation...........................................................................................................7
6. Exploratory Data Analysis..............................................................................................8
7. Clustering Techniques...................................................................................................13
8. Conclusion and Future Recommendation......................................................................15

2
Abstract

In the context of retail, basket analytics is a potent tool for learning about consumer shopping
preferences and patterns. In this article, we provide a business analytics method that uses
sales data from baskets to extract customer visit groups. We use Data Mining techniques such
as clustering and association rules, for the purpose of target marketing strategy. Our goal is to
develop a methodology for retailers on how to segment their stores based on multiple data
sources and how to create marketing strategies for each segment rather than mass marketing.

3
Business Understanding

Retail, by definition, is the exchange of products or services between a company and a


customer for the latter's personal use. Small-scale purchases of items are handled in retail
transactions, while bulk purchases of commodities are handled in wholesale transactions.
Modern people's everyday lives depend heavily on retail establishments. Customers can find
a wide range of goods and services through retailers all over the world. For groceries,
clothing, convenience items, and other goods, retail stores are your one-stop shop. The retail
industry is a crucial component of the supply chain since it connects a manufacturer with a
consumer.

What is Customer Segmentation?

Customer segmentation is the process of grouping consumers based on shared behaviour or


other characteristics. Both inside each group and in relation to one another, the groups should
be homogeneous. This approach' overarching goal is to discover high-value clients, or those
who have the most potential for growth or are the most lucrative.

Significance of Customer Segmentation

Customer segmentation is one of the tools that can help a business to better understand its
target audience by grouping customers based on common characteristics. Therefore, it can
also aid businesses in tailoring marketing techniques to meet the specific needs of each
consumer group.
A medium- to large-sized retail establishment must make investments in both
attracting new clients and keeping existing ones. The 'best' or most valuable clients are often
the main source of income for firms. It is essential to identify and target these clients because
a company's resources are constrained. Finding consumers who are inactive or at high risk of
leaving is equally crucial to address their issues. Companies employ the strategy of consumer
segmentation for this reason.

4
Objectives

In this project we analysed the transactions data set of the customers purchasing from a
supermarket spanning over a period of 117 weeks. Objective of the study is to:
Perform customer segmentation based on insights about purchasing interest as well as
consumer psychology with the help of the below inferences:

 Price sensitivity of customers


 What is the rush hour for the supermarket?
 Average spends and basket size per hour
 Seasonality of purchase
 Demand forecasting

Methodology

In this study, the Cross-Industry Standard Process for Data Mining (CRISP DM) technique
has been used for the analysis of the chosen dataset . The steps in the CRISP-DM approach
are as follows:

1. Business Understanding: The main goal here is to obtain a basic grasp of the company's
aims and requirements. To identify the importance of data in the business model, first
examine the goals in terms of what the customer wants to achieve. The project plan is then
created using the problem description and scenario. The development team is aware of this
stage of the process.

2. Data Understanding: This phase begins with acquiring the basic datasets, going through
the data, and examining the data to discover what can be expected and done from the data. It

5
also considers data consistency, completeness, and value distribution, in addition to data
governance compliance.

3. Data Preparation: This stage of the process entails converting raw data into usable form.
It is a time-consuming procedure since it includes all the actions undertaken for the final
dataset generation. Data transformation and cleansing are performed several times, as are
operations such as record, format, and attribute selection.
4. Modeling: Various models are produced and accessible at this level by developing test
models based on various modeling methodologies. To find a good enough model across the
CRISP-DM lifecycle, one must return to data preparation at some point because each
approach has different criteria for the type of data that must be used.

5. Evaluation: At this time, a model that looks to be of great quality has been built in our
opinion. Prior to deployment, the model is thoroughly evaluated, and the work completed is
verified to ensure that the model meets our business objectives. At this step, the testing data
is utilized to check that the model properly represents reality

6. Deployment: It is the sixth and last stage, which consists of presenting the final report,
which contains results, in a helpful and intelligible manner, and by doing so, the project may
achieve its objectives. It is the only move that is not part of a loop

Data Understanding

 The dataset was obtained from Dunhumby.com


 The dataset includes dummy data of transactions at a store, spanning over a period of
117 weeks (two years & a quarter)

Features of the dataset are as follows:


 shop_week – Identifies week of the basket
 shop_date - Date when shopping has been done
 shop_weekday - Identifies the day of the week (1=Sunday, 2=Monday, …,
7=Saturday)
 shop_hour - Hour slot of the shopping
 Quantity - Number of items of the same product bought in this basketSpend - Spend
associated to the items bought
 spend - Spend associated to the items bought
 prod_code - Product Code
 prod_code_10 - Product Hierarchy Level 10 Code

6
 prod_code_20 - Product Hierarchy Level 20 Code
 prod_code_30 - Product Hierarchy Level 30 Code
 prod_code_40 - Product Hierarchy Level 40 Code
 cust_code - Customer Code
 cust_price_sensitivity - Customer price sensitivity (LA=Less Affluent, MM=Mid-
Market, UM=Up Market, XX=unclassified)
 cust_Lifestage - Customer’s Lifestage (YA=Young Adults, OA=Older Adults,
YF=Young Families, OF=Older Families, PE=Pensioners, OT=Other,
XX=unclassified)
 basket_Id - All items in a basket share the same basket_id value
 basket_size - Basket size (L=Large, M=Medium, S=Small)
 basket_price_sensitivity - Basket price sensitivity
 basket_type - Basket type (Small Shop, Top Up, Full Shop, XX)
 basket_dominant_ mission - Shopping dominant mission (Fresh, Grocery, Mixed,
Non-Food, XX)
 store_Code - Store Code
 store_format - Format of the Store
 store Region - Region the store belongs to (E02, W01, E01, N03)

Data Preparation

 For Data preparation, the dataset was loaded in Python.


 Dataset was checked for missing values.
 Missingno library was used to understand the presence and distribution of missing
data within the data frame. From the matrix obtained:
o It was found that Customer Life stage attribute had the maximum missing
value.
o Cust_code and cust_price_sensitivity follow the same pattern.
o Null values were removed where the customer code was identified as null.
o Date time format was changed, and EDA was also carried out and duplicate
values were dropped from the data

7
Exploratory Data Analysis

 The dataset consists of 14999 records, each with 22 columns. We have 7 columns of
numerical data and 15 descriptive columns.

The dropna() function is used to removes the rows that contains NULL values.

8
 From this graph the distribution of customers according to hours can be seen.
 We can interpret that most of customers come to supermarket around 12 PM, 2 PM,
then 09:00 PM.
 The customers who shop 2 PM and 9 PM therefore are the rush-hour customers. The
remaining are general-hour customers.

 This graph shows the relationship between average quantity and hour the markets are
open.

9
 The customers who come 8:00 AM and 12:00 PM would purchase more quantities of
goods. The customers come during other hours would buy less as compared.
 Therefore, we can say that the customers who buy a lot of items in the early morning
and late evening are the customers with high-level consumption. The remaining are
customers with general-level consumption.

 This graph shows the relationship between basket size and the hours the market is
open.
 The customers who use large basket would go around 3 PM, and the customers who
use small and median basket would go around 2:30 PM.
 There is no relationship between the size of basket and the shop hour, so we cannot
divide customers according to it.

10
 This graph shows the relationship between price sensitivity and spend done by
customers in the shop.
 Unclassified customers have the highest spending followed by mid-market; however,
we need to consider the fact that unclassified customers are the least amongst all the
transactions.

 This graph above shows the area wise stores S02, N01, N03, W02 where customers
gather a lot, S02 is the most.

 It can also be inferred from the graph below that the purchasing happened happen
between December and February is twice than the others when guys celebrate
Christmas and Chinese New Year. It seems purchasing happened from April to June
early summer is higher than another season, when is close to graduation

11
 At noon (13-14) and evening (21) the customer flow reaches its peak value.
 Probably because these periods are off-work and off-school time, customers are more
likely to go shopping in these periods

Relationship between customer flow and turnover

 The more customers, the more turnover, and the relationship is almost linear

12
 For less affluent families, Saturday is the least favourite day for shopping for mid-
market families, Sunday is the least favourite day for shopping for up market families,
Tuesday is the least favourite day for shopping For other families, Wednesday is the
least favourite day for shopping

K-Means
 From scree plot analysis, the number of clusters to be input for k-means was chosen
as 3 with two features namely “SPEND” and “QUANTITY” were obtained.
 Out of the three clusters, customers in cluster number 3 showed highest average spend
and quantity bought revealing that this is the premium customer segment who has the
buying power and can be pitched for more luxury items.
 While clusters 1 and 2 are the ones that needs to be targeted with proper discount
offers and schemes
 Silhouette coefficient was found to be 0.745 indicating that clusters are not
overlapping and optimal in number

13
 Cluster Formation through K-means

 Out of the three clusters, the third cluster was found to have maximum average spend
and quantity followed by cluster 1 and cluster 2. Therefore, cluster 3 depicts people
who belong to the premium category and are affluent enough to make large purchases

Recommendations
The Retail store should work on curating discount offers and schemes to target customers in
cluster 1 who can be potential buyers.

DB-SCAN

14
The estimated number of clusters is 3 similar to what was identified through K-means
clustering but since the value of Silhouette coefficient is and CH index are lower than k-
means, this shows it’s not dense clusters & not well separated than K-means.

Validation Parameters K-Means DB Scan

Silhouette Coefficient 0.745 0.455

Calinski- Harabaz Index 7070 1715

Conclusion

15
Based on our analysis, we decided to come up with 3 types of customers – New
customers, Lost customers and best customers.

Type of Definition Recommended Action


Cluster Customer

0 New Customers who have The company should try to


Customers made recent purchases, enhance their purchasing
purchase infrequently and experience by providing good
spend little money quality products and services

1 Lost Customers who make the Understand the reason why the
Customers fewest transactions and customers are lost so that we can
spend the least money. strategize accordingly

2 Best Most frequent customers Potential target for new products


Customers with the highest spending released by company. Heavy
discount not required.

Future Recommendation:
 Analysis of Time variable and shop week: The number of days since the first
transaction by each customer. This will tell us how long each customer has been with
the system.
 Offer loyalty bonus to long-term engaged customers (with high tenure) so as to make
they don’t leave the service.
 Fresh food basket and mid-market customer segments should be targeted.
 Conducting deeper segmentation of customers based on their geographical location
and demographic and psychographic factors.
 Incorporating data from the Google Analytics account of the business: Google
Analytics is a great resource to track many important business metrics such as
Customer Lifetime Value, Traffic source/medium, PageViews per visit, Bounce rate
of the company’s website, etc.

16
17

You might also like