0% found this document useful (0 votes)

71 views17 pages

DSML - Project Report - Group 3

This document presents the analysis of customer transaction data from a supermarket over 117 weeks. The objectives are to perform customer segmentation based on purchasing behavior and create targeted marketing strategies. The CRISP-DM methodology is used, including data understanding, preparation, exploratory analysis, clustering techniques and conclusions. Key findings are that customer lifestyle data is most incomplete, while store, date and product information are complete.

Uploaded by

Deepak Bhatt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views17 pages

DSML - Project Report - Group 3

Uploaded by

Deepak Bhatt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

IIM Kashipur

Master of Business Administration

Master of Business Administration (Analytics)

Data Science and Machine Learning, Term-IV

Year 2022-2023

Submitted To: Prof. Venkataraghavan Krishnaswamy

Submitted By: Group 3

Anika Garg MBAA21006

Anurag Shelar MBAA21009

Avijit Kirttania MBAA21015

Ayushi Sharma MBAA21017

Paresh Kumar Jaiswal MBAA21023

Yukti Dewan MBAA21038

1
Table of Contents

Abstract.............................................................................................................................3
1. Business Understanding.................................................................................................4
2. Objective........................................................................................................................5
3. Methodology:................................................................................................................5
4. Data Understanding:.....................................................................................................6
5. Data Preparation...........................................................................................................7
6. Exploratory Data Analysis..............................................................................................8
7. Clustering Techniques...................................................................................................13
8. Conclusion and Future Recommendation......................................................................15

2
Abstract

In the context of retail, basket analytics is a potent tool for learning about consumer shopping
preferences and patterns. In this article, we provide a business analytics method that uses
sales data from baskets to extract customer visit groups. We use Data Mining techniques such
as clustering and association rules, for the purpose of target marketing strategy. Our goal is to
develop a methodology for retailers on how to segment their stores based on multiple data
sources and how to create marketing strategies for each segment rather than mass marketing.

3
Business Understanding

Retail, by definition, is the exchange of products or services between a company and a

customer for the latter's personal use. Small-scale purchases of items are handled in retail
transactions, while bulk purchases of commodities are handled in wholesale transactions.
Modern people's everyday lives depend heavily on retail establishments. Customers can find
a wide range of goods and services through retailers all over the world. For groceries,
clothing, convenience items, and other goods, retail stores are your one-stop shop. The retail
industry is a crucial component of the supply chain since it connects a manufacturer with a
consumer.

What is Customer Segmentation?

Customer segmentation is the process of grouping consumers based on shared behaviour or

other characteristics. Both inside each group and in relation to one another, the groups should
be homogeneous. This approach' overarching goal is to discover high-value clients, or those
who have the most potential for growth or are the most lucrative.

Significance of Customer Segmentation

Customer segmentation is one of the tools that can help a business to better understand its
target audience by grouping customers based on common characteristics. Therefore, it can
also aid businesses in tailoring marketing techniques to meet the specific needs of each
consumer group.
A medium- to large-sized retail establishment must make investments in both
attracting new clients and keeping existing ones. The 'best' or most valuable clients are often
the main source of income for firms. It is essential to identify and target these clients because
a company's resources are constrained. Finding consumers who are inactive or at high risk of
leaving is equally crucial to address their issues. Companies employ the strategy of consumer
segmentation for this reason.

4
Objectives

In this project we analysed the transactions data set of the customers purchasing from a
supermarket spanning over a period of 117 weeks. Objective of the study is to:
Perform customer segmentation based on insights about purchasing interest as well as
consumer psychology with the help of the below inferences:

 Price sensitivity of customers

 What is the rush hour for the supermarket?
 Average spends and basket size per hour
 Seasonality of purchase
 Demand forecasting

Methodology

In this study, the Cross-Industry Standard Process for Data Mining (CRISP DM) technique
has been used for the analysis of the chosen dataset . The steps in the CRISP-DM approach
are as follows:

1. Business Understanding: The main goal here is to obtain a basic grasp of the company's
aims and requirements. To identify the importance of data in the business model, first
examine the goals in terms of what the customer wants to achieve. The project plan is then
created using the problem description and scenario. The development team is aware of this
stage of the process.

2. Data Understanding: This phase begins with acquiring the basic datasets, going through
the data, and examining the data to discover what can be expected and done from the data. It

5
also considers data consistency, completeness, and value distribution, in addition to data
governance compliance.

3. Data Preparation: This stage of the process entails converting raw data into usable form.
It is a time-consuming procedure since it includes all the actions undertaken for the final
dataset generation. Data transformation and cleansing are performed several times, as are
operations such as record, format, and attribute selection.
4. Modeling: Various models are produced and accessible at this level by developing test
models based on various modeling methodologies. To find a good enough model across the
CRISP-DM lifecycle, one must return to data preparation at some point because each
approach has different criteria for the type of data that must be used.

5. Evaluation: At this time, a model that looks to be of great quality has been built in our
opinion. Prior to deployment, the model is thoroughly evaluated, and the work completed is
verified to ensure that the model meets our business objectives. At this step, the testing data
is utilized to check that the model properly represents reality

6. Deployment: It is the sixth and last stage, which consists of presenting the final report,
which contains results, in a helpful and intelligible manner, and by doing so, the project may
achieve its objectives. It is the only move that is not part of a loop

Data Understanding

 The dataset was obtained from Dunhumby.com

 The dataset includes dummy data of transactions at a store, spanning over a period of
117 weeks (two years & a quarter)

Features of the dataset are as follows:

 shop_week – Identifies week of the basket
 shop_date - Date when shopping has been done
 shop_weekday - Identifies the day of the week (1=Sunday, 2=Monday, …,
7=Saturday)
 shop_hour - Hour slot of the shopping
 Quantity - Number of items of the same product bought in this basketSpend - Spend
associated to the items bought
 spend - Spend associated to the items bought
 prod_code - Product Code
 prod_code_10 - Product Hierarchy Level 10 Code

6
 prod_code_20 - Product Hierarchy Level 20 Code
 prod_code_30 - Product Hierarchy Level 30 Code
 prod_code_40 - Product Hierarchy Level 40 Code
 cust_code - Customer Code
 cust_price_sensitivity - Customer price sensitivity (LA=Less Affluent, MM=Mid-
Market, UM=Up Market, XX=unclassified)
 cust_Lifestage - Customer’s Lifestage (YA=Young Adults, OA=Older Adults,
YF=Young Families, OF=Older Families, PE=Pensioners, OT=Other,
XX=unclassified)
 basket_Id - All items in a basket share the same basket_id value
 basket_size - Basket size (L=Large, M=Medium, S=Small)
 basket_price_sensitivity - Basket price sensitivity
 basket_type - Basket type (Small Shop, Top Up, Full Shop, XX)
 basket_dominant_ mission - Shopping dominant mission (Fresh, Grocery, Mixed,
Non-Food, XX)
 store_Code - Store Code
 store_format - Format of the Store
 store Region - Region the store belongs to (E02, W01, E01, N03)

Data Preparation

 For Data preparation, the dataset was loaded in Python.

 Dataset was checked for missing values.
 Missingno library was used to understand the presence and distribution of missing
data within the data frame. From the matrix obtained:
o It was found that Customer Life stage attribute had the maximum missing
value.
o Cust_code and cust_price_sensitivity follow the same pattern.
o Null values were removed where the customer code was identified as null.
o Date time format was changed, and EDA was also carried out and duplicate
values were dropped from the data

7
Exploratory Data Analysis

 The dataset consists of 14999 records, each with 22 columns. We have 7 columns of
numerical data and 15 descriptive columns.

The dropna() function is used to removes the rows that contains NULL values.

8
 From this graph the distribution of customers according to hours can be seen.
 We can interpret that most of customers come to supermarket around 12 PM, 2 PM,
then 09:00 PM.
 The customers who shop 2 PM and 9 PM therefore are the rush-hour customers. The
remaining are general-hour customers.

 This graph shows the relationship between average quantity and hour the markets are
open.

9
 The customers who come 8:00 AM and 12:00 PM would purchase more quantities of
goods. The customers come during other hours would buy less as compared.
 Therefore, we can say that the customers who buy a lot of items in the early morning
and late evening are the customers with high-level consumption. The remaining are
customers with general-level consumption.

 This graph shows the relationship between basket size and the hours the market is
open.
 The customers who use large basket would go around 3 PM, and the customers who
use small and median basket would go around 2:30 PM.
 There is no relationship between the size of basket and the shop hour, so we cannot
divide customers according to it.

10
 This graph shows the relationship between price sensitivity and spend done by
customers in the shop.
 Unclassified customers have the highest spending followed by mid-market; however,
we need to consider the fact that unclassified customers are the least amongst all the
transactions.

 This graph above shows the area wise stores S02, N01, N03, W02 where customers
gather a lot, S02 is the most.

 It can also be inferred from the graph below that the purchasing happened happen
between December and February is twice than the others when guys celebrate
Christmas and Chinese New Year. It seems purchasing happened from April to June
early summer is higher than another season, when is close to graduation

11
 At noon (13-14) and evening (21) the customer flow reaches its peak value.
 Probably because these periods are off-work and off-school time, customers are more
likely to go shopping in these periods

Relationship between customer flow and turnover

 The more customers, the more turnover, and the relationship is almost linear

12
 For less affluent families, Saturday is the least favourite day for shopping for mid-
market families, Sunday is the least favourite day for shopping for up market families,
Tuesday is the least favourite day for shopping For other families, Wednesday is the
least favourite day for shopping

K-Means
 From scree plot analysis, the number of clusters to be input for k-means was chosen
as 3 with two features namely “SPEND” and “QUANTITY” were obtained.
 Out of the three clusters, customers in cluster number 3 showed highest average spend
and quantity bought revealing that this is the premium customer segment who has the
buying power and can be pitched for more luxury items.
 While clusters 1 and 2 are the ones that needs to be targeted with proper discount
offers and schemes
 Silhouette coefficient was found to be 0.745 indicating that clusters are not
overlapping and optimal in number

13
 Cluster Formation through K-means

 Out of the three clusters, the third cluster was found to have maximum average spend
and quantity followed by cluster 1 and cluster 2. Therefore, cluster 3 depicts people
who belong to the premium category and are affluent enough to make large purchases

Recommendations
The Retail store should work on curating discount offers and schemes to target customers in
cluster 1 who can be potential buyers.

DB-SCAN

14
The estimated number of clusters is 3 similar to what was identified through K-means
clustering but since the value of Silhouette coefficient is and CH index are lower than k-
means, this shows it’s not dense clusters & not well separated than K-means.

Validation Parameters K-Means DB Scan

Silhouette Coefficient 0.745 0.455

Calinski- Harabaz Index 7070 1715

Conclusion

15
Based on our analysis, we decided to come up with 3 types of customers – New
customers, Lost customers and best customers.

Type of Definition Recommended Action

Cluster Customer

0 New Customers who have The company should try to

Customers made recent purchases, enhance their purchasing
purchase infrequently and experience by providing good
spend little money quality products and services

1 Lost Customers who make the Understand the reason why the
Customers fewest transactions and customers are lost so that we can
spend the least money. strategize accordingly

2 Best Most frequent customers Potential target for new products

Customers with the highest spending released by company. Heavy
discount not required.

Future Recommendation:
 Analysis of Time variable and shop week: The number of days since the first
transaction by each customer. This will tell us how long each customer has been with
the system.
 Offer loyalty bonus to long-term engaged customers (with high tenure) so as to make
they don’t leave the service.
 Fresh food basket and mid-market customer segments should be targeted.
 Conducting deeper segmentation of customers based on their geographical location
and demographic and psychographic factors.
 Incorporating data from the Google Analytics account of the business: Google
Analytics is a great resource to track many important business metrics such as
Customer Lifetime Value, Traffic source/medium, PageViews per visit, Bounce rate
of the company’s website, etc.

16
17

Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
TCS Prime Preparation Roadmap
100% (1)
TCS Prime Preparation Roadmap
2 pages
GCP Tech Leap Dumps Latest 2023
No ratings yet
GCP Tech Leap Dumps Latest 2023
147 pages
Big Data Seema Acharya
100% (1)
Big Data Seema Acharya
86 pages
AWS Documentation (7) (NEW)
No ratings yet
AWS Documentation (7) (NEW)
54 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
40 pages
Pranita Dane - IBM - Internship Project Submission - Data Analytics
No ratings yet
Pranita Dane - IBM - Internship Project Submission - Data Analytics
28 pages
Associate Analytics M3 Final
0% (1)
Associate Analytics M3 Final
145 pages
Online Grocery Store Synopsis
No ratings yet
Online Grocery Store Synopsis
4 pages
Accenture Exam Pattern & Syllabus
No ratings yet
Accenture Exam Pattern & Syllabus
47 pages
Ad3381 - Data Base Design and Management Manual
No ratings yet
Ad3381 - Data Base Design and Management Manual
56 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
MRA Project - Shehroz Khan
67% (3)
MRA Project - Shehroz Khan
19 pages
Agile Technologies 21CS641 Module 1
No ratings yet
Agile Technologies 21CS641 Module 1
19 pages
Phase 1 Project Report
No ratings yet
Phase 1 Project Report
44 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Implementation Issues Task
No ratings yet
Implementation Issues Task
18 pages
Bda Unit 5
No ratings yet
Bda Unit 5
14 pages
Unit 1
No ratings yet
Unit 1
61 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Candidate Generation and Pruning
No ratings yet
Candidate Generation and Pruning
9 pages
Analytics Insights PWC Launchpad24 Handbook 15th Feb24
No ratings yet
Analytics Insights PWC Launchpad24 Handbook 15th Feb24
18 pages
IOT Mod4@AzDOCUMENTS - in
No ratings yet
IOT Mod4@AzDOCUMENTS - in
17 pages
MSC IT Syllabus
93% (15)
MSC IT Syllabus
69 pages
STQA MiniProject
No ratings yet
STQA MiniProject
13 pages
Chapter 5
No ratings yet
Chapter 5
40 pages
Skill Development Practical File
No ratings yet
Skill Development Practical File
18 pages
Mfcs PPT (All Units)
No ratings yet
Mfcs PPT (All Units)
103 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
CHAPTER - 1 - Introduction - 1
No ratings yet
CHAPTER - 1 - Introduction - 1
33 pages
Cloud COMPUTING Module 5
No ratings yet
Cloud COMPUTING Module 5
63 pages
CHAPTER 03: Big Data Technology Landscape
No ratings yet
CHAPTER 03: Big Data Technology Landscape
81 pages
3rd Year Syllabus 2020-21
No ratings yet
3rd Year Syllabus 2020-21
36 pages
Assignment - 3 BI
No ratings yet
Assignment - 3 BI
7 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Cloud Computing: - AICTE Eduskills Virtual Internship - Submitted By: Akshat Antal - Adm. No.: 18SCSE1010053
No ratings yet
Cloud Computing: - AICTE Eduskills Virtual Internship - Submitted By: Akshat Antal - Adm. No.: 18SCSE1010053
8 pages
Python Solutions For iPA 10-Feb-23
No ratings yet
Python Solutions For iPA 10-Feb-23
21 pages
DWDM Online Bits
No ratings yet
DWDM Online Bits
3 pages
Format - Summer Internship Report
No ratings yet
Format - Summer Internship Report
6 pages
CNS-MODEL (New) S
No ratings yet
CNS-MODEL (New) S
5 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Chap - 4: Screen Designing: Visually Pleasing Composition
No ratings yet
Chap - 4: Screen Designing: Visually Pleasing Composition
23 pages
Lecture Notes - Unit I: EID 453 Design Patterns 4/4 B.Tech (CSE B3)
No ratings yet
Lecture Notes - Unit I: EID 453 Design Patterns 4/4 B.Tech (CSE B3)
11 pages
Customer Segmentation New
No ratings yet
Customer Segmentation New
11 pages
Finalized Mind Map For Each CO: 16CST33-Java Programming
No ratings yet
Finalized Mind Map For Each CO: 16CST33-Java Programming
7 pages
BDACh 07 L06 Real Time Analytics Platform
No ratings yet
BDACh 07 L06 Real Time Analytics Platform
14 pages
Data Analysis
No ratings yet
Data Analysis
10 pages
Tech Documentation
No ratings yet
Tech Documentation
5 pages
Synopsis: Stock Agent - A Java Stock Market Trading Program
No ratings yet
Synopsis: Stock Agent - A Java Stock Market Trading Program
27 pages
ICAP Past Exams (Questions) - IFRS 5
100% (1)
ICAP Past Exams (Questions) - IFRS 5
1 page
Question Paper Code:: (10×2 20 Marks)
No ratings yet
Question Paper Code:: (10×2 20 Marks)
2 pages
Replaced - E01 W91RUS-21-C-0008
100% (2)
Replaced - E01 W91RUS-21-C-0008
149 pages
Cs2358 Internet Programming Lab Anna University Syllabus
No ratings yet
Cs2358 Internet Programming Lab Anna University Syllabus
12 pages
Pincer Search Algo
No ratings yet
Pincer Search Algo
8 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Complete Poultry Farming Business Plan For 2,400 Layers Farm
0% (1)
Complete Poultry Farming Business Plan For 2,400 Layers Farm
8 pages
2023 Animo Tips Commercial Law
No ratings yet
2023 Animo Tips Commercial Law
23 pages
Top 50 MCA Colleges in India - With Packages
No ratings yet
Top 50 MCA Colleges in India - With Packages
1 page
It6006 Data Analytics Syllabus
No ratings yet
It6006 Data Analytics Syllabus
1 page
Business Analytics Course
No ratings yet
Business Analytics Course
11 pages
GPF Interest Rates 1963 To 2024
No ratings yet
GPF Interest Rates 1963 To 2024
1 page
Gaurav Upadhyay ML Project
No ratings yet
Gaurav Upadhyay ML Project
8 pages
School of Information Technology and Engineering ITE1002 - Web Technologies Laboratory Assessment-4 Using Node JS, Express JS, Mongo DB
No ratings yet
School of Information Technology and Engineering ITE1002 - Web Technologies Laboratory Assessment-4 Using Node JS, Express JS, Mongo DB
3 pages
Pieces 1 Cod Amount Product Detail 36 Cotton Pant Remarks
No ratings yet
Pieces 1 Cod Amount Product Detail 36 Cotton Pant Remarks
4 pages
What Does The Acronym Stand For?: Smart
No ratings yet
What Does The Acronym Stand For?: Smart
4 pages
ISA 720 MindMap
No ratings yet
ISA 720 MindMap
1 page
Guidance Document For MSPO Part 2-2
No ratings yet
Guidance Document For MSPO Part 2-2
45 pages
Record Keeping Final
No ratings yet
Record Keeping Final
70 pages
Prospectus Diploma Spring 2025 (13012025)
No ratings yet
Prospectus Diploma Spring 2025 (13012025)
34 pages
Congratulations On Successfully Paying All Your Premiums Regularly & Completing Another Year With Max Life Smart Term Plan
No ratings yet
Congratulations On Successfully Paying All Your Premiums Regularly & Completing Another Year With Max Life Smart Term Plan
2 pages
07MCEN78A54033
No ratings yet
07MCEN78A54033
9 pages
CA Final Group I - Management Accounting and Financial Analysis - May 2008
No ratings yet
CA Final Group I - Management Accounting and Financial Analysis - May 2008
5 pages
GATE Economics 2024
No ratings yet
GATE Economics 2024
37 pages
Kwefaako Development Initiative Kdi Sacco Loan Applications Form
No ratings yet
Kwefaako Development Initiative Kdi Sacco Loan Applications Form
6 pages
H L W O I S C P: ECLT5940/SEEM5880 Supply Chain Management These Problems Were Taken From Chapter 08 of The Textbook
100% (1)
H L W O I S C P: ECLT5940/SEEM5880 Supply Chain Management These Problems Were Taken From Chapter 08 of The Textbook
4 pages
Attachment Accounting
No ratings yet
Attachment Accounting
6 pages
Chpater 7
No ratings yet
Chpater 7
19 pages
Example Provision and Contingent
100% (1)
Example Provision and Contingent
18 pages
Cisco Support Case #692008040: Certificate Registration of Cedges
No ratings yet
Cisco Support Case #692008040: Certificate Registration of Cedges
3 pages
Formative Main
No ratings yet
Formative Main
13 pages
Program of Works - Rehabilitation of Facilities - NVRC
No ratings yet
Program of Works - Rehabilitation of Facilities - NVRC
13 pages
Sherman, Rick - Business Intelligence Guidebook
50% (2)
Sherman, Rick - Business Intelligence Guidebook
1 page
Imran Ka Ladka
No ratings yet
Imran Ka Ladka
2 pages
S3F84NBXZZ QT8B
No ratings yet
S3F84NBXZZ QT8B
1 page
Smart Buy Purchase Order Format
No ratings yet
Smart Buy Purchase Order Format
2 pages
Coinpedia Org Earning Site Audure Review
No ratings yet
Coinpedia Org Earning Site Audure Review
9 pages
HBC Canada Social Media and Internet Policy May 2020 10039958
No ratings yet
HBC Canada Social Media and Internet Policy May 2020 10039958
5 pages
09 Activity 1
No ratings yet
09 Activity 1
1 page

DSML - Project Report - Group 3

Uploaded by

DSML - Project Report - Group 3

Uploaded by

IIM Kashipur

Master of Business Administration

Data Science and Machine Learning, Term-IV

Submitted To: Prof. Venkataraghavan Krishnaswamy

Submitted By: Group 3

Anika Garg MBAA21006

Anurag Shelar MBAA21009

Avijit Kirttania MBAA21015

Ayushi Sharma MBAA21017

Paresh Kumar Jaiswal MBAA21023

Yukti Dewan MBAA21038

Retail, by definition, is the exchange of products or services between a company and a

What is Customer Segmentation?

Customer segmentation is the process of grouping consumers based on shared behaviour or

Significance of Customer Segmentation

 Price sensitivity of customers

 The dataset was obtained from Dunhumby.com

Features of the dataset are as follows:

 For Data preparation, the dataset was loaded in Python.

Relationship between customer flow and turnover

Validation Parameters K-Means DB Scan

Silhouette Coefficient 0.745 0.455

Calinski- Harabaz Index 7070 1715

Type of Definition Recommended Action

0 New Customers who have The company should try to

2 Best Most frequent customers Potential target for new products

You might also like