DSML - Project Report - Group 3
DSML - Project Report - Group 3
1
Table of Contents
Abstract.............................................................................................................................3
1. Business Understanding.................................................................................................4
2. Objective........................................................................................................................5
3. Methodology:................................................................................................................5
4. Data Understanding:.....................................................................................................6
5. Data Preparation...........................................................................................................7
6. Exploratory Data Analysis..............................................................................................8
7. Clustering Techniques...................................................................................................13
8. Conclusion and Future Recommendation......................................................................15
2
Abstract
In the context of retail, basket analytics is a potent tool for learning about consumer shopping
preferences and patterns. In this article, we provide a business analytics method that uses
sales data from baskets to extract customer visit groups. We use Data Mining techniques such
as clustering and association rules, for the purpose of target marketing strategy. Our goal is to
develop a methodology for retailers on how to segment their stores based on multiple data
sources and how to create marketing strategies for each segment rather than mass marketing.
3
Business Understanding
Customer segmentation is one of the tools that can help a business to better understand its
target audience by grouping customers based on common characteristics. Therefore, it can
also aid businesses in tailoring marketing techniques to meet the specific needs of each
consumer group.
A medium- to large-sized retail establishment must make investments in both
attracting new clients and keeping existing ones. The 'best' or most valuable clients are often
the main source of income for firms. It is essential to identify and target these clients because
a company's resources are constrained. Finding consumers who are inactive or at high risk of
leaving is equally crucial to address their issues. Companies employ the strategy of consumer
segmentation for this reason.
4
Objectives
In this project we analysed the transactions data set of the customers purchasing from a
supermarket spanning over a period of 117 weeks. Objective of the study is to:
Perform customer segmentation based on insights about purchasing interest as well as
consumer psychology with the help of the below inferences:
Methodology
In this study, the Cross-Industry Standard Process for Data Mining (CRISP DM) technique
has been used for the analysis of the chosen dataset . The steps in the CRISP-DM approach
are as follows:
1. Business Understanding: The main goal here is to obtain a basic grasp of the company's
aims and requirements. To identify the importance of data in the business model, first
examine the goals in terms of what the customer wants to achieve. The project plan is then
created using the problem description and scenario. The development team is aware of this
stage of the process.
2. Data Understanding: This phase begins with acquiring the basic datasets, going through
the data, and examining the data to discover what can be expected and done from the data. It
5
also considers data consistency, completeness, and value distribution, in addition to data
governance compliance.
3. Data Preparation: This stage of the process entails converting raw data into usable form.
It is a time-consuming procedure since it includes all the actions undertaken for the final
dataset generation. Data transformation and cleansing are performed several times, as are
operations such as record, format, and attribute selection.
4. Modeling: Various models are produced and accessible at this level by developing test
models based on various modeling methodologies. To find a good enough model across the
CRISP-DM lifecycle, one must return to data preparation at some point because each
approach has different criteria for the type of data that must be used.
5. Evaluation: At this time, a model that looks to be of great quality has been built in our
opinion. Prior to deployment, the model is thoroughly evaluated, and the work completed is
verified to ensure that the model meets our business objectives. At this step, the testing data
is utilized to check that the model properly represents reality
6. Deployment: It is the sixth and last stage, which consists of presenting the final report,
which contains results, in a helpful and intelligible manner, and by doing so, the project may
achieve its objectives. It is the only move that is not part of a loop
Data Understanding
6
prod_code_20 - Product Hierarchy Level 20 Code
prod_code_30 - Product Hierarchy Level 30 Code
prod_code_40 - Product Hierarchy Level 40 Code
cust_code - Customer Code
cust_price_sensitivity - Customer price sensitivity (LA=Less Affluent, MM=Mid-
Market, UM=Up Market, XX=unclassified)
cust_Lifestage - Customer’s Lifestage (YA=Young Adults, OA=Older Adults,
YF=Young Families, OF=Older Families, PE=Pensioners, OT=Other,
XX=unclassified)
basket_Id - All items in a basket share the same basket_id value
basket_size - Basket size (L=Large, M=Medium, S=Small)
basket_price_sensitivity - Basket price sensitivity
basket_type - Basket type (Small Shop, Top Up, Full Shop, XX)
basket_dominant_ mission - Shopping dominant mission (Fresh, Grocery, Mixed,
Non-Food, XX)
store_Code - Store Code
store_format - Format of the Store
store Region - Region the store belongs to (E02, W01, E01, N03)
Data Preparation
7
Exploratory Data Analysis
The dataset consists of 14999 records, each with 22 columns. We have 7 columns of
numerical data and 15 descriptive columns.
The dropna() function is used to removes the rows that contains NULL values.
8
From this graph the distribution of customers according to hours can be seen.
We can interpret that most of customers come to supermarket around 12 PM, 2 PM,
then 09:00 PM.
The customers who shop 2 PM and 9 PM therefore are the rush-hour customers. The
remaining are general-hour customers.
This graph shows the relationship between average quantity and hour the markets are
open.
9
The customers who come 8:00 AM and 12:00 PM would purchase more quantities of
goods. The customers come during other hours would buy less as compared.
Therefore, we can say that the customers who buy a lot of items in the early morning
and late evening are the customers with high-level consumption. The remaining are
customers with general-level consumption.
This graph shows the relationship between basket size and the hours the market is
open.
The customers who use large basket would go around 3 PM, and the customers who
use small and median basket would go around 2:30 PM.
There is no relationship between the size of basket and the shop hour, so we cannot
divide customers according to it.
10
This graph shows the relationship between price sensitivity and spend done by
customers in the shop.
Unclassified customers have the highest spending followed by mid-market; however,
we need to consider the fact that unclassified customers are the least amongst all the
transactions.
This graph above shows the area wise stores S02, N01, N03, W02 where customers
gather a lot, S02 is the most.
It can also be inferred from the graph below that the purchasing happened happen
between December and February is twice than the others when guys celebrate
Christmas and Chinese New Year. It seems purchasing happened from April to June
early summer is higher than another season, when is close to graduation
11
At noon (13-14) and evening (21) the customer flow reaches its peak value.
Probably because these periods are off-work and off-school time, customers are more
likely to go shopping in these periods
The more customers, the more turnover, and the relationship is almost linear
12
For less affluent families, Saturday is the least favourite day for shopping for mid-
market families, Sunday is the least favourite day for shopping for up market families,
Tuesday is the least favourite day for shopping For other families, Wednesday is the
least favourite day for shopping
K-Means
From scree plot analysis, the number of clusters to be input for k-means was chosen
as 3 with two features namely “SPEND” and “QUANTITY” were obtained.
Out of the three clusters, customers in cluster number 3 showed highest average spend
and quantity bought revealing that this is the premium customer segment who has the
buying power and can be pitched for more luxury items.
While clusters 1 and 2 are the ones that needs to be targeted with proper discount
offers and schemes
Silhouette coefficient was found to be 0.745 indicating that clusters are not
overlapping and optimal in number
13
Cluster Formation through K-means
Out of the three clusters, the third cluster was found to have maximum average spend
and quantity followed by cluster 1 and cluster 2. Therefore, cluster 3 depicts people
who belong to the premium category and are affluent enough to make large purchases
Recommendations
The Retail store should work on curating discount offers and schemes to target customers in
cluster 1 who can be potential buyers.
DB-SCAN
14
The estimated number of clusters is 3 similar to what was identified through K-means
clustering but since the value of Silhouette coefficient is and CH index are lower than k-
means, this shows it’s not dense clusters & not well separated than K-means.
Conclusion
15
Based on our analysis, we decided to come up with 3 types of customers – New
customers, Lost customers and best customers.
1 Lost Customers who make the Understand the reason why the
Customers fewest transactions and customers are lost so that we can
spend the least money. strategize accordingly
Future Recommendation:
Analysis of Time variable and shop week: The number of days since the first
transaction by each customer. This will tell us how long each customer has been with
the system.
Offer loyalty bonus to long-term engaged customers (with high tenure) so as to make
they don’t leave the service.
Fresh food basket and mid-market customer segments should be targeted.
Conducting deeper segmentation of customers based on their geographical location
and demographic and psychographic factors.
Incorporating data from the Google Analytics account of the business: Google
Analytics is a great resource to track many important business metrics such as
Customer Lifetime Value, Traffic source/medium, PageViews per visit, Bounce rate
of the company’s website, etc.
16
17