CUSTOMER ANALYSIS - Report
CUSTOMER ANALYSIS - Report
Problem Statement:
We want to understand how customers shop at a supermarket by looking at their
transaction data. This means finding out what products are popular, which ones aren’t, and
how buying patterns change over time. By creating new metrics and visualizing the data, we
aim to gather useful insights to improve sales strategies and the overall shopping experience.
Removing outliers will help ensure that our findings are accurate and practical for making
smart business decisions.
Dataset used:
Link: https://github.com/Aegon127bc/Supermarket-CustomerAnalysis.git
Loading the data set. The data set is assigned to a Pandas data frame. The BasketDate is
converted from 'str' to 'datetime'.
Types of attributes and null values and Description of the dataset.
All the ProdDescr missing values are already included in the CustomerID ones. In fact, if we
count all the rows where both the attributes are null.
Analysis of single attributes.
The sales are kind of constant throughout the year and increased in the proximity of
Christmas. In fact, the day with more sales was 2011-12-05 and the day with fewer sales was
2010-12-22. A simple way to justify this would be that the store started its activity that first
Christmas and steadily increased its clientele from time to time.
Correlation:
To calculate pairwise correlation, we transformed some attributes into categorical ones. Due
to implementation reasons, ProdDescr had to be treated differently from other attributes. For
this reason, we introduced a dictionary: for each string (key) we assigned an incremental
identifier (value).
All the ProdDescr missing values are already included in the CustomerID ones. In fact, if we
count all the rows where both the attributes are null.
Then we replaced all the descriptions with their associated identifier and this way we
proceeded to calculate the pairwise correlation, represented by a heatmap.
Correlation in the original dataset isn’t high in most of the considered pairs.
Exceptions are:
• BasketIDs and BasketDates: all the transactions that belong to the same basket are
made on the same date.
• ProdID and ProdDescr: the same item (usually) has the same description.
So, due to the high correlation score (0.98) for descriptions and items, we
can safely assume that ProdDescr is a superfluous attribute and so it can be
dropped in future studies.
On the other hand, our new attribute Amount has of course a very high correlation
with the Qta attribute, simply because they’re directly proportional.
Outliers:
The last step for data quality assessment was the outliers detection, which was
made only for the new dataset (the one with all the new attributes).
For outliers’ analysis and removal, we decided to use the Z-Score metric,
which is an important measurement that tells how many Standard Deviation above or
below a number is from the mean of the dataset.
Z-score normalization refers to the process of normalizing every value in a dataset such that
the mean of all of the values is 0 and the standard deviation is 1.
We use the following formula to perform a z-score normalization on every value in a dataset:
New value = (x – μ) / σ
where:
x: Original value
μ: Mean of data
σ: Standard deviation of data