0% found this document useful (0 votes)
27 views10 pages

CUSTOMER ANALYSIS - Report

The document outlines a customer analysis project aimed at understanding shopping behaviors in a supermarket through transaction data. It details the dataset attributes, preprocessing steps, and analysis methods including correlation and outlier detection using Z-Score normalization. The findings suggest that certain product descriptions can be considered redundant and that sales patterns fluctuate throughout the year, notably increasing during the Christmas season.

Uploaded by

Nitin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

CUSTOMER ANALYSIS - Report

The document outlines a customer analysis project aimed at understanding shopping behaviors in a supermarket through transaction data. It details the dataset attributes, preprocessing steps, and analysis methods including correlation and outlier detection using Z-Score normalization. The findings suggest that certain product descriptions can be considered redundant and that sales patterns fluctuate throughout the year, notably increasing during the Christmas season.

Uploaded by

Nitin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

CUSTOMER ANALYSIS

15/07/2024 Name: Nitin S


Batch no: 07 Roll no: 22i338

Problem Statement:
We want to understand how customers shop at a supermarket by looking at their
transaction data. This means finding out what products are popular, which ones aren’t, and
how buying patterns change over time. By creating new metrics and visualizing the data, we
aim to gather useful insights to improve sales strategies and the overall shopping experience.
Removing outliers will help ensure that our findings are accurate and practical for making
smart business decisions.

Dataset used:

Link: https://github.com/Aegon127bc/Supermarket-CustomerAnalysis.git

Attribute Type Description


Unique alphanumeric ID
assigned to a basket,
BasketID Numerical
defined as a set of
transactions.
The date on which the
BasketDate Numerical
transaction was made.
Cost of one unit of
Sale Numerical
product.
CustomerID Numerical The ID of the customer.
The Customer’s country of
CustomerCountry Categorical
origin.
ProdID Numerical The ID of the product.

ProdDescr Categorical A description of the product.


The number of units bought
Qta Numerical
in the transaction.
Data Preprocessing:

Loading the data set. The data set is assigned to a Pandas data frame. The BasketDate is
converted from 'str' to 'datetime'.
Types of attributes and null values and Description of the dataset.

Count of null values, for each attribute.

All the ProdDescr missing values are already included in the CustomerID ones. In fact, if we
count all the rows where both the attributes are null.
Analysis of single attributes.
The sales are kind of constant throughout the year and increased in the proximity of
Christmas. In fact, the day with more sales was 2011-12-05 and the day with fewer sales was
2010-12-22. A simple way to justify this would be that the store started its activity that first
Christmas and steadily increased its clientele from time to time.
Correlation:
To calculate pairwise correlation, we transformed some attributes into categorical ones. Due
to implementation reasons, ProdDescr had to be treated differently from other attributes. For
this reason, we introduced a dictionary: for each string (key) we assigned an incremental
identifier (value).

All the ProdDescr missing values are already included in the CustomerID ones. In fact, if we
count all the rows where both the attributes are null.

Then we replaced all the descriptions with their associated identifier and this way we
proceeded to calculate the pairwise correlation, represented by a heatmap.
Correlation in the original dataset isn’t high in most of the considered pairs.
Exceptions are:
• BasketIDs and BasketDates: all the transactions that belong to the same basket are
made on the same date.
• ProdID and ProdDescr: the same item (usually) has the same description.
So, due to the high correlation score (0.98) for descriptions and items, we
can safely assume that ProdDescr is a superfluous attribute and so it can be
dropped in future studies.
On the other hand, our new attribute Amount has of course a very high correlation
with the Qta attribute, simply because they’re directly proportional.
Outliers:
The last step for data quality assessment was the outliers detection, which was
made only for the new dataset (the one with all the new attributes).
For outliers’ analysis and removal, we decided to use the Z-Score metric,
which is an important measurement that tells how many Standard Deviation above or
below a number is from the mean of the dataset.

Z-score normalization refers to the process of normalizing every value in a dataset such that
the mean of all of the values is 0 and the standard deviation is 1.

We use the following formula to perform a z-score normalization on every value in a dataset:

New value = (x – μ) / σ

where:

x: Original value
μ: Mean of data
σ: Standard deviation of data

You might also like