0% found this document useful (0 votes)
12 views7 pages

br17 Final Project Report

The document discusses analyzing retail transaction data using machine learning techniques to answer questions about customer segmentation, sales forecasting, and purchase affinity. It describes exploring the data, identifying challenges like data discrepancies, and preparing the data for modeling to provide insights to retail owners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

br17 Final Project Report

The document discusses analyzing retail transaction data using machine learning techniques to answer questions about customer segmentation, sales forecasting, and purchase affinity. It describes exploring the data, identifying challenges like data discrepancies, and preparing the data for modeling to provide insights to retail owners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

RETAIL IN DETAIL

ECE/CS 498 DSG Final Project


Spring 2020
Badrinarayanan R Anirudh Sharma Anunay Sharma
br17 ashar29 anunays2
Industrial Engineering Information Management Industrial Engineering

Abstract—Retail industry is a vast and major sector of the leverage several machine learning techniques to analyze the
economy comprised of companies selling finished products to the retail business for a single retailer.
customers. Around 66% of the U.S. gross domestic product (GDP) After careful understanding of the data, we have decided to
comes from retail consumption. The industry initially started
with brick and mortar store retailers who are engaged in the answer 3 main questions that could prove to be beneficial for
sale of products from physical locations to make the customers the retail store owner of this data set:
purchase onsite. The last 7 years have seen the inception of 1) Customer Segmentation: Grouping customers into several
e-tailers where products are promoted online and arrives at clusters based on their purchase attributes and other relevant
the doorstep of customers with quick turnaround time. Online features thereby enabling the client to target a precise set
purchases take place at the tap of a button on the mobile and
make customers’ life easier than ever. Although the internet of audiences for relevant promotions and offers (targeted
era has made the retail segment more convenient, the retailers marketing).
face day-to-day challenges across multiple domains which needs 2) Sales forecasting: Analyzing the trends in the data to
more than human predictions to be successfully managed. and predict the sales of products to be sold in the upcoming weeks
provide an efficient and sustainable solution considering factors for the retail company
like time, money, and volume. The presence of statistics and
machine learning models has been widespread and has made 3) Affinity/Purchase/Basket Analysis: Exploring the cus-
its entry very well in the field retail too. The retail data is an tomer purchase behavior on a day-to-day basis and to under-
invaluable resource in the current period and is fed as an input to stand product affinity (i.e. Product X sells along with Y/Z)
these analytical models to derive actionable insights for the retail There are precedent examples of various companies using
stores. The scope of this project is concerned with the primary
these techniques successfully to enhance profits.
challenges in retail: customer segmentation, inventory, and sales
planning, and promotion strategies. Each of these problems is Infiniti Research,a leading market intelligence solutions
broken down to understand the existing scenarios and are finally provider located in London recently announced the completion
treated with a solution that has a solid foundation from the of its latest success story on market segmentation analysis.
statistical standpoint A leading player in the money transfer market wanted to
conduct a global agent satisfaction survey in a set of European
I. I NTRODUCTION
Countries. With the help of a customer intelligence solution,
Retailing is an integral part of modern society. Consumers the client was able to improve its agent management program
highly depend on retail stores (both online and offline) for by gaining an understanding of what the agents value. They
their various goods and service requirements. Earlier, goods were able to ensure a better satisfaction level and hence
and services were made available through the process of improve on sales and customer satisfaction. Sales forecasting
trading. But in present times trading is replaced by buying is used by many e-commerce giants such as Amazon offers an
and selling goods which makes retail stores an important part automated solution called Amazon forecast which uses time-
of the supply chain and thus remains an attractive business series data to predict future sales and product demand.
sector for many. With a presence worth several trillion dollars, There will be an effort to cover all the problems proposed
retail remains a significant contributor to the global economy. herein as much depth as possible and to provide insights that
[1] With the recent technological advancements, analytics in can be beneficial to the retail store owners.
retail has gained much popularity for its aid in helping the
business to make key decisions. It enables the retailer to II. E XPLORATORY A NALYSIS AND C HALLENGES
come up with standard methodologies that dissect the customer The dataset for this project has its source from Kaggle
segment and product categories and can help in boosting its website as a repository. There has been no problem statement
revenue. This project puts various concepts of retail analytics created nor any analysis done on this data to generate any
to use providing analytical insights that can be essential for insights. All the approaches and problems which will be
making marketing, and procurement decisions. We aim to discussed in this paper are self-defined and were not inspired
by any source. There are a total of 3 different data sources the highest and lowest sales categories for both the genders
namely the customer data, product mapping data, and the (Male and Female). From figure3 it can be seen that teleshops
transaction data for each customer who had visited the store. and e-shops contribute to more than 60% of the transactions
The transaction data records a total of 4 years of data for in the retail store performance
analysis and modeling.
As a first step, we perform exploratory data analysis to find
patterns in the data. Fig. 1 shows the sales performance of
different store types across different months since 2011. The
e-shop dominates with twice the sales as compared to others.
This is quite reasonable as this was around the time when
customers were shifting to e-commerce platforms like Amazon
and eBay. An interesting observation to keep in mind from the
graph is the sudden decline in the sales data from 2014. This
phenomenon is not temporary as it continues until the end of
2014.

Fig. 3. Distribution of different store types

A. Roadblocks
One of the major roadblock in any data science problem is
the data discrepancy. Some of them are mentioned as follows:
• As we can see, the data of 2014 seems to be very
suspicious as there is 144% decrease in the overall sales.
• There are transactions where the quantity happens to
be negative. These can be considered as returns by the
customer but still there were no transactions found where
Fig. 1. Sales trend across months for different store types the customer had purchased it well before.
• There is no information on the multiple purchases made
by the customer in a single transaction. This would limit
providing a product/SKU level recommendation
B. Measures
In order to make the profiling and modelling process un-
biased, we have taken certain measures to transform the data
and make it ready for further work:
• Restrict the data from 2011-2013
• Remove all the transactions having negative quantities
• The most significant metrics like sales, #transactions and
quantity go through a process called as winsorization. It
is a transformation done by limiting the extreme values
of a data to reduce the effect of outliers
III. P ROBLEM 1 - C USTOMER S EGMENTATION
Customer segmentation will allow the retail stores to learn
Fig. 2. Total sales amount for different product categories filtered on gender
a great deal about their customers so that they can cater to
and age group their needs more efficiently. This will also allow tailoring their
communication depending on the customer’s life cycle and to
Fig. 2 indicates that middle-aged customers are the major prepare a better acquisition strategy. The method used for this
contributors to the sales across all the product categories which problem is K-means clustering. The first step for this method
is expected as people generally become capable of spending was feature selection. Since there were only a few numerical
high amounts around this age. Another observation from this columns in the data set, an iterative approach was implemented
visualization is that books and bags segment are respectively to select the germane variables for clustering.
A. Approach The elbow method runs k-means clustering on the data set
The k-means algorithm works as follows: for a range of values for k (from 1-10) and then for each value
• Step 1: Randomly choose k data points (seeds) to be the
of k computes an average score for all clusters. The distortion
initial centroids i.e., cluster centers score is computed as the sum of square distances from each
• Step 2: Assign each data point to the closest centroid
point to its assigned center. As the drop in the inertia is not
• Step 3: Re-compute the centroids using the current cluster
that significant after the 3rd cluster, the optimum number of
members clusters selected was 3.
• Step 4: If a convergence criterion is not met, goes to step B. Results
2 and the algorithm continues until convergence is met.
The fig. 5 and fig. 6 shows the cluster in a 3D space.
.
While looking at it, although there is some overlap between
Stopping/Convergence criterion:
the clusters, a decision boundary can be seen. After analyzing
the data points in these clusters carefully it was inferred
• No (or minimal) re-assignments of data points to different that the three clusters would be the bargain hunters/seasonal
clusters purchasers(BH)(lesser basket value and transactions), the High
• No (or minimal) change of centroids Spenders (HS) (with higher basket value) and the regular
• Minimal decrease in the sum of squared error(SSE) shoppers(RS) (more frequent purchases). This makes sense
k X from a business aspect which can be a great insight for the
retail store when running any promotions or offers.
X
SSE = dist(x, mj )2 (1)
j=1 x∈Cj
X
mj = 1/nj x (2)
x∈Cj

where,
Cj is the j th cluster,
mj is the centroid of cluster Cj ,
nj is the number of points in cluster Cj ,
dist(x, mj ) is the distance between data point x and centroid
mj (generally Eucledian)

Fig. 5. 3-dimensional scatter plot for K-means

Fig. 4. Elbow curve to chose optimum number of clusters

The variables considered are Sales per Transaction (tells


about the basket value), Age (Customer’s age), and the number
of transactions (Frequency of purchase), quantity, and quan-
tity/transaction (basket size). The K means algorithm was run Fig. 6. Top view of the scatter plot for K-means
with different combinations of these variables and finally, we
could see 3 well-distinguished clusters for the following set
of variables : Sales/transaction, Quantity, and the number IV. P ROBLEM 2 - S ALES F ORECASTING
of transactions. The optimum number of clusters was chosen What if the retail shop wants to be well prepared to handle
based on the statistical test of the elbow curve, which shows a large crowd of customers during an occasion? It needs to
the within the sum of square distances for different clusters in have a very well defined strategy in setting up logistics and
fig. 4. inventory planning. Prior to this, every company engages in
a process of estimating its future sales which can be very states that the mean and variance of a time series are constant
beneficial in telling the company how to efficiently manage over time. Once they are ensured to be constant, it would be
its resources and inventory very well. According to research, easier to solve the modeling problem. Most time-series data
companies with accurate sales are 10% more likely to grow usually have at least one of these kinds of patterns: trend,
their revenue year over year. seasonality, or cycles. [2]
Fig. 7 shows the weekly sales distribution of the retail store. Trend - The trend describes the general behavior of a time
This data exhibits different aspects of time series. The red series. If a time series has a positive long term slope over time,
boxes marked in the graph denote a pattern known as cycle it has an upward trend and if there is a negative slope, it has
which is referred in the later sections a downward trend.
Seasonality - A seasonal pattern is any kind of fluctuation
in a time series that is caused by calendar related events. These
events can be the time of year or the time of the day or the
week. Seasonality always has fixed frequencies. The seasonal
patterns start and end in the same period of a week. Consider
the example of the occasion Black Friday and Cyber Monday.
Fig. 7. Weekly sales trend The sales at this period are meant to go up and this will be
observed distinctly in the data.
Cycle - Cycles are defined as the rises and fall with non-
A. Approach fixed magnitudes that can last more than a calendar year. They
Linear Regression are not repetitive. Usually, they result from external factors that
There are different ways to approach the problem of make them much harder to predict. The time series forecasting
forecasting. One of the classic statistical way of predicting leverages these patterns to produce reliable predictions. The
a real valued output is linear regression. It tries to model fig. 8 below shows the split of the weekly sales data into
the relationship between a dependent variable and several different components
independent variables to find the line that best fits the data
with least squared error. The program for the linear regression
is as follows:
Xn
min (y (i) − wT x(i) − b)2 (3)
i=1

where,
y (i) is the dependent variable
x(i) is a matrix of independent variables
w and b are the coefficient and the intercept for the equation
There are four principal assumptions which justify the use
of linear regression models for purposes of prediction:
• Linearity: The relationship between dependent and inde-
pendent variables should be linear
• homoscedasticity: The variance of residuals should be
same for any value of independent variable Fig. 8. Components of time series
• No Auto correlation: The error terms shouldn’t be
correlated to one another
• Normality: The errors should be normally distributed B. Methodology
• Independence: The variables are independent of each In this project, we started the modeling process with both
other the algorithms mentioned above. A special class of time
Time series Analysis series implemented is the ARIMA (Auto-Regressive Integrated
Time series analysis refers to the collection of data points Moving Average). These models take into account the lag
analyzed at constant intervals to determine the behavior or terms of the variables along with the forecast errors as the
pattern of the variables. This analysis helps us to move further only predictors for the forecasting.
towards the goal of forecasting and predictions. The unique our main objective is to forecast the sales value of the retail
thing that distinguishes linear regression from time series is store. We would want to estimate the sales at a weekly level
the time dependency. The observations are dependent on one as the lower the granularity, the more accurate are the results.
another. There are a total of 155 weeks of data from 2011-2013. This
An important criterion that needs to be satisfied to use the data follows a series of steps to achieving the final predictions
time series formulation would be stationarity. The assumption explained in Fig 9:
• The dataset is split into a subset called as training and
testing set. The training set is the set on which the whole
algorithm trains by capturing all the variations in the past.
The test set is the validation set here where we would get
to understand how good our model is. As a rule of thumb,
training and testing sets are split in the fashion of 80:20.
• Now this data is being tested with the assumptions of
linear and time series analysis. Here, the assumptions of
linear regression would not be met due to the presence
of auto-correlation but still we force-fit the model to
understand the baseline score through the most classic
statistical procedure
• The ARIMA model has several parameters that go into
the model:
AR term (p) tells about the number of lagged observations
to be considered in the model.
MA (q) gives the number of lagged forecast errors to be
taken into account.
Differencing term (d) is the number of differences be-
tween the observations to be taken to keep the time series
stationary. All these parameters are derived from Autocor- Fig. 10. PACF and ACF plots
relation(ACF) and Partial Auto-correlation (PACF) plots
shown in fig. 10
• From the plots, we could see that the values of p and
q should be ˜3 or 4 and is found out by iterating with
different combinations
• Various accuracy metrics are captured to understand the
performance of both models.

Fig. 11. ARIMA - Summary


Fig. 9. Flowchart for Time series modelling

C. Forecasting Results D. Seasoned ARIMA (SARIMA) - Results


The linear regression outputs a linear equation with a very Although the ARIMA model performs well with the ob-
high -ve intercept of -4742.8. This value is quite inconsistent tained parameters, there is always room to make them better by
(as sales can’t be negative) as this essentially signifies that considering even more intricate factors into account. SARIMA
without any of the lag variables effect, the model would predict model [3] takes into the seasonal terms to provide better results
this value as the sales for the future. Moreover, the R2 yields a from the existing model. The seasonal components are the
score of ˜98% which looks to be very suspicious. Additionally, counterparts to ARIMA parameters p, d and q are denoted by
it is to be kept in mind that the data violates the assumption P, D and Q. A series of iterations using pdm arima was done to
of auto-correlation in the case of linear regression. find the out the best seasonal parameters (fig. 12). The outputs
When we look at the results of ARIMA model in fig. 11, are shown in the figure below
we could see some interesting observations. The p-values of The results of SARIMA suggests that the model was able
most of the variables used in the model are less than 0.05 and to capture more variation than the previous models (fig. 13).
comes to be very significant. The AIC score (3047) tells about The seasonal component plays a major role in the retail sales
the amount of information lost by the model this score is the as there are several offers available during different seasons
lowest among all the iterations picked by the model. Finally, and occasions. The MAPE of the model gives 10.7%, slighly
the best model achieves a MAPE of 9.76% higher than ARIMA but the AIC (1817) and BIC(1828) of the
latter help with sales prediction to enable better logistics and
inventory management. A missing piece between these would
be the type of products suggested to the segment of customers
so that there is a higher chance of up-selling and cross-
selling products. This is very similar to MBA (Market Basket
Analysis), a widely used technique to identify the best possible
combination of products or services frequently bought by
customers. With the given constraints of data, the best possible
solution was to apply advanced data analysis techniques and
mining to provide close to accurate recommendations.

A. Approach
The approach to this problem would be to understand the
purchase of different sub-categories by each customer of the
retail store. A new line of analysis proposed from our end
Fig. 12. Parameter Tuning - SARIMA involves looking at the top3 sub-category purchases by each
customer along with the chain of subcategories purchased by
them. To get this recommendation more accurate, we leverage
model is greatly reduced. Also, the p-values the variables are
the customer segmentation to provide recommendations to
within the standard significance level of 0.05. Thus, SARIMA
different groups. The final step would be to provide offers
model gives best results for the retail sales data and the
on the set of preferred sub-categories to specific customers.
forecast is shown in fig. 14
Fig. 15 clearly explains this

Fig. 13. SARIMA - Summary


Fig. 15. Product Affinity Methodology

B. Results
The top 2 tables in fig. 16 shows the purchase analysis for
the segment Regular Shoppers(RS). It is seen that the top 3
and the chain corresponds to women, men, and kids. This
can help us achieve in deciding with a certain confidence that
the Regular shoppers are more inclined towards the apparels
section. Similarly, for the bottom 2 tables, the chain is more in
line with the top2 and 3 purchases which shows the personal
appliances and academic categories. This behavior is exhibited
both by Bargain Hunters(BH) and High Spenders(HS) and
Fig. 14. SARIMA Forecast
providing such suggestions can have high sales conversions
eventually leading to higher profits for the retail store. It is
V. C USTOMER P RODUCT A FFINITY noted that the customer segmentation has had a bigger impact
Till now, two main aspects of retail marketing were covered. in grouping the customers and narrow down the analysis
The former deals with the dissection of customers and the among the clusters
VIII. ACKNOWLEDGMENT
We would like to thank the TA’s and the Professor for
guiding us and providing suggestions on framing different
problem statements during the initial review session We
R EFERENCES
[1] N. T. Karim et al., “Customer and target individual face analysis for retail
analytics.” 2018 International Workshop on Advanced Image Technology
(IWAIT), Advanced Image Technology (IWAIT), 2018 International Work-
shop on, pp. 1 – 4, 2018.
[2] [Online]. Available: https://towardsdatascience.com/an-overview-of-time-
series-forecasting-models-a2fa7a358fcb
[3] T.-M. Choi, Y. Yu, and K.-F. Au, “A hybrid sarima wavelet transform
method for sales forecasting.” Decision Support Systems, vol. 51, no. 1,
pp. 130 – 140, 2011.

Fig. 16. Left - Cluster Regular Shoppers and Right - Cluster Bargain Hunters
and High Spenders

VI. D ISCUSSION
To summarize, there are a total of three problems addressed
in this report. The customer segments are obtained as a result
of the statistical K-means algorithm and yield 3 categories:
Bargain Hunters(BH), High Spenders(HS), and Regular Shop-
pers (RS). This looks very relevant as we could see there are
around ˜41% transactions where the basket value is higher
than the average basket value for the given period. Similarly,
out of the total customers, around ˜33% of them have visited
quite often than the average frequency of customer visits. For
a retail store, managing inventory and logistics play a very
vital role, and in the movement of goods. Hence, forecasting
techniques help the retail stores to be well prepared for a huge
inflow of customers during certain periods. The final problem
suggests the kind of products to be promoted to customers.
A brief look at the sales of different sub-categories completes
this analysis. The products suggested to the regular shoppers
have the highest sales share in the retail store while the ones
suggested to the spenders and bargain hunters have the lowest
share. This proves that the latter ones are very seasonal and
are quickly grabbed by the customers to take advantage of the
offers.

VII. M EMBER C ONTRIBUTIONS


• Badrinarayanan R – Studied and implemented different
time series modelling by reading up about different
techniques in the domain of retail forecasting – 28%
• Anirudh Sharma – Worked on the cluster methodologies
by understanding the type of data available and different
kind of distance measures to be considered for the
problem – 28%
• Anunay Sharma – Discussed and carried out the product
subcategory recommendation analysis through several
logical formulations – 28%
• As a team – Exploratory Data Analysis to get a sense of
the data - 5% each

You might also like