br17 Final Project Report
br17 Final Project Report
Abstract—Retail industry is a vast and major sector of the leverage several machine learning techniques to analyze the
economy comprised of companies selling finished products to the retail business for a single retailer.
customers. Around 66% of the U.S. gross domestic product (GDP) After careful understanding of the data, we have decided to
comes from retail consumption. The industry initially started
with brick and mortar store retailers who are engaged in the answer 3 main questions that could prove to be beneficial for
sale of products from physical locations to make the customers the retail store owner of this data set:
purchase onsite. The last 7 years have seen the inception of 1) Customer Segmentation: Grouping customers into several
e-tailers where products are promoted online and arrives at clusters based on their purchase attributes and other relevant
the doorstep of customers with quick turnaround time. Online features thereby enabling the client to target a precise set
purchases take place at the tap of a button on the mobile and
make customers’ life easier than ever. Although the internet of audiences for relevant promotions and offers (targeted
era has made the retail segment more convenient, the retailers marketing).
face day-to-day challenges across multiple domains which needs 2) Sales forecasting: Analyzing the trends in the data to
more than human predictions to be successfully managed. and predict the sales of products to be sold in the upcoming weeks
provide an efficient and sustainable solution considering factors for the retail company
like time, money, and volume. The presence of statistics and
machine learning models has been widespread and has made 3) Affinity/Purchase/Basket Analysis: Exploring the cus-
its entry very well in the field retail too. The retail data is an tomer purchase behavior on a day-to-day basis and to under-
invaluable resource in the current period and is fed as an input to stand product affinity (i.e. Product X sells along with Y/Z)
these analytical models to derive actionable insights for the retail There are precedent examples of various companies using
stores. The scope of this project is concerned with the primary
these techniques successfully to enhance profits.
challenges in retail: customer segmentation, inventory, and sales
planning, and promotion strategies. Each of these problems is Infiniti Research,a leading market intelligence solutions
broken down to understand the existing scenarios and are finally provider located in London recently announced the completion
treated with a solution that has a solid foundation from the of its latest success story on market segmentation analysis.
statistical standpoint A leading player in the money transfer market wanted to
conduct a global agent satisfaction survey in a set of European
I. I NTRODUCTION
Countries. With the help of a customer intelligence solution,
Retailing is an integral part of modern society. Consumers the client was able to improve its agent management program
highly depend on retail stores (both online and offline) for by gaining an understanding of what the agents value. They
their various goods and service requirements. Earlier, goods were able to ensure a better satisfaction level and hence
and services were made available through the process of improve on sales and customer satisfaction. Sales forecasting
trading. But in present times trading is replaced by buying is used by many e-commerce giants such as Amazon offers an
and selling goods which makes retail stores an important part automated solution called Amazon forecast which uses time-
of the supply chain and thus remains an attractive business series data to predict future sales and product demand.
sector for many. With a presence worth several trillion dollars, There will be an effort to cover all the problems proposed
retail remains a significant contributor to the global economy. herein as much depth as possible and to provide insights that
[1] With the recent technological advancements, analytics in can be beneficial to the retail store owners.
retail has gained much popularity for its aid in helping the
business to make key decisions. It enables the retailer to II. E XPLORATORY A NALYSIS AND C HALLENGES
come up with standard methodologies that dissect the customer The dataset for this project has its source from Kaggle
segment and product categories and can help in boosting its website as a repository. There has been no problem statement
revenue. This project puts various concepts of retail analytics created nor any analysis done on this data to generate any
to use providing analytical insights that can be essential for insights. All the approaches and problems which will be
making marketing, and procurement decisions. We aim to discussed in this paper are self-defined and were not inspired
by any source. There are a total of 3 different data sources the highest and lowest sales categories for both the genders
namely the customer data, product mapping data, and the (Male and Female). From figure3 it can be seen that teleshops
transaction data for each customer who had visited the store. and e-shops contribute to more than 60% of the transactions
The transaction data records a total of 4 years of data for in the retail store performance
analysis and modeling.
As a first step, we perform exploratory data analysis to find
patterns in the data. Fig. 1 shows the sales performance of
different store types across different months since 2011. The
e-shop dominates with twice the sales as compared to others.
This is quite reasonable as this was around the time when
customers were shifting to e-commerce platforms like Amazon
and eBay. An interesting observation to keep in mind from the
graph is the sudden decline in the sales data from 2014. This
phenomenon is not temporary as it continues until the end of
2014.
A. Roadblocks
One of the major roadblock in any data science problem is
the data discrepancy. Some of them are mentioned as follows:
• As we can see, the data of 2014 seems to be very
suspicious as there is 144% decrease in the overall sales.
• There are transactions where the quantity happens to
be negative. These can be considered as returns by the
customer but still there were no transactions found where
Fig. 1. Sales trend across months for different store types the customer had purchased it well before.
• There is no information on the multiple purchases made
by the customer in a single transaction. This would limit
providing a product/SKU level recommendation
B. Measures
In order to make the profiling and modelling process un-
biased, we have taken certain measures to transform the data
and make it ready for further work:
• Restrict the data from 2011-2013
• Remove all the transactions having negative quantities
• The most significant metrics like sales, #transactions and
quantity go through a process called as winsorization. It
is a transformation done by limiting the extreme values
of a data to reduce the effect of outliers
III. P ROBLEM 1 - C USTOMER S EGMENTATION
Customer segmentation will allow the retail stores to learn
Fig. 2. Total sales amount for different product categories filtered on gender
a great deal about their customers so that they can cater to
and age group their needs more efficiently. This will also allow tailoring their
communication depending on the customer’s life cycle and to
Fig. 2 indicates that middle-aged customers are the major prepare a better acquisition strategy. The method used for this
contributors to the sales across all the product categories which problem is K-means clustering. The first step for this method
is expected as people generally become capable of spending was feature selection. Since there were only a few numerical
high amounts around this age. Another observation from this columns in the data set, an iterative approach was implemented
visualization is that books and bags segment are respectively to select the germane variables for clustering.
A. Approach The elbow method runs k-means clustering on the data set
The k-means algorithm works as follows: for a range of values for k (from 1-10) and then for each value
• Step 1: Randomly choose k data points (seeds) to be the
of k computes an average score for all clusters. The distortion
initial centroids i.e., cluster centers score is computed as the sum of square distances from each
• Step 2: Assign each data point to the closest centroid
point to its assigned center. As the drop in the inertia is not
• Step 3: Re-compute the centroids using the current cluster
that significant after the 3rd cluster, the optimum number of
members clusters selected was 3.
• Step 4: If a convergence criterion is not met, goes to step B. Results
2 and the algorithm continues until convergence is met.
The fig. 5 and fig. 6 shows the cluster in a 3D space.
.
While looking at it, although there is some overlap between
Stopping/Convergence criterion:
the clusters, a decision boundary can be seen. After analyzing
the data points in these clusters carefully it was inferred
• No (or minimal) re-assignments of data points to different that the three clusters would be the bargain hunters/seasonal
clusters purchasers(BH)(lesser basket value and transactions), the High
• No (or minimal) change of centroids Spenders (HS) (with higher basket value) and the regular
• Minimal decrease in the sum of squared error(SSE) shoppers(RS) (more frequent purchases). This makes sense
k X from a business aspect which can be a great insight for the
retail store when running any promotions or offers.
X
SSE = dist(x, mj )2 (1)
j=1 x∈Cj
X
mj = 1/nj x (2)
x∈Cj
where,
Cj is the j th cluster,
mj is the centroid of cluster Cj ,
nj is the number of points in cluster Cj ,
dist(x, mj ) is the distance between data point x and centroid
mj (generally Eucledian)
where,
y (i) is the dependent variable
x(i) is a matrix of independent variables
w and b are the coefficient and the intercept for the equation
There are four principal assumptions which justify the use
of linear regression models for purposes of prediction:
• Linearity: The relationship between dependent and inde-
pendent variables should be linear
• homoscedasticity: The variance of residuals should be
same for any value of independent variable Fig. 8. Components of time series
• No Auto correlation: The error terms shouldn’t be
correlated to one another
• Normality: The errors should be normally distributed B. Methodology
• Independence: The variables are independent of each In this project, we started the modeling process with both
other the algorithms mentioned above. A special class of time
Time series Analysis series implemented is the ARIMA (Auto-Regressive Integrated
Time series analysis refers to the collection of data points Moving Average). These models take into account the lag
analyzed at constant intervals to determine the behavior or terms of the variables along with the forecast errors as the
pattern of the variables. This analysis helps us to move further only predictors for the forecasting.
towards the goal of forecasting and predictions. The unique our main objective is to forecast the sales value of the retail
thing that distinguishes linear regression from time series is store. We would want to estimate the sales at a weekly level
the time dependency. The observations are dependent on one as the lower the granularity, the more accurate are the results.
another. There are a total of 155 weeks of data from 2011-2013. This
An important criterion that needs to be satisfied to use the data follows a series of steps to achieving the final predictions
time series formulation would be stationarity. The assumption explained in Fig 9:
• The dataset is split into a subset called as training and
testing set. The training set is the set on which the whole
algorithm trains by capturing all the variations in the past.
The test set is the validation set here where we would get
to understand how good our model is. As a rule of thumb,
training and testing sets are split in the fashion of 80:20.
• Now this data is being tested with the assumptions of
linear and time series analysis. Here, the assumptions of
linear regression would not be met due to the presence
of auto-correlation but still we force-fit the model to
understand the baseline score through the most classic
statistical procedure
• The ARIMA model has several parameters that go into
the model:
AR term (p) tells about the number of lagged observations
to be considered in the model.
MA (q) gives the number of lagged forecast errors to be
taken into account.
Differencing term (d) is the number of differences be-
tween the observations to be taken to keep the time series
stationary. All these parameters are derived from Autocor- Fig. 10. PACF and ACF plots
relation(ACF) and Partial Auto-correlation (PACF) plots
shown in fig. 10
• From the plots, we could see that the values of p and
q should be ˜3 or 4 and is found out by iterating with
different combinations
• Various accuracy metrics are captured to understand the
performance of both models.
A. Approach
The approach to this problem would be to understand the
purchase of different sub-categories by each customer of the
retail store. A new line of analysis proposed from our end
Fig. 12. Parameter Tuning - SARIMA involves looking at the top3 sub-category purchases by each
customer along with the chain of subcategories purchased by
them. To get this recommendation more accurate, we leverage
model is greatly reduced. Also, the p-values the variables are
the customer segmentation to provide recommendations to
within the standard significance level of 0.05. Thus, SARIMA
different groups. The final step would be to provide offers
model gives best results for the retail sales data and the
on the set of preferred sub-categories to specific customers.
forecast is shown in fig. 14
Fig. 15 clearly explains this
B. Results
The top 2 tables in fig. 16 shows the purchase analysis for
the segment Regular Shoppers(RS). It is seen that the top 3
and the chain corresponds to women, men, and kids. This
can help us achieve in deciding with a certain confidence that
the Regular shoppers are more inclined towards the apparels
section. Similarly, for the bottom 2 tables, the chain is more in
line with the top2 and 3 purchases which shows the personal
appliances and academic categories. This behavior is exhibited
both by Bargain Hunters(BH) and High Spenders(HS) and
Fig. 14. SARIMA Forecast
providing such suggestions can have high sales conversions
eventually leading to higher profits for the retail store. It is
V. C USTOMER P RODUCT A FFINITY noted that the customer segmentation has had a bigger impact
Till now, two main aspects of retail marketing were covered. in grouping the customers and narrow down the analysis
The former deals with the dissection of customers and the among the clusters
VIII. ACKNOWLEDGMENT
We would like to thank the TA’s and the Professor for
guiding us and providing suggestions on framing different
problem statements during the initial review session We
R EFERENCES
[1] N. T. Karim et al., “Customer and target individual face analysis for retail
analytics.” 2018 International Workshop on Advanced Image Technology
(IWAIT), Advanced Image Technology (IWAIT), 2018 International Work-
shop on, pp. 1 – 4, 2018.
[2] [Online]. Available: https://towardsdatascience.com/an-overview-of-time-
series-forecasting-models-a2fa7a358fcb
[3] T.-M. Choi, Y. Yu, and K.-F. Au, “A hybrid sarima wavelet transform
method for sales forecasting.” Decision Support Systems, vol. 51, no. 1,
pp. 130 – 140, 2011.
Fig. 16. Left - Cluster Regular Shoppers and Right - Cluster Bargain Hunters
and High Spenders
VI. D ISCUSSION
To summarize, there are a total of three problems addressed
in this report. The customer segments are obtained as a result
of the statistical K-means algorithm and yield 3 categories:
Bargain Hunters(BH), High Spenders(HS), and Regular Shop-
pers (RS). This looks very relevant as we could see there are
around ˜41% transactions where the basket value is higher
than the average basket value for the given period. Similarly,
out of the total customers, around ˜33% of them have visited
quite often than the average frequency of customer visits. For
a retail store, managing inventory and logistics play a very
vital role, and in the movement of goods. Hence, forecasting
techniques help the retail stores to be well prepared for a huge
inflow of customers during certain periods. The final problem
suggests the kind of products to be promoted to customers.
A brief look at the sales of different sub-categories completes
this analysis. The products suggested to the regular shoppers
have the highest sales share in the retail store while the ones
suggested to the spenders and bargain hunters have the lowest
share. This proves that the latter ones are very seasonal and
are quickly grabbed by the customers to take advantage of the
offers.