0% found this document useful (0 votes)
93 views46 pages

Walmart Capstone Project

The Walmart Capstone Project aims to analyze weekly sales data across various stores to identify factors affecting sales, such as unemployment rates, seasonal trends, temperature, and the Consumer Price Index (CPI). The project involves statistical analysis and predictive modeling to forecast future sales, helping Walmart manage inventory and make informed investment decisions. Key findings include the impact of holidays on sales, correlations between sales and external factors, and the identification of top and worst-performing stores.

Uploaded by

Matthews Zen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views46 pages

Walmart Capstone Project

The Walmart Capstone Project aims to analyze weekly sales data across various stores to identify factors affecting sales, such as unemployment rates, seasonal trends, temperature, and the Consumer Price Index (CPI). The project involves statistical analysis and predictive modeling to forecast future sales, helping Walmart manage inventory and make informed investment decisions. Key findings include the impact of holidays on sales, correlations between sales and external factors, and the identification of top and worst-performing stores.

Uploaded by

Matthews Zen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

WALMART

CAPSTONE PROJECT

Project by
Lakshmanan Ravi
Table of Contents: -

Index Content Page No

1 Problem Statement 1
2 Dataset Information: 2
3 Task to be completed: 3
4 Understanding, Cleaning, and Exploring Data: 4

5 TASK - 1: - Weekly sales vs. unemployment rate 16


for various stores and their correlation.
6 TASK - 2: - Finding the Season trends in the 19
weekly sales.
7 TASK – 3: - Weekly Sales vs. Temperature and 23
its correlation on various stores.

8 TASK – 4: - Weekly Sales vs. CPI and its 26


correlation on various stores.
9 TASK – 5: - Best and Worst Performing stores 29
and corresponding difference between them

10 TASK – 6: - Predicting and forecasting Weekly 31


Sales of each individual store using ARIMA and
SARIMAX Models.

11 Reference 44
Problem Statement: -
A retail store that has multiple outlets across the country is facing issues
in managing the inventory - to match the demand with respect to supply.

There are many reasons that sales are significantly higher or lower than averages. If the
company does not know about these seasons, it can lose too much money. Predicting future
sales is one of the most crucial plans for a company. Sales forecasting gives an idea to the
company for arranging stocks, calculating revenue, and deciding to make a new investment.
Another advantage of knowing future sales is that achieving predetermined targets from the
beginning of the season can have a positive effect on stock prices and investors' perceptions.
Also, not reaching the projected target could significantly damage stock prices, conversely.
And, it will be a big problem, especially for Walmart as a big company.

Aim:
My aim in this project is to build a model which predicts sales of the stores. With this model,
Walmart authorities can decide their future plans which are very important for arranging
stocks, calculating revenue, and deciding whether to make new investments or not.

Advantages of forecasting:

With accurate prediction company can: -

• Determine seasonal demands and take action for this

• Protect from money loss because achieving sales targets can have a positive effect
on stock prices and investors' perceptions

• Forecast revenue easily and accurately

• Manage inventories

• Do more effective campaigns

Note: - Forecasting of the weekly_sales has been provided at the end of this report.

1
Dataset Information:

The walmart.csv contains 6435 rows and 8 columns.

Feature Name Description

Store Store number


Date Week of Sales
Weekly_Sales Sales for the given store in that week
Holiday_Flag If it is a holiday week
Temperature The temperature on the day of the sale
Fuel_Price Cost of fuel in the region
CPI Consumer Price Index
Unemployment Unemployment Rate

Note: -

1) CPI stands for Consumer Price Index, and it is a commonly used measure of inflation

in the United States and many other countries.

2) Holiday_Flag – In this dataset, the complete week is shown as holiday rather than the

individual day.

3) Store number indicates the various store outlets present across the USA

2
Task to be completed:

1. You are provided with the weekly sales data for their various outlets. Use statistical

analysis, EDA, and outlier analysis, and handling the missing values to come up with various

insights that can give them a clear perspective on the following:

a. If the weekly sales are affected by the unemployment rate, if yes - which stores

are suffering the most?

b. If the weekly sales show a seasonal trend, when and what could be the reason?

c. Does temperature affect the weekly sales in any manner?

d. How is the Consumer Price Index affecting the weekly sales of various stores?

e. Top performing stores according to historical data.

f. The worst performing store, and how significant is the difference between the

highest and lowest performing stores.

2. Use predictive modeling techniques to forecast the sales for each store for the next 12

weeks.

3
Understanding, Cleaning, and Exploring Data:

1. You are provided with the weekly sales data for their various outlets. Use statistical
analysis, EDA, and outlier analysis, and handle the missing values to come up with various
insights that can give them a clear perspective on the following:

Importing Libraries: -

Importing Dataset: -

4
Overview of the Dataset: -
To look at the overall structure of the dataset, let us use the following function: -

1) info(): -
• Total number of entries (rows)
• The data type of each column
• The number of non-null values in each column
• Memory usage information

This function is very helpful in the initial stages of data analysis and preprocessing, as it gives
you a quick overview of the dataset's structure, which can help you identify potential issues
such as missing values or unexpected data types.

2) describe(): -

This function provides a quick way to understand the basic statistics of your data, which can
help you identify outliers, skewed distributions, and other patterns that might be present in
your dataset. Keep in mind that the describe() function only works on numeric columns by
default.

5
3) Basic Functions: -

• shape: - To know the total number of rows and columns in the DataFrame.
• duplicated(): - Whether our data contains any duplicated rows or not, this will not
include the index column.
• isnull(): - To find if we have any Nan values present in our dataset.

4) Heatmap & corr(): -

Before proceeding further, let us find out whether there exists any sort of correlation
between these columns. We can use Heatmap for this purpose. Heatmap is the best way
to visualize and understand the correlation between the columns of the DataFrame.

6
Understanding the Dataset: -

Let us start to understand our dataset by visualizing the table, here we can find that the
weekly sales of the WALMART Stores across the United States are given for the span of two
years starting from 2010 to 2012.

Here we notice that we have the dataset for 45 WALMART stores, which was observed by
grouping the ‘store’ column.

We have been provided with four external parameters such as ‘Unemployment’, ‘temperature’,
’CPI’, and ‘Fuel_rate’.

Our task is to analyze these parameters and to find whether the parameter has any effect on
the overall weekly sales of the individual store, and to forecast the weekly sales for the next
few weeks based on the historical data.

Before proceeding further, let us find out whether there exists any sort of correlation
between these columns. We can use Heatmap for this purpose. Heatmap is the best way to
visualize and understand the correlation between the columns of the DataFrame.

Step 1: - Create a new column by assigning Categories for every quarter in Year ==> Q1, Q2,
Q3, Q4

7
Step 2: - Merge the year and quarter columns and create a new column as yearly_quarter.

Step 3: - Create a new Dataframe 'quarterwise_sales', by grouping Store and aggregate with
mean for Weekly_Sales, Unemployment, temperature, CPI, and Fuel_rate, to know the
behavior of the individual store on various parameters in each quarter of the year.

8
Step 4: - Before starting our observation on the individual store, let us visualize the overall
trend of various parameters on yearly_quarter irrespective of the store.

a) Correlation of various parameters after aggregating the values based on the various
quarters of the years.

Note: - What we can interpret from the above heat map?

Initially, when we took the raw data, we could not find any kind of correlation between
various parameters, but after aggregating the data in terms of quarter of the years and taking
the mean values of the other parameters and combining it irrespective of the stores, we find
that there exists a trend.
Let us visualize the various trends of the parametric value: -

MinMaxScaler():- Let us use min-max scaler values to normalize our parametric values so
that, we can plot it in the same scale.

9
b) Sales and Unemployment rate: - The Unemployment rate gradually decreased as
years passed on from 2010 to 2012. The sales have usually increased in the fourth
quarter of the year. The main reason for this is being festival season (Christmas) and
the year-end. Let us discuss more on the effect of holidays on the weekly sales of the
store later in the presentation.

10
c) CPI and Sales: - CPI stands for Consumer Price Index, and it is a commonly used
measure of inflation in the United States and many other countries. The CPI tracks the
average change over time in the prices paid by urban consumers for a market basket
of consumer goods and services, such as food, clothing, rent, healthcare, and
entertainment. We can see that the CPI index increases in each quarter.

d) Fuel price and sales: - The fuel price gradually increased from 2010 to 2012, There was
a peak in Q2 of 2011, and the reason may be due to Middle East supply disruptions and
increased demand from an improving global economy. (Source from New York Times)

11
e) Temperature and sales: -

Step 5: - Let us analyze the correlation between the various parameters with respect to
quarter sales of individual stores.

Note: - There are 45 stores in the given dataset, and we have grouped the dataset by store
and quarter sales for each store. We have not taken the weekly sales of the stores because
the CPI and unemployment rate are given for each quarter of the year, so it is more likely to
consider the correlation of various parameters based on each quarter cycle to get a better
understanding of the dataset.

12
Note: - Here store no 44 is highly negatively correlated, whereas store no 36 is highly
positively correlated.

13
14
Correlation Table for 45 stores: -

15
TASK – I
a. If the weekly sales are affected by the unemployment rate, if yes - which stores
are suffering the most?

1) Initially, when we look into the heatmap, we cannot find any sort of relationship between
weekly sales and the unemployment rate. Since the correlation between them was only

-0.11 (indicates no correlation).

2) When we started to analyze the various parameters on a quarterly level, irrespective of


the store. We can see certain trends in the Unemployment rate, fuel price, and CPI.

16
Note: The unemployment rate decreases and CPI and Fuel prices increased in the span of
two years from 2010 and 2012

3) Now let us look at the correlation of the unemployment rate with respect to each store,
here we can find that certain stores are affected by the unemployment rate.

Here, store number 44,38,42,39,41,37,4 show a negative correlation with respect to sales,
which means as the unemployment rate increases the sales of these stores usually
decreases. On the other hand, store number 35,36 shows a positive correlation, which means
as the unemployment rate decreases the sales of these stores usually decrease.

17
Let us visualize the store numbers 44 and 36 with respect to sales: -

Conclusion: -
Here store numbers 44,38,42,39,41,37,4 suffer the most because of the unemployment rate,
The reason can be that the location of these stores may be near corporate houses as the
unemployment rate increases the sales of these stores decreases.

Interesting to note that store 35, and 36 shows a positive correlation, which means the sales
of the store decrease as the unemployment rate decreases. Maybe the location of these
stores is near a tourist spot, as many people do not tend to visit these locations during the
working days.

18
TASK - 2
b) If the weekly sales show a seasonal trend, when and what could be the reason?

Yes, weekly sales show seasonal trends. We can notice that the sales increases at the end
of each year. This is mainly because of Christmas, Thanksgiving, and New Year's Day. The
seasonality in the sales mainly depends on the holidays and non-holidays.

Let us analyze the sales of the store in terms of holidays and non-holidays.

Note: - the above-shown record is the total sales of all the store combine during working
day and on Holiday.

19
Let us analyse the sales of individual stores on holiday and on working days: -

Sum of total sales: - As the number of working days is more when compared to holidays, the
sales will eventually be higher on working days, so for better understanding let us view the
mean sales of individual stores during working days and holidays.

Mean Sales of individual stores: - Here we can visualize that in most of the stores the mean
sales are more on holidays when compared to that of working days.

20
There are basically four holiday weeks in the USA, which include ‘Labour Day’, ‘Super Bowl',
'Thanksgiving', and 'Christmas' week. There are 10 holiday weeks present in the dataset for
the span of two years from 2010 to 2012.

An extra column of Holidays is added in our DataFrame for better understanding of the kind
of holiday and for visualization purposes.

21
Note: - the total sales corresponding to each holiday week for the year 2010 – 2012. Here the
one thing to notice is that sales for ‘Christmas’ and ‘Thanksgiving’ are not included for the
year 2012.

Conclusion: -
The seasonality in the weekly sales has been observed as there is a constant peak in the
sales at quarter four(Q4) of each year. This is mainly because of the holiday week present in
this quarter. This mainly includes ‘Christmas’ and ‘Thanksgiving’. We can also see that the
sales for each store are higher in the holiday week than on Non – Holiday week.

22
TASK - 3
c. Does temperature affect the weekly sales in any manner?

When we plot, the temperature vs. weekly sales for store number (in this case let us consider
the store number 28), we cannot interpret anything using the plot visualization.

So let us visualize the average monthly temperature to that of the mean of the weekly sales
for that month.

23
We can see that the sales have usually increased during the month of Nov-Jan (Winter) when
the temperature is usually minimal. During the month of May-Jul (Summer) sales are
considerably decreased. This is in the case of store number 28, let us find the correlation for
other stores as well.

Store number 28,10,37,12,3,8,34 shows a negative correlation which means the sales of these
stores are maximum when the temperature is usually minimal, whereas store number
19,44,26 shows a positive correlation.

Let us further visualize the sales vs. temperature with respect to each quarter of the year
for store number 28.

24
Correlation on monthly basis: - Correlation on quarterly basis: -

Conclusion: -
Sales of certain stores do depend on the temperature which includes store number
28,10,37,12,3. The reason can be due to holiday weeks happening in winter seasons or the
purchase of customers usually increases in winter seasons. The purchase may include a
sweater, heating equipment, and other accessories for the Christmas festival.

Store number 44, and 26 shows a positive correlation, which means these stores benefited
during the summer season which can include product like ice cream, soft drinks, etc.

25
TASK – 4
d. How is the Consumer Price Index affecting the weekly sales of various stores?

CPI stands for Consumer Price Index, and it is a commonly used measure of inflation in the
United States and many other countries. The CPI tracks the average change over time in the
prices paid by urban consumers for a market basket of consumer goods and services, such
as food, clothing, rent, healthcare, and entertainment. We can see that the CPI index increases
in each quarter.

When we combine the overall CPI scores of all the stores with their combined sales, we can
find that the CPI index increases as the year passes, which means that inflation in the United
States has increased from the year 2010 to 2012.

CPI also tracks the purchasing power of the individual in the USA, so based on the above
result we can conclude that the economic activities in the USA have also increased in these
two years.

Our task is to find whether the sales of the individual stores are affected by the CPI scores,
if yes then which stores are mostly correlated to the CPI index?

26
Here we can see that the store number 38,44,42,17,39,41,37,4 shows a positive correlation,
whereas the store number 36,14,35 shows a negative correlation. It is to be noted that the
overall trend of the CPI always increases.

Let us visualize store numbers 36 and 38, with respect to their corresponding sales to get a
better understanding.

27
Note: - The CPI index values increases in both store number 36 and 38, but the sales of store
36 see a negative impact on the weekly sales, whereas store number 38 have a positive
impact on the sales.

Conclusion: -

An increase in CPI generally indicates rising prices for goods and services. If wages and
income do not keep pace with inflation, consumers may experience a decrease in purchasing
power. This can lead to reduced spending on non-essential items, including retail goods,
which could negatively impact sales. Thus, store number 36,14,35 has also seen a negative
impact on sales.

Geographic and Demographic Variations: The impact of CPI on retail sales can also vary by
region and demographic groups. Some areas of consumer segments may be more sensitive
to CPI changes than others.

But in general, many stores have seen a positive impact on sales due to an increase in the
CPI index. As the economic activities increase the sales of the stores also increase.

28
TASK – 5

e. Top performing stores according to the historical data.


f. The worst performing store, and how significant is the difference between the
highest and lowest performing stores?

Store with Maximum Weekly Sales: -

Store with Minimum Weekly Sales: -

29
Difference between the highest and lowest performing stores: -

Conclusion: -

Here we have to assume that the top-performing stores are the ones with the maximum
sales and the worst stores are the stores with the lowest total sales.

Store number 20 has the highest sales of 301397792.46 dollars and store number 33 has the
lowest total sales of 37160221.96 dollars.

Difference in Sales amount = 301397792.46 - 37160221.96


= $ 264237570.499

30
TASK – 6

2. Use predictive modeling techniques to forecast the sales for each store for the next
12 weeks.

To analyze and forecast the sales for the upcoming 12 weeks for each individual store, we are
Using ARIMA or SARIMAX models. Since our data has the seasonality over a year for each
individual store, we are using the SARIMAX model for better forecasting.

Let us explore all possible forecasting models over Store number 1 and follow the same
obtained method to the remaining model for a better result.

Sales for the store 1 over two years: -

There is not much trend in the weekly sales for store 1, but the seasonality does exist over
the years or for 52 weeks (since the sales data is given in terms of weekly sales).
Let us now check whether the sales data points are stationary or non-stationary. We are
using the Adfuller method to check the stationarity of the data points. If the p-value in Adfuller
method comes out to be more than 0.05, then our data points are non-stationary.

31
p-value obtained from the Adfuller method is very less than 0.05, we can assume that the
data points are stationary, but still, we can find some sort of seasonality in our model, Let us
try to remove it from the data points.

Using the shift method and the window size of 52 (number of weeks), we are trying to remove
the seasonality in our dataset.

32
In a stationary series, the mean and variance remain constant over time, making it easier to
make predictions about future values. From the above plot, we can see that the mean of the
data points lies over zero and the variance of the data points is almost constant.

33
Thus, we can conclude that we have obtained stationary points, now we can use our
forecasting models over these stationary points to get a better result.

Many time series models, such as ARIMA (Autoregressive Integrated Moving Average),
assume stationarity in the data or require differencing to achieve stationarity.

From the above summary chart, we can find that our model may well perform with the
SARIMAX value of (5,0,1). Let us check, the ACF and PACF graphs to confirm it.

34
From the ACF and PACF graph, we can interpret that the AR value from the PACF graph has
6 major spikes and the MA value from the ACF graph is 1 as the mean difference occurs at
point 2, so we can take either point value as 1 or 3, so from the ACF and PACF graph the
SARIMAX value tends to be (6,0,1). We will take this SARIMAX value for our model building.

Training and testing our Model: -

A) ARIMA Model

Let us train our model using ARIMA, with the SARIMAX value as (6,0,1), indicating the AR value
as 6 and the MA value as 1.

There are 143 rows of data for each store, so let us take our train size as 120 value points and
test size as 20 value points.

The predicted value doesn’t seem to overlap the actual value. Let us try to build it using the
SARIMA model.

35
B) SARIMAX Model

In the above model because of the seasonality in the dataset, the ARIMA model could not
predict the value correctly. Let us try using the SARIMAX Model with a window size of 52.

We can see that the SARIMA model has mostly covered the test data points, hence the
prediction is near to that of the actual value.

FORECASTING: -

Note: - In the question they have asked us to forecast the sales for the next 12 weeks, but for
better visualization, the forecasting has been done for the next year.

36
Weekly Sales for the remaining stores: -

37
38
39
Forecasting for the remaining stores: -

40
41
42
43
Reference: -

1. Walmart's net sales worldwide 2023 – Statista, understanding the retail stores
business.
2. Time Series Forecasting with ARIMA, SARIMA, and SARIMAX by Brendan Artley
3. A Brief Introduction to ARIMA and SARIMAX Modeling in Python
4. Python Pandas DataFrame from Pandas documentation.
5. Matplotlib, Seaborn, and Plotly Gallery documentation.
6. Pandas basic and methods from Greeksforgreeks, w3school.

----------------Thank you-----------------

44

You might also like