Walmart Sales Prediction
Walmart Sales Prediction
Table of Contents
1. Introduction
2. Data Wrangling
3. Exploratory Data Analysis
4. Model Selection and Evaluation
5. Conclusions
6. References
1. Introduction
Sales analysis and forecasting are essential tools for businesses to understand and improve their sales
performance and make informed decisions about their future sales goals (B2B International, 2018). By analyzing
past sales data, businesses can identify trends and patterns that can help them understand what is driving their
sales and where they may need to focus their efforts to improve (Small Business Administration, 2021).
Forecasting allows businesses to project future sales based on these trends and patterns, helping them to set
realistic goals and allocate resources appropriately (Business News Daily, 2021).
The importance of sales analysis and forecasting extends beyond just understanding sales performance. It is also
crucial for budgeting and financial planning. By understanding their expected sales, businesses can better plan
for expenses and allocate resources to meet their financial goals (Small Business Administration, 2021).
Additionally, sales analysis and forecasting can help businesses identify opportunities for growth and new areas
for expansion (B2B International, 2018).
In this project, we will analyze Walmart's weekly sales data across 45 different stores to gain insights into their
sales performance and identify trends and patterns. We will then use this data to forecast future sales, which can
assist Walmart in making informed strategic decisions. This analysis will provide Walmart with valuable
information on how to optimize their sales and allocate resources effectively. Additionally, by understanding
how sales vary across different stores, Walmart can identify areas for improvement and potential opportunities
for growth. Overall, this project will enable Walmart to gain a deeper understanding of their sales performance
and make data-driven decisions to drive future success.
# import datetime
import datetime as dt
# import metrics
from sklearn.metrics import mean_squared_error
# import warnings
import warnings
warnings.filterwarnings('ignore')
2. Data Wrangling
In this section, we will identify and address any errors, inconsistencies, missing values, or duplicate entries in the
dataset. This will ensure that the dataset is accurate, consistent, and complete, and will make it more suitable for
analysis.
Thus we will address the following questions to ensure the quality and reliability of the dataset:
1. Are there any missing values in the dataset, and if so, what is their extent and data type?
2. Are there any duplicate entries in the dataset?
3. Are there any outliers in the dataset that may impact the analysis?
4. Does the dataset require any feature engineering to better support the analysis goals?
1. Are there any missing values in the dataset, and if so, what is their extent and data type?
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 6435 non-null int64
1 Date 6435 non-null object
2 Weekly_Sales 6435 non-null float64
3 Holiday_Flag 6435 non-null int64
4 Temperature 6435 non-null float64
5 Fuel_Price 6435 non-null float64
6 CPI 6435 non-null float64
7 Unemployment 6435 non-null float64
dtypes: float64(5), int64(2), object(1)
memory usage: 402.3+ KB
We can see that the dataset contains 6435 rows and 8 columns. All of the columns are in the appropriate
numeric data types, with the exception of the 'Date' column, which needs to be converted to a datetime
type. In addition, the feature names will be converted to lowercase for consistency. This information helps us
understand the structure and content of the dataset, and identify any necessary data type conversions or
formatting changes.
In [7]: # check
sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 6435 non-null int64
1 Date 6435 non-null datetime64[ns]
2 Weekly_Sales 6435 non-null float64
3 Holiday_Flag 6435 non-null int64
4 Temperature 6435 non-null float64
5 Fuel_Price 6435 non-null float64
6 CPI 6435 non-null float64
7 Unemployment 6435 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(2)
memory usage: 402.3 KB
In [9]: # check
sales.columns
3. Are there any outliers in the dataset that may impact the analysis?
Outliers, or data points that are significantly different from the rest of the data, can affect the accuracy and
reliability of statistical measures such as the mean and standard deviation. To ensure that these measures
accurately represent the data, it is necessary to identify and properly handle outliers. To address this issue, we
will develop two functions: one to detect outliers and another to count them. These functions will help us
identify and understand the impact of outliers on our data, and allow us to make informed decisions about how
to handle them.
This function takes a dataframe and a column as input, and returns the rows
with outliers in the given column. Outliers are identified using the
interquartile range (IQR) formula. The optional level parameter allows the
caller to specify the level of outliers to return, i.e., lower, upper, or both.
Args:
df: The input dataframe.
col: The name of the column to search for outliers.
level: The level of outliers to return, i.e., 'lower', 'upper', or 'both'.
Defaults to 'both'.
Returns:
A dataframe containing the rows with outliers in the given column.
"""
# compute the interquartile range
iqr = df[col].quantile(0.75) - df[col].quantile(0.25)
In [ ]: def count_outliers(df):
"""
This function takes in a DataFrame and returns a DataFrame containing the count and
percentage of outliers in each numeric column of the original DataFrame.
Input:
df: a Pandas DataFrame containing numeric columns
Output:
a Pandas DataFrame containing two columns:
'outlier_counts': the number of outliers in each numeric column
'outlier_percent': the percentage of outliers in each numeric column
"""
# select numeric columns
df_numeric = df.select_dtypes(include=['int', 'float'])
# count the outliers and compute the percentage of outliers for each column
for col in outlier_cols:
outlier_count = len(find_outlier_rows(df_numeric, col))
all_entries = len(df[col])
outlier_percent = round(outlier_count * 100 / all_entries, 2)
The above dataframe shows that weekly_sales, holiday_flag, temperature and unemployment
columns all have outliers with unemployment having the largest outlier percentage, 7%. Let’s examine the
outliers in each column to decide on how to handle them.
count 481.000000
Out[14]:
mean 11.447480
std 3.891387
min 3.879000
25% 11.627000
50% 13.503000
75% 14.021000
max 14.313000
Name: unemployment, dtype: float64
The minimum and maximum values of these outliers are 3.89% and 14.31% respectively. Majority, greater or
equal to 75%, are more than or equal to 11.6% These values are obtainable in reality and will be left intact for
the analysis. Thus, median, which rubust to outliers, will be used to measure the centre of the unemployment
rate distribution.
count 450.0
Out[15]:
mean 1.0
std 0.0
min 1.0
25% 1.0
50% 1.0
75% 1.0
max 1.0
Name: holiday_flag, dtype: float64
We can see that all special holiday weeks form the outliers. This is as a result of the fact that most of the weeks,
93%, are non-special holiday weeks. This will also be left intact.
We can note that all the weekly_sales outliers occur either in November or December with one outlier of the
outliers that occur in October.
4. Does the dataset require any feature engineering to better support the analysis goals?
Employment rate may be correlated with weekly sales. This will be created from the unemployment rate. Also
the date column will be split into three so that we can analyse the data by year, month or day.
Out[17]: store date weekly_sales holiday_flag temperature fuel_price cpi unemployment employment year
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment employment year
2010-
0 1 1643690.90 0 42.31 2.572 211.096358 8.106 91.894 2010
05-02
2010-
1 1 1641957.44 1 38.51 2.548 211.242170 8.106 91.894 2010
12-02
2010-
2 1 1611968.17 0 39.93 2.514 211.289143 8.106 91.894 2010
02-19
The weekly transactions occured over the period of three-year (2010-2012) in 45 stores. The maximum weekly
sales is $3.8 million and the hottest day was 100°F.
In [19]: # histograms
sales.hist(figsize=(30,20));
From the above histograms, we can understand that:
the number of transactions occurred almost evenly across various stores and years.
The distribution of weekly_sales right-skewed. Only a few of the weekly sales are above 2 million USD.
The distribution of temperature is approximately normal.
The distribution of fuel_price is bi-modal.
CPI formed two clusters.
unemployment rate is near normally distributed.
Four consecutive months November-February recorded the highest sales.
Sales trend analysis involves examining the historical sales data of a business or product over time to understand
patterns, trends, and changes in sales performance. It is an important tool for businesses to identify
opportunities for growth, understand their customers' behaviour, optimise resources, and make informed
decisions about future sales.
We will aggregate the average weekly sales by months for the three year and visualise the trend using a line
plot.
plt.show()
The line plot reveals that weekly sales at Walmart generally remain stable throughout the year, with the
exception of November and December, which experience a significant increase in sales. This trend is likely due to
the holiday season, when consumers typically make more purchases and retailers offer promotions and
discounts. To capitalize on this behavior, Walmart could consider offering seasonal discounts and promotions, as
well as ensuring a seamless and enjoyable shopping experience through their mobile and web outlets during
festive periods. By doing so, they can encourage more customers to make purchases and potentially drive up
sales.
Seasonality trends analysis can be extremely valuable for businesses, as it allows us to better forecast future
sales, make more informed decisions about inventory and staffing, and understand the drivers of customer
demand leading to improved efficiency and profitability.
We will create a pivot table to group the data by month and year and calculate the average sales for each
period. We will then plot the average sales of the table using line chart for the three years. This will allow us to
see if there are any patterns in the data that repeat at regular intervals.
month
month
We can observe that the line charts for the three years for the month of January to October simultaneously
follow a sawtooth shape with big rises experienced in November and December due to holidays. This indicates
seasonality trends as months do have consistencies in bigger or smaller sales for the three years. We can also
observe that although 2011 performed worst than 2010 in terms of average sales for Walmart, the trend was
reversed for the year 2012 which performed better than 2010. However, the data for 2012 ends in October,
which may explain the significant drop in sales for November."
5. Which stores had the highest and lowest average revenues over the years?
Identifying the top performing and lo performing stores or products in sales analysis can be useful for a variety
of purposes. By analysing the sales data for different stores, businesses can identify opportunities for growth,
understand customer preferences, optimise inventory levels, and identify potential problems or areas for
improvement. Understanding the performance of different stores can inform product development and
marketing efforts, as well as help businesses allocate resources more effectively and make more informed
business decisions.
We will create a function that takes a dataframe as input and generates two plots showing the top and bottom
performing stores in terms of average sales.
In [23]: def plot_top_and_bottom_stores(df, col):
"""
Plot the top and bottom 5 stores based on their average weekly sales.
Parameters:
df (pandas DataFrame): The dataframe containing the sales data.
col (str): The name of the column to group the data by.
Returns:
None
"""
# Group the data by the specified column and sort it by sales in descending order
df = df.groupby(col).mean().sort_values(by='weekly_sales', ascending=False)
On the other hand, the lowest performing stores have higher variations in sales, with the highest sales at around
$0.38 million USD. This suggests that there may be more variability in the sales performance of these stores.
6. Are there correlations between the features of the dataset and weekly_sales?
Linear Regression
Decision Tree Regressor
Random Forest Regressor
Support Vector Regressor, etc.
We will fit each of these regressors to our training data and make predictions on the test set. Then, we will
calculate the RMSE of the predictions and compare the results to choose the best regressor.
To ensure that the original dataset is not modified during the modeling process and to facilitate debugging if
needed, we will create a copy of the preprocessed dataset before fitting our various models. This will help to
preserve the integrity of the original data and allow us to refer to it if any issues arise during the modeling
process.
Out[29]: store weekly_sales holiday_flag temperature fuel_price cpi employment year month day
Scaling is a preprocessing step that transforms the features of a dataset so that they have a similar scale and can
improve the performance of some regression algorithms and facilitate comparison of the model's coefficients. In
this project we will use standard scaler to standardize the features of the dataset.
To properly evaluate the performance of our dataset and prevent overfitting, we can use cross-validation
techniques. One such technique is to split the dataset into a training set and a testing set. The training set is
used to train the model, while the testing set is used to evaluate the model's performance. This can help us
determine how well the model generalizes to unseen data and can identify any issues with overfitting. It is
important to randomly shuffle the data before splitting it into the train and test sets, as this can help ensure that
the data is representative of the overall population and not biased in any way.
In [32]: # split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_sta
In this subsection, we will create a function that will train multiple regressors and compare their performance
using the root mean square error (RMSE) metric. We will use the RMSE values to compare the performance of
the various regressors and determine which model has the lowest error and is therefore the best fit for our data.
Parameters
----------
model : object
A scikit-learn estimator object.
X_train : array-like or pd.DataFrame
Training data with shape (n_samples, n_features).
y_train : array-like
Training labels with shape (n_samples,).
X_test : array-like or pd.DataFrame
Test data with shape (n_samples, n_features).
y_test : array-like
Test labels with shape (n_samples,).
Returns
-------
rmse : float
Root mean squared error between the test labels and the predictions.
"""
# train
model.fit(X_train, y_train)
# predict
y_pred = model.predict(X_test)
# calculate MSE
mse = mean_squared_error(y_test, y_pred)
# calculate RMSE
rmse = np.sqrt(mse)
return rmse
Parameters:
-----------
regressors (list): a list of scikit-learn compatible regression models
regressor_names (list): a list of strings containing the names of the regression model
X_train (pandas DataFrame): a pandas DataFrame containing the features for training th
y_train (pandas Series): a pandas Series containing the target values for training the
X_test (pandas DataFrame): a pandas DataFrame containing the features for testing the
y_test (pandas Series): a pandas Series containing the target values for testing the m
Returns:
--------
pandas DataFrame: a dataframe containing the names of the regressors and their corresp
"""
# evaluate the models and compute their RMSE on the test data
rmses = [evaluate_model(regressor, X_train, y_train, X_test, y_test) for regressor in
Result interpretation
Let's interprete the result by evaluating the rmse value of the best model, the Random Forest Regressor.
The above table shows that Random Forest Regressor outperformed all the regressors with RMSE of 1.17e+05.
This provide a good estimate for future sales as it has about 12% average error compared to the typical median
sale.
5. Conclusions
Our analysis shows that sales during holiday weeks are significantly higher than during non-holiday weeks, with
sales doubling on average. Additionally, there is a strong seasonal component to the sales data. The average
sales of the top performing stores are up to 500% higher than the lowest performing stores.
The best model for predicting future sales is the Random Forest Regressor model,which achieved an RMSE of
1.17e+05. This is a good estimate as it is 88% close to the median sale of the data.
These findings have important implications for businesses as they can inform decisions about inventory, staffing,
and marketing efforts. By understanding the factors that drive sales and using a reliable model to forecast future
sales, businesses can better plan for the future and optimize their resources.
Future work
One area that future studies could explore is the relationship between festive sales and profit margins. By
augmenting the dataset with expenses data, it would be possible to investigate whether larger festive sales
always translate to larger profit margins. This can inform decisions about marketing and pricing strategies.
The analysis showed a 500% difference in sales between the top performing and lowest performing stores,
which is a significant difference. This suggests that there may be underlying factors contributing to the
performance of these stores. To better understand the reasons behind the performance of these stores, it is
necessary to gather additional data and parameters that may be influencing the sales of the top selling
products. -Hyperparameter tuning involves adjusting the parameters of a model in order to improve its
performance on a given dataset. By iteratively adjusting the parameters of the best model, it is possible to
achieve an even better model.
6. References
Kaggle
B2B International (2018). Sales Forecasting: The Importance and Benefits.
Business News Daily (2021). Sales Forecasting: The Importance of Accurate Sales Forecasts.
Small Business Administration (2021). The Importance of Sales Forecasting.