Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree
Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree
Chirag Jhamb
16 August 2016
I. Definition
Problem Overview:
Rossmann operates over 3,000 drug stores in 7 European countries.
Currently, Rossmann store managers are tasked with predicting
their daily sales for up to six weeks in advance. Store sales are
influenced by many factors, including promotions, competition,
school and state holidays, seasonality, and locality. With thousands
of individual managers predicting sales based on their unique
circumstances, the accuracy of results can be quite varied.
The goal of this project is create a model that predicts daily sales for
1,115 stores located across Germany. Reliable sales forecasts
enable store managers to create effective staff schedules that
increase productivity and motivation. By helping Rossmann create a
robust prediction model, managers can stay focused on what’s most
important to them: their customers and their teams, instead on
worrying about profits.
For this probelm, Rossman has provided datasets “train.csv” and
“stores.csv” containing sales information on daily basis and
information of various stores, respectively. There is also a test set
containing features similar to that of “train.csv” but without the
feature “Sales”.
Since all the features are given along with result, this is a Supervised
learning problem.
Note: The actual training dataset contains 1017209 rows. Since the
data was too large, only 75% of the data was used. The method used
can be seen in the file “create_files.py”.
Problem Statement:
Primary goal: Given historical sales data for 1,115 Rossmann stores.
The task is to forecast the "Sales" column for the test set. The data
can be found here.
For achieving the primary goal, training data will be divided into
two parts, first half will contain 75% of the data and the other half
that will to test the accuracy of predictions made by the model
designed and trained over the first set. For the creation of model,
several regression algorithms will be used such as decisiontree,
gradientboost etc.. The model having the highest accuracy and the
lowest cost(time consumption) will be chosen and will be trained
over the entire dataset and then that model will be used to make
predictions over the “test.csv” file.
The solution must have precise sale values that each store will
achieve in the future.
Metrics
By now, we know this is a regression problem. For testing the
accuracy of each model we create a test set by dividing the training
data.
Then we use root mean squared prediction error to find out the
accuracy of our predictions. The root-mean-square error (RMSE),
or RMSD is a frequently used measure of the differences between
values (test and train values) predicted by a model or an estimator
and the values actually observed. The RMSE represents the sample
standard deviation of the differences between predicted values and
actual values. Hence 0 implies a perfect score.
Mathematically,
II. Analysis
Data Exploration
The data is present in the “inputs” folder of the repository. There
are three files- “train.csv” containing the records about the sales
that is to be used as training data, “store.csv” containing more
information about store and the features affecting the sales, which
can be merged with the training data and “test.csv” containing
records of each day but not the sales, and predicting sales is the goal
of this project.
These are all the features given in the dataset along with a short
description of each-
•Id - an Id that represents a (Store, Date) duple within the test set
•Store - a unique Id for each store
•Sales - the turnover for any given day (this is what you are
predicting)
•Customers - the number of customers on a given day
•Open - an indicator for whether the store was open: 0 = closed, 1 =
open
•StateHoliday - indicates a state holiday. Normally all stores, with
few exceptions, are closed on state holidays. Note that all schools
are closed on public holidays and weekends. a = public holiday, b =
Easter holiday, c = Christmas, 0 = None
•SchoolHoliday - indicates if the (Store, Date) was affected by the
closure of public schools
•DayOfWeek – the day of the week
•StoreType - differentiates between 4 different store models: a, b,
c, d
•Assortment - describes an assortment level: a = basic, b = extra, c
= extended
•CompetitionDistance - distance in meters to the nearest
competitor store
•CompetitionOpenSince[Month/Year] - gives the approximate
year and month of the time the nearest competitor was opened
•Promo - indicates whether a store is running a promo on that day
•Promo2 - Promo2 is a continuing and consecutive promotion for
some stores: 0 = store is not participating, 1 = store is participating
•Promo2Since[Year/Week] - describes the year and calendar
week when the store started participating in Promo2
•PromoInterval - describes the consecutive intervals Promo2 is
started, naming the months the promotion is started anew. E.g.
"Feb,May,Aug,Nov" means each round starts in February, May,
August, November of any given year for that store
This part basically explores all the features by visualising the data
and we use our intuition to predict the correlation between it.
We start by exploring the median of the sales and customers that
stores get each day of the week and view it as a timeline for the
week-
Now lets look at the performance of the stores over the entire
timeline, from the starting months present in training data to the
last-
The figures 4a and 4b measure performance of stores and share the
same x-axis. First and formost thing that is obvious from the plot is
that sales and customers are highly correlated. Therefore, majority
of customers walking through the store are contributing to sales.
However, starting in year 2015, customer numbers are diverging
slightly away from sales with less customers contributing to more
sales. Customer growth is not evident. Sales and customer numbers
appears to spike just before Christmas and fall back down again
during the new year. If more customers could be enticed into the
store, better sales could be acheived. Let's plot the sales and
customer data with promos, stateholiday and schoolholiday to
visualise their behaviour for those days.
Benchmark
First, the steps were divided into various functions to calculate the
time taken by each function in order to check the cost.
The first function trains the classifier that it takes as an argument. It
applies the fit method and reports the time. Then the second
function runs the prediction over the training set itself and returns
the root mean square error rate. It also returns the time taken to
make predictions over the training set. Then the third function
makes the prediction over the test set and reports the time and the
score.
First, the sales data is converted into log values in order to make
predictions easier. Then each of the models mentioned in the earlier
section are called and passed as arguments on the functions and the
time and score is reported for each.
Once the score is reported, the most efficient model is chosen and
the feature importance is calculated. Feature importance tells us
which features were the most relevant in making predictions. This
can be compared to our analysis while exploring the data.
The feature importance is then represented on the bar graph.
Then the entire dataset is trained over the chosen model and the
final predictions are made and saved to the test file.
Refinement
For Kneighbours-
In the previous section, the benchmark value for the error rate was
almost 0.33. Thanks to decisionTree regressor the error rate was
almost half of the expected value. If the sales can be predicted with
an error rate of only 0.16, then it would be very easy for the
manager to make necessary changes and see what increases or
decreases the sales precisely.
DecisonTree Regression created a model that predicts the value of a
Sales by learning simple decision rules inferred from the data
features. The error was so low because each feature was applied if-
then-else decision rules to predict sales and since the features and
numeric as well categorical the model was a perfect fit.
Thus, the entire dataset is trained in the end and Sales of the test set
is predicted. It can be said with confidence that this model would
have precisely predicted the required values.
Hence, the task of this project is completed!
V. Conclusion
Free-Form Visualization
Finally, when the optimal model has been trained, we can look at
which features were the ones that affected the importance the most
and evaluate our prediction made earlier in this project. Lets look at
the importances-
Improvement
I do believe that the features given were enough for making an
optimal precison model. There might certain algorithms such
XGBoost which might do a better job of prediction than decisiontree
if fully optimised, but it can’t be said for sure. If the final solution
was used as a benchmark, then it would be tough to beat but there
has been better models having ever lower error rates. Thus, there is
wide possibility of improvement which is followed by more research
on regression models.