0% found this document useful (0 votes)

156 views23 pages

Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree

This document summarizes a capstone project to predict sales for Rossman drug stores using machine learning. The goal is to create a model that predicts daily sales for 1,115 German Rossman stores. Historical sales data for these stores is provided along with information on store attributes and calendar events. Various regression algorithms will be tested and the most accurate and efficient model will be used to predict sales in the test set. Model performance will be evaluated using root mean squared error to measure differences between predicted and actual sales values, with lower error indicating a better model.

Uploaded by

naveen mani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

156 views23 pages

Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree

Uploaded by

naveen mani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 23

Capstone Project

Predicting sales of Rossman stores

Machine Learning Engineer Nanodegree

Chirag Jhamb
16 August 2016

I. Definition

Problem Overview:
Rossmann operates over 3,000 drug stores in 7 European countries.
Currently, Rossmann store managers are tasked with predicting
their daily sales for up to six weeks in advance. Store sales are
influenced by many factors, including promotions, competition,
school and state holidays, seasonality, and locality. With thousands
of individual managers predicting sales based on their unique
circumstances, the accuracy of results can be quite varied.
The goal of this project is create a model that predicts daily sales for
1,115 stores located across Germany. Reliable sales forecasts
enable store managers to create effective staff schedules that
increase productivity and motivation. By helping Rossmann create a
robust prediction model, managers can stay focused on what’s most
important to them: their customers and their teams, instead on
worrying about profits.
For this probelm, Rossman has provided datasets “train.csv” and
“stores.csv” containing sales information on daily basis and
information of various stores, respectively. There is also a test set
containing features similar to that of “train.csv” but without the
feature “Sales”.
Since all the features are given along with result, this is a Supervised
learning problem.
Note: The actual training dataset contains 1017209 rows. Since the
data was too large, only 75% of the data was used. The method used
can be seen in the file “create_files.py”.
Problem Statement:
Primary goal: Given historical sales data for 1,115 Rossmann stores.
The task is to forecast the "Sales" column for the test set. The data
can be found here.
For achieving the primary goal, training data will be divided into
two parts, first half will contain 75% of the data and the other half
that will to test the accuracy of predictions made by the model
designed and trained over the first set. For the creation of model,
several regression algorithms will be used such as decisiontree,
gradientboost etc.. The model having the highest accuracy and the
lowest cost(time consumption) will be chosen and will be trained
over the entire dataset and then that model will be used to make
predictions over the “test.csv” file.
The solution must have precise sale values that each store will
achieve in the future.

Metrics
By now, we know this is a regression problem. For testing the
accuracy of each model we create a test set by dividing the training
data.
Then we use root mean squared prediction error to find out the
accuracy of our predictions. The root-mean-square error (RMSE),
or RMSD is a frequently used measure of the differences between
values (test and train values) predicted by a model or an estimator
and the values actually observed. The RMSE represents the sample
standard deviation of the differences between predicted values and
actual values. Hence 0 implies a perfect score.

Mathematically,
II. Analysis
Data Exploration
The data is present in the “inputs” folder of the repository. There
are three files- “train.csv” containing the records about the sales
that is to be used as training data, “store.csv” containing more
information about store and the features affecting the sales, which
can be merged with the training data and “test.csv” containing
records of each day but not the sales, and predicting sales is the goal
of this project.
These are all the features given in the dataset along with a short
description of each-
•Id - an Id that represents a (Store, Date) duple within the test set
•Store - a unique Id for each store
•Sales - the turnover for any given day (this is what you are
predicting)
•Customers - the number of customers on a given day
•Open - an indicator for whether the store was open: 0 = closed, 1 =
open
•StateHoliday - indicates a state holiday. Normally all stores, with
few exceptions, are closed on state holidays. Note that all schools
are closed on public holidays and weekends. a = public holiday, b =
Easter holiday, c = Christmas, 0 = None
•SchoolHoliday - indicates if the (Store, Date) was affected by the
closure of public schools
•DayOfWeek – the day of the week
•StoreType - differentiates between 4 different store models: a, b,
c, d
•Assortment - describes an assortment level: a = basic, b = extra, c
= extended
•CompetitionDistance - distance in meters to the nearest
competitor store
•CompetitionOpenSince[Month/Year] - gives the approximate
year and month of the time the nearest competitor was opened
•Promo - indicates whether a store is running a promo on that day
•Promo2 - Promo2 is a continuing and consecutive promotion for
some stores: 0 = store is not participating, 1 = store is participating
•Promo2Since[Year/Week] - describes the year and calendar
week when the store started participating in Promo2
•PromoInterval - describes the consecutive intervals Promo2 is
started, naming the months the promotion is started anew. E.g.
"Feb,May,Aug,Nov" means each round starts in February, May,
August, November of any given year for that store

We can expect outliers to be in features “Stores” and “Customers” of

the training set as the rest of the values are categorical or in form of
dates.
An interesting observation is that there are a lot of dates that need
to be handled in the data. This could help us in getting a timeline
view.
On some exploration, there were certain NaN values in columns
such “Open”, “Promo2” etc. where the values should be either 1 or 0.
In case of promotions it is safe to assume there were no promotions
at all which is why the column hasn’t been filled. In case of features
such as “Open” we can simply check if there has been any sales that
day. If there has been no sales then it is safe to assume that the store
was closed replace NaN with zero.
Exploratory Visualization:

This part basically explores all the features by visualising the data
and we use our intuition to predict the correlation between it.
We start by exploring the median of the sales and customers that
stores get each day of the week and view it as a timeline for the
week-

That is a very interesting plot. Let's assume day-1 is Monday and

day-7 is Sunday. Throughout the week until Friday, less customers
are ringing in more sales. On Saturday, both the number of sales
and customers drop to the lowest and spike again on Sunday.
However, only on Sunday there is an opposite behaviour i.e. more
customers are ringing in less sales, in other words there is more
customer traffic but less sales. There is only one conclusion from
this plot, the customers who walk into the store during the week
are serious customers who will most likely make a sale whereas
customers walking in on Sunday appear to be
distracted/disinterested/window-shopping! One area of
improvement would be to look at converting the customer
numbers to sales on the weekend, may be targeted promotions,
etc.

Now lets look at the performance of the stores over the entire
timeline, from the starting months present in training data to the
last-
The figures 4a and 4b measure performance of stores and share the
same x-axis. First and formost thing that is obvious from the plot is
that sales and customers are highly correlated. Therefore, majority
of customers walking through the store are contributing to sales.
However, starting in year 2015, customer numbers are diverging
slightly away from sales with less customers contributing to more
sales. Customer growth is not evident. Sales and customer numbers
appears to spike just before Christmas and fall back down again
during the new year. If more customers could be enticed into the
store, better sales could be acheived. Let's plot the sales and
customer data with promos, stateholiday and schoolholiday to
visualise their behaviour for those days.

State Holidays (Fig. 5)

I think there is something amiss about the "StateHoliday" data.
From August 2014, there appears to be no StateHoliday at all, which
is highly unlikely, most likely it is the case of "missing data". That
can be fixed by filling the values using previous years StateHolidays,
as StateHolidays rarely change.
School Holidays (Fig. 6)
SchoolHolidays appears to be seasonal, with July-August being the
highest, may be July-August is the summer vacation time.
Interesting observation is that there are minimal/least school
holidays in November i.e. before Christmas vacation. Another
observation is the decreasing trend of school holidays reaching a
minimum point just before the vacation starts. On the other hand,
after the vacation when the school restarts there is a increasing
trend of school holidays with an abrupt drop after which the
decreasing trend takes on. This abrupt drop might be the school
restart after a mid-term break.
Promotions (Fig. 7)
When we look at the number of promotions being held, it appears
that every month there were promotions at various stores. An
interesting observation is that there is a massive drop in
sales/customers in January, 2014 and there were no promotions
subsequent to these drop in numbers which clearly suggests that the
management were unable to predict this drop in numbers. If they
had predicted the drop they would have organised more promotions
to lure in the customers. This is evident from the January, 2015 ,
where they have organised more promotions in January, 2015
learning from their mistakes from 2014. Subsequently, there
appears to be more sales, albeit small change compared to Jan-Feb
2014, with same number of customers. They have organised more
promotions during School holidays but it has not had much positive
impact on sales or customer numbers. Again, another interesting
observation is sales always has an increasing uptrend during the
days when the schools are in session.
Algorithms and Techniques
It is obvious from the above section of exploration that Sales cannot
be predicted just by using customer data or just the promotions.
The sales are affected by each and every attribute. In order to make
the prediction of sales, first we will get rid of the outliers in the
Customers feature and the Sales feature as there might be
exceptional days when the numbers are too high or too low. Such
days will affect the precision of the model.
After dealing with outliers, we can start preprocessing the data. This
includes getting rid of Null values, label encoding features like
StoreType, StateHoliday etc.. As discussed earlier conversion of date
is going to be important for prediction which is also proved in the
visualisations as sales change a lot with dates, months and days.
We can also process Competition’s details as it would definitely
affect the number of customers which would indeed affect the sales.
Promotions might help in getting the customers back, which is why
we need to encode all the Promo columns as well.
Once we have preprocessed the data, we can split it using
cross_validation.train_test_split method. This method randomly
shuffles the data and return two sets, training and testing. The size
of the test can be defined.
The training set then is used to train three models-
1. DecisionTree regression- The goal of this model is to create a
model that predicts the value of a target variable by learning simple
decision rules inferred from the data features. DecisionTree takes
features and work out a way to get to the target variable using if-
then-else decision rules on the input features.
2. Kneighbour regression- The principle behind nearest neighbor
methods is to find a predefined number of training samples closest
in distance to the new point, and predict the label from these. K
neighbour regression implements learning based on the k-nearest
neighbors of each query point, where k is an integer value specified
by the user. Example of regression made by Kneighbour-

3. GradientBoost regressor- Gradient Boosted Regression is a

generalization of boosting to arbitrary differentiable loss functions.
Gradient boosting produces a prediction model in the form of
an ensemble of weak prediction models, typically decision trees. It
builds the model in a stage-wise fashion like
other boosting methods do, and it generalizes them by allowing
optimization of an arbitrarydifferentiable loss function.
These models have been chosen for this problem because of the
following properties-
• Given the features affecting the sales, the data can be easily
split into deciding chaterstics using if-then-else decision rules
on input features. This requires data prepration which can be
done easily, so we do not need to worry about the main
shortcoming here. Though decision tree can be unstable, it is
very good at handling categorical data as well as numerical.
Hence the multiple types of features in our dataset won’t be a
problem. We can find out if it’s unstable for our data only by
looking at the test score of the model and then decide if it can
be used or not.
• Our dataset has a large number of datapoints. Kneighbours
uses brute force to perform fast computation which helps us in
reducing the cost of the model. On the other hand, it is good
only in case of continous values which is not the case for some
of the input features. We can find out how much that affects
the accuracy of the model by the test scores.
• GradientBoost is a slow model, hence the cost of the model is
going to be high. On the other hand, it uses a different
approach by optimization of an arbitrarydifferentiable loss
function. If we are unable to get a good score by any other
model we might be able to get a good score through this
approach.

Benchmark

Given that we are going to be experimenting over 3 models, I expect

at lease one model to have a good RMSE score, note that in case of
RMSE the lower the better with 0 being the perfect score.
To have a good idea about the Sales the store managers need to have
certain amount of belief in the accuracy of the model. Certain rate of
error can be accepted but if the error is almost as much as one third
then the model is no good. With experience the store managers
would be able to predict the with that much error rate. If the models
are not predicting at least one third of the data correctly then the
data requires more processing and the selected models need to be
optimized or changed. Hence the benchmark of RMSE should be
0.33.
III. Methodology
Data Processing-
First we start by handling the outliers. Detecting outliers in the data
is extremely important in the data preprocessing step of any
analysis. The presence of outliers can often skew results which take
into consideration these data points. There are many "rules of
thumb" for what constitutes an outlier in a dataset. Here, we will
use Tukey's Method for identfying outliers: An outlier step is
calculated as 1.5 times the interquartile range (IQR). A data point
with a feature that is beyond an outlier step outside of the IQR for
that feature is considered abnormal.
I therefore wrote code which finds out outliers in customers and in
sales and then sees which outliers are common in both the features.
Then the common outliers are dropped. Here is the result-
Number of outliers in sales = 15863
Number of outliers in customers = 22600
Number of common outliers = 11334
Dropping common outliers...
1.90838602133 % of data dropped

As it can be seen, the number of common outliers is 11334. This

might seem like a high number but when we compare it the number
of data points given to us its only 1.9% of the data. Thus, it is
feasible dropping these many points.

As discussed earlier, data requires moderate amount of processing.

Only some features having numerical values were used directly-
'Store', 'CompetitionDistance', 'Promo', 'Promo2', 'SchoolHoliday'.
The initial step of processing the data was to fill all the NaN values
with zeroes. Here we assume that the column was not filled out
because of abcense of that feature. Then to make the processing
faster, I drop the rows where the the store is closed, that is where
open is set to zero as we are only interested in training the models
for days when the store was open and hence there were any sales.
Then features having categorical values 'StoreType', 'Assortment'
and 'StateHoliday' have all the values replaced by lables which can
be used in models.
After that we move to the dates. The date format given is aribtrary
and requires to be worked upon. Thus all the dates are divided into
the features- 'DayOfWeek', 'Month', 'Day', 'Year', 'WeekOfYear'.
Then process the date of Competition feature which is given either
in year or month. We convert all the values into months to have one
single unit for comparison. The same step is repeated in case of
“PromoOpen” which is given since year, since week etc.. In the end,
“IsPromoMonth” mapped with month values and is assigned 0 and
1 depending on the value.
To make model creation a little more faster, all the rows having
Sales as 0 is dropped as it is probably an unfilled value which will
only affect the model negatively.
Implementation

First, the steps were divided into various functions to calculate the
time taken by each function in order to check the cost.
The first function trains the classifier that it takes as an argument. It
applies the fit method and reports the time. Then the second
function runs the prediction over the training set itself and returns
the root mean square error rate. It also returns the time taken to
make predictions over the training set. Then the third function
makes the prediction over the test set and reports the time and the
score.
First, the sales data is converted into log values in order to make
predictions easier. Then each of the models mentioned in the earlier
section are called and passed as arguments on the functions and the
time and score is reported for each.
Once the score is reported, the most efficient model is chosen and
the feature importance is calculated. Feature importance tells us
which features were the most relevant in making predictions. This
can be compared to our analysis while exploring the data.
The feature importance is then represented on the bar graph.
Then the entire dataset is trained over the chosen model and the
final predictions are made and saved to the test file.

Refinement

Decisiontree regressor was chosen for regression. Initially, the error

rate was high. Hence the Sales data for training the model was
converted into logs to decrease the error rate. The root mean
squared error value over the test set was 0.1819, which is much
better than what I had expected. Then by applying GridSearchCV
the error rate was reduced to 0.164. GridSearchCV exhaustively
considers all parametric combinations that are passed in param
grid. In this case, leaf samples and sample split values of the
decision tree algorithm was optimised to get the optimum score.
IV. Results
Model Evaluation and Validation
Lets look at the results for each model-
For Decision Tree regressor-

Training a DecisionTreeRegressor using a training set size of

392592. . . Trained model in 6.3615 seconds Made predictions in
0.6946 seconds. mean_squared_error for training set: 0.0000.
Made predictions in 0.1140 seconds.
mean_squared_error for test set: 0.1819.

For Kneighbours-

Training a KNeighborsRegressor using a training set size of 392592.

..
Trained model in 3.6165 seconds
Made predictions in 23.1225 seconds.
mean_squared_error for training set: 0.1927.
Made predictions in 5.8234 seconds.
mean_squared_error for test set: 0.2470.

For GradientBoost regressor-

Training a GradientBoostingRegressor using a training set size of

392592. . .
Trained model in 71.3005 seconds
Made predictions in 1.1283 seconds.
mean_squared_error for training set: 0.3151.
Made predictions in 0.2588 seconds.
mean_squared_error for test set: 0.3181.
In case of Kneighbours, the cost of the model is low as expected, and
the error rate is lower than the benchmark value that was decided.
This is a good model but the error rate is relatively higher than the
DecisionTree model, hence this model is not used for final
predictions.
The GradientBoost regressor has extremely high cost for training
the model which was expected but also gives the highest error rate.
This model does not classify as the optimal model.
DecisionTree regressor is the clear winner, it has the lowest error
rate and doesn’t take too long to train. Even though training time is
higher than Kneighbours, the time taken to make predictions is
much lower. The benchmark value of RMSE was 0.33, and this
model gave the value of 0.18 which is even better. Thereby making
DecisionTree the most optimal model.
After optimisation of DecisionTree through gridsearch, the model’s
error rate was reduced to 0.16.
Justification

In the previous section, the benchmark value for the error rate was
almost 0.33. Thanks to decisionTree regressor the error rate was
almost half of the expected value. If the sales can be predicted with
an error rate of only 0.16, then it would be very easy for the
manager to make necessary changes and see what increases or
decreases the sales precisely.
DecisonTree Regression created a model that predicts the value of a
Sales by learning simple decision rules inferred from the data
features. The error was so low because each feature was applied if-
then-else decision rules to predict sales and since the features and
numeric as well categorical the model was a perfect fit.
Thus, the entire dataset is trained in the end and Sales of the test set
is predicted. It can be said with confidence that this model would
have precisely predicted the required values.
Hence, the task of this project is completed!
V. Conclusion
Free-Form Visualization

Finally, when the optimal model has been trained, we can look at
which features were the ones that affected the importance the most
and evaluate our prediction made earlier in this project. Lets look at
the importances-

Interestingly, the Store, DayOfWeek and Date were the most

explored features which seemed to make the most difference on
sales. This prediction was right.
On the other hand, holidays and promo also seemed to make a lot of
difference but these features have been given lower ranks.
Reflection
The analysis of the project initially was an iteresting part of the
project, as it was able to tell us which feature affects the sales almost
as precisely as the feature_importance property of the decisionTree
regressor told us. There were difficulties in visualising the data as
the data was not preprocessed. There were a lot of NaN values in the
data which were degrading the quality of outputs at almost every
phase.
I expected final model to be less time consuming but 6-7 seconds of
training time is still acceptable. Optimising the model was a
challenge as there were indexing errors which were dealt with (as
commented in the code).
Now this model can be used for predicting Sales and even if more
stores or different businesses come along, if there are similar
features then this model can be used for proper predictions.

Improvement
I do believe that the features given were enough for making an
optimal precison model. There might certain algorithms such
XGBoost which might do a better job of prediction than decisiontree
if fully optimised, but it can’t be said for sure. If the final solution
was used as a benchmark, then it would be tough to beat but there
has been better models having ever lower error rates. Thus, there is
wide possibility of improvement which is followed by more research
on regression models.

Goodbelly Marketing Analysis Final
85% (13)
Goodbelly Marketing Analysis Final
32 pages
SS Teamproject Documentation
No ratings yet
SS Teamproject Documentation
33 pages
Forecasting PDF
100% (1)
Forecasting PDF
37 pages
Data Analysis On BigMart Sales
67% (3)
Data Analysis On BigMart Sales
17 pages
Supermarket Sales Analysis and Prediction
100% (1)
Supermarket Sales Analysis and Prediction
34 pages
Updated Thesis 1
No ratings yet
Updated Thesis 1
20 pages
The Authentic I Ching The Three Classic Methods of Prediction
0% (1)
The Authentic I Ching The Three Classic Methods of Prediction
4 pages
Case Study-Retail Walmart Store Sales Prediction - Forecasting
No ratings yet
Case Study-Retail Walmart Store Sales Prediction - Forecasting
3 pages
Scheme of Work Science Stage 6v1
No ratings yet
Scheme of Work Science Stage 6v1
22 pages
BigMart Sale Prediction Using Machine Learning
No ratings yet
BigMart Sale Prediction Using Machine Learning
2 pages
Presentation - Retail Sales Prediction
No ratings yet
Presentation - Retail Sales Prediction
23 pages
Cutting Tool Wear Prediction in Machining Operations, A Review
No ratings yet
Cutting Tool Wear Prediction in Machining Operations, A Review
13 pages
Energy Consumption Prediction Using Machine Learning
100% (2)
Energy Consumption Prediction Using Machine Learning
61 pages
Wal Mart Sales Forecasting
No ratings yet
Wal Mart Sales Forecasting
35 pages
Ada 333462
No ratings yet
Ada 333462
53 pages
Regression
No ratings yet
Regression
3 pages
Concluding Paragraph (Example)
No ratings yet
Concluding Paragraph (Example)
10 pages
3-ps2-2 Predictable Motion
No ratings yet
3-ps2-2 Predictable Motion
8 pages
Journal of Rock Mechanics and Geotechnical Engineering: Review
No ratings yet
Journal of Rock Mechanics and Geotechnical Engineering: Review
7 pages
Prediction of Crop Yield Using Regression Analysis
No ratings yet
Prediction of Crop Yield Using Regression Analysis
5 pages
Conformal Prediction
No ratings yet
Conformal Prediction
51 pages
Enhancing A Somatic Maturity Prediction Model.25
No ratings yet
Enhancing A Somatic Maturity Prediction Model.25
10 pages
Retail Sales Forecasting
No ratings yet
Retail Sales Forecasting
31 pages
Architecture-Level Modifiability Analysis (Alma) : Perolof Bengtsson, Nico Lassing, Jan Bosch, Hans Van Vliet
No ratings yet
Architecture-Level Modifiability Analysis (Alma) : Perolof Bengtsson, Nico Lassing, Jan Bosch, Hans Van Vliet
19 pages
Rossmann Sales Prediction Presentation
No ratings yet
Rossmann Sales Prediction Presentation
35 pages
8317 PramodBS
No ratings yet
8317 PramodBS
7 pages
Multiple Regression
No ratings yet
Multiple Regression
3 pages
Modelingand Predictionof Land Use Land Cover Change Dynamics Basedon Land Change Modeler LCMin Nashe Watershed
No ratings yet
Modelingand Predictionof Land Use Land Cover Change Dynamics Basedon Land Change Modeler LCMin Nashe Watershed
24 pages
Rossman Store Sales Predictions
No ratings yet
Rossman Store Sales Predictions
29 pages
1 s2.0 S0360544221003145 Main
No ratings yet
1 s2.0 S0360544221003145 Main
12 pages
Corporate Failure Prediction Models Applied
No ratings yet
Corporate Failure Prediction Models Applied
4 pages
Big Data Jury
No ratings yet
Big Data Jury
21 pages
A Mini Project Report On: "Big Mart Sales Prediction" by
67% (3)
A Mini Project Report On: "Big Mart Sales Prediction" by
23 pages
Recent Developments and Challenges in Surrogate Model Based Optimal Design of Engineering Systems
No ratings yet
Recent Developments and Challenges in Surrogate Model Based Optimal Design of Engineering Systems
9 pages
VARMA - Timisoara Journal of Economics and Business) Application of The Varma Model For Sales Forecast - Case of Urmia Gray Cement Factory
No ratings yet
VARMA - Timisoara Journal of Economics and Business) Application of The Varma Model For Sales Forecast - Case of Urmia Gray Cement Factory
13 pages
Case Study: Finding Results
No ratings yet
Case Study: Finding Results
39 pages
Forecast of Sales of Walmart Store Using Big Data Applications
No ratings yet
Forecast of Sales of Walmart Store Using Big Data Applications
9 pages
Predictive Model For The Surface Settlement Caused by The Excavation of Twin Tunnels
No ratings yet
Predictive Model For The Surface Settlement Caused by The Excavation of Twin Tunnels
9 pages
Debenhams Summer Sale QT
No ratings yet
Debenhams Summer Sale QT
18 pages
GR6 Epgp16sece
No ratings yet
GR6 Epgp16sece
11 pages
EC III Module 4 Scientific Reasoning, Strategic Reasoning and Creativity
No ratings yet
EC III Module 4 Scientific Reasoning, Strategic Reasoning and Creativity
25 pages
Rossmann Sales Prediction: Computing For Data Sciences-Final Project
100% (1)
Rossmann Sales Prediction: Computing For Data Sciences-Final Project
46 pages
Sales Analysis of Walmart Data: Mayank Gupta, Prerana Ghosh, Deepti Bahel, Anantha Venkata Sai Akhilesh Karumanchi
No ratings yet
Sales Analysis of Walmart Data: Mayank Gupta, Prerana Ghosh, Deepti Bahel, Anantha Venkata Sai Akhilesh Karumanchi
10 pages
My Presentation Final
No ratings yet
My Presentation Final
28 pages
Report
No ratings yet
Report
9 pages
Black Friday Sales Prediction
100% (1)
Black Friday Sales Prediction
4 pages
Retail Analysis With Walmart Data
No ratings yet
Retail Analysis With Walmart Data
2 pages
Retail Analysis With Walmart Data
No ratings yet
Retail Analysis With Walmart Data
10 pages
Subtitle 2
No ratings yet
Subtitle 2
2 pages
MA (Unit-3)
No ratings yet
MA (Unit-3)
13 pages
Knowledge Claims in Rhetorical Criticism
No ratings yet
Knowledge Claims in Rhetorical Criticism
13 pages
Retail Analysis With Walmart Data
No ratings yet
Retail Analysis With Walmart Data
2 pages
Analyzing Walmart Sales
No ratings yet
Analyzing Walmart Sales
13 pages
Retail Sales Prediction Model
No ratings yet
Retail Sales Prediction Model
50 pages
Pankaj Soni Gamma 199 BA Assignment
No ratings yet
Pankaj Soni Gamma 199 BA Assignment
20 pages
Summer Assignment 9
No ratings yet
Summer Assignment 9
2 pages
Rashmi Jeswani Capstone
No ratings yet
Rashmi Jeswani Capstone
84 pages
TSF - Problem Statement
No ratings yet
TSF - Problem Statement
40 pages
Machine Learning Project
No ratings yet
Machine Learning Project
10 pages
Research 7 7pdf
No ratings yet
Research 7 7pdf
6 pages
Rossmann Sales Analysis PDF
No ratings yet
Rossmann Sales Analysis PDF
3 pages
Hajian Et Al 2021
No ratings yet
Hajian Et Al 2021
43 pages
Capstone3problemstatement
No ratings yet
Capstone3problemstatement
14 pages
MA Unit-3
No ratings yet
MA Unit-3
7 pages
Sales Prediction of Walmart Based On Regression Models: Abstract
No ratings yet
Sales Prediction of Walmart Based On Regression Models: Abstract
10 pages
Project 7 - Walmart Retail Sales Final Analysis Report
No ratings yet
Project 7 - Walmart Retail Sales Final Analysis Report
15 pages
Analy
No ratings yet
Analy
51 pages
BCM Group Project Group 01
No ratings yet
BCM Group Project Group 01
5 pages
Store-Sales-Ts-Forecasting-A-Comprehensive-Guide - Jupyter Notebook
No ratings yet
Store-Sales-Ts-Forecasting-A-Comprehensive-Guide - Jupyter Notebook
26 pages
New AIML PPT
No ratings yet
New AIML PPT
41 pages
FA-19 - Articulo Final - Jose Santaella
No ratings yet
FA-19 - Articulo Final - Jose Santaella
6 pages
Walmart Sales Prediction Using Multiple Linear Reg
No ratings yet
Walmart Sales Prediction Using Multiple Linear Reg
6 pages
Walmart Capstone Project
No ratings yet
Walmart Capstone Project
46 pages
Project 7 - Walmart Retail Sales Final Analysis Report
No ratings yet
Project 7 - Walmart Retail Sales Final Analysis Report
13 pages
Finaal Project
No ratings yet
Finaal Project
13 pages
Aaabgh Project
No ratings yet
Aaabgh Project
28 pages
BS Mini Project 2
No ratings yet
BS Mini Project 2
5 pages
Rossmann nr1 Doc
No ratings yet
Rossmann nr1 Doc
7 pages
Group 9 Paper Presentation
No ratings yet
Group 9 Paper Presentation
24 pages
Store Sales Prediction Project
No ratings yet
Store Sales Prediction Project
13 pages
BDA - Research Paper5
No ratings yet
BDA - Research Paper5
9 pages
Subhadeep Seal TSF-Coded Project Rose Wine Business Report
No ratings yet
Subhadeep Seal TSF-Coded Project Rose Wine Business Report
38 pages
AS Riyyan ICT702
No ratings yet
AS Riyyan ICT702
8 pages
Data Analysis
No ratings yet
Data Analysis
10 pages
Ammmp2023 87 94
No ratings yet
Ammmp2023 87 94
8 pages
Sales Target Assurance Planning
From Everand
Sales Target Assurance Planning
Steve Hay
4.5/5 (2)
Practical Sales and Operations Planning
From Everand
Practical Sales and Operations Planning
John Chase
1.5/5 (4)
Retail in Detail
From Everand
Retail in Detail
Ronald L. Bond
2/5 (1)
Successful Sales Forecasting
From Everand
Successful Sales Forecasting
Laurence Ainsworth
No ratings yet

Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree

Uploaded by

Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree

Uploaded by

Capstone Project

Predicting sales of Rossman stores

Machine Learning Engineer Nanodegree

We can expect outliers to be in features “Stores” and “Customers” of

That is a very interesting plot. Let's assume day-1 is Monday and

State Holidays (Fig. 5)

3. GradientBoost regressor- Gradient Boosted Regression is a

Given that we are going to be experimenting over 3 models, I expect

As it can be seen, the number of common outliers is 11334. This

As discussed earlier, data requires moderate amount of processing.

Decisiontree regressor was chosen for regression. Initially, the error

Training a DecisionTreeRegressor using a training set size of

Training a KNeighborsRegressor using a training set size of 392592.

For GradientBoost regressor-

Training a GradientBoostingRegressor using a training set size of

Interestingly, the Store, DayOfWeek and Date were the most

You might also like