0% found this document useful (0 votes)

10 views

Machine Learning Project

Uploaded by

Himanshu Saxena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Machine Learning Project

Uploaded by

Himanshu Saxena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

NextHikes

Pharmaceutical Sales prediction

across multiple stores
Overview

Business Need
You work at Nexthikes as a Machine Learning Engineer and a company called Rossmann
Pharmaceuticals has given you a project on sales forecasting. The finance team wants to
forecast sales in all their stores across several cities six weeks ahead of time. Managers
in individual stores rely on their years of experience as well as their personal judgement
to forecast sales.

The data team identified factors such as promotions, competition, school and state
holidays, seasonality, and locality as necessary for predicting sales across the various
stores.

Your job is to build and serve an end-to-end product that delivers this prediction to
analysts in the finance team.

Data and Features

The data for this challenge can be found here.

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that
aren't.
Id - an Id that represents a (Store, Date) duple within the test set
Store - a unique Id for each store
Sales - the turnover for any given day (this is what you are predicting)
Customers - the number of customers on a given day
Open - an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are
closed on state holidays. Note that all schools are closed on public holidays and
weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public
schools
StoreType - differentiates between 4 different store models: a, b, c, d
Assortment - describes an assortment level: a = basic, b = extra, c = extended. Read
more about assortment here
CompetitionDistance - distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year] - gives the approximate year and month of the
time the nearest competitor was opened
Promo - indicates whether a store is running a promo on that day
Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 =
store is not participating, 1 = store is participating
Promo2Since[Year/Week] - describes the year and calendar week when the store
started participating in Promo2
PromoInterval - describes the consecutive intervals Promo2 is started, naming the
months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round
starts in February, May, August, November of any given year for that store

Learning Outcomes
- Technical Skills: Pandas, Matplotlib, Numpy, HTML and CSS ,Flask. Interns will
also improve their code modularization skills.
- Creation of new features
- Predictive pipeline: Exploratory data analysis, data wrangling, building and
fine-tuning models
- Building model using MLOps Techniques
- Deployment: Interns will know how to serve predictions in a web app.

Badges
Each week, one user will be awarded one of the badges below for the best performance
in the category below.

In addition to being the badge holder for that badge, each badge winner will get +20
points to the overall score.
Visualization - quality of visualizations, understandability, skimmability, choice of
visualization
Quality of code - reliability, maintainability, efficiency, commenting - in future this
will be CICD/CML
Innovative approach to analysis -using latest algorithms, adding in research
paper content and other innovative approaches
Writing and presentation - clarity of written outputs, clarity of slides, overall
production value
Most supportive in the community - helping others, adding links, tutoring those
struggling

The goal of this approach is to support and reward expertise in different parts of the
Machine learning engineering toolbox.

Badges
Each week, one Intern will be awarded one of the badges below for the best
performance in the category below.

In addition to being the badge holder for that badge, each badge winner will get +20
points to the overall score.

Visualization - quality of visualizations, understandability, skimmability, choice of

visualization
Quality of code - reliability, maintainability, efficiency, commenting - in future this
will be CICD
Innovative approach to analysis -using latest algorithms, adding in research
paper content and other innovative approaches
Writing and presentation - clarity of written outputs, clarity of slides, overall
production value
Most supportive in the community - helping others, adding links, tutoring those
struggling

The goal of this approach is to support and reward expertise in different parts of the Data
Scientist toolbox.

Late Submission Policy

Our goal is to prepare successful learners for the work and submitting late, when given
enough notice, shouldn’t be necessary.
For interim submissions, those submitted 1-6 hours late will receive a maximum of 50%
of the total possible grade. Those submitted >6 hours late may receive feedback, but
will not receive a grade.

For final submissions, those submitted 1-24 hours late, will receive a maximum of 50% of
the total possible grade. Those submitted >24 hours late may receive feedback, but will
not receive a grade.
Instructions
The task is divided into the following objectives
- Exploration of customer purchasing behavior
- Prediction of store sales
- Machine learning approach
- Deep Learning approach
- Serving predictions on a web interface

Task 1 - Exploration of customer purchasing behaviour

Exploratory data analysis is the lifeblood of every meaningful machine-learning project.

It helps us unravel the nature of the data and sometimes informs how you go about
modelling. A careful exploration of the data encapsulates checking all available features,
checking their interactions and correlation as well as their variability with respect to the
target.

In this task, you seek to explore the behaviour of customers in the various stores. Our
goal is to check how some measures such as promos and opening of new stores affect
purchasing behavior.

To achieve this goal, you need to first clean the data. The data cleaning process will
involve building pipelines to detect and handle outlier and missing data. This is
particularly important because you don’t want to skew our analysis.

Visualizing various features and interactions is necessary for clearly communicating our
findings. It is a powerful tool in the data science toolbox. Communicate the findings
below via the necessary plots.

You can use the following questions as a guide during your analysis. It is important to
come up with more questions to explore. This is part of our expectation for an excellent
analysis.

● Check for distribution in both training and test sets - are the promotions
distributed similarly between these two groups?
● Check & compare sales behavior before, during and after holidays
● Find out any seasonal (Christmas, Easter etc) purchase behaviours,
● What can you say about the correlation between sales and number of
customers?
● How does promo affect sales? Are the promos attracting more customers? How
does it affect already existing customers?
● Could the promos be deployed in more effective ways? Which stores should
promos be deployed in?
● Trends of customer behavior during store open and closing times
● Which stores are opened on all weekdays? How does that affect their sales on
weekends?
● Check how the assortment type affects sales
● How does the distance to the next competitor affect sales? What if the store and
its competitors all happen to be in city centres, does the distance matter in that
case?
● How does the opening or reopening of new competitors affect stores? Check for
stores with NA as competitor distance but later on has values for competitor
distance

Deliver your exploratory analysis notebook - make sure you answer all the questions
asked in task 1 using the appropriate plots or summary tables and give useful insights. A
3 - 5 slides presentation is enough for interim submission.

1.2 - Logging
Log your steps using the logger library in python.

Task 2 - Prediction of store sales

Prediction of sales is the central task in this challenge. you want to predict daily sales in
various stores up to 6 weeks ahead of time. This will help the company plan ahead of
time.

The following steps outline the various sub tasks needed to effectively do this:

2.1 Preprocessing
It is important to process the data into a format where it can be fed to a machine learning
model. This typically means converting all non-numeric columns to numeric, handling
NaN values and generating new features from already existing features.
In our case, you have a few datetime columns to preprocess. you can extract the
following from them:
- weekdays
- weekends
- number of days to holidays
- Number of days after holiday
- Beginning of month, mid month and ending of month
- (think of more features to extract), extra marks for it

As a final thing, you have to scale the data. This helps with predictions especially when
using machine learning algorithms that use Euclidean distances. you can use the
standard scaler in sklearn for this.

2.2 Building models with sklearn pipelines

At this point, all our features are numeric. Since our problem is a regression problem, you
can narrow down the list of algorithms you can use for modelling.

A reasonable starting point will be to use any of the tree based algorithms. Random
forests Regressor will make for a good start.

Also, for the sake of this challenge, work with sklearn pipelines. This makes modeling
modular and more reproducible. Working with pipelines will also significantly reduce
your workload when you are moving your setup into files for the next part of the
challenge. Extra marks will be awarded for doing this.

2.3 Choose a loss function

Loss functions indicate how well our model is performing. This means that the loss
functions affect the overall output of sales prediction.
Different loss functions have different use cases.

In this challenge, you’re allowed to choose your own loss function. you need to defend
the loss function you choose for this challenge. Feel free to be creative with your choice.
You might want to use loss functions that are easily interpretable.

2.4 Post Prediction analysis

Explore the feature importance from our modelling. Creatively deduce a way to estimate
the confidence interval of your predictions. Extra marks will be given for this.
2.5 Serialize models
To serve the models you built above, you need to serialize them. Save the model with
the timestamp(eg. 10-08-2020-16-32-31-00.pkl). This is necessary so that you can track
predictions from various models.

Assume that you’ll make daily predictions. This means you’ll have various models for
predictions hence the reason for serializing the models in the format above.

2.6 Building model with deep learning

Deep Learning techniques can be used to predict various outcomes including but not
limited to future sales. Your task is to create a deep learning model of the Long Short
Term Memory which is a type of Recurrent Neural Network .

You can use either Tensorflow or Pytorch libraries for model building. The model should
not be very deep (Two layers) due to the computational requirements, it should
comfortably run in google colab.
1. Isolate the Rossmann Store Sales dataset into time series data
2. Check whether your time Series Data is Stationary
3. Depending on your conclusion from 2 above difference your time series data
4. Check for autocorrelation and partial autocorrelation of your data
5. Transform the time series data into supervised learning data by creating a new
y(target) column. For example as illustrated here in the Sliding Window For Time
Series Data section
6. Scale your data in the (-1, 1) range
7. Build a LSTM Regression model to predict the next sale.

2.7 Using MLFlow to serve the prediction

Use the code snippet provided by mlflow to make inference on your test data

Task 3 - Serving predictions on a web interface

Use one of the platforms of your choice (Flask, Streamlit, pure javascript, etc.) to design,
and build a backend to make inference using your trained model and input parameters
collected through a frontend interface.

Your dashboard should provide an easy way for a user (in this case managers of the
stores) to enter required input parameters, and output the predicted sales amount and
customer numbers.

The input fields in the frontend are for example

● Store_id
● Upload csv file with a columns name
○ Date
○ IsHoliday
○ IsWeekend
○ IsPromo
○ Any other parameter which is dependent on the date
● Any any other parameter requires as input for your model that is not dependent
on date

Finally your dashboard should show a plot that shows the predicted sales amount and
number of customers. It should also allow the user to download the prediction in the
form of a csv table.

Hosting
you can use heroku for hosting or any other platform that allows you to host your App.
Create a free heroku instance and deploy your code. Submit a link to your site.

Interim Submissions

To be categorised to the different weeks till 4th

week
● Your employer wants a quick meeting after you’ve done a first quick pass of the
data and wants to know whether further investigation is useful. To achieve this,
summarize your findings from Exploration of customer purchasing behavior (task
1) in 3 - 5 slides.
● Link to your Github code that includes your Jupyter notebook.
● Submit Screenshots showing:
■ Multiple data versions in your DVC store
■ Multiple model versions in your MLFlow dashboard

Feedback
You may not receive detailed comments on your interim submission, but will receive a
grade.

Final Submission (Date)

● PDF suitable to submit as a blog on your analysis. More emphasis on Deep
Learning Modelling is rewarded.
● Link to your Github code that includes your Jupyter notebook.
● Link to your Deployed Application and/or screenshot of your dashboard
● Screenshots demonstrating anything else you have done

Feedback
You will receive comments/feedback in addition to a grade.

(Assignment Template) : ILM Level 5 Certificate in Coaching and Mentoring
75% (4)
(Assignment Template) : ILM Level 5 Certificate in Coaching and Mentoring
16 pages
Capstone Project 1 1
33% (3)
Capstone Project 1 1
4 pages
Data Analysis On BigMart Sales
67% (3)
Data Analysis On BigMart Sales
17 pages
SS Teamproject Documentation
No ratings yet
SS Teamproject Documentation
33 pages
Black Friday Sales
No ratings yet
Black Friday Sales
26 pages
Fundamentals of Technical Product Management
From Everand
Fundamentals of Technical Product Management
Tariq Syed
4.5/5 (12)
7 Actions How to Mentally Prepare as an Entrepreneur
From Everand
7 Actions How to Mentally Prepare as an Entrepreneur
Cheok Tuan Eng
No ratings yet
Setting Up an Online Business
From Everand
Setting Up an Online Business
KingsHub
No ratings yet
Retail Sales Prediction Model
No ratings yet
Retail Sales Prediction Model
50 pages
Rashmi Jeswani Capstone
No ratings yet
Rashmi Jeswani Capstone
84 pages
Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree
No ratings yet
Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree
23 pages
Rossmann Sales Analysis PDF
No ratings yet
Rossmann Sales Analysis PDF
3 pages
Rossmann Sales Prediction: Computing For Data Sciences-Final Project
100% (1)
Rossmann Sales Prediction: Computing For Data Sciences-Final Project
46 pages
Cost of goods sold A Clear and Concise Reference
From Everand
Cost of goods sold A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Mastering New Customer Acquisition
From Everand
Mastering New Customer Acquisition
Clifford Woods
No ratings yet
Session 6 MKTG3528 Final Project
No ratings yet
Session 6 MKTG3528 Final Project
47 pages
Market data Complete Self-Assessment Guide
From Everand
Market data Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Multichannel Second Edition
From Everand
Multichannel Second Edition
Gerardus Blokdyk
No ratings yet
How to Start Your Online Business
From Everand
How to Start Your Online Business
Yassir Albonie
No ratings yet
Multimedia Third Edition
From Everand
Multimedia Third Edition
Gerardus Blokdyk
No ratings yet
Project_details
No ratings yet
Project_details
5 pages
Target market Second Edition
From Everand
Target market Second Edition
Gerardus Blokdyk
No ratings yet
Succeed Through Customer Knowledge: A Guide to Practical Customer-Originated Marketing
From Everand
Succeed Through Customer Knowledge: A Guide to Practical Customer-Originated Marketing
Kazuki Nishiguchi
No ratings yet
Applied Datascience - Phase3
No ratings yet
Applied Datascience - Phase3
8 pages
NodeMCU A Complete Guide
From Everand
NodeMCU A Complete Guide
Gerardus Blokdyk
No ratings yet
Data cube A Clear and Concise Reference
From Everand
Data cube A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
How to do an analysis of exceptional dice for sales - definitive guide to commercial success
From Everand
How to do an analysis of exceptional dice for sales - definitive guide to commercial success
Digital World
No ratings yet
Product market A Complete Guide
From Everand
Product market A Complete Guide
Gerardus Blokdyk
No ratings yet
Creating A Digital Product- Beginner’s
From Everand
Creating A Digital Product- Beginner’s
Natisha Peters
No ratings yet
How To Win Customers Every Day _ Volume 7: Data-Driven Selling: The Complete Guide to Success
From Everand
How To Win Customers Every Day _ Volume 7: Data-Driven Selling: The Complete Guide to Success
MAX EDITORIAL
No ratings yet
ISO 91-2 1991 Standard Requirements
From Everand
ISO 91-2 1991 Standard Requirements
Gerardus Blokdyk
No ratings yet
Evidence Guided: Creating High Impact Products in the Face of Uncertainty
From Everand
Evidence Guided: Creating High Impact Products in the Face of Uncertainty
Itamar Gilad
No ratings yet
dk-phase2 (1)
No ratings yet
dk-phase2 (1)
5 pages
New production Complete Self-Assessment Guide
From Everand
New production Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Analytical project using python BMBA-252
No ratings yet
Analytical project using python BMBA-252
4 pages
Data model Second Edition
From Everand
Data model Second Edition
Gerardus Blokdyk
No ratings yet
Operating model A Clear and Concise Reference
From Everand
Operating model A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Business Problem: Time Series: Problem Statement
No ratings yet
Business Problem: Time Series: Problem Statement
2 pages
Carbon Markets Third Edition
From Everand
Carbon Markets Third Edition
Gerardus Blokdyk
No ratings yet
How to Develop New Products and Services
From Everand
How to Develop New Products and Services
Dr Jim Porter
No ratings yet
Marketing Affinity A Complete Guide - 2020 Edition
From Everand
Marketing Affinity A Complete Guide - 2020 Edition
Gerardus Blokdyk
No ratings yet
Marketing Analytics A Complete Guide - 2020 Edition
From Everand
Marketing Analytics A Complete Guide - 2020 Edition
Gerardus Blokdyk
No ratings yet
SAP IQ Third Edition
From Everand
SAP IQ Third Edition
Gerardus Blokdyk
No ratings yet
IP multicast Third Edition
From Everand
IP multicast Third Edition
Gerardus Blokdyk
No ratings yet
Side Hustle Success Secrets. The Ultimate AI Prompt Toolkit for Building Profitable Income Streams.
From Everand
Side Hustle Success Secrets. The Ultimate AI Prompt Toolkit for Building Profitable Income Streams.
The Suburban Guru
No ratings yet
Sasi 2nd
No ratings yet
Sasi 2nd
34 pages
Carbon Market Standard Requirements
From Everand
Carbon Market Standard Requirements
Gerardus Blokdyk
No ratings yet
Mass-media The Ultimate Step-By-Step Guide
From Everand
Mass-media The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Product Development
From Everand
Product Development
Ethan Evans
No ratings yet
News analytics Third Edition
From Everand
News analytics Third Edition
Gerardus Blokdyk
No ratings yet
Project Phase A Complete Guide
From Everand
Project Phase A Complete Guide
Gerardus Blokdyk
No ratings yet
Synopsis Format
No ratings yet
Synopsis Format
8 pages
Market access Complete Self-Assessment Guide
From Everand
Market access Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Product marketing The Ultimate Step-By-Step Guide
From Everand
Product marketing The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Walmart Capstone Project
No ratings yet
Walmart Capstone Project
46 pages
Object model Third Edition
From Everand
Object model Third Edition
Gerardus Blokdyk
No ratings yet
Earned media A Complete Guide
From Everand
Earned media A Complete Guide
Gerardus Blokdyk
No ratings yet
Street marketing The Ultimate Step-By-Step Guide
From Everand
Street marketing The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Running Costs Second Edition
From Everand
Running Costs Second Edition
Gerardus Blokdyk
No ratings yet
Market research Third Edition
From Everand
Market research Third Edition
Gerardus Blokdyk
No ratings yet
Outcomes Map For Weeks 1-10: How Can I Improve Health and Wellness For Myself and Others With Routine Physical Activity?
No ratings yet
Outcomes Map For Weeks 1-10: How Can I Improve Health and Wellness For Myself and Others With Routine Physical Activity?
4 pages
Behavior Modification Questions
No ratings yet
Behavior Modification Questions
2 pages
COB - Income, Social Class and Consumer Behaviour - Group13
No ratings yet
COB - Income, Social Class and Consumer Behaviour - Group13
4 pages
Econometrics Essay - Bridging The Gap Between Theory and Data
No ratings yet
Econometrics Essay - Bridging The Gap Between Theory and Data
3 pages
Marketing Management-CP 1
0% (1)
Marketing Management-CP 1
14 pages
Relay Protection System of Transmission Line Based On AI - PLOS ONE
No ratings yet
Relay Protection System of Transmission Line Based On AI - PLOS ONE
13 pages
Chapter Two Literature Review
No ratings yet
Chapter Two Literature Review
80 pages
Human Relation in An Organization
No ratings yet
Human Relation in An Organization
51 pages
2018-19 Course Outline - MGMT 2008
No ratings yet
2018-19 Course Outline - MGMT 2008
10 pages
Psychology and Marketing - 2024 - Nasr - Self‐Comparing With Virtual Influencers Effects on Followers Wellbeing
No ratings yet
Psychology and Marketing - 2024 - Nasr - Self‐Comparing With Virtual Influencers Effects on Followers Wellbeing
19 pages
HRM Practices Used To Promote Pro Environmental Behavior A UK Survey
No ratings yet
HRM Practices Used To Promote Pro Environmental Behavior A UK Survey
23 pages
Child and Adolescent Development Activity # 1
No ratings yet
Child and Adolescent Development Activity # 1
11 pages
Update - Barbie Project
100% (1)
Update - Barbie Project
28 pages
Parents Choices of Preschools - Issues and Challenges
No ratings yet
Parents Choices of Preschools - Issues and Challenges
11 pages
Nature, History & Scope of Social Psychology
100% (3)
Nature, History & Scope of Social Psychology
35 pages
Ebooks File (Ebook PDF) Consumer Behaviour 4th Edition by Zubin Sethna All Chapters
100% (4)
Ebooks File (Ebook PDF) Consumer Behaviour 4th Edition by Zubin Sethna All Chapters
41 pages
JER Azlia Maharani 324-334
No ratings yet
JER Azlia Maharani 324-334
11 pages
Social Class & Strarification
No ratings yet
Social Class & Strarification
10 pages
Revised CBSE Class 11 Psychology Syllabus 2020-21
No ratings yet
Revised CBSE Class 11 Psychology Syllabus 2020-21
6 pages
Developing Effective Behavior Intervention Plans Killu PDF
No ratings yet
Developing Effective Behavior Intervention Plans Killu PDF
11 pages
Topic 1 Business Ethics
No ratings yet
Topic 1 Business Ethics
7 pages
Conflict Management short note
No ratings yet
Conflict Management short note
9 pages
Barista Latest
0% (1)
Barista Latest
53 pages
2 VERSION Expository Essay - Donadio
No ratings yet
2 VERSION Expository Essay - Donadio
2 pages
Spatial Forms and Signage in Wayfinding Decision Points For Hospital Outpatient Services
No ratings yet
Spatial Forms and Signage in Wayfinding Decision Points For Hospital Outpatient Services
9 pages
CP Project-1-1
No ratings yet
CP Project-1-1
43 pages
Advertising Help Paper
No ratings yet
Advertising Help Paper
111 pages
To Analyze The Motivation Behind Consumers Purchasing Airline Tickets Online
No ratings yet
To Analyze The Motivation Behind Consumers Purchasing Airline Tickets Online
18 pages
Sahin Et Al 2017 Investigating The Impacts of Brand Experience and Service Quality
No ratings yet
Sahin Et Al 2017 Investigating The Impacts of Brand Experience and Service Quality
18 pages

Machine Learning Project

Uploaded by

Machine Learning Project

Uploaded by

NextHikes

Pharmaceutical Sales prediction

Data and Features

Visualization - quality of visualizations, understandability, skimmability, choice of

Late Submission Policy

Task 1 - Exploration of customer purchasing behaviour

Exploratory data analysis is the lifeblood of every meaningful machine-learning project.

Task 2 - Prediction of store sales

2.2 Building models with sklearn pipelines

2.3 Choose a loss function

2.4 Post Prediction analysis

2.6 Building model with deep learning

2.7 Using MLFlow to serve the prediction

Task 3 - Serving predictions on a web interface

The input fields in the frontend are for example

To be categorised to the different weeks till 4th

Final Submission (Date)

You might also like