DS For Business Home Assignments
DS For Business Home Assignments
15 points
Suggestions
1. Use Aggregation.
2. Use Aggregation twice, second time with empty groupby. Use the fact that the
total number of deaths in the country is maximal on the last available day.
3. Use Differentiate; use Filter Examples.
4. Use Join; Use Generate Attribute; use Filter Examples.
5. Use Generate Attribute, you can get weekday using the following function:
date_str_custom(Date, "E"), where Date is the name of the column with the date;
use Pivot.
You are provided with the starter process to download and preprocess latest
available COVID data:
1.https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_confirmed_global.csv
2.https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_deaths_global.csv
3.https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_recovered_global.csv
However for the purpose of this assignment you should use data from April 18,
2020 (files attached on wiki page).
You should submit your answers to the google form (link on wiki page).
Home assignment 2. 20 points, 1 bonus point
Analyze Titanic data.
I. Start with basic EDA (Exploratory data analysis):
1. Compute average `Age` of passengers and number of passengers who
survived and not survived grouped by `Sex` and `Passenger Class` (24
numbers);
2. What can you say about survivors based on the resulting table (open
question), e.g. what is the surviving ratio for females in First class compared
to the Second and Third?
This answer is limited to 150 words.
3. What is the average number of males and females on all boats (rounded to
the closest integer)?
Do not forget to filter out all `?` in `Life Boat` attribute.
Suggestions.
I.1 Use the Aggregation block.
I.3 Use the Aggregation block twice.
II.2 Use Generate attribute block.
II.3 Use Aggregation block with `count` aggregation function; use Join block
II.4 Use example block from Seminar 3.
II.5 Use One Hot encoding block
II.6 Use Set Role block
III.1 Use example from Seminar 3.
Upload your solution (.py, .r, .ipynb, .rpm). Answers without an uploaded solution file will not
be graded!
You should submit your answers to the google form (link on wiki page).
Home assignment 3. 35 points, 10 bonus points
“You are provided with historical sales data for 45 Walmart stores located in
different regions. Each store contains a number of departments, and you are tasked
with predicting the department-wide sales for each store.”
Data description
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data
Before starting this assignment I highly recommend you to watch the record of the
last seminar, as it answers many questions regarding the data and task.
If you are working in Python you can find a description of all preprocessing steps
in Seminar 4 plan.
For bonus points you should complete III.6 and provide additional report
file, same rules (150-350 words and up to 4 graphs). Name your report with
the bonus task the following way:
DS_HA3_[Surname]_[Name]_[chosen bonus task 1]_bonus.pdf
For example:
DS_HA3_Kurmukov_Anvar_4_bonus.pdf
That means that for a bonus task I took task II.4.
For this task your main goal is to decrease company losses due to customer’s
churn. We will compare two discount strategies: providing a 20% discount with a
75% acceptance rate and a 30% discount with a 90% acceptance rate.
For this assignment you should use this data (subset of the seminar’s data):
https://yadi.sk/d/K6DApyOjp42IYA
Comments:
1. For the section III you may use any classification model you want (you are
not restricted by these two models above).
2. You may want to use grid search to look for the best parameters of the
model/s.
1. Use the default threshold of 0.5 to compute the confusion matrix. Based on
this confusion matrix report (5 points):
a. TP, FP, TN, FN
b. Losses if you do not apply any discount strategy.
c. Total gains from the discount strategy B.
d. Total costs of the discount strategy B.
e. Total losses of the discount strategy B.
f. Total profit of the discount strategy B.
g. Profit per customer pd (using strategy B).
2. Use 9 different thresholds: 0.1, 0.2, … 0.9. Answer to the following
questions (10 points):
a. What is the threshold with the highest accuracy, using strategy B?
b. What is the threshold with the highest profit, using strategy B? What
is the highest profit, using strategy B?
c. What is the threshold with the highest profit per customer, using
strategy B ? What is the highest profit per customer, using strategy
B?
d. What is the ratio of profit per customer (obtained on the previous step)
p
and p ? pd
e. Which strategy yields the highest profit (A or B)? What are the TP,
FP, TN, FN in this case? What is the highest profit in that case?
f. Which strategy yields the highest profit per customer (A or B)? What
are the TP, FP, TN, FN in this case? What is the highest profit per
customer in that case?
3. Prepare a report (10 points). Your report must summarize your results.
Reports with simple copy paste of the results will be graded with 0 points!
Some example questions (you are not limited or restricted to them):
● Do thresholds for the highest profit and highest profit per
customer coincide or not? Why?
● Which you decide to choose? Under what circumstances (how
many clients will you lose in both situations; what should be the
decision criteria)?
● How hard does your profit per customer decrease for customers
for whom you provide a discount, compared to customers for
whom you do not provide a discount? Compare this number
with the discount.
4. *For the bonus 10 points you need to redo all computations, but now instead
of average p you should use customer’s MonthlyCharges. All the results for
the bonus task must be summarized in an additional report. You must
provide a comparison of the results (with the regular case when you use p ).
I. Clustering (20)
Python examples:
Collaborative filtering example
https://realpython.com/build-recommendation-engine-collaborative-filtering/
MF recommender systems library https://github.com/lyst/lightfm
1. Prepare user-item data as it was done during the seminar: User, Item,
Score. You may construct Score (e.g. see seminar) any way you want,
but you must explain it in your report.
2. Split your data into train and test sets (as Leonid explained during the
lecture): some of the user-item pairs go to the train set and some to the
test set.
3. Build a recommender system using cluster groups (if you have about
20-40 clusters) or items subcategories (75 most frequent values of the
`dbi_item_famly_name` attribute) as items and `dd_card_number` as
users. You may want to play with a number of neighbours in your
KNN recommender model.
4. Compute 3 different recommender performance scores, which were
explained during the lecture or seminar to assess the quality of your
recommendations (use appropriate metrics).
5. Write a report. In your report you should present the following
information:
a. Report computed performance scores.
b. Elaborate on the quality of your recommendations.
c. Provide 3-5 examples of `good` recommendations suggested by
your recommender system.
d. Provide 3-5 examples of `bad` recommendations suggested by
your recommender system.
e. You may report any additional information you find potentially
useful to assess the quality of your recommendations: e.g for a
couple of customers compute the price of their average
purchase (or an item in purchase) and compare it with the
average price of recommended items.
f. You may use any visualisations you find useful
You report should consist of 2 parts: report for the clustering part and report for the
recommendation part. Every part is limited to 350 words and 5 pictures. Name
your report the following way:
DS_HA5_[Surname]_[Name].pdf
We are using house sale price data from King County, Wahington, USA. This
dataset is in public domain and can be obtained from Kaggle:
https://www.kaggle.com/harlfoxem/housesalesprediction
8. Split your data into train (70%) and test parts (30%). How many records
(rows) do you have in train and test tables? (list of int)?
9. Create a predictive regression model of a house price.
a. Use decision tree regression
b. Use k nearest neighbours regression
10. Use grid search to select optimal hyperparamters of your models.
a. Depth for the tree
b. Number of neighbours for the knn
11. Compute train and test mean squared error for your best models (list of
float).
a. Train, test MSE using decision tree regression
b. Train, test MSE using k nearest neighbours regression
12. Normalize your numerical features and repeat steps 9-11, does train, test
MSE changes for KNN model? For decision tree model?
13.Write short (3-5 sentences) report on your solution.