0% found this document useful (0 votes)
15 views24 pages

DS For Business Home Assignments

The document outlines three home assignments focused on data analysis and modeling using COVID-19 and Titanic datasets, as well as Walmart sales data. Each assignment includes specific tasks such as computing statistics, exploratory data analysis, feature generation, and building classification or regression models. Detailed instructions and suggestions for data processing techniques are provided for each assignment, along with requirements for submission.

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views24 pages

DS For Business Home Assignments

The document outlines three home assignments focused on data analysis and modeling using COVID-19 and Titanic datasets, as well as Walmart sales data. Each assignment includes specific tasks such as computing statistics, exploratory data analysis, feature generation, and building classification or regression models. Detailed instructions and suggestions for data processing techniques are provided for each assignment, along with requirements for submission.

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Home assignment 1.

15 points

1. Compute statistics (over countries) of total number of ​confirmed​ cases on the


10’th day since 50 confirmed cases:
a. Mean
b. Median
c. Max
d. Min
If " March, 13" is the first day with >= 50 confirmed cases, then "March, 22" is the 10th day.

2. Compute statistics (over countries) of total number of deaths on last available


day:
a. Mean
b. Median
c. Max
d. Min
3. What was the average number of ​new​ cases for the last 10 days in Germany?
4. Compute case fatality rate (death to total cases ratio) for the last available day in
countries with more than 10 000 reported cases (in total).
a. What is the biggest case fatality rate? ​Write ​percentage​ rounded to 2
decimal places.
b. What is the lowest? ​Write p ​ ercentage​ rounded to 2 decimal places.
c. Plot a scatter plot: Total number of cases vs Case fatality rate, color
points according to the country.
5. On which weekday most cases were reported in France on average? On which
weekday least cases were reported in Italy on average?

Write all numbers rounded to 2 decimal places.

Suggestions

1. Use ​Aggregation​.
2. Use ​Aggregation​ twice, second time with empty groupby. Use the fact that the
total number of deaths in the country is maximal on the last available day.
3. Use ​Differentiate​; use ​Filter Examples​.
4. Use ​Join​; Use​ Generate Attribute​; use​ Filter Examples​.
5. Use ​Generate Attribute​, you can get weekday using the following function:
date_str_custom(Date, "E")​, where Date is the name of the column with the date;
use ​Pivot​.

You are provided with the starter process to download and preprocess latest
available COVID data:
1.​https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_confirmed_global.csv

2.​https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_deaths_global.csv

3.​https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_recovered_global.csv

However for the purpose of this assignment you should use data from April 18,
2020 ​(files attached on wiki page).

You should submit your answers to the google form (link on wiki page).
Home assignment 2. 20 points, 1 bonus point
Analyze Titanic data.
I. Start with basic EDA (Exploratory data analysis):
1. Compute ​average​ `Age` of passengers and ​number of passengers​ who
survived and not survived grouped by `Sex` and `Passenger Class` (24
numbers);
2. What can you say about survivors based on the resulting table (open
question), e.g. what is the surviving ratio for females in First class compared
to the Second and Third?
This answer is limited to 150 words.
3. What is the ​average​ number of males and females on all boats (rounded to
the closest integer)?
Do not forget to filter out all `?` in `Life Boat` attribute.

II. Proceed with feature generation.


1. Drop the column ​`Life Boat`.
2. Generate new attribute `Family size`: sum up `No of Parents or Children on
Board` and `No of Siblings or Spouses on Board` and add 1 (for passenger
himself, ​thanks to @pianovanastya​). What is the ​average​ family size? In
which class​ did the biggest family travel?
In this case, isn’t it better to group people not by ticket number, but by the family size?
Then we can divide the number of people with the same family size by the family size
value and receive the number of families for each family size.
Do not drop original attributes.
3. It seems that `Passenger Fare` is total among all passengers with the same
`Ticket Number`:​ create new attribute ​`Single passenger fare`. For every
passenger you need to compute the number of passengers with the same
`Ticket Number` and then use this number as a divisor for `Passenger Fare`.
Do not drop the original attribute.
4. Impute missing values: for numerical attributes use ​averaging​ over three
groups: `Passenger Class`, `Sex`, `Embarkation Port`; for every numerical
attribute create separate column that contain 1 for imputed value and 0 for
originally presented.
This step is mainly for practicing your groupby/join skills. In real tasks this kind of
imputation is relatively rare.
5. Pre-process categorical attributes: For every categorical attribute create a
separate column that contains 1 for a missing value and 0 for originally
presented. One-hot encode categorical attributes with less than 20 unique
values, drop other categorical attributes; ​drop original (that you
pre-processed during this step) attributes.
6. Set the role of the `Survived` attribute to `label`.

III. Finish by building a classification model using preprocessed data


1. Compute classification accuracy on a train-test setup:
a. Create a Cross Validation block, fix the random_state parameter to
2020.
b. Use a decision tree with `maximal depth` = 7; uncheck `apply
pruning` box; leave all other parameters by default.
c. Use accuracy as a performance metric
2. Analyze the resulting confusion matrix, which error is larger: Type I or Type
II?
3. Provide a short analysis of the results, based on your answers III.2-III.3. E.g.
What are the splitting features of the first 3 levels of the best tree (up to 7
attributes)? Do these results coincide with your intuition? You may include
some misclassified examples along with explanations why they were
misclassified​.
This answer is limited to 250 words.

Suggestions.
I.1 Use the Aggregation block.
I.3 Use the Aggregation block twice.
II.2 Use Generate attribute block.
II.3 Use Aggregation block with `count` aggregation function; use Join block
II.4 Use example block from Seminar 3.
II.5 Use One Hot encoding block
II.6 Use Set Role block
III.1 Use example from Seminar 3.
Upload your solution ​(.py, .r, .ipynb, .rpm). ​Answers without an uploaded solution file will not
be graded!
You should submit your answers to the google form (link on wiki page).
Home assignment 3. 35 points, 10 bonus points
“You are provided with historical sales data for 45 Walmart stores located in
different regions. Each store contains a number of departments, and you are tasked
with predicting the department-wide sales for each store.”

Data description
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

Your goal for this task is three-fold:


1. Make an EDA of presented data to understand underlying processes.
2. Build a complete pipeline process from tables to predictions.
3. Carefully analyze your predictions.

Before starting this assignment I highly recommend you to watch the ​record​ of the
last seminar, as it answers many questions regarding the data and task.
If you are working in Python you can find a description of all preprocessing steps
in Seminar 4 ​plan​.

Before you start this home assignment you should


1. Join 3 tables: train, features, stores;
2. Generate Week, Year;
3. Establish NA values, sort your data over Date, Store, Dept;
4. Generate `sample_weights` attribute and set the roles for `sample_weights`
(weight) and `Weekly_Sales` (label);
Set the role weight for the sample_weight attribute ​after​ all feature preprocessing & engineering
steps (after completing task I). See Q&A for more detailed description.
5. Split data into local​ train​ (​same as on seminar​*​) and ​test​ parts (last 39
weeks of the training period, ​same as on seminar*​ ​).
This is time-based data, so we need to use a specific type of train-test splitting (time based), so
we do not predict “future” using “future” (which is mostly impossible in real life).
*The yellow box in the RM seminar process suggests using 3​ 9 last weeks a​ s a test set, however
during the seminar I made a mistake and used 40 last weeks. Your test set should be​ 39 weeks
long (last 39 weeks of the available data).
Do not drop any attributes at this point. Whenever I use the words “train” /
”training data” / ”training part” / ”train period” I mean local train. ​Unless otherwise
stated ​you should do computations (plot graphs) using training data.
Part I is worth 20 points (1 point for every question, e.g. I.1.a.ii is a single question
as it will appear in google form)
Part II is worth 5 points. (for II.1 and II.2)
Part III is worth 10 points + 10 bonus points (5 for task II.3 and 5 for task II.4 +
analysis of the results.)

I. Exploratory Data Analysis & Feature generation (20


points)
1. Analyze target variable.
a. Compare different stores by `Weekly_Sales`:
i. What is the store (Store ID) with the biggest total sales in
September 2011; smallest total sales in January 2011? (two
numbers divided by a comma, e.g. 1, 2)
ii. Did the store with the largest sales in March change from 2010
to 2011; from 2011 to 2012? (Yes/No)
b. Compare different departments by `Weekly_Sales`:
i. How many departments have substantially larger sales during
holidays (>= 200%) compared to regular weeks (averaged over
all train period, over all stores)? (single integer number, e.g. 1)
e.g. 1000$ on regular weeks and >2000$ on holidays’ weeks
ii. How many departments have substantially smaller sales during
holidays (<=50%) compared to regular weeks (averaged over
all train period, over all stores)? (single integer number, e.g. 1)
e.g. 1000$ on regular weeks and <500$ on holidays’ weeks
iii. Generate new attribute​ `Department_Type` (1,2,3): 1 for
departments that have substantially smaller sales during
holidays (<=50%) compared to regular weeks (averaged over
all train period, over all stores); 3 for departments that have
substantially larger sales during holidays (>= 200%) compared
to regular weeks (averaged over all train period, over all stores);
2 for all other departments. How many departments of each
type do you have? (3 integer numbers, divided by comma, e.g.
1, 2, 3)
If you want you may do that for every separate store.
2. Analyze `Size` and `Type`.
a. Compute correlation between total `Weekly_Sales` (in millions) over
all training set and `Size`. (single number rounded to 3 decimal places,
e.g. 0.001)
b. Plot a scatter plot total (over all training set) `Weekly_Sales` (in
millions) in a store vs `Size` of the store.
c. Plot a boxplot of `Weekly_Sales` for different store types (3 boxes,
one for every Type). For this plot filter out all `Weekly_Sales` bigger
than 70000.
d. Find Stores with Type == `B` which have total `Weekly_Sales`, (in
millions) over all training set, more than 80% percentile of ​total
`Weekly_Sales` of stores with Type == `A`. (List of store IDs, e.g. 1,
2, 3).
e. Generate new attribute​ `Total_Sales_Store_Type`. 5 (from 1 to 5)
groups generated using percentiles (0<=p<20, 20<=p<40, 40<=p<60,
60<=p<80, 80<=p<=100) of total `Weekly_Sales` (in millions) over
all training set. How many stores of each type do you have? (5
numbers from 1 to 5 type).
3. Analyze `Is_Holiday`
a. How many holidays’ weeks are in the train set; test set? (two integer
numbers, divided by comma)
b. Separate 3 important American holidays: Super Bowl; Black Friday;
Christmas:
i. Generate 5 separate attributes ​(nominal True-False or 1-0):
`Is_SuperBowl`, `Is_BlackFriday`, `Is_Christmas`,
`Other_holidays` and `Regular_weeks` (`Other_holidays` are
all weeks which are not SuperBowl or BlackFriday or
Christmas but have `Is_Holiday` = True; `Regular_weeks` are
weeks which are not SuperBowl or BlackFriday or Christmas
and `Is_Holiday` = False). How many `Other_holidays` weeks
do you have in the whole training set?
Dates will slightly differ from year to year, use the following google
search pattern: <us holiday_name year dates>, e.g. <us super bowl 2011
dates>. ​You could share these dates with your colleagues.
ii. Select 10 Stores with ​highest​ total Sales in 2011. Compute the
percentage of Sales during Black Friday week compared to total
Sales; during Super Bowl week; during Christmas week. (three
integer numbers 0-100, e.g. 10,20,30)
iii. Select 10 Stores with ​lowest​ total Sales in 2011. Compute the
percentage of Sales during Black Friday week compared to total
Sales; during Super Bowl week; during Christmas week. (three
integer numbers 0-100, e.g. 10,20,30)
4. Analyze `Temperature`:
a. Plot a linear graph of temperature over time (averaged over all stores).
b. Plot a scatter plot of `Temperature` vs `Weekly_Sales` (every point
corresponds to a single date, a single store). Compute correlation
between them. (single number rounded to 3 decimal places, e.g.
0.001)
c. Find 2 stores with the biggest difference in temperature in July 2010.
(two stores ID, e.g. 1, 2)
d. Plot a linear graph of temperature over time for these two stores, use
different colors for different stores.
e. Plot a scatter plot of `Temperature` vs `Weekly_Sales`, use different
colors for points corresponding to different stores. Compute
correlation between temperature and sales separately for each of 2
stores. (two numbers rounded to 3 decimal places, e.g. 0.001, 0.002)
Do they differ from each other? Do they differ from I.4.b?
f. Generate new attribute `Average_Temperature_month`​: average
temperature over current month for this particular store.
You may use any reasonable approach: e.g. for the current week average of the
last 4 weeks, or average of all weeks from the same month one year ago.
5. Drop​ `Fuel_Price`, `MarkDown1`, `MarkDown2`, `MarkDown3`,
`MarkDown4`, `MarkDown5`, `CPI`, `Unemployment`, `Temperature` as
they only appears for the​ downloaded train table (​ they are not available for
the “future”).
6. Generate `sample_weights` attribute and set the roles for `sample_weights`
(weight) and `Weekly_Sales` (label); ​Drop​ `Is_Holiday` and `Type`
attribute;
You might use so-called “lag” features: e.g. to predict Sales on January, 2014 use
`Unemployment` for the last available period (July 2013). But you should do it with extreme
caution.

II. Building a regression model (5 points)


This is actually the simplest part
1. Train a Random Forest model with least_squares (RandomForestRegressor
in sklearn) on the train set the number of trees to 10, and depth of the trees to
10.
a. Make predictions on a test set; on a train set.
b. Compute performance of your model on a test set; on a train set. Use
wMAE (absolute_error in RM, mean_absolute_error in sklearn). What
is the difference between train score and test score? (single number
rounded to 3 decimal places, e.g. 123.456)
E.g. you obtain 1234.567 wMAE on a train set and 2345.678 on test set, then the
difference is 1234.567 - 2345.678 = -1111.111
2. Run a Grid Search to look for the best set of Random Forest parameters. Use
the following grid:
a. What are the resulting parameters of the best model? (3 numbers, e.g.
5, 2, 0.1)
■ Number of trees: from 5 to 50, with step 2 (5, 7, 9,...)
■ Depth of the tree: from 2 to 20, with step 1 (2, 3, 4,...)
■ Subset ratio (sklearn max_features): from 0.1 to 1, with step 0.1
(0.1, 0.2, 0.3,...)
In order to speed up GridSearch computation you may use a sample instead of the
whole train set. If your sample will represent the original train distribution your
best parameters obtained on this sample will be close to best parameters obtained
on the whole train. I recommend you to use a sample of stores (try to preserve
distribution of types of the stores).
b. Make predictions on a test set (using your best model); on a train set,
save results to file.​ (you will need them in part III)
c. Compute performance of your model on a test set; on a train set. Use
wMAE (absolute_error in RM, mean_absolute_error in sklearn).
What is the difference between train score and test score? (single
number rounded to 3 decimal places, e.g. 123.456)
d. What is the difference between test score using default parameters
(II.1.b) and the test score using parameters obtained using
GridSearch? (single number rounded to 3 decimal places, e.g.
123.456)
3. *​ Train separate models for every `Total_Sales_Store_Type`.
a. Tune every model using Grid Search as in II.2.
b. Make predictions on a test set; train set, ​save results to file.​ (you will
need them in part III)
c. Compute train and test performance using wMAE.
4. * ​Train separate models for different types of weeks: holidays’ weeks and
regular.
You may train separate models for every type of week (Christmas, etc), but you probably
lack training data for all weeks except regular.
a. Tune every model using Grid Search as in II.2.
b. Make predictions on a test set; train set, ​save results to file.​ (you will
need them in part III)
c. Compute train and test performance using wMAE.
You might want to compare your local test score with the test score on the
leaderboard of the competition (predict Sales for test table and upload your
predictions on kaggle to see the results).

III. Analyzing the resulting model (10 points + 10 bonus


points)
For this part you may choose any 2 questions out of 1, 2, 3,
4 (e.g. III.1 and III.4), 5 is mandatory.
1. Analyze the results for different Stores:
a. Compute wMAE separately for Stores of different types
(`Total_Sales_Store_Type`).
b. Is there any difference between errors obtained for different types of
Stores? Plot a boxplot or bar plot of errors distribution (1 box/bar for
every week type).
2. Analyze the results for different Departments (Dpt).
a. Compute wMAE separately for different Department types
(`Department_Type`).
b. Is there any difference between errors obtained for different types of
departments? Plot a boxplot or bar plot of errors distribution (1
box/bar for every week type).
3. Analyze the results for different weeks
a. Compare wMAE separately for regular and holidays’ weeks;
b. Is there any difference between errors obtained for different types of
weeks (`Is_SuperBowl`, `Is_BlackFriday`, `Is_Christmas`,
`Other_holidays` and `Regular_weeks`)? Plot a boxplot or bar plot of
errors distribution (1 box/bar for every week type).
4. Split testing part into 4 weeks parts (1-4, 5-8,…, 37-39).
a. Compute wMAE for these time periods.
b. Is there any difference between first 4 weeks and last 4 (3) weeks?
Plot a box or bar or a line plot of errors distribution.
5. Make an overall conclusion. Your conclusion should include answers to (but
not limited to) the following questions:
■ How good is your final model for small/huge stores?
■ What kind of additional features could you suggest for this
task?
■ Would you suggest building individual models for
Stores/Departments/Regular-Holiday weeks/Nearest-Farthest
weeks?
■ Which features were most useful? Least useful?
6. * ​If you want to get bonus points you should complete II.3 ​or​ II.4 and make
separate conclusion (III.5) for it.

For this section you should prepare a short report in a .pdf


file that includes your graphs and conclusion. Your
conclusion is limited to ​350​ words (​no less than 150
words​). You might include up to ​4​ graphs in your report.
You will upload your report into the google form.

Name your report the following way:


DS_HA3_[Surname]_[Name]_[chosen task at section III]_[chosen task task
at section III].pdf
For example:
DS_HA3_Kurmukov_Anvar_1_3.pdf
That means that for III section I decide to complete tasks III.1 and III.3

For bonus points you should complete III.6 and provide additional report
file, same rules (150-350 words and up to 4 graphs). Name your report with
the bonus task the following way:
DS_HA3_[Surname]_[Name]_[chosen bonus task 1]_bonus.pdf
For example:
DS_HA3_Kurmukov_Anvar_4_bonus.pdf
That means that for a bonus task I took task II.4.

Do not include any names/surname inside the report!


Wrongly named reports will be graded by ​ 0​ points!
Home assignment 4. 35 points, 10 bonus points
Seminar 5 plan:
https://docs.google.com/document/d/1zDlYT7LU3NQqV74InsvSj8fyE3hOWDW
TsHYBJhfsgXg/edit?usp=sharing

For this task your main goal is to decrease company losses due to customer’s
churn. We will compare two discount strategies: providing a 20% discount with a
75% acceptance rate and a 30% discount with a 90% acceptance rate.

For this assignment you should use this data (subset of the seminar’s data):
https://yadi.sk/d/K6DApyOjp42IYA

I.​ ​Data preprocessing (5).


​ e will skip all the exploration steps and only do some simple features
This time​ w
preprocessing:
1. Replace “No internet service” to “No” for the following attributes:
`OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`,
`StreamingTV`, `StreamingMovies`. ​ Already done in HA data
2. Generate `tenure_group` attribute: discretize `tenure` into 6 groups: “0-12”,
“12-24”, “24-36”, “36-48”, “48-60”, “60+” (all are left closed intervals: [0,
12), [12, 24), [24, 36)...). What are the sizes of these groups?
Tenure​ refers to the number of months that a ​customer​ has subscribed for.
Do not​ drop the `tenure` column.
3. Preprocess categorical columns with only 2 unique values (“binary”
columns): replace one unique value with 0 and another with 1 (label
encoding). How many ​such​ columns do you have?
E.g. for the `gender` attribute you may replace Female with 1 and Male with
0 or vice versa.
4. Preprocess categorical columns with more then 2 unique values using
dummy encoding (=one-hot encoding). How many ​such​ columns (before
dummy encoding) do you have?
5. Drop customerID attribute.

II. Build a churn model (5).


1. Build 2 classification models to predict customers churn:
a. Logistic Regression. What is the ROC AUC of this model?
b. Random Forest. What is the ROC AUC of this model?
In this task I suggest you deviate from the train-test strategy and use a k-fold
approach to train and predict the whole dataset: train on ⅘ of the data, predict on
⅕ , repeat this 5 times, thus you will get predictions for the whole dataset.
Recall that in this case you will actually have 5 trained classifiers not 1, but for
our purposes this is ok.
After this section you must have predictions for all customers from the dataset
(~6k) obtained using cross validation.

Comments:
1. For the section III you may use any classification model you want (you are
not restricted by these two models above).
2. You may want to use grid search to look for the best parameters of the
model/s.

III. Compare two discount strategies (25 + 10).


Assumptions:
1. Every customer pays the same price p which is the average of
`MonthlyCharges`.
2. If we decide to provide a discount we provide it to all the customers who are
predicted as Churn=Yes.
3. When we compute gains, costs and losses we compute them for the short
term.
Therefore all the computations from the seminar hold (except you need to
recompute the coefficients).
4. Strategy’s profit is the difference between gains, costs and losses:
profit = gains - costs - losses
5. Profit per customer is the total profit divided by the number of customers (if
the person churns the person is not a customer anymore).

Strategy A​: Provide a 20% discount with a 75% acceptance rate.


Strategy B​: Provide a 30% discount with a 90% acceptance rate.
In the seminar we had a 30% discount with 80% acceptance rate.

You are not obligated to use any particular software.

You may use any classifier you want.

1. Use the default threshold of 0.5 to compute the confusion matrix. Based on
this confusion matrix report (5 points):
a. TP, FP, TN, FN
b. Losses if you do not apply any discount strategy.
c. Total ​gains​ from the discount strategy B.
d. Total ​costs​ of the discount strategy B.
e. Total ​losses​ of the discount strategy B.
f. Total profit of the discount strategy B.
g. Profit per customer pd (using strategy B).
2. Use 9 different thresholds: 0.1, 0.2, … 0.9. Answer to the following
questions (10 points):
a. What is the threshold with the highest accuracy, using strategy B?
b. What is the threshold with the highest profit, using strategy B? What
is the highest profit, using strategy B?
c. What is the threshold with the highest profit per customer, using
strategy B ? What is the highest profit per customer, using strategy
B?
d. What is the ratio of profit per customer (obtained on the previous step)
p
and p ? pd
e. Which strategy yields the highest profit (A or B)? What are the TP,
FP, TN, FN in this case? What is the highest profit in that case?
f. Which strategy yields the highest profit per customer (A or B)? What
are the TP, FP, TN, FN in this case? What is the highest profit per
customer in that case?
3. Prepare a report (10 points). Your report must ​summarize​ your results.
Reports with simple copy paste of the results will be graded with 0 points!
Some ​example​ questions (​you are not limited or restricted to them​):
● Do thresholds for the highest profit and highest profit per
customer coincide or not? Why?
● Which you decide to choose? Under what circumstances (how
many clients will you lose in both situations; what should be the
decision criteria)?
● How hard does your profit per customer decrease for customers
for whom you provide a discount, compared to customers for
whom you do not provide a discount? Compare this number
with the discount.
4. *​For the bonus 10 points you need to redo all computations, but now instead
of average p you should use customer’s MonthlyCharges. All the results for
the bonus task must be summarized in an additional report. You must
provide a comparison of the results (with the regular case when you use p ).

Your report must be prepared in a ​ .pdf ​file​. It is limited


to ​350​ words. You might include up to ​4​ graphs in your
report.​ ​You will upload your report into the google form.

Name your report the following way:


DS_HA4_[Surname]_[Name].pdf
For example:
DS_HA4_Kurmukov_Anvar.pdf
Name your report for the bonus task (same rules: up to 350 words and 4
graphs) the following way:
DS_HA4_[Surname]_[Name]_bonus.pdf
For example:
DS_HA4_Kurmukov_Anvar_bonus.pdf

Do not include any names/surname inside the report!


Wrongly named reports (or reports with your identification
in it) will be graded by ​ 0​ points!
Home assignment 5. 45 points.

Clustering seminar (6) plan


https://docs.google.com/document/d/1Ptj7J1ikOVsuGmY5rPNFCo0p3hrDcLyQA
wotDxxbBiQ/edit?usp=sharing

You goal for this task is two fold:


1. Cluster all the products into distinct groups (clusters)
2. Build a recommender system for customers, but instead of products we will
recommend categories.

I. Clustering (20)

1. Feature generation. Use examples from Seminar 6 Plan to generate


features for products clustering. You may generate any number of
features but you must generate at least 3 features which differ from
those, proposed in the plan.
2. Cluster all products into distinct groups (clusters). You may use any
clustering algorithm you want. If you use distance-based clustering
(e.g. k-means), do not forget to preprocess your features
(normalization, z-scoring or standard scaling). Try a different number
of groups (e.g. from 5 to 30). Couple of useful links for selection
number of clusters:
a. Rapid Miner
i. https://community.rapidminer.com/discussion/39731/dyn
amically-determine-number-of-clusters-k-means
ii. https://mgaproyekto.blogspot.com/2018/06/rapidminer-tu
torial-k-means-clustering.html
b. Python/R:
i. https://blog.cambridgespark.com/how-to-determine-the-o
ptimal-number-of-clusters-for-k-means-clustering-14f27
070048f
ii. https://medium.com/@masarudheena/4-best-ways-to-find
-optimal-number-of-clusters-for-clustering-with-python-c
ode-706199fa957c
3. Write a report. In your report you should present the following
information:
a. Put an example screenshot of your features.
b. Explain (in a similar way I explain them in the plan) every
single feature (you may skip features from the seminar plan)
you use.
c. Cluster’s information: how many clusters do you have, how
many objects are in these clusters.
d. Cluster’s interpretation. Try to provide an interpretation of
every single cluster (or groups of clusters) you end up. For
example:
“Cluster `1` includes hot drinks and beverages often bought in a combination in the
morning.”
e. You may include any visualization you find necessary, e.g.:
colored PCA components, histogram or pie charts of cluster’s
sizes, “elbows” used for selection number of clusters (if you
have used it).
Python visualization examples:
https://www.kaggle.com/python10pm/plotting-with-python-learn-80-plots-step-by-
step
Rapid Miner visualization examples:
https://www.ou.nl/documents/40554/349790/IM0503_ChartingInRapidMiner.pdf/c
5ede337-5287-14f6-bc3d-f9221efc4fea​,
https://michael.hahsler.net/SMU/EMIS3309/slides/Essential_Visualizations.pdf
II. Recommender system (25)

Python examples:
Collaborative filtering example
https://realpython.com/build-recommendation-engine-collaborative-filtering/
MF recommender systems library ​https://github.com/lyst/lightfm

1. Prepare user-item data as it was done during the seminar: User, Item,
Score. You may construct Score (e.g. see seminar) any way you want,
but you must explain it in your report.
2. Split your data into train and test sets (as Leonid explained during the
lecture): some of the user-item pairs go to the train set and some to the
test set.
3. Build a recommender system using cluster groups (if you have about
20-40 clusters) or items subcategories (75 most frequent values of the
`dbi_item_famly_name` attribute) as items and `dd_card_number` as
users. You may want to play with a number of neighbours in your
KNN recommender model.
4. Compute 3 different recommender performance scores, which were
explained during the lecture or seminar to assess the quality of your
recommendations (use appropriate metrics).
5. Write a report. In your report you should present the following
information:
a. Report computed performance scores.
b. Elaborate on the quality of your recommendations.
c. Provide 3-5 examples of `good` recommendations suggested by
your recommender system.
d. Provide 3-5 examples of `bad` recommendations suggested by
your recommender system.
e. You may report any additional information you find potentially
useful to assess the quality of your recommendations: e.g for a
couple of customers compute the price of their average
purchase (or an item in purchase) and compare it with the
average price of recommended items.
f. You may use any visualisations you find useful

You report should consist of 2 parts: report for the clustering part and report for the
recommendation part. Every part is limited to 350 words and 5 pictures. Name
your report the following way:

DS_HA5_[Surname]_[Name].pdf

Do not include any names/surname inside the report!


Wrongly named reports (or reports with your identification
in it) will be graded by ​ 0​ points!
September exam

We are using house sale price data from King County, Wahington, USA. This
dataset is in public domain and can be obtained from Kaggle:
https://www.kaggle.com/harlfoxem/housesalesprediction

1. Observe top 10 observations (int)


a. What is the price of a house with `id` == 7237550310?
b. How many bedrooms has a house with `id` == 7237550310?
c. When was the house with `id` == 2414600126 built (`yr_built`)?
2. Observe last 10 observations (int)
a. What is the price of a house with `id` == 263000018?
b. How many bedrooms has a house with `id` == 291310100?
c. When was the house with `id` == 1523300141 built (`yr_built`)?
3. Display some column statistics (list of floats, rounded up to 3 digits, e.g.
1.234)
a. What are the max, min, mean and the std of the `floors` column?
b. What are the max, min, mean and the std of the `sqft_living` column?
c. What are the max, min, mean and the std of the `price` column?
4. Select rows/columns (int)
a. How many houses were built during American Great Depression
(1929–1939)? Including both start and end year.
b. How many houses, built before first human in space (<1961), have
high condition (=5)?
c. How many houses with the waterfront (=1) were built duroing Nixon's
presidency (1969—1974)? Including both start and end year.
d. How many houses were sold for 256000 dollars?
5. Select rows/columns and compute simple statistics (float)
a. What was the average (sold) price of a houses built in the year of
Cuban Missile Crisis (1962)?
b. What was the price of the most expensive house sold, built between
1991 and 2000?
6. Create new columns using the old ones (new column in your DataFrame)
a. Create a `sqft_tot_area` column (sum of all columns with `sqft_`
prefix) using any method above
b. Create a new column `sqm_tot_area` using `sqft_tot_area` and the fact
that 1 foot = 0.3048 meters
c. Create a new column `sqm_aver_floor_area` by dividing total area (in
meters) by number of floors
d. Create a new bool column `high_class` it is True if the house has
grade >= 9 and condition >= 4
7. Create some groupby features
a. `price_by_class` groupby `high_class` and compute median `price`.
b. `area_by_price` groupby `price_cat` and compute average
`sqft_living`.
c. `floors_by_age` groupby `floors` and compute average age of a house.

Drop features `price_by_class` and `area_by_price`, preprocess categorical and


date features.

8. Split your data into train (70%) and test parts (30%). How many records
(rows) do you have in train and test tables? (list of int)?
9. Create a predictive regression model of a house price.
a. Use decision tree regression
b. Use k nearest neighbours regression
10. Use grid search to select optimal hyperparamters of your models.
a. Depth for the tree
b. Number of neighbours for the knn
11. Compute train and test mean squared error for your best models (list of
float).
a. Train, test MSE using decision tree regression
b. Train, test MSE using k nearest neighbours regression
12. Normalize your numerical features and repeat steps 9-11, does train, test
MSE changes for KNN model? For decision tree model?
13.Write short (3-5 sentences) report on your solution.

You might also like