0% found this document useful (0 votes)

15 views24 pages

DS For Business Home Assignments

The document outlines three home assignments focused on data analysis and modeling using COVID-19 and Titanic datasets, as well as Walmart sales data. Each assignment includes specific tasks such as computing statistics, exploratory data analysis, feature generation, and building classification or regression models. Detailed instructions and suggestions for data processing techniques are provided for each assignment, along with requirements for submission.

Uploaded by

Daniel Wu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views24 pages

DS For Business Home Assignments

Uploaded by

Daniel Wu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Home assignment 1.

15 points

1. Compute statistics (over countries) of total number of confirmed cases on the

10’th day since 50 confirmed cases:
a. Mean
b. Median
c. Max
d. Min
If " March, 13" is the first day with >= 50 confirmed cases, then "March, 22" is the 10th day.

2. Compute statistics (over countries) of total number of deaths on last available

day:
a. Mean
b. Median
c. Max
d. Min
3. What was the average number of new cases for the last 10 days in Germany?
4. Compute case fatality rate (death to total cases ratio) for the last available day in
countries with more than 10 000 reported cases (in total).
a. What is the biggest case fatality rate? Write percentage rounded to 2
decimal places.
b. What is the lowest? Write p ercentage rounded to 2 decimal places.
c. Plot a scatter plot: Total number of cases vs Case fatality rate, color
points according to the country.
5. On which weekday most cases were reported in France on average? On which
weekday least cases were reported in Italy on average?

Write all numbers rounded to 2 decimal places.

Suggestions

1. Use Aggregation.
2. Use Aggregation twice, second time with empty groupby. Use the fact that the
total number of deaths in the country is maximal on the last available day.
3. Use Differentiate; use Filter Examples.
4. Use Join; Use Generate Attribute; use Filter Examples.
5. Use Generate Attribute, you can get weekday using the following function:
date_str_custom(Date, "E"), where Date is the name of the column with the date;
use Pivot.

You are provided with the starter process to download and preprocess latest
available COVID data:
1.https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_confirmed_global.csv

2.https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_deaths_global.csv

3.https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/c
sse_covid_19_time_series/time_series_covid19_recovered_global.csv

However for the purpose of this assignment you should use data from April 18,
2020 (files attached on wiki page).

You should submit your answers to the google form (link on wiki page).
Home assignment 2. 20 points, 1 bonus point
Analyze Titanic data.
I. Start with basic EDA (Exploratory data analysis):
1. Compute average `Age` of passengers and number of passengers who
survived and not survived grouped by `Sex` and `Passenger Class` (24
numbers);
2. What can you say about survivors based on the resulting table (open
question), e.g. what is the surviving ratio for females in First class compared
to the Second and Third?
This answer is limited to 150 words.
3. What is the average number of males and females on all boats (rounded to
the closest integer)?
Do not forget to filter out all `?` in `Life Boat` attribute.

II. Proceed with feature generation.

1. Drop the column `Life Boat`.
2. Generate new attribute `Family size`: sum up `No of Parents or Children on
Board` and `No of Siblings or Spouses on Board` and add 1 (for passenger
himself, thanks to @pianovanastya). What is the average family size? In
which class did the biggest family travel?
In this case, isn’t it better to group people not by ticket number, but by the family size?
Then we can divide the number of people with the same family size by the family size
value and receive the number of families for each family size.
Do not drop original attributes.
3. It seems that `Passenger Fare` is total among all passengers with the same
`Ticket Number`: create new attribute `Single passenger fare`. For every
passenger you need to compute the number of passengers with the same
`Ticket Number` and then use this number as a divisor for `Passenger Fare`.
Do not drop the original attribute.
4. Impute missing values: for numerical attributes use averaging over three
groups: `Passenger Class`, `Sex`, `Embarkation Port`; for every numerical
attribute create separate column that contain 1 for imputed value and 0 for
originally presented.
This step is mainly for practicing your groupby/join skills. In real tasks this kind of
imputation is relatively rare.
5. Pre-process categorical attributes: For every categorical attribute create a
separate column that contains 1 for a missing value and 0 for originally
presented. One-hot encode categorical attributes with less than 20 unique
values, drop other categorical attributes; drop original (that you
pre-processed during this step) attributes.
6. Set the role of the `Survived` attribute to `label`.

III. Finish by building a classification model using preprocessed data

1. Compute classification accuracy on a train-test setup:
a. Create a Cross Validation block, fix the random_state parameter to
2020.
b. Use a decision tree with `maximal depth` = 7; uncheck `apply
pruning` box; leave all other parameters by default.
c. Use accuracy as a performance metric
2. Analyze the resulting confusion matrix, which error is larger: Type I or Type
II?
3. Provide a short analysis of the results, based on your answers III.2-III.3. E.g.
What are the splitting features of the first 3 levels of the best tree (up to 7
attributes)? Do these results coincide with your intuition? You may include
some misclassified examples along with explanations why they were
misclassified.
This answer is limited to 250 words.

Suggestions.
I.1 Use the Aggregation block.
I.3 Use the Aggregation block twice.
II.2 Use Generate attribute block.
II.3 Use Aggregation block with `count` aggregation function; use Join block
II.4 Use example block from Seminar 3.
II.5 Use One Hot encoding block
II.6 Use Set Role block
III.1 Use example from Seminar 3.
Upload your solution (.py, .r, .ipynb, .rpm). Answers without an uploaded solution file will not
be graded!
You should submit your answers to the google form (link on wiki page).
Home assignment 3. 35 points, 10 bonus points
“You are provided with historical sales data for 45 Walmart stores located in
different regions. Each store contains a number of departments, and you are tasked
with predicting the department-wide sales for each store.”

Data description
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

Your goal for this task is three-fold:

1. Make an EDA of presented data to understand underlying processes.
2. Build a complete pipeline process from tables to predictions.
3. Carefully analyze your predictions.

Before starting this assignment I highly recommend you to watch the record of the
last seminar, as it answers many questions regarding the data and task.
If you are working in Python you can find a description of all preprocessing steps
in Seminar 4 plan.

Before you start this home assignment you should

1. Join 3 tables: train, features, stores;
2. Generate Week, Year;
3. Establish NA values, sort your data over Date, Store, Dept;
4. Generate `sample_weights` attribute and set the roles for `sample_weights`
(weight) and `Weekly_Sales` (label);
Set the role weight for the sample_weight attribute after all feature preprocessing & engineering
steps (after completing task I). See Q&A for more detailed description.
5. Split data into local train (same as on seminar*) and test parts (last 39
weeks of the training period, same as on seminar* ).
This is time-based data, so we need to use a specific type of train-test splitting (time based), so
we do not predict “future” using “future” (which is mostly impossible in real life).
*The yellow box in the RM seminar process suggests using 3 9 last weeks a s a test set, however
during the seminar I made a mistake and used 40 last weeks. Your test set should be 39 weeks
long (last 39 weeks of the available data).
Do not drop any attributes at this point. Whenever I use the words “train” /
”training data” / ”training part” / ”train period” I mean local train. Unless otherwise
stated you should do computations (plot graphs) using training data.
Part I is worth 20 points (1 point for every question, e.g. I.1.a.ii is a single question
as it will appear in google form)
Part II is worth 5 points. (for II.1 and II.2)
Part III is worth 10 points + 10 bonus points (5 for task II.3 and 5 for task II.4 +
analysis of the results.)

I. Exploratory Data Analysis & Feature generation (20

points)
1. Analyze target variable.
a. Compare different stores by `Weekly_Sales`:
i. What is the store (Store ID) with the biggest total sales in
September 2011; smallest total sales in January 2011? (two
numbers divided by a comma, e.g. 1, 2)
ii. Did the store with the largest sales in March change from 2010
to 2011; from 2011 to 2012? (Yes/No)
b. Compare different departments by `Weekly_Sales`:
i. How many departments have substantially larger sales during
holidays (>= 200%) compared to regular weeks (averaged over
all train period, over all stores)? (single integer number, e.g. 1)
e.g. 1000$ on regular weeks and >2000$ on holidays’ weeks
ii. How many departments have substantially smaller sales during
holidays (<=50%) compared to regular weeks (averaged over
all train period, over all stores)? (single integer number, e.g. 1)
e.g. 1000$ on regular weeks and <500$ on holidays’ weeks
iii. Generate new attribute `Department_Type` (1,2,3): 1 for
departments that have substantially smaller sales during
holidays (<=50%) compared to regular weeks (averaged over
all train period, over all stores); 3 for departments that have
substantially larger sales during holidays (>= 200%) compared
to regular weeks (averaged over all train period, over all stores);
2 for all other departments. How many departments of each
type do you have? (3 integer numbers, divided by comma, e.g.
1, 2, 3)
If you want you may do that for every separate store.
2. Analyze `Size` and `Type`.
a. Compute correlation between total `Weekly_Sales` (in millions) over
all training set and `Size`. (single number rounded to 3 decimal places,
e.g. 0.001)
b. Plot a scatter plot total (over all training set) `Weekly_Sales` (in
millions) in a store vs `Size` of the store.
c. Plot a boxplot of `Weekly_Sales` for different store types (3 boxes,
one for every Type). For this plot filter out all `Weekly_Sales` bigger
than 70000.
d. Find Stores with Type == `B` which have total `Weekly_Sales`, (in
millions) over all training set, more than 80% percentile of total
`Weekly_Sales` of stores with Type == À`. (List of store IDs, e.g. 1,
2, 3).
e. Generate new attribute `Total_Sales_Store_Type`. 5 (from 1 to 5)
groups generated using percentiles (0<=p<20, 20<=p<40, 40<=p<60,
60<=p<80, 80<=p<=100) of total `Weekly_Sales` (in millions) over
all training set. How many stores of each type do you have? (5
numbers from 1 to 5 type).
3. Analyze Ìs_Holiday`
a. How many holidays’ weeks are in the train set; test set? (two integer
numbers, divided by comma)
b. Separate 3 important American holidays: Super Bowl; Black Friday;
Christmas:
i. Generate 5 separate attributes (nominal True-False or 1-0):
Ìs_SuperBowl`, Ìs_BlackFriday`, Ìs_Christmas`,
Òther_holidays` and `Regular_weeks` (Òther_holidays` are
all weeks which are not SuperBowl or BlackFriday or
Christmas but have Ìs_Holiday` = True; `Regular_weeks` are
weeks which are not SuperBowl or BlackFriday or Christmas
and Ìs_Holiday` = False). How many Òther_holidays` weeks
do you have in the whole training set?
Dates will slightly differ from year to year, use the following google
search pattern: <us holiday_name year dates>, e.g. <us super bowl 2011
dates>. You could share these dates with your colleagues.
ii. Select 10 Stores with highest total Sales in 2011. Compute the
percentage of Sales during Black Friday week compared to total
Sales; during Super Bowl week; during Christmas week. (three
integer numbers 0-100, e.g. 10,20,30)
iii. Select 10 Stores with lowest total Sales in 2011. Compute the
percentage of Sales during Black Friday week compared to total
Sales; during Super Bowl week; during Christmas week. (three
integer numbers 0-100, e.g. 10,20,30)
4. Analyze `Temperature`:
a. Plot a linear graph of temperature over time (averaged over all stores).
b. Plot a scatter plot of `Temperature` vs `Weekly_Sales` (every point
corresponds to a single date, a single store). Compute correlation
between them. (single number rounded to 3 decimal places, e.g.
0.001)
c. Find 2 stores with the biggest difference in temperature in July 2010.
(two stores ID, e.g. 1, 2)
d. Plot a linear graph of temperature over time for these two stores, use
different colors for different stores.
e. Plot a scatter plot of `Temperature` vs `Weekly_Sales`, use different
colors for points corresponding to different stores. Compute
correlation between temperature and sales separately for each of 2
stores. (two numbers rounded to 3 decimal places, e.g. 0.001, 0.002)
Do they differ from each other? Do they differ from I.4.b?
f. Generate new attribute Àverage_Temperature_month`: average
temperature over current month for this particular store.
You may use any reasonable approach: e.g. for the current week average of the
last 4 weeks, or average of all weeks from the same month one year ago.
5. Drop `Fuel_Price`, `MarkDown1`, `MarkDown2`, `MarkDown3`,
`MarkDown4`, `MarkDown5`, `CPI`, Ùnemployment`, `Temperature` as
they only appears for the downloaded train table ( they are not available for
the “future”).
6. Generate `sample_weights` attribute and set the roles for `sample_weights`
(weight) and `Weekly_Sales` (label); Drop Ìs_Holiday` and `Type`
attribute;
You might use so-called “lag” features: e.g. to predict Sales on January, 2014 use
Ùnemployment` for the last available period (July 2013). But you should do it with extreme
caution.

II. Building a regression model (5 points)

This is actually the simplest part
1. Train a Random Forest model with least_squares (RandomForestRegressor
in sklearn) on the train set the number of trees to 10, and depth of the trees to
10.
a. Make predictions on a test set; on a train set.
b. Compute performance of your model on a test set; on a train set. Use
wMAE (absolute_error in RM, mean_absolute_error in sklearn). What
is the difference between train score and test score? (single number
rounded to 3 decimal places, e.g. 123.456)
E.g. you obtain 1234.567 wMAE on a train set and 2345.678 on test set, then the
difference is 1234.567 - 2345.678 = -1111.111
2. Run a Grid Search to look for the best set of Random Forest parameters. Use
the following grid:
a. What are the resulting parameters of the best model? (3 numbers, e.g.
5, 2, 0.1)
■ Number of trees: from 5 to 50, with step 2 (5, 7, 9,...)
■ Depth of the tree: from 2 to 20, with step 1 (2, 3, 4,...)
■ Subset ratio (sklearn max_features): from 0.1 to 1, with step 0.1
(0.1, 0.2, 0.3,...)
In order to speed up GridSearch computation you may use a sample instead of the
whole train set. If your sample will represent the original train distribution your
best parameters obtained on this sample will be close to best parameters obtained
on the whole train. I recommend you to use a sample of stores (try to preserve
distribution of types of the stores).
b. Make predictions on a test set (using your best model); on a train set,
save results to file. (you will need them in part III)
c. Compute performance of your model on a test set; on a train set. Use
wMAE (absolute_error in RM, mean_absolute_error in sklearn).
What is the difference between train score and test score? (single
number rounded to 3 decimal places, e.g. 123.456)
d. What is the difference between test score using default parameters
(II.1.b) and the test score using parameters obtained using
GridSearch? (single number rounded to 3 decimal places, e.g.
123.456)
3. * Train separate models for every `Total_Sales_Store_Type`.
a. Tune every model using Grid Search as in II.2.
b. Make predictions on a test set; train set, save results to file. (you will
need them in part III)
c. Compute train and test performance using wMAE.
4. * Train separate models for different types of weeks: holidays’ weeks and
regular.
You may train separate models for every type of week (Christmas, etc), but you probably
lack training data for all weeks except regular.
a. Tune every model using Grid Search as in II.2.
b. Make predictions on a test set; train set, save results to file. (you will
need them in part III)
c. Compute train and test performance using wMAE.
You might want to compare your local test score with the test score on the
leaderboard of the competition (predict Sales for test table and upload your
predictions on kaggle to see the results).

III. Analyzing the resulting model (10 points + 10 bonus

points)
For this part you may choose any 2 questions out of 1, 2, 3,
4 (e.g. III.1 and III.4), 5 is mandatory.
1. Analyze the results for different Stores:
a. Compute wMAE separately for Stores of different types
(`Total_Sales_Store_Type`).
b. Is there any difference between errors obtained for different types of
Stores? Plot a boxplot or bar plot of errors distribution (1 box/bar for
every week type).
2. Analyze the results for different Departments (Dpt).
a. Compute wMAE separately for different Department types
(`Department_Type`).
b. Is there any difference between errors obtained for different types of
departments? Plot a boxplot or bar plot of errors distribution (1
box/bar for every week type).
3. Analyze the results for different weeks
a. Compare wMAE separately for regular and holidays’ weeks;
b. Is there any difference between errors obtained for different types of
weeks (Ìs_SuperBowl`, Ìs_BlackFriday`, Ìs_Christmas`,
Òther_holidays` and `Regular_weeks`)? Plot a boxplot or bar plot of
errors distribution (1 box/bar for every week type).
4. Split testing part into 4 weeks parts (1-4, 5-8,…, 37-39).
a. Compute wMAE for these time periods.
b. Is there any difference between first 4 weeks and last 4 (3) weeks?
Plot a box or bar or a line plot of errors distribution.
5. Make an overall conclusion. Your conclusion should include answers to (but
not limited to) the following questions:
■ How good is your final model for small/huge stores?
■ What kind of additional features could you suggest for this
task?
■ Would you suggest building individual models for
Stores/Departments/Regular-Holiday weeks/Nearest-Farthest
weeks?
■ Which features were most useful? Least useful?
6. * If you want to get bonus points you should complete II.3 or II.4 and make
separate conclusion (III.5) for it.

For this section you should prepare a short report in a .pdf

file that includes your graphs and conclusion. Your
conclusion is limited to 350 words (no less than 150
words). You might include up to 4 graphs in your report.
You will upload your report into the google form.

Name your report the following way:

DS_HA3_[Surname]_[Name]_[chosen task at section III]_[chosen task task
at section III].pdf
For example:
DS_HA3_Kurmukov_Anvar_1_3.pdf
That means that for III section I decide to complete tasks III.1 and III.3

For bonus points you should complete III.6 and provide additional report
file, same rules (150-350 words and up to 4 graphs). Name your report with
the bonus task the following way:
DS_HA3_[Surname]_[Name]_[chosen bonus task 1]_bonus.pdf
For example:
DS_HA3_Kurmukov_Anvar_4_bonus.pdf
That means that for a bonus task I took task II.4.

Do not include any names/surname inside the report!

Wrongly named reports will be graded by 0 points!
Home assignment 4. 35 points, 10 bonus points
Seminar 5 plan:
https://docs.google.com/document/d/1zDlYT7LU3NQqV74InsvSj8fyE3hOWDW
TsHYBJhfsgXg/edit?usp=sharing

For this task your main goal is to decrease company losses due to customer’s
churn. We will compare two discount strategies: providing a 20% discount with a
75% acceptance rate and a 30% discount with a 90% acceptance rate.

For this assignment you should use this data (subset of the seminar’s data):
https://yadi.sk/d/K6DApyOjp42IYA

I. Data preprocessing (5).

e will skip all the exploration steps and only do some simple features
This time w
preprocessing:
1. Replace “No internet service” to “No” for the following attributes:
`OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`,
`StreamingTV`, `StreamingMovies`. Already done in HA data
2. Generate `tenure_group` attribute: discretize `tenure` into 6 groups: “0-12”,
“12-24”, “24-36”, “36-48”, “48-60”, “60+” (all are left closed intervals: [0,
12), [12, 24), [24, 36)...). What are the sizes of these groups?
Tenure refers to the number of months that a customer has subscribed for.
Do not drop the `tenure` column.
3. Preprocess categorical columns with only 2 unique values (“binary”
columns): replace one unique value with 0 and another with 1 (label
encoding). How many such columns do you have?
E.g. for the `gender` attribute you may replace Female with 1 and Male with
0 or vice versa.
4. Preprocess categorical columns with more then 2 unique values using
dummy encoding (=one-hot encoding). How many such columns (before
dummy encoding) do you have?
5. Drop customerID attribute.

II. Build a churn model (5).

1. Build 2 classification models to predict customers churn:
a. Logistic Regression. What is the ROC AUC of this model?
b. Random Forest. What is the ROC AUC of this model?
In this task I suggest you deviate from the train-test strategy and use a k-fold
approach to train and predict the whole dataset: train on ⅘ of the data, predict on
⅕ , repeat this 5 times, thus you will get predictions for the whole dataset.
Recall that in this case you will actually have 5 trained classifiers not 1, but for
our purposes this is ok.
After this section you must have predictions for all customers from the dataset
(~6k) obtained using cross validation.

Comments:
1. For the section III you may use any classification model you want (you are
not restricted by these two models above).
2. You may want to use grid search to look for the best parameters of the
model/s.

III. Compare two discount strategies (25 + 10).

Assumptions:
1. Every customer pays the same price p which is the average of
`MonthlyCharges`.
2. If we decide to provide a discount we provide it to all the customers who are
predicted as Churn=Yes.
3. When we compute gains, costs and losses we compute them for the short
term.
Therefore all the computations from the seminar hold (except you need to
recompute the coefficients).
4. Strategy’s profit is the difference between gains, costs and losses:
profit = gains - costs - losses
5. Profit per customer is the total profit divided by the number of customers (if
the person churns the person is not a customer anymore).

Strategy A: Provide a 20% discount with a 75% acceptance rate.

Strategy B: Provide a 30% discount with a 90% acceptance rate.
In the seminar we had a 30% discount with 80% acceptance rate.

You are not obligated to use any particular software.

You may use any classifier you want.

1. Use the default threshold of 0.5 to compute the confusion matrix. Based on
this confusion matrix report (5 points):
a. TP, FP, TN, FN
b. Losses if you do not apply any discount strategy.
c. Total gains from the discount strategy B.
d. Total costs of the discount strategy B.
e. Total losses of the discount strategy B.
f. Total profit of the discount strategy B.
g. Profit per customer pd (using strategy B).
2. Use 9 different thresholds: 0.1, 0.2, … 0.9. Answer to the following
questions (10 points):
a. What is the threshold with the highest accuracy, using strategy B?
b. What is the threshold with the highest profit, using strategy B? What
is the highest profit, using strategy B?
c. What is the threshold with the highest profit per customer, using
strategy B ? What is the highest profit per customer, using strategy
B?
d. What is the ratio of profit per customer (obtained on the previous step)
p
and p ? pd
e. Which strategy yields the highest profit (A or B)? What are the TP,
FP, TN, FN in this case? What is the highest profit in that case?
f. Which strategy yields the highest profit per customer (A or B)? What
are the TP, FP, TN, FN in this case? What is the highest profit per
customer in that case?
3. Prepare a report (10 points). Your report must summarize your results.
Reports with simple copy paste of the results will be graded with 0 points!
Some example questions (you are not limited or restricted to them):
● Do thresholds for the highest profit and highest profit per
customer coincide or not? Why?
● Which you decide to choose? Under what circumstances (how
many clients will you lose in both situations; what should be the
decision criteria)?
● How hard does your profit per customer decrease for customers
for whom you provide a discount, compared to customers for
whom you do not provide a discount? Compare this number
with the discount.
4. *For the bonus 10 points you need to redo all computations, but now instead
of average p you should use customer’s MonthlyCharges. All the results for
the bonus task must be summarized in an additional report. You must
provide a comparison of the results (with the regular case when you use p ).

Your report must be prepared in a .pdf file. It is limited

to 350 words. You might include up to 4 graphs in your
report. You will upload your report into the google form.

Name your report the following way:

DS_HA4_[Surname]_[Name].pdf
For example:
DS_HA4_Kurmukov_Anvar.pdf
Name your report for the bonus task (same rules: up to 350 words and 4
graphs) the following way:
DS_HA4_[Surname]_[Name]_bonus.pdf
For example:
DS_HA4_Kurmukov_Anvar_bonus.pdf

Do not include any names/surname inside the report!

Wrongly named reports (or reports with your identification
in it) will be graded by 0 points!
Home assignment 5. 45 points.

Clustering seminar (6) plan

https://docs.google.com/document/d/1Ptj7J1ikOVsuGmY5rPNFCo0p3hrDcLyQA
wotDxxbBiQ/edit?usp=sharing

You goal for this task is two fold:

1. Cluster all the products into distinct groups (clusters)
2. Build a recommender system for customers, but instead of products we will
recommend categories.

I. Clustering (20)

1. Feature generation. Use examples from Seminar 6 Plan to generate

features for products clustering. You may generate any number of
features but you must generate at least 3 features which differ from
those, proposed in the plan.
2. Cluster all products into distinct groups (clusters). You may use any
clustering algorithm you want. If you use distance-based clustering
(e.g. k-means), do not forget to preprocess your features
(normalization, z-scoring or standard scaling). Try a different number
of groups (e.g. from 5 to 30). Couple of useful links for selection
number of clusters:
a. Rapid Miner
i. https://community.rapidminer.com/discussion/39731/dyn
amically-determine-number-of-clusters-k-means
ii. https://mgaproyekto.blogspot.com/2018/06/rapidminer-tu
torial-k-means-clustering.html
b. Python/R:
i. https://blog.cambridgespark.com/how-to-determine-the-o
ptimal-number-of-clusters-for-k-means-clustering-14f27
070048f
ii. https://medium.com/@masarudheena/4-best-ways-to-find
-optimal-number-of-clusters-for-clustering-with-python-c
ode-706199fa957c
3. Write a report. In your report you should present the following
information:
a. Put an example screenshot of your features.
b. Explain (in a similar way I explain them in the plan) every
single feature (you may skip features from the seminar plan)
you use.
c. Cluster’s information: how many clusters do you have, how
many objects are in these clusters.
d. Cluster’s interpretation. Try to provide an interpretation of
every single cluster (or groups of clusters) you end up. For
example:
“Cluster `1` includes hot drinks and beverages often bought in a combination in the
morning.”
e. You may include any visualization you find necessary, e.g.:
colored PCA components, histogram or pie charts of cluster’s
sizes, “elbows” used for selection number of clusters (if you
have used it).
Python visualization examples:
https://www.kaggle.com/python10pm/plotting-with-python-learn-80-plots-step-by-
step
Rapid Miner visualization examples:
https://www.ou.nl/documents/40554/349790/IM0503_ChartingInRapidMiner.pdf/c
5ede337-5287-14f6-bc3d-f9221efc4fea,
https://michael.hahsler.net/SMU/EMIS3309/slides/Essential_Visualizations.pdf
II. Recommender system (25)

Python examples:
Collaborative filtering example
https://realpython.com/build-recommendation-engine-collaborative-filtering/
MF recommender systems library https://github.com/lyst/lightfm

1. Prepare user-item data as it was done during the seminar: User, Item,
Score. You may construct Score (e.g. see seminar) any way you want,
but you must explain it in your report.
2. Split your data into train and test sets (as Leonid explained during the
lecture): some of the user-item pairs go to the train set and some to the
test set.
3. Build a recommender system using cluster groups (if you have about
20-40 clusters) or items subcategories (75 most frequent values of the
`dbi_item_famly_name` attribute) as items and `dd_card_number` as
users. You may want to play with a number of neighbours in your
KNN recommender model.
4. Compute 3 different recommender performance scores, which were
explained during the lecture or seminar to assess the quality of your
recommendations (use appropriate metrics).
5. Write a report. In your report you should present the following
information:
a. Report computed performance scores.
b. Elaborate on the quality of your recommendations.
c. Provide 3-5 examples of `good` recommendations suggested by
your recommender system.
d. Provide 3-5 examples of `bad` recommendations suggested by
your recommender system.
e. You may report any additional information you find potentially
useful to assess the quality of your recommendations: e.g for a
couple of customers compute the price of their average
purchase (or an item in purchase) and compare it with the
average price of recommended items.
f. You may use any visualisations you find useful

You report should consist of 2 parts: report for the clustering part and report for the
recommendation part. Every part is limited to 350 words and 5 pictures. Name
your report the following way:

DS_HA5_[Surname]_[Name].pdf

Do not include any names/surname inside the report!

Wrongly named reports (or reports with your identification
in it) will be graded by 0 points!
September exam

We are using house sale price data from King County, Wahington, USA. This
dataset is in public domain and can be obtained from Kaggle:
https://www.kaggle.com/harlfoxem/housesalesprediction

1. Observe top 10 observations (int)

a. What is the price of a house with ìd` == 7237550310?
b. How many bedrooms has a house with ìd` == 7237550310?
c. When was the house with ìd` == 2414600126 built (`yr_built`)?
2. Observe last 10 observations (int)
a. What is the price of a house with ìd` == 263000018?
b. How many bedrooms has a house with ìd` == 291310100?
c. When was the house with ìd` == 1523300141 built (`yr_built`)?
3. Display some column statistics (list of floats, rounded up to 3 digits, e.g.
1.234)
a. What are the max, min, mean and the std of the `floors` column?
b. What are the max, min, mean and the std of the `sqft_living` column?
c. What are the max, min, mean and the std of the `price` column?
4. Select rows/columns (int)
a. How many houses were built during American Great Depression
(1929–1939)? Including both start and end year.
b. How many houses, built before first human in space (<1961), have
high condition (=5)?
c. How many houses with the waterfront (=1) were built duroing Nixon's
presidency (1969—1974)? Including both start and end year.
d. How many houses were sold for 256000 dollars?
5. Select rows/columns and compute simple statistics (float)
a. What was the average (sold) price of a houses built in the year of
Cuban Missile Crisis (1962)?
b. What was the price of the most expensive house sold, built between
1991 and 2000?
6. Create new columns using the old ones (new column in your DataFrame)
a. Create a `sqft_tot_area` column (sum of all columns with `sqft_`
prefix) using any method above
b. Create a new column `sqm_tot_area` using `sqft_tot_area` and the fact
that 1 foot = 0.3048 meters
c. Create a new column `sqm_aver_floor_area` by dividing total area (in
meters) by number of floors
d. Create a new bool column `high_class` it is True if the house has
grade >= 9 and condition >= 4
7. Create some groupby features
a. `price_by_class` groupby `high_class` and compute median `price`.
b. àrea_by_price` groupby `price_cat` and compute average
`sqft_living`.
c. `floors_by_age` groupby `floors` and compute average age of a house.

Drop features `price_by_class` and `area_by_price`, preprocess categorical and

date features.

8. Split your data into train (70%) and test parts (30%). How many records
(rows) do you have in train and test tables? (list of int)?
9. Create a predictive regression model of a house price.
a. Use decision tree regression
b. Use k nearest neighbours regression
10. Use grid search to select optimal hyperparamters of your models.
a. Depth for the tree
b. Number of neighbours for the knn
11. Compute train and test mean squared error for your best models (list of
float).
a. Train, test MSE using decision tree regression
b. Train, test MSE using k nearest neighbours regression
12. Normalize your numerical features and repeat steps 9-11, does train, test
MSE changes for KNN model? For decision tree model?
13.Write short (3-5 sentences) report on your solution.

30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
Practise Questions
No ratings yet
Practise Questions
26 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
CSE5ML 2024 SEM2 Assignment 1
No ratings yet
CSE5ML 2024 SEM2 Assignment 1
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Sberbank Project Report
No ratings yet
Sberbank Project Report
19 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
ML Lab Manual 2024
No ratings yet
ML Lab Manual 2024
41 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
PAMLSET1 New
No ratings yet
PAMLSET1 New
4 pages
DS Ex1975
No ratings yet
DS Ex1975
5 pages
Thinespary Sitharam 841007106016-Supply Chain Management Data Analytic
No ratings yet
Thinespary Sitharam 841007106016-Supply Chain Management Data Analytic
6 pages
Data Science Practicals
No ratings yet
Data Science Practicals
40 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
BZAN 6310-Project Instructions
No ratings yet
BZAN 6310-Project Instructions
4 pages
Saurabh
No ratings yet
Saurabh
22 pages
MKT4080-Codes
No ratings yet
MKT4080-Codes
9 pages
PAMLSET2
No ratings yet
PAMLSET2
4 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Data Mining Methods
No ratings yet
Data Mining Methods
17 pages
Logistic Regression
No ratings yet
Logistic Regression
3 pages
FAQ's - FMT Project
No ratings yet
FAQ's - FMT Project
3 pages
Computer Lab 2 Block 1-3
No ratings yet
Computer Lab 2 Block 1-3
7 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Bank Marketing Targets 1724510938
No ratings yet
Bank Marketing Targets 1724510938
13 pages
Data Mining - Data Preparation Report
No ratings yet
Data Mining - Data Preparation Report
4 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
Train
No ratings yet
Train
17 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
PreProcessing With R
No ratings yet
PreProcessing With R
6 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
ML Report
No ratings yet
ML Report
3 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Account Based Analytics Final Spring 2025
No ratings yet
Account Based Analytics Final Spring 2025
2 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
David sm18 PPT 06
100% (2)
David sm18 PPT 06
62 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Module 2
No ratings yet
Module 2
20 pages
Research Final Paper Grace M Rufa B. 1 5 Chapters
No ratings yet
Research Final Paper Grace M Rufa B. 1 5 Chapters
79 pages
Cuestionario Inglés MLQ.
No ratings yet
Cuestionario Inglés MLQ.
19 pages
Pharmaceutical Process Validation
No ratings yet
Pharmaceutical Process Validation
33 pages
Final-Format-JRU-Thesis (1) JJJJJ
100% (1)
Final-Format-JRU-Thesis (1) JJJJJ
5 pages
Data Collection Toolkit
No ratings yet
Data Collection Toolkit
14 pages
MRT Feasibility Study JAKARTA
No ratings yet
MRT Feasibility Study JAKARTA
21 pages
Large Project Delay
100% (1)
Large Project Delay
25 pages
Research Proposal Writing Format + 2016
100% (1)
Research Proposal Writing Format + 2016
6 pages
Spiritual Coping
No ratings yet
Spiritual Coping
22 pages
Data Science Product Questions
No ratings yet
Data Science Product Questions
92 pages
Near Real Time Fraud Detection With Apac
No ratings yet
Near Real Time Fraud Detection With Apac
87 pages
Project 2
No ratings yet
Project 2
79 pages
Walpole Ch01
No ratings yet
Walpole Ch01
31 pages
Jes2010 (4 7)
100% (1)
Jes2010 (4 7)
6 pages
Linear Regression Slides
No ratings yet
Linear Regression Slides
129 pages
Effective Data-Driven Campaigning For Credit Cards Target, Attract, Retain and Engage
No ratings yet
Effective Data-Driven Campaigning For Credit Cards Target, Attract, Retain and Engage
7 pages
Tiếng Anh 1 ĐHCNĐA
No ratings yet
Tiếng Anh 1 ĐHCNĐA
135 pages
Pygmalion Effect
No ratings yet
Pygmalion Effect
3 pages
My Resume
No ratings yet
My Resume
1 page
Challenges and Prospects of Private Broadcast Medi
No ratings yet
Challenges and Prospects of Private Broadcast Medi
9 pages
06 Augste Lames 2011
No ratings yet
06 Augste Lames 2011
6 pages
董运昌《搁浅的心》指弹吉他谱
No ratings yet
董运昌《搁浅的心》指弹吉他谱
7 pages
Strata Stratch SQL Question - Hard
No ratings yet
Strata Stratch SQL Question - Hard
9 pages
PSY2202 Week1 2 Intro Fall2024
No ratings yet
PSY2202 Week1 2 Intro Fall2024
71 pages
Python Code Library
No ratings yet
Python Code Library
8 pages
数据科学 Sharon
No ratings yet
数据科学 Sharon
22 pages
Data Science Notes
No ratings yet
Data Science Notes
5 pages
Bachelor Harper Kakkonen Daniel 2022
No ratings yet
Bachelor Harper Kakkonen Daniel 2022
30 pages
Michaud Made Orderform
No ratings yet
Michaud Made Orderform
2 pages
Cultural Framework For Preston - Preston City Council
No ratings yet
Cultural Framework For Preston - Preston City Council
13 pages
Collins H Narrative in Creative Research p145-148
No ratings yet
Collins H Narrative in Creative Research p145-148
5 pages
01 Faculty-of-Commerce Graduation-Programme March2022
No ratings yet
01 Faculty-of-Commerce Graduation-Programme March2022
21 pages
Fi Snish
No ratings yet
Fi Snish
3 pages
USAID-1 Merged
No ratings yet
USAID-1 Merged
12 pages
Discrete Random Variable
No ratings yet
Discrete Random Variable
53 pages
Data Gathering Procedures and Creative Writing
No ratings yet
Data Gathering Procedures and Creative Writing
3 pages
Manulife Wellness Account List of Expenses
No ratings yet
Manulife Wellness Account List of Expenses
1 page
Impact of Marketing Information System On Product Performance in Nigerian Bottling Company Limited
No ratings yet
Impact of Marketing Information System On Product Performance in Nigerian Bottling Company Limited
16 pages
Git Editor Change
No ratings yet
Git Editor Change
1 page
PSYC 218 901 2023 SPSS Assignment 2 Solved
No ratings yet
PSYC 218 901 2023 SPSS Assignment 2 Solved
4 pages