Computer Lab 2 Block 1-3
Computer Lab 2 Block 1-3
Divide data randomly into train and test (50/50) by using the codes from the
lectures.
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-
collar','entrepreneur','housemaid','management','retired','self-
employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note:
'divorced' means divorced or widowed)
4 - education (categorical:
'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree'
,'unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute
highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not
known before a call is performed. Also, after the end of the call y is obviously known. Thus,
this input should only be included for benchmark purposes and should be discarded if the
intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client
(numeric, includes last contact)
732A99/732A68/ TDDE01 Machine Learning
Division of Statistics and Machine Learning
Department of Computer and Information Science
13 - pdays: number of days that passed by after the client was last contacted from a
previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client
(numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical:
'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
and report the misclassification rates for the training and validation data.
Which model is the best one among these three? Report how changing the
deviance and node size affected the size of the trees and explain why.
3. Use training and validation sets to choose the optimal tree depth in the
model 2c: study the trees up to 50 leaves. Present a graph of the dependence
of deviances for the training and the validation data on the number of leaves
and interpret this graph in terms of bias-variance tradeoff. Report the
optimal amount of leaves and which variables seem to be most important for
decision making in this tree. Interpret the information provided by the tree
structure (not everything but most important findings).
4. Estimate the confusion matrix, accuracy and F1 score for the test data by
using the optimal model from step 3. Comment whether the model has a
good predictive power and which of the measures (accuracy or F1-score)
should be preferred here.
5. Perform a decision tree classification of the test data with the following loss
matrix,
732A99/732A68/ TDDE01 Machine Learning
Division of Statistics and Machine Learning
Department of Computer and Information Science
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
𝐿𝐿 = 𝑦𝑦𝑦𝑦𝑦𝑦 0 5
𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 � �
𝑛𝑛𝑛𝑛 1 0
and report the confusion matrix for the test data. Compare the results with
the results from step 4 and discuss how the rates has changed and why.
6. Use the optimal tree and a logistic regression model to classify the test data by
using the following principle:
𝑌𝑌� = 𝑦𝑦𝑦𝑦𝑦𝑦 𝑖𝑖𝑖𝑖 𝑝𝑝(𝑌𝑌 = ′𝑦𝑦𝑦𝑦𝑦𝑦′|𝑋𝑋) > 𝜋𝜋, 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑌𝑌� = 𝑛𝑛𝑛𝑛
where 𝜋𝜋 = 0.05,0.1,0.15, … 0.9,0.95. Compute the TPR and FPR values for the
two models and plot the corresponding ROC curves. Conclusion? Why precision-
recall curve could be a better option here?
to optimize this cost with starting point 𝜃𝜃 0 = 0 and compute training and
test errors for every iteration number. Present a plot showing dependence of
both errors on the iteration number and comment which iteration number is
optimal according to the early stopping criterion. Compute the training and
test error in the optimal model, compare them with results in step 3 and
make conclusions.
a. Hint 1: don’t store parameters from each iteration (otherwise it will
take a lot of memory), instead compute and store test errors directly.
b. Hint 2: discard some amount of initial iterations, like 500, in your plot
to make the dependences visible.
1. submits the group report using Lab X item in the Submissions folder
before the deadline. Makes sure that the report contains the Statement Of
Contribution describing how each group member has contributed into the
group report.
2. Goes to LISAMCourse DocumentsDeadlines.PDF, finds the deadline
(date and time) for the current lab.
3. Goes to LISAMCourse DocumentsSeminars.PDF and find the group
number of your opponent group
4. Goes to LISAMCourse DocumentsGroups.PDF and finds email
addresses of the students in the opponent group
5. Go to LISAMOutlook app and in the Outlook web client creates a new
message where you
Specify Lab X report as a title (X is lab number)
Specify email addresses of the opponents in the “To:” field
Attach your group PDF report.
Important: Click on arrow next to “Send” button, choose
“Send Later” and specify the lab deadline as the message
delivery time stamp (see figure)
contains the Statement Of Contribution describing how each group member has
contributed into the group report.
• After the deadline for the lab has passed you should be able to receive the PDF
report of the speakers per email. Compile it, read it carefully and prepare (in
cooperation with your group comrade) at least three
questions/comments/improvement suggestions per lab assignment in order to
put them at the seminar.