Midterm - APS1070 - 2020 - 05 Summer
Midterm - APS1070 - 2020 - 05 Summer
Midterm - APS1070 - 2020 - 05 Summer
“In submitting this assessment, I confirm that my conduct during this quiz adheres to the Code
of Behaviour on Academic Matters. I confirm that I did NOT act in such a way that would
constitute cheating, misrepresentation, or unfairness, including but not limited to, using
unauthorized aids and assistance, impersonating another person, and committing plagiarism. I
pledge upon my honour that I have not violated the Faculty of Applied Science & Engineering’s
Honour Code during this assessment.”
3. [2] In lecture, we discussed decision trees – an intuitive classification model that splits on
different attributes, creating a tree-like structure. A data scientist is given a large data set and
uses part of the data to train a really big decision tree with many branches and nodes, that
perfectly fits the data. When they apply it to the validation data, overall accuracy is only 78%.
a. Why is test performance so poor?
b. What can the data scientist do to improve the model?
4. [2] A data scientist has a data set with a lot of features and chooses to use some of these
features to train a model on training data and evaluate performance on testing data. They find
that both training and testing accuracy is poor. What would you recommend (i) removing a few
features or (ii) adding more features? Explain.
5. [2] A data set with 4 features has the following covariance matrix:
A B C D
A 0.5 0.018 0.11 0.048
B 0.018 0.01 0.0025 0.14
C 0.11 0.0025 0.023 0.0055
D 0.048 0.14 0.0055 6
You’re asked to remove a highly correlated feature from the data set. Which one would you
remove?
6. [4] You have two binary classification models (P_1 and P_2), that use a series of features to
predict the probability of emails being spam. The computed probabilities are shown in the table
below, along with actual labels, for six validation data.