0% found this document useful (0 votes)
12 views7 pages

Final Assignment

This document contains a multi-part assignment involving KNN classification, decision tree classification, and clustering on a small dataset with variables X1, X2, and a binary outcome Y. For KNN, the student is asked to make predictions for different values of k and distance metrics. For decision trees, the student computes a split's Gini index and argues for different split points. Finally, the student is tasked with clustering the records based only on X1 and X2.

Uploaded by

hyperloke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

Final Assignment

This document contains a multi-part assignment involving KNN classification, decision tree classification, and clustering on a small dataset with variables X1, X2, and a binary outcome Y. For KNN, the student is asked to make predictions for different values of k and distance metrics. For decision trees, the student computes a split's Gini index and argues for different split points. Finally, the student is tasked with clustering the records based only on X1 and X2.

Uploaded by

hyperloke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

K=Final Assignment (60 points)

Problem 1 (15 points)

Here is a short dataset consisting of a binary outcome variable Y and two independent variables
X1 and X2. The independent variables are normalized, so no need to normalize further.

X1 X2 Y
3 5 1
1 4 0
3 2 0
2 2 1
4 1 1

a) Suppose you are asked to predict the outcome for (X1, X2) = (4, 4). Use KNN with k = 3 to
predict this outcome. You can use Euclidean distance as the distance measure.
b) Predict the outcome with k = 5.

c) Use k = 3 with Manhattan distance and reevaluate the prediction. Should the prediction
with k = 5 change? Why or why not? Is there another name you can use to call the
prediction with k = 5.
Problem 2 (30 points)

Consider the same dataset as in Problem 1.

X1 X2 Y
3 5 1
1 4 0
3 2 0
2 2 1
4 1 1

a) We would now like to train a classification tree to the above dataset. Consider the split X1 =
2.5. Compute the weighted Gini Index of this split.
b) Suppose we change the split to X1 = 3.5. Provide an argument for why this split is better or
worse.
c) Based on your judgment, introduce a split on X2 in addition to the previous split on X1
(either 2.5 or 3.5). Show that this split improves fit and draw the corresponding decision tree.
Problem 3 (15 points)

For the same dataset, ignore the Y variable and simply consider the X variables:

Record X1 X2
#
R1 3 5
R2 1 4
R3 3 2
R4 2 2
R5 4 1
We are now interested in a clustering exercise.

a) Fill up a distance matrix that stipulates the distance from each record to every other record.
Use either Euclidean or Manhattan distance as the measure depending on your convenience.

R1 R2 R3 R4 R5
R1
R2
R3
R4
R5
b) Construct 2 clusters based on the distance matrix above. Can you improve these clusters?
How would you measure this improvement?

You might also like