0% found this document useful (0 votes)
8 views7 pages

CS4780 Homework 5 SP24-2

The document outlines Homework 5 for CS 4780/5780, due on April 10, 2024, focusing on K-NN algorithms and their bias-variance tradeoff in classification and regression tasks. It includes multiple problems that require visualizations, mathematical proofs, and discussions on overfitting and regularization techniques. Students are encouraged to work in groups and submit their work on Gradescope.

Uploaded by

matthew.lew.04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

CS4780 Homework 5 SP24-2

The document outlines Homework 5 for CS 4780/5780, due on April 10, 2024, focusing on K-NN algorithms and their bias-variance tradeoff in classification and regression tasks. It includes multiple problems that require visualizations, mathematical proofs, and discussions on overfitting and regularization techniques. Students are encouraged to work in groups and submit their work on Gradescope.

Uploaded by

matthew.lew.04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CS 4780/5780 Homework 5

Due: Wednesday 04/10/2024 11:59pm on Gradescope

Note: You can work in a group of up to 3. Please include your teammates’ NetIDs
and names on the front page and form a group on Gradescope.

Problem 1: Back AK -NN [16 points]


K-NN is back again! We will first learn visually learn the relation between k in the K-NN algorithm,
and the bias and variance of the error it produces.
Assume we have a big dataset of patient records each tagged with three properties: detected atrial
fibrillation, age of the patient, and how many minutes of exercise they put in every week on
average. This data is plotted in figure 1. The red circles represent patients with AFib (class 1)
while the blue circles represent those without (class 0). We want to predict the probability of onset
of Atrial Fibrillation in new patients.

1. (1) Independent of how K-NN treats classification, say you are asked to draw a single line
to demarcate the boundary between high chance of AFib (red area) vs low chance of AFib
(blue area). Where would you draw the line? Make a copy of figure 1 on your submission
and overlay your line. Keep your line smooth.

While we haven’t discussed the mathematical form of bias, variance and noise terms for classifi-
cation models, we can safely assume that variance will depend on the number of times hD (x) will
disagree with h̄(x), while bias will depend on how often h̄(x) will differ from ȳ(x).
Here you can estimate ȳ(x) (expected class for x) as the class predicted using the line you drew
in part 1, hD (x) is the class predicted by the K-NN classifier trained on dataset D, and h̄(x) can
be estimated by taking majority poll of the predicted class from all hD (x) for D = 1 to 5.
For the sampled datasets D given in figure 2, we will now visualize what happens when hD (x)
is obtained from Ak (D) (the K-NN algorithm with the desired k). We will repeat this process of
training 5 classifiers for k = 1, 10, 30. Each dataset has 30 points - 10 from the red class, 20 from
the blue.

2. (2) Let’s start with k = 1. For your reference, we have given the voronoi cell boundaries for
all these 5 datasets in figure 3. Refer to them in your submission to state and justify whether
each statement below is true or false.

(a) For most datasets, there are blue and red regions both above and below the line you
drew in part 1.
(b) The 1-NN will correctly predict the class for all train points.
Takeaway: This shows whether the classifier is overfitting or underfitting.

1
Figure 1: Plot of heart condition (red) and healthy heart (blue) when plotted against Age (y-axis)
and Minutes of Exercise Weekly (x-axis)

Figure 2: Datasets used to train the K-NN classifiers

2
Figure 3: 1-NN boundaries for given datasets

3. (3) Show for k = 1 and the datasets given, that variance is higher than bias. Specifically,
show that:
X X X X
I[hD (x) ̸= h̄(x)] > I[h̄(x) ̸= ȳ(x)] where hD = A1 (D)
D (x,y)∈D D (x,y)∈D

where I is the indicator function. (in words: you need to show that hD (x) disagrees with h̄(x)
more often than h̄(x) disagrees with ȳ(x)). You will not need to sum over all points, instead
pick a few points per dataset to show RHS is low, while LHS is high
Hint: focus on the outliers in each dataset, because all hD (x), h̄(x), ȳ(x) will agree for other
points.
4. (2) Now let’s try k = 30. Notice that k = |D|. First, for each dataset in figure 2, draw
the 30-NN classification boundary by shading the area of the plot red and blue based on its
prediction. If you are not submitting a colored submission, shade the red area and leave the
blue area un-shaded.
5. (2) By looking at your answers in part 4, state and justify whether the following statements
are true or false.

(a) For most datasets, there are blue and red regions both above and below the line you
drew in part 1.
(b) The 30-NN will correctly predict the class for all train points.

6. (3) Show for k = 30 and the datasets given, that bias is higher than variance. Specifically,
show that:
X X X X
I[hD (x) ̸= h̄(x)] < I[h̄(x) ̸= ȳ(x)] where hD = A30 (D)
D (x,y)∈D D (x,y)∈D

3
where I is the indicator function. (in words: you need to show that hD (x) disagrees with h̄(x)
less often than h̄(x) disagrees with ȳ(x)). Again, you will not need to manually sum over all
points in this part to show that LHS is low and RHS is high for each dataset.

7. Finally let’s look at k = 5. For each dataset in figure 2, draw the the best approximation of
classification boundaries that a 5-NN would make. You don’t have to be exact - a hand-drawn
shading will do.

8. (3) Simply by eye-balling, what conclusions can you make about the 5-NN classifiers? Is the
variance lower than 1-NN? Is the bias lower than 30-NN? No need to justify. Refer to your
answer in part 7.

4
Problem 2: Regressi-knn [15 points]
Enough eye-balling. We will now understand the relation between k in the K-NN algorithm and
its error terms, mathematically.
We will be using K-NN for a regression task, since its easiest to do the derivations for regression
(and mean-square error loss)
Suppose we have data generated by a model yi = f (xi ) + εi , where εi are i.i.d. random variables
with E[εi ] = 0 and Var[εi ] = σ 2 . Denote D as the training set. The expected prediction error at a
single x is
EPEk (x) = ED,(x,y) [(y − hk (x))2 ],
where y = f (x) + ε. (Here, ε is also i.i.d. and from the same distribution as εi .) For simplicity, we
assume that the values of xi and x in the training sample are fixed in advance (nonrandom), while
the value of yi and y are random variables as defined. In the specific KNN regression model,
k k
1X 1X
hk (x) = y(l) = (f (x(l) ) + ε(l) ),
k k
l=1 l=1

where x(l) is the lth closest point to x in D.


Decompose EPEk (x) into three components: variance, noise and bias. Each term should be
represented by x(1) , · · · , x(l) , x, σ and f . Using your expression, argue that the variance will drop
as k is increased.

1. (2) Let’s start off by finding ȳ(x) = Ey|x [y(x)]

2. (Bonus) Let h̄k (x) be the expected classifier. What can we say about E[h̄k (x) − y(x)]?

3. (5) Prove that


EPEk (x) = ED,(x,y) [(y − hk (x))2 ]
can be reduced to

EPEk (x) = ED,(x,y) [(hk (x) − h̄k (x))2 ] + ED,(x,y) [(h̄k (x) − ȳ(x))2 ] + ED,(x,y) [(ȳ(x) − y(x))2 ]

Identify the corresponding bias, noise and variance respectively.

4. (8) Can you simplify the terms further by representing it in terms x(1) , ..., x(l) , x, σ and f ?

5
Problem 3: Overfitting/Underfitting [6 points]
Which of the following strategies can be used when overfitting / underfitting happens?

overfitting underfitting
increase the regularization
decrease the regularization
use less features
use more features
use a more complex model
use a less complex model

Problem 4: Regularization Mitigates Overfitting [15 points]


In this question, we are going to investigate how adding l2 regularization can help mitigate the
effect of overfitting for ordinary least square regression. First, recall that in our notes for lecture
11, we mention that we can rewrite the objective function of l2-regularized least square regression
(or ridge regression)
Xn
min (w ⃗ 22
⃗ T ⃗xi − yi )2 + λ||w||
w

i=1
as
n
X
min (w ⃗ 22 ≤ B
⃗ T ⃗xi − yi )2 subject to ||w||
w

i=1
To simplify our analysis, we are going to focus on the second expression. In addition, we are going
to assume the following:
(i) Each data point (⃗xi , yi ) is drawn identically and independently from the distribution P,
namely, the dataset D ∼ P n

(ii) For any (⃗x, y) sampled from P, we have ||⃗x||22 = 1


With the above assumption, we are going to do the following:
1. Notice that w(D)
⃗ is a function of D and since D is random, so is w(D).
⃗ Define w̄ = ED (w(D)).

Show that
||w(D)
⃗ − w̄||22 ≤ 4B 2
using the triangular inequality

||a − b||2 ≤ ||a||2 + ||b||2

2. Define the model hD (⃗x) = w(D)


⃗ T⃗ x and h̄(⃗x) = ED (hD (⃗x)). Show that the variance of the
model
E⃗x,D ((hD (⃗x) − h̄(⃗x))2 ) ≤ 4B 2
by first showing that
hD (⃗x) − h̄(⃗x) = (w(D) − w)T ⃗x

6
and then using the Cauchy-Schwarz inequality:

(aT b)2 ≤ (aT a)(bT b)

to conclude the result.

Takeaway: By adding regularization, we essentially bound the variance of the model which reduces
overfitting.

You might also like