0% found this document useful (0 votes)
7 views

Quiz 1

Uploaded by

liquid.nitrogen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Quiz 1

Uploaded by

liquid.nitrogen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

BITS F464

Machine Learning
Aditya Challa Quiz 1

Instructions

• This is a take-home quiz.

• The total marks for the quiz will be scaled to 10.

• You are required to submit your answers via quanta (local). The form to submit
your answers will be available on quanta shortly.

• The last date to submit your answers is 11:59 PM, 30th September 2024.

Exercise 1
1. One way to select a subset of features is by using naive selection, for which forward and
backward selection are two popular approxmations to reduce computational complexity.
Another way is to perform regularization schemes such as L1 and L2 regularization.
A dataset is provided to you with filenames data problem1.csv and labels problem1.csv.
Using this data which of the following features would be in the best subset: (indexed as
0, 1, · · · , 199). Assume that you know exactly 100 features are relevant.

A 121
B 129
C 34
D 173
E 20
F 158

[3 Marks]

1 of 5
BITS F464
Machine Learning
Aditya Challa Quiz 1

Exercise 2
2. We discussed about increasing the complexity of the model. Two student propose the fol-
lowing ways to increase the complexity:
Assume that the data is in 2-dimensions and we are concerned with the classification prob-
lem.
Student A uses a random matrix to project the data to a higher dimension and then uses a
linear classifier. This implicitly increases the number of features, and hence the number of
parameters in the model and hence the complexity of the model.
Student B instead generates a lot of random features by concatenating the original features
with random features. So, he creates a lot of features and adds them to the 2 original features
and then uses a linear classifier. This also implicitly increases the number of features, and
hence the number of parameters in the model and hence the complexity of the model.
Which of the following statements are true? - Comparision is with respect to the original
model with 2 features.
Remark: The words “reduces” and “increases” are used to mean ≤ and ≥ respectively, i.e
equality is allowed in both cases.

(A) Training error reduces with Student A’s approach.


(B) Training error increases with Student A’s approach.
(C) Test error reduces with Student A’s approach.
(D) Test error increases with Student A’s approach.
(E) Training error reduces with Student B’s approach.
(F) Training error increases with Student B’s approach.
(G) Test error reduces with Student B’s approach.
(H) Test error increases with Student B’s approach.

[8 Marks]

2 of 5
BITS F464
Machine Learning
Aditya Challa Quiz 1

Exercise 3
3. Consider the following kernel construction - Given the set of data points {xi }, Construct a
complete graph between all points where the edge weight is given by exp(−||xi − xj ||2 /σ 2 ).
Then, define the kernel as

K(x, x′ ) = max ′ min exp −||xi − xj ||2 /σ 2



π∈Π(x,x ) (xi ,xj )∈π

where Π(x, x′ ) is the set of all paths between x and x′ , e ∈ π is an edge in the path π. Also
set K(x, x) := 1.
Answer the following questions:

(a) State TRUE/FALSE. The kernel is symmetric, i.e K(x, x′ ) = K(x′ , x).
(b) State TRUE/FALSE. The kernel is positive definite.
(c) State TRUE/FALSE. The boundary obtained using this kernel changes under the
transformation x → x + c for any vector c.
(d) State TRUE/FALSE. The boundary obtained using this kernel does not change
under the transformation x → Ax + b for matrix A and vector b.
(e) State TRUE/FALSE. The boundary obtained using this kernel does not change
under the transformation x → Ax + b for matrix A which is positive definite and
vector b.

Definitions/Hints:

1. If A is a symmetric with positive entries, and Aij ≥ mink {Aik , Akj }, then A is
positive definite. Check out https://www.math.kent.edu/~varga/pub/paper_
199.pdf
2. A complete graph with n vertices has n(n − 1)/2 edges, and every point is con-
nected to every other point.

[5 × 2 = 10 Marks]

3 of 5
BITS F464
Machine Learning
Aditya Challa Quiz 1

Exercise 4
4. Implement the above kernel using scikit-learn support vector machine library (allows
using manual kernel). Answer the following questions based on the results. You can use the
code in kernel svm.ipynb as a starting point.
Remark 1: You are expected to tweak the hyperparameter (σ) to get the best results and
answer the following questions based on those results.
Remark 2: There is an important subtlety in the way the kernel is defined. Note that
K(x, x′ ) actually depends on the entire dataset! Hence, the implementation should be
adapted accordingly.
Which of the following statements are true?

(A) The best test-accuracy on the make moons dataset is 0 (Assume very large
number of samples, Use n samples = 1000 and noise = 0.01)
(B) The best test-accuracy on the make moons dataset is 1 (Assume very large
number of samples, Use n samples = 1000 and noise = 0.01)
(C) The best test-accuracy on the make circles dataset is 1 (Assume very large
number of samples, Use n samples = 1000 and noise = 0.01)
(D) The best test-accuracy on the make circles dataset is 0 (Assume very large
number of samples)
(E) Consider the dataset of make blobs - The kernel works well compared to linear
kernel when n samples is large and n features is small.
(F) Consider the dataset of make blobs - The kernel works well compared to linear
kernel when n samples is small and n features is large.

[12 Marks]

4 of 5
BITS F464
Machine Learning
Aditya Challa Quiz 1

Problem 5
5. Suppose we P have a domain knowledge that the ground-truth function is a finite sum of the
form f (t) = pi=1 ai cos(2πωi t), where ai ∈ R is fixed but unknown, but know that ωi belongs
to the set of integers between 1 and K - {1, 2, · · · , K}. So, sampling from this function will
give observations - {(ti , yi = f (ti ))}ni=1 . Assume ti is uniformly sampled from [−1, 1].

Pc let us construct the hypothesis class Hc (indexed by c) as


Using the domain knowledge,
follows - Hc = {h(t) = i=1 ai cos(2πωi t)}, where ai is some real number and ωi ∈
{1, 2, · · · , K}.
One way to look at the problem is – as a regression problem. Since, K is known, we can
generate K features by transforming ti to (cos(2πti ), cos(2π2ti ), · · · , cos(2πKti )). So we now
have K features and we need to identify the values of ai , and can use the linear regression
model to do so.
Which of the following statements are true?

(A) Since there are K unknowns - {ai }, we need at least K distinct (w.r.t t) data
points to identify the values of ai .
(B) Since there are K unknowns - {ai } and there is no irreducible error, K distinct
(w.r.t t) data points are sufficient to identify the values of ai .

Note that there is no irredicible error in the ground-truth function, that is if t is fixed the
output is also fixed.
Assume that we are sampling t uniformly at random from the continuous distribution
U [−1, 1]. Then which of the following statements are true?

(C) Irrespective of the sample size n, the variance of our estimates is 0.


(D) If the sample size is n ≥ K, then the variance of our estimates is 0.
(E) If the sample size is n ≥ 4K + 4, then the variance of our estimates is 0.

Hint: You are required to use the bootstrap method to estimate the variance of your
estimates and playing with the parameters K, p and n and see how the variance changes.
Use Log-Scale on the y-axis to get better visualization. Use large a values (order of 109 ) to
see the effect of n on the variance better. One question which arises while experimentation
is - When can you say the variance is 0? Due to precision errors while computing, this can
be a tricky question. Take equally spaced samples in [−1, 1] and plot the variance of your
estimates as a function of n. Since it is the same sample each time, we expect that the
variance is 0 and hence you can check if the variance you get is approximately 0 or not.

[10 Marks]

5 of 5

You might also like