0% found this document useful (0 votes)
82 views7 pages

ML Midsem 2018 Solutions

This document contains instructions and questions for a machine learning mid-semester exam. It includes: - True/False questions covering topics like matrix ranks, MAP vs MLE estimation, Naive Bayes classifier properties, logistic regression, SVM margins, overfitting, feature independence, and data normalization. - A question about linear regression that involves deriving the probability distribution of outputs given inputs and parameters, the conditional likelihood of training data, and the gradient descent learning rule. - A question about placing a Laplacian prior over the linear regression parameters and deriving the regularization effect and posterior log expression. - A question asking to write performance metrics for classification and regression tasks.

Uploaded by

Aniket Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views7 pages

ML Midsem 2018 Solutions

This document contains instructions and questions for a machine learning mid-semester exam. It includes: - True/False questions covering topics like matrix ranks, MAP vs MLE estimation, Naive Bayes classifier properties, logistic regression, SVM margins, overfitting, feature independence, and data normalization. - A question about linear regression that involves deriving the probability distribution of outputs given inputs and parameters, the conditional likelihood of training data, and the gradient descent learning rule. - A question about placing a Laplacian prior over the linear regression parameters and deriving the regularization effect and posterior log expression. - A question asking to write performance metrics for classification and regression tasks.

Uploaded by

Aniket Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Machine Learning Name:

Monsoon 2018
Mid-Semester Exam: CSE/ECE 343/543
22/9/2018
Time Limit: 120 Minutes Roll No:

Instructions:
Please do not plagiarize. It will be dealt with very strictly!
Try to answer all questions. The last question is Extra Credit. Try to make the most of it.
Good Luck!

1. (20 points) Answer the following True/False questions. An answer is true if it is always true.
You have to justify your answer to get full credit.
(a) (2 points) True or False ?.
Let A ∈ Rm×n and B ∈ Rn×p be the two arbitrary matrices.
(i) rank(AB) ≥ min(rank(A), rank(B)). (ii) if m = n, rank(A) = rank(A−1 ) ).
(b) (2 points) As number of samples increases, the MAP and MLE estimate become the same.
(c) (2 points) Naive Bayes can only classify linearly separable data.
(d) (2 points) Naive Bayes can handle only discrete valued variables.
(e) (2 points) In logistic regression, there exists a closed form solution for the parameters that
maximizes the conditional log likelihood.
(f) (2 points) Given a linearly separable data, the margin of the decision boundary produced
by SVM will always be greater than or equal to the margin of the decision boundary
produced by any other hyperplane that perfectly classifies that data.
(g) (2 points) If my training error is too high, I should collect more training data to resolve
this problem.
(h) (2 points) If my model is overfitting, I should collect more data to fix it.
(i) (2 points) If my features/observations/input variables are all independent, it would not
matter whether I use a Bayes Classifier or a Naive Bayes classifier.
(j) (2 points) Normalizing data is for kids. Real experts work with raw data!
Solution:
(a) I. False; Since the column vectors of AB are all linear combinations of column vectors
of A, the rank of AB is upper bounded by rank(A). Similarly, the row vectors of AB
are linear combinations of row vectors of B, the rank of AB is also upper bounded by
rank(B). Therefore, we have rank(AB) ≤ min(rank(A), rank(B))
II. True; If the inverse exists, then the matrix A is nonsingular and therefore rank(A) =
rank(A−1 ) = m = n
(b) True. In MAP, we are also considering the prior term in contrast to MLE. Consider the
log-posterior expression, the log-likelihood is a summation over all n data points, while
the log-prior term is only defined on the parameters. As n → ∞, the log-prior term will
be insignificant compared to the likelihood term.
(c) False; In general, Naive Bayes classifier is not linear, as the decision boundary depends
on underlying probability distribution. However, there may be distributions that result in
linear boundaries, e.g., exponential family distributions.
Machine Learning Mid-Semester Exam: CSE/ECE 343/543 - Page 2 of 7 22/9/2018

(d) False; Again, as Naive Bayes classifiers rely on underlying probability distributions, they
can handle both Discrete and Continuous valued variables.
(e) False; Because of the logit transformation, there is no closed-form solution for maximizing
the log-likelihood for logistic regression. Usually iterative optimization techniques like
gradient descent are used for fitting the model.
(f) True; Since the SVM objective maximizes the minimum distance of the points from the
decision boundary, the margin width will be greater than equal to any other linear classifier
(hyperplane) for the given training dataset.
(g) False; If the training error is already too high, increasing the training data will likely
increase the error. High training error may imply that the learning algorithm was not run
long enough, or that the model does not have enough capacity, e.g., if you are trying to
fit a line in data that lies along a higher order polynomial .
(h) True; Adding more diverse training data should help. Other strategies like regularization
that may help in mitigating overfitting.
(i) True; the Naive assumption in Naive Bayes classifier is the independence of variables. If
the variables are actually independent, the Bayesian classifier works the same as naive
bayes.
(j) False; Normalizing the data is a good practice as it makes training less sensitive to scale
variations in the input data. For example, in case of minimizing the mean squared error,
without normalization, variables with smaller magnitude, which may be more informative
in terms of predicting the output variable, may get completely overshadowed by input
variables with large magnitude, but little correlation with the output variable.

2. (20 points) Consider the following simple linear regression model.

y = wx + ϵ

Where, y is the sum of deterministic linear function of x and some noise ϵ. x and y are the real-
valued input and output respectively, w is the real-valued parameter to be learned. ϵ ∼ N (0, σ)
is a Gaussian random variable representing the noise.
(a) (5 points) Write the expression for the probability distribution of y (i.e p(y|w, x)) in terms
of N (·, ·), w, σ, x.
(b) (5 points) Suppose, we are give m i.i.d training samples {(y 1 , x1 ), ..., (y m , xm )}, Y =
{y 1 , ..., y m } and X = {x1 , ..., xm }. Derive an expression for conditional data likelihood
(i.e P (Y|X , w)).
(c) (5 points) Consider the conditional data likelihood, derive an expression for gradient de-
scent learning rule.
(d) (5 points) Suppose we are given a Laplacian prior (p(x) = e−λ|x| , x ∈ R, λ > 0) over w.
What kind of regularization does it impose on your regular regression problem? Can you
derive the expression for the log posterior? {Hint: Start with MLE, use Bayes rule and
apply log, simple!}
Solution:

(a) y = wx + ϵ, ϵ ∼ N (0, σ) (1 mark)


Therefore y follows the distribution
Machine Learning Mid-Semester Exam: CSE/ECE 343/543 - Page 3 of 7 22/9/2018

p(y|x; w) = N (f (x), ∑
σ), where f (x) = wx
p(y|x; w) = N (w0 + wi xi , σ) (2marks)
1 y−f (x) 2
p(y|x; w) = √ 1 e− 2 ( σ
)
(2marks)
2πσ 2
(b) From the above expression

p(y|x; w) = p(y1 , y2 , y3 , y4 ...|θ) i p(yi |θ) (2marks)

ln p(y|x; w) = ln p(yi |θ)

(1mark) = −(yi − f (xl ; w))2 (2 marks)
(c) For gradient descent we aim for Maximum Likelihood Estimate -

wM CLE = arg maxw i −(yi − f (xi ; w))2 (0.5 mark)

= arg minw i (yi − f (xi ; w))2 (0.5 mark)
∑ ∑
i (yi −f (xi ;w))
− f (xi ; w)) ∂(yi −f
∂ 2
(x;w))
∂wj
= i 2(yi ∂wj
(2 marks)

= i −2(y − f (xi ; w)) ∂f∂w
(xi ;w)
j (1 mark)

Since f (x) = w0 + i wi xi ..
the gradient update rule will be

wj ← wj +η i (yi −f (xi ; w))xji (2 marks) Here the subscripts denote the sample number
and the superscript j denotes the j th element of the vector x or w.

(d) y ∼ N (f (x; w), σ)


From Bayes rule we know that
p(x|y)p(y)
p(y|x) = p(x)

p(y|x)p(x) ∝ p(y)
y−f (x)
√ 1 e− 2 ( σ ) p(x) (1 mark)
1 2

2πσ 2
1 y−f (x) 2 y−f (x) 2
√ 1 e− 2 ( σ ) e−λ|x| (1 mark) e− 2 (
1
⇒ 2
⇒ ln √ 1 σ
)
e−λ|x| (1 mark)
2πσ 2πσ 2
(y−f (x))2
− 21 ( −λ|x|
⇒ ln e + ln eσ (1 mark)
⇒ −(y − f (x; W ))2 − λ|x|

Which is the expression for Lasso/L1 Regularisation. (1 mark)

3. (20 points) Answer the following.


(a) (4 points) Write mathematical formulation of one performance evaluation metric for each
of the following tasks.
(i) Classification, (ii) Regression
Machine Learning Mid-Semester Exam: CSE/ECE 343/543 - Page 4 of 7 22/9/2018

(b) (4 points) What are the drawbacks of holdout set? Also, suggest an alternate method.
(c) (4 points) How is K-fold cross validation is different from Leave-one-out Cross validation?
(write any two differences). If the data is sparse, which cross-validation method would be
beneficial?.
(d) (4 points) Write two pros and cons of choosing (i) Large number of folds, (ii) Small number
of folds in a cross-validation task:
(e) (4 points) Explain how would you detect (i) High variance, (ii) High bias, in your learning
model. Write one solution each for fixing them.
Solution:
(a) For classification (2 marks)
Accuracy = (TP + TN)/(TP + TN + FP + FN)
TP = Number of True Positives
FP = Number of False Positives
TN = Number of True Negatives
FN = Number of False Negatives For regression (2 marks)
Root Mean √Square Error (RMSE)
∑n
(y −b
i=1 i y )2
i
RMSE = n
yi denotes the true score for the i-th data point, and ŷi denotes the predicted value.
(b) The disadvantage of using holdout set is that the evaluation can have a high variance as
it may depend heavily on which data points end up in the training set and which end up
in the test set. Thus, the evaluation may be significantly different depending on train-test
split. (2 marks)
K-fold cross validation is one way to improve over the holdout method. (1 mark)
The advantage of this method is that it matters less how the data gets divided. Every
data point gets to be in a test set exactly once, and gets to be in a training set k-1 times.
Since the validation error is averaged over N/k different validation samples, the variance
in the error estimates would be reduced. (1 mark)
(c) In leave-one-out cross validation, K=Number of datapoints (N) ie. training is done all the
data except for one point. It requires more computation time than K-fold cross validation
since N training cycles are required. (1 mark each for the 2 differences)
Leave-one-out cross validation is better for sparse data (1 mark)
Since the data is very sparse, leave-one-out CV will allow us to train on as many samples
as possible. (1 mark)
(d) Larger number of folds
Pros:
(a) The bias of the true error rate estimator will be small.
Cons:
(a) The variance of the true error rate estimator will be large.
(b) The computational time will be very large.
Smaller number of folds
Pros:
(a) The computation time are reduced as there will be lesser number of experiments.
(b) The variance of the estimator will be small as each experiment has a larger number
of validation samples.
Machine Learning Mid-Semester Exam: CSE/ECE 343/543 - Page 5 of 7 22/9/2018

Cons:
(a) The bias of the estimator will be larger.
(e) High Variance
Detection (1 mark)
The errors over the training set and validation sets would be very different in case of high
variance. This may indicate overfitting of your model on your training set.
Solution (1 mark)
More training examples, smaller set of features. (any one)
High Bias
Detection (1 mark)
If the cross-validation and training errors are similar for a range of training set sizes, and
yet they are significantly larger than what is expected, then the model has high bias. It
indicates that the model is underfitting.
Solution (1 mark)
Larger set of features, or a higher capacity model.

4. (20 points) Answer the following


(a) (5 points) Explain why the most probable label for the training sample (x1 , x2 , ..., xm ) is
the label l that maximizes the following:

P (X1 = x1 , X2 = x2 , ..., Xm = xm |L = l)P (L = l)

(b) (5 points) In the context of Naive Bayes training, what is the concept of smoothing and
why do we need it? {Hint: What happens if one of the classes has zero training samples?
}
(c) (6 points) True or False ?. Justify your answer.
(i) In a nearest neighbor classifier, Euclidean distance and squared Euclidean distance are
equivalent.
(ii) The 3-nearest neighbor classifier is always more accurate than the 2-nearest neighbor
classifier.
(iii) With sufficient training data, the error of a nearest neighbor classifier always goes
down to zero.
(d) (2 points) Give scenarios where you would prefer to use k-nearest neighbors instead of
Support Vector Machines.
(e) (2 points) k-nearest neighbors is a nonparametric classifier because you need to retain all
the training samples in order for it to work well. If I say SVMs are not nonparametric,
would I be correct or wrong? Justify your answer.
Solution:

(a) The most probable label for the training sample X = (x1 , x2 , ..., xm ) is one that maximizes
P (L = l|X = (x1 , x2 , ..., xm )). Now,

P (L = l|X = (x1 , x2 , ..., xm )) = P (L = 1|X1 = x1 , X2 = x2 , ..., Xm = xm ) (1)


P (X1 = x1 , X2 = x2 , ..., Xm = xm |L = l)P (L = l)
= (Bayes rule) (2)
P (X1 = x1 , X2 = x2 , ..., Xm = xm )
Machine Learning Mid-Semester Exam: CSE/ECE 343/543 - Page 6 of 7 22/9/2018

Now, since we’re interested in maximizing P (L = l|X = (x1 , x2 , ..., xm )) over l, P (X1 =
x1 , X2 = x2 , ..., Xm = xm ) remains constant over all l, thus doesn’t affect argmax (al-
though affects the absolute value of max). Thus, the l maximizing P (X1 = x1 , X2 =
1 ,X2 =x2 ,...,Xm =xm |L=l)P (L=l)
x2 , ..., Xm = xm |L = l)P (L = l) also maximizes P (X1 =x
P (X1 =x1 ,X2 =x2 ,...,Xm =xm ) . Hence,
the label l that maximizes P (X1 = x1 , X2 = x2 , ..., Xm = xm |L = l)P (L = l) is the most
probable label.
(b) In context of Naive Bayes classifier, smoothing corresponds to adding “virtual counts” for
the purpose of calculating probabilities. Normally,
Count(Xi = u ∧ Y = y)
P (Xi = u|Y = y) = (3)
Count(Y = y)
In smoothing, this changes to:
Count(Xi = u ∧ Y = y) + 1
P (Xi = u|Y = y) = (4)
Count(Y = y) + number of classes
Smoothing is required in Naive Bayes training, because in the training set, we may not
observe a particular feature in a particular class (we clearly would not necessarily see
all possible features in the training set since it’s not exhaustive), Naive Bayes without
smoothing will see that P (Xi = u|Y = y) = 0 in the training data, thus will assign 0
probability to class y for any sample that has Xi = u. This is worse when a particular
feature value hasn’t been seen at all before (say, with integer data). If the training data
doesn’t have Xi = u in any sample, Naive Bayes without smoothing assigns 0 probability
to each class, which doesn’t give us a good decision. Instead, it would be good if the
classifier looked at other feature values in such a case. Smoothing helps ensure exactly
this.
(c) (i) True. Nearest neighbor example only compares distances to see which points are
closer, and which are farther. It doesn’t take into account the magnitude by which a
point is closer or farther. Since distances are always non-negatives, x > y ⇐⇒ X 2 >
y 2 (since x, y ≥ 0)
(ii) False. The 3-nearest neighbor classifier is not always more accurate than the 2-nearest
neighbor classifier. As an example, look at the following figure: When classifying the
red point (test sample), we can clearly see that the point belongs to the yellow class.
The 2-nearest neighbor classifier finds A and C as nearest neighbors, and thus classifies
the point as yellow (tie breaking by distance). But, the 3-nearest neighbor classifier
finds A,B,C as nearest neighbors and thus, classifies it into purple class.
(iii) False. Since the k-nearest neighbor classifier is distance based, when data has distri-
butions with some overlap across classes, k-NN may still make classification mistakes.
(d) Low-dimensional feature spaces, where classes have multi-modal distributions and non-
linear boundaries.
(e) Linear SVMs are parametric classifiers, since once the parameters w and b are identified,
the entire training data can be discarded. Given the dimensionality of the data, the
number of parameters is fixed. On the other hand, the number of parameters (no. of
support vectors) required to define the decision rule depends on the number of training
data. If you change the training set, the resulting number of support vectors may change,
implying that the number of parameters is not fixed. Therefore kernelized SVMs are
non-parametric.
Machine Learning Mid-Semester Exam: CSE/ECE 343/543 - Page 7 of 7 22/9/2018

5. (10 points) {Extra Credit}: There is a heavy traffic jam at Connaught Place. There is bumper-
to-bumper traffic and cars all around the outermost circle are at a standstill. Somehow Google
leaks the GPS locations of all these cars. The typical noise in phone based GPS measurements
would be around 10m. Google wants you to estimate the circumference of the outermost circle
of Connaught Place (Trump asked them not to use other resources, but ask IIITD students
to solve it). Since you have only done linear regression, you need to apply linear regresssion
for this problem. First of all, can you apply linear regression? What would you need to do?
Explain the entire training and validation pipeline.
Solution:

Since we are given the latitude(X) and longitude(Y ) of all the vehicles in the traffic. we can
mean normalize both X and Y . Considering (0, 0) as the origin, we can transform the feature
space by including X 2 and Y 2 . So, Now we have:

aX 2 + bY 2 + cX + dY = R2

where R is the distance between the point and the center of CP. Now, we can fit linear regression
to solve it. However, we can solve this problem in multiple other ways.

You might also like