0% found this document useful (0 votes)
25 views7 pages

Quiz1 Solutions Quiz 1 Soln

This document contains solutions to a quiz on machine learning concepts. It discusses concepts like loss functions, overfitting, underfitting, linear regression, and more. Multiple choice questions with detailed explanations are provided as part of the quiz solutions.

Uploaded by

hasna.nafir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Quiz1 Solutions Quiz 1 Soln

This document contains solutions to a quiz on machine learning concepts. It discusses concepts like loss functions, overfitting, underfitting, linear regression, and more. Multiple choice questions with detailed explanations are provided as part of the quiz solutions.

Uploaded by

hasna.nafir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

lOMoARcPSD|41888394

Quiz1 Solutions - Quiz-1 soln

Advanced Machine Learning (Indraprastha Institute of Information Technology, Delhi)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Hasna NAFIR ([email protected])
lOMoARcPSD|41888394

Quiz 1

Course Title: Machine Learning


Date of Examination: 10.09.2020

1. In your experiment, you wish to strongly penalize outliers. Consider the formula for mean
squared error, which is often used as a loss function. Is it a good idea to use “mean cubed
error” (consider power 3 instead of 2) as a loss function instead? (Yes/No).

1 X
N
M SE = (yi − ŷi )2 (1)
N
i=1

Ans: No, this is not a good idea. Even though the cubic function is also a differentiable
function and in theory, we can plug it into our model. However, it will cause the following
problems:-

• The cubic function can return negative values as will, which may cancel out some positive
values while summing up the values and lead to a lower loss, which will be erroneous.
• Using a higher even power will also not be a good idea. Due to the higher degree, it may
cause the model to overfit more (relative to a squared function), because even for small
deviations, a large penalty will be induced. This will essentially be a case of penalizing
too strongly for it to lead to any good.
• Implementation constraints: Because we’re using a cubic function, we will reach the point
of overflow much sooner than a squaring function.

2. While training your model, you observe that the loss on the testing set is constantly lower than
the training set. What could be the possible justification for this?

(a) The model is not training well as the testing set loss should always be higher than training
set error.
(b) The model is training properly as the testing set loss should always be lower than training
set error.
(c) The model may be training properly. We cannot comment with certainty about whether
the training or testing set loss should be lower.
(d) The model is training properly. We should ignore the test set loss as it provides us no
useful information. We only need to ensure that the training set loss keeps decreasing.

Ans: C,We may have an unfortunate train-test split, due to which the model is generalizing
well and making better predictions on the testing set. However, the relative trends should still
remain the same (both training and testing set losses decrease till stopping point, then training
loss decreases and testing loss increases.)

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

3. Consider we are trying to learn three different classifiers A, B, C on a given data set such that A
has high training as well as test accuracies, B has high training accuracy but low test accuracy,
while C has low training as well as test accuracies. Which one of the following statements is
correct? Justify your answer.

(a) A is overfitting whereas B is underfitting.


(b) A is overfitting whereas C is underfitting.
(c) B is overfitting whereas C is underfitting.
(d) B is underfitting whereas C is overfitting.
(e) None of the above

Ans: C,Overfitting refers to the phenomenon of a model overtly fitting the patterns in the
training data which may not generalize well on unseen data. On the other hand, underfitting
refers to the phenomenon of a model not being able to capture even the desirable patterns
existing in the data (due to limitations on its representational power). In the case above, A
tends to be showing expected trends as the performance on unseen test data is high. However, B
performs well on the training data but does not generalize well, indicating a case of overfitting.
C does not perform well even on the training data, suggesting that the model may be incapable
to predict patterns existing in data, a case of underfitting.

4. You are trying to minimize a convex objective function using gradient descent and the algorithm
has not converged even after completing 10,000 iterations. What might be the possible reasons
for this?

(a) The learning rate is too high


(b) The learning rate is too low
(c) The model may be stuck in a local minima
(d) None of the above

Ans: A,B The learning rate is an important hyperparameter that determines how fast and
stable the gradient descent algorithm converges to the global minimum. A very low learning
rate would signify a very small step-size and may lead to a significant amount of time for the
algorithm to converge. A very high learning rate on the other hand signifies a large step size
and may overshoot the minimum and never lead to convergence due to instability. Therefore,
both are correct. In the case of a convex function, there are no local minima and local minima
is equal to the global minima. Hence, C is incorrect.

5. In linear regression we define residual as the difference of actual value and predicted value
(ytrue − ypred ). You have trained a linear regression model and plotted the residual and the
predicted values on a plane and observed that there is a relationship between them. What can
you say about the trained model?

(a) The model is trained properly


(b) The model is overfitted
(c) The model has failed to capture the relationship between input vector and output com-
pletely
(d) Can’t say anything

Page 2

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Ans: C, as we can observe a relationship between errors and outputs the model has failed
to capture this and is not trained properly. A properly trained model will not have any
relationship with residue and the outputs.
6. Consider the following statements regarding linear regression and select the correct ones :
(a) You will always find the optimum solution with the normal equation method with arbi-
trary input feature matrix.
(b) You will always find the optimum solution with the normal equation method with input
feature matrix which does not contain columns which perfectly correlate with each other.
(c) You will always find the optimum solution with gradient descent method as the loss
surface is convex.
(d) You will find the optimum solution with gradient descent with appropriate hyperparam-
eters.
Ans : B and C are correct. The matrix may not be invertible (contain linearly dependent
columns) while calculating the normal solution. The learning rate might be too high for the
model to converge with gradient descent.
7. Let f(x) be the sigmoid function. What is the derivative of f(x) in terms of f(x)?
(a) f (x) ∗ (1 − f (x))
(b) f (x) ∗ (f (x) − 1)
(c) f (x)/(1 − f (x))
(d) 1/(1 − f (x))
Ans : A. f (x) ∗ (1 − f (x)) is the derivative of f(x).
8. Suppose you have been given a fair coin and you want to find out the odds of getting heads.
Which of the following option is true for such a case?
(a) 0
(b) 0.5
(c) 1
(d) None of the above
Ans : C. Odds are defined as the ratio of the probability of success and the probability of
failure. So in case of fair coin probability of success is 1/2 and the probability of failure is 1/2
so odd would be 1
9. You have a model which predicts marks of students on a test. The marks are given on a scale
of 0-10, in increments of 0.5. You have been given a set of features for this. You will perform
(a) ————. Regression on this. A possible metric to compute the performance of this model
is (b)————.
Ans: (a)Logistic Regression, (b)Accuracy. Since the grading buckets are discretized in
intervals of 0.5 marks, we must perform logistic regression as the task is to predict one out
of k possible classes (grading buckets between 0-10 at intervals of 0.5). You cannot use linear
regression as it will not predict results with the constraint of the result being a multiple of 0.5
(linear regression predicts a continuous value).

Page 3

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

10. Which optimization function between MSE and MAE is harder to optimize. State the reason.
Ans :MAE is harder to optimize. MAE does not have a closed form solution because it is
a non-differentiable piecewise function, as it involves an absolute value. For this reason, MAE
is computationally more expensive, as we can’t solve it in terms of matrix math, and most rely
on approximations.
11. You have a dataset which has x and y as input variables and z as the output variable. After
plotting the data you realized that the output values form a cone around the positive z axis.
What transformation can you apply to the input variables so that you could train a linear
regression classifier on the dataset.
(a) (x, y) → (x + y)
(b) (x, y) → (x2 + y 2 )
(c) (x, y) → (y, x)
(d) (x, y) → (x)
Ans : B. As the output forms a cone around the positive z axis we can say that the output
shows linear relation with the input coordinates’ distance from origin on the x,y plane. We can
model this linear relationship with linear regression by calculating the distance of the input
coordinates from origin and then applying a linear regression model on it.
12. Which loss function L1 or L2 penalizes outliers more and why?
Ans : In L2 since the difference between the predicted and expected value is squared, the
error when there’s greater difference between predicted and expected values(as id the case
with outliers) is relatively more than when there is less difference in comparison with L1 loss.
Therefore the L2 penalizes outliers more.
13. You have learnt about the equation for gradient descent equation (see below). You want to
train your model for 100 iterations. What are the possible issue(s) that may arise due to an
incorrect choice of the learning rate (α):-

(a) If the learning rate is too high, the updates will not move towards the intended minima.
Instead, it’ll move in the opposite direction.
(b) If the learning rate is too high, the gradient descent will arrive at the minima in very few
epochs (say 10 iterations). In subsequent epochs, it will move out of the minima.
(c) If the learning rate is too low, the updates will not move towards the intended minima.
Instead, it’ll move in the opposite direction.
(d) If the learning rate is too low, then the parameters may not converge to the optimal values
in 100 iterations even if the updates are moving in the correct direction.
Ans: D. The convergence will be slow if the learning rate is too low. It may require a higher
number of iterations. A, C are incorrect because the update to the parameters is always in
the direction of the minima. B is incorrect because once the minima is reached, the gradient
becomes 0. Hence, there will be no update at all for the remaining iterations.

Page 4

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

14. Imagine a classification problem with highly imbalanced data. The majority class occurs 99%
of the times in the training data. A model trained on this dataset yields 99% accuracy on the
test data. Which of the following can be said in such a case? (Select all that apply)

(a) Accuracy is not a good metric for problems having highly imbalanced data.
(b) Accuracy is a good metric for problems with highly imbalanced data.
(c) Precision, Recall are good metrics for problems with highly imbalanced data.
(d) Precision, Recall are not good metrics for problems with highly imbalanced data

Ans: A,C In a highly imbalanced dataset, accuracy is not a good metric since an accuracy
of 99% might mean that the model is only predicting the majority class correctly. To measure
class wise performance of the classifier, other metrics such as Precision and Recall should be
used

15. Consider that you have a linear regression model with two model parameters, θ0 and θ1 .
Further, consider that the cost function, after n iterations with m training inputs, the cost
function J(θ0 , θ1 ) = 0. Then select all that apply:

(a) All prediction outputs for the training inputs will lie perfectly on a straight line which
also coincides with the true outputs for the training inputs.
(b) For the true output matrix Y ǫ R1×m , if Y is a null matrix, then θ0 and θ1 can be equal
to zero.
(c) For the true output matrix Y ǫ R1×m , if Y is a null matrix, then θ0 and θ1 can NOT be
equal to zero.
(d) J(θ0 , θ1 ) can NOT be equal to 0.

Ans: A, B If the cost function is zero and we have only two parameters, we can say that our
predicted output = θ0 + θ1 x, which will necessarily lie on a straight line. If that straight line
coincides with the true outputs, then the predicted and true outputs will be the same, hence
cost function J = 0 If Y is a null matrix, then all true outputs are zero. If θ0 and θ1 are zero,
then predicted values are zero too.

16. Which of the following is true?

1. The validation set is used to estimate the generalization error of the final model, once all
hyperparameters have been chosen.
2. Model Parameters are the parameters in the model that must be determined using the
training data set.
3. The test set is used to estimate the generalization error of each hyperparameter setting.
4. Hyperparameters are adjustable parameters that must be tuned in order to obtain a model
with optimal performance.

Ans: B,D

Page 5

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

17. Five randomly chosen land areas were sold for the prices mentioned in the table below. A real
estate dealer wants to predict the house price for any given land area based on these samples.
Come up with a linear regression equation that best predicts the house prices.

Ans :The numerical is similar to one posted for tutorial 3. There are other alternate approaches
as well (Best one would be normal equation method if you had access to scientific calculators)

Regression equation is:


yP red = b0 + b1 x. We solve for b0 and b1 .
P P
b1 = ( [(xi − xM ean )(yi − yM ean )])/( [(xi − xM ean )2])
b1 = 125 / 350 = 0.357

b0 = yM ean − b1 ∗ xM ean
b0 = 80 - 0.357 * 85 = 49.655

Final answer: yP red = b0 + b1x


yP red = 49.655 + 0.357 ∗ x.
Or, w = 0.357, b = 49.655 (As given in the question)

Page 6

Downloaded by Hasna NAFIR ([email protected])

You might also like