Assignment 2
Assignment 2
Sol. (a)
The absolute value of the correlation coefficient denotes the strength of the relationship. Since
absolute correlation is significantly less, regressing Y on X1 mostly does not explain away Y .
3. Which of the following is a limitation of subset selection methods in regression?
(a) They tend to produce biased estimates of the regression coefficients.
(b) They cannot handle datasets with missing values.
(c) They are computationally expensive for large datasets.
(d) They assume a linear relationship between the independent and dependent variables.
(e) They are not suitable for datasets with categorical predictors.
Sol. (c)
They are computationally expensive for large datasets.
4. The relation between studying time (in hours) and grade on the final examination (0-100) in
a random sample of students in the Introduction to Machine Learning Class was found to be:
Grade = 30.5 + 15.2 (h)
How will a student’s grade be affected if she studies for four hours?
1
(d) The grade will remain unchanged.
(e) It cannot be determined from the information given
Sol. (c)
The slope of the regression line gives the average increase in grade for every hour increase in
studying. So, if studying is increased by four hours, the grade will increase by 4(15.2) = 60.8.
5. Which of the statements is/are True?
(a) Ridge has sparsity constraint, and it will drive coefficients with low values to 0.
(b) Lasso has a closed form solution for the optimization problem, but this is not the case
for Ridge.
(c) Ridge regression does not reduce the number of variables since it never leads a coefficient
to zero but only minimizes it.
(d) If there are two or more highly collinear variables, Lasso will select one of them randomly.
Sol. (c),(d)
Refer to the lecture
6. Find the mean of squared error for the given predictions:
Y f(x)
1 2
2 3
4 5
8 9
16 15
32 31
Hint: Find the squared error for each prediction and take the mean of that.
(a) 1
(b) 2
(c) 1.5
(d) 0
Sol. (a)
Σ(Y − f (x))2
Mean squared error =
6
(−1)2 + (−1)2 + (−1)2 + (−1)2 + 12 + 12
=
6
6
=
6
=1
2
7. Consider the following statements:
Statement A: In Forward stepwise selection, in each step, that variable is chosen which has the
maximum correlation with the residual, then the residual is regressed on that variable, and it
is added to the predictor.
Statement B: In Forward stagewise selection, the variables are added one by one to the previ-
ously selected variables to produce the best fit till then
Sol. (d)
Refer to the lecture
8. The linear regression model y = a0 +a1x1 +a2x2 +...+apxp is to be fitted to a set of N training
data points having p attributes each. Let X be N × (p + 1) vectors of input values (augmented
by 1‘s), Y be N × 1 vector of target values, and θ be (p + 1) × 1 vector of parameter values(a0,
a1, a2, ..., ap). If the sum squared error is minimized for obtaining the optimal regression model,
which of the following equation holds?
(a) XT X = XY
(b) Xθ = XT Y
(c) XT Xθ = Y
(d) XT Xθ = XT Y
Sol. (d)
This comes from minimizing the sum of the least squares.
RSS(θ) = (Y — XθT )(Y − Xθ) (in matrix form)
If we take the derivative and equate it to 0, then we get,
XT (Y − Xθ) = 0
So,
XT Xθ = XT Y, θ = (XT X)−1X T Y.
9. Which of the following statements is true regarding Partial Least Squares (PLS) regression?
(a) PLS is a dimensionality reduction technique that maximizes the covariance between the
predictors and the dependent variable.
(b) PLS is only applicable when there is no multicollinearity among the independent variables.
(c) PLS can handle situations where the number of predictors is larger than the number of
observations.
(d) PLS estimates the regression coefficients by minimizing the residual sum of squares.
(e) PLS is based on the assumption of normally distributed residuals.
(f) All of the above.
(g) None of the above.
3
Sol. (a)
PLS is a dimensionality reduction technique that maximizes
the covariance between the predictors and the dependent variable.
10. Which of the following statements about principal components in Principal Component Re-
gression (PCR) is true?
(a) Principal components are calculated based on the correlation matrix of the original pre-
dictors.
(b) The first principal component explains the largest proportion of the variation in the
dependent variable.
(c) Principal components are linear combinations of the original predictors that are uncorre-
lated with each other.
(d) PCR selects the principal components with the highest p-values for inclusion in the re-
gression model.
(e) PCR always results in a lower model complexity compared to ordinary least squares
regression.
Sol. (c)
Principal components are linear combinations of the original predictors
that are uncorrelated with each other.