Week 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Introduction to Machine Learning

Week 5
Prof. B. Ravindran, IIT Madras

1. (1 Mark) Given a 3 layer neural network which takes in 10 inputs, has 5 hidden units and
outputs 10 outputs, how many parameters are present in this network?

(a) 115
(b) 500
(c) 25
(d) 100

Soln. A

2. (1 Mark) Recall the XOR(tabulated below) example from class where we did a transformation
of features to make it linearly separable. Which of the following transformations can also
work?

X1 X2 Y
-1 -1 -1
1 -1 1
-1 1 1
1 1 -1

(a) Rotating x1 and x2 by a fixed angle.


(b) Adding a third dimension z = x ∗ y
(c) Adding a third dimension z = x2 + y 2
(d) None of the above
Sol. (b)

3. We use several techniques to ensure the weights of the neural network are small (such as
random initialization around 0 or regularisation). What conclusions can we draw if weights of
our ANN are high?
(a) Model has overfitted.
(b) It was initialized incorrectly.
(c) At least one of (a) or (b).
(d) None of the above.
Sol. (d)
Overfitting may be because of high weights but the two are not always associated.

1
4. (1 Mark) In a basic neural network, which of the following is generally considered a good
initialization strategy for the weights?
(a) Initialize all weights to zero
(b) Initialize all weights to a constant non-zero value (e.g., 0.5)
(c) Initialize weights randomly with small values close to zero
(d) Initialize weights with large random values (e.g., between -10 and 10)
Soln. C
5. (1 Mark) Which of the following is the primary reason for rescaling input features before
passing them to a neural network?
(a) To increase the complexity of the model
(b) To ensure all input features contribute equally to the initial learning process
(c) To reduce the number of parameters in the network
(d) To eliminate the need for activation functions
Soln. B
6. (1 Mark) In the Bayesian approach to machine learning, we often use the formula: P (θ|D) =
P (D|θ)P (θ)
P (D) Where θ represents the model parameters and D represents the observed data.
Which of the following correctly identifies each term in this formula?
(a) P (θ|D) is the likelihood, P (D|θ) is the posterior, P (θ) is the prior, P (D) is the evidence
(b) P (θ|D) is the posterior, P (D|θ) is the likelihood, P (θ) is the prior, P (D) is the evidence
(c) P (θ|D) is the evidence, P (D|θ) is the likelihood, P (θ) is the posterior, P (D) is the prior
(d) P (θ|D) is the prior, P (D|θ) is the evidence, P (θ) is the likelihood, P (D) is the posterior
Soln. B
7. (1 Mark) Why do we often use log-likelihood maximization instead of directly maximizing the
likelihood in statistical learning?
(a) Log-likelihood provides a different optimal solution than likelihood maximization
(b) Log-likelihood is always faster to compute than likelihood
(c) Log-likelihood turns products into sums, making computations easier and more numeri-
cally stable
(d) Log-likelihood allows us to avoid using probability altogether
Soln. C
8. (1 Mark) In machine learning, if you have an infinite amount of data, but your prior distribution
is incorrect, will you still converge to the right solution?

(a) Yes, with infinite data, the influence of the prior becomes negligible, and you will converge
to the true underlying solution.
(b) No, the incorrect prior will always affect the convergence, and you may not reach the true
solution even with infinite data.

2
(c) It depends on the type of model used; some models may still converge to the right solution,
while others might not.
(d) The convergence to the right solution is not influenced by the prior, as infinite data will
always lead to the correct solution regardless of the prior.

Soln. A
9. Statement: Threshold function cannot be used as activation function for hidden layers.
Reason: Threshold functions do not introduce non-linearity.

(a) Statement is true and reason is false.


(b) Statement is false and reason is true.
(c) Both are true and the reason explains the statement.
(d) Both are true and the reason does not explain the statement.
Sol. (a)
The reason is that threshold function is non-differentiable so we will not be able to calculate
gradient for backpropagation.

10. Choose the correct statement (multiple may be correct):

(a) MLE is a special case of MAP when prior is a uniform distribution.


(b) MLE acts as regularisation for MAP.
(c) MLE is a special case of MAP when prior is a beta disrubution .
(d) MAP acts as regularisation for MLE.

Sol. (a), (d)


Ref. lecture

You might also like