Week 5
Week 5
Week 5
Week 5
Prof. B. Ravindran, IIT Madras
1. (1 Mark) Given a 3 layer neural network which takes in 10 inputs, has 5 hidden units and
outputs 10 outputs, how many parameters are present in this network?
(a) 115
(b) 500
(c) 25
(d) 100
Soln. A
2. (1 Mark) Recall the XOR(tabulated below) example from class where we did a transformation
of features to make it linearly separable. Which of the following transformations can also
work?
X1 X2 Y
-1 -1 -1
1 -1 1
-1 1 1
1 1 -1
3. We use several techniques to ensure the weights of the neural network are small (such as
random initialization around 0 or regularisation). What conclusions can we draw if weights of
our ANN are high?
(a) Model has overfitted.
(b) It was initialized incorrectly.
(c) At least one of (a) or (b).
(d) None of the above.
Sol. (d)
Overfitting may be because of high weights but the two are not always associated.
1
4. (1 Mark) In a basic neural network, which of the following is generally considered a good
initialization strategy for the weights?
(a) Initialize all weights to zero
(b) Initialize all weights to a constant non-zero value (e.g., 0.5)
(c) Initialize weights randomly with small values close to zero
(d) Initialize weights with large random values (e.g., between -10 and 10)
Soln. C
5. (1 Mark) Which of the following is the primary reason for rescaling input features before
passing them to a neural network?
(a) To increase the complexity of the model
(b) To ensure all input features contribute equally to the initial learning process
(c) To reduce the number of parameters in the network
(d) To eliminate the need for activation functions
Soln. B
6. (1 Mark) In the Bayesian approach to machine learning, we often use the formula: P (θ|D) =
P (D|θ)P (θ)
P (D) Where θ represents the model parameters and D represents the observed data.
Which of the following correctly identifies each term in this formula?
(a) P (θ|D) is the likelihood, P (D|θ) is the posterior, P (θ) is the prior, P (D) is the evidence
(b) P (θ|D) is the posterior, P (D|θ) is the likelihood, P (θ) is the prior, P (D) is the evidence
(c) P (θ|D) is the evidence, P (D|θ) is the likelihood, P (θ) is the posterior, P (D) is the prior
(d) P (θ|D) is the prior, P (D|θ) is the evidence, P (θ) is the likelihood, P (D) is the posterior
Soln. B
7. (1 Mark) Why do we often use log-likelihood maximization instead of directly maximizing the
likelihood in statistical learning?
(a) Log-likelihood provides a different optimal solution than likelihood maximization
(b) Log-likelihood is always faster to compute than likelihood
(c) Log-likelihood turns products into sums, making computations easier and more numeri-
cally stable
(d) Log-likelihood allows us to avoid using probability altogether
Soln. C
8. (1 Mark) In machine learning, if you have an infinite amount of data, but your prior distribution
is incorrect, will you still converge to the right solution?
(a) Yes, with infinite data, the influence of the prior becomes negligible, and you will converge
to the true underlying solution.
(b) No, the incorrect prior will always affect the convergence, and you may not reach the true
solution even with infinite data.
2
(c) It depends on the type of model used; some models may still converge to the right solution,
while others might not.
(d) The convergence to the right solution is not influenced by the prior, as infinite data will
always lead to the correct solution regardless of the prior.
Soln. A
9. Statement: Threshold function cannot be used as activation function for hidden layers.
Reason: Threshold functions do not introduce non-linearity.