ML Endsem 2022
ML Endsem 2022
END-Sem
Instructions –
• Attempt all questions. MCQs have a single correct option.
• State any assumptions you have made clearly.
• Standard institute plagiarism policy holds.
• No evaluation without suitable justification.
B. No, the order of input data does not influence the learning outcome in batch learning.
D. The order only matters in real-time data streaming, not in batch learning.
Answer: (B)No, the order of input data does not influence the learning outcome
in batch learning. In batch learning, the entire dataset is considered in one go, or
in large batches, for the training process. This approach means that the order in
which the data points are presented does not affect the learning process, as the model
processes the entire dataset or large subsets of it collectively. The training is based on
the aggregate error over the entire dataset or the batch, rather than individual data
points, making the order of data presentation irrelevant in this context.
Question 2: [1Marks]What impact will increasing the number of nodes in a neural network’s
hidden layer have?
A. It will always improve the network’s accuracy on both training and test sets.
B. It may lead to underfitting, resulting in poor accuracy on both training and test sets.
C. It may result in overfitting, leading to higher accuracy on the training set but poor
accuracy on the test set, and increased training time.
D. The number of nodes in the hidden layer has no impact on the network’s performance.
(C). It may result in overfitting, leading to higher accuracy on the training set but
poor accuracy on the test set and increased training time.
Question 3: [1Marks]You are designing a deep learning system to detect driver fatigue in cars,
where it is crucial to detect fatigue accurately to prevent accidents. Which of the following
evaluation metrics would be the most appropriate for this system?
1. Precision
2. Recall
3. F1 score
4. Loss value
(B)Explanation: Recall is the most appropriate metric in this context because it measures the
proportion of actual positive cases (fatigue detected) that were correctly identified. In safety-
critical applications like driver fatigue detection, it’s crucial to minimize false negatives (i.e.,
not detecting fatigue when it is present), which is what Recall focuses on.
Question 4: [2 Marks] Assuming δL/δx3 is known, write the weight update for w1 (δL/δw1
should be in the expanded form). Input x1 and all weights are positive. All neurons have ReLU
Activation.Figure attached below.
Question 5: [2+2 Marks] Consider an activation function ρ(x) = x · σ(x), where σ(x) is the
sigmoid function.
(a) Compute ρ′ (x) in terms of σ(x).
(b) For large x, compare ρ(x) and ρ′ (x) with standard activation functions. (No derivation
required).
Part (a) Given that: σ ′ (x) = σ(x)(1 − σ(x))
By the multiplication rule:
Part (b)
1. For large values of x, σ(x) ≈ 1, thus ρ(x) ≈ x and the function mimics RELU activation.
2. For large values of x, σ(x) ≈ 1, or (1 − σ(x)) ≈ 0, and overall [1 + x(1 − σ(x))] ≈ 1, ρ′ (x)
function mimics Sigmoid activation.
Question 6: [2 Marks] Consider an ensemble of 5 models which are trained on the same ar-
chitecture but have different initializations for a handwritten digit classification task. Does the
guarantee of better performance in expectation (in terms of cross-entropy loss) by averaging
the predictions of all five networks hold if you instead average the weights and biases of the
networks? Why or why not?
Page 2
Solution: No, the guarantee does not hold because the loss is not convex with respect to the
weights and biases. Networks starting from different initializations might learn different hidden
representations, so it makes no sense to average the weights and biases.
Question 7: [2 Marks] Prove that approximately 63% of the entire original dataset (total
training set) is present in any of the sampled bootstrap datasets using the Bagging method.
Figure 2: Solution to Q8
Alternate sol for proving non validity:- when xi = −yi , distance is infinite.
Page 3
Question 9: [3 Marks] For a CNN-based classifier, calculate the number of weights, number
of biases, and the size of the associated feature maps for each layer, following the notation:
• CONV-K-N denotes a convolutional layer with N filters, each of size K × K. Padding and
stride parameters are always 0 and 1, respectively.
• POOL-K indicates a K × K pooling layer with stride K and padding 0.
• FC-N stands for a fully-connected layer with N neurons.
Successively:
Figure 3: Question 4
Figure 4: Question 9
Question 10: [4 Marks] Discuss the role of activation functions in mitigating exploding and
vanishing gradient problems. Provide examples of activation functions that are more or less
prone to these issues and explain why.
Exploding and Vanishing Gradient Issues:
Exploding Gradient: Occurs when gradients become extremely large during backpropagation,
leading to unstable learning. This occurs when the weights become very large.
Vanishing Gradient: Happens when gradients become too small, causing slow or halted
learning for deep networks.
Activation Functions:
Sigmoid Activation: Prone to vanishing gradients, particularly for deep networks.
Hyperbolic Tangent (tanh) Activation: Similar to sigmoid, still susceptible to vanishing
gradients.
Sigmoid and tanh: More prone to vanishing gradient, limiting their use in deep networks.
ReLU, Leaky ReLU, PReLU, ELU: Designed to alleviate vanishing gradients, making them
suitable for deep architectures.
Question 11: [2 Marks] Express the derivative of a sigmoid in terms of the sigmoid itself for
positive constants a and b:
1
(a) A purely positive sigmoid: φj (v) = 1+exp(−av)
(b) An antisymmetric sigmoid: φj (v) = a tanh(bv)
Page 4
Question 12:[2 Marks] Consider the following patterns, each having four binary-valued at-
tributes:
ω1 1100 0000 1010 0011
ω2 1100 1111 1110 0111
Note: especially that the first patterns in the two categories are the same. Identify the root
node feature for a binary classification tree for this data so that the leaf nodes have the lowest
impurity possible.
Question 13: [2 Marks] In the context of behavior simulation and content simulation tasks,
discuss the implications of the observed performance gap between LCBM and large content-only
models like GPT-3.5 and GPT-4. What insights can be drawn from this performance difference,
and how does it contribute to our understanding of the effectiveness of including behavior tokens
in language model training? Provide a brief analysis of the observed trends as presented during
Page 5
Figure 5: Q12, All the calculations for entropy/impurity should be there.
the lecture.
Q13:- Refer this for more details: https://arxiv.org/pdf/2309.00359.pdf
• LCBM, while being 10x smaller than GPT-3.5 and 4, performs better than them on all
behavior-related tasks.
• Further, we see that there is no significant difference between 10-shot and 2-shot GPT-4
or between GPT-3.5 and GPT-4, indicating that unlike other tasks, it is harder to achieve
good performance through in-context learning on the behavior modality.
• It can be observed that often GPT-3.5 and 4 achieve performance comparable to (or worse
than) random baselines. Interestingly, the performance of GPTs on the content simulation
task is also substantially behind LCBM.
• The way we formulate the content simulation task (Listing 5), it can be seen that a
substantial performance could be achieved by strong content knowledge, and behavior
brings in little variance.
Page 6
• We still see a substantial performance gap between the two models. All of this indicates
that large models like GPT-3.5 and 4 are not trained on behavior tokens.
Disadvantages:
• Limited flexibility in responding to diverse user inputs, Potential to miss nuanced expres-
sions and unique user needs, May feel less conversational and empathetic.
• Enhanced flexibility in addressing various user inputs, Can simulate more natural and
empathetic conversations, Better adaptation to users’ emotional states.
Disadvantages:
0.25 marks for any one advantage and disadvantage of each type(0.25*4=1 mark)
• Implications: Ensures the protection of sensitive health information, Guarantees user pri-
vacy and control over personal data,Sets standards for cybersecurity and data protection.
Give 0.5 for any one mentioned.
• To ensure compliance : Implement robust encryption for data transmission, Ensure secure
storage and access controls for user data, Obtain clear consent for data collection and
usage, or something along these lines. 0.5 marks
Page 7