0% found this document useful (0 votes)
19 views7 pages

ML Endsem 2022

Uploaded by

shobhitraj0011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

ML Endsem 2022

Uploaded by

shobhitraj0011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CSE343/CSE543/ECE363/ECE563: Machine Learning Sec A (Monsoon 2023)

END-Sem

Date of Examination: 6.12.2023 Duration: 2 hours Total Marks: 30 marks

Instructions –
• Attempt all questions. MCQs have a single correct option.
• State any assumptions you have made clearly.
• Standard institute plagiarism policy holds.
• No evaluation without suitable justification.

0 marks if either option or explanation is incorrect.


Question 1: [1 Mark] In the context of batch learning in neural networks, does the order in
which input data is presented to the network influence the learning process?
A. Yes, the order significantly affects learning, leading to different models.

B. No, the order of input data does not influence the learning outcome in batch learning.

C. Yes, but only in networks with more than three layers.

D. The order only matters in real-time data streaming, not in batch learning.
Answer: (B)No, the order of input data does not influence the learning outcome
in batch learning. In batch learning, the entire dataset is considered in one go, or
in large batches, for the training process. This approach means that the order in
which the data points are presented does not affect the learning process, as the model
processes the entire dataset or large subsets of it collectively. The training is based on
the aggregate error over the entire dataset or the batch, rather than individual data
points, making the order of data presentation irrelevant in this context.

Question 2: [1Marks]What impact will increasing the number of nodes in a neural network’s
hidden layer have?
A. It will always improve the network’s accuracy on both training and test sets.

B. It may lead to underfitting, resulting in poor accuracy on both training and test sets.

C. It may result in overfitting, leading to higher accuracy on the training set but poor
accuracy on the test set, and increased training time.

D. The number of nodes in the hidden layer has no impact on the network’s performance.
(C). It may result in overfitting, leading to higher accuracy on the training set but
poor accuracy on the test set and increased training time.

Question 3: [1Marks]You are designing a deep learning system to detect driver fatigue in cars,
where it is crucial to detect fatigue accurately to prevent accidents. Which of the following
evaluation metrics would be the most appropriate for this system?

1. Precision

2. Recall

3. F1 score

4. Loss value
(B)Explanation: Recall is the most appropriate metric in this context because it measures the
proportion of actual positive cases (fatigue detected) that were correctly identified. In safety-
critical applications like driver fatigue detection, it’s crucial to minimize false negatives (i.e.,
not detecting fatigue when it is present), which is what Recall focuses on.

Question 4: [2 Marks] Assuming δL/δx3 is known, write the weight update for w1 (δL/δw1
should be in the expanded form). Input x1 and all weights are positive. All neurons have ReLU
Activation.Figure attached below.

Figure 1: Solution to Q4 [1 mark for derivation, 1 mark for correct answer]

Question 5: [2+2 Marks] Consider an activation function ρ(x) = x · σ(x), where σ(x) is the
sigmoid function.
(a) Compute ρ′ (x) in terms of σ(x).
(b) For large x, compare ρ(x) and ρ′ (x) with standard activation functions. (No derivation
required).
Part (a) Given that: σ ′ (x) = σ(x)(1 − σ(x))
By the multiplication rule:

ρ′ (x) = x′ σ(x) + xσ ′ (x) = 1σ(x) + xσ(x)(1 − σ(x)) = σ(x)[1 + x(1 − σ(x))]

Part (b)

1. For large values of x, σ(x) ≈ 1, thus ρ(x) ≈ x and the function mimics RELU activation.

2. For large values of x, σ(x) ≈ 1, or (1 − σ(x)) ≈ 0, and overall [1 + x(1 − σ(x))] ≈ 1, ρ′ (x)
function mimics Sigmoid activation.

Question 6: [2 Marks] Consider an ensemble of 5 models which are trained on the same ar-
chitecture but have different initializations for a handwritten digit classification task. Does the
guarantee of better performance in expectation (in terms of cross-entropy loss) by averaging
the predictions of all five networks hold if you instead average the weights and biases of the
networks? Why or why not?

Page 2
Solution: No, the guarantee does not hold because the loss is not convex with respect to the
weights and biases. Networks starting from different initializations might learn different hidden
representations, so it makes no sense to average the weights and biases.

Question 7: [2 Marks] Prove that approximately 63% of the entire original dataset (total
training set) is present in any of the sampled bootstrap datasets using the Bagging method.

Solution: We have a dataset with n observations. In bootstrap sampling, we draw with


replacement from this original dataset n times to create a new dataset of the same size n.
For any single observation, the probability of not being chosen on the first draw is: P(not
chosen) = 1 - 1/n [1 mark]
Since each draw is independent of the others, the probability of not being chosen in all n draws
is: P(not chosen in all n draws) = (1 − 1/n)n
Taking the limit: As n gets large (which is typically the case in real applications),
limitn to infinity (1 − 1/n)n = e-1 = 0.368
Probability of appearing = 1-0.368 = 0.63 [1 mark]

Question 8:[2 marks] Consider the given distance metric:


X x i − yi
d(x, y) =
x i + yi
Enumerate and explain the desirable properties of a distance metric. Evaluate the given distance
metric against these properties, demonstrating mathematically or through examples how the
metric adheres to or fails to meet each of the properties.

Figure 2: Solution to Q8

Alternate sol for proving non validity:- when xi = −yi , distance is infinite.

Page 3
Question 9: [3 Marks] For a CNN-based classifier, calculate the number of weights, number
of biases, and the size of the associated feature maps for each layer, following the notation:
• CONV-K-N denotes a convolutional layer with N filters, each of size K × K. Padding and
stride parameters are always 0 and 1, respectively.
• POOL-K indicates a K × K pooling layer with stride K and padding 0.
• FC-N stands for a fully-connected layer with N neurons.
Successively:

120 × 120 × 32 and 32 × (9 × 9 × 3 + 1)


60 × 60 × 32 and 0
56 × 56 × 64 and 64 × (5 × 5 × 32 + 1)
28 × 28 × 64 and 0
24 × 24 × 64 and 64 × (5 × 5 × 64 + 1)
12 × 12 × 64 and 0
3 and 3 × (12 × 12 × 64 + 1)

Figure 3: Question 4
Figure 4: Question 9

Question 10: [4 Marks] Discuss the role of activation functions in mitigating exploding and
vanishing gradient problems. Provide examples of activation functions that are more or less
prone to these issues and explain why.
Exploding and Vanishing Gradient Issues:
Exploding Gradient: Occurs when gradients become extremely large during backpropagation,
leading to unstable learning. This occurs when the weights become very large.
Vanishing Gradient: Happens when gradients become too small, causing slow or halted
learning for deep networks.
Activation Functions:
Sigmoid Activation: Prone to vanishing gradients, particularly for deep networks.
Hyperbolic Tangent (tanh) Activation: Similar to sigmoid, still susceptible to vanishing
gradients.
Sigmoid and tanh: More prone to vanishing gradient, limiting their use in deep networks.
ReLU, Leaky ReLU, PReLU, ELU: Designed to alleviate vanishing gradients, making them
suitable for deep architectures.

Question 11: [2 Marks] Express the derivative of a sigmoid in terms of the sigmoid itself for
positive constants a and b:
1
(a) A purely positive sigmoid: φj (v) = 1+exp(−av)
(b) An antisymmetric sigmoid: φj (v) = a tanh(bv)

Page 4
Question 12:[2 Marks] Consider the following patterns, each having four binary-valued at-
tributes:
ω1 1100 0000 1010 0011
ω2 1100 1111 1110 0111
Note: especially that the first patterns in the two categories are the same. Identify the root
node feature for a binary classification tree for this data so that the leaf nodes have the lowest
impurity possible.

Question 13: [2 Marks] In the context of behavior simulation and content simulation tasks,
discuss the implications of the observed performance gap between LCBM and large content-only
models like GPT-3.5 and GPT-4. What insights can be drawn from this performance difference,
and how does it contribute to our understanding of the effectiveness of including behavior tokens
in language model training? Provide a brief analysis of the observed trends as presented during

Page 5
Figure 5: Q12, All the calculations for entropy/impurity should be there.

Figure 6: Q12. Final Decision Tree

the lecture.
Q13:- Refer this for more details: https://arxiv.org/pdf/2309.00359.pdf

• LCBM, while being 10x smaller than GPT-3.5 and 4, performs better than them on all
behavior-related tasks.

• Further, we see that there is no significant difference between 10-shot and 2-shot GPT-4
or between GPT-3.5 and GPT-4, indicating that unlike other tasks, it is harder to achieve
good performance through in-context learning on the behavior modality.

• It can be observed that often GPT-3.5 and 4 achieve performance comparable to (or worse
than) random baselines. Interestingly, the performance of GPTs on the content simulation
task is also substantially behind LCBM.

• The way we formulate the content simulation task (Listing 5), it can be seen that a
substantial performance could be achieved by strong content knowledge, and behavior
brings in little variance.

Page 6
• We still see a substantial performance gap between the two models. All of this indicates
that large models like GPT-3.5 and 4 are not trained on behavior tokens.

Question 14: [2 Marks] Consider a scenario where a company is developing an AI-based


conversational chatbot focused on mental health support. Before deploying this chatbot, outline
the critical considerations the company should take into account and discuss each of the following
aspects:
• Controlled Generation vs. Free Flow QA: Compare and contrast the advantages and disad-
vantages of implementing a controlled generation approach versus a free-flow question-answer
model in the context of a mental health chatbot.
• Industry-Specific Approvals: Adherence to industry-specific regulations such as HIPAA,
GDPR, NIST, etc., holds significant implications. Explain the implications of adhering to
these regulations and the measures the company should take to ensure compliance in the
development and deployment of the mental health chatbot.
Controlled Generation: Advantages:

• Precision and predictability in responses, Lower risk of generating inappropriate or harm-


ful content, Aligns with ethical and regulatory standards.

Disadvantages:

• Limited flexibility in responding to diverse user inputs, Potential to miss nuanced expres-
sions and unique user needs, May feel less conversational and empathetic.

Free Flow QA: Advantages:

• Enhanced flexibility in addressing various user inputs, Can simulate more natural and
empathetic conversations, Better adaptation to users’ emotional states.

Disadvantages:

• Higher risk of generating inappropriate or unsafe content, Difficulty in maintaining control


over the conversation, Challenges in aligning with regulatory standards.

0.25 marks for any one advantage and disadvantage of each type(0.25*4=1 mark)

• Implications: Ensures the protection of sensitive health information, Guarantees user pri-
vacy and control over personal data,Sets standards for cybersecurity and data protection.
Give 0.5 for any one mentioned.

• To ensure compliance : Implement robust encryption for data transmission, Ensure secure
storage and access controls for user data, Obtain clear consent for data collection and
usage, or something along these lines. 0.5 marks

Page 7

You might also like