0% found this document useful (0 votes)
3 views

Introduction to Machine Learning - - Unit 8 - Week 5

The document outlines the content and assignments for Week 5 of the 'Introduction to Machine Learning' course on NPTEL. It includes various questions related to neural networks, activation functions, parameter estimation, and loss minimization. Additionally, it provides correct answers and scores for the assignments submitted by students.

Uploaded by

240188.ad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Introduction to Machine Learning - - Unit 8 - Week 5

The document outlines the content and assignments for Week 5 of the 'Introduction to Machine Learning' course on NPTEL. It includes various questions related to neural networks, activation functions, parameter estimation, and loss minimization. Additionally, it provides correct answers and scores for the assignments submitted by students.

Uploaded by

240188.ad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

4/22/25, 4:50 AM Introduction to Machine Learning - - Unit 8 - Week 5

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

[email protected]

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Introduction to Machine Learning (course)


Click to register
for Certification
exam
Week 5 : Assignment 5
(https://examform.nptel.ac.in/2025_01/exam_form/dashboard)
The due date for submitting this assignment has passed.

If already Due on 2025-02-26, 23:59 IST.


registered, click
to check your Assignment submitted on 2025-02-24, 19:38 IST
payment status
1) Consider a feedforward neural network that performs regression on a p -dimensional1 point
input to produce a scalar output. It has m hidden layers and each of these layers has k hidden
units. What is the total number of trainable parameters in the network? Ignore the bias terms.
Course
outline pk + mk
2
+ k

2
About pk + (m − 1)k + k

NPTEL ()
2
p + (m − 1)pk + k

How does an 2 2
p + (m − 1)pk + k
NPTEL
online Yes, the answer is correct.
course Score: 1
work? () Accepted Answers:
2
pk + (m − 1)k + k

Week 0 ()
2) Consider a neural network layer defined as y = ReLU (W x). Here x ∈ R
p
is the 1 point
Week 1 () input, y ∈ R
d
is the output and W ∈ R
d×p
is the parameter matrix. The ReLU activation
∂y
(defined asReLU (z) := max(0, z) for a scalar z) is applied element-wise to W xFind i

∂Wij
Week 2 () where i = 1, . . , d and j = 1, . . . , p. In the following options, I(condition) is an indicator
function that returns 1 if the condition is true and 0 if it is false.
Week 3 ()

p
I(∑ Wik x k ≤ 0)x i
Week 4 () k=1

p
I(∑ Wik x k > 0)x j
Week 5 () k=1

https://onlinecourses.nptel.ac.in/noc25_cs46/unit?unit=60&assessment=313 1/5
4/22/25, 4:50 AM Introduction to Machine Learning - - Unit 8 - Week 5

Artificial
p
I(∑ Wik x k > 0)Wij x j
Neural k=1

Networks I -
p
Early Models I(∑
k=1
Wik x k ≤ 0)Wij x j

(unit?
Yes, the answer is correct.
unit=60&lesso Score: 1
n=61)
Accepted Answers:
p
I(∑ Wik x k > 0)x j
Artificial k=1

Neural
3) Consider a two-layered neural network y = σ(W
(B)
σ(W
(A)
x)). Let 1 point
Networks II -
Backpropagati h = σ(W
(A)
x) denote the hidden layer representation.W (A)
and W (B)
are arbitrary weights.
on (unit? Which of the following statement(s) is/are true? Note:∇ g (f ) denotes the gradient of f w.r.t g.
unit=60&lesso
n=62)
∇ h (y) depends on W (A)
Artificial
Neural ∇ (A) (y) depends on W (B)
W
Networks III -
Backpropagati
∇ (A) (h) depends on W (B)
on Continued W

(unit?
unit=60&lesso ∇
W
(B) (y) depends on W (A)
n=63) Yes, the answer is correct.
Score: 1
Artificial
Accepted Answers:
Neural
(y) depends on W
(B)

Networks IV - W
(A)

Training, ∇
W
(B) (y) depends on W (A)
Initialization
and Validation 4) Which of the following statement(s) about the initialization of neural network weights 1 point
(unit? is/are true for a network that uses the sigmoid activation function?
unit=60&lesso
n=64) Two different initializations of the same network could converge to different minima
Parameter For a given initialization, gradient descent will converge to the same minima irrespective
Estimation I - of the learning rate.
The Maximum
Initializing all weights to the same constant value leads to undesirable results
Likelihood
Estimate (unit? Initializing all weights to very large values leads to undesirable results
unit=60&lesso Yes, the answer is correct.
n=65) Score: 1
Accepted Answers:
Parameter
Two different initializations of the same network could converge to different minima
Estimation II -
Initializing all weights to the same constant value leads to undesirable results
Priors and the
MAP estimate Initializing all weights to very large values leads to undesirable results
(unit?
unit=60&lesso 5) Consider the following statements about the derivatives of the sigmoid 1 point
n=66) 1 exp(x)−exp(−x)
(σ(x) = ) and tanh (tanh(x) = ) activation functions. Which of
1+exp(−x) exp(x)+exp(−x)

Parameter these statement(s) is/are correct?


Estimation III
(unit?

unit=60&lesso σ (x) = σ(x)(1 − σ(x))

n=67)
′ 1
0 < σ (x) ≤
Week 5 4

Feedback

https://onlinecourses.nptel.ac.in/noc25_cs46/unit?unit=60&assessment=313 2/5
4/22/25, 4:50 AM Introduction to Machine Learning - - Unit 8 - Week 5

Form : ′
tanh (x) =
1
(1 − (tanh(x)) )
2

2
Introduction To
Machine 0 < tanh (x) ≤ 1

Learning (unit?
unit=60&lesso Yes, the answer is correct.
Score: 1
n=286)
Accepted Answers:
Quiz: Week 5 ′
σ (x) = σ(x)(1 − σ(x))

: Assignment ′
0 < σ (x) ≤
1

4
5 ′
0 < tanh (x) ≤ 1
(assessment?
name=313) 6) A geometric distribution is defined by the p.m.f. f (x; p) = (1 − p)(x−1) p for 1 point
x = 1, 2, . . . . . . Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the MLE
Week 6 () of p .

Week 7 () 0.111
0.222
Week 8 ()
0.333
0.444
Week 9 ()
Yes, the answer is correct.
Week 10 () Score: 1
Accepted Answers:
0.222
Week 11 ()

7) Consider a Bernoulli distribution with p = 0.7 (true value of the parameter). We 1 point
Week 12 ()
draw samples from this distribution and compute an MAP estimate of p by assuming a prior
distribution over p . Let N (μ, σ 2 ) denote a gaussian distribution with a mean μ and variance
Text
Transcripts σ
2
.Distributions are normalized as needed. Which of the following statement(s) is/are true?
()
If the prior is N (0.6, 0.1) , we will likely require fewer samples for converging to the true
Download value than if the prior is N (0.4, 0.1)
Videos ()
If the prior is N (0.4, 0.1) , we will likely require fewer samples for converging to the true
Books () value than if the prior is N (0.6, 0.1)

Problem With a prior of N (0.1, 0.001), the estimate will never converge to the true value,
Solving regardless of the number of samples used.
Session -
Jan 2025 () With a prior of U (0, 0.5) (i.e. uniform distribution between 0 and 0.5), the estimate will
never converge to the true value, regardless of the number of samples used.

No, the answer is incorrect.


Score: 0
Accepted Answers:
If the prior is N (0.6, 0.1) , we will likely require fewer samples for converging to the true
value than if the prior is N (0.4, 0.1)
With a prior of U (0, 0.5) (i.e. uniform distribution between 0 and 0.5), the estimate will never
converge to the true value, regardless of the number of samples used.

8) Which of the following statement(s) about parameter estimation techniques is/are 1 point
true?

https://onlinecourses.nptel.ac.in/noc25_cs46/unit?unit=60&assessment=313 3/5
4/22/25, 4:50 AM Introduction to Machine Learning - - Unit 8 - Week 5

To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
The MAP estimate of the parameter gives a point prediction for a new data point.
The MLE of a parameter gives a distribution of predicted values for a new data point.
We need a point estimate of the parameter to compute a distribution of the predicted
values for a new data point.
Yes, the answer is correct.
Score: 1
Accepted Answers:
To obtain a distribution over the predicted values for a new data point, we need to compute an
integral over the parameter space.
The MAP estimate of the parameter gives a point prediction for a new data point.

9) In a classification setting, it is common in machine learning applications to minimize 1 point


the discrete cross entropy loss given by HCE (p, q) = −Σi p i logq i where p i and qi are the
true and predicted distributions (p i ∈ {0, 1} depending on the label of the corresponding entry).
Which of the following statement(s) about minimizing the cross entropy loss is/are true?

Minimizing HCE (p, q) is equivalent to minimizing the (self) entropy H(q)

Minimizing HCE (p, q) is equivalent to minimizing HCE (q, p).

Minimizing HCE (p, q) is equivalent to minimizing the KL divergence D KL (p||q)

Minimizing HCE (p, q) is equivalent to minimizing the KL divergence D KL (q||p)

Yes, the answer is correct.


Score: 1
Accepted Answers:
Minimizing HCE (p, q) is equivalent to minimizing the KL divergence D KL (p||q)

10) Which of the following statement(s) about activation functions is/are NOT true? 1 point

Non-linearity of activation functions is not a necessary criterion when designing very deep
neural networks

Saturating non-linear activation functions (derivative → 0 as x → ±∞ ) avoid the vanishing


gradients problem
Using the ReLU activation function avoids all problems arising due to gradients being too
small.
The dead neurons problem in ReLU networks can be fixed using a leaky ReLU activation
function

Yes, the answer is correct.


Score: 1
Accepted Answers:
Non-linearity of activation functions is not a necessary criterion when designing very deep
neural networks
Saturating non-linear activation functions (derivative → 0 as x → ±∞ ) avoid the vanishing
gradients problem
Using the ReLU activation function avoids all problems arising due to gradients being too
small.

https://onlinecourses.nptel.ac.in/noc25_cs46/unit?unit=60&assessment=313 4/5

You might also like