Week 4
Week 4
1. We have following functions x3 ,ln(x),ex ,x and 4. Which of the following functions has the
steepest slope at x=1?
a)x3
b)ln(x)
c)ex
d)4
Answer:a)
Solution: Calculate the derivatives of following functions at x=1 and choose the function
with highest absolute value
2. Which of the following represents the contour plot of the function f(x,y) = x2 − y 2 ?
−20
0
4 −1
− 10
0
0
2
10
10
0 0
0
20
20
−2
0
0
10
10
−10
−4 −1
0
−20
a) −4 −2 0 2 4
2
0
20
20
15
15
10
10
5
−2
20
20
15
15
10
10
5
25
25
5
−4
b) −4 −2 0 2 4
1
4
8
−
6 4 2 0
− − −
2
2
0 4 2 0
− −
4
−2 2
0
6
−4 4
8
c) −4 −2 0 2 4
45
45
4 25 20 25
30 5 40
40 3 30
15
3
5
15
20
2
10
15
5
20
10
25
25
0
5
15
−2
20
5
10
15
10
25
15 20 25
−4 20
45 40 35 30 25 30 35 40 45
d) −4 −2 0 2 4
−20
0
4 −1
−10
0
0
2
10
10
0 0
0
20
20
−2
0
0
10
10
−10
−4 −1
0
−20
Answer: a) −4 −2 0 2 4
3. Choose the correct options for the given gradient descent update rule wt+1 = wt − η∇w (η is
the learning rate) (MSQ)
a)The weight update is tiny at a gentle loss surface
b)The weight update is tiny at a steep loss surface
c)The weight update is large at a steep loss surface
2
d)The weight update is large at a gentle loss surface
Answer:a),c)
Solution: Gradient is small at a gentle loss surface and large at a steep loss surface.
Gradient determines the size of updates.
4. Which of the following algorithms will result in more oscillations of the parameter during the
training process of the neural network?
Answer: a)
Solution: Since in stochastic gradient descent we update weights based on one training
example it is a poor approximation of true gradient compared to say Mini batch more batch
gradient hence results in more oscillations.
Answer: c)
Solution: It gets stuck before reaching the local minima since the learning rate of weight
which is dense gets reduced exponentially.
6. Which of the following is a variant of gradient descent that uses an estimate of the next
gradient to update the current position of the parameters?
a) Momentum optimization
b) Stochastic gradient descent
c) Nesterov accelerated gradient descent
d) Adagrad
Answer: c) Nesterov accelerated gradient descent
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.
3
Solution: Use the expression vt = vt−1 + (∇W )2 to get v6 using the gradients list given in
the question. Calculate η6 = η/sqrt(vt + ϵ)
8. Which of the following can help avoid getting stuck in a poor local minimum while training a
deep neural network?
Answer: (d) Solution: None of the above methods can prevent the neural network from
getting stuck in poor local minima.
9. What are the two main components of the ADAM optimizer?
a) Momentum and learning rate.
b)Gradient magnitude and previous gradient.
c)Exponential weighted moving average and gradient variance.
d)Learning rate and a regularization term.
Answer: c) The two main components of the ADAM optimizer are exponential moving
average and gradient variance.
10. What is the role of activation functions in deep learning?
(a) Activation functions transform the output of a neuron into a non-linear function,
allowing the network to learn complex patterns.
(b) Activation functions make the network faster by reducing the number of iterations
needed for training.
(c) Activation functions are used to normalize the input data.
(d) Activation functions are used to compute the loss function.
Answer: a)
Solution: Activation functions transform the output of a neuron into a non-linear function,
which is important for learning complex patterns. Without activation functions, neural
networks would be limited to linear transformations of the input data.