0% found this document useful (0 votes)
5 views4 pages

Week 4

The document covers various topics in deep learning, including function slopes, contour plots, gradient descent updates, and the disadvantages of algorithms like Adagrad. It also discusses the role of activation functions and components of the ADAM optimizer. The answers to multiple-choice questions are provided along with explanations for each concept.

Uploaded by

lekha.cce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Week 4

The document covers various topics in deep learning, including function slopes, contour plots, gradient descent updates, and the disadvantages of algorithms like Adagrad. It also discusses the role of activation functions and components of the ADAM optimizer. The answers to multiple-choice questions are provided along with explanations for each concept.

Uploaded by

lekha.cce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DEEP LEARNING WEEK 4

1. We have following functions x3 ,ln(x),ex ,x and 4. Which of the following functions has the
steepest slope at x=1?
a)x3
b)ln(x)
c)ex
d)4
Answer:a)
Solution: Calculate the derivatives of following functions at x=1 and choose the function
with highest absolute value
2. Which of the following represents the contour plot of the function f(x,y) = x2 − y 2 ?
−20
0
4 −1
− 10
0
0
2
10

10

0 0
0
20

20

−2
0
0
10

10

−10
−4 −1
0
−20

a) −4 −2 0 2 4

2
0
20

20
15

15
10

10
5

−2
20

20
15

15
10

10
5
25

25
5

−4

b) −4 −2 0 2 4

1
4
8

6 4 2 0
− − −
2
2
0 4 2 0
− −

4
−2 2
0
6
−4 4
8
c) −4 −2 0 2 4
45
45
4 25 20 25
30 5 40

40 3 30
15
3

5
15
20

2
10
15

5
20
10

25
25

0
5

15

−2
20

5
10
15

10
25
15 20 25
−4 20
45 40 35 30 25 30 35 40 45
d) −4 −2 0 2 4

−20
0
4 −1
−10
0
0
2
10

10

0 0
0
20

20

−2
0
0
10

10

−10
−4 −1
0
−20

Answer: a) −4 −2 0 2 4

3. Choose the correct options for the given gradient descent update rule wt+1 = wt − η∇w (η is
the learning rate) (MSQ)
a)The weight update is tiny at a gentle loss surface
b)The weight update is tiny at a steep loss surface
c)The weight update is large at a steep loss surface

2
d)The weight update is large at a gentle loss surface
Answer:a),c)
Solution: Gradient is small at a gentle loss surface and large at a steep loss surface.
Gradient determines the size of updates.

4. Which of the following algorithms will result in more oscillations of the parameter during the
training process of the neural network?

a)Stochastic gradient descent


b)Mini batch gradient descent
c)Batch gradient descent
d)Batch NAG

Answer: a)
Solution: Since in stochastic gradient descent we update weights based on one training
example it is a poor approximation of true gradient compared to say Mini batch more batch
gradient hence results in more oscillations.

5. Which of the following are among the disadvantages of Adagrad?

a)It doesn’t work well for the Sparse matrix.


b)It usually goes past the minima.
c)It gets stuck before reaching the minima.
d)Weight updates are very small at the initial stages of the algorithm.

Answer: c)
Solution: It gets stuck before reaching the local minima since the learning rate of weight
which is dense gets reduced exponentially.

6. Which of the following is a variant of gradient descent that uses an estimate of the next
gradient to update the current position of the parameters?
a) Momentum optimization
b) Stochastic gradient descent
c) Nesterov accelerated gradient descent
d) Adagrad
Answer: c) Nesterov accelerated gradient descent
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.

7. Consider a gradient profile ∇W =[1,0.9,0.6,0.01,0.1,0.2,0.5,0.55,0.56]. Assume


v−1 = 0,ϵ = 0,β = 0.9 and the learning rate is η−1 . Suppose that we use the Adagrad
algorithm then what is the value of η6 = η/sqrt(vt + ϵ)?
a)0.03
b)0.06
c)0.08
d)0.006
Answer:b)

3
Solution: Use the expression vt = vt−1 + (∇W )2 to get v6 using the gradients list given in
the question. Calculate η6 = η/sqrt(vt + ϵ)
8. Which of the following can help avoid getting stuck in a poor local minimum while training a
deep neural network?

(a) Using a smaller learning rate.


(b) Using a smaller batch size.
(c) Using a shallow neural network instead.
(d) None of the above.

Answer: (d) Solution: None of the above methods can prevent the neural network from
getting stuck in poor local minima.
9. What are the two main components of the ADAM optimizer?
a) Momentum and learning rate.
b)Gradient magnitude and previous gradient.
c)Exponential weighted moving average and gradient variance.
d)Learning rate and a regularization term.
Answer: c) The two main components of the ADAM optimizer are exponential moving
average and gradient variance.
10. What is the role of activation functions in deep learning?

(a) Activation functions transform the output of a neuron into a non-linear function,
allowing the network to learn complex patterns.
(b) Activation functions make the network faster by reducing the number of iterations
needed for training.
(c) Activation functions are used to normalize the input data.
(d) Activation functions are used to compute the loss function.

Answer: a)
Solution: Activation functions transform the output of a neuron into a non-linear function,
which is important for learning complex patterns. Without activation functions, neural
networks would be limited to linear transformations of the input data.

You might also like