0% found this document useful (0 votes)
6 views

Week 4

Uploaded by

Netaji Gandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week 4

Uploaded by

Netaji Gandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DEEP LEARNING WEEK 4

1. Which step does Nesterov accelerated gradient descent perform before finding the update
size?
a) Increase the momentum
b) Estimate the next position of the parameters
c) Adjust the learning rate
d) Decrease the step size
Answer: b) Estimate the next position of the parameters
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.
2. Select the parameter of vanilla gradient descent controls the step size in the direction of the
gradient.
a) Learning rate
b) Momentum
c) Gamma
d) None of the above
Answer: a) Learning rate
Solution: Learning rate determines the step size in vanilla gradient descent. Momentum is
not used in the normal gradient descent.
3. What does the distance between two contour lines on a contour map represent?
a) The change in the output of the function
b) The direction of the function
c) The rate of change of the function
d) None of the above
Answer: c) The rate of change of the function
Solution: The distance between two contour line determine the rate of change/ steepness of
the function
4. Which of the following represents the contour plot of the function f(x,y) = x2 − y?
−20
4 −10 −1
0

0 0
2
10

10

0
0 0
20

20

−2
0 0
10

10

−10 −1
−4 0
−20

a) −4 −2 0 2 4

1
4

5
2

0
20

20
15

15
10

10
5

0
0

5
−2
15

15
10

10
20

20
5
25

25
−4
5

b) −4 −2 0 2 4

4
8
− 6 4
− − 2 0

2

2
4 2 0
0 − −

4
−2 2
2 0

6
−4 4

2 8
c) −4 −2 0 2 4
45
45 20
25
4 25
0

40 3

15
35 4

10 20
30

30
5
20

2
15

15
25

25
10

10
5

0
5

25
20
20

15

−2 5
15

10
10
25

−4 15
20 20 25
30
45 40 35 30
25 35 40 45
d) −4 −2 0 2 4

2
4

5
2

0
20

20
15

15
10

10
5

0
0

5
−2
15

15
10

10
20

20
5
25

25
−4

5
Answer: b) −4 −2 0 2 4

5. What is the main advantage of using Adagrad over other optimization algorithms?
a)It converges faster than other optimization algorithms.
b)It is less sensitive to the choice of hyperparameters(learning rate).
c)It is more memory-efficient than other optimization algorithms.
d)It is less likely to get stuck in local optima than other optimization algorithms.
Answer: b) The main advantage of using Adagrad over other optimization algorithms is
that it is less sensitive to the choice of hyperparameters.
Solution: Adagrad automatically adapts the learning rate for each weight based on the
gradient history, which makes it less sensitive to the choice of hyperparameters than other
optimization algorithms. This can be especially useful when dealing with high-dimensional
datasets or complex models where the manual tuning of hyperparameters can be
time-consuming and error-prone.
6. We are training a neural network using the vanilla gradient descent algorithm. We observe
that the change in weights is small in successive iterations. What are the possible causes for
the following phenomenon? (MSQ)

a)η is large
b)∇w is small
c)∇w is large
d)η is small

Solution: (b),(d)
Answer: Small update changes signifies that the quantity η∇w is small. This can happen if
∇w or η is small.
7. You are given labeled data which we call X where rows are data points and columns feature.
One column has most of its values as 0. What algorithm should we use here for faster
convergence and achieve the optimal value of the loss function?

a)NAG
b)Adam
c)Stochastic gradient descent

3
d)Momentum-based gradient descent

Answer: b)
Solution: One of our weight vectors is sparse hence adam would work best here.
Solution: The moving averages used in ADAM are initialized to zero, which can result in
biased estimates of the first and second moments of the gradient. To address this, ADAM
applies bias correction terms to the moving averages, which corrects for the initial bias and
leads to more accurate estimates of the moments.
8. What is the update rule for the ADAM optimizer?

a).wt = wt−1 − lr ∗ (mt /( vt + ϵ))
b). wt = wt−1 − lr ∗ m
c). wt = wt−1 − lr ∗ (mt /(vt + ϵ))
d). wt = wt−1 − lr ∗ (vt /(mt + ϵ))
Answer: a)
Solution: The update rule for the ADAM optimizer is w = w - lr * (m / (sqrt(v) + eps)),
where w is the weight, lr is the learning rate, m is the first-moment estimate, v is the
second-moment estimate, and eps is a small constant to prevent division by zero.
9. What is the advantage of using mini-batch gradient descent over batch gradient descent?
a) Mini-batch gradient descent is more computationally efficient than batch gradient descent.
b) Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch
gradient descent.
c) Mini batch gradient descent gives us a better solution.
d) Mini-batch gradient descent can converge faster than batch gradient descent.
Answer: a) and d).
Solution: The advantage of using mini-batch gradient descent over batch gradient descent is
that it is more computationally efficient, allows for parallel processing of the training
examples, and can converge faster than batch gradient descent.
10. Which of the following is a variant of gradient descent that uses an estimate of the next
gradient to update the current position of the parameters?
a) Momentum optimization
b) Stochastic gradient descent
c) Nesterov accelerated gradient descent
d) Adagrad Answer: c) Nesterov accelerated gradient descent
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.

You might also like