HW 3
HW 3
In the previous homework, we proved that gradient descent (GD) algorithm can converge and derive the
convergence rate. In this homework, we will add the momentum term and how it affects to the convergence
rate. The optimization procedure of gradient descent+momentum is given below:
wt+1 = wt − ηzt+1
zt+1 = (1 − β)zt + βgt , (2)
where gt = ∇L(wt ), η is learning rate and β defines how much averaging we want for the gradient. Note
that when β = 1, the above procedure is just original gradient descent.
Let’s investigate the effect of this change. We’ll see that this modification can actually ’accelerate’ the
convergence by allowing larger learning rates.
wt+1 = I − 2η(X T X) wt + 2ηX T y (3)
w∗ = (X T X)−1 X T y (4)
The geometric convergence rate (in the sense of what base is there for convergence as ratet ) of this
procedure is
You saw on the last homework that if we choose the learning rate that maximizes Eq. 5, the optimal
learning rate, η ∗ is
1
η∗ = 2 2
, (6)
σmin + σmax
where σmax and σmin are the maximum and minimum singular value of the matrix X. The correspond-
Homework 3, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Homework 3 @ 2022-09-23 19:46:20Z
(σmax /σmin )2 − 1
optimal rate = (7)
(σmax /σmin )2 + 1
Therefore, how fast ordinary gradient descent converges is determined by the ratio between the maxi-
mum singular value and the minimum singular value as above.
Now, let’s consider using momentum to smooth the gradients before taking a step in Eq.2.
wt+1 = wt − ηzt+1
zt+1 = (1 − β)zt + β(2X T Xwt − 2X T y) (8)
We can use the SVD of the matrix X = U ΣV T , where Σ = diag(σmax , σ2 , . . . , σmin ) with the same
(potentially rectangular) shape as X. This allows us to reparameterize the parameters wt and averaged
gradients zt as below:
xt = V T (wt − w∗ )
at = V T zt . (9)
Please rewrite Eq. 8 with the reparameterized variables, xt [i] and at [i]. (xt [i] and at [i] are i-th
components of xt and at respectively.)
(b) Notice that the above 2 × 2 vector/matrix recurrence has no external input. We can derive the 2 × 2
system matrix Ri from above such that
" # " #
at+1 [i] at [i]
= Ri (10)
xt+1 [i] xt [i]
Derive Ri .
(c) Use the computer to symbolically find the eigenvalues of the matrix Ri .
When are they purely real? When are they repeated and purely real? When are they complex?
(d) For the case when they are repeated, what is the condition on η, β, σi that keeps them stable
(strictly inside the unit circle)? What is the highest learning rate η as a function of β and σi that
results in repeated eigenvalues?
(e) For the case when the eigenvalues are real, what is the condition on η, β, σi that keeps them stable
(strictly inside the unit circle)? What is the upper bound of learning rate? Express with β, σi
(f) For the case when the eigenvalues are complex, what is the condition on η, β, σi that keeps them
stable (strictly inside the unit circle)? What is the highest learning rate η as a function of β and
σi that results in complex eigenvalues?
(g) Now, apply what you have learned to the following problem. Assume that β = 0.1 and we have a
problem with two singular values σmax 2 = 5 and σmin2 = 0.05. What learning rate η should we
choose to get the fastest convergence for gradient descent with momentum? Compare how many
iterations it will take to get within 99.9% of the optimal solution (starting at 0) using this learning
rate and momentum with what it would take using ordinary gradient descent.
Homework 3, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Homework 3 @ 2022-09-23 19:46:20Z
, where x is the input signal, h is impulse response (also referred to as the filter). Please note that the con-
volution operations is to ’flip and drag’. But for neural networks, we simply implement the convolutional
layer without flipping and such operation is called correlation. Interestingly, in CNN those two operations
are equivalent because filter weights are initialized and updated. Even though you implement ’true’ convo-
lution, you just ended up with getting the flipped kernel. In this question, we will follow the definition.
Now let’s consider rectangular signal with the length of L (sometimes also called the "rect" for short, or,
alternatively, the "boxcar" signal). This signal is defined as:
(
1 n = 0, 1, 2, ..., L − 1
x(n) =
0 otherwise
Here’s an example plot for L = 7, with time indices shown from -2 to 8 (so some implicit zeros are shown):
Homework 3, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Homework 3 @ 2022-09-23 19:46:20Z
Homework 3, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 4
Homework 3 @ 2022-09-23 19:46:20Z
3. Feature Dimensions of Convolutional Neural Network In this problem, we compute output feature
shape of convolutional layers and pooling layers, which are building blocks of CNN. Let’s assume that input
feature shape is W × H × C, where W is the width, H is the height and C is the number of channels of
input feature.
(a) A convolutional layer has 4 hyperparameters: the filter size(K), the padding size (P ), the stride step
size (S) and the number of filters (F ). How many weights and biases in this convolutional layer? And
what is the shape of output feature that this convolutional layer produces?
(b) A pooling layer has 2 hyperparameters: the stride step size(S) and the filter size (K). What is the
output feature shape that this pooling layer produces?
(c) Let’s assume that we have the CNN model which consists of L successive convolutional layers and the
filter size is K and the stride step size is 1 for every convolutional layer. Then what is the receptive
field size?
(d) Consider a downsampling layer (e.g. pooling layer and strided convolution layer). In this problem, we
investigate pros and cons of downsampling layer. This layer reduces the output feature resolution and
this implies that the output features loose the certain amount of spatial information. Therefore when we
design CNN, we usually increase the channel length to compensate this loss. For example, if we apply
the max pooling layer with kernel size of 2 and stride size of 2, we increase the output feature size by
a factor of 2. If we apply this max pooling layer, how much the receptive field increases? Explain
the advantage of decreasing the output feature resolution with the perspective of reducing the
amount of computation.
Homework 3, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 5
Homework 3 @ 2022-09-23 19:46:20Z
(a) Implement forward operation of convolutional layer and max pooling layer.
(b) Implement Three-layer ConvNet
(c) Implement forward operation of spatial batch normalization layer.
(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll
need to remember it for the self-grade form.
Contributors:
• Suhong Moon.
• Gabriel Goh.
• Anant Sahai.
Homework 3, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Homework 3 @ 2022-09-23 19:46:20Z
• Dominic Carrano.
• Babak Ayazifar.
• Sukrit Arora.
• Fei-Fei Li.
• Sheng Shen.
• Jake Austin.
• Kevin Li.
Homework 3, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 7