CS5242 Neural Networks and Deep Learning: Quiz 1
CS5242 Neural Networks and Deep Learning: Quiz 1
Quiz 1
A. BP algorithm updates the parameters based on the gradients of the average loss w.r.t the
parameters.
B. To add a new operator in a neural network, we need to implement the forward and
backward method for the BP algorithm to call.
C. For chain rule I, suppose there is an operator y=f(x), where x and y could be scalars or
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
vectors or matrices. If the gradient of the final output o w.r.t y is known, then 𝜕𝜕𝜕𝜕 = 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 .
D. For chain rule I, suppose the variables 𝑣𝑣1 , 𝑣𝑣2 , … 𝑣𝑣𝑘𝑘 (k>3) are following the topological order,
𝜕𝜕𝑣𝑣 𝜕𝜕𝑣𝑣𝑘𝑘−1
to compute 𝜕𝜕𝑣𝑣𝑘𝑘, we have to compute 𝜕𝜕𝑣𝑣1
explicitly.
1
A. Adding the squared L2 norm of the parameters into the loss function prevents the model
parameters from becoming large; therefore, it regularizes the model and reduces the chance
of overfitting.
B. If we do not initialize the parameters randomly in MLP model, then all parameters will be
the same throughout the training process.
C. Adam algorithm always converges to better optimal points than mini-batch SGD with
momentum.
D. Early stopping cannot regularize the model since it does not change the model to add any
constraint.
7. Suppose we are applying SGD to train a model with two parameters 𝒘𝒘 ∈ 𝑅𝑅 2 ; and squared L2
𝜆𝜆 2
norm regularization ( �|𝒘𝒘|� ) is also applied. Given the learning rate 0.5, the parameter values
2
as 𝒘𝒘 = (1, 2)𝑇𝑇 , the gradient of cross-entropy loss w.r.t the parameters as (2, −2)𝑇𝑇 , and the
coefficient 𝜆𝜆 = 1; after one step of updating, 𝒘𝒘 = (−0.5, 2 )𝑇𝑇 (2 points)
9. Suppose the Leaky ReLU is f(x) = 0.01x if x <0; x, otherwise. Given 𝒙𝒙 = (2, −1)𝑇𝑇 , and the output
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
is 𝒉𝒉; if gradient of the loss w.r.t 𝒉𝒉 is 𝜕𝜕𝒉𝒉 = (1, 1)𝑇𝑇 , 𝜕𝜕𝒙𝒙
= (1, 0.01)𝑇𝑇