CS6910_Tutorial5
CS6910_Tutorial5
VERSION I
Version I Page 1 of 9
3. An ordered network is a network where the state variables can be
computed one at a time in a specified order. Given the ordered net-
work below, give a formula for calculating the ordered derivative ∂y
∂y1
3
y2
y1 3 y3
(a)
dy3 ∂y3 dy2 ∂y3
= +
dy1 ∂y2 dy1 ∂y1
(b)
dy3 ∂y3 dy2 ∂y3
=
dy1 ∂y2 dy1 ∂y1
(c)
dy3 ∂y3 dy2 ∂y3
= −
dy1 ∂y2 dy1 ∂y1
(d)
None of the above.
4. Let φ1 (.) and φ2 (.) denote the sigmoid and the tanh functions re-
spectively. Tick the correct options.
(a) φ1 (-ν) = φ1 (ν) and φ2 (-ν) = 1 - φ2 (ν)
(b) φ1 (-ν) = -φ1 (ν) and φ2 (-ν) = 1 - φ2 (ν)
(c) φ1 (-ν) = 1 - φ1 (ν) and φ2 (-ν) = -φ2 (ν)
(d) None of the above.
Version I Page 2 of 9
5. Consider vectors u, x ∈ Rd , and matrix A ∈ Rn×n . The derivative of
a scalar f w.r.t. a vector x is a vector by itself, given by
∂f ∂f ∂f
∇x f = , ,...,
∂x1 ∂x2 ∂xn
∇x uT x , ∇ x xT x and ∇x xT Ax
6. A fair coin results in either Head (1) or Tail (0) with equal probability.
What is the entropy of the random variable indicating the outcome
of the toss? If instead, we had a biased coin with P(H) = 0.7, does
the entropy increase or decrease?
7. Recall the gradient descent update rule that comes from the Taylor
series when at each step we ensure L(θK+1 ) < L(θK ) where L is the
loss function. Suppose we are dealing with a quadratic loss function.
Can you come up with a better update rule such that we reach the
minima quickly.
Bonus question: Think about why this is not a widely used update
rule even when this is much faster then gradient descent.
Version I Page 3 of 9
9. Consider the following computation,
x σ f (x)
f (x) = tanh(w · x + b)
(a)
∂L
= 2(y − f (x))3 f (x)(1 − f (x))2 x
∂w
∂L
= 2(y − f (x))3 f (x)(1 − f (x))2
∂b
(b)
∂L
= 2(y − f (x))3 (f (x)2 − 1)x
∂w
∂L
= 2(y − f (x))3 (f (x)2 − 1)
∂b
(c)
∂L
= 2(f (x) − y)3 f (x)2
∂w
∂L
= 2(f (x) − y)3 f (x)2 x
∂b
(d)
∂L
= 2(y − f (x))3 f (x)(1 − f (x))yx
∂w
∂L
= 2(y − f (x))3 f (x)(1 − f (x))y
∂b
Version I Page 4 of 9
2
y
10. Let f(x, y) = x2 + 100 . Gradient descent with a fixed step size η is
run for finding the minimum value of f from an initial point (xo , yo ).
1. Give an expression for (xt , yt ) in terms of η and (xo , yo ).
2. Let (xo , yo ) = (10, 0), give the range of η for which convergence
to the solution is guaranteed.
3. Let (xo , yo ) = (2, 5), give the range of η for which convergence
to the solution is guaranteed.
∂L
Derive the gradient ∂W for the gradient descent update rule.
Answer:
∂L
∂W =
Version I Page 5 of 9
13. Consider a multivariate linear regression problem where the output
Ŷ = XW where X ∈ Rm×n , m is the number of training samples, n
is the number of features and Y is the true labels. The objective is
to minimize the squared error function where Y , Ŷ ∈ Rm .
M
1 X
L(W ) = (Yi − Ŷi )2
M i=1
(a) W = X T XX T Y
(b) W = X T (XX T )−1 Y
(c) W = (X T X)T XY
(d) W = (X T X)−1 X T Y
14. Suppose we train a deep neural network using the cross entropy loss
for classification. Now, instead of minimizing the cross-entropy loss,
suppose we change our objective function (J(θ)) to maximize the
probability of the correct class. What changes will have to be made
in our training setup?
(a) We cannot use backpropagation since it is applicable only in
scenarios where we are minimizing an objective function, not
maximizing it.
(b) We will have to change the update rule to θj : θj + α ∂J(θ)
∂θj .
(c) We do not need to change anything and the network will still
get trained properly without any modification.
15. Which of the following loss functions when used with logistic re-
gression is a convex loss function? Provide a proof for your an-
swer.
(a) Cross Entropy
(b) Mean Squared error
(c) All of the above
(d) None of the above
Version I Page 6 of 9
16. Suppose we have the following four points: x1 = (1,1), x2 = (-1,
P3
3), x3 = (2, 4) and (y1 , y2 , y3 ) = (5, 11, 18). Find minw i=1
(xi T w − yi )2 and also the value of w that leads to this minimum
value.
(a) min value = 0, w = [1,4]
(b) min value = 0, w = [4,1]
(c) min value = 1, w = [2,5]
(d) min value = 1, w = [5,2]
17. Which of the following metrics can be used to measure the similarity
between two probability distributions?
(a) Jensen-Shannon divergence
(b) Kullback–Leibler(KL) divergence
(c) Cross-Entropy
(d) Mahalanobis divergence
(a)
y(x + z)
−
x(y + z)
(b)
y 2 (x + z)
−
x2 (y + z)
(c)
y(x2 − z 2 )
x(y 2 − z 2 )
(d)
y(x2 + z 2 )
−
x(y 2 + z 2 )
Version I Page 7 of 9
20. Consider this quadratic loss function J(θ). Which of the following
update equations will take minimum steps to reach from point 1 to
point 2?
21. We are given astronomical data for star classification. Stars can be
classified into seven main types (O, B, A, F, G, K, M) based on their
surface temperatures. Additionally, there are sub-classes identified
based on their sizes: supergiants, giants, main-sequence stars, and
subdwarfs. Hence, a given training sample can be a supergiant star
of O type. What changes should be done to the standard feed forward
neural network to handle such cases where the classes are not mutually
exclusive?
(a) No changes are needed and we can model this problem with the
standard setup.
(b) Use sigmoid instead of softmax as the output activation func-
tion.
(c) Use Swish activation function instead of ReLU.
(d) Use binary cross-entropy loss for each class instead of categorical
cross-entropy loss.
Version I Page 8 of 9
22. An e-commerce company builds a feed forward neural network that
predicts how similar are two products. The network has 2 hidden
layers and an output layer.
a 1 = xT W 1 x
g(x) = x.sigmoid(x)
Compute the backpropagation updates of this network, specifically
∂L ∂L ∂L ∂L
derive: ∂a 3
, ∂a 2
, ∂a 1
and ∂W 111
END
Version I Page 9 of 9