0% found this document useful (0 votes)
2 views

CS6910_Tutorial5

This document is a tutorial for CS6910, covering various topics in neural networks and machine learning. It includes questions on loss functions, derivatives, gradient descent, and network architecture, as well as proofs and derivations related to these concepts. The tutorial aims to deepen understanding of classification problems and optimization techniques in neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CS6910_Tutorial5

This document is a tutorial for CS6910, covering various topics in neural networks and machine learning. It includes questions on loss functions, derivatives, gradient descent, and network architecture, as well as proofs and derivations related to these concepts. The tutorial aims to deepen understanding of classification problems and optimization techniques in neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CS6910: Tutorial 3

VERSION I

1. Show that for a binary classification problem, minimising the cross


entropy loss is the same as minimising the KL divergence between the
true and predicted distributions.

2. You want your neural network based classification model to be highly


confident in addition to being accurate. One way of achieving this
is to ensure that the probability predicted for the correct class yi
should be larger than the probabilities predicted for the other classes
by a significant margin ∆ (say, ≥ 0.3). How would you design a loss
function to ensure this? For example, if we have 3 classes and if the
correct class label is 0, and the probabilities predicted by the model
are [y0 = 0.58, y1 = 0.37, y2 = 0.05] then the model should incur: (i)
no loss for the correct class, (ii) a loss for assigning a probability of
0.37 for the 2nd class since the difference in the probabilities for the
correct class and incorrect class is just 0.21 (that is lesser than ∆),
(iii) no loss for assigning a probability of 0.05 for the 3rd class since
the difference in the probability is greater than ∆.
Loss function =

Version I Page 1 of 9
3. An ordered network is a network where the state variables can be
computed one at a time in a specified order. Given the ordered net-
work below, give a formula for calculating the ordered derivative ∂y
∂y1
3

in terms of partial derivatives w.r.t. y1 and y2 where y1 , y2 and y3


are the outputs of nodes 1, 2 and 3 respectively.

y2

y1 3 y3

(a)
dy3 ∂y3 dy2 ∂y3
= +
dy1 ∂y2 dy1 ∂y1
(b)
dy3 ∂y3 dy2 ∂y3
=
dy1 ∂y2 dy1 ∂y1
(c)
dy3 ∂y3 dy2 ∂y3
= −
dy1 ∂y2 dy1 ∂y1
(d)
None of the above.

4. Let φ1 (.) and φ2 (.) denote the sigmoid and the tanh functions re-
spectively. Tick the correct options.
(a) φ1 (-ν) = φ1 (ν) and φ2 (-ν) = 1 - φ2 (ν)
(b) φ1 (-ν) = -φ1 (ν) and φ2 (-ν) = 1 - φ2 (ν)
(c) φ1 (-ν) = 1 - φ1 (ν) and φ2 (-ν) = -φ2 (ν)
(d) None of the above.

Version I Page 2 of 9
5. Consider vectors u, x ∈ Rd , and matrix A ∈ Rn×n . The derivative of
a scalar f w.r.t. a vector x is a vector by itself, given by
 
∂f ∂f ∂f
∇x f = , ,...,
∂x1 ∂x2 ∂xn

Derive the expressions for the following derivatives (gradients).

∇x uT x , ∇ x xT x and ∇x xT Ax

(a) uT , xT and AxT


(b) uT , 2xT and 2AxT
(c) u, 2x and 2Ax
(d) u, 2x and Ax

6. A fair coin results in either Head (1) or Tail (0) with equal probability.
What is the entropy of the random variable indicating the outcome
of the toss? If instead, we had a biased coin with P(H) = 0.7, does
the entropy increase or decrease?

(a) With fair coin, entropy is 1. With biased coin, it is 0.88.


(b) With fair coin, entropy is 0.88. With biased coin, it is 1.
(c) With fair coin, entropy is 0.88. With biased coin, it is 0.66.
(d) With fair coin, entropy is 0.25. With biased coin, it is 0.90.

7. Recall the gradient descent update rule that comes from the Taylor
series when at each step we ensure L(θK+1 ) < L(θK ) where L is the
loss function. Suppose we are dealing with a quadratic loss function.
Can you come up with a better update rule such that we reach the
minima quickly.
Bonus question: Think about why this is not a widely used update
rule even when this is much faster then gradient descent.

8. Consider a fully connected network with 3 inputs x1 , x2 , x3 . Suppose


there are two hidden layers with 4 neurons having sigmoid activation
functions. Further, the output layer is a softmax layer. Assume that
all the weights in the network are set to 1 and all biases are set to 0.
Write down the output of the network as a function of x = [x1 , x2 , x3 ].
y=

Version I Page 3 of 9
9. Consider the following computation,

x σ f (x)

f (x) = tanh(w · x + b)

The value L is given by,


1
L= (y − f (x))4
2
Here, x and y are constants and w and b are parameters that can be
modified. In other words, L is a function of w and b.
∂L
Derive the partial derivatives, ∂w and ∂L
∂b .

(a)
∂L
= 2(y − f (x))3 f (x)(1 − f (x))2 x
∂w
∂L
= 2(y − f (x))3 f (x)(1 − f (x))2
∂b
(b)
∂L
= 2(y − f (x))3 (f (x)2 − 1)x
∂w
∂L
= 2(y − f (x))3 (f (x)2 − 1)
∂b
(c)
∂L
= 2(f (x) − y)3 f (x)2
∂w
∂L
= 2(f (x) − y)3 f (x)2 x
∂b
(d)
∂L
= 2(y − f (x))3 f (x)(1 − f (x))yx
∂w
∂L
= 2(y − f (x))3 f (x)(1 − f (x))y
∂b

Version I Page 4 of 9
2
y
10. Let f(x, y) = x2 + 100 . Gradient descent with a fixed step size η is
run for finding the minimum value of f from an initial point (xo , yo ).
1. Give an expression for (xt , yt ) in terms of η and (xo , yo ).

2. Let (xo , yo ) = (10, 0), give the range of η for which convergence
to the solution is guaranteed.

3. Let (xo , yo ) = (2, 5), give the range of η for which convergence
to the solution is guaranteed.

11. Consider a multivariate linear regression problem where the output


Ŷ = XW where X ∈ Rm×n , m is the number of training samples, n
is the number of features and Y is the true labels. The objective is
to minimize the squared error function where Y , Ŷ ∈ Rm .
M
1 X
L(W ) = (Yi − Ŷi )2
M i=1

∂L
Derive the gradient ∂W for the gradient descent update rule.
Answer:
∂L
∂W =

12. Consider a binary classification problem. Which of the following loss


functions when used with a deep neural network (≥ 1 hidden layer)
with non-linear activations is a convex loss function? Provide a
proof for your answer.
(a) Cross Entropy
(b) Mean Squared error
(c) All of the above
(d) None of the above

Version I Page 5 of 9
13. Consider a multivariate linear regression problem where the output
Ŷ = XW where X ∈ Rm×n , m is the number of training samples, n
is the number of features and Y is the true labels. The objective is
to minimize the squared error function where Y , Ŷ ∈ Rm .
M
1 X
L(W ) = (Yi − Ŷi )2
M i=1

Find a closed form solution to this problem if it exists. Think about


why do we use gradient descent (an iterative approach) in practice
over this.

(a) W = X T XX T Y
(b) W = X T (XX T )−1 Y
(c) W = (X T X)T XY
(d) W = (X T X)−1 X T Y

14. Suppose we train a deep neural network using the cross entropy loss
for classification. Now, instead of minimizing the cross-entropy loss,
suppose we change our objective function (J(θ)) to maximize the
probability of the correct class. What changes will have to be made
in our training setup?
(a) We cannot use backpropagation since it is applicable only in
scenarios where we are minimizing an objective function, not
maximizing it.
(b) We will have to change the update rule to θj : θj + α ∂J(θ)
∂θj .
(c) We do not need to change anything and the network will still
get trained properly without any modification.

15. Which of the following loss functions when used with logistic re-
gression is a convex loss function? Provide a proof for your an-
swer.
(a) Cross Entropy
(b) Mean Squared error
(c) All of the above
(d) None of the above

Version I Page 6 of 9
16. Suppose we have the following four points: x1 = (1,1), x2 = (-1,
P3
3), x3 = (2, 4) and (y1 , y2 , y3 ) = (5, 11, 18). Find minw i=1
(xi T w − yi )2 and also the value of w that leads to this minimum
value.
(a) min value = 0, w = [1,4]
(b) min value = 0, w = [4,1]
(c) min value = 1, w = [2,5]
(d) min value = 1, w = [5,2]

17. Which of the following metrics can be used to measure the similarity
between two probability distributions?
(a) Jensen-Shannon divergence
(b) Kullback–Leibler(KL) divergence
(c) Cross-Entropy
(d) Mahalanobis divergence

18. Consider this function: x2 y 2 + y 2 z 2 + z 2 x2 = 0. Compute


∂x
∂y .

(a)
y(x + z)

x(y + z)
(b)
y 2 (x + z)

x2 (y + z)
(c)
y(x2 − z 2 )
x(y 2 − z 2 )
(d)
y(x2 + z 2 )

x(y 2 + z 2 )

19. Consider a binary classification problem. Which of the following loss


functions when used with a deep neural network (≥ 1 hidden layer)
with linear activations is a convex loss function? Provide a proof for
your answer.

(a) Cross Entropy


(b) Mean Squared error
(c) All of the above
(d) None of the above

Version I Page 7 of 9
20. Consider this quadratic loss function J(θ). Which of the following
update equations will take minimum steps to reach from point 1 to
point 2?

(a) θ∗ = θo - 0.5 ∇θ J(θo )


(b) θ∗ = θo - H −1 ∇θ J(θo ) where H is hessian of J at θo
(c) θ∗ = θo - 4 ∇θ J(θo )
(d) θ∗ = θo - 2 ∇θ J(θo )

21. We are given astronomical data for star classification. Stars can be
classified into seven main types (O, B, A, F, G, K, M) based on their
surface temperatures. Additionally, there are sub-classes identified
based on their sizes: supergiants, giants, main-sequence stars, and
subdwarfs. Hence, a given training sample can be a supergiant star
of O type. What changes should be done to the standard feed forward
neural network to handle such cases where the classes are not mutually
exclusive?
(a) No changes are needed and we can model this problem with the
standard setup.
(b) Use sigmoid instead of softmax as the output activation func-
tion.
(c) Use Swish activation function instead of ReLU.
(d) Use binary cross-entropy loss for each class instead of categorical
cross-entropy loss.

Version I Page 8 of 9
22. An e-commerce company builds a feed forward neural network that
predicts how similar are two products. The network has 2 hidden
layers and an output layer.

Instead of a linear module, the company decides to have a quadratic


module in the first layer with b1 = 0 where x ∈ Rn . In addition to
this, they use Swish activation function(g(x)) instead of ReLU. Loss
is cross-entropy.

a 1 = xT W 1 x
g(x) = x.sigmoid(x)
Compute the backpropagation updates of this network, specifically
∂L ∂L ∂L ∂L
derive: ∂a 3
, ∂a 2
, ∂a 1
and ∂W 111

END

Version I Page 9 of 9

You might also like