0% found this document useful (0 votes)
9 views82 pages

Deep Learning Quiz Merged

The document contains assignments from the NPTEL Online Certification Courses on Deep Learning, specifically for Weeks 1 and 2, featuring multiple-choice questions (MCQs) related to concepts such as region descriptors, Bayes' classifiers, and Gaussian distributions. Each question includes options and the correct answer along with detailed solutions explaining the reasoning behind the answers. The assignments cover various topics in deep learning and statistical analysis, aimed at enhancing understanding of the subject.

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views82 pages

Deep Learning Quiz Merged

The document contains assignments from the NPTEL Online Certification Courses on Deep Learning, specifically for Weeks 1 and 2, featuring multiple-choice questions (MCQs) related to concepts such as region descriptors, Bayes' classifiers, and Gaussian distributions. Each question includes options and the correct answer along with detailed solutions explaining the reasoning behind the answers. The assignments cover various topics in deep learning and statistical analysis, aimed at enhancing understanding of the subject.

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

NPTEL Online Certification Courses

Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 1
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 2= 20
______________________________________________________________________________

QUESTION 1:
Which of the following is (are) region descriptor(s) ? Choose the correct option.
I) Fourier descriptor II) co-occurrence matrix III) Intensity histogram IV ) Signature

a. Both I and IV
b. Only I
c. Both II and III
d. None of the above

Correct Answer: c

Detailed Solution:

Histogram and co-occurrence matrix are region descriptors.

______________________________________________________________________________

QUESTION 2:
Consider a two class Bayes’ Minimum Risk Classifier. Probability of class ω1 is P (ω1) =0.4 . P (x|
ω1) = 0.65, P (x| ω2) =0.5 and the loss matrix values are
11 12  0.1 0.9 
    0.85 0.15
 21 22  

Find the Risk R (α1|x).

a. 0.51

b. 0.61

c. 0.53

d. 0.39

Correct Answer: c
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Detailed Solution:
P (ω2 ) = 1 - P (ω1 ) = 0.6

R(α1 |x) = λ11 * P(ω1 /x)+ λ12 * P(ω2 /x)

Where, P(ω1 /x) = P (ω1 )* P (x| ω1 ) / P(x) and P(ω2 /x) = P (ω2 )* P (x| ω2 ) / P(x)

Now, P(x) = P (ω1 )* P (x| ω1 ) + P (ω2 )* P (x| ω2 ) = 0.4*0.65 + 0.6*0.5 = 0.56

P(ω1 /x) = P (ω1 )* P (x| ω1 ) / P(x) = 0.4*0.65 / 0.56 = 0.464

Similarly, P(ω2 /x) = P (ω2 )* P (x| ω2 ) / P(x) = 0.6*0.50 / 0.56 = 0.536

So, R(α1 |x) = 0.1*0.464 + 0.9*0.536 = 0.53

______________________________________________________________________________
QUESTION 3:
If the larger values of gray co-occurrence matrix are concentrated around the main diagonal,
then which one of the following will be true?

a. The value of entropy will be very low.


b. The value of element difference moment will be high.
c. The value of inverse element difference moment will be high.
d. None of the above.

Correct Answer: c

Detailed Solution:

Options are self-explanatory. We can’t comment anything on the entropy based on the
values of diagonal elements. Because it depends on the randomness of the value. Whereas
element difference moment will be low and inverse element difference moment will be high.

______________________________________________________________________________

QUESTION 4:
Suppose Fourier descriptor of a shape has K coefficient, and we remove last few coefficients
and use only first m (m<K) number of coefficients to reconstruct the shape. What will be effect
of using truncated Fourier descriptor on the reconstructed shape?

a. We will get a smoothed boundary version of the shape.

b. We will get only the fine details of the boundary of the shape.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

c. Full shape will be reconstructed without any loss of information.

d. Low frequency component of the boundary will be removed from contour of the
shape.

Correct Answer: a

Detailed Solution:

Low frequency component of Fourier descriptor captures the general shape properties of
the object and high frequency component captures the finer detail. So, if we remove the last
few components, then the finer details will be lost, and as a result the reconstructed shape
will be smoothed version of original shape. The boundary of the reconstructed shape will
be a low frequency approximation of the original shape boundary.

______________________________________________________________________________

QUESTION 5:
Signature descriptor of an unknown shape is given in the figure, can you identify the unknown
shape?

a. Circle
b. Square
c. Straight line
d. Cannot be predicted
Correct Answer: a
Detailed Solution:
Distance from centroid to boundary is same for every value of ϴ. This is true for Circle
with a radius k.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 6:
Signature descriptor of an unknown shape is given in the figure. If the value of k is 7 cm., what
is the area of the unknown shape

a. 145 sq. cm.


b. 49 sq cm.
c. 98 sq cm.
d. 154 sq cm.
Correct Answer: d
Detailed Solution:
Distance from centroid to boundary is same for every value of ϴ. This is true for Circle
with a radius k. So, with radius 7 cm., area of the circle is 154 sq. cm.

______________________________________________________________________________

QUESTION 7:
Which of the following is not a Co-occurrence matrix-based descriptor?

a. Entropy
b. Uniformity
c. Intensity histogram.
d. All of the above.

Correct Answer: c

Detailed Solution:

Please follow lecture videos.


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 8:

Given an image I (fig 1), The gray co-occurrence matrix C (fig 2) can be constructed by specifying
the displacement vector d = (dx, dy). Let the position operator be specified as (1, 1), which has
the interpretation: one pixel to the right and one pixel below. (Both the image and the partial
gray co-occurrence is given in the figure 1, and 2 respectively. Blank values and ‘X’,‘Y’ values in
gray co-occurrence matrix are unknown.)

2 0 2 0 1

0 1 1 2 2

2 1 2 2 1

1 2 2 0 1

1 0 1 2 0

Fig1: I

Fig 2: C

What is the value of (Y – X) ?

a. 0

b. 1

c. 2

d. 3
Correct Answer: d

Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Complete the matrix and find the elements. Value of X is 0, Value of Y is 3.

QUESTION 9:
What is the value of maximum probability descriptor?

a. 1/4

b. 3/12

c. 1/3

d. 3/16

Correct Answer: a

Detailed Solution:

Maximum probability = max (cij). cij is normalized co-occurrence matrix = 4/16 = 1/4.

______________________________________________________________________________

QUESTION 10:
The plot of distance of the different boundary point from the centroid of the shape taken at
various direction is known as

a. Signature descriptor

b. Polygonal descriptor

c. Fourier descriptor.

d. Convex Hull
Correct Answer: a

Detailed Solution:

Please refer to the lecture videos.

__________________________________________________________

************END***********
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 2
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 2 = 20
______________________________________________________________________________

QUESTION 1:
Two random variables 𝑋1 and 𝑋2 follows Gaussian distribution with following mean and
variance.
𝑋1~𝑁 (0, 3) 𝑎𝑛𝑑 𝑋2~𝑁 (0, 2)

Which of the following options is true?

a. Distribution of 𝑋1 will be flatter than the distribution of 𝑋2


b. Distribution of 𝑋2 will be flatter than the distribution of 𝑋1
c. Peak of the both distributions will be at same height
d. None of the above

Correct Answer: a

Detailed Solution:

As 𝑿𝟏 has more variance than the 𝑿𝟐, so the distribution of 𝑿𝟏 will be more spread than
the 𝑿𝟐. So, distribution of 𝑿𝟏 will be flatter than distribution of 𝑿𝟐.

QUESTION 2:
In which scenario the discriminant function will be linear when a two class Bayesian classifier is
used to classify two class of points distributed normally? Choose the correct option.

I. Σ1 = Σ2 but Σ is not an identity matrix


II.Σ1 = Σ2 but Σ is an identity matrix III. Σ1 ≠ Σ2

a. Only II
b. Both I and II
c. Only III
d. None of the above

Correct Answer: b
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Detailed Solution:

Discriminant function is linear when Σ1 = Σ2

QUESTION 3:

Choose the correct option regarding discriminant functions gi(x) for multiclass classification (x is the
feature vector to be classified).

Statement i : Risk value R (α i|x) in Bayes minimum risk classifier can be used as a discriminant function.
Statement ii : Negative of Risk value R (α i|x) in Bayes minimum risk classifier can be used as a
discriminant function.
Statement iii : Negative of Aposteriori probability P(ωi|x) in Bayes minimum error classifier can be used
as a discriminant function.
Statement iv : Aposteriori probability P(ωi|x) in Bayes minimum error classifier can be used as a
discriminant function.

a. Only Statement i is true


b. Both Statements ii and iii are true
c. Both Statements i and iv are true
d. Both Statements ii and iv are true

Correct Answer: d

Detailed Solution:

Follow the Lecture 06 for detailed explanation

QUESTION 4:
If we choose the discriminant function 𝑔𝑖 (𝑥) as a function of posterior probability. i.e. 𝑔𝑖 (𝑥 ) =
𝑓(𝑝(𝑤𝑖⁄𝑥 )). Then, which of following cannot be the function 𝑓( )?

a. f (x) = a𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑎 > 1


b. f (x) = a−𝑥 ,𝑤ℎ𝑒𝑟𝑒 𝑎 > 1
c. f(x) = 2x + 3
d. 𝑓(𝑥) = exp(𝑥)

Correct Answer: b

Detailed Solution:

The function f () should be a monotonic increasing function.


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 5:
For a two-class problem, the linear discriminant function is given by 𝑔(𝑥) = 𝑎𝑡 𝑦. What is
updation rule for finding the weight vector 𝑎? Here 𝑦 is augmented feature vector.

a. 𝑎( 𝑘 + 1 ) = 𝑎( 𝑘 ) + 𝜂 ∑ 𝑦
b. 𝑎( 𝑘 + 1 ) = 𝑎( 𝑘 ) − 𝜂 ∑ 𝑦
c. 𝑎( 𝑘 + 1 ) = 𝑎(𝑘 − 1) − 𝜂𝑎(𝑘)
d. 𝑎( 𝑘 + 1 ) = 𝑎(𝑘 − 1) + 𝜂𝑎(𝑘)

Correct Answer: a

Detailed Solution:

𝑎( 𝑘 + 1 ) = 𝑎( 𝑘 ) + 𝜂 ∑ 𝑦

For derivation refer to video lectures.

QUESTION 6:
You are given some data points for two different class.
Class 1 points: {(11, 11), (13, 11), (8, 10), (9, 9), (7, 7), (7, 5), (15, 3)}
Class 2 points: {(7, 11), (15, 9), (15, 7), (13, 5), (14, 4), (9, 3), (11, 3)}
Compute the mean vectors 𝜇1 and 𝜇 2 for these two classes and choose the correct option.
a. 𝜇1 = [10] and 𝜇 2 = [ 6 ]
10 12
b. 𝜇1 = [ ] and 𝜇 2 = [12]
12
8 7
10 10
c. 𝜇1 = [ ] and 𝜇 2 = [ ]
7 7
10 12
d. 𝜇1 = [ ] and 𝜇 2 = [ ]
8 6

Correct Answer: d

Detailed Solution: Add the points for each class and divide the output by number of points
i.e., 7
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 7:
You are given some data points for two different class es.
Class 1 points: {(11, 11), (13, 11), (8, 10), (9, 9), (7, 7), (7, 5), (15, 3)}
Class 2 points: {(7, 11), (15, 9), (15, 7), (13, 5), (14, 4), (9, 3), (11, 3)}
Compute the covariance matrices Σ1 and Σ2 for these two classes and choose the correct option.

a. Σ1 = [8.29 0 ] and Σ = [ 8.29 −1.0]


2
0 8.29 −1.0 8.29
b. Σ1 = [1 0 ] and Σ2 = [ 1 0 ]
0 1 0 1
c. Σ1 = [ 3.65 −1.0] and Σ2 = [ 9.67 −0.85]
−0.85 3.65 −0.85 9.67
d. Σ1 = [ 8.29 −0.85 ] and Σ2 = [ 8.29 −0.85]
−0.85 8.29 −0.85 8.29

Correct Answer: d

Detailed Solution:

𝚺𝟏 = 𝚺𝟐 = 𝚺 = [ 𝟖. 𝟐𝟗 −𝟎. 𝟖𝟓] , Follow the steps mentioned in the lecture video


−𝟎. 𝟖𝟓 𝟖. 𝟐𝟗

QUESTION 8:
You are given some data points for two different class.
Class 1 points: {(11, 11), (13, 11), (8, 10), (9, 9), (7, 7), (7, 5), (15, 3)}
Class 2 points: {(7, 11), (15, 9), (15, 7), (13, 5), (14, 4), (9, 3), (11, 3)}
Assume that the points are samples from normal distribution and a two class Bayesian classifier
is used to classify them. Also assume the prior probability of the classes are equal i.e.,
𝑝(𝜔1 ) = 𝑝(𝜔2 )
Which of the following is true about the corresponding decision boundary used in the classifier?
(Choose correct option regarding the given statements)

Statement i: Decision boundary passes through the midpoint of the line segment joining the
means of two classes
Statement ii: Decision boundary will be orthogonal bisector of the line joining the means of two
classes.

a. Only Statement i is true


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

b. Only Statement ii is true

c. Both Statement i and ii are true

d. None of the statements are true

Correct Answer: a

Detailed Solution:

𝟖. 𝟐𝟗 −𝟎. 𝟖𝟓
𝚺𝟏 = 𝚺𝟐 = 𝚺 = [ ] but 𝚺 is not an identity matrix So, only option a is correct.
−𝟎. 𝟖𝟓 𝟖. 𝟐𝟗

QUESTION 9:
You are given some data points for two different class.
Class 1 points: {(11,11), (13,11), (8,10), (9,9), (7,7), (7,5), (15,3)}
Class 2 points: {(7,11), (15,9), (15,7), (13,5), (14,4), (9,3), (11,3)}
Classify the following two new samples (𝐴 = (6,11), 𝐵 = (14,3)) using K-nearest neighbor.
Where K=3. Use Manhattan distance as a distance function.

Given two points (x_1, y_1) and (x_2, y_2), the Manhattan Distance d between them is:
d = |x_1 - x_2| + |y_1 - y_2|

a. A belongs to class 1 and B belongs to class 1.


b. A belongs to class 2 and B belongs to class 2.
c. A belongs to class 1 and B belongs to class 2.
d. A belongs to class 2 and B belongs to class 1.

Correct Answer: c

Detailed Solution:
Calculate Manhattan distance of each point of Class 1 and 2 from A = (6,11). Find the
closest 3 points. From class 1, 2 points are closest to A=(6,11) whereas from class 2, 1 point
is close to A.
Follow similar procedure for Point B.

QUESTION 10:
Suppose if you are solving a four-class problem, how many discriminants function you will need
for solving?

a. 1
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

b. 2
c. 3
d. 4

Correct Answer: d

Detailed Solution:
For n class problem we need n number of discriminant function.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 3
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
Find the scalar projection of vector b = <-3, 2> onto vector a = <1,1>?

a. 0
1
b.
√2
−1
c.
√2
−1
d. 2

Correct Answer: c

Detailed Solution:
𝒃∙𝒂 −𝟑 ×𝟏+𝟐×𝟏 −𝟏
Scalar projection of b onto vector a is given by the scalar value |𝒂|
= = √𝟐
√ 𝟏 𝟐 +𝟏 𝟐

QUESTION 2:
Suppose there is a feature vector represented as [1, 4, 3]. What is the distance of this feature
vector from the separating plane x1+ 2x2- 2x3 + 3 = 0. Choose the correct option.
a. 1

b. 5

c. 3

d. 2

Correct Answer: d

Detailed Solution:
𝒂𝒚 𝟏+ 𝒃𝒚 𝟐+𝒄𝒚𝟑+𝒅
Distance of a vector [y1, y2, y3] from the plane ax1+ bx2 + cx3 + d = 0 is given by d =
√𝒂𝟐 +𝒃𝟐 +𝒄 𝟐
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

𝟏 ×𝟏+ 𝟐 ×𝟒+(−𝟐)×𝟑+𝟑 𝟔
= = =2
√𝟏 𝟐 +𝟐 𝟐 +𝟐 𝟐 𝟑

QUESTION 3:
If we employ SVM to realize two input logic gates, then which of the following will be true?

a. The weight vector for AND gate and OR gate will be same.
b. The margin for AND gate and OR gate will be same.
c. Both the margin and weight vector will be same for AND gate and OR
gate.
d. None of the weight vector and margin will be same for AND gate and
OR gate.

Correct Answer: b

Detailed Solution:

As we can see although the weight vectors are not same but the margin is same.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 4:
Suppose we have the below set of points with their respective classes as shown in the table.
Answer the following question based on the table.

X Y Class Label

1 1 +1

-1 -1 -1

2 2 +1

-1 2 +1

1 -1 -1

What can be a possible decision boundary of the SVM for the given points?

a. 𝑦=0
b. 𝑥=0
c. 𝑥=𝑦
d. 𝑥 +𝑦 = 1

Correct Answer: a

Detailed Solution:

Plot the points to visualize the answer.


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 5:
Suppose we have the below set of points with their respective classes as shown in the table.
Answer the following question based on the table.

X Y Class Label

1 1 +1

-1 -1 -1

2 2 +1

-1 2 +1

1 -1 -1

Find the decision boundary of the SVM trained on these points and choose which of the
following statements are true based on the decision boundary.
i) The point (-1,-2) is classified as -1
ii) The point (-1,-2) is classified as +1
iii) The point (1,-2) is classified as -1
iv) The point (1,-2) is classified as +1

a. Only statement ii is true


b. Both statements i and iii are true
c. Both statements i and iv are true
d. Both statements ii and iii are true

Correct Answer: b

Detailed Solution:

The decision boundary is y=0. For the point (-1,-2) , -2 < 0 so the point is classified as -1.
Similarly, for the point (1,-2) , -2 > 0 so the point is classified as -1.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 6:
The shape of the loss landscape during optimization of SVM resembles to which structure?

a. Linear
b. Ellipsoidal
c. Non-convex with multiple possible local minimum
d. Paraboloid

Correct Answer: d

Detailed Solution:

In SVM the objective to find the maximum margin based hyperplane (W) such that

WTx + b =1 for class = +1 else WTx + b = -1

For the max-margin condition to be satisfied we solve to minimize ||W||.

The above optimization is a quadratic optimization with a paraboloid landscape for the loss
function.

______________________________________________________________________________

QUESTION 7:
How many local minimum can be encountered while solving the optimization for maximizing
margin for SVM?

a. 2
b. 1
c. ∞ (infinite)
d. 0

Correct Answer: b

Detailed Solution:

In SVM the objective to find the maximum margin-based hyperplane (W) such that

WTx + b =1 for class = +1 else WTx + b = -1

For the max-margin condition to be satisfied we solve to minimize ||W||.


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

The above optimization is a quadratic optimization with a paraboloid landscape for the loss
function. Since the shape is paraboloid, there can be only 1 global minimum.

______________________________________________________________________________

QUESTION 8:
Suppose we have one feature x ∈ R and binary class y. The dataset consists of 3 points: p1: (x1,
y1) = (−1, −1), p2: (x2, y2) = (1, 1), p3: (x3, y3) = (3, 1). Which of the following true with respect
to SVM?

a. Maximum margin will increase if we remove the point p2 from the training set.
b. Maximum margin will increase if we remove the point p3 from the training set.
c. Maximum margin will remain same if we remove the point p2 from the training set.
d. None of the above.

Correct Answer: a

Detailed Solution:

Here the point p2 is a support vector, if we remove the point p2 then maximum margin will
increase.

______________________________________________________________________________

QUESTION 9:
Choose the correct option regarding classification using SVM for two classes

Statement i : While designing an SVM for two classes, the equation 𝑦𝑖 (𝑎 𝑡 𝑥𝑖 + 𝑏) ≥ 0 is used to choose
the separating plane using the training vectors.
Statement ii : During inference, for an unknown vector 𝑥𝑗 , if 𝑦𝑗 (𝑎 𝑡𝑥𝑗 + 𝑏) ≥ 0 , then the vector can be
assigned class 1.
Statement iii : During inference, for an unknown vector 𝑥𝑗 , if (𝑎 𝑡 𝑥𝑗 + 𝑏) > 0 , then the vector can be
assigned class 1.
Statement iv : While designing an SVM for two classes, the equation 𝑦𝑖 (𝑎 𝑡𝑥𝑖 + 𝑏) ≥ 1 is used to choose
the separating plane using the training vectors.

a. Only Statement i is true


b. Both Statements ii and iii are true
c. Both Statements i and ii are true
d. Both Statements iii and iv are true
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Correct Answer: d

Detailed Solution:

Follow the lecture for detailed explanation

___________________________________________________________________________

QUESTION 10:
Find the distance of the 3D point, 𝑃 = (−2, 4, 1) from the plane defined by
2𝑥 + 3𝑦 + 6𝑧 + 7 = 0?

a. 3
b. 4
c. 0
d. ∞ (infinity)

Correct Answer: a

Detailed Solution:

−𝟐∗𝟐 + 𝟒∗𝟑 + 𝟏∗𝟔 + 𝟕


Distance = = 𝟑
√𝟐∗𝟐 + 𝟑∗𝟑 + 𝟔∗𝟔
______________________________________________________________________________

______________________________________________________________

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 4
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
Let X and Y be two features to discriminate between two classes. The values and class labels of
the features are given here under. The minimum number of neuron-layers required to design
the neural network classifier

X Y #Class

1 2 Class-II

0 0 Class-I

-2 -2 Class-I

3 2 Class-II

-1 -1 Class-I

a. 2
b. 1
c. 5
d. 4

Correct Answer: b

Detailed Solution:

Please refer to the lectures of week 4. Plot the feature points. They are linearly separable.
Hence single layer is able to do the classification task.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 2:
Let us assume that we implement an AND function using a single neuron as shown below. The
activation function 𝑓𝑁𝐿 (∙) , of our neuron is denoted as: f(y)=0, for y<30, f(y)=1 for y >=30. What
would be a possible combination of the weights and bias?

a. Bias = 5, w1 = 5, w2 = 25
b. Bias = 10, w1 = 5, w2 = 5
c. Bias = 10, w1 = 15, w2 = 15
d. Bias = 5, w1 = 10, w2 = 10
Correct Answer: c

Detailed Solution:

For AND function, (w1*x1+w2*x2+bias) should be >= 30 only when x1 and x2 equal to 1
and the expression should be less than 30 for all other values of x1 and x2. Only option C
satisfies that. Please refer to the lectures of week 4.

QUESTION 3:
Which among the following options give the range for a logistic function?

a. -1 to 1
b. -1 to 0
c. 0 to 1
d. 0 to infinity

Correct Answer: c

Detailed Solution:

Refer to lectures, specifically the formula for logistic function.

QUESTION 4:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Input to SoftMax activation function is [5,3,4]. What will be the output?

a. [0.58,0.11, 0.31]
b. [0.43,0.24, 0.33]
c. [0.60,0.10,0.30]
d. [0.67, 0.09,0.24]

Correct Answer: d

Detailed Solution:
𝒙
𝒆 𝒋
SoftMax, 𝝈(𝒙𝒋 ) = ∑𝑛 𝒙𝒌 for j=1,2…,n
𝑘 =1 𝒆
𝒙
𝒆 𝒋
Therefore, 𝝈(𝟑) = ∑𝑛 𝒙𝒌 =0.67and similarly the other values
𝑘 =1 𝒆

QUESTION 5:
Which of the following options is true?
a. In Batch Gradient Descent, a small batch of sample is selected randomly instead
of the whole data set for each iteration.
b. In Batch Gradient Descent, the whole data set is processed together for update
in each iteration.
c. Batch Gradient Descent considers only one sample for updates and has noisier
updates.
d. Batch Gradient Descent produces noisier updates than Stochastic Gradient
Descent
Correct Answer: b
Detailed Solution:

Batch Gradient Descent considers whole dataset for updates in each iteration.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 6:
Choose the correct option:

i) Inability of a model to obtain sufficiently low training error is termed as Overfitting


ii) Inability of a model to reduce large margin between training and testing error is
termed as Overfitting
iii) Inability of a model to obtain sufficiently low training error is termed as Underfitting
iv) Inability of a model to reduce large margin between training and testing error is
termed as Underfitting

a. Only option (i) is correct


b. Both Options (ii) and (iii) are correct
c. Both Options (ii) and (iv) are correct
d. Only option (iv) is correct

Correct Answer: b

Detailed Solution:

Follow lecture 17
______________________________________________________________________________

QUESTION 7:
Choose the correct options about the assumptions are generally made during optimization in
machine learning.

i) Data samples in each data set are dependent on each other


ii) Each data samples present in the training and test set are independent of each other
iii) Training set and test set are overlapping to each other
iv) The distributions of training set and test set are assumed to be identical

a. Only option (i) is correct


b. Both Options (ii) and (iii) are correct
c. Both Options (ii) and (iv) are correct
d. Only option (iv) is correct

Correct Answer: c

Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Follow lectures of week 4

QUESTION 8:
An artificial neuron receives n inputs 𝑥 1 , 𝑥 2 , 𝑥 3 , … . . 𝑥 𝑛 with weights 𝑤1 , 𝑤2 , 𝑤3 , … . . 𝑤𝑛 attached
to the input links. The weighted sum_________________ is computed to be passed on to a
non-linear filter Φ called activation function to release the output. Fill in the blanks by choosing
one option from the following.

a. ∑𝑖 𝑤𝑖
b. ∑𝑖 𝑥 𝑖
c. ∑𝑖 𝑤𝑖 + ∑𝑖 𝑥 𝑖
d. ∑𝑖 𝑤𝑖 𝑥 𝑖

Correct Answer: d

Detailed Solution:

Refer to the lecture.


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 9:
Consider the below neural network. 𝑝̂ is the output after applying the non-linearity function
𝑓𝑁𝐿 (∙) on 𝑦 .The non-linearity 𝑓𝑁𝐿 (∙) is given as a step function i.e.,
0, 𝑖𝑓 𝑣 < 0
𝑓(𝑣) = {
1, 𝑖𝑓 𝑣 ≥ 0

The weights are given as 𝑤1 = 2, 𝑤2 = −1.5, 𝑤3 = 1

Choose the correct outputs generated by the network when the inputs are
{𝑥 1 = 1, 𝑥 2 = 0, 𝑥 3 = 0} and {𝑥 1 = 0, 𝑥 2 = 1, 𝑥 3 = 1}. Outputs are in the same order as inputs.

a. 1, 1
b. 0, 0
c. 1, 0
d. 0, 1

Correct Answer: c

Detailed Solution:
Detailed Solution:
𝒚 = 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒙𝟑 𝒘𝟑 = 𝟐 𝑿 𝟏 − 𝟏. 𝟓 𝑿 𝟎 + 𝟏 𝑿 𝟎 = 𝟐 𝒇(𝒚) = 𝟏 𝒂𝒔 𝒚 ≥ 𝟎

Similarly for the other point


𝒚 = 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒙𝟑 𝒘𝟑 = 𝟐 𝑿 𝟎 − 𝟏. 𝟓 𝑿 𝟏 + 𝟏 𝑿 𝟏 = −𝟎. 𝟓 𝒇(𝒚) = 𝟎 𝒂𝒔 𝒚 < 𝟎
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 10:
Consider the below neural network. 𝑝̂ is the output after applying the non-linearity function
𝑓𝑁𝐿 (∙) on 𝑦 .The non-linearity 𝑓𝑁𝐿 (∙) is given as a step function i.e.,
0, 𝑖𝑓 𝑣 < 0
𝑓(𝑣) = {
1, 𝑖𝑓 𝑣 ≥ 0

Choose the correct set of weights 𝑤1 , 𝑤2 and bias for which the network behaves as an OR
function.

a. 𝑤1 = 1, 𝑤2 = 1.5, 𝑏𝑖𝑎𝑠 = 1
b. 𝑤1 = 1, 𝑤2 = 0.5, 𝑏𝑖𝑎𝑠 = −1
c. 𝑤1 = 1, 𝑤2 = 1.5, 𝑏𝑖𝑎𝑠 = −1
d. 𝑤1 = 1, 𝑤2 = −0.5, 𝑏𝑖𝑎𝑠 = 1

Correct Answer: c

For OR function
𝒚 = 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒃𝒊𝒂𝒔 𝒇(𝒚) = 𝟏 𝒂𝒔 𝒚 ≥ 𝟎, so 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒃𝒊𝒂𝒔 should be ≥ 𝟎
for (𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟏) , (𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟎), (𝒙𝟏 = 𝟎, 𝒙𝟐 = 𝟏) and should be < 0 for
(𝒙𝟏 = 𝟎, 𝒙𝟐 = 𝟎). Only option C satisfies this condition.

___________________________________________________________________

______________________________________________________________________________

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 5
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
What is the main benefit of stacking multiple layers of neuron with non-linear activation
functions over a single layer perceptron?

a. Reduces complexity of the network


b. Reduce inference time during testing
c. Allows to create complex non-linear decision boundaries
d. All of the above

Correct Answer: c

Detailed Solution:

A single layer perceptron without non-linear activation function is capable of classifying


only linearly separable classes. Stacking multiple layers of neurons helps in creating non-
linear decision boundaries and thus can be used for classifying examples belonging to
classes which are NOT linearly separable.

QUESTION 2:
For a 2-class classification problem, what is the minimum number of nodes required for the
output layer of a multi-layered neural network?

a. 2
b. 1
c. 3
d. None of the above

Correct Answer: b

Detailed Solution:

Only 1 node is enough. We can expect that node to be activated (have high activation value)
only when class = +1 else the node should NOT be activated (have activation close to zero).
We can use the binary (2-class) cross entropy loss to train such a model.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 3:
What will the output from node 𝑎3 in the following neural network setup when the inputs are
(𝑥1, 𝑥2) = (0, 1).
The activation function used in each of three nodes 𝑎1 , 𝑎2 and 𝑎3 are zero-thresholding i.e.,
0, 𝑖𝑓 𝑣 < 0
𝑓(𝑣) = {
1, 𝑖𝑓 𝑣 ≥ 0

a. -1
b. 0
c. 1
d. 0.5

Correct Answer: b

Detailed Solution:

Output of 𝒂𝟏 : 𝒇(𝟎. 𝟓 ∗ 𝟏 + −𝟏 ∗ 𝟎 + −𝟏 ∗ 𝟏) = 𝒇(−𝟎. 𝟓) = 𝟎

Output of 𝒂𝟐 : 𝒇(−𝟏. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟎 + 𝟏 ∗ 𝟏) = 𝒇(−𝟎. 𝟓) = 𝟎

Output of 𝒂𝟑 : 𝒇(−𝟎. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟎 + 𝟏 ∗ 𝟎) = 𝒇(−𝟎. 𝟓) = 𝟎

______________________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 4:
Which basic logic gate is implemented by the following neural network setup. The activation
function used in each of three nodes 𝑎1 , 𝑎2 and 𝑎3 are zero-thresholding i.e.,
0, 𝑖𝑓 𝑣 < 0
𝑓(𝑣) = {
1, 𝑖𝑓 𝑣 ≥ 0

a. AND
b. NOR
c. XNOR
d. XOR

Correct Answer: c

Detailed Solution:
for input (𝟎, 𝟎)

Output of 𝒂𝟏 : 𝒇(𝟎. 𝟓 ∗ 𝟏 + −𝟏 ∗ 𝟎 + −𝟏 ∗ 𝟎) = 𝒇(𝟎. 𝟓) = 𝟏

Output of 𝒂𝟐 : 𝒇(−𝟏. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟎 + 𝟏 ∗ 𝟎) = 𝒇(−𝟏. 𝟓) = 𝟎

Output of 𝒂𝟑 : 𝒇(−𝟎. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟏 + 𝟏 ∗ 𝟎) = 𝒇(𝟎. 𝟓) = 𝟏


Similarly for input (𝟏, 𝟏)

Output of 𝒂𝟏 : 𝒇(𝟎. 𝟓 ∗ 𝟏 + −𝟏 ∗ 𝟏 + −𝟏 ∗ 𝟏) = 𝒇(−𝟏. 𝟓) = 𝟎

Output of 𝒂𝟐 : 𝒇(−𝟏. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟏 + 𝟏 ∗ 𝟏) = 𝒇(𝟎. 𝟓) = 𝟏

Output of 𝒂𝟑 : 𝒇(−𝟎. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟎 + 𝟏 ∗ 𝟏) = 𝒇(𝟎. 𝟓) = 𝟏


For other inputs (𝟏, 𝟎) and (𝟎, 𝟏) Output of 𝒂𝟑 is 0. So, the network resembles XNOR.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 5:
𝜕𝐽
Find the gradient component for the network shown below if 𝐽 (∙) = (𝑝̂ − 𝑝)2 is the loss
𝜕𝑤1
function, 𝑝 is the target and the non-linearity 𝑓𝑁𝐿 (∙) is the sigmoid activation function
represented as 𝜎(∙)?

a. 2𝑝̂ × (1 − 𝜎(𝑦)) × 𝑥 1
b. 2(𝑝̂ − 𝑝) × 𝜎 (𝑦) × (1 − 𝜎 (𝑦)) × 𝑥 1
c. 2(𝑝̂ − 𝑝) × (1 − 𝜎 (𝑦)) × 𝑥 1
d. 2(1 − 𝑝) × (1 − 𝜎 (𝑦)) × 𝑥 1

Correct Answer: b

Detailed Solution:

̂ − 𝒑) 𝟐
𝑱 ( ∙) = ( 𝒑

̂ = 𝒇𝑵𝑳 (𝒚)
𝒑

𝒚 = 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒙𝟑 𝒘𝟑 + 𝟏

Using chain rule,

𝝏𝑱 𝝏𝑱 𝝏𝒑
̂ 𝝏𝒚
= ∙ ∙
𝝏𝒘𝟏 𝝏𝒑 ̂ 𝝏𝒚 𝝏𝒘𝟏
𝝏𝑱 𝝏𝒑
̂ 𝝏𝒚
= 𝟐(𝒑
̂ − 𝒑) , = 𝝈(𝒚) × (𝟏 − 𝝈(𝒚)), = 𝒙𝟏
𝝏𝒑
̂ 𝝏𝒚 𝝏𝒘 𝟏

𝝏𝑱
= 𝟐 (𝒑
̂ − 𝒑) × 𝝈(𝒚) × (𝟏 − 𝝈(𝒚)) × 𝒙𝟏
𝝏𝒘𝟏
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 6:
Find the output 𝑝̂ corresponding to input {𝑥 1 = 1, 𝑥 2 = 1, 𝑥 3 = 0}, for the network shown
below. The non-linearity 𝑓𝑁𝐿 (∙) is the sigmoid activation function.

The weights are given as 𝑤1 = 2, 𝑤2 = −1, 𝑤3 = 1, 𝑏 = −1.

a. 0
b. 1
c. 0.5
d. 0.25

Correct Answer: c

Detailed Solution:

𝒚 = 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒙𝟑 𝒘𝟑 + 𝒃 = 𝟏 𝑿 𝟐 − 𝟏 𝑿 𝟏 + 𝟎 𝑿 𝟏 − 𝟏 = 𝟎 ̂ = 𝝈(𝒚) = 𝜎(0) =
𝒑
0.5
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 7:
𝜕𝐽
Find the gradient component for the network shown below if 𝐽(∙) = (𝑝̂ − 𝑝)2 is the loss
𝜕𝑤2
function, 𝑝=1 is the target and the non-linearity 𝑓𝑁𝐿 (∙) is the sigmoid activation function
represented as 𝜎(∙)?

The input to the network is {𝑥 1 = 1, 𝑥 2 = 1, 𝑥 3 = 0}

The weights are given as 𝑤1 = 2, 𝑤2 = −1, 𝑤3 = 1, 𝑏 = −1.

a. −0.5
b. −1
c. 0
d. −0.25

Correct Answer: d

Detailed Solution:

̂ − 𝒑) 𝟐
𝑱 ( ∙) = ( 𝒑

̂ = 𝒇𝑵𝑳 (𝒚)
𝒑

𝒚 = 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒙𝟑 𝒘𝟑 + 𝒃 = 𝟏 𝑿 𝟐 − 𝟏 𝑿 𝟏 + 𝟎 𝑿 𝟏 − 𝟏 = 𝟎 ̂ = 𝝈(𝒚) = 𝜎(0) =
𝒑
0.5
Using chain rule,

𝝏𝑱 𝝏𝑱 𝝏𝒑
̂ 𝝏𝒚
= ∙ ∙
𝝏𝒘𝟐 𝝏𝒑 ̂ 𝝏𝒚 𝝏𝒘𝟐
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

𝝏𝑱 𝝏𝒑
̂ 𝝏𝒚
= 𝟐(̂
𝒑 − 𝒑) , = 𝝈(𝒚) × (𝟏 − 𝝈(𝒚)), = 𝒙𝟐
𝝏𝒑
̂ 𝝏𝒚 𝝏𝒘 𝟐

𝝏𝑱
= 𝟐 (𝒑
̂ − 𝒑) × 𝝈(𝒚) × (𝟏 − 𝝈(𝒚)) × 𝒙𝟐 = 𝟐 × (𝟎. 𝟓 − 𝟏) × 𝟎. 𝟓 × (𝟏 − 𝟎. 𝟓) × 𝟏
𝝏𝒘𝟐
= −𝟎. 𝟐𝟓

QUESTION 8:
What will be the updated value of 𝑤2 after the first iteration from the current state of the
network shown below if 𝐽 (∙) = (𝑝̂ − 𝑝)2 is the loss function, 𝑝=1 is the target and the non-
linearity 𝑓𝑁𝐿 (∙) is the sigmoid activation function represented as 𝜎(∙)?

The input to the network is {𝑥 1 = 1, 𝑥 2 = 1, 𝑥 3 = 0}, the learning rate η = 2

The weights of the current state are given as 𝑤1 = 2, 𝑤2 = −1, 𝑤3 = 1, 𝑏 = −1.

a. −0.5
b. −1
c. 0
d. −0.25

Correct Answer: a

Detailed Solution:

̂ − 𝒑) 𝟐
𝑱 ( ∙) = ( 𝒑

̂ = 𝒇𝑵𝑳 (𝒚)
𝒑
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

𝒚 = 𝒙𝟏 𝒘𝟏 + 𝒙𝟐 𝒘𝟐 + 𝒙𝟑 𝒘𝟑 + 𝒃 = 𝟏 𝑿 𝟐 − 𝟏 𝑿 𝟏 + 𝟎 𝑿 𝟏 − 𝟏 = 𝟎 ̂ = 𝝈(𝒚) = 𝜎(0) =
𝒑
0.5
Using chain rule,

𝝏𝑱 𝝏𝑱 𝝏𝒑
̂ 𝝏𝒚
= ∙ ∙
𝝏𝒘𝟐 𝝏𝒑 ̂ 𝝏𝒚 𝝏𝒘𝟐
𝝏𝑱 𝝏𝒑
̂ 𝝏𝒚
= 𝟐(𝒑
̂ − 𝒑) , = 𝝈(𝒚) × (𝟏 − 𝝈(𝒚)), = 𝒙𝟐
𝝏𝒑
̂ 𝝏𝒚 𝝏𝒘 𝟐

𝝏𝑱
= 𝟐 (𝒑
̂ − 𝒑) × 𝝈(𝒚) × (𝟏 − 𝝈(𝒚)) × 𝒙𝟐 = 𝟐 × (𝟎. 𝟓 − 𝟏) × 𝟎. 𝟓 × (𝟏 − 𝟎. 𝟓) × 𝟏
𝝏𝒘𝟐
= −𝟎. 𝟐𝟓
𝝏𝑱
Updated 𝒘𝟐 = 𝒘𝟐 − 𝜼 𝝏𝒘 = −𝟏 − 𝟐 × (−𝟎. 𝟐𝟓) = −𝟏 + 𝟎. 𝟓 = −𝟎. 𝟓
𝟐

QUESTION 9:
Suppose a neural network has 3 input nodes, x, y, z. There are 2 neurons, Q and F. Q = x + y and
F = Q * z. What is the gradient of F with respect to x, y and z? Assume, (x, y, z) = (-2, 5, -4).

a. (-4, 3, -3)
b. (-4, -4, 3)
c. (4, 4, -3)
d. (3, 3, 4)

Correct Answer: b

Detailed Solution:

𝝏𝑭
𝑭 = 𝑸. 𝒛, = 𝑸 = 𝒙+𝒚 = 𝟑
𝝏𝒛
𝝏𝑭 𝝏𝑭
𝑭 = 𝑸. 𝒛 = (𝒙 + 𝒚). 𝒛, = 𝒛 = −𝟒, = 𝒛 = −𝟒
𝝏𝒙 𝝏𝒚
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 10:
Suppose a fully-connected neural network has a single hidden layer with 15 nodes. The input is
represented by a 5D feature vector and the number of classes is 3. Calculate the number of
parameters of the network. Consider there are NO bias nodes in the network?

a. 225
b. 75
c. 78
d. 120

Correct Answer: d

Detailed Solution:

Number of parameters = (5 * 15) + (15 * 3) = 120

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 6
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
Which of the following is not true for PCA? Choose the correct option.

a. Rotates the axes to lie along the principal components


b. Is calculated from the covariance matrix
c. Removes some information from the data
d. Eigenvectors describe the length of the principal components

Correct Answer: d

Detailed Solution:

See the definition

Direct from classroom lecture

QUESTION 2:
What is the output of sigmoid function for an input with dynamic range[0, ∞]?

a. [0, 1]
b. [−1, 1]
c. [0.5, 1]
d. [0.25, 1]

Correct Answer: c

Detailed Solution:

𝟏
𝑺𝒊𝒈𝒎𝒐𝒊𝒅(𝒙) =
𝟏 + 𝒆−𝒙
𝟏 𝟏
If 𝒙 = 𝟎, 𝑺𝒊𝒈𝒎𝒐𝒊𝒅(𝟎) = = = 𝟎. 𝟓
𝟏+𝒆−𝟎 𝟏+𝟏
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

𝟏 𝟏
If 𝒙 = ∞, 𝑺𝒊𝒈𝒎𝒐𝒊𝒅(∞) = = =𝟏
𝟏+𝒆−∞ 𝟏+𝟎

QUESTION 3:

A zero-bias autoencoder has 3 input neurons, 1 hidden neuron and 3 output neurons. If the
2
network is perfectly trained using an input[3 ]. What would be the values of the weights in the
5
autoencoder?
2
a. [1 1 1], [3]
5
0.2
b. [1 1 1], [0.3 ]
0.5
1
c. [0.2 0.3 0.5], [1]
1
1
d. [2 3 5], [1]
1

Correct Answer: b

Detailed Solution:

𝑦 = 𝑊2 ∙ 𝑊1 ∙ 𝑥 ∙∙∙∙∙∙∙∙∙ (1)

Where 𝑊1 is encoder weight and 𝑊2 is decoder weight.

2
If the network is perfectly trained, 𝑦 = 𝑥 = [3 ]
5
0.2
Equation 1 is only satisfied if 𝑊1 = [ 1 1 1] and 𝑊2 = [0.3]
0.5
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 4:
A single hidden and no-bias autoencoder has 100 input neurons and 10 hidden neurons. What
will be the number of parameters associated with this autoencoder?

a. 1000
b. 2000
c. 2110
d. 1010

Correct Answer: b

Detailed Solution:

As single hidden layer and no-bias autoencoder,

Input neurons = 100, Hidden neurons = 10. So Output neurons = 100

Total number of parameters = 100*10+10*100=2000


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 5:
Consider the 2-layer neural network shown below. The weights are represented as follows:
𝑘
𝑤𝑚𝑛 = weight between 𝑛th node of 𝑘th layer and 𝑚 th node (𝑘 − 1)th layer. 0th node is the bias
node = 1 as depicted in the diagram.
1
e.g. 𝑤32 = weight between 2nd node of hidden layer and 3rd node of input layer. Refer to the
diagram. All weights have not been shown to maintain clarity.

Sigmoid activation function is applied to both the hidden layer and the output layer. The loss
function is defined as 𝐽(∙) = 0.5(𝑦 − 𝑡) 2 where 𝑡 is the true label.

The initial weights are given as:


−0.4 0.2 0.4 −0.5
𝑊1 = [ ] 𝑊 2 = [0.1 − 0.3 − 0.2]
0.2 −0.3 0.1 0.2

Find the output at node 𝑎1 and 𝑎2 for given input {𝑥 1 = 1, 𝑥 2 = 0, 𝑥 3 = 1}?

a. 0.13, 0.54
b. 0.33, 0.52
c. 0.23, 0.51
d. 0.13, 0.51

Correct Answer: b

Detailed Solution:
Let input vector be X = [1 1 0 1]T
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

1
−0.4 0.2 0.4 −0.5 1 −0.7
𝒂 = 𝜎(𝑊1 𝑋) 𝑊1𝑋 = [ ][ ] = [ ]
0.2 −0.3 0.1 0.2 0 0.1
1
−0.7 0.33
𝜎 ([ ]) = [ ]
0.1 0.52

QUESTION 6:
Consider the 2-layer neural network shown below. The weights are represented as follows:
𝑘
𝑤𝑚𝑛 = weight between 𝑛th node of 𝑘th layer and 𝑚 th node (𝑘 − 1)th layer. 0th node is the bias
node = 1 as depicted in the diagram.
1
e.g. 𝑤32 = weight between 2nd node of hidden layer and 3rd node of input layer. Refer to the
diagram. All weights have not been shown to maintain clarity.

Sigmoid activation function is applied to both the hidden layer and the output layer. The loss
function is defined as 𝐽(∙) = 0.5(𝑦 − 𝑡) 2 where 𝑡 is the true label.

The initial weights are given as:


−0.4 0.2 0.4 −0.5
𝑊1 = [ ] 𝑊 2 = [0.1 − 0.3 − 0.2]
0.2 −0.3 0.1 0.2

Find the final output at node 𝑦 for given input {𝑥 1 = 1, 𝑥 2 = 0, 𝑥 3 = 1}? Choose the closest
answer.

a. 0.13
b. 0.33
c. 0.48
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

d. 0.51

Correct Answer: c

Detailed Solution:
Hidden vector be A = [1 0.33 0.52]T as calculated in other question.
1
𝒚 = 𝜎 (𝑊2 𝐴) 𝑊2 𝐴 = [0.1 − 0.3 − 0.2] [0.33] = −0.1
0.52
𝜎 (−0.1) = 0.48
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 7:
Consider the 2-layer neural network shown below. The weights are represented as follows:
𝑘
𝑤𝑚𝑛 = weight between 𝑛th node of 𝑘th layer and 𝑚 th node (𝑘 − 1)th layer. 0th node is the bias
node = 1 as depicted in the diagram.
1
e.g. 𝑤32 = weight between 2nd node of hidden layer and 3rd node of input layer. Refer to the
diagram. All weights have not been shown to maintain clarity.

Sigmoid activation function is applied to both the hidden layer and the output layer. The loss
function is defined as 𝐽(∙) = 0.5(𝑦 − 𝑡) 2 where 𝑡 is the true label.

The initial weights are given as:


−0.4 0.2 0.4 −0.5
𝑊1 = [ ] 𝑊 2 = [0.1 − 0.3 − 0.2]
0.2 −0.3 0.1 0.2
𝜕𝐽
Find the gradient component 𝜕 𝑤211
for 𝑡 = 1 and given input {𝑥 1 = 1, 𝑥 2 = 0, 𝑥 3 = 1}? Choose
the closest answer.

a. −0.09
b. −0.11
c. −0.13
d. −0.04

Correct Answer: d

Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

𝐽(∙) = 0.5(𝑦 − 𝑡)2

𝑡 = 1 𝑎𝑛𝑑 𝑦 = 0.48

1
2 2 2 2
Let 𝒂 = 𝑊 𝐴 = 𝑤01 + 𝑤11 𝒂𝟏 + 𝑤21 𝒂𝟐 = [0.1 − 0.3 − 0.2] [ 0.33] = −0.1
0.52

𝒚 = 𝝈(𝒂) and 𝑱(∙) = 𝟎. 𝟓(𝒚 − 𝒕)𝟐

Using chain rule,

𝝏𝑱 𝝏𝑱 𝝏𝒚 𝝏𝒂
2
= ∙ ∙ 2
𝝏 𝑤11 𝝏𝒚 𝝏𝒂 𝝏 𝑤11
𝝏𝑱 𝝏𝒚 𝝏𝒂
= (𝒚 − 𝒕), = 𝝈 (𝒂) × (𝟏 − 𝝈(𝒂)), = 𝒂𝟏
𝝏𝒚 𝝏𝒂 𝝏𝑤211

𝝏𝑱
2
= (𝒚 − 𝒕) × 𝝈(𝒂) × (𝟏 − 𝝈(𝒂)) × 𝒂𝟏 = (𝟎. 𝟒𝟖 − 𝟏) × 𝟎. 𝟒𝟖 × (𝟏 − 𝟎. 𝟒𝟖) × 𝟎. 𝟑𝟑
𝝏𝑤11
= −𝟎. 𝟎𝟒
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 8:
Consider the 2-layer neural network shown below. The weights are represented as follows:
𝑘
𝑤𝑚𝑛 = weight between 𝑛th node of 𝑘th layer and 𝑚 th node (𝑘 − 1)th layer. 0th node is the bias
node = 1 as depicted in the diagram.
1
e.g. 𝑤32 = weight between 2nd node of hidden layer and 3rd node of input layer. Refer to the
diagram. All weights have not been shown to maintain clarity.

Sigmoid activation function is applied to both the hidden layer and the output layer. The loss
function is defined as 𝐽(∙) = 0.5(𝑦 − 𝑡) 2 where 𝑡 is the true label.

The initial weights are given as:


−0.4 0.2 0.4 −0.5
𝑊1 = [ ] 𝑊 2 = [0.1 − 0.3 − 0.2]
0.2 −0.3 0.1 0.2
2
Find the updated value of 𝑤21 after 1 iteration for 𝑡 = 1, the learning rate η = 0.9 and given
input {𝑥 1 = 1, 𝑥 2 = 0, 𝑥 3 = 1}? Choose the closest answer.

a. −0.29
b. −0.1
c. −0.14
d. −0.04

Correct Answer: c

Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

𝐽(∙) = 0.5(𝑦 − 𝑡)2

𝑡 = 1 𝑎𝑛𝑑 𝑦 = 0.48

1
2 2 2 2
Let 𝒂 = 𝑊 𝐴 = 𝑤01 + 𝑤11 𝒂𝟏 + 𝑤21 𝒂𝟐 = [0.1 − 0.3 − 0.2] [ 0.33] = −0.1
0.52

𝒚 = 𝝈(𝒂) and 𝑱(∙) = 𝟎. 𝟓(𝒚 − 𝒕)𝟐

Using chain rule,

𝝏𝑱 𝝏𝑱 𝝏𝒚 𝝏𝒂
2
= ∙ ∙ 2
𝝏 𝑤21 𝝏𝒚 𝝏𝒂 𝝏 𝑤21
𝝏𝑱 𝝏𝒚 𝝏𝒂
= (𝒚 − 𝒕), = 𝝈 (𝒂) × (𝟏 − 𝝈(𝒂)), = 𝒂𝟐
𝝏𝒚 𝝏𝒂 𝝏𝑤221

𝝏𝑱
2
= (𝒚 − 𝒕) × 𝝈(𝒂) × (𝟏 − 𝝈(𝒂)) × 𝒂𝟐 = (𝟎. 𝟒𝟖 − 𝟏) × 𝟎. 𝟒𝟖 × (𝟏 − 𝟎. 𝟒𝟖) × 𝟎. 𝟓𝟐
𝝏𝑤21
= −𝟎. 𝟎𝟕

2 2 𝝏𝑱
Updated 𝑤21 = 𝑤21 − 𝜼 = −𝟎. 𝟐 − 𝟎. 𝟗 × (−𝟎. 𝟎𝟕) = −𝟎. 𝟐 + 𝟎. 𝟎𝟔 = −𝟎. 𝟏𝟒
𝑤221

QUESTION 9:
𝑑𝑦 𝑑𝑦
𝑦 = min (𝑎, 𝑏) and 𝑎 > 𝑏. What is the value of 𝑑𝑎 and 𝑑𝑏 ?

a. 1, 0
b. 0, 1
c. 0, 0
d. 1, 1

Correct Answer: b

Detailed Solution:

𝑦 = min (𝑎, 𝑏) and 𝑎 > 𝑏.


𝑑𝑦 𝑑𝑦
Now 𝑦 = 𝑏. So 𝑑𝑎 = 0 and 𝑑𝑏 = 1
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 10:
Let’s say vectors 𝑎⃗ = {2; 4} and 𝑏⃗⃗ = {𝑛; 1} forms the first two principle components after
applying PCA. Under such circumstances, which among the following can be a possible value of
n?

a. 2
b. -2
c. 0
d. 1

Correct Answer: b

Detailed Solution:

Only option (b) makes the two vectors orthogonal.

____________________________________________________________________________

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 7
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
Select the correct option about Sparse Autoencoder?

Statement 1: Sparse autoencoders introduces information bottleneck by reducing the number


of nodes at hidden layers

Statement 2: The idea is to encourage network to learn an encoding and decoding which only
relies on activating a small number of neurons

a. Both the statements are true


b. Statement 1 is true, but Statement 2 is false
c. Statement 1 is false, but statement 2 is true
d. Both the statements are false

Correct Answer: c

Detailed Solution:

Sparse autoencoders introduces an information bottleneck without requiring a reduction in


the number of nodes at hidden layers. It encourages network to learn an encoding and
decoding which only relies on activating a small number of neurons.

______________________________________________________________________________

QUESTION 2:
Select the correct option about Denoising autoencoders?

Statement A: The loss is between the original input and the reconstruction from a noisy version
of the input

Statement B: Denoising autoencoders can be used as a tool for feature extraction.

a. Both the statements are false


b. Statement A is false but Statement B is true
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

c. Statement A is true but Statement B is false


d. Both the statements are true

Correct Answer: d

Detailed Solution:

For denoising autoencoder, both statement 1 and 2 are true. Thus option (d) is correct

______________________________________________________________________________

QUESTION 3:
Which of the following autoencoder methods uses corrupted versions of the input?

a. Overcomplete design
b. Undercomplete Design
c. Sparse Design
d. Denoising Design

Correct Answer: d

Detailed Solution:

Refer to classroom lecture.

______________________________________________________________________________

QUESTION 4:
Which of the following autoencoder methods uses a hidden layer with fewer units than the
input layer?

a. Overcomplete design
b. Undercomplete Design
c. Sparse Design
d. Denoising Design

Correct Answer: b

Detailed Solution:

Refer to classroom lecture.


NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 5:

Which of the following is false about autoencoder?

a. Autoencoders possesses generalization capabilities


b. Autoencoders are best suited for image captioning task
c. Its objective is to minimize the reconstruction loss so that output is similar to
input
d. It compresses the input into a latent space representation and then reconstruct
the output from it

Correct Answer: b

Detailed Solution:

Except option (b), rest all the options are true about auroencoders

____________________________________________________________________________

QUESTION 6:
Find the value of 𝑑(𝑡 − 34) ∗ 𝑥(𝑡 + 56); 𝑑(𝑡) being the delta function and * being the
convolution operation.

a. 𝑥(𝑡 + 56)
b. 𝑥(𝑡 + 32)
c. 𝑥(𝑡 + 22)
d. 𝑥(𝑡 − 56)

Correct Answer: c

Detailed Solution:

Convolution of a function with delta shifts accordingly

_____________________________________________________________________________

QUESTION 7:
Impulse response is the output of ________________system due to impulse input applied at
time=0. Fill in the blanks from the options below.

a. Linear
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

b. Time Varying
c. Time Invariant
d. Linear And Time Invariant

Correct Answer: d

Detailed Solution:

Impulse response is output of LTI system due to impulse input pplied at time t=0 or n=0.
Behaviour of an LTI system is characterized by its impulse response.

_________________________________________________________________________

QUESTION 8:
The impulse function is ___ when t=0. Fill in the blanks.

a. 1
b. 0
c. Infinity
d. None of the above

Correct Answer: a

Detailed Solution:

Self-explainable from the definition of impulse function

______________________________________________________________________________

QUESTION 9:
Given the image below where, Row 1: Original Input, Row 2: Noisy input, Row 3: Reconstructed
output. Choose one of the following variants of autoencoder that is most suited to get Row 3
from Row 2.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

a. Stacked autoencoder
b. Sparse autoencoder
c. Denoising autoencoder
d. None of the above

Correct Answer: c

Detailed Solution:

Reconstruction of original noise-free data from noisy input is the tasks of denoising
autoencoder

____________________________________________________________________________

QUESTION 10:
Which of the following is true for Contractive Autoencoders?
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

a. penalizing instances where a small change in the input leads to a large change in
the encoding space
b. penalizing instances where a large change in the input leads to a small change in
the encoding space
c. penalizing instances where a small change in the input leads to a small change in
the encoding space
d. None of the above

Correct Answer: a

Detailed Solution:

Direct from definition of Contractive autoencoders

______________________________________________________________________________

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 8
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
Which of the following is false about CNN?

a. Output should be flattened before feeding it to a fully connected lyer


b. There can be only 1 fully connected layer in CNN
c. We can use ana many convolutional layers in CNN
d. None of the above
Correct Answer: b

Detailed Solution:

Direct from classroom lecture


______________________________________________________________________________

QUESTION 2:
The input image has been converted into a matrix of size 64 X 64 and a kernel/filter of size 5x5
with a stride of 1 and no padding. What will be the size of the convoluted matrix?

a. 5x5
b. 59x59
c. 64x64
d. 60x60

Correct Answer: d

Detailed Solution:

The size of the convoluted matrix is given by CxC where C=((I-F+2P)/S)+1, where C is the
size of the Convoluted matrix, I is the size of the input matrix, F the size of the filter matrix
and P the padding applied to the input matrix. Here P=0, I=64, F=5 and S=1. Therefore,
the answer is 60x60.
______________________________________________________________________________

QUESTION 3:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Filter size of 3x3 is convolved with matrix of size 4x4 (stride=1). What will be the size of output
matrix if valid padding is applied:

a. 4x4
b. 3x3
c. 2x2
d. 1x1

Correct Answer: c

Detailed Solution:

This type is used when there is no requirement for Padding. The output matrix after
convolution will have the dimension of ((n – f +2P)/S+ 1) x ((n – f +2P)/S+ 1)

______________________________________________________________________________

QUESTION 4:
Let us consider a Convolutional Neural Network having three different convolutional layers in
its architecture as:

Layer-1: Filter Size – 3 X 3, Number of Filters – 10, Stride – 1, Padding – 0

Layer-2: Filter Size – 5 X 5, Number of Filters – 20, Stride – 2, Padding – 0

Layer-3: Filter Size – 5 X5 , Number of Filters – 40, Stride – 2, Padding – 0

Layer 3 of the above network is followed by a fully connected layer. If we give a 3-D
image input of dimension 39 X 39 to the network, then which of the following is the input
dimension of the fully connected layer.

a. 1960
b. 2200
c. 4563
d. 13690
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Correct Answer: a

Detailed Solution:

the input image of dimension 39 X 39 X 3 convolves with 10 filters of size 3 X 3 and takes
the Stride as 1 with no padding. After these operations, we will get an output of 37 X 37 X
10.

Output of layer 2 would be 17x17x20

Ouput of layer 3 would be 7x7x40. Flattening this gives 1960.

______________________________________________________________________________

QUESTION 5:
Suppose you have 64 convolutional kernel of size 3 x 3 with no padding and stride 1 in the first
layer of a convolutional neural network. You pass an input of dimension 1024x1024x3 through
this layer. What are the dimensions of the data which the next layer will receive?

a. 1020x1020x40
b. 1022x1022x40
c. 1021x1021x40
d. 1022x1022x3

Correct Answer: b

Detailed Solution:

The layer accepts a volume of size W1×H1×D1. In our case, 1024x1024x3

Requires four hyperparameters: Number of filters K=64, their spatial extent F=3, the
stride S=1, the amount of padding P=0.

Produces a volume of size W2×H2×D2 i.e. 225x225x256 where: W2=(W1−F+2P)/S+1


=(1024−3)/1+1 =1022, H2=(H1−F+2P)/S+1 =(1024−3)/1+1 =1022, (i.e. width and height are
computed equally by symmetry), D2= Number of filters K=40.

____________________________________________________________________________

QUESTION 6:
Consider a CNN model which aims at classifying an image as either a rose,or a marigold, or a lily
or orchid (consider the test image can have only 1 of the images at a time) . The last (fully-
connected) layer of the CNN outputs a vector of logits, L, that is passed through a ____
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

activation that transforms the logits into probabilities, P. These probabilities are the model
predictions for each of the 4 classes.

Fill in the blanks with the appropriate option.

a. Leaky ReLU
b. Tanh
c. ReLU
d. Softmax

Correct Answer: a

Detailed Solution:

Softmax works best if there is one true class per example, because it outputs a probability
vector whose entries sum to 1.

____________________________________________________________________________

QUESTION 7:
Suppose your input is a 300 by 300 color (RGB) image, and you use a convolutional layer with
100 filters that are each 5x5. How many parameters does this hidden layer have (without bias)

a. 2501
b. 2600
c. 7500
d. 7600

Correct Answer: c

Detailed Solution:

As we have a RGB Image so each filter would be 3D, whose dimension is 5 * 5 * 3 = 75

Now we have 100 such filters. Now, as there is no bias so, total number of parameters= = 5
* 5 * 3 * 100 = 7500

______________________________________________________________________________

QUESTION 8:
Which of the following activation functions can lead to vanishing gradients?
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

a. ReLU
b. Sigmoid
c. Leaky ReLU
d. None of the above

Correct Answer: b

Detailed Solution:

For sigmoid activation, a large change in the input of the sigmoid function will cause a
small change in the output. Hence, the derivative becomes small. When more and more
layers uses such activation, the gradient of the loss function becomes very small making the
network difficult to train.

___________________________________________________________________________

QUESTION 9:
Statement 1: Residual networks can be a solution for vanishing gradient problem

Statement 2: Residual networks provide residual connections straight to earlier layers

Statement 3: Residual networks can never be a solution for vanishing gradient problem

Which of the following option is correct?

a. Statement 2 is correct
b. Statement 3 is correct
c. Both Statement 1 and Statement 2 are correct
d. Both Statement 2 and Statement 3 are correct

Correct Answer: c

Detailed Solution:

Residual networks can be a solution to vanishing gradient problems, as they provide


residual connections straight to earlier layers. This residual connection doesn’t go through
activation functions that “squashes” the derivatives, resulting in a higher overall derivative
of the block.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

____________________________________________________________________________

QUESTION 10:
Input to SoftMax activation function is [0.5,0.5,1]. What will be the output?

a. [0.28,0.28,0.44]
b. [0.022,0.956, 0.022]
c. [0.045,0.910,0.045]
d. [0.42, 0.42,0.16]

Correct Answer: a

Detailed Solution:
𝒙
𝒆 𝒋
SoftMax, 𝝈(𝒙𝒋 ) = 𝑛 for j=1,2…,n
∑𝑘=1 𝒆𝒙𝒌
𝒙
𝒆 𝒋
Therefore, 𝝈(𝟎. 𝟓) = 𝑛 =0.28and similarly the other values
∑𝑘=1 𝒆𝒙𝒌

______________________________________________________________________

______________________________________________________________________________

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 9
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
What can be a possible consequence of choosing a very small learning rate?
a. Slow convergence
b. Overshooting minima
c. Oscillations around the minima
d. All of the above

Correct Answer: a
Detailed Solution:
Choosing a very small learning rate can lead to slower convergence and thus option (a) is
correct.
______________________________________________________________________________

QUESTION 2:
The following is the equation of update vector for momentum optimizer. Which of the
following is true for 𝛾?
𝑉𝑡 = 𝛾𝑉𝑡−1 + 𝜂∇𝜃 𝐽(𝜃)
a. 𝛾 is the momentum term which indicates acceleration
b. 𝛾 is the step size
c. 𝛾 is the first order moment
d. 𝛾 is the second order moment

Correct Answer: a
Detailed Solution:
A fraction of the update vector of the past time step is added to the current update vector. 𝛾 is
that fraction which indicates how much acceleration you want and its value lies between 0 and 1.
______________________________________________________________________________

QUESTION 3:
Which of the following is true about momentum optimizer?
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

a. It helps accelerating Stochastic Gradient Descent in right direction


b. It helps prevent unwanted oscillations
c. It helps to know the direction of the next step with knowledge of the previous step
d. All of the above

Correct Answer: d
Detailed Solution:
Option (a), (b) and (c) all are true for momentum optimiser. Thus, option (d) is correct.
______________________________________________________________________________

QUESTION 4:
Let 𝐽(𝜃) be the cost function. Let the gradient descent update rule for 𝜃𝑖 be,

𝜃𝑖+1 = 𝜃𝑖 + ∇𝜃𝑖

What is the correct expression of ∇𝜃𝑖 . 𝛼 is the learning rate.

𝑑𝐽(𝜃𝑖 )
a. −𝛼
𝑑 𝜃𝑖
𝑑𝐽(𝜃𝑖 )
b. 𝛼
𝑑 𝜃𝑖
𝑑𝐽(𝜃𝑖 )
c. −
𝑑𝜃𝑖+1
𝑑𝐽(𝜃𝑖 )
d.
𝑑𝜃𝑖

Correct Answer: a
Detailed Solution:
Gradient descent update rule for 𝜃𝑖 is,
𝑑𝐽(𝜃𝑖 )
𝜃𝑖+1 = 𝜃𝑖 − 𝛼 , 𝛼 is the learning rate
𝑑 𝜃𝑖
______________________________________________________________________________

QUESTION 5:
A given cost function is of the form J(θ) =6 θ2 - 6θ+6? What is the weight update rule for
gradient descent optimization at step t+1? Consider, 𝛼 to be the learning rate.

a. 𝜃𝑡+1 = 𝜃𝑡 − 6𝛼(2𝜃 − 1)
b. 𝜃𝑡+1 = 𝜃𝑡 + 6𝛼(2𝜃)
c. 𝜃𝑡+1 = 𝜃𝑡 − 𝛼(12𝜃 − 6 + 6)
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

d. 𝜃𝑡+1 = 𝜃𝑡 − 6𝛼(2𝜃 + 1)

Correct Answer: a
Detailed Solution:
𝜕𝐽(𝜃)
= 12𝜃 − 6
𝜕𝜃
So, weight update will be
𝜃𝑡+1 = 𝜃𝑡 − 6𝛼(2𝜃 − 1)
______________________________________________________________________________

QUESTION 6:
If the first few iterations of gradient descent cause the function f(θ0,θ1) to increase rather than
decrease, then what could be the most likely cause for this?

a. we have set the learning rate to too large a value


b. we have set the learning rate to zero
c. we have set the learning rate to a very small value
d. learning rate is gradually decreased by a constant value after every epoch

Correct Answer: a
Detailed Solution:
If learning rate were small enough, then gradient descent should successfully take a tiny small
downhill and decrease f(θ0,θ1) at least a little bit. If gradient descent instead increases the
objective value that means learning rate is too high.
______________________________________________________________________________

QUESTION 7:
For a function f(θ0,θ1), if θ0 and θ1 are initialized at a global minimum, then what should be the
values of θ0 and θ1 after a single iteration of gradient descent?

a. θ0 and θ1 will update as per gradient descent rule


b. θ0 and θ1 will remain same
c. Depends on the values of θ0 and θ1
d. Depends on the learning rate

Correct Answer: b
Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

At a local minimum, the derivative (gradient) is zero, so gradient descent will not change the
parameters.
______________________________________________________________________________

QUESTION 8:
What can be one of the practical problems of exploding gradient?
a. Too large update of weight values leading to unstable network
b. Too small update of weight values inhibiting the network to learn
c. Too large update of weight values leading to faster convergence
d. Too small update of weight values leading to slower convergence

Correct Answer: a
Detailed Solution:
Exploding gradients are a problem where large error gradients accumulate and result in very
large updates to neural network model weights during training. This has the effect of your model
being unstable and unable to learn from your training data.
______________________________________________________________________________

QUESTION 9:
What are the steps for using a gradient descent algorithm?

1. Calculate error between the actual value and the predicted value
2. Update the weights and biases using gradient descent formula
3. Pass an input through the network and get values from output layer
4. Initialize weights and biases of the network with random values
5. Calculate gradient value corresponding to each weight and bias

a. 1, 2, 3, 4, 5
b. 5, 4, 3, 2, 1
c. 3, 2, 1, 5, 4
d. 4, 3, 1, 5, 2

Correct Answer: d
Detailed Solution:
Initialize random weights, and then start passing input instances and calculate error response
from output layer and back-propagate the error through each subsequent layers. Then update the
neuron weights using a learning rate and gradient of error. Please refer to the lectures of week 4.
______________________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 10:
You run gradient descent for 15 iterations with learning rate 𝜂 = 0.3 and compute error after
each iteration. You find that the value of error decreases very slowly. Based on this, which of
the following conclusions seems most plausible?

a. Rather than using the current value of a, use a larger value of 𝜂


b. Rather than using the current value of a, use a smaller value of 𝜂
c. Keep 𝜂 = 0.3
d. None of the above

Correct Answer: a
Detailed Solution:
Error rate is decreasing very slowly. Therefore increasing the learning rate is a most plausible
solution.
______________________________________________________________________________

______________________________________________________________________________

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 10
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:

What is not a reason for using batch-normalization??

a. Prevent overfitting
b. Faster convergence
c. Faster inference time
d. Prevent Co-variant shift

Correct Answer: c

Detailed Solution:
Inference time does not become faster due to batch normalization. It increases the computational
burden. So, inference time increases.
____________________________________________________________________________

QUESTION 2:
A neural network has 3 neurons in a hidden layer. Activations of the neurons for three batches
1 0 6
are[2] , [2] , [9] respectively. What will be the value of mean if we use batch normalization in
3 5 2
this layer?
2.33
a. [4.33]
3.33
2.00
b. [2.33]
5.66
1.00
c. [1.00]
1.00
0.00
d. [0.00]
0.00
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Correct Answer: a

Detailed Solution:
1 1 0 6 2.33
× ([2] + [2] + [9]) = [4.33]
3
3 5 2 3.33
______________________________________________________________________________

QUESTION 3:
How can we prevent underfitting?

a. Increase the number of data samples


b. Increase the number of features
c. Decrease the number of features
d. Decrease the number of data samples

Correct Answer: b

Detailed Solution:
Underfitting happens whenever feature samples are capable enough to capture the data
distribution. We need to increase the feature size, so data can be fitted perfectly well.
______________________________________________________________________________

QUESTION 4:
How do we generally calculate mean and variance during testing?

a. Batch normalization is not required during testing


b. Mean and variance based on test image
c. Estimated mean and variance statistics during training
d. None of the above

Correct Answer: c

Detailed Solution:
We generally calculate batch mean and variance statistics during training and use the estimated
batch mean and variance during testing.
______________________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 5:
Which one of the following is not an advantage of dropout?

a. Regularization
b. Prevent Overfitting
c. Improve Accuracy
d. Reduce computational cost during testing

Correct Answer: d

Detailed Solution:
Dropout makes some random features during training but while testing we don’t zero-down any
feature. So there is no question of reduction of computational cost.
______________________________________________________________________________

QUESTION 6:
What is the main advantage of layer normalization over batch normalization?

a. Faster convergence
b. Lesser computation
c. Useful in recurrent neural network
d. None of these

Correct Answer: c

Detailed Solution:
See the lectures/lecture materials.
______________________________________________________________________________

QUESTION 7:
While training a neural network for image recognition task, we plot the graph of training error
and validation error. Which is the best for early stopping?
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

a. A
b. B
c. C
d. D

Correct Answer: c

Detailed Solution:
Minimum validation point is the best for early stopping.
______________________________________________________________________________

QUESTION 8:
Which among the following is NOT a data augmentation technique?

a. Random horizontal and vertical flip of image


b. Random shuffle all the pixels of an image
c. Random color jittering
d. All the above are data augmentation techniques

Correct Answer: b

Detailed Solution:
Random shuffle of all the pixels of the image will distort the image and neural network will be
unable to learn anything. So, it is not a data augmentation technique.
______________________________________________________________________________

QUESTION 9:
Which of the following is true about model capacity (where model capacity means the ability of
neural network to approximate complex functions)?

a. As number of hidden layers increase, model capacity increases


b. As dropout ratio increases, model capacity increases
c. As learning rate increases, model capacity increases
d. None of these

Correct Answer: a

Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Dropout and learning rate has nothing to do with model capacity. If hidden layers increase, it
increases the number of learnable parameter. Therefore, model capacity increases.
______________________________________________________________________________

QUESTION 10:
Batch Normalization is helpful because

a. It normalizes all the input before sending it to the next layer


b. It returns back the normalized mean and standard deviation of weights
c. It is a very efficient back-propagation technique
d. None of these

Correct Answer: a

Detailed Solution:
Batch normalization layer normalizes the input.

______________________________________________________________________________

______________________________________________________________________________

************END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 11
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
Which of following can be a target output of semantic segmentation problem with 4 class?

a.
0 1 0 1 1 0 0 0 1 0 0 0
0 1 0 0 0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0
I II III IV
b.
0 1 0 1 0 0 0 0 1 0 1 0
0 1 0 0 1 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 1
I II III IV

c.
0 1 0 1 0 0 0 0 1 0 0 0
0 1 0 0 0 1 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 1
I II III IV
d.
0 1 0 1 0 0 0 0 1 0 0 0
0 1 0 1 0 0 1 0 0 0 0 0
1 1 0 0 1 0 0 0 1 0 1 1
I II III IV

Correct Answer: c

Detailed Solution:
Target output should be one hot encoded vector at every pixel location. It should one if the pixel
belongs to that particular class otherwise 0.
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 2:
Suppose you have a 1𝐷 signal 𝑥 = [1,2,3,4,5] and a filter 𝑓 = [1,2,3,4], and you perform stride
2 transpose convolution on the signal 𝑥 by the filter 𝑓 to get the signal 𝑦. What will be the
signal 𝑦 if we don’t perform cropping?

a. 𝑦 = [1,2,5,8,9,14,13,20,19,26,3,4]
b. 𝑦 = [1,2,3,4,5,4,3,2,1]
c. 𝑦 = [1,2,5,8,9,14,13,20,17,26,15,20]
d. 𝑦 = [0,0,5,8,9,14,13,20,19,26,0,0]

Correct Answer: c

Detailed Solution:
1 2 3 4 5

1 1

2 2

1*3+2*1=5 3 1

4*1+2*2=8 4 2

3*2+1*3=9 3 1

4*2+2*3=14 4 2

3*3+1*4=13 3 1

4*3+2*4=20 4 2

3*4+1*5=17 3 1

4*4+2*5=26 4 2

15 3

20 4

______________________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 3:
What are the different challenges one face while creating a facial recognition system?

a. Different illumination condition


b. Different pose and orientation of face images
c. Limited dataset for training
d. All of the above

Correct Answer: d

Detailed Solution:
Please refer to the lecture of week 11.
______________________________________________________________________________

QUESTION 4:
Fully Connected Convolutional network or FCN became one of the major successful network
architectures. Can you identify what are the advantages of FCN which makes it a successful
architecture for semantic segmentation?

a. Larger Receptive Field


b. Mixing of global feature
c. Lesser computation required
d. All of the above

Correct Answer: d

Detailed Solution:
Please refer to the lecture of week 11. It has a larger receptive field by using strided Convolution
layer, also it mixes global feature and number of computations are reduced as we are down
sampling the image resolution.
____________________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 5:
In a Deep CNN architecture, the feature map before applying a max pool layer with (2x2) kernel
is given bellow.
12 6 15 9
19 2 7 18
14 2 17 6
3 5 19 2
After few successive convolution layers, the feature map is again up-sampled using Max Un-
pooling. If the following feature map is present before Max-Unpooling layer, what will be the
output of the Max-Unpooling layer?
5 6
8 13

a.
0 0 0 0
5 0 0 6
8 0 0 0
0 0 13 0
b.
5 5 6 6
5 5 6 6
8 8 13 13
8 8 13 13
c.
5 0 6 0
0 0 0 0
8 0 13 0
0 0 0 0
d. None of the above

Correct Answer: a

Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Please refer to lectures of week 11.


____________________________________________________________________________

QUESTION 6:
What could be thought as disadvantage of Fully Convolutional neural network for semantic
segmentation addressed by other researchers?

a. It has a fixed receptive field, so the object with lesser size than the receptive
filed will be missed by the network.
b. Down Sampling the image dimension over the depth makes the feature map
sparse.
c. It requires lot of computation.
d. None of the above

Correct Answer: a

Detailed Solution:
The fixed receptive filed is the disadvantage of Fully convolution network for semantic
segmentation addressed by “Learning Deconvolution Network for Semantic Segmentation”
paper.
____________________________________________________________________________

QUESTION 7:
What will be the dice coefficient of following two one hot encoded vector? (|A|=no of 1 bit)

A 1 0 1 0 0 0 1 1 1 0 0 1 0 1
B 1 0 0 0 0 1 1 1 0 0 0 1 0 0

a. 0.83
b. 0.41
c. 0.67
d. 0.90

Correct Answer: c

Detailed Solution:
No of 1 bit in A=7
No of 1 bit in B =5
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Overlapping 1 bit =4
2∗|𝐴∩𝐵| 2∗4
Dice Coefficient = |𝐴|+|𝐵|
= = 0.67
5+7
______________________________________________________________________________

QUESTION 8:
What will be the value of dice coefficient between A and B?

(Consider, |A|= sum of all elements.)


a. 0.93
b. 0.77
c. 0.11
d. 0.89

Correct Answer: a

Detailed Solution:

|A|=7.82

|B|=8

0 0 0 0

A∩B= 0 0 0 0
0.89 0.85 0.88 0.91
0.99 0.97 0.95 0.97

|A∩B|=7.42
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Dice Coefficient = 2*|A∩B|/(|A|+|B|)=2*7.42/ (8+7.82) =0.93

______________________________________________________________________________

QUESTION 9:
In FaceNet, why the L2 normalization layer is used?
a. To constrain the embedding function in a d-dimensional hyper-sphere.
b. For regularization of weight vector, i.e. L2 regularization.
c. For getting a sparse embedding function.
d. None of the above.

Correct Answer: a

Detailed Solution:
Using the L2 normalization layer we impose the constrain that ||f(x)||22 =1 . This will constrain
the embedding function to live on the d-dimensional hyper-sphere.
______________________________________________________________________________

QUESTION 10:
What is the use of Skip Connection in image denoising networks?

a. Helping de-convolution layer to recover an improved clean version of image.


b. Back propagating the gradient to bottom layers, which makes the training easy.
c. To create the direct path between convolution layer and the corresponding
mirror de-convolution layer.
d. All of the above.

Correct Answer: d

Detailed Solution:

Please refer to lecture of week 11.


______________________________________________________________________________

______________________________________________________________________________

***********END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Deep Learning
Assignment- Week 12
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________

QUESTION 1:
During training a Variational Auto-encoder (VAE), it is assumed that 𝑃(𝑧|𝑥) ∼ 𝑁(0, 𝐼) i.e., given
an input sample, the encoder is forced to map its latent code to 𝑁(0, 𝐼). After the training is
over, we want to use the VAE as a generative model. What should be the best choice of
distribution from which we should sample a latent vector to generate a novel example?

a. 𝑁(0, 𝐼): Normal distribution with zero mean and identity covariance
b. 𝑁(1, 𝐼): Normal distribution with mean = 1 and identity covariance
c. Uniform distribution between [-1, 1]
d. 𝑁(−1, 𝐼): Normal distribution with mean = -1 and identity covariance

Correct Answer: a
Detailed Solution:
Since during training, we forced the latent code to follow 𝑁(0, 𝐼), the decoder has learnt to map
latent codes from that distribution only. So, during sampling if we provide vectors from any
other distributions, then the encoder will have low probability to have encountered such vectors
thereby leading to unrealistic reconstructions. So, we should sample vectors from 𝑁(0, 𝐼) for
using the pre-trained VAE as a generative model.
______________________________________________________________________________

QUESTION 2:
When the GAN game has converged to its Nash equilibrium (when the Discriminator randomly
makes an error in distinguishing fake samples from real samples), what is the probability (of
belongingness to real class) given by the Discriminator to a fake generated sample?

a. 1
b. 0.5
c. 0
d. 0.25

Correct Answer: b
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Detailed Solution:

Nash equilibrium is reached when the generated distribution, 𝑝𝑔 (𝑥) equals the original data
distribution, 𝑝𝑑𝑎𝑡𝑎 (𝑥), which leads to 𝐷(𝑥) = 0.5 for all 𝑥.
______________________________________________________________________________

QUESTION 3:
Why is re-parameterization trick used in VAE?

a. Without re-parameterization, the mean vector of latent code of VAE encoder


with tend towards zero
b. Sampling from a VAE encoder latent space is non-differentiable and thus we
cannot back propagate gradient during optimization using gradient descent
c. We need to re-parameterize Normal distribution over latent space to Bernoulli
distribution
d. None of the above

Correct Answer: b

Detailed Solution:
We cannot sample in a differentiable manner from within a computational graph present in a
neural network. Re-parameterization enables the sampling function to be present outside the
main computational graph which enables us to do regular gradient descent optimization.
______________________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 4:
Which one of the following graphical models fully represents a Variational Auto-encoder (VAE)
realization?

Correct Answer: a

Detailed Explanation:
For practical realization of VAE, we have an encoder 𝑄(∙) which receives an input signal, 𝑥 and
generates a latent code, 𝑧. This part of the network can be denoted by 𝑄(𝑧|𝑥) and directed from
𝑥 to 𝑧. Next, we have a decoder section which takes the encoded z vector to reconstruct the input
signal, 𝑥. This part of the network is represented by 𝑃(𝑥|𝑧) and should be directed from 𝑧 to 𝑥.
______________________________________________________________________________

QUESTION 5:
Which one of the following computational graphs correctly depict the re-parameterization trick
deployed for practical Variational Auto-encoder (VAE) implementation? Circular nodes
represent random nodes in the models and the quadrilateral nodes represent deterministic
nodes.

a b c d

Correct Answer: a

Detailed Solution:
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

With the re-parameterization trick, the only random component in the network is the node of ∊
which is sampled from 𝑁(0, 𝐼). The other nodes of μ and σ are deterministic. Since ∊ is sampled
from outside the computational graph, the overall z vector also becomes deterministic
component for a given set of μ, σ and ∊. Also, if z is not deterministic, we cannot back propagate
gradients through it. Also, in the computation graph, the forward arrows will emerge from μ,σ,ϵ
towards z for computing the z vector.
______________________________________________________________________________

QUESTION 6:
For the following min-max game, at which state of (x, y) do we achieve the Nash equilibrium
(the state where change of one variable does not alter the state of the other variable)?

a. X = 0 y = -1
b. X = 0 , y= 0
c. X = 0 y = 1
d. X = ∞(infinite), y = 0

Correct Answer: b

Detailed Solution:
The Nash equilibrium is x=y=0. This is the only state where the action of one player does not
affect the other player’s move. It is the only state that any opponents’ actions will not change the
game outcome.
______________________________________________________________________________

QUESTION 7:
Which of the following losses can be used to optimize for generator’s objective (while training a
Generative Adversarial network) by MINIMIZING with gradient descent optimizer? Consider
cross-entropy loss,

CE(a, b) = - [ a*log(b) + (1-a)*log(1-b)]

and D(G(z)) = probability of belonging to real class as output by the Discriminator for a given
generated sample G(z).

a. CE(1, D(G(z)))
b. CE(1, -D(G(z)))
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

c. CE(1, 1 - D(G(z)))
d. CE(1, 1 / D(G(z)))

Correct Answer: a

Detailed Solution:
Except for option (a) none of the other objective function are minimized at D(G(z)) = 1 which is
the goal of the Generator, i.e. to force the Discriminator to output probability=1 for a generated
sample. Loss function in option (a) is the only choice which keeps on decreasing as D(G(z))
increases. Also, it is required that D(G(z)) ∈ [0,1].
______________________________________________________________________________

QUESTION 8:
While training a Generative Adversarial network, which of the following losses CANNOT be
used to optimize for discriminator objective (while only sampling from the distribution of
generated samples) by MAXIMIZING with gradient ASCENT optimizer? Consider cross-entropy
loss,

CE(a, b) = - [ a*log(b) + (1-a)*log(1-b)]

and D(G(z)) = probability of belonging to real class as output by the Discriminator for a given
generated sample, G(z) from a noise vector, z.

a. CE(1, D(G(z)))
b. -CE(1, D(G(z)))
c. CE(1, 1 + D(G(z)))
d. -CE(1, 1 - D(G(z)))

Correct Answer: b

Detailed Solution:
During optimization of discriminator, when we sample from the distribution of fake/generated
distribution, we want D(G(z)) = 0. Since we want to use gradient ASCENT optimization, the
objective function should increase as we approach D(G(z)) = 0 while the objective value should
decrease with increase in value of D(G(z)). Apart from option (b), all other options satisfy the
above conditions.
______________________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

QUESTION 9:
For training VAE, we want to predict an unknown distribution of latent code given an observed
sample, i.e., P(z|x), but we approximate it with some distribution Q(z|x) which we can control
by varying some known parameters. Which of the following loss functions is used as a loss to
minimize?
𝑃(𝑥,𝑧)
a. − ∑𝑧 𝑄(𝑧|𝑥)𝑙𝑜𝑔 𝑄(𝑧∨𝑥)
𝑃(𝑥,𝑧)
b. − ∑𝑥 𝑄(𝑧|𝑥)𝑙𝑜𝑔 𝑄(𝑧∨𝑥)
𝑃(𝑥,𝑧)
c. ∑𝑧 𝑃(𝑧|𝑥)𝑙𝑜𝑔 𝑄(𝑧∨𝑥)
d. None of the above

Correct Answer: a

Detailed Solution:
Since we are trying to approximate P(z|x) with Q(z|x), we will try to minimize the KL
divergence, KL(Q(z|x) || P(z|x)) which eventually leads to maximization of the well-known
𝑃(𝑥,𝑧)
Variational lower bound of ∑𝑧 𝑄(𝑧|𝑥)𝑙𝑜𝑔 𝑄(𝑧∨𝑥).

𝑃(𝑥,𝑧)
So, we will minimize − ∑𝑧 𝑄(𝑧|𝑥)𝑙𝑜𝑔 . See the lecture videos for detailed derivations.
𝑄(𝑧∨𝑥)
______________________________________________________________________________

QUESTION 10:

Above figure shows latent vector subtraction of two concepts of “face with glasses” and
“glasses”. What is expected from the resultant vector?
a. glasses
b. face without glasses
c. face with 2 glasses
d. None of the above

Correct Answer: b
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur

Detailed Solution:
It is expected that VAE latent space follows vector arithmetic. Thus the resultant vector is a
vector subtraction of the two concepts which will result in the final vector to represent a face
without glasses.

Face with glasses - ? = glasses

? = (face with glasses) – (glasses)

? = face without glasses

_______________________________________________________________________

______________________________________________________________________________

************END*******

You might also like