Deep Learning

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

Assignment 1

1 point
Consider the following table, where x1x1 and x2x2 are features and yy is a label

Assume that the elements in ww are initialized to zero and the perception learning algorithm is used to
update the weights ww. If the learning algorithm runs for long enough iterations, then

The algorithm never converges


The algorithm converges (i.e., no further weight updates) after some iterations
The classification error remains greater than zero

The classification error becomes zero eventually


Yes, the answer is correct.
Score: 1
Accepted Answers:
The algorithm converges (i.e., no further weight updates) after some iterations
The classification error becomes zero eventually
1 point
In the perceptron model, the weight ww vector is perpendicular to the linear decision boundary at all times.

True

False
Yes, the answer is correct.
Score: 1
Accepted Answers:
True
1 point
What is the perceptron algorithm used for?

Clustering data points


Classifying data
Solving optimization problems

Finding the shortest path in a graph


Yes, the answer is correct.
Score: 1
Accepted Answers:
Classifying data
1 point
Choose the correct input-output pair for the given MP
Neuron. f(x)={1,0,if x1+x2+x3>2otherwise f(x)={1,if x1+x2+x3>20,otherwise

y=1y=1 for (x1,x2,x3)=(0,1,1)(x1,x2,x3)=(0,1,1)

y=0y=0 for (x1,x2,x3)=(0,0,1)(x1,x2,x3)=(0,0,1)

y=1y=1 for (x1,x2,x3)=(0,0,0)(x1,x2,x3)=(0,0,0)

y=1y=1 for (x1,x2,x3)=(1,1,1)(x1,x2,x3)=(1,1,1)

y=0y=0 for (x1,x2,x3)=(1,0,1)(x1,x2,x3)=(1,0,1)


Yes, the answer is correct.
Score: 1
Accepted Answers:
y=0y=0 for (x1,x2,x3)=(0,0,1)(x1,x2,x3)=(0,0,1)
y=1y=1 for (x1,x2,x3)=(1,1,1)(x1,x2,x3)=(1,1,1)
y=0y=0 for (x1,x2,x3)=(1,0,1)(x1,x2,x3)=(1,0,1)
1 point
Which of the following Boolean functions can be implemented using a perceptron?

NOR
NAND
NOT

XOR
Yes, the answer is correct.
Score: 1
Accepted Answers:
NOR
NAND
NOT
1 point
Which of the following threshold values of MP neuron implements AND Boolean function? Assume that the
number of inputs to the neuron is 7 and the neuron does not have any inhibitory inputs.

11

33

66

77

88
Yes, the answer is correct.
Score: 1
Accepted Answers:
77
1 point
Suppose we have a boolean function that takes 4 inputs x1,x2,x3,x4?x1,x2,x3,x4? We have an MP
neuron with parameter θ=3θ=3. For how many inputs will this MP neuron give output y=1y=1?

55

44

11

1616
Yes, the answer is correct.
Score: 1
Accepted Answers:
55
1 point
Consider points shown in the picture. The vector w=[−1−1]w=[−1−1]. As per this weight vector, the
Perceptron algorithm will predict which classes for the data points x1x1 and x2x2.

NOTE:
y={1−1if wTx>0if wTx≤0y={1if wTx>0−1if wTx≤0

x1=−1x1=−1

x1=1x1=1
x2=−1x2=−1

x2=1x2=1
Yes, the answer is correct.
Score: 1
Accepted Answers:
x1=−1x1=−1
x2=1x2=1
1 point
Consider the following table, where x1x1 and x2x2 are features (packed into a single
vector x=[x1x2]x=[x1x2]) and yy is a label:

Suppose that the perceptron model is used to classify the data points. Suppose further that the
weights ww are initialized to w=[11]w=[11]. The following rule is used for classification,

y={10if wTx>0if wTx≤0y={1if wTx>00if wTx≤0


The perceptron learning algorithm is used to update the weight vector ww. Then, how many times the
weight vector ww will get updated during the entire training process?

2
1
0

Not possible to determine


Yes, the answer is correct.
Score: 1
Accepted Answers:
0
1 point
Which Boolean function with two inputs x1x1 and x2x2 is represented by the following decision
boundary? (Points on boundary or right of the decision boundary to be classified 1)
How many boolean functions can be designed for 3 inputs?

8
16
256

64
Yes, the answer is correct.
Score: 1
Accepted Answers:
256
1 point
Which of the following statements is(are) true about the following function?
σ(z)=11+e−(z)σ(z)=11+e−(z)

The function is bounded between 0 and 1

The function attains its maximum when z→∞z→∞


The function is continuously differentiable

The function is monotonic


Yes, the answer is correct.
Score: 1
Accepted Answers:
The function is bounded between 0 and 1
The function attains its maximum when z→∞z→∞
The function is continuously differentiable
The function is monotonic
1 point
You are training a model using the gradient descent algorithm and notice that the loss decreases and then
increases after each successive epoch (pass through the data). Which of the following techniques would
you employ to enhance the likelihood of the gradient descent algorithm converging? (Here, ηη refers to the
step size.)

Decrease the value of ηη

Increase the value of ηη

Set η=1η=1

Set η=0η=0

Yes, the answer is correct.


Score: 1
Accepted Answers:
Decrease the value of ηη
1 point
Which of the following statements is true about the representation power of a multilayer network of
perceptions?

A multilayer network of perceptrons can represent any function.


A multilayer network of perceptrons can represent any linear function.
A multilayer network of perceptrons can represent any boolean function.

A multilayer network of perceptrons can represent any continuous function.


No, the answer is incorrect.
Score: 0
Accepted Answers:
A multilayer network of perceptrons can represent any boolean function.
1 point
How many boolean functions can be designed for 4 inputs?

65,536
8
256

64
Yes, the answer is correct.
Score: 1
Accepted Answers:
65,536
1 point
We have a function that we want to approximate using 150 rectangles (towers). How many neurons are
required to construct the required network?

301
451
150

500
No, the answer is incorrect.
Score: 0
Accepted Answers:
301
1 point
What happens to the output of the sigmoid function as |x||x| becomes very large for input x?Select all
relevant operations

The output approaches 0.5


The output approaches 1.
The output oscillates between 0 and 1.

The output approaches 0.


Yes, the answer is correct.
Score: 1
Accepted Answers:
The output approaches 1.
The output approaches 0.
1 point
We have a classification problem with labels 0 and 1. We train a logistic model and find out
that ω0ω0 learned by our model is -17. We are to predict the label of a new test point xx using this trained
model. If ωTx=1ωTx=1, which of the following statements is True?

We cannot make any prediction as the value of ωTxωTx does not make sense
The label of the test point is 0.
The label of the test point is 1.

We cannot make any prediction as we do not know the value of xx.


Yes, the answer is correct.
Score: 1
Accepted Answers:
The label of the test point is 0.
1 point
Suppose we have a function f(x1,x2)=x21+3x2+25f(x1,x2)=x12+3x2+25 which we want to minimize
the given function using the gradient descent algorithm. We initialize (x1,x2)=(0,0)(x1,x2)=(0,0). What
will be the value of x1x1 after ten updates in the gradient descent process?(Let ηη be 1)

0
-3
−4.5

−3
Yes, the answer is correct.
Score: 1
Accepted Answers:
0
1 point
What is the purpose of the gradient descent algorithm in machine learning?

To minimize the loss function


To maximize the loss function
To minimize the output function

To maximize the output function


Yes, the answer is correct.
Score: 1
Accepted Answers:
To minimize the loss function

AND
OR
XOR

NAND
Yes, the answer is correct.
Score: 1
Accepted Answers:
OR
1 point
Choose the correct input-output pair for the given MP Neuron.

y={1,0,ifx1+x2+x3≥2otherwisey={1,ifx1+x2+x3≥20,otherwise

y=1y=1 for (x1,x2,x3)=(0,1,1)(x1,x2,x3)=(0,1,1)

y=0y=0 for (x1,x2,x3)=(0,0,1)(x1,x2,x3)=(0,0,1)

y=1y=1 for (x1,x2,x3)=(1,1,1)(x1,x2,x3)=(1,1,1)

y=0y=0 for (x1,x2,x3)=(1,0,0)(x1,x2,x3)=(1,0,0)


Yes, the answer is correct.
Score: 1
Accepted Answers:
y=1y=1 for (x1,x2,x3)=(0,1,1)(x1,x2,x3)=(0,1,1)
y=0y=0 for (x1,x2,x3)=(0,0,1)(x1,x2,x3)=(0,0,1)
y=1y=1 for (x1,x2,x3)=(1,1,1)(x1,x2,x3)=(1,1,1)
y=0y=0 for (x1,x2,x3)=(1,0,0)(x1,x2,x3)=(1,0,0)
1 point
Suppose we have a boolean function that takes 4 inputs x1, x2, x3, x4? We have an MP neuron with
parameter θ=2θ=2. For how many inputs will this MP neuron give output y=1?y=1?

11
21
15

8
No, the answer is incorrect.
Score: 0
Accepted Answers:
11
1 point
We are given the following data:

Can you classify every label correctly by training a perceptron algorithm? (assume bias to be 0 while
training)

Yes
No
Yes, the answer is correct.
Score: 1
Accepted Answers:
No
1 point
We are given the following dataset with features as (x1,x2) and y as the label (-1,1). If we apply the
perception algorithm on the following dataset with w initialized as (0,0). What will be the value of w when
the algorithm converges? (Start the algorithm from (2,2)

(-2,2)
(2,1)
(2,-1)

None of These
Yes, the answer is correct.
Score: 1
Accepted Answers:
(2,-1)
1 point
Consider points shown in the picture. The vector w is (-1,0). As per this weight vector, the Perceptron
algorithm will predict which classes for the data points x1 and x2.

x1=1
x2=1
x1=-1
x2=-1
Yes, the answer is correct.
Score: 1
Accepted Answers:
x2=1
x1=-1
1 point
Given an MP neuron with the inputs as x1,x2,x3,x4,x5 and threshold θ=3θ=3 where x5 is inhibitory input.
For input (1,1,1,0,1) what will be the value of yy?

y=0y=0

y=1y=1 since θ≥3θ≥3

y=1/2y=1/2

Insufficient information
Yes, the answer is correct.
Score: 1
Accepted Answers:
y=0y=0
1 point
An MP neuron takes two inputs x1 and x2. Its threshold is θ=0θ=0. Select all the boolean functions this
MP neuron may represent.

AND
NOT
OR

NOR
Yes, the answer is correct.
Score: 1
Accepted Answers:
NOR
1 point
What is the output of a perceptron with weight vector w=[2−31]w=[2−31] and bias b=−2b=−2 when
the input is x=[10−1]x=[10−1]?

0
1
-1

2
Yes, the answer is correct.
Score: 1
Accepted Answers:
-1
1 point
What is the ”winter of AI” referring to in the history of artificial intelligence?
The period during winter when AI technologies are least effective due to cold temperatures
A phase marked by decreased funding and interest in AI research.
The season when AI algorithms perform at their peak efficiency.

A period characterized by rapid advancements and breakthroughs in AI technologies.


Yes, the answer is correct.
Score: 1
Accepted Answers:
A phase marked by decreased funding and interest in AI research.

Assignment 2
How many boolean functions can be designed for 3 inputs?

8
16
256

64
Yes, the answer is correct.
Score: 1
Accepted Answers:
256
1 point
Which of the following statements is(are) true about the following function?
σ(z)=11+e−(z)σ(z)=11+e−(z)

The function is bounded between 0 and 1

The function attains its maximum when z→∞z→∞


The function is continuously differentiable

The function is monotonic


Yes, the answer is correct.
Score: 1
Accepted Answers:
The function is bounded between 0 and 1
The function attains its maximum when z→∞z→∞
The function is continuously differentiable
The function is monotonic
1 point
You are training a model using the gradient descent algorithm and notice that the loss decreases and then
increases after each successive epoch (pass through the data). Which of the following techniques would
you employ to enhance the likelihood of the gradient descent algorithm converging? (Here, ηη refers to the
step size.)

Decrease the value of ηη


Increase the value of ηη

Set η=1η=1

Set η=0η=0

Yes, the answer is correct.


Score: 1
Accepted Answers:
Decrease the value of ηη
1 point
Which of the following statements is true about the representation power of a multilayer network of
perceptions?

A multilayer network of perceptrons can represent any function.


A multilayer network of perceptrons can represent any linear function.
A multilayer network of perceptrons can represent any boolean function.

A multilayer network of perceptrons can represent any continuous function.


No, the answer is incorrect.
Score: 0
Accepted Answers:
A multilayer network of perceptrons can represent any boolean function.
1 point
How many boolean functions can be designed for 4 inputs?

65,536
8
256

64
Yes, the answer is correct.
Score: 1
Accepted Answers:
65,536
1 point
We have a function that we want to approximate using 150 rectangles (towers). How many neurons are
required to construct the required network?

301
451
150

500
No, the answer is incorrect.
Score: 0
Accepted Answers:
301
1 point
What happens to the output of the sigmoid function as |x||x| becomes very large for input x?Select all
relevant operations
The output approaches 0.5
The output approaches 1.
The output oscillates between 0 and 1.

The output approaches 0.


Yes, the answer is correct.
Score: 1
Accepted Answers:
The output approaches 1.
The output approaches 0.
1 point
We have a classification problem with labels 0 and 1. We train a logistic model and find out
that ω0ω0 learned by our model is -17. We are to predict the label of a new test point xx using this trained
model. If ωTx=1ωTx=1, which of the following statements is True?

We cannot make any prediction as the value of ωTxωTx does not make sense
The label of the test point is 0.
The label of the test point is 1.

We cannot make any prediction as we do not know the value of xx.


Yes, the answer is correct.
Score: 1
Accepted Answers:
The label of the test point is 0.
1 point
Suppose we have a function f(x1,x2)=x21+3x2+25f(x1,x2)=x12+3x2+25 which we want to minimize
the given function using the gradient descent algorithm. We initialize (x1,x2)=(0,0)(x1,x2)=(0,0). What
will be the value of x1x1 after ten updates in the gradient descent process?(Let ηη be 1)

0
-3
−4.5

−3
Yes, the answer is correct.
Score: 1
Accepted Answers:
0
1 point
What is the purpose of the gradient descent algorithm in machine learning?

To minimize the loss function


To maximize the loss function
To minimize the output function

To maximize the output function


Yes, the answer is correct.
Score: 1
Accepted Answers:
To minimize the loss function

Assignment 4
A team has a data set that contains 1000 samples for training a feed-forward neural network. Suppose they
decided to use stochastic gradient descent algorithm to update the weights. How many times do the
weights get updated after training the network for 5 epochs?

1000
5000
100

5
Yes, the answer is correct.
Score: 1
Accepted Answers:
5000
1 point
What is the primary benefit of using Adagrad compared to other optimization algorithms?

It converges faster than other optimization algorithms.


It is more memory-efficient than other optimization algorithms.
It is less sensitive to the choice of hyperparameters(learning rate).

It is less likely to get stuck in local optima than other optimization algorithms.
Yes, the answer is correct.
Score: 1
Accepted Answers:
It is less sensitive to the choice of hyperparameters(learning rate).
1 point
What are the benefits of using stochastic gradient descent compared to vanilla gradient descent?

SGD converges more quickly than vanilla gradient descent.


SGD is computationally efficient for large datasets.
SGD theoretically guarantees that the descent direction is optimal.
SGD experiences less oscillation compared to vanilla gradient descent.
Yes, the answer is correct.
Score: 1
Accepted Answers:
SGD converges more quickly than vanilla gradient descent.
SGD is computationally efficient for large datasets.
1 point
A team has a data set that contains 100 samples for training a feed-forward neural network. Suppose they
decided to use the gradient descent algorithm to update the weights. Suppose further that they use line
search algorithm for the learning rate as follows, η=[0.01,0.1,1,2,10]η=[0.01,0.1,1,2,10]. How many
times do the weights get updated after training the network for 10 epochs? (Note, for each weight update
the loss has to decrease)

100
5
500
10

50
Yes, the answer is correct.
Score: 1
Accepted Answers:
10
1 point
Select the true statements about the factor ββ used in the momentum based gradient descent algorithm.

Setting β=0.1β=0.1 allows the algorithm to move faster than the vanilla gradient descent algorithm

Setting β=0β=0 makes it equivalent to the vanilla gradient descent algorithm

Setting β=1β=1 makes it equivalent to the vanilla gradient descent algorithm

Oscillation around the minimum will be less if we set β=0.1β=0.1 than setting β=0.99β=0.99

Yes, the answer is correct.


Score: 1
Accepted Answers:
Setting β=0.1β=0.1 allows the algorithm to move faster than the vanilla gradient descent algorithm
Setting β=0β=0 makes it equivalent to the vanilla gradient descent algorithm
Oscillation around the minimum will be less if we set β=0.1β=0.1 than setting β=0.99β=0.99

1 point
What is the advantage of using mini-batch gradient descent over batch gradient descent?

Mini-batch gradient descent is more computationally efficient than batch gradient descent.
Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch gradient
descent.
Mini batch gradient descent gives us a better solution.

Mini-batch gradient descent can converge faster than batch gradient descent.
Partially Correct.
Score: 0.5
Accepted Answers:
Mini-batch gradient descent is more computationally efficient than batch gradient descent.
Mini-batch gradient descent can converge faster than batch gradient descent.
1 point
We have following functions x3,ln(x),ex,xx3,ln(x),ex,x and 4. Which of the following functions has the
steepest slope at x=1?

x3x3

ln(x)ln(x)

exex

4
No, the answer is incorrect.
Score: 0
Accepted Answers:
x3x3
1 point
Which of the following are among the disadvantages of Adagrad?

It doesn’t work well for the Sparse matrix.


It usually goes past the minima.
It gets stuck before reaching the minima.

Weight updates are very small at the initial stages of the algorithm.
Yes, the answer is correct.
Score: 1
Accepted Answers:
It gets stuck before reaching the minima.
1 point
Which of the following is a variant of gradient descent that uses an estimate of the next gradient to update
the current position of the parameters?

Momentum optimization
Stochastic gradient descent
Nesterov accelerated gradient descent

Adagrad
Yes, the answer is correct.
Score: 1
Accepted Answers:
Nesterov accelerated gradient descent
1 point
What is the role of activation functions in deep learning?

Activation functions transform the output of a neuron into a non-linear function, allowing the network to
learn complex patterns.
Activation functions make the network faster by reducing the number of iterations needed for training.
Activation functions are used to normalize the input data.
Activation functions are used to compute the loss function.
Yes, the answer is correct.
Score: 1
Accepted Answers:
Activation functions transform the output of a neuron into a non-linear function, allowing the network to
learn complex patterns.

Assignment 5
Which of the following is a measure of the amount of variance explained by a principal component in PCA?

Covariance
Correlation
Mean absolute deviation

Eigenvalue
Yes, the answer is correct.
Score: 1
Accepted Answers:
Eigenvalue
1 point
What is/are the limitations of PCA?

It is computationally less efficient than autoencoders


It can only reduce the dimensionality of a dataset by a fixed amount.
It can only identify linear relationships in the data.

It can be sensitive to outliers in the data.


Partially Correct.
Score: 0.5
Accepted Answers:
It can only identify linear relationships in the data.
It can be sensitive to outliers in the data.
1 point
Which of the following is a property of eigenvalues of a symmetric matrix?

Eigenvalues are always positive


Eigenvalues are always negative
Eigenvalues are always real

Eigenvalues can be complex numbers with imaginary parts non-zero


Yes, the answer is correct.
Score: 1
Accepted Answers:
Eigenvalues are always real
1 point
The eigenvalues of AA are 3,4. Which of the following are the eigenvalues of A3A3?

3, 4
9, 16
27, 64

3–√,4–√3,4
Yes, the answer is correct.
Score: 1
Accepted Answers:
27, 64
1 point
If we have a 12×1212×12 matrix having entries from RR, how many linearly independent eigenvectors
corresponding to real eigenvalues are possible for this matrix?

10
24
12

6
Partially Correct.
Score: 0.33
Accepted Answers:
10
12
6

Questions 6-9 are based on common data.

Consider the following data points x1,x2,x3x1,x2,x3 to answer following


questions: x1=[22]x1=[22], x2=[12]x2=[12], x3=[21]x3=[21]
1 point
What is the mean of the given data points x1,x2,x3x1,x2,x3?

[55][55]

[1.671.67][1.671.67]

[22][22]

[1.51.5][1.51.5]
Yes, the answer is correct.
Score: 1
Accepted Answers:
[1.671.67][1.671.67]
1 point
The covariance matrix C=1n∑ni=1(x−x¯)(x−x¯)TC=1n∑i=1n(x−x¯)(x−x¯)T is given by: (x¯x¯ is
mean of the data points)

[0.22−0.11−0.110.22][0.22−0.11−0.110.22]

[0.33−0.17−0.170.33][0.33−0.17−0.170.33]
[0.22−0.22−0.220.22][0.22−0.22−0.220.22]

[0.33−0.33−0.330.33][0.33−0.33−0.330.33]
Yes, the answer is correct.
Score: 1
Accepted Answers:
[0.22−0.11−0.110.22][0.22−0.11−0.110.22]
1 point
The maximum eigenvalue of the covariance matrix CC is:

0.330.33

0.670.67

11

0.50.5
Yes, the answer is correct.
Score: 1
Accepted Answers:
0.330.33
1 point
The eigenvector corresponding to the maximum eigenvalue of the given matrix CC is:

[0.710.71][0.710.71]

[−0.710.71][−0.710.71]

[−11][−11]

[11][11]
No, the answer is incorrect.
Score: 0
Accepted Answers:
[−0.710.71][−0.710.71]
OR
[−11][−11]
What is the determinant of a 2×22×2 matrix that has eigenvalues of 44 and 55?

Yes, the answer is correct.


Score: 1
Accepted Answers:
(Type: Numeric) 20

Assignment 6
We are given an autoencoder A. The average activation value of neurons in this network is 0.01. The given
autoencoder is
Contractive autoencoder
Overcomplete neural network
Denoising autoencoder

Sparse autoencoder
No, the answer is incorrect.
Score: 0
Accepted Answers:
Sparse autoencoder
1 point
What is the primary reason for adding corruption to the input data in a denoising autoencoder?

To increase the complexity of the model.


To improve the model’s ability to generalize to unseen data.
To reduce the size of the training dataset.

To increase the training time.


Yes, the answer is correct.
Score: 1
Accepted Answers:
To improve the model’s ability to generalize to unseen data.
1 point
What is/are the primary advantages of Autoencoders over PCA?

Autoencoders are less prone to overfitting than PCA.


Autoencoders are faster and more efficient than PCA.
Autoencoders can capture nonlinear relationships in the input data.

Autoencoders require fewer input data than PCA.


Yes, the answer is correct.
Score: 1
Accepted Answers:
Autoencoders can capture nonlinear relationships in the input data.
1 point
What type of autoencoder is it when the hidden layer’s dimensionality is less than that of the input layer?

Under-complete autoencoder
Complete autoencoder
Overcomplete autoencoder

Sparse autoencoder
Yes, the answer is correct.
Score: 1
Accepted Answers:
Under-complete autoencoder
1 point
Which of the following statements about regularization in autoencoders is always true?

Regularisation reduces the search space of weights for the network.


Regularisation helps to reduce the overfitting in overcomplete autoencoders.
Regularisation shrinks the size of weight vectors learned.

All of these.
No, the answer is incorrect.
Score: 0
Accepted Answers:
Regularisation reduces the search space of weights for the network.
Regularisation helps to reduce the overfitting in overcomplete autoencoders.
1 point
What are the advantages of using a denoising autoencoder?

Robustness to noisy input data


Reduction of the risk of overfitting
Faster training time

It promotes sparsity in the hidden layer


Yes, the answer is correct.
Score: 1
Accepted Answers:
Robustness to noisy input data
Reduction of the risk of overfitting
1 point
We are given an autoencoder A. The average activation value of neurons in this network is 0.06. The given
autoencoder is:

Contractive autoencoder
Overcomplete neural network
Sparse autoencoder

Denoising autoencoder
Yes, the answer is correct.
Score: 1
Accepted Answers:
Sparse autoencoder
1 point
If the dimension of the input layer in an over-complete autoencoder is 5, what is the possible dimension of
the hidden layer?

4
2
8

0
Yes, the answer is correct.
Score: 1
Accepted Answers:
8
1 point
What is the primary objective of sparse autoencoders that distinguishes it from vanilla autoencoder?

They learn a low-dimensional representation of the input data


They minimize the reconstruction error between the input and the output
They capture only the important variations/features in the data

They maximize the mutual information between the input and the output
Yes, the answer is correct.
Score: 1
Accepted Answers:
They capture only the important variations/features in the data
1 point
Suppose for one data point we have
features x1,x2,x3,x4,x5x1,x2,x3,x4,x5 as −2,12,4.2,7.6,0−2,12,4.2,7.6,0 then, which of the
following function should we use on the output layer(decoder)?

Logistic
Relu
Tanh

Linear
Yes, the answer is correct.
Score: 1
Accepted Answers:
Linear

Assignment 8
Which of the following activation functions is not zero-centered?

Sigmoid
Tanh
ReLU

Softmax
Partially Correct.
Score: 0.34
Accepted Answers:
Sigmoid
ReLU
Softmax
1 point
Which of the following are common issues caused by saturating neurons in deep networks?

Vanishing gradients
Slow convergence during training
Overfitting

Increased model complexity


No, the answer is incorrect.
Score: 0
Accepted Answers:
Vanishing gradients
Slow convergence during training
1 point
What are the challenges associated with using the Tanh(x) activation function?
It is not zero centered
Computationally expensive
Non-differentiable at 0

Saturation
Yes, the answer is correct.
Score: 1
Accepted Answers:
Computationally expensive
Saturation
1 point
Which of the following activation functions is preferred to avoid the vanishing gradient problem?

Sigmoid
Tanh
ReLU

None of these
Yes, the answer is correct.
Score: 1
Accepted Answers:
ReLU
1 point
How does pre-training prevent overfitting in deep networks?

It adds regularization
It initializes the weights near local minima
It constrains the weights to a certain region

It eliminates the need for fine-tuning


Partially Correct.
Score: 0.5
Accepted Answers:
It adds regularization
It constrains the weights to a certain region
1 point
We train a feed-forward neural network and notice that all the weights for a particular neuron are equal.
What could be the possible causes of this issue?

Weights were initialized randomly


Weights were initialized to high values
Weights were initialized to equal values

Weights were initialized to zero


Partially Correct.
Score: 0.5
Accepted Answers:
Weights were initialized to equal values
Weights were initialized to zero
1 point
Which of the following methods can help to avoid saturation in deep learning?
Using a different activation function.
Increasing the learning rate.
Increasing the model complexity

All of the above.


Yes, the answer is correct.
Score: 1
Accepted Answers:
Using a different activation function.
1 point
Which of the following is an advantage of unsupervised pre-training in deep learning?

It helps in reducing overfitting


Pre-trained models converge faster
It improves the accuracy of the model

It requires fewer computational resources


Partially Correct.
Score: 0.33
Accepted Answers:
It helps in reducing overfitting
Pre-trained models converge faster
It improves the accuracy of the model
1 point
How can you tell if your network is suffering from the Dead ReLU problem?

The loss function is not decreasing during training


The accuracy of the network is not improving
A large number of neurons have zero output

The network is overfitting to the training data


Yes, the answer is correct.
Score: 1
Accepted Answers:
A large number of neurons have zero output
1 point
In Batch Normalization, which parameter is learned during training?

Mean
Variance

γγ

ϵϵ
Yes, the answer is correct.
Score: 1
Accepted Answers:
γ

Assignment 9
Consider the following corpus: "human machine interface for computer applications. user opinion of
computer system response time. user interface management system. system engineering for improved
response time". What is the size of the vocabulary of the above corpus?

13
14
15

16
No, the answer is incorrect.
Score: 0
Accepted Answers:
15
1 point
At the input layer of a continuous bag of words model, we multiply a one-hot vector x∈R|V|x∈R|V| with the
parameter matrix W∈Rk×|V|W∈Rk×|V|. What does each column of WW correspond to?

the representation of the ii-th word in the vocabulary

the ii-th eigen vector of the co-occurrence matrix

No, the answer is incorrect.


Score: 0
Accepted Answers:
the representation of the ii-th word in the vocabulary

Suppose that we use the continuous bag of words (CBOW) model to find vector representations of words.
Suppose further that we use a context window of size 3 (that is, given the 3 context words, predict the
target word P(wt|(wi,wj,wk))P(wt|(wi,wj,wk))). The size of word vectors (vector representation of words)
is chosen to be 100 and the vocabulary contains 10,000 words. The input to the network is the one-hot
encoding (also called 1-of-VV encoding) of word(s). How many parameters (weights), excluding bias, are
there in WwordWword? Enter the answer in thousands. For example, if your answer is 50,000, then just
enter 50.
No, the answer is incorrect.
Score: 0
Accepted Answers:
(Type: Numeric) 1000
1 point
1 point
Let count(w,c)count(w,c) be the number of times the words ww and cc appear together in the corpus
(i.e., occur within a window of few words around each other). Further,
let count(w)count(w) and count(c)count(c) be the total number of times the word ww and cc appear
in the corpus respectively and let NN be the total number of words in the corpus. The PMI
between ww and cc is then given by:

logcount(w,c)∗count(w)N∗count(c)log⁡count(w,c)∗count(w)N∗count(c)

logcount(w,c)∗count(c)N∗count(w)log⁡count(w,c)∗count(c)N∗count(w)

logcount(w,c)∗Ncount(w)∗count(c)log⁡count(w,c)∗Ncount(w)∗count(c)
No, the answer is incorrect.
Score: 0
Accepted Answers:
logcount(w,c)∗Ncount(w)∗count(c)log⁡count(w,c)∗Ncount(w)∗count(c)
1 point
Suppose we are learning the representations of words using Glove representations. If we observe that the
cosine similarity between two representations vivi and vjvj for words `ii' and `jj' is very high. which of the
following statements is true?( parameter bibi = 0.02 and bjbj = 0.07)

Xij=0.02Xij=0.02

Xij=0.2Xij=0.2

Xij=0.88Xij=0.88

Xij=0Xij=0
No, the answer is incorrect.
Score: 0
Accepted Answers:
Xij=0.88Xij=0.88
1 point
Which of the following is a disadvantage of one hot encoding?

It requires a large amount of memory to store the vectors


It can result in a high-dimensional sparse representation
It cannot capture the semantic similarity between words

All of the above


No, the answer is incorrect.
Score: 0
Accepted Answers:
All of the above
1 point
Which of the following is true about the input representation in the CBOW model?

Each word is represented as a one-hot vector


Each word is represented as a continuous vector
Each word is represented as a sequence of one-hot vectors

Each word is represented as a sequence of continuous vectors


No, the answer is incorrect.
Score: 0
Accepted Answers:
Each word is represented as a one-hot vector
1 point
What is the role of the softmax function in the skip-gram method?

To calculate the dot product between the target word and the context words
To transform the dot product into a probability distribution
To calculate the distance between the target word and the context words

To adjust the weights of the neural network during training


No, the answer is incorrect.
Score: 0
Accepted Answers:
To transform the dot product into a probability distribution
1 point
What is the computational complexity of computing the softmax function in the output layer of a neural
network?

O(n)O(n)

O(n2)O(n2)

O(nlogn)O(nlogn)

O(logn)O(logn)
No, the answer is incorrect.
Score: 0
Accepted Answers:
O(n)O(n)
1 point
How does Hierarchical Softmax reduce the computational complexity of computing the softmax function?

It replaces the softmax function with a linear function


It uses a binary tree to approximate the softmax function
It uses a heuristic to compute the softmax function faster

It does not reduce the computational complexity of computing the softmax function
No, the answer is incorrect.
Score: 0
Accepted Answers:
It uses a binary tree to approximate the softmax function

Assignment 10
Consider an input image of size 1000×1000×101000×1000×10 where 10 refers to the number of
channels (Such images do exist!). Suppose we want to apply a convolution operation on the entire image
by sliding a kernel of size 1×1×d1×1×d. What should be the depth d of the kernel?

Yes, the answer is correct.


Score: 1
Accepted Answers:
(Type: Numeric) 10
1 point
1 point
For the same input image in Q1, suppose that we apply the following kernels of differing sizes.

K1:3×3K1:3×3
K2:7×7K2:7×7
K3:17×17K3:17×17
K4:41×41K4:41×41

Assume that stride s=1s=1 and no zero padding. Among all these kernels which one shrinks the output
dimensions the most?
K1K1

K2K2

K3K3

K4K4
Yes, the answer is correct.
Score: 1
Accepted Answers:
K4K4
1 point
Which of the following statements about CNN is (are) true?

CNN is a feed-forward network


Weight sharing helps CNN layers to reduce the number of parameters
CNN is suitable only for natural images

The shape of the input to the CNN network should be square


Yes, the answer is correct.
Score: 1
Accepted Answers:
CNN is a feed-forward network
Weight sharing helps CNN layers to reduce the number of parameters

Consider an input image of size 100×100×1100×100×1. Suppose that we used kernel of


size 3×33×3, zero padding P=1P=1 and stride value S=3S=3. What will be the output dimension?

No, the answer is incorrect.


Score: 0
Accepted Answers:
(Type: Numeric) 34
0 points
1 point
Consider an input image of size 100×100×3100×100×3. Suppose that we use 10 kernels (filters) each
of size 1×11×1, zero padding P=1P=1 and stride value S=2S=2. How many parameters are there?
(assume no bias terms)

5
10
15

30
Yes, the answer is correct.
Score: 1
Accepted Answers:
30
1 point
Which statement is true about the size of filters in CNNs?

The size of the filter does not affect the features it captures.
The size of the filter only affects the computation time.
Larger filters capture more global features.

Smaller filters capture more local features.


Yes, the answer is correct.
Score: 1
Accepted Answers:
Larger filters capture more global features.
Smaller filters capture more local features.
1 point
What is the motivation behind using multiple filters in one Convolution layer?

Reduced complexity of the network


Reduced size of the convolved image
Insufficient information

Each filter captures some feature of the image separately


Yes, the answer is correct.
Score: 1
Accepted Answers:
Each filter captures some feature of the image separately
1 point
Which of the following architectures has the highest no of layers?

AlexNet
GoogleNet
ResNet

VGG
Yes, the answer is correct.
Score: 1
Accepted Answers:
ResNet
1 point
What is the purpose of guided backpropagation in CNNs?

To train the CNN to improve its accuracy on a given task.


To reduce the size of the input images in order to speed up computation.
To visualize which pixels in an image are most important for a particular class prediction.

None of the above.


Yes, the answer is correct.
Score: 1
Accepted Answers:
To visualize which pixels in an image are most important for a particular class prediction.
1 point
Which of the following statements is true regarding the occlusion experiment in a CNN?

It is a technique used to prevent overfitting in deep learning models.


It is used to increase the number of filters in a convolutional layer.
It is used to determine the importance of each feature map in the output of the network.
It involves masking a portion of the input image with a patch of zeroes.
Partially Correct.
Score: 0.5
Accepted Answers:
It is used to determine the importance of each feature map in the output of the network.
It involves masking a portion of the input image with a patch of zeroes.

Assignment 11
Suppose that we need to develop an RNN model for sentiment classification. The input to the model is a
sentence composed of five words and the output is the sentiments (positive or negative). Assume that each
word is represented as a vector of length 70×170×1 and the output labels are one-hot encoded. Further,
the state vector stst is initialized with all zeros of size 50×150×1. How many parameters (including bias)
are there in the network?
Yes, the answer is correct.
Score: 1
Accepted Answers:
(Type: Numeric) 6152
1 point
1 point
Select the true statements about BPTT?

The gradients of Loss with respect to parameters are added across time steps
The gradients of Loss with respect to parameters are subtracted across time steps
The gradient may vanish or explode, in general, if timesteps are too large

The gradient may vanish or explode if timesteps are too small


Yes, the answer is correct.
Score: 1
Accepted Answers:
The gradients of Loss with respect to parameters are added across time steps
The gradient may vanish or explode, in general, if timesteps are too large
1 point
Select the correct statements about GRUs

GRUs have fewer parameters compared to LSTMs


GRUs use a single gate to control both input and forget mechanisms
GRUs are less effective than LSTMs in handling long-term dependencies

GRUs are a type of feedforward neural network


Yes, the answer is correct.
Score: 1
Accepted Answers:
GRUs have fewer parameters compared to LSTMs
GRUs use a single gate to control both input and forget mechanisms
1 point
The statement that LSTM and GRU solves both the problem of vanishing and exploding gradients in RNN
is

True

False
No, the answer is incorrect.
Score: 0
Accepted Answers:
False
1 point
How does LSTM prevent the problem of vanishing gradients?

Different activation functions, such as ReLU, are used instead of sigmoid in LSTM
Gradients are normalized during backpropagation
The learning rate is increased in LSTM

Forget gates regulate the flow of gradients during backpropagation


Yes, the answer is correct.
Score: 1
Accepted Answers:
Forget gates regulate the flow of gradients during backpropagation
We construct an RNN for the sentiment classification of text where a text can have positive sentiment or
negative sentiment. Suppose the dimension of one-hot encoded-words is R100×1R100×1, dimension of
state vector sisi is R50×1R50×1. What is the total number of parameters in the network? (Don’t include
biases also in the network)
No, the answer is incorrect.
Score: 0
Accepted Answers:
(Type: Range) 7599.5,7601.5
1 point
1 point
Arrange the following sequence in the order they are performed by LSTM at time step t.
[Selectively read, Selectively write, Selectively forget]

Selectively read, Selectively write, Selectively forget


Selectively write, Selectively read, Selectively forget
Selectively read, Selectively forget, Selectively write

Selectively forget, Selectively write, Selectively read


No, the answer is incorrect.
Score: 0
Accepted Answers:
Selectively read, Selectively forget, Selectively write
1 point
What is the objective(loss) function in the RNN?

Cross Entropy
Sum of cross-entropy
Squared error

Accuracy
No, the answer is incorrect.
Score: 0
Accepted Answers:
Sum of cross-entropy
1 point
Which of the following is a limitation of traditional feedforward neural networks in handling sequential data?

They can only process fixed-length input sequences


They are highly optimizable using the gradient descent methods
They can’t model temporal dependencies between sequential data

All of These
Partially Correct.
Score: 0.5
Accepted Answers:
They can only process fixed-length input sequences
They can’t model temporal dependencies between sequential data
1 point
Which of the following is true about LSTM and GRU networks?

LSTM networks have more gates than GRU networks


GRU networks have more gates than LSTM networks
LSTM and GRU networks have the same number of gates

Both LSTM and GRU networks have no gates


Yes, the answer is correct.
Score: 1
Accepted Answers:
LSTM networks have more gates than GRU networks

Assignment 12
Which of the following are benefits of using attention mechanisms in neural networks?

Improved handling of long-range dependencies


Enhanced interpretability of model predictions
Reduction in model complexity

Ability to handle variable-length input sequences


Partially Correct.
Score: 0.67
Accepted Answers:
Improved handling of long-range dependencies
Enhanced interpretability of model predictions
Ability to handle variable-length input sequences
1 point
Which of the following is a disadvantage of using an encoder-decoder model for sequence-to-sequence
tasks?

The model requires a large amount of training data


The model is slow to train and requires a lot of computational resources
The generated output sequences may be limited by the capacity of the model

The model is prone to overfitting on the training data


Yes, the answer is correct.
Score: 1
Accepted Answers:
The generated output sequences may be limited by the capacity of the model
1 point
Which of the following attention mechanisms is most commonly used in the Transformer model
architecture?

Dot product attention


Additive attention
Multiplicative attention

All of the above


Yes, the answer is correct.
Score: 1
Accepted Answers:
Dot product attention
1 point
Which scenarios would most benefit from hierarchical attention mechanisms?

Summarizing long text documents


Classifying images in a dataset
Analyzing customer reviews or feedback data

Real-time processing of sensor data


Yes, the answer is correct.
Score: 1
Accepted Answers:
Summarizing long text documents
1 point
In a hierarchical attention network, what are the two primary levels of attention?

Character-level and word-level


Word-level and sentence-level
Sentence-level and document-level

Paragraph-level and document-level


Yes, the answer is correct.
Score: 1
Accepted Answers:
Word-level and sentence-level
1 point
Which of the following is NOT a component of the attention mechanism?

Decoder
Key
Value

Encoder
Partially Correct.
Score: 0.5
Accepted Answers:
Decoder
Encoder
1 point
Which of the following is a major advantage of using an attention mechanism in an encoder-decoder
model?

Reduced computational complexity


Improved generalization to new data
Reduced risk of overfitting

None of These
Yes, the answer is correct.
Score: 1
Accepted Answers:
Improved generalization to new data
1 point
Which of the following output functions is most commonly used in the decoder of an encoder-decoder
model for translation tasks?

Sigmoid
ReLU
Softmax

Tanh
Yes, the answer is correct.
Score: 1
Accepted Answers:
Softmax
1 point
In the encoder-decoder model, what is the role of the decoder?

To generate output based on the input representations.


To encode the input
To learn the attention mechanism

None of the above


Yes, the answer is correct.
Score: 1
Accepted Answers:
To generate output based on the input representations.
1 point
We are performing a task where we generate the summary for an image using the encoder-decoder model.
Choose the correct statements.

LSTM is used as the decoder.


CNN is used as the decoder.
LSTM is used as the encoder.

None of These
Yes, the answer is correct.
Score: 1
Accepted Answers:
LSTM is used as the decoder.

You might also like