Deep Learning
Deep Learning
Deep Learning
1 point
Consider the following table, where x1x1 and x2x2 are features and yy is a label
Assume that the elements in ww are initialized to zero and the perception learning algorithm is used to
update the weights ww. If the learning algorithm runs for long enough iterations, then
True
False
Yes, the answer is correct.
Score: 1
Accepted Answers:
True
1 point
What is the perceptron algorithm used for?
NOR
NAND
NOT
XOR
Yes, the answer is correct.
Score: 1
Accepted Answers:
NOR
NAND
NOT
1 point
Which of the following threshold values of MP neuron implements AND Boolean function? Assume that the
number of inputs to the neuron is 7 and the neuron does not have any inhibitory inputs.
11
33
66
77
88
Yes, the answer is correct.
Score: 1
Accepted Answers:
77
1 point
Suppose we have a boolean function that takes 4 inputs x1,x2,x3,x4?x1,x2,x3,x4? We have an MP
neuron with parameter θ=3θ=3. For how many inputs will this MP neuron give output y=1y=1?
55
44
11
1616
Yes, the answer is correct.
Score: 1
Accepted Answers:
55
1 point
Consider points shown in the picture. The vector w=[−1−1]w=[−1−1]. As per this weight vector, the
Perceptron algorithm will predict which classes for the data points x1x1 and x2x2.
NOTE:
y={1−1if wTx>0if wTx≤0y={1if wTx>0−1if wTx≤0
x1=−1x1=−1
x1=1x1=1
x2=−1x2=−1
x2=1x2=1
Yes, the answer is correct.
Score: 1
Accepted Answers:
x1=−1x1=−1
x2=1x2=1
1 point
Consider the following table, where x1x1 and x2x2 are features (packed into a single
vector x=[x1x2]x=[x1x2]) and yy is a label:
Suppose that the perceptron model is used to classify the data points. Suppose further that the
weights ww are initialized to w=[11]w=[11]. The following rule is used for classification,
2
1
0
8
16
256
64
Yes, the answer is correct.
Score: 1
Accepted Answers:
256
1 point
Which of the following statements is(are) true about the following function?
σ(z)=11+e−(z)σ(z)=11+e−(z)
Set η=1η=1
Set η=0η=0
65,536
8
256
64
Yes, the answer is correct.
Score: 1
Accepted Answers:
65,536
1 point
We have a function that we want to approximate using 150 rectangles (towers). How many neurons are
required to construct the required network?
301
451
150
500
No, the answer is incorrect.
Score: 0
Accepted Answers:
301
1 point
What happens to the output of the sigmoid function as |x||x| becomes very large for input x?Select all
relevant operations
We cannot make any prediction as the value of ωTxωTx does not make sense
The label of the test point is 0.
The label of the test point is 1.
0
-3
−4.5
−3
Yes, the answer is correct.
Score: 1
Accepted Answers:
0
1 point
What is the purpose of the gradient descent algorithm in machine learning?
AND
OR
XOR
NAND
Yes, the answer is correct.
Score: 1
Accepted Answers:
OR
1 point
Choose the correct input-output pair for the given MP Neuron.
y={1,0,ifx1+x2+x3≥2otherwisey={1,ifx1+x2+x3≥20,otherwise
11
21
15
8
No, the answer is incorrect.
Score: 0
Accepted Answers:
11
1 point
We are given the following data:
Can you classify every label correctly by training a perceptron algorithm? (assume bias to be 0 while
training)
Yes
No
Yes, the answer is correct.
Score: 1
Accepted Answers:
No
1 point
We are given the following dataset with features as (x1,x2) and y as the label (-1,1). If we apply the
perception algorithm on the following dataset with w initialized as (0,0). What will be the value of w when
the algorithm converges? (Start the algorithm from (2,2)
(-2,2)
(2,1)
(2,-1)
None of These
Yes, the answer is correct.
Score: 1
Accepted Answers:
(2,-1)
1 point
Consider points shown in the picture. The vector w is (-1,0). As per this weight vector, the Perceptron
algorithm will predict which classes for the data points x1 and x2.
x1=1
x2=1
x1=-1
x2=-1
Yes, the answer is correct.
Score: 1
Accepted Answers:
x2=1
x1=-1
1 point
Given an MP neuron with the inputs as x1,x2,x3,x4,x5 and threshold θ=3θ=3 where x5 is inhibitory input.
For input (1,1,1,0,1) what will be the value of yy?
y=0y=0
y=1/2y=1/2
Insufficient information
Yes, the answer is correct.
Score: 1
Accepted Answers:
y=0y=0
1 point
An MP neuron takes two inputs x1 and x2. Its threshold is θ=0θ=0. Select all the boolean functions this
MP neuron may represent.
AND
NOT
OR
NOR
Yes, the answer is correct.
Score: 1
Accepted Answers:
NOR
1 point
What is the output of a perceptron with weight vector w=[2−31]w=[2−31] and bias b=−2b=−2 when
the input is x=[10−1]x=[10−1]?
0
1
-1
2
Yes, the answer is correct.
Score: 1
Accepted Answers:
-1
1 point
What is the ”winter of AI” referring to in the history of artificial intelligence?
The period during winter when AI technologies are least effective due to cold temperatures
A phase marked by decreased funding and interest in AI research.
The season when AI algorithms perform at their peak efficiency.
Assignment 2
How many boolean functions can be designed for 3 inputs?
8
16
256
64
Yes, the answer is correct.
Score: 1
Accepted Answers:
256
1 point
Which of the following statements is(are) true about the following function?
σ(z)=11+e−(z)σ(z)=11+e−(z)
Set η=1η=1
Set η=0η=0
65,536
8
256
64
Yes, the answer is correct.
Score: 1
Accepted Answers:
65,536
1 point
We have a function that we want to approximate using 150 rectangles (towers). How many neurons are
required to construct the required network?
301
451
150
500
No, the answer is incorrect.
Score: 0
Accepted Answers:
301
1 point
What happens to the output of the sigmoid function as |x||x| becomes very large for input x?Select all
relevant operations
The output approaches 0.5
The output approaches 1.
The output oscillates between 0 and 1.
We cannot make any prediction as the value of ωTxωTx does not make sense
The label of the test point is 0.
The label of the test point is 1.
0
-3
−4.5
−3
Yes, the answer is correct.
Score: 1
Accepted Answers:
0
1 point
What is the purpose of the gradient descent algorithm in machine learning?
Assignment 4
A team has a data set that contains 1000 samples for training a feed-forward neural network. Suppose they
decided to use stochastic gradient descent algorithm to update the weights. How many times do the
weights get updated after training the network for 5 epochs?
1000
5000
100
5
Yes, the answer is correct.
Score: 1
Accepted Answers:
5000
1 point
What is the primary benefit of using Adagrad compared to other optimization algorithms?
It is less likely to get stuck in local optima than other optimization algorithms.
Yes, the answer is correct.
Score: 1
Accepted Answers:
It is less sensitive to the choice of hyperparameters(learning rate).
1 point
What are the benefits of using stochastic gradient descent compared to vanilla gradient descent?
100
5
500
10
50
Yes, the answer is correct.
Score: 1
Accepted Answers:
10
1 point
Select the true statements about the factor ββ used in the momentum based gradient descent algorithm.
Setting β=0.1β=0.1 allows the algorithm to move faster than the vanilla gradient descent algorithm
Oscillation around the minimum will be less if we set β=0.1β=0.1 than setting β=0.99β=0.99
1 point
What is the advantage of using mini-batch gradient descent over batch gradient descent?
Mini-batch gradient descent is more computationally efficient than batch gradient descent.
Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch gradient
descent.
Mini batch gradient descent gives us a better solution.
Mini-batch gradient descent can converge faster than batch gradient descent.
Partially Correct.
Score: 0.5
Accepted Answers:
Mini-batch gradient descent is more computationally efficient than batch gradient descent.
Mini-batch gradient descent can converge faster than batch gradient descent.
1 point
We have following functions x3,ln(x),ex,xx3,ln(x),ex,x and 4. Which of the following functions has the
steepest slope at x=1?
x3x3
ln(x)ln(x)
exex
4
No, the answer is incorrect.
Score: 0
Accepted Answers:
x3x3
1 point
Which of the following are among the disadvantages of Adagrad?
Weight updates are very small at the initial stages of the algorithm.
Yes, the answer is correct.
Score: 1
Accepted Answers:
It gets stuck before reaching the minima.
1 point
Which of the following is a variant of gradient descent that uses an estimate of the next gradient to update
the current position of the parameters?
Momentum optimization
Stochastic gradient descent
Nesterov accelerated gradient descent
Adagrad
Yes, the answer is correct.
Score: 1
Accepted Answers:
Nesterov accelerated gradient descent
1 point
What is the role of activation functions in deep learning?
Activation functions transform the output of a neuron into a non-linear function, allowing the network to
learn complex patterns.
Activation functions make the network faster by reducing the number of iterations needed for training.
Activation functions are used to normalize the input data.
Activation functions are used to compute the loss function.
Yes, the answer is correct.
Score: 1
Accepted Answers:
Activation functions transform the output of a neuron into a non-linear function, allowing the network to
learn complex patterns.
Assignment 5
Which of the following is a measure of the amount of variance explained by a principal component in PCA?
Covariance
Correlation
Mean absolute deviation
Eigenvalue
Yes, the answer is correct.
Score: 1
Accepted Answers:
Eigenvalue
1 point
What is/are the limitations of PCA?
3, 4
9, 16
27, 64
3–√,4–√3,4
Yes, the answer is correct.
Score: 1
Accepted Answers:
27, 64
1 point
If we have a 12×1212×12 matrix having entries from RR, how many linearly independent eigenvectors
corresponding to real eigenvalues are possible for this matrix?
10
24
12
6
Partially Correct.
Score: 0.33
Accepted Answers:
10
12
6
[55][55]
[1.671.67][1.671.67]
[22][22]
[1.51.5][1.51.5]
Yes, the answer is correct.
Score: 1
Accepted Answers:
[1.671.67][1.671.67]
1 point
The covariance matrix C=1n∑ni=1(x−x¯)(x−x¯)TC=1n∑i=1n(x−x¯)(x−x¯)T is given by: (x¯x¯ is
mean of the data points)
[0.22−0.11−0.110.22][0.22−0.11−0.110.22]
[0.33−0.17−0.170.33][0.33−0.17−0.170.33]
[0.22−0.22−0.220.22][0.22−0.22−0.220.22]
[0.33−0.33−0.330.33][0.33−0.33−0.330.33]
Yes, the answer is correct.
Score: 1
Accepted Answers:
[0.22−0.11−0.110.22][0.22−0.11−0.110.22]
1 point
The maximum eigenvalue of the covariance matrix CC is:
0.330.33
0.670.67
11
0.50.5
Yes, the answer is correct.
Score: 1
Accepted Answers:
0.330.33
1 point
The eigenvector corresponding to the maximum eigenvalue of the given matrix CC is:
[0.710.71][0.710.71]
[−0.710.71][−0.710.71]
[−11][−11]
[11][11]
No, the answer is incorrect.
Score: 0
Accepted Answers:
[−0.710.71][−0.710.71]
OR
[−11][−11]
What is the determinant of a 2×22×2 matrix that has eigenvalues of 44 and 55?
Assignment 6
We are given an autoencoder A. The average activation value of neurons in this network is 0.01. The given
autoencoder is
Contractive autoencoder
Overcomplete neural network
Denoising autoencoder
Sparse autoencoder
No, the answer is incorrect.
Score: 0
Accepted Answers:
Sparse autoencoder
1 point
What is the primary reason for adding corruption to the input data in a denoising autoencoder?
Under-complete autoencoder
Complete autoencoder
Overcomplete autoencoder
Sparse autoencoder
Yes, the answer is correct.
Score: 1
Accepted Answers:
Under-complete autoencoder
1 point
Which of the following statements about regularization in autoencoders is always true?
All of these.
No, the answer is incorrect.
Score: 0
Accepted Answers:
Regularisation reduces the search space of weights for the network.
Regularisation helps to reduce the overfitting in overcomplete autoencoders.
1 point
What are the advantages of using a denoising autoencoder?
Contractive autoencoder
Overcomplete neural network
Sparse autoencoder
Denoising autoencoder
Yes, the answer is correct.
Score: 1
Accepted Answers:
Sparse autoencoder
1 point
If the dimension of the input layer in an over-complete autoencoder is 5, what is the possible dimension of
the hidden layer?
4
2
8
0
Yes, the answer is correct.
Score: 1
Accepted Answers:
8
1 point
What is the primary objective of sparse autoencoders that distinguishes it from vanilla autoencoder?
They maximize the mutual information between the input and the output
Yes, the answer is correct.
Score: 1
Accepted Answers:
They capture only the important variations/features in the data
1 point
Suppose for one data point we have
features x1,x2,x3,x4,x5x1,x2,x3,x4,x5 as −2,12,4.2,7.6,0−2,12,4.2,7.6,0 then, which of the
following function should we use on the output layer(decoder)?
Logistic
Relu
Tanh
Linear
Yes, the answer is correct.
Score: 1
Accepted Answers:
Linear
Assignment 8
Which of the following activation functions is not zero-centered?
Sigmoid
Tanh
ReLU
Softmax
Partially Correct.
Score: 0.34
Accepted Answers:
Sigmoid
ReLU
Softmax
1 point
Which of the following are common issues caused by saturating neurons in deep networks?
Vanishing gradients
Slow convergence during training
Overfitting
Saturation
Yes, the answer is correct.
Score: 1
Accepted Answers:
Computationally expensive
Saturation
1 point
Which of the following activation functions is preferred to avoid the vanishing gradient problem?
Sigmoid
Tanh
ReLU
None of these
Yes, the answer is correct.
Score: 1
Accepted Answers:
ReLU
1 point
How does pre-training prevent overfitting in deep networks?
It adds regularization
It initializes the weights near local minima
It constrains the weights to a certain region
Mean
Variance
γγ
ϵϵ
Yes, the answer is correct.
Score: 1
Accepted Answers:
γ
Assignment 9
Consider the following corpus: "human machine interface for computer applications. user opinion of
computer system response time. user interface management system. system engineering for improved
response time". What is the size of the vocabulary of the above corpus?
13
14
15
16
No, the answer is incorrect.
Score: 0
Accepted Answers:
15
1 point
At the input layer of a continuous bag of words model, we multiply a one-hot vector x∈R|V|x∈R|V| with the
parameter matrix W∈Rk×|V|W∈Rk×|V|. What does each column of WW correspond to?
Suppose that we use the continuous bag of words (CBOW) model to find vector representations of words.
Suppose further that we use a context window of size 3 (that is, given the 3 context words, predict the
target word P(wt|(wi,wj,wk))P(wt|(wi,wj,wk))). The size of word vectors (vector representation of words)
is chosen to be 100 and the vocabulary contains 10,000 words. The input to the network is the one-hot
encoding (also called 1-of-VV encoding) of word(s). How many parameters (weights), excluding bias, are
there in WwordWword? Enter the answer in thousands. For example, if your answer is 50,000, then just
enter 50.
No, the answer is incorrect.
Score: 0
Accepted Answers:
(Type: Numeric) 1000
1 point
1 point
Let count(w,c)count(w,c) be the number of times the words ww and cc appear together in the corpus
(i.e., occur within a window of few words around each other). Further,
let count(w)count(w) and count(c)count(c) be the total number of times the word ww and cc appear
in the corpus respectively and let NN be the total number of words in the corpus. The PMI
between ww and cc is then given by:
logcount(w,c)∗count(w)N∗count(c)logcount(w,c)∗count(w)N∗count(c)
logcount(w,c)∗count(c)N∗count(w)logcount(w,c)∗count(c)N∗count(w)
logcount(w,c)∗Ncount(w)∗count(c)logcount(w,c)∗Ncount(w)∗count(c)
No, the answer is incorrect.
Score: 0
Accepted Answers:
logcount(w,c)∗Ncount(w)∗count(c)logcount(w,c)∗Ncount(w)∗count(c)
1 point
Suppose we are learning the representations of words using Glove representations. If we observe that the
cosine similarity between two representations vivi and vjvj for words `ii' and `jj' is very high. which of the
following statements is true?( parameter bibi = 0.02 and bjbj = 0.07)
Xij=0.02Xij=0.02
Xij=0.2Xij=0.2
Xij=0.88Xij=0.88
Xij=0Xij=0
No, the answer is incorrect.
Score: 0
Accepted Answers:
Xij=0.88Xij=0.88
1 point
Which of the following is a disadvantage of one hot encoding?
To calculate the dot product between the target word and the context words
To transform the dot product into a probability distribution
To calculate the distance between the target word and the context words
O(n)O(n)
O(n2)O(n2)
O(nlogn)O(nlogn)
O(logn)O(logn)
No, the answer is incorrect.
Score: 0
Accepted Answers:
O(n)O(n)
1 point
How does Hierarchical Softmax reduce the computational complexity of computing the softmax function?
It does not reduce the computational complexity of computing the softmax function
No, the answer is incorrect.
Score: 0
Accepted Answers:
It uses a binary tree to approximate the softmax function
Assignment 10
Consider an input image of size 1000×1000×101000×1000×10 where 10 refers to the number of
channels (Such images do exist!). Suppose we want to apply a convolution operation on the entire image
by sliding a kernel of size 1×1×d1×1×d. What should be the depth d of the kernel?
K1:3×3K1:3×3
K2:7×7K2:7×7
K3:17×17K3:17×17
K4:41×41K4:41×41
Assume that stride s=1s=1 and no zero padding. Among all these kernels which one shrinks the output
dimensions the most?
K1K1
K2K2
K3K3
K4K4
Yes, the answer is correct.
Score: 1
Accepted Answers:
K4K4
1 point
Which of the following statements about CNN is (are) true?
5
10
15
30
Yes, the answer is correct.
Score: 1
Accepted Answers:
30
1 point
Which statement is true about the size of filters in CNNs?
The size of the filter does not affect the features it captures.
The size of the filter only affects the computation time.
Larger filters capture more global features.
AlexNet
GoogleNet
ResNet
VGG
Yes, the answer is correct.
Score: 1
Accepted Answers:
ResNet
1 point
What is the purpose of guided backpropagation in CNNs?
Assignment 11
Suppose that we need to develop an RNN model for sentiment classification. The input to the model is a
sentence composed of five words and the output is the sentiments (positive or negative). Assume that each
word is represented as a vector of length 70×170×1 and the output labels are one-hot encoded. Further,
the state vector stst is initialized with all zeros of size 50×150×1. How many parameters (including bias)
are there in the network?
Yes, the answer is correct.
Score: 1
Accepted Answers:
(Type: Numeric) 6152
1 point
1 point
Select the true statements about BPTT?
The gradients of Loss with respect to parameters are added across time steps
The gradients of Loss with respect to parameters are subtracted across time steps
The gradient may vanish or explode, in general, if timesteps are too large
True
False
No, the answer is incorrect.
Score: 0
Accepted Answers:
False
1 point
How does LSTM prevent the problem of vanishing gradients?
Different activation functions, such as ReLU, are used instead of sigmoid in LSTM
Gradients are normalized during backpropagation
The learning rate is increased in LSTM
Cross Entropy
Sum of cross-entropy
Squared error
Accuracy
No, the answer is incorrect.
Score: 0
Accepted Answers:
Sum of cross-entropy
1 point
Which of the following is a limitation of traditional feedforward neural networks in handling sequential data?
All of These
Partially Correct.
Score: 0.5
Accepted Answers:
They can only process fixed-length input sequences
They can’t model temporal dependencies between sequential data
1 point
Which of the following is true about LSTM and GRU networks?
Assignment 12
Which of the following are benefits of using attention mechanisms in neural networks?
Decoder
Key
Value
Encoder
Partially Correct.
Score: 0.5
Accepted Answers:
Decoder
Encoder
1 point
Which of the following is a major advantage of using an attention mechanism in an encoder-decoder
model?
None of These
Yes, the answer is correct.
Score: 1
Accepted Answers:
Improved generalization to new data
1 point
Which of the following output functions is most commonly used in the decoder of an encoder-decoder
model for translation tasks?
Sigmoid
ReLU
Softmax
Tanh
Yes, the answer is correct.
Score: 1
Accepted Answers:
Softmax
1 point
In the encoder-decoder model, what is the role of the decoder?
None of These
Yes, the answer is correct.
Score: 1
Accepted Answers:
LSTM is used as the decoder.