Deep Learning MCQ
Deep Learning MCQ
DEEP LEARNING
By
YouTube: www.youtube.com/c/UniversityAcademy
3. Supervised learning differs from unsupervised clustering in that supervised learning requires
(A) at least one input attribute.
(B) input attributes to be categorical.
(C) at least one output attribute.
(D) ouput attriubutes to be categorical.
Correct Answer: B
8. Assume that we have a dataset containing information about 200 individuals. One
hundred of these individuals have purchased life insurance. A supervised data
mining session has discovered the following rule:
How many individuals in the class life insurance= no have credit card insurance and
are less than 30 years old?
(A) 63
(B) 70
(C) 30
(D) 18
Correct Answer: D
9. Which statement is true about neural network and linear regression models?
(A) Both models require input attributes to be numeric.
(B) Both models require numeric attributes to range between 0 and 1.
(C) The output of both models is a categorical attribute value.
(D) Both techniques build models whose output is determined by a
linear sum of weighted input attribute values.
(E) More than one of a,b,c or d is true.
Correct Answer: A
12. The average positive difference between computed and desired outcome values.
(A) root mean squared error
(B) mean squared error
(C) mean absolute error
(D) mean positive error
Correct Answer: C
Correct Answer: D
17. Which of the following statements is true for the Perceptron Learning Algorithm?
(A) I & II
(B) I
(C) II
(D) None of the statements
Correct Answer: C
18. Sigmoid is
(A) Smooth
(B) Continuous
(C) Differentiable
(D) All of these
Correct Answer: D
19. Which kind of optimisation is required in 2-class linear SVM training procedure?
(A) Linear
(B) Qudratic
(C) Cubic
(D) Depends on dimension of the data
Correct Answer: B
20. The number of Lagrange multiplier required for estimating SVM training parameters depends
on
(A) The dimension of the data
(B) The number of classes
(C) The total number of support vectors from training samples
(D) All of the above
Correct Answer: C
21. The distance of a feature vector [2, 3, -2] from separating plane x1+2x2+2x3+5=0 is given
by
(A) 5
(B) 3/13
(C) 3
(D) 13
Correct Answer: C
22. For the separable data with D-dimensional feature vectors, K classes and a total of N examples, how
many unknown variables are to be estimated in SVM training to get a separating plane 'which
maximizes margin? How many constraints?
23.The values of Lagrange multipliers corresponding to the support vectors can be:
(A) Less than zero
(B) Greater than zero
(C) Any real number
(D) Any real non-zero number
Correct Answer: B
24. A data point with 5 dimensions [27,40, -15,30,38] obtain a score [18, 20, -5, -15,19].
Find the Hinge loss incurred by second class (class 2) with a margin of 5 is given by:
(A) 37
(B) 7
(C) 3
(D) 120
Correct Answer: B
(A) True
(B) False
Correct Answer: A
26. In sigmoid (logistic) function when WTX=0, then sigmoid value is equal to
(A) 0
(B) 1
(C) ∞
(D) 0.5
Correct Answer: D
27. In logistic function, when we increase the value of „w‟, the function becomes
(A) Step function
(B) Sigmoid function
(C) quadratic function
(D) None of these
Correct Answer: B
28. Which parameter(s) need to be learnt in minimizing objective function in supervised learning?
(A) Only Weight
(B) Only Bias
(C) Both Weight and Bias
(D) Learning rate
(E) None of these
Correct Answer: C
30. Why convolution neural network is taking off quickly in recent times
(A) Access to large amount of digitized data
(B) Integration of feature extraction within the training process
(C) Availability of more computational power
(D) All of the above
Correct Answer: D
31. Momentum based gradient descent algorithm and Nesterov accelerated gradient descent are faster
than Stochastic gradient descent algorithm”
(A) True
(B) False
Correct Answer: B
32.Consider the following statement, “It takes less time to navigate the regions having a gentle slope”
The above statement is true in case of
I. Gradient descent algorithm
II. Momentum based gradient descent algorithm
(A) I
(B) II
(C) II & I
Correct Answer: B
33. Identify the technique that is used to achieve relatively better learning rate by updating w using
bunch of different values of η.
(A) Bias Correction
(B) Line Search
(C) Stochastic
(D) All the above
Correct Answer: B
34. There is no guarantee that the loss decreases at each step in a stochastic Gradient Descent”
(A) True
(B) False
Correct Answer: A
36. Pick out the methods for annealing learning rate that has only number of epochs as the
hyperparameter.
(A) Step decay
(B) Exponential Decay
(C) 1/t Decay
Correct Answer: A
37.Adagrad got stuck when it was close to convergence. How does RMSProp overcome this problem?
(A) More Aggressive on decay
(B) Less Aggressive on decay
(C) No decay
Correct Answer: B
38.Which of the following gradient descent algorithm suffers from more oscillations?
(A) Assign random values and pray to God they are correct
(B) Search every possible combination of weights and biases till you get the best value
(C) Iteratively check that after assigning a value how far you are from the best values, and slightly
change the assigned values values to make them better
(D) None of these
Correct Answer: C
1. Calculate error between the actual value and the predicted value
2. Reiterate until you find the best weights of network
3. Pass an input through the network and get values from output layer
4. Initialize random weight and bias
5. Go to each neurons which contributes to the error and change its respective
values to reduce the error
(A) 1, 2, 3, 4, 5
(B) 5, 4, 3, 2, 1
(C) 3, 2, 1, 5, 4
(D) 4, 3, 1, 5, 2
Correct Answer: D
(A) True
(B) False
Correct Answer: B
42 .Which of the following techniques perform similar operations as dropout in a neural network?
(A) Bagging
(B) Boosting
(C) Stacking
(D) None of these
Correct Answer: A
44.Which of the following is true about model capacity (where model capacity means the ability of
neural network to approximate complex functions) ?
(A) As number of hidden layers increase, model capacity increases
(B) As dropout ratio increases, model capacity increases
(C) As learning rate increases, model capacity increases
(D) None of these
Correct Answer: A
45. If you increase the number of hidden layers in a Multi Layer Perceptron, the classification error of
test data always decreases. True or False?
(A) True
(B) False
Correct Answer: B
46. Can a neural network model the function (y=1/x)?
(A) Yes
(B) No
Correct Answer: A
47. In which neural net architecture, does weight sharing occur?
49. Instead of trying to achieve absolute zero error, we set a metric called bayes error which is the
error we hope to achieve. What could be the reason for using bayes error?
(A) Input variables may not contain complete information about the output variable
(B) System (that creates input-output mapping) may be stochastic
(C) Limited training data
(D) All the above
Correct Answer: D
50.The number of neurons in the output layer should match the number of classes (Where the number
of classes is greater than 2) in a supervised learning task. True or False?
(A) True
(B) False
Correct Answer: B
51.In a neural network, which of the following techniques is used to deal with overfitting?
(A) Dropout
(B) Regularization
(C) Batch Normalization
(D) All of these
Correct Answer: D
Can this equation be represented by a neural network of single hidden layer with linear threshold?
(A) Yes
(B) No
Correct Answer: B
(A) A unit which doesn‟t update during training by any of its neighbour
(B) A unit which does not respond completely to any of the training patterns
(C) The unit which produces the biggest sum-squared error
(D) None of these
Correct Answer: A
56.Which gradient technique is more advantageous when the data is too big to handle in RAM
simultaneously?
(A) Full Batch Gradient Descent
(B) Stochastic Gradient Descent
Correct Answer: B
57. What are the factors to select the depth of neural network?
58. Consider the scenario. The problem you are trying to solve has a small amount of data.
Fortunately, you have a pre-trained neural network that was trained on a similar problem. Which of
the following methodologies would you choose to make use of this pre-trained network?
(A) Re-train the model for the new dataset
(B) Assess on every layer how the model performs and only select a few of them
(C) Fine tune the last couple of layers only
(D) Freeze all the layers except the last, re-train the last layer
Correct Answer: D
61. Which of the following method use arithmetic operation in evaluation of word representation?
(A) Semantic relatedness
(A) IV, II
(B) I, III
(C) V
(D) None of these
Correct Answer: C
67. The problem of generating the sentence given an image can be possibly solved with the encoder-
decoder architecture.
(A) Yes
(B) No
Correct Answer: A
68. For document classification and summarization, it is important to look at the important sentences
and important words. What kind of “attention” mechanism is required for encoding?
(A) Hierarchical
(B) Ungraded
(C) Sequential
(D) Unordered
Correct Answer: A
69. 48 filters of size 21 x 21 is applied to an image of size 327 x 327, with zero padding and stride of
3. The image is an RGB image. The depth of the filter is same as the depth of image. What will be the
volume of the final image?
(A) 103 x 103 x 3
(B) 103 x 103 x 48
(C) 327 x 327 x 3
(D) 327 x 327 x 48
Correct Answer: B
76. Which of the following is used to mapping sentence plan into sentence structure?
(A) Text planning
(B) Sentence planning
(C) Text Realization
(D) None of the Above
Correct Answer: C
76. Given a sound clip of a person or people speaking, determine the textual representation of the
speech.
(A) Text-to-speech
(B) Speech-to-text
(C) All of the mentioned
(D) None of the mentioned
Correct Answer: B
77. is the step in which an input sentence is converted into a hierarchical structure
that corresponds to the units of meaning in the sentence.
(A) Syntactic Processing
(B) Semantic Processing
(C) Graph Processing
(D) All of the mentioned
Correct Answer: A
78. Which of the following statement about Skip-gram model are correct?
(A) It predicts the center word from the surrounding context words
(B) The final word vector for a word is the average or sum of the input vector v
and output vector u corresponding to that word
(C) When it comes to a small corpus, it has better performance than GloVe
(D) It makes use of global co-occurrence statistics
Correct Answer: B
79. Which of the following statements is true of language models?
80 Suppose that you are training a neural network for classification, but you notice that
the training loss is much lower than the validation loss. Which of the following can be
used to address the issue (select all that apply)?
(A) Use a network with fewer layers
(B) Modality
(C) Model
(D) Identity
Correct Answer: B
87. What enables people to recognize people, animals and inanimate objects reliably?
(A) Speech
(B) Vision
(C) Hear
(D) Perception
Correct Answer: B
88. Which provides a framework for studying object recognition?
(A) Learning
(B) Unsupervised learning
(C) Supervised learning
(D) None of the mentioned
Correct Answer: C
89. How the distance between two shapes can be defined?
(A) Weighted sum of the shape
(B) Size of the shape
(C) Shape context
(D) None of the mentioned
Correct Answer: A
Correct Answer: A
94. Inductive learning is based on the knowledge that if something happens
a lot it is likely to be generally _________
(A) True
(B) False
Correct Answer: B
98. Suppose a convolutional neural network is trained on ImageNet dataset (Object recognition
dataset). This trained model is then given a completely white image as an input.The output
probabilities for this input would be equal for all classes. True or False?
(A) True
(B) False
Correct Answer: B
99. Which of the following gives non-linearity to a neural network?
(A) Stochastic Gradient Descent
(B) Rectified Linear Unit
(C) Convolution function
(D) None of the above
Correct Answer: B
101. When performing regression or classification, which of the following is the correct way to
preprocess the data?
(A) Normalize the data → PCA → training
(B) PCA → normalize PCA output → training
(C) Normalize the data → PCA → normalize PCA output → training
(D) None of the above
Correct Answer: A
102. In determination of weights by learning, for noisy input vectors what kind of learning should be
employed?.
(A) hebb learning law
(B) widrow learning law
(C) hoff learning law
(D) no learning law
Correct Answer: D
103.Number of output cases depends on what factor?
(A) number of inputs
(B) number of distinct classes
(C) total number of classes
(D) none of the mentioned
Correct Answer: B
104. Which architecture of neural network would be better suited to solve the
problem?
(A) End-to-End fully connected neural network
(B) Convolutional neural network followed by recurrent units
(C) Neural Turing Machine
(D) All of above. Correct
Correct Answer: B
106. Perplexity is a commonly used evaluation technique when applying deep learning for
NLP tasks. Which of the following statement is correct?
(A) Higher the perplexity the better
(B) Lower the perplexity the better
Correct Answer: B
107. What is generally the sequence followed when building a neural network architecture for
semantic segmentation for image?
(A) Convolutional network on input and deconvolutional network on output
(B) Deconvolutional network on input and convolutional network on output
Correct Answer: A
108. Deep learning can be applied to which of the following NLP tasks?
(A) Machine translation
(B) Sentiment analysis
(C) Question Answering system
(D) All of the above
Correct Answer: D
109. What is the technical difference between vanilla backpropagation algorithm and
backpropagation through time (BPTT) algorithm?
Unlike backprop, in BPTT we sum up gradients for corresponding weight for each time step
Unlike backprop, in BPTT we subtract gradients for corresponding weight for each time step
Correct Answer: A
Practice Question
(Option In bold font is Answer)
110. Which of the following is FALSE about Deep Learning and Machine Learning
algorithms?
a. Deep Learning algorithms work efficiently on high amount of data
b. Feature Extraction needs to be done manually in both ML and DL algorithms
c. Deep Learning algorithms are best suited for unstructured data
d. Deep Learning algorithms require high computational power
111. Which of the following is FALSE about Deep Learning and Machine Learning
algorithms?
a. Data augmentation can be done easily in Deep Learning as compared to Machine
Learning
b. Deep Learning algorithms efficiently solve computer vision problems
c. Deep Learning algorithms are more interpretable as compared to Machine
Learning algorithms
d. None of the above
112. Which of the following is TRUE about Perceptrons?
a. It is an artificial neuron and is the fundamental unit of a neural network
b. It is also called single layer neural network or search single layer binary linear
classifier
c. A neural network is a composition of perceptrons connected in different ways
d. All of the above
113. Which of the following is FALSE about Perceptrons?
a. Perceptron can learn both linearly and non-linearly separable functions
b. It cannot implement XOR gate as it cannot be classified by a linear separator
c. It can easily implement OR and AND gates as these are linearly separable
d. None of the above
114. Which of the following is the structure of the input to an artificial neuron?
a. Weighted sum of inputs + bias
b. Sum of inputs + bias
c. Weighted sum of bias + inputs
d. Sum of bias + inputs
115. Which of the following is FALSE about Neural Networks?
a. During backward propagation, we update the weights using gradient descent
algorithm
b. We can use different activation functions in different layers
c. We can use different gradient descent algorithms in different epochs
d. None of the above
116. Which of the following is FALSE about Hidden Layers in Neural Networks?
a. Abstract representation of the training data is stored in the hidden layers
b. Feature extraction happens at the hidden layers
c. Increasing the number of hidden layers always leads to higher accuracy
d. Increasing the number of hidden layers above a certain point may lead to overfitting
117. Which of the following is FALSE about Deep Neural Networks?
a. These are computationally more complex as compared to shallow networks
c. It is non-linear in nature
d. None of the above
127. Which of the following is FALSE about sigmoid and tanh activation function?
a. Both are non-linear activation functions
b. Output of sigmoid ranges from 1 to 1 while output of tanh ranges from 0 to 1
c. Output of both sigmoid and tanh is smooth, continuous and differentiable
d. None of the above
128. Which of the following is FALSE about sigmoid and tanh activation function?
a. These cannot blow up the activations unlike ReLU
b. Both functions output a probability value instead of discrete values like 0 and 1
c. Sigmoid is zero centered as its values range from 1 to 1
d. None of the above
129. Which of the following is FALSE about sigmoid and tanh activation function?
a. These do not suffer from vanishing and exploding gradient problems unlike
ReLU
b. These involve computing the exponential (which may be expensive) unlike ReLU
c. These are non-linear in nature like ReLU
d. None of the above
130. Output of step (threshold) activation function ranges from:
a. Either 0 or 1
b. 0 to 1
c. -1 to 1
d. Either -1 or 1
131. Output of sigmoid activation function ranges from:
a. 0 to 1
b. .-1 to 1
c. -1 to o
d. 0 to 9
132. Output of tanh activation function ranges from:
a. 0 to 1
b. -1 to 1
c. -1 to o
d. 0 to 9
133. Output of which of the following activation functions is zero centered?
a. Hyperbolic Tangent
b. Sigmoid
c. Softmax
d. ReLU
134. ReLU activation function outputs zero when:
a. Input is zero
b. Input is less than or equal to zero
c. Input is greater than or equal to zero
d. Input is zero or one
135. ReLU outputs same value as input when:
a. Input is greater than zero
b. Input is greater than or equal to zero
c. Input lies between zero and one
d. Input is zero or one
b. Adagrad
c. AdaDelta
d. RMSprop
155. Which of the following SGD variants is based on adaptive learning?
a. Adam
b. Adagrad
c. AdaDelta
d. All of the above
156. Which of the following SGD variants is based on both momentum and adaptive
learning?
a. RMSprop
b. Adagrad
c. Adam
d. Nesterov
157. Which of the following SGD variants is based on both momentum and adaptive
learning?
a. RMSprop
b. Adagrad
c. AdaDelta
d. None of the above
158. Which of the following is FALSE about Adagrad, AdaDelta, RMSprop and Adam?
a. AdaDelta and RMSprop are an extension of Adagrad
b. RMSprop considers both momentum and adaptive learning
c. Adagrad mainly focuses on adaptive learning rate instead of momentum
d. None of the above
159. Which of the following is TRUE about Nesterov, RMSprop and Adam?
a. A Nesterov and RMSprop are an extension of Adam
b. B. Nesterov considers both momentum and adaptive learning
c. C. RMSprop mainly focuses on adaptive learning rate instead of momentum
a. All of the above
160. CNN is best suited for:
a. Image Classification
b. Natural Language Processing
c. Image Captioning
d. All of the above
161. Which of the following layers is NOT a part of CNN?
a. Convolutional Layer
b. Pooling Layer
c. Code Layer
d. Fully connected Layer
162. Which of the following terms is NOT associated with CNN?
a. Filters (Kernels)
b. Forget Gates
c. Zero and Valid Padding
d. Strides
163. Which of the following is FALSE about CNN?
a. We must flatten the output before feeding it to a fully connected layer
b. There can be only one fully connected layer in CNN
d. 1X1
172. Filter of size 3X3 is rotated over input matrix of size 4X4 (stride=1). What will be the
size of output matrix after applying valid padding?
a. 4X4
b. 3X3
c. 2X2
d. 1X1
173. Which of the following is FALSE about Pooling Layer in CNN?
a. Pooling layer must be added after each convolutional layer
b. Output of convolutional layer acts as an input to the pooling layer
c. It does down-sampling of an image which reduces dimensions by retaining vital
information
d. It does feature extraction and detects components of the image like edges, corners etc.
174. Which of the following is TRUE about Pooling Layer in CNN?
a. We can use Max, Min, Average or Sum pooling in CNN
b. It helps in retaining the most useful information and throwing away useless
information
c. It reduces resolution and dimension and hence reduces computational complexity
d. All of the above
175. Which of the following is FALSE about Pooling Layer in CNN?
a. It helps in reducing overfitting
b. It reduces computational complexity
c. It increases image resolution
d. None of the above
176. Which of the following is NOT a hyper-parameter in CNN?
a. Code size for compression
b. Number and size of kernels in a convolutional layer
c. Padding in a convolutional layer (zero or valid padding)
d. Number of convolutional layers
177. Which of the following model is best suited for sequential data?
a. Convolutional Neural Networks (ConvNots)
b. Capsule Neural Networks (CapsNots)
c. RNN (Recurrent Neural Network)
d. Autoencoders
178. Which of the following model contains internal memory?
a. Convolutional Neural Networks (ConvNets)
b. Capsule Neural Networks (CapsNets)
c. RNN (Recurrent Neural Network)
d. Autoencoders
179. Which of the following is FALSE about ISTM?
a. LSTM is an extension for RNA which extends its memory
b. LSTM enables RNN to learn long-term dependencies
c. LSTM solves the exploding gradients issue in RNN
d. None of the above
180. Which of the following is TRUE about LSTM?
a. It uses forget gates, input gates and output gates
b. These gates are analog in nature
c. It uses feedback loop to remember the data
209. Using too many epochs while training a network may lead to:
a. High training time
b. Overfitting
c. Unnecessary time wastage
d. All of the above
210. Using large size batches while training a network may lead to:
a. Lesser accuracy
b. Overfitting
c. High consumption of computational resources
d. All of the above
211. Model accuracy is not improving after so many epochs. There could be an issue with:
a. A Network architecture
b. B. Training data
c. C. Fine-tuning of hyperparameters
a. Any of the above
212. Which of the following is FALSE about Autoencoders?
a. It is an unsupervised deep learning algorithm
b. It is like a data compression algorithm which performs dimensionality reduction
c. More the number of code layers, more is the data compression
d. In autoencoders, output is nearly same as that of the input
213. Which of the following is FALSE about Autoencoders?
a. It can only compress and decompress the data similar to what it has been trained on
b. All autoencoders are generative in nature
c. Difference between the input and output is called reconstruction loss
d. Output is not exactly same as the input, it is a close but degraded representation
214. Which of the following is FALSE about Autoencoders?
a. It compresses the input into a latent-space representation and then reconstruct the
output from it
b. Its objective is to minimize this reconstruction loss so that the output is similar to the
input
c. Autoencoders are best suited for image coloring, image captioning and image
recognition
d. Autoencoder possess generalization capabilities
215. Which of the following is FALSE about Encoders in Autoencoders?
a. It is the layer in which the model learns how to reduce the input dimensions
b. It compresses the input into a latent space representation
c. Number of nodes per layer increases with each subsequent layer in the encoder
d. None of the above
216. Which of the following is FALSE about Decoders in Autoencoders?
a. It is the layer in which the model learns how to reconstruct the data
b. The decoded image is a lossy reconstruction of the original image
c. Number of nodes per layer decreases with each subsequent layer in the decoder
d. None of the above
217. Which of the following is TRUE about Encoders and Decoders in Autoencoders?
a. Number of nodes per layer increases with each subsequent layer in the encoder
b. Number of nodes per layer decreases with each subsequent layer in the encoder
c. Number of nodes per layer decreases with each subsequent layer in encoder and
decoder
d. Number of nodes per layer increases with each subsequent layer in encoder and
decoder
218. Which of the following is FALSE about Bottleneck / Code in Autoencoders?
a. Number of code layers is n hyper parameter which needs to be fine-tuned
b. It decides which aspects of the data are relevant and which aspects can be thrown
away
c. It is the layer that contains the compressed representation of the input data
d. This compressed representation is the lowest possible dimensions of the input data
219. Which of the following is FALSE about code layer in autoencoder?
a. code size results in more data compres
b. Small code size results in more generalization
c. Small code size results in more accuracy
d. None of the above
220. Which of the following is FALSE about Reconstruction Loss in Autoencoders?
a. This is the method that measures how well the decoder is performing
b. This is the method that measures how close the output is to the original input
c. For an anomaly, reconstruction loss is less as compared to the regular data
d. None of the above
221. Which of the following is TRUE about Undercomplete autoencoder?
a. A We limit the number of nodes in the hidden layers to increase its
generalization capabilities
b. B. We use Li and L2 regularization to increase its generalization capabilities
c. C. We add random noise to the input to increase its generalization capabilities
a. All of the above
222. Which of the following is TRUE about Sparse autoencoder?
a. We limit the number of nodes in the hidden layers to increase its generalization
capabilities
b. We use Li and L2 regularization to increase its generalization capabilities
c. We add random noise to input to increase its generalization capabilities
d. All of the above
223. Which of the following is TRUE about Denoising autoencoder?
a. We limit the number of nodes in the hidden layers to increase its generalization
capabilities
b. We use Li and L2 regularization to increase its generalization capabilities
c. We add random noise to input to increase its generalization capabilities
d. All of the above
224. Which of the following is NOT a way to increase generalization in autoencoder?
a. Use larger code size
b. Use Li and L2 regularization
c. Add some random noise to the input
d. Limit the number of nodes in the hidden layers
225. Which of the following is NOT a hyper-parameters in Autoencoders?
a. Batch size and Code size
b. Number of code layers
c. Number of nodes per layer
d. Number of layers in encoder and decoder
226. Which of the following is NOT a hyper-parameters in Autoencoders?
a. Number and size of Kernels
252. Which of the following is FALSE about Xavier Weight Initialization with ReLU function?
a. We generate random weights from a Gaussian distribution (mean = 0, std dev = 1)
b. Multiply above random weights with the square root of (1/n) where n is number
of input units
c. Multiply above random weights with the square root of (2/n) where n is number of
input units
d. None of the above
253. Which of the following is FALSE about Xavier Weight Initialization with Sigmoid function?
a. We generate random weights from a Gaussian distribution (mean = 0, std dev = 1)
b. Multiply above random weights with the square root of (1/n) where n is number of
input units
c. Multiply above random weights with the square root of (2/n) where n is number
of input units
d. None of the above