0% found this document useful (0 votes)
133 views78 pages

Question Bank 2023 Final All Questions

Uploaded by

Binit Karmakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views78 pages

Question Bank 2023 Final All Questions

Uploaded by

Binit Karmakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Question Bank

PCA
1.

2PM
2

apt
3

4. If we have 3 data points (2,3),(7,1) and (5,3) , what will be the covariance matrix.
5. If (3,2) is an Eugene vector then what will be the length of the Eigen vector? Sqrt(3^2+2^2)
How we can make the above eigen vector ORTHO normal?
6. We have two dimensional 5 data points as X = [[2.5,2.4],[0.5,0.7],[2.2,2.9],1.9,2.2],[3.1,3.0]]. Find the
mean subtracted data and also the covariance matrix.
7. Why are the PCA basis vectors the eigenvectors of the correlation matrix? Prove the same.
8. What computational trick is used in PCA for eigenface approach?
9. Var-Covariance matrix is real symmetric matrix True
10. Why PCA is not good for classification. Explain with plotting two different data
11. What is projection matrix and I is used for reconstruction?
12. Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8). Calculate the mean vector.
Find the mean subtracted data. Calculate the covariance matrix. Find the eigen values of covariance matrix.
Find the eigen vectors. Find the principal component. Draw the projection direction. Say a given feature
vector (2,1). Transform the feature vector using the first principal component. Find the reconstruction error.
13. Consider a vector (2,1) in a Cartesian coordinate system (x-y) . What are the basis vectors in this
coordinate system? Find the weights by projecting (2,1) vector on the basis vectors. Using weights and
basis vector show that the original (2,1) vector can be obtained.

SVM.
A data point with 5-dimension [27, 40, -15, 30, 38] obtains a score [18, 20, -5, -15, 19]. Find the hinge loss
incurred by second class (class-2) with a margin (Δ) of 5.

= ( , − + )+ ( ,− − + )+ ( ,− − + )
+ ( , − + )= + + + =

What is the shape of the loss landscape during optimization of SVM?

In SVM the objective to find the maximum margin based hyperplane (W) such that WTx + b =1 for class =
+1 else WTx + b = -1
For the max-margin condition to be satisfied we solve to minimize ||W||. The above optimization is a
quadratic optimization with a paraboloid landscape for the loss function.

How many local minimum can be encountered while solving the optimization for maximizing margin for
SVM?

In SVM the objective to find the maximum margin-based hyperplane (W) such that WTx + b =1 for class =
+1 else WTx + b = -1
For the max-margin condition to be satisfied we solve to minimize ||W||. The above optimization is a
quadratic optimization with a paraboloid landscape for the loss function. Since the shape is paraboloid,
there can be only 1 global minimum.

Which of the following classifiers can be replaced by a linear SVM?


Logistic regression framework belongs to the genre of linear classifier which means the decision boundary
can segregate classes only if they are linearly separable. SVM is also capable of doing so and thus can be
used instead of logistic regression classifiers. Neural networks and decision trees are capable of modeling
non-linear decision boundaries which linear SVM cannot model directly.

Consider a 2-class [y= {-1, 1}] classification problem of 2 dimensional feature vectors. The support vectors
and the corresponding class label and lagrangian multipliers are provided. Find the value of SVM weight
matrix W? X1=(-1,1), y1=-1, α1=2 X2=(0,3), y2=1, α2=1 X3=(0,-1), y3=1, α3=1

For a 2-class problem what is the minimum possible number of support vectors. Assume there are more
than 4 examples from each class?

To determine the separating hyper-plane, we need at least 1 example (which becomes a support vector)
from each of the classes
Suppose we have one feature x ∈ R and binary class y. The dataset consists of 3 points: p1: (x1, y1) = (−1,
−1), p2: (x2, y2) = (1, 1), p3: (x3, y3) = (3, 1). Which of the following true with respect to SVM?
a. Maximum margin will increase if we remove the point p2 from the training
set. b. Maximum margin will increase if we remove the point p3 from the training
set. c. Maximum margin will remain same if we remove the point p2 from the
training set. d. None of the above.

Here the point p2 is a support vector, if we remove the point p2 then maximum margin will increase.

If we employ SVM to realize two input logic gates, then which of the following will be true?
a. The weight vector for AND gate and OR gate will be same. b. The margin for AND gate and OR gate will
be same. c. Both the margin and weight vector will be same for AND gate and OR
gate. d. None of the weight vector and margin will be same for AND gate and
OR gate.

The values of Lagrange multipliers corresponding to the support vectors can be:
a. Less than zero b. Greater than zero c. Any real number d. Any non zero number.
True or False? In a real problem, you should check to
see if the SVM is separable and then include slack
variables if it is not separable.
False: you can just run the slack variable problem in either case (but you need to pick C)

True or False? Linear SVMs have no hyperparameters


that need to be set by cross-validation
False: you need to pick C

True or False? Adding slack variables is equivalent to


requiring that all of the αi are less than a constant in the
dual

Why are SVMs fast?


l Quadratic optimization (convex!)
l They work in the dual, with relatively few points
l The kernel trick

Why are SVMs often more accurate than

logistic regression?
l SVMs use kernels –but regression can, too
l SVMs assume less about the model form
n Logistic regression uses all the data points,
assuming a probabilistic model, while SVMs ignore
the points that are clearly correct, and give less
weight to ones that are wrong
Find the Distance from support vector to hyperplane
sie
que
yz I

Consider XOR data, which is nonlinearly separable. Explain if we use Gaussian kernel to transform the data,
will it be linearly separable

logloss ylogJ
What is kernel trick? H Y log fi 5
What is hinge loss, how compared to log loss? yes 91 910250
How SGD can be applied in SVM considering Hinge loss

Logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers.

Logistic loss does not go to zero even if the point is classified sufficiently confidently. This might lead to
minor degradation in accuracy.

In the XOR problem, for example, adding a new feature of x1 x2 make the problem linearly separable

nonlinear b consider this non


decision linear data
boundary
us
Transform at mats tab
6 a b
My
a

b
a
4 36 12 1 10424 15 64
show that in 22 2 10 4 8
transformed space 32 3 10 3
ng
24
with higher dimer 42 4110 26 6
sin 5 5 10 4 1 TIE
62 6 10 4 0

7 7 7044 3
8 10 4 8
or 82
92 9 10 24 15
decision
ff y1iaear
boundary
Decision tree
For xor data describe a binary tree with minimum depth

for the given dataset, X=(-1,-1),(-1,1),(1,-1),(1,1), and y=(-1,-1,-1,1), describe a binary tree with minimum
depth.
Can we get nonlinear decision boundary in decision tree? How axis labels are used in decision tree? What
is prunning, what is stumps?
The ID3 algorithm is guaranteed to find an optimal decision tree. false
One advantage of decision trees is that they are not easy to overfit.false

15 points

Suppose you are given six training points (listed in Table 1) for a classification problem with two binary
attributes X 1, X2, and three classes Y ∈ {1, 2, 3}. We will use a decision tree learner based on information
gain.
lil I l il D 14114 1110 3 10,0 2
10,0 3 represented as 14 x2 Y
[Points: 12 pts] Calculate the information gain for both X1 and X2. You can use the approximation
log2 3 ≈ 19/12. Report information gains as fractions or as decimals with the precision of three decimal
digits. Show your work and circle your final answers for IG(X1) and IG(X2).
[Points: 4 pts] Report which attribute is used for the first split. Draw the decision tree resulting
from using this split alone. Make sure to label the split attribute, which branch is which, and what
the predicted label is in each leaf. How would this tree classify an example with X1 = 0 and X2 = 1?

Whatistheentropyofagroupwith 50% in either class?


– entropy = -0.5 log20.5 – 0.5 log20.5 =1
good training set for learning

Whatistheentropyofagroupinwhich all examples belong to the same class?


– entropy=-1log21=0
not a good training set for learning
say we have total 30 instances. 16 parent,
and 14 children. We split in sush a
way so that in one half we have
17 instances and other half has
13 instances. Out of it instances
y parents and 13 Wilds, and out of
13 instances I child and 12 parents.
Find tht Information gain for
that split.

Say container A contains 5 red ball and 35 blue balls. Container B has 20 red ball and 20 blue balls. Which
container has higher entropy and why?

How are decision tree used for classification? How stopping condition is used in decision tree?

What is entropy? How this is related to information gain? How information gain is used in decision tree?
What is Gini index? Calculate the Gini index for a simple data set where 9 tuples are in class yes and 5 in
class no. 1-((9/14)^2+(5/14)^2)

say we have a training sat


with 3 features and 2 classes.
givin as (x1,x2,x3, Y)
(I, l, 1,1), (1, 1,0, 1), (0,0,1,2) , (1,0,0,2)
How would you distinguish
class I from class 2?
split on attribute for x1, x2, x3
. Separately. Report
which split is the best split. And which one is the worst

and whirl one


is the worst
Boosting
What is voting power?

What is week classifier?

As we have 4 weak classifiers h1,h2,h3 and h4. How we can make a strong classifier from this.

How voting power is related to error rate of each weak classifier?

What is stumps in decision tree?

15 points

Say we have 5 data points A,B,C,D,E. and we are allowed to use 6 weak classifiers. We use the decision
tree stumps.
The datapoints are at (1,5,A),(5,5,B),(3,3,C),(1,1,D),(5,1,E) in the form (x_1,x_2, data point)

1. Draw the data points in x_1,x_2 coordinate


2. Find out the misclassification when x1>2, x1<2,x1>4, x1<4,x1>6,x1<6
3. Initialise the weights for each sample points giving equal weight.
4. According the misclassification happened in question1 find the error rate for each weak classifier
5. Find the best weak classifier
6. Append the best weak classifier in strong classifier
7. Is your strong classifier good enough? If yes why? If not suggest the next procedure.

Random forest combine results at the end of the process. True or False? Gradient boosting combine results
at the end of the process. True or False? False, along the way.
Is it possible to parallelise training of gradient boosting model? Yes.
oosting trees are built sequentially while random forest build trees in parallel. True or False? True

15 points

Say we have 5 data points A,B,C,D,E. and we are allowed to use 6 weak classifiers. We use the decision
tree stumps.
The datapoints are at (1,5,A),(5,5,B),(3,3,C),(1,1,D),(5,1,E) in the form (x_1,x_2, data point)

1. Find out the misclassification when x1>2, x1<2,x1>4, x1<4,x1>6,x1<6Find out the misclassification
when x1>2, x1<2,x1>4, x1<4,x1>6,x1<6
2. Update the weights for each data point up to round 3
3. In each round calculate the error rate
4. In each round calculate the strong classifier.
5. Explain why we can stop after round 3

[2 pts] Why do we want to use “weak” learners when boosting?


Solution: To prevent overfitting, since the complexity of the overall learner increases at
each step. Starting with weak learners implies the final classifier will be less likely to overfit.

What are the parameters used in gradient boosting? Learning rate , number f trees, min sample split, max
depth, sub sample, min samples leaf

Say we have a data set (1,1,+),(1,4,+),(2,2,-),(4,1,+),(4,4,+)


Draw the data points with axis label and data label
How many iterations are needed to get training error zero
HII
Ans 3 iterations. X1<5 all +, x1<2 all +, x_1 > 3 all +

Clustering
15 points
Suppose you are given the following <x,y> pairs. You will simulate the k-means algorithm to identify TWO
clusters in the data.

<x,y> = {(1.90,0.97),(1.76,0.84),(2.32,1.63),(2.31,2.09),(1.14,2.11),(5.02,3.02),(5.74,3.84),(2.25,3.47),
(4.71,3.60),(3.17,4.96)}
Suppose you are given the following <x,y> pairs. You will simulate the k-means algorithm and Gaussian
Mixture Models learning algorithm to identify TWO clusters in the data.

1. Simulate K-Means Clustering (k=2) for one iteration.What are the cluster assignment after one iteration?
2. what are the cluster assignments until convergence
3.
Assume K means uses Euclidean distance

Give one advantage of hierarchical clustering over K-means clustering, and one advantage of K-means
clustering over hierarchical clustering.
Answer: Many possibilities.
Some advantages of hierarchical clustering:
1. Don’t need to know how many clusters you’re after
2. Can cut hierarchy at any level to get any number of clusters
3. Easy to interpret hierarchy for particular applications
4. Can deal with long stringy data
Some advantages of K-means clustering:
1. Can be much faster than hierarchical clustering, depending on data
2. Nice theoretical framework
3. Can incorporate new data and reform clusters easily
15 points

say we are given 7 data points


Data1=(3,8), Data2 = (2.5,7), Data3=(2.1,7.5),Data4=(4,8),Data5=(8,3),Data6=(7,0.5), data7=(3,1)
1. Simulate K-Means algorithm until convergence considering data5 and data7 are the initial centroids.
2. If data1 and data6 are taken as initial centroid how many steps we need for convergence?
3. If data2 and data4 are taken as initial centroid, does it take more iterations to converge comparing to
data5 and data7 as initial guess as centroid?

What is elbow method? Where it is applied?

Explain briefly Llyod’s algorithm.

Briefly explain why Lloyd’s algorithm is always guaranteed to converge (i.e. stop)
in a finite number of steps.
Answer: The cluster assignments L can take finitely many values (Kn, to be precise). The
cluster centers C are uniquely determined by the assignments L, so after executing step ii the
algorithm can be in finitely many possible states. Thus either the algorithm stops in finitely
many steps, or at least one value of L is repeated more than once in non-consecutive iterations.
However, the latter case is not possible, since after every iteration we have J (C (t), L(t)) ≥

What is agglomeration clustering?

Agglomerative hierarchical clustering is a family of hierarchical clustering algorithms that, equipped with a
notion of distance between clusters, form a binary tree with leaves for each original data point as follows:
i. Initialize by placing each data point in its own cluster (i.e. singleton trees).
ii. Find the two closest clusters, join them in a single cluster (by creating a new node and making
it the parent of the roots of those two clusters).
iii. If there are more than one clusters (trees) left, repeat from step i. iv. Return the final tree.

What are the common distance metrics used in agglomeration clustering?


M M
L M 0

PCA more question


Consider the following data point x1=[1,1]^T, x2=[0,2]^T,x3=[2,0]^T. Find maximum eigen value of the
covariance matrix. Find the eigen vector corresponds to maximum eigen value. Suppose on this eigen
vector the data points x1,x2 and x3 are projected. After projection what will be the new coordinate of x2?
K
O
pit
O

E
w
É

Debug,Select features, Unit 7


More training data helps with overfitting, True or False? True
Increasing or decreasing number of training data helps the bias improvement? True or False
High feature space leads to overfitting, True or False?
1% training error and 2% validation error in case of low bias and low variance, True or false?
iii
1% training error and 15% validation error leads to low variance and high bias problem, True or False?
15% training error and 15% validation error leads to high bias and low variance problem?True or False?

5 points

Our system is suffering from under fitting problem and we want to increase or decrease the number of
features. Does it help? Draw some data points and show the under fitting problem occurs.
Use more features helps under fitting as number of features increases the number of parameter increases,
hence reduce bias or under fitting.

Say we are solving a regression problem for the given data related as y = x1^3+4x2^2+x3. We have
developed a model as h = 3x1+4
h 3M will see large training and test error. Why it is so? How we
When we plot the hypothesis with the data we4 punderfit
can manage the problem? What happens if we use a model h = x1^4+x2^3+2x1x2+6x1^2x2^2+4x3^3

15 points a
overficting problem
In a project we have a set of data from patients who have visited ABCD hospital during the year 2022. A set
of features (e.g., bmr, weight) have been also extracted for each patient. The goal of the project is to decide
whether a new visiting patient has any of diabetes, heart disease, or neural disease (a patient can have one
or more of these diseases).
(a) [5 points] We have decided to use a neural network to solve this problem. We have two choices: either to
train a separate neural network for each of the diseases or to train a single neural network with one output
neuron for each disease, but with a shared hidden layer. Which method do you prefer? Justify your answer.

Solution: Siamese Neural Network face Recognition


1- Neural network with a shared hidden layer can capture dependencies between diseases. It can be shown
that in some cases, when there is a dependency between the output nodes, having a shared node in the
hidden layer can improve the accuracy.
2- If there is no dependency between diseases (output neurons), then we would prefer to have a separate
neural network for each disease.
5 points
In our project we first ask our model to predict whether a patient has a disease, and if the classifier is 80%
confident that the patient has a disease, then we will collect additional patient features like temperatures
etc. In this case, which classification methods do you recommend: neural networks, decision tree, or naive
Bayes? Justify your answer.

We expect students to explain how each of these learning techniques can be used to output a confidence
value (any of these techniques can be modified to provide a confidence value). In addition, Naive Bayes is
preferable to other cases since we can still use it for classification when the value of some of the features
are unknown.
We gave partial credits to those who mentioned neural network because of its non-linear de- cision
boundary, or decision tree since it gives us an interpretable answer.

Unit 3 Practical Aspects of Implementation


A student team reports a low training error and claims their method is good. Is it ok or
É
Problematic. If OK explain why, else justify your answer.

⋆ SOLUTION: Problematic because training error is an optimistic estimator of test error. Low training error
does not tell much about the generalization performance of the model. To prove that a method is good they
should report their error on independent test data.

The same student team claimed great success after achieving 97 percent classification accuracy on a
binary classification task where one class is very rare (e.g., detecting fraud transactions). Their data
consisted of 40 positive examples and 4000 negative examples. Is it OK or problematic? Justify your
answer

Think of classifier which predicts everything as the majority class. The accuracy of that classifier will be
99%. Therefore 97% accuracy is not an impressive result on such an unbalanced problem.

One student group performed a feature selection procedure on the full data and reduced their large feature
set to a smaller set. Then they split the data into test and training portions. They built their model on training
data using several different model settings, and report the the best test error they achieved. Is it OK?

Problematic because:
(a) Using the full data for feature selection will leak information from the test examples into the model.
The feature selection should be done exclusively using training and validation data not on test data.
(b) The best parameter setting should not be chosen based on the test error; this has the danger of
overfitting to the test data. They should have used validation data and use the test data only in the final
evaluation step.

1 point

Reducing the number of leaves in decision tree will lower the bias, True or False?
false, increases bias

Reducing the number of leaves in decision tree will increase the variance. True or False?
False, lower the variance

Increasing the number of k in k nearest neighbour increases bias. True or False?


True
e I g ape
Increase the number of k in k nearest neighbour increases variance. True or False? False, decrease the
variance
As smoothing happens the effect of noise reduces
variance decreases
In logistic regression we increase the number of training examples to decrease the bias. Will it work? Yes or
no? No, no effect of number of training images on bias

Averaging in random forest can reduce overfitting. Yes i


5 point
Eigenface approach is one practical application of PCA. Say we have 500 face images having dimension
64x64. How could you construct the covariance matrix to find the principal components?

Without normalising the data we continue to build the linear regression model and after some epochs
gradient descent will converge. What will happen when we predict the data with this model?

What is Xavier initialisation ? What are the other form of initialisation ?

Is there any problem of zero initialisation of weights in logistic regression? If not why? If yes justify your
answer. No problem

15 points

1. Say we are developing a logistic regression model for a dataset of 100 points with 4 dimensions. How
we can split the training and test data? [2]
2. If the dimension of training data is 100x4, then what will be the dimension of mean data?
3. Is it important to standardise the data? How you can standardise.
4. Which activation function we use at the output and why?
5. What type of cost function we use as objective for optimisation? Can we use MSE cost function? If yes,
why?, if not ,justify.
6. How many weight parameters you use to build the model?
7. What is the proper way of update the weights in gradient descent? Write down for each weights in your
model.

Ans: AIML-Lab5th sem 1A_Logistic_Regression.ipynb

1. Say you have a dataset of MRI images of knee tissue. You want to built a classifier for predicting the

I
probability of damaged tissue. Between decision tree and logistic regression, which one you use? And
why? Ans - Decision trees only provide a label estimate, whereas logistic regression provides the
probability of a label (patient has cancer) for a given input (cellular image).
2. say we have 950 MRI scans (image) contain no damaged ligaments and 50 MRI scan images of
damaged ligaments. You train a classifier and observe 85% accuracy. Is the result ok? Justify. What
should be the minimum accuracy your classifier should attain.
Suppose the dataset in the previous question had 900 cancer-free images and 100 images from cancer
patients. If I train a classifier which achieves 85% accuracy on this dataset, it is it a good classifier.
⋆ SOLUTION: FALSE. This is not a good accuracy on this dataset, since a classifier that outputs ”cancer-
free” for all input images will have better accuracy (90%).

4. Are non-parametric models more efficient than parametric models in terms of model storage.
SOLUTION: False. Non-parametric models either need to look at the entire dataset to predict the label of
test points or require the number of parameters to scale with the dataset size,
hence require more storage.
5. Suppose you want to predict the blood sugar level of a person from his/her retinal images using
regression, but you only have 10 subjects and each subject is represented by the retinal activity at
20,000 regions. What regression you would prefer to use? least squares regression or
of ridge regression? And why?

When the number of datapoints (subjects) is less than number of features, the least squares solution needs
to be regularized to prevent overfitting, hence we prefer ridge regression.

6. You want to predict the chance that Brazil football team will win the next WC at America 2026, which
one you should prefer and why? logistic regression or of decision trees.
Logistic
É regression will characterize the probability (chance) of label being win or loss, whereas decision
tree will simply output the decision (win or loss).

7.
What is K-means cost function? Does k-means algorithm find the global optimum of this cost function?

The kmeans cost function is non-convex and the algorithm is only guaranteed
to converge to a local optimum.

True positive = 300, True Negative = 20 , False positive = 30, False Negative = 10
Find Accuracy, precision, recall, …. Which ml technique used for detecting outliers? Anomaly detection
How to handle missing data? Why recall is used for driver drowsiness detection?

Consider the following design matrix. X=[[4,1],[2,3],[5,4],[1,0]], 4 2D data points. We want to reduce the
dimension from 2 to 1 and represent the data. Show the detail process with numerical steps.

[.707 -0.707]^T

É
Consider 4 datapoints as (1,0),(4,1),(2,3) and (5,4). We want one dimensional representation of data. Draw
the principal components with direction. Show all the projections of all four sample points onto the principal
direction

Principal coordinates are 1/sqrt(2), 5/sqrt(2), and 9/sqrt(2)

What is the disadvantage in K means clustering. How this can be overcome by means++

Logistic regression is used for classification model and provides the output in 0/1 label. True or false?
Deep Neural Network

Unit-1 see below


5 marks :
1. What is the main benefit of stacking multiple layers of neuron with non-linear activation functions over a
single layer perceptron? A single layer perceptron is capable of classifying only linearly separable
classes. Stacking multiple layers of neurons helps in creating non-linear decision boundaries and thus
can be used for classifying examples belonging to classes which are NOT linearly separable.
UnitI -3 Loss, activation , backprop, optimization
5 points

1. You want to solve a classification task. You first train your network on 20 samples. Training converges,
but the training loss is very high. You then decide to train this network on 10,000 examples. Is your
approach to fixing the problem correct? If yes, explain the most likely results of training with 10,000
examples. If not, give a solution to this problem.

Solution: The model is suffering from a bias problem.


Increasing the amount of data reduces the variance, and is not likely to solve the problem.
A better approach would be to decrease the bias of the model by maybe adding more layers/ learnable
parameters. It is possible that training converged to a local optimum. Training longer/using a better
optimizer/ restarting from a different initialization could also work.

Last
I
2. Which regularization method leads to weight sparsity? Explain why.
3. Suppose a neural network has 3 input 3 nodes, x, y, z. There are 2 neurons, Q and F. Q = 4x + y and F =
Q * z2. What is the gradient of F with respect to x, y and z? Assume, (x, y, z) = (-2, 5, -4).

nose to one
mapping
4. What will happen if the following activation functions are used in hidden layers in NN? (A) periodic
function (b) Unbounded function (c) Monotonic function. Give example of unbounded and monotonic
function
5. What is gradient descent update rule for lth layer? How the update rule is modified in stochastic gradient
descent?
I If the mini batch size is B, write the update rule for mini batch gradient descent.

.
15 points

Q1. Why ReLU activation function leads to sparse activation maps?

ve values 0

We have a problem set with 4 class classification. Outputs are one hot encoded. Say for the 3rd class we
have the softmax output =[0.3,0.02,0.6,0.08]. What will be the cross entropy loss?

A given cost function J(w) = 2w^2-4w+2. What is the weight update rule for gradient descent optimization
at step t+1. Consider learning rate = 0.01

510 202 4042


fo no u

What is the range of tanh activation function? (-1,1)

sigmoid

f
What is vanishing gradient? Which activation function(s) lead(s) to the same? Explain with example.
When is gradient descent algorithm certain to find global minima?

neural net to convexfunction


deep due
aw saturation foid
so

Q2. Say z = [-1,0,3,5] be the input of ith layer of a neural network. On this we apply softmax activation .
What will be the output ?
Show detail calculation.

Consider the following neural network shown in the figure with inputs x1 , x2 and output Y. The inputs take
the values x1 , x1 ϵ {0,1}. Which logical operation performed by the network ?

0,0 1
0,07 140401 01187 70
01 1
0 it 1 0 24 1 81 1 o
1,0 0 NANI
1,041 2401 i re Do 0
1 2 27 3 07340
1 I
1,11

What will happen if we replace the activation function of the previous question to sigmoid?

What actually back propagation does? c. Negative gradient of error w.r.t weight
Ew
Q3. Explain briefly what will happen if learning rate is equal to 0.

No update WE W few
9 0 was w no update
If g(x) represent sigmoid function find it’s first derivative. Now show that at what point the gradient of g(x) is
maximum.

If J = loss function, y = output, d = desired output, a = nonlinearity, f = network then how the derivative of
the loss function with respect to the weights in a deep neural network is computed?

We are training a neural network using a normal gradient descent algorithm. We observe
that the change in weights is small in successive iterations. What are the possible causes for the
following phenomenon? Possibility 1: learning rate large, possibility 2 : learning rate small, possibility 3:
weight change small, possibility 4 : weight change large. Justify your answer.

Small update changes signifies that the quantity η∇w is small. This can happen if ∇w or η is
small.

Given z = Wx+b and a = g(z) , If z, a has dimension 3x1, and x has dimension nx1. What will be the
dimension of b and W?

For the above cost function if we start gradient descent from point C what will happen? What will happen if
we start from A?

Q4. Why stochastic gradient descent will result more oscillations?

Since in stochastic gradient descent we update weights based on one training example it is more prone
to changes and hence more oscillations.

What is mini batch gradient decent? What is the advantage of mini batch GD over SGD and BGD?

What are the advantages of stochastic gradient descent instead of vanilla gradient descent?

: SGD updates weight more frequently hence it converges fast. Since it is computationally faster
than vanilla gradient descent, it works well for large datasets.

Why batch norm makes training faster?


You have a single hidden-layer neural network for a binary classification task. The input is X ∈ Rn×m, output
yˆ ∈ R1×m and true label y ∈ R1×m.
What are the forward propagation equations? What will be the expression of loss function?

What is sparse categorical cross entropy loss function? When it is used?

Q5. For fitting a model in given data we have constructed the hypothesis as h = x^w. If the actual target is
y then Find the loss function in MSE sense. Also develop the update rule for weight using gradient descent
optimization.

J = (xw ¡ y)
2 h = xw
@J log h = w log x
w Ãw¡®¢ @J @ 2
@w = (xw ¡ y) @ log h @h
¢ = log x
® X
m
@w @w @h @w
)wÃw¡ (xw ¡ y) xw log x @
m @h
i=1 = 2 (xw ¡ y) (xw ¡ y) = h log x
@w @w
= 2 (x ¡ y) xw log x
w
@h
= xw log x
@w
Write one advantage of mini batch normalisation. – Eliminates covariate shift between batches
Sigmoid M linear
How we choose the activation functions in case of (a) for binary classification (b) for regression ki like
temperature prediction (c) For regression like house price prediction. Justify your answer.
Rew
Is there any problem with the flatness of activation function in hidden layer? If yes then what is the problem
the training will face, if no, justify. saturation derivative to vanishing
Why do we actually need activation function. If we replace all activation functions in the hidden layer to
linear functions, what happened? Show your result with mathematically.
linear regression
Consider all weights w’s are scalar, and we have the following neural architecture. O is the out put.

X ——(w1)—-> f1 —(w2)—->f2 —(w3)—-> f3 ——(w4)—>f4 ———>O

Write the expression for the output. What will be the expression for delO/delw1 and delO/delw2

f I Win fr I Wz f 2 Wz w a

f I Wa f f Wz Wz win
win
fy Witz fy Wu wz wa

0 fu
Ew wi 81.8
8 8 3.82
Unit-4 Entropy HMM , Random field conditional, Markov

1 marks

1. We are given that the probability of Event A happening is 0.85 and the probability of Event B happening
is 0.05. Event A has a low information content And B has high information content. True or false?
Solution: Events with high entropy have low information content while events with low entropy have high
information content.

5 marks
The probability of all the events x1, x2, x2....xn in a system is equal(n > 1). What can you say about the
entropy H(X) of that system?(base of log is 2)

HCA 2463109 PED PED I


t t 1092Ctn
login
h 1 HCA I
Unit5 : CNN, Regularization , Drop out , Recurrent, DBN , Regularization
15 points.

Easy:

(A) You come up with a CNN classifier. For each layer, calculate the number of weights, number of biases
and the size of the associated feature maps.
The notation follows the convention:
• CONV-K-N denotes a convolutional layer with N filters, each them of size K×K, Padding and stride
parameters are always 0 and 1 respectively.
• POOL-K indicates a K × K pooling layer with stride K and padding 0.
• FC-N stands for a fully-connected layer with N neurons.

Layer Find (1) Activation map dimensions (2) Number of weights (3) Number of biases
The first layer is given.
Layer-1: INPUT ,(1) activation map dimension is 128x128x3, (2) number of weights = 0, (3) number of
biases = 0 , Complete the rest
Layer-2 :CONV-9-32, Layer-3:POOL-2, Layer-4:CONV-5-64 , Layer-6: POOL-2 , Layer-7:CONV-5-64,
Layer-8:POOL-2, Layer-9: FC-3. 9points

(B) Why is it important to place non-linearities between the layers of neural networks?
Solution: Non-linearity introduces more degrees of freedom to the model. It lets it capture more complex
representations which can be used towards the task at hand. A deep neural network without non-linearities
is essentially a linear regression. Pt 3

(C) Following the last FC-3 layer of your network in Q(A), what activation must be applied? Given a vector
a = [0.3, 0.3, 0.3], what is the result of using your activation on this vector?
Soft ax, 0.33,0.33,0.33. Pt 3
Easy

Give three benefits of using convolutional layers instead of fully connected ones for visual tasks.
Solution:
• Uses spatial context (by only assigning weights to nearby pixels) • Translation invariance
• Have lot less parameters, since CNN’s share weights

In VGG-16 Net, the input is RGB with width and height 224 , 224. We are going to classify 1000 classes.
We have [CONV 64 x 2] at the first layer. Where, CONV means 3x3 filter, stride =1, same, 64 means total
no. Of filters and 2 means there are two such layers. After applying this , we have the intermediate output
dimension 224x224x64. Now we have the architecture after [CONV 64x2] is , pool, [CONV 128x2] , pool,
[CONV256x3] , pool , [CONV 512x3], pool [CONV 512x3], pool,FC-4096, FC-4096, output .

(A) Report the dimension of intermediate volume after each CONV and pool operation.
(B) What activation you will use at the output layer
(C) What will be the size of output layer
(D) Calculate the total number of parameters. Show the detail calculation with each layer. For this
calculation you have to consider the bias term also.
(E) What do you mean by uniformity of this net.

Moderate

(A) In Alex Net local response normalisation is used. Pictorially show the process of calculation of local
normalisation.
(B) What is the difference between LeNet and Alex Net in terms of activation?
(C) Which one is much dipper? AlexNet or LeNet?
(D) How does CNN share parameters?
(E) What do you mean by sparsity of connection in CNN?
(F) We have the input image of size 32x32x3. We apply 6 numbers of 5x5 filter. What will be the output
size?
(G) How many parameters we need to learn in Q.(F)
(H) In Q.(F) we have some output size. So for a deep net architecture we can assume that the same output
obtained in Q.(F) acts as hidden layer. Now we construct a MLP with the input 32x2x3 and the 1st hidden
layer with number of neurons = size of output in Q.(F). Now for this MLP for the fist layer calculate the total
number of parameters.
(I) Compare the result of (G) and (H) and put your remark.

Moderate

Suppose you have a RGB image of size 6x6x3. You have applied 6 numbers of 3x3 filters with (1) no
padding and (2) with padding = 1. You have decided to set stride = 1. For both the cases
(A) Write the equation of forward propagation I.e. the linear part for the 1st layer.
(B) Add bias. Show them in the architecture.
(C) You have used ReLU activation in the 1st layer. Show them in the architecture.
(D) How many output channels you will get after activation? Show them in the architecture.
(E) Suppose you are using tensorflow to train this model. How many parameters you need to learn?
(F) What optimization technique you will use?

Suppose you have 3 different kernels to detect the edges of a 2D image. What are those kernels if you want
to detect (1) horizontal edges (2) Vertical Edges and (3) 45 degree edges.
Hard

(A) Sobel filter is used for edge detection. Show two different types of Sobel filter used for edge detection.
What is the advantage of using Sobel filter ?
(B) With a 7x7 image show the process of strided convolution with stride=2. You can choose any value of
7x7 image and any value for the kernels. You need to show two steps for strided convolution in each
direction.
(C) What are the new innovations used in GoogleNet, that led to better results? Google net used different-
sized kernels in a single layer which led to better feature extraction at a single layer. 1x1 convolutions and
average pooling were used to reduce computations which made this technique effective for practical
purposes.

(E) What is inception network? What is 1x1 convolution? What do you mean by network in a network?

Hard

(A) Say we have 4 neurons ant input layer, 4 neurons in 1st hidden layer, 4 input neurons in 2nd hidden
layer, 4 neurons in 3rd hidden layer and one neuron in output layer. Now we have set keep probability =0.5
for working out the drop out strategy. Draw for any instance of the network after application of dropout.
(B) How inverted dropout is implemented?
(C) With dropout technique we have spread out of weights. Justify the statement.
(D) How you can implement different drop outs in different layers? Draw any network and justify your
dropout probability setting .
(E) How does data augmentation works as regulariser?
(F) What is early stopping? Justify why early stopping can prevent overfitting problem?
(G) What is the problem with early stopping?
(H) In case of sentiment classification we have input as “There is nothing to like in this movie” and the
corresponding output as 1 star. his is sequential learning problem. Does standard MLP can handle this
problem? If yes Justify your answer. If not, suggest some network.

Hard

(A) In named identity recognition system we have to recognise the person’s name from a given sentence.
The input is given as “ The best book of deep learning is written by Goodfellow , Aaron and Bengio” . What
will be the corresponding output vector if you want to build a recurrent neural network?
(B) We have a vocabulary of say 10000 words where ate position 2 the word is “Aaron”, at position 1229 the
word is “Bengio” and at position 2048 the word is “Goodfellow”. We want to represent the input as one hot
encoding. What will be the following vectors x<10>, x<11> and x<12>?
(C) A standard network with multiple layers cannot handle this type of sequential problems. Why?
(D) How recurrent neural network can overcome all the problems. Show a standard recurrent neural network
architecture, where input length and output length are same.
9E) How the above network in Q.(E) will be changed if the network is used for music generation where input
is an integer and output is a sequence of dat?
(F) Show an architecture of recurrent neural network with weights for machine translation problem.
(G) Show any RNN for same length input and output. Write the forward propagation expressions for any
layer.
(H) Draw a computational graph of RNN (input and output length same) showing the loss of individual layer
and the backpropagation flow.
(I) What is the advantage of Bidirectional RNN over standard RNN. Justify with an example.
Unit 2 : feed forward, activation mlp
1 marks

Easy :
1. We have a multi-classification problem that we decide to solve by training a feedforward neural
network. What activation function should we use in the output layer to get the best results? Softmax
2. forward, backward, weight updates , is the proper order to train NN. True or false?

15 points
Which of the following is only an unsupervised learning problem?
a. Digit Recognition b. Image Segmentation c. Image Compression , Justify your answer

5 marks

1. We have data x with the following labels y=[‘orange’, ‘orange’, ‘apple’, ‘guava’, ‘orange’, ‘apple’,
‘orange’, ‘apple’]. Which of the following distribution will give the lowest cross-entropy loss with y?
(Distribution is given in the following order [‘orange’, ‘apple’, ‘guava’])?a)[0.5, 0.375, 0.125] b)[0.56, 0.31,
0.11 ] c)[0.47, 0.34, 0.19 ] d)[0.53, 0.29, 0.18]. Answer: a) Solution: The first distribution is equal to the
original distribution hence the cross-entropy loss is minimum at that point
2. say we have a neural network with 3 input neuron, 2 hidden layers each having 8 neurons, and 3
neurons at the out put layer. Find the total number of biases. Find the total number of weights. Which
loss and activation function for the output layer is best suited for the above-given network? Solution:
Total no of bias parameter b = 3 + 8 + 8 = 19, Total no of weight parameters
w=8.3+8.8+8.3=24+64+24=112, Answer:d) Solution: The given problem is a classification problem and
from the no of the output layer we can infer it is a multi-class classification problem. Hence softmax is
the best activation function.
3. Suppose we have a problem where data x and label y are related by y = x2 + 1. Which of the following
is not a good choice for the activation function in the hidden layer if the activation function at the output
layer is linear? (A) ReLU, Tanh (c) Sigmoid (d) Linear. Solution: If we chose the first activation function
then the output of the neural network will be a linear function of the data since the network is just doing
a combination of weight and biases at every layer, hence we won’t be able to learn the non-linear
relationship.

15 points

1. Easy: see below

(A) Suppose a fully-connected neural network has a single hidden layer with 10 nodes. The input is
represented by a 5D feature vector and the number of classes is 3. Calculate the number of parameters of
the network. Consider there are NO bias nodes in the network? Number of parameters = (5 * 10) + (10 * 3) =
80

(B) For a 2-class classification problem, what is the minimum number of nodes required for the output layer
of a multi-layered neural network? Why? Only 1 node is enough. We can expect that node to be activated
(have high activation value) only when class = +1 else the node should NOT be activated (have activation
close to zero). We can use the binary (2-class) cross entropy loss to train such a model.

(C) What are the potential benefits of using ReLU activation over sigmoid activation?

ReLu(x) = max(0, x). Since, the values of neurons are clipped to zero for negative values, ReLu helps in
sparse representations since an appreciable fraction of neurons might have negative values. Sigmoid, on
the other hand always outputs some real values for neurons’ activations and thus the representations are
dense. It is prefered to have sparse representations over dense representations. Moreover, the magnitude
of gradient for sigmoid function tends to zero as the value of the node increases. Since the value of the
gradient is essential for update of a neuron during back-propagation, this leads to vanishing gradient
problem which leads to slower learning. ReLu, on the other hand offers a constant gradient for all x >0 and
thus it is free from vanishing gradient problems.

(D)

Hard

Consider a neural net for a binary classification which has one hidden layer as shown in the figure. We use a
linear activation function h(z) = cz at hidden units and a sigmoid activation function g(z) = sigmoid(z)
at the output unit to learn the function for P (y = 1|x, w) where x = (x1, x2) and w = (w1, w2, . . . , w9).

(A) What is the output P (y = 1 | x, w) from the above neural net? Express it in terms of xi, c
and weights wi.
Z LI WT Wg NI tWzR W5223 e wa wz Wyse Won
Ply 1 45 841 I
penetCpewgefwiewswewatwm nod
(B) What is the final classification boundary?

Wy te Wg WitWan 4 Wsh2 WgC Wat Wun t Wgn O

(C) Is it true that any multi-layered neural net with linear activation functions at hidden layers
can be represented as a neural net without any hidden layer? Briefly explain your answer.

Yes. If linear activation functions are used for all the hidden units, output from hidden units
will be written as linear combination of input features. Since these intermediate output serves as
input for the final output layer, we can always find an equivalent neural net which does not have
any hidden layer as seen in the example above.

(D) Draw a neural net with no hidden layer which is equivalent to the given neural net, and
write weights ˜w of this new neural net in terms of c and wi.

Wy4CWow t cngWz

Moderate

(A) Implement x1 AND x2 with single layer perceptron.


(B) Implement (NOT x1) AND (NOT x2) using single layer perceptron
(C) Implement x1 OR x2 using single layer perceptron

For the above three implementation you can choose any weights. You don’t need to learn. Each case you
should consider bias term. Each case the sigmoid activation should be used. Show the detail calculations.
(D) Implement XNOR with MLP using the weights you have derived earlier.

Moderate
O
O O at
8 ont o o 9
O O
O

Consider the above network. Answer the following


(A) What is the size of W^[1]?
(B) How a^[3] is computed? Show all components of a^[3]
(C) How a^[2] is computed? Show all components of a^[2]
(D) Why W^[1],W^[2],W^[3] are vectors or matrices?
(E) Redraw the network with weights and links in each layer.
(F) What is the size of W^[2]?
(G) What is the size of W^[3]?
(H) Redraw the network considering all activation functions are linear except the output.

Hard

(A) What is the vectorised implementation of forward propagation for layer l. Where 1<= l <=L.
(B) We have a MLP having 3 hidden layers. We have initialised the weights with 0 vector. We start training.
After 3rd iteration, what we will see? Each neuron in the first hidden layer will perform the same
computation. So even after multiple iterations of gradient descent each neuron in the layer will be
computing the same thing as other neurons.
(C) Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you
initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to
“break symmetry”, True/False? Justify your answer. False Logistic Regression doesn't have a hidden layer.
If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the
derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not
zero. So at the second iteration, the weights values follow x's distribution and are different from each other if
x is not a constant vector.

(D) You have built a network using the tanh activation for all the hidden units. You initialize the weights to
relative large values, using np.random.randn(..,..)*1000. What will happen?

This will cause the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The
optimization algorithm will thus become slow.

(E) During forward propagation should we need to know what activation function is used in the 1st layer?
Justify your answer.
(F) During back propagation should we need to know which activation function is used in last layer? Justify
your answer During backpropagation you need to know which activation was used in the forward
propagation to be able to compute the correct derivative.
(F)What happens when you increase the regularization hyperparameter lambda? Justify your answer with
gradient descent update rule . (1-lambda/m)W
(G) Suppose batch gradient descent in a deep network is taking excessively long to find a value of the
parameters that achieves a small value for the cost function J(W[1],b[1],...,W[L],b[L]). Which of the following
techniques could help find parameter values that attain a small value forJ? You need to write whether it will
work or not. 5 will not work

1.Try using Adam


2.Try better random initialization for the weights
3.Try tuning the learning rate α
4.Try mini-batch gradient descent
5.Try initializing all the weights to zero

Easy

Consider 3 layer neural network. At the input layer we have 3 inputs. 1st hidden layer has 4 units. 2nd
hidden layer has 3 units. One output unit. Answer the following
Find the size of
(A) W^[1],W^[2], W^[3], (B) b^[1],b^[2],b^[3] (C) Now draw the MLP with weights and bias links.(D) Consider
all the weights in each layer are 0.1 and biases are 0.5. The activation function we use in 1st layer is ReLu,
2nd layer is ReLU and the output layer is sigmoid. What will be the predicted out for x=[1,1,1]? (E) Consider
all the activation functions are linear function except output. Should we consider the hidden units? Justify.
In that case find the output.

Easy

(A) Suppose a fully-connected neural network has a single hidden layer with 10 nodes. The input is
represented by a 5D feature vector and the number of classes is 3. Calculate the number of parameters of
the network. Consider there are NO bias nodes in the network? Number of parameters = (5 * 10) + (10 * 3) =
80

(B) For a 2-class classification problem, what is the minimum number of nodes required for the output layer
of a multi-layered neural network? Why? Only 1 node is enough. We can expect that node to be activated
(have high activation value) only when class = +1 else the node should NOT be activated (have activation
close to zero). We can use the binary (2-class) cross entropy loss to train such a model.

(C) What are the potential benefits of using ReLU activation over sigmoid activation?

ReLu(x) = max(0, x). Since, the values of neurons are clipped to zero for negative values, ReLu helps in
sparse representations since an appreciable fraction of neurons might have negative values. Sigmoid, on
the other hand always outputs some real values for neurons’ activations and thus the representations are
dense. It is prefered to have sparse representations over dense representations. Moreover, the magnitude
of gradient for sigmoid function tends to zero as the value of the node increases. Since the value of the
gradient is essential for update of a neuron during back-propagation, this leads to vanishing gradient
problem which leads to slower learning. ReLu, on the other hand offers a constant gradient for all x >0 and
thus it is free from vanishing gradient problems.

(D) in what case we cannot use Adam optimisation? Mini batch, Adam works on batch gradient descent.

(E) The following diagram represents a feed-forward neural network with one hidden layer:

4 l l
3
I 7
A weight on connection between nodes i and j is denoted by wij, such as w13 is the weight on the
connection between nodes 1 and 3. The following all the weights in the network:

w13=−2 w35=1 w23 = 3 w45 = −1 w14 = 4 w36 = −1 w24=−1 w46=1


Each of the nodes 3, 4, 5 and 6 uses the following activation function: g(z) = 1 if z>=0, otherwise 0. Where z
is the weighted sum of the nodes.Calculate the output of the network (y5,y6) when the input patterns are 1.
(0,0), (1,0),(0,1),(1,1). Ans (1,1), (0,1), (1,0),((1,1)

Unit -6
15 point

Moderate : Two historians approach you for your deep learning expertise. They want to classify images of
historical objects into 3 classes depending on the time they were created:
• Antiquity (y = 0)
• Middle Ages (y = 1)
• Modern Era (y = 2)

(a) Over the last few years, the historians have collected nearly 5,000 hand-labelled RGB images.
(i) (2 points) Before training your model, you want to decide the image resolution to be used.

Why is the choice of image resolution important?


Solution: Trade-off between accuracy and model complexity.

(1 point) If you had 1 hour to choose the resolution to be used, what would you
do?
Solution: (See lecture of Prof. Katanforoosh)
Print pictures of images with different resolutions and ask friends if they can properly recognize images.

(b) You have now figured out a good image resolution to use.
(i) (2 points) How would you partition your dataset? Formulate your answer in percentages.

Solution: Several ratios possible. One way of doing it: split the initial dataset into 64% training/16% dev/
20% testing set. Training on the training set and tuning the hyperparameters after looking at the
performance on the dev set.

(2 points) After visually inspecting the dataset, you realize that the training set only contains pictures taken
during the day, whereas the dev set only has pictures taken at night. Explain what is the issue and how you
would correct it.
Solution:
• It can cause a domain mismatch.
• The difference in the distribution of the images between training and dev might lead to faulty
hyperparameter tuning on the dev set, resulting in poor performance on unseen data.
• Solution: randomly mix pictures taken at day and at night in the two sets and then resplit the data.

(3 points) As you train your model, you realize that you do not have enough data. Cite data augmentation
techniques that can be used to overcome the shortage of data.

(5pts ) Construct a Convolution neural net of your own having 3 Conv layers 2 pooling layers and one fully
connected layer. What Activation you choose for classification?

5 marks

Easy
You have been asked to classify MNIST digits. Can you suggest a deep neural network architecture. You
just need to specify the layers, and their sizes. The MNIST data set has the images having size say 28x28.

Moderate

What is tokenisation in text classification? What is the deep NN architecture for NLP? (Rajiv chopra
NLP)

Unit 4
5 marks
The probability of all the events x1, x2, x2....xn in a system is equal(n > 1). What can you say about the
entropy H(X) of that system?(base of log is 2)

What is Markov process? Show the Markov chain for first order.

15 marks

Consider the Markov chain with three states, S={1,2,3} that has the following transition matrix
P=[[0.5,0.25,0.25],[1/3,0,2/3],[0.5,0.5,0]].
(A) Draw the transition diagram for this chain.
(B) If we know P(x1=1)=P(x2=2)=0.25, find P(x1=3,x2=2,x3=1)
P(x1=3)=1,=1-P(X1=1)-P(x1=2)=1-1/4-1/4=1/2, P(x1=3,x2=2,x3=1)=P(x1=3).p32.p21=1/2.1/3.1/3=1/12

INIT
of
1
10.4
a
Sun
t 0
ossa
É
zo
On IT 0.614
plpaint 0 o
PLdean
Sun 0.4

p play
Rain Faint É fifty
(A) What are the hidden states in the above example? (B) What are the Observables? (C) What is initial

probabilities? (D) Find the transition probability matrix. (E) Find the Emission probability matrix.

Unit-1
5 marks :
What is the main benefit of stacking multiple layers of neuron with non-linear activation functions over a
single layer perceptron? A single layer perceptron is capable of classifying only linearly separable classes.
Stacking multiple layers of neurons helps in creating non-linear decision boundaries and thus can be used
for classifying examples belonging to classes which are NOT linearly separable.
How do you debug a machine learning problem?
a lily i
15 marks
updaterule WEWif misclassified
(A) Is there any difference between a single layer perceptron and a logistic regression.? If yes point the
difference if no, justify
(B) What is delta rule? Explain with expression. new 28W
(C) If your logistic regression model sufferers from overfitting what will you do?
Reg
(D) Say we have nonlinearly separable datapoints. We want to classify the with linear decision boundary. Is
it possible? Justify your answer. Possible, SVM kernel
(E) Suppose you have 5 logistic regression models. Can you make a strong classifier with these 5 models?
If no say why. If yes justify your answer. Boosting Adaboos
(F) Why F1score is better than accuracy in learning problem? unbalanced date set
(G) We have a confusion matrix having all off diagonal elements are zero. What you can infer from this result
for your designed classifier?
No FP FN
(H) We have 10 set of non-linearly separable data with no labels are given. You want to classify this data
set. What will be your approach? K-means
(I) How you can manage the exploding and vanishing gradient problem in deep network? Suggest one
Residual net
network which can help you to prevent the same.

1 marks

Unit 5

1.Weight sharing is a procedure of reducing number of parameters. True or False? True


2.A 3x3 filter is convoluted with 7x7 image. What will be the output? 5x5
3.30 number of 5x5 filters convoluted with 228x228x3 with stride 1 and padding 0. What will be the depth of
output volume? 30
4.In CNN we use only ReLU or tanh, sigmoid and softmax are not used. True or False? False
5.In fractional max pooling what we use? A pseudorandom number generator
6.Several convolution layers are stacked before pooling. True or False?
7.Stride and number of filters are treated as hyperparameters True or False? True
8.In max pooling we consider the weights but average pooling we don’t. True or False? False
9.Recurrent neural network can be used in computer vision task.True or False? True
10.When we use RNN in computer vision problem, the input and output length are not same. True or False?
True
11.In RNN we don’t have the weight sharing advantage. True or False? False
12.With dropout technique what we can prevent? Overfitting
13. Boltzmann machine is a non stochastic learning. True or False? False

Unit 6

1.For speech recognition HMM is used. True or False? True


2. GPU stands for what?
3. Image Net is superior to LeNet in computer vision application, True or False? true
4. For object detection and classification You Only Look Once algorithm is used, True or False?
5. Is Big Data related to Deep Learning in symbolic way? Yes
6. The learning of a distributed representation for each word is called word embedding. True or False? True

Unit 4

1.What is Markov assumption. Write in one sentence. Ans when predicting future, the past does not matter,
only the present.

2.Which algorithm is used for likelihood computation in HMM? Ans Forward algorithm
3.Viterbi Algorithm is used for decoding i.e. to find hidden sequence. True or False? True
4.Only backward algorithm is sufficient for HMM training. True or False? False. Both forward backward.

Unit 3
1.Which loss is used to train a MLP if the outputs are encoded as one hot vector in multiclass
classification? Categorical cross entropy
2.Which loss is used to train a MLP if the outputs are given as {1,2,3…} in this form? Sparse categorical
Cross entropy
3.To converge our gradient descent faster we generally normalise our feature set. true or False. True.
4.SGD is faster than batch gradient descent but noisy. True or Flase. True
5.Vanishing gradient problem may occur in case of tanh activation. True or false? True.
6.LeakyReLU activation can give use both +ve and -ve output.
7.Adam optimisation can be used in both batch and mini batches. True or False? False only batch
8.We cannot use mean squared error loss function in binary classification problem for ANN.True or False?
True
9.The nature of log-loss cost function may be convex or non convex.True or False? False
10.What is the range of tanh activation function? (-1,1)
11.Vanishing gradient causes deeper layers to learn more slowly than earlier layers True or False. False.
12. Before non linear activation we have linear activation in hidden layers.True or False? True.

Unit-2
1.A 2-layer neural network with 5 neurons in each layer has a total of 6- parameters. True or false. True
2. Leaky ReLU is less likely to suffer from vanishing gradients than sigmoid. True or False, True
3. Xavier initialisation can help prevent vanishing gradient problem. True or False. True.
4. If z=tanh(x) , then what is z’? (1-z^2)
5. Consider a trained logistic regression. It’s weight vector is W and its test accuracy on a given dataset is
A. Assuming no bias, dividing W by 2 what will be test accuracy? A
6. You’re solving binary classification task. The final two layers in your network are a ReLU activation
followed by sigmoid. What will happen. Answer in one sentence. Using ReLu then sigmoid will cause all
predictions positive.
7. A better method of searching best learning rate for your model on a log scale. Is it True? True
8. You are searching a learning rate as 0.1,0.2,0.5 etc. Are you getting good result? Write yes or no. No
9. In logistic regression you initialise all weights to 0.5. Is it a good idea? Yes. Solution: Yes. For logistic
regression with a convex cost function you’ll have just a single optimal point and it does not matter where
you start, the starting point just changes the number of epochs to reach to that optimal point.
10. You try a 4-layer neural network. You initialize all weights to
0.5. Is this a good idea? Yes or No? No. Solution: No, initializing all weights to zeros does not break the
symmetry. All hidden units will have identical influence on the cost, which will lead to identical gradients.
Thus, both neurons will evolve symmetrically throughout training, effectively preventing different neurons
from learning different things.
11. Mini batch gradient descent is advantageous comparing to full-batch gradient descent. True or false?
True.
less computationally expensive, faster convergence
12. An end-to-end learning generally leads to lower bias. True or False? True.
13. Suppose you have a music generation problem Your friend suggest to use a recurrent neural network.
Are you taking his/her suggestion? Yes or No? Yes

Unit 1

You have to recognise digit 6. To handle overfit you use data augmentation. Which data augmentation you
should not do? Flip vertically

You have developed a classifier. You see during testing there is a large gap between training error and test
error. What is the problem in your model? Overfitting

MCQ
The classifier gives 1% training error and 15% dev error. What is the cause?
1. Low bias
2. high bias
3. low variance
4. high variance

The classifier gives 15% training error and 15% dev error. What is the cause?

1. Low bias high variance


2. high bias high variance
3. low variance low bias
4. high variance low bias

Find df
dx where f = |x|? |x| means absolute of x.
a. 1
b. Sign(x)
c. 0
d. ∞

Detailed Solution:
df
dx = {
1x>0
−1 x < 0
0x=0
= sign(x)

the first derivative of g(x) is if g is sigmoid ans C


1-g(x)
1+g(x)
g(x)(1-g(x)
g(x)(1+g(x)

There are 5 black 7 white balls. Assume we have drawn two balls randomly one by one without
any replacement. What will be the probability that both balls are black?
a. 20/132
b. 25/144
c. 20/144
d. 25/132

Detailed Solution:
Probability of first ball being black = 5/(5 + 7) = 5/12.
Probability of drawing second ball black is = 4/(4 + 7) = 4 /11.
Now overall probability of both balls being black = (5/12) × (4/11) = 20/132

Two dices are rolled together. What will be the probability of getting 1 and 4 together?
a. 1/18
b. 1/36
c. 1
d. None of the above

Number of possible outcomes = (6 × 6) = 36.


Number of times getting 1 & 4 together = 2 (where 1 in first dice, 4 in second dice or 4 in
first dice, 1 in second dice).
So, probability = 2/36 = 1/18

Matrix inverse of a square matrix A exists if.


a. Determinant of A, det(A) = 0
b. Eigen values of A are non-zero
c. Sum of eigen values are non-zero
d. None of the above

Matrix inverse exists if det(A) is not equal to zero. det(A) = product of all the eigen
values of the square matrix.

x1, x2, x3 are the linearly independent vectors. If x1 = [1,3,0], x2 = [−2,4,−5], what is the possible

value of x3?
X = [x1 x2 x3]. x1, x2, x3 are linearly independent if determinant(X)/det(X) = 0

det([
1 −2 3
344
0 −5 5
]) ≠ 0
We also can validate linear dependency of option a, b, d.
Option a: x1 + x2 = x3,
Option b: 2x1 + x2 = x3,
Option d: x1 − 2x2 = x3

What are the eigen values of the matrix A?


A=[
54
−3 −2
]

a. 4, −3
15 error means bias
g
variance
b. 5, −2
c. −2, −1 15 dev error means high
d. 2, 1

de t(λI − A) = 0
or, det ([
λ − 5 −4 f a In 121 n 270
3λ+2
121 2 220
]) = 0 or, (λ − 5)(λ + 2) + 12 = 0 or, λ If 21 270
121 0 2 8

220
2 − 3λ + 2 = 0 or, λ = 2, 1
If 1
Signal
1

O 2 0
From a pack of 52 cards, two cards are drawn together n at random. What is the probability of n
both the cards being kings?
a. 1/15 I
b. 25/57
c. 35/256
d. 1/221
Correct Answer: d
Detailed Solution:
glaci 91m
The probability that first card is king is 4/52. Now after drawing the first card only 51 card is
there in the pack, out of which 3 cards are king. So the probability of second card being king is
3/51. So the joint probability of both the card being king is = 4/52 * 3/51 = 1/13 * 1/17 = 1/221.

combinations
For a two class problem Bayes minimum error classifier follows which of following balls
12 rule? or
(The two different classesdrawing
are ω1 and ω2, and input feature vector is x)
ofballsfrom
a. Choose ω1 if P(ω1/x)2> P(ω2/x)
b. Choose ω1 if P(ω1)>P(ω2) m balls
2
ofbeing
ff
c. Choose ω2 if P(ω1)<P(ω2)
d. Choose ω2 if P(ω1/x)>P(ω2/x) 5balls

Correct Answer: a

Why convolution neural network is taking off quickly in recent times? (Check the options that
are true.)
Sample space 36
a. Access to large amount of digitized data
2 761,4
b. Integration of feature extraction within the training process. PLD.P 4 1 2
c. Availability of more computational power 2
d. All of the above. f 56
Correct Answer: d
Detailed Solution:
Convolution neural network need lot of computation power and data for efficient training.8
Now a days because of GPUs computation power has become assessable to many Did
Aetc) areeigenvectors
det
researchers. Also many open public datasets (like ImageNet, MNIST available
both
online for free of use, so requirement of data is also being solved. Also as CNN don’t need
so
zero
any feature extraction step so it is very convenient to apply with minimum domain
knowledge. are non
The bayes formula states:
a. posterior =

likelihood∗prior/
evidence

b. posterior =

likelihood∗evidence/
prior crossproduct
c. posterior = likelihood ∗ prior

Correct Answer: a83 is 5.1


d. posterior = likelihood ∗ evidence
3,1123
Suppose if you are solving a four class problem, how many discriminant function you will need
for solving?
a. 1
b. 2
I 31 t 0 I if
5 0
c. 3 x̅ I 4 2731
d. 4
Correct Answer: d
I 15 1 55
Detailed Solution: For n class problem we need n number of discriminant function.
10 I
\ 15 I 55 10 x̅
Two random variable X1 and X2 follows Gaussian distribution with following mean and
covariance.

X1~N (0, 3) and X2~N (0, 2).

Which of following will is true.

a. Distribution of X1 will be more flat than the distribution of X2.


b. Distribution of X2 will be more flat than the distribution of X1.
c. Peak of the both distribution will be same
d. None of above.

Correct Answer: a
Detailed Solution:
As X1 has more covariance than the X2, so the distribution of X1 will be more spread than
the X2. So distribution of X1 will more flat than distribution of X2.

5222 King 4
Which of the following is true with respect to the discriminant function for normal density.?
a. Decision surface is always orthogonal bisector to two surfaces when the

I
IE IE EEI
covariance matrices of different classes are identical but otherwise arbitrary
b. Decision surface is generally not orthogonal to two surfaces when the covariance
matrices of different classes are identical but otherwise arbitrary
c. Decision surface is always orthogonal to two surfaces but not bisector when the
covariance matrices of different classes are identical but otherwise arbitrary
d. Decision surface is arbitrary when the covariance matrices of different classes
are identical but otherwise arbitrary

Correct Answer: b
Detailed Solution:
All the options are self explanatory.

In which of following case the decision surface intersect the line joining two means of two class
at midpoint? (Consider class variance is large relative to the difference of two means)
a. When both the covariance matrices are identical and diagonal matrix.
b. When the covariance matrices for both the class are identical but otherwise
arbitrary.
c. When both the covariance matrices are identical and diagonal matrix, and both
the class has equal class probability.
d. When the covariance matrices for both class are arbitrary and different.

Correct Answer: c
Detailed Solution:
When the both class have equal probability then the decision surface become the bisector of
line joining two means, and if they not equal then the intersection point shifts away from
the more likely mean.

For minimum distance classifier which of the following must be satisfied?

a. All the classes should have identical covariance matrix and diagonal matrix.
b. All the classes should have identical covariance matrix but otherwise arbitrary.
c. All the classes should have equal class probability.
d. None of above.

Correct Answer: c

The decision boundary of linear classifier is given by the following equation.

4x1 + 6x2 − 11 = 0

What will be class of the following two unknown input example? (Consider class 1 as positive
class, and class 2 as the negative class)
a1= [1, 2]
a2= [1, 1]

a. a1 belongs to class 1, a2 belongs to class 2


b. a2 belongs to class 1, a1 belongs to class 2
c. a1 belongs to class 2, a2 belongs to class 2
d. a1 belongs to class 1, a2 belongs to class 1

Correct Answer: a, If we put the value a1 on the decision boundary we get a positive value, and putting a2
will produce a
negative value.

Which logic function cannot be performed using a single-layered Neural Network?

a. AND
b. OR
c. XOR
d. All
Correct Answer: c
Detailed Solution:
A single layer neural network can be used implement linear logic gates only. Direct from
classroom lecture

Which of the following statement is true?

a. L2 regularization lead to sparse activation maps


b. L1 regularization lead to sparse activation maps
c. Some of the weights are squashed to zero in L2 regularization
d. L2 regularization is also known as Lasso

Correct Answer: b
Detailed Solution:
Regularization basically adds the penalty as model complexity increases. L1 regularization
squashes some of the weights to zero, thus aiding to sparse activation maps

Which among the following options give the range for a tanh function?

a. -1 to 1
b. -1 to 0
c. 0 to 1
d. 0 to infinity
Correct Answer: a
Detailed Solution:
Refer to lectures, specifically the formula for tanh function.

When is gradient descent algorithm certain to find a global minima?

a. For convex cost plot


b. For concave cost plot
c. For union of 2 convex cost plot
d. For union of 2 concave cost plot

Correct Answer: a
Detailed Solution:
Only for convex cost plot, gradient descent is certain to find a global optima. However,
union of 2 convex cost plot might not be convex, so option c rules out.

Let X=[-1, 0, 3, 5] be the the input of ith layer of a neural network. On this, we want to apply
softmax function. What should be the output of it?

a. [0.368, 1, 20.09, 148.41]


b. [0.002, 0.006, 0.118,0.874]
c. [0.3, 0.05,0.6,0.05]
d. [0.04,0,0.06,0.9]

Correct Answer: b
Detailed Solution:
Softmax performs the following transform on n numbers,

The activation function which is not analytically differentiable for all real values of the given
input is
a. Sigmoid
0
4244624 71
b. Tanh
c. ReLU
d. Both a & b
Correct Answer: c
Detailed Solution:
Ns values.
Both Sigmoid and Tanh are analytically differentiable for all real input

activationI
yme6m1170
2
What is the main benefit of stacking multiple layers of neuron with non-linear
functions over a single layer perceptron?
a. Reduces complexity of the network
b. Reduce inference time during testing m
c. Allows to create non-linear decision boundaries 0
d. All of the above
424 64 11 47 62 11
Correct Answer: c
Detailed Solution:
A single layer perceptron is capable of classifying only linearly separable classes. Stacking
multiple layers of neurons helps in creating non-linear decision boundaries and thus can be
used for classifying examples belonging to classes which are NOT linearly separable.

Suppose a neural network has 3 input 3 nodes, x, y, z. There are 2 neurons, Q and F. Q = 4x + y
and F = Q * z2

. What is the gradient of F with respect to x, y and z? Assume, (x, y, z) = (-2, 5, -4).
a. (64,16, 24)
b. (-24, -4, 16)
c. (4, 4, -13)
d. (13, 13, 24)
Correct Answer: a

Which of the following properties, if present in an activation function CANNOT be used in a


neural network?

a. The function is periodic


b. The function is monotonic
c. The function is unbounded
d. Both a and b
Correct Answer: a
Detailed Solution:
If the activation function is periodic then several different values (if those are multiples of
the periodicity) of neurons will map to a single scalar value (following the property of
periodicity) after the activation is applied. This will create confusion in the training
procedure. Ideally, we don’t want any periodicity in activation function.
Unbounded is not a problem: e.g ReLU
Monotonicity is not a problem: e.g Sigmoid, Tanh.

For a binary classification setting, what if the probability of belonging to class= +1 is 0.67, what
is the probability of belonging to class= -1?

a. 0
b. 0.33
c. 0.67 * 0.33 ʰ cost function
d. 1- (0.67 * 0.67)

Correct Answer: b
Detailed Solution:
In the binary classification setting we keep a single output node which can denote the probability
(p) of belonging to class= +1. So, probability of belonging to class= -1 is (1 - p) since the 2
classes are mutually exclusive.

Suppose a fully-connected neural network has a single hidden layer with 10 nodes. The input is
represented by a 5D feature vector and the number of classes is 3. Calculate the number of
parameters of the network. Consider there are NO bias nodes in the network?

a. 80
b. 75
c. 78 Here we 4 classes Hence
have there are
d. 120 models 0117 0127 013 014
PG
f main
Correct Answer: a
P 9 1140 1 2,0
Detailed Solution: how a
Number of parameters = (5 * 10) + (10 * 3) = 80

For a 2-class classification problem, what is the minimum number of nodes required for exp
the 01352
output layer of a multi-layered neural network?
Given O x 1 014 0 ok 3 0146 5 0.002 up O x
a. 2
b. 1
c. 3 Do we need an activation for which is
always differentiable at all real value's
d. None of the above

Correct Answer: b Ans NO ReLU is not differentiable


Detailed Solution:
at (have high activation value)
Only 1 node is enough. We can expect that node to be activated0
only when class = +1 else the node should NOT be activated (have activation close to zero).
We can use the binary (2-class) cross entropy loss to train such a model.

Suppose the input layer of a fully-connected neural network has 4 nodes. The value of a node in
To get linear decision
the first hidden layer before applying sigmoid nonlinearity is V. Now, each of the input layer’s
non
nodes are scaled up by 8 times. What will be the value of that neuron with the updated input
layer? boundaries
a. 8V
b. 4V
c. 32V
d. Remain same since scaling of input layers does not affect the hidden layers

Correct Answer: a
Detailed Solution:
Output of the neuron in first case is given by ∑i Wi ∗ xi = V; where W vector represents
the weight matrix and x is the input vector.
Output of the neuron in second case is given by ∑i Wi ∗ 8xi = 8V
QZ

Which of the following are potential benefits of using ReLU activation over sigmoid activation?
a. ReLu helps in creating dense (most of the neurons are active) representations
2.4
8
b. ReLu helps in creating sparse (most of the neurons are non-active)

85 8
representations
c. ReLu helps in mitigating vanishing gradient effect
d. Both (b) and (c) 64 16,24 Z 1 76
Correct Answer: d
Detailed Solution:
ReLu(x) = max(0, x). Since, the values of neurons are clipped to zero for negative values,

periodic to
ReLu helps in sparse representations since an appreciable fraction of neurons might a have
for
sigmoid it
negative values. Sigmoid, on the other hand always outputs some
anh1 1 real values for neurons’sine for gives
D say for
activations and thus the representations are dense. It is prefered to have sparse
ReLV
representations over dense representations. 0 at 0 1801360
Moreover, the magnitude of gradient for sigmoid function tends to zero as the 1 atof90
or value the270 450
maps
node increases. Since the value of the gradient is essential for update to
of a neuron duringvalue
a single for differe
back-propagation, this leads to vanishing gradient problem which leads to slower learning.
ReLu, on the other hand offers a constant gradient for all x >0 and thus it is free from
impferes
vanishing gradient problems.

Which of the following is considered for correcting a weight during back propagation?

a. Positive gradient of weight


b. Gradient of error
c. Negative gradient of error w.r.t weight
d. Negative gradient of weight

Correct Answer: c
Detailed Solution:
The function increases in the direction of gradient, and the direction of negative gradient
causes decrease in the function value. In machine learning, minimizing functions as
objective values of optimization, and moving in the direction of the negative gradient.

What will happen when learning rate is set to zero?


a. Weight update will be very slow
b. Weights will be zero
c. Weight update will tend to zero but not exactly zero
d. Weights will not be updated

Correct Answer: d
Detailed Solution:
W(n + 1) = W(n) − η
dL(w)
dw ; If learning rate (η) becomes zero, the weight update (η
dL(w)
dw )

will become zero. So, weights will not be updated.

Gradient of sigmoid function is maximum at x=?

a. 0
b. Positive Infinity
c. Negative Infinity
d. 1
Correct Answer: a

The derivative of the loss function with respect to the weights in a deep neural network can be
computed as,

a. Sum of derivative of cost function, derivative of non-linear transfer function and


derivative of linear network.
b. Product of derivative of cost function and derivative of non-linear transfer
function.
c. Product of derivative of cost function, derivative of non-linear transfer function
and derivative of linear network.
d. Sum of derivative of cost function and derivative of non-linear transfer function.

Correct Answer: c iitkgp week6 q5

QUESTION 3:
During back-propagation through max pooling with stride the gradients are

a. Evenly distributed
b. Sparse gradients are generated with non-zero gradient at the max response
location
c. Differentiated with respect to responses
d. None of the above

Correct Answer: b iitkgp week6 q3

Which of the following models can be employed for unsupervised learning?

a. Autoencoder
b. Restricted Boltzmann machines
c. Both a and b
d. None
Correct Answer: c
Detailed Solution:
An autoencoder is a type of artificial neural network used to learn efficient data coding in
an unsupervised manner.

fitti
Which of the following is only an unsupervised learning problem?

a. Digit Recognition
b. Image Segmentation
c. Image Compression 2 5
d. All of the above

Correct Answer: c
Detailed Solution:
Image compression does not need any supervised labels of the image to compress an image.
Digit recognition requires digit labels and image segmentation requires segmentation map.

What is the dimension of encoder weight matrix of an autoencoder (hidden units=400)


constructed to handle 10-dimensional input samples?
a. rows = 10 and columns = 401
b. rows = 400 and columns = 10
c. rows = 11 and columns = 400
d. rows = 400 and columns = 11

Correct Answer: d iiitkgp week6, q10

Which of the following functions can be used as an activation function in the output layer if we
wish to predict the probabilities of n classes such that the sum of p over all n equals to 1?

a. Softmax
b. ReLU
c. Sigmoid
d. Tanh
Correct Answer: a
Detailed Solution:
Softmax function ensures that the summation of probabilities asserted over the k classes
equals to 1.
The input image has been converted into a matrix of size 256 X 256 and a kernel/filter of size
5x5 with a stride of 1 and no padding. What will be the size of the convoluted matrix?

a. 252x252
b. 3x3
c. 254x254
d. 256x256
Correct Answer: a
Detailed Solution:
The size of the convoluted matrix is given by CxC where C=((I-F+2P)/S)+1, where C is the
size of the Convoluted matrix, I is the size of the input matrix, F the size of the filter matrix
and P the padding applied to the input matrix. Here P=0, I=256, F=5 and S=1. There the
answer is 252x252.

What will be the range of output if we apply ReLU non-linearity and then Sigmoid Nonlinearity
subsequently after a convolution layer?

a. [-1, 1]
b. [0, 1]
c. [0.5, 1]
d. [-1, -0.5]
Correct Answer: c
Detailed Solution:
Answer is evident from the formula

The figure below shows image of a face which is input to a convolutional neural net and the
other three images shows different levels of features extracted from the network. Can you
identify from the following options which one is correct?
a. Label 3: Low-level features, Label 2: High-level features, Label 1: Mid-level
features
b. Label 1: Low-level features, Label 3: High-level features, Label 2: Mid-level
features
c. Label 2: Low-level features, Label 1: High-level features, Label 3: Mid-level
features
d. Label 3: Low-level features, Label 1: High-level features, Label 2: Mid-level
features
Correct Answer: b
Convolutional NN will try to learn low-level features such as edges and lines in early layers
then parts of faces of people and then high-level representation of a face.

Suppose you have 8 convolutional kernel of size 5 x 5 with no padding and stride 1 in the first
layer of a convolutional neural network. You pass an input of dimension 228 x 228 x 3 through
this layer. What are the dimensions of the data which the next layer will receive?

a. 224 x 224 x 3
b. 224 x 224 x 8
c. 226 x 226 x 8
d. 225 x 225 x 3
Correct Answer: b
Detailed Solution:
The layer accepts a volume of size W1×H1×D1. In our case, 228x228x3
Requires four hyperparameters: Number of filters K=8, their spatial extent F=3, the stride
S=1, the amount of padding P=0.
Produces a volume of size W2×H2×D2 i.e. 224x224x8 where: W2=(W1−F+2P)/S+1
=(228−5)/1+1 =224, H2=(H1−F+2P)/S+1 =(228−5)/1+1 =224, (i.e. width and height are
computed equally by symmetry), D2= Number of filters K=8.

What is the mathematical form of the Leaky ReLU layer?

a. f(x)=max(0,x)
b. f(x)=min(0,x)
c. f(x)=min(0, αx), where α is a small constant
d. f(x)=1(x<0)(αx)+1(x>=0)(x), where α is a small constant

Correct Answer: d
Detailed Solution:
Option d comes from the direct formula.

The input image has been converted into a matrix of size 224 x 224 and convolved with a
kernel/filter of size FxF with a stride of s and padding P to produce a feature map of dimension
222x222. Which among the following is true?

a. F=3x3, s=1, P=1


b. F=3x3, s=0, P=1
c. F=3x3, s=1, P=0
d. F=2x2, s=0, P=0

Correct Answer: c
Detailed Solution:
The size of the convoluted matrix is given by CxC, where C=((I-F+2P)/S)+1, where C is the
size of the convoluted matrix, I is the size of the input matrix, F the size of the filter matrix
and P the padding applied to the input matrix. Here C is given in the question and it is 222.
Therefore, P=0, I=224, F=3 and s=1. Thus option c is the answer.

Statement 1: For a transfer learning task, lower layers are more generally transferred to
another task
Statement 2: For a transfer learning task, last few layers are more generally transferred to
another task
Which of the following option is correct?

a. Statement 1 is correct and Statement 2 is incorrect


b. Statement 1 is incorrect and Statement 2 is correct
c. Both Statement 1 and Statement 2 are correct
d. Both Statement 1 and Statement 2 are incorrect

Correct Answer: a
Detailed Solution:
Lower layers are more general features (for eg: can be edge detectors) and thus can be
transferred well to other task. Higher layers on the other hand are task specific.

Statement 1: Adding more hidden layers will solve the


vanishing gradient problem for a 2-layer
neural network
Statement 2: Making the network deeper will increase
the chance of vanishing gradients.

a. Statement 1 is correct
b. Statement 2 is correct
c. Neither Statement 1 nor Statement 2 is correct
d. Vanishing gradient problem is independent of number of hidden layers of the
neural network.

Correct Answer: b
Detailed Solution:
As more layers using certain activation functions are added to neural networks, the
gradients of the loss function approaches zero, making the network hard to train. Thus
statement 2 is correct.

How many convolution layers are there in a LeNet-5 architecture?

a. 2
b. 3
c. 4
d. 5
Correct Answer: a
Detailed Solution:
There are two convolutional layers and three fully connected layers in LeNet-5
architecture.

Suppose you have trained a neural network 3 times (Experiment A, B and C) with some
unknown optimizer. Each time she has kept all other hyper-parameters same, but changed only
one hyper-parameter. From the three given loss curves, can you identify what is that hyper-
parameter?

a. Batch size
b. Learning rate
c. Number of hidden layers
d. Loss function

Correct Answer: a
Detailed Solution: The fluctuating loss of Experiment A could be due to very small batch size, which is little
less when a slightly larger batch size is chosen for Experiment B, and it is smooth when a
large batch size is chosen for Experiment C. Learning rate can’t be the answer as all three
converges at same rate. For the same reason loss functions and hidden layers can’t be the
appropriate answer.

This question has Statement 1 and Statement 2. Of the four choices given after the statements,
choose the one that best describes the two statements.
Statement 1: Mini-batch gradient descent will always overshoot the optimum point
even with a lower learning rate value.
Statement 2: Mini-batch gradient might oscillate in its path towards convergence and
oscillation can reduced by momentum optimizer.
a. Statement 1 is True and Statement 2 is False
b. Statement 1 is False and Statement 2 is True
c. Statement 1 is True and Statement 2 is True
d. Statement 1 is False and Statement 2 is False

Correct Answer: b
Detailed Solution:
Mini-batch gradient descent makes a parameter update after seeing just a subset of

examples, the direction of the update has some variance, and so the path taken by mini-
batch gradient descent will "oscillate" toward convergence. Using momentum can reduce

these oscillations.

This question has Statement 1 and Statement 2. Of the four choices given after the statements,
As no increase the
choose the one that best describes the two statements.
of layers weight multiplication
Statement 1: Apart from the learning rate,rule
Momentum optimizer has two hyper
increases
in chain in back propagation value
parameters whereas Adam has just one hyper parameter in its weight update equation
Which ultimately makes same weightvery
gradient
Statement 2: Adam optimizer and stochastic gradient descent have the
a small
update rule
and in the
a. Statement 1 is True and Statement 2 is False earlier
psetion
b. Statement 1 is False and Statement 2 is True weight
c. Statement 1 is True and Statement 2 is True not updated
d. Statement 1 is False and Statement 2 is False

Correct Answer: d
Detailed Solution:
Adam has more than one hyper parameter and the weight update rule is different from
stochastic gradient descent.

Which of the following options is true?

a. Stochastic Gradient Descent has noisier updates


b. In Stochastic Gradient Descent, a small batch of sample is selected randomly
instead of the whole data set for each iteration. Too large update of weight
values leading to faster convergence
c. In big data applications Stochastic Gradient
Descent increases you have the computational
burden As all three converges
d. Stochastic Gradient sameis atime
at the Descent non-
iterative process
hence learning
Correct Answer: a rate has not
Detailed Solution: been changed
Stochastic Gradient Descent does not consider the
whole batch for update and thus has noisier
updates.

Which of the following is a possible edge of


momentum optimizer over of mini-batch gradient
descent?

a. Mini-batch gradient descent performs better than momentum optimizer when

the surface of the loss function has a much more elongated curvature along X-
axis than along Y-axis

b. Mini-batch gradient descent always performs better than momentum optimizer


c. Mini-batch gradient descent will always overshoot the optimum point even with
a lower learning rate value
d. Mini-batch gradient might oscillate in its path towards convergence which can
reduced by momentum optimizer

Correct Answer: d
Detailed Solution:
Mini-batch gradient descent makes a parameter update after seeing just a subset of

examples, the direction of the update has some variance, and so the path taken by mini-
batch gradient descent will "oscillate" toward convergence. Using momentum can reduce

these oscillations.

Which of the following is true?

a. Adam is a replacement optimization algorithm for stochastic gradient descent


for training deep learning models in local minima
b. Apart from the learning rate, Momentum optimizer has two hyper parameters
whereas Adam has just one hyper parameter in its weight update equation
c. Adam optimizer and stochastic gradient descent have the same weight update
rule
d. None of the above

Correct Answer: a
Detailed Solution:
Option (a) is self-explanatory.

Which of the following is the correct property of RMSProp optimizer?

a. RMSProp divides the learning rate by an exponentially decaying average of


squared gradients
b. RMSProp has a constant learning rate
c. RMSProp divides the learning rate by an exponentially increasing average of
squared gradients
d. RMSProp decays the learning rate by a constant value

Correct Answer: a
Detailed Solution:
Refer to the lecture.

Why it is at all required to choose different learning rates for different weights?

a. To avoid the problem of diminishing learning rate


b. To avoid overshooting the optimum point
c. To reduce vertical oscillations while navigating the optimum point
d. This would aid to reach the optimum point faster

Correct Answer: d
Detailed Solution:
In case of adaptive learning rate, learning rate will be reduced for parameters with high
gradient and learning rate will be increased for parameter with small gradient. This would
aid in reaching the optimum point much faster. This is the benefit of choosing different
learning rate for different weights.

This question has Statement 1 and Statement 2. Of the four choices given after the statements,
choose the one that best describes the two statements.
Statement 1: The stochastic gradient computes the gradient using a single sample
Statement 2: It converges much faster than the batch gradient
a. Statement 1 is True and Statement 2 is False
b. Statement 1 is False and Statement 2 is True
c. Statement 1 is True and Statement 2 is True
d. Statement 1 is False and Statement 2 is False
Correct Answer: c
Detailed Solution:
Refer to classroom lectures

What is the main purpose of auxiliary classifier in GoogleNet?


a. To increase the number of parameters
b. To avoid vanishing gradient problem
c. To increase the inference speed
d. None of the above

Correct Answer: b
Detailed Solution:

In case of Group Normalization, if group number=1, Group Normalization behaves like

a. Batch Normalization
b. Layer Normalization
c. Instance Normalization
d. None of the above

Correct Answer: b
Detailed Solution:
If group number=1, all the channels will be in a single group. Now we normalize the inputs
across all the channels and spatial dimensions. So, it behaves like layer normalization.

In case of Group Normalization, if group number=number of channels, Group Normalization


behaves like

a. Batch Normalization
b. Layer Normalization
c. Instance Normalization
d. None of the above

Correct Answer: c
Detailed Solution:
If group number = number of channels. There will be 1 channel in a single group.
Therefore, we will normalize the inputs across spatial dimensions. So it will behave like
instance normalization.

When will you do early stopping?


a. Minimum training loss point
b. Minimum validation loss point
c. Minimum test loss point
d. None of these
Correct Answer: b
Detailed Solution:
See the definition

What is the use of learnable parameters in batch-normalization layer?

a. Calculate mean and variances


b. Perform normalization
c. Renormalize the activations In adaptive learning
d. No learnable parameter is present salt how y is chosen
and
for largergradient
Correct Answer: c
Detailed Solution: for smaller gradient high
See the definition learning rate is chosen
Which of the one is not a procedure to prevent overfitting?

a. Reduce feature size


b. Use dropout
c. Use Early stopping
d. Increase training iterations

Correct Answer: d
Detailed Solution:
Increasing training iterations is not a procedure to prevent overfitting.

Suppose, you have used a batch-normalization layer after a convolution block. After that you
train the model using any standard dataset. Now will the extracted feature distribution after
batch normalization layer have zero mean and unit variance if we feed any input image?
a. Yes. Because batch-normalization normalizes the features into zero mean and
unit variance
b. No. It is not possible to normalize the features into zero mean and unit variance
c. Can’t Say. Because the batch-normalization renormalizes the features using
trainable parameters. After training, it may or may not be the zero mean and
unit variance.
d. None of the above

Correct Answer: c
Detailed Solution:
Options are self-explanatory

Which one of the following regularization methods induces sparsity among the trained
weights?

a. L1 regularizer
b. L2 regularizer
c. Both L1& L2
d. None of the above

Correct Answer: a
Detailed Solution:
https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-
regularization

Which one of the following is not an advantage of dropout?

a. Regularization
b. Prevent Overfitting
c. Improve Accuracy
d. Reduce computational cost during testing

Correct Answer: d
Detailed Solution:

Dropout makes some random features during training but while testing we don’t zero-
down any feature. So there is no question of reduction of computational cost.

Which of the following operation does not reduce spatial dimension of features?

a. Max-Pooling
b. Conolution with 3 × 3 Kernel, Stride=2, Padding all sides = 1
c. Conolution with 3 × 3 Kernel, Stride=1, Padding all sides = 1
d. Average-Pooling

Correct Answer: b
Detailed Solution:
Stride of 2 reduces the spatial dimension by a factor of 2.

Which of the following functions can’t be learned by the Perceptron algorithm?


a)OR
b)AND
c)XOR
d)NOT
Answer: c)XOR
Solution: Since perceptron is a linear algorithm(its decision boundary is linear) it can’tsolve
the XOR problem.

Which of the following statement is True?


a)A perceptron can be used to approximate any function to the desired precision.
b)A single-layered network of neurons can be used to approximate any function to the
desired precision.
c)A multi-layered network of neurons with a single hidden layer can approximate any
continuous function to the desired precision.
d)Multi-layered neural networks with multiple hidden layers can’t approximate all continuous
functions.
Answer: c)
Solution: Universal approximation theorem discussed in lectures.

Which of the following architecture has the capacity to solve complex long-time lag tasks?
a)Hopfield network
b)RNN
c)LSTM
d)Elman network
Answer:c)LSTM
Solution:c)

An MP neuron takes two inputs x1 and x2. Its threshold is θ = 0. Select all the boolean
functions this MP neuron may represent.

a)AND
b)NOT
c)OR
d)NOR
Answer: d)
Solution: Only functions with threshold θ = 0 are NOT and NOR. NOT takes only one input
while NOR takes two
7. We are given 4 points in R2 say, x1 = (0,-1), x2 = (−1, −1), x3 = (2, 3), x4 = (4, −5).Labels
of x1, x2, x3, x4 are given to be 1, 1, −1, 1 We initiate the perceptron algorithm with an
initial weight w0 = (0, 0) on this data. What will be the value of w0 after 2 updates? (Take
points in sequential order from x1 to x)( update happens when the value of weight changes)
a)(0, 0)
b)( 2, 2)
c)( 2, 3)
d)(1, 1)
Answer: c)
Solution: First misclassified point is x3, hence w0 changes to:
w0 − x3 = (0, 0) − (2, 3) = (−2, −3). All the points are correctly classified after this update

We are given the following data:

x1 x2 y3
241
3 -1 -1
5 6 -1
201
-1 0 1
-2 -2 1

Can you classify every label correctly by training a perceptron algorithm? (assume bias to be
0 while training)
a)Yes
b)No
Answer: a)No
Solution: By plotting x1 and x2 on graph paper we can observe that 1 and 1 can’t be
separated using a line passing through the origin. Hence perceptron will fail to classify all the
points correctly.

Suppose we have a boolean function that takes 5 inputs x1, x2, x3, x4, x5? We have an MP
neuron with parameter θ = 1. For how many inputs will this MP neuron give output y = 1?
a)21
b)31
c)30
d)32
Answer: b)
Solution: Total no of possible boolean inputs is 2^5 = 32. The only input that will give output

y = 0 is (0, 0, 0, 0, 0). Hence required answer is 32 − 1 = 31.

Which of the following algorithm is used for training in a neural network?


a)Decision Tree
b)Newton method
c)Backpropagation
d)LASSO
Answer: c)
Solution: c)
How many boolean functions can be designed for 4 inputs?
a)65,536
b)16
c)256
d)64
Answer: a)
Solution: No.of boolean functions are given by 2^(2^4)
= 65, 536.

We have a classification problem with labels 0 and 1. We train a logistic model and find out
that w0 learned by our model is -15. We are to predict the label of a new test point x
using this trained model. If wT

x = 1, which of the following statements is True?


a)Label of x predicted by the model is 1
b)Label of x predicted by the model is 0
c)Not enough information to decide the label of x
d)Label of x predicted by the model is 0.37
Answer: b)
Solution: The value of the sigmoid for our test point is
1
1 + e−(−15+1)

which is very close to 0.

How many neurons do you need in the hidden layer of a perceptron to learn any boolean
function with 5 inputs? (Only one hidden layer is allowed)
a)16
b)64
c)16
d)32
Answer: d)
Solution: No of neurons needed to represent all boolean functions of n inputs in the
perceptron is 2^n

You are training a model using the gradient descent algorithm. You observe that the loss is
decreasing and then increasing after each successive epoch (pass-through data). Which of the
following techniques will you use to increase the chances of convergence of the gradient
descent algorithm? (η refers to step size)
a)Increase the value of η
b)Decrease the value of η
c)Set η = 0
d)Set η = 1
Answer: b)
Solution; Since the loss is oscillating around minima it means our η is high. Hence, lowering
η will increase the possibility of converging to the minima

We have a function that we want to approximate using 150 rectangles (towers).


How many neurons in the hidden layer are required to construct the required
network?
a)350
b)450
c)400
d)500
Answer: b) 4th
0 May
Neuron
Solution: To approximate one rectangle we need 3 neurons. Hence to approximate the function we would
require 450 neurons

The output layer of a sigmoid neuron has the following values for the inputs x1: 0.9,
x2: 0.1, x3: 0.49, x4: 0.63, x5: 0.33. What will be the label predicted by the sigmoid
neuron for the following inputs? (In sequence from x1 to x5).
a)[1,0,1,0,1]
b)[0,1,1,0,1]
c)[1,0,0,1,0]
d)[0,1,0,1,0]
Answer: c)
Solution: Perceptron will give all the inputs with a value < 0.5 as label 0 and others as 1
respectively.

We are using line search with following η = [0.1, 1, 0.5, 0.01, 0.01, 10]. At the n

th iteration, the

η selected was 10. What can we infer about the value ∇wn? (∇wn is positive)
a)∇wn is large.
b)∇wn = 0
c)∇wn = 1
d)∇wn is small.
Answer: d)
Solution: Since large η was chosen, hence value of ∇wn must be small.

What are the advantages of using stochastic gradient descent instead of vanilla gradient
descent? (MSQ)
a)We are theoretically guaranteed the descent direction is the optimal direction in SGD.
WIN
b)SGD oscillates less compared to Vanilla gradient descent. I
Wo 15
c)SGD converges faster than vanilla gradient descent

g wTn
d)SGD is computationally efficient for large datasets.
Wo
Fwo
Answer: c), d)
Solution: SGD updates weight more frequently hence it converges fast. Since it is
computationally faster than vanilla gradient descent, it works well for large datasets.

11 11 15
We generate the data using the following model y = 5x^3 + 3x^2 + x + 1. We fit two models on

this data and train them using a neural network.


Model 1: y = w_1x + w_0
Model 2: y = w_6x^6 + w_5x^5 + w_4x^4 + w_3x^3 + w_2x^2 + w_1x + w_0

Select the correct options.


a) Model 1 has a higher bias than model 2.
b) Model 2 has a higher variance than model 1.
c) Model 1 has a higher variance than model 2.
d) Model 2 has a higher bias than model 1.
Answer: a),b)
Solution: Model 2 has high capacity hence it will overfit and have a high variance while
model 2 has low capacity, hence it will underfit and have low bias.
Given are two plots depicting two regularization techniques where we will refer to the left
plot as A and the right plot as B. Choose the correct options

a) A represents L1 regularization.
b) A represents L2 regularization.
c) B represents L1 regularization.
d) B represents L2 regularization.
Answer: a),d)
Solution: Regularization in L1 is done using absolute error which results in plot A while
regularization in L2 is using squared error which results in plot B.

We are training a neural network to distinguish cat images. To increase the effectiveness of
learning we blur, rotate, and changed some pixels. Which regularization technique are we
using for the task?
a) Addition of noise during training
b) Data sharing
c) Parameter sharing
d) Data augmentation
Answer: d)
Solution: We are transforming the data in such a way that doesn’t change the label hence
enriching the dataset.

We are adding a gaussian noise to the inputs before feeding it to the feed-forward neural
network. Which regularization technique are we using? (MSQ)
a) L2 regularization
b) L1 regularization
c) Addition of noise during training
d) Data augmentation
Answer: a),c)
largestep size is taken
Solution: The addition of noise during trainingis
of after step size
a seq
is equivalent to L2 regularization.
small
Largestep size taken means gradient is actually very
flat and towardsthe minima we want to move with high 2
overshoot occurs
tonsidering no
We are training a neural network with the following patience parameter=2. When will the
training stop?
SGD is Fast
3
SGD oscillation is higher as
Epoch1 Training error Validation error
145
are present Tith folding
2 3.5 3.8
3 3 3.3
4 2.2 3.6
5 2 3.6
6 1 3.4

a) After epoch 5
b) After epoch 4
c) After epoch 6
d) After epoch 3
Highbias underfitting
Answer: a) high variance
overfitting
Solution: Training doesn’t decrease in epoch 4 and epoch 5 hence the training stops after
epoch 5 since the patience parameter is 2.

We trained different models on data and then we used the bagging technique. We observe
that our test error reduces drastically after using bagging. Choose the correct options.
(MSQ)
a) Different classifiers were selected as models
b) Errors of all the models were correlated.
c) Errors of all the models were uncorrelated(independent).
d) All of these.
Answer: c)
Solution: If the models were correlated then the covariance of test errors will not be 0 hence
test errors wouldn’t reduce drastically. If all models have the same hyperparameters and
train on the same set of data than they are correlated.

Given the feed-forward network below how many


thinned networks can be formed from it?

a)32
b)128
c)64
d)50
Answer:c)
Solution: No of thinned networks possible are 2^n and here n=6.

We trained a network using the dropout technique. We observe that the weights learned by
two different neurons(in the same layer) h1 and h2 are the same but while calculating the
output the contribution of h1 is greater than h2. Choose the correct options. (Assume input
received by h1 and h2 is positive and the weights learned by them are also positive, Relu
function was used as activation)
a) h1 got dropped more than h2
b) h2 got dropped more than h1
c) both got dropped an equal no of times
d) Insufficient information to say anything
Answer: b)
Solution: The weights get scaled by the fraction of times the neuron was selected during the
dropout process. Hence, h1 got dropped less since the scaling of weights of h1 is greater than
h2.

We have trained four models in the same dataset with different hyperparameters. The
training and validation loss for the models are listed below. Which model should we choose
to get the best result on the test dataset?
Adding noise to the input data can prevent
Model Training error Validation error the model 15 rely heavily on a specificfeature
1 0.3 1.7 Adding noise is more likely to dropoutregularization
2 0.9 1.1
3 0.5 0.7
4 1.9 2.3

a)Model 1
b)Model 2
c)Model 3
d)Model 4
Answer: c)
Solution: Model 3 has both low training loss and low validation loss.

We have observed that the sigmoid neuron has become saturated. What might be the
possible values of output at this neuron? (MSQ)
a)0.94
b)0.04
c)0.65
d)0.35
Answer: a), b)
Solution: Since the neuron has saturated its output values are close to 0 or 1.

Suppose we have the following neural network. If the edges from hidden layer 1 to layer2 have
weights w1 and w2.Which of the following are possible values of the gradient ∇w1 and ∇w2
given that sigmoid is used as an activation function?

a)-0.1,0.1
b)1.-1
c)2,2
d)-1,-0.1
Answer: c), d)
Solution: All the gradients at a layer are either all positive or all negative.

Which of the following problems makes training a neural network harder while using sigmoid
as the activation function? (MSQ)
a)Saturation
b)Non- Assume we
differentiable at 0. have n
c)Computation neurons
ally expensive and each
output is (0,1)has
d)Range of the
Answer:
neuron
a),c),d)
to be
the proba
Solution: bitity disabled
Sigmoid is computationally expensive
due to the
situation
exponentiation 0 neurons
0 process. They
saturate easily and since their range is [0,1], weight update directions are limited. n neurons are
remains
Situation I I heron remain disabled
What are the problems associated with Tanh(x) activation function? (MSQ) Cch 0
CCN
a)It is not zero centered
1 situation 2 2 neuron remains Cch 2
b)Computationally expensive
c)Saturation
so c n 1 cln.de echin 2h
d)Non-differentiable at 0
Answer: b),c)
Solution: Tanh(x) is zero-centered but the problem of saturation still persists. It is
computationally expensive to do this operation.

We observe that in the following neural network with Relu as the activation function, weights
in the first hidden layer and input layer are not getting updated. Choose the correct option
for the value of bias b going to 2nd hidden layer. (Yellow nodes represent the bias and inputs
are normalized)

a) b=20
b) b=0
c) b=-0.4
d)b=-15
Answer: d)
Solution: Since the weights are not getting updated our neuron is dead which is caused due
to high negative bias.

We train a feed-forward neural network with tanh(x) as an activation function and observe
that the neurons are getting saturated. What are the possible causes of this phenomenon?
a)Weights were initialized to very high values
b)Weights were initialized to low values(close to 0)
c)Weights were initialized to zero
d)All of these
Answer: d)
094 004
etmatio O Tw
Solution: Tanh(x) will saturate when the value of weight is very high or very small.

qq.gg we
We train a feed-forward neural network and observe that the weights learned are all equal for 78
a single neuron. What are the possible causes of this problem? (MSQ)
a)Weights were initialized randomly. Common
b)Weights were initialized to high values.
Twr Allowed Restrictingthe movement of term
makes convergence 7W
c)Weights were initialized to equal values. I
Dwi Pwa
ff
d)Weights
Answer: c),d)
yw
alone
longertime were initialized to zero.
5 W a twz
as
Solution: All
Not weights remain equal for a single neuron
due to the
problem Wi E
symmetry breaking 89
which is caused by equal initialization of
Allow
weights. 21
Wr
Which of the following is22 hi wah
4with respect to the batch normalization process in neural
incorrect
8
4 if Eg s.grota3E ceytred
networks?
84
a)We normalize the output produced at each layer before feeding it into the next layer
b)Variance and mean are not learnable parameters. hence are
hiand hf
c) Batch normalization lead to a better initialization of weights.
481 bite In
d)Backpropagation can be used after a
Answer: b)
iii iftomitohffi.is trkenibouEire ve
batch normalization

Solution: The network is allowed to adjust variance and mean by making them a learnable
parameter.

Saturation at high value of v1 and ve


in sigmoid Computationally expensive as
Which of the following statements we have
is true to calculate
with respect
exp oTnby ae^x-1
to exponential Relu given theirif inverse
x <=0? (a = 1) 2
a)It is discontinuous at 0. 517
b)It is non-differentiable at 0.
c) is less computationally expensive than Relu
d)Exponential Relu can have negative values.
Answer:d)
If Z
Solution: Exponential Relu can have negative values and since a = 1, it is non-differentiableRange 10,1
at 0.

1
Suppose we are given a 9x9 image ‘I’ to which we apply a convolutional filter
What will be the size(dimension) of the resultant image ‘R’?(Stride ’S’ is 1)
a)6x6
b)5x5
Range 1 D
‘C’ of size 14x4.

c)9x9
d)4x4
Answer: a)
Solution: We have WR = WI − WC + 1 = 9 − 4 + 1 = 6. Similarly HR = 6. Hence
size(dimension) of the resultant image is 6x6.

bias are added to each


Which of the
no of layers?
neuron
followingbefore activation
architectures has the highest

a)AlexNet
So ifRell is the activation
b)GoogleNet
then f
a Max 0 2
c)VGG Now ifb is added to the
d)ResNet
Answer: d) input
flat then b Max 0 a b
i e the
Solution: ResNet activation works
has the highest no of layers amongon neb
all other architectures.b is
If
very large ve
value then the activation function shifted towards left In case
of RelU as inputofshifted
We are given the one hot representations two words below:
towards left ReLUgives all values
I
APPLE= [1, 0, 0, 0,can'ses 9saturation
0], BALL= [0, 0, 0, 1, 0]
What is the euclidean distance the
between APPLE and BALL?
mustons
of tin thenitik
a)1 Forward backward weight update so weights in
b)√ first layer is not updated as the bias of 2nd layerhighway
0
c)√
3
d) 2
Answer: d) Saturation
Solution: Euclidean distance between APPLE and BALL is given by
0 4940 9,9 tanna

If
sqrt((1 − 0)^2 + (0 − 1)^2) = 2

710 fwi
What are the problems in the RNN architecture? (MSQ)
8 7
a)Morphing of information stored at each time step. saturate
b)Exploding and Vanishing gradient problem.
c)Errors caused at time step tn can’t be related to previous time steps faraway
d)All of the above
Answer: d)
Solution: Information stored in the network gets morphed at every time step due to new

off
input. Exploding and vanishing gradient problems are caused by the long dependency chains
Ñ Ñ
in RNN. if w̅ ER

fat sas.in
Which type of neural network is best suited for processing sequential data?
a) Convolutional Neural Networks (CNN)
If fig
b) Recurrent Neural Networks (RNN)
c) Fully Connected Neural Networks (FCN) if w̅ O
d) Deep Belief Networks (DBN) FEE
Answer: b)
Solution: Recurrent Neural Networks (RNN) are best suited for processing sequential data.

What is the objective(loss) function in the RNN?


a)Cross Entropy Page 40 Batch
b)Sum of cross-entropy normalization
c)Squared error

If 11
d)Accuracy
d
If
Answer: b)
E.ru
Solution: RNN is used for sequential tasks. At each state s we have some predicted and
actual output where loss between two is measured by cross-entropy. The loss across in RNN
is measured by the sum of cross entropy across all such states in the network.

act
Select the tasks for which using an RNN will work better than other types of networks. MSQ)1
a)Video captioning
b)Language translation
c)Image Detection
d)Sentiment classification
Answer: a),b),d)
Solution: CNN is better at detecting images than RNN. The rest of the tasks are sequential
tasks hence RNN would perform better.
n 2P 1
i
Linear Regression

If you train a linear regression estimator with only half the data, its bias is smaller. SOLUTION:
FALSE. Bias depends on the model you use (in this case linear regression) and not on the number
of training data.

2. What happens in gradient descent when α < 0? Gradient ascent

3. 10) Can we model non-linear relationships with a linear regression? Yes → transform features

4. Suppose we have a function f(x1, x2) = x_1^2+ 3x_2 + 25 which we want to minimize the given
function using the gradient descent algorithm. We initialize (x1, x2) = (0, 0). What will be
the value of (x1, x2) after two updates in the gradient descent process?(Let learning rate be 1)
a)(0, −6)
b)(0, −3)
c)(0, −4.5)
d)(1, −3)
Answer: a)
Solution: Gradient of f(x1, x2) at any general point (x1, x2) is (2x1, 3)
Hence, after the first update value of (x1, x2) = (0, 0) − (0, 3) = (0, −3)
Value of (x1, x2) after the second update is given by (x1, x2) = (0, −3) − (0, 3) = (0,−6)

5. In ridge regression the cost function nonotically decreases.


6. If no of features =3, how many parameters are needed for linear regression model.

7. Bivariate data: (1, 3), (2, 1), (4, 4)


1.Do (simple) linear regression to find the best fitting line. Hint: minimize the total squared error by
taking partial derivatives with respect to a and b.
2. Do linear regression to find the best fitting parabola.
3. Set up the linear regression to find the best fitting cubic. but don’t take derivatives.
4. Find the best fitting exponential y = e ^(ax+b ).
Hint: take ln(y) and do simple linear regression.
8. For the bi-variate data (1,5),(2,6),(3,9), we develop a linear model Find the total squared error loss.
Y = 0.5x_i+0.5

9. Proof of normal equation

10. Can we use SGD for both ridge regression and lasso regression. If yes explain why. If not explain
why and what other techniques we can use for optimization. Derive the closed form solution for ridge
regression model considering maximum liklihood estimation. What is weight decay in ridge regression?
Why data preprocessing is must in linear regression.? Say we have the data X={(1.2, 3043),
(2.4,506.3),(3.5,1127)} . Derive the prepossessed data. If weights are initialized as (0,0) and learning
rate = 3, for the above data find the value of weights after 2nd epoch. Consider regularizing parameter
value 10.

Logistic Regression

The output layer of a sigmoid neuron has the following values for the inputs x1: 0.9,
x2: 0.1, x3: 0.49, x4: 0.63, x5: 0.33. What will be the label predicted by the sigmoid
neuron for the following inputs? (In sequence from x1 to x5).
a)[1,0,1,0,1]
b)[0,1,1,0,1]
c)[1,0,0,1,0]
d)[0,1,0,1,0]
Answer: c)
Solution: Perceptron will give all the inputs with a value < 0.5 as label 0 and others as 1
respectively.
• Suppose we have a linear regression model as
y=b+w1x1+w2x2
what will be the decision boundary looks like if b=2, w1=2 and w2=0.

• Represent the following boolean function with a single logistic threshold unit .
f(A,B
A B
)

1 1 0
0 0 0

1 0 1

0 1 0

19. Say h= g(b+w1x1+w2x2, ) and w1=1,w2=1,b=-3. Write down the equation for decision
boundary. Say we have four classes. How we can represent these four classes using one hot
encoding?

20. Suppose we are doing binary sentiment classification on movie review text, and
we would like to know whether to assign the sentiment class + or − to a review
document doc. We’ll represent each input observation by the 6 features x_1 ... x_6 of
the input, as (3,2,1,3,0,4.19).
Say the corresponding learned weights are [2.5,−5.0,−1.2,0.5,2.0,0.7], while b = 0.1. Find

p(+|x) and p(-|x)

21. For the above example the actual output is given for five sample as y=[1,0,0,1,1], and the predicted
output obtained as h = [0.73,0.3,0.2,0.4,0.95]. Find the binary cross entropy loss.

22. When and why softmax activation is used? What is the limitation of sigmoid?
23. Find sigmoid(0.3)

24. (a) For a vector z of dimensionality K, what will be the softmax(z). (b) if z=[z_1,z_2,...z_k] then what
will be softmax(z) (c)for example given a vector: z = [0.6,1.1,−1.5,1.2,3.2,−1.1] find the resulting
rounded softmax(z). What is conditional maximum likelihood estimation ? How this related to cross
entropy loss function, if raw value at the output neuron is a = [-0.5,1.2,-0.1,2.4] then what will be
sigmoid(a) and softmax(a). Which one appropriate, why difference.
sol

vbvbb

Practical Implementation of Machine Learning


1. You have built a classifier which has 1% training error and 15% validation error. What is the
problem
2. Say classifier A has 0.1% training error and 10% validation errror and classifier B has 10%
train error and 15% validation error. Which has high bias?
3. What is The propoer way of splitting train:val:test set.
4. Normalization of input feature mainly speeds up the convergence process. True or False?
5. High feature space results in overfitting. True or False
6. Large gap between train and test error indicates under fitting. True or False?
7. More training data helps overfitting. True or False.
8. What is meant by Better training
1. Run for more iterations 2. Use a different algorithm 3. Use a different classifier 4. Play with
regularization
9. The loss function reduces and after some iteration it increases. What is the nature of cost
function on which GD is applied.
If we are doing gradient descent on a convex function the objective can’t increase
10. But if your data set is unbalanced, accuracy may be misleading. Which metric should be used?

You might also like