50% found this document useful (2 votes)
343 views

Deep Learning Questions

Deep learning involves using large datasets and complex algorithms to train neural networks to perform tasks like image and speech recognition. It is inspired by the human brain. A neural network consists of an input layer, one or more hidden layers where feature extraction occurs, and an output layer. The hidden layers use nonlinear activation functions. Deep learning models are trained using methods like backpropagation to minimize a cost function through gradient descent. Hyperparameters like learning rate and epochs are tuned. Techniques like dropout and batch normalization help reduce overfitting. Recurrent neural networks can process sequential data through feedback loops. LSTMs are a type of RNN that can learn long-term dependencies in a sequence.

Uploaded by

Aditi Jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
343 views

Deep Learning Questions

Deep learning involves using large datasets and complex algorithms to train neural networks to perform tasks like image and speech recognition. It is inspired by the human brain. A neural network consists of an input layer, one or more hidden layers where feature extraction occurs, and an output layer. The hidden layers use nonlinear activation functions. Deep learning models are trained using methods like backpropagation to minimize a cost function through gradient descent. Hyperparameters like learning rate and epochs are tuned. Techniques like dropout and batch normalization help reduce overfitting. Recurrent neural networks can process sequential data through feedback loops. LSTMs are a type of RNN that can learn long-term dependencies in a sequence.

Uploaded by

Aditi Jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

1. What Is Deep Learning?

Deep Learning   involves taking large volumes of structured or


unstructured data and using complex algorithms to train neural
networks. It performs complex operations to extract hidden
patterns and features (for instance, distinguishing the image of
a cat from that of a dog).

2. What is a Neural Network?

Neural Network replicate the way humans learn, inspired by


how the neurons in our brains fire, only much simpler.
The most common Neural Networks consist of three network
layers:

1. An input layer
2. A hidden layer (this is the most important layer where
feature extraction takes place, and adjustments are made to
train faster and function better)
3. An output layer
Each sheet contains neurons called “nodes,” performing various
operations. Neural Networks are used in deep learning
algorithms like CNN, RNN, GAN, etc.

3. What Is a Multi-layer Perceptron(MLP)?

As in Neural Networks, MLPs have an input layer, a hidden


layer, and an output layer. It has the same structure as a single
layer perceptron with one or more hidden layers. A single layer
perceptron can classify only linear separable classes with binary
output (0,1), but MLP can classify nonlinear classes.
Except for the input layer, each node in the other layers uses a
nonlinear activation function. This means the input layers, the
data coming in, and the activation function is based upon all
nodes and weights being added together, producing the output.
MLP uses a supervised learning method called
“backpropagation.” In backpropagation, the neural network
calculates the error with the help of cost function. It propagates
this error backward from where it came (adjusts the weights to
train the model more accurately).

4. What Is Data Normalization, and Why Do We Need It?

The process of standardizing and reforming data is called “Data


Normalization.” It’s a pre-processing step to eliminate data
redundancy. Often, data comes in, and you get the same
information in different formats. In these cases, you should
rescale values to fit into a particular range, achieving better
convergence.

5. What is the Boltzmann Machine?

One of the most basic Deep Learning models is a Boltzmann


Machine, resembling a simplified version of the Multi-Layer
Perceptron. This model features a visible input layer and a
hidden layer -- just a two-layer neural net that makes stochastic
decisions as to whether a neuron should be on or off. Nodes are
connected across layers, but no two nodes of the same layer are
connected.

6. What Is the Role of Activation Functions in a Neural


Network?

At the most basic level, an activation function decides whether a


neuron should be fired or not. It accepts the weighted sum of
the inputs and bias as input to any activation function. Step
function, Sigmoid, ReLU, Tanh, and Softmax are examples of
activation functions.

7. What Is the Cost Function?

Also referred to as “loss” or “error,” cost function is a measure


to evaluate how good your model’s performance is. It’s used to
compute the error of the output layer during backpropagation.
We push that error backward through the neural network and
use that during the different training functions.

8. What Is Gradient Descent?

Gradient Descent is an optimal algorithm to minimize the cost


function or to minimize an error. The aim is to find the local-
global minima of a function. This determines the direction the
model should take to reduce the error
9. What Do You Understand by Backpropagation?

Backpropagation is a technique to improve the performance of


the network. It backpropagates the error and updates the
weights to reduce the error.

10. What Is the Difference Between a Feedforward Neural


Network and Recurrent Neural Network?

A Feedforward Neural Network signals travel in one direction


from input to output. There are no feedback loops; the network
considers only the current input. It cannot memorize previous
inputs (e.g., CNN).
A Recurrent Neural Network’s signals travel in both directions,
creating a looped network. It considers the current input with
the previously received inputs for generating the output of a
layer and can memorize past data due to its internal memory.

11. What Are the Applications of a Recurrent Neural Network


(RNN)?

The RNN can be used for sentiment analysis, text mining, and
image captioning. Recurrent Neural Networks can also address
time series problems such as predicting the prices of stocks in a
month or quarter.

12. What Are the Softmax and ReLU Functions?

Softmax is an activation function that generates the output


between zero and one. It divides each output, such that the total
sum of the outputs is equal to one. Softmax is often used for
output layers.
ReLU (or Rectified Linear Unit) is the most widely used
activation function. It gives an output of X if X is positive and
zeroes otherwise. ReLU is often used for hidden layers.

13. What Are Hyperparameters?

With neural networks, you’re usually working with


hyperparameters once the data is formatted correctly. A
hyperparameter is a parameter whose value is set before the
learning process begins. It determines how a network is trained
and the structure of the network (such as the number of hidden
units, the learning rate, epochs, etc.).
14. What Will Happen If the Learning Rate Is Set Too Low or
Too High?

When the learning rate is too low, training of the model will
progress very slowly as we are making minimal updates to the
weights. It will take many updates before reaching the
minimum point.

If the learning rate is set too high, this causes undesirable


divergent behaviour to the loss function due to drastic updates
in weights. It may fail to converge (model can give a good
output) or even diverge (data is too chaotic for the network to
train).

15. What Is Dropout and Batch Normalization?

Dropout is a technique of dropping out hidden and visible units


of a network randomly to prevent overfitting of data (typically
dropping 20 percent of the nodes). It doubles the number of
iterations needed to converge the network.

Batch normalization is the technique to improve the


performance and stability of neural networks by normalizing
the inputs in every layer so that they have mean output
activation of zero and standard deviation of one.

16. What Is the Difference Between Batch Gradient Descent and


Stochastic Gradient Descent?

Batch Gradient Stochastic Gradient


Descent Descent
The batch gradient The stochastic gradient
computes the gradient computes the gradient using
using the entire dataset. a single sample.

It takes time to converge It converges much faster


because the volume of than the batch gradient
data is huge, and weights because it updates weight
update slowly. more frequently.

17. What Is Overfitting and Underfitting, and How to Combat


Them?

Overfitting occurs when the model learns the details and noise
in the training data to the degree that it adversely impacts the
execution of the model on new information. It is more likely to
occur with nonlinear models that have more flexibility when
learning a target function. An example would be if a model is
looking at cars and trucks, but only recognizes trucks that have
a specific box shape. It might not be able to notice a flatbed
truck because there's only a particular kind of truck it saw in
training. The model performs well on training data, but not in
the real world.

Underfitting alludes to a model that is neither well-trained on


data nor can generalize to new information. This usually
happens when there is less and incorrect data to train a model.
Underfitting has both poor performance and accuracy.

To combat overfitting and underfitting, you can resample the


data to estimate the model accuracy (k-fold cross-validation)
and by having a validation dataset to evaluate the model.

18. How Are Weights Initialized in a Network?


There are two methods here: we can either initialize the weights
to zero or assign them randomly.

Initializing all weights to 0: This makes your model similar to a


linear model. All the neurons and every layer perform the same
operation, giving the same output and making the deep net
useless.

Initializing all weights randomly: Here, the weights are


assigned randomly by initializing them very close to 0. It gives
better accuracy to the model since every neuron performs
different computations. This is the most commonly used
method.

19. What Are the Different Layers on CNN?

There are four layers in CNN:

1. Convolutional Layer - the layer that performs a


convolutional operation, creating several smaller picture
windows to go over the data.
2. ReLU Layer - it brings non-linearity to the network and
converts all the negative pixels to zero. The output is a
rectified feature map.
3. Pooling Layer - pooling is a down-sampling operation that
reduces the dimensionality of the feature map.
4. Fully Connected Layer - this layer recognizes and classifies
the objects in the image.

20. What is Pooling on CNN, and How Does It Work?

Pooling is used to reduce the spatial dimensions of a CNN. It


performs down-sampling operations to reduce the
dimensionality and creates a pooled feature map by sliding a
filter matrix over the input matrix.

21. How Does an LSTM Network Work?

Long-Short-Term Memory (LSTM) is a special kind of


recurrent neural network capable of learning long-term
dependencies, remembering information for long periods as its
default behavior. There are three steps in an LSTM network:

 Step 1: The network decides what to forget and what to


remember.
 Step 2: It selectively updates cell state values.
 Step 3: The network decides what part of the current state
makes it to the output.
22. What Are Vanishing and Exploding Gradients?

While training an RNN, your slope can become either too small
or too large; this makes the training difficult. When the slope is
too small, the problem is known as a “Vanishing Gradient.”
When the slope tends to grow exponentially instead of
decaying, it’s referred to as an “Exploding Gradient.” Gradient
problems lead to long training times, poor performance, and
low accuracy.

23. What Is the Difference Between Epoch, Batch, and Iteration


in Deep Learning?

 Epoch - Represents one iteration over the entire dataset


(everything put into the training model).
 Batch - Refers to when we cannot pass the entire dataset
into the neural network at once, so we divide the dataset into
several batches.
 Iteration - if we have 10,000 images as data and a batch
size of 200. then an epoch should run 50 iterations (10,000
divided by 50).

24. Why Is Tensorflow the Most Preferred Library in Deep


Learning?
Tensorflow provides both C++ and Python APIs, making it
easier to work on and has a faster compilation time compared
to other Deep Learning libraries like Keras and Torch.
Tensorflow supports both CPU and GPU computing devices.

25. What Do You Mean by Tensor in Tensorflow?

A tensor is a mathematical object represented as arrays of


higher dimensions. These arrays of data with different
dimensions and ranks fed as input to the neural network are
called “Tensors.”

26. What Are the Programming Elements in Tensorflow?

Constants - Constants are parameters whose value does not


change. To define a constant we use  tf.constant() command.
For example:

a = tf.constant(2.0,tf.float32)

b = tf.constant(3.0)

Print(a, b)
Variables - Variables allow us to add new trainable parameters
to graph. To define a variable, we use the tf.Variable()
command and initialize them before running the graph in a
session. An example:

W = tf.Variable([.3].dtype=tf.float32)

b = tf.Variable([-.3].dtype=tf.float32)

Placeholders - these allow us to feed data to a tensorflow model


from outside a model. It permits a value to be assigned later. To
define a placeholder, we use the tf.placeholder() command. An
example:

a = tf.placeholder (tf.float32)

b = a*2

with tf.Session() as sess:

result = sess.run(b,feed_dict={a:3.0})

print result

Sessions - a session is run to evaluate the nodes. This is called


the “Tensorflow runtime.” For example:

a = tf.constant(2.0)

b = tf.constant(4.0)

c = a+b

# Launch Session

Sess = tf.Session()
# Evaluate the tensor c

print(sess.run(c))

27. Explain a Computational Graph.

Everything in a tensorflow is based on creating a computational


graph. It has a network of nodes where each node operates,
Nodes represent mathematical operations, and edges represent
tensors. Since data flows in the form of a graph, it is also called
a “DataFlow Graph.”

28. Explain Generative Adversarial Network.

Suppose there is a wine shop purchasing wine from dealers,


which they resell later. But some dealers sell fake wine. In this
case, the shop owner should be able to distinguish between fake
and authentic wine.

The forger will try different techniques to sell fake wine and
make sure specific techniques go past the shop owner’s check.
The shop owner would probably get some feedback from wine
experts that some of the wine is not original. The owner would
have to improve how he determines whether a wine is fake or
authentic.

The forger’s goal is to create wines that are indistinguishable


from the authentic ones while the shop owner intends to tell if
the wine is real or not accurately.
Let us understand this example with the help of an image
shown above.

There is a noise vector coming into the forger who is generating


fake wine.

Here the forger acts as a Generator.

The shop owner acts as a Discriminator.

The Discriminator gets two inputs; one is the fake wine, while
the other is the real authentic wine. The shop owner has to
figure out whether it is real or fake.

So, there are two primary components of Generative


Adversarial Network (GAN) named:

1. Generator
2. Discriminator
The generator is a CNN that keeps keys producing images and
is closer in appearance to the real images while the
discriminator tries to determine the difference between real
and fake images. The ultimate aim is to make the discriminator
learn to identify real and fake images.
29. What Is an Auto-encoder?

This Neural Network has three layers in which the input


neurons are equal to the output neurons. The network's target
outside is the same as the input. It uses dimensionality
reduction to restructure the input. It works by compressing the
image input to a latent space representation then
reconstructing the output from this representation.

30. What Is Bagging and Boosting?

Bagging and Boosting are ensemble techniques to train


multiple models using the same learning algorithm and then
taking a call.
With Bagging, we take a dataset and split it into training data
and test data. Then we randomly select data to place into the
bags and train the model separately.

With Boosting, the emphasis is on selecting data points which


give wrong output to improve the accuracy.

31. Why is it necessary to introduce non-linearities in a neural


network?

Solution: otherwise, we would have a composition of linear


functions, which is also a linear function, giving a linear model.
A linear model has a much smaller number of parameters, and
is therefore limited in the complexity it can model.

32. Describe two ways of dealing with the vanishing gradient


problem in a neural network.

Solution:

 Using ReLU activation instead of sigmoid.

 Using Xavier initialization.


33. What are some advantages in using a CNN (convolutional
neural network) rather than a DNN (dense neural network) in
an image classification task?

Solution: while both models can capture the relationship


between close pixels, CNNs have the following properties:

 It is translation invariant — the exact location of the pixel


is irrelevant for the filter.

 It is less likely to overfit — the typical number of


parameters in a CNN is much smaller than that of a DNN.

 Gives us a better understanding of the model — we can


look at the filters’ weights and visualize what the network
“learned”.

 Hierarchical nature — learns patterns in by describing


complex patterns using simpler ones.

34. Describe two ways to visualize features of a CNN in an


image classification task.

Solution:

 Input occlusion — cover a part of the input image and see


which part affect the classification the most. For instance,
given a trained image classification model, give the images
below as input. If, for instance, we see that the 3rd image is
classified with 98% probability as a dog, while the 2nd image
only with 65% accuracy, it means that

 Activation Maximization — the idea is to create an artificial


input image that maximize the target response (gradient
ascent).

35. Is trying the following learning rates: 0.1,0.2,…,0.5 a good


strategy to optimize the learning rate?

Solution: No, it is recommended to try a logarithmic scale to


optimize the learning rate.

36. Suppose you have a NN with 3 layers and ReLU activations.


What will happen if we initialize all the weights with the same
value? what if we only had 1 layer (i.e linear/logistic
regression?)
Solution: If we initialize all the weights to be the same we
would not be able to break the symmetry; i.e, all gradients will
be updated the same and the network will not be able to learn.
In the 1-layers scenario, however, the cost function is convex
(linear/sigmoid) and thus the weights will always converge to
the optimal point, regardless of the initial value (convergence
may be slower).

37. Explain the idea behind the Adam optimizer.

Solution: Adam, or adaptive momentum, combines two ideas


to improve convergence: per-parameter updates which give
faster convergence, and momentum which helps to avoid
getting stuck in saddle point.

38. Compare batch, mini-batch and stochastic gradient descent.

Solution: batch refers to estimating the data by taking the entire


data, mini-batch by sampling a few datapoints, and SGD refers
to update the gradient one datapoint at each epoch. The tradeoff
here is between how precise the calculation of the gradient is
versus what size of batch we can keep in memory. Moreover,
taking mini-batch rather than the entire batch has a regularizing
effect by adding random noise at each epoch.

39. What is data augmentation? Give examples.

Solution: Data augmentation is a technique to increase the


input data by performing manipulations on the original data.
For instance in images, one can: rotate the image, reflect (flip)
the image, add Gaussian blur

40. What is the idea behind GANs?

Solution: GANs, or generative adversarial networks, consist of


two networks (D,G) where D is the “discriminator” network and
G is the “generative” network. The goal is to create data —
images, for instance, which are undistinguishable from real
images. Suppose we want to create an adversarial example of a
cat. The network G will generate images. The network D will
classify images according to whether they are a cat or not. The
cost function of G will be constructed such that it tries to “fool”
D — to classify its output always as cat.

41. What are the advantages of using Batchnorm?

Solution: Batchnorm accelerates the training process. It also


(as a byproduct of including some noise) has a regularizing
effect.

42. What is multi-take learning? When should it be used?

Solution: Multi-tasking is useful when we have a small amount


of data for some task, and we would benefit from training a
model on a large dataset of another task. Parameters of the
models are shared — either in a “hard” way (i.e the same
parameters) or a “soft” way (i.e regularization/penalty to the
cost function).
43. What is end-to-end learning? Give a few of its advantages.

Solution: End-to-end learning is usually a model which gets


the raw data and outputs directly the desired outcome, with no
intermediate tasks or feature engineering. It has several
advantages, among which: there is no need to handcraft
features, and it generally leads to lower bias.

44. What happens if we use a ReLU activation and then a


sigmoid as the final layer?

Solution: Since ReLU always outputs a non-negative result,


the network will constantly predict one class for all the inputs!

45. How to solve the exploding gradient problem?

Solution: A simple solution to the exploding gradient problem


is gradient clipping — taking the gradient to be ±M when its
absolute value is bigger than M, where M is some large number.

46. Is it necessary to shuffle the training data when using batch


gradient descent?

Solution: No, because the gradient is calculated at each epoch


using the entire training data, so shuffling does not make a
difference.

47. When using mini batch gradient descent, why is it


important to shuffle the data?
Solution: otherwise, suppose we train a NN classifier and have
two classes — A and B, and that al

48. Describe some hyperparameters for transfer learning.

Solution: How many layers to keep, how many layers to add,


how many to freeze.

49. Is dropout used on the test set?

Solution: No! only in the train set. Dropout is a regularization


technique that is applied in the training process.

50. Explain why dropout in a neural network act as a


regularizer.

Solution: There are several (related) explanations to why


dropout works. It can be seen as a form of model averaging — at
each step we “turn off” a part of the model and average the
models we get. It also adds noise, which naturally has a
regularizing effect. It also leads to more sparsity of the weights
and essentially prevents co-adaptation of neurons in the
network.

51. Give examples in which a many-to-one RNN architecture is


appropriate.

Solution: A few examples are: sentiment analysis, gender


recognition from speech, .
52. When can’t we use BiLSTM? Explain what assumption has
to be made.

Solution: in any bi-directional model, we assume that we have


access to the next elements of the sequence in a given “time”.
This is the case for text data (i.e sentiment analysis, translation
etc.), but not the case for time-series data.

53. True/false: adding L2 regularization to a RNN can help with


the vanishing gradient problem.

Solution: false! Adding L2 regularization will shrink the


weights towards zero, which can actually make the vanishing
gradients worse in some cases.

54. Suppose the training error/cost is high and that the


validation cost/error is almost equal to it. What does it mean?
What should be done?

Solution: this indicates underfitting. One can add more


parameters, increase the complexity of the model, or lower the
regularization.

55. Describe how L2 regularization can be explained as a sort of


a weight decay.

Solution: Suppose our cost function is C(w), and that we add a


penalization c|w|2 . When using gradient descent, the iterations
will look like
w = w -grad(C)(w) — 2cw = (1–2c)w — grad(C)(w)

In this equation, the weight is multiplied by a factor < 1.

1. Presenting the meaning of Batch


Normalization
This can be considered a very good question because it covers
most of the knowledge that candidates need to know when
working with a neural network model. You can answer
differently but need to clarify the following main ideas:
Batch Normalization is an effective method when training a
neural network model. The goal of this method is to want to
normalize the features (the output of each layer after going
through the activation) to zero-mean state with standard
deviation 1. So the opposite phenomenon is non-zero mean,
How does it affect the model training?

 Firstly, Non zero mean is a phenomenon where data is


not distributed around the value of 0, but the data has most
values greater than zero, or less than zero. Combined with
the high variance problem, data becomes very large or very
small. This problem is common when training neural
networks with deep layer numbers. The fact that the feature
is not distributed within stable intervals (small to large
values) will have an effect on the optimization process of the
network.
 As we all know, optimizing a neural network will need to
use derivative calculations. Assuming a simple layer
calculation formula is y = (Wx + b), the derivative
of y from w looks like: dy = dWx. Thus the value of x directly
affects the value of the derivative (of course, the concept of
gradients in neural network models cannot be so simple but
theoretically, x will affect the derivative). Therefore, if x
brings unstable changes, the derivative may be too big, or too
small, resulting in an unstable learning model. And that also
means we can use higher learning rates during training when
using Batch Normalization.

 Batch normalization can help us avoid the phenomenon


that the value of x falls into saturation after going through
non-linear activation functions. So it makes sure that no
activation is exceeded either too high or too low. This helps
the weights that when not using the patient will probably
never learn, now they are normally learned. This helps us
reduce the dependence on the initial value of the parameters.

 Batch Normalization also acts as a form of regularization


that helps to minimize overfitting. Using batch
normalization, we won’t need to use too many dropput and
this makes sense since we won’t need to worry about losing
too much information when we drop down the network.
However, it is still advisable to use a combination of both
techniques.
2. Present the concept and trade-off
relationship between bias and
variance?
What is bias? Understandably, bias is the difference between
the average prediction of the current model and the actual
results that we need to predict. A model with a high bias
indicates that it is less focused on training data. This makes the
model too simple and does not achieve good accuracy on both
training and testing. This phenomenon is also known
as underfitting.

Variance can simply understand as the distribution (or


clustering) of the model outputs on a data point. The larger the
variance, the more likely it is that the model is paying close
attention to training data and does not provide a generalization
on data never encountered. As a result, the model achieved
extremely good results on the training data set, but the results
were very poor with the test data set. This is the phenomenon
of overfitting.

The correlation between these two concepts can be visualized by


the following figure:
In the diagram above, the centre of the circle is a model that
perfectly predicts the exact values. In fact, you have never found
such a good model. As we get farther away from the centre of
the circle, our predictions get worse and worse.

We can change the model to increase the number of prediction


models that fall into the centre of the circle as much as possible.
A balance between Bias and Variance values is needed. If our
model is too simple and has very few parameters then it may
have high bias and low variance.

Besides, if our model has a large number of parameters then it


will have high variance and low bias. This is the basis for us to
calculate the complexity of the model when designing the
algorithm.
3. Suppose that the Deep Learning
model has found 10 million faces
vectors. How to find a new face
fastest by queries.
This question is about the application of Deep Learning
algorithms in practice, the key point of this question is the
method of indexing data. This is the final step in the problem of
applying One Shot Learning for face recognition but it is the
most important step that makes this application easy to deploy
in practice.

Basically, with this question, you should present an overview of


the face recognition method by One Shot Learning first. It can
be understood simply as turning each face into a vector, and the
new face recognition is finding the vectors that are closest to
(most similar) to the input face. Usually, people will use a deep
learning model with a custom loss function called triplet loss to
do that.

However, with the increase in the images number at the


beginning of the article, calculating the distance to 10 million
vectors in each identification is not a smart solution, makes the
system much slower. We need to think of methods of indexing
data on real vector space in order to make the query more
convenient.

The main idea of these methods is to divide the data into easy
structures for querying new data (possibly similar to a tree
structure). When new data is available, querying in the tree
helps to quickly find the vector that has the closest distance with
time very quickly.
There are several methods that can be used for this purpose
such as Locality Sensitive Hashing — LSH, Approximate
Nearest Neighbors Oh Yeah — Annoy Indexing, Faiss…
4. With classification problem, is the
accuracy index completely reliable?
Which metrics do you usually use to
evaluate your model?
With a class problem, there are many different ways to evaluate.
As for accuracy, the formula simply takes the number of correct
prediction data points divided by the total data. This sounds
reasonable, but in reality, for unbalanced data problems, this
quantity is not significant enough. Suppose we are building a
prediction model for network attacks (assuming attack requests
account for about 1/100000 number of requests).

If the model predicts that all requests are normal, then the
accuracy is also up to 99.9999% and this figure is often
unreliable in the classification model. The accuracy calculation
above usually shows us how many percents of the data is
correctly predicted, but does not indicate how each class is
classified in detail. Instead, we can use the Confusion matrix.
Basically, Confusion matrix shows how many data points
actually belong to a class, and is predicted to fall into a class. It
has the following form:
In addition to expressing the change of True Positive and False
Positive indices corresponding to each threshold that defines
the classification, we have a graph called Receiver Operating
Characteristic — ROC. Based on ROC we can know whether the
model is effective or not.

An ideal ROC is the closer the orange line to the top left corner
(i.e., True Positive is high and False Positive is lower) the better.
What is computer vision ?

It’s Subset of AI. Computer vision is an interdisciplinary scientific field that deals with
how computers can be made to gain high-level understanding from images or videos
What are languages supported by Computer vision ?
C++, Python, Matlab,
What are computer vision libraries ?
OpenCV – python, Java
How many algorithms in opencv ?
2500 optimized algorithms
What is CUDA ?
What is OpenGL ?
What are machine learning algorithms available in
opencv ?
Normal Bayes Classifier
K-Nearest Neighbors
Support Vector Machines
Decision Trees
Boosting
Gradient Boosted Trees
Random Trees
Extremely randomized trees
What is Images stitching?  How can you do with opencv ?
What is Computational Photography ? How can you do with
opencv ?
How you can you connect your webcam to opencv ?
How can you do objection detection in opencv ?
What are face recognition algorithms ?
Haar Cascades, Eigenfaces, Fisherfaces
How  Haar Cascades algorithm works  ?
How can you do face detection in opencv ?
How can you detect eye in Open CV ?
What is Cascade Classifier in Open CV ?
How can you detect corner of images using OpenCV ?
How many types of image filters in OpenCV ?

 Averaging
 Gaussian Filtering
 Median Filtering
 Bilateral Filtering

How can you do Feature Detection in Open CV ?


How many types of video filters in OpenCV ?

 Color Conversion
 Thresholding
 Smoothing
 Morphology
 Gradients
 Canny Edge Detection
 Contours
 Histograms

How can do Image processing in opencv ?


How can you do image compression in opencv ?
How can you do image resize in opencv  ?
How can you convert black and white image to color image
using computer vision ?
What is video analysis  in Opencv ?
How can you do detect objections video ?
how you would create a 3D model of an object ?
How can you remove red eye from photos using in opencv ?
How can you connect GPU with opencv ?
How can you integrate open cv with android ?
How can you integrate open cv with ios?

Machine Learning Interview Questions

A collection of technical interview questions for machine learning and computer


vision engineering positions.

1) What's the trade-off between bias and variance? [src]

If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand if our model has large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without overfitting and underfitting the data. [src]

2) What is gradient descent? [src]

[Answer]

3) Explain over- and under-fitting and how to combat them? [src]

[Answer]

4) How do you combat the curse of dimensionality? [src]

 Manual Feature Selection


 Principal Component Analysis (PCA)
 Multidimensional Scaling
 Locally linear embedding
[src]
5) What is regularization, why do we use it, and give some examples of common methods?
[src]

A technique that discourages learning a more complex or flexible model, so as to


avoid the risk of overfitting. Examples

 Ridge (L2 norm)


 Lasso (L1 norm)
The obvious disadvantage of ridge regression, is model interpretability. It will shrink
the coefficients for least important predictors, very close to zero. But it will never
make them exactly zero. In other words, the final model will include all predictors.
However, in the case of the lasso, the L1 penalty has the effect of forcing some of the
coefficient estimates to be exactly equal to zero when the tuning parameter λ is
sufficiently large. Therefore, the lasso method also performs variable selection and is
said to yield sparse models. [src]

6) Explain Principal Component Analysis (PCA)? [src]

[Answer]

7) Why is ReLU better and more often used than Sigmoid in Neural Networks? [src]

Imagine a network with random initialized weights ( or normalised ) and almost 50%
of the network yields 0 activation because of the characteristic of ReLu ( output 0 for
negative values of x ). This means a fewer neurons are firing ( sparse activation ) and
the network is lighter. [src]
8) Given stride S and kernel sizes for each layer of a (1-dimensional) CNN, create a function to
compute the  receptive field  of a particular node in the network. This is just finding how many
input nodes actually connect through to a neuron in a CNN. [src]

9) Implement  connected components  on an image/matrix. [src]

10) Implement a sparse matrix class in C++. [src]

11) Create a function to compute an  integral image, and create another function to get area
sums from the integral image.[src]

12) How would you  remove outliers  when trying to estimate a flat plane from noisy samples?
[src]

13) How does  CBIR  work? [src]

14) How does image registration work? Sparse vs. dense  optical flow  and so on. [src]

15) Describe how convolution works. What about if your inputs are grayscale vs RGB imagery?
What determines the shape of the next layer? [src]

16) Talk me through how you would create a 3D model of an object from imagery and depth
sensor measurements taken at all angles around the object. [src]

17) Implement SQRT(const double & x) without using any special functions, just fundamental
arithmetic. [src]

18) Reverse a bitstring. [src]

19) Implement non maximal suppression as efficiently as you can. [src]

20) Reverse a linked list in place. [src]

21) What is data normalization and why do we need it? [src]

Data normalization is very important preprocessing step, used to rescale values to fit
in a specific range to assure better convergence during backpropagation. In general,
it boils down to subtracting the mean of each data point and dividing by its standard
deviation. If we don't do this then some of the features (those with high magnitude)
will be weighted more in the cost function (if a higher-magnitude feature changes by
1%, then that change is pretty big, but for smaller features it's quite insignificant).
The data normalization makes all features weighted equally.
22) Why do we use convolutions for images rather than just FC layers? [src]

Firstly, convolutions preserve, encode, and actually use the spatial information from
the image. If we used only FC layers we would have no relative spatial information.
Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation
in-variance, since each convolution kernel acts as it's own filter/feature detector.

23) What makes CNNs translation invariant? [src]

As explained above, each convolution kernel acts as it's own filter/feature detector.
So let's say you're doing object detection, it doesn't matter where in the image the
object is since we're going to apply the convolution in a sliding window fashion
across the entire image anyways.

24) Why do we have max-pooling in classification CNNs? [src]

for a role in Computer Vision. Max-pooling in a CNN allows you to reduce


computation since your feature maps are smaller after the pooling. You don't lose
too much semantic information since you're taking the maximum activation. There's
also a theory that max-pooling contributes a bit to giving CNNs more translation in-
variance. Check out this great video from Andrew Ng on the benefits of max-pooling.

25) Why do segmentation CNNs typically have an encoder-decoder style / structure? [src]

The encoder CNN can basically be thought of as a feature extraction network, while
the decoder uses that information to predict the image segments by "decoding" the
features and upscaling to the original image size.

26) What is the significance of Residual Networks? [src]

The main thing that residual connections did was allow for direct feature access from
previous layers. This makes information propagation throughout the network much
easier. One very interesting paper about this shows how using local skip connections
gives the network a type of ensemble multi-path structure, giving features multiple
paths to propagate throughout the network.

27) What is batch normalization and why does it work? [src]

Training Deep Neural Networks is complicated by the fact that the distribution of
each layer's inputs changes during training, as the parameters of the previous layers
change. The idea is then to normalize the inputs of each layer in such a way that they
have a mean output activation of zero and standard deviation of one. This is done for
each individual mini-batch at each layer i.e compute the mean and variance of that
mini-batch alone, then normalize. This is analogous to how the inputs to networks
are standardized. How does this help? We know that normalizing the inputs to a
network helps it learn. But a network is just a series of layers, where the output of
one layer becomes the input to the next. That means we can think of any layer in a
neural network as the first layer of a smaller subsequent network. Thought of as a
series of neural networks feeding into each other, we normalize the output of one
layer before applying the activation function, and then feed it into the following layer
(sub-network).

28) Why would you use many small convolutional kernels such as 3x3 rather than a few large
ones? [src]

This is very well explained in the VGGNet paper. There are 2 reasons: First, you can
use several smaller kernels rather than few large ones to get the same receptive field
and capture more spatial context, but with the smaller kernels you are using less
parameters and computations. Secondly, because with smaller kernels you will be
using more filters, you'll be able to use more activation functions and thus have a
more discriminative mapping function being learned by your CNN.

29) Why do we need a validation set and test set? What is the difference between them? [src]

When training a model, we divide the available data into three separate sets:

 The training dataset is used for fitting the model’s parameters. However, the accuracy
that we achieve on the training set is not reliable for predicting if the model will be
accurate on new samples.
 The validation dataset is used to measure how well the model does on examples that
weren’t part of the training dataset. The metrics computed on the validation data can
be used to tune the hyperparameters of the model. However, every time we evaluate
the validation data and we make decisions based on those scores, we are leaking
information from the validation data into our model. The more evaluations, the more
information is leaked. So we can end up overfitting to the validation data, and once
again the validation score won’t be reliable for predicting the behaviour of the model
in the real world.
 The test dataset is used to measure how well the model does on previously unseen
examples. It should only be used once we have tuned the parameters using the
validation set.

So if we omit the test set and only use a validation set, the validation score won’t be
a good estimate of the generalization of the model.

30) What is stratified cross-validation and when should we use it? [src]

Cross-validation is a technique for dividing data between training and validation sets.
On typical cross-validation this split is done randomly. But in stratified cross-
validation, the split preserves the ratio of the categories on both the training and
validation datasets.

For example, if we have a dataset with 10% of category A and 90% of category B, and
we use stratified cross-validation, we will have the same proportions in training and
validation. In contrast, if we use simple cross-validation, in the worst case we may
find that there are no samples of category A in the validation set.

Stratified cross-validation may be applied in the following scenarios:

 On a dataset with multiple categories. The smaller the dataset and the more
imbalanced the categories, the more important it will be to use stratified cross-
validation.
 On a dataset with data of different distributions. For example, in a dataset for
autonomous driving, we may have images taken during the day and at night. If we do
not ensure that both types are present in training and validation, we will have
generalization problems.

31) Why do ensembles typically have higher scores than individual models? [src]

An ensemble is the combination of multiple models to create a single prediction. The


key idea for making better predictions is that the models should make different
errors. That way the errors of one model will be compensated by the right guesses of
the other models and thus the score of the ensemble will be higher.

We need diverse models for creating an ensemble. Diversity can be achieved by:

 Using different ML algorithms. For example, you can combine logistic regression, k-
nearest neighbors, and decision trees.
 Using different subsets of the data for training. This is called bagging.
 Giving a different weight to each of the samples of the training set. If this is done
iteratively, weighting the samples according to the errors of the ensemble, it’s called
boosting. Many winning solutions to data science competitions are ensembles.
However, in real-life machine learning projects, engineers need to find a balance
between execution time and accuracy.

32) What is an imbalanced dataset? Can you list some ways to deal with it? [src]

An imbalanced dataset is one that has different proportions of target categories. For
example, a dataset with medical images where we have to detect some illness will
typically have many more negative samples than positive samples—say, 98% of
images are without the illness and 2% of images are with the illness.

There are different options to deal with imbalanced datasets:


 Oversampling or undersampling. Instead of sampling with a uniform distribution
from the training dataset, we can use other distributions so the model sees a more
balanced dataset.
 Data augmentation. We can add data in the less frequent categories by modifying
existing data in a controlled way. In the example dataset, we could flip the images
with illnesses, or add noise to copies of the images in such a way that the illness
remains visible.
 Using appropriate metrics. In the example dataset, if we had a model that always
made negative predictions, it would achieve a precision of 98%. There are other
metrics such as precision, recall, and F-score that describe the accuracy of the model
better when using an imbalanced dataset.

33) Can you explain the differences between supervised, unsupervised, and reinforcement
learning? [src]

In supervised learning, we train a model to learn the relationship between input data
and output data. We need to have labeled data to be able to do supervised learning.

With unsupervised learning, we only have unlabeled data. The model learns a
representation of the data. Unsupervised learning is frequently used to initialize the
parameters of the model when we have a lot of unlabeled data and a small fraction
of labeled data. We first train an unsupervised model and, after that, we use the
weights of the model to train a supervised model.

In reinforcement learning, the model has some input data and a reward depending
on the output of the model. The model learns a policy that maximizes the reward.
Reinforcement learning has been applied successfully to strategic games such as Go
and even classic Atari video games.

34) What is data augmentation? Can you give some examples? [src]

Data augmentation is a technique for synthesizing new data by modifying existing


data in such a way that the target is not changed, or it is changed in a known way.

Computer vision is one of fields where data augmentation is very useful. There are
many modifications that we can do to images:

 Resize
 Horizontal or vertical flip
 Rotate
 Add noise
 Deform
 Modify colors Each problem needs a customized data augmentation pipeline. For
example, on OCR, doing flips will change the text and won’t be beneficial; however,
resizes and small rotations may help.
35) What is Turing test? [src]

The Turing test is a method to test the machine’s ability to match the human level
intelligence. A machine is used to challenge the human intelligence that when it
passes the test, it is considered as intelligent. Yet a machine could be viewed as
intelligent without sufficiently knowing about people to mimic a human.

36) What is Precision?

Precision (also called positive predictive value) is the fraction of relevant instances
among the retrieved instances
Precision = true positive / (true positive + false positive)
[src]

37) What is Recall?

Recall (also known as sensitivity) is the fraction of relevant instances that have been
retrieved over the total amount of relevant instances. Recall = true positive / (true
positive + false negative)
[src]

38) Define F1-score. [src]

It is the weighted average of precision and recall. It considers both false positive and
false negative into account. It is used to measure the model’s performance.
F1-Score = 2 * (precision * recall) / (precision + recall)

39) What is cost function? [src]

Cost function is a scalar functions which Quantifies the error factor of the Neural
Network. Lower the cost function better the Neural network. Eg: MNIST Data set to
classify the image, input image is digit 2 and the Neural network wrongly predicts it
to be 3

40) List different activation neurons or functions. [src]

 Linear Neuron
 Binary Threshold Neuron
 Stochastic Binary Neuron
 Sigmoid Neuron
 Tanh function
 Rectified Linear Unit (ReLU)
41) Define Learning rate.

Learning rate is a hyper-parameter that controls how much we are adjusting the
weights of our network with respect the loss gradient. [src]

42) What is Momentum (w.r.t NN optimization)?

Momentum lets the optimization algorithm remembers its last step, and adds some
proportion of it to the current step. This way, even if the algorithm is stuck in a flat
region, or a small local minimum, it can get out and continue towards the true
minimum. [src]

43) What is the difference between Batch Gradient Descent and Stochastic Gradient Descent?

Batch gradient descent computes the gradient using the whole dataset. This is great
for convex, or relatively smooth error manifolds. In this case, we move somewhat
directly towards an optimum solution, either local or global. Additionally, batch
gradient descent, given an annealed learning rate, will eventually find the minimum
located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. SGD
works well (Not well, I suppose, but better than batch gradient descent) for error
manifolds that have lots of local maxima/minima. In this case, the somewhat noisier
gradient calculated using the reduced number of samples tends to jerk the model
out of local minima into a region that hopefully is more optimal. [src]

44) Epoch vs Batch vs Iteration.

Epoch: one forward pass and one backward pass of all the training examples
Batch: examples processed together in one pass (forward and backward)
Iteration: number of training examples / Batch size

45) What is vanishing gradient? [src]

As we add more and more hidden layers, back propagation becomes less and less
useful in passing information to the lower layers. In effect, as information is passed
back, the gradients begin to vanish and become small relative to the weights of the
networks.

46) What are dropouts? [src]

Long Short Term Memory – are explicitly designed to address the long term
dependency problem, by maintaining a state what to remember and what to forget.
47) Define LSTM. [src]

As we add more and more hidden layers, back propagation becomes less and less
useful in passing information to the lower layers. In effect, as information is passed
back, the gradients begin to vanish and become small relative to the weights of the
networks.

48) List the key components of LSTM. [src]

 Gates (forget, Memory, update & Read)


 tanh(x) (values between -1 to 1)
 Sigmoid(x) (values between 0 to 1)

49) List the variants of RNN. [src]

 LSTM: Long Short Term Memory


 GRU: Gated Recurrent Unit
 End to End Network
 Memory Network

50) What is Autoencoder, name few applications. [src]

Auto encoder is basically used to learn a compressed form of given data. Few
applications include

 Data denoising
 Dimensionality reduction
 Image reconstruction
 Image colorization

51) What are the components of GAN? [src]

 Generator
 Discriminator

52) What's the difference between boosting and bagging?

Boosting and bagging are similar, in that they are both ensembling techniques,
where a number of weak learners (classifiers/regressors that are barely better than
guessing) combine (through averaging or max vote) to create a strong learner that
can make accurate predictions. Bagging means that you take bootstrap samples
(with replacement) of your data set and each sample trains a (potentially) weak
learner. Boosting, on the other hand, uses all data to train each learner, but instances
that were misclassified by the previous learners are given more weight so that
subsequent learners give more focus to them during training. [src]

53) Explain how a ROC curve works.  [src]

The ROC curve is a graphical representation of the contrast between true positive
rates and the false positive rate at various thresholds. It’s often used as a proxy for
the trade-off between the sensitivity of the model (true positives) vs the fall-out or
the probability it will trigger a false alarm (false positives).

54) What’s the difference between Type I and Type II error?  [src]

Type I error is a false positive, while Type II error is a false negative. Briefly stated,
Type I error means claiming something has happened when it hasn’t, while Type II
error means that you claim nothing is happening when in fact something is. A clever
way to think about this is to think of Type I error as telling a man he is pregnant,
while Type II error means you tell a pregnant woman she isn’t carrying a baby.

55) What’s the difference between a generative and discriminative model?  [src]

A generative model will learn categories of data while a discriminative model will
simply learn the distinction between different categories of data. Discriminative
models will generally outperform generative models on classification tasks.

You might also like