0% found this document useful (0 votes)
98 views31 pages

DL KIET Model Question Paper

The document discusses biological neurons and the perceptron model. It defines key components of biological neurons like dendrites, soma, axon and synapses. It explains that perceptrons mathematically model neurons by representing electrical signals as numerical inputs and weights. The document also discusses McCulloch and Pitts who first proposed the computational neuron model in 1943. It represents the neuron model as an aggregation function g and threshold function f. Finally, it explains linear perceptrons as binary linear classifiers that learn through a perceptron learning algorithm to minimize misclassifications until convergence.

Uploaded by

Srinivas Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views31 pages

DL KIET Model Question Paper

The document discusses biological neurons and the perceptron model. It defines key components of biological neurons like dendrites, soma, axon and synapses. It explains that perceptrons mathematically model neurons by representing electrical signals as numerical inputs and weights. The document also discusses McCulloch and Pitts who first proposed the computational neuron model in 1943. It represents the neuron model as an aggregation function g and threshold function f. Finally, it explains linear perceptrons as binary linear classifiers that learn through a perceptron learning algorithm to minimize misclassifications until convergence.

Uploaded by

Srinivas Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DL KIET Model Question Paper

Module-1
1. What is Biological Neuron?
Biological Neuron
• Neurons are the nervous system's basic functional components, and they generate electrical
signals called action potentials that allow them to send information over large distances fast.
• We inspired with the neurons to create 'Perceptron' in AI just like the birds had inspired
humans to create airplanes, and the four-legged animals inspired us to develop cars.
Components and Working of Biological Neural Networks
• Dendrites: Receive signals (or information) from outside.
• Soma: Analyze the incoming signals and decide whether or not the data should be
forwarded on.
• Axon: Send signals to target cells, which could include other neurons, muscles, or glands.
• Synapse: The link between an axon and other neuron dendrites is known as a synapse.

2. What is the difference in between biological neuron and


Perceptron?
The perceptron is a mathematical model of a biological neuron. While in actual neurons the dendrite
receives electrical signals from the axons of other neurons, in the perceptron these electrical signals
are represented as numerical values. At the synapses between the dendrite and axons, electrical
signals are modulated in various amounts. This is also modeled in the perceptron by multiplying each
input value by a value called the weight. An actual neuron fires an output signal only when the total
strength of the input signals exceed a certain threshold. We model this phenomenon in a perceptron by
calculating the weighted sum of the inputs to represent the total strength of the input signals, and
applying a step function on the sum to determine its output. As in biological neural networks, this
output is fed to other perceptrons

3. Who’s McCulloh Pits? And what they had done?

McCulloch-Pitts Neuron

The first computational model of a neuron was proposed by Warren MuCulloch (neuroscientist) and

Walter Pitts (logician) in 1943.


It may be divided into 2 parts. The first part, g takes an input (ahem dendrite ahem), performs an
aggregation and based on the aggregated value the second part, f makes a decision.

Lets suppose that I want to predict my own decision, whether to watch a random football game or not

on TV. The inputs are all boolean i.e., {0,1} and my output variable is also boolean {0: Will watch it,

1: Won’t watch it}.

• So, x_1 could be isPremierLeagueOn (I like Premier League more)

• x_2 could be isItAFriendlyGame (I tend to care less about the friendlies)

• x_3 could be isNotHome (Can’t watch it when I’m running errands. Can I?)

• x_4 could be isManUnitedPlaying (I am a big Man United fan. GGMU!) and so on.

These inputs can either be excitatory or inhibitory. Inhibitory inputs are those that have maximum

effect on the decision making irrespective of other inputs i.e., if x_3 is 1 (not home) then my output

will always be 0 i.e., the neuron will never fire, so x_3 is an inhibitory input. Excitatory inputs are

NOT the ones that will make the neuron fire on their own but they might fire it when combined

together. Formally, this is what is going on

We can see that g(x) is just doing a sum of the inputs — a simple aggregation. And theta here is called

thresholding parameter. For example, if I always watch the game when the sum turns out to be 2 or

more, the theta is 2 here. This is called the Thresholding Logic.

4. Explain Linear Perceptron.


It is a binary linear classifier for supervised learning. The idea behind the binary linear
classifier can be described as follows.
where x is the feature vector, θ is the weight vector, and θ₀ is the bias. The sign function is

used to distinguish x as either a positive (+1) or a negative (-1) label. There is the decision

boundary to separate the data with different labels, which occurs at

The decision boundary separates the hyperplane into two regions. The data will be labeled as

positive in the region that θ⋅ x + θ₀ > 0, and be labeled as negative in the region that θ⋅ x + θ₀

< 0. If all the instances in a given data are linearly separable, there exists a θ and a θ₀ such

that y⁽ⁱ ⁾ (θ⋅ x⁽ⁱ ⁾ + θ₀) > 0 for every i-th data point, where y⁽ⁱ ⁾ is the label.

Figure 1 illustrates the aforementioned concepts with the 2-D case where the x = [x₁ x₂]ᵀ, θ =

[θ₁ θ₂] and θ₀ is a offset scalar. Note that the margin boundaries are related to the

regularization to prevent overfitting of the data, which is beyond the scope discussed here.

5. escribe Perceptron learning Algorithm

.
Following are the major components of a perceptron:

o Input: All the features become the input for a perceptron. We denote the input of a
perceptron by [x1, x2, x3, ..,xn], where x represents the feature value and n represents the
total number of features. We also have special kind of input called the bias. In the image, we
have described the value of the BIAS as w0.
o Weights: The values that are computed over the time of training the model. Initially, we
start the value of weights with some initial value and these values get updated for each
training error. We represent the weights for perceptron by [w1,w2,w3,.. wn].
o Bias: A bias neuron allows a classifier to shift the decision boundary left or right. In
algebraic terms, the bias neuron allows a classifier to translate its decision boundary. It aims
to "move every point a constant distance in a specified direction." Bias helps to train the
model faster and with better quality.
o Weighted summation: Weighted summation is the sum of the values that we get after the
multiplication of each weight [wn] associated with the each feature value [xn]. We represent
the weighted summation by ∑wixi for all i -> [1 to n].
o Step/activation function: The role of activation functions is to make neural networks
nonlinear. For linear classification, for example, it becomes necessary to make the
perceptron as linear as possible.
o Output: The weighted summation is passed to the step/activation function and whatever
value we get after computation is our predicted output.

6. What is Linear Separability?


Linear separability
Linear separability is an important concept in neural networks. The idea is to check if you
can separate points in an n-dimensional space using only n-1 dimensions. Lost it? Here's a
simpler explanation.

One Dimension
Lets say you're on a number line. You take any two numbers. Now, there are two
possibilities:

1. You choose two different numbers


2. You choose the same number

If you choose two different numbers, you can always find another number between them.
This number "separates" the two numbers you chose.

One dimensional separability


So, you say that these two numbers are "linearly separable".
But, if both numbers are the same, you simply cannot separate them. They're the same. So,
they're "linearly inseparable". (Not just linearly, they're aren't separable at all. You cannot
separate something from itself)

Two Dimensions
On extending this idea to two dimensions, some more possibilities come into existence.
Consider the following:
Two classes of points
Here, we're like to seperate the point (1,1) from the other points. You can see that there
exists a line that does this. In fact, there exist infinite such lines. So, these two "classes" of
points are linearly separable. The first class consists of the point (1,1) and the other class
has (0,1), (1,0) and (0,0).
Now consider this:

Linearly inseparable
In this case, you just cannot use one single line to separate the two classes (one containing
the black points and one containing the red points). So, they are linearly inseparable.

Three dimensions
Extending the above example to three dimensions. You need a plane for separating the two
classes.

Linear separability in 3D space


The dashed plane separates the red point from the other blue points. So its linearly
separable. If bottom right point on the opposite side was red too, it would become linearly
inseparable .
Extending to n dimensions
Things go up to a lot of dimensions in neural networks. So to separate classes in n-
dimensions, you need an n-1 dimensional "hyperplane"

7. Please explain convergence theorem for Perceptron Learning


Algorithm.

Perceptron Convergence Theorem:


In the classification of linearly separable patterns belonging to two classes only, the
training task for the classifier was to find the weight w such that.
(w^tx>0\hspace{0.4cm} for\hspace{0.2cm}each \hspace{0.2cm}x\in X_1\
w^tx<0\hspace{0.4cm} for\hspace{0.2cm}each \hspace{0.2cm}x\in X_2\)
Completion of training with the fixed correction training rule for any initial weight vector
and any correction increment constant leads to the following weights:
w∗=wk0=wk0+1=wk0+2.....w∗=wk0=wk0+1=wk0+2.....
with w∗w∗ as the solution vector for equation.
Integer k0k0 is the training step number starting at which no more misclassification
occurs, and thus no right adjustments take place for (k_0>=0)
This theorem is called as the "Perceptron Convergence Theorem".
Perceptron Convergence theorem states that a classifier for two linearly separable classes
of patterns is always trainable in a finite number of training steps.
In summary, the training of a single discrete perceptron two class classifier requires a
change of weights if and only if a misclassification occurs.
In the reason for misclassification is (w^tx<0\) then all weights are increased in proportion
wo xixi . If \(w^tx>0) then all weights are decreased in proportion to xixi
Summary of the Perceptron Convergence Algorithm:
Variables and Parameters: x(n)=(m+1)x(n)=(m+1) by 1 input vector
=[+1,x1(n),x2(n),.....xm(n)]T=[+1,x1(n),x2(n),.....xm(n)]T
w(n)=(m+1)w(n)=(m+1) by 1 weight vector
=[b(n),w1(n),w2(n),.....wm(n)]T=[b(n),w1(n),w2(n),.....wm(n)]T
b(n)=b(n)= bias
y(n)=y(n)= actual response
d(n)=d(n)= desired response
η=η= learning rate parameter, a +ve constant less than unity
1. Initialization: Set w(0)=0w(0)=0 , then perform the following computations for time
step n=1,2
2. Activation: At time step n, activate the perceptron by applying input vector x(n) and
desired response d(n).
3. Computation of actual response: Compute the actual response of the perceptron:
y(n)=sgn[wT(x)x(n)]y(n)=sgn[wT(x)x(n)]
4. Adaptation of weight vector: Update the weight vector of the perceptron:
w(n+1)=w(n)+η[d(n)−y(n)]x(n)w(n+1)=w(n)+η[d(n)−y(n)]x(n)
5. Continuation: Increment time step n by 1, go to step 1

Module-2
1. What is Feedforward neural network?
What is a Feed Forward Neural Network?

A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes does not form a cycle. The opposite of a feed forward neural network is
a recurrent neural network, in which certain pathways are cycled. The feed forward model is
the simplest form of neural network as information is only processed in one direction. While
the data may pass through multiple hidden nodes, it always moves in one direction and never
backwards.

Source

How does a Feed Forward Neural Network work?

A Feed Forward Neural Network is commonly seen in its simplest form as a single
layer perceptron. In this model, a series of inputs enter the layer and are multiplied by the
weights. Each value is then added together to get a sum of the weighted input values. If the
sum of the values is above a specific threshold, usually set at zero, the value produced is
often 1, whereas if the sum falls below the threshold, the output value is -1. The single layer
perceptron is an important model of feed forward neural networks and is often used in
classification tasks. Furthermore, single layer perceptrons can incorporate aspects of machine
learning. Using a property known as the delta rule, the neural network can compare the
outputs of its nodes with the intended values, thus allowing the network to adjust its weights
through training in order to produce more accurate output values. This process of training and
learning produces a form of a gradient descent. In multi-layered perceptrons, the process of
updating weights is nearly analogous, however the process is defined more specifically as
back-propagation. In such cases, each hidden layer within the network is adjusted according
to the output values produced by the final layer.

Source

Applications of Feed Forward Neural Networks

While Feed Forward Neural Networks are fairly straightforward, their simplified architecture
can be used as an advantage in particular machine learning applications. For example, one
may set up a series of feed forward neural networks with the intention of running them
independently from each other, but with a mild intermediary for moderation. Like the human
brain, this process relies on many individual neurons in order to handle and process larger
tasks. As the individual networks perform their tasks independently, the results can be
combined at the end to produce a synthesized, and cohesive output.

2. Explain Multilayer Perceptron in your words.


A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a
set of outputs from a set of inputs. An MLP is characterized by several layers of input
nodes connected as a directed graph between the input and output layers. MLP uses
backpropogation for training the network. MLP is a deep learning method.
Techopedia Explains Multilayer Perceptron (MLP)
A multilayer perceptron is a neural network connecting multiple layers in a directed graph,
which means that the signal path through the nodes only goes one way. Each node, apart from
the input nodes, has a nonlinear activation function. An MLP uses backpropagation as a
supervised learning technique. Since there are multiple layers of neurons, MLP is a deep
learning technique.
MLP is widely used for solving problems that require supervised learning as well as research
into computational neuroscience and parallel distributed processing. Applications include
speech recognition, image recognition and machine translation.

3. What is Gradient descent and how it is useful for neural network?


Gradient descent is an optimization algorithm which is commonly-used to
train machine learning models and neural networks. Training data helps these models
learn over time, and the cost function within gradient descent specifically acts as a
barometer, gauging its accuracy with each iteration of parameter updates. Until the
function is close to or equal to zero, the model will continue to adjust its parameters to
yield the smallest possible error. Once machine learning models are optimized for
accuracy, they can be powerful tools for artificial intelligence (AI) and computer
science applications.
Types of Gradient Descent
Batch gradient descent

Batch gradient descent sums the error for each point in a training set, updating the model only
after all training examples have been evaluated. This process referred to as a training epoch.
While this batching provides computation efficiency, it can still have a long processing time
for large training datasets as it still needs to store all of the data into memory. Batch gradient
descent also usually produces a stable error gradient and convergence, but sometimes that
convergence point isn’t the most ideal, finding the local minimum versus the global one.

Stochastic gradient descent

Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset
and it updates each training example's parameters one at a time. Since you only need to hold
one training example, they are easier to store in memory. While these frequent updates can
offer more detail and speed, it can result in losses in computational efficiency when
compared to batch gradient descent. Its frequent updates can result in noisy gradients, but this
can also be helpful in escaping the local minimum and finding the global one.

Mini-batch gradient descent

Mini-batch gradient descent combines concepts from both batch gradient descent and
stochastic gradient descent. It splits the training dataset into small batch sizes and performs
updates on each of those batches. This approach strikes a balance between the computational
efficiency of batch gradient descent and the speed of stochastic gradient descent.

4. Explain Backpropagation.

The Backpropagation algorithm looks for the minimum value of the error function in weight
space using a technique called the delta rule or gradient descent. The weights that minimize
the error function is then considered to be a solution to the learning problem.
Let’s understand how it works with an example:
You have a dataset, which has labels.
Consider the below table:
Input Desired Output
0 0
1 2
2 4
Now the output of your model when ‘W” value is 3:
Input Desired Output Model output (W=3)
0 0 0
1 2 3
2 4 6
Notice the difference between the actual output and the desired output:
Model output
Input Desired Output Absolute Error Square Error
(W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4
Let’s change the value of ‘W’. Notice the error when ‘W’ = ‘4’
Model Model
Desired Absolute Square Square
Input output output
Output Error Error Error
(W=3) (W=4)
0 0 0 0 0 0 0
1 2 3 1 1 4 4
2 4 6 2 4 8 16
Now if you notice, when we increase the value of ‘W’ the error has increased. So, obviously
there is no point in increasing the value of ‘W’ further. But, what happens if I decrease the
value of ‘W’? Consider the table below:
Model Model
Desired Absolute Square Square
Input output output
Output Error Error Error
(W=3) (W=2)
0 0 0 0 0 0 0
1 2 3 2 4 3 0
2 4 6 2 4 4 0

5. Explain Regularization and why it is used in neural network?

Regularization is a set of strategies used in Machine Learning to reduce the generalization

error. Most models, after training, perform very well on a specific subset of the overall

population but fail to generalize well. This is also known as overfitting. Regularization
strategies aim to reduce overfitting and keep, at the same time, the training error as low as

possible.

6. What is Autoencoder? And why it is called Unsupervised learning?

An autoencoder neural network is an Unsupervised Machine learning algorithm that


applies backpropagation, setting the target values to be equal to the inputs. Autoencoders are
used to reduce the size of our inputs into a smaller representation. If anyone needs the
original data, they can reconstruct it from the compressed data.

We have a similar machine learning algorithm ie. PCA which does the same task. So you
might be thinking why do we need Autoencoders then? Let’s continue this Autoencoders
Tutorial and find out the reason behind using Autoencoders.

Autoencoders Tutorial: Its Emergence


Autoencoders are preferred over PCA because:

▪ An autoencoder can learn non-linear transformations with a non-linear activation


function and multiple layers.
▪ It doesn’t have to learn dense layers. It can use convolutional layers to learn which is
better for video, image and series data.
▪ It is more efficient to learn several layers with an autoencoder rather than learn one
huge transformation with PCA.
▪ An autoencoder provides a representation of each layer as the output.
▪ It can make use of pre-trained layers from another model to apply transfer learning
to enhance the encoder/decoder.

7. Explain difficulties in training neural network.

I think that there are many challenges that neural networks face. But here are a few that I
think are major challenges and that overcoming one of these can result in a breakthrough.

• Priors: Baking prior knowledge into neural networks is an ongoing active


area of research. The idea is that priors helps with the curse of
dimensionality which can enable models learn faster, use less training data
and still be robust and accurate. Priors can be very important for truely
intelligent systems. Modern machine learning (ML) models lack priors
except for a few such as convolution in convolutional neural networks
(CNN) and recurrent connections in recurrent neural networks (RNN).
• Memory: How does the brain store memories in a scalable manner? How
does it decide what to store in the first place? These are also ongoing
research problems. The use of softly accessible memory in the so called
memory networks is very problematic and inefficient. Though some efforts
like the differentiable neural computer (DNC) are a bit promising in this
space.
• Reasoning: The fact that feedforward neural networks are general function
approximators limits them to learning mapping from one space to another.
Unfortunately, reasoning is not mapping based, it is sort of a recurrent
process. So then, we can just use RNNs for reasoning, right? In theory, yes,
RNNs are Turing complete so they should be able to implement any
computational function that any computer or even a human can do. So
where is the issue then? It is not that simple in practice because learnability
and memory comes to play, some reasoning problems are not differentiable
thus it is hard to use backpropagation + gradient descent to learn and we
still need to figure out how large scale memory is to be implemented into
neural networks because reasoning involves both short-term and long-term
memory to store facts and no, this is not like the LSTM (long-short-term-
memory) network.
• Gradual learning: You might train a network on some task and if you try
to train it on another, the model catastrophically forgets the previous task.
This is another area of active research as efforts are being made towards
machines that learn gradually. There is a term called online machine
learning which is about models that adapt to incoming data, this is in fact a
basic form of gradual learning but it is extremely hard for neural networks
to achieve online machine learning.
• Overfitting/underfitting: Neural networks normally have a large enough
capacity to store the whole training data, including noise. There are many
ways like dropout, L2L2 & L1L1 regularization to work around such a
problem but still, modern neural networks have a tendency to overfit. For
example, if a model is trained on synthetic data, say from a simulation or
game, it finds it hard to generalize to the real world data. So how do we
make models that generalize better? Well this is also an active area of
research.
• Large data requirements. With a huge number of parameters, neural net
models normally need a lot of training data including data augmentation in
order for them to be robust and usable in the real world. But single or zero
shot learning does take place in humans, animals and insects, but how does
it occur? This is another active area of research. I am actually working on
trying to make data & compute efficient deep neural networks.
Probably overfitting and underfitting didn't deserve a spot there but I feel like neural
networks still don't generalize as well as they are supposed to given the amount of data they
consume during learning.

8. What Greedy layerwise training?

Greedy Layer-wise Unsupervised Pretraining


• A representation learned for one task
– unsupervised learning, that captures the shape of the input distribution
• Is used for another task
– supervised learning with the same input domain
• Greedy layer-wise pre-training relies on a single-layer representation learning 6
Single-layer representation learning
• We need a single-layer representation learning algorithm, such as:
– An RBM
• (a Markov network)
– A single-layer autoencoder
– A sparse coding model
– Or another model that learns latent representations
Single layer Pretraining
• Each layer pretrained using unsupervised learning
– Taking the output of the previous layer and producing as output a new
representation of data,
• Whose distribution (or relation to categories) is simpler

Module-3
1. What is Optimization and why it is required in neural network?

Optimization is the problem of finding a set of inputs to an objective function that results in
a maximum or minimum function evaluation.
It is the challenging problem that underlies many machine learning algorithms, from fitting
logistic regression models to training artificial neural networks.
There are perhaps hundreds of popular optimization algorithms, and perhaps tens of
algorithms to choose from in popular scientific code libraries. This can make it challenging to
know which algorithms to consider for a given optimization problem.
different types of optimizers and their advantages:

Gradient Descent

Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in

linear regression and classification algorithms. Backpropagation in neural networks also uses a

gradient descent algorithm.

Gradient descent is a first-order optimization algorithm which is dependent on the first order

derivative of a loss function. It calculates that which way the weights should be altered so that

the function can reach a minima. Through backpropagation, the loss is transferred from one

layer to another and the model’s parameters also known as weights are modified depending on

the losses so that the loss can be minimized.

Stochastic Gradient Descent

It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In

this, the model parameters are altered after computation of loss on each training example. So,

if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one

cycle of dataset instead of one time as in Gradient Descent.

Mini-Batch Gradient Descent

It’s best among all the variations of gradient descent algorithms. It is an improvement on both

SGD and standard gradient descent. It updates the model parameters after every batch. So, the

dataset is divided into various batches and after every batch, the parameters are updated.

Momentum
Momentum was invented for reducing high variance in SGD and softens the convergence. It

accelerates the convergence towards the relevant direction and reduces the fluctuation to the

irrelevant direction. One more

Nesterov Accelerated Gradient

Momentum may be a good method but if the momentum is too high the algorithm may miss

the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was

developed. It is a look ahead method. We know we’ll be using γV(t−1) for modifying the

weights so, θ−γV(t−1) approximately tells us the future location. Now, we’ll calculate the cost

based on this future parameter rather than the current one.

Adagrad

One of the disadvantages of all the optimizers explained is that the learning rate is constant for

all parameters and for each cycle. This optimizer changes the learning rate. It changes the

learning rate ‘η’ for each parameter and at every time step ‘t’. It’s a type second order

optimization algorithm. It works on the derivative of an error function.

AdaDelta

It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of
it. Instead of accumulating all previously squared gradients, Adadelta limits the window of

accumulated past gradients to some fixed size w. In this exponentially moving average is used

rather than the sum of all the gradients.

Adam

Adam (Adaptive Moment Estimation) works with momentums of first and second order. The
intuition behind the Adam is that we don’t want to roll so fast just because we can jump over
the minimum, we want to decrease the velocity a little bit for a careful search. In addition to

storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also

keeps an exponentially decaying average of past gradients M(t).

2. Explain Optimization methods in neural network?

3. What is Stochastic Gradient Descent?


What is Gradient Descent?
Before explaining Stochastic Gradient Descent (SGD), let’s first describe what Gradient
Descent is. Gradient Descent is a popular optimization technique in Machine Learning and
Deep Learning, and it can be used with most, if not all, of the learning algorithms. A
gradient is the slope of a function. It measures the degree of change of a variable in
response to the changes of another variable. Mathematically, Gradient Descent is a convex
function whose output is the partial derivative of a set of parameters of its inputs. The
greater the gradient, the steeper the slope.
Starting from an initial value, Gradient Descent is run iteratively to find the optimal values
of the parameters to find the minimum possible value of the given cost function.
Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
In this article, we will be discussing Stochastic Gradient Descent or SGD.
Stochastic Gradient Descent (SGD):

The word ‘stochastic‘ means a system or a process that is linked with a random probability.
Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the
whole data set for each iteration. In Gradient Descent, there is a term called “batch” which
denotes the total number of samples from a dataset that is used for calculating the gradient
for each iteration. In typical Gradient Descent optimization, like Batch Gradient Descent,
the batch is taken to be the whole dataset. Although, using the whole dataset is really useful
for getting to the minima in a less noisy and less random manner, but the problem arises
when our datasets get big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient
Descent optimization technique, you will have to use all of the one million samples for
completing one iteration while performing the Gradient Descent, and it has to be done for
every iteration until the minima are reached. Hence, it becomes computationally very
expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single
sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled
and selected for performing the iteration.

4. What is Mini-Batch Gradient Descent and why momentum is required in here?

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the
training dataset into small batches that are used to calculate model error and update model
coefficients.
Implementations may choose to sum the gradient over the mini-batch which further reduces
the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic
gradient descent and the efficiency of batch gradient descent. It is the most common
implementation of gradient descent used in the field of deep learning.

5. Explain Adam optimizer.

Adam is an optimization algorithm that can be used instead of the classical stochastic
gradient descent procedure to update network weights iterative based in training data.
Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University
of Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic
Optimization“. I will quote liberally from their paper in this post, unless stated otherwise.
The algorithm is called Adam. It is not an acronym and is not written as “ADAM”
When introducing the algorithm, the authors list the attractive benefits of using Adam on
non-convex optimization problems, as follows:
• Straightforward to implement.
• Computationally efficient.
• Little memory requirements.
• Invariant to diagonal rescale of the gradients.
• Well suited for problems that are large in terms of data and/or parameters.
• Appropriate for non-stationary objectives.
• Appropriate for problems with very noisy/or sparse gradients.
• Hyper-parameters have intuitive interpretation and typically require little tuning.

6. Explain Adagrad and RMSprop.


Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-based optimization. The
learning rate is adapted component-wise to the parameters by incorporating knowledge of
past observations. It performs larger updates (e.g. high learning rates) for those parameters
that are related to infrequent features and smaller updates (i.e. low learning rates) for frequent
one. It performs smaller updates As a result, it is well-suited when dealing with sparse data
(NLP or image recognition) Each parameter has its own learning rate that improves
performance on problems with sparse gradients.
Advantages of Using AdaGrad

• It eliminates the need to manually tune the learning rate


• Convergence is faster and more reliable – than simple SGD when the scaling of the
weights is unequal
• It is not very sensitive to the size of the master step
RMSprop is a gradient based optimization technique used in training neural networks. It was
proposed by the father of back-propagation, Geoffrey Hinton. Gradients of very complex
functions like neural networks have a tendency to either vanish or explode as the data
propagates through the function (*refer to vanishing gradients problem). Rmsprop was
developed as a stochastic technique for mini-batch learning.

RMSprop deals with the above issue by using a moving average of squared gradients to
normalize the gradient. This normalization balances the step size  (momentum),  decreasing
the step for large gradients to avoid exploding, and increasing the step for small gradients to
avoid vanishing.

7. Why we have to use second order optimization?


Second-order optimization technique is the advances of first-order optimization in neural
networks. It provides an addition curvature information of an objective function that
adaptively estimate the step-length of optimization trajectory in training phase of neural
network.
8. What is the use of Regularization?
Regularization is a technique used for tuning the function by adding an additional penalty term
in the error function. The additional term controls the excessively fluctuating function such
that the coefficients don’t take extreme values. This technique of keeping a check or reducing
the value of error coefficients are called shrinkage methods or weight decay in case of neural
networks.
9. List down the regularization method with example.
Dropout

Another most frequently used regularization technique is dropout. It essentially means that
during the training, randomly selected neurons are turned off or ‘dropped’ out. It means that
they are temporarily obstructed from influencing or activating the downward neuron in a
forward pass, and none of the weights updates is applied on the backward pass.

Appearing For A Data Science Interview? Be Prepared For These 7 Types Of Questions

So if neurons are randomly dropped out of the network during training, the other neurons step
in and make the predictions for the missing neurons. This results in independent internal
representations being learned by the network, making the network less sensitive to the
specific weight of the neurons. Such a network is better generalised and has fewer chances of
producing overfitting.

Early Stopping

It is a kind of cross-validation strategy where one part of the training set is used as a
validation set, and the performance of the model is gauged against this set. So if the
performance on this validation set gets worse, the training on the model is immediately
stopped.
The main idea behind this technique is that while fitting a neural network on training data,
consecutively, the model is evaluated on the unseen data or the validation set after each
iteration. So if the performance on this validation set is decreasing or remaining the same for
the certain iterations, then the process of model training is stopped.

Data Augmentation

The simplest way to reduce overfitting is to increase the data, and this technique helps in
doing so.
Data augmentation is a regularization technique, which is used generally when we have
images as data sets. It generates additional data artificially from the existing training data by
making minor changes such as rotation, flipping, cropping, or blurring a few pixels in the
image, and this process generates more and more data. Through this regularization technique,
the model variance is reduced, which in turn decreases the regularization error.

Module-4
1. What is RNN (Recurrent Neural Network)?
A recurrent neural network is a type of artificial neural network commonly used in speech
recognition and natural language processing. Recurrent neural networks recognize data's
sequential characteristics and use patterns to predict the next likely scenario.
2. Explain LSTM (Long Short Term Memory). With an example.

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable
of learning order dependence in sequence prediction problems.
This is a behavior required in complex problem domains like machine translation, speech
recognition, and more.
LSTMs are a complex area of deep learning. It can be hard to get your hands around what
LSTMs are, and how terms like bidirectional and sequence-to-sequence relate to the field.
In this post, you will get insight into LSTMs using the words of research scientists that
developed the methods and applied them to new and important problems.
There are few that are better at clearly and precisely articulating both the promise of LSTMs
and how they work than the experts that developed them.
We will explore key questions in the field of LSTMs using quotes from the experts, and if
you’re interested, you will be able to dive into the original papers from which the quotes were
taken

3. Explain Bidirectional LSTM.

Bidirectional LSTMs are an extension of traditional LSTMs that can improve model
performance on sequence classification problems.
In problems where all timesteps of the input sequence are available, Bidirectional LSTMs
train two instead of one LSTMs on the input sequence. The first on the input sequence as-is
and the second on a reversed copy of the input sequence. This can provide additional context
to the network and result in faster and even fuller learning on the problem.

4. What is CNN (Convolutional Neural Network)?


In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural
networks, most commonly applied to analyze visual imagery. Now when we think of a neural
network we think about matrix multiplications but that is not the case with ConvNet. It uses a
special technique called Convolution. Now in mathematics convolution is a mathematical
operation on two functions that produces a third function that expresses how the shape of one
is modified by the other.
5. Explain architecture of LeNet, Alexnet.
A Convolutional Neural Network (CNN, or ConvNet) are a special kind of multi-layer
neural networks, designed to recognize visual patterns directly from pixel images with
minimal preprocessing.. The ImageNet project is a large visual database designed for use in
visual object recognition software research. The ImageNet project runs an annual software
contest
LeNet-5 (1998)

LeNet-5, a pioneering 7-level convolutional network by LeCun et al in 1998, that classifies

digits, was applied by several banks to recognise hand-written numbers on checks (cheques)

digitized in 32x32 pixel greyscale inputimages. The ability to process higher resolution images
requires larger and more convolutional layers, so this technique is constrained by the

availability of computing resources.

AlexNet (2012)

In 2012, AlexNet significantly outperformed all the prior competitors and won the challenge

by reducing the top-5 error from 26% to 15.3%. The second place top-5 error rate, which was

not a CNN variation, was around 26.2%.


The network had a very similar architecture as LeNet by Yann LeCun et al but was deeper,
with more filters per layer, and with stacked convolutional layers. It consisted 11x11, 5x5,3x3,
convolutions, max pooling, dropout, data augmentation, ReLU activations, SGD with
momentum. It attached ReLU activations after every convolutional and fully-connected layer.
AlexNet was trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which
is the reason for why their network is split into two pipelines. AlexNet was designed by the
SuperVision group, consisting of Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever.

ZFNet(2013)

Not surprisingly, the ILSVRC 2013 winner was also a CNN which became known as ZFNet.

It achieved a top-5 error rate of 14.8% which is now already half of the prior mentioned non-

neural error rate. It was mostly an achievement by tweaking the hyper-parameters of AlexNet
while maintaining the same structure with additional Deep Learning elements as discussed

earlier in this essay.

GoogLeNet/Inception(2014)

The winner of the ILSVRC 2014 competition was GoogLeNet(a.k.a. Inception V1) from

Google. It achieved a top-5 error rate of 6.67%! This was very close to human level

performance which the organisers of the challenge were now forced to evaluate. As it turns

out, this was actually rather hard to do and required some human training in order to beat

GoogLeNets accuracy. After a few days of training, the human expert (Andrej Karpathy) was

able to achieve a top-5 error rate of 5.1%(single model) and 3.6%(ensemble). The network

used a CNN inspired by LeNet but implemented a novel element which is dubbed an inception

module. It used batch normalization, image distortions and RMSprop. This module is based on

several very small convolutions in order to drastically reduce the number of parameters. Their

architecture consisted of a 22 layer deep CNN but reduced the number of parameters from 60

million (AlexNet) to 4 million.

VGGNet (2014)

The runner-up at the ILSVRC 2014 competition is dubbed VGGNet by the community and

was developed by Simonyan and Zisserman. VGGNet consists of 16 convolutional layers and

is very appealing because of its very uniform architecture. Similar to AlexNet, only 3x3

convolutions, but lots of filters. Trained on 4 GPUs for 2–3 weeks. It is currently the most

preferred choice in the community for extracting features from images. The weight

configuration of the VGGNet is publicly available and has been used in many other

applications and challenges as a baseline feature extractor. However, VGGNet consists of 138

million parameters, which can be a bit challenging to handle.

ResNet(2015)
At last, at the ILSVRC 2015, the so-called Residual Neural Network (ResNet) by Kaiming He

et al introduced anovel architecture with “skip connections” and features heavy batch

normalization. Such skip connections are also known as gated units or gated recurrent units

and have a strong similarity to recent successful elements applied in RNNs. Thanks to this

technique they were able to train a NN with 152 layers while still having lower complexity

than VGGNet. It achieves a top-5 error rate of 3.57% which beats human-level performance

on this dataset.

Module-5
1. Explain Autoencoders. And how can we use Autoencoder?

Autoencoder is a type of neural network where the output layer has the same
dimensionality as the input layer. In simpler words, the number of output units in the
output layer is equal to the number of input units in the input layer. An autoencoder
replicates the data from the input to the output in an unsupervised manner and is therefore
sometimes referred to as a replicator neural network.
The autoencoders reconstruct each dimension of the input by passing it through the
network. It may seem trivial to use a neural network for the purpose of replicating the
input, but during the replication process, the size of the input is reduced into its smaller
representation. The middle layers of the neural network have a fewer number of units as
compared to that of input or output layers. Therefore, the middle layers hold the reduced
representation of the input. The output is reconstructed from this reduced representation
of the input

Architecture of autoencoders

An autoencoder consists of three components:

• Encoder: An encoder is a feedforward, fully connected neural network that


compresses the input into a latent space representation and encodes the input
image as a compressed representation in a reduced dimension. The compressed
image is the distorted version of the original image.
• Code: This part of the network contains the reduced representation of the input
that is fed into the decoder.
• Decoder: Decoder is also a feedforward network like the encoder and has a
similar structure to the encoder. This network is responsible for reconstructing
the input back to the original dimensions from the code.

2. What Generative Adversarial Network (GAN)? Give a real time example where GAN
has been used.

Generative Adversarial Networks, or GANs for short, are an approach to generative modeling
using deep learning methods, such as convolutional neural networks.
Generative modeling is an unsupervised learning task in machine learning that involves
automatically discovering and learning the regularities or patterns in input data in such a way
that the model can be used to generate or output new examples that plausibly could have been
drawn from the original dataset.
GANs are a clever way of training a generative model by framing the problem as a
supervised learning problem with two sub-models: the generator model that we train to
generate new examples, and the discriminator model that tries to classify examples as either
real (from the domain) or fake (generated). The two models are trained together in a zero-sum
game, adversarial, until the discriminator model is fooled about half the time, meaning the
generator model is generating plausible examples.
GANs are an exciting and rapidly changing field, delivering on the promise of generative
models in their ability to generate realistic examples across a range of problem domains, most
notably in image-to-image translation tasks such as translating photos of summer to winter or
day to night, and in generating photorealistic photos of objects, scenes, and people that even
humans cannot tell are fake.
In this post, you will discover a gentle introduction to Generative Adversarial Networks, or
GANs.

3. Explain NLP. And explain one usecase of NLP.


To ensure that human beings communicate with computers in their natural language,
computer scientists have developed natural language processing (NLP) applications. For
computers to understand unstructured and often ambiguous human speech, they require input
from NLP applications.
4. What is Variational Autoencoder?
variational autoencoders (VAEs) are autoencoders that tackle the problem of the latent

space irregularity by making the encoder return a distribution over the latent space instead

of a single point and by adding in the loss function a regularisation term over that returned

distribution in order to ensure a better organisation of the latent space

5. Explain Different type of Autoencoder.


There are, basically, 7 types of autoencoders:

• Denoising autoencoder
• Sparse Autoencoder
• Deep Autoencoder
• Contractive Autoencoder
• Undercomplete Autoencoder
• Convolutional Autoencoder
• Variational Autoencoder
1) Denoising Autoencoder
Denoising autoencoders create a corrupted copy of the input by introducing some noise. This
helps to avoid the autoencoders to copy the input to the output without learning features
about the data. These autoencoders take a partially corrupted input while training to recover
the original undistorted input. The model learns a vector field for mapping the input data
towards a lower dimensional manifold which describes the natural data to cancel out the
added noise.
Advantages-

• It was introduced to achieve good representation. Such a representation is one that can
be obtained robustly from a corrupted input and that will be useful for recovering the
corresponding clean input.
• Corruption of the input can be done randomly by making some of the input as zero.
Remaining nodes copy the input to the noised input.
• Minimizes the loss function between the output node and the corrupted input.
• Setting up a single-thread denoising autoencoder is easy.
Drawbacks-

• To train an autoencoder to denoise data, it is necessary to perform preliminary


stochastic mapping in order to corrupt the data and use as input.
• This model isn't able to develop a mapping which memorizes the training data
because our input and target output are no longer the same.
2) Sparse Autoencoder
Sparse autoencoders have hidden nodes greater than input nodes. They can still discover
important features from the data. A generic sparse autoencoder is visualized where the
obscurity of a node corresponds with the level of activation. Sparsity constraint is introduced
on the hidden layer. This is to prevent output layer copy input data. Sparsity may be obtained
by additional terms in the loss function during the training process, either by comparing the
probability distribution of the hidden unit activations with some low desired value,or by
manually zeroing all but the strongest hidden unit activations. Some of the most powerful AIs
in the 2010s involved sparse autoencoders stacked inside of deep neural networks.
Advantages-

• Sparse autoencoders have a sparsity penalty, a value close to zero but not exactly
zero. Sparsity penalty is applied on the hidden layer in addition to the reconstruction
error. This prevents overfitting.
• They take the highest activation values in the hidden layer and zero out the rest of the
hidden nodes. This prevents autoencoders to use all of the hidden nodes at a time and
forcing only a reduced number of hidden nodes to be used.
Drawbacks-

• For it to be working, it's essential that the individual nodes of a trained model which
activate are data dependent, and that different inputs will result in activations of
different nodes through the network.
3) Deep Autoencoder
Deep Autoencoders consist of two identical deep belief networks, oOne network for encoding
and another for decoding. Typically deep autoencoders have 4 to 5 layers for encoding and
the next 4 to 5 layers for decoding. We use unsupervised layer by layer pre-training for this
model. The layers are Restricted Boltzmann Machines which are the building blocks of deep-
belief networks. Processing the benchmark dataset MNIST, a deep autoencoder would use
binary transformations after each RBM. Deep autoencoders are useful in topic modeling, or
statistically modeling abstract topics that are distributed across a collection of documents.
They are also capable of compressing images into 30 number vectors.
Advantages-

• Deep autoencoders can be used for other types of datasets with real-valued data, on
which you would use Gaussian rectified transformations for the RBMs instead.
• Final encoding layer is compact and fast.
Drawbacks-

• Chances of overfitting to occur since there's more parameters than input data.
• Training the data maybe a nuance since at the stage of the decoder’s backpropagation,
the learning rate should be lowered or made slower depending on whether binary or
continuous data is being handled.

4) Contractive Autoencoder
The objective of a contractive autoencoder is to have a robust learned representation which is
less sensitive to small variation in the data. Robustness of the representation for the data is
done by applying a penalty term to the loss function. Contractive autoencoder is another
regularization technique just like sparse and denoising autoencoders. However, this
regularizer corresponds to the Frobenius norm of the Jacobian matrix of the encoder
activations with respect to the input. Frobenius norm of the Jacobian matrix for the hidden
layer is calculated with respect to input and it is basically the sum of square of all elements.
Advantages-

• Contractive autoencoder is a better choice than denoising autoencoder to learn useful


feature extraction.
• This model learns an encoding in which similar inputs have similar encodings. Hence,
we're forcing the model to learn how to contract a neighborhood of inputs into a
smaller neighborhood of outputs.

5) Undercomplete Autoencoder
The objective of undercomplete autoencoder is to capture the most important features present
in the data. Undercomplete autoencoders have a smaller dimension for hidden layer
compared to the input layer. This helps to obtain important features from the data. It
minimizes the loss function by penalizing the g(f(x)) for being different from the input x.
Advantages-

• Undercomplete autoencoders do not need any regularization as they maximize the


probability of data rather than copying the input to the output.
Drawbacks-
• Using an overparameterized model due to lack of sufficient training data can create
overfitting.
6) Convolutional Autoencoder
Autoencoders in their traditional formulation does not take into account the fact that a signal
can be seen as a sum of other signals. Convolutional Autoencoders use the convolution
operator to exploit this observation. They learn to encode the input in a set of simple signals
and then try to reconstruct the input from them, modify the geometry or the reflectance of the
image. They are the state-of-art tools for unsupervised learning of convolutional filters. Once
these filters have been learned, they can be applied to any input in order to extract features.
These features, then, can be used to do any task that requires a compact representation of the
input, like classification.
Advantages-

• Due to their convolutional nature, they scale well to realistic-sized high dimensional
images.
• Can remove noise from picture or reconstruct missing parts.
Drawbacks-

• The reconstruction of the input image is often blurry and of lower quality due to
compression during which information is lost.

7) Variational Autoencoder
Variational autoencoder models make strong assumptions concerning the distribution of
latent variables. They use a variational approach for latent representation learning, which
results in an additional loss component and a specific estimator for the training algorithm
called the Stochastic Gradient Variational Bayes estimator. It assumes that the data is
generated by a directed graphical model and that the encoder is learning an approximation to
the posterior distribution where Ф and θ denote the parameters of the encoder (recognition
model) and decoder (generative model) respectively. The probability distribution of the latent
vector of a variational autoencoder typically matches that of the training data much closer
than a standard autoencoder.
Advantages-
• It gives significant control over how we want to model our latent distribution unlike
the other models.
• After training you can just sample from the distribution followed by decoding and
generating new data.
Drawbacks-

• When training the model, there is a need to calculate the relationship of each
parameter in the network with respect to the final output loss using a technique known
as backpropagation. Hence, the sampling process requires some extra attention.

Applications
Autoencoders work by compressing the input into a latent space representation and then
reconstructing the output from this representation. This kind of network is composed of two
parts:

1. Encoder: This is the part of the network that compresses the input into a latent-space
representation. It can be represented by an encoding function h=f(x).
2. Decoder: This part aims to reconstruct the input from the latent space representation.
It can be represented by a decoding function r=g(h).
If the only purpose of autoencoders was to copy the input to the output, they would be
useless. We hope that by training the autoencoder to copy the input to the output, the latent
representation will take on useful properties. This can be achieved by creating constraints on
the copying task. If the autoencoder is given too much capacity, it can learn to perform the
copying task without extracting any useful information about the distribution of the data. This
can also occur if the dimension of the latent representation is the same as the input, and in the
overcomplete case, where the dimension of the latent representation is greater than the input.
In these cases, even a linear encoder and linear decoder can learn to copy the input to the
output without learning anything useful about the data distribution. Ideally, one could train
any architecture of autoencoder successfully, choosing the code dimension and the capacity
of the encoder and decoder based on the complexity of distribution to be modeled.
Autoencoders are learned automatically from data examples. It means that it is easy to train
specialized instances of the algorithm that will perform well on a specific type of input and
that it does not require any new engineering, only the appropriate training data. However,
autoencoders will do a poor job for image compression. As the autoencoder is trained on a
given set of data, it will achieve reasonable compression results on data similar to the training
set used but will be poor general-purpose image compressors. Autoencoders are trained to
preserve as much information as possible when an input is run through the encoder and then
the decoder, but are also trained to make the new representation have various nice properties.

You might also like