Unit 5
Unit 5
UNIT V
NEURAL NETWORKS
Perceptron - Multilayer perceptron, activation functions, network training –
gradient descent optimization – stochastic gradient descent, error backpropagation,
from shallow networks to deep networks –Unit saturation (aka the vanishing gradient
problem) – ReLU, hyperparameter tuning, batchnormalization, regularization, dropout.
1. What is perceptron? Explain single layer and Multiple Layer perceptron with an
example.
PERCEPTRONS
The perceptron was first proposed by Rosenblatt (1958) is a simple neuron that is
used to classify its input into one of two categories. A perceptron is a single processing
unit of a neural network. This is a good learning tool. This model follows perceptron
training rule and it could operate well with linearly separable patterns.
Linear separability is the separation of the input space into regions is based
on whether the network response is positive or negative.
A perceptron uses a step function that returns +1 if the weighted sum of its input (v)
is greater than or equal to 0 else it returns -1.
Working of perceptrons
In the biological neurons, the dendrite receives the electrical signals from the
axons of other neurons. The signals are modulated in various amounts before
further transmission.
The signals are transmitted to other neurons only if the modulated signal
exceeds the threshold value. The same principle is applied in perceptron model.
In the perceptron, the input received is always represented as numerical values.
These values are multiplied by the weights.
The total strength of the input is calculated as the weighted sum of the inputs. A
step function (activation function) is applied to determine its output.
This output is fed to the other perceptrons if it exceeds the threshold value.
Fig. 2.4. Perceptron Model (In this model x0 = 1, which is the bias)
The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the
learning process is slow or has vanishing or exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights.
and Bias, net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together to
create the weighted sum. Then this weighted sum is applied to the activation function 'f'
to obtain the desired output. This activation function is also known as the step function
and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight of
input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
6. If the added sum of all input values is more than the threshold value,
it must have an output signal; otherwise, no output will be shown.
Multi-layer Perceptron
MLP networks are used for supervised learning format. A typical learning algorithm for
MLP networks is also called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a
set of outputs from a set of inputs. An MLP is characterized by several layers of input
nodes connected as a directed graph between the input nodes connected as a directed
graph between the input and output layers. MLP uses backpropagation for training the
network. MLP is a deep learning method.
Because they introduce non-linearities in neural networks and enable the neural
networks can learn powerful operations, activation functions are helpful. A feedforward
neural network might be refectories into a straightforward linear function or matrix
transformation on to its input if indeed the activation functions were taken out.
By generating a weighted total and then including bias with it, the activation function
determines whether a neuron should be turned on. The activation function seeks to boost a
neuron's output's nonlinearity.
As you can see the function is a line or linear. Therefore, the output of the functions will not
be confined between any range.
Equation : f(x) = x
It doesn’t help with the complexity or various parameters of usual data that is fed to the
neural networks.
This function takes any real value as input and outputs values in the range of 0 to 1.
The larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to 0.0, as shown below.
Prepared by N.GOBINATHAN, AP/CSE
Page | 10
Here’s why sigmoid/logistic activation function is one of the most widely used functions:
The function is differentiable and provides a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-shape of the sigmoid activation function.
It implies that for values greater than 3 or less than -3, the function will have very small
gradients. As the gradient value approaches zero, the network ceases to learn and suffers from
the Vanishing gradient problem.
The output of the logistic function is not symmetric around zero. So the output of all
the neurons will be of the same sign. This makes the training of the neural network
more difficult and unstable.
Tanh function is very similar to the sigmoid/logistic activation function, and even has the same
S- shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to -1.0.
The output of the tanh activation function is Zero centered; hence we can easily map
the output values as strongly negative, neutral, or strongly positive.
Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps
in centering the data and makes learning for the next layer much easier.
Have a look at the gradient of the tanh activation function to understand its limitations.
Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is zero centered, and
the gradients are not restricted to move in a certain direction. Therefore, in practice, tanh
nonlinearity is always preferred to sigmoid nonlinearity.
As you can see— it also faces the problem of vanishing gradients similar to the sigmoid
activation function. Plus the gradient of the tanh function is much steeper as compared to the
sigmoid function.
The advantage is that the negative inputs will be mapped strongly negative and the zero
inputs will be mapped near zero in the tanh graph.
The function is differentiable.
The function is monotonic while its derivative is not monotonic.
The tanh function is mainly used classification between two classes.
Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
Sigmoid
The sigmoid function consists of 2 functions, logistic and tangential. The values of
logistic function range from 0 and 1 and -1 to +1 for tangential function.
The perceptron rule fails to converge if the examples are not linearly separable.
Delta rule is designed to overcome this difficulty.
If the training examples are not linearly separable, the delta rule converges toward
a best-fit approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the training
examples.
Gradient descent searches the hypothesis space of possible weight vectors to find the
best one. The search hypothesis space contains many different types of
o continuously parameterized hypotheses.
To find a local minimum of a function using gradient descent, one takes steps
proportional to the negative of the gradient (or of the approximate gradient) of the
function from the current point.
Where D is the set of training examples, td is the target output (actual output) for
training exampled, and od (predicted output) is the output of the linear unit for
training example d.
E(→w ) is half the squared difference between the target output and the
output, summed over all training examples. This is the deviation between actual and
target output. In other words it is the error in prediction.
The best machine learning model will try to minimise this error value through
continuous learning.
"A gradient measures how much the output of a function changes if you change the
inputs a little bit." —Lex Fridman (MIT)
How big the steps are gradient descent takes into the direction of the local minimum are
determined by the learning rate, which figures out how fast or slows we will move towards
the optimal weights.
For gradient descent to reach the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high. This is important because if the
steps it takes are too big, it may not reach the local minimum because it bounces back and
forth between the convex function of gradient descent (see left image below). If we set the
learning rate to a very small value, gradient descent will eventually reach the local
minimum but that may take a while (see the right image).
Advantages:
Easy computation.
Easy to implement.
Easy to understand.
Disadvantages:
May trap at local minima.
Weights are changed after calculating the gradient on the whole dataset. So, if the
dataset is too large then this may take years to converge to the minima.
Requires large memory to calculate the gradient on the whole dataset.
b. Any non linear function which differentiable everywhere and increases everywhere with
sum can be used as activation function.
Training procedure : The network is usually trained with a large number of input - output
pairs.
1. Generate weights randomly to small random values (both positive and negative) to
ensure that the network is not saturated by large values of weights.
2. Choose a training pair from the training
set. 3, Apply the input vector to network
input.
4, Calculate the network output.
5. Calculate the error, the difference between the network output and the desired output.
6. Adjust the weights of the network in a way that minimizes this error.
7. Repeat steps 2 6 for each pair of input output in the training set until the error for the
entire system is acceptably low.
Forward pass and backward pass :
• Backpropagation neural network training involves two passes.
1. In the forward pass, the input signals moves forward from the network input to the
output.
2. In the backward pass, the calculated error signals propagate backward through the
network, where they are used to adjust the weights.
3. In the forward pass, the calculation of the output is carried out, layer by layer in the
forward direction. The output of one layer is the input to the next layer.
• In the reverse pass,
The weights of the output neuron layer are adjusted first since the target value of
each output neuron is available to guide the adjustment of the associated weights,
using the delta rule.
Next, we adjust the weights of the middle layers. As the middle layer neurons have
no target values, it makes the problem complex.
Selection of number of hidden units : The number of hidden units depends on the
number of input units.
1. Never choose h to be more than twice the number of input units.
2. You can load patterns of I elements into log, p hidden units.
3. Ensure that we must have at least 1/e times as many training examples.
4. Feature II
extraction
Year/CSE requires fewer hidden units than
CS3491-Artificial inputs. and Machine Learning
Intelligence
5. Learning many examples of disjointed inputs requires more hidden units than inputs.
6. The number of hidden units required for a classification task increases with the number
of classes in the task. Large networks require longer training times.
Factor influencing Backpropagation training:
The training time can be reduced by using:
Bias : Networks with biases can represent relationships between outputs more easily than
networks it out biases, Adding a bias to eat neuron 1s usually desirable to offset the origin
of the activation function. The weight of the bias is trainable similar to weight except that
the input is always +1.
2. Momentum :The use of momentum enhances the stability of the training process.
Momentum is used to keep the training process going in the same general direction
analogous to the way that momentum of a moving object behave. In back propagation with
momentum, the weight change is a combination of the current gradient and the previous
gradient.
Advantages and Disadvantages
Advantages of backpropagation:
1. It is simple, fast and easy to program.
2. Only numbers of the input are tuned and not any other
3. No need to have prior knowledge about the network.
4. It is flexible.
5. A standard approach and works efficiently.
6. It does not require the user to learn special functions.
Disadvantages of backpropagation:
Backpropagation possibly be sensitive to noisy data and irregularity.
The performance of this is highly reliant on the input data.
Needs excessive time for training.
The need for a matrix-based method for backpropagation instead of mini - batch.
5. Explain in detail about shallow and deep networks with an example.
Shallow Networks:
The terms shallow and deep refer to the number of layers in a neural network; shallow
neural networks refer to a neural network that have a small number of layers, usually
regarded as having a single hidden layer, and deep neural networks refer to neural
networks that have multiple hidden layers. Both types of networks perform certain tasks
better than the other and selecting the right network depth is important for creating a
successful model.
• In a shallow neural network, the values of the feature vector of the data to be classified
(the input layer) are passed to a hidden layer of nodes (neurons) each of which generates a
response according to some activation function, g, acting on the weighted sum of those
values, z.
The responses of each unit in the hidden layer is then passed to a final, outputs layer
(which may consist of a single unit), whose activation produces the classification prediction
output.
Deep Network:
Deep learning is a new area of machine learning research, which has been
introduced with the objective of moving machine learning closer to one of its original goals.
Deep learning is about learning multiple levels of representation and abstraction that help
to make sense of data such as images, sound, and text.
Deep learning' means using a neural network with several layers of nodes between
input and output. It is generally better than other methods on image, speech and certain
other types of data because the series of layers between input and output do feature
identification and processing in a series of stages, just as our brain seem to.
Deep Learning emphasizes the network architecture of today's most successful
machine learning approaches. These methods are based on "deep" multi – layer neural
networks with many hidden layers.
TensorFlow
TensorFlow is one of the most popular frameworks used to build deep learning
models. The framework is developed by Google Brain Team.
Languages like C++, R and Python are supported by the framework to create the
models as well as the libraries. This framework can be accessed from both -desktop
and mobile.
The translator used by Google is the best example of TensorFlow. In this, the model
is created by adding the functionalities of text classification, natural language
processing, speech or handwriting recognition, image recognition, etc.
The framework has its own visualization toolkit, named Tensor Board which helps
in powerful data visualization of the network along with its performance.
One more tool added in TensorFlow, TensorFlow Serving, can be used for quick and
easy deployment of the newly developed algorithms without introducing any
change in the existing API or architecture.
TensorFlow framework comes along with a detailed documentation for the users to
adapt it quickly and easily, making it the most preferred deep learning
Tensorflow framework comes along with a detailed documentation for the users to
adapt it quickly and easily , making it the most preferred deep learning framework
to model deep learning algorithms.
Some of the characteristics of TensorFlow is :
Multiple GPU supported
One can visualize graphs and queues easily using Tensorboard.
Powerful documentation and larger support from community.
Keras
If you are comfortable in programming with Python, then learning Keras will no
prove hard to you. This will be the most recommended framework to create deep learning
models for ones having a sound of Python.
Keras is built purely on Python and can run on the top of TensorFlow. Due to its
complexity and use of low - level libraries, TensorFlow can be comparatively harder to
adapt for the new users as compared to Keras. Users those who are beginners in deep
learning, and find its models difficult to understand in TensorFlow generally prefer Keras
as it solves all complex models in no time.
Keras has been developed keeping in mind the complexities in the deep learning
models, and hence it can run quickly to get the results in minimum time.
Convolutional as well as Recurrent Neural networks are supported in Keras. The
framework can run easily on CPU and GPU.
Deep network contains many hidden layers. Shallow network contains only one hidden
layer.
Deep network can compactly express highly Shallow networks with oneHidden layer
complex functions over input space. cannot place complex functions over the
input space.
Training in DN is easy and no issue of local Shallow network is more difficult to train
minima in DN. with our current algorithms.
DN can fit functions better with less Shallow net's needs more parameters to
parameters than a shallow network have better fit.
It results in models with many layers being rendered unable to learn on a specific
dataset. It could even cause models with many layers to prematurely converge to a
substandard solution.
When the backpropagation algorithm advances downwards or backward going from
the output layer to the input layer, the gradients tend to shrink, becoming smaller
and smaller till they approach zero. This ends up leaving the weights of the initial or
lower layers practically unchanged. In this situation, the gradient descent does not
ever end up converging to the optimum.
Vanishing gradient does not necessarily imply that the gradient vector is all zero. It
implies that the gradients are minuscule, which would cause the learning to be very
slow.
The most important solution to the vanishing gradient problem is a specific type of
neural network called Long Short-Term Memory Networks (LSTMs).
7. Explain in detail about RELU function and its usage in hidden layer.
ReLU
• Rectified Linear Unit (ReLU) solve the vanishing gradient problem. ReU is a non-
linear function or piecewise linear function that will output the input directly if it is
positive, otherwise, it will output zero.
• It is the most commonly used activation function in neural networks, especially
Convolutional Neural Networks (CNNs) and Multilayer perceptron’s.
• . Mathematically, it is expressed as
f(x) = max (0, x)
where X : input to neuron
• The leak helps to increase the range of the ReLU function. Usually, the value of a is
0.01 or so.
• The n motivation foe using LReLU instead of ReLU is that constant zero gradients
can also result in slow learning, as when a saturated neuron uses a sigmoid
activation function.
EReLU:
• An Elastic ReLU (EReLU) Considers a slope randomly drawn from a uniform
distribution during the training for the positive inputs to control the amount of non-
Linearity.
• The EReLU 15 defined as : EReLU() max(RX: 0) in the output range of[0;1]where R is
a random number
• At the test time, the EReLU becomes the identity function for positive inputs.
Layer Size:
Layer size is defined by the number of neurons in a given layer. Input and output
layers are relatively easy to figure out because they correspond directly to how our
modeling problem handles input and output.
For the input layer, this will match up to the number of features in the input vector.
For the output layer, this will either be a single output neuron or a number of neurons
matching the number of classes we are trying to predict.
It is obvious that a neural network with 3 layers will give better performance than that
of 2 layers. Increasing more than 3 doesn't help that much in neural networks. In the
case of CNN, an increasing number of layers makes the model better.
Magnitude : Learning Rate
The amount that the weights are updated during training 1s referred to as the size or
the learning rate. Specifically, the learning rate is a configurable hyper-parameter
used in the training of neural networks that has a small positive value, often in the
range between 0.0 and 1.0.
For example, if learning rate is 0.1, then the weights in the network are updated 0.1 *
(estimated weight error) or 10 % of the estimated weight error each time the weights
are updated. The learning rate hyper-parameter controls the rate or speed at which the
model learns.
• Learning rates are tricky because they end up being specific to the dataset and even to
other hyper-parameters. This creates a lot of overhead for finding the right setting for
hyper-parameters.
• Large learning rates () make the model learn faster but at the same time it may cause us to
miss the minimum loss function and only reach the surrounding of it. In cases where the
learning rate is too large, the optimizer overshoots the minimum and the loss updates will
lead to divergent behaviors.
• On the other hand, choosing lower learning rate values gives a better chance of finding
the local minima with the trade-off of needing larger number of epochs and more time.
• Momentum can accelerate learning on those problems where the high-dimensional
weight space that is being navigated by the optimization process has structures that
mislead the gradient descent algorithm, such as flat regions or steep curvature.
Normalization is a data pre-processing tool used to bring the numerical data to a common
Generally, when we input the data to a machine or deep learning algorithm we tend to
change the values to a balanced scale. The reason we normalize is partly to ensure that our
Now coming back to Batch normalization, it is a process to make neural networks faster
and more stable through adding extra layers in a deep neural network. The new layer
performs the standardizing and normalizing operations on the input of a layer coming from
a previous layer.
But what is the reason behind the term “Batch” in batch normalization? A typical neural
network is trained using a collected set of input data called batch. Similarly, the
normalizing process in batch normalization takes place in batches, not as a single input.
Let’s understand this through an example, we have a deep neural network as shown in the
following image.
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-
processing stage. When the input passes through the first layer, it transforms, as a sigmoid
function applied over the dot product of input X and the weight matrix W.
Similarly, this transformation will take place for the second layer and go till the last layer L
Although, our input X was normalized with time the output will no longer be on the same
scale. As the data go through multiple layers of the neural network and L activation
Since by now we have a clear idea of why we need Batch normalization, let’s understand
how it works. It is a two-step process. First, the input is normalized, and later rescaling and
offsetting is performed.
Normalization is the process of transforming the data to have a mean zero and standard
deviation one. In this step we have our batch input from layer h, first, we need to calculate
Once we have meant at our end, the next step is to calculate the standard deviation of the
hidden activations.
Further, as we have the mean and the standard deviation ready. We will normalize the
hidden activations using these values. For this, we will subtract the mean from each input
and divide the whole value with the sum of standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a
Rescaling of Offsetting
In the final operation, the re-scaling and offsetting of the input take place. Here two
components of the BN algorithm come into the picture, γ(gamma) and β (beta). These
parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from
These two are learnable parameters, during the training neural network ensures the
optimal values of γ and β are used. That will enable the accurate normalization of each
batch.
Advantages of Batch Normalization
By Normalizing the hidden layer activation the Batch normalization speeds up the
training process.
Handles internal covariate shift
It solves the problem of internal covariate shift. Through this, we ensure that the
input for every layer is distributed around the same mean and standard deviation. If you
are unaware of what is an internal covariate shift, look at the following example.
Internal covariate shift
Suppose we are training an image classification model, that classifies the images
into Dog or Not Dog. Let’s say we have the images of white dogs only, these images will
have certain distribution as well. Using these images model will update its parameters.
later, if we get a new set of images, consisting of non-white dogs. These new images will
have a slightly different distribution from the previous images. Now the model will change
its parameters according to these new images. Hence the distribution of the hidden
activation will also change. This change in hidden activation is known as an internal
covariate shift.
However, according to a study by MIT researchers, the batch normalization does not solve
• Just have a look at the above figure, and we can immediately predict that once we
try to cover every minutest feature of the input data, there can be irregularities in
the extracted features, which can introduce noise in the output. This is referred to as
"Overfitting".
• This may also happen with the lesser number of features extract as some of the
important details might be missed out. This will leave an effect on the accuracy of
the outputs produced. This is referred to as "Underfitting".
This also shows that the complexity for processing the input elements increases
with overfitting. Also, neural networks being a complex interconnection of nodes
the issue of overfitting may arise frequently.
• To eliminate this, regularization is used, in which we have to make the slightest
modification in the design of the neural network, and we can get better outcomes.
Y = Learned relation
B(beta) = Co-efficient estimators for different variables and/or predictors (X)
Now, we shall introduce a loss function, that implements the fitting procedure,
which is referred to as "Residual Sum of Squares'" or RSS.
The co-efficient in the function is chosen in such a way that it can minimize the loss
function easily.
If you have studied the concept of regularization in machine learning, you will have a fair
idea that regularization penalizes the coefficients. In deep learning, it actually penalizes the
weight matrices of the nodes.
Assume that our regularization coefficient is so high that some of the weight matrices are
nearly equal to zero.
This will result in a much simpler linear network and slight underfitting of the training
data.
Such a large value of the regularization coefficient is not that useful. We need to optimize
the value of regularization coefficient in order to obtain a well-fitted model as shown in the
image below.
In this case, the regularization term is the squared norm of the weights of each network’s
layer. This matrix norm is called Frobenius norm and, explicitly, it’s computed as follows:
Please note that the weight matrix relative to layer l has n^{[l]} rows and n^{[l-1]} columns.
Finally, the complete cost function under L2 Regularization becomes:
Again, λ is the regularization term and for λ=0 the effects of L2 Regularization are null.
L2 Regularization brings towards zero the values of the weights, resulting in a more
simple model.
It regularizes the co-efficient set for the model and hence the ridge regression term
deduces the values of the coefficient, which ultimately helps in deducing the
complexity of the machine learning model.
From the above equation, we can observe that if the value of tends to zero, the last
term on the right - hand side will tend to zero,. thus making the above equation a
representation of a simple linear regression model:
Hence, lower the value of , the model will tend to linear regression.
This model is important to execute the neural networks for machine learning, as
there would be risks of failure for generalized linear regression models, if there are
dependencies found between its variables. Hence, ridge regression is used here.
Lasso Regression (L1 Regularization)
One more technique to reduce the overfitting, and thus the complexity of the model
is the lasso regression.
Lasso regression stands for Least Absolute and Selection Operator and is
also sometimes known as L1 regularization.
The equation for the lasso regression is almost same as that of the ridge regression,
except for a change that the value. of the penalty term is taken as the absolute
weights.
The advantage of taking the absolute values is that its slope can shrink to 0,
as compared to the ridge regression, where the slope will shrink it near to 0.
The following equation gives the cost function defined in the Lasso regression:
In L1 Regularization we add the following term to the cost function J:
where the matrix norm is the sum of the absolute value of the weights for each layer 1, …, L
of the network:
λ is the regularization term. It’s a hyperparameter that must be carefully tuned. λ directly
controls the impact of the regularization: as λ increases, the effects on the weights
shrinking are more severe.
The complete cost function under L1 Regularization becomes:
For λ=0, the effects of L1 Regularization are null. Instead, choosing a value of λ which is
too big, will over-simplify the model, probably resulting in an underfitting network.
L1 Regularization can be considered as a sort of neuron selection process because it would
bring to zero the weights of some hidden neurons.
• Due to the acceptance of absolute values for the cost function, some of the features of the
input dataset can be ignored completely while evaluating the machine learning model, and
hence the feature selection and overfitting can be reduced to much extent.
• On the other hand, ridge regression does not ignore any feature in the model and includes
it all for model evaluation. The complexity of the model can be reduced using the shrinking
of co-efficient in the ridge regression model.
Dropout
Dropout was introduced by "Hinton et al' and this method is now very popular. It Consists of
setting to zero the output of each hidden neuron in chosen layer with some probability and is
proven to be very effective in reducing overfitting,
Understand dropout, let’s say our neural network structure is akin to the one shown below:
Drop Connect:
Drop Connect, known as the generalized version of Dropout, is the method used tor
regularizing deep neural networks.
Fig. dropconnect.
DropConnect has been proposed to add more noise to the network. The primary difference
is that instead of randomly dropping the output of the neurons, we randomlỳ drop the
connection between neurons.
In other words, the fully connected layer with DropConnect becomes a sparsely connected
layer in which the connections are chosen at random during the training stage.
Difference between L1 and L2 Regularization:
S.No L1 Regularization L2 Regularization
Penalizes the sum of absolute value Penalizes the sum of square weights.
1.
weights.
2. It has a sparse solution It has a non-sparse solution
It gives multiple solutions. It has only one solutions.
3.
PART-A
Deep network contains many hidden Shallow network contains only one
layers. hidden layer.
DN can fit functions better with less Shallow net's needs more parameters
parameters than a shallow network to have better fit.