AD601 Deep Learning Unit-2 Notes
AD601 Deep Learning Unit-2 Notes
AD601 Deep Learning Unit-2 Notes
Multilayer Perceptron, Gradient Descent, Backpropagation, Empirical Risk Minimization, regularization, auto
encoders.
__________________________________________________________________________________________
Unit-2
Feedforward Networks:
The process of receiving an input to produce some kind of output to make some kind of prediction is known as
Feed Forward." Feed Forward neural network is the core of many other important neural networks such as
convolution neural network.
In the feed-forward neural network, there are not any feedback loops or connections in the network. Here is
simply an input layer, a hidden layer, and an output layer.
There can be multiple hidden layers which depend on what kind of data you are dealing with. The number of
hidden layers is known as the depth of the neural network. The deep neural network can learn from more
functions. Input layer first provides the neural network with data and the output layer then make predictions
on that data which is based on a series of functions. ReLU Function is the most commonly used activation
function in the deep neural network.
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
Input layer:
The neurons of this layer receive input and pass it on to the other layers of the network. Feature or attribute
numbers in the dataset must match the number of neurons in the input layer.
Output layer:
According to the type of model getting built, this layer represents the forecasted feature.
Hidden layer:
Input and output layers get separated by hidden layers. Depending on the type of model, there may be several
hidden layers.
There are several neurons in hidden layers that transform the input before actually transferring it to the next
layer. This network gets constantly updated with weights in order to make it easier to predict.
Neuron weights:
Neurons get connected by a weight, which measures their strength or magnitude. Similar to linear regression
coefficients, input weights can also get compared.
Neurons:
Artificial neurons get used in feed forward networks, which later get adapted from biological neurons. A
neural network consists of artificial neurons.
Neurons function in two ways: first, they create weighted input sums, and second, they activate the sums to
make them normal.
Activation functions can either be linear or nonlinear. Neurons have weights based on their inputs. During the
learning phase, the network studies these weights.
Activation Function:
According to the activation function, the neurons determine whether to make a linear or nonlinear decision.
Since it passes through so many layers, it prevents the cascading effect from increasing neuron outputs.
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
An activation function can be classified into three major categories: sigmoid, Tanh, and Rectified Linear Unit
(ReLu).
Sigmoid:
Tanh:
Only positive values are allowed to flow through this function. Negative values get mapped to 0.
Multilayer Perceptron
It is a neural network where the mapping between inputs and output is non-linear. A Multilayer Perceptron
has input and output layers, and one or more hidden layers with many neurons stacked together. And while in
the Perceptron the neuron must have an activation function that imposes a threshold, like ReLU or sigmoid,
neurons in a Multilayer Perceptron can use any arbitrary activation function.
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
Multilayer Perceptron falls under the category of feedforward algorithms, because inputs are combined with
the initial weights in a weighted sum and subjected to the activation function, just like in the Perceptron. But
the difference is that each linear combination is propagated to the next layer.
Each layer is feeding the next one with the result of their computation, their internal representation of the
data. This goes all the way through the hidden layers to the output layer.
But it has more to it. If the algorithm only computed the weighted sums in each neuron, propagated results to
the output layer, and stopped there, it wouldn’t be able to learn the weights that minimize the cost function. If
the algorithm only computed one iteration, there would be no actual learning.
Gradient Descent
Optimization is a big part of machine learning. Almost every machine learning algorithm has an optimization
algorithm at its core. Gradient Descent is an optimization technique that is used to improve deep learning and
neural network-based models by minimizing the cost function.
Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function
(f) that minimizes a cost function (cost). Gradient Descent is a process that occurs in
the backpropagation phase where the goal is to continuously resample the gradient of the model’s parameter
in the opposite direction based on the weight w, updating consistently until we reach the global minimum of
function J(w).
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
This is a type of gradient descent which processes all the training examples for each iteration of gradient
descent.
But if the number of training examples is large, then batch gradient descent is computationally very expensive.
Hence if the number of training examples is large, then batch gradient descent is not preferred. Instead, we
prefer to use stochastic gradient descent or mini-batch gradient descent.
This is a type of gradient descent which processes 1 training example per iteration. Hence, the parameters are
being updated even after one iteration in which only a single example has been processed.
Hence this is quite faster than batch gradient descent. But again, when the number of training examples is
large, even then it processes only one example which can be additional overhead for the system as the
number of iterations will be quite large.
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
This is a type of gradient descent which works faster than both batch gradient descent and stochastic
gradient descent.
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training
dataset into small batches that are used to calculate model error and update model coefficients.
Here b examples where b<m are processed per iteration. So even if the number of training examples is
large, it is processed in batches of b training examples in one go.
Thus, it works for larger training examples and that too with lesser number of iterations.
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
Backpropagation-
Back propagation learning algorithm is one of the most important developments in neural networks. This
learning algorithm is applied to multilayer feed-forward networks consisting of processing elements with
continuous differentiable activation functions. The network associated with back propagation learning
algorithm is called back-propagation networks (BPNs).
Architecture:
A back propagation neural network is a multilayer, feed-forward neural network consisting of an input layer, a
hidden layer and an output layer. The neurons present in the hidden and output layers have biases, which are
the connections from the units whose activation is always 1. The bias terms also acts as weights.
The inputs are sent to the BPN and the output obtained from the net could be either binary (0,1) or bipolar (-
1, +1). The activation function could be any function which increases monotonically and is also differentiable.
Training Algorithm:
Step 0: Initialize weights and learning rate (take some small random values).
Step 3: Each input unit receives input signal x; and sends it to the hidden unit (i = l to n}.
Step 4: Each hidden unit Zj (j = 1 to p) sums its weighted input signals to calculate net input:
i=1
Calculate output of the hidden unit by applying its activation functions over Zinj (binary or bipolar sigmoidal
activation function}: -
zj = f(zinj)
and send the output signal from the hidden unit to the input of output layer units.
Step 5: For each output unityk(k = I to m), calculate the net input:
j=1
yk = f(yink)
Step 6: Each output unit yk(k = I to m) receives a target pattern corresponding to the input training pattern and
computes the error correction term:
δk = (tk – yk)f’(yink)
The derivative f(yink) can be calculated. On the basis of the calculated error correction term, update the change
in weights and bias:
Δwjk = αδkzj;
Δw0k = αδk;
Step 7: Each hidden unit (zj,j = I top) sums its delta inputs from the output units:
δinj = ∑δkwjk
k=1
The term δinj gets multiplied with the derivative of f(Zinj) to calculate the error term:
δj = δinj f’(zinj)
The derivative f’(zinj) can be calculated. Depending on whether binary or bipolar sigmoidal function is used. On
the basis of the calculated δj, update the change in weights and bias:
Δvij = αδjxi;
Δv0j = αδj;
Step 8: Each output unit (yk, k=1 to m) updates the bias and weights:
Step 9: Check for the stopping condition. The stopping condition may be certain number of epochs reached or
when the actual output equals the target output.
The Empirical Risk Minimization (ERM) principle is a learning paradigm which consists in selecting the model
with minimal average error over the training set. This so-called training error can be seen as an estimate of the
risk (due to the law of large numbers), hence the alternative name of empirical risk.
By minimizing the empirical risk, we hope to obtain a model with a low value of the risk. The larger the
training set size is, the closer to the true risk the empirical risk is.
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
If we were to apply the ERM principle without more care, we would end up learning by heart, which we know
is bad. This issue is more generally related to the overfitting phenomenon, which can be avoided by restricting
the space of possible models when searching for the one with minimal error. The most severe and yet common
restriction is encountered in the contexts of linear classification or linear regression. Another approach consists
in controlling the complexity of the model by regularization.
Regularization-
Regularization is the process of introducing some additional information in order to prevent overfitting.
L1 and L2 are the most common types of regularization. These update the general cost function by adding
another term known as the regularization term.
This technique performs L2 regularization. The main algorithm behind this is to modify the residual sum of
squares or RSS by adding the penalty which is equivalent to the square of the magnitude of coefficients.
However, it is considered to be a technique used when the info suffers from multi collinearity (independent
variables are highly correlated).
Every technique has some pros and cons, so as Ridge regression. It decreases the complexity of a model but
does not reduce the number of variables since it never leads to a coefficient tending to zero rather only
minimizes it. Hence, this model is not a good fit for feature reduction.
This regularization technique performs L1 regularization. Unlike Ridge Regression, it modifies the RSS by
adding the penalty (shrinkage quantity) equivalent to the sum of the absolute value of coefficients.
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
Lasso regression differs from ridge regression in a way that it uses absolute values within the penalty function,
rather than that of squares.
This leads to penalizing (or equivalently constraining the sum of the absolute values of the estimates) values
which causes some of the parameter estimates to turn out exactly zero.
The more penalties are applied, the more the estimates get shrunk towards absolute zero. This helps to variable
selection out of given range of n variables.
Auto Encoders
Auto-encoder is an unsupervised artificial neural network that learns how to efficiently compress and encode
data. Auto-encoder is a neural network that uses a backpropagation algorithm for feature learning.
It works in two phases: encoding and decoding. In the encoding phase, the input data are mapped to a low-
dimensional representation space to obtain the most appropriate feature, which again maps to the input space in
the decoding phase.
The code is a compact “summary” or “compression” of the input, also called the latent-space representation.
• Encoder
• Code
• Decoder
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
1. Encoder: This part of the network compresses the input into a latent space representation. The
encoder layer encodes the input image as a compressed representation in a reduced dimension. The
compressed image is the distorted version of the original image.
2. Code: This part of the network represents the compressed input which is fed to the decoder.
3. Decoder: This layer decodes the encoded image back to the original dimension. The decoded image is
a lossy reconstruction of the original image and it is reconstructed from the latent space
representation.
Autoencoder, by design, reduces data dimensions by learning how to ignore the noise in the data.
1. Sparse Auto-encoder
2. Deep Auto-encoder
3. Convolutional Auto-encoder
Sparse Autoencoder
3. A generic sparse auto-encoder is visualized where the obscurity of a node corresponds with the level of
activation. Sparsity constraint is introduced on the hidden layer.
5. Sparsity may be obtained by additional terms in the loss function during the training process, either by
comparing the probability distribution of the hidden unit activations with some low desired value, or
by manually zeroing all but the strongest hidden unit activations.
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
Deep Auto-encoder
1. Deep Auto-encoders consist of two identical deep belief networks, One network for encoding and
another for decoding.
2. Typically deep auto-encoders have 4 to 5 layers for encoding and the next 4 to 5 layers for decoding.
4. Deep autoencoders are useful in topic modeling, or statistically modeling abstract topics that are
distributed across a collection of documents. They are also capable of compressing images into 30
number vectors.
Convolutional Auto-encoder
1. Auto-encoders in their traditional formulation does not take into account the fact that a signal can be
seen as a sum of other signals.
3. They learn to encode the input in a set of simple signals and then try to reconstruct the input from
them, modify the geometry or the reflectance of the image.
4. They are the state-of-art tools for unsupervised learning of convolutional filters.
5. Once these filters have been learned, they can be applied to any input in order to extract features.
6. These features, then, can be used to do any task that requires a compact representation of the input,
like classification.
• Image Coloring
Auto encoders are used for converting any black and white picture into a colored image. Depending on what is
in the picture, it is possible to tell what the color should be.
• Feature variation
It extracts only the required features of an image and generates the output by removing any noise or
unnecessary interruption.
• Dimensionality Reduction
Chameli Devi Group of Institutions
Department of Artificial Intelligence & Data Science
The reconstructed image is the same as our input but with reduced dimensions. It helps in providing the
similar image with a reduced pixel value.
• De-noising Image
The input seen by the auto encoder is not the raw input but a stochastically corrupted version. A de-noising
auto encoder is thus trained to reconstruct the original input from the noisy version