UNIT_1_DL
UNIT_1_DL
UNIT_1_DL
UNIT I
Feed forward Neural networks - Gradient descent and the back propagation algorithm
- Unit saturation - Adaptive Gradient Algorithm- Dropout Regularization - Data
Augmentation - CNN Architectures - LeNet-5- AlexNet- VCG-16 - U-Net
Feed forward neural networks, also referred to as multi-layer neural networks, are
artificial neural networks characterized by the absence of loops among nodes. In this
type of neural network, information exclusively moves in a forward direction. The data
flow begins with input nodes receiving data, which then traverses through hidden
layers before ultimately exiting through output nodes. Notably, there are no connections
in the network that allow thetransmission of information back from the output node.
Through the training and learning process, gradient descent takes place. In the case of
multi- layered perceptron’s, this weight adjustment process is known as back-
propagation. In such scenarios, the hidden layers of the network are adjusted based on
the output values generatedby the final layer.
Input layer:
The neurons of this layer receive input and pass it on to the other layers of the network.
Feature or attribute numbers in the dataset must match the number of neurons in the input
layer.
Output layer:
According to the type of model getting built, this layer represents the forecasted feature.
Hidden layer:
Input and output layers get separated by hidden layers. Depending on the type of model,
there may be several hidden layers. There are several neurons in hidden layers that transform
the input
before actually transferring it to the next layer. This network gets constantly updated with
weights in order to make it easier to predict.
Neuron weights:
Neurons get connected by a weight, which measures their strength or magnitude. Similar to
linear regression coefficients, input weights can also get compared.
Weight is normally between 0 and 1, with a value between 0 and 1.
Neurons
:
Artificial neurons get used in feed forward networks, which later get adapted from biological
neurons. A neural network consists of artificial neurons. Neurons function in two ways: first,
they create weighted input sums, and second, they activate the sums to make them normal.
Activation functions can either be linear or nonlinear. Neurons have weights based on their
inputs. During the learning phase, the network studies these weights.
Activation Function:
Neurons are responsible for making decisions in this area. According to the activation
function, the neurons determine whether to make a linear or nonlinear decision. Since it
passes through so many layers, it prevents the cascading effect from increasing neuron
outputs. An activation function can be classified into three major categories: sigmoid, Tanh,
and Rectified Linear Unit(ReLu).
Sigmoid:
Input values between 0 and 1 get mapped to the output values.
Tanh:
A value between -1 and 1 gets mapped to the input values.
Rectified linear Unit:
Only positive values are allowed to flow through this function. Negative values get
mapped to 0.
In a feed forward neural network, the cost function plays an important role. The
categorizeddata points are little affected by minor adjustments to weights and biases.
Thus, a smooth cost function can get used to determine a method of adjusting weights
andbiases to improve performance.
Following is a definition of the mean square error cost function:
Where,
Loss function
The loss function of a neural network gets used to determine if an adjustment needs
to bemade in the learning process.
Neurons in the output layer are equal to the number of classes. Showing the differences
between predicted and actual probability distributions. Following is the cross -entropy
loss for binary classification.
In the gradient descent algorithm, the next point gets calculated by scaling the gradient
at the current position by a learning rate. Then subtracted from the current position by
the achievedvalue.
To decrease the function, it subtracts the value (to increase, it would add). As an
example,here is how to write this procedure:
The gradient gets adjusted by the parameter η, which also determines the
step size. Performance is significantly affected by the learning rate in machine
learning.
Output units
In the output layer, output units are those units that provide the desired output or
prediction,thereby fulfilling the task that the neural network needs to complete.
There is a close relationship between the choice of output units and the cost function.
Any unit that can serve as a hidden unit can also serve as an output unit in a neural
network.
2. Gradient Descent
This method is the key to minimizing the loss function and achieving our target,
which isto predict close to the original value.
Gradient = dE/dw
Where E is the error and w is the weight.
Batch Gradient Descent: When we train the model to optimize the loss
function using the mean of all the individual losses in our whole dataset, it is
called Batch Gradient Descent.
3. Back propagation
Back propagation is used to train the neural network of the chain rule method. In
simple terms, after each feed-forward passes through a network, this algorithm does
the backward pass to adjust the model’s parameters based on weights and biases. A
typical supervised learning algorithm attempts to find a function that maps input data
to the right
Output. Back propagation works with a multi-layered neural network and learns
internalrepresentations of input to output mapping.
It has four layers: input layer, hidden layer, hidden layer II and final output
layer.So, the main three layers are:
1. Input layer
2. Hidden layer
3. Output layer
Each layer has its own way of working and its own way to take action such that we are
ableto get the desired results and correlate these scenarios to our conditions.
The difference between these two approaches is that static back propagation is as
fast as themapping is static.
Back propagation has many advantages, some of the important ones are listed below-
Unit saturation in refers to a situation where the neurons (or units) in a neural
network become saturated, meaning they reach extreme values (either very close to 0
or 1) and stop learning effectively. This can lead to issues like vanishing or exploding
gradients during thetraining process.
Vanishing gradients occur when the gradients of the loss function with respect to the
parameters become very small, causing the model to stop learning or learn very slowly.
On the other hand, exploding gradients happen when the gradients become very large,
leading to unstable training. Unit saturation is often related to the choice of activation
functions.
4.1 Activation Functions:
Unit saturation is closely tied to the choice of activation functions. Common
activation functions include:
Hyperbolic Tangent (tanh): Similar to the sigmoid but maps values between -1
and
1. It also suffers from saturation issues.
Rectified Linear Unit (ReLU): It is popular but can suffer from saturation
fornegative inputs, leading to dead neurons during training.
(LSTM) and Gated Recurrent Unit (GRU) use gated activation functions to
address the vanishing gradient problem in recurrent neural networks.
Specifically, it calculates the learning rate as the sum of the squares of the
gradients over time, one for each parameter. This reduces the learning rate for
parameters with big gradients while raising the learning rate for parameters with
modest gradients.
The idea behind this particular method is that it enables the learning rate to adapt to
the geometry of the loss function, allowing it to converge quicker in steep gradient
directions while being more conservative in flatter gradient directions. This may result
in quicker convergence and improved generalization.
However, this method has significant downsides. One of the most significant concerns is
that the cumulative gradient magnitudes may get quite big over time, resulting in a
meager effective learning rate that can inhibit further learning. Adam and RMSProp,
two contemporary optimization algorithms, combine their adaptive learning rate
method with otherstrategies to limit the growth of gradient magnitudes over time.
5.1 Types of Gradient Descent
Batch Gradient Descent– This is the most common kind of gradient descent, in
which the gradient is calculated at each step using the whole dataset. The
approach changes the parameters by taking action toward the loss function’s
negative gradient.
Adaptive learning rate– Modifies the learning rate for each parameter
depending on the parameter’s past gradients. This implies that for parameters
with big gradients, the learning rate is lowered, while for parameters with small
gradients, the learning rate is raised, allowing the algorithm to converge quicker
and prevent overshooting the idealsolution.
Adaptability to noisy data– This method provides the ability to smooth out
the impacts of noisy data by assigning lesser learning rates to parameters with
strong gradients owing to noisy input. Handling sparse data efficiently– It is
particularly good at dealing with sparse data, which is prevalent in NLP and
recommendation systems. This is performed by giving sparse parameters faster
learning rates, which may speed convergence.
6. Dropout
When you have training data, if you try to train your model too much, it might over fit,
and when you get the actual test data for making predictions, it will not probably
perform well. Dropout regularization is one technique used to tackle over fitting
problems in deep learning. That’s what we are going to look into in this blog, and we’ll
go over some theories first, and then we’ll write python code using Tensor Flow, and
we’ll see how adding a dropoutlayer increases the performance of your neural network.
7. Data Augmentation
Augmented data: Derived from original images with some sort of minor geometric
transformations (such as flipping, translation, rotation, or the addition of noise) in
order to increase the diversity of the training set.
Here are some of the reasons why data augmentation techniques have been gaining
popularity in the last few years.
Improves the performance of ML models (more diverse datasets).
Data augmentation methods are widely used in practically every cutting-edge deep
learning application such as object detection, image classification, image recognition,
natural language understanding, semantic segmentation, and more.
Data collection and data labeling can be time-consuming and expensive processes for
deep learning models. Companies can cut operational expenses by transforming
datasets using data augmentation techniques.
8.1 LeNet-5
LeNet-5
The LeNet-5 has the ability to process higher one-resolution images that
require larger and more CNN convolutional layers.
The leNet-5 technique is measured by the availability of all computing
resources
8.2 AlexNNet
The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenges
of deep learning algorithm by a large variance by achieving 17% with top-5
error rate as the second best achieved 26%!
It was introduced by Alex Krizhevsky (name of founder), The Ilya Sutskever
and Geoffrey Hinton are quite similar to LeNet-5, only much bigger and
deeper and it was introduced first to stack convolutional layers directly on
top of each other models, instead of stacking a pooling layer top of each on
CN network convolutional layer.
AlexNNet has 60 million parameters as AlexNet has total 8
layers, 5convolutional and 3 fully connected layers.
AlexNNet is first to execute (ReLUs) Rectified Linear Units as
activation functions
It was the first CNN architecture that uses GPU to improve the performance.
ALexNNet
8.3 VGG16 Architecture
VGG16, as its name suggests, is a 16-layer deep neural network. VGG16 is thus a
relatively extensive network with a total of 138 million parameters—it’s huge even by
today’s standards. However, the simplicity of the VGGNet16 architecture is its main
attraction.
The VGGNet architecture incorporates the most important convolution neural
networkfeatures.
A VGG network consists of small convolution filters. VGG16 has three fully connected
layers and 13 convolutional layers.
1. Encoder-Decoder Structure:
4. Skip Connections: