UNIT_1_DL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

21AD1601 – DEEP LEARNING

UNIT I

INTRODUCTION TO DEEP LEARNING

Feed forward Neural networks - Gradient descent and the back propagation algorithm
- Unit saturation - Adaptive Gradient Algorithm- Dropout Regularization - Data
Augmentation - CNN Architectures - LeNet-5- AlexNet- VCG-16 - U-Net

1. Feed forward neural networks

Feed forward neural networks, also referred to as multi-layer neural networks, are
artificial neural networks characterized by the absence of loops among nodes. In this
type of neural network, information exclusively moves in a forward direction. The data
flow begins with input nodes receiving data, which then traverses through hidden
layers before ultimately exiting through output nodes. Notably, there are no connections
in the network that allow thetransmission of information back from the output node.

A feed forward neural network approximates functions in the following way:

 An algorithm calculates classifiers by using the formula y = f* (x).


 Input x is therefore assigned to category y.
 According to the feed forward model, y = f (x; θ). This value determines the
closestapproximation of the function.

1.1 Working principle of a feed forward neural network


When the feed forward neural network gets simplified, it can appear as a single layer
perceptron.

This model performs multiplication of inputs with corresponding weights upon


entering the layer. Subsequently, the weighted input values are summed, and if the sum
surpasses a specified threshold (typically set at zero), the output is set to 1; otherwise,
it is set to -1. The single-layer perceptron, functioning as a feed-forward neural
network, is commonly employed for classification tasks. Additionally, machine learning
can be integrated into single-layer perceptron’s, where training involves adjusting the
weights using the delta rule to compare outputs with intended values.

Through the training and learning process, gradient descent takes place. In the case of
multi- layered perceptron’s, this weight adjustment process is known as back-
propagation. In such scenarios, the hidden layers of the network are adjusted based on
the output values generatedby the final layer.

1.2 Layers of feed forward neural network

 Input layer:
The neurons of this layer receive input and pass it on to the other layers of the network.
Feature or attribute numbers in the dataset must match the number of neurons in the input
layer.
 Output layer:
According to the type of model getting built, this layer represents the forecasted feature.
 Hidden layer:
Input and output layers get separated by hidden layers. Depending on the type of model,
there may be several hidden layers. There are several neurons in hidden layers that transform
the input
before actually transferring it to the next layer. This network gets constantly updated with
weights in order to make it easier to predict.
 Neuron weights:
Neurons get connected by a weight, which measures their strength or magnitude. Similar to
linear regression coefficients, input weights can also get compared.
Weight is normally between 0 and 1, with a value between 0 and 1.
 Neurons
:
Artificial neurons get used in feed forward networks, which later get adapted from biological
neurons. A neural network consists of artificial neurons. Neurons function in two ways: first,
they create weighted input sums, and second, they activate the sums to make them normal.
Activation functions can either be linear or nonlinear. Neurons have weights based on their
inputs. During the learning phase, the network studies these weights.
 Activation Function:
Neurons are responsible for making decisions in this area. According to the activation
function, the neurons determine whether to make a linear or nonlinear decision. Since it
passes through so many layers, it prevents the cascading effect from increasing neuron
outputs. An activation function can be classified into three major categories: sigmoid, Tanh,
and Rectified Linear Unit(ReLu).
 Sigmoid:
Input values between 0 and 1 get mapped to the output values.
 Tanh:
A value between -1 and 1 gets mapped to the input values.
 Rectified linear Unit:
Only positive values are allowed to flow through this function. Negative values get
mapped to 0.

1.3 Function in feed forward neural network


 Cost function

In a feed forward neural network, the cost function plays an important role. The
categorizeddata points are little affected by minor adjustments to weights and biases.
Thus, a smooth cost function can get used to determine a method of adjusting weights
andbiases to improve performance.
Following is a definition of the mean square error cost function:

Where,

w = the weights gathered in the


networkb = biases
n = number of inputs for
training a = output vectors
x = input
‖v‖ = vector v's normal length

 Loss function

The loss function of a neural network gets used to determine if an adjustment needs
to bemade in the learning process.
Neurons in the output layer are equal to the number of classes. Showing the differences
between predicted and actual probability distributions. Following is the cross -entropy
loss for binary classification.

As a result of multiclass categorization, a cross-entropy loss occurs:


 Gradient learning algorithm

In the gradient descent algorithm, the next point gets calculated by scaling the gradient
at the current position by a learning rate. Then subtracted from the current position by
the achievedvalue.
To decrease the function, it subtracts the value (to increase, it would add). As an
example,here is how to write this procedure:

The gradient gets adjusted by the parameter η, which also determines the
step size. Performance is significantly affected by the learning rate in machine
learning.

 Output units

In the output layer, output units are those units that provide the desired output or
prediction,thereby fulfilling the task that the neural network needs to complete.

There is a close relationship between the choice of output units and the cost function.
Any unit that can serve as a hidden unit can also serve as an output unit in a neural
network.

1.4 Advantages of feed forward Neural Networks

 Machine learning can be boosted with feed forward neural networks'


simplified architecture.
 Multi-network in the feed forward networks operate independently, with a
moderatedintermediary.
 Complex tasks need several neurons in the network.
 Neural networks can handle and process nonlinear data easily
compared to perceptrons and sigmoid neurons, which are otherwise
complex.
 A neural network deals with the complicated problem of decision boundaries.
 Depending on the data, the neural network architecture can vary. For example,
convolutional neural networks (CNNs) perform exceptionally well in image
processing, whereas recurrent neural networks (RNNs) perform well in text and
voiceprocessing.
 Neural networks need graphics processing units (GPUs) to handle large
datasets for massive computational and hardware performance. Several GPUs
get used widely in the market, including Kaggle Notebooks and Google Collab
Notebooks.

1.5 Applications of feed forward neural networks

 Physiological feed forward system


It is possible to identify feed forward management in this situation because the
central involuntary regulates the heartbeat before exercise.
 Gene regulation and feed forward
Detecting non-temporary changes to the atmosphere is a function of this motif as a
feed forward system. You can find the majority of this pattern in the illustrious
networks.
 Automation and machine management
Automation control using feed forward is one of the disciplines in automation.

 Parallel feed forward compensation with derivative


An open-loop transfer converts non-minimum part systems into minimum part
systems using this technique.

2. Gradient Descent
This method is the key to minimizing the loss function and achieving our target,
which isto predict close to the original value.

Gradient descent for Mean Squared Error


(MSE)
In this diagram, above we see our loss function graph. If we observe we will see it is
basically a parabolic shape or a convex shape, it has a specific global minimum which
we need to find in order to find the minimum loss function value. So, we always try to
use a loss function which is convex in shape in order to get a proper minimum. Now, we
see the predicted results depend on the weights from the equation. If we replace
equation 1 in equation 2 we obtain this graph, with weights in X-axis and Loss on Y-axis.
Initially, the model assigns random weights to the features. So, say it initializes the
weight=a. So, we can see it generates a loss which is far from the minimum point L-min.
Now, we can see that if we move the weights more towards the positive x-axis we can
optimize the loss function and achieve minimum value. But, how will the machine
know? We need to optimize weight to minimize error, so, obviously, we need to check
how the error varies with the weights. To do this we need to find the derivative of the
Error with respect to the weight. This derivative is called Gradient.

Gradient = dE/dw
Where E is the error and w is the weight.

2.1 Gradient Descent Methods

 Stochastic Gradient Descent: When we train the model to optimize the


loss function using only one particular example from our dataset, it is called
StochasticGradient Descent.

 Batch Gradient Descent: When we train the model to optimize the loss
function using the mean of all the individual losses in our whole dataset, it is
called Batch Gradient Descent.

 Mini-Batch Gradient Descent: Now, as we discussed batch gradient


descent takes a lot of time and is therefore somewhat inefficient. If we look
at Stochastic Gradient Descent (SGD), it is trained using only 1 example. So,
how good do you think a baby will learn if it is shown only one bike and told
to learn about all other bikes? It's simple its decision will be somewhat
biased to the peculiarities of the shown example. So, it is the same for the
SGD, there is a possibility that the model may get too biased with the
peculiarity of that particular example. So, we
use the mean of a batch of 10–1000 examples to check the optimize the loss
inorder to deal with the problems.

3. Back propagation

Back propagation is used to train the neural network of the chain rule method. In
simple terms, after each feed-forward passes through a network, this algorithm does
the backward pass to adjust the model’s parameters based on weights and biases. A
typical supervised learning algorithm attempts to find a function that maps input data
to the right
Output. Back propagation works with a multi-layered neural network and learns
internalrepresentations of input to output mapping.

3.1 Back propagation working

It has four layers: input layer, hidden layer, hidden layer II and final output
layer.So, the main three layers are:

1. Input layer
2. Hidden layer
3. Output layer
Each layer has its own way of working and its own way to take action such that we are
ableto get the desired results and correlate these scenarios to our conditions.

the functioning of the back propagation approach.

1. Input layer receives x


2. Input is modeled using weights w
3. Each hidden layer calculates the output and data is ready at the output
layer
4. Difference between actual output and desired output is known as the error
5. Go back to the hidden layers and adjust the weights so that this
error isreduced in future runs
This process is repeated till we get the desired output. The training phase is done with
supervision. Once the model is stable, it is used in production.

3.2 Types of back propagation

There are two types of back propagation networks.

 Static back propagation


 Recurrent back propagation
 Static back propagation
In this network, mapping of a static input generates static output. Static
classification problems like optical character recognition will be a suitable domain
for static back propagation.
 Recurrent back propagation
Recurrent back propagation is conducted until a certain threshold is met. After
thethreshold, the error is calculated and propagated backward.

The difference between these two approaches is that static back propagation is as
fast as themapping is static.

3.3 Why do we need back propagation?

Back propagation has many advantages, some of the important ones are listed below-

 Back propagation is fast, simple and easy to implement


 There are no parameters to be tuned
 Prior knowledge about the network is not needed thus becoming a
flexiblemethod
 This approach works very well in most cases
 The model need not learn the features of the function
4. Unit saturation

Unit saturation in refers to a situation where the neurons (or units) in a neural
network become saturated, meaning they reach extreme values (either very close to 0
or 1) and stop learning effectively. This can lead to issues like vanishing or exploding
gradients during thetraining process.

Vanishing gradients occur when the gradients of the loss function with respect to the
parameters become very small, causing the model to stop learning or learn very slowly.
On the other hand, exploding gradients happen when the gradients become very large,
leading to unstable training. Unit saturation is often related to the choice of activation
functions.
4.1 Activation Functions:
Unit saturation is closely tied to the choice of activation functions. Common
activation functions include:

 Sigmoid Function: It squashes input values between 0 and 1. When inputs


are too large or too small, the sigmoid function saturates, leading to vanishing
gradients.

 Hyperbolic Tangent (tanh): Similar to the sigmoid but maps values between -1
and
1. It also suffers from saturation issues.

 Rectified Linear Unit (ReLU): It is popular but can suffer from saturation
fornegative inputs, leading to dead neurons during training.

4.2 Vanishing and Exploding Gradients:

 Vanishing Gradients: During back propagation, gradients diminish as they


are propagated back through the layers. If the gradients become very small,
the model parameters do not update effectively, causing slow or halted
learning.

 Exploding Gradients: Conversely, gradients can become extremely large,


causing the model parameters to update drastically. This can lead to unstable
training.

4.3 Impacts on Learning:

 Slow Convergence: Saturated units result in slow convergence during training,


as themodel learns at a snail's pace due to small gradient values.

 Limited Representational Capacity: Saturation limits the ability of the


model to represent and learn complex patterns in the data.

4.4 Mitigation Strategies:

 Weight Initialization: Careful initialization of weights helps in preventing


saturation. Techniques like He initialization for ReLU activations or
Xavier/Glorot initialization for sigmoid and tanh activations are common.

 Batch Normalization: Normalizing inputs within a mini-batch helps in


reducinginternal covariate shift, mitigating saturation issues.

 Use of Different Activations: Leaky ReLU, Parametric ReLU, and


Exponential Linear Unit (ELU) are variations of ReLU that address its
saturation problems.

4.5 Advanced Architectures:


 Gated Activation Functions: Architectures like Long Short-Term Memory

(LSTM) and Gated Recurrent Unit (GRU) use gated activation functions to
address the vanishing gradient problem in recurrent neural networks.

4.6 Research Trends:

 Swish Activation: Introduced as an alternative to ReLU, Swish aims to combine


the best of both worlds, avoiding saturation issues while maintaining non-
linearity.

To mitigate unit saturation, techniques such as careful weight initialization, using


different activation functions, or employing normalization techniques like batch
normalization can be applied. These approaches help to maintain a balance between the
activations, preventing them from saturating and facilitating better learning during the
training process.

5. Adaptive Gradient Algorithm (AdaGrad)

AdaGrad is a well-known optimization method that is used in machine learning


and deep learning. AdaGrad’s concept is to modify the learning rate for every
parameter in amodel depending on the parameter’s previous gradients.

Specifically, it calculates the learning rate as the sum of the squares of the
gradients over time, one for each parameter. This reduces the learning rate for
parameters with big gradients while raising the learning rate for parameters with
modest gradients.
The idea behind this particular method is that it enables the learning rate to adapt to
the geometry of the loss function, allowing it to converge quicker in steep gradient
directions while being more conservative in flatter gradient directions. This may result
in quicker convergence and improved generalization.
However, this method has significant downsides. One of the most significant concerns is
that the cumulative gradient magnitudes may get quite big over time, resulting in a
meager effective learning rate that can inhibit further learning. Adam and RMSProp,
two contemporary optimization algorithms, combine their adaptive learning rate
method with otherstrategies to limit the growth of gradient magnitudes over time.
5.1 Types of Gradient Descent

Gradient Descent is a prominent optimization approach used in machine learning and


deep learning to determine the best values for a model’s parameters. It is an iterative
approach that works by minimizing a loss function that quantifies the difference
between the expected andreal outputs of the model.
Gradient descent is classified into three types:

 Batch Gradient Descent– This is the most common kind of gradient descent, in
which the gradient is calculated at each step using the whole dataset. The
approach changes the parameters by taking action toward the loss function’s
negative gradient.

 Stochastic Gradient Descent (SGD)– In this variation of gradient descent, the


gradient is calculated at each step using a single randomly picked sample from
the dataset. Because the gradient is derived from a single data point, it may not
correctly reflect the general structure of the dataset. This makes the process
quicker but also noisier.

 Mini-batch Gradient Descent– A hybrid of batch gradient descent and


stochastic gradient descent. The gradient is produced using a small batch of
randomly chosen samples from the dataset rather than the complete dataset or a
single example in mini- batch gradient descent. This method creates a
compromise between SGD’s noise and batch gradient descent’s computing cost.

5.2 Advantages of AdaGrad

 Easy to use– It’s a reasonably straightforward optimization technique and


may beapplied to various models.

 No need for manual– There is no need to manually tune hyperparameters


since this optimization method automatically adjusts the learning rate for each
parameter.

 Adaptive learning rate– Modifies the learning rate for each parameter
depending on the parameter’s past gradients. This implies that for parameters
with big gradients, the learning rate is lowered, while for parameters with small
gradients, the learning rate is raised, allowing the algorithm to converge quicker
and prevent overshooting the idealsolution.

 Adaptability to noisy data– This method provides the ability to smooth out
the impacts of noisy data by assigning lesser learning rates to parameters with
strong gradients owing to noisy input. Handling sparse data efficiently– It is
particularly good at dealing with sparse data, which is prevalent in NLP and
recommendation systems. This is performed by giving sparse parameters faster
learning rates, which may speed convergence.

6. Dropout

In machine learning, “dropout” refers to the practice of disregarding certain nodes in a


layer at random during training. A dropout is a regularization approach that prevents
over fitting by ensuring that no units are codependent with one another.

6.1 Dropout Regularization

When you have training data, if you try to train your model too much, it might over fit,
and when you get the actual test data for making predictions, it will not probably
perform well. Dropout regularization is one technique used to tackle over fitting
problems in deep learning. That’s what we are going to look into in this blog, and we’ll
go over some theories first, and then we’ll write python code using Tensor Flow, and
we’ll see how adding a dropoutlayer increases the performance of your neural network.

7. Data Augmentation

Data augmentation is a process of artificially increasing the amount of data by


generating new data points from existing data. This includes adding minor alterations to data
or using machine learning models to generate new data points in the latent space of original
data to amplify the dataset.

The difference between augmented data and synthetic data.

 Synthetic data: When data is generated artificially without using real-world


images.Synthetic data are often produced by Generative Adversarial Networks

 Augmented data: Derived from original images with some sort of minor geometric
transformations (such as flipping, translation, rotation, or the addition of noise) in
order to increase the diversity of the training set.

7.1The importance of DataAugmentation

Here are some of the reasons why data augmentation techniques have been gaining
popularity in the last few years.
 Improves the performance of ML models (more diverse datasets).

 Data augmentation methods are widely used in practically every cutting-edge deep
learning application such as object detection, image classification, image recognition,
natural language understanding, semantic segmentation, and more.

 Augmented data is improving the performance and results of deep learning


models bygenerating new and diverse instances for training datasets.

 Reduces operation costs related to data collection

 Data collection and data labeling can be time-consuming and expensive processes for
deep learning models. Companies can cut operational expenses by transforming
datasets using data augmentation techniques.

7.2Limitations of Data Augmentation


This method also comes with its own challenges, including:

 Cost of quality assurance of the augmented datasets.

 Research and Development to build synthetic data with advanced applications.

 Verification of image augmentation techniques like GANs is challenging.

 Finding an optimal augmentation strategy for the data is non-trivial.

 The inherent bias of original data persists in augmented data.

8. Convolutional Neural Network (CNN) Architectures


A Convolutional Neural Network (CNN) is a type of Deep Learning neural
network architecture commonly used in Computer Vision. Computer vision is a field
of Artificial Intelligence that enables a computer to understand and interpret the
image or visual data.
Let’s discuss, How CNN architecture developed and grow over time.

 8.1 LeNet-5

 The First LeNet-5 architecture is the most widely known CNN


architecture. It was introduced in 1998 and is widely used for
handwritten method digit recognition.
 LeNet-5 has 2 convolutional and 3 full layers.
 This LeNet-5 architecture has 60,000 parameters.

LeNet-5

 The LeNet-5 has the ability to process higher one-resolution images that
require larger and more CNN convolutional layers.
 The leNet-5 technique is measured by the availability of all computing
resources

 8.2 AlexNNet
 The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenges
of deep learning algorithm by a large variance by achieving 17% with top-5
error rate as the second best achieved 26%!
 It was introduced by Alex Krizhevsky (name of founder), The Ilya Sutskever
and Geoffrey Hinton are quite similar to LeNet-5, only much bigger and
deeper and it was introduced first to stack convolutional layers directly on
top of each other models, instead of stacking a pooling layer top of each on
CN network convolutional layer.
 AlexNNet has 60 million parameters as AlexNet has total 8
layers, 5convolutional and 3 fully connected layers.
 AlexNNet is first to execute (ReLUs) Rectified Linear Units as
activation functions
 It was the first CNN architecture that uses GPU to improve the performance.

ALexNNet
8.3 VGG16 Architecture
VGG16, as its name suggests, is a 16-layer deep neural network. VGG16 is thus a
relatively extensive network with a total of 138 million parameters—it’s huge even by
today’s standards. However, the simplicity of the VGGNet16 architecture is its main
attraction.
The VGGNet architecture incorporates the most important convolution neural
networkfeatures.

A VGG network consists of small convolution filters. VGG16 has three fully connected
layers and 13 convolutional layers.

Here is a quick outline of the VGG architecture:


1. Input—VGGNet receives a 224×224 image input. In the ImageNet competition,
the model’s creators kept the image input size constant by cropping a 224×224
section from the center of each image.
2. Convolutional layers—the convolutional filters of VGG use the smallest
possible receptive field of 3×3. VGG also uses a 1×1 convolution filter as the
input’s linear transformation.
3. ReLu activation—next is the Rectified Linear Unit Activation Function (ReLU)
component, AlexNet’s major innovation for reducing training time. ReLU is a
linear function that provides a matching output for positive inputs and outputs
zero for negative inputs. VGG has a set convolution stride of 1 pixel to preserve
the spatial resolution after convolution (the stride value reflects how many
pixels the filter “moves” to cover the entire space of the image).
4. Hidden layers—all the VGG network’s hidden layers use ReLU instead of Local
Response Normalization like AlexNet. The latter increases training time and
memoryconsumption with little improvement to overall accuracy.
5. Pooling layers–A pooling layer follows several convolutional layers—this
helps reduce the dimensionality and the number of parameters of the feature
maps created by each convolution step. Pooling is crucial given the rapid
growth of the number of available filters from 64 to 128, 256, and eventually
512 in the final layers.
6. Fully connected layers—VGGNet includes three fully connected layers. The
first two layers each have 4096 channels, and the third layer has 1000
channels, one for every class.

8.4 U-Net architecture:

The U-Net architecture is a convolutional neural network (CNN) architecture designed


for semantic segmentation tasks in computer vision. It was introduced by Olaf
Ronneberger, Philipp Fischer, and Thomas Brox in 2015. The name "U-Net" comes from
the U-shaped architecture of the network. U-Net is particularly effective in tasks where
detailed spatial information is crucial, such as medical image segmentation. Here is an in-
depth overview of theU-Net architecture:

1. Encoder-Decoder Structure:

1. Purpose: Capture contextual information through the encoder and recover


spatialdetails through the decoder.
2. U-Shape: Resembles a "U" with a contracting path (encoder) and an
expansivepath (decoder).

2. Contracting Path (Encoder):

1. Convolution and Max-Pooling: Repeated blocks reduce spatial


dimensions andincrease channel depth.

2. Hierarchical Features: Captures hierarchical features recognizing


complexpatterns at different scales.

3. Expansive Path (Decoder):

1. Up sampling and Convolution: Involves up sampling layers and


convolutional layers, gradually increasing spatial dimensions while
decreasing channels.

2. Skip Connections: Connects encoder and decoder layers, preserving


spatialinformation.

4. Skip Connections:

1. Purpose: Enhances the network's ability to retain spatial details.

2. Concatenation: Output of encode layer concatenated with corresponding


decoderlayer.

5. Activation Functions and Normalization:

1. ReLU Activation: Introduces non-linearity after each convolutional layer.

2. Batch Normalization: Stabilizes and accelerates training by normalizing


layer inputs.

6. Final Layer and Loss Function:

1. Final Convolutional Layer: Typically involves a convolutional layer with


softmaxactivation.

2. Softmax Cross-Entropy Loss: Commonly used for segmentation tasks,


measuring the difference between predicted and ground truth pixel-wise class
probabilities.

You might also like