0% found this document useful (0 votes)

13 views30 pages

DL Unit-I

The document provides an overview of Feedforward Neural Networks (FNNs), detailing their structure, activation functions, and training processes, including backpropagation and gradient descent. It discusses various types of gradient descent, such as batch, stochastic, and mini-batch, along with their advantages and challenges like local minima and vanishing gradients. Additionally, it explains the backpropagation process for updating weights in a neural network using mathematical examples.

Uploaded by

Rishika Vuggam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views30 pages

DL Unit-I

Uploaded by

Rishika Vuggam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

DEEP LEARNING

UNIT-I

Dr.N.Linga Reddy

Associate professor, vgnt, Deshmukhi

Feedforward neural network

Artificial Neural Networks (ANNs) have revolutionized the field of machine learning, offering
powerful tools for pattern recognition, classification, and predictive modeling. Among the various
types of neural networks, the Feedforward Neural Network (FNN) is one of the most fundamental
and widely used.

A Feedforward Neural Network (FNN) is a type of artificial neural network where connections
between the nodes do not form cycles. This characteristic differentiates it from recurrent neural
networks (RNNs). The network consists of an input layer, one or more hidden layers, and an output
layer. Information flows in one direction—from input to output—hence the name "feedforward."

Structure of a Feedforward Neural Network

1. Input Layer: The input layer consists of neurons that receive the input data. Each neuron in
the input layer represents a feature of the input data.

2. Hidden Layers: One or more hidden layers are placed between the input and output layers.
These layers are responsible for learning the complex patterns in the data. Each neuron in a
hidden layer applies a weighted sum of inputs followed by a non-linear activation function.

3. Output Layer: The output layer provides the final output of the network. The number of
neurons in this layer corresponds to the number of classes in a classification problem or the
number of outputs in a regression problem.

Each connection between neurons in these layers has an associated weight that is adjusted during
the training process to minimize the error in predictions.
Activation Functions

Activation functions introduce non-linearity into the network, enabling it to learn and model complex
data patterns. Common activation functions include:

• Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1

• Tanh: tanh(x)=ex−e−xex+e−xtanh(x)=ex+e−xex−e−x

• ReLU (Rectified Linear Unit): ReLU(x)=max⁡(0,x)ReLU(x)=max(0,x)

• Leaky ReLU: Leaky ReLU(x)=max⁡(0.01x,x)Leaky ReLU(x)=max(0.01x,x)

Training a Feedforward Neural Network

Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize
the error between the predicted output and the actual output. This process is typically performed
using backpropagation and gradient descent.

1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.

2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean Squared
Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.

3. Backpropagation: In backpropagation, the error is propagated back through the network to

update the weights. The gradient of the loss function with respect to each weight is
calculated, and the weights are adjusted using gradient descent.
Gradient Descent
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It helps in
finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using gradient descent is
as follows:

o If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.

o Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objective of using a gradient descent algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient or slope of that
function.

o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.

Cost-function:
The cost function is defined as the measurement of difference or error
between actual values and expected values at the current position
and present in the form of a single real number. It helps to increase and
improve machine learning efficiency by providing feedback to this model
so that it can minimize error and find the local or global minimum. Further,
it continuously iterates along the direction of the negative gradient until
the cost function approaches zero. At this steepest descent point, the
model will stop learning further. Although cost function and loss function
are considered synonymous, also there is a minor difference between
them. The slight difference between the loss function and the cost
function is about the error within the training of machine learning models,
as loss function refers to the error of one training example, while a cost
function calculates the average error across an entire training set.
The cost function is calculated after making a hypothesis with initial
parameters and modifying these parameters using gradient descent
algorithms over known data to reduce the cost function.
Hypothesis:
Parameters:
Cost function:
Goal:

Gradient Descent working

Before starting the working principle of gradient descent, we should know some basic concepts to
find out the slope of a line from linear regression. The equation for simple linear regression is given
as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as
an arbitrary point. At this starting point, we will derive the first derivative or slope and then use a
tangent line to calculate the steepness of this slope. Further, this slope will inform the updates to the
parameters (weights and bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters
are generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest
point, which is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:

o Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future iteration and
allow it to the point of convergence or local minimum or global minimum. Let's discuss learning rate
factors in brief;

Learning Rate:

It is defined as the step size taken to reach the minimum or lowest point. This is typically a small
value that is evaluated and updated based on the behavior of the cost function. If the learning rate is
high, it results in larger steps but also leads to risks of overshooting the minimum. At the same time,
a low learning rate shows the small step sizes, which compromises overall efficiency but gives the
advantage of more precision.

Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and update
the model after evaluating all training examples. This procedure is known as the training epoch. In
simple words, it is a greedy approach where we have to sum over all examples for each update.

Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.

o It produces stable gradient descent convergence.

o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent

Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per
iteration. Or in other words, it processes a training epoch for each example within a dataset and
updates each training example's parameters one at a time. As it requires only one training example
at a time, hence it is easier to store in allocated memory. However, it shows some computational
efficiency losses in comparison to batch gradient systems as it shows frequent updates that require
more detail and speed. Further, due to frequent updates, it is also treated as a noisy gradient.
However, sometimes it can be helpful in finding the global minimum and also escaping the local
minimum.

Advantages of Stochastic gradient descent:

In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.

o It is easier to allocate in desired memory.

o It is relatively fast to compute than batch gradient descent.

o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the updates on
those batches separately. Splitting training datasets into smaller batches make a balance to maintain
the computational efficiency of batch gradient descent and speed of stochastic gradient descent.
Hence, we can achieve a special type of gradient descent with higher computational efficiency and
less noisy gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.

o It is computationally efficient.

o It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for optimization problems,
it still also has some challenges. There are a few challenges as follows:

1. Local Minima and Saddle Point:

For convex problems, gradient descent can find the global minimum easily, while for non-convex
problems, it is sometimes difficult to find the global minimum, where the machine learning models
achieve the best results.
Whenever the slope of the cost function is at zero or just close to zero, this model stops learning
further. Apart from the global minimum, there occur some scenarios that can show this slop, which is
saddle point and local minimum. Local minima generate the shape similar to the global minimum,
where the slope of the cost function increases on both sides of the current points.

In contrast, with saddle points, the negative gradient only occurs on one side of the point, which
reaches a local maximum on one side and a local minimum on the other side. The name of a saddle
point is taken by that of a horse's saddle.

The name of local minima is because the value of the loss function is minimum at that point in a local
region. In contrast, the name of the global minima is given so because the value of the loss function
is minimum there, globally across the entire domain the loss function.

2. Vanishing and Exploding Gradient

In a deep neural network, if the model is trained with gradient descent and backpropagation, there
can occur two more issues other than local minima and saddle point.

Vanishing Gradients:

Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this
gradient becomes smaller that causing the decrease in the learning rate of earlier layers than the
later layer of the network. Once this happens, the weight parameters update until they become
insignificant.

Exploding Gradient:

Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too
large and creates a stable model. Further, in this scenario, model weight increases, and they will be
represented as NaN. This problem can be solved using the dimensionality reduction technique,
which helps to minimize complexity within the model.
Backpropagation Process
Backpropagation is one of the important concepts of a neural network. Our task is to classify our
data best. For this, we have to update the weights of parameter and bias, but how can we do that in
a deep neural network? In the linear regression model, we use gradient descent to optimize the
parameter. Similarly here we also use gradient descent algorithm using Backpropagation.

For a single training example, Backpropagation algorithm calculates the gradient of the error
function. Backpropagation can be written as a function of the neural network. Backpropagation
algorithms are a set of methods used to efficiently train artificial neural networks following a
gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method through
which it calculates the updated weight to improve the network until it is not able to perform the task
for which it is being trained. Derivatives of the activation function to be known at network design
time is required to Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation works? Let start with
an example and do it mathematically to understand how exactly updates the weight using
Backpropagation.

Input values

X1=0.05
X2=0.10

Initial weight

W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values

b1=0.35 b2=0.60
Target Values

T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass

To find the value of H1 we first multiply the input value from the weights as

H1=x1�w1+x2�w2+b1
H1=0.05�0.15+0.10�0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1

H2=x1�w3+x2�w4+b1
H2=0.05�0.25+0.10�0.30+0.35
H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from the
weights as

y1=H1�w5+H2�w6+b2
y1=0.593269992�0.40+0.596884378�0.45+0.60
y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

y2=H1�w7+H2�w8+b2
y2=0.593269992�0.50+0.596884378�0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

Our target values are 0.01 and 0.99. Our y1 and y2

value is not matched with our target values T1 and T2.

Now, we will find the total error, which is simply the difference between the outputs from the target
outputs. The total error is calculated as

So, the total error is

Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer

To update the weight, we calculate the error correspond to each weight with the help of a total error.
The error on weight w is calculated by differentiating total error with respect to w.

We perform backward process so first consider the last weight w5 as

From equation two, it is clear that we cannot partially differentiate it with respect to w5 because
there is no any w5. We split equation one into multiple terms so that we can easily differentiate it
with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as

Putting the value of e-y in equation (5)

So, we put the values of in equation no (3) to find the final result.

Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the following values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

Backward pass at Hidden layer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and w4 as we
have done with w5, w6, w7, and w8 weights.

We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1 because
there is no any w1. We split equation (1) into multiple terms so that we can easily differentiate it
with respect to w1 as

Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
We again split this because there is no any H1final term in Etoatal as

will again split because in E1 and E2 there is no H1 term. Splitting is done as

We again Split both because there is no any y1 and y2 term in E1 and E2. We split it as

Now, we find the value of by putting values in equation (18) and (19) as

From equation (18)

From equation (8)

From equation (19)

Putting the value of e-y2 in equation (23)

From equation (21)

Now from equation (16) and (17)

Put the value of in equation (15) as

We have we need to figure out as

Putting the value of e-H1 in equation (30)

We calculate the partial derivative of the total net input to H1 with respect to w1 the same as we did
for the output neuron:

So, we put the values of in equation (13) to find the final result.
Now, we will calculate the updated weight w1new with the help of the following formula

In the same way, we calculate w2new,w3new, and w4 and this will give us the following values

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

We have updated all the weights. We found the error 0.298371109 on the network when we fed
forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total error is down to
0.291027924. After repeating this process 10,000, the total error is down to 0.0000351085. At this
point, the outputs neurons generate 0.159121960 and 0.984065734 i.e., nearby our target value
when we feed forward the 0.05 and 0.1.

Vanishing Gradient Problem

The vanishing gradient problem is a challenge faced in training artificial neural networks, particularly
deep feedforward and recurrent neural networks. This issue arises during
the backpropagation process, which is used to update the weights of the neural network through
gradient descent. The gradients are calculated using the chain rule and propagated back through the
network, starting from the output layer and moving towards the input layer. However, when the
gradients are very small, they can diminish as they are propagated back through the network, leading
to minimal or no updates to the weights in the initial layers. This phenomenon is known as the
vanishing gradient problem.

Causes of Vanishing Gradient Problem

The vanishing gradient problem is often attributed to the choice of activation functions and the
architecture of the neural network. Activation functions like the sigmoid or hyperbolic tangent (tanh)
have gradients that are in the range of 0 to 0.25 for sigmoid and -1 to 1 for tanh. When these
activation functions are used in deep networks, the gradients of the loss function with respect to the
parameters can become very small, effectively preventing the weights from changing their values
during training.
Another cause of the vanishing gradient problem is the initialization of weights. If the weights are
initialized too small, the gradients can shrink exponentially as they are propagated back through the
network, leading to vanishing gradients.

Consequences of Vanishing Gradient Problem

The vanishing gradient problem can severely impact the training process of a neural network. Since
the weights in the earlier layers receive minimal updates, these layers learn very slowly, if at all. This
can result in a network that does not perform well on the training data, leading to poor
generalization to new, unseen data. In the worst case, the training process can completely stall, with
the network being unable to learn the complex patterns in the data that are necessary for making
accurate predictions.

Solutions to Vanishing Gradient Problem

To mitigate the vanishing gradient problem, several strategies have been developed:

• Activation Functions: Using activation functions such as Rectified Linear Unit (ReLU) and its
variants (Leaky ReLU, Parametric ReLU, etc.) can help prevent the vanishing gradient
problem. ReLU and its variants have a constant gradient for positive input values, which
ensures that the gradients do not diminish too quickly during backpropagation.

• Weight Initialization: Properly initializing the weights can help prevent gradients from
vanishing. Techniques like Xavier initialization and He initialization are designed to maintain
the variance of the gradients throughout the network.

• Batch Normalization: Applying batch normalization normalizes the output of each layer to
have a mean of zero and a variance of one. This can help maintain stable gradients
throughout the training process.

• Residual Connections: Architectures like Residual Networks (ResNets) introduce skip

connections that allow gradients to bypass certain layers in the network, which can help
alleviate the vanishing gradient problem.

• Gradient Clipping: This technique involves clipping the gradients during backpropagation to
prevent them from becoming too small (or too large, in the case of the exploding gradient
problem).

• Use of LSTM/GRU in RNNs: For recurrent neural networks, using Long Short-Term
Memory (LSTM) units or Gated Recurrent Units (GRU) can help mitigate the vanishing
gradient problem. These structures have gating mechanisms that control the flow of
gradients and can maintain them over longer sequences.

ReLU heuristics for avoiding bad local minima

ReLU (Rectified Linear Unit) is a widely-used activation function due to its simplicity and ability to
reduce the vanishing gradient problem in deep networks. However, networks with ReLU are still
susceptible to bad local minima, which can hinder model performance. Here are some heuristics to
help avoid these pitfalls when using ReLU:

1. Proper Weight Initialization

• He (or Kaiming) Initialization: For ReLU networks, initializing weights using the He
initialization method can help. It sets weights as: w∼N(0,2n)w \sim \mathcal{N}(0,
\frac{2}{n})w∼N(0,n2) where nnn is the number of input neurons to a layer. This approach
accounts for the fact that ReLU can "shut off" neurons, thus requiring larger weight variances
to maintain effective signal flow.

• Proper weight initialization prevents the network from starting in regions that might lead
quickly to dead neurons (neurons that always output zero).

2. Batch Normalization

• Adding Batch Normalization after each fully-connected or convolutional layer can stabilize
learning by normalizing activations, allowing for higher learning rates and faster
convergence.

• This normalization can help the network avoid bad local minima by making the optimization
landscape smoother, reducing sensitivity to the initial conditions and helping prevent neuron
saturation.

3. Leaky ReLU or Parametric ReLU (PReLU)

• Leaky ReLU: ReLU suffers from "dying ReLU" problem, where neurons get stuck and always
output zero, effectively removing them from the learning process. Leaky ReLU modifies the
ReLU to allow a small, non-zero gradient when inputs are negative: f(x)=max⁡(αx,x)f(x) =
\max(\alpha x, x)f(x)=max(αx,x) where α\alphaα is a small constant (e.g., 0.01).

• Parametric ReLU (PReLU) allows the network to learn α\alphaα, which can further reduce
the likelihood of neurons dying and getting stuck in poor local minima.

4. Adaptive Learning Rate Optimization

• Adam, RMSprop, or AdaGrad optimizers adaptively adjust the learning rate for each weight,
which can help the network avoid bad local minima by quickly adjusting to better solutions
during training.

• These optimizers generally handle gradients well in ReLU networks and help manage the
sometimes erratic gradients that ReLU can produce, particularly when neurons are close to
the threshold of activation.

5. Learning Rate Schedules or Annealing

• Start with a higher learning rate and reduce it gradually during training to avoid large, erratic
updates that could push the model into local minima.

• Learning rate schedules (e.g., exponential decay) or annealing techniques (e.g., cosine
annealing) can be used to guide the network out of regions of bad local minima, especially as
training progresses.

6. Early Stopping and Regularization

• Early stopping can prevent the network from overfitting, which sometimes aligns with
avoiding bad local minima as well. This approach is often paired with dropout regularization
to make the network more resilient to suboptimal pathways by randomly dropping
connections.

• L2 Regularization (weight decay) can also help in smoothing the optimization landscape,
encouraging simpler models and helping to escape sharp, poor local minima.
7. Gradient Clipping

• When ReLU activations lead to large gradients, particularly in deeper networks, gradient
clipping can limit gradients to a certain threshold, reducing the risk of the model getting
stuck in poor local minima.

• Clipping helps stabilize training, particularly when dealing with ReLU activations that can
produce larger-than-usual gradients.

8. Layer-wise Training or Warm Restarts

• Training one layer at a time, or layer-wise pretraining, can help initialize the network in a
better region of the parameter space. This approach is less common today but can be
effective in very deep networks.

• Warm restarts periodically reset the learning rate to a higher value during training, which
allows the network to escape from potential local minima by pushing it into new regions of
the parameter space.

Heuristics for faster training

Training deep learning models can be computationally expensive and time-consuming, especially
with deep architectures. Here are some effective heuristics for speeding up training without
compromising accuracy:

1. Efficient Weight Initialization

• He Initialization (for ReLU and its variants) and Xavier Initialization (for tanh activations) help
stabilize the model and ensure faster convergence by setting weights to appropriate initial
scales.

• Good initialization prevents vanishing or exploding gradients, which can otherwise slow
down training considerably by requiring more gradient updates.

2. Batch Normalization

• Batch Normalization normalizes inputs across mini-batches, which allows for higher learning
rates and reduces sensitivity to initial weights. It also smooths the loss landscape, leading to
faster and more stable training.

• It can also reduce the need for other regularization methods, as it has a slight regularizing
effect.

3. Learning Rate Schedules and Annealing

• Learning Rate Scheduling: Starting with a high learning rate and gradually lowering it can
significantly reduce training time. Common schedules include step decay, exponential decay,
and cosine annealing.

• Warmup and Annealing: Using a warmup period (starting with a low learning rate) followed
by annealing (gradually reducing the rate) helps improve convergence speed and stability,
especially in very deep networks.

4. Adaptive Optimizers
• Adam, RMSprop, and AdaGrad dynamically adjust the learning rate for each parameter,
speeding up training by allowing larger updates early on and refining steps as training
progresses.

• Adam is particularly popular for deep networks as it combines momentum and adaptive
learning rates, balancing fast convergence with stability.

5. Use of Mixed Precision Training

• Mixed Precision involves using 16-bit floating-point precision (FP16) instead of the standard
32-bit (FP32), significantly reducing memory usage and computation time.

• Many hardware accelerators (e.g., NVIDIA GPUs with Tensor Cores) are optimized for mixed
precision, which can nearly double training speeds without major accuracy loss.

6. Gradient Accumulation

• For large models and smaller GPUs, gradient accumulation allows training with larger
effective batch sizes without increasing memory requirements. Instead of updating weights
after every mini-batch, gradients are accumulated over several mini-batches, and weights are
updated less frequently.

• This method achieves the benefits of larger batch sizes (e.g., better generalization and faster
convergence) without needing large GPU memory.

7. Data Augmentation

• Using on-the-fly data augmentation (e.g., rotation, flipping, scaling, and cropping) increases
dataset variability, which can improve model generalization.

• Augmentation can allow models to converge faster by reducing overfitting and enabling the
model to learn more robust features.

8. Reduce Overhead with Data Loaders

• Ensure data loading and preprocessing are not bottlenecks by using optimized data loaders
that allow for parallel data loading, caching, and asynchronous processing.

• Prefetching and caching batches in memory can significantly reduce the idle time of the GPU,
especially for large datasets or complex preprocessing pipelines.

9. Use Larger Batch Sizes (if Hardware Permits)

• Larger batch sizes can improve training speed by increasing parallelism and reducing the
number of parameter updates needed per epoch. However, this can sometimes harm
generalization.

• Gradient accumulation (as mentioned) is an alternative when hardware memory limits are a
concern, enabling training with an effective larger batch size.

10. Early Stopping

• Early Stopping halts training when the model stops improving on a validation metric,
preventing unnecessary epochs and potential overfitting.
• Coupled with validation metrics tracking, early stopping can greatly reduce training time
without sacrificing performance.

11. Transfer Learning and Pretrained Models

• Using pretrained models as a starting point, especially for complex tasks (e.g., image
classification, NLP tasks), can drastically reduce training time.

• Fine-tuning a model with task-specific data is often much faster than training from scratch,
especially with large models like those used in vision or language tasks.

12. Distributed and Parallel Training

• Data Parallelism: Splits batches across multiple GPUs or nodes, each processing a subset of
the batch in parallel, which speeds up training linearly with the number of GPUs.

• Model Parallelism: Splits large models across devices, useful for very large models that
exceed single-GPU memory.

• Hybrid Parallelism: Combines data and model parallelism, commonly used in massive deep
learning models to achieve faster training.

13. Efficient Network Architecture Choices

• Use of efficient architectures like MobileNet, EfficientNet, or SqueezeNet can reduce training
time without sacrificing performance, especially when computational resources are limited.

• Techniques like skip connections (e.g., in ResNets) reduce the risk of vanishing gradients,
enabling deeper networks that converge faster.

14. Pruning and Knowledge Distillation

• Pruning: Removes less important weights or neurons from the model, which can reduce
training time and memory usage.

• Knowledge Distillation: A smaller, faster model (the student) is trained using the outputs of a
larger pretrained model (the teacher), achieving comparable performance with faster
training times.

15. Gradient Clipping for Stability

• Gradient Clipping prevents exploding gradients by limiting gradients to a maximum

threshold. This can speed up training in RNNs and deep networks, where large gradients can
cause erratic updates that slow down convergence.

Nesterov Accelerated Gradient Descent (NAG)

Nesterov Accelerated Gradient Descent (NAG) is an advanced optimization algorithm that builds
upon the traditional momentum-based gradient descent method. It was introduced by Yurii
Nesterov in 1983 and has since become a staple in training deep learning models due to its ability
to accelerate convergence and navigate complex loss landscapes more effectively.

1. Introduction to Gradient Descent and Momentum

Before diving into NAG, it's essential to understand the basics of gradient descent and the
momentum technique.
• Gradient Descent: An optimization algorithm used to minimize the loss function by
iteratively moving in the direction of the steepest descent as defined by the negative of the
gradient.

θt+1=θt−η∇J(θt)\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t)θt+1=θt−η∇J(θt)

where:

o θ\thetaθ represents the parameters.

o η\etaη is the learning rate.

o ∇J(θt)\nabla J(\theta_t)∇J(θt) is the gradient of the loss function JJJ at time step
ttt.

• Momentum: Enhances gradient descent by accumulating a velocity vector in directions of

persistent reduction in the loss function, thereby smoothing updates and potentially
accelerating convergence.

vt+1=γvt+η∇J(θt)v_{t+1} = \gamma v_t + \eta \nabla J(\theta_t)vt+1=γvt+η∇J(θt)

θt+1=θt−vt+1\theta_{t+1} = \theta_t - v_{t+1}θt+1=θt−vt+1

where:

o vtv_tvt is the velocity at time step ttt.

o γ\gammaγ is the momentum coefficient (typically between 0.9 and 0.99).

2. Nesterov Accelerated Gradient (NAG)

NAG modifies the traditional momentum approach by incorporating a "lookahead" mechanism,

which anticipates the future position of the parameters and adjusts the gradient accordingly. This
foresight leads to more informed and effective updates.

• NAG Update Rules:

vt+1=γvt+η∇J(θt−γvt)v_{t+1} = \gamma v_t + \eta \nabla J(\theta_t - \gamma v_t)vt+1=γvt+η∇J(θt

−γvt) θt+1=θt−vt+1\theta_{t+1} = \theta_t - v_{t+1}θt+1=θt−vt+1

Here, the gradient is evaluated not at the current parameters θt\theta_tθt, but at the
approximated future position θt−γvt\theta_t - \gamma v_tθt−γvt.

3. Differences Between Momentum and NAG

Aspect Momentum Nesterov Accelerated Gradient (NAG)

Gradient At current position At lookahead position θt−γvt\theta_t - \gamma

Evaluation θt\theta_tθt v_tθt−γvt

Based on current velocity

Update Step Based on anticipated velocity vt+1v_{t+1}vt+1
vtv_tvt

Can overshoot and oscillate in Generally more responsive and accurate

Performance
certain scenarios updates, leading to faster convergence

4. Advantages of NAG
1. Faster Convergence: By anticipating the future position, NAG can make more informed
updates, often leading to faster convergence compared to standard momentum.

2. Reduced Oscillations: The lookahead mechanism helps in dampening oscillations,

especially in ravines or areas with steep gradients.

3. Better Handling of Non-Convex Loss Landscapes: NAG is more effective in navigating

complex, non-convex loss surfaces typical in deep learning.

4. Theoretical Guarantees: Nesterov provided theoretical guarantees for accelerated

convergence rates in convex optimization problems.

5. Mathematical Intuition Behind NAG

NAG can be viewed as a correction to the momentum method. While momentum accumulates
gradients based on past velocities, NAG adjusts the current gradient based on where the
parameters are expected to be in the future. This adjustment provides a more accurate gradient
direction, especially in regions where the loss surface curvature changes.

6. Implementation in Popular Deep Learning Frameworks

Most deep learning frameworks provide built-in support for NAG. Here's how to implement it in a
few popular frameworks:

• TensorFlow (using Keras API):

python

Copy code

import tensorflow as tf

from tensorflow.keras.optimizers import SGD

optimizer = SGD(learning_rate=0.01, momentum=0.9, nesterov=True)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

• PyTorch:

python

Copy code

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

• Keras (Standalone):

python

Copy code

from keras.optimizers import SGD

optimizer = SGD(lr=0.01, momentum=0.9, nesterov=True)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

7. Practical Tips for Using NAG

1. Hyperparameter Tuning:

o Learning Rate (η\etaη): Start with a moderate learning rate (e.g., 0.01) and adjust
based on training performance.

o Momentum Coefficient (γ\gammaγ): Common values range from 0.9 to 0.99.

Higher momentum can accelerate training but may lead to overshooting.

2. Combination with Learning Rate Schedules:

o Pair NAG with learning rate schedules (e.g., step decay, cosine annealing) to adjust
the learning rate during training, potentially enhancing convergence speed and
final performance.

3. Initialization:

o Use proper weight initialization methods (e.g., He Initialization for ReLU

activations) to complement the effectiveness of NAG.

4. Batch Normalization:

o Incorporate batch normalization to stabilize and accelerate training, especially

when using NAG.

5. Monitoring:

o Monitor training and validation loss curves to ensure that NAG is effectively
accelerating convergence without causing instability.

8. Comparison with Other Optimizers

Optimizer Description Pros Cons

Can oscillate and

SGD with Traditional momentum- Simpler, fewer
converge slower than
Momentum based gradient descent. hyperparameters.
NAG.

Nesterov Incorporates a lookahead Faster convergence, Slightly more

Accelerated mechanism for more better handling of computational
Gradient (NAG) informed updates. ravines. overhead.

Adaptive Moment
Handles sparse Can sometimes
Estimation, combines
Adam gradients, requires less generalize worse than
momentum and adaptive
tuning. SGD/NAG.
learning rates.
Optimizer Description Pros Cons

Maintains a moving average

Effective for recurrent Requires careful tuning
RMSprop of squared gradients to
neural networks. of hyperparameters.
adapt learning rates.

Adapts learning rates based Learning rate can

AdaGrad on past gradients, beneficial Good for sparse data. become too small over
for sparse data. time.

NAG vs. Adam:

• NAG often provides better generalization performance on certain tasks compared to Adam,
especially in computer vision.

• Adam converges faster initially but may plateau, whereas NAG continues to make progress
due to its momentum and lookahead.

9. When to Use NAG

• Deep Networks with Complex Loss Landscapes: NAG excels in navigating non-convex
optimization problems common in deep learning.

• Problems with High Curvature: The lookahead mechanism helps in handling areas with
varying curvature, reducing the chances of getting stuck in plateaus.

• When Computational Resources are Limited: Compared to some adaptive methods, NAG
doesn't require storing additional moment estimates per parameter, making it memory-
efficient.

10. Example: NAG in Practice

Here's a simple example of using NAG to train a neural network on the MNIST dataset using
PyTorch.

python

Copy code

import torch

import torch.nn as nn

import torch.optim as optim

from torchvision import datasets, transforms

# Define a simple neural network

class SimpleNN(nn.Module):

def __init__(self):

super(SimpleNN, self).__init__()
self.flatten = nn.Flatten()

self.fc1 = nn.Linear(28*28, 256)

self.relu = nn.ReLU()

self.fc2 = nn.Linear(256, 10)

def forward(self, x):

x = self.flatten(x)

x = self.relu(self.fc1(x))

x = self.fc2(x)

return x

# Data loading and preprocessing

transform = transforms.Compose([

transforms.ToTensor(),

transforms.Normalize((0.1307,), (0.3081,))

])

train_dataset = datasets.MNIST('.', train=True, download=True, transform=transform)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

# Initialize the network, loss function, and optimizer

model = SimpleNN()

criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

# Training loop

epochs = 10

for epoch in range(epochs):

model.train()

running_loss = 0.0

for data, target in train_loader:

optimizer.zero_grad()

outputs = model(data)

loss = criterion(outputs, target)

loss.backward()

optimizer.step()

running_loss += loss.item()

print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss/len(train_loader):.4f}")

Explanation:

• Model: A simple feedforward neural network with one hidden layer.

• Optimizer: Stochastic Gradient Descent (SGD) with Nesterov momentum.

• Training Loop: Iterates through the dataset, computes loss, backpropagates gradients, and
updates parameters using NAG.

11. Potential Pitfalls and Considerations

1. Hyperparameter Sensitivity: NAG, like other optimizers, requires careful tuning of the
learning rate and momentum coefficient. Improper settings can lead to suboptimal
performance or training instability.

2. Non-Convex Optimization: While NAG handles non-convex landscapes better than

standard momentum, it doesn't guarantee finding the global minimum. It can still get
trapped in local minima or saddle points.

3. Computational Overhead: Although minimal, NAG introduces slight computational

overhead due to the lookahead gradient calculation. However, this is often negligible
compared to the overall training time.

12. Extensions and Variants

• NAG with Adaptive Learning Rates: Combining NAG with adaptive learning rate methods
like AdaGrad or Adam can potentially yield better performance, although this is less
common.

• Scheduled Nesterov Momentum: Adjusting the momentum coefficient during training to

optimize convergence dynamics.

Regularization
Regularization is crucial in deep learning for controlling model complexity, improving generalization,
and reducing the risk of overfitting. Overfitting occurs when a model learns patterns specific to the
training data that do not generalize to unseen data. Here are the most widely used regularization
techniques in deep learning, along with explanations of how they help achieve better performance.
1. L1 and L2 Regularization (Weight Regularization)

• L2 Regularization (Ridge): Adds a penalty proportional to the sum of the squared weights to
the loss function. This is known as weight decay in deep learning.

LossL2=Loss+λ∑iwi2\text{Loss}_{L2} = \text{Loss} + \lambda \sum_{i} w_i^2LossL2=Loss+λi∑wi2

• L1 Regularization (Lasso): Adds a penalty proportional to the sum of the absolute values of
the weights.

LossL1=Loss+λ∑i∣wi∣\text{Loss}_{L1} = \text{Loss} + \lambda \sum_{i} |w_i|LossL1=Loss+λi∑∣wi∣

• Purpose: Encourages smaller weights, leading to simpler models with reduced variance. L1
regularization can lead to sparse weights (some weights become exactly zero), which can be
useful in feature selection.

2. Dropout

• How It Works: During training, randomly "drops out" (sets to zero) a fraction of neurons in
each layer according to a dropout rate, typically between 0.2 and 0.5. This prevents the
model from relying too heavily on specific neurons and encourages redundancy, making the
network more robust.

hidropout={0with probability phi1−pwith probability (1−p)h_i^{\text{dropout}} = \begin{cases} 0 &

\text{with probability } p \\ \frac{h_i}{1 - p} & \text{with probability } (1 - p) \end{cases}hidropout
={01−phiwith probability pwith probability (1−p)

where ppp is the dropout rate.

• Benefits: Helps reduce overfitting by preventing co-adaptation among neurons, thus

encouraging the network to learn more independent and generalized representations.

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
1 Intro
No ratings yet
1 Intro
91 pages
DL Unit2 HD
No ratings yet
DL Unit2 HD
141 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
AD601 Deep Learning Unit-2 Notes
No ratings yet
AD601 Deep Learning Unit-2 Notes
14 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Back Propagation
No ratings yet
Back Propagation
17 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
ML Unit 3 1
No ratings yet
ML Unit 3 1
57 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
ML Unit 3
No ratings yet
ML Unit 3
46 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
AI Unit II Lec Notes Deep Learning
No ratings yet
AI Unit II Lec Notes Deep Learning
64 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
Detailed Guide 7 Loss Functions Machine Learning Python Code
No ratings yet
Detailed Guide 7 Loss Functions Machine Learning Python Code
16 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
Upload Unit 2
No ratings yet
Upload Unit 2
19 pages
UNIT2
No ratings yet
UNIT2
25 pages
Lecture Notes 3 &4
No ratings yet
Lecture Notes 3 &4
35 pages
Module 2 DL
No ratings yet
Module 2 DL
9 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Unit 1
No ratings yet
Unit 1
72 pages
Linearity: Skip To Content
No ratings yet
Linearity: Skip To Content
10 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Unit 2
No ratings yet
Unit 2
37 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Unit 2
No ratings yet
Unit 2
19 pages
AI33
No ratings yet
AI33
6 pages
Computing Gradient Using Backpropagation: ZV0GDF798E
No ratings yet
Computing Gradient Using Backpropagation: ZV0GDF798E
5 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
DLA Unit 3
No ratings yet
DLA Unit 3
26 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
HCIA AI Practice Exam All
No ratings yet
HCIA AI Practice Exam All
64 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Unit 2
No ratings yet
Unit 2
36 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Unit 1 DL
No ratings yet
Unit 1 DL
18 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Module 2
No ratings yet
Module 2
44 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Unit 2 v1.
No ratings yet
Unit 2 v1.
41 pages
ML Notes
No ratings yet
ML Notes
14 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Project Report - Sign Language To Text Conversion
No ratings yet
Project Report - Sign Language To Text Conversion
34 pages
MITx 6.86x Notes - MD
No ratings yet
MITx 6.86x Notes - MD
91 pages
Ai & ML - SLM
No ratings yet
Ai & ML - SLM
87 pages
Gradient Descent: Disclaimer: This PPT Is Modified Based On Hung-Yi Lee
No ratings yet
Gradient Descent: Disclaimer: This PPT Is Modified Based On Hung-Yi Lee
38 pages
SP18 Practice Midterm
No ratings yet
SP18 Practice Midterm
5 pages
ML Solutions
No ratings yet
ML Solutions
40 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
AMBLE Adjusting Mini-Batch and Local Epoch For Federated Learning With Heterogeneous Devices
No ratings yet
AMBLE Adjusting Mini-Batch and Local Epoch For Federated Learning With Heterogeneous Devices
11 pages
Stochastic Gradient Descent - Math and Python Code
No ratings yet
Stochastic Gradient Descent - Math and Python Code
28 pages
Final Major Project 7th Sem
No ratings yet
Final Major Project 7th Sem
72 pages
Deep Residual Learning For Image and Video Recognition
No ratings yet
Deep Residual Learning For Image and Video Recognition
13 pages
Introduction To AI - MBA 2nd Sem 26-02-2024
No ratings yet
Introduction To AI - MBA 2nd Sem 26-02-2024
65 pages
ANNMath
No ratings yet
ANNMath
104 pages
Gradient Descent
No ratings yet
Gradient Descent
58 pages
1725877145module 3 How AI Works
No ratings yet
1725877145module 3 How AI Works
18 pages
Deep Learning Unit - I Notes
No ratings yet
Deep Learning Unit - I Notes
20 pages
Dl-Unit-2 - 1
No ratings yet
Dl-Unit-2 - 1
47 pages
Interview Questions AI
No ratings yet
Interview Questions AI
7 pages
Transfer Learning With ResNet-50 For Malaria Cell-Image Classification
No ratings yet
Transfer Learning With ResNet-50 For Malaria Cell-Image Classification
6 pages
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
No ratings yet
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
10 pages
Ai 2024
No ratings yet
Ai 2024
5 pages
Complete Panoptic Traffic Recognition System With Ensemble of YOLO Family Models
No ratings yet
Complete Panoptic Traffic Recognition System With Ensemble of YOLO Family Models
9 pages
IA Error Codes
No ratings yet
IA Error Codes
14 pages
When Machine Learning Meets Blockchain: A Decentralized, Privacy-Preserving and Secure Design
No ratings yet
When Machine Learning Meets Blockchain: A Decentralized, Privacy-Preserving and Secure Design
11 pages
Background Noise Suppression in Audio File Using LSTM Network
No ratings yet
Background Noise Suppression in Audio File Using LSTM Network
9 pages
Module 5 - QB - Optimization Techniques
No ratings yet
Module 5 - QB - Optimization Techniques
2 pages
Week 3
No ratings yet
Week 3
5 pages
Hyperparameter Tuning For Deep Learning in Natural Language Processing
No ratings yet
Hyperparameter Tuning For Deep Learning in Natural Language Processing
7 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet