0% found this document useful (0 votes)
13 views30 pages

DL Unit-I

The document provides an overview of Feedforward Neural Networks (FNNs), detailing their structure, activation functions, and training processes, including backpropagation and gradient descent. It discusses various types of gradient descent, such as batch, stochastic, and mini-batch, along with their advantages and challenges like local minima and vanishing gradients. Additionally, it explains the backpropagation process for updating weights in a neural network using mathematical examples.

Uploaded by

Rishika Vuggam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

DL Unit-I

The document provides an overview of Feedforward Neural Networks (FNNs), detailing their structure, activation functions, and training processes, including backpropagation and gradient descent. It discusses various types of gradient descent, such as batch, stochastic, and mini-batch, along with their advantages and challenges like local minima and vanishing gradients. Additionally, it explains the backpropagation process for updating weights in a neural network using mathematical examples.

Uploaded by

Rishika Vuggam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DEEP LEARNING

UNIT-I

Dr.N.Linga Reddy

Associate professor, vgnt, Deshmukhi

Feedforward neural network

Artificial Neural Networks (ANNs) have revolutionized the field of machine learning, offering
powerful tools for pattern recognition, classification, and predictive modeling. Among the various
types of neural networks, the Feedforward Neural Network (FNN) is one of the most fundamental
and widely used.

A Feedforward Neural Network (FNN) is a type of artificial neural network where connections
between the nodes do not form cycles. This characteristic differentiates it from recurrent neural
networks (RNNs). The network consists of an input layer, one or more hidden layers, and an output
layer. Information flows in one direction—from input to output—hence the name "feedforward."

Structure of a Feedforward Neural Network

1. Input Layer: The input layer consists of neurons that receive the input data. Each neuron in
the input layer represents a feature of the input data.

2. Hidden Layers: One or more hidden layers are placed between the input and output layers.
These layers are responsible for learning the complex patterns in the data. Each neuron in a
hidden layer applies a weighted sum of inputs followed by a non-linear activation function.

3. Output Layer: The output layer provides the final output of the network. The number of
neurons in this layer corresponds to the number of classes in a classification problem or the
number of outputs in a regression problem.

Each connection between neurons in these layers has an associated weight that is adjusted during
the training process to minimize the error in predictions.
Activation Functions

Activation functions introduce non-linearity into the network, enabling it to learn and model complex
data patterns. Common activation functions include:

• Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1

• Tanh: tanh(x)=ex−e−xex+e−xtanh(x)=ex+e−xex−e−x

• ReLU (Rectified Linear Unit): ReLU(x)=max⁡(0,x)ReLU(x)=max(0,x)

• Leaky ReLU: Leaky ReLU(x)=max⁡(0.01x,x)Leaky ReLU(x)=max(0.01x,x)

Training a Feedforward Neural Network

Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize
the error between the predicted output and the actual output. This process is typically performed
using backpropagation and gradient descent.

1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.

2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean Squared
Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.

3. Backpropagation: In backpropagation, the error is propagated back through the network to


update the weights. The gradient of the loss function with respect to each weight is
calculated, and the weights are adjusted using gradient descent.
Gradient Descent
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It helps in
finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using gradient descent is
as follows:

o If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.

o Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objective of using a gradient descent algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient or slope of that
function.

o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.

Cost-function:
The cost function is defined as the measurement of difference or error
between actual values and expected values at the current position
and present in the form of a single real number. It helps to increase and
improve machine learning efficiency by providing feedback to this model
so that it can minimize error and find the local or global minimum. Further,
it continuously iterates along the direction of the negative gradient until
the cost function approaches zero. At this steepest descent point, the
model will stop learning further. Although cost function and loss function
are considered synonymous, also there is a minor difference between
them. The slight difference between the loss function and the cost
function is about the error within the training of machine learning models,
as loss function refers to the error of one training example, while a cost
function calculates the average error across an entire training set.
The cost function is calculated after making a hypothesis with initial
parameters and modifying these parameters using gradient descent
algorithms over known data to reduce the cost function.
Hypothesis:
Parameters:
Cost function:
Goal:

Gradient Descent working


Before starting the working principle of gradient descent, we should know some basic concepts to
find out the slope of a line from linear regression. The equation for simple linear regression is given
as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as
an arbitrary point. At this starting point, we will derive the first derivative or slope and then use a
tangent line to calculate the steepness of this slope. Further, this slope will inform the updates to the
parameters (weights and bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters
are generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest
point, which is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:

o Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future iteration and
allow it to the point of convergence or local minimum or global minimum. Let's discuss learning rate
factors in brief;

Learning Rate:

It is defined as the step size taken to reach the minimum or lowest point. This is typically a small
value that is evaluated and updated based on the behavior of the cost function. If the learning rate is
high, it results in larger steps but also leads to risks of overshooting the minimum. At the same time,
a low learning rate shows the small step sizes, which compromises overall efficiency but gives the
advantage of more precision.

Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and update
the model after evaluating all training examples. This procedure is known as the training epoch. In
simple words, it is a greedy approach where we have to sum over all examples for each update.

Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.


o It produces stable gradient descent convergence.

o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent


Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per
iteration. Or in other words, it processes a training epoch for each example within a dataset and
updates each training example's parameters one at a time. As it requires only one training example
at a time, hence it is easier to store in allocated memory. However, it shows some computational
efficiency losses in comparison to batch gradient systems as it shows frequent updates that require
more detail and speed. Further, due to frequent updates, it is also treated as a noisy gradient.
However, sometimes it can be helpful in finding the global minimum and also escaping the local
minimum.

Advantages of Stochastic gradient descent:

In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.

o It is easier to allocate in desired memory.

o It is relatively fast to compute than batch gradient descent.

o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the updates on
those batches separately. Splitting training datasets into smaller batches make a balance to maintain
the computational efficiency of batch gradient descent and speed of stochastic gradient descent.
Hence, we can achieve a special type of gradient descent with higher computational efficiency and
less noisy gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.

o It is computationally efficient.

o It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for optimization problems,
it still also has some challenges. There are a few challenges as follows:

1. Local Minima and Saddle Point:

For convex problems, gradient descent can find the global minimum easily, while for non-convex
problems, it is sometimes difficult to find the global minimum, where the machine learning models
achieve the best results.
Whenever the slope of the cost function is at zero or just close to zero, this model stops learning
further. Apart from the global minimum, there occur some scenarios that can show this slop, which is
saddle point and local minimum. Local minima generate the shape similar to the global minimum,
where the slope of the cost function increases on both sides of the current points.

In contrast, with saddle points, the negative gradient only occurs on one side of the point, which
reaches a local maximum on one side and a local minimum on the other side. The name of a saddle
point is taken by that of a horse's saddle.

The name of local minima is because the value of the loss function is minimum at that point in a local
region. In contrast, the name of the global minima is given so because the value of the loss function
is minimum there, globally across the entire domain the loss function.

2. Vanishing and Exploding Gradient

In a deep neural network, if the model is trained with gradient descent and backpropagation, there
can occur two more issues other than local minima and saddle point.

Vanishing Gradients:

Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this
gradient becomes smaller that causing the decrease in the learning rate of earlier layers than the
later layer of the network. Once this happens, the weight parameters update until they become
insignificant.

Exploding Gradient:

Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too
large and creates a stable model. Further, in this scenario, model weight increases, and they will be
represented as NaN. This problem can be solved using the dimensionality reduction technique,
which helps to minimize complexity within the model.
Backpropagation Process
Backpropagation is one of the important concepts of a neural network. Our task is to classify our
data best. For this, we have to update the weights of parameter and bias, but how can we do that in
a deep neural network? In the linear regression model, we use gradient descent to optimize the
parameter. Similarly here we also use gradient descent algorithm using Backpropagation.

For a single training example, Backpropagation algorithm calculates the gradient of the error
function. Backpropagation can be written as a function of the neural network. Backpropagation
algorithms are a set of methods used to efficiently train artificial neural networks following a
gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method through
which it calculates the updated weight to improve the network until it is not able to perform the task
for which it is being trained. Derivatives of the activation function to be known at network design
time is required to Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation works? Let start with
an example and do it mathematically to understand how exactly updates the weight using
Backpropagation.

Input values

X1=0.05
X2=0.10

Initial weight

W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values

b1=0.35 b2=0.60
Target Values

T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass

To find the value of H1 we first multiply the input value from the weights as

H1=x1�w1+x2�w2+b1
H1=0.05�0.15+0.10�0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1

H2=x1�w3+x2�w4+b1
H2=0.05�0.25+0.10�0.30+0.35
H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from the
weights as

y1=H1�w5+H2�w6+b2
y1=0.593269992�0.40+0.596884378�0.45+0.60
y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

y2=H1�w7+H2�w8+b2
y2=0.593269992�0.50+0.596884378�0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

Our target values are 0.01 and 0.99. Our y1 and y2


value is not matched with our target values T1 and T2.

Now, we will find the total error, which is simply the difference between the outputs from the target
outputs. The total error is calculated as

So, the total error is

Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer

To update the weight, we calculate the error correspond to each weight with the help of a total error.
The error on weight w is calculated by differentiating total error with respect to w.

We perform backward process so first consider the last weight w5 as

From equation two, it is clear that we cannot partially differentiate it with respect to w5 because
there is no any w5. We split equation one into multiple terms so that we can easily differentiate it
with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as

Putting the value of e-y in equation (5)


So, we put the values of in equation no (3) to find the final result.

Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the following values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

Backward pass at Hidden layer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and w4 as we
have done with w5, w6, w7, and w8 weights.

We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1 because
there is no any w1. We split equation (1) into multiple terms so that we can easily differentiate it
with respect to w1 as

Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
We again split this because there is no any H1final term in Etoatal as

will again split because in E1 and E2 there is no H1 term. Splitting is done as

We again Split both because there is no any y1 and y2 term in E1 and E2. We split it as

Now, we find the value of by putting values in equation (18) and (19) as

From equation (18)

From equation (18)

From equation (8)


From equation (19)

Putting the value of e-y2 in equation (23)

From equation (21)


Now from equation (16) and (17)

Put the value of in equation (15) as


We have we need to figure out as

Putting the value of e-H1 in equation (30)

We calculate the partial derivative of the total net input to H1 with respect to w1 the same as we did
for the output neuron:

So, we put the values of in equation (13) to find the final result.
Now, we will calculate the updated weight w1new with the help of the following formula

In the same way, we calculate w2new,w3new, and w4 and this will give us the following values

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

We have updated all the weights. We found the error 0.298371109 on the network when we fed
forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total error is down to
0.291027924. After repeating this process 10,000, the total error is down to 0.0000351085. At this
point, the outputs neurons generate 0.159121960 and 0.984065734 i.e., nearby our target value
when we feed forward the 0.05 and 0.1.

Vanishing Gradient Problem


The vanishing gradient problem is a challenge faced in training artificial neural networks, particularly
deep feedforward and recurrent neural networks. This issue arises during
the backpropagation process, which is used to update the weights of the neural network through
gradient descent. The gradients are calculated using the chain rule and propagated back through the
network, starting from the output layer and moving towards the input layer. However, when the
gradients are very small, they can diminish as they are propagated back through the network, leading
to minimal or no updates to the weights in the initial layers. This phenomenon is known as the
vanishing gradient problem.

Causes of Vanishing Gradient Problem

The vanishing gradient problem is often attributed to the choice of activation functions and the
architecture of the neural network. Activation functions like the sigmoid or hyperbolic tangent (tanh)
have gradients that are in the range of 0 to 0.25 for sigmoid and -1 to 1 for tanh. When these
activation functions are used in deep networks, the gradients of the loss function with respect to the
parameters can become very small, effectively preventing the weights from changing their values
during training.
Another cause of the vanishing gradient problem is the initialization of weights. If the weights are
initialized too small, the gradients can shrink exponentially as they are propagated back through the
network, leading to vanishing gradients.

Consequences of Vanishing Gradient Problem

The vanishing gradient problem can severely impact the training process of a neural network. Since
the weights in the earlier layers receive minimal updates, these layers learn very slowly, if at all. This
can result in a network that does not perform well on the training data, leading to poor
generalization to new, unseen data. In the worst case, the training process can completely stall, with
the network being unable to learn the complex patterns in the data that are necessary for making
accurate predictions.

Solutions to Vanishing Gradient Problem

To mitigate the vanishing gradient problem, several strategies have been developed:

• Activation Functions: Using activation functions such as Rectified Linear Unit (ReLU) and its
variants (Leaky ReLU, Parametric ReLU, etc.) can help prevent the vanishing gradient
problem. ReLU and its variants have a constant gradient for positive input values, which
ensures that the gradients do not diminish too quickly during backpropagation.

• Weight Initialization: Properly initializing the weights can help prevent gradients from
vanishing. Techniques like Xavier initialization and He initialization are designed to maintain
the variance of the gradients throughout the network.

• Batch Normalization: Applying batch normalization normalizes the output of each layer to
have a mean of zero and a variance of one. This can help maintain stable gradients
throughout the training process.

• Residual Connections: Architectures like Residual Networks (ResNets) introduce skip


connections that allow gradients to bypass certain layers in the network, which can help
alleviate the vanishing gradient problem.

• Gradient Clipping: This technique involves clipping the gradients during backpropagation to
prevent them from becoming too small (or too large, in the case of the exploding gradient
problem).

• Use of LSTM/GRU in RNNs: For recurrent neural networks, using Long Short-Term
Memory (LSTM) units or Gated Recurrent Units (GRU) can help mitigate the vanishing
gradient problem. These structures have gating mechanisms that control the flow of
gradients and can maintain them over longer sequences.

ReLU heuristics for avoiding bad local minima


ReLU (Rectified Linear Unit) is a widely-used activation function due to its simplicity and ability to
reduce the vanishing gradient problem in deep networks. However, networks with ReLU are still
susceptible to bad local minima, which can hinder model performance. Here are some heuristics to
help avoid these pitfalls when using ReLU:

1. Proper Weight Initialization

• He (or Kaiming) Initialization: For ReLU networks, initializing weights using the He
initialization method can help. It sets weights as: w∼N(0,2n)w \sim \mathcal{N}(0,
\frac{2}{n})w∼N(0,n2) where nnn is the number of input neurons to a layer. This approach
accounts for the fact that ReLU can "shut off" neurons, thus requiring larger weight variances
to maintain effective signal flow.

• Proper weight initialization prevents the network from starting in regions that might lead
quickly to dead neurons (neurons that always output zero).

2. Batch Normalization

• Adding Batch Normalization after each fully-connected or convolutional layer can stabilize
learning by normalizing activations, allowing for higher learning rates and faster
convergence.

• This normalization can help the network avoid bad local minima by making the optimization
landscape smoother, reducing sensitivity to the initial conditions and helping prevent neuron
saturation.

3. Leaky ReLU or Parametric ReLU (PReLU)

• Leaky ReLU: ReLU suffers from "dying ReLU" problem, where neurons get stuck and always
output zero, effectively removing them from the learning process. Leaky ReLU modifies the
ReLU to allow a small, non-zero gradient when inputs are negative: f(x)=max⁡(αx,x)f(x) =
\max(\alpha x, x)f(x)=max(αx,x) where α\alphaα is a small constant (e.g., 0.01).

• Parametric ReLU (PReLU) allows the network to learn α\alphaα, which can further reduce
the likelihood of neurons dying and getting stuck in poor local minima.

4. Adaptive Learning Rate Optimization

• Adam, RMSprop, or AdaGrad optimizers adaptively adjust the learning rate for each weight,
which can help the network avoid bad local minima by quickly adjusting to better solutions
during training.

• These optimizers generally handle gradients well in ReLU networks and help manage the
sometimes erratic gradients that ReLU can produce, particularly when neurons are close to
the threshold of activation.

5. Learning Rate Schedules or Annealing

• Start with a higher learning rate and reduce it gradually during training to avoid large, erratic
updates that could push the model into local minima.

• Learning rate schedules (e.g., exponential decay) or annealing techniques (e.g., cosine
annealing) can be used to guide the network out of regions of bad local minima, especially as
training progresses.

6. Early Stopping and Regularization

• Early stopping can prevent the network from overfitting, which sometimes aligns with
avoiding bad local minima as well. This approach is often paired with dropout regularization
to make the network more resilient to suboptimal pathways by randomly dropping
connections.

• L2 Regularization (weight decay) can also help in smoothing the optimization landscape,
encouraging simpler models and helping to escape sharp, poor local minima.
7. Gradient Clipping

• When ReLU activations lead to large gradients, particularly in deeper networks, gradient
clipping can limit gradients to a certain threshold, reducing the risk of the model getting
stuck in poor local minima.

• Clipping helps stabilize training, particularly when dealing with ReLU activations that can
produce larger-than-usual gradients.

8. Layer-wise Training or Warm Restarts

• Training one layer at a time, or layer-wise pretraining, can help initialize the network in a
better region of the parameter space. This approach is less common today but can be
effective in very deep networks.

• Warm restarts periodically reset the learning rate to a higher value during training, which
allows the network to escape from potential local minima by pushing it into new regions of
the parameter space.

Heuristics for faster training


Training deep learning models can be computationally expensive and time-consuming, especially
with deep architectures. Here are some effective heuristics for speeding up training without
compromising accuracy:

1. Efficient Weight Initialization

• He Initialization (for ReLU and its variants) and Xavier Initialization (for tanh activations) help
stabilize the model and ensure faster convergence by setting weights to appropriate initial
scales.

• Good initialization prevents vanishing or exploding gradients, which can otherwise slow
down training considerably by requiring more gradient updates.

2. Batch Normalization

• Batch Normalization normalizes inputs across mini-batches, which allows for higher learning
rates and reduces sensitivity to initial weights. It also smooths the loss landscape, leading to
faster and more stable training.

• It can also reduce the need for other regularization methods, as it has a slight regularizing
effect.

3. Learning Rate Schedules and Annealing

• Learning Rate Scheduling: Starting with a high learning rate and gradually lowering it can
significantly reduce training time. Common schedules include step decay, exponential decay,
and cosine annealing.

• Warmup and Annealing: Using a warmup period (starting with a low learning rate) followed
by annealing (gradually reducing the rate) helps improve convergence speed and stability,
especially in very deep networks.

4. Adaptive Optimizers
• Adam, RMSprop, and AdaGrad dynamically adjust the learning rate for each parameter,
speeding up training by allowing larger updates early on and refining steps as training
progresses.

• Adam is particularly popular for deep networks as it combines momentum and adaptive
learning rates, balancing fast convergence with stability.

5. Use of Mixed Precision Training

• Mixed Precision involves using 16-bit floating-point precision (FP16) instead of the standard
32-bit (FP32), significantly reducing memory usage and computation time.

• Many hardware accelerators (e.g., NVIDIA GPUs with Tensor Cores) are optimized for mixed
precision, which can nearly double training speeds without major accuracy loss.

6. Gradient Accumulation

• For large models and smaller GPUs, gradient accumulation allows training with larger
effective batch sizes without increasing memory requirements. Instead of updating weights
after every mini-batch, gradients are accumulated over several mini-batches, and weights are
updated less frequently.

• This method achieves the benefits of larger batch sizes (e.g., better generalization and faster
convergence) without needing large GPU memory.

7. Data Augmentation

• Using on-the-fly data augmentation (e.g., rotation, flipping, scaling, and cropping) increases
dataset variability, which can improve model generalization.

• Augmentation can allow models to converge faster by reducing overfitting and enabling the
model to learn more robust features.

8. Reduce Overhead with Data Loaders

• Ensure data loading and preprocessing are not bottlenecks by using optimized data loaders
that allow for parallel data loading, caching, and asynchronous processing.

• Prefetching and caching batches in memory can significantly reduce the idle time of the GPU,
especially for large datasets or complex preprocessing pipelines.

9. Use Larger Batch Sizes (if Hardware Permits)

• Larger batch sizes can improve training speed by increasing parallelism and reducing the
number of parameter updates needed per epoch. However, this can sometimes harm
generalization.

• Gradient accumulation (as mentioned) is an alternative when hardware memory limits are a
concern, enabling training with an effective larger batch size.

10. Early Stopping

• Early Stopping halts training when the model stops improving on a validation metric,
preventing unnecessary epochs and potential overfitting.
• Coupled with validation metrics tracking, early stopping can greatly reduce training time
without sacrificing performance.

11. Transfer Learning and Pretrained Models

• Using pretrained models as a starting point, especially for complex tasks (e.g., image
classification, NLP tasks), can drastically reduce training time.

• Fine-tuning a model with task-specific data is often much faster than training from scratch,
especially with large models like those used in vision or language tasks.

12. Distributed and Parallel Training

• Data Parallelism: Splits batches across multiple GPUs or nodes, each processing a subset of
the batch in parallel, which speeds up training linearly with the number of GPUs.

• Model Parallelism: Splits large models across devices, useful for very large models that
exceed single-GPU memory.

• Hybrid Parallelism: Combines data and model parallelism, commonly used in massive deep
learning models to achieve faster training.

13. Efficient Network Architecture Choices

• Use of efficient architectures like MobileNet, EfficientNet, or SqueezeNet can reduce training
time without sacrificing performance, especially when computational resources are limited.

• Techniques like skip connections (e.g., in ResNets) reduce the risk of vanishing gradients,
enabling deeper networks that converge faster.

14. Pruning and Knowledge Distillation

• Pruning: Removes less important weights or neurons from the model, which can reduce
training time and memory usage.

• Knowledge Distillation: A smaller, faster model (the student) is trained using the outputs of a
larger pretrained model (the teacher), achieving comparable performance with faster
training times.

15. Gradient Clipping for Stability

• Gradient Clipping prevents exploding gradients by limiting gradients to a maximum


threshold. This can speed up training in RNNs and deep networks, where large gradients can
cause erratic updates that slow down convergence.

Nesterov Accelerated Gradient Descent (NAG)


Nesterov Accelerated Gradient Descent (NAG) is an advanced optimization algorithm that builds
upon the traditional momentum-based gradient descent method. It was introduced by Yurii
Nesterov in 1983 and has since become a staple in training deep learning models due to its ability
to accelerate convergence and navigate complex loss landscapes more effectively.

1. Introduction to Gradient Descent and Momentum

Before diving into NAG, it's essential to understand the basics of gradient descent and the
momentum technique.
• Gradient Descent: An optimization algorithm used to minimize the loss function by
iteratively moving in the direction of the steepest descent as defined by the negative of the
gradient.

θt+1=θt−η∇J(θt)\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t)θt+1=θt−η∇J(θt)

where:

o θ\thetaθ represents the parameters.

o η\etaη is the learning rate.

o ∇J(θt)\nabla J(\theta_t)∇J(θt) is the gradient of the loss function JJJ at time step
ttt.

• Momentum: Enhances gradient descent by accumulating a velocity vector in directions of


persistent reduction in the loss function, thereby smoothing updates and potentially
accelerating convergence.

vt+1=γvt+η∇J(θt)v_{t+1} = \gamma v_t + \eta \nabla J(\theta_t)vt+1=γvt+η∇J(θt)


θt+1=θt−vt+1\theta_{t+1} = \theta_t - v_{t+1}θt+1=θt−vt+1

where:

o vtv_tvt is the velocity at time step ttt.

o γ\gammaγ is the momentum coefficient (typically between 0.9 and 0.99).

2. Nesterov Accelerated Gradient (NAG)

NAG modifies the traditional momentum approach by incorporating a "lookahead" mechanism,


which anticipates the future position of the parameters and adjusts the gradient accordingly. This
foresight leads to more informed and effective updates.

• NAG Update Rules:

vt+1=γvt+η∇J(θt−γvt)v_{t+1} = \gamma v_t + \eta \nabla J(\theta_t - \gamma v_t)vt+1=γvt+η∇J(θt


−γvt) θt+1=θt−vt+1\theta_{t+1} = \theta_t - v_{t+1}θt+1=θt−vt+1

Here, the gradient is evaluated not at the current parameters θt\theta_tθt, but at the
approximated future position θt−γvt\theta_t - \gamma v_tθt−γvt.

3. Differences Between Momentum and NAG

Aspect Momentum Nesterov Accelerated Gradient (NAG)

Gradient At current position At lookahead position θt−γvt\theta_t - \gamma


Evaluation θt\theta_tθt v_tθt−γvt

Based on current velocity


Update Step Based on anticipated velocity vt+1v_{t+1}vt+1
vtv_tvt

Can overshoot and oscillate in Generally more responsive and accurate


Performance
certain scenarios updates, leading to faster convergence

4. Advantages of NAG
1. Faster Convergence: By anticipating the future position, NAG can make more informed
updates, often leading to faster convergence compared to standard momentum.

2. Reduced Oscillations: The lookahead mechanism helps in dampening oscillations,


especially in ravines or areas with steep gradients.

3. Better Handling of Non-Convex Loss Landscapes: NAG is more effective in navigating


complex, non-convex loss surfaces typical in deep learning.

4. Theoretical Guarantees: Nesterov provided theoretical guarantees for accelerated


convergence rates in convex optimization problems.

5. Mathematical Intuition Behind NAG

NAG can be viewed as a correction to the momentum method. While momentum accumulates
gradients based on past velocities, NAG adjusts the current gradient based on where the
parameters are expected to be in the future. This adjustment provides a more accurate gradient
direction, especially in regions where the loss surface curvature changes.

6. Implementation in Popular Deep Learning Frameworks

Most deep learning frameworks provide built-in support for NAG. Here's how to implement it in a
few popular frameworks:

• TensorFlow (using Keras API):

python

Copy code

import tensorflow as tf

from tensorflow.keras.optimizers import SGD

optimizer = SGD(learning_rate=0.01, momentum=0.9, nesterov=True)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

• PyTorch:

python

Copy code

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

• Keras (Standalone):

python

Copy code

from keras.optimizers import SGD


optimizer = SGD(lr=0.01, momentum=0.9, nesterov=True)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

7. Practical Tips for Using NAG

1. Hyperparameter Tuning:

o Learning Rate (η\etaη): Start with a moderate learning rate (e.g., 0.01) and adjust
based on training performance.

o Momentum Coefficient (γ\gammaγ): Common values range from 0.9 to 0.99.


Higher momentum can accelerate training but may lead to overshooting.

2. Combination with Learning Rate Schedules:

o Pair NAG with learning rate schedules (e.g., step decay, cosine annealing) to adjust
the learning rate during training, potentially enhancing convergence speed and
final performance.

3. Initialization:

o Use proper weight initialization methods (e.g., He Initialization for ReLU


activations) to complement the effectiveness of NAG.

4. Batch Normalization:

o Incorporate batch normalization to stabilize and accelerate training, especially


when using NAG.

5. Monitoring:

o Monitor training and validation loss curves to ensure that NAG is effectively
accelerating convergence without causing instability.

8. Comparison with Other Optimizers

Optimizer Description Pros Cons

Can oscillate and


SGD with Traditional momentum- Simpler, fewer
converge slower than
Momentum based gradient descent. hyperparameters.
NAG.

Nesterov Incorporates a lookahead Faster convergence, Slightly more


Accelerated mechanism for more better handling of computational
Gradient (NAG) informed updates. ravines. overhead.

Adaptive Moment
Handles sparse Can sometimes
Estimation, combines
Adam gradients, requires less generalize worse than
momentum and adaptive
tuning. SGD/NAG.
learning rates.
Optimizer Description Pros Cons

Maintains a moving average


Effective for recurrent Requires careful tuning
RMSprop of squared gradients to
neural networks. of hyperparameters.
adapt learning rates.

Adapts learning rates based Learning rate can


AdaGrad on past gradients, beneficial Good for sparse data. become too small over
for sparse data. time.

NAG vs. Adam:

• NAG often provides better generalization performance on certain tasks compared to Adam,
especially in computer vision.

• Adam converges faster initially but may plateau, whereas NAG continues to make progress
due to its momentum and lookahead.

9. When to Use NAG

• Deep Networks with Complex Loss Landscapes: NAG excels in navigating non-convex
optimization problems common in deep learning.

• Problems with High Curvature: The lookahead mechanism helps in handling areas with
varying curvature, reducing the chances of getting stuck in plateaus.

• When Computational Resources are Limited: Compared to some adaptive methods, NAG
doesn't require storing additional moment estimates per parameter, making it memory-
efficient.

10. Example: NAG in Practice

Here's a simple example of using NAG to train a neural network on the MNIST dataset using
PyTorch.

python

Copy code

import torch

import torch.nn as nn

import torch.optim as optim

from torchvision import datasets, transforms

# Define a simple neural network

class SimpleNN(nn.Module):

def __init__(self):

super(SimpleNN, self).__init__()
self.flatten = nn.Flatten()

self.fc1 = nn.Linear(28*28, 256)

self.relu = nn.ReLU()

self.fc2 = nn.Linear(256, 10)

def forward(self, x):

x = self.flatten(x)

x = self.relu(self.fc1(x))

x = self.fc2(x)

return x

# Data loading and preprocessing

transform = transforms.Compose([

transforms.ToTensor(),

transforms.Normalize((0.1307,), (0.3081,))

])

train_dataset = datasets.MNIST('.', train=True, download=True, transform=transform)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

# Initialize the network, loss function, and optimizer

model = SimpleNN()

criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

# Training loop

epochs = 10

for epoch in range(epochs):

model.train()

running_loss = 0.0

for data, target in train_loader:


optimizer.zero_grad()

outputs = model(data)

loss = criterion(outputs, target)

loss.backward()

optimizer.step()

running_loss += loss.item()

print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss/len(train_loader):.4f}")

Explanation:

• Model: A simple feedforward neural network with one hidden layer.

• Optimizer: Stochastic Gradient Descent (SGD) with Nesterov momentum.

• Training Loop: Iterates through the dataset, computes loss, backpropagates gradients, and
updates parameters using NAG.

11. Potential Pitfalls and Considerations

1. Hyperparameter Sensitivity: NAG, like other optimizers, requires careful tuning of the
learning rate and momentum coefficient. Improper settings can lead to suboptimal
performance or training instability.

2. Non-Convex Optimization: While NAG handles non-convex landscapes better than


standard momentum, it doesn't guarantee finding the global minimum. It can still get
trapped in local minima or saddle points.

3. Computational Overhead: Although minimal, NAG introduces slight computational


overhead due to the lookahead gradient calculation. However, this is often negligible
compared to the overall training time.

12. Extensions and Variants

• NAG with Adaptive Learning Rates: Combining NAG with adaptive learning rate methods
like AdaGrad or Adam can potentially yield better performance, although this is less
common.

• Scheduled Nesterov Momentum: Adjusting the momentum coefficient during training to


optimize convergence dynamics.

Regularization
Regularization is crucial in deep learning for controlling model complexity, improving generalization,
and reducing the risk of overfitting. Overfitting occurs when a model learns patterns specific to the
training data that do not generalize to unseen data. Here are the most widely used regularization
techniques in deep learning, along with explanations of how they help achieve better performance.
1. L1 and L2 Regularization (Weight Regularization)

• L2 Regularization (Ridge): Adds a penalty proportional to the sum of the squared weights to
the loss function. This is known as weight decay in deep learning.

LossL2=Loss+λ∑iwi2\text{Loss}_{L2} = \text{Loss} + \lambda \sum_{i} w_i^2LossL2=Loss+λi∑wi2

• L1 Regularization (Lasso): Adds a penalty proportional to the sum of the absolute values of
the weights.

LossL1=Loss+λ∑i∣wi∣\text{Loss}_{L1} = \text{Loss} + \lambda \sum_{i} |w_i|LossL1=Loss+λi∑∣wi∣

• Purpose: Encourages smaller weights, leading to simpler models with reduced variance. L1
regularization can lead to sparse weights (some weights become exactly zero), which can be
useful in feature selection.

2. Dropout

• How It Works: During training, randomly "drops out" (sets to zero) a fraction of neurons in
each layer according to a dropout rate, typically between 0.2 and 0.5. This prevents the
model from relying too heavily on specific neurons and encourages redundancy, making the
network more robust.

hidropout={0with probability phi1−pwith probability (1−p)h_i^{\text{dropout}} = \begin{cases} 0 &


\text{with probability } p \\ \frac{h_i}{1 - p} & \text{with probability } (1 - p) \end{cases}hidropout
={01−phiwith probability pwith probability (1−p)

where ppp is the dropout rate.

• Benefits: Helps reduce overfitting by preventing co-adaptation among neurons, thus


encouraging the network to learn more independent and generalized representations.

You might also like