DL Unit-3
DL Unit-3
3.1 REGULARIZATION
Regularization in deep learning is a technique used to prevent overfitting, which
occurs when a model is too complex and performs well on training data but poorly
on new, unseen data. Regularization helps to:
1. Reduce model complexity
2. Prevent overfitting
3. Improve generalization
Role Of Regularization
Let’s explore some more detailed explanations about the role of Regularization
in Python:
3. Balancing Bias and Variance: Regularization can help balance the trade-
off between model bias (underfitting) and model variance (overfitting) in
machine learning, which leads to improved performance.
4. Early Stopping: Stops training when the model's performance on the validation
set starts to degrade.
5. Data Augmentation: Increases the size of the training set by applying
transformations to the data.
6. Batch Normalization: Normalizes the inputs to each layer, reducing the impact
of internal covariate shift.
7. Weight Decay: Adds a term to the loss function that is proportional to the
square of the model's weights.
3.3. L2 REGULARIZATION
L2 Regularization, also known as Ridge Regression, is a type of regularization
technique that adds a penalty term to the loss function to discourage large weights.
L2 Regularization, also known as Ridge Regression, is a technique used to reduce
overfitting in machine learning models. It adds a penalty term to the loss function,
proportional to the square of the model's weights. This encourages the model to
have smaller weights, which can lead to a simpler model that generalizes better
to new data.
The L2 Regularization term is defined as the sum of the squares of the model's
weights, multiplied by a hyperparameter alpha. This term is added to the loss
function, which is typically mean squared error or cross-entropy. The
hyperparameter alpha controls the strength of the regularization, with higher
values resulting in smaller weights.
L2 Regularization has several benefits, including reducing overfitting, improving
model interpretability, and preventing large weights from dominating the model.
Unlike L1 Regularization, L2 Regularization does not set weights to zero, but
instead shrinks them towards zero. This can lead to a more stable model that is
less sensitive to small changes in the data.
3.4 L1 REGULARIZATION
L1 Regularization, also known as Lasso (Least Absolute Shrinkage and Selection
Operator), is a type of regularization technique that adds a penalty term to the loss
function to discourage large weights.
L1 Regularization, also known as Lasso Regression, is a technique used to reduce
overfitting in machine learning models. It adds a penalty term to the loss function,
proportional to the absolute value of the model's weights. This encourages the
model to have smaller weights, which can lead to a simpler model that generalizes
better to new data.
The L1 Regularization term is defined as the sum of the absolute values of the
model's weights, multiplied by a hyperparameter alpha. This term is added to the
loss function, which is typically mean squared error or cross-entropy. The
hyperparameter alpha controls the strength of the regularization, with higher
values resulting in smaller weights.
- Prevents overfitting by stopping training before the model becomes too complex
- Helps to avoid local minima
How it works:
1. Split data into training and validation sets
2. Train model on training set and evaluate on validation set
3. If performance on validation set starts to degrade, stop training
3.5 DROPOUT
- Randomly sets a fraction of the model's neurons to zero during training
- Prevents overfitting by reducing the impact of individual neurons
- Encourages neurons to learn redundant representations.
Dropout is a regularization technique used in deep learning to prevent overfitting.
It works by randomly dropping out (i.e., setting to zero) a fraction of the neurons
in a layer during training. This forces the model to learn multiple representations
of the data, rather than relying on a single set of weights. By doing so, Dropout
reduces the model's ability to fit the training data too closely, resulting in
improved generalization performance.
The Dropout rate, typically set between 0.2 and 0.5, determines the fraction of
neurons to drop out. During training, each neuron has a probability of being
dropped out, which helps to prevent the model from becoming too dependent on
any single neuron. At test time, the entire network is used, without dropout, to
make predictions.
How it works:
1. Randomly select a fraction of neurons to drop out (set to zero)
2. Train model with dropped-out neurons
3. Repeat process for each training iteration
When to use:
- Early Stopping: When training data is limited or model is prone to overfitting
- Dropout: When model has many layers or neurons, or when training data is
noisy
Overall, Early Stopping is a simple yet effective technique for improving the
performance of machine learning models. By monitoring the model's
performance on a validation set and stopping training when it starts to degrade,
Early Stopping can help prevent overfitting and improve the model's ability to
generalize to new data. This makes it a widely used technique in many
applications, including image classification, natural language processing, and
recommender systems.
2. Stopping criterion: Stop training when the performance on the validation set
starts to degrade (e.g., accuracy decreases or loss increases).
3. Patience: Allow the model to continue training for a few iterations after the
stopping criterion is met, to ensure the model has converged.
Types of Dropping:
Benefits of Dropping:
1. Prevents Overfitting: Reduces the impact of individual neurons or weights,
preventing overfitting.
2. Encourages Redundancy: Encourages neurons or weights to learn redundant
representations, improving robustness.
3. Improves Generalization: Helps the model generalize better to new data.
1. Local Minima: NNs can get stuck in local minima, which are suboptimal
solutions that prevent the model from reaching the global minimum.
Local Minima is a fundamental challenge in Neural Network (NN) optimization.
It occurs when the optimization algorithm converges to a suboptimal solution,
rather than the global minimum. This happens because the NN loss function is
non-convex, meaning it has multiple local minima. The algorithm gets stuck in
one of these local minima, failing to explore other parts of the loss landscape. As
a result, the model's performance is suboptimal, and it may not generalize well to
new data.
4. Underfitting: NNs can be too simple to capture the underlying patterns in the
data, resulting in poor performance.
Underfitting poses a significant challenge in neural network optimization,
occurring when a model is too simplistic to capture the underlying patterns in the
training data, resulting in poor performance on both the training and testing sets.
This can be due to insufficient model complexity, inadequate training data, or
overly aggressive regularization. Underfitting makes it difficult to achieve good
generalization, as the model fails to learn the relevant features and relationships
in the data. Moreover, underfitting can lead to slow convergence or oscillations
during training, making it challenging to determine the optimal stopping point.
Additionally, underfitting models may require significant hyperparameter tuning,
and the addition of more data or features may not necessarily improve
performance. Addressing underfitting requires careful balancing of model
complexity, regularization, and training data, adding complexity to the
optimization process and potentially requiring significant computational
resources and expertise.
8. Gradient Noise: Gradients can be noisy, which can slow down convergence or
lead to suboptimal solutions.
Gradient noise is a challenging aspect of neural network optimization, as it refers
to the inherent randomness and variability in the gradient estimates used to update
model parameters. This noise can arise from various sources, including stochastic
gradient descent, dropout, and batch normalization. Gradient noise can lead to
unstable training, slow convergence, and suboptimal solutions, making it difficult
to optimize neural networks. Moreover, gradient noise can cause optimization
algorithms to oscillate or diverge, especially in the presence of large learning rates
or complex loss landscapes. Furthermore, gradient noise can interact with other
challenges like non-convexity and overfitting, making it even more difficult to
achieve reliable convergence. As a result, techniques like gradient smoothing,
noise reduction, and adaptive learning rates are essential to mitigate the effects of
gradient noise and ensure robust neural network optimization.
9. Plateaus: NNs can converge to plateaus, where the loss remains constant for a
long time, making it difficult to escape.
Plateaus are a challenging aspect of neural network optimization, where the
training loss or validation performance remains stagnant for an extended period,
despite continued updates to the model parameters. Plateaus can occur due to
various reasons, including local minima, saddle points, or the optimization
algorithm's inability to escape a particular region of the loss landscape. This can
lead to wasted computational resources, as the model is not improving despite
11. Batch Normalization: Batch normalization can help, but it can also introduce
additional challenges, such as choosing the right normalization parameters.
Batch normalization is a challenging aspect of neural network optimization, as it
can introduce additional complexity and instability to the training process. While
batch normalization can help stabilize the training process by normalizing the
inputs to each layer, it can also lead to issues such as internal covariate shift,
where the distribution of the inputs changes during training. Additionally, batch
normalization can introduce additional hyperparameters, such as the momentum
term, that require tuning. Moreover, batch normalization can interact with other
optimization challenges, such as gradient noise and non-convexity, making it
difficult to achieve stable and reliable convergence. Furthermore, batch
13. Learning Rate Scheduling: Scheduling the learning rate during training can
be challenging.
Learning rate scheduling is a challenging aspect of neural network optimization,
as it requires finding the optimal learning rate schedule to adapt to the changing
loss landscape during training. The learning rate needs to be high enough to
escape local minima, but low enough to converge to the global minimum.
Moreover, the optimal learning rate schedule can vary across different layers,
parameters, and training phases, making it difficult to determine a single optimal
schedule. Additionally, learning rate schedules can interact with other
optimization challenges, such as gradient noise and non-convexity, making it
challenging to achieve stable and reliable convergence. Techniques like
annealing, step decay, and cyclical learning rates can help, but require careful
tuning and monitoring to achieve optimal results. Furthermore, the optimal
learning rate schedule can change during training, requiring adaptive scheduling
methods that can adjust to the changing loss landscape. As a result, learning rate
14. Model Selection: Selecting the best model from multiple trained models can
be challenging. Model selection is a challenging aspect of neural network
optimization, as it requires choosing the optimal model architecture,
hyperparameters, and training parameters to achieve the best performance on a
given task. With numerous models, hyperparameters, and training options
available, selecting the optimal combination is a daunting task. Moreover, model
selection is often performed using trial-and-error or grid search methods, which
can be computationally expensive and time-consuming. Additionally, the optimal
model can vary depending on the specific problem, dataset, and evaluation
metrics, making it challenging to determine a single optimal model. Furthermore,
model selection can be affected by factors like overfitting, underfitting, and
regularization, requiring careful consideration of these factors during the
selection process. As a result, model selection requires careful expertise,
computational resources, and time to achieve optimal neural network
optimization, and even then, there is no guarantee of finding the absolute best
model.
These challenges highlight the complexity of NN optimization and the need for
careful hyperparameter tuning, regularization, and optimization algorithm
selection to achieve good performance.
1. Initialize parameters: Start with initial values for the model's parameters.
Initializing parameters is a crucial step in the gradient descent approach. The
goal is to set the initial values of the model's parameters, such as weights and
biases, to ensure convergence to the optimal solution. Common initialization
2. Forward pass : Compute the predicted output and loss using the current
parameters.
The forward pass in the gradient descent approach involves computing the
predicted output of the model for a given input, using the current values of the
model's parameters. This process typically involves the following steps:
The forward pass is a crucial step in the gradient descent algorithm, as it provides
the necessary information for computing the gradients of the loss function with
respect to the model's parameters. These gradients are then used to update the
parameters during the backward pass, ultimately minimizing the loss and
improving the model's performance.
3. Backward pass: Compute the gradients of the loss with respect to each
parameter using backpropagation.
Here's an explanation of the backward pass in the gradient descent approach:
Backward Pass:
Steps:
1. Compute the error gradient: Calculate the gradient of the loss function with
respect to the predicted output.
2. Compute the parameter gradients: Backpropagate the error gradient through
the network, computing the gradients of the loss with respect to each parameter.
3. Compute the gradient of the loss with respect to each parameter: Use the chain
rule to compute the gradients.
4. Update parameters: Use the gradients to update the parameters using an
optimization algorithm, such as gradient descent.
Key Equations:
Purpose:
1. Compute the gradients of the loss function with respect to the model's
parameters.
2. Update the parameters to minimize the loss.
import numpy as np
# Update parameters
weights -= 0.01 * param_grad
return weights
4. Update parameters: Update the parameters using the gradients and a learning
rate.
Gradient Descent (GD) is an optimization algorithm used to minimize the loss
function in machine learning models. It works by iteratively updating the model's
parameters in the direction of the negative gradient of the loss function. The goal
is to find the optimal parameters that result in the lowest loss. GD starts with an
initial set of parameters and computes the gradient of the loss function with
respect to each parameter. It then updates the parameters using the update rule: w
= w - α * ∂L/∂w, where w is the parameter, α is the learning rate, and ∂L/∂w is
the gradient of the loss.
2. Momentum: Adds a momentum term to the update rule to help escape local
minima.
Momentum is a technique used in the gradient descent algorithm to help escape
local minima and converge faster to the optimal solution.
What is momentum?
Where:
- θ(t) is the parameter value at iteration t.
- α is the learning rate.
- ∇L(θ(t)) is the gradient of the loss function at iteration t.
- γ is the momentum coefficient (typically between 0 and 1).
- θ(t-1) is the parameter value at the previous iteration.
Example:
Suppose we're optimizing a loss function L(θ) using gradient descent with
momentum:
import numpy as np
# Initialize parameters
theta = 0
alpha = 0.1
gamma = 0.9
num_iterations = 100
- 0.9
- 0.99
- 0.999
A higher momentum value can lead to faster convergence but may also cause
oscillations.
These algorithms build upon the momentum concept and incorporate additional
techniques to improve convergence and stability.
Where:
- θ(t) is the parameter value at iteration t.
- α is the learning rate.
- ∇L(θ(t)) is the gradient of the loss function at iteration t.
- γ is the momentum coefficient.
- θ(t-1) is the parameter value at the previous iteration.
Example
Suppose we're optimizing a loss function L(θ) using NAG:
import numpy as np
def L(theta):
return np.sum((theta - 2) ** 2)
# Initialize parameters
theta = 0
alpha = 0.1
gamma = 0.9
num_iterations = 100
Disadvantages:
- Requires careful tuning of hyperparameters.
- Computationally expensive.
4. Adagrad: Adapts the learning rate for each parameter based on the magnitude
of the gradient.
Adagrad is a variant of the gradient descent algorithm that adapts the learning rate
for each parameter individually, based on the gradient magnitude.
Where:
- θ(t) is the parameter value at iteration t.
- α is the initial learning rate.
- G(t) is the cumulative gradient magnitude at iteration t.
- ε is a small constant for numerical stability.
- ∇L(θ(t)) is the gradient of the loss function at iteration t.
Adagrad Algorithm:
1. Initialize parameters: θ(0), α, ε.
2. Initialize cumulative gradient magnitude: G(0) = 0.
3. For each iteration t:
a. Compute gradient: ∇L(θ(t)).
b. Update cumulative gradient magnitude: G(t) = G(t-1) + ∇L(θ(t))^2.
c. Compute adaptive learning rate: α / √(G(t) + ε).
Benefits:
1. Adaptive learning rate.
2. Improved convergence.
3. Reduced oscillations.
Drawbacks:
1. Diminishing learning rate.
2. Requires careful initialization.
import numpy as np
# Initialize parameters
theta = 0
alpha = 0.1
epsilon = 1e-8
num_iterations = 100
Common Hyperparameters:
Where:
- θ(t) is the parameter value at iteration t.
- α is the initial learning rate.
- v(t) is the moving average of the squared gradient at iteration t.
- ε is a small constant for numerical stability.
- ∇L(θ(t)) is the gradient of the loss function at iteration t.
RMSProp Algorithm:
1. Initialize parameters: θ(0), α, ε, γ (decay rate).
2. Initialize moving average: v(0) = 0.
3. For each iteration t:
a. Compute gradient: ∇L(θ(t)).
Benefits:
1. Adaptive learning rate.
2. Improved convergence.
3. Reduced oscillations.
4. Less sensitive to hyperparameters.
Drawbacks:
1. Requires careful tuning of decay rate (γ).
2. May converge to local minima.
import numpy as np
# Initialize parameters
theta = 0
alpha = 0.1
epsilon = 1e-8
gamma = 0.9
num_iterations = 100
Common Hyperparameters:
- Initial learning rate (α): 0.01, 0.1.
- Epsilon (ε): 1e-8, 1e-6.
- Decay rate (γ): 0.9, 0.99.
RMSProp is a popular optimization algorithm, especially in deep learning.
6. Adam: Combines the benefits of Adagrad and RMSProp with a bias correction
term.
Adam (Adaptive Moment Estimation) is a popular optimization algorithm that
combines the benefits of Adagrad, RMSProp, and Momentum.
Where:
- θ(t) is the parameter value at iteration t.
- α is the initial learning rate.
- β1 is the first moment coefficient (momentum).
- β2 is the second moment coefficient (variance).
- ε is a small constant for numerical stability.
- m(t) is the first moment estimate (momentum).
- v(t) is the second moment estimate (variance).
Adam Algorithm:
1. Initialize parameters: θ(0), α, β1, β2, ε.
2. Initialize first and second moment estimates: m(0) = 0, v(0) = 0.
3. For each iteration t:
a. Compute gradient: ∇L(θ(t)).
b. Update first moment estimate: m(t) = β1 * m(t-1) + (1 - β1) * ∇L(θ(t)).
c. Update second moment estimate: v(t) = β2 * v(t-1) + (1 - β2) * ∇L(θ(t))^2.
d. Compute adaptive learning rate: α / √(v(t) + ε).
e. Update parameter: θ(t+1) = θ(t) - α / √(v(t) + ε) * m(t) / (1 - β1^t).
Benefits:
1. Adaptive learning rate.
2. Improved convergence.
3. Reduced oscillations.
4. Less sensitive to hyperparameters.
Drawbacks:
1. Computationally expensive.
2. Requires careful tuning of hyperparameters.
import numpy as np
# Initialize parameters
theta = 0
alpha = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
num_iterations = 100
Common Hyperparameters:
- Initial learning rate (α): 0.001, 0.01.
- First moment coefficient (β1): 0.9, 0.99.
- Second moment coefficient (β2): 0.999, 0.9999.
- Epsilon (ε): 1e-8, 1e-6.
7. AdamW: A variant of Adam that decouples weight decay from the optimization
steps.
AdamW is a variant of the Adam optimization algorithm that incorporates weight
decay, which helps to prevent overfitting.
Where:
- θ(t) is the parameter value at iteration t.
- α is the initial learning rate.
- β1 is the first moment coefficient (momentum).
- β2 is the second moment coefficient (variance).
- ε is a small constant for numerical stability.
- m(t) is the first moment estimate (momentum).
- v(t) is the second moment estimate (variance).
- w is the weight decay coefficient.
- ∇L(θ(t)) is the gradient of the loss function at iteration t.
AdamW Algorithm:
1. Initialize parameters: θ(0), α, β1, β2, ε, w.
2. Initialize first and second moment estimates: m(0) = 0, v(0) = 0.
3. For each iteration t:
a. Compute gradient: ∇L(θ(t)).
b. Update first moment estimate: m(t) = β1 * m(t-1) + (1 - β1) * ∇L(θ(t)).
c. Update second moment estimate: v(t) = β2 * v(t-1) + (1 - β2) * ∇L(θ(t))^2.
d. Compute adaptive learning rate: α / √(v(t) + ε).
e. Update parameter: θ(t+1) = θ(t) - α / √(v(t) + ε) * m(t) / (1 - β1^t) - α * w * θ(t).
Benefits:
1. Adaptive learning rate.
2. Improved convergence.
3. Reduced oscillations.
4. Less sensitive to hyperparameters.
5. Weight decay helps prevent overfitting.
Drawbacks:
1. Computationally expensive.
2. Requires careful tuning of hyperparameters.
import numpy as np
# Initialize parameters
theta = 0
alpha = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
weight_decay = 0.01
num_iterations = 100
Common Hyperparameters:
- Initial learning rate (α): 0.001, 0.01.
- First moment coefficient (β1): 0.9, 0.99.
- Second moment coefficient (β2): 0.999, 0.9999.
- Epsilon (ε): 1e-8, 1e-6.
- Weight decay coefficient (w): 0.01, 0.1.
AdamW is widely used in deep learning for its ability to prevent overfitting.
1. Adaptive Learning Rate: Adjust the learning rate during training based on the
gradient magnitude or other criteria.
Adaptive Learning Rate is an optimization approach that adjusts the learning rate
of a gradient descent algorithm during training, based on the characteristics of the
loss landscape and the progress of the optimization process. This approach aims
to improve the convergence speed and stability of the training process by adapting
the learning rate to the specific needs of the model and the data. Unlike traditional
gradient descent, which uses a fixed learning rate throughout training, Adaptive
Learning Rate methods adjust the learning rate dynamically, either globally or
per-parameter, based on factors such as the gradient norm, curvature, and
convergence rate. Examples of Adaptive Learning Rate methods include
AdaGrad, RMSProp, Adam, and Nadam, each with its own specific mechanism
for adjusting the learning rate. By adapting the learning rate, these methods can
escape local minima, handle non-stationary objectives, and improve training
stability, leading to faster convergence, improved accuracy, and reduced
overfitting. Additionally, Adaptive Learning Rate methods can automatically
adjust to changes in the loss landscape during training, making them particularly
useful for complex models and datasets.
3. Adagrad: Adapt the learning rate for each parameter based on the magnitude
of the gradient.
Adagrad is an adaptive optimization algorithm that adjusts the learning rate for
each parameter individually, based on the magnitude of the gradient for that
parameter. It was introduced by John Duchi, Elad Hazan, and Yoram Singer in
2011. Adagrad adapts the learning rate by dividing it by the square root of the
sum of the squares of the gradients for each parameter, effectively reducing the
learning rate for parameters with large gradients and increasing it for parameters
with small gradients. This approach helps to stabilize the training process, reduce
5. Adam: Adapt the learning rate for each parameter based on the magnitude of
the gradient and a bias correction term.
Adam (Adaptive Moment Estimation) is an adaptive optimization algorithm that
adjusts the learning rate for each parameter individually, based on the magnitude
of the gradient for that parameter. It was introduced by Diederik Kingma and
Jimmy Ba in 2014. Adam combines the benefits of two other adaptive
optimization algorithms, Adagrad and RMSProp, to create a more robust and
efficient algorithm. Adam maintains two moving averages for each parameter:
the first moment (mean) and the second moment (variance). The first moment is
used to adapt the learning rate, while the second moment is used to stabilize the
update. Adam also uses a decay rate to control the amount of history to keep,
allowing it to adapt to changing gradients over time. The algorithm is particularly
useful for deep neural networks, where the gradients can be sparse and varying
in magnitude. Adam has been shown to be effective in practice, especially for
tasks such as image classification, speech recognition, and natural language
processing. It is also a popular choice for training recurrent neural networks
(RNNs) and long short-term memory (LSTM) networks. Adam's ability to adapt
to changing gradients, stabilize the update, and handle sparse gradients makes it
a widely used optimization algorithm in deep learning. Its default
hyperparameters also make it easy to use and require minimal tuning.
6. AdamW: Decouple weight decay from the optimization steps and adapt the
learning rate.
The AdamW algorithm is a variant of the Adam optimization algorithm that
decouples the weight decay from the optimization steps, providing a more
effective and efficient way to regularize neural networks. Introduced by Ilya
Loshchilov and Frank Hutter in 2019, AdamW modifies the Adam algorithm by
adding a weight decay term to the update rule, which helps to prevent overfitting
by penalizing large weights. Unlike Adam, which applies weight decay to the
gradients, AdamW applies weight decay directly to the model's weights, leading
to a more accurate and efficient optimization process. AdamW also maintains two
moving averages for each parameter, similar to Adam, but uses a different update
rule that takes into account the weight decay term. This approach allows AdamW
to adapt to changing gradients and stabilize the update, while also effectively
regularizing the model. AdamW has been shown to outperform Adam and other
optimization algorithms in various deep learning tasks, including image
classification, natural language processing, and reinforcement learning. Its ability
to effectively regularize neural networks while maintaining adaptive optimization
makes AdamW a popular choice for training deep learning models.
7. LARS: Layer-wise adaptive rate scaling, which adapts the learning rate for
each layer.
LARS (Layer-wise Adaptive Rate Scaling) is an adaptive optimization algorithm
designed for large-scale deep learning tasks, particularly those involving large
8. SGDR: Stochastic gradient descent with restarts, which adaptively adjusts the
learning rate and momentum.
SGDR (Stochastic Gradient Descent with Warm Restarts) is an adaptive
optimization approach that combines the benefits of stochastic gradient descent
(SGD) with the idea of periodic learning rate warm restarts. Introduced by
Loshchilov and Hutter in 2016, SGDR adapts the learning rate during training by
periodically resetting the learning rate to its initial value, allowing the model to
"forget" previous knowledge and adapt to new information. This approach helps
to escape local minima, improve convergence, and reduce overfitting. SGDR
consists of three phases: a warm-up phase, where the learning rate increases to its
maximum value; a constant phase, where the learning rate remains constant; and
a cool-down phase, where the learning rate decreases to zero. At the end of each
cycle, the learning rate is reset to its initial value, and the process repeats. SGDR
also incorporates a stochastic gradient descent update rule, which adds noise to
the gradients to help escape local minima. By combining the benefits of SGD
with periodic warm restarts, SGDR can effectively adapt to changing loss
landscapes, improve training stability, and achieve better performance than
traditional SGD. SGDR has been shown to be effective in various deep learning
tasks, including image classification, natural language processing, and speech
recognition.
10. Adaptive Batch Size: Adjust the batch size during training to improve
convergence and stability. Adaptive Batch Size is an optimization technique that
dynamically adjusts the batch size during training, based on the computational
resources available and the convergence rate of the model. Unlike traditional
training methods, which use a fixed batch size throughout training, Adaptive
Batch Size adjusts the batch size to balance the need for computational efficiency
with the need for accurate gradient estimates. The approach typically involves
monitoring the model's convergence rate and adjusting the batch size accordingly.
For example, if the model is converging slowly, the batch size may be increased
to provide more accurate gradient estimates, while if the model is converging
quickly, the batch size may be decreased to reduce computational costs. Adaptive
Batch Size can help to improve training efficiency, reduce memory usage, and
increase model accuracy. By adjusting the batch size dynamically, the algorithm
can adapt to changing computational resources and model convergence rates,
leading to more effective and efficient training. This approach has been shown to
be effective in various deep learning tasks, including image classification, natural
language processing, and speech recognition. Additionally, Adaptive Batch Size
3.10 ADAGRAD
AdaGrad is an adaptive learning rate optimization algorithm for training neural
networks. It adapts the learning rate for each parameter based on the magnitude
of the gradient, making it more robust to varying gradient scales.
Adagrad is a stochastic gradient descent optimization algorithm that adapts the
learning rate for each parameter individually, based on the gradient history. It was
introduced by John Duchi, Elad Hazan, and Yoram Singer in 2011. Adagrad helps
in converging faster and avoids overshooting by adjusting the learning rate for
each parameter based on the magnitude of the gradient. It accumulates the
squared gradients for each parameter, which helps in stabilizing the learning rate.
1. _Adaptive learning rate_: Adjusts the learning rate for each parameter
individually.
2. _Gradient magnitude_: Scales the learning rate based on the magnitude of the
gradient.
3. _Element-wise adaptation_: Adapts the learning rate for each parameter
element-wise.
where:
AdaGrad benefits:
1. _Aggressive adaptation_: Can lead to very small learning rates, causing slow
convergence.
2. _Limited scalability_: Can be computationally expensive for large models.
RMSPROP
1. _Adaptive learning rate_: Adjusts the learning rate for each parameter
individually.
2. _Exponential decay_: Uses an exponentially decaying average of squared
gradients.
3. _Element-wise adaptation_: Adapts the learning rate for each parameter
element-wise.
4. _Monotonically decreasing learning rate_: The learning rate decreases over
time.
where:
- `w_t` is the parameter at time `t`
- `lr` is the base learning rate
- `g_t` is the gradient at time `t`
- `v_t` is the exponentially decaying average of squared gradients
- `epsilon` is a small constant to prevent division by zero
RMSProp benefits:
ADAM
Adam is a popular adaptive learning rate optimization algorithm for training
neural networks. It combines the benefits of AdaGrad and RMSProp, making it
more robust and efficient.
ADAM (Adaptive Moment Estimation) is a stochastic gradient descent
optimization algorithm that adapts the learning rate for each parameter
individually. Introduced by Diederik Kingma and Jimmy Ba in 2014, ADAM
combines the benefits of two other optimization algorithms, RMSProp and
Adagrad. ADAM uses the moving average of the gradient and the squared
gradient to adapt the learning rate, which helps in stabilizing the learning rate.
What is Adam?
1. Adaptive learning rate: Adjusts the learning rate for each parameter
individually.
2. Exponential decay: Uses an exponentially decaying average of squared
gradients (like RMSProp).
3. Bias correction: Corrects for the bias in the exponentially decaying average.
4. Element-wise adaptation: Adapts the learning rate for each parameter element-
wise.
where:
Adam benefits:
Adam is often used as a default optimizer in many deep learning frameworks and
libraries, due to its robustness and efficiency.
1. Calculate mean and variance: Compute the mean and variance of each feature
in the mini-batch.
2. Normalize inputs: Subtract the mean and divide by the square root of the
variance for each feature.
3. Scale and shift: Apply a learned scale and shift to the normalized inputs.
`y = γ(x - μ) / σ + β`
where:
- `x` is the input
- `μ` is the mean
- `σ` is the standard deviation
- `γ` is the learned scale
- `β` is the learned shift
- `y` is the output
Batch Normalization is typically applied after the activation function and before
the next layer. It's a simple yet powerful technique that has become a standard
component in many deep learning architectures.