0% found this document useful (0 votes)
6 views

Module 3dl1

The document discusses optimization techniques for training deep models, focusing on empirical risk minimization, challenges in neural network optimization, and basic algorithms like Stochastic Gradient Descent and its variants. It highlights the differences between learning and pure optimization, the importance of parameter initialization strategies, and adaptive learning rate algorithms such as AdaGrad and Adam. Key challenges in optimization include ill-conditioning, local minima, saddle points, and the effects of long-term dependencies and inexact gradients.

Uploaded by

n.kumar05052002
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 3dl1

The document discusses optimization techniques for training deep models, focusing on empirical risk minimization, challenges in neural network optimization, and basic algorithms like Stochastic Gradient Descent and its variants. It highlights the differences between learning and pure optimization, the importance of parameter initialization strategies, and adaptive learning rate algorithms such as AdaGrad and Adam. Key challenges in optimization include ill-conditioning, local minima, saddle points, and the effects of long-term dependencies and inexact gradients.

Uploaded by

n.kumar05052002
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Module 3- Imp question & Answers

Optimization for Training Deep Models: Empirical Risk Minimization, Challenges in Neural Network
Optimization, Basic Algorithms: Stochastic Gradient Descent, Parameter Initialization Strategies,
Algorithms with Adaptive Learning Rates: The AdaGrad algorithm, The RMSProp algorithm, Choosing the
Right Optimization Algorithm.

Textbook 1: Chapter: 8.1-8.5

How Learning Differs from Pure Optimization

Learning:
Learning involves improving a model's performance by discovering patterns or structures in
data. It is often dynamic, iterative, and driven by exposure to new data or feedback
(e.g., training a neural network with labeled data).
Example: In supervised learning, the model adjusts its parameters based on the error between
predictions and ground truth.
Pure Optimization:
Optimization is the process of finding the best possible solution for a well-defined mathematical
problem, often defined by minimizing or maximizing an objective function. It is static in nature
and focuses purely on achieving a predefined goal.
Example: Finding the minimum of a convex function (e.g., minimizing a cost function using
gradient descent).
8.1. Explain Empirical Risk Minimization( Modelqp)

The goal of a machine learning algorithm is to reduce the expected generalization error
given by equation below. This quantity is known as the risk. The generalization error, or risk,
measures how well a machine learning model performs on unseen data.
The training process based on minimizing this average training error is known as empirical risk
minimization. The goal of ERM is to minimize the empirical risk: ERM assumes that minimizing the
empirical risk on the training set will reduce the true risk on the underlying distribution. This process is
effectively an optimization problem.

Alternatives to ERM

Due to the challenges of overfitting and non-differentiability, ERM is rarely used directly in deep
learning. Instead, practitioners adopt modified approaches:

(a) Surrogate Loss Functions:

 Replace non-differentiable loss functions (e.g., 0-1 loss) with differentiable alternatives, such as:
o Cross-entropy loss for classification.
o Mean squared error (MSE) for regression.

(b) Regularization:

 Add terms to the loss function to prevent overfitting, such as:


o L2 regularization (weight decay).
o Dropout.

(c) Optimization on Training Loss:


In practice, we optimize a proxy loss function L(θ), which may include regularization terms and
be designed for compatibility with gradient-based optimization

8.2. Explain the Challenges in Neural Network Optimization( ModelQp)


Most prominent challenges involved in optimization for training deep models are described
below:
a).ill conditioning:
 The ill-conditioning problem is generally believed to be present in neural network
training problems.
 ill-conditioning occurs when the Hessian matrix (second derivative matrix of the cost
function) has a wide range of eigenvalues. This leads to difficulty in optimization, as the
cost surface has steep slopes in some directions and flat regions in others.
 Ill-conditioning can cause gradient descent methods, like stochastic gradient descent
(SGD), to struggle: Even small steps can increase the cost function in poorly conditioned
directions. The learning rate must be reduced to maintain stability, slowing
convergence.
b). Local Minima
Local minima refer to a situation where the optimization algorithm finds a set of model
parameters that correspond to the minimum value of the loss function in a small region of the
parameter space.
However, this minimum value may not be the global minimum of the loss function, which
corresponds to the smallest value of the loss function across the entire parameter space.
Global minima
Global minima are the absolute lowest points of the loss function, which correspond to the
optimal set of parameters for a model. The goal of any optimization algorithm is to find the
global minimum, which will produce the best results for the given problem.

Local Minima in Convex vs. Non-Convex Optimization:

 Convex Optimization: Any local minimum is a global minimum or part of a flat region
that is an acceptable solution.

 Non-Convex Optimization: Neural networks have a large number of local minima due to

their non-convex nature.


c) Plateaus, Saddle Points and Other Flat Regions
The term “saddle point” refers to a specific point in the optimization landscape of a cost function
where the gradient is zero, but the point is neither a minimum nor a maximum. Saddle points
can appear as local minima in some cross-sections and local maxima in others. Neural networks,
particularly their loss surfaces contain saddle points in high-dimensional loss functions as
shown in fig where Blue line indicates Saddle points

Optimization Challenges wrt saddle points:

 First-order methods (e.g., gradient descent): Often face challenges near saddle points
because the gradient magnitude becomes small.
 Second-order methods (e.g., Newton's method): Specifically solve for points with zero
gradient and are more prone to getting stuck at saddle points.

Plateau: A plateau is a flat region of the loss landscape where the gradients are very small. This is
often the case when using activation functions like the sigmoid or hyperbolic tangent, which
have flat regions in their output.

Challenges wrt Plateau:

When an optimization algorithm such as gradient descent encounters a plateau, it can cause
problems because the gradients are very small. The gradient is used to update the parameters of
the model, and if it’s close to zero, the updates will also be very small.

d) Cliffs and Exploding Gradients: A cliff refers to a region in the optimization landscape where
the loss function or cost function changes rapidly. These regions often exhibit steep gradients,
making optimization challenging for gradient-based methods. Cliffs can occur in both the
parameter space of a model and in input data spaces, depending on the context.
Fig shows cliff (marked with dark line) where the models parameters are placed close to
each other
The cliff can be dangerous whether we approach it from above or from below, but can be
avoided using the gradient clipping heuristic
e) Long-Term Dependencies
Another difficulty that neural network optimization algorithms must overcome arises when the
computational graph becomes extremely deep. Feed forward networks with many layers have
such deep computational graphs. Recurrent networks construct very deep computational
graphs by repeatedly applying the same operation at each time step of a long temporal
sequence. Repeated application of the same parameters gives rise to especially pronounced
difficulties.
f) Inexact Gradients
Most optimization algorithms are designed with the assumption that we have access to the
exact gradient or Hessian matrix. In practice, we usually only have a noisy or even biased
estimate of these quantities. In other cases, the objective function we want to minimize is
actually intractable. When the objective function is intractable, typically its gradient is
intractable as well. In such cases we can only approximate the gradient. Various neural network
optimization algorithms are designed to account for imperfections in the gradient estimate.
g) Poor Correspondence between Local and Global Structure::
Gradient descent may succeed locally but fail globally.
local optimization methods (like gradient descent) perform well in determining the best
downhill direction at a single point, the global landscape of the cost function may lead to long,
inefficient trajectories.
 For instance, local optimization can fail if the direction of steepest descent does not point
toward the global minimum or a better solution.

Figure : Optimization based on local downhill moves can fail if the local surface does not point toward
the global solution.

8.3 Basic Algorithms

I) Stochastic Gradient Descent:

Stochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms
for machine learning in general and for deep learning in particular. It is possible to obtain an unbiased
estimate of the gradient by taking the average gradient on a mini-batch of m examples drawn from the
data generating distribution.

where
II) Stochastic with Momentum:

Momentum is designed to improve the speed and stability of stochastic gradient descent (SGD),
especially in challenging situations such as:

1. High curvature (e.g., narrow valleys in the loss surface).


2. Small but consistent gradients (slow progress in steady regions).
3. Noisy gradients (oscillations due to stochasticity in updates).
4. The momentum algorithm introduces a variable v that plays the role of velocity—it is the
direction and speed at which the parameters move through parameter space

how Momentum Works:

1. Velocity Variable (v):


o v represents the velocity (direction and speed of parameter updates).
o It accumulates the gradients over time, smoothed by an exponential decay factor.

2. Hyperparameter α:

α ∈ [0,1): Larger values mean a slower decay of past gradients, allowing stronger
o Determines how much weight is given to past gradients.
o
momentum.
3. Update Equations:

The momentum-based update has two steps:

o Update the velocity:


o Update the parameters:

3. Nesterov Momentum(0ptional Read once)

In standard momentum, the gradient is computed at the current position of the parameters. In
Nesterov Momentum, the gradient is computed at a "lookahead" position, which is the current position
plus a fraction of the velocity. The difference between Nesterov momentum and standard momentum is
where the gradient is evaluated. With Nesterov momentum the gradient is evaluated after the current
velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor
to the standard method of momentum.

8.4 Parameter Initialization Strategies

Initializing the parameters of a deep neural network is an important step in the


training process, as it can have a significant impact on the convergence and
performance of the model.

1. Zero Initialization: Initialize all the weights and biases to zero. This is not
generally used in deep learning as it leads to symmetry in the gradients,
resulting in all the neurons learning the same feature.

2. Random Initialization: Initialize the weights and biases randomly from a uniform
or normal distribution. This is the most common technique used in deep
learning.

3. Xavier Initialization: Initialize the weights with a normal distribution with mean
0 and variance of sqrt(1/n), where n is the number of neurons in the previous
layer. This is used for the sigmoid activation function.

4. He Initialization: Initialize the weights with a normal distribution with mean 0


and variance of sqrt(2/n), where n is the number of neurons in the previous
layer. This is used for the ReLU activation function.
5. Orthogonal Initialization: Initialize the weights with an orthogonal matrix, which
preserves the gradient norm during backpropagation.

6. Uniform Initialization: Initialize the weights with a uniform distribution. This is


less commonly used than random initialization.

7. Constant Initialization: Initialize the weights and biases with a constant value.
This is rarely used in deep learning.

8.5 Algorithms with Adaptive Learning Rates:

Number of incremental (or mini-batch-based) methods that have been introduced that adapt the
learning rates of model parameters.

a) AdaGrad

The AdaGrad algorithm, individually adapts the learning rates of all model parameters by scaling them
inversely proportional to the square root of the sum of all of their historical squared values .

The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in
their learning rate, while parameters with small partial derivatives have a relatively small decrease in
their learning rate. . AdaGrad performs well for some but not all deep learning models

1. Adam(RMS PROP):
Adam is yet another adaptive learning rate optimization algorithm .The name “Adam” derives
from the phrase “adaptive moment”. Adam (short for Adaptive Moment Estimation) is an
adaptive learning rate optimization algorithm that combines ideas from momentum and
RMSProp.

Note: you can simplify the algorithms and just write the steps without the equations
just mention the parametrs

You might also like