0% found this document useful (0 votes)

8 views11 pages

Module 3dl1

The document discusses optimization techniques for training deep models, focusing on empirical risk minimization, challenges in neural network optimization, and basic algorithms like Stochastic Gradient Descent and its variants. It highlights the differences between learning and pure optimization, the importance of parameter initialization strategies, and adaptive learning rate algorithms such as AdaGrad and Adam. Key challenges in optimization include ill-conditioning, local minima, saddle points, and the effects of long-term dependencies and inexact gradients.

Uploaded by

n.kumar05052002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views11 pages

Module 3dl1

Uploaded by

n.kumar05052002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Module 3- Imp question & Answers

Optimization for Training Deep Models: Empirical Risk Minimization, Challenges in Neural Network
Optimization, Basic Algorithms: Stochastic Gradient Descent, Parameter Initialization Strategies,
Algorithms with Adaptive Learning Rates: The AdaGrad algorithm, The RMSProp algorithm, Choosing the
Right Optimization Algorithm.

Textbook 1: Chapter: 8.1-8.5

How Learning Differs from Pure Optimization

Learning:
Learning involves improving a model's performance by discovering patterns or structures in
data. It is often dynamic, iterative, and driven by exposure to new data or feedback
(e.g., training a neural network with labeled data).
Example: In supervised learning, the model adjusts its parameters based on the error between
predictions and ground truth.
Pure Optimization:
Optimization is the process of finding the best possible solution for a well-defined mathematical
problem, often defined by minimizing or maximizing an objective function. It is static in nature
and focuses purely on achieving a predefined goal.
Example: Finding the minimum of a convex function (e.g., minimizing a cost function using
gradient descent).
8.1. Explain Empirical Risk Minimization( Modelqp)

The goal of a machine learning algorithm is to reduce the expected generalization error
given by equation below. This quantity is known as the risk. The generalization error, or risk,
measures how well a machine learning model performs on unseen data.
The training process based on minimizing this average training error is known as empirical risk
minimization. The goal of ERM is to minimize the empirical risk: ERM assumes that minimizing the
empirical risk on the training set will reduce the true risk on the underlying distribution. This process is
effectively an optimization problem.

Alternatives to ERM

Due to the challenges of overfitting and non-differentiability, ERM is rarely used directly in deep
learning. Instead, practitioners adopt modified approaches:

(a) Surrogate Loss Functions:

 Replace non-differentiable loss functions (e.g., 0-1 loss) with differentiable alternatives, such as:
o Cross-entropy loss for classification.
o Mean squared error (MSE) for regression.

(b) Regularization:

 Add terms to the loss function to prevent overfitting, such as:

o L2 regularization (weight decay).
o Dropout.

(c) Optimization on Training Loss:

In practice, we optimize a proxy loss function L(θ), which may include regularization terms and
be designed for compatibility with gradient-based optimization

8.2. Explain the Challenges in Neural Network Optimization( ModelQp)

Most prominent challenges involved in optimization for training deep models are described
below:
a).ill conditioning:
 The ill-conditioning problem is generally believed to be present in neural network
training problems.
 ill-conditioning occurs when the Hessian matrix (second derivative matrix of the cost
function) has a wide range of eigenvalues. This leads to difficulty in optimization, as the
cost surface has steep slopes in some directions and flat regions in others.
 Ill-conditioning can cause gradient descent methods, like stochastic gradient descent
(SGD), to struggle: Even small steps can increase the cost function in poorly conditioned
directions. The learning rate must be reduced to maintain stability, slowing
convergence.
b). Local Minima
Local minima refer to a situation where the optimization algorithm finds a set of model
parameters that correspond to the minimum value of the loss function in a small region of the
parameter space.
However, this minimum value may not be the global minimum of the loss function, which
corresponds to the smallest value of the loss function across the entire parameter space.
Global minima
Global minima are the absolute lowest points of the loss function, which correspond to the
optimal set of parameters for a model. The goal of any optimization algorithm is to find the
global minimum, which will produce the best results for the given problem.

Local Minima in Convex vs. Non-Convex Optimization:

 Convex Optimization: Any local minimum is a global minimum or part of a flat region
that is an acceptable solution.

 Non-Convex Optimization: Neural networks have a large number of local minima due to

their non-convex nature.

c) Plateaus, Saddle Points and Other Flat Regions
The term “saddle point” refers to a specific point in the optimization landscape of a cost function
where the gradient is zero, but the point is neither a minimum nor a maximum. Saddle points
can appear as local minima in some cross-sections and local maxima in others. Neural networks,
particularly their loss surfaces contain saddle points in high-dimensional loss functions as
shown in fig where Blue line indicates Saddle points

Optimization Challenges wrt saddle points:

 First-order methods (e.g., gradient descent): Often face challenges near saddle points
because the gradient magnitude becomes small.
 Second-order methods (e.g., Newton's method): Specifically solve for points with zero
gradient and are more prone to getting stuck at saddle points.

Plateau: A plateau is a flat region of the loss landscape where the gradients are very small. This is
often the case when using activation functions like the sigmoid or hyperbolic tangent, which
have flat regions in their output.

Challenges wrt Plateau:

When an optimization algorithm such as gradient descent encounters a plateau, it can cause
problems because the gradients are very small. The gradient is used to update the parameters of
the model, and if it’s close to zero, the updates will also be very small.

d) Cliffs and Exploding Gradients: A cliff refers to a region in the optimization landscape where
the loss function or cost function changes rapidly. These regions often exhibit steep gradients,
making optimization challenging for gradient-based methods. Cliffs can occur in both the
parameter space of a model and in input data spaces, depending on the context.
Fig shows cliff (marked with dark line) where the models parameters are placed close to
each other
The cliff can be dangerous whether we approach it from above or from below, but can be
avoided using the gradient clipping heuristic
e) Long-Term Dependencies
Another difficulty that neural network optimization algorithms must overcome arises when the
computational graph becomes extremely deep. Feed forward networks with many layers have
such deep computational graphs. Recurrent networks construct very deep computational
graphs by repeatedly applying the same operation at each time step of a long temporal
sequence. Repeated application of the same parameters gives rise to especially pronounced
difficulties.
f) Inexact Gradients
Most optimization algorithms are designed with the assumption that we have access to the
exact gradient or Hessian matrix. In practice, we usually only have a noisy or even biased
estimate of these quantities. In other cases, the objective function we want to minimize is
actually intractable. When the objective function is intractable, typically its gradient is
intractable as well. In such cases we can only approximate the gradient. Various neural network
optimization algorithms are designed to account for imperfections in the gradient estimate.
g) Poor Correspondence between Local and Global Structure::
Gradient descent may succeed locally but fail globally.
local optimization methods (like gradient descent) perform well in determining the best
downhill direction at a single point, the global landscape of the cost function may lead to long,
inefficient trajectories.
 For instance, local optimization can fail if the direction of steepest descent does not point
toward the global minimum or a better solution.

Figure : Optimization based on local downhill moves can fail if the local surface does not point toward
the global solution.

8.3 Basic Algorithms

I) Stochastic Gradient Descent:

Stochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms
for machine learning in general and for deep learning in particular. It is possible to obtain an unbiased
estimate of the gradient by taking the average gradient on a mini-batch of m examples drawn from the
data generating distribution.

where
II) Stochastic with Momentum:

Momentum is designed to improve the speed and stability of stochastic gradient descent (SGD),
especially in challenging situations such as:

1. High curvature (e.g., narrow valleys in the loss surface).

2. Small but consistent gradients (slow progress in steady regions).
3. Noisy gradients (oscillations due to stochasticity in updates).
4. The momentum algorithm introduces a variable v that plays the role of velocity—it is the
direction and speed at which the parameters move through parameter space

how Momentum Works:

1. Velocity Variable (v):

o v represents the velocity (direction and speed of parameter updates).
o It accumulates the gradients over time, smoothed by an exponential decay factor.

2. Hyperparameter α:

α ∈ [0,1): Larger values mean a slower decay of past gradients, allowing stronger
o Determines how much weight is given to past gradients.
o
momentum.
3. Update Equations:

The momentum-based update has two steps:

o Update the velocity:

o Update the parameters:

3. Nesterov Momentum(0ptional Read once)

In standard momentum, the gradient is computed at the current position of the parameters. In
Nesterov Momentum, the gradient is computed at a "lookahead" position, which is the current position
plus a fraction of the velocity. The difference between Nesterov momentum and standard momentum is
where the gradient is evaluated. With Nesterov momentum the gradient is evaluated after the current
velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor
to the standard method of momentum.

8.4 Parameter Initialization Strategies

Initializing the parameters of a deep neural network is an important step in the

training process, as it can have a significant impact on the convergence and
performance of the model.

1. Zero Initialization: Initialize all the weights and biases to zero. This is not
generally used in deep learning as it leads to symmetry in the gradients,
resulting in all the neurons learning the same feature.

2. Random Initialization: Initialize the weights and biases randomly from a uniform
or normal distribution. This is the most common technique used in deep
learning.

3. Xavier Initialization: Initialize the weights with a normal distribution with mean
0 and variance of sqrt(1/n), where n is the number of neurons in the previous
layer. This is used for the sigmoid activation function.

4. He Initialization: Initialize the weights with a normal distribution with mean 0

and variance of sqrt(2/n), where n is the number of neurons in the previous
layer. This is used for the ReLU activation function.
5. Orthogonal Initialization: Initialize the weights with an orthogonal matrix, which
preserves the gradient norm during backpropagation.

6. Uniform Initialization: Initialize the weights with a uniform distribution. This is

less commonly used than random initialization.

7. Constant Initialization: Initialize the weights and biases with a constant value.
This is rarely used in deep learning.

8.5 Algorithms with Adaptive Learning Rates:

Number of incremental (or mini-batch-based) methods that have been introduced that adapt the
learning rates of model parameters.

a) AdaGrad

The AdaGrad algorithm, individually adapts the learning rates of all model parameters by scaling them
inversely proportional to the square root of the sum of all of their historical squared values .

The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in
their learning rate, while parameters with small partial derivatives have a relatively small decrease in
their learning rate. . AdaGrad performs well for some but not all deep learning models

1. Adam(RMS PROP):
Adam is yet another adaptive learning rate optimization algorithm .The name “Adam” derives
from the phrase “adaptive moment”. Adam (short for Adaptive Moment Estimation) is an
adaptive learning rate optimization algorithm that combines ideas from momentum and
RMSProp.

Note: you can simplify the algorithms and just write the steps without the equations
just mention the parametrs

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Gradient Descent
No ratings yet
Gradient Descent
55 pages
DL - Unit 2
No ratings yet
DL - Unit 2
60 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Module 3
No ratings yet
Module 3
7 pages
1 Intro
No ratings yet
1 Intro
91 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
UNIT2
No ratings yet
UNIT2
25 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Unit IV
No ratings yet
Unit IV
89 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Unit 2
No ratings yet
Unit 2
18 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Module3 Notes
No ratings yet
Module3 Notes
18 pages
Lec 7 Optimization Part 2
No ratings yet
Lec 7 Optimization Part 2
139 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
DL 12
No ratings yet
DL 12
55 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
AI Lect 09 Bidirectional Search
No ratings yet
AI Lect 09 Bidirectional Search
3 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Lec 28 Variations in BPNN
100% (1)
Lec 28 Variations in BPNN
20 pages
ML Module 5 Full Notes
No ratings yet
ML Module 5 Full Notes
23 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
No ratings yet
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
11 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
DL 4
No ratings yet
DL 4
15 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Machine Learning AL-405 GS Answers
No ratings yet
Machine Learning AL-405 GS Answers
3 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Chapter
No ratings yet
Chapter
46 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
(FlippedMath) Unit 1A Corrective Assignment (With Answers)
No ratings yet
(FlippedMath) Unit 1A Corrective Assignment (With Answers)
6 pages
ML Notes
No ratings yet
ML Notes
14 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
CH#1 Error Analysis-06-Jan-2025
No ratings yet
CH#1 Error Analysis-06-Jan-2025
45 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Advanced Algebra
No ratings yet
Advanced Algebra
124 pages
Chapter Two Searching and Sorting: Algorithm
No ratings yet
Chapter Two Searching and Sorting: Algorithm
53 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Lecture 1P4-Further Matrices (Class 3) - B&W
No ratings yet
Lecture 1P4-Further Matrices (Class 3) - B&W
55 pages
Numerical Methods
0% (1)
Numerical Methods
2 pages
A Generalized Hilbert Matrix Problem and Confluent Chebyshev - Vandermonde Systems
No ratings yet
A Generalized Hilbert Matrix Problem and Confluent Chebyshev - Vandermonde Systems
24 pages
Kiflu Kemal PDF
No ratings yet
Kiflu Kemal PDF
43 pages
WORKSHEET BCA5004 NUMERICAL METHOD Unit 2
No ratings yet
WORKSHEET BCA5004 NUMERICAL METHOD Unit 2
3 pages
UNIT3DAApptx 2022 08 05 11 12 14
No ratings yet
UNIT3DAApptx 2022 08 05 11 12 14
68 pages
1.2 Students
No ratings yet
1.2 Students
47 pages
Chapter 6 - Algorithms Part II
No ratings yet
Chapter 6 - Algorithms Part II
25 pages
Counting Sort
No ratings yet
Counting Sort
59 pages
Worksheet Class-8 Mathematics BridgeProgramme W1
No ratings yet
Worksheet Class-8 Mathematics BridgeProgramme W1
2 pages
DAA - Lab.File
No ratings yet
DAA - Lab.File
25 pages
243149 吴晔敏 Proposal Presentation
No ratings yet
243149 吴晔敏 Proposal Presentation
10 pages
Lesson 3 Interpolation
No ratings yet
Lesson 3 Interpolation
18 pages
Transportation Problem
No ratings yet
Transportation Problem
15 pages
Quadratic Equations D1
No ratings yet
Quadratic Equations D1
32 pages
Bjorck 1988
No ratings yet
Bjorck 1988
12 pages
D1 January 2012 Question Paper
No ratings yet
D1 January 2012 Question Paper
32 pages
Maths
No ratings yet
Maths
8 pages
Quiz 2
No ratings yet
Quiz 2
10 pages
Ss 2 Fmaths 1st CA Test Fmaths 2022 2023
No ratings yet
Ss 2 Fmaths 1st CA Test Fmaths 2022 2023
10 pages
Inequalities Exam Revision
No ratings yet
Inequalities Exam Revision
4 pages
Theory of Approximation and Splines-I Lecture-1 Basic Concepts of Interpolation
No ratings yet
Theory of Approximation and Splines-I Lecture-1 Basic Concepts of Interpolation
4 pages
Lecture-06-Daa 2
No ratings yet
Lecture-06-Daa 2
10 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet

Module 3dl1

Uploaded by

Module 3dl1

Uploaded by

Module 3- Imp question & Answers

Textbook 1: Chapter: 8.1-8.5

How Learning Differs from Pure Optimization

(a) Surrogate Loss Functions:

 Add terms to the loss function to prevent overfitting, such as:

(c) Optimization on Training Loss:

8.2. Explain the Challenges in Neural Network Optimization( ModelQp)

Local Minima in Convex vs. Non-Convex Optimization:

their non-convex nature.

Optimization Challenges wrt saddle points:

Challenges wrt Plateau:

8.3 Basic Algorithms

I) Stochastic Gradient Descent:

1. High curvature (e.g., narrow valleys in the loss surface).

how Momentum Works:

1. Velocity Variable (v):

The momentum-based update has two steps:

o Update the velocity:

3. Nesterov Momentum(0ptional Read once)

8.4 Parameter Initialization Strategies

Initializing the parameters of a deep neural network is an important step in the

4. He Initialization: Initialize the weights with a normal distribution with mean 0

6. Uniform Initialization: Initialize the weights with a uniform distribution. This is

8.5 Algorithms with Adaptive Learning Rates:

You might also like