0% found this document useful (0 votes)

72 views3 pages

Nesterov Momentum

Uploaded by

smartking1823

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views3 pages

Nesterov Momentum

Uploaded by

smartking1823

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Nesterov’s momentum method

Sanjay J. Alexa,1 , Shubham Baranwalb,2 , Anurag Raic,3

a
2023EEY7517
b
2023EEY7520
c
2023EEZ7504

MCL-758 Optimization

Abstract—Gradient descent is essential for minimizing objective functions 3. Momentum

but faces limitations such as slow progress in flat regions and sensitivity to
noisy gradients. The momentum introduction helps overcome these chal- Momentum is an extension of the gradient descent algorithm de-
lenges by providing a more consistent direction for the search process. The signed to speed up optimization. It adds history to the parameter
Nesterov momentum method further refines this approach by anticipating update equation based on the gradient encountered during the previ-
future gradients, thus offering a more efficient trajectory towards the optimum.
ous updates, resulting in more efficient and improved optimization.
This examination highlights how Nesterov momentum improves the reliability
and efficiency of finding optimal solutions in gradient descent algorithms. This approach is based on the metaphor of momentum from physics,
where acceleration in a direction can be accumulated from past up-
keywords— Gradient descent, Momentum, Nesterov’s Momentum. dates.
Momentum involves adding a hyperparameter that controls the
amount of history (momentum) to include in the update equation,
1. Introduction i.e. the step to a new point in the search space. The value for the
radient descent is an optimization algorithm that follows an hyperparameter is defined in the range 0.0 to 1.0 and often has a value
G objective function’s negative gradient to locate the function’s
minimum.
close to 1.0, such as 0.8, 0.9, or 0.99. A momentum of 0.0 is the
same as gradient descent without momentum. The update rule
A limitation of gradient descent is that it can get stuck in flat areas for the Momentum method is given by:
or bounce around if the objective function returns noisy gradients. 𝑣𝑡+1 = 𝛽𝑣𝑡 + (1 − 𝛽)∇𝑓(𝑥𝑡 ) (2)
Momentum is an approach that accelerates the progress of the search
𝑥𝑡+1 = 𝑥𝑡 − 𝛼𝑣𝑡+1 (3)
to skim across flat areas and smooth out bouncy gradients.
In some cases, the acceleration of momentum can cause the search where:
to miss or overshoot the minima at the bottom of basins or valleys.
• 𝑥𝑡 is the parameter vector at iteration 𝑡,
Nesterov momentum is an extension of momentum that involves
• ∇𝐽(𝑥𝑡 ) is the gradient of the objective function with respect to
calculating the decaying moving average of the gradients of projected
𝑥𝑡 ,
positions in the search space rather than the actual positions them-
• 𝑣𝑡 is the velocity vector at iteration 𝑡,
selves.
• 𝛼 is the learning rate, and
This has the effect of harnessing the accelerating benefits of mo- • 𝛽 is the momentum coefficient, typically chosen between 0 and
mentum whilst allowing the search to slow down when approaching 1.
the optima and reducing the likelihood of missing or overshooting it.
The momentum term 𝑣𝑡 can be interpreted as the exponentially
weighted average of past gradients.
2. Gradient Descent
The gradient descent algorithm requires an optimized target function
and a derivative function for the objective function. The target func-
tion 𝑓() returns a score for a given set of inputs, and the derivative
function 𝑓 ′ () gives the derivative of the target function for a given set
of inputs.
The gradient descent algorithm requires a starting point (𝑥) in the
problem, such as a randomly selected point in the input space.
The derivative is then calculated, and a step is taken in the input
Figure 1. Standard Momentum vs Nestrov Momentum
space. This is expected to result in a downhill movement in the target
function, assuming we are minimizing the target function.
A downhill movement is made by first calculating how far to move
in the input space, calculated as the step size (called alpha or the 4. Nesterov Momentum
learning rate) multiplied by the gradient. This is then subtracted Nesterov Momentum is an extension of the gradient descent opti-
from the current point, ensuring we move against the gradient or mization algorithm.
down the target function. Yurii Nesterov described the approach in his 1983 paper titled
“A Method For Solving The Convex Programming Problem With
𝑥(𝑡 + 1) = 𝑥(𝑡) − 𝛼 × 𝑓 ′ (𝑥(𝑡)) (1)
Convergence Rate 𝑂(1∕𝑘2 )".
The steeper the objective function at a given point, the larger the Nesterov Momentum is just like more traditional momentum, ex-
magnitude of the gradient, and in turn, the larger the step taken in cept the update is performed using the partial derivative of the pro-
the search space. The size of the step taken is scaled using a step size jected update rather than the derivative current variable value. Nes-
hyperparameter. terov Momentum is an optimization algorithm that improves the
Step Size (𝛼): Hyperparameter that controls how far to move in convergence rate by considering the future gradient information.
the search space against the gradient each iteration of the algorithm. It can be understood in terms of the following four steps:
If the step size is too small, the movement in the search space will 1. The deprojected position of the entire solution is calculated using
be small, and the search will take a long time. If the step size is too the change calculated in the last iteration of the algorithm:
large, the search may bounce around the search space and skip over
the optima. 𝑦𝑡+1 = 𝑥𝑡 + (𝛽 × 𝑣𝑡 ) (4)

Indian Institute of Technology Delhi April 10, 2024 1–3

Nesterov’s momentum method

2. The sign of the gradient for this new position is then calculated: of functions. It’s noteworthy because it extends the acceleration
guarantee to a more diverse class of functions. Remember, in the
∇𝑡 + 1 = ∇𝑓(𝑦𝑡+1 ) (5) scenario where 𝑓 is 𝑚-strongly convex and 𝐿-smooth, the condition
𝐿
3. Next, the change in each variable is calculated: number 𝜅 equals . Hence, the acceleration effect becomes notably
𝑚
pronounced when 𝐿 is significantly smaller than 𝑚 (indicating some
𝑣𝑡+1 = (𝛽 × 𝑣𝑡 − (𝛼 × ∇ × 𝑓(𝑦𝑡+1 ) (6)
dimensions possess very steep gradients).
4. Finally, the new value for each variable is calculated using the
calculated change: 5.2. The smooth case
𝑥𝑡+1 = 𝑥𝑡 + 𝑣𝑡+1 (7) Theorem 1.2. Let 𝑓 be a convex and 𝐿-smooth function, then Nes-
In the field of convex optimization, more generally, Nesterov Momen- terov’s accelerated gradient descent satisfies
tum is known to improve the rate of convergence of the optimization 2𝐿 ‖𝑥1 − 𝑥∗ ‖
2

algorithm, reducing the number of iterations required to find the 𝑓 (𝑦𝑡 ) − 𝑓 (𝑥 ∗ ) ≤ .

𝑡2
solution.
The convergence rate 𝑂(1∕𝑡 2 ) is optimal since no first order al-
gorithm can perform better than NAGD in terms of convergence
rate. All first-order algorithm can only be atmost as good as NAG.
Table 1 shows a comparison between the vanilla gradient descent and
Nesterov’s accelerated gradient descent.
To establish the convergence of Nesterov’s momentum algorithm
for general convex functions, Nesterov relies on an estimated se-
quence.
Definition 1 (Estimate Sequence - Definition 2.2.1 from Nesterov’s
∞ ∞
book [3]). A pair of sequences {𝜙𝑘 (𝑥)}𝑘=0 and {𝜆𝑘 }𝑘=0 where 𝜆𝑘 ≥ 0,
is called an Estimate Sequence of 𝑓 if the following conditions hold:-
𝜆𝑘 → 0 when 𝑘 → ∞ - ∀𝑥 ∈ ℝ𝑛 , ∀𝑘 ≥ 0, the following holds:
𝜙𝑘 (𝑥) ≤ (1 − 𝜆𝑘 ) 𝑓(𝑥) + 𝜆𝑘 𝜙0 (𝑥)

Figure 2. Visualisation of CM and NAG

5. Convergence Results
Like classical momentum, NAG is a first-order optimization method Figure 3. Example of an estimate sequence
with a better convergence rate guarantee than gradient descent in
certain situations. In particular, for general smooth (non-strongly) When 𝑘 → ∞, 𝜙 becomes a lower bound of 𝑓, since 𝜆𝑘 𝜙0 (𝑥) → 0
convex functions and a deterministic gradient, NAG achieves a global and (1 − 𝜆𝑘 ) 𝑓(𝑥) → 𝑓(𝑥). The idea is to find an estimate sequence 𝜙
convergence rate of 𝑂(1∕𝑇 2 ) (versus the 𝑂(1∕𝑇) of gradient descent), to help us minimize 𝑓. To do so, the author introduces the following
with constant proportional to the Lipschitz coefficient of the deriva- result:
tive and the squared Euclidean distance to the solution. Theorem 2: (Lemma 2.2.1 of Nesterov’s book [3]) Assuming
Just as second-order methods guarantee improved local conver- that the estimate sequence is such that ∀𝑘, 𝑓 (𝑥𝑘 ) ≤ 𝜙𝑘∗ where
gence rates, Polyak (1964) showed that classical momentum can 𝜙𝑘∗ = min𝑥 𝜙𝑘 (𝑥), then:
considerably accelerate convergence to a local minimum, requiring
√ 𝑓 (𝑥𝑘 ) − 𝑓 (𝑥∗ ) ≤ 𝜆𝑘 [𝜙0 (𝑥 ∗ ) − 𝑓 (𝑥∗ )]
𝜅 times fewer iterations than the steepest descent to reach the same
𝐿
level of accuracy, where 𝜅 is the condition number (𝜅 = , where L Since 𝜙0 (𝑥 ∗ ) − 𝑓 (𝑥 ∗ ) is a fixed quantity, the rate of convergence
is the Lipschitz coefficient and m is from m-strongly
𝑚
for the sub-optimality will depend on how quickly 𝜆𝑘 decreases to 0.
√ √convex) of the Figure 4 presents a visualization of lemma 2.2.1.
curvature at the minimum and 𝛽 is set to ( 𝜅 − 1)∕( 𝜅 + 1).
Let us first define our first elements in the sequence.
Here ,we give the details of the method both for the strongly convex
and non-strongly convex case. 𝛼 2
𝜙0 (𝑥) = 𝑓 (𝑥0 ) + ‖𝑥 − 𝑥0 ‖
2
5.1. The smooth and strongly convex case 𝜆0 = 1
Theorem 1.1. Let 𝑓 be 𝑚-strongly convex and 𝐿-smooth, then Nes-
The next elements are of the following form:
terov’s accelerated gradient descent satisfies
𝜙𝑘+1 (𝑥) = (1 − 𝛼𝑘 )𝜙𝑘 (𝑥)
( 𝛼 )
𝑚+𝐿 2 𝑡−1 + 𝛼𝑘 𝑓(𝑦𝑘 ) + ⟨∇𝑓(𝑦𝑘 ), 𝑥 − 𝑦𝑘 ⟩ + ‖𝑥 − 𝑦𝑘 ‖2
𝑓 (𝑦𝑡 ) − 𝑓 (𝑥∗ ) ≤ ‖𝑥1 − 𝑥∗ ‖ exp (− √ ) .
2
2 𝜅 𝜆𝑘+1 = (1 − 𝛼𝑘 )𝜆𝑘
When applied to smooth and strongly convex functions, Nesterov’s
∞ ∞
Accelerated Gradient offers a similar acceleration to what is observed Where {𝑦𝑘 }𝑘=0 is an arbitrary sequence, and {𝛼𝑘 }𝑘=0 is a sequence
∑∞
with Polyak’s momentum but is now applicable to a broader range such that 𝛼𝑘 ∈ (0, 1) and 𝑘=0 𝛼𝑘 = ∞. This carefully designed

2
Nesterov’s momentum method

Figure 4. Illustration of lemma 2.2.1

sequence is then used to show how 𝜆𝑘 can converge efficiently and

how the minimum point of 𝜙𝑘 converges to 𝑥 ∗ .

Table 1. Convergence rate for Gradient Descent (GD) and Nesterov’s Acceler-
ated Gradient (NAG)

Class of Function GD NAG

(1) (1)
Smooth 𝑂 𝑂 2
𝑇 𝑇
( ( 𝑇 )) 𝑇
Smooth & Strongly-Convex 𝑂 exp − 𝑂 (exp (− √ ))
𝜅 𝜅

6. Limitations of Nesterov’s Momentum Method

The Nesterov’s accelerated gradient may not perform significantly
better than standard momentum in all cases. The algorithm requires
tuning the hyperparameters 𝜇 (momentum constant) and 𝜖 (learning
rate or step size). The high dimensional non-convex nature of the
loss function may lead to different sensitivity on different dimensions,
which a constant value of these hyperparameters cannot adapt to.
However, due to its simplicity and close resemblance to standard
momentum, it is very easy to implement and may significantly boost
performance.

References
[1] Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton
(2013) On the importance of initialization and momentum in deep
learning , Proceedings of the 30 th International Conference on
Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W & CP
volume 28.

[2] Ilya Sutskever (2013), TRAINING RECURRENT NEURAL NET-

WORKS

[3] Yurii Nesterov, Introductory lecture on convex programming. ISBN,

1998

Indian Institute of Technology Delhi April 10, 2024 3

Analysis of Longitudinal Data (Oxford Statistical Science Series), 2nd Edition Accessible DOCX Download
100% (8)
Analysis of Longitudinal Data (Oxford Statistical Science Series), 2nd Edition Accessible DOCX Download
15 pages
Ps 1
No ratings yet
Ps 1
16 pages
Numerical Analysis Notes
No ratings yet
Numerical Analysis Notes
450 pages
Manual CVT
100% (2)
Manual CVT
221 pages
Awesome Angles For Mathematics Competitions Book1 Look Inside
No ratings yet
Awesome Angles For Mathematics Competitions Book1 Look Inside
11 pages
Basic Measures of Epidemiology
100% (1)
Basic Measures of Epidemiology
51 pages
CCA Selected Answers
83% (6)
CCA Selected Answers
84 pages
Unconstrained Optimization (Contd.) Constrained Optimization
No ratings yet
Unconstrained Optimization (Contd.) Constrained Optimization
19 pages
Chapter - 2 Wiener Filters
No ratings yet
Chapter - 2 Wiener Filters
25 pages
Motorcycle Accidents Thesis
100% (2)
Motorcycle Accidents Thesis
7 pages
Euler and Modified Euler Method
50% (2)
Euler and Modified Euler Method
8 pages
Lecture On BVP
No ratings yet
Lecture On BVP
26 pages
MPMC LAB Manual R20
No ratings yet
MPMC LAB Manual R20
49 pages
PIC16 (L) F15354 - 55 Data Sheet 40001853C-1314298
No ratings yet
PIC16 (L) F15354 - 55 Data Sheet 40001853C-1314298
539 pages
Zarchan-Lecture Kalman Filter9
No ratings yet
Zarchan-Lecture Kalman Filter9
104 pages
Computer Architecture & Design-Slides
No ratings yet
Computer Architecture & Design-Slides
34 pages
Dynamic Programming Value Iteration
100% (1)
Dynamic Programming Value Iteration
36 pages
The Hopf Bifurcation: Maple Solution
100% (2)
The Hopf Bifurcation: Maple Solution
5 pages
Introduction To Optimal Control Theory and Hamilton-Jacobi Equations
100% (1)
Introduction To Optimal Control Theory and Hamilton-Jacobi Equations
55 pages
Postmodernism and Biology in John Fowles S The French Lieutenant's Woman
No ratings yet
Postmodernism and Biology in John Fowles S The French Lieutenant's Woman
23 pages
A Note On Univariate Ito'S Lemma: 1 2 F D Hxi
100% (1)
A Note On Univariate Ito'S Lemma: 1 2 F D Hxi
2 pages
HJB Equations
100% (1)
HJB Equations
38 pages
Lecture 4
No ratings yet
Lecture 4
64 pages
A100 User Guide 8750163000 RevC1
No ratings yet
A100 User Guide 8750163000 RevC1
44 pages
Hw2sol PDF
100% (1)
Hw2sol PDF
5 pages
Bellman
100% (1)
Bellman
8 pages
Partial Diff Equations
No ratings yet
Partial Diff Equations
51 pages
CST 303
No ratings yet
CST 303
19 pages
Finite Difference Methods: Autumn 2 0 0 9
No ratings yet
Finite Difference Methods: Autumn 2 0 0 9
33 pages
Biology Life On Earth 10th Edition Audesirk Solutions Manual Instant Download
100% (6)
Biology Life On Earth 10th Edition Audesirk Solutions Manual Instant Download
37 pages
Levenberg Marquardt
No ratings yet
Levenberg Marquardt
7 pages
Thesis Michaela and Tobias Hoefler PDF
No ratings yet
Thesis Michaela and Tobias Hoefler PDF
182 pages
Hopf-Bifurcation Ina Two Dimensional Nonlinear Differential Equation
100% (1)
Hopf-Bifurcation Ina Two Dimensional Nonlinear Differential Equation
11 pages
CQ 02 February 1980
No ratings yet
CQ 02 February 1980
100 pages
The Unscented Kalman Filter (UKF) Chapter 7a: Hao Liu Ke Zhang
No ratings yet
The Unscented Kalman Filter (UKF) Chapter 7a: Hao Liu Ke Zhang
56 pages
Eurotile Pricelist2015 3
No ratings yet
Eurotile Pricelist2015 3
147 pages
Final JWT Internship Report Finalised
No ratings yet
Final JWT Internship Report Finalised
33 pages
TP-Link Switch TL-SG5412F
No ratings yet
TP-Link Switch TL-SG5412F
32 pages
10.1515 - Hjbpa 2017 0005666666
No ratings yet
10.1515 - Hjbpa 2017 0005666666
14 pages
Nesterov's Momentum Method
No ratings yet
Nesterov's Momentum Method
9 pages
Fluids - Lecture 11 Notes: Vorticity and Strain Rate
No ratings yet
Fluids - Lecture 11 Notes: Vorticity and Strain Rate
46 pages
An Introduction To The Hamilton Jacobi Equation
No ratings yet
An Introduction To The Hamilton Jacobi Equation
9 pages
Physics12 03
No ratings yet
Physics12 03
11 pages
263 Homework
No ratings yet
263 Homework
153 pages
DSA3102 Final Cheatsheet
No ratings yet
DSA3102 Final Cheatsheet
2 pages
Method of Multiple Scales and Averaging
No ratings yet
Method of Multiple Scales and Averaging
48 pages
FDA Confirms Graphene Oxide Is in The mRNA COVID-19 Vaccines
No ratings yet
FDA Confirms Graphene Oxide Is in The mRNA COVID-19 Vaccines
8 pages
Aircraft Stability and Control - Lec04
No ratings yet
Aircraft Stability and Control - Lec04
14 pages
7.06 Method of Eigenfunction Expansions
No ratings yet
7.06 Method of Eigenfunction Expansions
7 pages
Multimetru Auto, Model ADD51
No ratings yet
Multimetru Auto, Model ADD51
2 pages
DDE Seminar Tutorial II
No ratings yet
DDE Seminar Tutorial II
24 pages
Stability of Equilibrium Points (2021)
No ratings yet
Stability of Equilibrium Points (2021)
17 pages
7th Grade General Science Proficiency Scales
No ratings yet
7th Grade General Science Proficiency Scales
10 pages
Test (POS)
No ratings yet
Test (POS)
2 pages
Interpolation and Curve Fitting: Pierre Bézier
No ratings yet
Interpolation and Curve Fitting: Pierre Bézier
47 pages
Oundary Alue Roblems: Dr. Johnson
No ratings yet
Oundary Alue Roblems: Dr. Johnson
33 pages
14.2.7 - Numerical Approximation Euler's Method
No ratings yet
14.2.7 - Numerical Approximation Euler's Method
23 pages
Sylvester's Criterion
No ratings yet
Sylvester's Criterion
4 pages
WinPlot Major Features Summary
No ratings yet
WinPlot Major Features Summary
29 pages
Real Christmas: The Purpose of Christmas
No ratings yet
Real Christmas: The Purpose of Christmas
17 pages
Deserts: Before Reading
No ratings yet
Deserts: Before Reading
12 pages
4.3 Unsustainable Fisheries in Thailand
No ratings yet
4.3 Unsustainable Fisheries in Thailand
2 pages
CEP Calculation
No ratings yet
CEP Calculation
20 pages
D Manikanta Swamy
No ratings yet
D Manikanta Swamy
8 pages
Notes On Linear Multivariable Control Theory
No ratings yet
Notes On Linear Multivariable Control Theory
36 pages
Balanced Truncation
No ratings yet
Balanced Truncation
15 pages
Calg t2 4 Filled in
No ratings yet
Calg t2 4 Filled in
7 pages
Vector Ratio Rule, and An Application in Physics
No ratings yet
Vector Ratio Rule, and An Application in Physics
5 pages
Ashenafi Agizaw
No ratings yet
Ashenafi Agizaw
103 pages
Exercise Topic 3
No ratings yet
Exercise Topic 3
5 pages
Bowl-And-Snail Technique For Soft Cataract: Ahmed Gomaa, FRCS, PHD, Christopher Liu, Frcophth
No ratings yet
Bowl-And-Snail Technique For Soft Cataract: Ahmed Gomaa, FRCS, PHD, Christopher Liu, Frcophth
3 pages
7-Ode Ivp1
No ratings yet
7-Ode Ivp1
12 pages
Numerical Analysis
No ratings yet
Numerical Analysis
14 pages
Newton Gauss Method
No ratings yet
Newton Gauss Method
37 pages
Final 12
No ratings yet
Final 12
11 pages
Roots of Equations: 1.0.1 Newton's Method
No ratings yet
Roots of Equations: 1.0.1 Newton's Method
20 pages
Governing Equations For Turbulent Flow
No ratings yet
Governing Equations For Turbulent Flow
15 pages
Numerical Solutions of Second Order Boundary Value Problems by Galerkin Residual Method On Using Legendre Polynomials
No ratings yet
Numerical Solutions of Second Order Boundary Value Problems by Galerkin Residual Method On Using Legendre Polynomials
11 pages
Chem Lab Report
No ratings yet
Chem Lab Report
6 pages
Algorithms For Constrained Optimization
No ratings yet
Algorithms For Constrained Optimization
22 pages
Appendix B Hand Out Gauss Newton Derivation
No ratings yet
Appendix B Hand Out Gauss Newton Derivation
8 pages
CFD Module-III
No ratings yet
CFD Module-III
53 pages
Nonlinear Systems - Newton Method
No ratings yet
Nonlinear Systems - Newton Method
4 pages
Optimal Control of Switching Times in Switched Dynamical Systems
No ratings yet
Optimal Control of Switching Times in Switched Dynamical Systems
6 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Numerical Differentiation - Origin
No ratings yet
Numerical Differentiation - Origin
2 pages
Respect The Unstable PDF
No ratings yet
Respect The Unstable PDF
14 pages
Notes On Linearisation (H.K.Khalil)
No ratings yet
Notes On Linearisation (H.K.Khalil)
11 pages
1104.4025 (Methods in Ma Thematic A For Solving Ordinary Differential Equations)
No ratings yet
1104.4025 (Methods in Ma Thematic A For Solving Ordinary Differential Equations)
13 pages
Applied Mechanics II
No ratings yet
Applied Mechanics II
2 pages

Nesterov Momentum

Uploaded by

Nesterov Momentum

Uploaded by

Nesterov’s momentum method

Sanjay J. Alexa,1 , Shubham Baranwalb,2 , Anurag Raic,3

Abstract—Gradient descent is essential for minimizing objective functions 3. Momentum

Indian Institute of Technology Delhi April 10, 2024 1–3

algorithm, reducing the number of iterations required to find the 𝑓 (𝑦𝑡 ) − 𝑓 (𝑥 ∗ ) ≤ .

Figure 2. Visualisation of CM and NAG

Figure 4. Illustration of lemma 2.2.1

sequence is then used to show how 𝜆𝑘 can converge efficiently and

Class of Function GD NAG

6. Limitations of Nesterov’s Momentum Method

[2] Ilya Sutskever (2013), TRAINING RECURRENT NEURAL NET-

[3] Yurii Nesterov, Introductory lecture on convex programming. ISBN,

Indian Institute of Technology Delhi April 10, 2024 3

You might also like