0% found this document useful (0 votes)
72 views3 pages

Nesterov Momentum

Uploaded by

smartking1823
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views3 pages

Nesterov Momentum

Uploaded by

smartking1823
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Nesterov’s momentum method

Sanjay J. Alexa,1 , Shubham Baranwalb,2 , Anurag Raic,3


a
2023EEY7517
b
2023EEY7520
c
2023EEZ7504

MCL-758 Optimization

Abstract—Gradient descent is essential for minimizing objective functions 3. Momentum


but faces limitations such as slow progress in flat regions and sensitivity to
noisy gradients. The momentum introduction helps overcome these chal- Momentum is an extension of the gradient descent algorithm de-
lenges by providing a more consistent direction for the search process. The signed to speed up optimization. It adds history to the parameter
Nesterov momentum method further refines this approach by anticipating update equation based on the gradient encountered during the previ-
future gradients, thus offering a more efficient trajectory towards the optimum.
ous updates, resulting in more efficient and improved optimization.
This examination highlights how Nesterov momentum improves the reliability
and efficiency of finding optimal solutions in gradient descent algorithms. This approach is based on the metaphor of momentum from physics,
where acceleration in a direction can be accumulated from past up-
keywords— Gradient descent, Momentum, Nesterov’s Momentum. dates.
Momentum involves adding a hyperparameter that controls the
amount of history (momentum) to include in the update equation,
1. Introduction i.e. the step to a new point in the search space. The value for the
radient descent is an optimization algorithm that follows an hyperparameter is defined in the range 0.0 to 1.0 and often has a value
G objective function’s negative gradient to locate the function’s
minimum.
close to 1.0, such as 0.8, 0.9, or 0.99. A momentum of 0.0 is the
same as gradient descent without momentum. The update rule
A limitation of gradient descent is that it can get stuck in flat areas for the Momentum method is given by:
or bounce around if the objective function returns noisy gradients. 𝑣𝑡+1 = 𝛽𝑣𝑡 + (1 − 𝛽)∇𝑓(𝑥𝑡 ) (2)
Momentum is an approach that accelerates the progress of the search
𝑥𝑡+1 = 𝑥𝑡 − 𝛼𝑣𝑡+1 (3)
to skim across flat areas and smooth out bouncy gradients.
In some cases, the acceleration of momentum can cause the search where:
to miss or overshoot the minima at the bottom of basins or valleys.
• 𝑥𝑡 is the parameter vector at iteration 𝑡,
Nesterov momentum is an extension of momentum that involves
• ∇𝐽(𝑥𝑡 ) is the gradient of the objective function with respect to
calculating the decaying moving average of the gradients of projected
𝑥𝑡 ,
positions in the search space rather than the actual positions them-
• 𝑣𝑡 is the velocity vector at iteration 𝑡,
selves.
• 𝛼 is the learning rate, and
This has the effect of harnessing the accelerating benefits of mo- • 𝛽 is the momentum coefficient, typically chosen between 0 and
mentum whilst allowing the search to slow down when approaching 1.
the optima and reducing the likelihood of missing or overshooting it.
The momentum term 𝑣𝑡 can be interpreted as the exponentially
weighted average of past gradients.
2. Gradient Descent
The gradient descent algorithm requires an optimized target function
and a derivative function for the objective function. The target func-
tion 𝑓() returns a score for a given set of inputs, and the derivative
function 𝑓 ′ () gives the derivative of the target function for a given set
of inputs.
The gradient descent algorithm requires a starting point (𝑥) in the
problem, such as a randomly selected point in the input space.
The derivative is then calculated, and a step is taken in the input
Figure 1. Standard Momentum vs Nestrov Momentum
space. This is expected to result in a downhill movement in the target
function, assuming we are minimizing the target function.
A downhill movement is made by first calculating how far to move
in the input space, calculated as the step size (called alpha or the 4. Nesterov Momentum
learning rate) multiplied by the gradient. This is then subtracted Nesterov Momentum is an extension of the gradient descent opti-
from the current point, ensuring we move against the gradient or mization algorithm.
down the target function. Yurii Nesterov described the approach in his 1983 paper titled
“A Method For Solving The Convex Programming Problem With
𝑥(𝑡 + 1) = 𝑥(𝑡) − 𝛼 × 𝑓 ′ (𝑥(𝑡)) (1)
Convergence Rate 𝑂(1∕𝑘2 )".
The steeper the objective function at a given point, the larger the Nesterov Momentum is just like more traditional momentum, ex-
magnitude of the gradient, and in turn, the larger the step taken in cept the update is performed using the partial derivative of the pro-
the search space. The size of the step taken is scaled using a step size jected update rather than the derivative current variable value. Nes-
hyperparameter. terov Momentum is an optimization algorithm that improves the
Step Size (𝛼): Hyperparameter that controls how far to move in convergence rate by considering the future gradient information.
the search space against the gradient each iteration of the algorithm. It can be understood in terms of the following four steps:
If the step size is too small, the movement in the search space will 1. The deprojected position of the entire solution is calculated using
be small, and the search will take a long time. If the step size is too the change calculated in the last iteration of the algorithm:
large, the search may bounce around the search space and skip over
the optima. 𝑦𝑡+1 = 𝑥𝑡 + (𝛽 × 𝑣𝑡 ) (4)

Indian Institute of Technology Delhi April 10, 2024 1–3


Nesterov’s momentum method

2. The sign of the gradient for this new position is then calculated: of functions. It’s noteworthy because it extends the acceleration
guarantee to a more diverse class of functions. Remember, in the
∇𝑡 + 1 = ∇𝑓(𝑦𝑡+1 ) (5) scenario where 𝑓 is 𝑚-strongly convex and 𝐿-smooth, the condition
𝐿
3. Next, the change in each variable is calculated: number 𝜅 equals . Hence, the acceleration effect becomes notably
𝑚
pronounced when 𝐿 is significantly smaller than 𝑚 (indicating some
𝑣𝑡+1 = (𝛽 × 𝑣𝑡 − (𝛼 × ∇ × 𝑓(𝑦𝑡+1 ) (6)
dimensions possess very steep gradients).
4. Finally, the new value for each variable is calculated using the
calculated change: 5.2. The smooth case
𝑥𝑡+1 = 𝑥𝑡 + 𝑣𝑡+1 (7) Theorem 1.2. Let 𝑓 be a convex and 𝐿-smooth function, then Nes-
In the field of convex optimization, more generally, Nesterov Momen- terov’s accelerated gradient descent satisfies
tum is known to improve the rate of convergence of the optimization 2𝐿 ‖𝑥1 − 𝑥∗ ‖
2

algorithm, reducing the number of iterations required to find the 𝑓 (𝑦𝑡 ) − 𝑓 (𝑥 ∗ ) ≤ .


𝑡2
solution.
The convergence rate 𝑂(1∕𝑡 2 ) is optimal since no first order al-
gorithm can perform better than NAGD in terms of convergence
rate. All first-order algorithm can only be atmost as good as NAG.
Table 1 shows a comparison between the vanilla gradient descent and
Nesterov’s accelerated gradient descent.
To establish the convergence of Nesterov’s momentum algorithm
for general convex functions, Nesterov relies on an estimated se-
quence.
Definition 1 (Estimate Sequence - Definition 2.2.1 from Nesterov’s
∞ ∞
book [3]). A pair of sequences {𝜙𝑘 (𝑥)}𝑘=0 and {𝜆𝑘 }𝑘=0 where 𝜆𝑘 ≥ 0,
is called an Estimate Sequence of 𝑓 if the following conditions hold:-
𝜆𝑘 → 0 when 𝑘 → ∞ - ∀𝑥 ∈ ℝ𝑛 , ∀𝑘 ≥ 0, the following holds:
𝜙𝑘 (𝑥) ≤ (1 − 𝜆𝑘 ) 𝑓(𝑥) + 𝜆𝑘 𝜙0 (𝑥)

Figure 2. Visualisation of CM and NAG

5. Convergence Results
Like classical momentum, NAG is a first-order optimization method Figure 3. Example of an estimate sequence
with a better convergence rate guarantee than gradient descent in
certain situations. In particular, for general smooth (non-strongly) When 𝑘 → ∞, 𝜙 becomes a lower bound of 𝑓, since 𝜆𝑘 𝜙0 (𝑥) → 0
convex functions and a deterministic gradient, NAG achieves a global and (1 − 𝜆𝑘 ) 𝑓(𝑥) → 𝑓(𝑥). The idea is to find an estimate sequence 𝜙
convergence rate of 𝑂(1∕𝑇 2 ) (versus the 𝑂(1∕𝑇) of gradient descent), to help us minimize 𝑓. To do so, the author introduces the following
with constant proportional to the Lipschitz coefficient of the deriva- result:
tive and the squared Euclidean distance to the solution. Theorem 2: (Lemma 2.2.1 of Nesterov’s book [3]) Assuming
Just as second-order methods guarantee improved local conver- that the estimate sequence is such that ∀𝑘, 𝑓 (𝑥𝑘 ) ≤ 𝜙𝑘∗ where
gence rates, Polyak (1964) showed that classical momentum can 𝜙𝑘∗ = min𝑥 𝜙𝑘 (𝑥), then:
considerably accelerate convergence to a local minimum, requiring
√ 𝑓 (𝑥𝑘 ) − 𝑓 (𝑥∗ ) ≤ 𝜆𝑘 [𝜙0 (𝑥 ∗ ) − 𝑓 (𝑥∗ )]
𝜅 times fewer iterations than the steepest descent to reach the same
𝐿
level of accuracy, where 𝜅 is the condition number (𝜅 = , where L Since 𝜙0 (𝑥 ∗ ) − 𝑓 (𝑥 ∗ ) is a fixed quantity, the rate of convergence
is the Lipschitz coefficient and m is from m-strongly
𝑚
for the sub-optimality will depend on how quickly 𝜆𝑘 decreases to 0.
√ √convex) of the Figure 4 presents a visualization of lemma 2.2.1.
curvature at the minimum and 𝛽 is set to ( 𝜅 − 1)∕( 𝜅 + 1).
Let us first define our first elements in the sequence.
Here ,we give the details of the method both for the strongly convex
and non-strongly convex case. 𝛼 2
𝜙0 (𝑥) = 𝑓 (𝑥0 ) + ‖𝑥 − 𝑥0 ‖
2
5.1. The smooth and strongly convex case 𝜆0 = 1
Theorem 1.1. Let 𝑓 be 𝑚-strongly convex and 𝐿-smooth, then Nes-
The next elements are of the following form:
terov’s accelerated gradient descent satisfies
𝜙𝑘+1 (𝑥) = (1 − 𝛼𝑘 )𝜙𝑘 (𝑥)
( 𝛼 )
𝑚+𝐿 2 𝑡−1 + 𝛼𝑘 𝑓(𝑦𝑘 ) + ⟨∇𝑓(𝑦𝑘 ), 𝑥 − 𝑦𝑘 ⟩ + ‖𝑥 − 𝑦𝑘 ‖2
𝑓 (𝑦𝑡 ) − 𝑓 (𝑥∗ ) ≤ ‖𝑥1 − 𝑥∗ ‖ exp (− √ ) .
2
2 𝜅 𝜆𝑘+1 = (1 − 𝛼𝑘 )𝜆𝑘
When applied to smooth and strongly convex functions, Nesterov’s
∞ ∞
Accelerated Gradient offers a similar acceleration to what is observed Where {𝑦𝑘 }𝑘=0 is an arbitrary sequence, and {𝛼𝑘 }𝑘=0 is a sequence
∑∞
with Polyak’s momentum but is now applicable to a broader range such that 𝛼𝑘 ∈ (0, 1) and 𝑘=0 𝛼𝑘 = ∞. This carefully designed

2
Nesterov’s momentum method

Figure 4. Illustration of lemma 2.2.1

sequence is then used to show how 𝜆𝑘 can converge efficiently and


how the minimum point of 𝜙𝑘 converges to 𝑥 ∗ .

Table 1. Convergence rate for Gradient Descent (GD) and Nesterov’s Acceler-
ated Gradient (NAG)

Class of Function GD NAG


(1) (1)
Smooth 𝑂 𝑂 2
𝑇 𝑇
( ( 𝑇 )) 𝑇
Smooth & Strongly-Convex 𝑂 exp − 𝑂 (exp (− √ ))
𝜅 𝜅

6. Limitations of Nesterov’s Momentum Method


The Nesterov’s accelerated gradient may not perform significantly
better than standard momentum in all cases. The algorithm requires
tuning the hyperparameters 𝜇 (momentum constant) and 𝜖 (learning
rate or step size). The high dimensional non-convex nature of the
loss function may lead to different sensitivity on different dimensions,
which a constant value of these hyperparameters cannot adapt to.
However, due to its simplicity and close resemblance to standard
momentum, it is very easy to implement and may significantly boost
performance.

References
[1] Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton
(2013) On the importance of initialization and momentum in deep
learning , Proceedings of the 30 th International Conference on
Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W & CP
volume 28.

[2] Ilya Sutskever (2013), TRAINING RECURRENT NEURAL NET-


WORKS

[3] Yurii Nesterov, Introductory lecture on convex programming. ISBN,


1998

Indian Institute of Technology Delhi April 10, 2024 3

You might also like