Nesterov Momentum
Nesterov Momentum
MCL-758 Optimization
2. The sign of the gradient for this new position is then calculated: of functions. It’s noteworthy because it extends the acceleration
guarantee to a more diverse class of functions. Remember, in the
∇𝑡 + 1 = ∇𝑓(𝑦𝑡+1 ) (5) scenario where 𝑓 is 𝑚-strongly convex and 𝐿-smooth, the condition
𝐿
3. Next, the change in each variable is calculated: number 𝜅 equals . Hence, the acceleration effect becomes notably
𝑚
pronounced when 𝐿 is significantly smaller than 𝑚 (indicating some
𝑣𝑡+1 = (𝛽 × 𝑣𝑡 − (𝛼 × ∇ × 𝑓(𝑦𝑡+1 ) (6)
dimensions possess very steep gradients).
4. Finally, the new value for each variable is calculated using the
calculated change: 5.2. The smooth case
𝑥𝑡+1 = 𝑥𝑡 + 𝑣𝑡+1 (7) Theorem 1.2. Let 𝑓 be a convex and 𝐿-smooth function, then Nes-
In the field of convex optimization, more generally, Nesterov Momen- terov’s accelerated gradient descent satisfies
tum is known to improve the rate of convergence of the optimization 2𝐿 ‖𝑥1 − 𝑥∗ ‖
2
5. Convergence Results
Like classical momentum, NAG is a first-order optimization method Figure 3. Example of an estimate sequence
with a better convergence rate guarantee than gradient descent in
certain situations. In particular, for general smooth (non-strongly) When 𝑘 → ∞, 𝜙 becomes a lower bound of 𝑓, since 𝜆𝑘 𝜙0 (𝑥) → 0
convex functions and a deterministic gradient, NAG achieves a global and (1 − 𝜆𝑘 ) 𝑓(𝑥) → 𝑓(𝑥). The idea is to find an estimate sequence 𝜙
convergence rate of 𝑂(1∕𝑇 2 ) (versus the 𝑂(1∕𝑇) of gradient descent), to help us minimize 𝑓. To do so, the author introduces the following
with constant proportional to the Lipschitz coefficient of the deriva- result:
tive and the squared Euclidean distance to the solution. Theorem 2: (Lemma 2.2.1 of Nesterov’s book [3]) Assuming
Just as second-order methods guarantee improved local conver- that the estimate sequence is such that ∀𝑘, 𝑓 (𝑥𝑘 ) ≤ 𝜙𝑘∗ where
gence rates, Polyak (1964) showed that classical momentum can 𝜙𝑘∗ = min𝑥 𝜙𝑘 (𝑥), then:
considerably accelerate convergence to a local minimum, requiring
√ 𝑓 (𝑥𝑘 ) − 𝑓 (𝑥∗ ) ≤ 𝜆𝑘 [𝜙0 (𝑥 ∗ ) − 𝑓 (𝑥∗ )]
𝜅 times fewer iterations than the steepest descent to reach the same
𝐿
level of accuracy, where 𝜅 is the condition number (𝜅 = , where L Since 𝜙0 (𝑥 ∗ ) − 𝑓 (𝑥 ∗ ) is a fixed quantity, the rate of convergence
is the Lipschitz coefficient and m is from m-strongly
𝑚
for the sub-optimality will depend on how quickly 𝜆𝑘 decreases to 0.
√ √convex) of the Figure 4 presents a visualization of lemma 2.2.1.
curvature at the minimum and 𝛽 is set to ( 𝜅 − 1)∕( 𝜅 + 1).
Let us first define our first elements in the sequence.
Here ,we give the details of the method both for the strongly convex
and non-strongly convex case. 𝛼 2
𝜙0 (𝑥) = 𝑓 (𝑥0 ) + ‖𝑥 − 𝑥0 ‖
2
5.1. The smooth and strongly convex case 𝜆0 = 1
Theorem 1.1. Let 𝑓 be 𝑚-strongly convex and 𝐿-smooth, then Nes-
The next elements are of the following form:
terov’s accelerated gradient descent satisfies
𝜙𝑘+1 (𝑥) = (1 − 𝛼𝑘 )𝜙𝑘 (𝑥)
( 𝛼 )
𝑚+𝐿 2 𝑡−1 + 𝛼𝑘 𝑓(𝑦𝑘 ) + ⟨∇𝑓(𝑦𝑘 ), 𝑥 − 𝑦𝑘 ⟩ + ‖𝑥 − 𝑦𝑘 ‖2
𝑓 (𝑦𝑡 ) − 𝑓 (𝑥∗ ) ≤ ‖𝑥1 − 𝑥∗ ‖ exp (− √ ) .
2
2 𝜅 𝜆𝑘+1 = (1 − 𝛼𝑘 )𝜆𝑘
When applied to smooth and strongly convex functions, Nesterov’s
∞ ∞
Accelerated Gradient offers a similar acceleration to what is observed Where {𝑦𝑘 }𝑘=0 is an arbitrary sequence, and {𝛼𝑘 }𝑘=0 is a sequence
∑∞
with Polyak’s momentum but is now applicable to a broader range such that 𝛼𝑘 ∈ (0, 1) and 𝑘=0 𝛼𝑘 = ∞. This carefully designed
2
Nesterov’s momentum method
Table 1. Convergence rate for Gradient Descent (GD) and Nesterov’s Acceler-
ated Gradient (NAG)
References
[1] Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton
(2013) On the importance of initialization and momentum in deep
learning , Proceedings of the 30 th International Conference on
Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W & CP
volume 28.