Lecture 5
Lecture 5
Spring 2023
Lecture 5
• Steepest descent Algorithm Weaknesses
• Addressing steepest descent weaknesses
• Step size
• Slow convergence
• Zig-zagging
• Momentum-based approaches
• Michel Bierlaire, Optimzation: Principles and algorithms, 2nd edition, EPFL Press, 2018.
• Mykel J. Kochenderfer, Tim A. Wheeler, Algorithms for Optimization, MIT Press, 2019.
• Jermey Watt, Reza Borhani, Aggelos K. Kataggelos,, Machine Learning Refined foundations, Algorithms, and
Applications, 2nd Edition, Cambridge University Press, 2020.
Topic Source
Gradient descent and its variants Jermey Watt, Reza Borhani, Aggelos K. Kataggelos,, Machine
Learning Refined foundations, Algorithms, and Applications, 2nd
Edition, Cambridge University Press, 2020, Ch.3 , Appendix A.
2
• The steepest descent algorithm generally converges slowly
towards an extremum.
3
• Finding a good step size may take some trial and error for
the specific function. It is problem dependent.
Gradient
descent
• The step size controls whether the algorithm converges to a
Weaknesses minimum quickly or slowly. As at each step of the algorithm,
we always move in a descent direction, but how far we move in
this direction is controlled by the step size. So, if the step size is
set too small, we can descend too slow, and if set too large we
may even ascend.
4
• Finding a good step size may take some trial and error for
the specific function. It is problem dependent.
Gradient
descent
• The step size controls whether the algorithm converges to a
Weaknesses minimum quickly or slowly. As at each step of the algorithm,
we always move in a descent direction, but how far we move in
this direction is controlled by the step size. So, if the step size is
set too small, we can descend too slow, and if set too large we
may even ascend.
5
• To reach a minimum, the magnitude of the negative gradient can vanish rapidly
near stationary points, leading gradient descent to slowly crawl near minima and
saddle points. This can slow down gradient descent’s progress near stationary
points.
Crawling
near •
stationary
points
∇𝑓 = (0.3,0.1)
∇𝑓 1
= 0.3,0.1 = (0.948, 0.316)
Addressing ∇𝑓 2
0.3 + 0.12
88
For example, if
Addressing ∇𝑓 = (0.3,10)
Steepest Descent ∇𝑓 1
= 0.3,10 = (0.029, 0.99)
Weaknesses: ∇𝑓 2
0.3 + 10 2
Less Zigzagging
99
1010
• Finding a good step size may take some trial and error for
Steepest the specific function. It is problem dependent.
Descent
and
Learning • The step size controls whether the algorithm converges to a
Rate minimum quickly or slowly. As at each step of the algorithm,
we always move in a descent direction, but how far we move in
this direction is controlled by the step size. So, if the step size is
set too small, we can descend too slow, and if set too large we
may even ascend.
• Instead of using a fixed step length rule. It is also
Steepest possible to change the value of α from one step to
another with what is often referred to as an adjustable
Descent with step length rule.
1212
• Some algorithms attempt to optimize the step size
at each iteration so that the step maximally
Descent with 2. Determine the steepest descent direction 𝑑 (𝑘) (𝒅(𝒌) = −𝜵(𝒇(𝒙(𝒌) ).
14
Use the steepest descent method with optimized step size to minimize the following function:
1 9
𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
2 2
𝑥
Gradient ∇𝑓 𝑥, 𝑦 = 9𝑦
𝒌+𝟏 𝒌
ℎ 𝛼 =𝑓 𝒙 = (𝒙 (𝟏 − 𝜶)2 + (𝒚 𝒌
(𝟏 − 𝟗𝜶)2
2 2
A univariate We need to solve min ℎ 𝛼
function 𝛼
𝑑𝑓 𝒙 𝒌+𝟏 𝒌 𝒌
= −(𝒙 𝟏 − 𝜶 ) − 81(𝒚 (𝟏 − 𝟗𝜶))=0
𝑑𝜶
2 2
𝒌 𝒙 𝒌 +81 𝒚 𝒌
𝜶 = min ℎ 𝛼 = 2 2
𝛼 𝒙 𝒌 +729 𝒚 𝒌
16
Michel Bieraire, Optimzation: Principles and Algorithms, EPFL Press, 2018.
𝑥 (1) = (9,3 ) Fixed 𝛼 = 0.2 Optimized 𝛼
17
Michel Bieraire, Optimzation: Principles and Algorithms, EPFL Press, 2018.
𝑥 (1) = (9,3 ) Optimized 𝛼
𝑓 𝑥, 𝑦
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠
18
Michel Bieraire, Optimzation: Principles and Algorithms, EPFL Press, 2018.
The procedure starts with a point 𝑥 (1) and involves the following steps:
exact line Determine the step size or learning rate 𝜶(𝒌) that minimizes ℎ 𝜶 .
method, ….
descent with a trial and error approach is considered where various step sizes are tested
and the first one that is suitable is accepted.
inexact line
search
• Inexact line search is a strategy for picking efficiently a good value of step size
which results in a quicker convergence.
The procedure starts with a point 𝑥 (0) and involves the following steps:
Gradient
Gradient 2. Determine the steepest descent direction 𝑑 (𝑘) (𝒅(𝒌) = −𝜵(𝒇(𝒙(𝒌) ).
Wolfe conditions provide an upper and lower
descent
descent with
with bound on the admissible step length values.
inexact
inexact line
line 3. a) Calculate initial loss 𝒇(𝒙(𝒌) ) and initialize 𝜶(𝒌) to a large value.
search
search b) Calculate 𝒇(𝒙(𝒌) −𝜶 𝒌
𝜵(𝒇 𝒙 𝒌
)
c) If 𝒇( 𝒙(𝒌) −𝜶 𝒌
𝜵(𝒇 𝒙 𝒌
)is less than 𝒇(𝒙(𝒌) ) then this value of 𝜶(𝒌) is
The Zig-
Gradient
zagging
descent with
behavior of a
inexact line
gradient
search
descent
The gradient descent with momentum algorithm (or Momentum for short)
borrows the idea from physics. Imagine rolling down a ball inside of a
frictionless bowl. Instead of stopping at the bottom, the momentum it has
Momentum accumulated pushes it forward, and the ball keeps rolling back and forth.
Accelerated The ball naturally gathers momentum as gravity causes it to accelerate, just as
the gradient causes momentum to accumulate in this descent method.
Steepest
Descent
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam- 23
f898b102325c
Momentum
Accelerated
Steepest
Descent
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-
24
adam-f898b102325c
Momentum helps solve the issue of slow convergence.
Momentum
Accelerated
Steepest
Descent 𝜔1
𝜔2
https://medium.com/analytics-vidhya/momentum-a-simple-yet-efficient-optimizing-technique-
ef76834e4423 25
Momentum
Accelerated
Steepest
Descent
𝜔1
26
𝜔2
Gradient descent takes a long time to traverse a nearly flat surface as they
have gradients with small magnitudes and thus require many iterations of
gradient descent to traverse.
Momentum
Accelerated 1. Check whether 𝒙(𝒌) satisfies the termination conditions. If it does, terminate;
𝜶 = 𝟎. 𝟏, 𝒙(𝟎) = (𝟏𝟎, 𝟏)
Momentum
Accelerated
29
Steepest
Descent:
Example
𝜶 = 𝟎. 𝟏, 𝒙(𝟎) = (𝟏𝟎, 𝟏)
𝛽 = 0.2
𝛽=0
3030
Momentum
Accelerated One issue of momentum is that the
Steepest steps do not slow down enough at the
Descent bottom of a valley and tend to overshoot
the valley floor.
Weakness
31