0% found this document useful (0 votes)
40 views

Lecture 5

Here are the steps to minimize the given function f(x,y) = 1/2 x^2 + 9y^2/2 using steepest descent with optimized step size: 1. Initialize x(0), y(0) 2. Compute the gradient: ∇f(x,y) = (x, 9y) 3. The descent direction is the negative gradient: d = (-x, -9y) 4. Construct the function to optimize for step size: h(α) = f(x - αx, y - α9y) = 1/2(x - αx)^2 + 9/2(y - α9y)^

Uploaded by

Reema Amgad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Lecture 5

Here are the steps to minimize the given function f(x,y) = 1/2 x^2 + 9y^2/2 using steepest descent with optimized step size: 1. Initialize x(0), y(0) 2. Compute the gradient: ∇f(x,y) = (x, 9y) 3. The descent direction is the negative gradient: d = (-x, -9y) 4. Construct the function to optimize for step size: h(α) = f(x - αx, y - α9y) = 1/2(x - αx)^2 + 9/2(y - α9y)^

Uploaded by

Reema Amgad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

AIS323

Spring 2023

Lecture 5
• Steepest descent Algorithm Weaknesses
• Addressing steepest descent weaknesses
• Step size
• Slow convergence
• Zig-zagging
• Momentum-based approaches

• Michel Bierlaire, Optimzation: Principles and algorithms, 2nd edition, EPFL Press, 2018.
• Mykel J. Kochenderfer, Tim A. Wheeler, Algorithms for Optimization, MIT Press, 2019.
• Jermey Watt, Reza Borhani, Aggelos K. Kataggelos,, Machine Learning Refined foundations, Algorithms, and
Applications, 2nd Edition, Cambridge University Press, 2020.
Topic Source

Problem formulation Michel Bieraire, Optimzation: Principles and Algorithms, EPFL


Press, 2018, Ch. 1

Exact solution of univariate


functions and stationary points

Contour plots Mykel J. Kochenderfer Tim A. Wheeler, Algorithms for


Optimization, MIT Press, 2019. Ch. 1, Sec. 2.1 and 2.2.
First order and second order
derivatives of multivariable
functions
Exact solution of multivariate
functions and stationary points

Gradient descent and its variants Jermey Watt, Reza Borhani, Aggelos K. Kataggelos,, Machine
Learning Refined foundations, Algorithms, and Applications, 2nd
Edition, Cambridge University Press, 2020, Ch.3 , Appendix A.
2
• The steepest descent algorithm generally converges slowly
towards an extremum.

• A natural weakness of the steepest descent : The direction

Gradient of the negative gradient can rapidly oscillate during a run of


gradient descent, often producing zig-zagging steps that
take considerable time for the gradient descent algorithm to
descent reach a minimum point.

Weaknesses • A natural weakness of the steepest descent : The magnitude


of the negative gradient can vanish rapidly near stationary
points, leading gradient descent to slowly crawl near
minima and saddle (inflection) points.

• If the function has more than one local minimum, the


algorithm can get stuck in one of them.

3
• Finding a good step size may take some trial and error for
the specific function. It is problem dependent.

Gradient
descent
• The step size controls whether the algorithm converges to a
Weaknesses minimum quickly or slowly. As at each step of the algorithm,
we always move in a descent direction, but how far we move in
this direction is controlled by the step size. So, if the step size is
set too small, we can descend too slow, and if set too large we
may even ascend.

4
• Finding a good step size may take some trial and error for
the specific function. It is problem dependent.

Gradient
descent
• The step size controls whether the algorithm converges to a
Weaknesses minimum quickly or slowly. As at each step of the algorithm,
we always move in a descent direction, but how far we move in
this direction is controlled by the step size. So, if the step size is
set too small, we can descend too slow, and if set too large we
may even ascend.

5
• To reach a minimum, the magnitude of the negative gradient can vanish rapidly
near stationary points, leading gradient descent to slowly crawl near minima and
saddle points. This can slow down gradient descent’s progress near stationary
points.
Crawling
near •

stationary
points

• A popular solution to this slow-crawling behavior is normalized gradient


descent.
Crawling towards stationary points is due to the vanishing gradient.
This can be addressed by Normalizing gradient descent. We
normalize the steepest descent direction by dividing it by its
magnitude. Doing so gives a normalized gradient descent step of
the form:
Addressing
crawling near 𝒙(𝒌+𝟏) = 𝒙(𝒌) − 𝜶(𝒌)
(𝜵(𝒇(𝒙(𝒌) ))
stationary 𝜵(𝒇(𝒙(𝒌) )
points: So, the movement in the steepest descent direction is totally
Normalizing controlled by the step size.
Gradient
Descent Normalized gradient descent empowers the standard gradient
descent method to push through flat regions of a function with
much greater ease. This includes flat regions of a function that may
lead to a local minimum, or the region around a saddle point of a
nonconvex function where standard gradient descent can halt.
For example, if

∇𝑓 = (0.3,0.1)

∇𝑓 1
= 0.3,0.1 = (0.948, 0.316)
Addressing ∇𝑓 2
0.3 + 0.12

Steepest Descent After normalization Before normalization


Weaknesses:
Normalizing
Gradient
Descent

Less crawling descent

88
For example, if

Addressing ∇𝑓 = (0.3,10)
Steepest Descent ∇𝑓 1
= 0.3,10 = (0.029, 0.99)
Weaknesses: ∇𝑓 2
0.3 + 10 2

Normalizing After normalization Before normalization


Gradient
Descent

Less Zigzagging

99
1010
• Finding a good step size may take some trial and error for
Steepest the specific function. It is problem dependent.

Descent
and
Learning • The step size controls whether the algorithm converges to a
Rate minimum quickly or slowly. As at each step of the algorithm,
we always move in a descent direction, but how far we move in
this direction is controlled by the step size. So, if the step size is
set too small, we can descend too slow, and if set too large we
may even ascend.
• Instead of using a fixed step length rule. It is also
Steepest possible to change the value of α from one step to
another with what is often referred to as an adjustable
Descent with step length rule.

Diminishing • A diminishing step length rule that starts with a large


𝛼 value that gets smaller through the iterations, like
Learning Step 1
𝛼 = at the 𝑘𝑡ℎ iteration can be used.
𝑘

• This avoids overshooting the minimum and zigzagging


around the minimum.

1212
• Some algorithms attempt to optimize the step size
at each iteration so that the step maximally

Steepest decreases 𝑓 for each gradient descent direction.


(analytical methods, exact line search methods,
Descent with
inexact line search methods, ….)
13
Optimized Step
Size (analytically)
• Deriving the optimum step size mathematically at
every step to force descent is computationally
expensive.
The procedure starts with a point 𝑥 (0) and involves the following steps:

1. Check whether 𝒙(𝒌) satisfies the termination conditions. If it does, terminate;

Steepest otherwise proceed to the next step.

Descent with 2. Determine the steepest descent direction 𝑑 (𝑘) (𝒅(𝒌) = −𝜵(𝒇(𝒙(𝒌) ).

Optimized Step 3. Construct ℎ(𝜶) = 𝑓 𝒙 𝒌+𝟏 = 𝒇(𝒙 𝒌 − 𝜶𝛁(𝒇 𝒙 𝒌 )


Size Determine analytically the step size or learning rate 𝜶(𝒌) that minimizes ℎ 𝜶 .
(analytically)
𝜶(𝒌) = min 𝒉(𝜶)
𝜶

4. Compute the next design point according to

𝒙(𝒌+𝟏) = 𝒙(𝒌) + 𝜶(𝒌) (−𝜵(𝒇(𝒙(𝒌) )).

14
Use the steepest descent method with optimized step size to minimize the following function:
1 9
𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
2 2
𝑥
Gradient ∇𝑓 𝑥, 𝑦 = 9𝑦

descent: 𝒙(𝒌+𝟏) = 𝒙(𝒌) − 𝜶(𝒌) 𝛁(𝒇(𝒙(𝒌) )

Optimized Step 𝒙 (𝒌+𝟏)


𝒙(𝒌+𝟏) 𝒙
= (𝒌+𝟏) = 𝒚
(𝒌) 𝒙
− 𝜶 𝟗𝒚
𝒌
=
𝒙 𝒌 (𝟏 − 𝜶)
.
𝒚 𝒌
Size (analytical) 𝒚
1 9
𝒚 (𝟏 − 𝟗𝜶)

𝒌+𝟏 𝒌
ℎ 𝛼 =𝑓 𝒙 = (𝒙 (𝟏 − 𝜶)2 + (𝒚 𝒌
(𝟏 − 𝟗𝜶)2
2 2
A univariate We need to solve min ℎ 𝛼
function 𝛼

𝑑𝑓 𝒙 𝒌+𝟏 𝒌 𝒌
= −(𝒙 𝟏 − 𝜶 ) − 81(𝒚 (𝟏 − 𝟗𝜶))=0
𝑑𝜶

2 2
𝒌 𝒙 𝒌 +81 𝒚 𝒌
𝜶 = min ℎ 𝛼 = 2 2
𝛼 𝒙 𝒌 +729 𝒚 𝒌

Michel Bieraire, Optimzation: Principles and Algorithms, EPFL Press, 2018. 15


𝑥 (1) = (9,3 ) Optimized 𝛼

Notice that the gradient vectors of two


successive steps are perpendicular.

The value of 𝑓 for the first five iterations


81
31.609756097561
12.3355145746579
4.81385934620797
1.87857925705677

16
Michel Bieraire, Optimzation: Principles and Algorithms, EPFL Press, 2018.
𝑥 (1) = (9,3 ) Fixed 𝛼 = 0.2 Optimized 𝛼

17
Michel Bieraire, Optimzation: Principles and Algorithms, EPFL Press, 2018.
𝑥 (1) = (9,3 ) Optimized 𝛼

𝑓 𝑥, 𝑦

𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠

18
Michel Bieraire, Optimzation: Principles and Algorithms, EPFL Press, 2018.
The procedure starts with a point 𝑥 (1) and involves the following steps:

1. Check whether 𝒙(𝒌) satisfies the termination conditions. If it does, terminate;

otherwise proceed to the next step.

2. Determine the steepest descent direction 𝑑 (𝑘) (𝒅(𝒌) = −𝜵(𝒇(𝒙(𝒌) ).


Gradient
descent with 3. Construct ℎ(𝜶) = 𝑓 𝒙 𝒌+𝟏 = 𝒇(𝒙 𝒌 − 𝜶𝛁(𝒇 𝒙 𝒌 )

exact line Determine the step size or learning rate 𝜶(𝒌) that minimizes ℎ 𝜶 .

search 𝜶(𝒌) = min 𝒉(𝜶)


𝜶

using numerical univariate optimization methods (Golden section, Newton’s

method, ….

4. Compute the next design point according to

𝒙(𝒌+𝟏) = 𝒙(𝒌) + 𝜶(𝒌) (−𝜵(𝒇(𝒙(𝒌) )).


• We need to Spend less effort in calculating the step size. So, instead of
Gradient trying to solve univariate optimization problem (analytically or numerically),

descent with a trial and error approach is considered where various step sizes are tested
and the first one that is suitable is accepted.
inexact line
search
• Inexact line search is a strategy for picking efficiently a good value of step size
which results in a quicker convergence.
The procedure starts with a point 𝑥 (0) and involves the following steps:

1. Check whether 𝒙(𝒌) satisfies the termination conditions. If it does, terminate;

otherwise proceed to the next step.

Gradient
Gradient 2. Determine the steepest descent direction 𝑑 (𝑘) (𝒅(𝒌) = −𝜵(𝒇(𝒙(𝒌) ).
Wolfe conditions provide an upper and lower
descent
descent with
with bound on the admissible step length values.

inexact
inexact line
line 3. a) Calculate initial loss 𝒇(𝒙(𝒌) ) and initialize 𝜶(𝒌) to a large value.

search
search b) Calculate 𝒇(𝒙(𝒌) −𝜶 𝒌
𝜵(𝒇 𝒙 𝒌
)

c) If 𝒇( 𝒙(𝒌) −𝜶 𝒌
𝜵(𝒇 𝒙 𝒌
)is less than 𝒇(𝒙(𝒌) ) then this value of 𝜶(𝒌) is

acceptable, else decrease 𝜶(𝒌) by some factor and repeat b).

Back tracking line search


4. Compute the next design point according to
𝒙(𝒌+𝟏) = 𝒙(𝒌) + 𝜶(𝒌) (−𝜵(𝒇(𝒙(𝒌) )).
Oscillating negative gradient

The Zig-
Gradient
zagging
descent with
behavior of a
inexact line
gradient
search
descent

A popular solution to the zig-zagging behavior is momentum accelerated gradient descent.


Momentum: is the quantity of motion of a moving body, measured as a
product of its mass and velocity.

The gradient descent with momentum algorithm (or Momentum for short)
borrows the idea from physics. Imagine rolling down a ball inside of a
frictionless bowl. Instead of stopping at the bottom, the momentum it has
Momentum accumulated pushes it forward, and the ball keeps rolling back and forth.

Accelerated The ball naturally gathers momentum as gravity causes it to accelerate, just as
the gradient causes momentum to accumulate in this descent method.
Steepest
Descent

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam- 23
f898b102325c
Momentum
Accelerated
Steepest
Descent

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-
24
adam-f898b102325c
Momentum helps solve the issue of slow convergence.

Momentum
Accelerated
Steepest
Descent 𝜔1

𝜔2

https://medium.com/analytics-vidhya/momentum-a-simple-yet-efficient-optimizing-technique-
ef76834e4423 25
Momentum
Accelerated
Steepest
Descent

𝜔1

26

𝜔2
Gradient descent takes a long time to traverse a nearly flat surface as they
have gradients with small magnitudes and thus require many iterations of
gradient descent to traverse.

The momentum is a function of every negative gradient which precedes it. It


Momentum captures the average of the directions preceding it.

Accelerated Modify the gradient descent to incorporate momentum


Steepest
Descent The momentum update equations are:
𝒅(𝒌) = 𝜷𝒅(𝒌−𝟏) + (𝟏 − 𝜷)𝜵𝒇(𝒙𝒌 )
𝒙(𝒌+𝟏) = 𝒙(𝒌) − 𝜶𝒅(𝒌)

For 𝛽 = 0, we recover the gradient descent.


For larger 𝛽, 𝑑 (𝑘) represents a summary of the subsequent preceding
gradients.
27
The procedure starts with a point 𝑥 (0) , 𝜷 (typically 𝟎. 𝟕), 𝜶 and 𝑑 (0) = ∇𝑓(𝑥 0 ) and
involves the following steps:

Momentum
Accelerated 1. Check whether 𝒙(𝒌) satisfies the termination conditions. If it does, terminate;

28 Steepest otherwise proceed to the next step.

Descent 2. Determine the steepest descent direction

𝒅(𝒌) = 𝜷𝒅(𝒌−𝟏) + (𝟏 − 𝜷)𝜵𝒇(𝒙𝒌 )

3. Compute the next design point according to

𝒙(𝒌+𝟏) = 𝒙(𝒌) − 𝜶𝒅(𝒌)


Use the momentum accelerated steepest descent method to minimize the following function:
𝑓 𝑥, 𝑦 = 0.5𝑥 2 + 9.75𝑦 2

𝜶 = 𝟎. 𝟏, 𝒙(𝟎) = (𝟏𝟎, 𝟏)
Momentum
Accelerated
29
Steepest
Descent:
Example
𝜶 = 𝟎. 𝟏, 𝒙(𝟎) = (𝟏𝟎, 𝟏)

𝛽 = 0.2
𝛽=0

Both momentum-accelerated ( 𝜷 = 𝟎. 𝟐 and 𝜷 = 𝟎. 𝟕)


versions clearly outperform the standard scheme (𝜷 = 𝟎)
, in that they reach a point closer to the true minimum of 𝛽 = 0.7
the function and the overall path taken by gradient
descent is smoothest for 𝜷 = 𝟎. 𝟕.

3030
Momentum
Accelerated One issue of momentum is that the
Steepest steps do not slow down enough at the
Descent bottom of a valley and tend to overshoot
the valley floor.
Weakness

31

You might also like