0% found this document useful (0 votes)
23 views73 pages

06 23ECE216 GradientDescent v2

The document provides an overview of optimization techniques, distinguishing between unconstrained and constrained optimization, and detailing methods such as gradient descent. It explains the concept of gradients, the iterative process of the gradient descent algorithm, and the significance of the learning rate in finding local minima. Several examples illustrate the application of gradient descent in minimizing functions with varying complexity.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views73 pages

06 23ECE216 GradientDescent v2

The document provides an overview of optimization techniques, distinguishing between unconstrained and constrained optimization, and detailing methods such as gradient descent. It explains the concept of gradients, the iterative process of the gradient descent algorithm, and the significance of the learning rate in finding local minima. Several examples illustrate the application of gradient descent in minimizing functions with varying complexity.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Om Namo Bhagavate Vasudevaya

Optimization
Gradient Descent

Dr. Binoy B Nair


Unconstrained Optimization

Definition: Optimization problems without any restrictions or constraints on the variables.

Objective: Minimize or maximize an objective function f(x) where (x) can take any value in the
domain.

Mathematical Formulation: Minimize (or Maximize) f(x)

Methods: Techniques like gradient descent, Newton's method, and quasi-Newton methods are
commonly used.

Applications: Often used in machine learning, statistics, and various engineering fields where the
solution space is not restricted.
Constrained Optimization

•Definition: Optimization problems that include restrictions or constraints on


the variables.

•Objective: Minimize or maximize an objective function 𝑓(𝑥) subject to


constraints 𝑔𝑖 𝑥 ≤ 0 and/or ℎ𝑗 (𝑥) = 0.

•Methods: Techniques like Lagrange multipliers, Karush-Kuhn-Tucker (KKT)


conditions, and interior-point methods are commonly used.

•Applications: Widely used in engineering design, economics, operations


research, and resource allocation where specific constraints must be satisfied.
Key Differences
•Constraints:
•Unconstrained: No constraints on the variables.
•Constrained: Includes constraints that the solution must satisfy.

•Complexity:
•Unconstrained: Generally simpler and faster to solve.
•Constrained: More complex due to the need to satisfy constraints, often requiring
advanced mathematical techniques.

•Solution Space:
•Unconstrained: Solution can be anywhere in the domain of the objective function.
•Constrained: Solution is restricted to the feasible region defined by the constraints.
Gradient Descent
• What’s the goal when you are hiking down
a mountain?

- To have fun and to reach the bottom. Let's


focus on reaching the bottom for now.

• What is the red dot doing when it's hiking


down? It's always going in the downward
direction, until it hits the bottom.
What is Gradient?
• Gradient is the generalization of derivatives in several variables.

• Let's use x's as our variables in the following function 𝐽(𝑥1 , 𝑥2 , 𝑥3 )


where
𝐽 𝑿 = 0.55 − (5𝑥1 + 2𝑥2 − 12𝑥3 )
Now the gradient is:
𝜕𝐽 𝜕𝐽 𝜕𝐽
𝛻𝐽 𝑿 = , , = −5, −2, 12
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3

Note: the gradient is inside < and > to denote that gradient is a vector.
Why do we care about gradient?
• Gradient is a pretty powerful tool in calculus.

• Remember:
• In one variable, derivative gives us the slope of the tangent line.
• In several variables, Gradient points towards direction of the fastest
increase of the function.

• This is extensively used in Gradient Descent Algorithm. Let's see how.


Idea behind Gradient Descent Algorithm
• Gradient descent algorithm is an
iterative process that takes us to the
minimum of a function (This will not
happen always, there are some
caveats!).

• Let's look at the red dot example


again:
Idea behind Gradient Descent Algorithm.. Contd.

• If you want to reach the bottom, in which


direction would you walk?

• In the downward direction, right?

• How do we find the downward direction?

• That's the direction opposite of the fastest


increase.
The search equation
• This means, if we are at point 𝑿(𝑘) and want
to move to lowest nearby point (this is why
we say “local minimum") our next step
should be at:

𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 , Evaluated at
𝑿 𝑘
Direction of fastest increase

Step size
Next position current position
Opposite direction
Why α ?
• α is called the Learning Rate or step size. (In some places you may see the symbol
as η ,but they mean the same thing.)
Which means, we want to take baby steps so that we don't overshoot the bottom.
• This is particularly important when we are very close to the minimum.
• A smart choice of α is crucial.
• When is too small, it will take our algorithm forever to reach the lowest point and if is
too big we might overshoot and miss the bottom.
Why the (-) sign?
• The negative sign indicates that we are stepping in the direction
opposite to that of 𝛻𝐽

• i.e. we are stepping in the direction opposite to that of fastest


increase.
Example 1
Q. Find the minimum of the function 𝐽 𝑥 = 𝑥 𝟐
using gradient descent.

The graph above is the


Sol. graph of 𝑥, 𝐽 𝑥 for
visualization. Note that we
Let 𝛼 = 0.8 and the initial guess 𝑥 be 𝑥 0 = −4. generate this plot using by
plugging in a large number
𝛻𝐽 𝑥 = 2𝑥 of x values and noting the
corresponding 𝐽 𝑥
values. This is not
Now, the update equation is 𝑥 𝑘 + 1 = 𝑥 𝑘 − possible practically in the
case of multivariable
𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥(𝑘) functions.
Example 1
Sol contd...

Initial: 𝑥 0 = −4

Iteration 1: 𝑥 1 = 𝑥 0 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 0
𝑥 1 = −4 − 0.8 ∗ 2 ∗ −4) = 𝟐. 𝟒

Iteration 2: 𝑥 2 = 𝑥 1 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 1
𝑥 2 = 2.4 − 0.8 ∗ 2 ∗ 2.4 = 2. 4 − 3.84 = −𝟏. 𝟒𝟒

Iteration 3: 𝑥 3 = 𝑥 2 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 2
𝑥 3 = −1.44 − 0.8 ∗ 2 ∗ −1.44 = −1.44 + 2.304 = 𝟎. 𝟖𝟔𝟒

And so on…
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4

15
Gradient Descent
𝑓 𝑥 =𝑥 2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = −4 − .8 ⋅ 2 ⋅ (−4)

16
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4

17
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = 2.4 − .8 ⋅ 2 ⋅ 2.4

𝑥 (1) = 0.4

18
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = −1.44

19
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = −1.44
𝑥 (3) = .864
𝑥 (4) = −0.5184
𝑥 (5) = 0.31104

𝑥 (30) = −8.84296𝑒 − 07
20
Role of learning rate

https://www.naukri.com/code360/library/nesterov-accelerated-gradient
Gradient Descent

Step size: .9

22
Gradient Descent

Step size: .2

23
Gradient Descent

Step size matters!

24
Gradient Descent

Step size matters!

25
Example 2
Q. Find the minimum of the function 𝐽 𝑿 = 𝑥𝟏𝟐 + 𝑥𝟐𝟐 using gradient
descent.

Sol.

1
Let 𝛼 = 0.4 and the initial guess 𝑥 be 𝑿 0 = .
1
2𝑥1
𝛻𝐽 𝑿 =
2𝑥2

Now, the update equation is 𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿(𝑘)


Example 2
Sol. Contd…
Step 3: 𝑿 3 = 𝑿 2 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 2
1
Initial: 𝑿 0 =
1 0.04 2 ∗ 0.04 0.008
𝑿 3 = − 0.4 ∗ )=
0.04 2 ∗ 0.04 0.008
Step 1: 𝑿 1 = 𝑿 0 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 0
And so on…
1 2∗1 0.2
𝑿 1 = − 0.4 ∗ =
1 2∗1 0.2

Step 2: 𝑿 2 = 𝑿 1 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 1

0.2 2 ∗ 0.2 0.04


𝑿 2 = − 0.4 ∗ )=
0.2 2 ∗ 0.2 0.04
Om Namo Bhagavate Vasudavaya

Example 3
Q. Find minimum of 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2 . Show only the first three iterations.
Consider 𝛼 = 0.1, consider the initial values of 𝑥, 𝑦 = (0,0)

Sol.
Iteration 1:
1. Compute the gradient:
𝜕𝑓
𝜕𝑥 2 𝑥−2 0
𝛻𝐽 𝑿 = 𝜕𝑓 = . Evaluating at 𝑿 0 = we get
2 𝑦+3 0
𝜕𝑦
2 0−2 −4
𝛻𝐽 [0,0] = =
2 0+3 6
Om Namo Bhagavate Vasudavaya

Example 3 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2
𝛼 = 0.1, initial values of 𝑥, 𝑦 = (0,0)

Sol.
Iteration 1 Contd..
2. Update the variables:
Now, the update equation is 𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿(𝑘)

𝑿 1 = 𝑿 0 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 0
0 −4 0.4
= − 0.1 ∗ =
0 6 −0.6

New point : 𝐗 𝟏 = 𝑥1 , 𝑦1 = 0.4, −0.6


Example 3 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2
𝛼 = 0.1, initial values of 𝑥, 𝑦 = (0,0)

Iteration 2:
2 0.4 − 2 −3.2
1. Compute the gradient: 𝛻𝐽 [0.4, −0.6] = =
2 −0.6 + 3 4.8
2. Update the variables:
𝑿 2 = 𝑿 1 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 1
0.4 −3.2 0.72
= − 0.1 ∗ =
−0.6 4.8 −1.08

• New point: 𝐗 𝟐 = 𝑥2 , 𝑦2 = 0.72, −1.08


Example 3
𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2
𝛼 = 0.1, initial values of 𝑥, 𝑦 = (0,0)

Iteration 3:
1. Compute the gradient: ∇𝑓 0.72, −1.08 = 2 0.72 − 2 , 2 −1.08 + 3 = −2.56,3.84
2. Update the variables:
𝑿 3 = 𝑿 2 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 2
0.72 −2. 56 0.976
= − 0.1 ∗ =
−1.08 3.84 −1.464

• New point: 𝑥3 , 𝑦3 = 0.976, −1.464


And so on…
Gradient descent with
momentum
Why do we need momentum?

https://www.naukri.com/code360/library/nesterov-accelerated-gradient
Gradient Descent with Momentum
• The update rule for gradient descent with momentum incorporates a
momentum term to the standard gradient descent update rule.
• The momentum helps smooth out the updates and accelerates
convergence, especially when the cost function has a lot of small
oscillations or noisy gradients.

• Here's the update rule for gradient descent with momentum:


𝜕𝑓
• 𝑣𝑥𝑡+1 = β𝑣𝑥𝑡 + α
𝜕𝑥
• 𝑥𝑡+1 = 𝑥𝑡 − 𝑣𝑥𝑡+1
Steps in Gradient Descent with Momentum
1. Initialize:
o Choose an initial point ( 𝑥0 , 𝑦0 ).
o Set the learning rate (α) and the momentum term (β). Usually α + β = 1
o Initialize the velocity terms (𝑣𝑥0 ) and (𝑣𝑦0 ) to 0.

2. Compute the Gradient:


𝜕𝑓 𝜕𝑓
• Calculate the gradient of the function at the current point: ∇𝑓 𝑥, 𝑦 = ,
𝜕𝑥 𝜕𝑦
Steps in Gradient Descent with Momentum
1. Update the Velocity:

• Update the velocity terms using the formula:


𝜕𝑓
• 𝑣𝑥𝑡+1 = β𝑣𝑥𝑡 + α
𝜕𝑥
𝜕𝑓
• 𝑣𝑦𝑡+1 = β𝑣𝑦𝑡 + α
𝜕𝑦

2. Update the Variables:

• Update the variables using the velocity terms:


• 𝑥𝑡+1 = 𝑥𝑡 − 𝑣𝑥𝑡+1
• 𝑦𝑡+1 = 𝑦𝑡 − 𝑣𝑦𝑡+1
Example 1
Q. Find the minimum of the function 𝑓 𝑥, 𝑦 = 𝑥 − 1 2 + 𝑦 − 2 2 using gradient descent with
momentum. Consider the initial point 𝑥0 , 𝑦0 = 0,0 , α = 0.1 , and β = 0.9 . Show only the first
three iterations.

Sol.
• Initialization:
• Initial point: 𝑥0 , 𝑦0 = 0, 0
• Initial velocity: 𝑣𝑥0 , 𝑣𝑦0 = 0, 0
Example 1

• Iteration 1:
𝜕𝑓 𝜕𝑓
Step 1.1 : Compute the gradient: ∇𝑓 𝑥, 𝑦 = , = 2 𝑥 − 1 ,2 𝑦 − 2
𝜕𝑥 𝜕𝑦

𝐴𝑡 𝑥0 , 𝑦0 = 0,0 We get ∇𝑓 0,0 = 2 0 − 1 , 2 0 − 2 = −2, −4


Step 1.2: Update the velocity:
𝜕𝑓
𝑣𝑥1 = β𝑣𝑥0 + α 𝜕𝑥 = 0.9 ∗ 0 + 0.1 ∗ −2 = −0.2
𝜕𝑓
𝑣𝑦1 = β𝑣𝑦0 + α 𝜕𝑦 = 0.9 ∗ 0 + 0.1 ∗ −4 = −0.4
Step 1.3: Update the variables:
𝑥1 = 𝑥0 − 𝑣𝑥1 = 0 − −0.2 = 0.2
𝑦1 = 𝑦0 − 𝑣𝑦1 = 0 − −0.4 = 0.4
• New point: 𝑥1 , 𝑦1 = 0.2,0.4
Example 1

• Iteration 2:
Step 2.1: Compute the gradient: ∇𝑓 0.2,0.4 = 2 0.2 − 1 , 2 0.4 − 2 = −1.6, −3.2

Step 2.2: Update the velocity:


𝜕𝑓
𝑣𝑥2 = β𝑣𝑥1 + α 𝜕𝑥 = 0.9 ∗ −0.2 + 0.1 ∗ −1.6 = −0.18 − 0.16 = −0.34
𝜕𝑓
𝑣𝑦2 = β𝑣𝑦1 + α 𝜕𝑦 = 0.9 ∗ −0.4 + 0.1 ∗ −3.2 = −0.36 − 0.32 = −0.68

Step 2.3: Update the variables:


𝑥2 = 𝑥1 − 𝑣𝑥2 = 0.2 − −0.34 = 0.54
𝑦2 = 𝑦1 − 𝑣𝑦2 = 0.4 − −0.68 = 1.08
• New point: 𝑥2 , 𝑦2 = 0.54,1.08
Example 1
• Iteration 3:
Step 3.1 : Compute the gradient: ∇𝑓 0.54, 1.08 = 2 0.54 − 1 , 2 1.08 − 2 = −0.92, −1. 84

Step 3.2 : Update the velocity:


𝜕𝑓
𝑣𝑥3 = β𝑣𝑥2 + α 𝜕𝑥 = 0.9 ∗ −0.34 + 0.1 ∗ −0.92 = −0.306 − 0.092 = −0.398
𝜕𝑓
𝑣𝑦3 = β𝑣𝑦2 + α 𝜕𝑦 = 0.9 ∗ −0.68 + 0.1 ∗ −1. 84 = −0. 612 − 0.184 = −0.796

Step 3.3 Update the variables:


𝑥3 = 𝑥2 − 𝑣𝑥3 = 0.54 − −0.398 = 0.938
𝑦3 = 𝑦2 − 𝑣𝑦3 = 1.08 − −0.796 = 1.876
• New point: 𝑥3 , 𝑦3 = 0.938, 1.876

• And so on…
Adagrad

https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
Adagrad
• Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm designed to adapt the
learning rate for each parameter based on its historical gradients.

• This makes it particularly useful for dealing with sparse data and features with different
scales.
Step-by-Step Explanation of Adagrad
Step 1: Initialize Parameters

• Parameters: Initialize the parameters (weights) you want to optimize,


typically with small random values.

• Learning Rate 𝜼 : Set an initial learning rate.

• Accumulator (G): Initialize an accumulator for each parameter to zero.


This will store the sum of the squares of the gradients.

• Small Constant 𝝐 : Use a small constant (e.g.,10−8 ) to prevent


division by zero.
Step-by-Step Explanation of Adagrad
Step 2: Compute the Gradient

• For each parameter at time step (t), compute the gradient


of the loss function with respect to that parameter.

• This gradient indicates the direction and magnitude of the


update needed to minimize the loss.

• Let’s denote this gradient as (𝑔𝑡 ).


Step-by-Step Explanation of Adagrad

Step 3: Accumulate the Squared Gradients


• Update the accumulator for each parameter by adding the square of the
current gradient:
𝐺𝑡 = 𝐺𝑡−1 + 𝑔𝑡2
Here, 𝐺𝑡 is the accumulated sum of squared gradients up to time step (t).
Step-by-Step Explanation of Adagrad
Step 4: Update the Parameters

• Adjust the parameters using the accumulated gradients.


• The update rule for each parameter (𝜃𝑖 ) is:
η
𝜃𝑖 = 𝜃𝑖 − 𝑔𝑡,𝑖
𝐺𝑡,𝑖 +𝜖

This formula ensures that parameters with larger accumulated gradients have
smaller updates, while those with smaller accumulated gradients have larger
updates.
Adagrad Example
Q. Consider the objective function 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2 . Use Adagrad to
find the minimum of this function. Consider the initial point (x0 ,y0) as
(1.0,1.0) and initial learning rate as 0.1.

Sol.

1.Initialization:
• Initial Parameters: (𝜃 = 𝑥0 , 𝑦0 = 1.0,1.0 )
• Learning rate: (𝜂 = 0.1)
• Accumulator: (𝐺 = [0, 0])
• Small constant: (𝜖 = 10−8 )
Adagrad Example
2. Iterations :

Iteration 1:
2.1.1 Compute Gradient:

∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 1.0, 2 ∗ 1.0 = 2.0,2.0

2.1.2 Accumulate Squared Gradients:


𝐺 = 𝐺 + ∇𝑓 𝑥, 𝑦 2 = [0,0] + [2.02 , 2.02 ] = [4.0,4.0]

2.1.3 Update Parameters:


𝜂 0.1
• 𝑥=𝑥− . ∇𝑓𝑥 = 1.0 − ∗ 2.0 = 1.0 − 0.1 ∗ 1.0 = 0.9
𝐺𝑥 +𝜖 4.0+10−8
𝜂 0.1
• 𝑦=𝑦− ⋅ ∇𝑓𝑦 = 1.0 − ∗ 2.0 = 1.0 − 0.1 ∗ 1.0 = 0.9
𝐺𝑦 +𝜖 4.0+10−8

• Updated parameters: ( 𝜃 = 0.9, 0.9 )


Adagrad Example
Iteration 2
2.2.1 Compute Gradient:
∇𝑓(𝑥, 𝑦) = [2𝑥, 2𝑦] = [2 ∗ 0.9,2 ∗ 0.9] = [1.8,1.8]

2.2.2 Accumulate Squared Gradients


𝐺 = 𝐺 + ∇𝑓 𝑥, 𝑦 2 = 4.0,4.0 + 1.82 , 1.82
= [4.0 + 3.24, 4.0 + 3.24] = [7.24,7.24]

2.2.3 Update Parameters


𝜂 0.1
• 𝑥=𝑥− ⋅ ∇𝑓𝑥 = 0.9 − ∗ 1.8 = 0.9 − 0.1 ∗ 0.668 = 0.832
𝐺𝑥 + 𝜖 7.24+10−8
𝜂 0.1
• 𝑦=𝑦− ⋅ ∇𝑓𝑦 = 0.9 − ∗ 1.8 = 0.9 − 0.1 ∗ 0.668 = 0.832
𝐺𝑦 +𝜖 7.24+10−8
• Updated parameters: ( 𝜃 = 0.832, 0.832 )
Adagrad Example
Iteration 3:
2.3.1 Compute Gradient
∇𝑓(𝑥, 𝑦) = [2𝑥, 2𝑦] = [2 ∗ 0.832, 2 ∗ 0.832] = [1.664,1.664]
2.3.2 Accumulate Squared Gradients
𝐺 = 𝐺 + ∇𝑓 𝑥, 𝑦 2 = 7.24,7.24 + 1.6642 , 1.6642
= [7.24 + 2.77,7.24 + 2.77] = [10.01,10.01]
2.3.3 Update Parameters
𝜂 0.1
𝑥=𝑥− ⋅ ∇𝑓𝑥 = 0.832 − ∗ 1.664 = 0.832 − 0.1 ∗ 0.526 = 0.779
𝐺𝑥 +𝜖 10.01+10−8
𝜂 0.1
𝑦=𝑦− ⋅ ∇𝑓𝑦 = 0.832 − ∗ 1.664 = 0.832 − 0.1 ∗ 0.526 = 0.779
𝐺𝑦 +𝜖 10.01+10−8
• Updated parameters: ( 𝜃 = 0.779, 0.779 )
Adagrad Example
After three iterations, the parameters have been updated from
([1.0, 1.0]) to ([0.779, 0.779]).

The Adagrad algorithm adapts the learning rate for each


parameter based on the accumulated squared gradients,
allowing for more efficient optimization, especially in cases with
sparse data.
Adagrad Advantages and Limitations
Advantages: Adagrad adapts the learning rate for each parameter,
making it effective for sparse data and features with different scales.
It eliminates the need to manually tune the learning rate.

Limitations: The learning rate can become very small over time due
to the accumulation of squared gradients, which can slow down
convergence.
RMSProp
RMSprop (Root Mean Square Propagation)

• RMSprop (Root Mean Square Propagation) is an adaptive learning rate


optimization algorithm designed to address the diminishing learning rates
issue in Adagrad.

• It adjusts the learning rate for each parameter based on a moving average
of the squared gradients.
Step-by-Step Explanation of RMSprop
Step 1: Initialize Parameters

Parameters: Initialize the parameters (weights) you want to optimize, typically


with small random values.

Learning Rate ( 𝜼 ): Set an initial learning rate, often around 0.001.

Decay Rate ( 𝜷 ): Set a decay rate for the moving average, typically around 0.9.

Accumulator ( 𝑬 𝒈𝟐 ): Initialize an accumulator for each parameter to zero.


This will store the exponentially decaying average of past squared gradients.

Small Constant ( 𝝐 ): Use a small constant (e.g., 10−8 ) to prevent division by


zero.
Step-by-Step Explanation of RMSprop

Step 2: Compute the Gradient

• For each parameter at time step (t), compute the gradient of the loss
function with respect to that parameter.

• This gradient indicates the direction and magnitude of the update


needed to minimize the loss.

• Let’s denote this gradient as (𝑔𝑡 ).


Step-by-Step Explanation of RMSprop

Step 3: Update the Accumulator

Update the accumulator for each parameter by computing the


exponentially decaying average of past squared gradients:
𝐸 𝑔2 𝑡 = 𝛽𝐸 𝑔2 𝑡−1 + 1 − 𝛽 𝑔𝑡2

Here, (𝐸 𝑔2 𝑡 ) is the updated accumulator at time step (t).


Step-by-Step Explanation of RMSprop
Step 4: Update the Parameters

Adjust the parameters using the updated accumulator.


The update rule for each parameter (𝜃i ) is:

𝜂
𝜃𝑖 = 𝜃𝑖 − . 𝑔𝑡
𝐸 𝑔2 𝑡 +𝜖

This formula ensures that parameters with larger accumulated


gradients have smaller updates, while those with smaller accumulated
gradients have larger updates.
RMSprop- Example
Q. Consider the objective function 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2 .
Use RMSProp to find the minimum of this function.
Consider the initial point (x0 ,y0) as (1.0, 1.0) and initial learning rate as 0.01.

Sol.

1.Initialization:

Parameters: (𝜃 = 𝑥, 𝑦 = 1.0, 1.0 )


Learning rate: (𝜂 = 0.01)
Decay rate: (𝛽 = 0.9)
Accumulator: (𝐸 𝑔2 = 0, 0 )
Small constant: (𝜖 = 10−8 )
RMSprop- Example
2. Iteration 1:

Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 1.0, 2 ∗ 1.0 = 2.0, 2.0

Update Accumulator: 𝐸 𝑔2 = 𝛽𝐸 𝑔2 + 1 − 𝛽 𝑔𝑡2 = 0.9 ∗ 0, 0 + 0.1 ∗ 2.02 , 2.02 = 0.4, 0.4

Update Parameters:
𝜂 0.01
𝑥=𝑥− 𝑔𝑥 = 1 − ∗ 2.0 = 1 − 0.316 = 0.968
2
𝐸 𝑔 𝑥 +𝜖 0.4+10−8
𝜂 0.01
𝑦=𝑦− 𝑔𝑦 = 1 − ∗ 2.0 = 1 − 0.316 = 0.968
0.4+10−8
𝐸 𝑔2 𝑦 +𝜖

Updated parameters: (𝜃 = 0.968, 0.968 )


RMSprop- Example

3.Iteration 2:

Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 0.968, 2 ∗ 0.968 = 1.936, 1.936

Update Accumulator: 𝐸 𝑔2 = 𝛽𝐸 𝑔2 + 1 − 𝛽 𝑔𝑡2 = 0.9 ∗ 0.4, 0.4 + 0.1 ∗


1.9362 , 1.9362 = 0.4 ∗ 0.9 + 0.1 ∗ 3.748, 0.4 ∗ 0.9 + 0.1 ∗ 3.748 = 0.734, 0.734

Update Parameters:
𝜂 0.01
𝑥=𝑥− . 𝑔𝑥 = 0.968 − ∗ 1.936 = 0.968 − 0.022 = 0.945
2
𝐸 𝑔 𝑥 +𝜖 0.734+10−8
𝜂 0.01
𝑦=𝑦− . 𝑔𝑦 = 0.968 − ∗ 1.936 = 0.968 − 0.022 = 0.945
0.734+10−8
𝐸 𝑔2 𝑦 +𝜖

Updated parameters: (𝜃 = 0.945, 0.945 )


RMSprop- Example
4.Iteration 3:

Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 0.945, 2 ∗ 0.945 = 1.891, 1.891

Update Accumulator: 𝐸 𝑔2 = 𝛽𝐸 𝑔2 + 1 − 𝛽 𝑔𝑡2 = 0.9 ∗ 0.945, 0.945 +


0.1 ∗ 1.8912 , 1.8912 = 0.945 ∗ 0.9 + 0.1 ∗ 3.576, 0.945 ∗ 0.9 + 0.1 ∗ 3.576 =
1.2081, 1.2081

Update Parameters:
𝜂 0.1
𝑥=𝑥− 𝑔𝑥 = 0.945 − ∗ 1.891 = 0.945 − 0.172 = 0.773
2
𝐸 𝑔 𝑥 +𝜖 1.2081+10−8
𝜂 0.01
𝑦=𝑦− . 𝑔𝑦 = 0.945 − ∗ 1.891 = 0.945 − 0.172 = 0.773
1.2081+10−8
𝐸 𝑔2 𝑦 +𝜖

Updated parameters: (𝜃 = 0.773, 0.773 )


RMSprop- Example
• After three iterations, the parameters have been updated from ([1.0, 1.0]) to
([0.773, 0.773]).

• RMSprop effectively adjusts the learning rate for each parameter based on the
moving average of the squared gradients, allowing for more stable and efficient
optimization.
ADAM
ADAM
Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the
advantages of two other popular methods: AdaGrad and RMSprop.
It computes adaptive learning rates for each parameter by estimating the first and second
moments of the gradients.
Step-by-Step Explanation of Adam
Step 1: Initialize Parameters

Parameters: Initialize the parameters (weights) you want to optimize,


typically with small random values.
Learning Rate ( 𝜼 ): Set an initial learning rate, often around 0.001.
•Exponential Decay Rates ( 𝜷𝟏 and 𝜷𝟐 ): Set decay rates for the moment
estimates, typically (𝛽1 = 0.9) and (𝛽2 = 0.999).
•First Moment Vector ((𝒎)): Initialize the first moment vector to zero.
•Second Moment Vector ((𝒗)): Initialize the second moment vector to zero.
•Small Constant ( 𝝐 ): Use a small constant (e.g., (10−8 )) to prevent division
by zero.
•Time Step (t): Initialize the time step to zero.
Step-by-Step Explanation of Adam

Step 2: Compute the Gradient


•For each parameter at time step (t), compute the gradient of the loss function with
respect to that parameter. This gradient indicates the direction and magnitude of the
update needed to minimize the loss.
•Let’s denote this gradient as (𝑔𝑡 ).

Step 3: Update the Time Step


•Increment the time step: 𝑡 = 𝑡 + 1

Step 4: Update Biased First Moment Estimate


•Compute the biased first moment estimate: $ 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡
Step-by-Step Explanation of Adam
Step 5: Update Biased Second Moment Estimate
Compute the biased second moment estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2

Step 6: Compute Bias-Corrected First Moment Estimate


𝑚
Correct the bias in the first moment estimate: 𝑚
ෞ𝑡 = 𝑡 𝑡
1−𝛽1

Step 7: Compute Bias-Corrected Second Moment Estimate


𝑣𝑡
Correct the bias in the second moment estimate: 𝑣ෝ𝑡 = 𝑡
1−𝛽2

Step 8: Update Parameters


𝜂
Adjust the parameters using the bias-corrected moment estimates: 𝜃𝑡 = 𝜃𝑡−1 − .𝑚
ෞ𝑡
ෞ𝑡 +𝜖
𝑣
ADAM Example
Q. Consider the objective function 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2 . Use ADAM to
find the minimum of this function. Consider the initial point (x0 ,y0) as
(1.0, 1.0) and initial learning rate as 0.01 with 𝛽1 = 0.9 and 𝛽2 =
0.999.

1.Initialization:
Parameters: (𝜃 = 𝑥, 𝑦 = 1.0, 1.0 )
Learning rate: (𝜂 = 0.01)
Decay rates: (𝛽1 = 0.9), (𝛽2 = 0.999)
First moment vector: (𝑚 = [0, 0])
Second moment vector: (𝑣 = [0, 0])
Small constant: (𝜖 = 10−8 )
Time step: (𝑡 = 0)
ADAM Example
2. Iteration 1:
Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 1.0, 2 ∗ 1.0 = 2.0, 2.0
Update Time Step: 𝑡 = 1
Update Biased First Moment Estimate: 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡 = 0.9 ∗ 0, 0 + 0.1 ∗
2.0, 2.0 = 0.2, 0.2
Update Biased Second Moment Estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2 = 0.999 ∗ 0, 0 +
0.001 ∗ 2.02 , 2.02 = 0.004, 0.004
𝑚𝑡 2.0,2.0
Compute Bias-Corrected First Moment Estimate: 𝑚
ෞ𝑡 = = = [2.0, 2.0]
1−𝛽1𝑡 1−0.91
𝑣𝑡 0.004,0.004
Compute Bias-Corrected Second Moment Estimate: 𝑣ෝ𝑡 = = = [4.0, 4.0]
1 − 𝛽2𝑡 1− 0.9991
Update Parameters:
𝜂 0.01
𝑥 = 𝑥 − 𝑚
ෞ𝑡 = 1.0 − ∗ 2.0 = 1.0 − 0.01 ∗ 1.0 = 0.99
𝑣𝑡 +𝜖 4.0+10−8
𝜂 0.01
𝑦=𝑦− 𝑚
ෞ = 1.0 − .∗ 2.0 = 1.0 − 0.01.∗ 1.0 = 0.99
𝑣𝑡 +𝜖 𝑡 4.0+10−8
Updated parameters: (𝜃 = 0.99, 0.99 )
ADAM Example

3.Iteration 2:
Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 0.99, 2 ∗ 0.99 = 1.98, 1.98
Update Time Step: 𝑡 = 2
Update Biased First Moment Estimate: 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡 = 0.9 ∗ 0.2, 0.2 + 0.1 ∗
1.98, 1.98 = 0.378, 0.378
Update Biased Second Moment Estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2 = 0.999 ∗
0.004, 0.004 + 0.001 ∗ 1.982 , 1.982 = 0.00792, 0.00792
𝑚𝑡 0.378,0.378
Compute Bias-Corrected First Moment Estimate: 𝑚
ෞ𝑡 = = = [1.989, 1.989]
1 − 𝛽1𝑡 1 − 0.92
𝑣 0.00792,0.00792
Compute Bias-Corrected Second Moment Estimate: 𝑣ෝ𝑡 = 𝑡 𝑡 = = 3.96, 3.96
1 − 𝛽2 1 − 0.9992
Update Parameters:
𝜂 0.01
𝑥 = 𝑥 − .𝑚ෞ𝑡 = 0.99 − ∗ 1.989 = 0.99 − 0.01 ∗ 1.0 = 0.98
𝑣𝑡 +𝜖 3.96+10−8
𝜂 1
𝑦 = 𝑦 − 𝑚
ෞ𝑡 = 0.99 − ∗ 1.989 = 0.99 − 0.01 ∗ 1.0 = 0.98
ෞ𝑡 +𝜖
𝑣 3.96+ 10−8
Updated parameters: (𝜃 = 0.98, 0.98 )
References
• Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient
methods for online learning and stochastic optimization." Journal of
machine learning research 12.7 (2011).
• Daniel Villarraga, SYSEN 6800 Fall 2021, AdaGrad, Cornell University,
https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
• Geoffrey Hinton, "Coursera Neural Networks for Machine Learning
lecture 6", 2018.
• Jason Huang, (SysEn 6800 Fall 2020, RMSProp, Cornell University,
https://optimization.cbe.cornell.edu/index.php?title=RMSProp
• Kingma, Diederik P. "Adam: A method for stochastic optimization." arXiv
preprint arXiv:1412.6980 (2014).
Thank You

You might also like