0% found this document useful (0 votes)

23 views73 pages

06 23ECE216 GradientDescent v2

The document provides an overview of optimization techniques, distinguishing between unconstrained and constrained optimization, and detailing methods such as gradient descent. It explains the concept of gradients, the iterative process of the gradient descent algorithm, and the significance of the learning rate in finding local minima. Several examples illustrate the application of gradient descent in minimizing functions with varying complexity.

Uploaded by

pvsbym

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views73 pages

06 23ECE216 GradientDescent v2

Uploaded by

pvsbym

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Om Namo Bhagavate Vasudevaya

Optimization
Gradient Descent

Dr. Binoy B Nair

Unconstrained Optimization

Definition: Optimization problems without any restrictions or constraints on the variables.

Objective: Minimize or maximize an objective function f(x) where (x) can take any value in the
domain.

Mathematical Formulation: Minimize (or Maximize) f(x)

Methods: Techniques like gradient descent, Newton's method, and quasi-Newton methods are
commonly used.

Applications: Often used in machine learning, statistics, and various engineering fields where the
solution space is not restricted.
Constrained Optimization

•Definition: Optimization problems that include restrictions or constraints on

the variables.

•Objective: Minimize or maximize an objective function 𝑓(𝑥) subject to

constraints 𝑔𝑖 𝑥 ≤ 0 and/or ℎ𝑗 (𝑥) = 0.

•Methods: Techniques like Lagrange multipliers, Karush-Kuhn-Tucker (KKT)

conditions, and interior-point methods are commonly used.

•Applications: Widely used in engineering design, economics, operations

research, and resource allocation where specific constraints must be satisfied.
Key Differences
•Constraints:
•Unconstrained: No constraints on the variables.
•Constrained: Includes constraints that the solution must satisfy.

•Complexity:
•Unconstrained: Generally simpler and faster to solve.
•Constrained: More complex due to the need to satisfy constraints, often requiring
advanced mathematical techniques.

•Solution Space:
•Unconstrained: Solution can be anywhere in the domain of the objective function.
•Constrained: Solution is restricted to the feasible region defined by the constraints.
Gradient Descent
• What’s the goal when you are hiking down
a mountain?

- To have fun and to reach the bottom. Let's

focus on reaching the bottom for now.

• What is the red dot doing when it's hiking

down? It's always going in the downward
direction, until it hits the bottom.
What is Gradient?
• Gradient is the generalization of derivatives in several variables.

• Let's use x's as our variables in the following function 𝐽(𝑥1 , 𝑥2 , 𝑥3 )

where
𝐽 𝑿 = 0.55 − (5𝑥1 + 2𝑥2 − 12𝑥3 )
Now the gradient is:
𝜕𝐽 𝜕𝐽 𝜕𝐽
𝛻𝐽 𝑿 = , , = −5, −2, 12
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3

Note: the gradient is inside < and > to denote that gradient is a vector.
Why do we care about gradient?
• Gradient is a pretty powerful tool in calculus.

• Remember:
• In one variable, derivative gives us the slope of the tangent line.
• In several variables, Gradient points towards direction of the fastest
increase of the function.

• This is extensively used in Gradient Descent Algorithm. Let's see how.

Idea behind Gradient Descent Algorithm
• Gradient descent algorithm is an
iterative process that takes us to the
minimum of a function (This will not
happen always, there are some
caveats!).

• Let's look at the red dot example

again:
Idea behind Gradient Descent Algorithm.. Contd.

• If you want to reach the bottom, in which

direction would you walk?

• In the downward direction, right?

• How do we find the downward direction?

• That's the direction opposite of the fastest

increase.
The search equation
• This means, if we are at point 𝑿(𝑘) and want
to move to lowest nearby point (this is why
we say “local minimum") our next step
should be at:

𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 , Evaluated at
𝑿 𝑘
Direction of fastest increase

Step size
Next position current position
Opposite direction
Why α ?
• α is called the Learning Rate or step size. (In some places you may see the symbol
as η ,but they mean the same thing.)
Which means, we want to take baby steps so that we don't overshoot the bottom.
• This is particularly important when we are very close to the minimum.
• A smart choice of α is crucial.
• When is too small, it will take our algorithm forever to reach the lowest point and if is
too big we might overshoot and miss the bottom.
Why the (-) sign?
• The negative sign indicates that we are stepping in the direction
opposite to that of 𝛻𝐽

• i.e. we are stepping in the direction opposite to that of fastest

increase.
Example 1
Q. Find the minimum of the function 𝐽 𝑥 = 𝑥 𝟐
using gradient descent.

The graph above is the

Sol. graph of 𝑥, 𝐽 𝑥 for
visualization. Note that we
Let 𝛼 = 0.8 and the initial guess 𝑥 be 𝑥 0 = −4. generate this plot using by
plugging in a large number
𝛻𝐽 𝑥 = 2𝑥 of x values and noting the
corresponding 𝐽 𝑥
values. This is not
Now, the update equation is 𝑥 𝑘 + 1 = 𝑥 𝑘 − possible practically in the
case of multivariable
𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥(𝑘) functions.
Example 1
Sol contd...

Initial: 𝑥 0 = −4

Iteration 1: 𝑥 1 = 𝑥 0 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 0
𝑥 1 = −4 − 0.8 ∗ 2 ∗ −4) = 𝟐. 𝟒

Iteration 2: 𝑥 2 = 𝑥 1 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 1
𝑥 2 = 2.4 − 0.8 ∗ 2 ∗ 2.4 = 2. 4 − 3.84 = −𝟏. 𝟒𝟒

Iteration 3: 𝑥 3 = 𝑥 2 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 2
𝑥 3 = −1.44 − 0.8 ∗ 2 ∗ −1.44 = −1.44 + 2.304 = 𝟎. 𝟖𝟔𝟒

And so on…
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4

15
Gradient Descent
𝑓 𝑥 =𝑥 2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = −4 − .8 ⋅ 2 ⋅ (−4)

16
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4

17
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = 2.4 − .8 ⋅ 2 ⋅ 2.4

𝑥 (1) = 0.4

18
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = −1.44

19
Gradient Descent
𝑓 𝑥 = 𝑥2

Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = −1.44
𝑥 (3) = .864
𝑥 (4) = −0.5184
𝑥 (5) = 0.31104

𝑥 (30) = −8.84296𝑒 − 07
20
Role of learning rate

https://www.naukri.com/code360/library/nesterov-accelerated-gradient
Gradient Descent

Step size: .9

22
Gradient Descent

Step size: .2

23
Gradient Descent

Step size matters!

24
Gradient Descent

Step size matters!

25
Example 2
Q. Find the minimum of the function 𝐽 𝑿 = 𝑥𝟏𝟐 + 𝑥𝟐𝟐 using gradient
descent.

Sol.

1
Let 𝛼 = 0.4 and the initial guess 𝑥 be 𝑿 0 = .
1
2𝑥1
𝛻𝐽 𝑿 =
2𝑥2

Now, the update equation is 𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿(𝑘)

Example 2
Sol. Contd…
Step 3: 𝑿 3 = 𝑿 2 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 2
1
Initial: 𝑿 0 =
1 0.04 2 ∗ 0.04 0.008
𝑿 3 = − 0.4 ∗ )=
0.04 2 ∗ 0.04 0.008
Step 1: 𝑿 1 = 𝑿 0 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 0
And so on…
1 2∗1 0.2
𝑿 1 = − 0.4 ∗ =
1 2∗1 0.2

Step 2: 𝑿 2 = 𝑿 1 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 1

0.2 2 ∗ 0.2 0.04

𝑿 2 = − 0.4 ∗ )=
0.2 2 ∗ 0.2 0.04
Om Namo Bhagavate Vasudavaya

Example 3
Q. Find minimum of 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2 . Show only the first three iterations.
Consider 𝛼 = 0.1, consider the initial values of 𝑥, 𝑦 = (0,0)

Sol.
Iteration 1:
1. Compute the gradient:
𝜕𝑓
𝜕𝑥 2 𝑥−2 0
𝛻𝐽 𝑿 = 𝜕𝑓 = . Evaluating at 𝑿 0 = we get
2 𝑦+3 0
𝜕𝑦
2 0−2 −4
𝛻𝐽 [0,0] = =
2 0+3 6
Om Namo Bhagavate Vasudavaya

Example 3 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2
𝛼 = 0.1, initial values of 𝑥, 𝑦 = (0,0)

Sol.
Iteration 1 Contd..
2. Update the variables:
Now, the update equation is 𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿(𝑘)

𝑿 1 = 𝑿 0 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 0
0 −4 0.4
= − 0.1 ∗ =
0 6 −0.6

New point : 𝐗 𝟏 = 𝑥1 , 𝑦1 = 0.4, −0.6

Example 3 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2
𝛼 = 0.1, initial values of 𝑥, 𝑦 = (0,0)

Iteration 2:
2 0.4 − 2 −3.2
1. Compute the gradient: 𝛻𝐽 [0.4, −0.6] = =
2 −0.6 + 3 4.8
2. Update the variables:
𝑿 2 = 𝑿 1 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 1
0.4 −3.2 0.72
= − 0.1 ∗ =
−0.6 4.8 −1.08

• New point: 𝐗 𝟐 = 𝑥2 , 𝑦2 = 0.72, −1.08

Example 3
𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2
𝛼 = 0.1, initial values of 𝑥, 𝑦 = (0,0)

Iteration 3:
1. Compute the gradient: ∇𝑓 0.72, −1.08 = 2 0.72 − 2 , 2 −1.08 + 3 = −2.56,3.84
2. Update the variables:
𝑿 3 = 𝑿 2 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 2
0.72 −2. 56 0.976
= − 0.1 ∗ =
−1.08 3.84 −1.464

• New point: 𝑥3 , 𝑦3 = 0.976, −1.464

And so on…
Gradient descent with
momentum
Why do we need momentum?

https://www.naukri.com/code360/library/nesterov-accelerated-gradient
Gradient Descent with Momentum
• The update rule for gradient descent with momentum incorporates a
momentum term to the standard gradient descent update rule.
• The momentum helps smooth out the updates and accelerates
convergence, especially when the cost function has a lot of small
oscillations or noisy gradients.

• Here's the update rule for gradient descent with momentum:

𝜕𝑓
• 𝑣𝑥𝑡+1 = β𝑣𝑥𝑡 + α
𝜕𝑥
• 𝑥𝑡+1 = 𝑥𝑡 − 𝑣𝑥𝑡+1
Steps in Gradient Descent with Momentum
1. Initialize:
o Choose an initial point ( 𝑥0 , 𝑦0 ).
o Set the learning rate (α) and the momentum term (β). Usually α + β = 1
o Initialize the velocity terms (𝑣𝑥0 ) and (𝑣𝑦0 ) to 0.

2. Compute the Gradient:

𝜕𝑓 𝜕𝑓
• Calculate the gradient of the function at the current point: ∇𝑓 𝑥, 𝑦 = ,
𝜕𝑥 𝜕𝑦
Steps in Gradient Descent with Momentum
1. Update the Velocity:

• Update the velocity terms using the formula:

𝜕𝑓
• 𝑣𝑥𝑡+1 = β𝑣𝑥𝑡 + α
𝜕𝑥
𝜕𝑓
• 𝑣𝑦𝑡+1 = β𝑣𝑦𝑡 + α
𝜕𝑦

2. Update the Variables:

• Update the variables using the velocity terms:

• 𝑥𝑡+1 = 𝑥𝑡 − 𝑣𝑥𝑡+1
• 𝑦𝑡+1 = 𝑦𝑡 − 𝑣𝑦𝑡+1
Example 1
Q. Find the minimum of the function 𝑓 𝑥, 𝑦 = 𝑥 − 1 2 + 𝑦 − 2 2 using gradient descent with
momentum. Consider the initial point 𝑥0 , 𝑦0 = 0,0 , α = 0.1 , and β = 0.9 . Show only the first
three iterations.

Sol.
• Initialization:
• Initial point: 𝑥0 , 𝑦0 = 0, 0
• Initial velocity: 𝑣𝑥0 , 𝑣𝑦0 = 0, 0
Example 1

• Iteration 1:
𝜕𝑓 𝜕𝑓
Step 1.1 : Compute the gradient: ∇𝑓 𝑥, 𝑦 = , = 2 𝑥 − 1 ,2 𝑦 − 2
𝜕𝑥 𝜕𝑦

𝐴𝑡 𝑥0 , 𝑦0 = 0,0 We get ∇𝑓 0,0 = 2 0 − 1 , 2 0 − 2 = −2, −4

Step 1.2: Update the velocity:
𝜕𝑓
𝑣𝑥1 = β𝑣𝑥0 + α 𝜕𝑥 = 0.9 ∗ 0 + 0.1 ∗ −2 = −0.2
𝜕𝑓
𝑣𝑦1 = β𝑣𝑦0 + α 𝜕𝑦 = 0.9 ∗ 0 + 0.1 ∗ −4 = −0.4
Step 1.3: Update the variables:
𝑥1 = 𝑥0 − 𝑣𝑥1 = 0 − −0.2 = 0.2
𝑦1 = 𝑦0 − 𝑣𝑦1 = 0 − −0.4 = 0.4
• New point: 𝑥1 , 𝑦1 = 0.2,0.4
Example 1

• Iteration 2:
Step 2.1: Compute the gradient: ∇𝑓 0.2,0.4 = 2 0.2 − 1 , 2 0.4 − 2 = −1.6, −3.2

Step 2.2: Update the velocity:

𝜕𝑓
𝑣𝑥2 = β𝑣𝑥1 + α 𝜕𝑥 = 0.9 ∗ −0.2 + 0.1 ∗ −1.6 = −0.18 − 0.16 = −0.34
𝜕𝑓
𝑣𝑦2 = β𝑣𝑦1 + α 𝜕𝑦 = 0.9 ∗ −0.4 + 0.1 ∗ −3.2 = −0.36 − 0.32 = −0.68

Step 2.3: Update the variables:

𝑥2 = 𝑥1 − 𝑣𝑥2 = 0.2 − −0.34 = 0.54
𝑦2 = 𝑦1 − 𝑣𝑦2 = 0.4 − −0.68 = 1.08
• New point: 𝑥2 , 𝑦2 = 0.54,1.08
Example 1
• Iteration 3:
Step 3.1 : Compute the gradient: ∇𝑓 0.54, 1.08 = 2 0.54 − 1 , 2 1.08 − 2 = −0.92, −1. 84

Step 3.2 : Update the velocity:

𝜕𝑓
𝑣𝑥3 = β𝑣𝑥2 + α 𝜕𝑥 = 0.9 ∗ −0.34 + 0.1 ∗ −0.92 = −0.306 − 0.092 = −0.398
𝜕𝑓
𝑣𝑦3 = β𝑣𝑦2 + α 𝜕𝑦 = 0.9 ∗ −0.68 + 0.1 ∗ −1. 84 = −0. 612 − 0.184 = −0.796

Step 3.3 Update the variables:

𝑥3 = 𝑥2 − 𝑣𝑥3 = 0.54 − −0.398 = 0.938
𝑦3 = 𝑦2 − 𝑣𝑦3 = 1.08 − −0.796 = 1.876
• New point: 𝑥3 , 𝑦3 = 0.938, 1.876

• And so on…
Adagrad

https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
Adagrad
• Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm designed to adapt the
learning rate for each parameter based on its historical gradients.

• This makes it particularly useful for dealing with sparse data and features with different
scales.
Step-by-Step Explanation of Adagrad
Step 1: Initialize Parameters

• Parameters: Initialize the parameters (weights) you want to optimize,

typically with small random values.

• Learning Rate 𝜼 : Set an initial learning rate.

• Accumulator (G): Initialize an accumulator for each parameter to zero.

This will store the sum of the squares of the gradients.

• Small Constant 𝝐 : Use a small constant (e.g.,10−8 ) to prevent

division by zero.
Step-by-Step Explanation of Adagrad
Step 2: Compute the Gradient

• For each parameter at time step (t), compute the gradient

of the loss function with respect to that parameter.

• This gradient indicates the direction and magnitude of the

update needed to minimize the loss.

• Let’s denote this gradient as (𝑔𝑡 ).

Step-by-Step Explanation of Adagrad

Step 3: Accumulate the Squared Gradients

• Update the accumulator for each parameter by adding the square of the
current gradient:
𝐺𝑡 = 𝐺𝑡−1 + 𝑔𝑡2
Here, 𝐺𝑡 is the accumulated sum of squared gradients up to time step (t).
Step-by-Step Explanation of Adagrad
Step 4: Update the Parameters

• Adjust the parameters using the accumulated gradients.

• The update rule for each parameter (𝜃𝑖 ) is:
η
𝜃𝑖 = 𝜃𝑖 − 𝑔𝑡,𝑖
𝐺𝑡,𝑖 +𝜖

This formula ensures that parameters with larger accumulated gradients have
smaller updates, while those with smaller accumulated gradients have larger
updates.
Adagrad Example
Q. Consider the objective function 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2 . Use Adagrad to
find the minimum of this function. Consider the initial point (x0 ,y0) as
(1.0,1.0) and initial learning rate as 0.1.

Sol.

1.Initialization:
• Initial Parameters: (𝜃 = 𝑥0 , 𝑦0 = 1.0,1.0 )
• Learning rate: (𝜂 = 0.1)
• Accumulator: (𝐺 = [0, 0])
• Small constant: (𝜖 = 10−8 )
Adagrad Example
2. Iterations :

Iteration 1:
2.1.1 Compute Gradient:

∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 1.0, 2 ∗ 1.0 = 2.0,2.0

2.1.2 Accumulate Squared Gradients:

𝐺 = 𝐺 + ∇𝑓 𝑥, 𝑦 2 = [0,0] + [2.02 , 2.02 ] = [4.0,4.0]

2.1.3 Update Parameters:

𝜂 0.1
• 𝑥=𝑥− . ∇𝑓𝑥 = 1.0 − ∗ 2.0 = 1.0 − 0.1 ∗ 1.0 = 0.9
𝐺𝑥 +𝜖 4.0+10−8
𝜂 0.1
• 𝑦=𝑦− ⋅ ∇𝑓𝑦 = 1.0 − ∗ 2.0 = 1.0 − 0.1 ∗ 1.0 = 0.9
𝐺𝑦 +𝜖 4.0+10−8

• Updated parameters: ( 𝜃 = 0.9, 0.9 )

Adagrad Example
Iteration 2
2.2.1 Compute Gradient:
∇𝑓(𝑥, 𝑦) = [2𝑥, 2𝑦] = [2 ∗ 0.9,2 ∗ 0.9] = [1.8,1.8]

2.2.2 Accumulate Squared Gradients

𝐺 = 𝐺 + ∇𝑓 𝑥, 𝑦 2 = 4.0,4.0 + 1.82 , 1.82
= [4.0 + 3.24, 4.0 + 3.24] = [7.24,7.24]

2.2.3 Update Parameters

𝜂 0.1
• 𝑥=𝑥− ⋅ ∇𝑓𝑥 = 0.9 − ∗ 1.8 = 0.9 − 0.1 ∗ 0.668 = 0.832
𝐺𝑥 + 𝜖 7.24+10−8
𝜂 0.1
• 𝑦=𝑦− ⋅ ∇𝑓𝑦 = 0.9 − ∗ 1.8 = 0.9 − 0.1 ∗ 0.668 = 0.832
𝐺𝑦 +𝜖 7.24+10−8
• Updated parameters: ( 𝜃 = 0.832, 0.832 )
Adagrad Example
Iteration 3:
2.3.1 Compute Gradient
∇𝑓(𝑥, 𝑦) = [2𝑥, 2𝑦] = [2 ∗ 0.832, 2 ∗ 0.832] = [1.664,1.664]
2.3.2 Accumulate Squared Gradients
𝐺 = 𝐺 + ∇𝑓 𝑥, 𝑦 2 = 7.24,7.24 + 1.6642 , 1.6642
= [7.24 + 2.77,7.24 + 2.77] = [10.01,10.01]
2.3.3 Update Parameters
𝜂 0.1
𝑥=𝑥− ⋅ ∇𝑓𝑥 = 0.832 − ∗ 1.664 = 0.832 − 0.1 ∗ 0.526 = 0.779
𝐺𝑥 +𝜖 10.01+10−8
𝜂 0.1
𝑦=𝑦− ⋅ ∇𝑓𝑦 = 0.832 − ∗ 1.664 = 0.832 − 0.1 ∗ 0.526 = 0.779
𝐺𝑦 +𝜖 10.01+10−8
• Updated parameters: ( 𝜃 = 0.779, 0.779 )
Adagrad Example
After three iterations, the parameters have been updated from
([1.0, 1.0]) to ([0.779, 0.779]).

The Adagrad algorithm adapts the learning rate for each

parameter based on the accumulated squared gradients,
allowing for more efficient optimization, especially in cases with
sparse data.
Adagrad Advantages and Limitations
Advantages: Adagrad adapts the learning rate for each parameter,
making it effective for sparse data and features with different scales.
It eliminates the need to manually tune the learning rate.

Limitations: The learning rate can become very small over time due
to the accumulation of squared gradients, which can slow down
convergence.
RMSProp
RMSprop (Root Mean Square Propagation)

• RMSprop (Root Mean Square Propagation) is an adaptive learning rate

optimization algorithm designed to address the diminishing learning rates
issue in Adagrad.

• It adjusts the learning rate for each parameter based on a moving average
of the squared gradients.
Step-by-Step Explanation of RMSprop
Step 1: Initialize Parameters

Parameters: Initialize the parameters (weights) you want to optimize, typically

with small random values.

Learning Rate ( 𝜼 ): Set an initial learning rate, often around 0.001.

Decay Rate ( 𝜷 ): Set a decay rate for the moving average, typically around 0.9.

Accumulator ( 𝑬 𝒈𝟐 ): Initialize an accumulator for each parameter to zero.

This will store the exponentially decaying average of past squared gradients.

Small Constant ( 𝝐 ): Use a small constant (e.g., 10−8 ) to prevent division by

zero.
Step-by-Step Explanation of RMSprop

Step 2: Compute the Gradient

• For each parameter at time step (t), compute the gradient of the loss
function with respect to that parameter.

• This gradient indicates the direction and magnitude of the update

needed to minimize the loss.

• Let’s denote this gradient as (𝑔𝑡 ).

Step-by-Step Explanation of RMSprop

Step 3: Update the Accumulator

Update the accumulator for each parameter by computing the

exponentially decaying average of past squared gradients:
𝐸 𝑔2 𝑡 = 𝛽𝐸 𝑔2 𝑡−1 + 1 − 𝛽 𝑔𝑡2

Here, (𝐸 𝑔2 𝑡 ) is the updated accumulator at time step (t).

Step-by-Step Explanation of RMSprop
Step 4: Update the Parameters

Adjust the parameters using the updated accumulator.

The update rule for each parameter (𝜃i ) is:

𝜂
𝜃𝑖 = 𝜃𝑖 − . 𝑔𝑡
𝐸 𝑔2 𝑡 +𝜖

This formula ensures that parameters with larger accumulated

gradients have smaller updates, while those with smaller accumulated
gradients have larger updates.
RMSprop- Example
Q. Consider the objective function 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2 .
Use RMSProp to find the minimum of this function.
Consider the initial point (x0 ,y0) as (1.0, 1.0) and initial learning rate as 0.01.

Sol.

1.Initialization:

Parameters: (𝜃 = 𝑥, 𝑦 = 1.0, 1.0 )

Learning rate: (𝜂 = 0.01)
Decay rate: (𝛽 = 0.9)
Accumulator: (𝐸 𝑔2 = 0, 0 )
Small constant: (𝜖 = 10−8 )
RMSprop- Example
2. Iteration 1:

Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 1.0, 2 ∗ 1.0 = 2.0, 2.0

Update Accumulator: 𝐸 𝑔2 = 𝛽𝐸 𝑔2 + 1 − 𝛽 𝑔𝑡2 = 0.9 ∗ 0, 0 + 0.1 ∗ 2.02 , 2.02 = 0.4, 0.4

Update Parameters:
𝜂 0.01
𝑥=𝑥− 𝑔𝑥 = 1 − ∗ 2.0 = 1 − 0.316 = 0.968
2
𝐸 𝑔 𝑥 +𝜖 0.4+10−8
𝜂 0.01
𝑦=𝑦− 𝑔𝑦 = 1 − ∗ 2.0 = 1 − 0.316 = 0.968
0.4+10−8
𝐸 𝑔2 𝑦 +𝜖

Updated parameters: (𝜃 = 0.968, 0.968 )

RMSprop- Example

3.Iteration 2:

Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 0.968, 2 ∗ 0.968 = 1.936, 1.936

Update Accumulator: 𝐸 𝑔2 = 𝛽𝐸 𝑔2 + 1 − 𝛽 𝑔𝑡2 = 0.9 ∗ 0.4, 0.4 + 0.1 ∗

1.9362 , 1.9362 = 0.4 ∗ 0.9 + 0.1 ∗ 3.748, 0.4 ∗ 0.9 + 0.1 ∗ 3.748 = 0.734, 0.734

Update Parameters:
𝜂 0.01
𝑥=𝑥− . 𝑔𝑥 = 0.968 − ∗ 1.936 = 0.968 − 0.022 = 0.945
2
𝐸 𝑔 𝑥 +𝜖 0.734+10−8
𝜂 0.01
𝑦=𝑦− . 𝑔𝑦 = 0.968 − ∗ 1.936 = 0.968 − 0.022 = 0.945
0.734+10−8
𝐸 𝑔2 𝑦 +𝜖

Updated parameters: (𝜃 = 0.945, 0.945 )

RMSprop- Example
4.Iteration 3:

Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 0.945, 2 ∗ 0.945 = 1.891, 1.891

Update Accumulator: 𝐸 𝑔2 = 𝛽𝐸 𝑔2 + 1 − 𝛽 𝑔𝑡2 = 0.9 ∗ 0.945, 0.945 +

0.1 ∗ 1.8912 , 1.8912 = 0.945 ∗ 0.9 + 0.1 ∗ 3.576, 0.945 ∗ 0.9 + 0.1 ∗ 3.576 =
1.2081, 1.2081

Update Parameters:
𝜂 0.1
𝑥=𝑥− 𝑔𝑥 = 0.945 − ∗ 1.891 = 0.945 − 0.172 = 0.773
2
𝐸 𝑔 𝑥 +𝜖 1.2081+10−8
𝜂 0.01
𝑦=𝑦− . 𝑔𝑦 = 0.945 − ∗ 1.891 = 0.945 − 0.172 = 0.773
1.2081+10−8
𝐸 𝑔2 𝑦 +𝜖

Updated parameters: (𝜃 = 0.773, 0.773 )

RMSprop- Example
• After three iterations, the parameters have been updated from ([1.0, 1.0]) to
([0.773, 0.773]).

• RMSprop effectively adjusts the learning rate for each parameter based on the
moving average of the squared gradients, allowing for more stable and efficient
optimization.
ADAM
ADAM
Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the
advantages of two other popular methods: AdaGrad and RMSprop.
It computes adaptive learning rates for each parameter by estimating the first and second
moments of the gradients.
Step-by-Step Explanation of Adam
Step 1: Initialize Parameters

Parameters: Initialize the parameters (weights) you want to optimize,

typically with small random values.
Learning Rate ( 𝜼 ): Set an initial learning rate, often around 0.001.
•Exponential Decay Rates ( 𝜷𝟏 and 𝜷𝟐 ): Set decay rates for the moment
estimates, typically (𝛽1 = 0.9) and (𝛽2 = 0.999).
•First Moment Vector ((𝒎)): Initialize the first moment vector to zero.
•Second Moment Vector ((𝒗)): Initialize the second moment vector to zero.
•Small Constant ( 𝝐 ): Use a small constant (e.g., (10−8 )) to prevent division
by zero.
•Time Step (t): Initialize the time step to zero.
Step-by-Step Explanation of Adam

Step 2: Compute the Gradient

•For each parameter at time step (t), compute the gradient of the loss function with
respect to that parameter. This gradient indicates the direction and magnitude of the
update needed to minimize the loss.
•Let’s denote this gradient as (𝑔𝑡 ).

Step 3: Update the Time Step

•Increment the time step: 𝑡 = 𝑡 + 1

Step 4: Update Biased First Moment Estimate

•Compute the biased first moment estimate: $ 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡
Step-by-Step Explanation of Adam
Step 5: Update Biased Second Moment Estimate
Compute the biased second moment estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2

Step 6: Compute Bias-Corrected First Moment Estimate

𝑚
Correct the bias in the first moment estimate: 𝑚
ෞ𝑡 = 𝑡 𝑡
1−𝛽1

Step 7: Compute Bias-Corrected Second Moment Estimate

𝑣𝑡
Correct the bias in the second moment estimate: 𝑣ෝ𝑡 = 𝑡
1−𝛽2

Step 8: Update Parameters

𝜂
Adjust the parameters using the bias-corrected moment estimates: 𝜃𝑡 = 𝜃𝑡−1 − .𝑚
ෞ𝑡
ෞ𝑡 +𝜖
𝑣
ADAM Example
Q. Consider the objective function 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2 . Use ADAM to
find the minimum of this function. Consider the initial point (x0 ,y0) as
(1.0, 1.0) and initial learning rate as 0.01 with 𝛽1 = 0.9 and 𝛽2 =
0.999.

1.Initialization:
Parameters: (𝜃 = 𝑥, 𝑦 = 1.0, 1.0 )
Learning rate: (𝜂 = 0.01)
Decay rates: (𝛽1 = 0.9), (𝛽2 = 0.999)
First moment vector: (𝑚 = [0, 0])
Second moment vector: (𝑣 = [0, 0])
Small constant: (𝜖 = 10−8 )
Time step: (𝑡 = 0)
ADAM Example
2. Iteration 1:
Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 1.0, 2 ∗ 1.0 = 2.0, 2.0
Update Time Step: 𝑡 = 1
Update Biased First Moment Estimate: 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡 = 0.9 ∗ 0, 0 + 0.1 ∗
2.0, 2.0 = 0.2, 0.2
Update Biased Second Moment Estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2 = 0.999 ∗ 0, 0 +
0.001 ∗ 2.02 , 2.02 = 0.004, 0.004
𝑚𝑡 2.0,2.0
Compute Bias-Corrected First Moment Estimate: 𝑚
ෞ𝑡 = = = [2.0, 2.0]
1−𝛽1𝑡 1−0.91
𝑣𝑡 0.004,0.004
Compute Bias-Corrected Second Moment Estimate: 𝑣ෝ𝑡 = = = [4.0, 4.0]
1 − 𝛽2𝑡 1− 0.9991
Update Parameters:
𝜂 0.01
𝑥 = 𝑥 − 𝑚
ෞ𝑡 = 1.0 − ∗ 2.0 = 1.0 − 0.01 ∗ 1.0 = 0.99
𝑣𝑡 +𝜖 4.0+10−8
𝜂 0.01
𝑦=𝑦− 𝑚
ෞ = 1.0 − .∗ 2.0 = 1.0 − 0.01.∗ 1.0 = 0.99
𝑣𝑡 +𝜖 𝑡 4.0+10−8
Updated parameters: (𝜃 = 0.99, 0.99 )
ADAM Example

3.Iteration 2:
Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 0.99, 2 ∗ 0.99 = 1.98, 1.98
Update Time Step: 𝑡 = 2
Update Biased First Moment Estimate: 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡 = 0.9 ∗ 0.2, 0.2 + 0.1 ∗
1.98, 1.98 = 0.378, 0.378
Update Biased Second Moment Estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2 = 0.999 ∗
0.004, 0.004 + 0.001 ∗ 1.982 , 1.982 = 0.00792, 0.00792
𝑚𝑡 0.378,0.378
Compute Bias-Corrected First Moment Estimate: 𝑚
ෞ𝑡 = = = [1.989, 1.989]
1 − 𝛽1𝑡 1 − 0.92
𝑣 0.00792,0.00792
Compute Bias-Corrected Second Moment Estimate: 𝑣ෝ𝑡 = 𝑡 𝑡 = = 3.96, 3.96
1 − 𝛽2 1 − 0.9992
Update Parameters:
𝜂 0.01
𝑥 = 𝑥 − .𝑚ෞ𝑡 = 0.99 − ∗ 1.989 = 0.99 − 0.01 ∗ 1.0 = 0.98
𝑣𝑡 +𝜖 3.96+10−8
𝜂 1
𝑦 = 𝑦 − 𝑚
ෞ𝑡 = 0.99 − ∗ 1.989 = 0.99 − 0.01 ∗ 1.0 = 0.98
ෞ𝑡 +𝜖
𝑣 3.96+ 10−8
Updated parameters: (𝜃 = 0.98, 0.98 )
References
• Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient
methods for online learning and stochastic optimization." Journal of
machine learning research 12.7 (2011).
• Daniel Villarraga, SYSEN 6800 Fall 2021, AdaGrad, Cornell University,
https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
• Geoffrey Hinton, "Coursera Neural Networks for Machine Learning
lecture 6", 2018.
• Jason Huang, (SysEn 6800 Fall 2020, RMSProp, Cornell University,
https://optimization.cbe.cornell.edu/index.php?title=RMSProp
• Kingma, Diederik P. "Adam: A method for stochastic optimization." arXiv
preprint arXiv:1412.6980 (2014).
Thank You

Genus Attribute Reference Manual
100% (2)
Genus Attribute Reference Manual
2,228 pages
Gradient Descent
No ratings yet
Gradient Descent
108 pages
Lecture 11
No ratings yet
Lecture 11
46 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
04gradient Descent
No ratings yet
04gradient Descent
21 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Math Lecture 4
No ratings yet
Math Lecture 4
27 pages
ISYE 8803 - Kamran - M5 - Optimization Methods 2
No ratings yet
ISYE 8803 - Kamran - M5 - Optimization Methods 2
17 pages
Optimization
No ratings yet
Optimization
6 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Unconstrained Minimization
No ratings yet
Unconstrained Minimization
7 pages
Clnote Oct8
No ratings yet
Clnote Oct8
39 pages
Multi Variable Optimization: Min F (X, X, X, - X)
No ratings yet
Multi Variable Optimization: Min F (X, X, X, - X)
38 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Lecture2 Gradient Descent Linear Regression
No ratings yet
Lecture2 Gradient Descent Linear Regression
75 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Numerical Optimization
No ratings yet
Numerical Optimization
31 pages
Lecture8 UnconstrainedII 2023
No ratings yet
Lecture8 UnconstrainedII 2023
57 pages
Handout Delta Rule
No ratings yet
Handout Delta Rule
10 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
Optimization 2
No ratings yet
Optimization 2
40 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Maximum Slope Method
No ratings yet
Maximum Slope Method
14 pages
Calculus - Class Notes
No ratings yet
Calculus - Class Notes
4 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
CH 4
No ratings yet
CH 4
28 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
MAT6007 - Session8 - Gradient Descent
No ratings yet
MAT6007 - Session8 - Gradient Descent
16 pages
Opt Lec 10
No ratings yet
Opt Lec 10
16 pages
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
No ratings yet
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
4 pages
Optimumengineeringdesign Day3a
No ratings yet
Optimumengineeringdesign Day3a
34 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
Nonlinearity in Structural Dynamics Chapter App G
No ratings yet
Nonlinearity in Structural Dynamics Chapter App G
11 pages
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
No ratings yet
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
11 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
BSC Part 3
No ratings yet
BSC Part 3
29 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Of A Function هّلادلا رادحنإ
No ratings yet
Gradient Of A Function هّلادلا رادحنإ
11 pages
Cgnotes PDF
No ratings yet
Cgnotes PDF
11 pages
Project For Automated Train by Roshan
No ratings yet
Project For Automated Train by Roshan
6 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Algorithms Process Optimization
No ratings yet
Algorithms Process Optimization
5 pages
Vibration Analysis Guide
No ratings yet
Vibration Analysis Guide
40 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Clinical Informatics Board Review and Self Assessment
100% (1)
Clinical Informatics Board Review and Self Assessment
339 pages
6012 - Creating UPSC Forms
67% (3)
6012 - Creating UPSC Forms
36 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Salesforce Marketing Cloud Email Specialist
No ratings yet
Salesforce Marketing Cloud Email Specialist
52 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Transaction OPK4 - Parameters For Order Confirmation
No ratings yet
Transaction OPK4 - Parameters For Order Confirmation
14 pages
Operating and Service Manual: Agilent 8480 Series Coaxial Power Sensors
No ratings yet
Operating and Service Manual: Agilent 8480 Series Coaxial Power Sensors
86 pages
Windows Wmic Command Line Command
No ratings yet
Windows Wmic Command Line Command
24 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Poker Math
No ratings yet
Poker Math
212 pages
DNV-CG-0004 2021-07
No ratings yet
DNV-CG-0004 2021-07
44 pages
H11AA1, H11AA3, H11AA2, H11AA4 AC Input/Phototransistor Optocouplers
No ratings yet
H11AA1, H11AA3, H11AA2, H11AA4 AC Input/Phototransistor Optocouplers
8 pages
ST Vl6180x API Integration Guide
No ratings yet
ST Vl6180x API Integration Guide
54 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
03 23ECE216 PartialDerivatives
No ratings yet
03 23ECE216 PartialDerivatives
47 pages
Internal Guide:-Mrs. A.A. Askhedkar: by Dikita Chauhan Amita Joshi & Anuja Karadkhedkar
No ratings yet
Internal Guide:-Mrs. A.A. Askhedkar: by Dikita Chauhan Amita Joshi & Anuja Karadkhedkar
33 pages
MDN 04 0213DG
No ratings yet
MDN 04 0213DG
95 pages
10 - 23ECE216 - Descriptive Statistics
No ratings yet
10 - 23ECE216 - Descriptive Statistics
60 pages
05 23ECE216 TaylorSeries
No ratings yet
05 23ECE216 TaylorSeries
24 pages
02 - 23ECE216 - EDA - Pre Processing
No ratings yet
02 - 23ECE216 - EDA - Pre Processing
16 pages
Pbs Nature
No ratings yet
Pbs Nature
4 pages
Computer Aided Engineering Drawing: Entity Draw Commands in Autocad
No ratings yet
Computer Aided Engineering Drawing: Entity Draw Commands in Autocad
37 pages
Queues
No ratings yet
Queues
34 pages
04 23ECE216 OptimizationWithHessians
No ratings yet
04 23ECE216 OptimizationWithHessians
24 pages
Bro ibaConditionMonitoring en
No ratings yet
Bro ibaConditionMonitoring en
36 pages
P 650 Se
No ratings yet
P 650 Se
126 pages
Lab #5 Sig Figs
No ratings yet
Lab #5 Sig Figs
2 pages
Serial Control Items
No ratings yet
Serial Control Items
2 pages
Cc1 Module 4 Addition of Number System
No ratings yet
Cc1 Module 4 Addition of Number System
5 pages
PDF Sni 7268 20091 Air Pengisi Ketel Uap - Compress
No ratings yet
PDF Sni 7268 20091 Air Pengisi Ketel Uap - Compress
14 pages
08 23ECE216 LinearRegressionUsingGradientDescent
No ratings yet
08 23ECE216 LinearRegressionUsingGradientDescent
11 pages
Impact of Technology On Business Communication
No ratings yet
Impact of Technology On Business Communication
3 pages
Impact of Digital Disruption On Human Capital of Banking Sector
No ratings yet
Impact of Digital Disruption On Human Capital of Banking Sector
9 pages
Introduction To Modulation and Demodulation
No ratings yet
Introduction To Modulation and Demodulation
13 pages
Fbphaser Manual 2 0
No ratings yet
Fbphaser Manual 2 0
4 pages
15ME663
No ratings yet
15ME663
2 pages
Lê Tiến Huy: Objective
No ratings yet
Lê Tiến Huy: Objective
2 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet