06 23ECE216 GradientDescent v2
06 23ECE216 GradientDescent v2
Optimization
Gradient Descent
Objective: Minimize or maximize an objective function f(x) where (x) can take any value in the
domain.
Methods: Techniques like gradient descent, Newton's method, and quasi-Newton methods are
commonly used.
Applications: Often used in machine learning, statistics, and various engineering fields where the
solution space is not restricted.
Constrained Optimization
•Complexity:
•Unconstrained: Generally simpler and faster to solve.
•Constrained: More complex due to the need to satisfy constraints, often requiring
advanced mathematical techniques.
•Solution Space:
•Unconstrained: Solution can be anywhere in the domain of the objective function.
•Constrained: Solution is restricted to the feasible region defined by the constraints.
Gradient Descent
• What’s the goal when you are hiking down
a mountain?
Note: the gradient is inside < and > to denote that gradient is a vector.
Why do we care about gradient?
• Gradient is a pretty powerful tool in calculus.
• Remember:
• In one variable, derivative gives us the slope of the tangent line.
• In several variables, Gradient points towards direction of the fastest
increase of the function.
𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 , Evaluated at
𝑿 𝑘
Direction of fastest increase
Step size
Next position current position
Opposite direction
Why α ?
• α is called the Learning Rate or step size. (In some places you may see the symbol
as η ,but they mean the same thing.)
Which means, we want to take baby steps so that we don't overshoot the bottom.
• This is particularly important when we are very close to the minimum.
• A smart choice of α is crucial.
• When is too small, it will take our algorithm forever to reach the lowest point and if is
too big we might overshoot and miss the bottom.
Why the (-) sign?
• The negative sign indicates that we are stepping in the direction
opposite to that of 𝛻𝐽
Initial: 𝑥 0 = −4
Iteration 1: 𝑥 1 = 𝑥 0 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 0
𝑥 1 = −4 − 0.8 ∗ 2 ∗ −4) = 𝟐. 𝟒
Iteration 2: 𝑥 2 = 𝑥 1 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 1
𝑥 2 = 2.4 − 0.8 ∗ 2 ∗ 2.4 = 2. 4 − 3.84 = −𝟏. 𝟒𝟒
Iteration 3: 𝑥 3 = 𝑥 2 − 𝛼𝛻𝐽 𝑥 𝑎𝑡 𝑥 = 𝑥 2
𝑥 3 = −1.44 − 0.8 ∗ 2 ∗ −1.44 = −1.44 + 2.304 = 𝟎. 𝟖𝟔𝟒
And so on…
Gradient Descent
𝑓 𝑥 = 𝑥2
Step size: .8
𝑥 (0) = −4
15
Gradient Descent
𝑓 𝑥 =𝑥 2
Step size: .8
𝑥 (0) = −4
𝑥 (1) = −4 − .8 ⋅ 2 ⋅ (−4)
16
Gradient Descent
𝑓 𝑥 = 𝑥2
Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
17
Gradient Descent
𝑓 𝑥 = 𝑥2
Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = 2.4 − .8 ⋅ 2 ⋅ 2.4
𝑥 (1) = 0.4
18
Gradient Descent
𝑓 𝑥 = 𝑥2
Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = −1.44
19
Gradient Descent
𝑓 𝑥 = 𝑥2
Step size: .8
𝑥 (0) = −4
𝑥 (1) = 2.4
𝑥 (2) = −1.44
𝑥 (3) = .864
𝑥 (4) = −0.5184
𝑥 (5) = 0.31104
𝑥 (30) = −8.84296𝑒 − 07
20
Role of learning rate
https://www.naukri.com/code360/library/nesterov-accelerated-gradient
Gradient Descent
Step size: .9
22
Gradient Descent
Step size: .2
23
Gradient Descent
24
Gradient Descent
25
Example 2
Q. Find the minimum of the function 𝐽 𝑿 = 𝑥𝟏𝟐 + 𝑥𝟐𝟐 using gradient
descent.
Sol.
1
Let 𝛼 = 0.4 and the initial guess 𝑥 be 𝑿 0 = .
1
2𝑥1
𝛻𝐽 𝑿 =
2𝑥2
Step 2: 𝑿 2 = 𝑿 1 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 1
Example 3
Q. Find minimum of 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2 . Show only the first three iterations.
Consider 𝛼 = 0.1, consider the initial values of 𝑥, 𝑦 = (0,0)
Sol.
Iteration 1:
1. Compute the gradient:
𝜕𝑓
𝜕𝑥 2 𝑥−2 0
𝛻𝐽 𝑿 = 𝜕𝑓 = . Evaluating at 𝑿 0 = we get
2 𝑦+3 0
𝜕𝑦
2 0−2 −4
𝛻𝐽 [0,0] = =
2 0+3 6
Om Namo Bhagavate Vasudavaya
Example 3 𝑓 𝑥, 𝑦 = 𝑥 − 2 2 + 𝑦 + 3 2
𝛼 = 0.1, initial values of 𝑥, 𝑦 = (0,0)
Sol.
Iteration 1 Contd..
2. Update the variables:
Now, the update equation is 𝑿 𝑘 + 1 = 𝑿 𝑘 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿(𝑘)
𝑿 1 = 𝑿 0 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 0
0 −4 0.4
= − 0.1 ∗ =
0 6 −0.6
Iteration 2:
2 0.4 − 2 −3.2
1. Compute the gradient: 𝛻𝐽 [0.4, −0.6] = =
2 −0.6 + 3 4.8
2. Update the variables:
𝑿 2 = 𝑿 1 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 1
0.4 −3.2 0.72
= − 0.1 ∗ =
−0.6 4.8 −1.08
Iteration 3:
1. Compute the gradient: ∇𝑓 0.72, −1.08 = 2 0.72 − 2 , 2 −1.08 + 3 = −2.56,3.84
2. Update the variables:
𝑿 3 = 𝑿 2 − 𝛼𝛻𝐽 𝑿 𝑎𝑡 𝑿 = 𝑿 2
0.72 −2. 56 0.976
= − 0.1 ∗ =
−1.08 3.84 −1.464
https://www.naukri.com/code360/library/nesterov-accelerated-gradient
Gradient Descent with Momentum
• The update rule for gradient descent with momentum incorporates a
momentum term to the standard gradient descent update rule.
• The momentum helps smooth out the updates and accelerates
convergence, especially when the cost function has a lot of small
oscillations or noisy gradients.
Sol.
• Initialization:
• Initial point: 𝑥0 , 𝑦0 = 0, 0
• Initial velocity: 𝑣𝑥0 , 𝑣𝑦0 = 0, 0
Example 1
• Iteration 1:
𝜕𝑓 𝜕𝑓
Step 1.1 : Compute the gradient: ∇𝑓 𝑥, 𝑦 = , = 2 𝑥 − 1 ,2 𝑦 − 2
𝜕𝑥 𝜕𝑦
• Iteration 2:
Step 2.1: Compute the gradient: ∇𝑓 0.2,0.4 = 2 0.2 − 1 , 2 0.4 − 2 = −1.6, −3.2
• And so on…
Adagrad
https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
Adagrad
• Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm designed to adapt the
learning rate for each parameter based on its historical gradients.
• This makes it particularly useful for dealing with sparse data and features with different
scales.
Step-by-Step Explanation of Adagrad
Step 1: Initialize Parameters
This formula ensures that parameters with larger accumulated gradients have
smaller updates, while those with smaller accumulated gradients have larger
updates.
Adagrad Example
Q. Consider the objective function 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2 . Use Adagrad to
find the minimum of this function. Consider the initial point (x0 ,y0) as
(1.0,1.0) and initial learning rate as 0.1.
Sol.
1.Initialization:
• Initial Parameters: (𝜃 = 𝑥0 , 𝑦0 = 1.0,1.0 )
• Learning rate: (𝜂 = 0.1)
• Accumulator: (𝐺 = [0, 0])
• Small constant: (𝜖 = 10−8 )
Adagrad Example
2. Iterations :
Iteration 1:
2.1.1 Compute Gradient:
Limitations: The learning rate can become very small over time due
to the accumulation of squared gradients, which can slow down
convergence.
RMSProp
RMSprop (Root Mean Square Propagation)
• It adjusts the learning rate for each parameter based on a moving average
of the squared gradients.
Step-by-Step Explanation of RMSprop
Step 1: Initialize Parameters
Decay Rate ( 𝜷 ): Set a decay rate for the moving average, typically around 0.9.
• For each parameter at time step (t), compute the gradient of the loss
function with respect to that parameter.
𝜂
𝜃𝑖 = 𝜃𝑖 − . 𝑔𝑡
𝐸 𝑔2 𝑡 +𝜖
Sol.
1.Initialization:
Update Parameters:
𝜂 0.01
𝑥=𝑥− 𝑔𝑥 = 1 − ∗ 2.0 = 1 − 0.316 = 0.968
2
𝐸 𝑔 𝑥 +𝜖 0.4+10−8
𝜂 0.01
𝑦=𝑦− 𝑔𝑦 = 1 − ∗ 2.0 = 1 − 0.316 = 0.968
0.4+10−8
𝐸 𝑔2 𝑦 +𝜖
3.Iteration 2:
Update Parameters:
𝜂 0.01
𝑥=𝑥− . 𝑔𝑥 = 0.968 − ∗ 1.936 = 0.968 − 0.022 = 0.945
2
𝐸 𝑔 𝑥 +𝜖 0.734+10−8
𝜂 0.01
𝑦=𝑦− . 𝑔𝑦 = 0.968 − ∗ 1.936 = 0.968 − 0.022 = 0.945
0.734+10−8
𝐸 𝑔2 𝑦 +𝜖
Update Parameters:
𝜂 0.1
𝑥=𝑥− 𝑔𝑥 = 0.945 − ∗ 1.891 = 0.945 − 0.172 = 0.773
2
𝐸 𝑔 𝑥 +𝜖 1.2081+10−8
𝜂 0.01
𝑦=𝑦− . 𝑔𝑦 = 0.945 − ∗ 1.891 = 0.945 − 0.172 = 0.773
1.2081+10−8
𝐸 𝑔2 𝑦 +𝜖
• RMSprop effectively adjusts the learning rate for each parameter based on the
moving average of the squared gradients, allowing for more stable and efficient
optimization.
ADAM
ADAM
Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the
advantages of two other popular methods: AdaGrad and RMSprop.
It computes adaptive learning rates for each parameter by estimating the first and second
moments of the gradients.
Step-by-Step Explanation of Adam
Step 1: Initialize Parameters
1.Initialization:
Parameters: (𝜃 = 𝑥, 𝑦 = 1.0, 1.0 )
Learning rate: (𝜂 = 0.01)
Decay rates: (𝛽1 = 0.9), (𝛽2 = 0.999)
First moment vector: (𝑚 = [0, 0])
Second moment vector: (𝑣 = [0, 0])
Small constant: (𝜖 = 10−8 )
Time step: (𝑡 = 0)
ADAM Example
2. Iteration 1:
Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 1.0, 2 ∗ 1.0 = 2.0, 2.0
Update Time Step: 𝑡 = 1
Update Biased First Moment Estimate: 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡 = 0.9 ∗ 0, 0 + 0.1 ∗
2.0, 2.0 = 0.2, 0.2
Update Biased Second Moment Estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2 = 0.999 ∗ 0, 0 +
0.001 ∗ 2.02 , 2.02 = 0.004, 0.004
𝑚𝑡 2.0,2.0
Compute Bias-Corrected First Moment Estimate: 𝑚
ෞ𝑡 = = = [2.0, 2.0]
1−𝛽1𝑡 1−0.91
𝑣𝑡 0.004,0.004
Compute Bias-Corrected Second Moment Estimate: 𝑣ෝ𝑡 = = = [4.0, 4.0]
1 − 𝛽2𝑡 1− 0.9991
Update Parameters:
𝜂 0.01
𝑥 = 𝑥 − 𝑚
ෞ𝑡 = 1.0 − ∗ 2.0 = 1.0 − 0.01 ∗ 1.0 = 0.99
𝑣𝑡 +𝜖 4.0+10−8
𝜂 0.01
𝑦=𝑦− 𝑚
ෞ = 1.0 − .∗ 2.0 = 1.0 − 0.01.∗ 1.0 = 0.99
𝑣𝑡 +𝜖 𝑡 4.0+10−8
Updated parameters: (𝜃 = 0.99, 0.99 )
ADAM Example
3.Iteration 2:
Compute Gradient: ∇𝑓 𝑥, 𝑦 = 2𝑥, 2𝑦 = 2 ∗ 0.99, 2 ∗ 0.99 = 1.98, 1.98
Update Time Step: 𝑡 = 2
Update Biased First Moment Estimate: 𝑚𝑡 = 𝛽1 ⋅ 𝑚𝑡−1 + 1 − 𝛽1 ⋅ 𝑔𝑡 = 0.9 ∗ 0.2, 0.2 + 0.1 ∗
1.98, 1.98 = 0.378, 0.378
Update Biased Second Moment Estimate: 𝑣𝑡 = 𝛽2 ⋅ 𝑣𝑡−1 + 1 − 𝛽2 ⋅ 𝑔𝑡2 = 0.999 ∗
0.004, 0.004 + 0.001 ∗ 1.982 , 1.982 = 0.00792, 0.00792
𝑚𝑡 0.378,0.378
Compute Bias-Corrected First Moment Estimate: 𝑚
ෞ𝑡 = = = [1.989, 1.989]
1 − 𝛽1𝑡 1 − 0.92
𝑣 0.00792,0.00792
Compute Bias-Corrected Second Moment Estimate: 𝑣ෝ𝑡 = 𝑡 𝑡 = = 3.96, 3.96
1 − 𝛽2 1 − 0.9992
Update Parameters:
𝜂 0.01
𝑥 = 𝑥 − .𝑚ෞ𝑡 = 0.99 − ∗ 1.989 = 0.99 − 0.01 ∗ 1.0 = 0.98
𝑣𝑡 +𝜖 3.96+10−8
𝜂 1
𝑦 = 𝑦 − 𝑚
ෞ𝑡 = 0.99 − ∗ 1.989 = 0.99 − 0.01 ∗ 1.0 = 0.98
ෞ𝑡 +𝜖
𝑣 3.96+ 10−8
Updated parameters: (𝜃 = 0.98, 0.98 )
References
• Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient
methods for online learning and stochastic optimization." Journal of
machine learning research 12.7 (2011).
• Daniel Villarraga, SYSEN 6800 Fall 2021, AdaGrad, Cornell University,
https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
• Geoffrey Hinton, "Coursera Neural Networks for Machine Learning
lecture 6", 2018.
• Jason Huang, (SysEn 6800 Fall 2020, RMSProp, Cornell University,
https://optimization.cbe.cornell.edu/index.php?title=RMSProp
• Kingma, Diederik P. "Adam: A method for stochastic optimization." arXiv
preprint arXiv:1412.6980 (2014).
Thank You