0% found this document useful (0 votes)

20 views47 pages

HMD-Deep Learning-Lecture 2-2024

Uploaded by

aniqaaqeel6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views47 pages

HMD-Deep Learning-Lecture 2-2024

Uploaded by

aniqaaqeel6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

2 Gradient-based Optimization Methods

(a) Review of Vector Calculus

(b) Optimization
(c) First-order Method (FOM)
(d) Parameter Estimation
(e) Variants of First-order Method
(f) Loss Functions: Topology, Optimal Solution, Convexity, etc.
(g) Flat Minima and Generalizability
(h) Variants of SGD: Momentum, Nesterov, and Adam
(i) Learning Rate Policies
Review of Calculus: Real-valued Function
o Given a function 𝑓 𝒙 :
𝑓: ℝ𝑛 →ℝ
where
𝑥1
𝑥2
𝒙= 𝑥3 ∈ ℝ𝑛 where 𝑥𝑖 ∈ ℝ; ∀ 𝑖 = 1, 2, 3, ⋯ , 𝑛
⋮
𝑥𝑛
9
o Second-order polynomial (𝑛 = 1): 𝑓 𝑥 = 3𝑥 2 − 6𝑥 +
4
1 2
o Residual SOS (𝑛 > 1): 𝑓 𝒙 = 𝑨𝒙 − 𝑏 where 𝑨 ∈ ℝ𝑚×𝑛 and 𝒃 ∈ ℝ𝑚×1
2
Review of Calculus: Gradient
o Given a function 𝑓: ℝ𝑛 → ℝ where 𝒙 ∈ ℝ𝑛
o Gradient of 𝑓 𝒙 :
𝑇
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 𝒙 = 𝒙 , 𝒙 ,⋯, 𝒙 (if all partial derivatives exist)
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛

9
o Second-order polynomial: 𝑓 𝑥 = 3𝑥 2 − 6𝑥 + and 𝑓 ′ 𝑥 = 6𝑥 − 6
4
1
o Residual SOS1: 𝑓 𝒙 = 𝑨𝒙 − 𝑏 2 and ∇𝑓 𝒙 = 𝑨𝑇 𝑨𝒙 − 𝑏 ∈ ℝ𝑛
2

o Numerical Gradient: 𝑖𝑡ℎ partial derivative can be numerical computed

𝜕𝑓 𝑓 𝒙+𝑡𝒆𝒊 −𝑓 𝒙
𝒙 = lim (if limit exists)
𝜕𝑥𝑖 𝑡→0 𝑡
Optimization: Minimization Problem
o Given a function 𝑓: ℝ𝑛 → ℝ where 𝒙 ∈ ℝ𝑛
o Minimization problem: Could be minimizing cost, risk, error, loss, etc.

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

𝒙∈ℝ 𝒙∈ℝ
o Minimizer of second-order polynomial: Set 𝑓 ′ 𝑥 = 0 to solve for 𝑥 ∗

𝑓 ′ 𝑥 = 6𝑥 − 6 = 0 ⇒ 𝑥 ∗ = 1
o Minimizer of Residual SOS1: Set ∇𝑓 𝒙 = 0 to solve for 𝒙∗

∇𝑓 𝒙 = 𝑨𝑇 𝑨𝒙 − 𝑏 = 𝑨𝑇 𝑨𝒙 − 𝑨𝑇 𝑏 = 0 ⇒ 𝒙∗ = 𝑨𝑇 𝑨 −1 𝑇
𝑨 𝒃
if 𝑨𝑇 𝑨 −1
exists. 𝒙∗ is also known as a Least Squares Solution2
Optimization: Iterative Approach
o Minimization problem:

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

𝒙∈ℝ 𝒙∈ℝ

𝒙𝟒 𝒙𝟑 𝒙𝟏 𝒙𝟎

𝒙 ∗
𝒙𝟐 𝓤
A walk to the optimal solution.
Optimization: Elements of Iterative Solver
o Initialization
o A good initialization 𝒙𝟎 is one close to the (optimal) solution
o It is quite difficult to come-up with an extremely good initialization
o Assume initialization is any point in ℝ𝑛
o Direction
o A unit-normal vector (𝒑𝒌 )
o Points in the direction where the cost function decreases
o Ideally, should be the direction of maximum decrease
o Step size
o Jump size (𝛼𝑘 > 0) along the direction vector
o Approximate jump size is (surprisingly) quite good and more than enough
o Stopping criteria
o Informs us that we have reached an (optimal) solution
o A surrogate for the condition that ∇𝑓 𝒙𝒌 = 0 (i.e., 𝒙𝒌 is a stationary point)
First-order Method: Gradient Descent
o Minimization problem:

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

𝒙∈ℝ 𝒙∈ℝ

o Pseudo-code

Given a starting point 𝒙𝟎 ∈ ℝ𝒏

repeat
1. Determine a descent direction 𝒑𝒌
2. Choose a step size 𝛼𝑘
3. Update 𝒙𝒌+𝟏 = 𝒙𝒌 + 𝛼𝑘 𝒑𝒌
until stopping criteria is satisfied
First-order Method: Steepest Descent
o Minimization problem:

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

𝒙∈ℝ 𝒙∈ℝ

o Pseudo-code

Given a starting point 𝒙𝟎 ∈ ℝ𝒏 and an 𝜖

repeat
1. Choose steepest descent direction 𝒑𝒌 = −∇𝑓 𝒙𝒌
2. Choose a step size 𝛼𝑘
3. Update 𝒙𝒌+𝟏 = 𝒙𝒌 + 𝛼𝑘 𝒑𝒌
until ∇𝑓 𝒙𝒌 2 ≤ 𝜖
Parameter Estimation: Forward Model
∆
A system’s perspective.

𝑥𝑖 𝑓(. ; 𝜽) ∑ 𝑦𝑖
o Forward model:

𝑦𝑖 = 𝑓 𝑥𝑖 ; 𝜽 + ∆
o Inverse problem:
𝑚
Given a collection of data 𝑥𝑖 , 𝑦𝑖 𝑖=1 find the optimal 𝜽∗
Parameter Estimation: Mean Sq. Error
o Mean Squared Error (MSE):
o Error is defined as 𝐄𝐫𝐫𝐨𝐫 𝜽; 𝒙, 𝒚 = measurement − prediction = 𝒚 − 𝑓 𝒙; 𝜽
o Squared error is:
𝑚
2 2
SE 𝜽; 𝒙, 𝒚 = 𝒚 − 𝑓 𝒙; 𝜽 2 = ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑖=1
o Mean squared error is:
𝑚
1 1 2
1 2
𝐽 𝜽; 𝒙, 𝒚 = 𝐒𝐄 = 𝒚 − 𝑓 𝒙; 𝜽 2 = ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑚 𝑚 𝑚
𝑖=1
o Minimization problem:
𝜽∗ = arg min 𝐽 𝜽; 𝒙, 𝒚
𝜽
Variants of FOM: Batch Gradient Descent
Given
a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚
𝑖=1 ,
loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 ,
step size (learning rate) 𝛼0 , and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝓓
2. Choose a learning rate: 𝛼𝑘
3. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
Variants of FOM: Computing Gradient
o Minimization problem: 𝜽∗ = arg min 𝐽 𝜽; 𝒙, 𝒚
𝜽
o Cost Function:
𝑚
1 1 2
1 2
𝐽 𝜽; 𝒙, 𝒚 = 𝐒𝐄 = 𝒚 − 𝑓 𝒙; 𝜽 2 = ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑚 𝑚 𝑚
𝑖=1
o Analytical solution:
𝑚
𝜕𝐽 𝜽; 𝒙, 𝒚 2 𝜕𝑓 𝒙𝑖 ; 𝜽
= − ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽 ⋅
𝜕𝜽 𝑚 𝜕𝜽
𝑖=1
o Numerical gradient: Given 𝜖 = 10−4

𝜕𝑓 𝒙𝑖 ; 𝜃𝑗 𝑓 𝒙𝑖 ; 𝜃𝑗 + 𝜖 − 𝑓 𝒙𝑖 ; 𝜃𝑗
=
𝜕𝜃𝑗 𝜖
Variants of FOM: Properties of BGD
o Computation
o Full gradient computed at each update i.e., ∇𝜽 𝐽 𝜽𝑘 ; 𝓓
o Memory
o Complete Training Dataset (𝓓) must be stored in memory
o Scaling
o Full gradient computation scales with size of data
o Speed
o Batch gradient descent (BGD) is slow for big data problems
o Online learning
o Cannot update with examples on-the-fly
Variants of FOM: Stochastic Grad. Descent
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 , step size (learning rate) 𝛼0 , and maximum number
of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Choose a learning rate: 𝛼𝑘
4. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
end
Variants of FOM: Properties of SGD
o Computation
o Redundant computations are avoided
o Memory
o A single example in 𝓓 needs to be stored in memory
o Scaling
o Gradient computation is not affected by the size of data
o Speed
o Much faster than batch gradient descent (BGD)
o Online learning
o Updates allowed for examples on-the-fly
o Fluctuations
o SGD performs high variance updates
Variants of FOM: Minibatch Grad. Descent
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , number of batches 𝐵, loss function
𝐽 𝜽; 𝒙, 𝒚 , initial model 𝜽0 , step size (learning rate) 𝛼0 , and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑏 = 1: 𝐵
2. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖 𝒊∈𝐵 𝑏
3. Choose a learning rate: 𝛼𝑘
4. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
end
Variants of FOM: Properties of miniBGD
o Memory
o Limited memory footprint depends on the batch size
o Scaling
o Gradient computation scales with the batch size
o Speed
o Sandwiched between batch GD and SGD
o Online learning
o Updates for packets of examples on-the-fly
o Fluctuations
o Minibatch GD updates have less variance compared to SGD
o Minibatch
o Batch size range from 32 − 512
Variants of FOM: Complexity Trade-off
o Performance
o Accuracy of the updated model
o Time
o Time taken for each update

Computational Complexity Trade-off in Batch GD, SGD, and minibatch GD.3

Loss Function: Topology

(a) and (c) Plateaus

(b) and (d) Global minima
(e) and (g) Local maxima
(f) and (h) Local minima

Topology of loss functions in learning problems.4

Loss Function: Local and Global Optima
𝑥 = −1 strict global maxima
𝑥 = 1 non-strict local minima
𝑥 = 1, 2 non-strict local minima
𝑥 = 1, 2 non-strict local maxima
𝑥 = 2 local maxima
𝑥 = 3 strict local minima
𝑥 = 3 non-strict global minima
𝑥 = 5 strict local maxima
𝑥 = 6.5 strict local minima
𝑥 = 6.5 non-strict global minima
𝑥 = 8 strict local maxima
Topology of a one-dimensional function.5
Loss Functions: Convex Function
o Non-negative curvature over the entire domain
o Every stationary point is a local minima
o Every local minima is a global minima
o Batch Gradient Method converges to global minima
Loss Functions: Concave Function
o Non-positive curvature over the entire domain
o −𝑓 ⋅ is a convex function
o Every stationary point (of −𝑓 ⋅ ) is a local minima
o Every local minima (of −𝑓 ⋅ ) is a global minima
o Batch Gradient Method converges to global minima (of −𝑓 ⋅ )
Loss Functions: Non-Convex Function
o Has non-positive and non-negative curvature
o Stationary point(s) can be a local (or global) minima
o Stationary point(s) can be a local (or global) maxima
o Batch Gradient Method converges to local minima
o SGD can jump to potentially better local minima
Loss Functions: Saddle Points
o Points on the loss surface where:
o one dimension slopes up
o another dimension slopes down
o Saddle points are usually surrounded by:
o plateau of the same error
o gradient ≈ 0 isotropically
o Notoriously hard to escape6
o SGD can be trapped in saddle point
o Sub-optimal local minima for non-convex loss functions
Flat Minima & Generalizability
o Global Minima
o Stationary points are global minima for convex loss functions

o Local Minima
o Stationary points are local minima for non-convex loss functions
o Local minima may be as good as global minima

o Flat Minima7,8,9
o Loss surfaces do not change much between training and testing
cohorts
o Easy to converge
o Less likely to overshoot
o Robust to low precision arithmetic or noise in parameter space
o Superior generalizability
Flat Minima & Generalizability: Sketch

Conceptual sketch of flat and sharp minimum.10

Variants of SGD: Pathological Curvature

Sketch of a loss landscape with pathological curvature. Ravine-like region is marked in

blue-color. Red color represents highest values of loss function and blue color represents
lowest values of loss function.11
Variants of SGD: Pathological Curvature

Surface at the ridge curves much more steeply in the direction of 𝒘𝟏 . Gradient descent is
bouncing along the ridges of the ravine and moving a lot slower towards the minima.
Smaller value of learning rate will slow down convergence.11
Variants of SGD: SGD with Momentum

SGD with momentum.3

o SGD Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖

o Momentum: 𝒗𝑘 = 𝛾𝒗𝑘−1 + 𝛼𝑘 ∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
o 𝛾 = 0.9 is the coefficient of momentum
o Exponential average with past gradients
o SGD with Momentum: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝑘
o Accelerates SGD in relevant direction
o Dampens oscillations
Variants of SGD: SGD with Momentum
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 , step size (learning rate) 𝛼0 , 𝒗𝟎 = 𝟎, 𝛾 = 0.9, and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute a direction/gradient: 𝒑𝒌 = +∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Choose a learning rate: 𝛼𝑘
4. Exponential average of gradients: 𝒗𝒌 = 𝛾𝒗𝒌−𝟏 + 𝛼𝑘 𝒑𝒌
5. Update: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝒌
end
end
Variants of SGD: Nesterov & Momentum
o Large Updates: Can skip (miss) local minima
o Small Updates: Must take small steps close to minima
o Stationary Point: Local minima have zero gradient i.e., grad = 0
o Ideal March:
o Start with fast march far from local minima where grad ≫ 0
o Followed by slow march as grad → 0
o Requires knowledge of the gradient at a future position
o Prescience: 𝜽𝑘 − 𝛾𝒗𝑘−1
o Nesterov with Momentum: 𝒗𝑘 = 𝛾𝒗𝑘−1 + 𝛼𝑘 ∇𝜽 𝐽 𝜽𝑘 − 𝛾𝒗𝑘−1 ; 𝑥𝑖 , 𝑦𝑖
o SGD with Nesterov and Momentum: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝑘
Variants of SGD: Nesterov & Momentum
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 , learning rate 𝛼0 , 𝒗𝟎 = 𝟎, 𝛾 = 0.9, and max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
෡ 𝑘 = 𝜽𝑘 − 𝛾𝒗𝑘−1
2. Compute (anticipatory update): 𝜽
3. Compute a direction/gradient: 𝒑𝒌 = +∇𝜽 𝐽 𝜽෡ 𝑘 ; 𝑥𝑖 , 𝑦𝑖
4. Choose a learning rate: 𝛼𝑘
5. Exponential average of gradients: 𝒗𝒌 = 𝛾𝒗𝒌−𝟏 + 𝛼𝑘 𝒑𝒌
6. Update: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝒌
end
end
Variants of SGD: Nesterov & Momentum

Blue vectors represent momentum. Brown vectors represent jump. Red vectors denote
correction. Green vectors are accumulated gradients (AGs).12

o Remarks
o Momentum method computes the current gradient first and then takes a
big jump in the direction of updated AGs.
o Nesterov with Momentum first makes a big jump in the direction of
previous AGs and then make a correction with new gradient.
o Updates updated to the slope of the error surface.
o Anticipatory updates prevents over-speeding and accelerates SGD
Variants of SGD: Adaptive Moments
o First Moment: 𝒎𝑘 = 𝛽1 𝒎𝑘−1 + 1 − 𝛽1 𝒈𝑘
o Where 𝒈𝑘 = ∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
o Exponentially decaying average of past gradients (𝛽1 = 0.9)
o Similar to the momentum method
o Second Moment: 𝒗𝑘 = 𝛽2 𝒗𝑘−1 + 1 − 𝛽2 𝒈𝑘 ⊙ 𝒈𝑘
o Uncentered variance
o Exponentially decaying average of past squared gradients
o 𝛽2 = 0.999 is the default value
o Initialization: 𝒎0 = 𝒗0 = 0
𝒎𝑘 𝒗𝑘
o Unbiased Moments: 𝒎
ෞ𝑘 = and 𝒗
ො𝑘 =
1−𝛽1𝑘 1−𝛽2𝑘
𝛼0
o Adam Update: 𝜽𝑘+1 = 𝜽𝑘 − ෝ𝑘 +𝜖
𝒗
𝒎
ෝ𝑘 &𝜖 (= 10−8 ) is a smoothing term
Variants of SGD: Adaptive Moments
Given 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚
𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 , initial model 𝜽0 , learning rate
𝛼0 , 𝒎𝟎 = 𝒗𝟎 = 𝟎, 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−8 , and max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute gradient: 𝒈𝒌 = +∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Compute first moment: 𝒎𝑘 = 𝛽1 𝒎𝑘−1 + 1 − 𝛽1 𝒈𝑘
4. Compute second moment: 𝒗𝑘 = 𝛽2 𝒗𝑘−1 + 1 − 𝛽2 𝒈𝑘 ⊙ 𝒈𝑘
5. Unbiased first moment: 𝒎ෝ 𝑘 = 𝒎𝑘 Τ 1 − 𝛽1𝑘
6. Unbiased second moment: 𝒗ො 𝑘 = 𝒗𝑘 Τ 1 − 𝛽2𝑘
𝛼
7. Update: 𝜽𝑘+1 = 𝜽𝑘 − ෝ 0 𝒎 ෝ 𝑘 (element wise operations)
𝒗𝑘 +𝜖
end
end
Learning Rate Policies: Constant Rate
o Constant Learning Rate
𝛼𝑘 = 𝛼0 ; ∀𝑘 ∈ 0, 𝑁

Constant policy.13
Learning Rate Policies: Auto-reduce
o Constant Learning Rate with Auto-Reduce14
𝑗
𝛼𝑘 = 𝛼0 𝜂 ; ∀𝑘 ∈ 0, 𝑁
where 𝜂 ∈ 0,1 is the decay rate and
𝑗 = 0,1, ⋯ counts number of auto-reduce events

Constant policy with Auto-

reduce.13
Learning Rate Policies: Polynomial
o Polynomial Schedule15
𝜂
𝑘
𝛼𝑘 = 𝛼𝑁 − 𝛼0 + 𝛼0 ; ∀𝑘 ∈ 0, 𝑁
𝑁
where 𝜂 > 0 is the decay rate if 𝛼0 > 𝛼𝑁 and
𝜂 > 0 is the growth rate if 𝛼0 < 𝛼𝑁 and

(Left) Polynomial decay policy

and
(Right) Polynomial growth
policy.13
Learning Rate Policies: Cyclical Rates
o Cyclical Learning Rate16
𝑘/2𝐼
2 −1
𝜋𝑘
𝛼𝑘 = 𝛼max − 𝛼0 𝜂 ⋅ sin sin + 𝛼0
𝜋 2𝐼
∀𝑘 ∈ 0, 𝑁 where 𝐼 > 0 is the step-size (half-period) and
𝜂 is the per-cycle decay rate

(Left) Cyclical learning policy with

𝜂 = 1 and
(Right) Cyclical-Exponential policy
with 𝜂 < 1.13
Learning Rate Policies: Cosine Annealing
o Cosine Annealing Schedule17

𝛼0 𝜋 mod 𝑘 − 1, 𝑁/𝑀
𝛼𝑘 = cos + 1 ; ∀𝑘 ∈ 0, 𝑁
2 𝑁/𝑀
where 𝑀 is the number of cycles

Cosine annealing policy.13

Learning Rate Policies: Making a Choice

Ranking of Learning Rate Policies for the task of Brian Tumor segmentation. Ranking is based
on Final Ranking Score. SGD optimizer is denoted by (S) and Adam optimizer is denoted by (A).
Polynomial Decay with Adam is the top ranked learning rate policy – optimizer pair.18
Learning Rate Policies: Making a Choice

Ranking of Learning Rate Policies for the task of Cardiac MRI segmentation. Ranking is based on
Final Ranking Score. Polynomial Growth with Adam is the top ranked learning rate policy –
optimizer pair.19
Learning Rate Policies: Remarks I
o General: Choosing a proper learning rate is quite difficult
o Specifications: Decide in advance about scheduler and bounds
o Trapped: Difficult to escape suboptimal local minima for non-
convex loss functions
o Small Learning Rate:
o Converges very slowly
o More likely to get stuck in local minima
o Large Learning Rate:
o Adversely impacts convergence
o Fluctuates around local solution
o Loss function can diverge
Learning Rate Policies: Remarks II

(Left) Batch Gradient Descent iterations and (Right) SGD iterations with a large learning rate
(step-size).20
Learning Rate Policies: Remarks III
o Choice of Optimizer: Combination of learning rate policy and
optimizer that collaboratively influences
strong predictive performance18.
o SGD vs Adam3:
o SGD with tuned learning and momentum is competitive to Adam
o Adam converges faster but underperforms SGD on some tasks
o Learning Rate and Batch Size9:
Learning Rate
o is the key control parameter for SGD
Batch Size
Learning Rate
o Large values of results in convergence to flat minima
Batch Size
Learning Rate
o Large values of improves generalization
Batch Size
References (with hyperlinks)
1. K. Petersen and M. Pedersen, The Matrix Cookbook, 2012.
2. R. Aster et al, Parameter estimation and inverse problems, 3rd Edition, Elsevier, 2018.
3. S. Ruder, An overview of gradient descent optimization algorithms, arXiv, 2016.
4. B. Haeffele and R. Vidal, Global Optimality in Structured Matrix Factorization, Johns Hopkins
University.
5. A. Beck, Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB, SIAM, 2014.
6. Y. Dauphin, Identifying and attacking the saddle point problem in high-dimensional non-
convex optimization, arXiv, 2014.
7. S. Hochreiter and J. Schmidhuber, Flat Minima, Neural Computation, 1997.
8. S. Jastrzebski et al, Finding flatter minima with SGD, ICLR, 2018.
9. S. Jastrzebski et al, Three Factors Influencing Minima in SGD, arXiv, 2018.
10. N. Keskar et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
Minima, ICLR, 2017.
11. A. Kathuria, Intro to optimization in deep learning: Momentum, RMSProp and Adam,
PaperspaceBlog, 2018.
References (with hyperlinks)
12. G. Hinton, Neural Networks for Machine Learning: Lecture 6a, 2014.
13. S. Bukhari, Impact of Learning Rate Policies on Training a U-Net for Brain Tumor Segmentation,
MS Thesis, LUMS, 2020.
14. Y. Bengio, Practical Recommendations for Gradient-Based Training of Deep Architectures,
Neural Networks: Tricks of the Trade, Springer, 2012.
15. Y. Wu et al, Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural
Networks, IEEE International Conference on Big Data, 2019.
16. L. Smith, Cyclical learning rates for training neural networks, IEEE Winter Conference on
Applications of Computer Vision, 2017.
17. G. Huang et al, Snapshot Ensembles: Train 1, get M for free, arXiv, 2017.
18. S. Bukhari and H. Mohy-ud-Din, A systematic evaluation of learning rate policies in training
CNNs for brain tumor segmentation, Physics in Medicine and Biology, 2021.
19. S. Jabbar et al, Generalizability of CNNs for Cardiac MRI Segmentation and Quantification,
2021.
20. A. Kathuria, Intro to optimization in deep learning: Gradient Descent, PaperspaceBlog, 2018.

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Chen-Stochastic Methods For Modeling
No ratings yet
Chen-Stochastic Methods For Modeling
234 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
Part 1 Building Your Own Binary Classification Model
43% (14)
Part 1 Building Your Own Binary Classification Model
6 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
Syllabus of Advanced Structural Analysis Course
0% (1)
Syllabus of Advanced Structural Analysis Course
2 pages
Rujivan 2021
No ratings yet
Rujivan 2021
29 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
Cse 3318 - W4 - 06242024
100% (1)
Cse 3318 - W4 - 06242024
121 pages
Sec 10.6 PDF
No ratings yet
Sec 10.6 PDF
7 pages
Brochure Degree Sciences
No ratings yet
Brochure Degree Sciences
205 pages
Chapter 3 - Forecasting - EXCEL TEMPLATES
No ratings yet
Chapter 3 - Forecasting - EXCEL TEMPLATES
14 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Multiplying Polynomialsusing FOILand Box Methods
No ratings yet
Multiplying Polynomialsusing FOILand Box Methods
3 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
CSC3113 Lec06
No ratings yet
CSC3113 Lec06
30 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
1.2.1 Idealization of A Continuum
No ratings yet
1.2.1 Idealization of A Continuum
6 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Unit 1 - Ai - KCS071
No ratings yet
Unit 1 - Ai - KCS071
32 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Seminar Report
No ratings yet
Seminar Report
19 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Sampling and Estimation Theories
No ratings yet
Sampling and Estimation Theories
22 pages
8.1 Procedures and Functions UPDATED (MT-L)
No ratings yet
8.1 Procedures and Functions UPDATED (MT-L)
9 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Syllabus CS212 Data Structure
No ratings yet
Syllabus CS212 Data Structure
5 pages
Advanced Vibrations - S. Graham Kelly
No ratings yet
Advanced Vibrations - S. Graham Kelly
5 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Mobile Device Training Strategies in Federated Learning: An Evolutionary Game Approach
No ratings yet
Mobile Device Training Strategies in Federated Learning: An Evolutionary Game Approach
6 pages
Math3302 Fa2021 Syllabus
No ratings yet
Math3302 Fa2021 Syllabus
4 pages
Dr. M. Turner : Mct6@sun - Engg.le - Ac.uk
No ratings yet
Dr. M. Turner : Mct6@sun - Engg.le - Ac.uk
4 pages
UNIT2
No ratings yet
UNIT2
25 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Discrete Random Variable
No ratings yet
Discrete Random Variable
14 pages
Module 2
No ratings yet
Module 2
20 pages
Optim
No ratings yet
Optim
33 pages
A-3 Ai Print
No ratings yet
A-3 Ai Print
6 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Chapter 02.background-Theory
No ratings yet
Chapter 02.background-Theory
20 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
LInear
No ratings yet
LInear
14 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Electronic Spreadsheet
No ratings yet
Electronic Spreadsheet
3 pages
Cours 5
No ratings yet
Cours 5
23 pages
Worksheet 7-g6 MATHS
No ratings yet
Worksheet 7-g6 MATHS
2 pages
Jntuh Used Paper Aug-2022: (Common To ECE, EIE)
No ratings yet
Jntuh Used Paper Aug-2022: (Common To ECE, EIE)
2 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Bmte 141
No ratings yet
Bmte 141
5 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Knowledge Organiser Quiz
No ratings yet
Knowledge Organiser Quiz
3 pages
Introduction To Basic Programming Concepts
No ratings yet
Introduction To Basic Programming Concepts
2 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

HMD-Deep Learning-Lecture 2-2024

Uploaded by

HMD-Deep Learning-Lecture 2-2024

Uploaded by

2 Gradient-based Optimization Methods

(a) Review of Vector Calculus

o Numerical Gradient: 𝑖𝑡ℎ partial derivative can be numerical computed

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

Given a starting point 𝒙𝟎 ∈ ℝ𝒏

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙

Given a starting point 𝒙𝟎 ∈ ℝ𝒏 and an 𝜖

Computational Complexity Trade-off in Batch GD, SGD, and minibatch GD.3

(a) and (c) Plateaus

Topology of loss functions in learning problems.4

Conceptual sketch of flat and sharp minimum.10

Sketch of a loss landscape with pathological curvature. Ravine-like region is marked in

SGD with momentum.3

o SGD Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖

Constant policy with Auto-

(Left) Polynomial decay policy

(Left) Cyclical learning policy with

Cosine annealing policy.13

You might also like