0% found this document useful (0 votes)
20 views47 pages

HMD-Deep Learning-Lecture 2-2024

Uploaded by

aniqaaqeel6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views47 pages

HMD-Deep Learning-Lecture 2-2024

Uploaded by

aniqaaqeel6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

2 Gradient-based Optimization Methods

(a) Review of Vector Calculus


(b) Optimization
(c) First-order Method (FOM)
(d) Parameter Estimation
(e) Variants of First-order Method
(f) Loss Functions: Topology, Optimal Solution, Convexity, etc.
(g) Flat Minima and Generalizability
(h) Variants of SGD: Momentum, Nesterov, and Adam
(i) Learning Rate Policies
Review of Calculus: Real-valued Function
o Given a function 𝑓 𝒙 :
𝑓: ℝ𝑛 →ℝ
where
𝑥1
𝑥2
𝒙= 𝑥3 ∈ ℝ𝑛 where 𝑥𝑖 ∈ ℝ; ∀ 𝑖 = 1, 2, 3, ⋯ , 𝑛

𝑥𝑛
9
o Second-order polynomial (𝑛 = 1): 𝑓 𝑥 = 3𝑥 2 − 6𝑥 +
4
1 2
o Residual SOS (𝑛 > 1): 𝑓 𝒙 = 𝑨𝒙 − 𝑏 where 𝑨 ∈ ℝ𝑚×𝑛 and 𝒃 ∈ ℝ𝑚×1
2
Review of Calculus: Gradient
o Given a function 𝑓: ℝ𝑛 → ℝ where 𝒙 ∈ ℝ𝑛
o Gradient of 𝑓 𝒙 :
𝑇
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 𝒙 = 𝒙 , 𝒙 ,⋯, 𝒙 (if all partial derivatives exist)
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛

9
o Second-order polynomial: 𝑓 𝑥 = 3𝑥 2 − 6𝑥 + and 𝑓 ′ 𝑥 = 6𝑥 − 6
4
1
o Residual SOS1: 𝑓 𝒙 = 𝑨𝒙 − 𝑏 2 and ∇𝑓 𝒙 = 𝑨𝑇 𝑨𝒙 − 𝑏 ∈ ℝ𝑛
2

o Numerical Gradient: 𝑖𝑡ℎ partial derivative can be numerical computed


𝜕𝑓 𝑓 𝒙+𝑡𝒆𝒊 −𝑓 𝒙
𝒙 = lim (if limit exists)
𝜕𝑥𝑖 𝑡→0 𝑡
Optimization: Minimization Problem
o Given a function 𝑓: ℝ𝑛 → ℝ where 𝒙 ∈ ℝ𝑛
o Minimization problem: Could be minimizing cost, risk, error, loss, etc.

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙


𝒙∈ℝ 𝒙∈ℝ
o Minimizer of second-order polynomial: Set 𝑓 ′ 𝑥 = 0 to solve for 𝑥 ∗

𝑓 ′ 𝑥 = 6𝑥 − 6 = 0 ⇒ 𝑥 ∗ = 1
o Minimizer of Residual SOS1: Set ∇𝑓 𝒙 = 0 to solve for 𝒙∗

∇𝑓 𝒙 = 𝑨𝑇 𝑨𝒙 − 𝑏 = 𝑨𝑇 𝑨𝒙 − 𝑨𝑇 𝑏 = 0 ⇒ 𝒙∗ = 𝑨𝑇 𝑨 −1 𝑇
𝑨 𝒃
if 𝑨𝑇 𝑨 −1
exists. 𝒙∗ is also known as a Least Squares Solution2
Optimization: Iterative Approach
o Minimization problem:

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙


𝒙∈ℝ 𝒙∈ℝ

𝒙𝟒 𝒙𝟑 𝒙𝟏 𝒙𝟎

𝒙 ∗
𝒙𝟐 𝓤
A walk to the optimal solution.
Optimization: Elements of Iterative Solver
o Initialization
o A good initialization 𝒙𝟎 is one close to the (optimal) solution
o It is quite difficult to come-up with an extremely good initialization
o Assume initialization is any point in ℝ𝑛
o Direction
o A unit-normal vector (𝒑𝒌 )
o Points in the direction where the cost function decreases
o Ideally, should be the direction of maximum decrease
o Step size
o Jump size (𝛼𝑘 > 0) along the direction vector
o Approximate jump size is (surprisingly) quite good and more than enough
o Stopping criteria
o Informs us that we have reached an (optimal) solution
o A surrogate for the condition that ∇𝑓 𝒙𝒌 = 0 (i.e., 𝒙𝒌 is a stationary point)
First-order Method: Gradient Descent
o Minimization problem:

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙


𝒙∈ℝ 𝒙∈ℝ

o Pseudo-code

Given a starting point 𝒙𝟎 ∈ ℝ𝒏


repeat
1. Determine a descent direction 𝒑𝒌
2. Choose a step size 𝛼𝑘
3. Update 𝒙𝒌+𝟏 = 𝒙𝒌 + 𝛼𝑘 𝒑𝒌
until stopping criteria is satisfied
First-order Method: Steepest Descent
o Minimization problem:

𝒙∗ = arg min𝑛 𝑓 𝒙 and 𝑓 𝒙∗ = min𝑛 𝑓 𝒙


𝒙∈ℝ 𝒙∈ℝ

o Pseudo-code

Given a starting point 𝒙𝟎 ∈ ℝ𝒏 and an 𝜖


repeat
1. Choose steepest descent direction 𝒑𝒌 = −∇𝑓 𝒙𝒌
2. Choose a step size 𝛼𝑘
3. Update 𝒙𝒌+𝟏 = 𝒙𝒌 + 𝛼𝑘 𝒑𝒌
until ∇𝑓 𝒙𝒌 2 ≤ 𝜖
Parameter Estimation: Forward Model

A system’s perspective.

𝑥𝑖 𝑓(. ; 𝜽) ∑ 𝑦𝑖
o Forward model:

𝑦𝑖 = 𝑓 𝑥𝑖 ; 𝜽 + ∆
o Inverse problem:
𝑚
Given a collection of data 𝑥𝑖 , 𝑦𝑖 𝑖=1 find the optimal 𝜽∗
Parameter Estimation: Mean Sq. Error
o Mean Squared Error (MSE):
o Error is defined as 𝐄𝐫𝐫𝐨𝐫 𝜽; 𝒙, 𝒚 = measurement − prediction = 𝒚 − 𝑓 𝒙; 𝜽
o Squared error is:
𝑚
2 2
SE 𝜽; 𝒙, 𝒚 = 𝒚 − 𝑓 𝒙; 𝜽 2 = ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑖=1
o Mean squared error is:
𝑚
1 1 2
1 2
𝐽 𝜽; 𝒙, 𝒚 = 𝐒𝐄 = 𝒚 − 𝑓 𝒙; 𝜽 2 = ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑚 𝑚 𝑚
𝑖=1
o Minimization problem:
𝜽∗ = arg min 𝐽 𝜽; 𝒙, 𝒚
𝜽
Variants of FOM: Batch Gradient Descent
Given
a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚
𝑖=1 ,
loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 ,
step size (learning rate) 𝛼0 , and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝓓
2. Choose a learning rate: 𝛼𝑘
3. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
Variants of FOM: Computing Gradient
o Minimization problem: 𝜽∗ = arg min 𝐽 𝜽; 𝒙, 𝒚
𝜽
o Cost Function:
𝑚
1 1 2
1 2
𝐽 𝜽; 𝒙, 𝒚 = 𝐒𝐄 = 𝒚 − 𝑓 𝒙; 𝜽 2 = ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑚 𝑚 𝑚
𝑖=1
o Analytical solution:
𝑚
𝜕𝐽 𝜽; 𝒙, 𝒚 2 𝜕𝑓 𝒙𝑖 ; 𝜽
= − ෍ 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽 ⋅
𝜕𝜽 𝑚 𝜕𝜽
𝑖=1
o Numerical gradient: Given 𝜖 = 10−4

𝜕𝑓 𝒙𝑖 ; 𝜃𝑗 𝑓 𝒙𝑖 ; 𝜃𝑗 + 𝜖 − 𝑓 𝒙𝑖 ; 𝜃𝑗
=
𝜕𝜃𝑗 𝜖
Variants of FOM: Properties of BGD
o Computation
o Full gradient computed at each update i.e., ∇𝜽 𝐽 𝜽𝑘 ; 𝓓
o Memory
o Complete Training Dataset (𝓓) must be stored in memory
o Scaling
o Full gradient computation scales with size of data
o Speed
o Batch gradient descent (BGD) is slow for big data problems
o Online learning
o Cannot update with examples on-the-fly
Variants of FOM: Stochastic Grad. Descent
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 , step size (learning rate) 𝛼0 , and maximum number
of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Choose a learning rate: 𝛼𝑘
4. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
end
Variants of FOM: Properties of SGD
o Computation
o Redundant computations are avoided
o Memory
o A single example in 𝓓 needs to be stored in memory
o Scaling
o Gradient computation is not affected by the size of data
o Speed
o Much faster than batch gradient descent (BGD)
o Online learning
o Updates allowed for examples on-the-fly
o Fluctuations
o SGD performs high variance updates
Variants of FOM: Minibatch Grad. Descent
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , number of batches 𝐵, loss function
𝐽 𝜽; 𝒙, 𝒚 , initial model 𝜽0 , step size (learning rate) 𝛼0 , and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑏 = 1: 𝐵
2. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖 𝒊∈𝐵 𝑏
3. Choose a learning rate: 𝛼𝑘
4. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
end
Variants of FOM: Properties of miniBGD
o Memory
o Limited memory footprint depends on the batch size
o Scaling
o Gradient computation scales with the batch size
o Speed
o Sandwiched between batch GD and SGD
o Online learning
o Updates for packets of examples on-the-fly
o Fluctuations
o Minibatch GD updates have less variance compared to SGD
o Minibatch
o Batch size range from 32 − 512
Variants of FOM: Complexity Trade-off
o Performance
o Accuracy of the updated model
o Time
o Time taken for each update

Computational Complexity Trade-off in Batch GD, SGD, and minibatch GD.3


Loss Function: Topology

(a) and (c) Plateaus


(b) and (d) Global minima
(e) and (g) Local maxima
(f) and (h) Local minima

Topology of loss functions in learning problems.4


Loss Function: Local and Global Optima
𝑥 = −1 strict global maxima
𝑥 = 1 non-strict local minima
𝑥 = 1, 2 non-strict local minima
𝑥 = 1, 2 non-strict local maxima
𝑥 = 2 local maxima
𝑥 = 3 strict local minima
𝑥 = 3 non-strict global minima
𝑥 = 5 strict local maxima
𝑥 = 6.5 strict local minima
𝑥 = 6.5 non-strict global minima
𝑥 = 8 strict local maxima
Topology of a one-dimensional function.5
Loss Functions: Convex Function
o Non-negative curvature over the entire domain
o Every stationary point is a local minima
o Every local minima is a global minima
o Batch Gradient Method converges to global minima
Loss Functions: Concave Function
o Non-positive curvature over the entire domain
o −𝑓 ⋅ is a convex function
o Every stationary point (of −𝑓 ⋅ ) is a local minima
o Every local minima (of −𝑓 ⋅ ) is a global minima
o Batch Gradient Method converges to global minima (of −𝑓 ⋅ )
Loss Functions: Non-Convex Function
o Has non-positive and non-negative curvature
o Stationary point(s) can be a local (or global) minima
o Stationary point(s) can be a local (or global) maxima
o Batch Gradient Method converges to local minima
o SGD can jump to potentially better local minima
Loss Functions: Saddle Points
o Points on the loss surface where:
o one dimension slopes up
o another dimension slopes down
o Saddle points are usually surrounded by:
o plateau of the same error
o gradient ≈ 0 isotropically
o Notoriously hard to escape6
o SGD can be trapped in saddle point
o Sub-optimal local minima for non-convex loss functions
Flat Minima & Generalizability
o Global Minima
o Stationary points are global minima for convex loss functions

o Local Minima
o Stationary points are local minima for non-convex loss functions
o Local minima may be as good as global minima

o Flat Minima7,8,9
o Loss surfaces do not change much between training and testing
cohorts
o Easy to converge
o Less likely to overshoot
o Robust to low precision arithmetic or noise in parameter space
o Superior generalizability
Flat Minima & Generalizability: Sketch

Conceptual sketch of flat and sharp minimum.10


Variants of SGD: Pathological Curvature

Sketch of a loss landscape with pathological curvature. Ravine-like region is marked in


blue-color. Red color represents highest values of loss function and blue color represents
lowest values of loss function.11
Variants of SGD: Pathological Curvature

Surface at the ridge curves much more steeply in the direction of 𝒘𝟏 . Gradient descent is
bouncing along the ridges of the ravine and moving a lot slower towards the minima.
Smaller value of learning rate will slow down convergence.11
Variants of SGD: SGD with Momentum

SGD with momentum.3

o SGD Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖


o Momentum: 𝒗𝑘 = 𝛾𝒗𝑘−1 + 𝛼𝑘 ∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
o 𝛾 = 0.9 is the coefficient of momentum
o Exponential average with past gradients
o SGD with Momentum: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝑘
o Accelerates SGD in relevant direction
o Dampens oscillations
Variants of SGD: SGD with Momentum
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 , step size (learning rate) 𝛼0 , 𝒗𝟎 = 𝟎, 𝛾 = 0.9, and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute a direction/gradient: 𝒑𝒌 = +∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Choose a learning rate: 𝛼𝑘
4. Exponential average of gradients: 𝒗𝒌 = 𝛾𝒗𝒌−𝟏 + 𝛼𝑘 𝒑𝒌
5. Update: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝒌
end
end
Variants of SGD: Nesterov & Momentum
o Large Updates: Can skip (miss) local minima
o Small Updates: Must take small steps close to minima
o Stationary Point: Local minima have zero gradient i.e., grad = 0
o Ideal March:
o Start with fast march far from local minima where grad ≫ 0
o Followed by slow march as grad → 0
o Requires knowledge of the gradient at a future position
o Prescience: 𝜽𝑘 − 𝛾𝒗𝑘−1
o Nesterov with Momentum: 𝒗𝑘 = 𝛾𝒗𝑘−1 + 𝛼𝑘 ∇𝜽 𝐽 𝜽𝑘 − 𝛾𝒗𝑘−1 ; 𝑥𝑖 , 𝑦𝑖
o SGD with Nesterov and Momentum: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝑘
Variants of SGD: Nesterov & Momentum
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 , learning rate 𝛼0 , 𝒗𝟎 = 𝟎, 𝛾 = 0.9, and max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
෡ 𝑘 = 𝜽𝑘 − 𝛾𝒗𝑘−1
2. Compute (anticipatory update): 𝜽
3. Compute a direction/gradient: 𝒑𝒌 = +∇𝜽 𝐽 𝜽෡ 𝑘 ; 𝑥𝑖 , 𝑦𝑖
4. Choose a learning rate: 𝛼𝑘
5. Exponential average of gradients: 𝒗𝒌 = 𝛾𝒗𝒌−𝟏 + 𝛼𝑘 𝒑𝒌
6. Update: 𝜽𝑘+1 = 𝜽𝑘 − 𝒗𝒌
end
end
Variants of SGD: Nesterov & Momentum

Blue vectors represent momentum. Brown vectors represent jump. Red vectors denote
correction. Green vectors are accumulated gradients (AGs).12

o Remarks
o Momentum method computes the current gradient first and then takes a
big jump in the direction of updated AGs.
o Nesterov with Momentum first makes a big jump in the direction of
previous AGs and then make a correction with new gradient.
o Updates updated to the slope of the error surface.
o Anticipatory updates prevents over-speeding and accelerates SGD
Variants of SGD: Adaptive Moments
o First Moment: 𝒎𝑘 = 𝛽1 𝒎𝑘−1 + 1 − 𝛽1 𝒈𝑘
o Where 𝒈𝑘 = ∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
o Exponentially decaying average of past gradients (𝛽1 = 0.9)
o Similar to the momentum method
o Second Moment: 𝒗𝑘 = 𝛽2 𝒗𝑘−1 + 1 − 𝛽2 𝒈𝑘 ⊙ 𝒈𝑘
o Uncentered variance
o Exponentially decaying average of past squared gradients
o 𝛽2 = 0.999 is the default value
o Initialization: 𝒎0 = 𝒗0 = 0
𝒎𝑘 𝒗𝑘
o Unbiased Moments: 𝒎
ෞ𝑘 = and 𝒗
ො𝑘 =
1−𝛽1𝑘 1−𝛽2𝑘
𝛼0
o Adam Update: 𝜽𝑘+1 = 𝜽𝑘 − ෝ𝑘 +𝜖
𝒗
𝒎
ෝ𝑘 &𝜖 (= 10−8 ) is a smoothing term
Variants of SGD: Adaptive Moments
Given 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚
𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 , initial model 𝜽0 , learning rate
𝛼0 , 𝒎𝟎 = 𝒗𝟎 = 𝟎, 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−8 , and max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute gradient: 𝒈𝒌 = +∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Compute first moment: 𝒎𝑘 = 𝛽1 𝒎𝑘−1 + 1 − 𝛽1 𝒈𝑘
4. Compute second moment: 𝒗𝑘 = 𝛽2 𝒗𝑘−1 + 1 − 𝛽2 𝒈𝑘 ⊙ 𝒈𝑘
5. Unbiased first moment: 𝒎ෝ 𝑘 = 𝒎𝑘 Τ 1 − 𝛽1𝑘
6. Unbiased second moment: 𝒗ො 𝑘 = 𝒗𝑘 Τ 1 − 𝛽2𝑘
𝛼
7. Update: 𝜽𝑘+1 = 𝜽𝑘 − ෝ 0 𝒎 ෝ 𝑘 (element wise operations)
𝒗𝑘 +𝜖
end
end
Learning Rate Policies: Constant Rate
o Constant Learning Rate
𝛼𝑘 = 𝛼0 ; ∀𝑘 ∈ 0, 𝑁

Constant policy.13
Learning Rate Policies: Auto-reduce
o Constant Learning Rate with Auto-Reduce14
𝑗
𝛼𝑘 = 𝛼0 𝜂 ; ∀𝑘 ∈ 0, 𝑁
where 𝜂 ∈ 0,1 is the decay rate and
𝑗 = 0,1, ⋯ counts number of auto-reduce events

Constant policy with Auto-


reduce.13
Learning Rate Policies: Polynomial
o Polynomial Schedule15
𝜂
𝑘
𝛼𝑘 = 𝛼𝑁 − 𝛼0 + 𝛼0 ; ∀𝑘 ∈ 0, 𝑁
𝑁
where 𝜂 > 0 is the decay rate if 𝛼0 > 𝛼𝑁 and
𝜂 > 0 is the growth rate if 𝛼0 < 𝛼𝑁 and

(Left) Polynomial decay policy


and
(Right) Polynomial growth
policy.13
Learning Rate Policies: Cyclical Rates
o Cyclical Learning Rate16
𝑘/2𝐼
2 −1
𝜋𝑘
𝛼𝑘 = 𝛼max − 𝛼0 𝜂 ⋅ sin sin + 𝛼0
𝜋 2𝐼
∀𝑘 ∈ 0, 𝑁 where 𝐼 > 0 is the step-size (half-period) and
𝜂 is the per-cycle decay rate

(Left) Cyclical learning policy with


𝜂 = 1 and
(Right) Cyclical-Exponential policy
with 𝜂 < 1.13
Learning Rate Policies: Cosine Annealing
o Cosine Annealing Schedule17

𝛼0 𝜋 mod 𝑘 − 1, 𝑁/𝑀
𝛼𝑘 = cos + 1 ; ∀𝑘 ∈ 0, 𝑁
2 𝑁/𝑀
where 𝑀 is the number of cycles

Cosine annealing policy.13


Learning Rate Policies: Making a Choice

Ranking of Learning Rate Policies for the task of Brian Tumor segmentation. Ranking is based
on Final Ranking Score. SGD optimizer is denoted by (S) and Adam optimizer is denoted by (A).
Polynomial Decay with Adam is the top ranked learning rate policy – optimizer pair.18
Learning Rate Policies: Making a Choice

Ranking of Learning Rate Policies for the task of Cardiac MRI segmentation. Ranking is based on
Final Ranking Score. Polynomial Growth with Adam is the top ranked learning rate policy –
optimizer pair.19
Learning Rate Policies: Remarks I
o General: Choosing a proper learning rate is quite difficult
o Specifications: Decide in advance about scheduler and bounds
o Trapped: Difficult to escape suboptimal local minima for non-
convex loss functions
o Small Learning Rate:
o Converges very slowly
o More likely to get stuck in local minima
o Large Learning Rate:
o Adversely impacts convergence
o Fluctuates around local solution
o Loss function can diverge
Learning Rate Policies: Remarks II

(Left) Batch Gradient Descent iterations and (Right) SGD iterations with a large learning rate
(step-size).20
Learning Rate Policies: Remarks III
o Choice of Optimizer: Combination of learning rate policy and
optimizer that collaboratively influences
strong predictive performance18.
o SGD vs Adam3:
o SGD with tuned learning and momentum is competitive to Adam
o Adam converges faster but underperforms SGD on some tasks
o Learning Rate and Batch Size9:
Learning Rate
o is the key control parameter for SGD
Batch Size
Learning Rate
o Large values of results in convergence to flat minima
Batch Size
Learning Rate
o Large values of improves generalization
Batch Size
References (with hyperlinks)
1. K. Petersen and M. Pedersen, The Matrix Cookbook, 2012.
2. R. Aster et al, Parameter estimation and inverse problems, 3rd Edition, Elsevier, 2018.
3. S. Ruder, An overview of gradient descent optimization algorithms, arXiv, 2016.
4. B. Haeffele and R. Vidal, Global Optimality in Structured Matrix Factorization, Johns Hopkins
University.
5. A. Beck, Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB, SIAM, 2014.
6. Y. Dauphin, Identifying and attacking the saddle point problem in high-dimensional non-
convex optimization, arXiv, 2014.
7. S. Hochreiter and J. Schmidhuber, Flat Minima, Neural Computation, 1997.
8. S. Jastrzebski et al, Finding flatter minima with SGD, ICLR, 2018.
9. S. Jastrzebski et al, Three Factors Influencing Minima in SGD, arXiv, 2018.
10. N. Keskar et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
Minima, ICLR, 2017.
11. A. Kathuria, Intro to optimization in deep learning: Momentum, RMSProp and Adam,
PaperspaceBlog, 2018.
References (with hyperlinks)
12. G. Hinton, Neural Networks for Machine Learning: Lecture 6a, 2014.
13. S. Bukhari, Impact of Learning Rate Policies on Training a U-Net for Brain Tumor Segmentation,
MS Thesis, LUMS, 2020.
14. Y. Bengio, Practical Recommendations for Gradient-Based Training of Deep Architectures,
Neural Networks: Tricks of the Trade, Springer, 2012.
15. Y. Wu et al, Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural
Networks, IEEE International Conference on Big Data, 2019.
16. L. Smith, Cyclical learning rates for training neural networks, IEEE Winter Conference on
Applications of Computer Vision, 2017.
17. G. Huang et al, Snapshot Ensembles: Train 1, get M for free, arXiv, 2017.
18. S. Bukhari and H. Mohy-ud-Din, A systematic evaluation of learning rate policies in training
CNNs for brain tumor segmentation, Physics in Medicine and Biology, 2021.
19. S. Jabbar et al, Generalizability of CNNs for Cardiac MRI Segmentation and Quantification,
2021.
20. A. Kathuria, Intro to optimization in deep learning: Gradient Descent, PaperspaceBlog, 2018.

You might also like