HMD-Deep Learning-Lecture 2-2024
HMD-Deep Learning-Lecture 2-2024
9
o Second-order polynomial: 𝑓 𝑥 = 3𝑥 2 − 6𝑥 + and 𝑓 ′ 𝑥 = 6𝑥 − 6
4
1
o Residual SOS1: 𝑓 𝒙 = 𝑨𝒙 − 𝑏 2 and ∇𝑓 𝒙 = 𝑨𝑇 𝑨𝒙 − 𝑏 ∈ ℝ𝑛
2
𝑓 ′ 𝑥 = 6𝑥 − 6 = 0 ⇒ 𝑥 ∗ = 1
o Minimizer of Residual SOS1: Set ∇𝑓 𝒙 = 0 to solve for 𝒙∗
∇𝑓 𝒙 = 𝑨𝑇 𝑨𝒙 − 𝑏 = 𝑨𝑇 𝑨𝒙 − 𝑨𝑇 𝑏 = 0 ⇒ 𝒙∗ = 𝑨𝑇 𝑨 −1 𝑇
𝑨 𝒃
if 𝑨𝑇 𝑨 −1
exists. 𝒙∗ is also known as a Least Squares Solution2
Optimization: Iterative Approach
o Minimization problem:
𝒙𝟒 𝒙𝟑 𝒙𝟏 𝒙𝟎
𝒙 ∗
𝒙𝟐 𝓤
A walk to the optimal solution.
Optimization: Elements of Iterative Solver
o Initialization
o A good initialization 𝒙𝟎 is one close to the (optimal) solution
o It is quite difficult to come-up with an extremely good initialization
o Assume initialization is any point in ℝ𝑛
o Direction
o A unit-normal vector (𝒑𝒌 )
o Points in the direction where the cost function decreases
o Ideally, should be the direction of maximum decrease
o Step size
o Jump size (𝛼𝑘 > 0) along the direction vector
o Approximate jump size is (surprisingly) quite good and more than enough
o Stopping criteria
o Informs us that we have reached an (optimal) solution
o A surrogate for the condition that ∇𝑓 𝒙𝒌 = 0 (i.e., 𝒙𝒌 is a stationary point)
First-order Method: Gradient Descent
o Minimization problem:
o Pseudo-code
o Pseudo-code
𝑥𝑖 𝑓(. ; 𝜽) ∑ 𝑦𝑖
o Forward model:
𝑦𝑖 = 𝑓 𝑥𝑖 ; 𝜽 + ∆
o Inverse problem:
𝑚
Given a collection of data 𝑥𝑖 , 𝑦𝑖 𝑖=1 find the optimal 𝜽∗
Parameter Estimation: Mean Sq. Error
o Mean Squared Error (MSE):
o Error is defined as 𝐄𝐫𝐫𝐨𝐫 𝜽; 𝒙, 𝒚 = measurement − prediction = 𝒚 − 𝑓 𝒙; 𝜽
o Squared error is:
𝑚
2 2
SE 𝜽; 𝒙, 𝒚 = 𝒚 − 𝑓 𝒙; 𝜽 2 = 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑖=1
o Mean squared error is:
𝑚
1 1 2
1 2
𝐽 𝜽; 𝒙, 𝒚 = 𝐒𝐄 = 𝒚 − 𝑓 𝒙; 𝜽 2 = 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑚 𝑚 𝑚
𝑖=1
o Minimization problem:
𝜽∗ = arg min 𝐽 𝜽; 𝒙, 𝒚
𝜽
Variants of FOM: Batch Gradient Descent
Given
a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚
𝑖=1 ,
loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 ,
step size (learning rate) 𝛼0 , and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝓓
2. Choose a learning rate: 𝛼𝑘
3. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
Variants of FOM: Computing Gradient
o Minimization problem: 𝜽∗ = arg min 𝐽 𝜽; 𝒙, 𝒚
𝜽
o Cost Function:
𝑚
1 1 2
1 2
𝐽 𝜽; 𝒙, 𝒚 = 𝐒𝐄 = 𝒚 − 𝑓 𝒙; 𝜽 2 = 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽
𝑚 𝑚 𝑚
𝑖=1
o Analytical solution:
𝑚
𝜕𝐽 𝜽; 𝒙, 𝒚 2 𝜕𝑓 𝒙𝑖 ; 𝜽
= − 𝒚𝒊 − 𝑓 𝒙𝑖 ; 𝜽 ⋅
𝜕𝜽 𝑚 𝜕𝜽
𝑖=1
o Numerical gradient: Given 𝜖 = 10−4
𝜕𝑓 𝒙𝑖 ; 𝜃𝑗 𝑓 𝒙𝑖 ; 𝜃𝑗 + 𝜖 − 𝑓 𝒙𝑖 ; 𝜃𝑗
=
𝜕𝜃𝑗 𝜖
Variants of FOM: Properties of BGD
o Computation
o Full gradient computed at each update i.e., ∇𝜽 𝐽 𝜽𝑘 ; 𝓓
o Memory
o Complete Training Dataset (𝓓) must be stored in memory
o Scaling
o Full gradient computation scales with size of data
o Speed
o Batch gradient descent (BGD) is slow for big data problems
o Online learning
o Cannot update with examples on-the-fly
Variants of FOM: Stochastic Grad. Descent
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 ,
initial model 𝜽0 , step size (learning rate) 𝛼0 , and maximum number
of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Choose a learning rate: 𝛼𝑘
4. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
end
Variants of FOM: Properties of SGD
o Computation
o Redundant computations are avoided
o Memory
o A single example in 𝓓 needs to be stored in memory
o Scaling
o Gradient computation is not affected by the size of data
o Speed
o Much faster than batch gradient descent (BGD)
o Online learning
o Updates allowed for examples on-the-fly
o Fluctuations
o SGD performs high variance updates
Variants of FOM: Minibatch Grad. Descent
Given a dataset 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚 𝑖=1 , number of batches 𝐵, loss function
𝐽 𝜽; 𝒙, 𝒚 , initial model 𝜽0 , step size (learning rate) 𝛼0 , and
maximum number of iterations max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑏 = 1: 𝐵
2. Compute a descent direction: 𝒑𝒌 = −∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖 𝒊∈𝐵 𝑏
3. Choose a learning rate: 𝛼𝑘
4. Update: 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘 𝒑𝒌
end
end
Variants of FOM: Properties of miniBGD
o Memory
o Limited memory footprint depends on the batch size
o Scaling
o Gradient computation scales with the batch size
o Speed
o Sandwiched between batch GD and SGD
o Online learning
o Updates for packets of examples on-the-fly
o Fluctuations
o Minibatch GD updates have less variance compared to SGD
o Minibatch
o Batch size range from 32 − 512
Variants of FOM: Complexity Trade-off
o Performance
o Accuracy of the updated model
o Time
o Time taken for each update
o Local Minima
o Stationary points are local minima for non-convex loss functions
o Local minima may be as good as global minima
o Flat Minima7,8,9
o Loss surfaces do not change much between training and testing
cohorts
o Easy to converge
o Less likely to overshoot
o Robust to low precision arithmetic or noise in parameter space
o Superior generalizability
Flat Minima & Generalizability: Sketch
Surface at the ridge curves much more steeply in the direction of 𝒘𝟏 . Gradient descent is
bouncing along the ridges of the ravine and moving a lot slower towards the minima.
Smaller value of learning rate will slow down convergence.11
Variants of SGD: SGD with Momentum
Blue vectors represent momentum. Brown vectors represent jump. Red vectors denote
correction. Green vectors are accumulated gradients (AGs).12
o Remarks
o Momentum method computes the current gradient first and then takes a
big jump in the direction of updated AGs.
o Nesterov with Momentum first makes a big jump in the direction of
previous AGs and then make a correction with new gradient.
o Updates updated to the slope of the error surface.
o Anticipatory updates prevents over-speeding and accelerates SGD
Variants of SGD: Adaptive Moments
o First Moment: 𝒎𝑘 = 𝛽1 𝒎𝑘−1 + 1 − 𝛽1 𝒈𝑘
o Where 𝒈𝑘 = ∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
o Exponentially decaying average of past gradients (𝛽1 = 0.9)
o Similar to the momentum method
o Second Moment: 𝒗𝑘 = 𝛽2 𝒗𝑘−1 + 1 − 𝛽2 𝒈𝑘 ⊙ 𝒈𝑘
o Uncentered variance
o Exponentially decaying average of past squared gradients
o 𝛽2 = 0.999 is the default value
o Initialization: 𝒎0 = 𝒗0 = 0
𝒎𝑘 𝒗𝑘
o Unbiased Moments: 𝒎
ෞ𝑘 = and 𝒗
ො𝑘 =
1−𝛽1𝑘 1−𝛽2𝑘
𝛼0
o Adam Update: 𝜽𝑘+1 = 𝜽𝑘 − ෝ𝑘 +𝜖
𝒗
𝒎
ෝ𝑘 &𝜖 (= 10−8 ) is a smoothing term
Variants of SGD: Adaptive Moments
Given 𝓓 = 𝑥𝑖 , 𝑦𝑖 𝑚
𝑖=1 , loss function 𝐽 𝜽; 𝒙, 𝒚 , initial model 𝜽0 , learning rate
𝛼0 , 𝒎𝟎 = 𝒗𝟎 = 𝟎, 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−8 , and max _epochs
for 𝑘 = 1: max _epochs
1. RandomShuffle(𝓓)
for 𝑖 = 1: 𝑚
2. Compute gradient: 𝒈𝒌 = +∇𝜽 𝐽 𝜽𝑘 ; 𝑥𝑖 , 𝑦𝑖
3. Compute first moment: 𝒎𝑘 = 𝛽1 𝒎𝑘−1 + 1 − 𝛽1 𝒈𝑘
4. Compute second moment: 𝒗𝑘 = 𝛽2 𝒗𝑘−1 + 1 − 𝛽2 𝒈𝑘 ⊙ 𝒈𝑘
5. Unbiased first moment: 𝒎ෝ 𝑘 = 𝒎𝑘 Τ 1 − 𝛽1𝑘
6. Unbiased second moment: 𝒗ො 𝑘 = 𝒗𝑘 Τ 1 − 𝛽2𝑘
𝛼
7. Update: 𝜽𝑘+1 = 𝜽𝑘 − ෝ 0 𝒎 ෝ 𝑘 (element wise operations)
𝒗𝑘 +𝜖
end
end
Learning Rate Policies: Constant Rate
o Constant Learning Rate
𝛼𝑘 = 𝛼0 ; ∀𝑘 ∈ 0, 𝑁
Constant policy.13
Learning Rate Policies: Auto-reduce
o Constant Learning Rate with Auto-Reduce14
𝑗
𝛼𝑘 = 𝛼0 𝜂 ; ∀𝑘 ∈ 0, 𝑁
where 𝜂 ∈ 0,1 is the decay rate and
𝑗 = 0,1, ⋯ counts number of auto-reduce events
𝛼0 𝜋 mod 𝑘 − 1, 𝑁/𝑀
𝛼𝑘 = cos + 1 ; ∀𝑘 ∈ 0, 𝑁
2 𝑁/𝑀
where 𝑀 is the number of cycles
Ranking of Learning Rate Policies for the task of Brian Tumor segmentation. Ranking is based
on Final Ranking Score. SGD optimizer is denoted by (S) and Adam optimizer is denoted by (A).
Polynomial Decay with Adam is the top ranked learning rate policy – optimizer pair.18
Learning Rate Policies: Making a Choice
Ranking of Learning Rate Policies for the task of Cardiac MRI segmentation. Ranking is based on
Final Ranking Score. Polynomial Growth with Adam is the top ranked learning rate policy –
optimizer pair.19
Learning Rate Policies: Remarks I
o General: Choosing a proper learning rate is quite difficult
o Specifications: Decide in advance about scheduler and bounds
o Trapped: Difficult to escape suboptimal local minima for non-
convex loss functions
o Small Learning Rate:
o Converges very slowly
o More likely to get stuck in local minima
o Large Learning Rate:
o Adversely impacts convergence
o Fluctuates around local solution
o Loss function can diverge
Learning Rate Policies: Remarks II
(Left) Batch Gradient Descent iterations and (Right) SGD iterations with a large learning rate
(step-size).20
Learning Rate Policies: Remarks III
o Choice of Optimizer: Combination of learning rate policy and
optimizer that collaboratively influences
strong predictive performance18.
o SGD vs Adam3:
o SGD with tuned learning and momentum is competitive to Adam
o Adam converges faster but underperforms SGD on some tasks
o Learning Rate and Batch Size9:
Learning Rate
o is the key control parameter for SGD
Batch Size
Learning Rate
o Large values of results in convergence to flat minima
Batch Size
Learning Rate
o Large values of improves generalization
Batch Size
References (with hyperlinks)
1. K. Petersen and M. Pedersen, The Matrix Cookbook, 2012.
2. R. Aster et al, Parameter estimation and inverse problems, 3rd Edition, Elsevier, 2018.
3. S. Ruder, An overview of gradient descent optimization algorithms, arXiv, 2016.
4. B. Haeffele and R. Vidal, Global Optimality in Structured Matrix Factorization, Johns Hopkins
University.
5. A. Beck, Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB, SIAM, 2014.
6. Y. Dauphin, Identifying and attacking the saddle point problem in high-dimensional non-
convex optimization, arXiv, 2014.
7. S. Hochreiter and J. Schmidhuber, Flat Minima, Neural Computation, 1997.
8. S. Jastrzebski et al, Finding flatter minima with SGD, ICLR, 2018.
9. S. Jastrzebski et al, Three Factors Influencing Minima in SGD, arXiv, 2018.
10. N. Keskar et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
Minima, ICLR, 2017.
11. A. Kathuria, Intro to optimization in deep learning: Momentum, RMSProp and Adam,
PaperspaceBlog, 2018.
References (with hyperlinks)
12. G. Hinton, Neural Networks for Machine Learning: Lecture 6a, 2014.
13. S. Bukhari, Impact of Learning Rate Policies on Training a U-Net for Brain Tumor Segmentation,
MS Thesis, LUMS, 2020.
14. Y. Bengio, Practical Recommendations for Gradient-Based Training of Deep Architectures,
Neural Networks: Tricks of the Trade, Springer, 2012.
15. Y. Wu et al, Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural
Networks, IEEE International Conference on Big Data, 2019.
16. L. Smith, Cyclical learning rates for training neural networks, IEEE Winter Conference on
Applications of Computer Vision, 2017.
17. G. Huang et al, Snapshot Ensembles: Train 1, get M for free, arXiv, 2017.
18. S. Bukhari and H. Mohy-ud-Din, A systematic evaluation of learning rate policies in training
CNNs for brain tumor segmentation, Physics in Medicine and Biology, 2021.
19. S. Jabbar et al, Generalizability of CNNs for Cardiac MRI Segmentation and Quantification,
2021.
20. A. Kathuria, Intro to optimization in deep learning: Gradient Descent, PaperspaceBlog, 2018.