0% found this document useful (0 votes)
12 views44 pages

Chapter 4 - Optimization

Uploaded by

Muhammad Ashraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views44 pages

Chapter 4 - Optimization

Uploaded by

Muhammad Ashraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Optimization

With permission from:


Justin Johnson, EECS 498-007 / 598-005
Deep Learning for Computer Vision
At University of Michigan

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 1 The Energy University


Last Time: Linear Classifiers

Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

f(x, W) = Wx + b One template Hyperplanes cutting up


per class high-dimensional space

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 2 The Energy University


Summary: Loss Functions quantify preferences

• We have some dataset of (x, y)


• We have a score function: Linear Classifier
• We have a loss function:

Cross-entropy

SVM

Full/average loss

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 3 The Energy University


Summary: Loss Functions quantify preferences

• We have some dataset of (x, y)


• We have a score function: Linear Classifier
• We have a loss function:

Cross-entropy Q: How do we find the best W?

SVM
A: Optimization
Full/average loss

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 4 The Energy University


Optimization

• Gradient
• Gradient Descent
• Stochastic Gradient Descent (SGD)
• SGD + Momentum
• AdaGrad
• RMSProp
• Adam

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 5 The Energy University


9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 6 The Energy University
Follow the slope

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 7 The Energy University


Follow the slope
In 1-dimension, the derivative of a function gives the slope:

In multiple dimensions, the gradient is the vector of partial derivatives along each
dimension

The slope in any direction is the dot product of the direction with the gradient
The direction of steepest descent is the negative gradient

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 8 The Energy University


current W: W + h (first dim): gradient dL/dW:

[0.34, [0.34 + 0.0001, [?,


-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25322

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 9 The Energy University


current W: W + h (first dim): gradient dL/dW:

[0.34, [0.34 + 0.0001, [-2.5,


-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
(1.25322 - 1.25347)/0.0001
0.55, 0.55, =?,-2.5
2.81, 2.81, ?,𝑑𝑓(𝑊) 𝑓 𝑊+ℎ −𝑓 𝑊
-3.1, -3.1, ?, 𝑑𝑊 = lim
→ ℎ
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25322

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 10 The Energy University


current W: W + h (second dim): gradient dL/dW:

[0.34, [0.34, [-2.5,


-1.11, -1.11 + 0.0001, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25353

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 11 The Energy University


current W: W + h (second dim): gradient dL/dW:

[0.34, [0.34, [-2.5,


-1.11, -1.11 + 0.0001, 0.6,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
(1.25353 - 1.25347)/0.0001
2.81, 2.81, =?,0.6
-3.1, -3.1, ?,𝑑𝑓(𝑊) 𝑓 𝑊+ℎ −𝑓 𝑊
= lim
-1.5, -1.5, ?, 𝑑𝑊 → ℎ
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25353

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 12 The Energy University


current W: W + h (third dim): gradient dL/dW:

[0.34, [0.34, [-2.5,


-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 13 The Energy University


current W: W + h (third dim): gradient dL/dW:

[0.34, [0.34, [-2.5,


-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0.0,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
(1.25347 - 1.25347)/0.0001
-3.1, -3.1, =?,0.6
-1.5, -1.5, ?,𝑑𝑓(𝑊) 𝑓 𝑊+ℎ −𝑓 𝑊
0.33,…] 0.33,…] ?,…]𝑑𝑊
= lim
→ ℎ
loss 1.25347 loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 14 The Energy University


current W: W + h (third dim): gradient dL/dW:

[0.34, [0.34, [-2.5,


-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0.0,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
Numeric Gradient:
-1.5, -1.5, - ?,
Slow: f(#dimensions)
0.33,…] 0.33,…] - ?,…]
Approximate
loss 1.25347 loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 15 The Energy University


Analytic Gradient: Loss is a function of W

want

Use calculus to compute an analytic gradient

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 16 The Energy University


current W: gradient dL/dW:

[0.34, [-2.5,
-1.11, dL/dW = ... 0.6,
0.78, (some function 0,
0.12, data and W) 0.2,
0.55, 0.7,
2.81, -0.5,
-3.1, (In practice we will compute 1.1,
dL/dW using backpropagation)
-1.5, 1.3,
0.33,…] -2.1,…]
loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 17 The Energy University


Computing Gradients

• Numeric gradient: approximate, slow, easy to write


• Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but check implementation


with numerical gradient. This is called a gradient check.

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 18 The Energy University


Computing Gradients

• Numeric gradient: approximate, slow, easy to write


• Analytic gradient: exact, fast, error-prone

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 19 The Energy University


Negative gradient
Gradient Descent W_2 direction Original W

Iteratively step in the direction of the


negative gradient
(direction of local steepest descent)

Hyperparameters:
- Weight initialization method
- Number of steps
- Learning rate W_1

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 20 The Energy University


Gradient Descent
Iteratively step in the direction of the
negative gradient
(direction of local steepest descent)

Hyperparameters:
- Weight initialization method
- Number of steps
- Learning rate

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 21 The Energy University


Batch Gradient Descent / Full Batch Gradient Descent
Full sum expensive
when N is large!

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 22 The Energy University


Stochastic Gradient Descent (SGD)
Full sum expensive
when N is large!

Approximate sum using


a minibatch of examples
32 / 64 / 128 common

Hyperparameters:
- Weight initialization
- Number of steps
- Learning rate
- Batch size
- Data sampling

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 23 The Energy University


Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?

Very slow progress along shallow dimension, jitter along steep direction

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 24 The Energy University


Problems with SGD Local
minimum

What if the loss function


has a local minimum or
saddle point?
Saddle
point

Zero gradient,
gradient descent gets stuck

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 25 The Energy University


Problems with SGD

Our gradients come from minibatches


so they can be noisy!

The Energy University


SGD
SGD

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 27 The Energy University


SGD + Momentum
SGD SGD+Momentum

- Build up “velocity” as a running mean of gradients


- Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 28 The Energy University


SGD + Momentum Gradient Noise

Local minimum / saddle points

Poor conditioning

SGD
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 29 The Energy University


AdaGrad

Added element-wise scaling of the gradient based


on the historical sum of squares in each dimension

“Per-parameter learning rates”


or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 30 The Energy University


AdaGrad

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 31 The Energy University


AdaGrad

Q: What happens with AdaGrad?


9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 32 The Energy University
AdaGrad

Progress along “steep” directions is damped;


Q: What happens with AdaGrad? progress along “flat” directions is accelerated
9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 33 The Energy University
RMSProp: AdaGrad + decay rate

AdaGrad

RMSProp

Tieleman and Hinton, 2012

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 34 The Energy University


RMSProp
SGD

SGD+Momentum

RMSProp

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 35 The Energy University


RMSProp

Compare them:
• Different algorithms will converge to
different minima.
• Often, SGD and SGD with momentum
will converge to the poorer minimum
• RMSProp will converge to the global
minimum.
https://emiliendupont.github.io/2018/01/24
/optimization-visualization/

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 36 The Energy University


Adam: RMSProp + Momentum

Adam

Momentum
RMSProp

SGD+Momentum

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 37 The Energy University


Adam: RMSProp + Momentum

Adam

RMSProp

RMSProp

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 38 The Energy University


Adam: RMSProp + Momentum

Adam

Momentum
RMSProp

Q: What happens at t=0?


(Assume beta2 = 0.999)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 39 The Energy University


Adam: RMSProp + Momentum

Momentum
RMSProp
Bias correction

Bias correction for the fact Adam with beta1 = 0.9,


that first and second moment beta2 = 0.999, and learning_rate = 1e-3, 5e-4, 1e-4
estimates start at zero is a great starting point for many models!

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 40 The Energy University


Adam

SGD

SGD+Momentum

RMSProp

Adam

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 41 The Energy University


Optimization Algorithm Comparison

Tracks second
Tracks first Bias correction
moments Leaky second
Algorithm moments for moment
(Adaptive moments
(Momentum) estimates
learning rates)
SGD x x x x
SGD+Momentum  x x x
AdaGrad x  x x
RMSProp x   x
Adam    

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 42 The Energy University


Summary
In practice:
• Adam is a good default choice in many cases
• SGD+Momentum can outperform Adam but may require more tuning

• Use Linear Models for image classification problems


• Use Loss Functions to express preferences over different choices of
weights
• Use Stochastic Gradient Descent to minimize our loss functions and
train the model

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 43 The Energy University


Next time:
Neural Networks

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 44 The Energy University

You might also like