Chapter 4 - Optimization
Chapter 4 - Optimization
Cross-entropy
SVM
Full/average loss
SVM
A: Optimization
Full/average loss
• Gradient
• Gradient Descent
• Stochastic Gradient Descent (SGD)
• SGD + Momentum
• AdaGrad
• RMSProp
• Adam
In multiple dimensions, the gradient is the vector of partial derivatives along each
dimension
The slope in any direction is the dot product of the direction with the gradient
The direction of steepest descent is the negative gradient
want
[0.34, [-2.5,
-1.11, dL/dW = ... 0.6,
0.78, (some function 0,
0.12, data and W) 0.2,
0.55, 0.7,
2.81, -0.5,
-3.1, (In practice we will compute 1.1,
dL/dW using backpropagation)
-1.5, 1.3,
0.33,…] -2.1,…]
loss 1.25347
Hyperparameters:
- Weight initialization method
- Number of steps
- Learning rate W_1
Hyperparameters:
- Weight initialization method
- Number of steps
- Learning rate
Hyperparameters:
- Weight initialization
- Number of steps
- Learning rate
- Batch size
- Data sampling
Very slow progress along shallow dimension, jitter along steep direction
Zero gradient,
gradient descent gets stuck
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
Poor conditioning
SGD
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
AdaGrad
RMSProp
SGD+Momentum
RMSProp
Compare them:
• Different algorithms will converge to
different minima.
• Often, SGD and SGD with momentum
will converge to the poorer minimum
• RMSProp will converge to the global
minimum.
https://emiliendupont.github.io/2018/01/24
/optimization-visualization/
Adam
Momentum
RMSProp
SGD+Momentum
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Adam
RMSProp
RMSProp
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Adam
Momentum
RMSProp
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Momentum
RMSProp
Bias correction
SGD
SGD+Momentum
RMSProp
Adam
Tracks second
Tracks first Bias correction
moments Leaky second
Algorithm moments for moment
(Adaptive moments
(Momentum) estimates
learning rates)
SGD x x x x
SGD+Momentum x x x
AdaGrad x x x
RMSProp x x
Adam