Optimization Techniques (SGD Alternatives)
Optimization Techniques (SGD Alternatives)
optimization algorithms
HYUNG IL KOO
Based on
http://sebastianruder.com/optimizing-gradient-descent/
Problem Statement
• Machine Learning Optimization Problem
• Training samples: 𝑥 𝑖 ,𝑦 𝑖
• Challenges
• Gradient descent optimization algorithms
• Momentum
• Adaptive Gradient
• Visualization
• Properties
• Very slow
• Intractable for datasets that don't fit in memory
• No online learning
Stochastic Gradient descent (SGD)
• To perform a parameter update for each training example 𝑥 (𝑖)
and label 𝑦 (𝑖)
• Properties:
• Faster
• Online learning
• Heavy fluctuation
• Capability to jump to new (potentially better local minima)
• Complicated convergence (overshooting)
SGD fluctuation
Batch Gradient vs SGD
Batch gradient Stochastic gradient descent
• Properties
• Fast convergence
• Less oscillation
Momentum
• Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on
the way (until it reaches its terminal velocity if there is air resistance).
• The same thing happens to our parameter updates: The momentum term
increases for dimensions whose gradients point in the same directions and
reduces updates for dimensions whose gradients change directions. As a
result, we gain faster convergence and reduced oscillation.
Momentum
• The momentum term is also useful in spaces with long
ravines characterized by sharp curvature across the ravine
and a gently sloping floor
• Sharp curvature tends to cause divergent oscillations across
the ravine
• To avoid this problem, we could decrease the learning rate, but this is too
slow
• The momentum term filters out the high curvature and allows the effective weight
steps to be bigger
• It turns out that ravines are not uncommon in optimization problems, so the use of
momentum can be helpful in many situations
• However, a momentum term can hurt when the search is close to the minima
(think of the error surface as a bowl)
• As the network approaches the bottom of the error surface, it builds enough
momentum to propel the weights in the opposite direction, creating an
undesirable oscillation that results in slower convergence
Smarter Ball?
NGD (Nesterov accelerated gradient)
• Nesterov accelerated gradient improved on the basis of Momentum algorithm
• Approximation of the next position of the parameters.
• Methods
• AdaGrad (Adaptive Gradient Method)
• AdaDelta
• RMSProp (Root Mean Square Propagation)
• Adam (Adaptive Moment Estimation)
Adaptive Gradient Methods
• These methods use a different learning rate for each parameter
𝜃𝑖 ∈ ℜ at every time step 𝑡.
(𝑡)
• For brevity, we set 𝑔𝑖 to be the gradient of the objective function w.r.t.
𝜃𝑖 ∈ ℜ at time step 𝑡:
(𝑡+1) (𝑡) (𝑡)
𝜃𝑖 = 𝜃𝑖 − 𝜼 ∙ 𝑔𝑖
• These methods modify the learning rate 𝜼 at each time step (𝑡) for every
parameter 𝜃𝑖 based on the past gradients that have been computed for 𝜃𝑖 .
Adagrad
• Adagrad modifies the general learning rate 𝜂 at each time step 𝑡
for every parameter 𝜃𝑖 based on the past gradients that have
been computed for 𝜃𝑖 :
(𝑡+1) (𝑡) 𝜼
𝜃𝑖 = 𝜃𝑖 − 𝑔𝑡,𝑖
𝑮𝒕,𝒊 + 𝝐
2
𝑘
• 𝐺𝑡,𝑖 = 𝑘≤𝑡 𝑔𝑖
(𝑘) 𝜕𝐽(𝜃)
𝑔𝑖 =
𝜕𝜃𝑖 𝜃(𝑘)
Adagrad
• Pros
• It eliminates the need to manually tune the learning rate. Most
implementations use a default value of 0.01.
• Cons
• Its accumulation of the squared gradients in the denominator: the
accumulated sum keeps growing during training. This causes the learning
rate to shrink and eventually become infinitesimally small. The following
algorithms aim to resolve this flaw.
RMSprop
• RMSprop has been developed to resolve Adagrad's diminishing
learning rates.
2
𝑡
𝜆𝑡,𝑖 = 𝛾 𝜆𝑡−1,𝑖 + 1 − 𝛾 𝑔𝑖
(𝑡+1) (𝑡) 𝜼
𝜃𝑖 = 𝜃𝑖 − 𝑔𝑡,𝑖
𝝀𝒕,𝒊 + 𝝐
𝑣𝑡 + 𝜖
COMPARISON
Visualization of algorithms
Which optimizer to choose?
• RMSprop is an extension of Adagrad that deals with its
radically diminishing learning rates.