Optimization
Optimization
S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology
• Momentum
• Nestrov Momentum
• AdaGrad
• RMSProp
• Adam
Properties of slop
γupdatet−1 + (1 − γ)α∇wt
Choose α := (1 − γ)α
History of Updates
• γ ∈ (0, 1]
• updatet = γupdatet−1 + α∇wt
α
wt+1 = wt − p ⊙ gt (1)
ϵI + diag(Gt )
where α is the learning rate, ϵ is some small quantity that used
to avoid the division of zero, I is the identity matrix, gt is the
gradient estimate in time step t
d
1 X
gt = ∇wt L(xm , ym )
d
m=1
st = st−1 + gt ⊙ gt
(i)
Let st , i= 1, 2, . . . n be the i th element of st . Now perform the
following operation:
˜(i) α
st = q , i = 1, 2, . . . n
(i)
ϵ + st
The element wise multiplication and division performed on st ,
as described above is denoted by
α
s˜t = √
ϵ + st
Algorithm 3 AdaGrad
Initial w, b
ϵ = 1e − 8, sw = 0, sb = 0
gw = 0, gb = 0, learning rate α
Iterate until convergence
{
Choose a mini batch of d training examples
{
1 Pd
Gradient Computation: gw := ∇w L(xm , ym )
d m=1
1 Pd
Gradient Computation: gb := ∇b L(xm , ym )
d m=1
Accumulation of squared gradient: sw := sw + gw ⊙ gw
Accumulation of squared gradient: sb := sb + gb ⊙ gb
α
Updation: w := w − √ ⊙ gw
ϵ + sw
α
Updation: b := b − √ ⊙ gb
ϵ + sb
}
}
Problems with AdaGrad
α
wt+1 = wt − p ⊙ m̂t
sˆt + ϵ
Suggested defaults: β1 = 0.9, β2 = 0.999, ϵ = .001 , α = 0.001
Algorithm 5 ADAM
Initial w, b
ϵ = 1e − 8, mw = 0, mb = 0, sw = 0, sb = 0
gw = 0, gb = 0, β1 = 0.9, β2 = 0.999, ϵ = 0.001, α