10 - AdaGrad
10 - AdaGrad
AdaGrad
© Nisheeth Joshi
• Adagrad stands for Adaptive Gradient
Optimizer.
• Developed in 2011 by
• There are optimizers like Gradient
Descent, Stochastic Gradient Descent,
mini-batch SGD.
Introduction • All aree used to reduce the loss function
with respect to the weights.
• The weight updating formula is
𝜕𝑐𝑜𝑠𝑡
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − ∗
𝜕𝑤𝑜𝑙𝑑
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡
𝑤𝑡 = 𝑤𝑡−1 − ∗
𝜕𝑤𝑡−1
Based on
iterations, where
this formula • w(t) = value of w at current iteration,
can be • w(t-1) = value of w at previous iteration
and
written as • η = learning rate.
© Nisheeth Joshi
Adagrad Explaination
• In SGD and mini-batch SGD, the value of η used to be
the same for each weight, or say for each parameter.
• Typically, η = 0.01
• But in Adagrad Optimizer the core idea is that each
weight has a different learning rate (η)
• This modification has great importance, in the real-
world dataset, some features are sparse
• (for example, in Bag of Words most of the
features are zero so it’s sparse)
• and some are dense (most of the features will be noon-
zero)
• So keeping the same value of learning rate for all the
weights is not good for optimization.
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡
𝑤𝑡 = 𝑤𝑡−1 − ∗
′
𝜕𝑤𝑡−1
Adagrad ′ =
𝛼𝑡 + 𝜀
Weight Where
Updation • 𝛼𝑡 denotes different learning rates for each
Formula weight at each iteration
• is a constant learning rate
• 𝜀 is a very small +ve value to avoid divide by
zero error
© Nisheeth Joshi
𝑡
2
𝛼𝑡 = 𝑔𝑖
𝑖=1
Adagrad
Intuition 𝜕𝑐𝑜𝑠𝑡
𝑔𝑖 =
𝜕𝑤𝑡−1
© Nisheeth Joshi
• 𝑔𝑖 is derivative of loss with respect to weight
• 𝑔𝑖2 will always be positive since its a square term, which
means that 𝛼𝑡 will also remain positive which implies
𝛼𝑡 >= 𝛼𝑡−1
Adagrad • Intuitively,
• 𝛼𝑡 is inversely proportional to ′
Intuition • As 𝛼𝑡 will increase ′ will decrease and vice versa
• This means that
• as the number of iterations will increase,
• the learning rate will reduce adaptively,
• so you no need to manually select the learning
rate
© Nisheeth Joshi
• No manual tuning of the learning
rate required.
Advantages • Faster convergence
• More reliable
© Nisheeth Joshi
• One main disadvantage
• alpha(t) can become large as the
number of iterations will increase and
due to this ′𝑡 will decrease at the larger
Disadvantage rate.
Of Adagrad • This will make the old weight almost
equal to the new weight which may lead
to slow convergence.
© Nisheeth Joshi
Sample Dataset
Student X1 X2 X3 Y
Physics (%) Chemistry Hours Studied Mathematics (%)
(%)
1 60 80 5 82
2 70 75 7 94
3 50 55 10 45
4 40 56 7 43
© Nisheeth Joshi
The Network – 1 Hidden Layer
b1
W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8
W6 ∑ ∫
X3
b2
© Nisheeth Joshi
b1
Linear Operation
∫
W1 ∑ z1
60 g1
W3
W5
80
© Nisheeth Joshi
b1
Non-Linear Operation
W1 ∑ z1
60 g1
W3
W5
80
𝟏
5 𝒈𝟏 =
𝟏 + 𝒆−𝒛𝟏
© Nisheeth Joshi
Output Layer: Linear Summation
y’ = w7*g1 + w8*g2 + b3
= 12*0.37 + 9*0.047 + 20
g1 = 0.37 b3 = 20 = 24.95
w7 = 12
∑ y’
g2 = 0.047
w8 = 9
© Nisheeth Joshi
b1
W1 ∑ ∫ q1 = 0.37 b3
60
W3
W2 w7 = 12
W5
80
∑ 24.95
W4
q2 = 0.047
W6 ∑ ∫ w8 =8
b2
© Nisheeth Joshi
Let us assume for w7
𝜕𝑐𝑜𝑠𝑡
• = 0.5 = 𝑔𝑖
𝜕𝑤𝑡−1
2
• 𝛼𝑡 = 𝑔𝑖 = 0.5 * 0.5 = 0.25
• =
′
= 0.01/sqrt(0.25+0.001)
𝛼𝑡 + 𝜀
= 0.01/0.501 = 0.01996
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡
𝑤𝑡 = 𝑤𝑡−1 − ∗
′
𝜕𝑤𝑡−1
𝜕𝑐𝑜𝑠𝑡
𝑡
𝑤7 = 𝑡−1
𝑤7 − ∗
′
𝜕𝑤𝑡−1
𝑡
𝑤7 = 12 - 0.01996 * 0.5
𝑡
𝑤7 = 12 – 0.00998 = 11.99002
© Nisheeth Joshi