0% found this document useful (0 votes)
2 views

10 - AdaGrad

Uploaded by

Swasti Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

10 - AdaGrad

Uploaded by

Swasti Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Optimizers

AdaGrad

© Nisheeth Joshi
• Adagrad stands for Adaptive Gradient
Optimizer.
• Developed in 2011 by
• There are optimizers like Gradient
Descent, Stochastic Gradient Descent,
mini-batch SGD.
Introduction • All aree used to reduce the loss function
with respect to the weights.
• The weight updating formula is

𝜕𝑐𝑜𝑠𝑡
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − ∗
𝜕𝑤𝑜𝑙𝑑
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡
𝑤𝑡 = 𝑤𝑡−1 − ∗
𝜕𝑤𝑡−1
Based on
iterations, where
this formula • w(t) = value of w at current iteration,
can be • w(t-1) = value of w at previous iteration
and
written as • η = learning rate.

© Nisheeth Joshi
Adagrad Explaination
• In SGD and mini-batch SGD, the value of η used to be
the same for each weight, or say for each parameter.
• Typically, η = 0.01
• But in Adagrad Optimizer the core idea is that each
weight has a different learning rate (η)
• This modification has great importance, in the real-
world dataset, some features are sparse
• (for example, in Bag of Words most of the
features are zero so it’s sparse)
• and some are dense (most of the features will be noon-
zero)
• So keeping the same value of learning rate for all the
weights is not good for optimization.

© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡
𝑤𝑡 = 𝑤𝑡−1 −  ∗

𝜕𝑤𝑡−1


Adagrad ′ =
𝛼𝑡 + 𝜀
Weight Where
Updation • 𝛼𝑡 denotes different learning rates for each
Formula weight at each iteration
•  is a constant learning rate
• 𝜀 is a very small +ve value to avoid divide by
zero error

© Nisheeth Joshi
𝑡
2
𝛼𝑡 = ෍ 𝑔𝑖
𝑖=1
Adagrad
Intuition 𝜕𝑐𝑜𝑠𝑡
𝑔𝑖 =
𝜕𝑤𝑡−1

© Nisheeth Joshi
• 𝑔𝑖 is derivative of loss with respect to weight
• 𝑔𝑖2 will always be positive since its a square term, which
means that 𝛼𝑡 will also remain positive which implies

𝛼𝑡 >= 𝛼𝑡−1
Adagrad • Intuitively,
• 𝛼𝑡 is inversely proportional to ′
Intuition • As 𝛼𝑡 will increase ′ will decrease and vice versa
• This means that
• as the number of iterations will increase,
• the learning rate will reduce adaptively,
• so you no need to manually select the learning
rate

© Nisheeth Joshi
• No manual tuning of the learning
rate required.
Advantages • Faster convergence
• More reliable

© Nisheeth Joshi
• One main disadvantage
• alpha(t) can become large as the
number of iterations will increase and
due to this ′𝑡 will decrease at the larger
Disadvantage rate.
Of Adagrad • This will make the old weight almost
equal to the new weight which may lead
to slow convergence.

© Nisheeth Joshi
Sample Dataset
Student X1 X2 X3 Y
Physics (%) Chemistry Hours Studied Mathematics (%)
(%)
1 60 80 5 82
2 70 75 7 94
3 50 55 10 45
4 40 56 7 43

© Nisheeth Joshi
The Network – 1 Hidden Layer

b1

W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8

W6 ∑ ∫
X3

b2
© Nisheeth Joshi
b1
Linear Operation


W1 ∑ z1

60 g1
W3

W5
80

© Nisheeth Joshi
b1
Non-Linear Operation
W1 ∑ z1

60 g1
W3

W5
80

𝟏
5 𝒈𝟏 =
𝟏 + 𝒆−𝒛𝟏

© Nisheeth Joshi
Output Layer: Linear Summation

y’ = w7*g1 + w8*g2 + b3
= 12*0.37 + 9*0.047 + 20
g1 = 0.37 b3 = 20 = 24.95

w7 = 12

∑ y’
g2 = 0.047

w8 = 9

© Nisheeth Joshi
b1

W1 ∑ ∫ q1 = 0.37 b3
60
W3
W2 w7 = 12
W5
80
∑ 24.95
W4
q2 = 0.047

W6 ∑ ∫ w8 =8

b2
© Nisheeth Joshi
Let us assume for w7
𝜕𝑐𝑜𝑠𝑡
• = 0.5 = 𝑔𝑖
𝜕𝑤𝑡−1

2
• 𝛼𝑡 = 𝑔𝑖 = 0.5 * 0.5 = 0.25


• =

= 0.01/sqrt(0.25+0.001)
𝛼𝑡 + 𝜀
= 0.01/0.501 = 0.01996
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡
𝑤𝑡 = 𝑤𝑡−1 −  ∗

𝜕𝑤𝑡−1

𝜕𝑐𝑜𝑠𝑡
𝑡
𝑤7 = 𝑡−1
𝑤7 −  ∗

𝜕𝑤𝑡−1

𝑡
𝑤7 = 12 - 0.01996 * 0.5

𝑡
𝑤7 = 12 – 0.00998 = 11.99002
© Nisheeth Joshi

You might also like