0% found this document useful (0 votes)
15 views28 pages

S09 DNN Gradients Wip

Uploaded by

harshalgangwal07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

S09 DNN Gradients Wip

Uploaded by

harshalgangwal07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

11/23/2024

Gradients and Optimization

Deep Neural Networks


Session 09
Pramod Sharma
[email protected]

2 Agenda

Loss / Cost Optimization

Stochastic Gradient Descent (SGD) and others

Momentum Learning Rates

Adaptive Learning Rates

11/23/2024

1
11/23/2024

3 Optimization Problem
 Neural Network: we solve it as a optimization problem

 Classification problem : predicting the probability of an instance belonging to each class

 Regression problem : predicting the actual value of an instance

 Gradient is actually Error Gradient

 Model is estimating weights to map inputs with the target

 Loss function : ℓ (a, y), a, in turn, is a function of W and b

 Major component comes from weights of different layers

 Compute gradient ,.
 Update weights W = W - α . .X
 Similarly b = b - α .
Where α is defined as learning rate

11/23/2024

Loss / Cost Optimization

11/23/2024

2
11/23/2024

5 Optimization Problem

dJ/ dw is positive hence


subtract from w.

dJ/ dw is negative hence


add to w.

dJ
J
dw

11/23/2024

6 Loss / Cost Optimization

11/23/2024

3
11/23/2024

7 State Space Landscape


 Problem: depending on initial state, can get stuck in local maxima/minima

Global Maxima

Shoulder
Objective Function

Local Maxima
“Flat”
Local Maxima

Preferred Current
State State
State Space Local Minimum with h = 1

11/23/2024

8 Learning Rate : Difficult to assess what’s doing to work?

α = 0.1
α = 0.01

Case I

Case II
Desired

11/23/2024

4
11/23/2024

9 Gradient – Tough Terrain

11/23/2024

10 Learning Rate : Tough Terrain


Finding most optimal gradient descent can be
difficult

Q1: How to select right learning rate?

 Too fast and you can miss minima!

 Too slow, you can be struck at local minima! Case II


Case I

 Need to look for learning rate converges


Desired
smoothly and avoids local minima! =>
“Momentum”

11/23/2024

5
11/23/2024

11 Learning Rate : Tough Terrain


Question: How to select right learning rate?

 Need a learning that ‘adapts’ to the terrain

 No Fixed learning rate

 Learning to change as per the change in gradient

 Popular algorithms
 SGD
 Adam
 Adadelta
 Adagrad
 RMSProp

11/23/2024

12 Gradient Descent
 Gradient descent is one of the most popular algorithms to perform optimization
 Most common way to optimize neural networks

 Every state-of-the-art Deep Learning library contains implementations of these algorithms

 Used as black-box optimizers

 Practical explanations of their strengths and weaknesses are hard to come by

 Makes sense to understand the implementations under-the-hood

 We optimize cost function J ( W, b )

 Learning rate α decides our step size ( Go down the slope… till you reach the valley! )

11/23/2024

6
11/23/2024

13

Stochastic Gradient Descent (SGD) and Others

11/23/2024

14 Stochastic Gradient Descent (SGD)


 Stochastic refers to a randomly determined process

 In theory it performs parameter update for each of training example

 Gradient computations are tough on computing resources

 Imagine how many calculations will be needed to cover all data points and all batches

 In this method, we pick one point randomly out of the “batch” and compute the loss for the same

 You will see wide fluctuations in the objective function

 The data set is processed in batches to parallelize the computations

 Enables it to jump over local minima with of hope of finding global minima
Our very first model…
11/23/2024

7
11/23/2024

15 Batch Gradient Descent


 Stochastic Gradient Descent is computationally expensive

 There is a possibility that it can make a noisy gradient descent where values are jumping around
uncontrollably
 It was so noisy in some cases that we needed to tweak a bit in our implementation

 So we changed, it to batch gradient descent which performs model updates at the end of each
training epoch

Epoch
Dictionary : “A long period of time, especially one in which there are new developments and great change”
ML: One cycle through the entire training dataset

11/23/2024

16 Batch Gradient Descent


 Vanilla gradient descent, computes the gradient of the cost function w. r. t. for the entire training
dataset:
ℓ(a, y)
= *( ) and W = W - α .

 Gradients for the whole dataset to perform one update, batch gradient

 Most deep learning libraries provide automatic differentiation that efficiently computes the gradient

 Update our parameters in the direction of the gradients with the learning rate

 For non–convex surfaces, it converges to local minima

 We have coded in recent examples

11/23/2024

8
11/23/2024

17 Batch Gradient Descent


 Pros:
 Fewer updates, computationally lightweight
 Fewer updates may result in more steady error gradient and stable convergence
 Calculation of errors and weight calculation are separate. Hence, Easier to implement parallel processing

 Cons:
 More stable gradient may result in premature convergence
 Need additional step of collecting errors across all training examples
 Model updates and training speed may become very slow for large dataset

11/23/2024

18 Batch vs. Mini-batch


Split entire data in mini batches of 128 rows.
Index X1 X2 y Batch Index X1 X2 y
0 0.871 0.64 0 0 0.660 0.775 1
1 0.987 0.633 0 1 0.343 0.605 1
2 0.52 0.405 0 Batch 1 2 0.771 0.958 0
… … … … … … … …
127 0.857 0.44 0 127 0.291 0.231 0
128 0.154 0.161 0 128 0.886 0.255 1
129 0.642 0.722 0 129 0.630 0.873 1
Batch 2
… … … … … … … …
256 0.825 0.844 1 256 0.260 0.880 0
257 0.763 0.244 1 257 0.002 0.263 1
258 0.562 0.225 0 258 0.055 0.351 1
Batch 3
… … … … … … … …
383 0.22 0.573 0 383 0.444 0.986 1
384 0.953 0.797 0 ° 384 0.997 0.020 0
385 0.118 0.62 1 ° 385 0.807 0.789 0
… … … … ° … … … …
… … … … ° … … … …
° 1152 0.695 0.215 0
1152 0.695 0.215 0
11/23/2024

9
11/23/2024

19 Mini-Batch Gradient Descent


 Instead using batch of entire dataset, create  Pros:
mini batches  Frequent model update
 Maintains advantage of batched updates
 Good balance between Stochastic Gradient  Mini-batch prevents need to process entire
Descent and Batch Gradient Descent. training data in one go

 Cons:
 Error information must be accumulated across
mini-batches of training examples like batch
gradient descent.

11/23/2024

20 Difference
 Batch Gradient Descent : Use all ‘m’ examples then update once…

 Stochastic Gradient Descent : update of each example

 Mini Batch Gradient Descent : Good balance between the two…

11/23/2024

10
11/23/2024

21 Overall Challenges – Batch / Stochastic


 What's proper learning rate ???
 Too small → painfully slow convergence
 Too large → loss func on to fluctuate around the minimum
or even to diverge

Case
 Use Learning rate schedules Case I
II

 Adjust the learning rate during training by reducing at


certain interval Desired

 Have to be defined in advance and are thus unable to


adapt to a dataset’s characteristics
Global Maxima
 While minimizing highly non-convex error functions,

Objective Function
how to avoid getting trapped in local minima. Shoulder

Local Maxima
“Flat” Local Maxima
 What about plateau?

11/23/2024
State Space

22

Momentum Learning Rates

11/23/2024

11
11/23/2024

23 Momentum Gradient Descent

Approximate weighted averages???


11/23/2024

24 Exponentially Weighted (Moving) Average

11/23/2024

12
11/23/2024

25 Exponentially Weighted (Moving) Average

𝑣 =𝛽𝑣 + (1+𝛽) * 𝜃

So we can demonstrate that


v =0.1 * [ θ + 0.9 * θ + 0.9 θ + 0.9 θ + …..) )
and that’s our exponentially weighted moving average….

If we must, we can correct the bias using following value


𝑣 = 𝑣 / (1-𝛽 )

11/23/2024

26 Gradient Descent with Momentum


 SGD has trouble navigating ravines
 Areas where the surface curves much more steeply in one dimension than in another

 SGD oscillates across the slopes of the ravine, progress towards bottom is very slow.

 Momentum accelerate SGD in the relevant direction and dampens oscillations

 Momentum Gradient Descent updates as follows:


Vt = β * Vt-1 + (1- β)*
and W = W - α * Vt

 Congratulations!!! β is another parameter you can tune.

11/23/2024

13
11/23/2024

27 Momentum Gradient Descent


 The momentum term β is usually set to 0.9

 It accumulates momentum as it rolls downhill, becoming faster and faster

 The momentum term increases for dimensions whose successive gradients point in the same
directions

 Reduces updates for dimensions whose successive gradients change directions.

 As a result, we gain faster convergence and reduced oscillation.

11/23/2024

28 Gradient Descent Example


 Compute dW and db
 Compute 𝑉 and 𝑉
 Update 𝑉 =β𝑉 + (1 - β ) dw and 𝑉 =β𝑉 + (1 - β ) db
 Update W = W - α * 𝑉 and b = b - α * 𝑉

11/23/2024

14
11/23/2024

29

Adaptive Learning Rates

11/23/2024

30 AdaGrad
 SGD needs:
 Starting point to be selected
 Constant learning rate

 AdaGrad tries to overcome these


issues.
 Adaptively scaled learning rate for each
dimension
 It is cumulating squares of terms in the
denominator.

 In some cases learning rate can become


infinitesimally small.
From Original Paper

11/23/2024

15
11/23/2024

31 AdaGrad
 Improved the robustness of SGD

 Being used for training large-scale neural nets


 At Google, to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.

 Previously, we performed an update for all Weights W at once as every parameter wi used the same
learning rate η ( read 𝜶).

 As Adagrad uses a different learning rate for every weight wi at every time step t,

 𝜕wt,i to be the gradient of the objective function w.r.t. to the parameter wi at time step t:
 W=W- * 𝜕W
Gt
 Where Gt is the sum of the element wise multiplication of the gradients until time-step 𝑡, = ∑ 𝜕W𝑖

 Out-of-box libraries available

 Some libraries use diagonal matrix instead of using full matrix

11/23/2024

32 AdaGrad

w
W=W- * 𝜕W
Gt
b  Gt = ∑ 𝜕W𝑖 )
11/23/2024

16
11/23/2024

33 Adadelta
 Derived from AdaGrad

 Aimed at reducing aggressive, monotonically decreasing learning rate

 Instead of considering entire history of time steps, use a window.

 Exponential decay ( Exponential Moving Average) is also considered

 Benefits:
 No manual adjustment of a learning rate after initial selection.
 Insensitive to hyperparameters.
 Separate dynamic learning rate per-dimension.
 Minimal computation over gradient descent.
 Robust to large gradients, noise and architecture choice.
 Applicable in both local or distributed environments.

 All libraries have built in functions for Adadelta

11/23/2024

34 Resilient Back Propagation - Rprop


 Each weight and bias has a different, variable, implied learning rate

 Each weight has a delta value that increases when the gradient doesn’t change sign (meaning it’s a
step in the correct direction) or decreases when the gradient does change sign

 It’s not commonly used as its implementation is clumsy and not many out of box solution are
available.

11/23/2024

17
11/23/2024

35 RMSProp

Added exponential moving


average on AdaGrad

Read 𝛽 Read 𝛼

11/23/2024

36 RMSProp
 Gradient for different weights are different

 Combines the idea of only using the sign of the gradient with the idea of adapting the step size
separately for each weight

 Keep moving averages of squared gradients for each weight

 Then divide the gradient by square root the mean square above

 In Momentum, the gradient descent was modified by its exponential moving average

 RMSProp updates by taking RMS values of the gradient:


 𝑣 =𝛽 *𝑣 + (1- 𝛽 )* ( )2 (element wise square)
𝜂
 and W = W - * (read 𝛼 for 𝜂 )

11/23/2024

18
11/23/2024

37 Adaptive Moment Estimation (Adam)


 Adam is another method that computes adaptive learning rates for each parameter.

 SGD was too simplistic, Adam is an improvement

 Adam was presented by Diederik Kingma (OpenAI) and Jimmy Ba (University of Toronto) in their 2015 ICLR
paper (poster) titled “Adam: A Method for Stochastic Optimization“
 https://arxiv.org/abs/1412.6980

 Its name Adam is derived from Adaptive Moment Estimation

 Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the
learning rate does not change during training.

 Adam is adopted from two other methods


 Adaptive Gradient Algorithm (AdaGrad): works well with sparse gradients
 Root Mean Square Propagation (RMSProp): works well in on-line and non-stationary settings

 Also storing an exponentially decaying average of past squared gradients 𝑣

11/23/2024

38 Adaptive Moment Estimation (Adam)


 Parameters:
 α = 0.001,
 𝛽 = 0.9,
 𝛽 = 0.999 and
 ε = 10−8
 Refer Paper cited for details of algorithm

Diederik P. Kingma* Jimmy Lei Ba∗


University of Amsterdam, OpenAI University of Toronto
[email protected] [email protected]
11/23/2024

19
11/23/2024

39 Which one to use???


 For sparse data, use one of the adaptive learning-rate methods
 No need to tune the learning rate,
 Generally work with the default values
 RMSprop:
 is an extension of Adagrad that deals with its radically diminishing learning rates.
 Identical to Adadelta, except that Adadelta uses the RMS of parameter updates
 Adam,
 adds bias-correction and momentum to RMSprop.
 RMSprop, Adadelta, and Adam are very similar algorithms. Pick any and then try others
 Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser
 Adam may appear to be the best overall choice
 Many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule
 SGD usually achieves a minimum, but it might take significantly longer than with some of the optimizers,
 For fast convergence and train a deep or complex neural network, you should choose one of the adaptive
learning rate methods

11/23/2024

40 Learning Rate Decay


 Idea is to take advantage of large steps in the beginning and then reduce your step size
 With some threshold at a minimum value

 If you continue with the same step size,


 You might keep hoping around

 A few recommended techniques:


 𝛼= *𝛼 w

 Or 𝛼 = 0.98 *𝛼 b
 Or may be stepped; reduce by 0.1 * 𝛼 after 100 epochs

11/23/2024

20
11/23/2024

41

11/23/2024

42 Reflect…
 What is Gradient Descent used for?  In Gradient Descent, what does the "gradient"
 A. Clustering represent?
 B. Regression  A. The direction of steepest ascent of the function.
 C. Dimensionality Reduction  B. The rate of change of the function at a point.
 D. Image Classification  C. The distance between data points.
 D. The probability of occurrence of a data point.
 Answer: B. Regression
 Answer: B. The rate of change of the function at a point.
 Which of the following best describes Gradient
Descent?
 A. An optimization algorithm used to minimize a function by  What is the role of the learning rate in Gradient
iteratively moving in the direction of steepest descent. Descent?
 B. An algorithm for finding the maximum of a function.  A. It determines the number of iterations.
 C. A clustering algorithm based on distance between data  B. It specifies the initial position of the algorithm.
points.
 D. A classification algorithm for non-linearly separable data.  C. It controls the size of the steps taken during optimization.
 D. It sets the threshold for convergence.
 Answer: A. An optimization algorithm used to minimize
a function by iteratively moving in the direction of  Answer: C. It controls the size of the steps taken during
steepest descent. optimization.

11/23/2024

21
11/23/2024

43 Reflect…
 Which variant of Gradient Descent updates the parameters  What is a potential issue with a high learning rate in
after evaluating the gradient over the entire dataset? Gradient Descent?
 A. Stochastic Gradient Descent (SGD.  A. Slow convergence
 B. Mini-batch Gradient Descent  B. Overshooting the minimum
 C. Batch Gradient Descent  C. Premature convergence to a local minimum
 D. Momentum-based Gradient Descent  D. Increased computational complexity

 Answer: C. Batch Gradient Descent  Answer: B. Overshooting the minimum

 Which statement best describes the trade-offs between  Which of the following statements is true regarding the
different variants of Gradient Descent? convergence of Gradient Descent?
 A. Batch Gradient Descent is faster than Stochastic Gradient  A. Gradient Descent always converges to the global minimum.
Descent.  B. Gradient Descent may converge to a local minimum depending
 B. Stochastic Gradient Descent guarantees convergence to the on the initialization and learning rate.
global minimum.  C. Gradient Descent always converges, regardless of the function
 C. Mini-batch Gradient Descent balances the efficiency of batch being optimized.
GD and the stochastic nature of SGD.  D. Gradient Descent converges faster with a smaller number of
 D. Momentum-based Gradient Descent is less prone to getting iterations.
stuck in local minima.
 Answer: B. Gradient Descent may converge to a local
 Answer: C. Mini-batch Gradient Descent balances the minimum depending on the initialization and learning rate.
efficiency of batch GD and the stochastic nature of SGD.

11/23/2024

44 Reflect…
 Which technique is often used to mitigate the issue of oscillations around the minimum in Gradient
Descent?
 A. Decreasing the learning rate over time
 B. Increasing the learning rate over time
 C. Using a larger batch size
 D. Adding momentum to the updates

 Answer: D. Adding momentum to the updates

 Which of the following is NOT a common variant of Gradient Descent?


 A. Adagrad
 B. K-means
 C. RMSprop
 D. Adam

 Answer: B. K-means

11/23/2024

22
11/23/2024

45 Next Session

L-1, L-2

Dropout

Early Stopping

Augmentation

11/23/2024

46

Over to Jupyter Notebook

11/23/2024

23
11/23/2024

47

11/23/2024

ADDITIONAL MATERIAL

What’s Next

24
11/23/2024

49

Augmented Random Search

Horia Mania Aurelia Guy Benjamin Recht


Department of Electrical Engineering and Computer Science
University of California, Berkeley
March 20, 2018

11/23/2024

50 Augmented Random Search


 It’s a Shallow Learning Algorithm
 Using Random Noise
 Exploits Generic Evolution Theory

11/23/2024

25
11/23/2024

51 Perceptron
 Welcome back our perceptron

11/23/2024

52 Perceptron

x1

w1

x2 z = w1*x1+w2*x2 + b ŷ = a = σ (z) ℓ ( a, y )
w2
𝜕𝐿 𝜕𝐿 𝜕𝑎 (z)
= . = σ (z). (1- σ (z)) = -y/a +(1-y) / (1-a)
b 𝜕𝑧 𝜕𝑎 𝜕𝑧
= { - y / a + ( 1 - y ) / ( 1 – a ) } * a * (1-a) = a. (1- a)
=a-y
w1 = w1 - α * = w1 – α * x1 * (a-y)
z = W * X + b
ŷ = a = σ (z) w2 = w2 - α ∗ = w2 - α * x2 * (a-y)

= x1 . = x1 (a-y)
σ ( z) = b=b-α* = b - α * (a-y)
1+e −z
ℓ(a, y) = -y * log(a) + (1-y) * log(1-a)  ℓ
= x2 . = x2 (a-y)  Where α is learning rate. The cost function is
For binary classification: ℓ ℓ
= = (a-y) J (W, b) = * (  ℓ (a, y) )
ℓ(a,y) = -y * log(a)
ℓ(a, y)
Hence = *( )
11/23/2024

26
11/23/2024

53 Perceptron
 Earlier we were using very elaborate
calculations to estimate direction of gradient
descent…..
 Why???
 Can't we just calculate in both directions and
see what's better?
 Not a bad suggestion… lets workout a
procedure….
 Add random noise to weights W
 Run a trial run
 Move in the direction of improvement

Shape of Weights W (inputs, outputs)

11/23/2024

54 Method of Finite Difference


 Generate small random numbers (δW)

 Shape of δW will be same as that of W

 Create 𝑊 and 𝑊 as follows


 𝑊 = W + δW
 𝑊 = W – δW

 Test both version and record Loss from each 𝐺 , 𝐺

 Update Weights as follows

 W = W + α (𝐺 – 𝐺 ) . δW

 Keep repeating till it converges to a Minima (We are minimizing the cost!)

 Can easily be implemented for gain in a similar fashion

11/23/2024

27
11/23/2024

55 Learning Rate Decay


 Slowly reduce learning rate.

 As mentioned before mini-batch gradient descent won't reach the optimum point (converge). But by making the
learning rate decay with iterations it will be much closer to it because the steps (and possible oscillations) near the
optimum are smaller.
 One equations is
 learning_rate = (1 / (1 + decay_rate * epoch_num)) * learning_rate_0
 epoch_num is over all data (not a single mini-batch)

 Other learning rate decay methods (continuous):


 learning_rate = (0.95 ) * learning_rate_0
 learning_rate = (𝑘 / 𝑠𝑞𝑟𝑡(𝑒𝑝𝑜𝑐ℎ_𝑛𝑢𝑚)) * learning_rate_0

 Some people perform learning rate decay discretely - repeatedly decrease after some number of epochs

 Some people are making changes to the learning rate manually


 ‘decay_rate’ is another hyperparameter

 Learning rate decay has less priority… last thing to tune in your network

11/23/2024

28

You might also like