S09 DNN Gradients Wip
S09 DNN Gradients Wip
2 Agenda
11/23/2024
1
11/23/2024
3 Optimization Problem
Neural Network: we solve it as a optimization problem
Compute gradient ,.
Update weights W = W - α . .X
Similarly b = b - α .
Where α is defined as learning rate
11/23/2024
11/23/2024
2
11/23/2024
5 Optimization Problem
dJ
J
dw
11/23/2024
11/23/2024
3
11/23/2024
Global Maxima
Shoulder
Objective Function
Local Maxima
“Flat”
Local Maxima
Preferred Current
State State
State Space Local Minimum with h = 1
11/23/2024
α = 0.1
α = 0.01
Case I
Case II
Desired
11/23/2024
4
11/23/2024
11/23/2024
11/23/2024
5
11/23/2024
Popular algorithms
SGD
Adam
Adadelta
Adagrad
RMSProp
11/23/2024
12 Gradient Descent
Gradient descent is one of the most popular algorithms to perform optimization
Most common way to optimize neural networks
Learning rate α decides our step size ( Go down the slope… till you reach the valley! )
11/23/2024
6
11/23/2024
13
11/23/2024
Imagine how many calculations will be needed to cover all data points and all batches
In this method, we pick one point randomly out of the “batch” and compute the loss for the same
Enables it to jump over local minima with of hope of finding global minima
Our very first model…
11/23/2024
7
11/23/2024
There is a possibility that it can make a noisy gradient descent where values are jumping around
uncontrollably
It was so noisy in some cases that we needed to tweak a bit in our implementation
So we changed, it to batch gradient descent which performs model updates at the end of each
training epoch
Epoch
Dictionary : “A long period of time, especially one in which there are new developments and great change”
ML: One cycle through the entire training dataset
11/23/2024
Gradients for the whole dataset to perform one update, batch gradient
Most deep learning libraries provide automatic differentiation that efficiently computes the gradient
Update our parameters in the direction of the gradients with the learning rate
11/23/2024
8
11/23/2024
Cons:
More stable gradient may result in premature convergence
Need additional step of collecting errors across all training examples
Model updates and training speed may become very slow for large dataset
11/23/2024
9
11/23/2024
Cons:
Error information must be accumulated across
mini-batches of training examples like batch
gradient descent.
11/23/2024
20 Difference
Batch Gradient Descent : Use all ‘m’ examples then update once…
11/23/2024
10
11/23/2024
Case
Use Learning rate schedules Case I
II
Objective Function
how to avoid getting trapped in local minima. Shoulder
Local Maxima
“Flat” Local Maxima
What about plateau?
11/23/2024
State Space
22
11/23/2024
11
11/23/2024
11/23/2024
12
11/23/2024
𝑣 =𝛽𝑣 + (1+𝛽) * 𝜃
11/23/2024
SGD oscillates across the slopes of the ravine, progress towards bottom is very slow.
11/23/2024
13
11/23/2024
The momentum term increases for dimensions whose successive gradients point in the same
directions
11/23/2024
11/23/2024
14
11/23/2024
29
11/23/2024
30 AdaGrad
SGD needs:
Starting point to be selected
Constant learning rate
11/23/2024
15
11/23/2024
31 AdaGrad
Improved the robustness of SGD
Previously, we performed an update for all Weights W at once as every parameter wi used the same
learning rate η ( read 𝜶).
As Adagrad uses a different learning rate for every weight wi at every time step t,
𝜕wt,i to be the gradient of the objective function w.r.t. to the parameter wi at time step t:
W=W- * 𝜕W
Gt
Where Gt is the sum of the element wise multiplication of the gradients until time-step 𝑡, = ∑ 𝜕W𝑖
11/23/2024
32 AdaGrad
w
W=W- * 𝜕W
Gt
b Gt = ∑ 𝜕W𝑖 )
11/23/2024
16
11/23/2024
33 Adadelta
Derived from AdaGrad
Benefits:
No manual adjustment of a learning rate after initial selection.
Insensitive to hyperparameters.
Separate dynamic learning rate per-dimension.
Minimal computation over gradient descent.
Robust to large gradients, noise and architecture choice.
Applicable in both local or distributed environments.
11/23/2024
Each weight has a delta value that increases when the gradient doesn’t change sign (meaning it’s a
step in the correct direction) or decreases when the gradient does change sign
It’s not commonly used as its implementation is clumsy and not many out of box solution are
available.
11/23/2024
17
11/23/2024
35 RMSProp
Read 𝛽 Read 𝛼
11/23/2024
36 RMSProp
Gradient for different weights are different
Combines the idea of only using the sign of the gradient with the idea of adapting the step size
separately for each weight
Then divide the gradient by square root the mean square above
In Momentum, the gradient descent was modified by its exponential moving average
11/23/2024
18
11/23/2024
Adam was presented by Diederik Kingma (OpenAI) and Jimmy Ba (University of Toronto) in their 2015 ICLR
paper (poster) titled “Adam: A Method for Stochastic Optimization“
https://arxiv.org/abs/1412.6980
Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the
learning rate does not change during training.
11/23/2024
19
11/23/2024
11/23/2024
11/23/2024
20
11/23/2024
41
11/23/2024
42 Reflect…
What is Gradient Descent used for? In Gradient Descent, what does the "gradient"
A. Clustering represent?
B. Regression A. The direction of steepest ascent of the function.
C. Dimensionality Reduction B. The rate of change of the function at a point.
D. Image Classification C. The distance between data points.
D. The probability of occurrence of a data point.
Answer: B. Regression
Answer: B. The rate of change of the function at a point.
Which of the following best describes Gradient
Descent?
A. An optimization algorithm used to minimize a function by What is the role of the learning rate in Gradient
iteratively moving in the direction of steepest descent. Descent?
B. An algorithm for finding the maximum of a function. A. It determines the number of iterations.
C. A clustering algorithm based on distance between data B. It specifies the initial position of the algorithm.
points.
D. A classification algorithm for non-linearly separable data. C. It controls the size of the steps taken during optimization.
D. It sets the threshold for convergence.
Answer: A. An optimization algorithm used to minimize
a function by iteratively moving in the direction of Answer: C. It controls the size of the steps taken during
steepest descent. optimization.
11/23/2024
21
11/23/2024
43 Reflect…
Which variant of Gradient Descent updates the parameters What is a potential issue with a high learning rate in
after evaluating the gradient over the entire dataset? Gradient Descent?
A. Stochastic Gradient Descent (SGD. A. Slow convergence
B. Mini-batch Gradient Descent B. Overshooting the minimum
C. Batch Gradient Descent C. Premature convergence to a local minimum
D. Momentum-based Gradient Descent D. Increased computational complexity
Which statement best describes the trade-offs between Which of the following statements is true regarding the
different variants of Gradient Descent? convergence of Gradient Descent?
A. Batch Gradient Descent is faster than Stochastic Gradient A. Gradient Descent always converges to the global minimum.
Descent. B. Gradient Descent may converge to a local minimum depending
B. Stochastic Gradient Descent guarantees convergence to the on the initialization and learning rate.
global minimum. C. Gradient Descent always converges, regardless of the function
C. Mini-batch Gradient Descent balances the efficiency of batch being optimized.
GD and the stochastic nature of SGD. D. Gradient Descent converges faster with a smaller number of
D. Momentum-based Gradient Descent is less prone to getting iterations.
stuck in local minima.
Answer: B. Gradient Descent may converge to a local
Answer: C. Mini-batch Gradient Descent balances the minimum depending on the initialization and learning rate.
efficiency of batch GD and the stochastic nature of SGD.
11/23/2024
44 Reflect…
Which technique is often used to mitigate the issue of oscillations around the minimum in Gradient
Descent?
A. Decreasing the learning rate over time
B. Increasing the learning rate over time
C. Using a larger batch size
D. Adding momentum to the updates
Answer: B. K-means
11/23/2024
22
11/23/2024
45 Next Session
L-1, L-2
Dropout
Early Stopping
Augmentation
11/23/2024
46
11/23/2024
23
11/23/2024
47
11/23/2024
ADDITIONAL MATERIAL
What’s Next
24
11/23/2024
49
11/23/2024
11/23/2024
25
11/23/2024
51 Perceptron
Welcome back our perceptron
11/23/2024
52 Perceptron
x1
w1
x2 z = w1*x1+w2*x2 + b ŷ = a = σ (z) ℓ ( a, y )
w2
𝜕𝐿 𝜕𝐿 𝜕𝑎 (z)
= . = σ (z). (1- σ (z)) = -y/a +(1-y) / (1-a)
b 𝜕𝑧 𝜕𝑎 𝜕𝑧
= { - y / a + ( 1 - y ) / ( 1 – a ) } * a * (1-a) = a. (1- a)
=a-y
w1 = w1 - α * = w1 – α * x1 * (a-y)
z = W * X + b
ŷ = a = σ (z) w2 = w2 - α ∗ = w2 - α * x2 * (a-y)
ℓ
= x1 . = x1 (a-y)
σ ( z) = b=b-α* = b - α * (a-y)
1+e −z
ℓ(a, y) = -y * log(a) + (1-y) * log(1-a) ℓ
= x2 . = x2 (a-y) Where α is learning rate. The cost function is
For binary classification: ℓ ℓ
= = (a-y) J (W, b) = * ( ℓ (a, y) )
ℓ(a,y) = -y * log(a)
ℓ(a, y)
Hence = *( )
11/23/2024
26
11/23/2024
53 Perceptron
Earlier we were using very elaborate
calculations to estimate direction of gradient
descent…..
Why???
Can't we just calculate in both directions and
see what's better?
Not a bad suggestion… lets workout a
procedure….
Add random noise to weights W
Run a trial run
Move in the direction of improvement
11/23/2024
W = W + α (𝐺 – 𝐺 ) . δW
Keep repeating till it converges to a Minima (We are minimizing the cost!)
11/23/2024
27
11/23/2024
As mentioned before mini-batch gradient descent won't reach the optimum point (converge). But by making the
learning rate decay with iterations it will be much closer to it because the steps (and possible oscillations) near the
optimum are smaller.
One equations is
learning_rate = (1 / (1 + decay_rate * epoch_num)) * learning_rate_0
epoch_num is over all data (not a single mini-batch)
Some people perform learning rate decay discretely - repeatedly decrease after some number of epochs
Learning rate decay has less priority… last thing to tune in your network
11/23/2024
28