9.b Handout-3-GD Variants

Uploaded by

calabi mozart

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views3 pages

9.b Handout-3-GD Variants

Uploaded by

calabi mozart

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition

The numerical gradient is very simple to compute using the nite difference approximation, but
the downside is that it is approximate (since we have to pick a small value of h, while the true
gradient is de ned as the limit as h goes to zero), and that it is very computationally expensive to
compute. The second way to compute the gradient is analytically using Calculus, which allows us
to derive a direct formula for the gradient (no approximations) that is also very fast to compute.
However, unlike the numerical gradient it can be more error prone to implement, which is why in
practice it is very common to compute the analytic gradient and compare it to the numerical
gradient to check the correctness of your implementation. This is called a gradient check.

Lets use the example of the SVM loss function for a single datapoint:

T T
Li = ∑ [max(0, w xi − wy xi + Δ)]
j i

j≠y i

We can differentiate the function with respect to the weights. For example, taking the gradient
with respect to wy we obtain:
i

T T
∇w Li = − (∑ 1(w xi − wy xi + Δ > 0)) xi
y
i
j i

j≠y i

where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While
the expression may look scary when it is written out, when you’re implementing this in code you’d
simply count the number of classes that didn’t meet the desired margin (and hence contributed to
the loss function) and then the data vector xi scaled by this number is the gradient. Notice that
this is the gradient only with respect to the row of W that corresponds to the correct class. For
the other rows where j ≠ y i the gradient is:

T T
∇w Li = 1(w xi − wy xi + Δ > 0)xi
j j i

Once you derive the expression for the gradient it is straight-forward to implement the
expressions and use them to perform the gradient update.

Gradient Descent
Now that we can compute the gradient of the loss function, the procedure of repeatedly
evaluating the gradient and then performing a parameter update is called Gradient Descent. Its
vanilla version looks as follows:

# Vanilla Gradient Descent

http://cs231n.github.io/optimization-1/ 11/14
10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition

while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update

This simple loop is at the core of all Neural Network libraries. There are other ways of performing
the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and
established way of optimizing Neural Network loss functions. Throughout the class we will put
some bells and whistles on the details of this loop (e.g. the exact details of the update equation),
but the core idea of following the gradient until we’re happy with the results will remain the same.

Mini-batch gradient descent. In large-scale applications (such as the ILSVRC challenge), the
training data can have on order of millions of examples. Hence, it seems wasteful to compute the
full loss function over the entire training set in order to perform only a single parameter update. A
very common approach to addressing this challenge is to compute the gradient over batches of
the training data. For example, in current state of the art ConvNets, a typical batch contains 256
examples from the entire training set of 1.2 million. This batch is then used to perform a
parameter update:

# Vanilla Minibatch Gradient Descent

while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update

The reason this works well is that the examples in the training data are correlated. To see this,
consider the extreme case where all 1.2 million images in ILSVRC are in fact made up of exact
duplicates of only 1000 unique images (one for each class, or in other words 1200 identical
copies of each image). Then it is clear that the gradients we would compute for all 1200 identical
copies would all be the same, and when we average the data loss over all 1.2 million images we
would get the exact same loss as if we only evaluated on a small subset of 1000. In practice of
course, the dataset would not contain duplicate images, the gradient from a mini-batch is a good
approximation of the gradient of the full objective. Therefore, much faster convergence can be
achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter
updates.

The extreme case of this is a setting where the mini-batch contains only a single example. This
process is called Stochastic Gradient Descent (SGD) (or also sometimes on-line gradient
descent). This is relatively less common to see because in practice due to vectorized code

http://cs231n.github.io/optimization-1/ 12/14
10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition

optimizations it can be computationally much more e cient to evaluate the gradient for 100
examples, than the gradient for one example 100 times. Even though SGD technically refers to
using a single example at a time to evaluate the gradient, you will hear people use the term SGD
even when referring to mini-batch gradient descent (i.e. mentions of MGD for “Minibatch Gradient
Descent”, or BGD for “Batch gradient descent” are rare to see), where it is usually assumed that
mini-batches are used. The size of the mini-batch is a hyperparameter but it is not very common
to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g.
32, 64 or 128. We use powers of 2 in practice because many vectorized operation
implementations work faster when their inputs are sized in powers of 2.

Summary

Summary of the information ow. The dataset of pairs of (x,y) is given and xed. The weights start out as
random numbers and can change. During the forward pass the score function computes class scores,
stored in vector f. The loss function contains two components: The data loss computes the compatibility
between the scores f and the labels y. The regularization loss is only a function of the weights. During
Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them
to perform a parameter update during Gradient Descent.

In this section,

We developed the intuition of the loss function as a high-dimensional optimization

landscape in which we are trying to reach the bottom. The working analogy we developed
was that of a blindfolded hiker who wishes to reach the bottom. In particular, we saw that
the SVM cost function is piece-wise linear and bowl-shaped.
We motivated the idea of optimizing the loss function with iterative re nement, where we
start with a random set of weights and re ne them step by step until the loss is minimized.
We saw that the gradient of a function gives the steepest ascent direction and we
discussed a simple but ine cient way of computing it numerically using the nite
difference approximation (the nite difference being the value of h used in computing the
numerical gradient).

http://cs231n.github.io/optimization-1/ 13/14

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
1 Intro
No ratings yet
1 Intro
91 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
ML3 Unit 4-3
No ratings yet
ML3 Unit 4-3
13 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
Deep Learning: Course Code: Unit 1
No ratings yet
Deep Learning: Course Code: Unit 1
41 pages
Lec7 8+CNN 2
No ratings yet
Lec7 8+CNN 2
69 pages
ML Lecture2
No ratings yet
ML Lecture2
36 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
Lab NN KNN SVM
No ratings yet
Lab NN KNN SVM
13 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
UNIT3
No ratings yet
UNIT3
37 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
DL Unit 2
No ratings yet
DL Unit 2
46 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
UNIT2
No ratings yet
UNIT2
25 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Optim
No ratings yet
Optim
33 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
LInear
No ratings yet
LInear
14 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
DFT Interview QA
No ratings yet
DFT Interview QA
14 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Loss Functions
No ratings yet
Loss Functions
29 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
Lec 04
No ratings yet
Lec 04
75 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Upload Unit 2
No ratings yet
Upload Unit 2
19 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Student Workbook - Unit 2 Algorithms
No ratings yet
Student Workbook - Unit 2 Algorithms
17 pages
Day 2 - Loss & Activation Functions
No ratings yet
Day 2 - Loss & Activation Functions
8 pages
Rbs 6501 Datasheet PDF
No ratings yet
Rbs 6501 Datasheet PDF
2 pages
Class 10 Artificial Intelligence Sample Paper Set 4
No ratings yet
Class 10 Artificial Intelligence Sample Paper Set 4
9 pages
AZ900
No ratings yet
AZ900
136 pages
Стойка Здравкова- Ограмотяването На Децата
No ratings yet
Стойка Здравкова- Ограмотяването На Децата
129 pages
Week 2 Assignment OOP C++
100% (1)
Week 2 Assignment OOP C++
7 pages
Method Statement - BMS Testing Rev 0
No ratings yet
Method Statement - BMS Testing Rev 0
16 pages
Blockchain Technology - CSE
No ratings yet
Blockchain Technology - CSE
30 pages
Flame Sensor Report
No ratings yet
Flame Sensor Report
69 pages
Visual Basic Theory Notes
No ratings yet
Visual Basic Theory Notes
6 pages
Clean Code v2017 en
No ratings yet
Clean Code v2017 en
55 pages
Introduction To AngularJS
No ratings yet
Introduction To AngularJS
74 pages
Factoring Polynomials: Be Sure Your Answers Will Not Factor Further!
No ratings yet
Factoring Polynomials: Be Sure Your Answers Will Not Factor Further!
5 pages
SDN 10-Gigabit L2+ Managed Switch
No ratings yet
SDN 10-Gigabit L2+ Managed Switch
16 pages
Escort FLS Manual
No ratings yet
Escort FLS Manual
111 pages
Clean Code March 2017
No ratings yet
Clean Code March 2017
82 pages
1.a-CMPS460-S22-Welcome To ML
No ratings yet
1.a-CMPS460-S22-Welcome To ML
57 pages
New XXX Hot XNXX Sex Xvideo Hot Sex 0008
No ratings yet
New XXX Hot XNXX Sex Xvideo Hot Sex 0008
3 pages
Digital Photography
No ratings yet
Digital Photography
2 pages
8.c-CMPS460-S22-Probabilitic Modeling - Logistic Regression
No ratings yet
8.c-CMPS460-S22-Probabilitic Modeling - Logistic Regression
16 pages
Sia by Khadeeja
No ratings yet
Sia by Khadeeja
5 pages
DVCon Europe 2015 P1 5 Paper
No ratings yet
DVCon Europe 2015 P1 5 Paper
6 pages
1.d-CMPS460-S22-Formalizing The Learning Problem
No ratings yet
1.d-CMPS460-S22-Formalizing The Learning Problem
9 pages
Vianney
No ratings yet
Vianney
41 pages
Python
No ratings yet
Python
2 pages
Login e Senha 2024
No ratings yet
Login e Senha 2024
5 pages
Ext Proposal Format
No ratings yet
Ext Proposal Format
2 pages
Rattrapage
No ratings yet
Rattrapage
6 pages
Log
No ratings yet
Log
3 pages
Test TP Programmation Sujet 5 + Solution
No ratings yet
Test TP Programmation Sujet 5 + Solution
4 pages
SSN 0002
No ratings yet
SSN 0002
4 pages
The Impact of Blockchain On Supply Chains A Systematic
No ratings yet
The Impact of Blockchain On Supply Chains A Systematic
38 pages
Regarding Slsksjs
No ratings yet
Regarding Slsksjs
1 page
18CS34 CES Questionnaire
No ratings yet
18CS34 CES Questionnaire
2 pages
Stochastic Parrot
No ratings yet
Stochastic Parrot
3 pages
Dog Shaped Coin Bank and Soap Dispenser 4a6ce93c c903 4af5 PDF
No ratings yet
Dog Shaped Coin Bank and Soap Dispenser 4a6ce93c c903 4af5 PDF
3 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

9.b Handout-3-GD Variants

Uploaded by

9.b Handout-3-GD Variants

Uploaded by

10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition

# Vanilla Gradient Descent

# Vanilla Minibatch Gradient Descent

We developed the intuition of the loss function as a high-dimensional optimization

You might also like