0% found this document useful (0 votes)
9 views49 pages

Lecture 5 Fall 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views49 pages

Lecture 5 Fall 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Lecture 5:

Learning Rate Schedules


Neural Networks

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 1 Sep 17, 2024
Announcements
● Optional discussion this Friday, Sep 20, 11-12pm, CS142
● Topic: Reviewing the chain rule, Applying the chain rule to vectors

● Homework 1 due Thursday, Sept 26, 11:55pm

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 2 Sep 17, 2024
Recap
- We have some dataset of (x,y) e.g.
- We have a score function:
- We have a loss function:

Softmax

SVM

Full loss

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 3 Sep 17, 2024
Finding the best W: Optimize with Gradient Descent

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 4 Sep 17, 2024
Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 5 Sep 17, 2024
Stochastic Gradient Descent (SGD)
Full sum expensive
when N is large!

Approximate sum
using a minibatch of
examples
32 / 64 / 128 common

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 6 Sep 17, 2024
Last time: fancy optimizers

SGD

SGD+Momentum

RMSProp

Adam

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 7 Sep 17, 2024
Learning rate schedules

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 8 Sep 12, 2024
Learning rate schedules

Learning rate

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 9 Sep 17, 2024
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.

Q: Which one of these learning


rates is best to use?

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 10 Sep 17, 2024
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.

Q: Which one of these learning


rates is best to use?

A: In reality, all of these are good


learning rates.

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 11 Sep 17, 2024
Learning rate decays over time
Step: Reduce learning rate at a few fixed
Reduce learning rate points. E.g. for ResNets, multiply LR by 0.1
after epochs 30, 60, and 90.

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 12 Sep 17, 2024
Learning Rate Decay
Step: Reduce learning rate at a few fixed
points. E.g. for ResNets, multiply LR by 0.1
after epochs 30, 60, and 90.

Cosine:

: Initial learning rate


Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017
Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018 : Learning rate at epoch t
Feichtenhofer et al, “SlowFast Networks for Video Recognition”, arXiv 2018
Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019
: Total number of epochs

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 13 Sep 17, 2024
Learning Rate Decay
Step: Reduce learning rate at a few fixed
points. E.g. for ResNets, multiply LR by 0.1
after epochs 30, 60, and 90.

Cosine:

: Initial learning rate


Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017
Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018 : Learning rate at epoch t
Feichtenhofer et al, “SlowFast Networks for Video Recognition”, arXiv 2018
Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019
: Total number of epochs

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 14 Sep 17, 2024
Learning Rate Decay
Step: Reduce learning rate at a few fixed
points. E.g. for ResNets, multiply LR by 0.1
after epochs 30, 60, and 90.

Cosine:

Linear:

: Initial learning rate


: Learning rate at epoch t
Devlin et al, “BERT: Pre-training of Deep Bidirectional Transformers for : Total number of epochs
Language Understanding”, 2018

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 15 Sep 17, 2024
Learning Rate Decay
Step: Reduce learning rate at a few fixed
points. E.g. for ResNets, multiply LR by 0.1
after epochs 30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

: Initial learning rate


: Learning rate at epoch t
: Total number of epochs
Vaswani et al, “Attention is all you need”, NIPS 2017

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 16 Sep 17, 2024
In practice:
- Adam is a good default choice in many cases; it
often works ok even with constant learning rate
- SGD+Momentum can outperform Adam but may
require more tuning of LR and schedule

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 17 Sep 17, 2024
Neural Networks

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 18 Sep 12, 2024
Neural networks: the original linear classifier

(Before) Linear score function:

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 19 Sep 17, 2024
Neural networks: 2 layers

(Before) Linear score function:


(Now) 2-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 20 Sep 17, 2024
Why do we want non-linearity?

Cannot separate red


and blue points with
linear classifier

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 21 Sep 17, 2024
Why do we want non-linearity?

y θ

f(x, y) = (r(x, y), θ(x, y))


x r

Cannot separate red After applying feature


and blue points with transform, points can
linear classifier be separated by linear
classifier

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 22 Sep 17, 2024
Neural networks: also called fully connected network

(Before) Linear score function:


(Now) 2-layer Neural Network

“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)
(In practice we will usually add a learnable bias at each layer as well)

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 23 Sep 17, 2024
Neural networks: 3 layers

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 24 Sep 17, 2024
Neural networks: hierarchical computation
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 25 Sep 17, 2024
Neural networks: learning 100s of templates
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Learn 100 templates instead of 10. Share templates between classes


Subhransu Maji, Chuang Gan and TAs
Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 26 Sep 17, 2024
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.


Q: What if we try to build a neural network without one?

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 27 Sep 17, 2024
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.


Q: What if we try to build a neural network without one?

A: We end up with a linear classifier again!

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 28 Sep 17, 2024
ReLU is a good default
Activation functions choice for most problems

Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 29 Sep 17, 2024
Neural networks: Architectures

“3-layer Neural Net”, or


“2-layer Neural Net”, or “2-hidden-layer Neural Net”
“1-hidden-layer Neural Net”
“Fully-connected” layers

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 30 Sep 17, 2024
Example feed-forward computation of a neural network

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 31 Sep 17, 2024
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 32 Sep 17, 2024
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 33 Sep 17, 2024
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 34 Sep 17, 2024
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 35 Sep 17, 2024
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Gradient descent

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 36 Sep 17, 2024
Setting the number of layers and their sizes

more neurons = more capacity


Subhransu Maji, Chuang Gan and TAs
Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 37 Sep 12, 2024
Do not use size of neural network as a regularizer. Use stronger regularization instead:

(Web demo with ConvNetJS: http://cs.stanford.edu/


people/karpathy/convnetjs/demo/classify2d.html)

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 38 Sep 12, 2024
This image by Fotis Bobolas is
licensed under CC-BY 2.0

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 39 Sep 17, 2024
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell
body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 40 Sep 17, 2024
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell
body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 41 Sep 17, 2024
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell
body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 42 Sep 17, 2024
Biological Neurons: Neurons in a neural network:
Complex connectivity patterns Organized into regular layers for
computational efficiency

This image is CC0 Public Domain

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 43 Sep 17, 2024
Biological Neurons: But neural networks with random
Complex connectivity patterns connections can work too!

This image is CC0 Public Domain


Xie et al, “Exploring Randomly Wired Neural Networks for Image Recognition”, arXiv 2019

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 44 Sep 17, 2024
Be very careful with your brain analogies!
Biological Neurons:
● Many different types
● Dendrites can perform complex non-linear computations
● Synapses are not a single weight but a complex non-linear dynamical system

[Dendritic Computation. London and Hausser]

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 45 Sep 17, 2024
Plugging in neural networks with loss functions
Nonlinear score function
SVM Loss on predictions

Regularization

Total loss: data loss + regularization

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 46 Sep 17, 2024
Problem: How to compute gradients?
Nonlinear score function
SVM Loss on predictions

Regularization

Total loss: data loss + regularization

If we can compute then we can learn W1 and W2

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 47 Sep 17, 2024
(Bad) Idea: Derive on paper
Problem: Very tedious: Lots of
matrix calculus, need lots of paper
Problem: What if we want to
change loss? E.g. use softmax
instead of SVM? Need to re-
derive from scratch =(
Problem: Not feasible for very
complex models!

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 48 Sep 17, 2024
Next lecture: Computational graphs + Backpropagation

x
s (scores) hing

* e
loss
+
L

W
R

Subhransu Maji, Chuang Gan and TAs


Some slides kindly provided by Fei-Fei Li, Jiajun Wu, Erik Learned-Miller
Lecture 5 - 49 Sep 17, 2024

You might also like