0% found this document useful (0 votes)
20 views

Lecture 4

The document outlines the topics that will be covered in Lecture 4, including announcements, administrative details, and an overview of neural networks and backpropagation. Specifically, there will be a discussion on backpropagation in the discussion section the next day. The lecture will also cover recent advances in deep learning like DALL-E 2, GPT-4, and neural network architectures. It provides examples of feedforward neural network computations and notes that implementing and training a basic 2-layer neural network only requires around 20 lines of code.

Uploaded by

Laila Shoukry
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture 4

The document outlines the topics that will be covered in Lecture 4, including announcements, administrative details, and an overview of neural networks and backpropagation. Specifically, there will be a discussion on backpropagation in the discussion section the next day. The lecture will also cover recent advances in deep learning like DALL-E 2, GPT-4, and neural network architectures. It provides examples of feedforward neural network computations and notes that implementing and training a basic 2-layer neural network only requires around 20 lines of code.

Uploaded by

Laila Shoukry
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

Lecture 4:

Neural Networks and


Backpropagation

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 1 April 13, 2023
Announcements

AWS credit: create an account, submit the number ID using


google form by 4/13.

Assignment 1 due Fri 4/21 at 11:59pm

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 2 April 13, 2023
Administrative: Project Proposal

Due Mon 4/24

TA expertise are posted on the webpage.

(http://cs231n.stanford.edu/office_hours.html)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 3 April 13, 2023
Administrative: Discussion Section

Discussion section tomorrow:

Backpropagation

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 4 April 13, 2023
Recap
- We have some dataset of (x,y) e.g.
- We have a score function:
- We have a loss function:

Softmax

SVM

Full loss

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 5 April 13, 2023
Finding the best W: Optimize with Gradient Descent

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 6 April 13, 2023
Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 7 April 13, 2023
Stochastic Gradient Descent (SGD)
Full sum expensive
when N is large!

Approximate sum
using a minibatch of
examples
32 / 64 / 128 common

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 8 April 13, 2023
Last time: fancy optimizers

SGD

SGD+Momentum

RMSProp

Adam

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 9 April 13, 2023
Last time: learning rate scheduling
Step: Reduce learning rate at a few fixed
points. E.g. for ResNets, multiply LR by 0.1
Reduce learning rate after epochs 30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

: Initial learning rate


: Learning rate at epoch t
: Total number of epochs

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 10 April 13, 2023
Today:

Deep Learning

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 11 April 13, 2023
Dall-E 2

“Teddy bears working on new AI research “Rabbits attending a college seminar on “A wise cat meditating in the Himalayas
on the moon in the 1980s.” human anatomy.” searching for enlightenment.”

Image source: Sam Altman, https://openai.com/dall-e-2/, https://twitter.com/sama/status/1511724264629678084

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 12 April 13, 2023
Ramesh et al., Hierarchical Text-Conditional
Image Generation with CLIP Latents, 2022.

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 13 April 13, 2023
GPT-4

Image source: https://openai.com/research/gpt-4

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 14 April 13, 2023
Segment Anything Model (SAM)

Kirillov et al., Segment Anything, 2023

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 15 April 13, 2023
Neural Networks

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 16 April 13, 2023
Neural networks: the original linear classifier

(Before) Linear score function:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 17 April 13, 2023
Neural networks: 2 layers

(Before) Linear score function:


(Now) 2-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 18 April 13, 2023
Why do we want non-linearity?

Cannot separate red


and blue points with
linear classifier

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 19 April 13, 2023
Why do we want non-linearity?

y θ

f(x, y) = (r(x, y), θ(x, y))


x r

Cannot separate red After applying feature


and blue points with transform, points can
linear classifier be separated by linear
classifier

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 20 April 13, 2023
Neural networks: also called fully connected network

(Before) Linear score function:


(Now) 2-layer Neural Network

“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)
(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 21 April 13, 2023
Neural networks: 3 layers

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 22 April 13, 2023
Neural networks: hierarchical computation
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 23 April 13, 2023
Neural networks: learning 100s of templates
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Learn 100 templates instead of 10. Share templates between classes

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 24 April 13, 2023
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.


Q: What if we try to build a neural network without one?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 25 April 13, 2023
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.


Q: What if we try to build a neural network without one?

A: We end up with a linear classifier again!

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 26 April 13, 2023
ReLU is a good default
Activation functions choice for most problems

Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 27 April 13, 2023
Neural networks: Architectures

“3-layer Neural Net”, or


“2-layer Neural Net”, or “2-hidden-layer Neural Net”
“1-hidden-layer Neural Net”
“Fully-connected” layers

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 28 April 13, 2023
Example feed-forward computation of a neural network

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 29 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 30 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 31 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 32 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 33 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Gradient descent

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 34 April 13, 2023
Setting the number of layers and their sizes

more neurons = more capacity


Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 35 April 07, 2022
Do not use size of neural network as a regularizer. Use stronger regularization instead:

(Web demo with ConvNetJS:


http://cs.stanford.edu/people/karpathy/convnetjs/demo
/classify2d.html)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 36 April 07, 2022
This image by Fotis Bobolas is
licensed under CC-BY 2.0

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 37 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 38 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 39 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 40 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 41 April 13, 2023
Biological Neurons: Neurons in a neural network:
Complex connectivity patterns Organized into regular layers for
computational efficiency

This image is CC0 Public Domain

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 42 April 13, 2023
Biological Neurons: But neural networks with random
Complex connectivity patterns connections can work too!

This image is CC0 Public Domain


Xie et al, “Exploring Randomly Wired Neural Networks for Image Recognition”, arXiv 2019

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 43 April 13, 2023
Be very careful with your brain analogies!
Biological Neurons:
● Many different types
● Dendrites can perform complex non-linear computations
● Synapses are not a single weight but a complex non-linear dynamical system

[Dendritic Computation. London and Hausser]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 44 April 13, 2023
Plugging in neural networks with loss functions
Nonlinear score function
SVM Loss on predictions

Regularization

Total loss: data loss + regularization

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 45 April 13, 2023
Problem: How to compute gradients?
Nonlinear score function
SVM Loss on predictions

Regularization

Total loss: data loss + regularization

If we can compute then we can learn W1 and W2

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 46 April 13, 2023
(Bad) Idea: Derive on paper
Problem: Very tedious: Lots of
matrix calculus, need lots of paper
Problem: What if we want to
change loss? E.g. use softmax
instead of SVM? Need to
re-derive from scratch =(
Problem: Not feasible for very
complex models!

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 47 April 13, 2023
Better Idea: Computational graphs + Backpropagation

x
s (scores) hinge

* loss
+
L

W
R

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 48 April 13, 2023
Convolutional network
(AlexNet)

input image

weights

loss

Figure copyright Alex Krizhevsky, Ilya Sutskever, and


Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 49 April 13, 2023
Really complex neural
networks!!

input image

loss

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 50 April 13, 2023
Neural Turing Machine

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - April 13, 2023
Solution: Backpropagation

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 52 April 13, 2023
Backpropagation: a simple example

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 53 April 13, 2023
Backpropagation: a simple example

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 54 April 13, 2023
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 55 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 56 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 57 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 58 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 59 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 60 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 61 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 62 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 63 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 64 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 65 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 66 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 67 April 07, 2022
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 68 April 07, 2022
f

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 69 April 13, 2023
“local gradient”

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 70 April 13, 2023
“local gradient”

“Upstream
gradient”

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 71 April 13, 2023
“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 72 April 13, 2023
“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 73 April 13, 2023
“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 74 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 75 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 76 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 77 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 78 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 79 April 13, 2023
Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 80 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 81 April 13, 2023
Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 82 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 83 April 13, 2023
Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 84 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 85 April 13, 2023
Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 86 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 87 April 13, 2023
Another example:

[upstream gradient] x [local gradient]


[0.2] x [1] = 0.2
[0.2] x [1] = 0.2 (both inputs!)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 88 April 13, 2023
Another example:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 89 April 13, 2023
Another example:

[upstream gradient] x [local gradient]


w0: [0.2] x [-1] = -0.2
x0: [0.2] x [2] = 0.4

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 90 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 91 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

Sigmoid local
gradient:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 92 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

[upstream gradient] x [local gradient]


[1.00] x [(1 - 1/(1+e-1)) (1/(1+e-1))] = 0.2

Sigmoid local
gradient:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 93 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

[upstream gradient] x [local gradient]


[1.00] x [(1 - 0.73) (0.73)] = 0.2

Sigmoid local
gradient:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 94 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor
3
2
7
4
+ 2
2

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 95 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 96 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10

copy gate: gradient adder


7
7 4
4+2=6 7
2

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 97 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10

copy gate: gradient adder max gate: gradient router


7 4
7 4 0 5
max
4+2=6 7 5 9
2 9

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 98 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Backward pass:
Compute grads

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 99 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Base case

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 100 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Sigmoid

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 101 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Add gate

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 102 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Add gate

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 103 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Multiply gate

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 104 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Multiply gate

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 105 April 13, 2023
“Flat” Backprop: Do this for assignment 1!
Stage your forward/backward computation!
margins
E.g. for the SVM:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 106 April 13, 2023
“Flat” Backprop: Do this for assignment 1!
E.g. for two-layer neural net:

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 107 April 13, 2023
Backprop Implementation: Modularized API
Graph (or Net) object (rough pseudo code)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 108 April 13, 2023
Modularized implementation: forward / backward API
Gate / Node / Function object: Actual PyTorch code

x
z Need to cache
* some values for
use in backward

y
Upstream
(x,y,z are scalars) gradient

Multiply upstream
and local gradients

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 109 April 13, 2023
Example: PyTorch operators

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 110 April 13, 2023
PyTorch sigmoid layer
Forward

Source

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 111 April 13, 2023
PyTorch sigmoid layer
Forward

Forward actually
defined elsewhere...

Source

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 112 April 13, 2023
PyTorch sigmoid layer
Forward

Forward actually
defined elsewhere...

Backward

Source

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 113 April 13, 2023
So far: backprop with scalars

What about vector-valued functions?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 114 April 13, 2023
Recap: Vector derivatives
Scalar to Scalar

Regular derivative:

If x changes by a
small amount, how
much will y change?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 115 April 13, 2023
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar

Regular derivative: Derivative is Gradient:

If x changes by a For each element of x,


small amount, how if it changes by a small
much will y change? amount then how much
will y change?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 116 April 13, 2023
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar Vector to Vector

Regular derivative: Derivative is Gradient: Derivative is Jacobian:

If x changes by a For each element of x, For each element of x, if it


small amount, how if it changes by a small changes by a small amount
much will y change? amount then how much then how much will each
will y change? element of y change?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 117 April 13, 2023
Backprop with Vectors
Loss L still a scalar!

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 118 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx

Dz
f
Dy

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 119 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx

Dz
f
Dy
“Upstream gradient”

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 120 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx

Dz
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 121 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”

Dz
“Downstream
gradients”
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 122 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”

[Dx x Dz] Dz
“Downstream
gradients”
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
For each element of z, how
much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 123 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”

Dx [Dx x Dz] Dz
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
Dy For each element of z, how
much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 124 April 13, 2023
Gradients of variables wrt loss have same dims as the original variable

Loss L still a scalar!


Dx

Dx Dz
f
Dy Dz
“Upstream gradient”
Dy For each element of z, how
much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 125 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 126 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

4D dL/dz:
[ 4 ]
[ -1 ] Upstream
[ 5 ] gradient
[ 9 ]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 127 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

Jacobian dz/dx 4D dL/dz:


[1000] [ 4 ]
[0000] [ -1 ] Upstream
[0010] [ 5 ] gradient
[0000] [ 9 ]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 128 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

[dz/dx] [dL/dz] 4D dL/dz:


[1000][4 ] [ 4 ]
[ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[0010][5 ] [ 5 ] gradient
[0000][9 ] [ 9 ]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 129 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:


[4] [1000][4 ] [ 4 ]
[0] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[5] [0010][5 ] [ 5 ] gradient
[0] [0000][9 ] [ 9 ]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 130 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [1000][4 ] [ 4 ]
[0] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[5] [0010][5 ] [ 5 ] gradient
[0] [0000][9 ] [ 9 ]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 131 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [ 4 ]
[0] z [ -1 ] Upstream
[5] [ 5 ] gradient
[0] [ 9 ]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 132 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx]
same shape as x!

[Dz×Mz]
Matrix-vector
multiply
f
[Dy×My]
Jacobian
matrices

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 133 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx]
same shape as x!

[Dx×Mx]
[Dz×Mz]
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 134 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx] “local same shape as x!
gradients”
[Dx×Mx]
[Dz×Mz]
“Downstream
Matrix-vector
gradients” multiply
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
For each element of y, how much
does it influence each element of z? much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 135 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx] “local same shape as x!
gradients”
[Dx×Mx] [(Dx×Mx)×(Dz×Mz)] [Dz×Mz]
“Downstream
Matrix-vector
gradients” multiply [(Dy×My)×(Dz×Mz)]
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
For each element of y, how much
does it influence each element of z? much does it influence L?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 136 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2]
[ 3 2 1 -2]

Also see derivation in the course notes:


http://cs231n.stanford.edu/handouts/linear-backprop.pdf

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 137 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Jacobians: [ -8 1 4 6 ]
[ 2 1 3 2] dy/dx: [(N×D)×(N×M)]
[ 3 2 1 -2] dy/dw: [(D×M)×(N×M)]
For a neural net we may have
N=64, D=M=4096
Each Jacobian takes ~256 GB of
memory! Must work with them implicitly!

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 138 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 139 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?
A: affects the
whole row

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 140 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the
whole row

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 141 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
whole row

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 142 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
[N×D] [N×M] [M×D] whole row

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 143 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2] By similar logic:
[ 3 2 1 -2]

[N×D] [N×M] [M×D] [D×M] [D×N] [N×M] These formulas are


easy to remember: they
are the only way to
make shapes match up!

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 144 April 13, 2023
Summary for today:

● (Fully-connected) Neural Networks are stacks of linear functions and nonlinear


activation functions; they have much more representational power than linear
classifiers
● backpropagation = recursive application of the chain rule along a computational
graph to compute the gradients of all inputs/parameters/intermediates
● implementations maintain a graph structure, where the nodes implement the
forward() / backward() API
● forward: compute result of an operation and save any intermediates needed for
gradient computation in memory
● backward: apply the chain rule to compute the gradient of the loss function with
respect to the inputs

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 145 April 13, 2023
Next Time: Convolutional Neural Networks!

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 146 April 13, 2023

You might also like