Lecture 4
Lecture 4
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 1 April 13, 2023
Announcements
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 2 April 13, 2023
Administrative: Project Proposal
(http://cs231n.stanford.edu/office_hours.html)
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 3 April 13, 2023
Administrative: Discussion Section
Backpropagation
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 4 April 13, 2023
Recap
- We have some dataset of (x,y) e.g.
- We have a score function:
- We have a loss function:
Softmax
SVM
Full loss
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 5 April 13, 2023
Finding the best W: Optimize with Gradient Descent
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 6 April 13, 2023
Gradient descent
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 7 April 13, 2023
Stochastic Gradient Descent (SGD)
Full sum expensive
when N is large!
Approximate sum
using a minibatch of
examples
32 / 64 / 128 common
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 8 April 13, 2023
Last time: fancy optimizers
SGD
SGD+Momentum
RMSProp
Adam
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 9 April 13, 2023
Last time: learning rate scheduling
Step: Reduce learning rate at a few fixed
points. E.g. for ResNets, multiply LR by 0.1
Reduce learning rate after epochs 30, 60, and 90.
Cosine:
Linear:
Inverse sqrt:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 10 April 13, 2023
Today:
Deep Learning
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 11 April 13, 2023
Dall-E 2
“Teddy bears working on new AI research “Rabbits attending a college seminar on “A wise cat meditating in the Himalayas
on the moon in the 1980s.” human anatomy.” searching for enlightenment.”
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 12 April 13, 2023
Ramesh et al., Hierarchical Text-Conditional
Image Generation with CLIP Latents, 2022.
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 13 April 13, 2023
GPT-4
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 14 April 13, 2023
Segment Anything Model (SAM)
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 15 April 13, 2023
Neural Networks
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 16 April 13, 2023
Neural networks: the original linear classifier
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 17 April 13, 2023
Neural networks: 2 layers
(In practice we will usually add a learnable bias at each layer as well)
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 18 April 13, 2023
Why do we want non-linearity?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 19 April 13, 2023
Why do we want non-linearity?
y θ
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 20 April 13, 2023
Neural networks: also called fully connected network
“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)
(In practice we will usually add a learnable bias at each layer as well)
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 21 April 13, 2023
Neural networks: 3 layers
(In practice we will usually add a learnable bias at each layer as well)
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 22 April 13, 2023
Neural networks: hierarchical computation
(Before) Linear score function:
(Now) 2-layer Neural Network
x W1 h W2 s
3072 100 10
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 23 April 13, 2023
Neural networks: learning 100s of templates
(Before) Linear score function:
(Now) 2-layer Neural Network
x W1 h W2 s
3072 100 10
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 24 April 13, 2023
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 25 April 13, 2023
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 26 April 13, 2023
ReLU is a good default
Activation functions choice for most problems
tanh Maxout
ReLU ELU
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 27 April 13, 2023
Neural networks: Architectures
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 28 April 13, 2023
Example feed-forward computation of a neural network
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 29 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 30 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 31 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:
Forward pass
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 32 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:
Forward pass
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 33 April 13, 2023
Full implementation of training a 2-layer Neural Network needs ~20 lines:
Forward pass
Gradient descent
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 34 April 13, 2023
Setting the number of layers and their sizes
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 36 April 07, 2022
This image by Fotis Bobolas is
licensed under CC-BY 2.0
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 37 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal
axon
cell body
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 38 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal
axon
cell body
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 39 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal
axon
cell body
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 40 April 13, 2023
Impulses carried toward cell body
dendrite
presynaptic
terminal
axon
cell body
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 41 April 13, 2023
Biological Neurons: Neurons in a neural network:
Complex connectivity patterns Organized into regular layers for
computational efficiency
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 42 April 13, 2023
Biological Neurons: But neural networks with random
Complex connectivity patterns connections can work too!
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 43 April 13, 2023
Be very careful with your brain analogies!
Biological Neurons:
● Many different types
● Dendrites can perform complex non-linear computations
● Synapses are not a single weight but a complex non-linear dynamical system
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 44 April 13, 2023
Plugging in neural networks with loss functions
Nonlinear score function
SVM Loss on predictions
Regularization
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 45 April 13, 2023
Problem: How to compute gradients?
Nonlinear score function
SVM Loss on predictions
Regularization
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 46 April 13, 2023
(Bad) Idea: Derive on paper
Problem: Very tedious: Lots of
matrix calculus, need lots of paper
Problem: What if we want to
change loss? E.g. use softmax
instead of SVM? Need to
re-derive from scratch =(
Problem: Not feasible for very
complex models!
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 47 April 13, 2023
Better Idea: Computational graphs + Backpropagation
x
s (scores) hinge
* loss
+
L
W
R
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 48 April 13, 2023
Convolutional network
(AlexNet)
input image
weights
loss
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 49 April 13, 2023
Really complex neural
networks!!
input image
loss
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 50 April 13, 2023
Neural Turing Machine
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - April 13, 2023
Solution: Backpropagation
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 52 April 13, 2023
Backpropagation: a simple example
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 53 April 13, 2023
Backpropagation: a simple example
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 54 April 13, 2023
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 55 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 56 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 57 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 58 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 59 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 60 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 61 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 62 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 63 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 64 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 65 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 66 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 67 April 07, 2022
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 4 - 68 April 07, 2022
f
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 69 April 13, 2023
“local gradient”
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 70 April 13, 2023
“local gradient”
“Upstream
gradient”
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 71 April 13, 2023
“local gradient”
“Downstream
gradients”
f
“Upstream
gradient”
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 72 April 13, 2023
“local gradient”
“Downstream
gradients”
f
“Upstream
gradient”
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 73 April 13, 2023
“local gradient”
“Downstream
gradients”
f
“Upstream
gradient”
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 74 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 75 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 76 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 77 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 78 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 79 April 13, 2023
Another example:
Upstream Local
gradient gradient
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 80 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 81 April 13, 2023
Another example:
Upstream Local
gradient gradient
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 82 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 83 April 13, 2023
Another example:
Upstream Local
gradient gradient
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 84 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 85 April 13, 2023
Another example:
Upstream Local
gradient gradient
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 86 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 87 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 88 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 89 April 13, 2023
Another example:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 90 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!
Sigmoid
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 91 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!
Sigmoid
Sigmoid local
gradient:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 92 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!
Sigmoid
Sigmoid local
gradient:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 93 April 13, 2023
Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!
Sigmoid
Sigmoid local
gradient:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 94 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor
3
2
7
4
+ 2
2
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 95 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 96 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 97 April 13, 2023
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 98 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Backward pass:
Compute grads
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 99 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Base case
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 100 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Sigmoid
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 101 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Add gate
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 102 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Add gate
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 103 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Multiply gate
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 104 April 13, 2023
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Multiply gate
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 105 April 13, 2023
“Flat” Backprop: Do this for assignment 1!
Stage your forward/backward computation!
margins
E.g. for the SVM:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 106 April 13, 2023
“Flat” Backprop: Do this for assignment 1!
E.g. for two-layer neural net:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 107 April 13, 2023
Backprop Implementation: Modularized API
Graph (or Net) object (rough pseudo code)
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 108 April 13, 2023
Modularized implementation: forward / backward API
Gate / Node / Function object: Actual PyTorch code
x
z Need to cache
* some values for
use in backward
y
Upstream
(x,y,z are scalars) gradient
Multiply upstream
and local gradients
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 109 April 13, 2023
Example: PyTorch operators
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 110 April 13, 2023
PyTorch sigmoid layer
Forward
Source
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 111 April 13, 2023
PyTorch sigmoid layer
Forward
Forward actually
defined elsewhere...
Source
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 112 April 13, 2023
PyTorch sigmoid layer
Forward
Forward actually
defined elsewhere...
Backward
Source
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 113 April 13, 2023
So far: backprop with scalars
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 114 April 13, 2023
Recap: Vector derivatives
Scalar to Scalar
Regular derivative:
If x changes by a
small amount, how
much will y change?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 115 April 13, 2023
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 116 April 13, 2023
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar Vector to Vector
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 117 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 118 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 119 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy
“Upstream gradient”
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 120 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 121 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
Dz
“Downstream
gradients”
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 122 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
[Dx x Dz] Dz
“Downstream
gradients”
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 123 April 13, 2023
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
Dx [Dx x Dz] Dz
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
Dy For each element of z, how
much does it influence L?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 124 April 13, 2023
Gradients of variables wrt loss have same dims as the original variable
Dx Dz
f
Dy Dz
“Upstream gradient”
Dy For each element of z, how
much does it influence L?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 125 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 126 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
4D dL/dz:
[ 4 ]
[ -1 ] Upstream
[ 5 ] gradient
[ 9 ]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 127 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 128 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 129 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 130 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [1000][4 ] [ 4 ]
[0] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[5] [0010][5 ] [ 5 ] gradient
[0] [0000][9 ] [ 9 ]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 131 April 13, 2023
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [ 4 ]
[0] z [ -1 ] Upstream
[5] [ 5 ] gradient
[0] [ 9 ]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 132 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!
[Dz×Mz]
Matrix-vector
multiply
f
[Dy×My]
Jacobian
matrices
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 133 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!
[Dx×Mx]
[Dz×Mz]
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
much does it influence L?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 134 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 135 April 13, 2023
Backprop with Matrices (or Tensors) Loss L still a scalar!
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 136 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2]
[ 3 2 1 -2]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 137 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Jacobians: [ -8 1 4 6 ]
[ 2 1 3 2] dy/dx: [(N×D)×(N×M)]
[ 3 2 1 -2] dy/dw: [(D×M)×(N×M)]
For a neural net we may have
N=64, D=M=4096
Each Jacobian takes ~256 GB of
memory! Must work with them implicitly!
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 138 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 139 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?
A: affects the
whole row
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 140 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the
whole row
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 141 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
whole row
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 142 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
[N×D] [N×M] [M×D] whole row
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 143 April 13, 2023
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2] By similar logic:
[ 3 2 1 -2]
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 144 April 13, 2023
Summary for today:
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 145 April 13, 2023
Next Time: Convolutional Neural Networks!
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 4 - 146 April 13, 2023