Lecture 4 PDF
Lecture 4 PDF
(http://cs231n.stanford.edu/office_hours.html)
Backpropagation
f(x,W) = Wx + b
Approximate sum
using a minibatch of
examples
32 / 64 / 128 common
Class
scores
f(x) = Wx
f(x) = Wx
Class
scores
Feature Representation
y θ
+1
Divide image into 8x8 pixel regions Example: 320x240 image gets divided
Within each region quantize edge into 40x30 bins; in each bin there are
direction into 9 bins 9 numbers so feature vector has
30*40*9 = 10,800 numbers
Lowe, “Object recognition from local scale-invariant features”, ICCV 1999
Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005
Cluster patches to
Extract random form “codebook”
patches of “visual words”
Fei-Fei and Perona, “A bayesian hierarchical model for learning natural scene categories”, CVPR 2005
training
10 numbers giving
scores for classes
training
(In practice we will usually add a learnable bias at each layer as well)
“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)
(In practice we will usually add a learnable bias at each layer as well)
(In practice we will usually add a learnable bias at each layer as well)
x W1 h W2 s
3072 100 10
x W1 h W2 s
3072 100 10
tanh Maxout
ReLU ELU
tanh Maxout
ReLU ELU
Forward pass
Forward pass
Forward pass
Gradient descent
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 42 13 Jan 2016
This image by Fotis Bobolas is
licensed under CC-BY 2.0
axon
cell body
axon
cell body
axon
cell body
axon
cell body
Regularization
Regularization
x
s (scores) hinge
* loss
+
L
W
R
input image
weights
loss
input image
loss
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 13, 2017
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 13, 2017
f
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
Upstream Local
gradient gradient
Upstream Local
gradient gradient
Upstream Local
gradient gradient
Upstream Local
gradient gradient
Sigmoid
Sigmoid
Sigmoid local
gradient:
Sigmoid
Sigmoid local
gradient:
Sigmoid
Sigmoid local
gradient:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 100 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor
3
2
7
4
+ 2
2
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 101 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 102 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 103 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 104 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Backward pass:
Compute grads
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 105 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Base case
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 106 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Sigmoid
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 107 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Add gate
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 108 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Add gate
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 109 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Multiply gate
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 110 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output
Multiply gate
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 111 April 08, 2021
“Flat” Backprop: Do this for assignment 1!
Stage your forward/backward computation!
margins
E.g. for the SVM:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 112 April 08, 2021
“Flat” Backprop: Do this for assignment 1!
E.g. for two-layer neural net:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 113 April 08, 2021
Backprop Implementation: Modularized API
Graph (or Net) object (rough pseudo code)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 114 April 08, 2021
Modularized implementation: forward / backward API
Gate / Node / Function object: Actual PyTorch code
x
z Need to stash
* some values for
use in backward
y
Upstream
(x,y,z are scalars) gradient
Multiply upstream
and local gradients
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 115 April 08, 2021
Example: PyTorch operators
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 116 April 08, 2021
PyTorch sigmoid layer
Forward
Source
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 117 April 08, 2021
PyTorch sigmoid layer
Forward
Forward actually
defined elsewhere...
Source
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 118 April 08, 2021
PyTorch sigmoid layer
Forward
Forward actually
defined elsewhere...
Backward
Source
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 119 April 08, 2021
Summary for today:
● (Fully-connected) Neural Networks are stacks of linear functions and
nonlinear activation functions; they have much more representational
power than linear classifiers
● backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
● implementations maintain a graph structure, where the nodes implement
the forward() / backward() API
● forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
● backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 120 April 08, 2021
So far: backprop with scalars
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 121 April 08, 2021
Next Time: Convolutional Networks!
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 122 April 08, 2021
Recap: Vector derivatives
Scalar to Scalar
Regular derivative:
If x changes by a
small amount, how
much will y change?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 123 April 08, 2021
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 124 April 08, 2021
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar Vector to Vector
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 125 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 126 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 127 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy
“Upstream gradient”
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 128 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 129 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
Dz
“Downstream
gradients”
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 130 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
[Dx x Dz] Dz
“Downstream
gradients”
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 131 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
Dx [Dx x Dz] Dz
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
Dy For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 132 April 08, 2021
Gradients of variables wrt loss have same dims as the original variable
Dx Dz
f
Dy Dz
“Upstream gradient”
Dy For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 133 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 134 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
4D dL/dz:
[ 4 ]
[ -1 ] Upstream
[ 5 ] gradient
[ 9 ]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 135 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 136 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 137 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 138 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [1000][4 ] [ 4 ]
[0] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[5] [0010][5 ] [ 5 ] gradient
[0] [0000][9 ] [ 9 ]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 139 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [ 4 ]
[0] z [ -1 ] Upstream
[5] [ 5 ] gradient
[0] [ 9 ]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 140 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!
[Dz×Mz]
Matrix-vector
multiply
f
[Dy×My]
Jacobian
matrices
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 141 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!
[Dx×Mx]
[Dz×Mz]
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 142 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 143 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 144 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2]
[ 3 2 1 -2]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 145 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Jacobians: [ -8 1 4 6 ]
[ 2 1 3 2] dy/dx: [(N×D)×(N×M)]
[ 3 2 1 -2] dy/dw: [(D×M)×(N×M)]
For a neural net we may have
N=64, D=M=4096
Each Jacobian takes 256 GB of memory!
Must work with them implicitly!
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 146 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 147 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?
A: affects the
whole row
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 148 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the
whole row
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 149 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
whole row
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 150 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
[N×D] [N×M] [M×D] whole row
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 151 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2] By similar logic:
[ 3 2 1 -2]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 152 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 153 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 154 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 155 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 156 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 157 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 158 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 159 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 160 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 161 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 162 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 163 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 164 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 165 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 166 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 167 April 08, 2021
A vectorized example:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 168 April 08, 2021
In discussion section: A matrix example...
16
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
9