0% found this document useful (0 votes)
34 views

Lecture 4 PDF

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Lecture 4 PDF

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Lecture 4:

Neural Networks and


Backpropagation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 1 April 08, 2021


Announcements: Assignment 1

Assignment 1 due Fri 4/16 at 11:59pm

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 2 April 08, 2021


Administrative: Project Proposal

Due Mon 4/19

TA expertise are posted on the webpage.

(http://cs231n.stanford.edu/office_hours.html)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 3 April 08, 2021


Administrative: Discussion Section

Discussion section tomorrow:

Backpropagation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 4 April 08, 2021


Administrative: Midterm Updates

- Tues, May 4 and is worth 15% of your grade.


- available for 24 hours on Gradescope from May 4, 12PM PDT
to May 5, 11:59 AM PDT.
- 3-hour consecutive timeframe
- Exam will be designed for 1.5 hours.
- Open book and open internet but no collaboration
- Only make private posts during those 24 hours

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 5 April 08, 2021


Recap: from last time

f(x,W) = Wx + b

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 6 April 08, 2021


Recap: loss functions
Linear score function

SVM loss (or softmax)

data loss + regularization

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 7 April 08, 2021


Finding the best W: Optimize with Gradient Descent

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 8 April 08, 2021


Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 9 April 08, 2021


Stochastic Gradient Descent (SGD)
Full sum expensive
when N is large!

Approximate sum
using a minibatch of
examples
32 / 64 / 128 common

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 10 April 08, 2021


What we are going to discuss today!
Linear score function

SVM loss (or softmax)

data loss + regularization

How to find the best W?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 11 April 08, 2021


Problem: Linear Classifiers are not very powerful

Visual Viewpoint Geometric Viewpoint

Linear classifiers learn Linear classifiers


one template per class can only draw linear
decision boundaries

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 12 April 08, 2021


Pixel Features

Class
scores
f(x) = Wx

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 13 April 16, 2020


Image Features

f(x) = Wx
Class
scores
Feature Representation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 14 April 16, 2020


Image Features: Motivation

Cannot separate red


and blue points with
linear classifier

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 15 April 16, 2020


Image Features: Motivation

y θ

f(x, y) = (r(x, y), θ(x, y))


x r

Cannot separate red After applying feature


and blue points with transform, points can
linear classifier be separated by linear
classifier

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 16 April 16, 2020


Example: Color Histogram

+1

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 17 April 16, 2020


Example: Histogram of Oriented Gradients (HoG)

Divide image into 8x8 pixel regions Example: 320x240 image gets divided
Within each region quantize edge into 40x30 bins; in each bin there are
direction into 9 bins 9 numbers so feature vector has
30*40*9 = 10,800 numbers
Lowe, “Object recognition from local scale-invariant features”, ICCV 1999
Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 18 April 16, 2020


Example: Bag of Words
Step 1: Build codebook

Cluster patches to
Extract random form “codebook”
patches of “visual words”

Step 2: Encode images

Fei-Fei and Perona, “A bayesian hierarchical model for learning natural scene categories”, CVPR 2005

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 19 April 16, 2020


Image Features

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 20 April 16, 2020


Image features vs ConvNets
f
Feature Extraction 10 numbers giving
scores for classes

training

Krizhevsky, Sutskever, and Hinton, “Imagenet classification


with deep convolutional neural networks”, NIPS 2012.
Figure copyright Krizhevsky, Sutskever, and Hinton, 2012.
Reproduced with permission.

10 numbers giving
scores for classes
training

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 21 April 16, 2020


One Solution: Feature Transformation
f(x, y) = (r(x, y), θ(x, y))

Transform data with a cleverly


chosen feature transform f,
then apply linear classifier

Color Histogram Histogram of Oriented Gradients (HoG)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 22 April 08, 2021


Today: Neural Networks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 23 April 08, 2021


Neural networks: the original linear classifier

(Before) Linear score function:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 24 April 08, 2021


Neural networks: 2 layers

(Before) Linear score function:


(Now) 2-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 25 April 08, 2021


Neural networks: also called fully connected network

(Before) Linear score function:


(Now) 2-layer Neural Network

“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)
(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 26 April 08, 2021


Neural networks: 3 layers

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 27 April 08, 2021


Neural networks: hierarchical computation
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 28 April 08, 2021


Neural networks: learning 100s of templates
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Learn 100 templates instead of 10. Share templates between classes

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 29 April 08, 2021


Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.


Q: What if we try to build a neural network without one?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 30 April 08, 2021


Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.


Q: What if we try to build a neural network without one?

A: We end up with a linear classifier again!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 31 April 08, 2021


Activation functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 32 April 08, 2021


ReLU is a good default
Activation functions choice for most problems

Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 33 April 08, 2021


Neural networks: Architectures

“3-layer Neural Net”, or


“2-layer Neural Net”, or “2-hidden-layer Neural Net”
“1-hidden-layer Neural Net”
“Fully-connected” layers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 34 April 08, 2021


Example feed-forward computation of a neural network

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 35 April 08, 2021


Full implementation of training a 2-layer Neural Network needs ~20 lines:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 36 April 08, 2021


Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 37 April 08, 2021


Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 38 April 08, 2021


Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 39 April 08, 2021


Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Gradient descent

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 40 April 08, 2021


Setting the number of layers and their sizes

more neurons = more capacity


Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 41 13 Jan 2016
Do not use size of neural network as a regularizer. Use stronger regularization instead:

(Web demo with ConvNetJS:


http://cs.stanford.edu/people/karpathy/convnetjs/demo
/classify2d.html)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 42 13 Jan 2016
This image by Fotis Bobolas is
licensed under CC-BY 2.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 43 April 08, 2021


Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 44 April 08, 2021


Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 45 April 08, 2021


Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 46 April 08, 2021


Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 47 April 08, 2021


Biological Neurons: Neurons in a neural network:
Complex connectivity patterns Organized into regular layers for
computational efficiency

This image is CC0 Public Domain

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 48 April 08, 2021


Biological Neurons: But neural networks with random
Complex connectivity patterns connections can work too!

This image is CC0 Public Domain


Xie et al, “Exploring Randomly Wired Neural Networks for Image Recognition”, arXiv 2019

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 49 April 08, 2021


Be very careful with your brain analogies!
Biological Neurons:
● Many different types
● Dendrites can perform complex non-linear computations
● Synapses are not a single weight but a complex non-linear dynamical
system

[Dendritic Computation. London and Hausser]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 50 April 08, 2021


Plugging in neural networks with loss functions
Nonlinear score function
SVM Loss on predictions

Regularization

Total loss: data loss + regularization

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 51 April 08, 2021


Problem: How to compute gradients?
Nonlinear score function
SVM Loss on predictions

Regularization

Total loss: data loss + regularization

If we can compute then we can learn W1 and W2

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 52 April 08, 2021


(Bad) Idea: Derive on paper
Problem: Very tedious: Lots of
matrix calculus, need lots of paper
Problem: What if we want to
change loss? E.g. use softmax
instead of SVM? Need to
re-derive from scratch =(
Problem: Not feasible for very
complex models!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 53 April 08, 2021


Better Idea: Computational graphs + Backpropagation

x
s (scores) hinge

* loss
+
L

W
R

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 54 April 08, 2021


Convolutional network
(AlexNet)

input image

weights

loss

Figure copyright Alex Krizhevsky, Ilya Sutskever, and


Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 55 April 08, 2021


Really complex neural
networks!!

input image

loss

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 56 April 08, 2021


Neural Turing Machine

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 2021


Solution: Backpropagation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 58 April 08, 2021


Backpropagation: a simple example

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 59 April 08, 2021


Backpropagation: a simple example

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 60 April 08, 2021


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 13, 2017
f

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 75 April 08, 2021


“local gradient”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 76 April 08, 2021


“local gradient”

“Upstream
gradient”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 77 April 08, 2021


“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 78 April 08, 2021


“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 79 April 08, 2021


“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 80 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 81 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 82 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 83 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 84 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 85 April 08, 2021


Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 86 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 87 April 08, 2021


Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 88 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 89 April 08, 2021


Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 90 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 91 April 08, 2021


Another example:

Upstream Local
gradient gradient

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 92 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 93 April 08, 2021


Another example:

[upstream gradient] x [local gradient]


[0.2] x [1] = 0.2
[0.2] x [1] = 0.2 (both inputs!)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 94 April 08, 2021


Another example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 95 April 08, 2021


Another example:

[upstream gradient] x [local gradient]


w0: [0.2] x [-1] = -0.2
x0: [0.2] x [2] = 0.4

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 96 April 08, 2021


Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 97 April 08, 2021


Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

Sigmoid local
gradient:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 98 April 08, 2021


Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

[upstream gradient] x [local gradient]


[1.00] x [(1 - 1/(1+e1)) (1/(1+e1))] = 0.2

Sigmoid local
gradient:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 99 April 08, 2021


Another example: Computational graph
representation may not
be unique. Choose one
Sigmoid where local gradients at
function each node can be easily
expressed!

Sigmoid

[upstream gradient] x [local gradient]


[1.00] x [(1 - 0.73) (0.73)] = 0.2

Sigmoid local
gradient:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 100 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor
3
2
7
4
+ 2
2

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 101 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 102 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10

copy gate: gradient adder


7
7 4
4+2=6 7
2

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 103 April 08, 2021
Patterns in gradient flow
add gate: gradient distributor mul gate: “swap multiplier”
3 2
2 5*3=15
7 6
4
+ 2 3
× 5
2 2*5=10

copy gate: gradient adder max gate: gradient router


7 4
7 4 0 5
max
4+2=6 7 5 9
2 9

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 104 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Backward pass:
Compute grads

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 105 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Base case

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 106 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Sigmoid

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 107 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Add gate

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 108 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Add gate

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 109 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Multiply gate

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 110 April 08, 2021
Backprop Implementation:
“Flat” code Forward pass:
Compute output

Multiply gate

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 111 April 08, 2021
“Flat” Backprop: Do this for assignment 1!
Stage your forward/backward computation!
margins
E.g. for the SVM:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 112 April 08, 2021
“Flat” Backprop: Do this for assignment 1!
E.g. for two-layer neural net:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 113 April 08, 2021
Backprop Implementation: Modularized API
Graph (or Net) object (rough pseudo code)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 114 April 08, 2021
Modularized implementation: forward / backward API
Gate / Node / Function object: Actual PyTorch code

x
z Need to stash
* some values for
use in backward

y
Upstream
(x,y,z are scalars) gradient

Multiply upstream
and local gradients

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 115 April 08, 2021
Example: PyTorch operators

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 116 April 08, 2021
PyTorch sigmoid layer
Forward

Source

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 117 April 08, 2021
PyTorch sigmoid layer
Forward

Forward actually
defined elsewhere...

Source

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 118 April 08, 2021
PyTorch sigmoid layer
Forward

Forward actually
defined elsewhere...

Backward

Source

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 119 April 08, 2021
Summary for today:
● (Fully-connected) Neural Networks are stacks of linear functions and
nonlinear activation functions; they have much more representational
power than linear classifiers
● backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
● implementations maintain a graph structure, where the nodes implement
the forward() / backward() API
● forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
● backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 120 April 08, 2021
So far: backprop with scalars

Next time: vector-valued functions!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 121 April 08, 2021
Next Time: Convolutional Networks!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 122 April 08, 2021
Recap: Vector derivatives
Scalar to Scalar

Regular derivative:

If x changes by a
small amount, how
much will y change?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 123 April 08, 2021
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar

Regular derivative: Derivative is Gradient:

If x changes by a For each element of x,


small amount, how if it changes by a small
much will y change? amount then how much
will y change?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 124 April 08, 2021
Recap: Vector derivatives
Scalar to Scalar Vector to Scalar Vector to Vector

Regular derivative: Derivative is Gradient: Derivative is Jacobian:

If x changes by a For each element of x, For each element of x, if it


small amount, how if it changes by a small changes by a small amount
much will y change? amount then how much then how much will each
will y change? element of y change?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 125 April 08, 2021
Backprop with Vectors
Loss L still a scalar!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 126 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx

Dz
f
Dy

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 127 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx

Dz
f
Dy
“Upstream gradient”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 128 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx

Dz
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 129 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”

Dz
“Downstream
gradients”
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 130 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”

[Dx x Dz] Dz
“Downstream
gradients”
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
For each element of z, how
much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 131 April 08, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”

Dx [Dx x Dz] Dz
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
Dy For each element of z, how
much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 132 April 08, 2021
Gradients of variables wrt loss have same dims as the original variable

Loss L still a scalar!


Dx

Dx Dz
f
Dy Dz
“Upstream gradient”
Dy For each element of z, how
much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 133 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 134 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

4D dL/dz:
[ 4 ]
[ -1 ] Upstream
[ 5 ] gradient
[ 9 ]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 135 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

Jacobian dz/dx 4D dL/dz:


[1000] [ 4 ]
[0000] [ -1 ] Upstream
[0010] [ 5 ] gradient
[0000] [ 9 ]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 136 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

[dz/dx] [dL/dz] 4D dL/dz:


[1000][4 ] [ 4 ]
[ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[0010][5 ] [ 5 ] gradient
[0000][9 ] [ 9 ]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 137 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]

4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:


[4] [1000][4 ] [ 4 ]
[0] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[5] [0010][5 ] [ 5 ] gradient
[0] [0000][9 ] [ 9 ]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 138 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [1000][4 ] [ 4 ]
[0] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream
[5] [0010][5 ] [ 5 ] gradient
[0] [0000][9 ] [ 9 ]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 139 April 08, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
Jacobian is sparse:
off-diagonal entries
[ 3 ] (elementwise) [ 3 ]
always zero! Never [ -1 ] [ 0 ]
explicitly form
Jacobian -- instead
use implicit 4D dL/dx: [dz/dx] [dL/dz] 4D dL/dz:
multiplication [4] [ 4 ]
[0] z [ -1 ] Upstream
[5] [ 5 ] gradient
[0] [ 9 ]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 140 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx]
same shape as x!

[Dz×Mz]
Matrix-vector
multiply
f
[Dy×My]
Jacobian
matrices

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 141 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx]
same shape as x!

[Dx×Mx]
[Dz×Mz]
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 142 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx] “local same shape as x!
gradients”
[Dx×Mx]
[Dz×Mz]
“Downstream
Matrix-vector
gradients” multiply
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
For each element of y, how much
does it influence each element of z? much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 143 April 08, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!

dL/dx always has the


[Dx×Mx] “local same shape as x!
gradients”
[Dx×Mx] [(Dx×Mx)×(Dz×Mz)] [Dz×Mz]
“Downstream
Matrix-vector
gradients” multiply [(Dy×My)×(Dz×Mz)]
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
For each element of y, how much
does it influence each element of z? much does it influence L?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 144 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2]
[ 3 2 1 -2]

Also see derivation in the course notes:


http://cs231n.stanford.edu/handouts/linear-backprop.pdf

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 145 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Jacobians: [ -8 1 4 6 ]
[ 2 1 3 2] dy/dx: [(N×D)×(N×M)]
[ 3 2 1 -2] dy/dw: [(D×M)×(N×M)]
For a neural net we may have
N=64, D=M=4096
Each Jacobian takes 256 GB of memory!
Must work with them implicitly!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 146 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 147 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one
[ 3 2 1 -2] element of x?
A: affects the
whole row

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 148 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the
whole row

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 149 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
whole row

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 150 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ]
[ 2 1 3 2] are affected by one does
[ 3 2 1 -2] element of x? affect ?
A: affects the A:
[N×D] [N×M] [M×D] whole row

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 151 April 08, 2021
Backprop with Matrices y: [N×M]
[13 9 -2 -6 ]
x: [N×D] Matrix Multiply [ 5 2 17 1 ]
[ 2 1 -3 ]
[ -3 4 2 ] dL/dy: [N×M]
w: [D×M] [ 2 3 -3 9 ]
[ 3 2 1 -1] [ -8 1 4 6 ]
[ 2 1 3 2] By similar logic:
[ 3 2 1 -2]

[N×D] [N×M] [M×D] [D×M] [D×N] [N×M] These formulas are


easy to remember: they
are the only way to
make shapes match up!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 152 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 153 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 154 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 155 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 156 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 157 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 158 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 159 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 160 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 161 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 162 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 163 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 164 April 08, 2021
A vectorized example:

Always check: The


gradient with
respect to a variable
should have the
same shape as the
variable

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 165 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 166 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 167 April 08, 2021
A vectorized example:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 168 April 08, 2021
In discussion section: A matrix example...

16
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
9

You might also like