0% found this document useful (0 votes)
60 views78 pages

4.2 Backpropagation 1

The document discusses backpropagation and gradient descent in neural networks. It covers computational graphs and how gradients flow through the graph in backpropagation using the chain rule. Specifically, it provides examples of how local gradients at each node are calculated from upstream gradients. It also discusses common patterns in the backward flow of gradients, such as gradient distributors for addition gates and gradient routers for maximum gates.

Uploaded by

Mitiku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views78 pages

4.2 Backpropagation 1

The document discusses backpropagation and gradient descent in neural networks. It covers computational graphs and how gradients flow through the graph in backpropagation using the chain rule. Specifically, it provides examples of how local gradients at each node are calculated from upstream gradients. It also discusses common patterns in the backward flow of gradients, such as gradient distributors for addition gates and gradient routers for maximum gates.

Uploaded by

Mitiku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 12, 2018

Where we are...

scores function

SVM loss

data loss + regularization

want

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 4 April 12, 2018
Optimization

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 5 April 12, 2018
Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 12, 2018
Computational graphs

x
s (scores) hinge

* loss
+
L

W
R

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 12, 2018
Convolutional network
(AlexNet)

input image

weights

loss

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 12, 2018
Backpropagation: a simple example

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 11 April 12, 2018
Backpropagation: a simple example

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 12, 2018
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 13, 2017
f

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 12, 2018
“local gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 26 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 27 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 12, 2018
Another example:

Upstream Local
gradient gradient

Upstream
gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 12, 2018
Another example:

Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 12, 2018
Another example:

Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 12, 2018
Another example:

Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 12, 2018
Another example:

[upstream gradient] x [local gradient]


[0.2] x [1] = 0.2
[0.2] x [1] = 0.2 (both inputs!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 12, 2018
Another example:

[upstream gradient] x [local gradient]


x0: [0.2] x [2] = 0.4
w0: [0.2] x [-1] = -0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 12, 2018
Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 12, 2018
Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate

[upstream gradient] x [local gradient]


[1.00] x [(1 - 0.73) (0.73)]= 0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 12, 2018
Patterns in backward flow

add gate: gradient distributor

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


Q: What is a max gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
Q: What is a mul gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
mul gate: gradient switcher

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 12, 2018
Gradients add at branches

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 12, 2018
Gradients for vectorized code (x,y,z are This is now the
now vectors) Jacobian matrix
(derivative of each
element of z w.r.t. each
“local gradient” element of x)

f
gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 12, 2018
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 12, 2018
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 56 April 12, 2018
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 57 April 12, 2018
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the in practice we process an


entire minibatch (e.g. 100)
size of the of examples at one time:
Jacobian matrix? i.e. Jacobian would technically be a
[4096 x 4096!] [409,600 x 409,600] matrix :\

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the Q2: what does it
Jacobian matrix? look like?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 60 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 12, 2018
A vectorized example:

Always check: The


gradient with
respect to a variable
should have the
same shape as the
variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 75 April 12, 2018
In discussion section: A matrix example...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 76 April 13, 2017
Modularized implementation: forward / backward API
Graph (or Net) object (rough pseudo code)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 77 April 12, 2018
Modularized implementation: forward / backward API

x
z
*
y
(x,y,z are scalars)

Local gradient Upstream gradient variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 78 April 12, 2018
Example: Caffe layers

Caffe is licensed under BSD 2-Clause

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 79 April 12, 2018
Caffe Sigmoid Layer

* top_diff (chain rule)

Caffe is licensed under BSD 2-Clause

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 80 April 12, 2018
In Assignment 1: Writing SVM / Softmax
Stage your forward/backward computation!
margins
E.g. for the SVM:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 81 April 12, 2018
Summary so far...
● neural nets will be very large: impractical to write down gradient formula
by hand for all parameters
● backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
● implementations maintain a graph structure, where the nodes implement
the forward() / backward() API
● forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
● backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 82 April 12, 2018

You might also like