0% found this document useful (0 votes)
24 views80 pages

Comp3314 7. Gradient Backpropagation

The document discusses gradient backpropagation in neural networks, emphasizing the importance of using analytic gradients for efficiency and accuracy. It explains the concept of computational graphs and the application of the chain rule to compute gradients recursively. Additionally, it highlights the challenges of handling large neural networks and the necessity of maintaining a graph structure for effective implementation.

Uploaded by

jocelynpratamah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views80 pages

Comp3314 7. Gradient Backpropagation

The document discusses gradient backpropagation in neural networks, emphasizing the importance of using analytic gradients for efficiency and accuracy. It explains the concept of computational graphs and the application of the chain rule to compute gradients recursively. Additionally, it highlights the challenges of handling large neural networks and the necessity of maintaining a graph structure for effective implementation.

Uploaded by

jocelynpratamah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Gradient Backpropagation

and Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 13, 2017
Optimization

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 13, 2017
Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 13, 2017
Computational graphs
Hinge loss

x
s (scores) hinge

* loss
+
L

W
R

Regularization

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017
f

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 13, 2017
“local gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 26 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 27 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 13, 2017
Another example:

partial derivative of final output


is always 1 (df/df)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 13, 2017
Another example:

partial derivative x output

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 13, 2017
Another example:

(-1) * (-0.20) = 0.20

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 13, 2017
Another example:

If there's 2 inputs: // for addition operator (+), local derivative is always 1


[local gradient] x [upstream gradient]
[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 13, 2017
Another example:

[local gradient] x [upstream gradient]


x0: [2] x [0.2] = 0.4
w0: [-1] x [0.2] = -0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 13, 2017
sigmoid function

sigmoid gate

^ input of sigmoid = 1 output = 0.73

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017
sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


gradient distributor = simply copy the gradient to the input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


Q: What is a max gate?
= take the maximum input to the output

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
Q: What is a mul gate?
multiplication gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router 2 x -4

mul gate: gradient switcher


= it switches the input 3x2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017
Gradients add at branches

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 13, 2017
Gradients for vectorized code (x,y,z are This is now the
now vectors) Jacobian matrix
(derivative of each
element of z w.r.t. each
“local gradient” element of x)
dx1/dx1 ; dz2/dx2 ,...

row vector x matrix = row vector

f if L = scalar, z = vector
the result is vector
(row vector)

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 13, 2017
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 13, 2017
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 13, 2017
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the in practice we process an


entire minibatch (e.g. 100)
size of the of examples at one time:
Jacobian matrix? i.e. Jacobian would technically be a
[4096 x 4096!] [409,600 x 409,600] matrix :\
huugee matrix :D

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the Q2: what does it
Jacobian matrix? look like?
[4096 x 4096!] Diagonal Matrix (karna max (0,x))

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 58 April 13, 2017
A vectorized example:

x = scalar
W = matrix

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 59 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 60 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 13, 2017
A vectorized example:

0.22^2 + 0.26^2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 13, 2017
A vectorized example:

q = column vector
xT = row vector

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 13, 2017
A vectorized example:

Always check: The


gradient with
respect to a variable
should have the
same shape as the
variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 13, 2017
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 13, 2017
Summary so far...
● neural nets will be very large: impractical to write down gradient formula
by hand for all parameters
● backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
● implementations maintain a graph structure, where the nodes implement
the forward() / backward() API
● forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
● backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 81 April 13, 2017
Next: Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 82 April 13, 2017
Neural networks: without the brain stuff

(Before) Linear score function:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 83 April 13, 2017
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 84 April 13, 2017
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network
input

hidden
output

x W1 h W2 s
3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 85 April 13, 2017
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 86 April 13, 2017
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 87 April 13, 2017
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 88 April 13, 2017
Activation functions Reactivate Linear Unit

Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 96 April 13, 2017
Neural networks: Architectures

“3-layer Neural Net”, or


“2-layer Neural Net”, or “2-hidden-layer Neural Net”
“1-hidden-layer Neural Net” “Fully-connected” layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 97 April 13, 2017
Example feed-forward computation of a neural network

We can efficiently evaluate an entire layer of neurons.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 98 April 13, 2017
Example feed-forward computation of a neural network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 99 April 13, 2017
Summary
- We arrange neurons into fully-connected layers
- The abstraction of a layer has the nice property that it
allows us to use efficient vectorized code (e.g. matrix
multiplies)
- Neural networks are not really neural
- Next time: Convolutional Neural Networks

10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
0

You might also like