0% found this document useful (0 votes)

24 views80 pages

Comp3314 7. Gradient Backpropagation

The document discusses gradient backpropagation in neural networks, emphasizing the importance of using analytic gradients for efficiency and accuracy. It explains the concept of computational graphs and the application of the chain rule to compute gradients recursively. Additionally, it highlights the challenges of handling large neural networks and the necessity of maintaining a graph structure for effective implementation.

Uploaded by

jocelynpratamah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views80 pages

Comp3314 7. Gradient Backpropagation

Uploaded by

jocelynpratamah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Gradient Backpropagation

and Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 13, 2017
Optimization

Landscape image is CC0 1.0 public domain

Walking man image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 13, 2017
Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)

Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your

implementation with numerical gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 13, 2017
Computational graphs
Hinge loss

x
s (scores) hinge

* loss
+
L

W
R

Regularization

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017
f

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 13, 2017
“local gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 26 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 27 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 13, 2017
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 13, 2017
Another example:

partial derivative of final output

is always 1 (df/df)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 13, 2017
Another example:

partial derivative x output

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 13, 2017
Another example:

(-1) * (-0.20) = 0.20

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 13, 2017
Another example:

If there's 2 inputs: // for addition operator (+), local derivative is always 1

[local gradient] x [upstream gradient]
[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 13, 2017
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 13, 2017
Another example:

[local gradient] x [upstream gradient]

x0: [2] x [0.2] = 0.4
w0: [-1] x [0.2] = -0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 13, 2017
sigmoid function

sigmoid gate

^ input of sigmoid = 1 output = 0.73

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017
sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017
Patterns in backward flow

add gate: gradient distributor

gradient distributor = simply copy the gradient to the input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017
Patterns in backward flow

add gate: gradient distributor

Q: What is a max gate?
= take the maximum input to the output

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017
Patterns in backward flow

add gate: gradient distributor

max gate: gradient router

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017
Patterns in backward flow

add gate: gradient distributor

max gate: gradient router
Q: What is a mul gate?
multiplication gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017
Patterns in backward flow

add gate: gradient distributor

max gate: gradient router 2 x -4

mul gate: gradient switcher

= it switches the input 3x2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017
Gradients add at branches

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 13, 2017
Gradients for vectorized code (x,y,z are This is now the
now vectors) Jacobian matrix
(derivative of each
element of z w.r.t. each
“local gradient” element of x)
dx1/dx1 ; dz2/dx2 ,...

row vector x matrix = row vector

f if L = scalar, z = vector
the result is vector
(row vector)

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017
Vectorized operations