0% found this document useful (0 votes)
37 views101 pages

cs231n 2018 Lecture04

The document outlines Lecture 4 by Fei-Fei Li, Justin Johnson, and Serena Yeung, focusing on optimization techniques, including gradient descent and backpropagation. It discusses computational graphs, the significance of local gradients, and examples of vectorized operations in neural networks. Additionally, it provides administrative details regarding assignment deadlines and office hours.

Uploaded by

mdnoshin62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views101 pages

cs231n 2018 Lecture04

The document outlines Lecture 4 by Fei-Fei Li, Justin Johnson, and Serena Yeung, focusing on optimization techniques, including gradient descent and backpropagation. It discusses computational graphs, the significance of local gradients, and examples of vectorized operations in neural networks. Additionally, it provides administrative details regarding assignment deadlines and office hours.

Uploaded by

mdnoshin62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 12, 2018

Administrative

Assignment 1 due Wednesday April 18, 11:59pm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 2 April 12, 2018
Administrative

All office hours this week will use queuestatus

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 3 April 12, 2018
Where we are...

scores function

SVM loss

data loss + regularization

want

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 4 April 12, 2018
Optimization

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 5 April 12, 2018
Gradient descent

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 12, 2018
Computational graphs

x
s (scores) hinge

* loss
+
L

W
R

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 12, 2018
Convolutional network
(AlexNet)

input image

weights

loss

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 12, 2018
Neural Turing Machine

input image

loss

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 9 April 12, 2018
Neural Turing Machine

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018
Backpropagation: a simple example

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 11 April 12, 2018
Backpropagation: a simple example

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 12, 2018
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4 df/dx = z * 1 = -4*1 = -4

df/dy = df/dq * dq/dy = z*1 = -4*1 = -4

df/dq =

Chain rule:

Want:
Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 13, 2017
f

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 12, 2018
“local gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 26 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 27 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 12, 2018
“local gradient”

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 12, 2018
Another example:
df/dx = d(x + c)/dx = 1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 12, 2018
Another example:

Upstream Local
gradient gradient

Upstream
gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 12, 2018
Another example:

Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 12, 2018
Another example:

Upstream Local
gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 12, 2018
Another example:

Upstream Local 1 * 0.2 = 0.2


gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 12, 2018
Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 12, 2018
Another example:

[upstream gradient] x [local gradient]


[0.2] x [1] = 0.2
[0.2] x [1] = 0.2 (both inputs!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 12, 2018
Another example:

dxy/dx = y dxy/dy = x

2 * 0.2 = 0.4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 12, 2018
Another example:

[upstream gradient] x [local gradient]


x0: [0.2] x [2] = 0.4
w0: [0.2] x [-1] = -0.2

w1 = -2*0.2 = -0.4
x1 = -3*0.2 = -0.6

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 12, 2018
Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 12, 2018
Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate

[upstream gradient] x [local gradient]


[1.00] x [(1 - 0.73) (0.73)]= 0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 12, 2018
Patterns in backward flow

add gate: gradient distributor

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


Q: What is a max gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
Q: What is a mul gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 12, 2018
Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
mul gate: gradient switcher

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 12, 2018
Gradients add at branches

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 12, 2018
Gradients for vectorized code (x,y,z are This is now the
now vectors) Jacobian matrix
(derivative of each
element of z w.r.t. each
“local gradient” element of x)

f
gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 12, 2018
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 12, 2018
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 56 April 12, 2018
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 57 April 12, 2018
Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the in practice we process an


entire minibatch (e.g. 100)
size of the of examples at one time:
Jacobian matrix? i.e. Jacobian would technically be a
[4096 x 4096!] [409,600 x 409,600] matrix :\

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018
Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the Q2: what does it
Jacobian matrix? look like?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 60 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 12, 2018
A vectorized example:

Always check: The


gradient with
respect to a variable
should have the
same shape as the
variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 12, 2018
A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 75 April 12, 2018
In discussion section: A matrix example...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 76 April 13, 2017
Modularized implementation: forward / backward API
Graph (or Net) object (rough pseudo code)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 77 April 12, 2018
Modularized implementation: forward / backward API

x
z
*
y
(x,y,z are scalars)

Local gradient Upstream gradient variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 78 April 12, 2018
Example: Caffe layers

Caffe is licensed under BSD 2-Clause

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 79 April 12, 2018
Caffe Sigmoid Layer

* top_diff (chain rule)

Caffe is licensed under BSD 2-Clause

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 80 April 12, 2018
In Assignment 1: Writing SVM / Softmax
Stage your forward/backward computation!
margins
E.g. for the SVM:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 81 April 12, 2018
Summary so far...
● neural nets will be very large: impractical to write down gradient formula
by hand for all parameters
● backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
● implementations maintain a graph structure, where the nodes implement
the forward() / backward() API
● forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
● backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 82 April 12, 2018
Next: Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 83 April 12, 2018
Neural networks: without the brain stuff

(Before) Linear score function:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 84 April 12, 2018
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 85 April 12, 2018
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 86 April 12, 2018
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 87 April 12, 2018
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 88 April 12, 2018
Full implementation of training a 2-layer Neural Network needs ~20 lines:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 89 April 12, 2018
In HW: Writing a 2-layer net

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 90 April 12, 2018
This image by Fotis Bobolas is
licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 91 April 12, 2018
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 92 April 12, 2018
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 93 April 12, 2018
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 94 April 12, 2018
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 95 April 12, 2018
Be very careful with your brain analogies!
Biological Neurons:
● Many different types
● Dendrites can perform complex non-linear computations
● Synapses are not a single weight but a complex non-linear dynamical
system
● Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 96 April 12, 2018
Activation functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 97 April 12, 2018
Neural networks: Architectures

“3-layer Neural Net”, or


“2-layer Neural Net”, or “2-hidden-layer Neural Net”
“1-hidden-layer Neural Net” “Fully-connected” layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 98 April 12, 2018
Example feed-forward computation of a neural network

We can efficiently evaluate an entire layer of neurons.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 99 April 12, 2018
Example feed-forward computation of a neural network

10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018
0
Summary
- We arrange neurons into fully-connected layers
- The abstraction of a layer has the nice property that it
allows us to use efficient vectorized code (e.g. matrix
multiplies)
- Neural networks are not really neural
- Next time: Convolutional Neural Networks

10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018
1

You might also like