Lecture04 Neuralnets
Lecture04 Neuralnets
Christopher Manning
Christopher Manning and Richard Socher
Lecture 4: Gradients by hand (matrix calculus) and
Lecture
algorithmically 2: Word Vectors algorithm)
(the backpropagation
1. Introduction
Assignment 2 is all about making sure you really understand the
math of neural networks … then we’ll let the software do it!
2
NER: Binary classification for center word being location
1
𝐽" 𝜃 = 𝜎 𝑠 =
1 + 𝑒 *+
4
Lecture Plan
Lecture 4: Gradients by hand and algorithmically
1. Introduction (5 mins)
2. Matrix calculus (40 mins)
3. Backpropagation (35 mins)
5
Computing Gradients by Hand
• Matrix calculus: Fully vectorized gradients
• “multivariable calculus is just like single-variable calculus if
you use matrices”
• Much faster and more useful than non-vectorized gradients
• But doing a non-vectorized gradient can be good for
intuition; watch last week’s lecture for an example
• Lecture notes and matrix calculus notes cover this
material in more detail
• You might also review Math 51, which has a new online
textbook:
http://web.stanford.edu/class/math51/textbook.html
6
Gradients
• Given a function with 1 output and 1 input
𝑓 𝑥 = 𝑥3
• It’s gradient (slope) is its derivative
45
46
= 3𝑥 8
“How much will the output change if we change the input a bit?”
7
Gradients
• Given a function with 1 output and n inputs
8
Jacobian Matrix: Generalization of the Gradient
• Given a function with m outputs and n inputs
9
Chain Rule
10
Example Jacobian: Elementwise activation Function
11
Example Jacobian: Elementwise activation Function
12
Example Jacobian: Elementwise activation Function
13
Example Jacobian: Elementwise activation Function
14
Example Jacobian: Elementwise activation Function
15
Other Jacobians
22
2. Apply the chain rule
23
2. Apply the chain rule
24
2. Apply the chain rule
25
2. Apply the chain rule
26
3. Write out the Jacobians
27
3. Write out the Jacobians
𝒖:
28
3. Write out the Jacobians
𝒖:
29
3. Write out the Jacobians
𝒖:
30
3. Write out the Jacobians
𝒖:
𝒖:
Useful Jacobians from previous slide
31
Re-using Computation
32
Re-using Computation
33
Re-using Computation
𝒖:
34
𝛿 is local error signal
Derivative with respect to Matrix: Output shape
35
Derivative with respect to Matrix: Output shape
• So is n by m:
36
Derivative with respect to Matrix
• Remember
• is going to be in our answer
• The other term should be because
• Answer is:
37
Why the Transposes?
39
Deriving local input gradient in backprop
• For this function:
𝜕𝑠 𝜕𝒛 𝜕
=𝜹 =𝜹 𝑾𝒙 + 𝒃
𝜕𝑾 𝜕𝑾 𝜕𝑾
• Let’s consider the derivative
of a single weight Wij
• Wij only contributes to zi s u2
• For example: W23 is only
f(z1)= h1 h2 =f(z2)
used to compute z2 not z1
W23
𝜕𝑧C 𝜕
= 𝑾CF 𝒙 + 𝑏C b2
𝜕𝑊CE 𝜕𝑊CE
H
= ∑4MNO 𝑊CM 𝑥M = 𝑥E x1 x2 x3 +1
HIJK
40
What shape should derivatives be?
• is a row vector
• But convention says our gradient should be a column vector
because is a column vector…
41
What shape should derivatives be?
Two options:
1. Use Jacobian form as much as possible, reshape to
follow the convention at the end:
• What we just did. But at the end transpose to make the
derivative a column vector, resulting in
42
Deriving gradients: Tips
• Tip 1: Carefully define your variables and keep track of their
dimensionality!
• Tip 2: Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:
𝜕𝒚 𝜕𝒚 𝜕𝒖
=
𝜕𝒙 𝜕𝒖 𝜕𝒙
Keep straight what variables feed into what computations
• Tip 3: For the top softmax part of a model: First consider the
derivative wrt fc when c = y (the correct class), then consider
derivative wrt fc when c ¹ y (all the incorrect classes)
• Tip 4: Work out element-wise partial derivatives if you’re getting
confused by matrix calculus!
• Tip 5: Use Shape Convention. Note: The error message 𝜹 that
arrives at a hidden layer has the same dimensionality as that
43
hidden layer
3. Backpropagation
44
Computation Graphs and Backpropagation
+
45
Computation Graphs and Backpropagation
+
46
Computation Graphs and Backpropagation
+
47
Backpropagation
+
48
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct
“downstream gradient”
49 Downstream Upstream
gradient gradient
Backpropagation: Single Node
Chain
rule!
51 Downstream Local Upstream
gradient gradient gradient
Backpropagation: Single Node
53
Backpropagation: Single Node
• Multiple inputs → multiple local gradients
55
An Example
*
max
56
An Example
2
+ 3
6
2 *
2
max
0
57
An Example
2
+ 3
6
2 *
2
max
0
58
An Example
2
+ 3
6
2 *
2
max
0
59
An Example
2
+ 3
6
2 *
2
max
0
60
An Example
2
+ 3
6
2 *
2
max
0
61
An Example
2
+ 3
1*2 = 2 6
2 * 1
2
max 1*3 = 3
0
62 upstream * local = downstream
An Example
2
+ 3
2
6
2 * 1
2
3*1 = 3
max 3
0
63 3*0 = 0 upstream * local = downstream
An Example
1
2*1 = 2
2
+ 3
2
2*1 = 2 6
2 * 1
2
3
max 3
0
64 0 upstream * local = downstream
An Example
1
2
2
+ 3
2
2 6
2 * 1
2
3
max 3
0
65 0
Gradients sum at outward branches
66
Gradients sum at outward branches
67
Node Intuitions
1
2
2
+ 3
2
2 6
2 * 1
2
max
0
68
Node Intuitions
2
+ 3
6
2 * 1
2
3
max 3
0
69 0
Node Intuitions
2
+ 3
2
6
2 * 1
2
max 3
0
70
Efficiency: compute all gradients at once
* +
71
Efficiency: compute all gradients at once
* +
72
Efficiency: compute all gradients at once
• Correct way:
• Compute all the gradients at once
• Analogous to using 𝜹 when we
computed gradients by hand
* +
73
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of
75
Backprop Implementations
76
Implementation: forward/backward API
77
Implementation: forward/backward API
78
Manual Gradient checking: Numeric Gradient