Chap3slides
Chap3slides
Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY
∗ Difficult to differentiate!
Recursive Nesting is Ugly!
g(y)=
cos(y)
O = [cos(w2)] + [sin(w2)]
w K(p,q) O
f(w)=w2
INPUT =p+q
OUTPUT
WEIGHT
h(z)=
sin(z)
∂o ∂K(p, q) ∂K(p, q)
= · g (y) · f
(w) + · h (z) · f
(w)
∂w
∂p
-sin(y)
∂q
cos(z)
2w 2w
1 1
= −2w · sin(y) + 2w · cos(z)
= −2w · sin(w2) + 2w · cos(w2)
– Identify the set P of all paths from the node with variable
w to the output o.
∂O
= h(1, j1) h(2, j ) h(3, j ) h(4, j ) h(5, j )
2 3 4 5
∂w
j1 ,j2 ,j3 ,j4 ,j5∈{1,2}5 w w2 w4 w8 w16
= w31 = 32w31
All 32 paths
W
X { ∑ɸ h=ɸ(W. X)
BREAK UP
W ah = W .X
X { ∑ɸ ∑ɸ h=ɸ(ah)
POST-ACTIVATION
VALUE
PRE-ACTIVATION
VALUE
• Backpropagation updates:
g i = J T g i+1 (11)
Effect on Linear Layer and Activation Functions
• Estimate of derivative:
∂L(w) L(w + ) − L(w)
≈ (12)
∂w
• New runs may also be started after killing threads (if needed).
40 40
30 30
20 20
10 10
VALUE OF y
VALUE OF y
0 0
−10 −10
−20 −20
−30 −30
−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x
(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2
h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑
• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).
1.2
0.8
LOSS
0.6
0.4
0.2 30
25
20
0 15
0 10
5 10 15 5
20 25 0
30
x y
• For negative inputs, the leaky ReLU can still propagate some
gradient backwards.
GD SLOWS DOWN
LOSS
IN FLAT REGION
GD GETS TRAPPED
IN LOCAL OPTIMUM
OPTIMUM
STARTING
POINT
STARTING
POINT WITH
MOMENTUM (b) WITHOUT MOMENTUM
OPTIMUM
WITHOUT STARTING
MOMENTUM POINT
√
• Use Ai + in the denominator to avoid ill-conditioning.
AdaGrad Intuition
√
• Scaling the derivative inversely with Ai encourages faster
relative movements along gently sloping directions.
√
• Use Ai + to avoid ill-conditioning.
RMSProp with Nesterov Momentum
1.2
0.8
LOSS
0.6
0.4
0.2 30
25
20
0 15
0 10
5 10 15 5
20 25 0
30
x y
40 40
30 30
20 20
10 10
VALUE OF y
VALUE OF y
0 0
−10 −10
−20 −20
−30 −30
−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x
(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2
5 LEAST
CURVATURE
DIRECTION
4
3
f(x, y)
−1
1
0.5 2
1
0
0
−0.5
−1
−1 −2
y x
• Solutions:
1
1
0.8
0.8 SADDLE
POINT
0.6
0.6
0.4
0.4 0.2
g(x, y)
0
0.2
−0.2
f(x)
0 −0.4
−0.6
−0.2
−0.8
−0.4 −1
1
−0.6 0.5
0
−0.8
−0.5 1
0.5
0
−1 −1 −0.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 y −1
x x
Batch Normalization
h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑
• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).
40 40
30 30
20 20
10 10
VALUE OF y
VALUE OF y
0 0
−10 −10
−20 −20
−30 −30
−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x
(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2
• One can view the input to each layer as a shifting data set
of hidden activations during training.
∑ɸ ∑ɸ
ADD BATCH BREAK UP
NORMALIZATION
vi ai
∑ɸ BN ∑ɸ BN
∑ɸ
(a) Post-activation normalization (b) Pre-activation normalization