Lec-4-Opt and BP

Optimization and
Backpropagation
I2DL: Prof. Niessner 1

Lecture 3 Recap

Neural Network
• Linear score function 𝒇 = 𝑾𝒙
On CIFAR-10
On ImageNet Credit:
I2DL: Prof. Niessner Li/Karpathy/Johnson
3
Neural Network
• Linear score function 𝒇 = 𝑾𝒙
• Neural network is a nesting of ‘functions’

– 2-layers: 𝒇 = 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙)
– 3-layers: 𝒇 = 𝑾𝟑 max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙))
– 4-layers: 𝒇 = 𝑾𝟒 tanh (𝑾𝟑 , max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙)))
– 5-layers: 𝒇 =
𝑾𝟓 𝜎(𝑾𝟒 tanh(𝑾𝟑 , max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙))))
– … up to hundreds of layers

Neural Network
Output layer
Input layer
Hidden layer
Credit: Li/Karpathy/Johnson
Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer
Output Layer
Width
Depth
Activation Functions
1 Leaky ReLU: max 0.1𝑥, 𝑥
Sigmoid: 𝜎 𝑥 = (1+𝑒 −𝑥 )
tanh: tanh 𝑥
Parametric ReLU: max 𝛼𝑥, 𝑥

Maxout max 𝑤1𝑇 𝑥 + 𝑏1 , 𝑤2𝑇 𝑥 + 𝑏2
ReLU: max 0, 𝑥 𝑥 if 𝑥 > 0
ELU f x = ቊ
α e𝑥− 1 if 𝑥 ≤ 0
Loss Functions
• Measure the goodness of the predictions (or
equivalently, the network's performance)
• Regression loss
1
• L1 loss 𝑳 𝒚, 𝒚
ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖 1
𝑛
1 2
• MSE loss 𝑳 𝒚, 𝒚 ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖
𝑛 2
• Classification loss (for multi-class classification)

– ො 𝜃 = − σ𝑛𝑖=1 σ𝐾
Cross Entropy loss 𝐸 𝑦, 𝑦; 𝑘=1(𝑦𝑖𝑘 ⋅ log 𝑦
ො𝑖𝑘 )

Computational Graphs
• Neural network is a computational graph
– It has compute nodes
– It has edges that connect nodes
– It is directional
– It is organized in ‘layers’

Backprop

The Importance of Gradients
• Our optimization schemes are based on computing
gradients
𝛻𝜽 𝐿 𝜽
• One can compute gradients analytically but what if
our function is too complex?
• Break down gradient computation Backpropagation
Done by many people before, but often credited to Rumelhart 1986

Backprop: Forward Pass
• 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧
Initialization 𝑥 = 1, 𝑦 = −3, 𝑧 = 4
1 1
𝑑 = −2
−3
sum 𝑓 = −8
−3
mult
4
4

Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑓

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑧
𝜕𝑓 𝜕𝑓 𝜕𝑓

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 4 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑑
𝜕𝑓 𝜕𝑓 𝜕𝑓

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑦
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑦 17
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
4 4
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑥
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑥 18
Compute Graphs -> Neural Networks
• 𝑥𝑘 input variables
• 𝑤𝑙,𝑚,𝑛 network weights (note 3 indices)
– 𝑙 which layer
– 𝑚 which neuron in layer
– 𝑛 which weight in neuron
• 𝑦ො𝑖 computed output (𝑖 output dim; 𝑛𝑜𝑢𝑡 )
• 𝑦𝑖 ground truth targets
• 𝐿 loss function

Input layer Output layer
𝑥0 ∗ 𝑤0
𝑥0
+ −𝑦0 𝑥 ∗𝑥 Loss/
𝑦ො0 𝑦0 cost
𝑥1 ∗ 𝑤1
𝑥1
Input Weights L2 Loss
(unknowns!) function
e.g., class label/

regression target

Input layer Output layer
𝑥0 ∗ 𝑤0
Loss/
𝑥0 + max(0, 𝑥) −𝑦0 𝑥 ∗𝑥
cost
𝑦ො0 𝑦0 𝑥1 ∗ 𝑤1
𝑥1 ReLU Activation
Input Weights (not arguing this L2 Loss
(unknowns!) is the right choice here) function
e.g., class label/ We want to compute gradients w.r.t. all weights 𝑾

regression target
∗ 𝑤0,0
Input layer Output layer Loss/
+ −𝑦0 𝑥 ∗𝑥
cost
∗ 𝑤0,1
∗ 𝑤1,0
𝑦ො0 𝑦0 𝑥0 Loss/
+ −𝑦0 𝑥 ∗𝑥
𝑥0 cost
∗ 𝑤1,1
𝑦ො1 𝑦1 𝑥1
𝑥1 ∗ 𝑤2,0
𝑦ො2 𝑦2 Loss/
+ −𝑦0 𝑥 ∗𝑥
cost
∗ 𝑤2,1
We want to compute gradients w.r.t. all weights 𝑾
Input layer Output layer Goal: We want to compute gradients of
the loss function 𝐿 w.r.t. all weights 𝑾
𝑥0 𝐿 = ෍ 𝐿𝑖
𝑦ො0 𝑦0 𝑖
𝐿: sum over loss per sample, e.g.
L2 loss ⟶ simply sum up squares:
…
𝑦ො1 𝑦1 𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑥𝑘 ⟶ use chain rule to compute partials

𝑦ො𝑖 = 𝐴(𝑏𝑖 + ෍ 𝑥𝑘 𝑤𝑖,𝑘 ) 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
𝑘 = ⋅
𝜕𝑤𝑖,𝑘 𝜕𝑦ො𝑖 𝜕𝑤𝑖,𝑘
Activation bias
function We want to compute gradients w.r.t.
all weights 𝑾 AND all biases 𝒃
NNs as Computational Graphs
• We can express any kind of functions in a
1
computational graph, e.g. 𝑓 𝒘, 𝒙 = − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1
1+𝑒
𝑤1
∗
𝑥1 Sigmoid function
+ 1
𝜎 𝑥 =
𝑤0 1 + 𝑒 −𝑥
∗
1
𝑥0 + ∙ −1 exp(∙) +1 ∙
𝑏
1
• 𝑓 𝒘, 𝒙 =
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1
2
𝑤1
∗
−1 −2
𝑥1
+
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
1
+ 1 ∙ − 1.372 = −0.53
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 1 0.73
−2 + ∙ −1 +1
𝑥0 ∙
−0.53 1
−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ 1 = −0.53
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−0.53 −0.53 1
−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ e−1 = −0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−0.2 −0.53 −0.53 1

−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝜕𝑥
+
−3 𝑤 6 4 −0.2 ∙ −1 = 0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−3 0.2 −0.2 −0.53 −0.53 1

𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
−0.2 𝑤1 𝑔 𝑥 = 𝑒𝑥 ⇒
𝜕𝑔
= 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
0.4 1 0.2 𝜕𝑥
+
−3 𝑤 6 4
−0.4 0
∗ 0.2 0.2
1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥 + ∙ −1 +1 ∙
−0.6 0
−3 0.2 −0.2 −0.53 −0.53 1
0.2 𝑏

Gradient Descent

Gradient Descent
𝒙∗ = arg min 𝑓(𝒙)
Initialization
Optimum

Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥)
𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥
Learning rate

Gradient Descent for Neural Networks
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer
Output Layer
𝑚 Neurons
𝑙 Layers
For a given training pair {𝒙, 𝒚}, we want to update all
weights, i.e., we need to compute the derivatives w.r.t. to
all weights:
𝜕𝑓
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
𝜕𝑤0,0,0
…
𝑚 Neurons
Output Layer
𝛻𝑾 𝑓 𝒙,𝒚 (𝑾) = …
𝜕𝑓
𝜕𝑤𝑙,𝑚,𝑛
Gradient step:
𝑾′ = 𝑾 − 𝛼𝛻𝑾 𝑓 𝒙,𝒚 (𝑾)
𝑙 Layers

NNs can Become Quite Complex…
• These graphs can be huge!
[Szegedy et al.,CVPR’15] Going Deeper with Convolutions

The Flow of the Gradients
• Many many many many of these nodes form a neural
network
NEURONS
• Each one has its own work to do

FORWARD AND BACKWARD PASS

The Flow of the Gradients
𝜕𝐿 𝜕𝐿 𝜕𝑧
=
𝑥 𝜕𝑥 𝜕𝑧 𝜕𝑥
𝜕𝑧
Activations
𝑓
𝜕𝑥 𝑧 = 𝑓(𝑥, 𝑦)
𝜕𝑧
𝜕𝐿
𝜕𝑦
𝜕𝑧
𝑦
I2DL: Prof. Niessner

Activation function 38
Loss function
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )

𝑗
Just simple:
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝐴 𝑥 = max(0, 𝑥)
𝑘
Backpropagation
Just go through layer by layer

𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
= ⋅
𝜕𝑤1,𝑖,𝑗 𝜕𝑦ො𝑖 𝜕𝑤1,𝑖,𝑗
𝜕𝐿𝑖
= 2(𝑦ොi − yi )
𝜕𝑦ො𝑖
𝜕𝑦ො 𝑖
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 ) = ℎ𝑗 if > 0, else 0
𝜕𝑤1,𝑖,𝑗
𝑘
𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 ) 𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕ℎ𝑗

= ⋅ ⋅
𝑗 𝜕𝑤0,𝑗,𝑘 𝜕𝑦ො𝑖 𝜕ℎ𝑗 𝜕𝑤0,𝑗,𝑘
…
How many unknown weights?

• Output layer: 2 ⋅ 4 + 2
• Hidden Layer: 4 ⋅ 3 + 4
#neurons ∙ #input channels + #biases
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝑘
Note that some activations
𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )
have also weights
𝑗

Derivatives of Cross Entropy Loss
Gradients of weights of last layer:
𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕𝑠𝑖
= ∙ ∙
𝜕𝑤𝑗𝑖 𝜕𝑦ො𝑖 𝜕𝑠𝑖 𝜕𝑤𝑗𝑖
𝜕𝐿 −𝑦𝑖 1 − 𝑦𝑖 𝑦ො𝑖 − 𝑦𝑖
= + = ,
𝜕𝑦ො𝑖 𝑦ො𝑖 1 − 𝑦ො𝑖 𝑦ො𝑖 (1 − 𝑦ො𝑖 )
𝜕𝑦ො𝑖
Binary Cross Entropy loss = 𝑦ො𝑖 1 − 𝑦ො𝑖 ,
𝑛𝑜𝑢𝑡
𝜕𝑠𝑖
𝜕𝑠𝑖
𝐿 = − ෍ (𝑦𝑖 log 𝑦ො𝑖 + 1 − 𝑦𝑖 log(1 − 𝑦ො𝑖 )) = ℎ𝑗
𝜕𝑤𝑗𝑖
𝑖=1
1
𝑦ො𝑖 = 𝑠𝑖 = ෍ ℎ𝑗 𝑤𝑗𝑖 𝜕𝐿 𝜕𝐿
1 + 𝑒 −𝑠𝑖 ⟹ = (𝑦ො𝑖 − 𝑦𝑖 )ℎ𝑗 , = 𝑦ො𝑖 − 𝑦𝑖
j 𝜕𝑠𝑖
𝜕𝑤𝑗𝑖
output scores
Derivatives of Cross Entropy Loss
Gradients of weights of first layer:
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖 𝜕𝑠𝑖 𝜕𝐿
= ෍ = ෍ 𝑦ො 1 − 𝑦ො𝑖 𝑤𝑗𝑖 = ෍ 𝑦ො𝑖 − 𝑦𝑖 𝑤𝑗𝑖
𝜕ℎ𝑗 𝜕𝑦ො𝑖 𝜕𝑠i 𝜕ℎ𝑗 𝜕𝑦ො𝑖 𝑖
𝑖=1 𝑖=1 𝑖=1
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑖 𝜕ℎ𝑗
= ෍ = ෍ (𝑦ො𝑖 − 𝑦𝑖 )𝑤𝑗𝑖 (ℎ𝑗 1 − ℎ𝑗 )
𝜕𝑠𝑗1 𝜕𝑠𝑖 𝜕ℎ𝑗 𝜕𝑠𝑗1
𝑖=1 𝑖=1
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑗1
1 = ෍ 1 = ෍ 𝑦
ො𝑖 − 𝑦𝑖 𝑤𝑗𝑖 ℎ𝑗 1 − ℎ𝑗 𝑥𝑘
𝜕𝑤𝑘𝑗 𝜕𝑠𝑗1 𝜕𝑤𝑘𝑗
𝑖=1 𝑖=1

Back to Compute Graphs & NNs
𝑥
• Inputs 𝒙 and targets 𝒚 𝑧 𝜎
∗
• Two-layer NN for max(0,∙)
𝑦ො
𝑤1 ∗
regression with ReLU
activation 𝑤2 𝐿
• Function we want to 𝑦
optimize:
𝑛
2
෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
𝑖=1

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො
3 𝑤1 ∗
1 𝑛 2 4
ෝ; 𝜽 =
𝐿 𝒚, 𝒚 σ | 𝑦ො𝑖 − 𝑦𝑖 | 2 2 𝑤
𝑛 𝑖 𝐿 9
2
In our case 𝑛, 𝑑 = 1: 1
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ 𝜕𝑤 = 𝜎 Backpropagation
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 ∗ 3
1 4
ෝ; 𝜽 = σ𝑛𝑖 | 𝑦ො𝑖 − 𝑦𝑖 |
𝐿 𝒚, 𝒚 2
2 2 𝑤 9
𝑛
2 𝐿
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2
2 ∙ 23

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 4 ∗ 3
1 9 4
ෝ; 𝜽 = σ𝑛𝑖 | 𝑦ො𝑖 − 𝑦𝑖 |
𝐿 𝒚, 𝒚 2
2 2 𝑤 9
𝑛
2 𝐿
4 1
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2
2 ∙ 23 ∙ 13

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො
3 𝑤1 4 ∗
9
In our case 𝑛, 𝑑 = 1: 2 𝑤
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜕𝜎 1 if 𝑥 > 0 Backpropagation
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 0 else 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝜎 𝜕𝑧
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 8 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23 ∙ 2

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 8
𝜎 8 2
3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23 ∙ 2 ∙ 1

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 8
𝜎 8 2
8 3
𝑤1 = 13, 𝑤2 = 2 1 3
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
2 𝑤
4
3 𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23 ∙ 2 ∙ 1 ∙1

1 1
• Function we want to 1 𝑥 3
8
3
𝑧 𝜎 2
optimize: 1
8
3
∗
3
max(0,∙)
8
3 3
𝑛 𝑦ො 4
3 𝑤1 4 ∗
2 3
𝑓 𝑥, 𝒘 = ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2 8 9 4
𝑖=1 3 2 𝑤 𝐿 9
4 2
1
• Computed gradients wrt 9
0 𝑦
to weights 𝑤1 and 𝑤2
• Now: update the weights
𝑤1
𝒘′ = 𝒘 − 𝛼 ∙ 𝛻𝒘 𝑓 = 𝑤 − 𝛼 ∙
𝛻𝑤1 𝑓 But: how to choose a
𝛻𝑤2 𝑓
1 8
2
good learning rate 𝛼 ?
3
= 3 − 𝛼∙ 4
2 9
Gradient Descent
• How to pick good learning rate?
• How to compute gradient for single training pair?
• How to compute gradient for large training set?
• How to speed things up? More to see in next

lectures…

Regularization

Recap: Basic Recipe for ML
• Split your data
60% 20% 20%

train validation test
Find your hyperparameters
Other splits are also possible (e.g., 80%/10%/10%)

Over- and Underfitting
Underfitted Appropriate Overfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

Training a Neural Network
• Training/ Validation curve
How can we
prevent our model
from overfitting?
Training error Generalization Regularization

too high gap is too big
Credits: Deep Learning. Goodfellow et al.

Regularization
ෝ, 𝜽) = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦𝑖
• Loss function 𝐿(𝒚, 𝒚 2
+ 𝜆𝑅(𝜽)
• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
– Early stopping
– ...

Regularization
ෝ, 𝜽) = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦𝑖
• Loss function 𝐿(𝒚, 𝒚 2
+ 𝜆𝑅(𝜽)
• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
More details later
– Early stopping
– ...

Regularization: Example
• Input: 3 features 𝒙 = [1, 2, 1]
• Two linear classifiers that give the same result:
• 𝜃1 = 0, 0.75, 0 Ignores 2 features
• 𝜃2 = [0.25, 0.5, 0.25] Takes information

from all features

2
• Loss 𝐿 𝒚, 𝒚
ෝ, 𝜽 = σ𝑛𝑖=1 𝑥𝑖 𝜃𝑗𝑖 − 𝑦𝑖 + 𝜆𝑅 𝜽
𝑛
• L2 regularization 𝑅 𝜽 = ෍ 𝜃𝑖2
𝑖=1
𝜃1 0 + 0.752 + 0 = 0.5625
𝜃2 0.252 + 0.52 + 0.252 = 0.375 Minimization
𝑥 = 1, 2, 1 , 𝜃1 = 0, 0.75, 0 , 𝜃2 = [0.25, 0.5, 0.25]

2
• Loss 𝐿 𝒚, 𝒚
ෝ, 𝜽 = σ𝑛𝑖=1 𝑥𝑖 𝜃𝑗𝑖 − 𝑦𝑖 + 𝜆𝑅 𝜽
𝑛
• L1 regularization 𝑅 𝜽 = ෍ |𝜃𝑖 |
𝑖=1
𝜃1 0 + 0.75 + 0 = 0.75 Minimization
𝜃2 0.25 + 0.5 + 0.25 = 1
𝑥 = 1, 2, 1 , 𝜃1 = 0, 0.75, 0 , 𝜃2 = [0.25, 0.5, 0.25]

• 𝜃1 = 0, 0.75, 0 Ignores 2 features
• 𝜃2 = [0.25, 0.5, 0.25] Takes information

from all features

• 𝜃1 = 0, 0.75, 0 L1 regularization
enforces sparsity
• 𝜃2 = [0.25, 0.5, 0.25] L2 regularization

enforces that the weights
have similar values
Regularization: Effect
• Dog classifier takes different inputs
Furry
L1 regularization
Has two eyes
will focus all the
attention to a
Has a tail
few key features
Has paws
Has two ears

Regularization: Effect
• Dog classifier takes different inputs
Furry
L2 regularization
Has two eyes will take all
information into
Has a tail account to make
decisions
Has paws
Has two ears

Regularization for Neural Networks
𝑥
∗ max(0,∙)
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆𝑅 𝑤1 , 𝑤2
regularization 𝑖=1
𝑥
∗ max(0,∙)
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
2
𝑤1
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 22 +𝜆 𝑤2
2

𝑥
∗ max(0,∙)
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆(𝑤12 + 𝑤22 )
Regularization
Regularization 𝜆=0 𝜆 = 0.00001 𝜆 = 0.001 𝜆=1 𝜆 = 10
Decision
Boundary
Credit: University of Washington
What happens to the training error?

What is the goal of regularization?
Regularization
• Any strategy that aims to
Lower Increasing
validation error training error

Next Lecture
• This week:
– Check exercises!
– Check piazza / post questions ☺
• Next lecture
– Optimization of Neural Networks
– In particular, introduction to SGD (our main method!)

See you next week ☺

Further Reading
• Backpropagation
– Chapter 6.5 (6.5.1 - 6.5.3) in
http://www.deeplearningbook.org/contents/mlp.html
– Chapter 5.3 in Bishop, Pattern Recognition and Machine Learning
– http://cs231n.github.io/optimization-2/
• Regularization
– Chapter 7.1 (esp. 7.1.1 & 7.1.2)
http://www.deeplearningbook.org/contents/regularization.html
– Chapter 5.5 in Bishop, Pattern Recognition and Machine Learning

Lec-4-Opt and BP

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Lec-4-Opt and BP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec-4-Opt and BP

Uploaded by

Copyright:

Available Formats

Optimization and

I2DL: Prof. Niessner 1

I2DL: Prof. Niessner 2

• Neural network is a nesting of ‘functions’

I2DL: Prof. Niessner 4

Parametric ReLU: max 𝛼𝑥, 𝑥

• Classification loss (for multi-class classification)

I2DL: Prof. Niessner 8

– It has compute nodes

– It has edges that connect nodes

I2DL: Prof. Niessner 9

I2DL: Prof. Niessner 10

• Break down gradient computation Backpropagation

Done by many people before, but often credited to Rumelhart 1986

I2DL: Prof. Niessner 12

I2DL: Prof. Niessner 13

I2DL: Prof. Niessner 14

I2DL: Prof. Niessner 15

I2DL: Prof. Niessner 16

I2DL: Prof. Niessner 19

e.g., class label/

I2DL: Prof. Niessner 20

e.g., class label/ We want to compute gradients w.r.t. all weights 𝑾

𝑥𝑘 ⟶ use chain rule to compute partials

I2DL: Prof. Niessner 25

I2DL: Prof. Niessner 26

I2DL: Prof. Niessner 27

−0.2 −0.53 −0.53 1

I2DL: Prof. Niessner 28

−3 0.2 −0.2 −0.53 −0.53 1

I2DL: Prof. Niessner 29

I2DL: Prof. Niessner 30

I2DL: Prof. Niessner 31

I2DL: Prof. Niessner 32

• Gradient steps in direction of negative gradient

I2DL: Prof. Niessner 33

I2DL: Prof. Niessner 35

[Szegedy et al.,CVPR’15] Going Deeper with Convolutions

• Each one has its own work to do

I2DL: Prof. Niessner 37

I2DL: Prof. Niessner

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )

Just go through layer by layer

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 ) 𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕ℎ𝑗

How many unknown weights?

I2DL: Prof. Niessner 41

I2DL: Prof. Niessner 43

I2DL: Prof. Niessner 44

I2DL: Prof. Niessner 45

I2DL: Prof. Niessner 46

I2DL: Prof. Niessner 47

I2DL: Prof. Niessner 48

I2DL: Prof. Niessner 49

I2DL: Prof. Niessner 50

I2DL: Prof. Niessner 51

I2DL: Prof. Niessner 52

• How to compute gradient for single training pair?