Lec-4-Opt and BP
Lec-4-Opt and BP
Lec-4-Opt and BP
Backpropagation
On CIFAR-10
On ImageNet Credit:
I2DL: Prof. Niessner Li/Karpathy/Johnson
3
Neural Network
• Linear score function 𝒇 = 𝑾𝒙
Output layer
Input layer
Hidden layer
Credit: Li/Karpathy/Johnson
I2DL: Prof. Niessner 5
Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer
Output Layer
Width
Depth
I2DL: Prof. Niessner 6
Activation Functions
1 Leaky ReLU: max 0.1𝑥, 𝑥
Sigmoid: 𝜎 𝑥 = (1+𝑒 −𝑥 )
tanh: tanh 𝑥
• Regression loss
1
• L1 loss 𝑳 𝒚, 𝒚
ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖 1
𝑛
1 2
• MSE loss 𝑳 𝒚, 𝒚 ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖
𝑛 2
– It is directional
– It is organized in ‘layers’
𝑑 = −2
−3
sum 𝑓 = −8
−3
mult
4
4
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑧
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 4 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑑
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑦
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑦 17
Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
4 4
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑥
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑥 18
Compute Graphs -> Neural Networks
• 𝑥𝑘 input variables
• 𝑤𝑙,𝑚,𝑛 network weights (note 3 indices)
– 𝑙 which layer
– 𝑚 which neuron in layer
– 𝑛 which weight in neuron
• 𝑦ො𝑖 computed output (𝑖 output dim; 𝑛𝑜𝑢𝑡 )
• 𝑦𝑖 ground truth targets
• 𝐿 loss function
𝑥0 ∗ 𝑤0
𝑥0
+ −𝑦0 𝑥 ∗𝑥 Loss/
𝑦ො0 𝑦0 cost
𝑥1 ∗ 𝑤1
𝑥1
Input Weights L2 Loss
(unknowns!) function
𝑥0 ∗ 𝑤0
Loss/
𝑥0 + max(0, 𝑥) −𝑦0 𝑥 ∗𝑥
cost
𝑦ො0 𝑦0 𝑥1 ∗ 𝑤1
𝑥1 ReLU Activation
Input Weights (not arguing this L2 Loss
(unknowns!) is the right choice here) function
∗ 𝑤1,0
𝑦ො0 𝑦0 𝑥0 Loss/
+ −𝑦0 𝑥 ∗𝑥
𝑥0 cost
∗ 𝑤1,1
𝑦ො1 𝑦1 𝑥1
𝑥1 ∗ 𝑤2,0
𝑦ො2 𝑦2 Loss/
+ −𝑦0 𝑥 ∗𝑥
cost
∗ 𝑤2,1
We want to compute gradients w.r.t. all weights 𝑾
I2DL: Prof. Niessner 22
Compute Graphs -> Neural Networks
Input layer Output layer Goal: We want to compute gradients of
the loss function 𝐿 w.r.t. all weights 𝑾
𝑥0 𝐿 = 𝐿𝑖
𝑦ො0 𝑦0 𝑖
𝐿: sum over loss per sample, e.g.
L2 loss ⟶ simply sum up squares:
…
𝑦ො1 𝑦1 𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑏
I2DL: Prof. Niessner 24
NNs as Computational Graphs
1
• 𝑓 𝒘, 𝒙 =
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1
2
𝑤1
∗
−1 −2
𝑥1
+
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−3
𝑏
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
1
+ 1 ∙ − 1.372 = −0.53
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 1 0.73
−2 + ∙ −1 +1
𝑥0 ∙
−0.53 1
−3
𝑏
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ 1 = −0.53
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−0.53 −0.53 1
−3
𝑏
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ e−1 = −0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
+
−3 𝑤 6 4 −0.2 ∙ −1 = 0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
−0.2 𝑤1 𝑔 𝑥 = 𝑒𝑥 ⇒
𝜕𝑔
= 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
0.4 1 0.2 𝜕𝑥
+
−3 𝑤 6 4
−0.4 0
∗ 0.2 0.2
1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥 + ∙ −1 +1 ∙
−0.6 0
−3 0.2 −0.2 −0.53 −0.53 1
0.2 𝑏
Initialization
Optimum
Learning rate
Output Layer
𝑚 Neurons
𝑙 Layers
I2DL: Prof. Niessner 34
Gradient Descent for Neural Networks
For a given training pair {𝒙, 𝒚}, we want to update all
weights, i.e., we need to compute the derivatives w.r.t. to
all weights:
𝜕𝑓
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
𝜕𝑤0,0,0
…
𝑚 Neurons
Output Layer
𝛻𝑾 𝑓 𝒙,𝒚 (𝑾) = …
𝜕𝑓
𝜕𝑤𝑙,𝑚,𝑛
Gradient step:
𝑾′ = 𝑾 − 𝛼𝛻𝑾 𝑓 𝒙,𝒚 (𝑾)
𝑙 Layers
NEURONS
𝜕𝑧
Activations
𝑓
𝜕𝑥 𝑧 = 𝑓(𝑥, 𝑦)
𝜕𝑧
𝜕𝐿
𝜕𝑦
𝜕𝑧
𝑦
𝜕𝐿𝑖
= 2(𝑦ොi − yi )
𝜕𝑦ො𝑖
𝜕𝑦ො 𝑖
ℎ𝑗 = 𝐴(𝑏0,𝑗 + 𝑥𝑘 𝑤0,𝑗,𝑘 ) = ℎ𝑗 if > 0, else 0
𝜕𝑤1,𝑖,𝑗
𝑘
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑖 𝜕ℎ𝑗
= = (𝑦ො𝑖 − 𝑦𝑖 )𝑤𝑗𝑖 (ℎ𝑗 1 − ℎ𝑗 )
𝜕𝑠𝑗1 𝜕𝑠𝑖 𝜕ℎ𝑗 𝜕𝑠𝑗1
𝑖=1 𝑖=1
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑗1
1 = 1 = 𝑦
ො𝑖 − 𝑦𝑖 𝑤𝑗𝑖 ℎ𝑗 1 − ℎ𝑗 𝑥𝑘
𝜕𝑤𝑘𝑗 𝜕𝑠𝑗1 𝜕𝑤𝑘𝑗
𝑖=1 𝑖=1
• Function we want to 𝑦
optimize:
𝑛
2
𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
𝑖=1
2 ∙ 23
2 ∙ 23 ∙ 2
2 ∙ 23 ∙ 2 ∙ 1
2 ∙ 23 ∙ 2 ∙ 1 ∙1
Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
– Early stopping
– ...
• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
More details later
– Early stopping
– ...
• L2 regularization 𝑅 𝜽 = 𝜃𝑖2
𝑖=1
𝜃1 0 + 0.752 + 0 = 0.5625
𝜃2 0.252 + 0.52 + 0.252 = 0.375 Minimization
• L1 regularization 𝑅 𝜽 = |𝜃𝑖 |
𝑖=1
𝜃1 0 + 0.75 + 0 = 0.75 Minimization
𝜃2 0.25 + 0.5 + 0.25 = 1
• 𝜃1 = 0, 0.75, 0 L1 regularization
enforces sparsity
Furry
L1 regularization
Has two eyes
will focus all the
attention to a
Has a tail
few key features
Has paws
Furry
L2 regularization
Has two eyes will take all
information into
Has a tail account to make
decisions
Has paws
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
Network output + L2-loss + 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆𝑅 𝑤1 , 𝑤2
regularization 𝑖=1
I2DL: Prof. Niessner 69
Regularization for Neural Networks
𝑥
∗ max(0,∙)
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
2
𝑤1
Network output + L2-loss + 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 22 +𝜆 𝑤2
regularization 𝑖=1
2
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
Network output + L2-loss + 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆(𝑤12 + 𝑤22 )
regularization 𝑖=1
I2DL: Prof. Niessner 71
Regularization
Regularization 𝜆=0 𝜆 = 0.00001 𝜆 = 0.001 𝜆=1 𝜆 = 10
Decision
Boundary
Lower Increasing
validation error training error
• Next lecture
– Optimization of Neural Networks
– In particular, introduction to SGD (our main method!)