18 DL Regularization
18 DL Regularization
Sergey Nikolenko
3
gradient descent
3
gradient descent
3
computational graph, frop and bprop
4
computational graph, frop and bprop
• This way we can take the gradient with the chain rule:
𝜕𝑓∘𝑔 𝜕𝑓 𝜕𝑔
𝜕𝑥1 𝜕𝑔 𝜕𝑥1 𝜕𝑓
∇x (𝑓 ∘ 𝑔) = ( ⋮ )=( ⋮ )= ∇ 𝑔.
𝜕𝑓∘𝑔
𝜕𝑥𝑛
𝜕𝑓 𝜕𝑔
𝜕𝑔 𝜕𝑥𝑛
𝜕𝑔 x
4
computational graph, frop and bprop
𝑘
𝜕𝑓 𝜕𝑓 𝜕𝑔1 𝜕𝑓 𝜕𝑔𝑘 𝜕𝑓 𝜕𝑔𝑖
= +…+ =∑ .
𝜕𝑥 𝜕𝑔1 𝜕𝑥 𝜕𝑔𝑘 𝜕𝑥 𝑖=1
𝜕𝑔𝑖 𝜕𝑥
𝑘
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇x 𝑓 = ∇x 𝑔1 + … + ∇x 𝑔𝑘 = ∑ ∇x 𝑔𝑖 .
𝜕𝑔1 𝜕𝑔𝑘 𝑖=1
𝜕𝑔𝑖
𝜕𝑔1 𝜕𝑔𝑘
…
⎛ 𝜕𝑥1 𝜕𝑥1
⎞
∇x 𝑓 = ∇x g∇g 𝑓, where ∇x g = ⎜
⎜
⎜ ⋮ ⋮ ⎟⎟
⎟.
𝜕𝑔1 𝜕𝑔𝑘
⎝ 𝜕𝑥𝑛 … 𝜕𝑥𝑛 ⎠
4
computational graph, frop and bprop
4
computational graph, frop and bprop
𝜕𝑓
• Forward propagation: we compute 𝜕𝑥
by the chain rule.
4
computational graph, frop and bprop
4
computational graph, frop and bprop
4
regularization
in neural networks
regularization in neural networks
6
regularization in neural networks
6
regularization in neural networks
6
regularization in neural networks
6
weight initialization
weight initialization
8
weight initialization
𝑦 = w⊤ x + 𝑏 = ∑ 𝑤𝑖 𝑥𝑖 + 𝑏.
𝑖
• The variance is
8
weight initialization
• The variance is
1 1
𝑤𝑖 ∼ 𝑈 [− √ ,√ ].
𝑛out 𝑛out
1
𝑛out Var [𝑤𝑖 ] = ,
3
and after a few layers the signal dies down; the same happens
in backprop.
8
weight initialization
• But it only works for symmetric activations, i.e., not for ReLU...
8
weight initialization
Var [𝑤𝑖 𝑥𝑖 ] = 𝔼 [𝑥𝑖 ]2 Var [𝑤𝑖 ] + 𝔼 [𝑤𝑖 ]2 Var [𝑥𝑖 ] + Var [𝑤𝑖 ] Var [𝑥𝑖 ]
Var [𝑤𝑖 𝑥𝑖 ] = 𝔼 [𝑥𝑖 ]2 Var [𝑤𝑖 ]+Var [𝑤𝑖 ] Var [𝑥𝑖 ] = Var [𝑤𝑖 ] 𝔼 [𝑥2𝑖 ] , so
(𝑙) 2
Var [𝑦(𝑙) ] = 𝑛in Var [𝑤(𝑙) ] 𝔼 [(𝑥(𝑙) ) ] .
8
weight initialization
(𝑙) 2
Var [𝑦(𝑙) ] = 𝑛in Var [𝑤(𝑙) ] 𝔼 [(𝑥(𝑙) ) ] .
• And this leads to the variance for ReLU init; there is no 𝑛out now:
(𝑙)
Var [𝑤𝑖 ] = 2/𝑛in .
(𝑙)
𝑤𝑖 ∼ 𝒩 (0, √2/𝑛in ) . 8
batch normalization
batch normalization
10
batch normalization
𝑢 + 𝑏 + Δ𝑏 − 𝔼 [𝑢 + 𝑏 + Δ𝑏] = 𝑢 + 𝑏 − 𝔼 [𝑢 + 𝑏] .
10
batch normalization
x̂ = Norm(x, 𝒳).
10
batch normalization
𝑥𝑘 − 𝔼 [𝑥𝑘 ]
𝑥𝑘̂ = ,
√Var [𝑥𝑘 ]
10
batch normalization
𝑥𝑘 − 𝔼 [𝑥𝑘 ]
𝑦𝑘 = 𝛾𝑘 𝑥𝑘̂ + 𝛽𝑘 = 𝛾𝑘 + 𝛽𝑘 .
√Var[𝑥𝑘 ]
• 𝛾𝑘 and 𝛽𝑘 are new variables and will be trained just like the
weights.
10
batch normalization
10
variations of
gradient descent
momentum
• Gradient descent:
• But this does not take 𝐸 into account; it’s better to be adaptive.
12
momentum
12
momentum
12
momentum
• This is usually much faster, and there’s nothing to tune (no 𝜂).
• But we need to compute the Hessian 𝐻(𝐸(𝜃)), and this is
infeasible.
• Interesting problem: can we make Newton’s method work for
deep learning?
12
adaptive methods
𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − ⋅ 𝑔𝑡,𝑖 ,
√𝐺𝑡,𝑖𝑖 + 𝜖
2
where 𝐺𝑡 is a diagonal matrix with 𝐺𝑡,𝑖𝑖 = 𝐺𝑡−1,𝑖𝑖 + 𝑔𝑡,𝑖 that
accumulates the total gradient value over learning history.
• So learning rate always goes down, but at different rates for
different 𝜃𝑖 .
13
adaptive methods
13
thank you!
14