0% found this document useful (0 votes)
20 views41 pages

18 DL Regularization

The document discusses gradient descent and computational graphs for deep learning. It explains how gradient descent is used to minimize an error function over a dataset. It also describes how computational graphs and backpropagation allow efficient computation of gradients to apply gradient descent to complex models.

Uploaded by

spanishbear75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views41 pages

18 DL Regularization

The document discusses gradient descent and computational graphs for deep learning. It explains how gradient descent is used to minimize an error function over a dataset. It also describes how computational graphs and backpropagation allow efficient computation of gradients to apply gradient descent to complex models.

Uploaded by

spanishbear75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

deep learning

Sergey Nikolenko

Harbour Space University, Barcelona, Spain


April 13, 2017
gradient descent
and computational graphs
gradient descent

• Gradient descent: take the gradient w.r.t. weights, move in that


direction.
• Formally: for an error function 𝐸, targets 𝑦, and model 𝑓 with
parameters 𝜃,
𝐸(𝜃) = ∑ 𝐸(𝑓(x, 𝜃), 𝑦),
(x,𝑦)∈𝐷

𝜃𝑡 = 𝜃𝑡−1 − 𝜂∇𝐸(𝜃𝑡−1 ) = 𝜃𝑡−1 − 𝜂 ∑ ∇𝐸(𝑓(x, 𝜃𝑡−1 ), 𝑦).


(x,𝑦)∈𝐷

• So we need to sum over the entire dataset for every step?!..

3
gradient descent

• Hence, stochastic gradient descent: after every training sample


update
𝜃𝑡 = 𝜃𝑡−1 − 𝜂∇𝐸(𝑓(x𝑡 , 𝜃𝑡−1 ), 𝑦𝑡 ),

• In practice people usually use mini-batches, it’s easy to


parallelize and smoothes out excessive “stochasticity”.
• So far the only parameter is the learning rate 𝜂.

3
gradient descent

• Lots of problems with 𝜂:

• We will get to them later, for now let’s concentrate on the


certainly required step: the derivatives.

3
computational graph, frop and bprop

• Let us represent a function as a composition of simple


functions (“simple” means that we can take derivatives).
• Example – 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑥𝑦 + (𝑥 + 𝑦)2 :

4
computational graph, frop and bprop

• This way we can take the gradient with the chain rule:

(𝑓 ∘ 𝑔)′ (𝑥) = (𝑓(𝑔(𝑥)))′ = 𝑓 ′ (𝑔(𝑥))𝑔′ (𝑥).

• This simply means that an increment 𝛿𝑥 results in

𝛿𝑓 = 𝑓 ′ (𝑔(𝑥))𝛿𝑔 = 𝑓 ′ (𝑔(𝑥))𝑔′ (𝑥)𝛿𝑥.

• We only need to be able to take gradients, i.e., derivatives w.r.t.


vectors:
𝜕𝑓
𝜕𝑥1
∇x 𝑓 = ( ⋮ ).
𝜕𝑓
𝜕𝑥𝑛

𝜕𝑓∘𝑔 𝜕𝑓 𝜕𝑔
𝜕𝑥1 𝜕𝑔 𝜕𝑥1 𝜕𝑓
∇x (𝑓 ∘ 𝑔) = ( ⋮ )=( ⋮ )= ∇ 𝑔.
𝜕𝑓∘𝑔
𝜕𝑥𝑛
𝜕𝑓 𝜕𝑔
𝜕𝑔 𝜕𝑥𝑛
𝜕𝑔 x

4
computational graph, frop and bprop

• Or, if 𝑓 depends on 𝑥 in several different ways,


𝑓 = 𝑓(𝑔1 (𝑥), 𝑔2 (𝑥), … , 𝑔𝑘 (𝑥)), the increment 𝛿𝑥 now comes into
play several times:

𝑘
𝜕𝑓 𝜕𝑓 𝜕𝑔1 𝜕𝑓 𝜕𝑔𝑘 𝜕𝑓 𝜕𝑔𝑖
= +…+ =∑ .
𝜕𝑥 𝜕𝑔1 𝜕𝑥 𝜕𝑔𝑘 𝜕𝑥 𝑖=1
𝜕𝑔𝑖 𝜕𝑥

𝑘
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇x 𝑓 = ∇x 𝑔1 + … + ∇x 𝑔𝑘 = ∑ ∇x 𝑔𝑖 .
𝜕𝑔1 𝜕𝑔𝑘 𝑖=1
𝜕𝑔𝑖

• Note that we got matrix multiplication for the Jacobi matrix:

𝜕𝑔1 𝜕𝑔𝑘

⎛ 𝜕𝑥1 𝜕𝑥1

∇x 𝑓 = ∇x g∇g 𝑓, where ∇x g = ⎜

⎜ ⋮ ⋮ ⎟⎟
⎟.
𝜕𝑔1 𝜕𝑔𝑘
⎝ 𝜕𝑥𝑛 … 𝜕𝑥𝑛 ⎠

4
computational graph, frop and bprop

• Let’s now go back to the example:

4
computational graph, frop and bprop

𝜕𝑓
• Forward propagation: we compute 𝜕𝑥
by the chain rule.

4
computational graph, frop and bprop

• Backpropagation: starting from the end node, go back as


𝜕𝑓 𝜕𝑓 𝜕𝑔′
𝜕𝑔
= ∑𝑔′ ∈Children(𝑔) 𝜕𝑔 ′ 𝜕𝑔 .

4
computational graph, frop and bprop

• Backprop is much better: we get all derivatives in a single pass


through the graph.
• Aaaand... that’s it! We can now take the gradients of any
complicated composition of simple functions.
• Which is all we need to apply gradient descent!
• The libraries – theano, TensorFlow – are actually automatic
differentiation libraries. This is their main function.
• So you can implement lots of “classical” models in TensorFlow
and train them by gradient descent.
• And live neurons can’t do that because you need two different
“algorithms” to compute the value and the derivative.

4
regularization
in neural networks
regularization in neural networks

• NNs have lots of parameters.


• Regularization is necessary.
• 𝐿2 or 𝐿1 regularization (𝜆 ∑𝑤 𝑤2 or 𝜆 ∑𝑤 |𝑤|) is called weight
decay.
• Very easy to add, just another term in the objective function.
• Sometimes still useful.

6
regularization in neural networks

• But there are better ways.


• Dropout: remove some units at random with probability 𝑝!

6
regularization in neural networks

• To apply, simply multiply the result by 1/𝑝 (preserving average


output); and you can usually take 𝑝 = 21 .

6
regularization in neural networks

• Dropout improved everything drastically. What the... why does it


work?
• Idea 1: we are making the units learn features by themselves,
without relying on the others.
• Idea 2: we are kind of averaging a huge number of networks
with shared weights, training each for one step. Like
bootstrapping taken to the extreme.
• Idea 3: this is just like sex!
• Idea 4: dropout is a special kind of prior (this has led to proper
dropout in recurrent NNs).

6
weight initialization
weight initialization

• The deep learning revolution began with unsupervised


pretraining.
• Main idea: get to a good region of the search space, then
fine-tune with gradient descent.
• Turns out by now we don’t need unsupervised pretraining with
complex models like RBM to get to a good region.
• Weight initialization is an important part of why.

8
weight initialization

• Xavier initialization (Glorot, Bengio, 2010).


• Let’s consider a single linear unit:

𝑦 = w⊤ x + 𝑏 = ∑ 𝑤𝑖 𝑥𝑖 + 𝑏.
𝑖

• The variance is

Var [𝑦𝑖 ] = Var [𝑤𝑖 𝑥𝑖 ] = 𝔼 [𝑋 2 𝑌 2 ] − (𝔼 [𝑋𝑌 ])2 =


2 2
= 𝔼 [𝑥𝑖 ] Var [𝑤𝑖 ] + 𝔼 [𝑤𝑖 ] Var [𝑥𝑖 ] + Var [𝑤𝑖 ] Var [𝑥𝑖 ] .

8
weight initialization

• The variance is

Var [𝑦𝑖 ] = Var [𝑤𝑖 𝑥𝑖 ] = 𝔼 [𝑋 2 𝑌 2 ] − (𝔼 [𝑋𝑌 ])2 =


= 𝔼 [𝑥𝑖 ]2 Var [𝑤𝑖 ] + 𝔼 [𝑤𝑖 ]2 Var [𝑥𝑖 ] + Var [𝑤𝑖 ] Var [𝑥𝑖 ] .

• For symmetric activation functions and zero mean of the weights

Var [𝑦𝑖 ] = Var [𝑤𝑖 ] Var [𝑥𝑖 ] .

• And if 𝑤𝑖 and 𝑥𝑖 are initialized independently from the same


distribution,
𝑛out 𝑛out
Var [𝑦] = Var [ ∑ 𝑦𝑖 ] = ∑ Var [𝑤𝑖 𝑥𝑖 ] = 𝑛out Var [𝑤𝑖 ] Var [𝑥𝑖 ] .
𝑖=1 𝑖=1

• In other words, the output variance is proportional to the input


variance with coefficient 𝑛out Var [𝑤𝑖 ]. 8
weight initialization

• Before (Glorot, Bengio, 2010), the standard way to initialize was


(it’s all over older literature)

1 1
𝑤𝑖 ∼ 𝑈 [− √ ,√ ].
𝑛out 𝑛out

• So in this case we get


2
1 1 1 1
Var [𝑤𝑖 ] = (√ +√ ) = , so
12 𝑛out 𝑛out 3𝑛out

1
𝑛out Var [𝑤𝑖 ] = ,
3
and after a few layers the signal dies down; the same happens
in backprop.

8
weight initialization

• Xavier initialization tries to reduce the change in variance, so we


take
2
Var [𝑤𝑖 ] = ,
𝑛in + 𝑛out
which for uniform distribution means
√ √
6 6
𝑤𝑖 ∼ 𝑈 [− √ ,√ ].
𝑛in + 𝑛out 𝑛in + 𝑛out

• But it only works for symmetric activations, i.e., not for ReLU...

8
weight initialization

• ...until (He et al., 2015)! Let’s go back to

Var [𝑤𝑖 𝑥𝑖 ] = 𝔼 [𝑥𝑖 ]2 Var [𝑤𝑖 ] + 𝔼 [𝑤𝑖 ]2 Var [𝑥𝑖 ] + Var [𝑤𝑖 ] Var [𝑥𝑖 ]

• We now can only make the second term zero:

Var [𝑤𝑖 𝑥𝑖 ] = 𝔼 [𝑥𝑖 ]2 Var [𝑤𝑖 ]+Var [𝑤𝑖 ] Var [𝑥𝑖 ] = Var [𝑤𝑖 ] 𝔼 [𝑥2𝑖 ] , so

(𝑙) 2
Var [𝑦(𝑙) ] = 𝑛in Var [𝑤(𝑙) ] 𝔼 [(𝑥(𝑙) ) ] .

8
weight initialization

• We now can only make the second term zero:

(𝑙) 2
Var [𝑦(𝑙) ] = 𝑛in Var [𝑤(𝑙) ] 𝔼 [(𝑥(𝑙) ) ] .

• Suppose now that 𝑥(𝑙) = max(0, 𝑦(𝑙−1) ), and 𝑦(𝑙−1) has a


symmetric distribution around zero. Then
(𝑙)
2 1 𝑛in
𝔼 [(𝑥(𝑙) ) ] = Var [𝑦(𝑙−1) ] , Var [𝑦(𝑙) ] = Var [𝑤(𝑙) ] Var [𝑦(𝑙−1) ] .
2 2

• And this leads to the variance for ReLU init; there is no 𝑛out now:

(𝑙)
Var [𝑤𝑖 ] = 2/𝑛in .

• You don’t have to make it uniform, btw; e.g., a normal


distribution is fine:

(𝑙)
𝑤𝑖 ∼ 𝒩 (0, √2/𝑛in ) . 8
batch normalization
batch normalization

• Important problem in deep neural networks: internal covariate


shift.
• When we change the weights of a layer, the distribution of its
outputs changes.
• This means that the next layer has to re-train almost from
scratch, it did not expect these outputs.
• Moreover, these neurons might have already reached saturation,
so they can’t re-train quickly.
• This seriously impedes learning.

10
batch normalization

• A characteristic example; note how different the distributions


are:

• What can we do?


10
batch normalization

• We could try to normalize (whiten) after every layer.


• Does not work: consider a layer that simply adds a bias 𝑏 to its
inputs 𝑢:
x̂ = x − 𝔼 [x] , where x = 𝑢 + 𝑏.

• On the next gradient descent step, we’ll have 𝑏 ∶= 𝑏 + Δ𝑏...


• ...but x̂ will not change:

𝑢 + 𝑏 + Δ𝑏 − 𝔼 [𝑢 + 𝑏 + Δ𝑏] = 𝑢 + 𝑏 − 𝔼 [𝑢 + 𝑏] .

• So the biases will simply increase unboundedly, and that’s all


the training we’ll get; not a good thing.

10
batch normalization

• We can try to add normalization as a layer:

x̂ = Norm(x, 𝒳).

• But note that the entire dataset 𝒳 is required here.


𝜕Norm
• So on the gradient descent step we’ll need to compute 𝜕x
and 𝜕Norm
𝜕𝒳
, and also the covariance matrix

Cov[x] = 𝔼x∈𝒳 [xx⊤ ] − 𝔼 [x] 𝔼 [x]⊤ .

• Definitely won’t work.

10
batch normalization

• The solution is to normalize each component separately, and


not over the whole dataset but over the current mini-batch;
hence batch normalization.
• After batch normalization we get

𝑥𝑘 − 𝔼 [𝑥𝑘 ]
𝑥𝑘̂ = ,
√Var [𝑥𝑘 ]

where the statistics are computed over the current mini-batch.


• However, one more problem: now nonlinearities disappear!
• E.g., we will almost always get into the region where 𝜎 is very
close to linear.

10
batch normalization

• To fix this, we have to allow the batchnorm layer enough


flexibility to sometimes do nothing with the inputs.
• So we introduce additional shift and scale parameters:

𝑥𝑘 − 𝔼 [𝑥𝑘 ]
𝑦𝑘 = 𝛾𝑘 𝑥𝑘̂ + 𝛽𝑘 = 𝛾𝑘 + 𝛽𝑘 .
√Var[𝑥𝑘 ]

• 𝛾𝑘 and 𝛽𝑘 are new variables and will be trained just like the
weights.

10
batch normalization

• Last remark: it matters where to put the batchnorm.


• You can put it either before or after the nonlinearity.

10
variations of
gradient descent
momentum

• Gradient descent:

𝜃𝑡 = 𝜃𝑡−1 − 𝜂∇𝐸(x𝑡 , 𝜃𝑡−1 , 𝑦𝑡 ).

• It all depends on the learning rate 𝜂.


• First idea – let’s make it decrease over time:
• linear decay:
𝑡
𝜂 = 𝜂0 (1 − );
𝑇
• exponential decay:
𝑡
𝜂 = 𝜂 0 𝑒− 𝑇 .

• But this does not take 𝐸 into account; it’s better to be adaptive.

12
momentum

• Momentum methods: let’s keep part of the speed, like a real


material point would.
• With the inertia we now have

𝑢𝑡 = 𝛾𝑢𝑡−1 + 𝜂∇𝜃 𝐸(𝜃),


𝜃 = 𝜃 − 𝑢𝑡 .

• So we now preserve 𝛾𝑢𝑡−1 .

12
momentum

• But we already know we will go to 𝛾𝑢𝑡−1 !


• Why don’t we compute the gradients right there, halfway?
• Nesterov’s momentum:

𝑢𝑡 = 𝛾𝑢𝑡−1 + 𝜂∇𝜃 𝐸(𝜃 − 𝛾𝑢𝑡−1 )

• Can we do even better?..

12
momentum

• ...well, yeah, we can try second-order methods.


• Newton’s method:
1
𝐸(𝜃) ≈ 𝐸(𝜃0 ) + ∇𝜃 𝐸(𝜃0 )(𝜃 − 𝜃0 ) + (𝜃 − 𝜃0 )⊤ 𝐻(𝐸(𝜃))(𝜃 − 𝜃0 ).
2

• This is usually much faster, and there’s nothing to tune (no 𝜂).
• But we need to compute the Hessian 𝐻(𝐸(𝜃)), and this is
infeasible.
• Interesting problem: can we make Newton’s method work for
deep learning?

12
adaptive methods

• But we can still do better!


• Note that so far the learning rate was the same in all directions.
• Idea: rate of change should be higher for parameters that do
not change much over the input samples, and lower for highly
variable parameters.
• Denoting 𝑔𝑡,𝑖 = ∇𝜃𝑖 𝐿(𝜃), we get

𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − ⋅ 𝑔𝑡,𝑖 ,
√𝐺𝑡,𝑖𝑖 + 𝜖

2
where 𝐺𝑡 is a diagonal matrix with 𝐺𝑡,𝑖𝑖 = 𝐺𝑡−1,𝑖𝑖 + 𝑔𝑡,𝑖 that
accumulates the total gradient value over learning history.
• So learning rate always goes down, but at different rates for
different 𝜃𝑖 .

13
adaptive methods

• One problem: 𝐺 keeps increasing, and learning rate sometimes


decreases too rapidly.
• Adadelta – same idea, but gradient history is computed with
decay:
2
𝐺𝑡,𝑖𝑖 = 𝜌𝐺𝑡−1,𝑖𝑖 + (1 − 𝜌)𝑔𝑡,𝑖 .

• The rest is the same:


𝜂
𝑢𝑡 = − g𝑡−1 .
√𝐺𝑡−1 + 𝜖

13
thank you!

Thank you for your attention!

14

You might also like