Lecture 03
Lecture 03
Deep Learning 1
Recap Lecture 2
▶ Backpropagation and gradient descent
Characterizing the error function
▶ The problem of local minima
▶ The importance of initialization
▶ The problem of poor conditioning
▶ Characterizing conditioning with the Hessian
Improving the conditioning
▶ Data normalization & choice of non-linearities
▶ Scaling initial weights, batch normalization, skip connections
1/31
Part 1 Recap Lecture 2
2/31
Recap: How to Learn in a Neural Network
Observation:
A neural network is a function of both its inputs and parameters.
graph view function view
f y
x = x1 , x 2 , x 3 y = a8 , a9 θ = (wij )ij , (bj )j
3/31
Recap: How to Learn in a Neural Network
function view
Dene an error function
X
f (xn ; θ) − tn )2
E(θ) =
n
4/31
Part 2 Characterizing the Error Function
5/31
Characterizing the Error Function: One Layer
6/31
Characterizing the Error Function: Two Layers
▶ One can show that this error function is non-convex, e.g. the simple
case N = 1, x1 = 1, t1 = 1, λ = 0.1 gives:
saddle point
two local minima
7/31
Characterizing the Error Function: Two Layers
1 X ⊤
E(W, v) = ∥v tanh(W x) − tn ∥2 + λ(∥v∥2 + ∥W ∥2 )
N n
▶ In addition to having several local minima, the error function now has
plateaus (non-minima regions with near-zero gradients), which are hard
to escape using gradient descent.
plateau (zero gradient)
true minimum
8/31
Practical Recommendations
Basic recommendations:
▶ Do not initialize your parameters to zero (otherwise, it is exactly at a saddle
point, and the gradient descent is stuck there).
▶ The scale should be not too large (in order avoid the saturated regime of the
nonlinearities).
These basic heuristics help to land in some local minimum, but not necessary a good
one.
More recommendations:
▶ If aordable, retrain the neural network with multiple random initializations,
and keep the training run that achieves the lowest error.
▶ A learning rate set large enough can help to escape local minima.
▶ Do not increase the depth of the neural network beyond necessity (a deeper
network is harder to optimize).
9/31
Learning Rate Schedules
Idea:
▶ During training, apply a broad range of learning rates, specically (1)
large learning rates to jump out of local minima, and (2) small learning
rate to nely adjust the parameters of the model.
Practical Examples:
▶ Step decay (every k iterations, decay the learning rate by a certain
factor). For example:
0.1
0 ≤ t ≤ 1000
γ(t) = 0.01 1000 ≤ t < 2000
..
.
10/31
Is it All About Escaping Local Minima?
minima of
the function
Well-conditioned functions
are easier to optimize.
11/31
Analyzing Gradient Descent (Special Case)
Special case: Suppose the error function takes the simple form:
d
X
E(θ) = αi (θi − θi⋆ )2
i=1
12/31
Analyzing Gradient Descent (Special Case)
13/31
Analyzing Gradient Descent (Special Case)
▶ Recall that:
(θi(new) − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )2
▶ Applying t steps of gradient descent from an initial solution θ(0) , we get
(1) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )
(1)
(θi − θi⋆ )2
z }| {
(2) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · (1 − 2γαi )2 · (θi − θi⋆ )
..
.
(T ) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · . . . · (1 − 2γαi )2 · (θi − θi⋆ )2
| {z }
(1 − 2γαi )2T
of equivalently
1
0<γ<
αi
▶ Let us choose the maximum learning rate that avoids diverging along
any of the dimensions:
1 1
γ (best) = 0.99 · min = 0.99 · ,
i αi αmax
where αmax is the coecient of the dimension with highest curvature.
▶ Using this learning rate, the convergence rate along the direction of
lowest curvature (with coecient αmin ) can be expressed as:
αmin
|1 − 2γ (best) αmin | = 1 − 2 · 0.99
αmax
the higher the ratio αmin /αmax the faster it converges.
▶ The diculty to optimize can therefore be quantied by the inverse
ratio αmax /αmin , known as the condition number.
15/31
Analyzing Gradient Descent (General Case)
▶ The analysis in the previous slides assume a very specic form of E(θ),
where the parameters to not interact.
▶ However, using the framework of Taylor expansions, any error function
can be rewritten near some local minimum θ⋆ as:
E(θ)
e
z }| {
2
1 ⋆ ⊤ ∂ E
⋆
E(θ) = E(θ ) + 0 + (θ − θ ) (θ − θ ) + higher-order
⋆
terms
2 ∂θ∂θ⊤ θ=θ ⋆
| {z }
H
where H is the Hessian, a matrix of size |θ| × |θ| where |θ| denotes the
number of parameters in the network.
16/31
Analyzing Gradient Descent (General Case)
.
d
X 1
= λi ((θ − θ⋆ )⊤ ui )2
i=1
2
17/31
Exercise: Deriving the Hessian of an Error Function
where E[·] denotes the expectation over the training data. Derive its Hessian.
Elements of the Hessian can be obtained by dierenting the function
twice:
∂
E(θ) = 2E[(w⊤ x − t)xi ] + 2λwi
∂wi
∂ ∂
Hij = E(θ) = 2E[xi xj ] + 2λ1i=j
∂wj ∂wi
18/31
Computing the Hessian in Practice?
Problem:
▶ The Hessian H (from which one can extract the condition number) is
hard to compute and very large for neural networks with many
parameters (e.g. fully connected networks).
Example:
|θ| = 784 · 300 + 300 · 100 + 100 · 10
= 266200
|H| = 266200 · 266200
= 7.086 · 10
10
∼ 283 gigabytes
Idea:
▶ For most practical tasks, we don't need to evaluate the Hessian and
the condition number. We only need to apply a set of
recommendations and tricks that keep the condition number low.
19/31
Part 3 Improving the Conditioning
20/31
Improving Conditioning of the Error Function
21/31
Data Normalization to Improve Conditioning
(x1 , ..., xN )
22/31
Decomposition of the Hessian
General formula for the Hessian of a neural network (size: |θ| × |θ|)
∂2E ∂F ⊤ ∂ 2 E ∂F ∂E ∂ 2 F
H= 2
= 2
+
∂θ ∂θ ∂F ∂θ ∂F ∂θ2
∂2E ∂δk
= E aj aj ′ δk2 + E aj ·
[Hk ]jj ′ = · (y − t)
∂wjk wj ′ k ∂wj ′ k
| {z } | {z }
similar to complicated
the simple
linear
model
where δk denotes the derivative of the neural network output w.r.t. the pre-
activation of neuron k.
23/31
Improving Conditioning of Higher-Layers
To improve conditioning, not only the input data should be normalized, but
also the representations built from this data at each layer. This can be done
by carefully choosing the activation function.
0.5
−2 −1 1 2
−4 −2 2 4 −1
24/31
Limitation of tanh
The tanh non-linearity works well initially, but after some training steps, it
might no longer work as expected as the input distribution will drift to neg-
ative or positive values.
−6 −4 −3 −2 −1 1 2 3
−0.5
−1
E[x] ≈ 0
Remark: If the input of tanh is centered but skewed, the output of tanh
will not be centered. This happens a lot in practice, e.g. when the problem
representation needs to be sparse.
25/31
Comparing Non-Linearities
26/31
Further Improving the Hessian
Recommendation
▶ Scale parameters such that neuron outputs have variance ≈1 initially
(LeCun'98/12 Ecient Backprop)
1
θ ∼ N (0, σ 2 ) σ2 = (1)
#input neurons
A Hessian-based justication:
▶ Build an approximation of the Hessian where interactions between parameters
of dierent neurons are neglected. Such approximation takes the form of a
block-diagonal matrix:
27/31
Further Improving Optimization / the Hessian
Batch Normalization
(Ioe et al.
arXiv:1502.03167, 2015)
Advantages:
▶ Ensures activations in multiple layers are centered.
▶ Reduce interactions between parameters at multiple layers.
28/31
Further Improving Optimization / the Hessian
Skip connections:
Advantages:
▶ Better propagate the relevant signal to the output of the network.
▶ Reduce interactions between parameters at dierent layers.
29/31
Summary
30/31
Summary
31/31