0% found this document useful (0 votes)
7 views

Lecture 03

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 03

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

WiSe 2023/24

Deep Learning 1

Lecture 3 Optimization (Part 1)


Outline

Recap Lecture 2
▶ Backpropagation and gradient descent
Characterizing the error function
▶ The problem of local minima
▶ The importance of initialization
▶ The problem of poor conditioning
▶ Characterizing conditioning with the Hessian
Improving the conditioning
▶ Data normalization & choice of non-linearities
▶ Scaling initial weights, batch normalization, skip connections

1/31
Part 1 Recap Lecture 2

2/31
Recap: How to Learn in a Neural Network

Observation:
A neural network is a function of both its inputs and parameters.
graph view function view

f y

   
x = x1 , x 2 , x 3 y = a8 , a9 θ = (wij )ij , (bj )j

3/31
Recap: How to Learn in a Neural Network

function view
Dene an error function
X
f (xn ; θ) − tn )2

E(θ) =
n

x and minimize it by gradient descent


θ ← θ − γ · ∇θ E(θ)
f y

θ = (wij )ij , (bj )j

4/31
Part 2 Characterizing the Error Function

5/31
Characterizing the Error Function: One Layer

▶ Consider a simple linear neural network made of one layer of


parameters:
linear
w
x y y = w⊤ x

with prediction error averaged over a dataset D of inputs and their


associated targets, i.e. D = {(x1 , t1 ), . . . , (xN , tN )} given by:
N
1 X ⊤
E(w) = (w xn − tn )2 + λ∥w∥2
N n=1

▶ One can show that this objective function is


convex (like for the perceptron). I.e. one can
always reach the minimum of the function by
performing gradient descent.

6/31
Characterizing the Error Function: Two Layers

▶ Consider now a slightly extended


version of the neural network above, linear linear

where we add an extra layer. This x


W
a
v
y
gives the error function:
N
1 X ⊤ 2
v W xn − tn + λ ∥v∥2 + ∥W ∥2F

E(W, v) =
N n=1

▶ One can show that this error function is non-convex, e.g. the simple
case N = 1, x1 = 1, t1 = 1, λ = 0.1 gives:
saddle point
two local minima

7/31
Characterizing the Error Function: Two Layers

▶ Let's now use a tanh nonlinear non-linear


activation function on the
intermediate layer which leads to the W v
following error function to minimize: x a y

1 X ⊤
E(W, v) = ∥v tanh(W x) − tn ∥2 + λ(∥v∥2 + ∥W ∥2 )
N n

▶ In addition to having several local minima, the error function now has
plateaus (non-minima regions with near-zero gradients), which are hard
to escape using gradient descent.
plateau (zero gradient)

true minimum

8/31
Practical Recommendations

Basic recommendations:
▶ Do not initialize your parameters to zero (otherwise, it is exactly at a saddle
point, and the gradient descent is stuck there).

▶ The most common alternative is to initialize the parameters at random (e.g.


drawn from a Gaussian distribution of xed scale).

▶ The scale should be not too large (in order avoid the saturated regime of the
nonlinearities).

These basic heuristics help to land in some local minimum, but not necessary a good
one.

More recommendations:
▶ If aordable, retrain the neural network with multiple random initializations,
and keep the training run that achieves the lowest error.

▶ A learning rate set large enough can help to escape local minima.

▶ Use a sucient number of neurons at each layer (more parameters makes it


easier for the algorithm to escape local minima).

▶ Do not increase the depth of the neural network beyond necessity (a deeper
network is harder to optimize).

9/31
Learning Rate Schedules

Idea:
▶ During training, apply a broad range of learning rates, specically (1)
large learning rates to jump out of local minima, and (2) small learning
rate to nely adjust the parameters of the model.
Practical Examples:
▶ Step decay (every k iterations, decay the learning rate by a certain
factor). For example:

 0.1
 0 ≤ t ≤ 1000
γ(t) = 0.01 1000 ≤ t < 2000
 ..
.

▶ Exponential decay (learning rate decays smoothly over time):


γ(t) = γ0 exp(−αt)
▶ Cyclical learning rates (reduce and grow the learning rate repeatedly).

10/31
Is it All About Escaping Local Minima?

Answer: No. We must also verify that the function is well-conditioned.


Examples:
well-conditioned poorly conditioned
error function error function

minima of
the function

Well-conditioned functions
are easier to optimize.

11/31
Analyzing Gradient Descent (Special Case)

Special case: Suppose the error function takes the simple form:
d
X
E(θ) = αi (θi − θi⋆ )2
i=1

with αi s xed coecients that are strictly positive, θi s parameters that we


would like to optimize, and θ⋆ a (unique) minimum of the error function.
Observations:
▶ The error is easiest to optimize when all
dimensions have the same curvature, i.e.
∀ij : αi = αj .
▶ The error is hard to optimize when there is a
strong divergence of curvature between the
dierent dimensions (e.g. ∃ij : αi ≫ αj ).
Idea:
▶ Quantify the diculty of optimization by analyzing the process of
gradient descent.

12/31
Analyzing Gradient Descent (Special Case)

▶ Recall that we have dened the error function


d
X
E(θ) = αi (θi − θi⋆ )2
i=1

▶ A step along the gradient direction ∂E/∂θi gives:


θi(new) = θi − γ · 2αi (θi − θi⋆ )
▶ From it, one can characterize the convergence of gradient descent:
θi(new) = θi − γ · 2αi (θi − θi⋆ )
θi(new) − θi⋆ = θi − γ · 2αi θi + γ · 2αi θi⋆ − θi⋆
θi(new) − θi⋆ = (1 − 2γαi ) · (θi − θi⋆ )
(θi(new) − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )2

13/31
Analyzing Gradient Descent (Special Case)

▶ Recall that:
(θi(new) − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )2
▶ Applying t steps of gradient descent from an initial solution θ(0) , we get
(1) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )
(1)
(θi − θi⋆ )2
z }| {
(2) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · (1 − 2γαi )2 · (θi − θi⋆ )
..
.
(T ) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · . . . · (1 − 2γαi )2 · (θi − θi⋆ )2
| {z }
(1 − 2γαi )2T

If the squared distance to the optimum decreases along all dimensions,


i.e. if |1 − 2γαi | < 1 for all αi , then the overall distance to the optimum
also decreases exponentially fast with the number of iterations.
▶ Likewise, E(θ) being a linear combination of these square distances, it
also decreases exponentially fast with the number of iterations.
14/31
Analyzing Gradient Descent (Special Case)

▶ Recall that gradient descent converges if for all dimensions i = 1 . . . d,


|1 − 2γαi | < 1

of equivalently
1
0<γ<
αi
▶ Let us choose the maximum learning rate that avoids diverging along
any of the dimensions:
1 1
γ (best) = 0.99 · min = 0.99 · ,
i αi αmax
where αmax is the coecient of the dimension with highest curvature.
▶ Using this learning rate, the convergence rate along the direction of
lowest curvature (with coecient αmin ) can be expressed as:
αmin
|1 − 2γ (best) αmin | = 1 − 2 · 0.99
αmax
the higher the ratio αmin /αmax the faster it converges.
▶ The diculty to optimize can therefore be quantied by the inverse
ratio αmax /αmin , known as the condition number.
15/31
Analyzing Gradient Descent (General Case)

▶ The analysis in the previous slides assume a very specic form of E(θ),
where the parameters to not interact.
▶ However, using the framework of Taylor expansions, any error function
can be rewritten near some local minimum θ⋆ as:
E(θ)
e
z }| {
2
1 ⋆ ⊤ ∂ E

E(θ) = E(θ ) + 0 + (θ − θ ) (θ − θ ) + higher-order

terms
2 ∂θ∂θ⊤ θ=θ ⋆
| {z }
H

where H is the Hessian, a matrix of size |θ| × |θ| where |θ| denotes the
number of parameters in the network.

16/31
Analyzing Gradient Descent (General Case)

▶ Let us start from the Hessian-based local approximation of the error


function:
1
E(θ)
e = (θ − θ⋆ )⊤ H(θ − θ⋆ )
2
▶ Diagonalizing the Hessian matrix, i.e. H = di=1 λi ui u⊤ with
P
i
λ1 , . . . , λd the eigenvalues, we can rewrite the error as:
d
1 X 
E(θ)
e = (θ − θ⋆ )⊤ λ i ui u⊤
i (θ − θ⋆ )
2
..
i=1

.
d
X 1
= λi ((θ − θ⋆ )⊤ ui )2
i=1
2

▶ Repeating the analysis from before, but replacing the individual


dimensions by the projections on eigenvectors, we get the condition
number:
λ
Condition number = max
λmin

17/31
Exercise: Deriving the Hessian of an Error Function

Consider the simple linear model with mean square error


E(θ) = E[(w⊤ x − t)2 ] + λ∥w∥2

where E[·] denotes the expectation over the training data. Derive its Hessian.
Elements of the Hessian can be obtained by dierenting the function
twice:

E(θ) = 2E[(w⊤ x − t)xi ] + 2λwi
∂wi
∂  ∂ 
Hij = E(θ) = 2E[xi xj ] + 2λ1i=j
∂wj ∂wi

The matrix can then also be stated in terms of vector operations:


H = 2E[xx⊤ ] + 2λI

18/31
Computing the Hessian in Practice?

Problem:
▶ The Hessian H (from which one can extract the condition number) is
hard to compute and very large for neural networks with many
parameters (e.g. fully connected networks).

Example:
|θ| = 784 · 300 + 300 · 100 + 100 · 10
= 266200
|H| = 266200 · 266200
= 7.086 · 10
10
∼ 283 gigabytes

Idea:
▶ For most practical tasks, we don't need to evaluate the Hessian and
the condition number. We only need to apply a set of
recommendations and tricks that keep the condition number low.

19/31
Part 3 Improving the Conditioning

20/31
Improving Conditioning of the Error Function

Example: The linear model


linear
w
x y y = w⊤ x

E(w) = E[(w⊤ x − t)2 ] + λ∥w∥2

= w⊤ E[xx⊤ + λI]w + linear + constant

= w⊤ E[(x − µ)(x − µ)⊤ + µµ⊤ + λI] w + linear + constant


| {z }
∝ Hessian
Observation:
▶ Hessian (and condition number) are inuenced by the mean and
covariance of the data.
▶ The closer the mean is to zero, and the closer the covariance is to the
identity, the lower the condition number.

Trick: Normalize the data

21/31
Data Normalization to Improve Conditioning

Data pre-processing before training:

(x1 , ..., xN )

(image from LeCun'98/12)

22/31
Decomposition of the Hessian

x Neural Network prediction error


θ F E

General formula for the Hessian of a neural network (size: |θ| × |θ|)
∂2E ∂F ⊤ ∂ 2 E ∂F ∂E ∂ 2 F
H= 2
= 2
+
∂θ ∂θ ∂F ∂θ ∂F ∂θ2

Hessian between weights of a single neuron (mean square error case):

∂2E ∂δk
= E aj aj ′ δk2 + E aj ·
   
[Hk ]jj ′ = · (y − t)
∂wjk wj ′ k ∂wj ′ k
| {z } | {z }
similar to complicated
the simple
linear
model

where δk denotes the derivative of the neural network output w.r.t. the pre-
activation of neuron k.
23/31
Improving Conditioning of Higher-Layers

To improve conditioning, not only the input data should be normalized, but
also the representations built from this data at each layer. This can be done
by carefully choosing the activation function.

logistic sigmoid hyperbolic tangent


1
1

0.5
−2 −1 1 2

−4 −2 2 4 −1

activations are activations approximately


not centered centered at zero
⇒ high condition number ⇒ low condition number

24/31
Limitation of tanh

The tanh non-linearity works well initially, but after some training steps, it
might no longer work as expected as the input distribution will drift to neg-
ative or positive values.

0.5 E[tanh(x)] ≈ 0.5

−6 −4 −3 −2 −1 1 2 3
−0.5

−1
E[x] ≈ 0

Remark: If the input of tanh is centered but skewed, the output of tanh
will not be centered. This happens a lot in practice, e.g. when the problem
representation needs to be sparse.
25/31
Comparing Non-Linearities

26/31
Further Improving the Hessian

Recommendation
▶ Scale parameters such that neuron outputs have variance ≈1 initially
(LeCun'98/12 Ecient Backprop)

1
θ ∼ N (0, σ 2 ) σ2 = (1)
#input neurons

▶ Use a similar number of neurons in each layer.

A Hessian-based justication:
▶ Build an approximation of the Hessian where interactions between parameters
of dierent neurons are neglected. Such approximation takes the form of a
block-diagonal matrix:

H = diag{Hj , Hj ′ , Hj ′′ , . . . , Hk , Hk′ , Hk′′ , . . . , Hout }


▶ Eigenvalues of H are given by the eigenvalues of the dierent blocks.
Reducing the condition number requires ensuring each block has eigenvalues
on a similar scale.

▶ Recall that the Hessian associated to a given neuron is of the form


[Hk ]jj ′ = 2E[aj aj ′ δk2 ]. This implies that activations and sensitivities to the
output needs to be on the same scale at each layer.

27/31
Further Improving Optimization / the Hessian

Batch Normalization
(Ioe et al.
arXiv:1502.03167, 2015)

Advantages:
▶ Ensures activations in multiple layers are centered.
▶ Reduce interactions between parameters at multiple layers.

28/31
Further Improving Optimization / the Hessian

Skip connections:

Advantages:
▶ Better propagate the relevant signal to the output of the network.
▶ Reduce interactions between parameters at dierent layers.

29/31
Summary

30/31
Summary

▶ Neural networks are powerful but also dicult to optimize (e.g.


non-convex, poorly conditioned, etc.)
▶ Non-convexity cannot be avoided, however, its adverse eects can be
mitigated by selecting an appropriate neural network architecture and
initialization of the parameters.
▶ Poor conditioning, characterized by analyzing the Hessian, can be
tackled by applying dierent tricks such as centering data and
representations, homogeneizing scales of activations across various
layers and reducing interaction between parameters of dient layers.
Many of these tricks can be justied as improving the condition
number.
▶ There are many more aspects of optimization that have not been
covered yet. These include the optimization procedure itself, avoiding
redundant computations, implementation aspects, and distributed ML
schemes. They will be the focus of Lecture 4.

31/31

You might also like