0% found this document useful (0 votes)

7 views

Lecture 03

Uploaded by

Tim Widmoser

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Lecture 03

Uploaded by

Tim Widmoser

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

WiSe 2023/24

Deep Learning 1

Lecture 3 Optimization (Part 1)

Outline

Recap Lecture 2
▶ Backpropagation and gradient descent
Characterizing the error function
▶ The problem of local minima
▶ The importance of initialization
▶ The problem of poor conditioning
▶ Characterizing conditioning with the Hessian
Improving the conditioning
▶ Data normalization & choice of non-linearities
▶ Scaling initial weights, batch normalization, skip connections

1/31
Part 1 Recap Lecture 2

2/31
Recap: How to Learn in a Neural Network

Observation:
A neural network is a function of both its inputs and parameters.
graph view function view

f y

x = x1 , x 2 , x 3 y = a8 , a9 θ = (wij )ij , (bj )j

3/31
Recap: How to Learn in a Neural Network

function view
Dene an error function
X
f (xn ; θ) − tn )2

E(θ) =
n

x and minimize it by gradient descent

θ ← θ − γ · ∇θ E(θ)
f y

θ = (wij )ij , (bj )j

4/31
Part 2 Characterizing the Error Function

5/31
Characterizing the Error Function: One Layer

▶ Consider a simple linear neural network made of one layer of

parameters:
linear
w
x y y = w⊤ x

with prediction error averaged over a dataset D of inputs and their

associated targets, i.e. D = {(x1 , t1 ), . . . , (xN , tN )} given by:
N
1 X ⊤
E(w) = (w xn − tn )2 + λ∥w∥2
N n=1

▶ One can show that this objective function is

convex (like for the perceptron). I.e. one can
always reach the minimum of the function by
performing gradient descent.

6/31
Characterizing the Error Function: Two Layers

▶ Consider now a slightly extended

version of the neural network above, linear linear

where we add an extra layer. This x

W
a
v
y
gives the error function:
N
1 X ⊤ 2
v W xn − tn + λ ∥v∥2 + ∥W ∥2F

E(W, v) =
N n=1

▶ One can show that this error function is non-convex, e.g. the simple
case N = 1, x1 = 1, t1 = 1, λ = 0.1 gives:
saddle point
two local minima

7/31
Characterizing the Error Function: Two Layers

▶ Let's now use a tanh nonlinear non-linear

activation function on the
intermediate layer which leads to the W v
following error function to minimize: x a y

1 X ⊤
E(W, v) = ∥v tanh(W x) − tn ∥2 + λ(∥v∥2 + ∥W ∥2 )
N n

▶ In addition to having several local minima, the error function now has
plateaus (non-minima regions with near-zero gradients), which are hard
to escape using gradient descent.
plateau (zero gradient)

true minimum

8/31
Practical Recommendations

Basic recommendations:
▶ Do not initialize your parameters to zero (otherwise, it is exactly at a saddle
point, and the gradient descent is stuck there).

▶ The most common alternative is to initialize the parameters at random (e.g.

drawn from a Gaussian distribution of xed scale).

▶ The scale should be not too large (in order avoid the saturated regime of the
nonlinearities).

These basic heuristics help to land in some local minimum, but not necessary a good
one.

More recommendations:
▶ If aordable, retrain the neural network with multiple random initializations,
and keep the training run that achieves the lowest error.

▶ A learning rate set large enough can help to escape local minima.

▶ Use a sucient number of neurons at each layer (more parameters makes it

easier for the algorithm to escape local minima).

▶ Do not increase the depth of the neural network beyond necessity (a deeper
network is harder to optimize).

9/31
Learning Rate Schedules

Idea:
▶ During training, apply a broad range of learning rates, specically (1)
large learning rates to jump out of local minima, and (2) small learning
rate to nely adjust the parameters of the model.
Practical Examples:
▶ Step decay (every k iterations, decay the learning rate by a certain
factor). For example:

 0.1
 0 ≤ t ≤ 1000
γ(t) = 0.01 1000 ≤ t < 2000
 ..
.


▶ Exponential decay (learning rate decays smoothly over time):

γ(t) = γ0 exp(−αt)
▶ Cyclical learning rates (reduce and grow the learning rate repeatedly).

10/31
Is it All About Escaping Local Minima?

Answer: No. We must also verify that the function is well-conditioned.

Examples:
well-conditioned poorly conditioned
error function error function

minima of
the function

Well-conditioned functions
are easier to optimize.

11/31
Analyzing Gradient Descent (Special Case)

Special case: Suppose the error function takes the simple form:
d
X
E(θ) = αi (θi − θi⋆ )2
i=1

with αi s xed coecients that are strictly positive, θi s parameters that we

would like to optimize, and θ⋆ a (unique) minimum of the error function.
Observations:
▶ The error is easiest to optimize when all
dimensions have the same curvature, i.e.
∀ij : αi = αj .
▶ The error is hard to optimize when there is a
strong divergence of curvature between the
dierent dimensions (e.g. ∃ij : αi ≫ αj ).
Idea:
▶ Quantify the diculty of optimization by analyzing the process of
gradient descent.

12/31
Analyzing Gradient Descent (Special Case)

▶ Recall that we have dened the error function

d
X
E(θ) = αi (θi − θi⋆ )2
i=1

▶ A step along the gradient direction ∂E/∂θi gives:

θi(new) = θi − γ · 2αi (θi − θi⋆ )
▶ From it, one can characterize the convergence of gradient descent:
θi(new) = θi − γ · 2αi (θi − θi⋆ )
θi(new) − θi⋆ = θi − γ · 2αi θi + γ · 2αi θi⋆ − θi⋆
θi(new) − θi⋆ = (1 − 2γαi ) · (θi − θi⋆ )
(θi(new) − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )2

13/31
Analyzing Gradient Descent (Special Case)

▶ Recall that:
(θi(new) − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )2
▶ Applying t steps of gradient descent from an initial solution θ(0) , we get
(1) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · (θi − θi⋆ )
(1)
(θi − θi⋆ )2
z }| {
(2) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · (1 − 2γαi )2 · (θi − θi⋆ )
..
.
(T ) (0)
(θi − θi⋆ )2 = (1 − 2γαi )2 · . . . · (1 − 2γαi )2 · (θi − θi⋆ )2
| {z }
(1 − 2γαi )2T

If the squared distance to the optimum decreases along all dimensions,

i.e. if |1 − 2γαi | < 1 for all αi , then the overall distance to the optimum
also decreases exponentially fast with the number of iterations.
▶ Likewise, E(θ) being a linear combination of these square distances, it
also decreases exponentially fast with the number of iterations.
14/31
Analyzing Gradient Descent (Special Case)

▶ Recall that gradient descent converges if for all dimensions i = 1 . . . d,

|1 − 2γαi | < 1

of equivalently
1
0<γ<
αi
▶ Let us choose the maximum learning rate that avoids diverging along
any of the dimensions:
1 1
γ (best) = 0.99 · min = 0.99 · ,
i αi αmax
where αmax is the coecient of the dimension with highest curvature.
▶ Using this learning rate, the convergence rate along the direction of
lowest curvature (with coecient αmin ) can be expressed as:
αmin
|1 − 2γ (best) αmin | = 1 − 2 · 0.99
αmax
the higher the ratio αmin /αmax the faster it converges.
▶ The diculty to optimize can therefore be quantied by the inverse
ratio αmax /αmin , known as the condition number.
15/31
Analyzing Gradient Descent (General Case)

▶ The analysis in the previous slides assume a very specic form of E(θ),
where the parameters to not interact.
▶ However, using the framework of Taylor expansions, any error function
can be rewritten near some local minimum θ⋆ as:
E(θ)
e
z }| {
2
1 ⋆ ⊤ ∂ E
⋆
E(θ) = E(θ ) + 0 + (θ − θ ) (θ − θ ) + higher-order
⋆
terms
2 ∂θ∂θ⊤ θ=θ ⋆
| {z }
H

where H is the Hessian, a matrix of size |θ| × |θ| where |θ| denotes the
number of parameters in the network.

16/31
Analyzing Gradient Descent (General Case)

▶ Let us start from the Hessian-based local approximation of the error

function:
1
E(θ)
e = (θ − θ⋆ )⊤ H(θ − θ⋆ )
2
▶ Diagonalizing the Hessian matrix, i.e. H = di=1 λi ui u⊤ with
P
i
λ1 , . . . , λd the eigenvalues, we can rewrite the error as:
d
1 X
E(θ)
e = (θ − θ⋆ )⊤ λ i ui u⊤
i (θ − θ⋆ )
2
..
i=1

.
d
X 1
= λi ((θ − θ⋆ )⊤ ui )2
i=1
2

▶ Repeating the analysis from before, but replacing the individual

dimensions by the projections on eigenvectors, we get the condition
number:
λ
Condition number = max
λmin

17/31
Exercise: Deriving the Hessian of an Error Function

Consider the simple linear model with mean square error

E(θ) = E[(w⊤ x − t)2 ] + λ∥w∥2

where E[·] denotes the expectation over the training data. Derive its Hessian.
Elements of the Hessian can be obtained by dierenting the function
twice:
∂
E(θ) = 2E[(w⊤ x − t)xi ] + 2λwi
∂wi
∂ ∂
Hij = E(θ) = 2E[xi xj ] + 2λ1i=j
∂wj ∂wi

The matrix can then also be stated in terms of vector operations:

H = 2E[xx⊤ ] + 2λI

18/31
Computing the Hessian in Practice?

Problem:
▶ The Hessian H (from which one can extract the condition number) is
hard to compute and very large for neural networks with many
parameters (e.g. fully connected networks).

Example:
|θ| = 784 · 300 + 300 · 100 + 100 · 10
= 266200
|H| = 266200 · 266200
= 7.086 · 10
10
∼ 283 gigabytes

Idea:
▶ For most practical tasks, we don't need to evaluate the Hessian and
the condition number. We only need to apply a set of
recommendations and tricks that keep the condition number low.

19/31
Part 3 Improving the Conditioning

20/31
Improving Conditioning of the Error Function

Example: The linear model

linear
w
x y y = w⊤ x

E(w) = E[(w⊤ x − t)2 ] + λ∥w∥2

= w⊤ E[xx⊤ + λI]w + linear + constant

= w⊤ E[(x − µ)(x − µ)⊤ + µµ⊤ + λI] w + linear + constant

| {z }
∝ Hessian
Observation:
▶ Hessian (and condition number) are inuenced by the mean and
covariance of the data.
▶ The closer the mean is to zero, and the closer the covariance is to the
identity, the lower the condition number.

Trick: Normalize the data

21/31
Data Normalization to Improve Conditioning

Data pre-processing before training:

(x1 , ..., xN )

(image from LeCun'98/12)

22/31
Decomposition of the Hessian

x Neural Network prediction error

θ F E

General formula for the Hessian of a neural network (size: |θ| × |θ|)
∂2E ∂F ⊤ ∂ 2 E ∂F ∂E ∂ 2 F
H= 2
= 2
+
∂θ ∂θ ∂F ∂θ ∂F ∂θ2

Hessian between weights of a single neuron (mean square error case):

∂2E ∂δk
= E aj aj ′ δk2 + E aj ·

[Hk ]jj ′ = · (y − t)
∂wjk wj ′ k ∂wj ′ k
| {z } | {z }
similar to complicated
the simple
linear
model

where δk denotes the derivative of the neural network output w.r.t. the pre-
activation of neuron k.
23/31
Improving Conditioning of Higher-Layers

To improve conditioning, not only the input data should be normalized, but
also the representations built from this data at each layer. This can be done
by carefully choosing the activation function.

logistic sigmoid hyperbolic tangent

1
1

0.5
−2 −1 1 2

−4 −2 2 4 −1

activations are activations approximately

not centered centered at zero
⇒ high condition number ⇒ low condition number

24/31
Limitation of tanh

The tanh non-linearity works well initially, but after some training steps, it
might no longer work as expected as the input distribution will drift to neg-
ative or positive values.

0.5 E[tanh(x)] ≈ 0.5

−6 −4 −3 −2 −1 1 2 3
−0.5

−1
E[x] ≈ 0

Remark: If the input of tanh is centered but skewed, the output of tanh
will not be centered. This happens a lot in practice, e.g. when the problem
representation needs to be sparse.
25/31
Comparing Non-Linearities

26/31
Further Improving the Hessian

Recommendation
▶ Scale parameters such that neuron outputs have variance ≈1 initially
(LeCun'98/12 Ecient Backprop)

1
θ ∼ N (0, σ 2 ) σ2 = (1)
#input neurons

▶ Use a similar number of neurons in each layer.

A Hessian-based justication:
▶ Build an approximation of the Hessian where interactions between parameters
of dierent neurons are neglected. Such approximation takes the form of a
block-diagonal matrix:

H = diag{Hj , Hj ′ , Hj ′′ , . . . , Hk , Hk′ , Hk′′ , . . . , Hout }

▶ Eigenvalues of H are given by the eigenvalues of the dierent blocks.
Reducing the condition number requires ensuring each block has eigenvalues
on a similar scale.

▶ Recall that the Hessian associated to a given neuron is of the form

[Hk ]jj ′ = 2E[aj aj ′ δk2 ]. This implies that activations and sensitivities to the
output needs to be on the same scale at each layer.

27/31
Further Improving Optimization / the Hessian

Batch Normalization
(Ioe et al.
arXiv:1502.03167, 2015)

Advantages:
▶ Ensures activations in multiple layers are centered.
▶ Reduce interactions between parameters at multiple layers.

28/31
Further Improving Optimization / the Hessian

Skip connections:

Advantages:
▶ Better propagate the relevant signal to the output of the network.
▶ Reduce interactions between parameters at dierent layers.

29/31
Summary

30/31
Summary

▶ Neural networks are powerful but also dicult to optimize (e.g.

non-convex, poorly conditioned, etc.)
▶ Non-convexity cannot be avoided, however, its adverse eects can be
mitigated by selecting an appropriate neural network architecture and
initialization of the parameters.
▶ Poor conditioning, characterized by analyzing the Hessian, can be
tackled by applying dierent tricks such as centering data and
representations, homogeneizing scales of activations across various
layers and reducing interaction between parameters of dient layers.
Many of these tricks can be justied as improving the condition
number.
▶ There are many more aspects of optimization that have not been
covered yet. These include the optimization procedure itself, avoiding
redundant computations, implementation aspects, and distributed ML
schemes. They will be the focus of Lecture 4.

31/31

3 DeltaRule PDF
No ratings yet
3 DeltaRule PDF
10 pages
BackPropagation Through Time
No ratings yet
BackPropagation Through Time
6 pages
Human Motion Detection - Report
No ratings yet
Human Motion Detection - Report
50 pages
Mid-Sem Exam: Isha Verma LIT2019047
No ratings yet
Mid-Sem Exam: Isha Verma LIT2019047
12 pages
DS303_NN
No ratings yet
DS303_NN
20 pages
12 Convolutional Networks
No ratings yet
12 Convolutional Networks
23 pages
Sheet #6 Ensemble + Neural Nets + Linear Regression + Backpropagation + CNN
No ratings yet
Sheet #6 Ensemble + Neural Nets + Linear Regression + Backpropagation + CNN
4 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
MN906 AI Watermarking
No ratings yet
MN906 AI Watermarking
99 pages
Lecture 08
No ratings yet
Lecture 08
43 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
DS303_RNN_LSTM
No ratings yet
DS303_RNN_LSTM
16 pages
Lecture 04
No ratings yet
Lecture 04
32 pages
Linear-Regression 231212 072619
No ratings yet
Linear-Regression 231212 072619
13 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
44 pages
Overfitting vs Underfitting
No ratings yet
Overfitting vs Underfitting
16 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
Soft_Computing_2 with numericals
No ratings yet
Soft_Computing_2 with numericals
64 pages
Learning 2
No ratings yet
Learning 2
104 pages
Lec ICFDC 5
No ratings yet
Lec ICFDC 5
16 pages
Introduction
No ratings yet
Introduction
137 pages
Lecture 2: Linear Algebra: Modern Control Systems
No ratings yet
Lecture 2: Linear Algebra: Modern Control Systems
31 pages
Influence and Outliers
No ratings yet
Influence and Outliers
37 pages
ThanhLX_VJU_OR_IP
No ratings yet
ThanhLX_VJU_OR_IP
82 pages
Determine Damping Ratio Using Experimental Method: EMCH 330, Fall 2015 Guest Lecturer: Zhenhua Tian
No ratings yet
Determine Damping Ratio Using Experimental Method: EMCH 330, Fall 2015 Guest Lecturer: Zhenhua Tian
7 pages
Mlfa Autumn 22 Lec 06
No ratings yet
Mlfa Autumn 22 Lec 06
14 pages
Numerical Methods Module 5
No ratings yet
Numerical Methods Module 5
19 pages
02 ErrorAnalysis
No ratings yet
02 ErrorAnalysis
15 pages
WEEK 9
No ratings yet
WEEK 9
80 pages
Fe Industrial Engineering
No ratings yet
Fe Industrial Engineering
14 pages
Sy M Bo Lab: Steps
No ratings yet
Sy M Bo Lab: Steps
5 pages
ASSIGNMENT -1 ANSWER
No ratings yet
ASSIGNMENT -1 ANSWER
8 pages
Antidifferentiation
No ratings yet
Antidifferentiation
34 pages
M&S L7
No ratings yet
M&S L7
42 pages
Neural Networks Three
No ratings yet
Neural Networks Three
60 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Lecture slides - Linear Regression (2025)
No ratings yet
Lecture slides - Linear Regression (2025)
45 pages
M Techniques 2
No ratings yet
M Techniques 2
7 pages
Lecture05-DeepLearningCNN
No ratings yet
Lecture05-DeepLearningCNN
84 pages
Python Tutorial
No ratings yet
Python Tutorial
37 pages
Lecture-10-boosting
No ratings yet
Lecture-10-boosting
20 pages
AI Foundation Application-2
No ratings yet
AI Foundation Application-2
84 pages
Module 3.Docxaiml
No ratings yet
Module 3.Docxaiml
20 pages
212 Handout
No ratings yet
212 Handout
28 pages
Selected Linear Algebra for Machine Learning
No ratings yet
Selected Linear Algebra for Machine Learning
30 pages
ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
No ratings yet
ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
37 pages
Week4_LearningII
No ratings yet
Week4_LearningII
39 pages
166q7
No ratings yet
166q7
3 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Lecture7[1]
No ratings yet
Lecture7[1]
46 pages
DS303 Clustering
No ratings yet
DS303 Clustering
20 pages
7
No ratings yet
7
47 pages
Assignment_14_Modern_AI
No ratings yet
Assignment_14_Modern_AI
3 pages
00 - Perceptron - Scientific Machine Learning (SciML)
No ratings yet
00 - Perceptron - Scientific Machine Learning (SciML)
42 pages
learning2
No ratings yet
learning2
82 pages
SLIDES - Statistics-Descriptive Statistics
No ratings yet
SLIDES - Statistics-Descriptive Statistics
25 pages
Neural Network and Fuzzy Logic
50% (2)
Neural Network and Fuzzy Logic
54 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Brainalyst's SQL Interview Guide
No ratings yet
Brainalyst's SQL Interview Guide
112 pages
Index: SR. NO. Content
No ratings yet
Index: SR. NO. Content
10 pages
ZAJKO - Artificial Intelligence Algorithms and Social Inequality Sociological Contributions To Contemporary Debates
No ratings yet
ZAJKO - Artificial Intelligence Algorithms and Social Inequality Sociological Contributions To Contemporary Debates
16 pages
Advancements in Artificial Intelligence and Their Implications For Society
No ratings yet
Advancements in Artificial Intelligence and Their Implications For Society
2 pages
E156915-1700413212347-330783-Unit 45 - Emerging - Technologies - Reworded - 2021
No ratings yet
E156915-1700413212347-330783-Unit 45 - Emerging - Technologies - Reworded - 2021
41 pages
SAP_AI_presentation
No ratings yet
SAP_AI_presentation
28 pages
A Survey Paper On Sign Language Recognition System Using OpenCV and Convolutional Neural Network
No ratings yet
A Survey Paper On Sign Language Recognition System Using OpenCV and Convolutional Neural Network
7 pages
Data Science Study Plan
No ratings yet
Data Science Study Plan
3 pages
Assignment 6
No ratings yet
Assignment 6
3 pages
embedded tutorial
No ratings yet
embedded tutorial
257 pages
CNN vs. LSTM For Turkish Text Classification
No ratings yet
CNN vs. LSTM For Turkish Text Classification
6 pages
Research Paper
No ratings yet
Research Paper
21 pages
1983-Article Text-3950-1-10-20240804
No ratings yet
1983-Article Text-3950-1-10-20240804
13 pages
Hand Gesture Paper
No ratings yet
Hand Gesture Paper
5 pages
Machine Learning in Oil & Gas Industry - A Novel Application of Clustering For Oilfield Advanced Process Control
No ratings yet
Machine Learning in Oil & Gas Industry - A Novel Application of Clustering For Oilfield Advanced Process Control
11 pages
Unit 3 Supervised Learning
No ratings yet
Unit 3 Supervised Learning
89 pages
Dev Sec Ops
No ratings yet
Dev Sec Ops
45 pages
Audit Data Analytics, Machine Learning and Full Population Testing
No ratings yet
Audit Data Analytics, Machine Learning and Full Population Testing
7 pages
Lec-All Deep Learning Coursework
100% (2)
Lec-All Deep Learning Coursework
639 pages
PR 02 Activity 2
No ratings yet
PR 02 Activity 2
3 pages
Base Paper
No ratings yet
Base Paper
21 pages
Llm Seminar Suggestive Text Grok
No ratings yet
Llm Seminar Suggestive Text Grok
1 page
B.SC Data Science Syllabus Bos
No ratings yet
B.SC Data Science Syllabus Bos
58 pages
Introduction To Natural Language Processing: by Rohit Sharma
No ratings yet
Introduction To Natural Language Processing: by Rohit Sharma
8 pages
SmartBridge - Artificial Intelligence & Machine Learning With Google
No ratings yet
SmartBridge - Artificial Intelligence & Machine Learning With Google
9 pages
Ayub 2020
No ratings yet
Ayub 2020
6 pages
MET CS777 Summer2-2022 Big-Data-Analytics
No ratings yet
MET CS777 Summer2-2022 Big-Data-Analytics
9 pages
Live Crypto Sentiment: Social Media Influence On Multi-Sectoral Coin and Its Impact On Portfolio Risk Management, Using Data Analytics.
No ratings yet
Live Crypto Sentiment: Social Media Influence On Multi-Sectoral Coin and Its Impact On Portfolio Risk Management, Using Data Analytics.
9 pages
A Study On A Car Insurance Purchase Prediction Using Two-Class Logistic Regression and Two-Class Boosted Decision Tree
No ratings yet
A Study On A Car Insurance Purchase Prediction Using Two-Class Logistic Regression and Two-Class Boosted Decision Tree
6 pages

Lecture 03

Uploaded by

Lecture 03

Uploaded by

WiSe 2023/24

Lecture 3 Optimization (Part 1)

x and minimize it by gradient descent

θ = (wij )ij , (bj )j

▶ Consider a simple linear neural network made of one layer of

with prediction error averaged over a dataset D of inputs and their

▶ One can show that this objective function is

▶ Consider now a slightly extended

where we add an extra layer. This x

▶ Let's now use a tanh nonlinear non-linear

▶ The most common alternative is to initialize the parameters at random (e.g.

▶ Use a sucient number of neurons at each layer (more parameters makes it

▶ Exponential decay (learning rate decays smoothly over time):

Answer: No. We must also verify that the function is well-conditioned.

with αi s xed coecients that are strictly positive, θi s parameters that we

▶ Recall that we have dened the error function

▶ A step along the gradient direction ∂E/∂θi gives:

If the squared distance to the optimum decreases along all dimensions,

▶ Recall that gradient descent converges if for all dimensions i = 1 . . . d,

▶ Let us start from the Hessian-based local approximation of the error

▶ Repeating the analysis from before, but replacing the individual

Consider the simple linear model with mean square error

The matrix can then also be stated in terms of vector operations:

Example: The linear model

E(w) = E[(w⊤ x − t)2 ] + λ∥w∥2

= w⊤ E[xx⊤ + λI]w + linear + constant

= w⊤ E[(x − µ)(x − µ)⊤ + µµ⊤ + λI] w + linear + constant

Trick: Normalize the data

Data pre-processing before training:

(image from LeCun'98/12)

x Neural Network prediction error

Hessian between weights of a single neuron (mean square error case):

logistic sigmoid hyperbolic tangent

activations are activations approximately

0.5 E[tanh(x)] ≈ 0.5

▶ Use a similar number of neurons in each layer.

H = diag{Hj , Hj ′ , Hj ′′ , . . . , Hk , Hk′ , Hk′′ , . . . , Hout }

▶ Recall that the Hessian associated to a given neuron is of the form

▶ Neural networks are powerful but also dicult to optimize (e.g.

You might also like

▶ Use a sucient number of neurons at each layer (more parameters makes it

with αi s xed coecients that are strictly positive, θi s parameters that we

▶ Recall that we have dened the error function

▶ Neural networks are powerful but also dicult to optimize (e.g.