0% found this document useful (0 votes)

27 views73 pages

Deep Learning Lectures - 2

Uploaded by

Việt Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views73 pages

Deep Learning Lectures - 2

Uploaded by

Việt Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Neural networks and

Backpropagation
Charles Ollion - Olivier Grisel

1 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

2 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

Sample s in dataset S :

input: xs
N
∈ R

expected output: y s ∈ [0, K − 1]

3 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

Sample s in dataset S :

input: xs
N
∈ R

expected output: y s ∈ [0, K − 1]

Output is a conditional probability distribution:

s s
f (x ; θ)c = P (Y = c|X = x )

4 / 74
Artificial Neuron

5 / 74
Artificial Neuron

T
z(x) = w x + b

T
f (x) = g(w x + b)

x, f (x) input and output

z(x) pre-activation
w, b weights and bias
g activation function

6 / 74
Layer of Neurons

7 / 74
Layer of Neurons

f (x) = g(z(x)) = g(Wx + b)

W, b now matrix and vector

8 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

9 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

10 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

11 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

12 / 74
One Hidden Layer Network

Alternate representation

13 / 74
One Hidden Layer Network

Keras implementation

model = Sequential()
model.add(Dense(H, input_dim=N)) # weight matrix dim [N * H]
model.add(Activation("tanh"))
model.add(Dense(K)) # weight matrix dim [H x K]
model.add(Activation("softmax"))

14 / 74
Element-wise activation
functions

blue: activation function

green: derivative 15 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1

⎣ xn ⎦
e

∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j

= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j

16 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1

⎣ xn ⎦
e

∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j

= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j

vector of values in (0, 1) that add up to 1

p(Y = c|X = x) = softmax(z(x))c

the pre-activation vector z(x) is often called "the logits"

17 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

18 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :

s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

19 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :

s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

example

20 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :

s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

The cost function is the negative likelihood of the model computed

on the full training set (for i.i.d. samples):

1
s
LS (θ) = − ∑ log f (x ; θ)y s
|S |
s∈S

21 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :

s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

The cost function is the negative likelihood of the model computed

on the full training set (for i.i.d. samples):

1
s
LS (θ) = − ∑ log f (x ; θ)y s + λΩ(θ)
|S |
s∈S

2 2
λΩ(θ) = λ(||W
h
|| + ||W
o
|| ) is an optional regularization term.

22 / 74
Stochastic Gradient Descent
Initialize θ randomly

23 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

24 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

25 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate

26 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate

Repeat until the epoch is completed (all of S is covered)

27 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate

Repeat until the epoch is completed (all of S is covered)

Stop when reaching criterion:

nll stops decreasing when computed on validation set

28 / 74
Computing Gradients

∂l(f (x),y) ∂l(f (x),y)

Output Weights: o
Output bias: o
∂b i
∂W i,j

∂l(f (x),y) ∂l(f (x),y)

Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i

29 / 74
Computing Gradients

∂l(f (x),y) ∂l(f (x),y)

Output Weights: o
Output bias: o
∂b i
∂W i,j

∂l(f (x),y) ∂l(f (x),y)

Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i

The network is a composition of differentiable modules

We can apply the "chain rule"

30 / 74
Chain rule

31 / 74
Chain rule

chain-rule

32 / 74
Chain rule

chain-rule

33 / 74
Chain rule

chain-rule

34 / 74
Backpropagation

35 / 74
Backpropagation

Compute partial derivatives of the loss

∂l(f (x),y) ∂−log f (x) −1 y=i

y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i

36 / 74
Backpropagation

Compute partial derivatives of the loss

∂l(f (x),y) ∂−log f (x) −1 y=i

y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i

∂l
o
=?
∂z (x)
i

37 / 74
Chain rule!

38 / 74
39 / 74
40 / 74
41 / 74
: one-hot encoding of y

42 / 74
Backpropagation

Gradients

∇zo (x) l = f (x) − e(y)

∇b o l = f (x) − e(y)

o
∂z (x)
because zo (x) and then
o o i
= W h(x) + b o = 1 i=j
∂b j

43 / 74
Backpropagation

Partial derivatives related to Wo

o
∂z (x)
∂l ∂l k

o
= ∑ o
∂W i,j k o
∂z (x) ∂W i,j
k

⊤
∇Wo l = (f (x) − e(y)). h(x)

44 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

45 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

Compute layer params gradients

⊤
∇Wo l = ∇zo (x) l ⋅ h(x)

∇b o l = ∇zo (x) l

46 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

Compute layer params gradients

⊤
∇Wo l = ∇zo (x) l ⋅ h(x)

∇b o l = ∇zo (x) l

Compute prev layer activation gradients

o⊤
∇h(x) l = W ∇zo (x) l
′ h
∇zh (x) l = ∇h(x) l ⊙ σ (z (x))

47 / 74
Loss, Initialization and
Learning Tricks

48 / 74
Discrete output (classification)
Binary classification: y ∈ [0, 1]

Y |X = x ∼ Bernoulli(b = f (x; θ))

output function: logistic(x) =

1
−x
1+e

loss function: binary cross-entropy

Multiclass classification: y ∈ [0, K − 1]

Y |X = x ∼ M ultinoulli(p = f (x; θ))

output function: sof tmax

loss function: categorical cross-entropy

49 / 74
Continuous output (regression)
Continuous output: y ∈ R
n

2
Y |X = x ∼ N (μ = f (x; θ), σ I)

output function: Identity

loss function: square loss

Heteroschedastic if f (x; θ) predicts both μ and σ 2

Mixture Density Network (multimodal output)

Y |X = x ∼ GM M x

f (x; θ) predicts all the parameters: the means,

covariance matrices and mixture weights
50 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization

51 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning

52 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry

53 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

54 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

Better inits: Xavier Glorot and Kaming He &

orthogonal

55 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

Better inits: Xavier Glorot and Kaming He &

orthogonal
Biases can (should) be initialized to zero

56 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence

57 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence

58 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence

Large constant LR prevents final convergence
multiply η t by β < 1 after each update

59 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence

Large constant LR prevents final convergence
multiply η t by β < 1 after each update
or monitor validation loss and divide η t by 2 or 10
when no progress
See ReduceLROnPlateau in Keras

60 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

61 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to

accelerate in low curvature areas

62 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to

accelerate in low curvature areas

Nesterov accelerated gradient

mt = γmt−1 + η∇θ LB (θt−1 − γmt−1 )

θt = θt−1 − mt

Better at handling changes in gradient direction.

63 / 74
Why Momentum Really Works

64 / 74
Why Momentum Really Works

65 / 74
Why Momentum Really Works

66 / 74
Why Momentum Really Works

67 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling

68 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)

69 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...

70 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Promising stochastic second order methods: K-FAC and Shampoo
can be used to accelerate training of very large models.

71 / 74
The Karpathy Constant for Adam

72 / 74
Optimizers around a saddle point

Credits: Alec Radford

73 / 74

Green Awareness Through Social Media
No ratings yet
Green Awareness Through Social Media
9 pages
RP - Provide - From - FRST - Earliest Record RP - Provide - From - Last - Read Newest Record (Current)
No ratings yet
RP - Provide - From - FRST - Earliest Record RP - Provide - From - Last - Read Newest Record (Current)
9 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Lecture - 14 - FFNN
No ratings yet
Lecture - 14 - FFNN
59 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel
No ratings yet
Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel
68 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Feedforward Networks: Marco Kuhlmann
No ratings yet
Feedforward Networks: Marco Kuhlmann
53 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Lecture NN Part1
No ratings yet
Lecture NN Part1
62 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
Module 3.docxaiml
No ratings yet
Module 3.docxaiml
20 pages
CS445 - Neural Networks and Deep Learning - Lecture Notes
No ratings yet
CS445 - Neural Networks and Deep Learning - Lecture Notes
5 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Week2 DL
No ratings yet
Week2 DL
29 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Unit 3
No ratings yet
Unit 3
110 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Slides 11
No ratings yet
Slides 11
48 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
14 - Học sâu (3) - Improve DNN - v3
No ratings yet
14 - Học sâu (3) - Improve DNN - v3
129 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Cours 4
No ratings yet
Cours 4
30 pages
Ann PPT
No ratings yet
Ann PPT
48 pages
Week 7 - Lab
No ratings yet
Week 7 - Lab
6 pages
Nonlinear
No ratings yet
Nonlinear
8 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Module 2
No ratings yet
Module 2
12 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Definition For Scratch Blocks
No ratings yet
Definition For Scratch Blocks
6 pages
CIS Debit Trading Deutsche
No ratings yet
CIS Debit Trading Deutsche
4 pages
Ce Unit Iii PDF
No ratings yet
Ce Unit Iii PDF
116 pages
During Ketu Mahadasha
No ratings yet
During Ketu Mahadasha
10 pages
s1 Edited PDF
No ratings yet
s1 Edited PDF
1 page
Variable Rate Technology: What We Have Got So Far
100% (1)
Variable Rate Technology: What We Have Got So Far
22 pages
SENSOR in Robotics
No ratings yet
SENSOR in Robotics
20 pages
Q 4
No ratings yet
Q 4
3 pages
Dinsmoor 1995a
No ratings yet
Dinsmoor 1995a
18 pages
The Company The Skema Difference
No ratings yet
The Company The Skema Difference
2 pages
Zero Energy Building Concept
No ratings yet
Zero Energy Building Concept
11 pages
CH 03
No ratings yet
CH 03
15 pages
F Theta Catalog210
No ratings yet
F Theta Catalog210
8 pages
Sabah Sewerage Services Enactment 2017
No ratings yet
Sabah Sewerage Services Enactment 2017
49 pages
Effective Parenting Q As
No ratings yet
Effective Parenting Q As
2 pages
STD 12 Pre First Exam English
No ratings yet
STD 12 Pre First Exam English
5 pages
Mason BEE
No ratings yet
Mason BEE
1 page
HKUST 4Y Curriculum Diagram CIVL
No ratings yet
HKUST 4Y Curriculum Diagram CIVL
4 pages
GP1 Q1 Week-1
No ratings yet
GP1 Q1 Week-1
18 pages
Manual Pce Ut603.pdfsf
No ratings yet
Manual Pce Ut603.pdfsf
14 pages
British Medical Journal July 1989 - Interview With DR Marietta Higgs
No ratings yet
British Medical Journal July 1989 - Interview With DR Marietta Higgs
1 page
BNWAS - Installation v.2.2
No ratings yet
BNWAS - Installation v.2.2
39 pages
ELEMENTS SENTENCE-CONSTRUCTION-for-Printing
No ratings yet
ELEMENTS SENTENCE-CONSTRUCTION-for-Printing
27 pages
Zhang & Su 2023
100% (1)
Zhang & Su 2023
19 pages
Airflow Rate L/S: Air Velocity M / S
No ratings yet
Airflow Rate L/S: Air Velocity M / S
1 page
Online Education Dashboard
No ratings yet
Online Education Dashboard
4 pages
Resoluçã Moran Liquid - Kerosene - Flows - Through - A - Venturi PDF
No ratings yet
Resoluçã Moran Liquid - Kerosene - Flows - Through - A - Venturi PDF
2 pages
Report 2
No ratings yet
Report 2
19 pages