Deep Learning Lectures - 2
Deep Learning Lectures - 2
Backpropagation
Charles Ollion - Olivier Grisel
1 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)
2 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)
Sample s in dataset S :
input: xs
N
∈ R
3 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)
Sample s in dataset S :
input: xs
N
∈ R
4 / 74
Artificial Neuron
5 / 74
Artificial Neuron
T
z(x) = w x + b
T
f (x) = g(w x + b)
6 / 74
Layer of Neurons
7 / 74
Layer of Neurons
8 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
9 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
10 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
11 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
12 / 74
One Hidden Layer Network
Alternate representation
13 / 74
One Hidden Layer Network
Keras implementation
model = Sequential()
model.add(Dense(H, input_dim=N)) # weight matrix dim [N * H]
model.add(Activation("tanh"))
model.add(Dense(K)) # weight matrix dim [H x K]
model.add(Activation("softmax"))
14 / 74
Element-wise activation
functions
⎣ xn ⎦
e
16 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1
⎣ xn ⎦
e
17 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
18 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
19 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
example
20 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
1
s
LS (θ) = − ∑ log f (x ; θ)y s
|S |
s∈S
21 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
1
s
LS (θ) = − ∑ log f (x ; θ)y s + λΩ(θ)
|S |
s∈S
2 2
λΩ(θ) = λ(||W
h
|| + ||W
o
|| ) is an optional regularization term.
22 / 74
Stochastic Gradient Descent
Initialize θ randomly
23 / 74
Stochastic Gradient Descent
Initialize θ randomly
24 / 74
Stochastic Gradient Descent
Initialize θ randomly
25 / 74
Stochastic Gradient Descent
Initialize θ randomly
Update parameters: θ ← θ − ηΔ
26 / 74
Stochastic Gradient Descent
Initialize θ randomly
Update parameters: θ ← θ − ηΔ
27 / 74
Stochastic Gradient Descent
Initialize θ randomly
Update parameters: θ ← θ − ηΔ
28 / 74
Computing Gradients
29 / 74
Computing Gradients
30 / 74
Chain rule
31 / 74
Chain rule
chain-rule
32 / 74
Chain rule
chain-rule
33 / 74
Chain rule
chain-rule
34 / 74
Backpropagation
35 / 74
Backpropagation
36 / 74
Backpropagation
∂l
o
=?
∂z (x)
i
37 / 74
Chain rule!
38 / 74
39 / 74
40 / 74
41 / 74
: one-hot encoding of y
42 / 74
Backpropagation
Gradients
∇b o l = f (x) − e(y)
o
∂z (x)
because zo (x) and then
o o i
= W h(x) + b o = 1 i=j
∂b j
43 / 74
Backpropagation
o
∂z (x)
∂l ∂l k
o
= ∑ o
∂W i,j k o
∂z (x) ∂W i,j
k
⊤
∇Wo l = (f (x) − e(y)). h(x)
44 / 74
Backprop gradients
Compute activation gradients
45 / 74
Backprop gradients
Compute activation gradients
∇b o l = ∇zo (x) l
46 / 74
Backprop gradients
Compute activation gradients
∇b o l = ∇zo (x) l
47 / 74
Loss, Initialization and
Learning Tricks
48 / 74
Discrete output (classification)
Binary classification: y ∈ [0, 1]
49 / 74
Continuous output (regression)
Continuous output: y ∈ R
n
2
Y |X = x ∼ N (μ = f (x; θ), σ I)
Y |X = x ∼ GM M x
51 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
52 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
53 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)
54 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)
55 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)
56 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
57 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1
58 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1
59 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1
60 / 74
Momentum
Accumulate gradients across successive updates:
θt = θt−1 − mt
61 / 74
Momentum
Accumulate gradients across successive updates:
θt = θt−1 − mt
62 / 74
Momentum
Accumulate gradients across successive updates:
θt = θt−1 − mt
θt = θt−1 − mt
64 / 74
Why Momentum Really Works
65 / 74
Why Momentum Really Works
66 / 74
Why Momentum Really Works
67 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
68 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
69 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
70 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Promising stochastic second order methods: K-FAC and Shampoo
can be used to accelerate training of very large models.
71 / 74
The Karpathy Constant for Adam
72 / 74
Optimizers around a saddle point
73 / 74