Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel
Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel
Backpropagation
Charles Ollion - Olivier Grisel
1 / 67
Neural Network for
classification
Neural networks and
Backpropagation
Vector function with tunable parameters $\theta$
$$ \mathbf{f}(\mathbf{x}^s;\mathbf{\theta})_c = 2 / 67
P(Y=c|X=\mathbf{x}^s) $$
1/
Neural Network for
classification
Artificial Neuron
Vector function with tunable parameters θ
Sample s in dataset S:
x s ℝN
input: ∈
s
expected output: y ∈ [0, K − 1]
2/
Artificial Neuron
Artificial Neuron
4 / 67
3/
Artificial Neuron
Layer of Neurons
z(x) = T
w x +b
f (x) = g(wT x + b)
6 / 67
5/
Layer of Neurons
One Hidden Layer Network
6/
One Hidden Layer Network
One Hidden Layer Network
o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
7/
One Hidden Layer Network
One Hidden Layer Network
o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
8/
One Hidden Layer Network
One Hidden Layer Network
o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
9/
One Hidden Layer Network
One Hidden Layer Network
Alternate representation
h h
zh (x)=W x+b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b 11 / 67
o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
10 /
One Hidden Layer Network
One Hidden Layer Network
Keras implementation
Alternate representation
12 / 67
11 /
One Hidden Layer Network
Element-wise activation
functions
Keras implementation
12 /
Element-wise activation
functions
Softmax function
$$ softmax(\mathbf{x}) = \frac{1}{\sum_{i=1}^{n}{e^{x_i}}} \cdot
\begin{bmatrix} e^{x_1}\\ e^{x_2}\\ \vdots\\ e^{x_n} \end{bmatrix}
$$
14 / 67
blue: activation function
green: derivative
13 /
Softmax function
Softmax function ⎡ ex1 ⎤
⎢ x2 ⎥
1
sof tmax(x) = n x ⋅ ⎢ ⎥
$$ softmax(\mathbf{x}) = \frac{1}{\sum_{i=1}^{n}{e^{x_i}}} ⎢ e
\cdot ⎥
⎢ ⋮ ⎥
∑i=1 e\end{bmatrix}
\begin{bmatrix} e^{x_1}\\ e^{x_2}\\ \vdots\\ e^{x_n} i
$$
⎣ exn ⎦
$$ \frac{\partial softmax(\mathbf{x})_i}{\partial x_j} = \begin{cases}
15 / 67
14 /
Softmax function
Training the network ⎡ ex1 ⎤
⎢ x2 ⎥
Find parameters $\mathbf{\theta} = ( 1
sof tmax(x) = n x ⋅ ⎢ ⎥
\mathbf{W}^h; ⎢ e ⎥
\mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize∑ the e i ⎢ log
negative
i=1 ⋮ ⎥
likelihood (or cross entropy)
⎣ exn ⎦
16 / 67
15 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
likelihood (or cross entropy)
17 / 67
16 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)
example
18 / 67
17 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)
19 / 67
18 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)
19 /
Training the network
Stochastic Gradient
h h Descent
o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
log likelihood (or cross entropy)
Initialize $\mathbf{\theta}$ randomly
The loss function for a given sample s ∈ S:
l(f(xs ; θ), ys ) = nll(θ; xs , ys ) = − log f(xs ; θ)ys
1
LS (θ) = − log f(xs ; θ)ys + λΩ(θ)
|S| ∑
s∈S
20 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
22 / 67
21 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:
Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$
23 / 67
22 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:
Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$
Update parameters: $\mathbf{\theta} \leftarrow \mathbf{\theta}
- \eta \Delta$
$\eta > 0$ is called the learning rate
24 / 67
23 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:
Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$
Update parameters: $\mathbf{\theta} \leftarrow \mathbf{\theta}
Update
- \eta \Delta$
parameters: θ ← θ − ηΔ
η >> 0$
$\eta 0 isiscalled
calledthethe learning
learning rate rate
24 /
Stochastic Gradient Descent
Computing Gradients
Initialize θ randomly
$\,$
25 /
Computing Gradients
Computing Gradients
∂l(f(x),y) ∂l(f(x),y)
Output
Output Weights:
Weights: o
$\frac{\partial Output Output
bias: bias:
$\frac{\partial
∂Wi,j ∂boi
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
W^o_{i,j}}$ ∂l(f(x),y) b^o_{i}}$ ∂l(f(x),y)
Hidden Weights: Hidden bias:
∂Wi,jh ∂bhi
Hidden Weights: $\frac{\partial Hidden bias: $\frac{\partial
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
W^h_{i,j}}$ b^h_{i}}$
$\,$
26 /
Computing Gradients
Chain rule
∂l(f(x),y) ∂l(f(x),y)
Output Weights: ∂Wi,jo Output bias:
∂boi
∂l(f(x),y) ∂l(f(x),y)
Hidden Weights: Hidden bias:
∂Wi,jh ∂bhi
chain-rule
28 / 67
27 /
Chain rule
Chain rule
chain-rule
chain-rule
29 / 67
28 /
Chain rule
Chain rule
chain-rule
chain-rule
30 / 67
29 /
Chain rule
Backpropagation
chain-rule
31 / 67
30 /
Backpropagation
Backpropagation
31 /
Backpropagation
32 /
34 / 67
33 /
...
...
35 / 67
34 /
Backpropagation
Gradients
$\nabla_{\mathbf{z}^o(\mathbf{x})} l(\mathbf{f(x)}, y)
= \mathbf{f(x)} - \mathbf{e}(y)$
$\nabla_{\mathbf{b}^o} l(\mathbf{f(x)}, y) =
\mathbf{f(x)} - \mathbf{e}(y)$ ...
35 /
Backpropagation
Backpropagation
$\nabla_{\mathbf{W}^o}
∂zo (x) i l(\mathbf{f(x)}, y) =
because
(\mathbf{f(x)} = 1i=j
∂b oj - \mathbf{e}(y)) . \mathbf{h(x)}^\top$
37 / 67
36 /
Backpropagation
Backprop gradients
Compute activation gradients
o
Partial derivatives related to W
⊤
∇W o l(f(x), y) = (f(x) − e(y)). h(x)
38 / 67
37 /
Backprop gradients
Backprop gradients
Compute activation gradients
Compute activation gradients
∇zo (x) l = f(x) − e(y)
$\nabla_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}
(y)$
$\nabla_{\mathbf{W}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l
\cdot \mathbf{h(x)}^\top$
$\nabla_{\mathbf{b}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l$
39 / 67
38 /
Backprop gradients
Backprop gradients
Compute activation gradients
Compute activation gradients
∇zo (x) l = f(x) − e(y)
$\nabla_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}
(y)$
Compute layer params gradients
Compute layer params gradients
⊤
∇ l
Wo = ∇ zo (x) l ⋅ h(x)
$\nabla_{\mathbf{W}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l
∇b\mathbf{h(x)}^\top$
\cdot ol = ∇ o
z (x) l
$\nabla_{\mathbf{b}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l$
$\nabla_{\mathbf{h(x)}} l = \mathbf{W}^{o\top} 40 / 67
\nabla_{\mathbf{z}^o(\mathbf{x})} l$
39 /
Backprop gradients
Compute activation gradients
Loss,layer
Compute Initialization and
params gradients
Learning⊤ Tricks
∇W o l = ∇zo (x) l ⋅ h(x)
∇b o l = ∇zo (x) l
Learning Tricks
loss function: binary cross-entropy
$Y|X=\mathbf{x} \sim
Multinoulli(\mathbf{p}=\mathbf{f}(\mathbf{x} ;
\theta))$
output function: $softmax$ 42 / 67
loss function: categorical cross-entropy
41 /
42 /
(μ =
Input data should be normalized to have approx. same range:
2
Y|X = x ∼
standardization or quantile f(x; θ), σ
normalization I)
output function: Identity
loss function: square loss
2
Heteroschedastic if f(x; θ) predicts both μ and σ
Y|X = x ∼ GMMx
f(x; θ) predicts all the parameters: the means,
44 / 67
45 / 67
44 /
W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
46 / 67
45 /
W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant
Constant init: hidden
init: hidden units
units collapse by collapse
symmetry by symmetry
Solution: random init, ex: $w \sim \mathcal{N}(0,
0.01)$
47 / 67
46 /
W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant
Constant init: hidden
init: hidden units
units collapse by collapse
symmetry by symmetry
Solution: random
Solution: init, ex: $w
random \sim
init, w ∼ (0,
\mathcal{N}(0,
ex: 0.01)
0.01)$
Better inits: Xavier Glorot and Kaming He &
orthogonal
48 / 67
47 /
W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant
Constant init: hidden
init: hidden units
units collapse by collapse
symmetry by symmetry
Solution: random
Solution: init, ex: $w
random \sim
init, w ∼ (0,
\mathcal{N}(0,
ex: 0.01)
0.01)$
Better inits: Xavier Glorot and Kaming He &
Better inits: Xavier Glorot and Kaming He &
orthogonal
orthogonal
Biases can (should) be initialized to zero
49 / 67
48 /
50 / 67
49 /
51 / 67
50 /
52 / 67
51 /
52 /
$\gamma$Divide
is typicallyby 100.9and
set to retry in case of divergence
Large constant LR prevents final convergence
multiply ηt by β < 1 after each update
or monitor validation loss and divide ηt by 2 or 10
when no progress
See ReduceLROnPlateau in Keras
54 / 67
53 /
Momentum
Momentum
Accumulate gradients across successive updates:
mt = γ mupdates:
Accumulate gradients across successive t−1 + η∇θ LB t
(θt−1 )
θt = θm_{t-1}
$$\begin{eqnarray} m_t &=& \gamma t−1 − m t
+ \eta \nabla_{\theta}
L_{B_t}(\theta_{t-1}) \nonumber \\ \theta_t &=& \theta_{t-1} - m_t
γ is typically
\nonumber set to 0.9
\end{eqnarray}$$
55 / 67
54 /
Momentum
Momentum
Accumulate gradients across successive updates:
mt = γ mupdates:
Accumulate gradients across successive t−1 + η∇θ LB t
(θt−1 )
θt = θm_{t-1}
$$\begin{eqnarray} m_t &=& \gamma t−1 − m t
+ \eta \nabla_{\theta}
L_{B_t}(\theta_{t-1}) \nonumber \\ \theta_t &=& \theta_{t-1} - m_t
γ is typically
\nonumber set to 0.9
\end{eqnarray}$$
55 /
Momentum
Accumulate gradients across successive updates:
Nesterov Why
accelerated
Momentum Really gradient
Works
56 /
Why Momentum Really Works
Why Momentum Really Works
58 / 67
57 /
Why Momentum Really Works
Why Momentum Really Works
59 / 67
58 /
Why Momentum Really Works
Why Momentum Really Works
60 / 67
59 /
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling
61 / 67
60 /
Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)
62 / 67
61 /
Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)
63 / 67
62 /
Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)
64 / 67
63 /
Alternative optimizers
TheSGD
Karpathy Constant for
(with Nesterov momentum)
Adam
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Active area of research: K-FAC stochastic second-order method
based on an invertible approximation of the Fisher information
65 / 67
matrix of the network.
64 /
66 / 67
65 /
66 /
Lab 2: back in 15min!
67 /