0% found this document useful (0 votes)
56 views

Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel

The document describes a neural network with one hidden layer for classification tasks. It defines the components of an artificial neuron, including inputs, weights, biases, activations and outputs. It then shows how multiple neurons can be connected in a layer and describes forwarding data through the network from input to output layers. The network is trained by minimizing the negative log-likelihood loss function to optimize the weights and biases to best fit the training data.

Uploaded by

yelen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel

The document describes a neural network with one hidden layer for classification tasks. It defines the components of an artificial neuron, including inputs, weights, biases, activations and outputs. It then shows how multiple neurons can be connected in a layer and describes forwarding data through the network from input to output layers. The network is trained by minimizing the negative log-likelihood loss function to optimize the weights and biases to best fit the training data.

Uploaded by

yelen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Neural networks and

Backpropagation
Charles Ollion - Olivier Grisel

1 / 67
Neural Network for
classification
Neural networks and
Backpropagation
Vector function with tunable parameters $\theta$

$$ \mathbf{f}(\cdot; \mathbf{\theta}): \mathbb{R}^N \rightarrow (0,


1)^K $$
Charles Ollion - Olivier Grisel
Sample $s$ in dataset $S$:

input: $\mathbf{x}^s \in \mathbb{R}^N$


expected output: $y^s \in [0, K-1]$

Output is a conditional probability distribution:

$$ \mathbf{f}(\mathbf{x}^s;\mathbf{\theta})_c = 2 / 67
P(Y=c|X=\mathbf{x}^s) $$

1/
Neural Network for
classification
Artificial Neuron
Vector function with tunable parameters θ

f(⋅; θ) : ℝN → (0, 1)K

Sample s in dataset S:

x s ℝN
input: ∈
s
expected output: y ∈ [0, K − 1]

Output is a conditional probability distribution:


3 / 67
f(xs ; θ)c = P(Y = c|X = xs )

2/
Artificial Neuron
Artificial Neuron

$z(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$

$f(\mathbf{x}) = g(\mathbf{w}^T \mathbf{x} + b)$

$\mathbf{x}, f(\mathbf{x}) \,\,$ input and output


$z(\mathbf{x})\,\,$ pre-activation
$\mathbf{w}, b\,\,$ weights and bias
$g$ activation function

4 / 67

3/
Artificial Neuron
Layer of Neurons

z(x) = T
w x +b
f (x) = g(wT x + b)

x, f (x) input and output


z(x) pre-activation
w, b weights and bias 5 / 67
g activation function
4/
Layer of Neurons
Layer of Neurons

$\mathbf{f}(\mathbf{x}) = g(\textbf{z(x)}) = g(\mathbf{W} \mathbf{x} +


\mathbf{b})$

$\mathbf{W}, \mathbf{b}\,\,$ now matrix and vector

6 / 67

5/
Layer of Neurons
One Hidden Layer Network

f(x) = g(z(x)) = g(Wx + b)


$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +
\mathbf{b}^h$
W, b now matrix= g(\mathbf{z}^h(\mathbf{x}))
$\mathbf{h}(\mathbf{x}) and vector =
g(\mathbf{W}^h \mathbf{x} + \mathbf{b}^h)$
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h} 7 / 67

6/
One Hidden Layer Network
One Hidden Layer Network

$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +


h h h
z (x) = W x + b
\mathbf{b}^h$
$\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) =
h h h
h(x) = g(z\mathbf{x}
g(\mathbf{W}^h (x)) = +g( W x+b )
\mathbf{b}^h)$
o o o
z (x) = W h(x) + b
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h} 8 / 67

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
7/
One Hidden Layer Network
One Hidden Layer Network

$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +


h h h
z (x) = W x + b
\mathbf{b}^h$
$\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) =
h h h
h(x) = g(z\mathbf{x}
g(\mathbf{W}^h (x)) = +g( W x+b )
\mathbf{b}^h)$
o o o
z (x) = W h(x) + b
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h} 9 / 67

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
8/
One Hidden Layer Network
One Hidden Layer Network

$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +


h h h
z (x) = W x + b
\mathbf{b}^h$
$\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) =
h h h
h(x) = g(z\mathbf{x}
g(\mathbf{W}^h (x)) = +g( W x+b )
\mathbf{b}^h)$
o o o
z (x) = W h(x) + b
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h} 10 / 67

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
9/
One Hidden Layer Network
One Hidden Layer Network

Alternate representation

h h
zh (x)=W x+b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b 11 / 67

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
10 /
One Hidden Layer Network
One Hidden Layer Network

Keras implementation
Alternate representation

12 / 67

11 /
One Hidden Layer Network
Element-wise activation
functions

Keras implementation

blue: activation function


green: derivative
13 / 67

12 /
Element-wise activation
functions
Softmax function
$$ softmax(\mathbf{x}) = \frac{1}{\sum_{i=1}^{n}{e^{x_i}}} \cdot
\begin{bmatrix} e^{x_1}\\ e^{x_2}\\ \vdots\\ e^{x_n} \end{bmatrix}
$$

$$ \frac{\partial softmax(\mathbf{x})_i}{\partial x_j} = \begin{cases}


softmax(\mathbf{x})_i \cdot (1 - softmax(\mathbf{x})_i) & i = j\\ -
softmax(\mathbf{x})_i \cdot softmax(\mathbf{x})_j & i \neq j
\end{cases} $$

14 / 67
blue: activation function
green: derivative
13 /
Softmax function
Softmax function ⎡ ex1 ⎤
⎢ x2 ⎥
1
sof tmax(x) = n x ⋅ ⎢ ⎥
$$ softmax(\mathbf{x}) = \frac{1}{\sum_{i=1}^{n}{e^{x_i}}} ⎢ e
\cdot ⎥
⎢ ⋮ ⎥
∑i=1 e\end{bmatrix}
\begin{bmatrix} e^{x_1}\\ e^{x_2}\\ \vdots\\ e^{x_n} i

$$
⎣ exn ⎦
$$ \frac{\partial softmax(\mathbf{x})_i}{\partial x_j} = \begin{cases}

{ −sof tmax(x)i &⋅ sof


∂sof tmax(x)
softmax(\mathbf{x})_i sof tmax(x)i ⋅ (1 −&sof
i (1 - softmax(\mathbf{x})_i)
\cdot tmax(x
i = j\\ - )i ) i=j
=
∂xj \cdot softmax(\mathbf{x})_j
softmax(\mathbf{x})_i tmax(x
i \neq j )j i≠j
\end{cases} $$

vector of values in (0, 1) that add up to 1


$p(Y = c|X = \mathbf{x}) = \text{softmax}(\mathbf{z}
(\mathbf{(x}))_c$
the pre-activation vector $\mathbf{z}(\mathbf{x})$ is often called
"the logits"

15 / 67

14 /
Softmax function
Training the network ⎡ ex1 ⎤
⎢ x2 ⎥
Find parameters $\mathbf{\theta} = ( 1
sof tmax(x) = n x ⋅ ⎢ ⎥
\mathbf{W}^h; ⎢ e ⎥
\mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize∑ the e i ⎢ log
negative
i=1 ⋮ ⎥
likelihood (or cross entropy)
⎣ exn ⎦

{ −sof tmax(x)i ⋅ sof tmax(x)j


∂sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i=j
=
∂xj i≠j

vector of values in (0, 1) that add up to 1


p(Y = c|X = x) = softmax(z((x))c
the pre-activation vector z(x) is often called "the logits"

16 / 67

15 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
likelihood (or cross entropy)

The loss function for a given sample $s \in S$:

$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,


y^s) = -\log \mathbf{f}(\mathbf{x}^s;\theta)_{y^s} $$

17 / 67

16 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)

The loss function forl(f( xs ; θ),


a given ys ) =$snll(θ;
sample \in S$:x s , y s ) = − log f(xs ; θ)ys
$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,
y^s) = -\log \mathbf{f}(\mathbf{x}^s;\theta)_{y^s} $$

example

18 / 67

17 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)

The loss function forl(f( xs ; θ),


a given ys ) =$snll(θ;
sample \in S$:x s , y s ) = − log f(xs ; θ)ys
$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,
y^s) = -\log \mathbf{f}(\mathbf{x}^s;\theta)_{y^s} $$

The cost function is the negative likelihood of the model computed


example
on the full training set (for i.i.d. samples):

$$ L_S(\theta) = -\frac{1}{|S|} \sum_{s \in S} \log \mathbf{f}


(\mathbf{x}^s;\theta)_{y^s} + \lambda \Omega(\mathbf{\theta}) $$

19 / 67

18 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)

The loss function forl(f( xs ; θ),


a given ys ) =$snll(θ;
sample \in S$:x s , y s ) = − log f(xs ; θ)ys
$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,
The= -\log
y^s) cost\mathbf{f}(\mathbf{x}^s;\theta)_{y^s}
function is the negative likelihood $$ of the model computed
on the full training set (for i.i.d. samples):
The cost function is the negative likelihood of the model computed
on the full training set (for i.i.d. samples):
1
LS (θ) = − log f(xs ; θ)ys + λΩ(θ)
\in ∑
$$ L_S(\theta) = -\frac{1}{|S|} \sum_{s|S| S} \log \mathbf{f}
s∈S
(\mathbf{x}^s;\theta)_{y^s} + \lambda \Omega(\mathbf{\theta}) $$
$\lambda \Omega(\mathbf{\theta}) = \lambda (||W^h||^2 +
||W^o||^2)$ is an optional regularization term.
20 / 67

19 /
Training the network
Stochastic Gradient
h h Descent
o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
log likelihood (or cross entropy)
Initialize $\mathbf{\theta}$ randomly
The loss function for a given sample s ∈ S:
l(f(xs ; θ), ys ) = nll(θ; xs , ys ) = − log f(xs ; θ)ys

The cost function is the negative likelihood of the model computed


on the full training set (for i.i.d. samples):

1
LS (θ) = − log f(xs ; θ)ys + λΩ(θ)
|S| ∑
s∈S

λΩ(θ) = λ(||W h ||2 + ||W o ||2 ) is an optional regularization term.


21 / 67

20 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly

For $E$ epochs perform:

Randomly select a small batch of samples $( B \subset S )$

22 / 67

21 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:

Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$

23 / 67

22 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:

Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$
Update parameters: $\mathbf{\theta} \leftarrow \mathbf{\theta}
- \eta \Delta$
$\eta > 0$ is called the learning rate

24 / 67

23 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:

Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$
Update parameters: $\mathbf{\theta} \leftarrow \mathbf{\theta}
Update
- \eta \Delta$
parameters: θ ← θ − ηΔ
η >> 0$
$\eta 0 isiscalled
calledthethe learning
learning rate rate

Stop when reaching criterion

nll stops decreasing when computed on validation set


25 / 67

24 /
Stochastic Gradient Descent
Computing Gradients
Initialize θ randomly

For E epochs perform:


Output Weights: $\frac{\partial Output bias: $\frac{\partial
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
Randomly select a small
W^o_{i,j}}$ batch of samples (B
b^o_{i}}$ ⊂ S)
Compute gradients: Δ = ∇ θ L B (θ)
Hidden Weights: $\frac{\partial Hidden bias: $\frac{\partial
parameters: θl(\mathbf{f(x)},
Updatey)}{\partial
l(\mathbf{f(x)}, ← θ − ηΔy)}{\partial
η > 0 is called the learning rate
W^h_{i,j}}$ b^h_{i}}$

$\,$

Stop when reaching criterion


26 / 67
nll stops decreasing when computed on validation set

25 /
Computing Gradients
Computing Gradients

∂l(f(x),y) ∂l(f(x),y)
Output
Output Weights:
Weights: o
$\frac{\partial Output Output
bias: bias:
$\frac{\partial
∂Wi,j ∂boi
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
W^o_{i,j}}$ ∂l(f(x),y) b^o_{i}}$ ∂l(f(x),y)
Hidden Weights: Hidden bias:
∂Wi,jh ∂bhi
Hidden Weights: $\frac{\partial Hidden bias: $\frac{\partial
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
W^h_{i,j}}$ b^h_{i}}$

$\,$

The network is a composition of differentiable modules


We can apply the "chain rule"
27 / 67

26 /
Computing Gradients
Chain rule

∂l(f(x),y) ∂l(f(x),y)
Output Weights: ∂Wi,jo Output bias:
∂boi

∂l(f(x),y) ∂l(f(x),y)
Hidden Weights: Hidden bias:
∂Wi,jh ∂bhi
chain-rule

The network is a composition of differentiable modules


We can apply the "chain rule"

28 / 67

27 /
Chain rule
Chain rule

chain-rule
chain-rule

29 / 67

28 /
Chain rule
Chain rule

chain-rule
chain-rule

30 / 67

29 /
Chain rule
Backpropagation

chain-rule

31 / 67

30 /
Backpropagation
Backpropagation

Compute partial derivatives of the loss

$\frac{\partial l(\mathbf{f(x)}, y)}{\partial


\mathbf{f(x)}_i} = \frac{\partial -\log \mathbf{f(x)}_y}
{\partial \mathbf{f(x)}_i} = \frac{-1_{y=i}}
{\mathbf{f(x)}_y}$

$\frac{\partial l(\mathbf{f(x)}, y)}{\partial


32 / 67
\mathbf{z}^o(\mathbf{x})_i} = \sum_j \frac{\partial

31 /
Backpropagation

Compute partial derivatives of the loss

∂l(f(x),y) ∂−log f(x)y −1y=i


∂f(x)i
= ∂f(x)i
= f(x)y

∂l(f(x),y) ∂l(f(x),y) ∂f(x)j


∂zo (x) i
= ∑j ∂f(x) ∂zo (x)
j i
33 / 67

32 /
34 / 67

33 /
...

...

35 / 67

34 /
Backpropagation

Gradients

$\nabla_{\mathbf{z}^o(\mathbf{x})} l(\mathbf{f(x)}, y)
= \mathbf{f(x)} - \mathbf{e}(y)$

$\nabla_{\mathbf{b}^o} l(\mathbf{f(x)}, y) =
\mathbf{f(x)} - \mathbf{e}(y)$ ...

because $\frac{\partial \mathbf{z}^o(\mathbf{x})_i}{\partial 36 / 67 ...


\mathbf{b}^o_j} = 1_{i=j}$

35 /
Backpropagation
Backpropagation

Partial derivatives related to $\mathbf{W}^o$


Gradients
$\frac{\partial l(\mathbf{f(x)}, y)}{\partial W_{i,j}} =
∇zo (x)\frac{\partial
\sum_{k} l(f(x), y) l(\mathbf{f(x)},
= f(x) − e(y) y)}{\partial
\mathbf{z}^o(\mathbf{x})_k} \frac{\partial
∇b o l(f(x), y) = f(x) − e(y)W^o_{i,j}}$
\mathbf{z}^o(\mathbf{x})_k}{\partial

$\nabla_{\mathbf{W}^o}
∂zo (x) i l(\mathbf{f(x)}, y) =
because
(\mathbf{f(x)} = 1i=j
∂b oj - \mathbf{e}(y)) . \mathbf{h(x)}^\top$
37 / 67

36 /
Backpropagation
Backprop gradients
Compute activation gradients

$\nabla_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}


(y)$

o
Partial derivatives related to W

∂l(f(x),y) ∂l(f(x),y) ∂zo (x) k


∂Wi,j = ∑k ∂zo (x) ∂W o
k i,j


∇W o l(f(x), y) = (f(x) − e(y)). h(x)
38 / 67

37 /
Backprop gradients
Backprop gradients
Compute activation gradients
Compute activation gradients
∇zo (x) l = f(x) − e(y)
$\nabla_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}
(y)$

Compute layer params gradients

$\nabla_{\mathbf{W}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l
\cdot \mathbf{h(x)}^\top$
$\nabla_{\mathbf{b}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l$

39 / 67

38 /
Backprop gradients
Backprop gradients
Compute activation gradients
Compute activation gradients
∇zo (x) l = f(x) − e(y)
$\nabla_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}
(y)$
Compute layer params gradients
Compute layer params gradients

∇ l
Wo = ∇ zo (x) l ⋅ h(x)
$\nabla_{\mathbf{W}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l
∇b\mathbf{h(x)}^\top$
\cdot ol = ∇ o
z (x) l
$\nabla_{\mathbf{b}^o} l = \nabla_{\mathbf{z}^o(\mathbf{x})} l$

Compute prev layer activation gradients

$\nabla_{\mathbf{h(x)}} l = \mathbf{W}^{o\top} 40 / 67
\nabla_{\mathbf{z}^o(\mathbf{x})} l$

39 /
Backprop gradients
Compute activation gradients

∇zo (x) l = f(x) − e(y)

Loss,layer
Compute Initialization and
params gradients
Learning⊤ Tricks
∇W o l = ∇zo (x) l ⋅ h(x)
∇b o l = ∇zo (x) l

Compute prev layer activation gradients


o⊤
∇h(x) l = W ∇zo (x) l 41 / 67

∇zh (x) l = ∇h(x) l ⊙ σ ′ (zh (x))


40 /
Discrete output (classification)
Binary classification: $y \in [0, 1]$

$Y|X=\mathbf{x} \sim Bernoulli(b=f(\mathbf{x} ;


\theta))$
Loss, Initialization and
output function: $logistic(x) = \frac{1}{1 + e^{-x}}$

Learning Tricks
loss function: binary cross-entropy

Multiclass classification: $y \in [0, K-1]$

$Y|X=\mathbf{x} \sim
Multinoulli(\mathbf{p}=\mathbf{f}(\mathbf{x} ;
\theta))$
output function: $softmax$ 42 / 67
loss function: categorical cross-entropy

41 /

Discrete output (classification)


Continuous output (regression)
Binary classification: y ∈ [0, 1]
Continuous output: $\mathbf{y} \in \mathbb{R}^n$
Y|X = x ∼ Bernoulli(b = f (x; θ))
$Y|X=\mathbf{x} \sim \mathcal{N} 1
output function: logistic(x) = 1+e−x
(\mathbf{\mu}=\mathbf{f}(\mathbf{x} ; \theta),
loss function:
\sigma^2 \mathbf{I})$binary cross-entropy
output function: Identity
Multiclass
loss function:classification:
square loss y ∈ [0, K − 1]
Heteroschedastic if $\mathbf{f}(\mathbf{x} ; \theta)$
Y|X = x ∼ Multinoulli(p =
predicts both $\mathbf{\mu}$ and $\sigma^2$
f(x; θ))
output function: sof tmax
Mixture Density Network (multimodal output)
loss function: categorical cross-entropy
$Y|X=\mathbf{x} \sim GMM_{\mathbf{x}}$ 43 / 67

42 /

Continuous output (regression)


Initialization and normalization
n
Continuous output: y ∈ ℝ

 (μ =
Input data should be normalized to have approx. same range:
2
Y|X = x ∼
standardization or quantile f(x; θ), σ
normalization I)
output function: Identity
loss function: square loss

2
Heteroschedastic if f(x; θ) predicts both μ and σ

Mixture Density Network (multimodal output)

Y|X = x ∼ GMMx
f(x; θ) predicts all the parameters: the means,
44 / 67

covariance matrices and mixture weights


43 /

Initialization and normalization


Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:
standardization or quantile normalization
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning

45 / 67

44 /

Initialization and normalization


Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry

46 / 67

45 /

Initialization and normalization


Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant
Constant init: hidden
init: hidden units
units collapse by collapse
symmetry by symmetry
Solution: random init, ex: $w \sim \mathcal{N}(0,
0.01)$

47 / 67

46 /

Initialization and normalization


Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant
Constant init: hidden
init: hidden units
units collapse by collapse
symmetry by symmetry
Solution: random
Solution: init, ex: $w
random \sim
init, w ∼  (0,
\mathcal{N}(0,
ex: 0.01)
0.01)$
Better inits: Xavier Glorot and Kaming He &
orthogonal

48 / 67

47 /

Initialization and normalization


Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant
Constant init: hidden
init: hidden units
units collapse by collapse
symmetry by symmetry
Solution: random
Solution: init, ex: $w
random \sim
init, w ∼  (0,
\mathcal{N}(0,
ex: 0.01)
0.01)$
Better inits: Xavier Glorot and Kaming He &
Better inits: Xavier Glorot and Kaming He &
orthogonal
orthogonal
Biases can (should) be initialized to zero

49 / 67

48 /

Initialization and normalization


SGD learning rate
Input data should be normalized to have approx. same range:
standardization
Very sensitive: or quantile normalization
Initializing W h and W oearly
Too high $\rightarrow$ : plateau or even
divergence
Zero is a saddle point: no gradient, no learning
Too low $\rightarrow$ slow convergence
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼  (0, 0.01)
Better inits: Xavier Glorot and Kaming He &
orthogonal
Biases can (should) be initialized to zero

50 / 67

49 /

SGD learning rate


SGD learning
Very sensitive:
rate
Too high
Very sensitive: → early plateau or even divergence
Too low → slow convergence
Too high $\rightarrow$ early plateau or even
divergence
Too low $\rightarrow$ slow convergence
Try a large value first: $\eta = 0.1$ or even $\eta =
1$
Divide by 10 and retry in case of divergence

51 / 67

50 /

SGD learning rate


SGD learning
Very sensitive:
rate
Too high
Very sensitive: → early plateau or even divergence
Too low → slow convergence
Too high $\rightarrow$ early plateau or even
divergence
Try a large value first: η =
Too low $\rightarrow$ slow convergence
0.1 or even η = 1
TryDivide by 10
a large value and
first: retry
$\eta inorcase
= 0.1$ evenof divergence
$\eta =
1$
Divide by 10 and retry in case of divergence
Large constant LR prevents final convergence
multiply $\eta_{t}$ by $\beta < 1$ after each
update

52 / 67

51 /

SGD learning rate


SGD learning
Very sensitive:
rate
Too high
Very sensitive: → early plateau or even divergence
Too low → slow convergence
Too high $\rightarrow$ early plateau or even
divergence
Try a large value first: η =
Too low $\rightarrow$ slow convergence
0.1 or even η = 1
TryDivide by 10
a large value and
first: retry
$\eta inorcase
= 0.1$ evenof divergence
$\eta =
1$ constant LR prevents final convergence
Large
multiply ηt by β < 1
Divide by 10 and retry in case of divergence
after each update
Large constant LR prevents final convergence
multiply $\eta_{t}$ by $\beta < 1$ after each
update
or monitor validation loss and divide $\eta_{t}$ by
2 or 10 when no progress
53 / 67
See ReduceLROnPlateau in Keras

52 /

SGD learning rate


Momentum
Very sensitive:
Too
Accumulate high
gradients → early plateau
across successive updates: or even divergence
Too low
$$\begin{eqnarray} m_t →
&=& slow
\gammaconvergence
m_{t-1} + \eta \nabla_{\theta}
L_{B_t}(\theta_{t-1}) \nonumber \\ \theta_t &=& \theta_{t-1} - m_t
\nonumberTry a large value first: η = 0.1 or even η = 1
\end{eqnarray}$$

$\gamma$Divide
is typicallyby 100.9and
set to retry in case of divergence
Large constant LR prevents final convergence
multiply ηt by β < 1 after each update
or monitor validation loss and divide ηt by 2 or 10
when no progress
See ReduceLROnPlateau in Keras
54 / 67

53 /

Momentum
Momentum
Accumulate gradients across successive updates:

mt = γ mupdates:
Accumulate gradients across successive t−1 + η∇θ LB t
(θt−1 )
θt = θm_{t-1}
$$\begin{eqnarray} m_t &=& \gamma t−1 − m t
+ \eta \nabla_{\theta}
L_{B_t}(\theta_{t-1}) \nonumber \\ \theta_t &=& \theta_{t-1} - m_t
γ is typically
\nonumber set to 0.9
\end{eqnarray}$$

$\gamma$ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to


accelerate in low curvature areas

55 / 67

54 /

Momentum
Momentum
Accumulate gradients across successive updates:

mt = γ mupdates:
Accumulate gradients across successive t−1 + η∇θ LB t
(θt−1 )
θt = θm_{t-1}
$$\begin{eqnarray} m_t &=& \gamma t−1 − m t
+ \eta \nabla_{\theta}
L_{B_t}(\theta_{t-1}) \nonumber \\ \theta_t &=& \theta_{t-1} - m_t
γ is typically
\nonumber set to 0.9
\end{eqnarray}$$

$\gamma$ is typically set to 0.9


Larger updates in directions where the gradient sign is constant to
accelerate
Larger updates in low curvature
in directions areas
where the gradient sign is constant to
accelerate in low curvature areas

Nesterov accelerated gradient

$$\begin{eqnarray} m_t &=& \gamma m_{t-1} + \eta \nabla_{\theta}


L_{B_t}(\theta_{t-1} - \gamma m_{t-1}) \nonumber \\ \theta_t &=&
\theta_{t-1} - m_t \nonumber \end{eqnarray}$$
56 / 67
Better at handling changes in gradient direction.

55 /

Momentum
Accumulate gradients across successive updates:

mt = γ mt−1 + η∇θ LBt (θt−1 )


θt = θt−1 − mt

γ is typically set to 0.9


Larger updates in directions where the gradient sign is constant to
accelerate in low curvature areas

Nesterov Why
accelerated
Momentum Really gradient
Works

mt = γ mt−1 + η∇θ LBt (θt−1 − γ mt−1 )


θt = θt−1 − mt
57 / 67
Better at handling changes in gradient direction.

56 /
Why Momentum Really Works
Why Momentum Really Works

58 / 67

57 /
Why Momentum Really Works
Why Momentum Really Works

59 / 67

58 /
Why Momentum Really Works
Why Momentum Really Works

60 / 67

59 /
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling

Why Momentum Really Works

61 / 67

60 /

Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)

Very sensitive to initial value of η


Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global $\eta$ set to 3e-4 often works well enough
Good default choice of optimizer (often)

62 / 67

61 /

Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)

Very sensitive to initial value of η


Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling
Need learning rate scheduling
Adam:
Adam: adaptive
adaptive learninglearning
rate scale rate scale
for each for
param each param
Global
Global η set
$\eta$ settoto
3e-4 often
3e-4 worksworks
often well enough
well enough
Good default choice of optimizer (often)
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...

63 / 67

62 /

Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)

Very sensitive to initial value of η


Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling
Need learning rate scheduling
Adam:
Adam: adaptive
adaptive learninglearning
rate scale rate scale
for each for
param each param
Global
Global η set
$\eta$ settoto
3e-4 often
3e-4 worksworks
often well enough
well enough
Good default choice of optimizer (often)
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
But
than well-tuned
Adam (with naiveSGD with
l2 reg)... LR scheduling can generalize better
Active
thanarea of research:
Adam (with K-FAC
naivestochastic second-order method
l2 reg)...
based on an invertible approximation of the Fisher information
matrix of the network.

64 / 67

63 /

Alternative optimizers
TheSGD
Karpathy Constant for
(with Nesterov momentum)
Adam
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Active area of research: K-FAC stochastic second-order method
based on an invertible approximation of the Fisher information
65 / 67
matrix of the network.

64 /

The Karpathy Constant for Adam


Optimizers around a saddle point

Credits: Alec Radford

66 / 67

65 /

Optimizers around a saddle point


Lab 2: back in 15min!

Credits: Alec Radford


67 / 67

66 /
Lab 2: back in 15min!

67 /

You might also like