0% found this document useful (0 votes)

64 views68 pages

Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel

The document describes a neural network with one hidden layer for classification tasks. It defines the components of an artificial neuron, including inputs, weights, biases, activations and outputs. It then shows how multiple neurons can be connected in a layer and describes forwarding data through the network from input to output layers. The network is trained by minimizing the negative log-likelihood loss function to optimize the weights and biases to best fit the training data.

Uploaded by

yelen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views68 pages

Neural Networks and Backpropagation: Charles Ollion - Olivier Grisel

Uploaded by

yelen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Neural networks and

Backpropagation
Charles Ollion - Olivier Grisel

1 / 67
Neural Network for
classification
Neural networks and
Backpropagation
Vector function with tunable parameters $\theta$

$$ \mathbf{f}(\cdot; \mathbf{\theta}): \mathbb{R}^N \rightarrow (0,

1)^K $$
Charles Ollion - Olivier Grisel
Sample $s$ in dataset $S$:

input: $\mathbf{x}^s \in \mathbb{R}^N$

expected output: $y^s \in [0, K-1]$

Output is a conditional probability distribution:

$$ \mathbf{f}(\mathbf{x}^s;\mathbf{\theta})_c = 2 / 67
P(Y=c|X=\mathbf{x}^s) $$

1/
Neural Network for
classification
Artificial Neuron
Vector function with tunable parameters θ

f(⋅; θ) : ℝN → (0, 1)K

Sample s in dataset S:

x s ℝN
input: ∈
s
expected output: y ∈ [0, K − 1]

Output is a conditional probability distribution:

3 / 67
f(xs ; θ)c = P(Y = c|X = xs )

2/
Artificial Neuron
Artificial Neuron

$z(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$

$f(\mathbf{x}) = g(\mathbf{w}^T \mathbf{x} + b)$

$\mathbf{x}, f(\mathbf{x}) \,\,$ input and output

$z(\mathbf{x})\,\,$ pre-activation
$\mathbf{w}, b\,\,$ weights and bias
$g$ activation function

4 / 67

3/
Artificial Neuron
Layer of Neurons

z(x) = T
w x +b
f (x) = g(wT x + b)

x, f (x) input and output

z(x) pre-activation
w, b weights and bias 5 / 67
g activation function
4/
Layer of Neurons
Layer of Neurons

$\mathbf{f}(\mathbf{x}) = g(\textbf{z(x)}) = g(\mathbf{W} \mathbf{x} +

\mathbf{b})$

$\mathbf{W}, \mathbf{b}\,\,$ now matrix and vector

6 / 67

5/
Layer of Neurons
One Hidden Layer Network

f(x) = g(z(x)) = g(Wx + b)

$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +
\mathbf{b}^h$
W, b now matrix= g(\mathbf{z}^h(\mathbf{x}))
$\mathbf{h}(\mathbf{x}) and vector =
g(\mathbf{W}^h \mathbf{x} + \mathbf{b}^h)$
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h} 7 / 67

6/
One Hidden Layer Network
One Hidden Layer Network

$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +

h h h
z (x) = W x + b
\mathbf{b}^h$
$\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) =
h h h
h(x) = g(z\mathbf{x}
g(\mathbf{W}^h (x)) = +g( W x+b )
\mathbf{b}^h)$
o o o
z (x) = W h(x) + b
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h} 8 / 67

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
7/
One Hidden Layer Network
One Hidden Layer Network

$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
8/
One Hidden Layer Network
One Hidden Layer Network

$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} +

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
9/
One Hidden Layer Network
One Hidden Layer Network

Alternate representation

h h
zh (x)=W x+b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b 11 / 67

o o o
f(x) = sof tmax(z ) = sof tmax(W h(x) + b )
10 /
One Hidden Layer Network
One Hidden Layer Network

Keras implementation
Alternate representation

12 / 67

11 /
One Hidden Layer Network
Element-wise activation
functions

Keras implementation

blue: activation function

green: derivative
13 / 67

12 /
Element-wise activation
functions
Softmax function
$$ softmax(\mathbf{x}) = \frac{1}{\sum_{i=1}^{n}{e^{x_i}}} \cdot
\begin{bmatrix} e^{x_1}\\ e^{x_2}\\ \vdots\\ e^{x_n} \end{bmatrix}
$$

$$ \frac{\partial softmax(\mathbf{x})_i}{\partial x_j} = \begin{cases}

softmax(\mathbf{x})_i \cdot (1 - softmax(\mathbf{x})_i) & i = j\\ -
softmax(\mathbf{x})_i \cdot softmax(\mathbf{x})_j & i \neq j
\end{cases} $$

14 / 67
blue: activation function
green: derivative
13 /
Softmax function
Softmax function ⎡ ex1 ⎤
⎢ x2 ⎥
1
sof tmax(x) = n x ⋅ ⎢ ⎥
$$ softmax(\mathbf{x}) = \frac{1}{\sum_{i=1}^{n}{e^{x_i}}} ⎢ e
\cdot ⎥
⎢ ⋮ ⎥
∑i=1 e\end{bmatrix}
\begin{bmatrix} e^{x_1}\\ e^{x_2}\\ \vdots\\ e^{x_n} i

$$
⎣ exn ⎦
$$ \frac{\partial softmax(\mathbf{x})_i}{\partial x_j} = \begin{cases}

{ −sof tmax(x)i &⋅ sof

∂sof tmax(x)
softmax(\mathbf{x})_i sof tmax(x)i ⋅ (1 −&sof
i (1 - softmax(\mathbf{x})_i)
\cdot tmax(x
i = j\\ - )i ) i=j
=
∂xj \cdot softmax(\mathbf{x})_j
softmax(\mathbf{x})_i tmax(x
i \neq j )j i≠j
\end{cases} $$

vector of values in (0, 1) that add up to 1

$p(Y = c|X = \mathbf{x}) = \text{softmax}(\mathbf{z}
(\mathbf{(x}))_c$
the pre-activation vector $\mathbf{z}(\mathbf{x})$ is often called
"the logits"

15 / 67

14 /
Softmax function
Training the network ⎡ ex1 ⎤
⎢ x2 ⎥
Find parameters $\mathbf{\theta} = ( 1
sof tmax(x) = n x ⋅ ⎢ ⎥
\mathbf{W}^h; ⎢ e ⎥
\mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize∑ the e i ⎢ log
negative
i=1 ⋮ ⎥
likelihood (or cross entropy)
⎣ exn ⎦

{ −sof tmax(x)i ⋅ sof tmax(x)j

∂sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i=j
=
∂xj i≠j

vector of values in (0, 1) that add up to 1

p(Y = c|X = x) = softmax(z((x))c
the pre-activation vector z(x) is often called "the logits"

16 / 67

15 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
likelihood (or cross entropy)

The loss function for a given sample $s \in S$:

$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,

y^s) = -\log \mathbf{f}(\mathbf{x}^s;\theta)_{y^s} $$

17 / 67

16 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)

The loss function forl(f( xs ; θ),

a given ys ) =$snll(θ;
sample \in S$:x s , y s ) = − log f(xs ; θ)ys
$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,
y^s) = -\log \mathbf{f}(\mathbf{x}^s;\theta)_{y^s} $$

example

18 / 67

17 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)

The loss function forl(f( xs ; θ),

a given ys ) =$snll(θ;
sample \in S$:x s , y s ) = − log f(xs ; θ)ys
$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,
y^s) = -\log \mathbf{f}(\mathbf{x}^s;\theta)_{y^s} $$

The cost function is the negative likelihood of the model computed

example
on the full training set (for i.i.d. samples):

$$ L_S(\theta) = -\frac{1}{|S|} \sum_{s \in S} \log \mathbf{f}

(\mathbf{x}^s;\theta)_{y^s} + \lambda \Omega(\mathbf{\theta}) $$

19 / 67

18 /
Training the network
Training the network
h h o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
logparameters
Find likelihood (or cross entropy)
$\mathbf{\theta} = ( \mathbf{W}^h; \mathbf{b}^h;
\mathbf{W}^o; \mathbf{b}^o )$ that minimize the negative log
The loss(or
likelihood function for a given sample s ∈ S :
cross entropy)

The loss function forl(f( xs ; θ),

a given ys ) =$snll(θ;
sample \in S$:x s , y s ) = − log f(xs ; θ)ys
$$ l(\mathbf{f}(\mathbf{x}^s;\theta), y^s) = nll(\theta; \mathbf{x}^s,
The= -\log
y^s) cost\mathbf{f}(\mathbf{x}^s;\theta)_{y^s}
function is the negative likelihood $$ of the model computed
on the full training set (for i.i.d. samples):
The cost function is the negative likelihood of the model computed
on the full training set (for i.i.d. samples):
1
LS (θ) = − log f(xs ; θ)ys + λΩ(θ)
\in ∑
$$ L_S(\theta) = -\frac{1}{|S|} \sum_{s|S| S} \log \mathbf{f}
s∈S
(\mathbf{x}^s;\theta)_{y^s} + \lambda \Omega(\mathbf{\theta}) $$
$\lambda \Omega(\mathbf{\theta}) = \lambda (||W^h||^2 +
||W^o||^2)$ is an optional regularization term.
20 / 67

19 /
Training the network
Stochastic Gradient
h h Descent
o o
Find parameters θ = (W ; b ; W ; b ) that minimize the negative
log likelihood (or cross entropy)
Initialize $\mathbf{\theta}$ randomly
The loss function for a given sample s ∈ S:
l(f(xs ; θ), ys ) = nll(θ; xs , ys ) = − log f(xs ; θ)ys

The cost function is the negative likelihood of the model computed

on the full training set (for i.i.d. samples):

1
LS (θ) = − log f(xs ; θ)ys + λΩ(θ)
|S| ∑
s∈S

λΩ(θ) = λ(||W h ||2 + ||W o ||2 ) is an optional regularization term.

21 / 67

20 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly

For $E$ epochs perform:

Randomly select a small batch of samples $( B \subset S )$

22 / 67

21 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:

Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$

23 / 67

22 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:

Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$
Update parameters: $\mathbf{\theta} \leftarrow \mathbf{\theta}
- \eta \Delta$
$\eta > 0$ is called the learning rate

24 / 67

23 /
Stochastic Gradient Descent
Stochastic Gradient Descent
Initialize θ randomly
Initialize $\mathbf{\theta}$ randomly
For E epochs perform:
For $E$ epochs perform:

Randomly
Randomly selectselect a small
a small batch batch $(
of samples \subset S )$(B
ofBsamples ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Compute gradients: $\Delta = \nabla_\theta L_B(\theta)$
Update parameters: $\mathbf{\theta} \leftarrow \mathbf{\theta}
Update
- \eta \Delta$
parameters: θ ← θ − ηΔ
η >> 0$
$\eta 0 isiscalled
calledthethe learning
learning rate rate

Stop when reaching criterion

nll stops decreasing when computed on validation set

25 / 67

24 /
Stochastic Gradient Descent
Computing Gradients
Initialize θ randomly

For E epochs perform:

Output Weights: $\frac{\partial Output bias: $\frac{\partial
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
Randomly select a small
W^o_{i,j}}$ batch of samples (B
b^o_{i}}$ ⊂ S)
Compute gradients: Δ = ∇ θ L B (θ)
Hidden Weights: $\frac{\partial Hidden bias: $\frac{\partial
parameters: θl(\mathbf{f(x)},
Updatey)}{\partial
l(\mathbf{f(x)}, ← θ − ηΔy)}{\partial
η > 0 is called the learning rate
W^h_{i,j}}$ b^h_{i}}$

$\,$

Stop when reaching criterion

26 / 67
nll stops decreasing when computed on validation set

25 /
Computing Gradients
Computing Gradients

∂l(f(x),y) ∂l(f(x),y)
Output
Output Weights:
Weights: o
$\frac{\partial Output Output
bias: bias:
$\frac{\partial
∂Wi,j ∂boi
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
W^o_{i,j}}$ ∂l(f(x),y) b^o_{i}}$ ∂l(f(x),y)
Hidden Weights: Hidden bias:
∂Wi,jh ∂bhi
Hidden Weights: $\frac{\partial Hidden bias: $\frac{\partial
l(\mathbf{f(x)}, y)}{\partial l(\mathbf{f(x)}, y)}{\partial
W^h_{i,j}}$ b^h_{i}}$

$\,$

The network is a composition of differentiable modules

We can apply the "chain rule"
27 / 67

26 /
Computing Gradients
Chain rule

∂l(f(x),y) ∂l(f(x),y)
Output Weights: ∂Wi,jo Output bias:
∂boi

∂l(f(x),y) ∂l(f(x),y)
Hidden Weights: Hidden bias:
∂Wi,jh ∂bhi
chain-rule

The network is a composition of differentiable modules

We can apply the "chain rule"

28 / 67

27 /
Chain rule
Chain rule

chain-rule
chain-rule

29 / 67

28 /
Chain rule
Chain rule

chain-rule
chain-rule

30 / 67

29 /
Chain rule
Backpropagation

chain-rule

31 / 67

30 /
Backpropagation
Backpropagation

Compute partial derivatives of the loss

$\frac{\partial l(\mathbf{f(x)}, y)}{\partial

\mathbf{f(x)}_i} = \frac{\partial -\log \mathbf{f(x)}_y}
{\partial \mathbf{f(x)}_i} = \frac{-1_{y=i}}
{\mathbf{f(x)}_y}$

$\frac{\partial l(\mathbf{f(x)}, y)}{\partial

32 / 67
\mathbf{z}^o(\mathbf{x})_i} = \sum_j \frac{\partial

31 /
Backpropagation

Compute partial derivatives of the loss

∂l(f(x),y) ∂−log f(x)y −1y=i

∂f(x)i
= ∂f(x)i
= f(x)y

∂l(f(x),y) ∂l(f(x),y) ∂f(x)j

∂zo (x) i
= ∑j ∂f(x) ∂zo (x)
j i
33 / 67

32 /
34 / 67

33 /
...

...

35 / 67

34 /
Backpropagation

Gradients

$\nabla_{\mathbf{z}^o(\mathbf{x})} l(\mathbf{f(x)}, y)
= \mathbf{f(x)} - \mathbf{e}(y)$

$\nabla_{\mathbf{b}^o} l(\mathbf{f(x)}, y) =
\mathbf{f(x)} - \mathbf{e}(y)$ ...

because $\frac{\partial \mathbf{z}^o(\mathbf{x})_i}{\partial 36 / 67 ...

\mathbf{b}^o_j} = 1_{i=j}$

35 /
Backpropagation
Backpropagation

Partial derivatives related to $\mathbf{W}^o$

Gradients
$\frac{\partial l(\mathbf{f(x)}, y)}{\partial W_{i,j}} =
∇zo (x)\frac{\partial
\sum_{k} l(f(x), y) l(\mathbf{f(x)},
= f(x) − e(y) y)}{\partial
\mathbf{z}ô(\mathbf{x})_k} \frac{\partial
∇b o l(f(x), y) = f(x) − e(y)Wô_{i,j}}$
\mathbf{z}ô(\mathbf{x})_k}{\partial

$\nabla_{\mathbf{W}^o}
∂zo (x) i l(\mathbf{f(x)}, y) =
because
(\mathbf{f(x)} = 1i=j
∂b oj - \mathbf{e}(y)) . \mathbf{h(x)}^\top$
37 / 67

36 /
Backpropagation
Backprop gradients
Compute activation gradients

$\nabla_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}

(y)$

o
Partial derivatives related to W

∂l(f(x),y) ∂l(f(x),y) ∂zo (x) k

∂Wi,j = ∑k ∂zo (x) ∂W o
k i,j

⊤
∇W o l(f(x), y) = (f(x) − e(y)). h(x)
38 / 67

37 /
Backprop gradients
Backprop gradients
Compute activation gradients
Compute activation gradients
∇zo (x) l = f(x) − e(y)
$\nabla_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}
(y)$

Compute layer params gradients

$\nabla_{\mathbf{W}ô} l = \nabla_{\mathbf{z}ô(\mathbf{x})} l
\cdot \mathbf{h(x)}^\top$
$\nabla_{\mathbf{b}ô} l = \nabla_{\mathbf{z}ô(\mathbf{x})} l$

39 / 67

38 /
Backprop gradients
Backprop gradients
Compute activation gradients
Compute activation gradients
∇zo (x) l = f(x) − e(y)
$\nabla_{\mathbf{z}ô(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}
(y)$
Compute layer params gradients
Compute layer params gradients
⊤
∇ l
Wo = ∇ zo (x) l ⋅ h(x)
$\nabla_{\mathbf{W}ô} l = \nabla_{\mathbf{z}ô(\mathbf{x})} l
∇b\mathbf{h(x)}^\top$
\cdot ol = ∇ o
z (x) l
$\nabla_{\mathbf{b}ô} l = \nabla_{\mathbf{z}ô(\mathbf{x})} l$

Compute prev layer activation gradients

$\nabla_{\mathbf{h(x)}} l = \mathbf{W}^{o\top} 40 / 67
\nabla_{\mathbf{z}^o(\mathbf{x})} l$

39 /
Backprop gradients
Compute activation gradients

∇zo (x) l = f(x) − e(y)

Loss,layer
Compute Initialization and
params gradients
Learning⊤ Tricks
∇W o l = ∇zo (x) l ⋅ h(x)
∇b o l = ∇zo (x) l

Compute prev layer activation gradients

o⊤
∇h(x) l = W ∇zo (x) l 41 / 67

∇zh (x) l = ∇h(x) l ⊙ σ ′ (zh (x))

40 /
Discrete output (classification)
Binary classification: $y \in [0, 1]$

$Y|X=\mathbf{x} \sim Bernoulli(b=f(\mathbf{x} ;

\theta))$
Loss, Initialization and
output function: $logistic(x) = \frac{1}{1 + e^{-x}}$

Learning Tricks
loss function: binary cross-entropy

Multiclass classification: $y \in [0, K-1]$

$Y|X=\mathbf{x} \sim
Multinoulli(\mathbf{p}=\mathbf{f}(\mathbf{x} ;
\theta))$
output function: $softmax$ 42 / 67
loss function: categorical cross-entropy

41 /

Discrete output (classification)

Continuous output (regression)
Binary classification: y ∈ [0, 1]
Continuous output: $\mathbf{y} \in \mathbb{R}^n$
Y|X = x ∼ Bernoulli(b = f (x; θ))
$Y|X=\mathbf{x} \sim \mathcal{N} 1
output function: logistic(x) = 1+e−x
(\mathbf{\mu}=\mathbf{f}(\mathbf{x} ; \theta),
loss function:
\sigma^2 \mathbf{I})$binary cross-entropy
output function: Identity
Multiclass
loss function:classification:
square loss y ∈ [0, K − 1]
Heteroschedastic if $\mathbf{f}(\mathbf{x} ; \theta)$
Y|X = x ∼ Multinoulli(p =
predicts both $\mathbf{\mu}$ and $\sigma^2$
f(x; θ))
output function: sof tmax
Mixture Density Network (multimodal output)
loss function: categorical cross-entropy
$Y|X=\mathbf{x} \sim GMM_{\mathbf{x}}$ 43 / 67

42 /

Continuous output (regression)

Initialization and normalization
n
Continuous output: y ∈ ℝ

 (μ =
Input data should be normalized to have approx. same range:
2
Y|X = x ∼
standardization or quantile f(x; θ), σ
normalization I)
output function: Identity
loss function: square loss

2
Heteroschedastic if f(x; θ) predicts both μ and σ

Mixture Density Network (multimodal output)

Y|X = x ∼ GMMx
f(x; θ) predicts all the parameters: the means,
44 / 67

covariance matrices and mixture weights

43 /

Initialization and normalization

Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:
standardization or quantile normalization
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning

45 / 67

44 /

Initialization and normalization

Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

46 / 67

45 /

Initialization and normalization

Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

47 / 67

46 /

Initialization and normalization

Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

W h and
standardization
Initializing W o : normalization
or quantile
Initializing $W^h$ and $W^o$:
Zero is a saddle point: no gradient, no learning
Zero is a saddle point: no gradient, no learning
Constant
Constant init: hidden
init: hidden units
units collapse by collapse
symmetry by symmetry
Solution: random
Solution: init, ex: $w
random \sim
init, w ∼  (0,
\mathcal{N}(0,
ex: 0.01)
0.01)$
Better inits: Xavier Glorot and Kaming He &
orthogonal

48 / 67

47 /

Initialization and normalization

Initialization and normalization
Input data should be normalized to have approx. same range:
standardization
Input data should be normalized or quantile
to have normalization
approx. same range:

49 / 67

48 /

Initialization and normalization

SGD learning rate
Input data should be normalized to have approx. same range:
standardization
Very sensitive: or quantile normalization
Initializing W h and W oearly
Too high $\rightarrow$ : plateau or even
divergence
Zero is a saddle point: no gradient, no learning
Too low $\rightarrow$ slow convergence
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼  (0, 0.01)
Better inits: Xavier Glorot and Kaming He &
orthogonal
Biases can (should) be initialized to zero

50 / 67

49 /

SGD learning rate

51 / 67

50 /

SGD learning rate

SGD learning
Very sensitive:
rate
Too high
Very sensitive: → early plateau or even divergence
Too low → slow convergence
Too high $\rightarrow$ early plateau or even
divergence
Try a large value first: η =
Too low $\rightarrow$ slow convergence
0.1 or even η = 1
TryDivide by 10
a large value and
first: retry
$\eta inorcase
= 0.1$ evenof divergence
$\eta =
1$
Divide by 10 and retry in case of divergence
Large constant LR prevents final convergence
multiply $\eta_{t}$ by $\beta < 1$ after each
update

52 / 67

51 /

SGD learning rate

SGD learning
Very sensitive:
rate
Too high
Very sensitive: → early plateau or even divergence
Too low → slow convergence
Too high $\rightarrow$ early plateau or even
divergence
Try a large value first: η =
Too low $\rightarrow$ slow convergence
0.1 or even η = 1
TryDivide by 10
a large value and
first: retry
$\eta inorcase
= 0.1$ evenof divergence
$\eta =
1$ constant LR prevents final convergence
Large
multiply ηt by β < 1
Divide by 10 and retry in case of divergence
after each update
Large constant LR prevents final convergence
multiply $\eta_{t}$ by $\beta < 1$ after each
update
or monitor validation loss and divide $\eta_{t}$ by
2 or 10 when no progress
53 / 67
See ReduceLROnPlateau in Keras

52 /

SGD learning rate

Momentum
Very sensitive:
Too
Accumulate high
gradients → early plateau
across successive updates: or even divergence
Too low
$$\begin{eqnarray} m_t →
&=& slow
\gammaconvergence
m_{t-1} + \eta \nabla_{\theta}
L_{B_t}(\theta_{t-1}) \nonumber \\ \theta_t &=& \theta_{t-1} - m_t
\nonumberTry a large value first: η = 0.1 or even η = 1
\end{eqnarray}$$

$\gamma$Divide
is typicallyby 100.9and
set to retry in case of divergence
Large constant LR prevents final convergence
multiply ηt by β < 1 after each update
or monitor validation loss and divide ηt by 2 or 10
when no progress
See ReduceLROnPlateau in Keras
54 / 67

53 /

Momentum
Momentum
Accumulate gradients across successive updates:

mt = γ mupdates:
Accumulate gradients across successive t−1 + η∇θ LB t
(θt−1 )
θt = θm_{t-1}
$$\begin{eqnarray} m_t &=& \gamma t−1 − m t
+ \eta \nabla_{\theta}
L_{B_t}(\theta_{t-1}) \nonumber \\ \theta_t &=& \theta_{t-1} - m_t
γ is typically
\nonumber set to 0.9
\end{eqnarray}$$

$\gamma$ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to

accelerate in low curvature areas

55 / 67

54 /

Momentum
Momentum
Accumulate gradients across successive updates:

$\gamma$ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to
accelerate
Larger updates in low curvature
in directions areas
where the gradient sign is constant to
accelerate in low curvature areas

Nesterov accelerated gradient

$$\begin{eqnarray} m_t &=& \gamma m_{t-1} + \eta \nabla_{\theta}

L_{B_t}(\theta_{t-1} - \gamma m_{t-1}) \nonumber \\ \theta_t &=&
\theta_{t-1} - m_t \nonumber \end{eqnarray}$$
56 / 67
Better at handling changes in gradient direction.

55 /

Momentum
Accumulate gradients across successive updates:

mt = γ mt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to
accelerate in low curvature areas

Nesterov Why
accelerated
Momentum Really gradient
Works

mt = γ mt−1 + η∇θ LBt (θt−1 − γ mt−1 )

θt = θt−1 − mt
57 / 67
Better at handling changes in gradient direction.

56 /
Why Momentum Really Works
Why Momentum Really Works

58 / 67

57 /
Why Momentum Really Works
Why Momentum Really Works

59 / 67

58 /
Why Momentum Really Works
Why Momentum Really Works

60 / 67

59 /
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling

Why Momentum Really Works

61 / 67

60 /

Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)

Very sensitive to initial value of η

Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global $\eta$ set to 3e-4 often works well enough
Good default choice of optimizer (often)

62 / 67

61 /

Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)

Very sensitive to initial value of η

63 / 67

62 /

Alternative optimizers
Alternative optimizers
SGD (with Nesterov momentum)
Simple
SGD (with Nesterovto implement
momentum)

Very sensitive to initial value of η

Simple to implement
Very sensitive to initial value of $\eta$
Need learning rate scheduling
Need learning rate scheduling
Adam:
Adam: adaptive
adaptive learninglearning
rate scale rate scale
for each for
param each param
Global
Global η set
$\eta$ settoto
3e-4 often
3e-4 worksworks
often well enough
well enough
Good default choice of optimizer (often)
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
But
than well-tuned
Adam (with naiveSGD with
l2 reg)... LR scheduling can generalize better
Active
thanarea of research:
Adam (with K-FAC
naivestochastic second-order method
l2 reg)...
based on an invertible approximation of the Fisher information
matrix of the network.

64 / 67

63 /

Alternative optimizers
TheSGD
Karpathy Constant for
(with Nesterov momentum)
Adam
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Active area of research: K-FAC stochastic second-order method
based on an invertible approximation of the Fisher information
65 / 67
matrix of the network.

64 /

The Karpathy Constant for Adam

Optimizers around a saddle point

Credits: Alec Radford

66 / 67

65 /

Optimizers around a saddle point

Lab 2: back in 15min!

Credits: Alec Radford

67 / 67

66 /
Lab 2: back in 15min!

67 /

MITx 6.86x Notes - MD
No ratings yet
MITx 6.86x Notes - MD
91 pages
DL Unit2 HD
No ratings yet
DL Unit2 HD
141 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
Unit 2 - Neural Networks (DL Illustrated)
No ratings yet
Unit 2 - Neural Networks (DL Illustrated)
146 pages
21cs743 TLP
No ratings yet
21cs743 TLP
15 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Project Report: ON Heart Disease Prediction Using Machine Learning
No ratings yet
Project Report: ON Heart Disease Prediction Using Machine Learning
35 pages
CS 230 - Deep Learning Tips and Tricks Cheatsheet
No ratings yet
CS 230 - Deep Learning Tips and Tricks Cheatsheet
8 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
Math For ML
No ratings yet
Math For ML
10 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
DSML
No ratings yet
DSML
510 pages
Food Recipe Recommendation Based On Ingredients de
No ratings yet
Food Recipe Recommendation Based On Ingredients de
7 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Aiml Report
No ratings yet
Aiml Report
70 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Alexnet Paper
No ratings yet
Alexnet Paper
39 pages
01 Introduction To Feedforward Neural Networks (Hugo)
No ratings yet
01 Introduction To Feedforward Neural Networks (Hugo)
78 pages
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
No ratings yet
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
90 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Feedforward Networks: Marco Kuhlmann
No ratings yet
Feedforward Networks: Marco Kuhlmann
53 pages
Notes 3
No ratings yet
Notes 3
59 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Sample Midterm With Solutions (Updated)
No ratings yet
Sample Midterm With Solutions (Updated)
26 pages
Unit 2
No ratings yet
Unit 2
31 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Neural Networks Optional
No ratings yet
Neural Networks Optional
96 pages
Lecture 9. Neural Networks
No ratings yet
Lecture 9. Neural Networks
106 pages
Week 14 (NN)
No ratings yet
Week 14 (NN)
49 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Soft - Computing - 2 With Numericals
No ratings yet
Soft - Computing - 2 With Numericals
64 pages
Evaluation of Deep Learning Models For Multi-Step Ahead Time Series Prediction
No ratings yet
Evaluation of Deep Learning Models For Multi-Step Ahead Time Series Prediction
22 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
A Novel Method For SoH Prediction of Batteries Based On Stacked LSTM With Quick Charge Data Tyler and Francis
No ratings yet
A Novel Method For SoH Prediction of Batteries Based On Stacked LSTM With Quick Charge Data Tyler and Francis
20 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Lecture - 14 - FFNN
No ratings yet
Lecture - 14 - FFNN
59 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
A Hands-On Introduction To Physics-Informed Neural Networks For Solving PDE
No ratings yet
A Hands-On Introduction To Physics-Informed Neural Networks For Solving PDE
38 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Lecture NN Part1
No ratings yet
Lecture NN Part1
62 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
4-Neural Networks and Activation Function
No ratings yet
4-Neural Networks and Activation Function
28 pages
An Introduction To Deep Learning For The Physical Layer
No ratings yet
An Introduction To Deep Learning For The Physical Layer
13 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
Fichas de Aprendizaje IBM AI Practitioner Qs - Quizletv3
No ratings yet
Fichas de Aprendizaje IBM AI Practitioner Qs - Quizletv3
17 pages
PRCV Unit-2
No ratings yet
PRCV Unit-2
24 pages
Ethiopian Banknote Recognition Using Convolutional
No ratings yet
Ethiopian Banknote Recognition Using Convolutional
18 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
Dat 300
No ratings yet
Dat 300
12 pages
Lecture - 05 (Introduction To ANN)
No ratings yet
Lecture - 05 (Introduction To ANN)
27 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Lec 22 Activations Functions Complete
No ratings yet
Lec 22 Activations Functions Complete
33 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Smart Parking - Face Recognition
No ratings yet
Smart Parking - Face Recognition
12 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
AMED Project 2
No ratings yet
AMED Project 2
14 pages
Polymer Composites - 2024 - Kamarian - Machine Learning For Bending Behavior of Sandwich Beams With 3D Printed Core and
No ratings yet
Polymer Composites - 2024 - Kamarian - Machine Learning For Bending Behavior of Sandwich Beams With 3D Printed Core and
12 pages
Lab 3
No ratings yet
Lab 3
40 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Montanari
No ratings yet
Montanari
10 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Nonlinear
No ratings yet
Nonlinear
8 pages
Hindi Digit Recognition Paper
No ratings yet
Hindi Digit Recognition Paper
8 pages
Week 4 Lecture Notes
No ratings yet
Week 4 Lecture Notes
5 pages
Neural Networks Skimmed - Ipynb - Colab
No ratings yet
Neural Networks Skimmed - Ipynb - Colab
8 pages
Deep Lab V3
No ratings yet
Deep Lab V3
4 pages
Role of Optimizer in Neural Network
No ratings yet
Role of Optimizer in Neural Network
2 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet