0% found this document useful (0 votes)
27 views73 pages

Deep Learning Lectures - 2

Uploaded by

Việt Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views73 pages

Deep Learning Lectures - 2

Uploaded by

Việt Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Neural networks and

Backpropagation
Charles Ollion - Olivier Grisel

1 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

2 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

Sample s in dataset S :

input: xs
N
∈ R

expected output: y s ∈ [0, K − 1]

3 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

Sample s in dataset S :

input: xs
N
∈ R

expected output: y s ∈ [0, K − 1]

Output is a conditional probability distribution:


s s
f (x ; θ)c = P (Y = c|X = x )

4 / 74
Artificial Neuron

5 / 74
Artificial Neuron

T
z(x) = w x + b

T
f (x) = g(w x + b)

x, f (x) input and output


z(x) pre-activation
w, b weights and bias
g activation function

6 / 74
Layer of Neurons

7 / 74
Layer of Neurons

f (x) = g(z(x)) = g(Wx + b)

W, b now matrix and vector

8 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

9 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

10 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

11 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

12 / 74
One Hidden Layer Network

Alternate representation

13 / 74
One Hidden Layer Network

Keras implementation

model = Sequential()
model.add(Dense(H, input_dim=N)) # weight matrix dim [N * H]
model.add(Activation("tanh"))
model.add(Dense(K)) # weight matrix dim [H x K]
model.add(Activation("softmax"))

14 / 74
Element-wise activation
functions

blue: activation function


green: derivative 15 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1

⎣ xn ⎦
e

∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j


= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j

16 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1

⎣ xn ⎦
e

∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j


= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j

vector of values in (0, 1) that add up to 1


p(Y = c|X = x) = softmax(z(x))c

the pre-activation vector z(x) is often called "the logits"

17 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

18 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

19 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

example

20 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

The cost function is the negative likelihood of the model computed


on the full training set (for i.i.d. samples):

1
s
LS (θ) = − ∑ log f (x ; θ)y s
|S |
s∈S

21 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

The cost function is the negative likelihood of the model computed


on the full training set (for i.i.d. samples):

1
s
LS (θ) = − ∑ log f (x ; θ)y s + λΩ(θ)
|S |
s∈S

2 2
λΩ(θ) = λ(||W
h
|| + ||W
o
|| ) is an optional regularization term.

22 / 74
Stochastic Gradient Descent
Initialize θ randomly

23 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

24 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

25 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate

26 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate


Repeat until the epoch is completed (all of S is covered)

27 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate


Repeat until the epoch is completed (all of S is covered)

Stop when reaching criterion:

nll stops decreasing when computed on validation set

28 / 74
Computing Gradients

∂l(f (x),y) ∂l(f (x),y)


Output Weights: o
Output bias: o
∂b i
∂W i,j

∂l(f (x),y) ∂l(f (x),y)


Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i

29 / 74
Computing Gradients

∂l(f (x),y) ∂l(f (x),y)


Output Weights: o
Output bias: o
∂b i
∂W i,j

∂l(f (x),y) ∂l(f (x),y)


Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i

The network is a composition of differentiable modules


We can apply the "chain rule"

30 / 74
Chain rule

31 / 74
Chain rule

chain-rule

32 / 74
Chain rule

chain-rule

33 / 74
Chain rule

chain-rule

34 / 74
Backpropagation

35 / 74
Backpropagation

Compute partial derivatives of the loss

∂l(f (x),y) ∂−log f (x) −1 y=i


y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i

36 / 74
Backpropagation

Compute partial derivatives of the loss

∂l(f (x),y) ∂−log f (x) −1 y=i


y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i

∂l
o
=?
∂z (x)
i

37 / 74
Chain rule!

38 / 74
39 / 74
40 / 74
41 / 74
: one-hot encoding of y

42 / 74
Backpropagation

Gradients

∇zo (x) l = f (x) − e(y)

∇b o l = f (x) − e(y)

o
∂z (x)
because zo (x) and then
o o i
= W h(x) + b o = 1 i=j
∂b j

43 / 74
Backpropagation

Partial derivatives related to Wo

o
∂z (x)
∂l ∂l k

o
= ∑ o
∂W i,j k o
∂z (x) ∂W i,j
k


∇Wo l = (f (x) − e(y)). h(x)

44 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

45 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

Compute layer params gradients



∇Wo l = ∇zo (x) l ⋅ h(x)

∇b o l = ∇zo (x) l

46 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

Compute layer params gradients



∇Wo l = ∇zo (x) l ⋅ h(x)

∇b o l = ∇zo (x) l

Compute prev layer activation gradients


o⊤
∇h(x) l = W ∇zo (x) l
′ h
∇zh (x) l = ∇h(x) l ⊙ σ (z (x))

47 / 74
Loss, Initialization and
Learning Tricks

48 / 74
Discrete output (classification)
Binary classification: y ∈ [0, 1]

Y |X = x ∼ Bernoulli(b = f (x; θ))

output function: logistic(x) =


1
−x
1+e

loss function: binary cross-entropy

Multiclass classification: y ∈ [0, K − 1]

Y |X = x ∼ M ultinoulli(p = f (x; θ))

output function: sof tmax


loss function: categorical cross-entropy

49 / 74
Continuous output (regression)
Continuous output: y ∈ R
n

2
Y |X = x ∼ N (μ = f (x; θ), σ I)

output function: Identity


loss function: square loss

Heteroschedastic if f (x; θ) predicts both μ and σ 2

Mixture Density Network (multimodal output)

Y |X = x ∼ GM M x

f (x; θ) predicts all the parameters: the means,


covariance matrices and mixture weights
50 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization

51 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning

52 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry

53 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

54 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

Better inits: Xavier Glorot and Kaming He &


orthogonal

55 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

Better inits: Xavier Glorot and Kaming He &


orthogonal
Biases can (should) be initialized to zero

56 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence

57 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence

58 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence


Large constant LR prevents final convergence
multiply η t by β < 1 after each update

59 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence


Large constant LR prevents final convergence
multiply η t by β < 1 after each update
or monitor validation loss and divide η t by 2 or 10
when no progress
See ReduceLROnPlateau in Keras

60 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

61 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to


accelerate in low curvature areas

62 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to


accelerate in low curvature areas

Nesterov accelerated gradient

mt = γmt−1 + η∇θ LB (θt−1 − γmt−1 )


t

θt = θt−1 − mt

Better at handling changes in gradient direction.


63 / 74
Why Momentum Really Works

64 / 74
Why Momentum Really Works

65 / 74
Why Momentum Really Works

66 / 74
Why Momentum Really Works

67 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling

68 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)

69 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...

70 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Promising stochastic second order methods: K-FAC and Shampoo
can be used to accelerate training of very large models.

71 / 74
The Karpathy Constant for Adam

72 / 74
Optimizers around a saddle point

Credits: Alec Radford

73 / 74

You might also like