0% found this document useful (0 votes)
15 views

Deep Learning Lectures - 2

Uploaded by

Việt Lê
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Deep Learning Lectures - 2

Uploaded by

Việt Lê
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Neural networks and

Backpropagation
Charles Ollion - Olivier Grisel

1 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

2 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

Sample s in dataset S :

input: xs
N
∈ R

expected output: y s ∈ [0, K − 1]

3 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)

Sample s in dataset S :

input: xs
N
∈ R

expected output: y s ∈ [0, K − 1]

Output is a conditional probability distribution:


s s
f (x ; θ)c = P (Y = c|X = x )

4 / 74
Artificial Neuron

5 / 74
Artificial Neuron

T
z(x) = w x + b

T
f (x) = g(w x + b)

x, f (x) input and output


z(x) pre-activation
w, b weights and bias
g activation function

6 / 74
Layer of Neurons

7 / 74
Layer of Neurons

f (x) = g(z(x)) = g(Wx + b)

W, b now matrix and vector

8 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

9 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

10 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

11 / 74
One Hidden Layer Network

h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )

12 / 74
One Hidden Layer Network

Alternate representation

13 / 74
One Hidden Layer Network

Keras implementation

model = Sequential()
model.add(Dense(H, input_dim=N)) # weight matrix dim [N * H]
model.add(Activation("tanh"))
model.add(Dense(K)) # weight matrix dim [H x K]
model.add(Activation("softmax"))

14 / 74
Element-wise activation
functions

blue: activation function


green: derivative 15 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1

⎣ xn ⎦
e

∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j


= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j

16 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1

⎣ xn ⎦
e

∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j


= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j

vector of values in (0, 1) that add up to 1


p(Y = c|X = x) = softmax(z(x))c

the pre-activation vector z(x) is often called "the logits"

17 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

18 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

19 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

example

20 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

The cost function is the negative likelihood of the model computed


on the full training set (for i.i.d. samples):

1
s
LS (θ) = − ∑ log f (x ; θ)y s
|S |
s∈S

21 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)

The loss function for a given sample s ∈ S :


s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s

The cost function is the negative likelihood of the model computed


on the full training set (for i.i.d. samples):

1
s
LS (θ) = − ∑ log f (x ; θ)y s + λΩ(θ)
|S |
s∈S

2 2
λΩ(θ) = λ(||W
h
|| + ||W
o
|| ) is an optional regularization term.

22 / 74
Stochastic Gradient Descent
Initialize θ randomly

23 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

24 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

25 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate

26 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate


Repeat until the epoch is completed (all of S is covered)

27 / 74
Stochastic Gradient Descent
Initialize θ randomly

For E epochs perform:

Randomly select a small batch of samples (B ⊂ S)

Compute gradients: Δ = ∇θ LB (θ)

Update parameters: θ ← θ − ηΔ

η > 0 is called the learning rate


Repeat until the epoch is completed (all of S is covered)

Stop when reaching criterion:

nll stops decreasing when computed on validation set

28 / 74
Computing Gradients

∂l(f (x),y) ∂l(f (x),y)


Output Weights: o
Output bias: o
∂b i
∂W i,j

∂l(f (x),y) ∂l(f (x),y)


Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i

29 / 74
Computing Gradients

∂l(f (x),y) ∂l(f (x),y)


Output Weights: o
Output bias: o
∂b i
∂W i,j

∂l(f (x),y) ∂l(f (x),y)


Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i

The network is a composition of differentiable modules


We can apply the "chain rule"

30 / 74
Chain rule

31 / 74
Chain rule

chain-rule

32 / 74
Chain rule

chain-rule

33 / 74
Chain rule

chain-rule

34 / 74
Backpropagation

35 / 74
Backpropagation

Compute partial derivatives of the loss

∂l(f (x),y) ∂−log f (x) −1 y=i


y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i

36 / 74
Backpropagation

Compute partial derivatives of the loss

∂l(f (x),y) ∂−log f (x) −1 y=i


y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i

∂l
o
=?
∂z (x)
i

37 / 74
Chain rule!

38 / 74
39 / 74
40 / 74
41 / 74
: one-hot encoding of y

42 / 74
Backpropagation

Gradients

∇zo (x) l = f (x) − e(y)

∇b o l = f (x) − e(y)

o
∂z (x)
because zo (x) and then
o o i
= W h(x) + b o = 1 i=j
∂b j

43 / 74
Backpropagation

Partial derivatives related to Wo

o
∂z (x)
∂l ∂l k

o
= ∑ o
∂W i,j k o
∂z (x) ∂W i,j
k


∇Wo l = (f (x) − e(y)). h(x)

44 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

45 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

Compute layer params gradients



∇Wo l = ∇zo (x) l ⋅ h(x)

∇b o l = ∇zo (x) l

46 / 74
Backprop gradients
Compute activation gradients

∇zo (x) l = f (x) − e(y)

Compute layer params gradients



∇Wo l = ∇zo (x) l ⋅ h(x)

∇b o l = ∇zo (x) l

Compute prev layer activation gradients


o⊤
∇h(x) l = W ∇zo (x) l
′ h
∇zh (x) l = ∇h(x) l ⊙ σ (z (x))

47 / 74
Loss, Initialization and
Learning Tricks

48 / 74
Discrete output (classification)
Binary classification: y ∈ [0, 1]

Y |X = x ∼ Bernoulli(b = f (x; θ))

output function: logistic(x) =


1
−x
1+e

loss function: binary cross-entropy

Multiclass classification: y ∈ [0, K − 1]

Y |X = x ∼ M ultinoulli(p = f (x; θ))

output function: sof tmax


loss function: categorical cross-entropy

49 / 74
Continuous output (regression)
Continuous output: y ∈ R
n

2
Y |X = x ∼ N (μ = f (x; θ), σ I)

output function: Identity


loss function: square loss

Heteroschedastic if f (x; θ) predicts both μ and σ 2

Mixture Density Network (multimodal output)

Y |X = x ∼ GM M x

f (x; θ) predicts all the parameters: the means,


covariance matrices and mixture weights
50 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization

51 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning

52 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry

53 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

54 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

Better inits: Xavier Glorot and Kaming He &


orthogonal

55 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)

Better inits: Xavier Glorot and Kaming He &


orthogonal
Biases can (should) be initialized to zero

56 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence

57 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence

58 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence


Large constant LR prevents final convergence
multiply η t by β < 1 after each update

59 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1

Divide by 10 and retry in case of divergence


Large constant LR prevents final convergence
multiply η t by β < 1 after each update
or monitor validation loss and divide η t by 2 or 10
when no progress
See ReduceLROnPlateau in Keras

60 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

61 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to


accelerate in low curvature areas

62 / 74
Momentum
Accumulate gradients across successive updates:

mt = γmt−1 + η∇θ LBt (θt−1 )

θt = θt−1 − mt

γ is typically set to 0.9

Larger updates in directions where the gradient sign is constant to


accelerate in low curvature areas

Nesterov accelerated gradient

mt = γmt−1 + η∇θ LB (θt−1 − γmt−1 )


t

θt = θt−1 − mt

Better at handling changes in gradient direction.


63 / 74
Why Momentum Really Works

64 / 74
Why Momentum Really Works

65 / 74
Why Momentum Really Works

66 / 74
Why Momentum Really Works

67 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling

68 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)

69 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...

70 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Promising stochastic second order methods: K-FAC and Shampoo
can be used to accelerate training of very large models.

71 / 74
The Karpathy Constant for Adam

72 / 74
Optimizers around a saddle point

Credits: Alec Radford

73 / 74

You might also like