0% found this document useful (0 votes)
62 views78 pages

Lec 04 Deep Networks 2

The document discusses output and loss functions for deep neural networks. It begins by explaining that the output layer computes the network's output and the loss function compares this to the target values. The choice of output layer and loss function depends on the task, such as discrete vs continuous outputs. It then discusses how the goal of optimizing the loss function is to make the model's predictions similar to the targets. Specific loss functions like mean squared error are derived from maximizing the likelihood of the targets under a distribution.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views78 pages

Lec 04 Deep Networks 2

The document discusses output and loss functions for deep neural networks. It begins by explaining that the output layer computes the network's output and the loss function compares this to the target values. The choice of output layer and loss function depends on the task, such as discrete vs continuous outputs. It then discusses how the goal of optimizing the loss function is to make the model's predictions similar to the targets. Specific loss functions like mean squared error are derived from maximizing the likelihood of the targets under a distribution.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Deep Learning

Lecture 4 – Deep Neural Networks II

Prof. Dr.-Ing. Andreas Geiger


Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

4.1 Output and Loss Functions

4.2 Activation Functions

4.3 Preprocessing and Initialization

2
4.1
Output and Loss Functions
Output and Loss Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I Choice of output layer and loss function depends on task (discrete, continuous, ..)
4
Output and Loss Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I Choice of output layer and loss function depends on task (discrete, continuous, ..)
4
Loss Function
What is the goal of optimizing the loss function?
I Tries to make the model output (=prediction) similar to the target (=data)
I Think of the loss function as a measure of cost being paid for a prediction

Prediction Prediction
Target Target

Large Cost Small Cost

5
Loss Function
What is the goal of optimizing the loss function?
I Tries to make the model output (=prediction) similar to the target (=data)
I Think of the loss function as a measure of cost being paid for a prediction

1.2
KL Divergence Large 1.2
KL Divergence Small
pdata pdata
pmodel pmodel
1.0 1.0

0.8 0.8
p(y|x)

p(y|x)
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
y y 5
Loss Function
How to design a good loss function?
I A loss function can be any differentiable function that we wish to optimize
I Deriving the cost function from the maximum likelihood principle removes the
burden of manually designing the cost function for each model
I Consider the output of the neural network as parameters of a distribution over yi

ŵM L = argmax pmodel (y|X, w)


w
N
iid
Y
= argmax pmodel (yi |xi , w)
w
i=1
XN
= argmax log pmodel (yi |xi , w)
w
i=1
| {z }
Log-Likelihood 6
Loss Function
0.40

0.35

0.30

0.25

p(y)
0.20

0.15
Target
0.10

0.05
Mean
0.00
6 4 2 0 2 4 6
y

Example:
I Neural network fw (x) predicts mean µ of Gaussian distribution over y:
!
1 (y − fw (x))2
p(y|x, w) = √ exp −
2πσ 2 2σ 2

I We want to maximize the probability of the target y under this distribution


7
Recap: Regression

Input Model Output

143,52 €

I Mapping: fw : RN → R

8
Recap: Binary Classification

Input Model Output

"Beach"

I Mapping: fw : RW ×H → {“Beach”, “No Beach”}

8
Recap: Multi-Class Classification

Input Model Output

"Beach"

I Mapping: fw : RW ×H → {“Beach”, “Mountain”, “City”, “Forest”}

8
Regression Problems
Gaussian Distribution

Gaussian distribution:
! 0.40
2
1 (y − µ) 0.35
p(y) = √ exp −
2πσ 2 2σ 2 0.30
0.25

p(y)
0.20
I µ : mean
0.15
I σ : standard deviation
0.10
I The distribution has thin “tails”:
0.05
p(y) → 0 quickly as y → ∞ 0.00
6 4 2 0 2 4 6
I It thus penalizes outliers strongly y

10
Gaussian Distribution / L2 Loss
2
 
Let pmodel (y|x, w) = √ 1 exp − (y−f2σ
w (x))
2 be a Gaussian distribution. We obtain:
2πσ 2

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X 1 X 1
= argmax − log(2πσ 2 ) − (fw (xi ) − yi )2
w 2 2σ 2
i=1 i=1
N
X
= argmax − (fw (xi ) − yi )2
w
i=1
N
X
= argmin (fw (xi ) − yi )2
w | {z }
i=1
L2 Loss

In other words, we minimize the squared loss (=L2 loss), affected strongly by outliers. 11
Laplace Distribution
Laplace distribution:
  0.5
1 |y − µ|
p(y) = exp −
2b b 0.4

0.3
I µ : location

p(y)
I b : scale 0.2

I The distribution has heavy “tails”:


0.1
p(y) → 0 more slowly as y → ∞
I Penalizes outliers less strongly 0.0
6 4 2 0 2 4 6
y
I Thus often preferred in practice

12
Laplace Distribution / L1 Loss
 
Let pmodel (y|x, w) = 1
2b exp − |y−fbw (x)| be a Laplace distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2b) − |fw (xi ) − yi |
w b
i=1 i=1
N
X
= argmax − |fw (xi ) − yi |
w
i=1
N
X
= argmin |f (xi ) − yi |
w | w {z }
i=1
L1 Loss

We minimize the absolute loss (=L1 loss) which is more robust than L2 . 13
Predicting all Parameters

 
1 |y−fw (x)|
Let pmodel (y|x, w) = 2 gw (x) exp − gw (x) be a Laplace distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2 gw (x)) − |fw (xi ) − yi |
w gw (x)
i=1 i=1

In this case, we predict both the location µ and the scale b with a neural network.
fw (xi ) and gw (x) are typically the same except for the last layer. Allows for estimating
the aleatoric uncertainty (observation noise) with the neural network itself.

Kendall, Gal: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NIPS, 2017. 14
Predicting all Parameters

Kendall, Gal: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NIPS, 2017. 15
Mixture Density Networks
To represent multi-modal distributions, we can also model mixture densities:

M
!
(m)
X 1 |y − fw (x)|
pmodel (y|x, w) = πm (m)
exp − (m)
m=1 2 gw (x) gw (x)

0.8
Example: 0.7

I Mixture of Laplace distribution 0.6

0.5
I πm ∈ [0, 1]: weight of mode m

p(y)
0.4

I Constraint m πm = 1
P
0.3

0.2
I Location µm and scale bm of each
0.1
mode m modeled by neural network 0.0
6 4 2 0 2 4 6
y
Bishop. Mixture Density Networks. 1994. 16
Output Layer for Regression Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I For most outputs (e.g., µ ∈ R), the output layer is linear (no non-linearity)
I For some outputs (e.g., b ∈ R+ ), we need a squashing function (ReLU, softplus)

Bishop. Mixture Density Networks. 1994. 17


Loss Function for Regression Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I Gaussian/Laplacian model distribution correspond to L2 and L1 loss functions


I It is also possible to predict uncertainty (variance/scale) or multiple modes (MDN)

Bishop. Mixture Density Networks. 1994. 18


Classification Problems
Image Classification

MNIST Handwritten Digits:


I One of the most popular datasets in ML (many variants, still in use today)
I Based on a data from the National Institute of Standards and Technology
I Hand written by Census Bureau employees and high-school children
I Resolution: 28 x 28 pixels, 60k training samples with labels, 10k test samples

LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, 1998. 20
Image Classification

Curse of Dimensionality:
I There exist 2784 = 10236 possible binary images of resolution 28 x 28 pixels
I MNIST is gray-scale, thus 256784 combinations ⇒ impossible to enumerate
I Why is image classification with just 60k labeled training images even possible?
I Answer: Images concentrated on low-dimensional manifold in {1, . . . , 256}784
LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, 1998. 21
Bernoulli Distribution

0.7
Bernoulli distribution:
0.6

p(y) = µy (1 − µ)(1−y) 0.5

0.4

p(y)
0.3
I µ: probability for y = 1
0.2
I Handles only two classes
0.1
e.g. (“cats” vs. “dogs”)
0.0
0 1
y

22
Bernoulli Distribution / BCE Loss

Let pmodel (y|x, w) = fw (x)y (1 − fw (x))(1−y) be a Bernoulli distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X h i
= argmax log fw (xi )yi (1 − fw (xi ))(1−yi )
w
i=1
N
X
= argmin −y log fw (xi ) − (1 − yi ) log(1 − fw (xi ))
w | i {z }
i=1
BCE Loss

In other words, we minimize the binary cross-entropy (BCE) loss.


Remark: Last layer of fw (x) can be a sigmoid function such that fw (x)y ∈ [0, 1].
23
How can we scale this to multiple classes?
Categorical Distribution
Categorical distribution:
p(y = c) = µc
0.5
I µc : probability for class c
0.4
I Multiple classes, multiple modes
0.3

p(y)
Alternative notation:
C
Y 0.2
p(y) = µyc c
c=1 0.1

I y: “one-hot” vector with yc ∈ {0, 1}


0.0
1 2 3 4
I y = (0, . . . , 0, 1, 0, . . . , 0)> with all y
zeros except for one (the true class)
25
One-Hot Vector Representation

class y y
1 (1, 0, 0, 0)>
2 (0, 1, 0, 0)>
3 (0, 0, 1, 0)>
4 (0, 0, 0, 1)>

I One-hot vector y with binary elements yc ∈ {0, 1}


I Index c with yc = 1 determines the correct class, and yk = 0 for k 6= c
I Interpretation as discrete distribution with all probability mass at the true class
I Often used in ML as it can make formalism more convenient

26
Categorical Distribution / CE Loss
QC (c) yc
Let pmodel (y|x, w) = c=1 fw (x) be a Categorical distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N C
(c)
X Y
= argmax log fw (xi )yi,c
w
i=1 c=1
N C
(c)
XX
= argmin −yi,c log fw (xi )
w
i=1 c=1
| {z }
CE Loss

In other words, we minimize the cross-entropy (CE) loss.


The target y = (0, . . . , 0, 1, 0, . . . , 0)> is a “one-hot” vector with yc its c’th element.
27
Softmax
(c)
How can we ensure that fw (x) predicts a valid Categorical (discrete) distribution?
(c) (c)
I We must guarantee (1) fw (x) ∈ [0, 1] and (2) C
P
c=1 fw (x) = 1
I An element-wise sigmoid as output function would ensure (1) but not (2)
I Solution: The softmax function guarantees both (1) and (2):
!
exp(x1 ) exp(xC )
softmax(x) = PC , . . . , PC
k=1 exp(xk ) k=1 exp(xk )

I Let s denote the network output after the last affine layer (=scores). Then:
C
(c) exp(sc ) (c)
X
fw (x) = PC ⇒ log fw (x) = sc − log exp(sk )
k=1 exp(sk ) k=1

I Remark: sc is a direct contribution to the loss function, i.e., it does not saturate
28
Log Softmax
Intuition: Assume c is the correct class. Our goal is to maximize the log softmax:

C
(c)
X
log fw (x) = sc − log exp(sk )
k=1

I The first term encourages the score sc for the correct class c to increase
I The second term encourages all scores in s to decrease
I The second term can be approximated by: log C
P
k=1 exp(sk ) ≈ maxk sk
as exp(sk ) is insignificant for all sk < maxk sk
I Thus, the loss always strongly penalizes the most active incorrect prediction
I If the correct class already has the largest score (i.e., sc = maxk sk ), both terms
roughly cancel and the example will contribute little to the overall training cost
29
Log Softmax Example
Scores sc Exponential scores exp(sc )
4
50
3
2 40

exp(sc)
1 30
sc

0 20
1 10
2
1 2 3 4 0
1 2 3 4
c c
PC
I The second term becomes: log k=1 exp(sk ) = 4.06 ≈ s3 = maxk sk
(c)
I For c = 2 we obtain: log fw sc − log C
P
(x) = k=1 exp(sk ) = 1 − 4.06 ≈ −3
(c) PC
I For c = 3 we obtain: log fw (x) = sc − log k=1 exp(sk ) = 4 − 4.06 ≈ 0
30
Softmax
I Predicting C values/scores overparameterizes the Categorical distribution
I As the distribution sums to 1 only C1 parameters are necessary
I Example: Consider C = 2 and fix one degree of freedom (x2 = 0):
 
exp(x1 ) exp(x2 )
softmax(x) = ,
exp(x1 ) + exp(x2 ) exp(x1 ) + exp(x2 )
 
exp(x1 ) 1
= ,
exp(x1 ) + 1 exp(x1 ) + 1
 
1 1
= ,1 −
1 + exp(−x1 ) 1 + exp(−x1 )
= (σ(x1 ), 1 − σ(x1 ))

I The softmax is a multi-class generalization of the sigmoid function


I In practice, the overparameterized version is often used (simpler to implement)
31
Softmax
!
exp(s1 ) exp(sC )
softmax(s) = PC , . . . , PC
k=1 exp(sk ) k=1 exp(sk )

I The name “softmax” is confusing, “soft argmax” would be more precise as it is a


continuous and differentiable version of argmax (in one-hot representation)
I Example with four classes:
4 1.0 1.0
50
3 0.8 0.8
2 40
0.6 0.6

softmax

argmax
exp(sc)

1 30
sc

0 20 0.4 0.4
1 10 0.2 0.2
2 0 0.0 0.0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
c c c c
32
Softmax

I Softmax responds to differences between inputs


I It is invariant to adding the same scalar to all its inputs:

softmax(x) = softmax(x + c)

I We can therefore derive a numerically more stable variant:

softmax(x) = softmax(x − max xk )


k=1..L

I Allows accurate computation even when x is large


I Illustrates again that softmax depends on differences between scores

33
Cross Entropy Loss with Softmax Example
Putting it together: Cross Entropy Loss for a single training sample (x, y) ∈ X :

C
(c)
X
CE Loss: −yc log fw (x)
c=1

Example: Suppose C = 4 and 4 training samples x with labels y


Input x Label y Predicted scores s softmax(s) CE Loss
(1, 0, 0, 0)> (+3, +1, −1, −1)> (0.85, 0.12, 0.02, 0.02)> 0.16
(0, 1, 0, 0)> (+3, +3, +1, +0)> (0.46, 0.46, 0.06, 0.02)> 0.78
(0, 0, 1, 0)> (+1, +1, +1, +1)> (0.25, 0.25, 0.25, 0.25)> 1.38
(0, 0, 0, 1)> (+3, +2, +3, −1)> (0.42, 0.16, 0.42, 0.01)> 4.87

I Sample 4 contributes most strongly to the loss function! (elephant in the room)
34
Output Layer for Classification Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I For 2 classes, we can predict 1 value and use a sigmoid, or 2 values with softmax
I For C > 2 classes we typically predict C scores and use a softmax non-linearity

35
Loss Function for Classification Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I For 2 classes, we use the binary cross-entropy loss (BCE)


I For C > 2 classes, we use the cross-entropy loss (CE)

36
4.2
Activation Functions
Activation Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I Hidden layer hi = g(Ai hi−1 + bi ) with activation function g(·) and weights Ai , bi
I The activation function is frequently applied element-wise to its input
I Activation functions must be non-linear to learn non-linear mappings
I Some of them are not differentiable everywhere (but still ok for training) 38
Activation Functions
Sigmoid:
1 1.0 Sigmoid
g(x) =
1 + exp(−x)
0.8
I Maps input to range [0, 1]
0.6
I Neuroscience interpretation as

g(x)
saturating “firing rate” of neurons 0.4
Problems: 0.2
I Saturation “kills” gradients
0.0
I Outputs are not zero-centered 10 5 0 5 10
x
I Introduces bias after first layer
39
Activation Functions

Sigmoid Problem #1:


1.0 Sigmoid

0.8

0.6

g(x)
0.4
I Downstream gradient becomes zero
when input x is saturated: g 0 (x) ≈ 0 0.2
I No learning if x is very small (< −10) 0.0
10 5 0 5 10
I No learning if x is very large (> 10) x

40
Activation Functions
Sigmoid Problem #2:
1 X
g(x) = x= ai xi + b
1 + exp(−x)
i
I Sigmoid is always positive ⇒ xi also
I Gradient of sigmoid is always positive

The gradient wrt. parameter ai is given by:

∂L ∂L ∂g ∂L ∂g ∂x ∂L ∂g
= = = xi
∂ai ∂g ∂ai ∂g ∂x ∂ai ∂g ∂x

I Therefore: sgn(∂L/∂ai ) = sgn(∂L/∂g)


I All gradients have the same sign (+ or -)
41
Activation Functions
Sigmoid Problem #2:

II

Allowed
Gradient I

Update III

Directions IV

Allowed V
Gradient Update
Path
Update
Directions VII
VIII
Optimal Update

I Restricts gradient updates and leads to inefficient optimization (minibatches help)


42
Activation Functions

Tanh: Tanh
1.00
2 0.75
g(x) = −1
1 + exp(−2x)
0.50
0.25
I Maps input to range [−1, 1]
0.00

g(x)
I Anti-symmetric 0.25
I Zero-centered 0.50
0.75
Problems:
1.00
I Again, saturation “kills” gradients 10 5 0 5 10
x

LeCun, Kanter and Solla: Second-order properties of error surfaces: learning time and generalization. NIPS, 1991. 43
Activation Functions

Rectified Linear Unit (ReLU):


10 ReLU
g(x) = max(0, x)
8
I Does not saturate (for x > 0) 6

g(x)
I Leads to fast convergence
4
I Computationally efficient
2
Problems:
I Not zero-centered 0
10 5 0 5 10
I No learning for x < 0 ⇒ dead ReLUs x

Nair and Hinton: Rectified linear units improve restricted boltzmann machines. ICML, 2010. 44
Activation Functions
ReLU Problem:
ReLU
10

g(x)
4
I Downstream gradient becomes
zero when input x < 0 2
I Results in so-called “dead ReLUs”
0
that never participate in learning 10 5 0 5 10
x
I Often initialize with pos. bias (b > 0)

Nair and Hinton: Rectified linear units improve restricted boltzmann machines. ICML, 2010. 45
Activation Functions

10 Leaky ReLU
Leaky ReLU:
8
g(x) = max(0.01x, x)
6

g(x)
I Does not saturate (i.e., will not die) 4
I Closer to zero-centered outputs
2
I Leads to fast convergence
0
I Computationally efficient
10 5 0 5 10
x

Maas: Rectifier nonlinearities improve neural network acoustic models. ICML, 2013. 46
Activation Functions

10 Leaky ReLU
Parametric ReLU:
8
g(x) = max(αx, x)
6

g(x)
I Does not saturate (i.e., will not die) 4
I Leads to fast convergence
2
I Computationally efficient
0
I Parameter α learned from data
10 5 0 5 10
x

He, Zhang, Ren and Sun: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV, 2015. 47
Activation Functions

10 ELU
Exponential Linear Units (ELU):
 8
x if x > 0
g(x) = 6
α(exp(x) − 1) if x ≤ 0

g(x)
4
I All benefits of Leaky ReLU
2
I Adds some robustness to noise
I Default α = 1 0
10 5 0 5 10
x

Clevert, Unterthiner and Hochreiter: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). ICLR, 2016. 48
Activation Functions
Rectifier Absolute Quadratic

Maxout: g(x) = max(a> >


1 x + b1 , a2 x + b2 )
I Generalizes ReLU and Leaky ReLU
I Increases the number of parameters per neuron

Goodfellow, Warde-Farley, Mirza, Courville and Yoshua Bengio: Maxout Networks. ICML, 2013. 49
Activation Functions

Summary:
I No one-size-fits-all: Choice of activation function depends on problem
I We only showed the most common ones, there exist many more
I Best activation function/model is often found using trial-and-error in practice
I It is important to ensure a good “gradient flow” during optimization

Rule of Thumb:
I Use ReLU by default (with small enough learning rate)
I Try Leaky ReLU, Maxout, ELU for some small additional gain
I Prefer Tanh over Sigmoid (Tanh often used in recurrent models)

50
Implementation
Numerical Differentiation
I Murphy: “Anything that can go wrong will.”
I When implementing the backward pass of
activation, output or loss functions it is important
to ensure that all gradients are correct!
f(x+h)
I Verify via Newton’s difference quotient: secant

∂f (x) f (x + h) − f (x)
= lim
∂x h→0 h f(x)

I Even better: Symmetric difference quotient


x x+h
∂f (x) f (x + h) − f (x − h) h
= lim
∂x h→0 2h

52
Numerical Differentiation

How to choose h?
I For h = 0 the expression is undefined
I Choose h to trade-off:
I Rounding error (finite precision)
I Approximation error (wrong)

I Good choice: 3
 with  the machine precision
I Examples:
en.wikipedia.org/wiki/
I  = 6 × 10−8 for single precision (32 bit)
Numerical_differentiation
I  = 1 × 10−16 for double precision (64 bit)

53
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:

1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x

1.0 Sigmoid Gradient with h = 100 max. error: 0.0189


0.40
Sigmoid Analytic Gradient
0.35 Numeric Gradient
0.8
0.30
0.6 0.25

(x)
0.20
(x)

0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:

1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x

1.0 Sigmoid Gradient with h = 10 4 max. error: 0.0006


0.40
Sigmoid Analytic Gradient
0.35 Numeric Gradient
0.8
0.30
0.6 0.25

(x)
0.20
(x)

0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:

1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x

1.0 Sigmoid Gradient with h = 10 6 max. error: 0.0568


0.40
Sigmoid Analytic Gradient
0.35 Numeric Gradient
0.8
0.30
0.6 0.25

(x)
0.20
(x)

0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
4.3
Preprocessing and Initialization
Data Preprocessing
Data Preprocessing
Remember what happens for positive inputs:
!
X
g(x) = g ai xi + b
i
Allowed
The gradient wrt. parameter ai is given by: Gradient
Update
∂L ∂L ∂g Directions
= xi
∂ai ∂g ∂x
Allowed
Gradient Update
I Both terms in blue are positive Path
Update
I All gradients have the same sign (+ or -) Directions
Optimal Update
I We should pre-process the input data
such that it is “well distributed”
57
Data Preprocessing

I Zero-center: xi,j ← xi,j − µj with µj = 1 PN


N i=1 xi,j
2 1 PN
I Normalization: xi,j ← xi,j /σj with σj = N i=1 (xi,j − µj )2

58
Data Preprocessing

I Decorrelate: Multiply with eigenvectors of covariance matrix


I Whiten: Divide by square root of eigenvalues of covariance matrix

59
Data Preprocessing
Original Data Zero-Centered Data

I Classification loss becomes less sensitive to changes in the weight matrix


60
Data Preprocessing

Common Practices for Images:


I AlexNet: Subtract mean image
(mean image: W × H × 3 numbers)
I VGGNet: Subtract per-channel mean
(mean along each channel: 3 numbers)
I ResNet: Subtract per-channel mean and divide by per-channel std. dev.
(mean along each channel: 3 numbers)
I Whitening is less common

61
Weight Initialization
Recap: Stochastic Gradient Descent

Algorithm for training an MLP using (stochastic) gradient descent:


1. Initialize weights w, pick learning rate η and minibatch size |Xbatch |
2. Draw (random) minibatch Xbatch ⊆ X
3. For all elements (x, y) ∈ Xbatch of minibatch (in parallel) do:
3.1 Forward propagate x through network to calculate h1 , h2 , . . . , ŷ
3.2 Backpropagate gradients through network to obtain ∇w L(ŷ, y)
1
4. Update gradients: wt+1 = wt − η
P
|Xbatch | (x,y)∈Xbatch ∇w L(ŷ, y)
5. If validation error decreases, go to step 2, otherwise stop

Question:
I How to best initialize the weights w?

63
Constant Initialization

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I How to initialize the parameters w of all network layers?


I Simple solution: set all network parameters to a constant (i.e., w = 0)
I Learning not be possible (all units of each layer are learning the same)
64
Weight Initialization

Consider a layer in a Multi-Layer Perceptron:


!
X
g(x) = g ai xi + b
i

The gradient wrt. parameter ai is given by:

∂L ∂L ∂g
= xi
∂ai ∂g ∂x
Remark:
I For g(·), we will use Tanh and ReLU in the following

65
Small Random Numbers

Tanh Activation Function:


I Draw weights independently from Gaussian with small std. dev (σ = 0.01)
I Activations (=activation function outputs) in deeper layers tend towards zero
I Gradients wrt. weights thus also tend towards zero ⇒ no learning:

∂L ∂L ∂g ∂L ∂g
= xi = 0=0
∂ai ∂g ∂x ∂g ∂x 66
Large Random Numbers

Tanh Activation Function:


I Draw weights independently from Gaussian with large std. dev (σ = 0.2)
I All activation functions saturate
I Local gradients are all becoming zero ⇒ no learning:

∂L ∂L ∂g ∂L
= xi = 0 xi = 0
∂ai ∂g ∂x ∂g 67
Xavier Initialization

Tanh Activation Function:


I Glorot et al. draw weights independently from Gaussian with σ 2 = 1/Din
I Din denotes the dimension of the input to the layer, may vary across layers
I Activation distribution now well scaled across all layers

Glorot and Bengio: Understanding the difficulty of training deep feedforward neural networks. AISTAT, 2010. 68
Xavier Initialization

Why σ = 1/ Din ? Let us consider y = g(w> x) and assume that all xi and wi are
independent and identically (i.i.d.) distributed with zero mean. Let further g 0 (0) = 1.
Then:
Var(y) ≈ Var(w> x) = Din Var(xi wi )
= Din (E[x2i wi2 ] − E[xi wi ]2 )
= Din (E[x2i ] E[wi2 ] − E[xi ]2 E[wi ]2 )
= Din E[x2i ] E[wi2 ]
= Din Var(xi ) Var(wi )

Thus:
Var(wi ) = 1/Din ⇒ Var(y) = Var(xi )

Glorot and Bengio: Understanding the difficulty of training deep feedforward neural networks. AISTAT, 2010. 69
Xavier Initialization

ReLU Activation Function:


I Xavier initialization assumes zero centered activation function
I For ReLU, activations again start collapsing to zero for deeper layers

Glorot and Bengio: Understanding the difficulty of training deep feedforward neural networks. AISTAT, 2010. 70
He Initialization

ReLU Activation Function:


I Since ReLU is restricted to positive outputs, variance must be doubled
I He et al. draw weights independently from Gaussian with σ 2 = 2/Din
I Activation distribution now well scaled across all layers

He, Zhang, Ren and Sun: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV, 2015. 71
Summary

Data Preprocessing:
I Zero-centering the network inputs is important for efficient learning
I Decorrelation and whitening used less frequently

Weight Initialization:
I Proper initialization important for ensuring a good “gradient flow”
I For zero-centered activation functions, use Xavier initialization
I For ReLU activation functions, use He initialization
I Initialization is a research topic, much more literature on this topic

72

You might also like