Lec 04 Deep Networks 2
Lec 04 Deep Networks 2
2
4.1
Output and Loss Functions
Output and Loss Functions
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I Choice of output layer and loss function depends on task (discrete, continuous, ..)
4
Output and Loss Functions
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I Choice of output layer and loss function depends on task (discrete, continuous, ..)
4
Loss Function
What is the goal of optimizing the loss function?
I Tries to make the model output (=prediction) similar to the target (=data)
I Think of the loss function as a measure of cost being paid for a prediction
Prediction Prediction
Target Target
5
Loss Function
What is the goal of optimizing the loss function?
I Tries to make the model output (=prediction) similar to the target (=data)
I Think of the loss function as a measure of cost being paid for a prediction
1.2
KL Divergence Large 1.2
KL Divergence Small
pdata pdata
pmodel pmodel
1.0 1.0
0.8 0.8
p(y|x)
p(y|x)
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
y y 5
Loss Function
How to design a good loss function?
I A loss function can be any differentiable function that we wish to optimize
I Deriving the cost function from the maximum likelihood principle removes the
burden of manually designing the cost function for each model
I Consider the output of the neural network as parameters of a distribution over yi
0.35
0.30
0.25
p(y)
0.20
0.15
Target
0.10
0.05
Mean
0.00
6 4 2 0 2 4 6
y
Example:
I Neural network fw (x) predicts mean µ of Gaussian distribution over y:
!
1 (y − fw (x))2
p(y|x, w) = √ exp −
2πσ 2 2σ 2
143,52 €
I Mapping: fw : RN → R
8
Recap: Binary Classification
"Beach"
8
Recap: Multi-Class Classification
"Beach"
8
Regression Problems
Gaussian Distribution
Gaussian distribution:
! 0.40
2
1 (y − µ) 0.35
p(y) = √ exp −
2πσ 2 2σ 2 0.30
0.25
p(y)
0.20
I µ : mean
0.15
I σ : standard deviation
0.10
I The distribution has thin “tails”:
0.05
p(y) → 0 quickly as y → ∞ 0.00
6 4 2 0 2 4 6
I It thus penalizes outliers strongly y
10
Gaussian Distribution / L2 Loss
2
Let pmodel (y|x, w) = √ 1 exp − (y−f2σ
w (x))
2 be a Gaussian distribution. We obtain:
2πσ 2
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X 1 X 1
= argmax − log(2πσ 2 ) − (fw (xi ) − yi )2
w 2 2σ 2
i=1 i=1
N
X
= argmax − (fw (xi ) − yi )2
w
i=1
N
X
= argmin (fw (xi ) − yi )2
w | {z }
i=1
L2 Loss
In other words, we minimize the squared loss (=L2 loss), affected strongly by outliers. 11
Laplace Distribution
Laplace distribution:
0.5
1 |y − µ|
p(y) = exp −
2b b 0.4
0.3
I µ : location
p(y)
I b : scale 0.2
12
Laplace Distribution / L1 Loss
Let pmodel (y|x, w) = 1
2b exp − |y−fbw (x)| be a Laplace distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2b) − |fw (xi ) − yi |
w b
i=1 i=1
N
X
= argmax − |fw (xi ) − yi |
w
i=1
N
X
= argmin |f (xi ) − yi |
w | w {z }
i=1
L1 Loss
We minimize the absolute loss (=L1 loss) which is more robust than L2 . 13
Predicting all Parameters
1 |y−fw (x)|
Let pmodel (y|x, w) = 2 gw (x) exp − gw (x) be a Laplace distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2 gw (x)) − |fw (xi ) − yi |
w gw (x)
i=1 i=1
In this case, we predict both the location µ and the scale b with a neural network.
fw (xi ) and gw (x) are typically the same except for the last layer. Allows for estimating
the aleatoric uncertainty (observation noise) with the neural network itself.
Kendall, Gal: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NIPS, 2017. 14
Predicting all Parameters
Kendall, Gal: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NIPS, 2017. 15
Mixture Density Networks
To represent multi-modal distributions, we can also model mixture densities:
M
!
(m)
X 1 |y − fw (x)|
pmodel (y|x, w) = πm (m)
exp − (m)
m=1 2 gw (x) gw (x)
0.8
Example: 0.7
0.5
I πm ∈ [0, 1]: weight of mode m
p(y)
0.4
I Constraint m πm = 1
P
0.3
0.2
I Location µm and scale bm of each
0.1
mode m modeled by neural network 0.0
6 4 2 0 2 4 6
y
Bishop. Mixture Density Networks. 1994. 16
Output Layer for Regression Problems
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I For most outputs (e.g., µ ∈ R), the output layer is linear (no non-linearity)
I For some outputs (e.g., b ∈ R+ ), we need a squashing function (ReLU, softplus)
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, 1998. 20
Image Classification
Curse of Dimensionality:
I There exist 2784 = 10236 possible binary images of resolution 28 x 28 pixels
I MNIST is gray-scale, thus 256784 combinations ⇒ impossible to enumerate
I Why is image classification with just 60k labeled training images even possible?
I Answer: Images concentrated on low-dimensional manifold in {1, . . . , 256}784
LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, 1998. 21
Bernoulli Distribution
0.7
Bernoulli distribution:
0.6
0.4
p(y)
0.3
I µ: probability for y = 1
0.2
I Handles only two classes
0.1
e.g. (“cats” vs. “dogs”)
0.0
0 1
y
22
Bernoulli Distribution / BCE Loss
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X h i
= argmax log fw (xi )yi (1 − fw (xi ))(1−yi )
w
i=1
N
X
= argmin −y log fw (xi ) − (1 − yi ) log(1 − fw (xi ))
w | i {z }
i=1
BCE Loss
p(y)
Alternative notation:
C
Y 0.2
p(y) = µyc c
c=1 0.1
class y y
1 (1, 0, 0, 0)>
2 (0, 1, 0, 0)>
3 (0, 0, 1, 0)>
4 (0, 0, 0, 1)>
26
Categorical Distribution / CE Loss
QC (c) yc
Let pmodel (y|x, w) = c=1 fw (x) be a Categorical distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N C
(c)
X Y
= argmax log fw (xi )yi,c
w
i=1 c=1
N C
(c)
XX
= argmin −yi,c log fw (xi )
w
i=1 c=1
| {z }
CE Loss
I Let s denote the network output after the last affine layer (=scores). Then:
C
(c) exp(sc ) (c)
X
fw (x) = PC ⇒ log fw (x) = sc − log exp(sk )
k=1 exp(sk ) k=1
I Remark: sc is a direct contribution to the loss function, i.e., it does not saturate
28
Log Softmax
Intuition: Assume c is the correct class. Our goal is to maximize the log softmax:
C
(c)
X
log fw (x) = sc − log exp(sk )
k=1
I The first term encourages the score sc for the correct class c to increase
I The second term encourages all scores in s to decrease
I The second term can be approximated by: log C
P
k=1 exp(sk ) ≈ maxk sk
as exp(sk ) is insignificant for all sk < maxk sk
I Thus, the loss always strongly penalizes the most active incorrect prediction
I If the correct class already has the largest score (i.e., sc = maxk sk ), both terms
roughly cancel and the example will contribute little to the overall training cost
29
Log Softmax Example
Scores sc Exponential scores exp(sc )
4
50
3
2 40
exp(sc)
1 30
sc
0 20
1 10
2
1 2 3 4 0
1 2 3 4
c c
PC
I The second term becomes: log k=1 exp(sk ) = 4.06 ≈ s3 = maxk sk
(c)
I For c = 2 we obtain: log fw sc − log C
P
(x) = k=1 exp(sk ) = 1 − 4.06 ≈ −3
(c) PC
I For c = 3 we obtain: log fw (x) = sc − log k=1 exp(sk ) = 4 − 4.06 ≈ 0
30
Softmax
I Predicting C values/scores overparameterizes the Categorical distribution
I As the distribution sums to 1 only C1 parameters are necessary
I Example: Consider C = 2 and fix one degree of freedom (x2 = 0):
exp(x1 ) exp(x2 )
softmax(x) = ,
exp(x1 ) + exp(x2 ) exp(x1 ) + exp(x2 )
exp(x1 ) 1
= ,
exp(x1 ) + 1 exp(x1 ) + 1
1 1
= ,1 −
1 + exp(−x1 ) 1 + exp(−x1 )
= (σ(x1 ), 1 − σ(x1 ))
softmax
argmax
exp(sc)
1 30
sc
0 20 0.4 0.4
1 10 0.2 0.2
2 0 0.0 0.0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
c c c c
32
Softmax
softmax(x) = softmax(x + c)
33
Cross Entropy Loss with Softmax Example
Putting it together: Cross Entropy Loss for a single training sample (x, y) ∈ X :
C
(c)
X
CE Loss: −yc log fw (x)
c=1
I Sample 4 contributes most strongly to the loss function! (elephant in the room)
34
Output Layer for Classification Problems
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I For 2 classes, we can predict 1 value and use a sigmoid, or 2 values with softmax
I For C > 2 classes we typically predict C scores and use a softmax non-linearity
35
Loss Function for Classification Problems
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
36
4.2
Activation Functions
Activation Functions
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I Hidden layer hi = g(Ai hi−1 + bi ) with activation function g(·) and weights Ai , bi
I The activation function is frequently applied element-wise to its input
I Activation functions must be non-linear to learn non-linear mappings
I Some of them are not differentiable everywhere (but still ok for training) 38
Activation Functions
Sigmoid:
1 1.0 Sigmoid
g(x) =
1 + exp(−x)
0.8
I Maps input to range [0, 1]
0.6
I Neuroscience interpretation as
g(x)
saturating “firing rate” of neurons 0.4
Problems: 0.2
I Saturation “kills” gradients
0.0
I Outputs are not zero-centered 10 5 0 5 10
x
I Introduces bias after first layer
39
Activation Functions
0.8
0.6
g(x)
0.4
I Downstream gradient becomes zero
when input x is saturated: g 0 (x) ≈ 0 0.2
I No learning if x is very small (< −10) 0.0
10 5 0 5 10
I No learning if x is very large (> 10) x
40
Activation Functions
Sigmoid Problem #2:
1 X
g(x) = x= ai xi + b
1 + exp(−x)
i
I Sigmoid is always positive ⇒ xi also
I Gradient of sigmoid is always positive
∂L ∂L ∂g ∂L ∂g ∂x ∂L ∂g
= = = xi
∂ai ∂g ∂ai ∂g ∂x ∂ai ∂g ∂x
II
Allowed
Gradient I
Update III
Directions IV
Allowed V
Gradient Update
Path
Update
Directions VII
VIII
Optimal Update
Tanh: Tanh
1.00
2 0.75
g(x) = −1
1 + exp(−2x)
0.50
0.25
I Maps input to range [−1, 1]
0.00
g(x)
I Anti-symmetric 0.25
I Zero-centered 0.50
0.75
Problems:
1.00
I Again, saturation “kills” gradients 10 5 0 5 10
x
LeCun, Kanter and Solla: Second-order properties of error surfaces: learning time and generalization. NIPS, 1991. 43
Activation Functions
g(x)
I Leads to fast convergence
4
I Computationally efficient
2
Problems:
I Not zero-centered 0
10 5 0 5 10
I No learning for x < 0 ⇒ dead ReLUs x
Nair and Hinton: Rectified linear units improve restricted boltzmann machines. ICML, 2010. 44
Activation Functions
ReLU Problem:
ReLU
10
g(x)
4
I Downstream gradient becomes
zero when input x < 0 2
I Results in so-called “dead ReLUs”
0
that never participate in learning 10 5 0 5 10
x
I Often initialize with pos. bias (b > 0)
Nair and Hinton: Rectified linear units improve restricted boltzmann machines. ICML, 2010. 45
Activation Functions
10 Leaky ReLU
Leaky ReLU:
8
g(x) = max(0.01x, x)
6
g(x)
I Does not saturate (i.e., will not die) 4
I Closer to zero-centered outputs
2
I Leads to fast convergence
0
I Computationally efficient
10 5 0 5 10
x
Maas: Rectifier nonlinearities improve neural network acoustic models. ICML, 2013. 46
Activation Functions
10 Leaky ReLU
Parametric ReLU:
8
g(x) = max(αx, x)
6
g(x)
I Does not saturate (i.e., will not die) 4
I Leads to fast convergence
2
I Computationally efficient
0
I Parameter α learned from data
10 5 0 5 10
x
He, Zhang, Ren and Sun: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV, 2015. 47
Activation Functions
10 ELU
Exponential Linear Units (ELU):
8
x if x > 0
g(x) = 6
α(exp(x) − 1) if x ≤ 0
g(x)
4
I All benefits of Leaky ReLU
2
I Adds some robustness to noise
I Default α = 1 0
10 5 0 5 10
x
Clevert, Unterthiner and Hochreiter: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). ICLR, 2016. 48
Activation Functions
Rectifier Absolute Quadratic
Goodfellow, Warde-Farley, Mirza, Courville and Yoshua Bengio: Maxout Networks. ICML, 2013. 49
Activation Functions
Summary:
I No one-size-fits-all: Choice of activation function depends on problem
I We only showed the most common ones, there exist many more
I Best activation function/model is often found using trial-and-error in practice
I It is important to ensure a good “gradient flow” during optimization
Rule of Thumb:
I Use ReLU by default (with small enough learning rate)
I Try Leaky ReLU, Maxout, ELU for some small additional gain
I Prefer Tanh over Sigmoid (Tanh often used in recurrent models)
50
Implementation
Numerical Differentiation
I Murphy: “Anything that can go wrong will.”
I When implementing the backward pass of
activation, output or loss functions it is important
to ensure that all gradients are correct!
f(x+h)
I Verify via Newton’s difference quotient: secant
∂f (x) f (x + h) − f (x)
= lim
∂x h→0 h f(x)
52
Numerical Differentiation
How to choose h?
I For h = 0 the expression is undefined
I Choose h to trade-off:
I Rounding error (finite precision)
I Approximation error (wrong)
√
I Good choice: 3
with the machine precision
I Examples:
en.wikipedia.org/wiki/
I = 6 × 10−8 for single precision (32 bit)
Numerical_differentiation
I = 1 × 10−16 for double precision (64 bit)
53
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:
1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x
(x)
0.20
(x)
0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:
1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x
(x)
0.20
(x)
0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:
1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x
(x)
0.20
(x)
0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
4.3
Preprocessing and Initialization
Data Preprocessing
Data Preprocessing
Remember what happens for positive inputs:
!
X
g(x) = g ai xi + b
i
Allowed
The gradient wrt. parameter ai is given by: Gradient
Update
∂L ∂L ∂g Directions
= xi
∂ai ∂g ∂x
Allowed
Gradient Update
I Both terms in blue are positive Path
Update
I All gradients have the same sign (+ or -) Directions
Optimal Update
I We should pre-process the input data
such that it is “well distributed”
57
Data Preprocessing
58
Data Preprocessing
59
Data Preprocessing
Original Data Zero-Centered Data
61
Weight Initialization
Recap: Stochastic Gradient Descent
Question:
I How to best initialize the weights w?
63
Constant Initialization
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
∂L ∂L ∂g
= xi
∂ai ∂g ∂x
Remark:
I For g(·), we will use Tanh and ReLU in the following
65
Small Random Numbers
∂L ∂L ∂g ∂L ∂g
= xi = 0=0
∂ai ∂g ∂x ∂g ∂x 66
Large Random Numbers
∂L ∂L ∂g ∂L
= xi = 0 xi = 0
∂ai ∂g ∂x ∂g 67
Xavier Initialization
Glorot and Bengio: Understanding the difficulty of training deep feedforward neural networks. AISTAT, 2010. 68
Xavier Initialization
√
Why σ = 1/ Din ? Let us consider y = g(w> x) and assume that all xi and wi are
independent and identically (i.i.d.) distributed with zero mean. Let further g 0 (0) = 1.
Then:
Var(y) ≈ Var(w> x) = Din Var(xi wi )
= Din (E[x2i wi2 ] − E[xi wi ]2 )
= Din (E[x2i ] E[wi2 ] − E[xi ]2 E[wi ]2 )
= Din E[x2i ] E[wi2 ]
= Din Var(xi ) Var(wi )
Thus:
Var(wi ) = 1/Din ⇒ Var(y) = Var(xi )
Glorot and Bengio: Understanding the difficulty of training deep feedforward neural networks. AISTAT, 2010. 69
Xavier Initialization
Glorot and Bengio: Understanding the difficulty of training deep feedforward neural networks. AISTAT, 2010. 70
He Initialization
He, Zhang, Ren and Sun: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV, 2015. 71
Summary
Data Preprocessing:
I Zero-centering the network inputs is important for efficient learning
I Decorrelation and whitening used less frequently
Weight Initialization:
I Proper initialization important for ensuring a good “gradient flow”
I For zero-centered activation functions, use Xavier initialization
I For ReLU activation functions, use He initialization
I Initialization is a research topic, much more literature on this topic
72