0% found this document useful (0 votes)

62 views78 pages

Lec 04 Deep Networks 2

The document discusses output and loss functions for deep neural networks. It begins by explaining that the output layer computes the network's output and the loss function compares this to the target values. The choice of output layer and loss function depends on the task, such as discrete vs continuous outputs. It then discusses how the goal of optimizing the loss function is to make the model's predictions similar to the targets. Specific loss functions like mean squared error are derived from maximizing the likelihood of the targets under a distribution.

Uploaded by

Mr. Coffee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views78 pages

Lec 04 Deep Networks 2

Uploaded by

Mr. Coffee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Deep Learning

Lecture 4 – Deep Neural Networks II

Prof. Dr.-Ing. Andreas Geiger

Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

4.1 Output and Loss Functions

4.2 Activation Functions

4.3 Preprocessing and Initialization

2
4.1
Output and Loss Functions
Output and Loss Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I Choice of output layer and loss function depends on task (discrete, continuous, ..)
4
Loss Function
What is the goal of optimizing the loss function?
I Tries to make the model output (=prediction) similar to the target (=data)
I Think of the loss function as a measure of cost being paid for a prediction

Prediction Prediction
Target Target

Large Cost Small Cost

5
Loss Function
What is the goal of optimizing the loss function?
I Tries to make the model output (=prediction) similar to the target (=data)
I Think of the loss function as a measure of cost being paid for a prediction

1.2
KL Divergence Large 1.2
KL Divergence Small
pdata pdata
pmodel pmodel
1.0 1.0

0.8 0.8
p(y|x)

p(y|x)
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
y y 5
Loss Function
How to design a good loss function?
I A loss function can be any differentiable function that we wish to optimize
I Deriving the cost function from the maximum likelihood principle removes the
burden of manually designing the cost function for each model
I Consider the output of the neural network as parameters of a distribution over yi

ŵM L = argmax pmodel (y|X, w)

w
N
iid
Y
= argmax pmodel (yi |xi , w)
w
i=1
XN
= argmax log pmodel (yi |xi , w)
w
i=1
| {z }
Log-Likelihood 6
Loss Function
0.40

0.35

0.30

0.25

p(y)
0.20

0.15
Target
0.10

0.05
Mean
0.00
6 4 2 0 2 4 6
y

Example:
I Neural network fw (x) predicts mean µ of Gaussian distribution over y:
!
1 (y − fw (x))2
p(y|x, w) = √ exp −
2πσ 2 2σ 2

I We want to maximize the probability of the target y under this distribution

7
Recap: Regression

Input Model Output

143,52 €

I Mapping: fw : RN → R

8
Recap: Binary Classiﬁcation

Input Model Output

"Beach"

I Mapping: fw : RW ×H → {“Beach”, “No Beach”}

8
Recap: Multi-Class Classiﬁcation

Input Model Output

"Beach"

I Mapping: fw : RW ×H → {“Beach”, “Mountain”, “City”, “Forest”}

8
Regression Problems
Gaussian Distribution

Gaussian distribution:
! 0.40
2
1 (y − µ) 0.35
p(y) = √ exp −
2πσ 2 2σ 2 0.30
0.25

p(y)
0.20
I µ : mean
0.15
I σ : standard deviation
0.10
I The distribution has thin “tails”:
0.05
p(y) → 0 quickly as y → ∞ 0.00
6 4 2 0 2 4 6
I It thus penalizes outliers strongly y

10
Gaussian Distribution / L2 Loss
2

Let pmodel (y|x, w) = √ 1 exp − (y−f2σ
w (x))
2 be a Gaussian distribution. We obtain:
2πσ 2

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X 1 X 1
= argmax − log(2πσ 2 ) − (fw (xi ) − yi )2
w 2 2σ 2
i=1 i=1
N
X
= argmax − (fw (xi ) − yi )2
w
i=1
N
X
= argmin (fw (xi ) − yi )2
w | {z }
i=1
L2 Loss

In other words, we minimize the squared loss (=L2 loss), affected strongly by outliers. 11
Laplace Distribution
Laplace distribution:
0.5
1 |y − µ|
p(y) = exp −
2b b 0.4

0.3
I µ : location

p(y)
I b : scale 0.2

I The distribution has heavy “tails”:

0.1
p(y) → 0 more slowly as y → ∞
I Penalizes outliers less strongly 0.0
6 4 2 0 2 4 6
y
I Thus often preferred in practice

12
Laplace Distribution / L1 Loss

Let pmodel (y|x, w) = 1
2b exp − |y−fbw (x)| be a Laplace distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2b) − |fw (xi ) − yi |
w b
i=1 i=1
N
X
= argmax − |fw (xi ) − yi |
w
i=1
N
X
= argmin |f (xi ) − yi |
w | w {z }
i=1
L1 Loss

We minimize the absolute loss (=L1 loss) which is more robust than L2 . 13
Predicting all Parameters

1 |y−fw (x)|
Let pmodel (y|x, w) = 2 gw (x) exp − gw (x) be a Laplace distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2 gw (x)) − |fw (xi ) − yi |
w gw (x)
i=1 i=1

In this case, we predict both the location µ and the scale b with a neural network.
fw (xi ) and gw (x) are typically the same except for the last layer. Allows for estimating
the aleatoric uncertainty (observation noise) with the neural network itself.

Kendall, Gal: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NIPS, 2017. 14
Predicting all Parameters

Kendall, Gal: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NIPS, 2017. 15
Mixture Density Networks
To represent multi-modal distributions, we can also model mixture densities:

M
!
(m)
X 1 |y − fw (x)|
pmodel (y|x, w) = πm (m)
exp − (m)
m=1 2 gw (x) gw (x)

0.8
Example: 0.7

I Mixture of Laplace distribution 0.6

0.5
I πm ∈ [0, 1]: weight of mode m

p(y)
0.4

I Constraint m πm = 1
P
0.3

0.2
I Location µm and scale bm of each
0.1
mode m modeled by neural network 0.0
6 4 2 0 2 4 6
y
Bishop. Mixture Density Networks. 1994. 16
Output Layer for Regression Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I For most outputs (e.g., µ ∈ R), the output layer is linear (no non-linearity)
I For some outputs (e.g., b ∈ R+ ), we need a squashing function (ReLU, softplus)

Bishop. Mixture Density Networks. 1994. 17

Loss Function for Regression Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I Gaussian/Laplacian model distribution correspond to L2 and L1 loss functions

I It is also possible to predict uncertainty (variance/scale) or multiple modes (MDN)

Bishop. Mixture Density Networks. 1994. 18

Classiﬁcation Problems
Image Classiﬁcation

MNIST Handwritten Digits:

I One of the most popular datasets in ML (many variants, still in use today)
I Based on a data from the National Institute of Standards and Technology
I Hand written by Census Bureau employees and high-school children
I Resolution: 28 x 28 pixels, 60k training samples with labels, 10k test samples

LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, 1998. 20
Image Classiﬁcation

Curse of Dimensionality:
I There exist 2784 = 10236 possible binary images of resolution 28 x 28 pixels
I MNIST is gray-scale, thus 256784 combinations ⇒ impossible to enumerate
I Why is image classiﬁcation with just 60k labeled training images even possible?
I Answer: Images concentrated on low-dimensional manifold in {1, . . . , 256}784
LeCun, Bottou, Bengio and Haffner. Gradient-based learning applied to document recognition. IEEE, 1998. 21
Bernoulli Distribution

0.7
Bernoulli distribution:
0.6

p(y) = µy (1 − µ)(1−y) 0.5

0.4

p(y)
0.3
I µ: probability for y = 1
0.2
I Handles only two classes
0.1
e.g. (“cats” vs. “dogs”)
0.0
0 1
y

22
Bernoulli Distribution / BCE Loss

Let pmodel (y|x, w) = fw (x)y (1 − fw (x))(1−y) be a Bernoulli distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X h i
= argmax log fw (xi )yi (1 − fw (xi ))(1−yi )
w
i=1
N
X
= argmin −y log fw (xi ) − (1 − yi ) log(1 − fw (xi ))
w | i {z }
i=1
BCE Loss

In other words, we minimize the binary cross-entropy (BCE) loss.

Remark: Last layer of fw (x) can be a sigmoid function such that fw (x)y ∈ [0, 1].
23
How can we scale this to multiple classes?
Categorical Distribution
Categorical distribution:
p(y = c) = µc
0.5
I µc : probability for class c
0.4
I Multiple classes, multiple modes
0.3

p(y)
Alternative notation:
C
Y 0.2
p(y) = µyc c
c=1 0.1

I y: “one-hot” vector with yc ∈ {0, 1}

0.0
1 2 3 4
I y = (0, . . . , 0, 1, 0, . . . , 0)> with all y
zeros except for one (the true class)
25
One-Hot Vector Representation

class y y
1 (1, 0, 0, 0)>
2 (0, 1, 0, 0)>
3 (0, 0, 1, 0)>
4 (0, 0, 0, 1)>

I One-hot vector y with binary elements yc ∈ {0, 1}

I Index c with yc = 1 determines the correct class, and yk = 0 for k 6= c
I Interpretation as discrete distribution with all probability mass at the true class
I Often used in ML as it can make formalism more convenient

26
Categorical Distribution / CE Loss
QC (c) yc
Let pmodel (y|x, w) = c=1 fw (x) be a Categorical distribution. We obtain:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N C
(c)
X Y
= argmax log fw (xi )yi,c
w
i=1 c=1
N C
(c)
XX
= argmin −yi,c log fw (xi )
w
i=1 c=1
| {z }
CE Loss

In other words, we minimize the cross-entropy (CE) loss.

The target y = (0, . . . , 0, 1, 0, . . . , 0)> is a “one-hot” vector with yc its c’th element.
27
Softmax
(c)
How can we ensure that fw (x) predicts a valid Categorical (discrete) distribution?
(c) (c)
I We must guarantee (1) fw (x) ∈ [0, 1] and (2) C
P
c=1 fw (x) = 1
I An element-wise sigmoid as output function would ensure (1) but not (2)
I Solution: The softmax function guarantees both (1) and (2):
!
exp(x1 ) exp(xC )
softmax(x) = PC , . . . , PC
k=1 exp(xk ) k=1 exp(xk )

I Let s denote the network output after the last afﬁne layer (=scores). Then:
C
(c) exp(sc ) (c)
X
fw (x) = PC ⇒ log fw (x) = sc − log exp(sk )
k=1 exp(sk ) k=1

I Remark: sc is a direct contribution to the loss function, i.e., it does not saturate
28
Log Softmax
Intuition: Assume c is the correct class. Our goal is to maximize the log softmax:

C
(c)
X
log fw (x) = sc − log exp(sk )
k=1

I The ﬁrst term encourages the score sc for the correct class c to increase
I The second term encourages all scores in s to decrease
I The second term can be approximated by: log C
P
k=1 exp(sk ) ≈ maxk sk
as exp(sk ) is insigniﬁcant for all sk < maxk sk
I Thus, the loss always strongly penalizes the most active incorrect prediction
I If the correct class already has the largest score (i.e., sc = maxk sk ), both terms
roughly cancel and the example will contribute little to the overall training cost
29
Log Softmax Example
Scores sc Exponential scores exp(sc )
4
50
3
2 40

exp(sc)
1 30
sc

0 20
1 10
2
1 2 3 4 0
1 2 3 4
c c
PC
I The second term becomes: log k=1 exp(sk ) = 4.06 ≈ s3 = maxk sk
(c)
I For c = 2 we obtain: log fw sc − log C
P
(x) = k=1 exp(sk ) = 1 − 4.06 ≈ −3
(c) PC
I For c = 3 we obtain: log fw (x) = sc − log k=1 exp(sk ) = 4 − 4.06 ≈ 0
30
Softmax
I Predicting C values/scores overparameterizes the Categorical distribution
I As the distribution sums to 1 only C1 parameters are necessary
I Example: Consider C = 2 and ﬁx one degree of freedom (x2 = 0):

exp(x1 ) exp(x2 )
softmax(x) = ,
exp(x1 ) + exp(x2 ) exp(x1 ) + exp(x2 )

exp(x1 ) 1
= ,
exp(x1 ) + 1 exp(x1 ) + 1

1 1
= ,1 −
1 + exp(−x1 ) 1 + exp(−x1 )
= (σ(x1 ), 1 − σ(x1 ))

I The softmax is a multi-class generalization of the sigmoid function

I In practice, the overparameterized version is often used (simpler to implement)
31
Softmax
!
exp(s1 ) exp(sC )
softmax(s) = PC , . . . , PC
k=1 exp(sk ) k=1 exp(sk )

I The name “softmax” is confusing, “soft argmax” would be more precise as it is a

continuous and differentiable version of argmax (in one-hot representation)
I Example with four classes:
4 1.0 1.0
50
3 0.8 0.8
2 40
0.6 0.6

softmax

argmax
exp(sc)

1 30
sc

0 20 0.4 0.4
1 10 0.2 0.2
2 0 0.0 0.0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
c c c c
32
Softmax

I Softmax responds to differences between inputs

I It is invariant to adding the same scalar to all its inputs:

softmax(x) = softmax(x + c)

I We can therefore derive a numerically more stable variant:

softmax(x) = softmax(x − max xk )

k=1..L

I Allows accurate computation even when x is large

I Illustrates again that softmax depends on differences between scores

33
Cross Entropy Loss with Softmax Example
Putting it together: Cross Entropy Loss for a single training sample (x, y) ∈ X :

C
(c)
X
CE Loss: −yc log fw (x)
c=1

Example: Suppose C = 4 and 4 training samples x with labels y

Input x Label y Predicted scores s softmax(s) CE Loss
(1, 0, 0, 0)> (+3, +1, −1, −1)> (0.85, 0.12, 0.02, 0.02)> 0.16
(0, 1, 0, 0)> (+3, +3, +1, +0)> (0.46, 0.46, 0.06, 0.02)> 0.78
(0, 0, 1, 0)> (+1, +1, +1, +1)> (0.25, 0.25, 0.25, 0.25)> 1.38
(0, 0, 0, 1)> (+3, +2, +3, −1)> (0.42, 0.16, 0.42, 0.01)> 4.87

I Sample 4 contributes most strongly to the loss function! (elephant in the room)
34
Output Layer for Classiﬁcation Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I For 2 classes, we can predict 1 value and use a sigmoid, or 2 values with softmax
I For C > 2 classes we typically predict C scores and use a softmax non-linearity

35
Loss Function for Classiﬁcation Problems

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I For 2 classes, we use the binary cross-entropy loss (BCE)

I For C > 2 classes, we use the cross-entropy loss (CE)

36
4.2
Activation Functions
Activation Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I Hidden layer hi = g(Ai hi−1 + bi ) with activation function g(·) and weights Ai , bi
I The activation function is frequently applied element-wise to its input
I Activation functions must be non-linear to learn non-linear mappings
I Some of them are not differentiable everywhere (but still ok for training) 38
Activation Functions
Sigmoid:
1 1.0 Sigmoid
g(x) =
1 + exp(−x)
0.8
I Maps input to range [0, 1]
0.6
I Neuroscience interpretation as

g(x)
saturating “ﬁring rate” of neurons 0.4
Problems: 0.2
I Saturation “kills” gradients
0.0
I Outputs are not zero-centered 10 5 0 5 10
x
I Introduces bias after ﬁrst layer
39
Activation Functions

Sigmoid Problem #1:

1.0 Sigmoid

0.8

0.6

g(x)
0.4
I Downstream gradient becomes zero
when input x is saturated: g 0 (x) ≈ 0 0.2
I No learning if x is very small (< −10) 0.0
10 5 0 5 10
I No learning if x is very large (> 10) x

40
Activation Functions
Sigmoid Problem #2:
1 X
g(x) = x= ai xi + b
1 + exp(−x)
i
I Sigmoid is always positive ⇒ xi also
I Gradient of sigmoid is always positive

The gradient wrt. parameter ai is given by:

∂L ∂L ∂g ∂L ∂g ∂x ∂L ∂g
= = = xi
∂ai ∂g ∂ai ∂g ∂x ∂ai ∂g ∂x

I Therefore: sgn(∂L/∂ai ) = sgn(∂L/∂g)

I All gradients have the same sign (+ or -)
41
Activation Functions
Sigmoid Problem #2:

Allowed
Gradient I

Update III

Directions IV

Allowed V
Gradient Update
Path
Update
Directions VII
VIII
Optimal Update

I Restricts gradient updates and leads to inefﬁcient optimization (minibatches help)

42
Activation Functions

Tanh: Tanh
1.00
2 0.75
g(x) = −1
1 + exp(−2x)
0.50
0.25
I Maps input to range [−1, 1]
0.00

g(x)
I Anti-symmetric 0.25
I Zero-centered 0.50
0.75
Problems:
1.00
I Again, saturation “kills” gradients 10 5 0 5 10
x

LeCun, Kanter and Solla: Second-order properties of error surfaces: learning time and generalization. NIPS, 1991. 43
Activation Functions

Rectiﬁed Linear Unit (ReLU):

10 ReLU
g(x) = max(0, x)
8
I Does not saturate (for x > 0) 6

g(x)
I Leads to fast convergence
4
I Computationally efﬁcient
2
Problems:
I Not zero-centered 0
10 5 0 5 10
I No learning for x < 0 ⇒ dead ReLUs x

Nair and Hinton: Rectiﬁed linear units improve restricted boltzmann machines. ICML, 2010. 44
Activation Functions
ReLU Problem:
ReLU
10

g(x)
4
I Downstream gradient becomes
zero when input x < 0 2
I Results in so-called “dead ReLUs”
0
that never participate in learning 10 5 0 5 10
x
I Often initialize with pos. bias (b > 0)

Nair and Hinton: Rectiﬁed linear units improve restricted boltzmann machines. ICML, 2010. 45
Activation Functions

10 Leaky ReLU
Leaky ReLU:
8
g(x) = max(0.01x, x)
6

g(x)
I Does not saturate (i.e., will not die) 4
I Closer to zero-centered outputs
2
I Leads to fast convergence
0
I Computationally efﬁcient
10 5 0 5 10
x

Maas: Rectiﬁer nonlinearities improve neural network acoustic models. ICML, 2013. 46
Activation Functions

10 Leaky ReLU
Parametric ReLU:
8
g(x) = max(αx, x)
6

g(x)
I Does not saturate (i.e., will not die) 4
I Leads to fast convergence
2
I Computationally efﬁcient
0
I Parameter α learned from data
10 5 0 5 10
x

He, Zhang, Ren and Sun: Delving Deep into Rectiﬁers: Surpassing Human-Level Performance on ImageNet Classiﬁcation. ICCV, 2015. 47
Activation Functions

10 ELU
Exponential Linear Units (ELU):
 8
x if x > 0
g(x) = 6
α(exp(x) − 1) if x ≤ 0

g(x)
4
I All beneﬁts of Leaky ReLU
2
I Adds some robustness to noise
I Default α = 1 0
10 5 0 5 10
x

Clevert, Unterthiner and Hochreiter: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). ICLR, 2016. 48
Activation Functions
Rectiﬁer Absolute Quadratic

Maxout: g(x) = max(a> >

1 x + b1 , a2 x + b2 )
I Generalizes ReLU and Leaky ReLU
I Increases the number of parameters per neuron

Goodfellow, Warde-Farley, Mirza, Courville and Yoshua Bengio: Maxout Networks. ICML, 2013. 49
Activation Functions

Summary:
I No one-size-ﬁts-all: Choice of activation function depends on problem
I We only showed the most common ones, there exist many more
I Best activation function/model is often found using trial-and-error in practice
I It is important to ensure a good “gradient ﬂow” during optimization

Rule of Thumb:
I Use ReLU by default (with small enough learning rate)
I Try Leaky ReLU, Maxout, ELU for some small additional gain
I Prefer Tanh over Sigmoid (Tanh often used in recurrent models)

50
Implementation
Numerical Differentiation
I Murphy: “Anything that can go wrong will.”
I When implementing the backward pass of
activation, output or loss functions it is important
to ensure that all gradients are correct!
f(x+h)
I Verify via Newton’s difference quotient: secant

∂f (x) f (x + h) − f (x)
= lim
∂x h→0 h f(x)

I Even better: Symmetric difference quotient

x x+h
∂f (x) f (x + h) − f (x − h) h
= lim
∂x h→0 2h

52
Numerical Differentiation

How to choose h?
I For h = 0 the expression is undeﬁned
I Choose h to trade-off:
I Rounding error (ﬁnite precision)
I Approximation error (wrong)
√
I Good choice: 3
with the machine precision
I Examples:
en.wikipedia.org/wiki/
I = 6 × 10−8 for single precision (32 bit)
Numerical_differentiation
I = 1 × 10−16 for double precision (64 bit)

53
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:

1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x

1.0 Sigmoid Gradient with h = 100 max. error: 0.0189

0.40
Sigmoid Analytic Gradient
0.35 Numeric Gradient
0.8
0.30
0.6 0.25

(x)
0.20
(x)

0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:

1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x

1.0 Sigmoid Gradient with h = 10 4 max. error: 0.0006

0.40
Sigmoid Analytic Gradient
0.35 Numeric Gradient
0.8
0.30
0.6 0.25

(x)
0.20
(x)

0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
Numerical Differentiation
Example: Sigmoid derivative using symmetric differences with single precision:

1 ∂σ(x)
σ(x) = = σ(x)(1 − σ(x))
1 + e−x ∂x

1.0 Sigmoid Gradient with h = 10 6 max. error: 0.0568

0.40
Sigmoid Analytic Gradient
0.35 Numeric Gradient
0.8
0.30
0.6 0.25

(x)
0.20
(x)

0
0.4 0.15
0.10
0.2
0.05
0.0 0.00
10 5 0 5 10 10 5 0 5 10
x x
54
4.3
Preprocessing and Initialization
Data Preprocessing
Data Preprocessing
Remember what happens for positive inputs:
!
X
g(x) = g ai xi + b
i
Allowed
The gradient wrt. parameter ai is given by: Gradient
Update
∂L ∂L ∂g Directions
= xi
∂ai ∂g ∂x
Allowed
Gradient Update
I Both terms in blue are positive Path
Update
I All gradients have the same sign (+ or -) Directions
Optimal Update
I We should pre-process the input data
such that it is “well distributed”
57
Data Preprocessing

I Zero-center: xi,j ← xi,j − µj with µj = 1 PN

N i=1 xi,j
2 1 PN
I Normalization: xi,j ← xi,j /σj with σj = N i=1 (xi,j − µj )2

58
Data Preprocessing

I Decorrelate: Multiply with eigenvectors of covariance matrix

I Whiten: Divide by square root of eigenvalues of covariance matrix

59
Data Preprocessing
Original Data Zero-Centered Data

I Classiﬁcation loss becomes less sensitive to changes in the weight matrix

60
Data Preprocessing

Common Practices for Images:

I AlexNet: Subtract mean image
(mean image: W × H × 3 numbers)
I VGGNet: Subtract per-channel mean
(mean along each channel: 3 numbers)
I ResNet: Subtract per-channel mean and divide by per-channel std. dev.
(mean along each channel: 3 numbers)
I Whitening is less common

61
Weight Initialization
Recap: Stochastic Gradient Descent

Algorithm for training an MLP using (stochastic) gradient descent:

1. Initialize weights w, pick learning rate η and minibatch size |Xbatch |
2. Draw (random) minibatch Xbatch ⊆ X
3. For all elements (x, y) ∈ Xbatch of minibatch (in parallel) do:
3.1 Forward propagate x through network to calculate h1 , h2 , . . . , ŷ
3.2 Backpropagate gradients through network to obtain ∇w L(ŷ, y)
1
4. Update gradients: wt+1 = wt − η
P
|Xbatch | (x,y)∈Xbatch ∇w L(ŷ, y)
5. If validation error decreases, go to step 2, otherwise stop

Question:
I How to best initialize the weights w?

63
Constant Initialization

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

I How to initialize the parameters w of all network layers?

I Simple solution: set all network parameters to a constant (i.e., w = 0)
I Learning not be possible (all units of each layer are learning the same)
64
Weight Initialization

Consider a layer in a Multi-Layer Perceptron:

!
X
g(x) = g ai xi + b
i

The gradient wrt. parameter ai is given by:

∂L ∂L ∂g
= xi
∂ai ∂g ∂x
Remark:
I For g(·), we will use Tanh and ReLU in the following

65
Small Random Numbers

Tanh Activation Function:

I Draw weights independently from Gaussian with small std. dev (σ = 0.01)
I Activations (=activation function outputs) in deeper layers tend towards zero
I Gradients wrt. weights thus also tend towards zero ⇒ no learning:

∂L ∂L ∂g ∂L ∂g
= xi = 0=0
∂ai ∂g ∂x ∂g ∂x 66
Large Random Numbers

Tanh Activation Function:

I Draw weights independently from Gaussian with large std. dev (σ = 0.2)
I All activation functions saturate
I Local gradients are all becoming zero ⇒ no learning:

∂L ∂L ∂g ∂L
= xi = 0 xi = 0
∂ai ∂g ∂x ∂g 67
Xavier Initialization

Tanh Activation Function:

I Glorot et al. draw weights independently from Gaussian with σ 2 = 1/Din
I Din denotes the dimension of the input to the layer, may vary across layers
I Activation distribution now well scaled across all layers

Glorot and Bengio: Understanding the difﬁculty of training deep feedforward neural networks. AISTAT, 2010. 68
Xavier Initialization
√
Why σ = 1/ Din ? Let us consider y = g(w> x) and assume that all xi and wi are
independent and identically (i.i.d.) distributed with zero mean. Let further g 0 (0) = 1.
Then:
Var(y) ≈ Var(w> x) = Din Var(xi wi )
= Din (E[x2i wi2 ] − E[xi wi ]2 )
= Din (E[x2i ] E[wi2 ] − E[xi ]2 E[wi ]2 )
= Din E[x2i ] E[wi2 ]
= Din Var(xi ) Var(wi )

Thus:
Var(wi ) = 1/Din ⇒ Var(y) = Var(xi )

Glorot and Bengio: Understanding the difﬁculty of training deep feedforward neural networks. AISTAT, 2010. 69
Xavier Initialization

ReLU Activation Function:

I Xavier initialization assumes zero centered activation function
I For ReLU, activations again start collapsing to zero for deeper layers

Glorot and Bengio: Understanding the difﬁculty of training deep feedforward neural networks. AISTAT, 2010. 70
He Initialization

ReLU Activation Function:

I Since ReLU is restricted to positive outputs, variance must be doubled
I He et al. draw weights independently from Gaussian with σ 2 = 2/Din
I Activation distribution now well scaled across all layers

He, Zhang, Ren and Sun: Delving Deep into Rectiﬁers: Surpassing Human-Level Performance on ImageNet Classiﬁcation. ICCV, 2015. 71
Summary

Data Preprocessing:
I Zero-centering the network inputs is important for efﬁcient learning
I Decorrelation and whitening used less frequently

Weight Initialization:
I Proper initialization important for ensuring a good “gradient ﬂow”
I For zero-centered activation functions, use Xavier initialization
I For ReLU activation functions, use He initialization
I Initialization is a research topic, much more literature on this topic

Application of Linear Transformation
No ratings yet
Application of Linear Transformation
15 pages
Dr. Vikas Thada - Google Scholar Citations PDF
No ratings yet
Dr. Vikas Thada - Google Scholar Citations PDF
2 pages
UVa Problem List (Catagorized Algorithmic Problem)
100% (1)
UVa Problem List (Catagorized Algorithmic Problem)
8 pages
ct32005 2010
No ratings yet
ct32005 2010
210 pages
Convex Hull: Jarvis's Algorithm
No ratings yet
Convex Hull: Jarvis's Algorithm
21 pages
04 AIS302 ANN - Loss Functions
No ratings yet
04 AIS302 ANN - Loss Functions
74 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
DeepLearning Workshop Humayun
No ratings yet
DeepLearning Workshop Humayun
63 pages
Faculty Development Program ON Artificial Intelligence & Machine Learning For Engineering Applications
No ratings yet
Faculty Development Program ON Artificial Intelligence & Machine Learning For Engineering Applications
70 pages
2 DL Training
No ratings yet
2 DL Training
60 pages
7 TrainingNN-2
No ratings yet
7 TrainingNN-2
84 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
The United Republic of Tanzania President'S Office, Regional Administration and Local Government
No ratings yet
The United Republic of Tanzania President'S Office, Regional Administration and Local Government
4 pages
CM20315 05 Loss
No ratings yet
CM20315 05 Loss
100 pages
MIT5 61F17 Pset6 Soln
No ratings yet
MIT5 61F17 Pset6 Soln
22 pages
ANN Unit-2
No ratings yet
ANN Unit-2
48 pages
cs188 sp24 Note22
No ratings yet
cs188 sp24 Note22
8 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
An Overview On Clustering Methods: T. Soni Madhulatha
No ratings yet
An Overview On Clustering Methods: T. Soni Madhulatha
7 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
Unit 2 DL
No ratings yet
Unit 2 DL
70 pages
A-3 Ai Print
No ratings yet
A-3 Ai Print
6 pages
Lect 8
No ratings yet
Lect 8
117 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
Signal Processing: Deterministic Signals
No ratings yet
Signal Processing: Deterministic Signals
25 pages
Penerapan Algoritma Convolutional Neural Network Dalam Klasifikasi Telur Ayam Fertil Dan Infertil Berdasarkan Hasil Candling
No ratings yet
Penerapan Algoritma Convolutional Neural Network Dalam Klasifikasi Telur Ayam Fertil Dan Infertil Berdasarkan Hasil Candling
9 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Dsoop (Co 221) - 1
No ratings yet
Dsoop (Co 221) - 1
20 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Deep - Learning
No ratings yet
Deep - Learning
49 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
3a Variations
No ratings yet
3a Variations
17 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
Unit 4 Short Notes
No ratings yet
Unit 4 Short Notes
27 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Faxoc JOB DESCRIPTION MachineLearning
No ratings yet
Faxoc JOB DESCRIPTION MachineLearning
1 page
KNN
No ratings yet
KNN
3 pages
Lecture 18. Backpropagation
No ratings yet
Lecture 18. Backpropagation
55 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
ECE533 Digital Image Processing: University of Wisconsin - Madison
No ratings yet
ECE533 Digital Image Processing: University of Wisconsin - Madison
25 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Demographic Models: Lecture 11: Modelling Population Phenomena
No ratings yet
Demographic Models: Lecture 11: Modelling Population Phenomena
4 pages
3a Variations4
No ratings yet
3a Variations4
5 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Loss Functions
No ratings yet
Loss Functions
15 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
A E T S F E: Pplied Conometric IME Eries Ourth Dition
No ratings yet
A E T S F E: Pplied Conometric IME Eries Ourth Dition
43 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Daa Unit 1
No ratings yet
Daa Unit 1
19 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
RBI
No ratings yet
RBI
2 pages
16 DL 1
No ratings yet
16 DL 1
9 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
Cheatsheet Reflex Models
No ratings yet
Cheatsheet Reflex Models
4 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
A Computer Algorithm To Determine The Steady-State Response of Nonlinaer Oscillators
No ratings yet
A Computer Algorithm To Determine The Steady-State Response of Nonlinaer Oscillators
7 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
12 - Chepter 5
No ratings yet
12 - Chepter 5
11 pages
Final Exam Answer of Numerical Methods 2019 PDF
No ratings yet
Final Exam Answer of Numerical Methods 2019 PDF
6 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
No ratings yet
A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
7 pages
Presentation On Line Drawing Algorithms
No ratings yet
Presentation On Line Drawing Algorithms
32 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Adaptive Robust Control of Wheeled Mobile Robot With Uncertainties
No ratings yet
Adaptive Robust Control of Wheeled Mobile Robot With Uncertainties
6 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Math3302 Fa2021 Syllabus
No ratings yet
Math3302 Fa2021 Syllabus
4 pages
Machine Learning References
No ratings yet
Machine Learning References
3 pages
Modeling and Simulation
No ratings yet
Modeling and Simulation
2 pages
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet

Lec 04 Deep Networks 2

Uploaded by

Lec 04 Deep Networks 2

Uploaded by

Deep Learning

Lecture 4 – Deep Neural Networks II

Prof. Dr.-Ing. Andreas Geiger

4.1 Output and Loss Functions

4.2 Activation Functions

4.3 Preprocessing and Initialization

Large Cost Small Cost

ŵM L = argmax pmodel (y|X, w)

I We want to maximize the probability of the target y under this distribution

Input Model Output

Input Model Output

I Mapping: fw : RW ×H → {“Beach”, “No Beach”}

Input Model Output

I Mapping: fw : RW ×H → {“Beach”, “Mountain”, “City”, “Forest”}

I The distribution has heavy “tails”:

I Mixture of Laplace distribution 0.6

Bishop. Mixture Density Networks. 1994. 17

I Gaussian/Laplacian model distribution correspond to L2 and L1 loss functions

Bishop. Mixture Density Networks. 1994. 18

MNIST Handwritten Digits:

p(y) = µy (1 − µ)(1−y) 0.5

Let pmodel (y|x, w) = fw (x)y (1 − fw (x))(1−y) be a Bernoulli distribution. We obtain:

In other words, we minimize the binary cross-entropy (BCE) loss.

I y: “one-hot” vector with yc ∈ {0, 1}

I One-hot vector y with binary elements yc ∈ {0, 1}

In other words, we minimize the cross-entropy (CE) loss.

I The softmax is a multi-class generalization of the sigmoid function

I The name “softmax” is confusing, “soft argmax” would be more precise as it is a

I Softmax responds to differences between inputs

I We can therefore derive a numerically more stable variant:

softmax(x) = softmax(x − max xk )

I Allows accurate computation even when x is large

Example: Suppose C = 4 and 4 training samples x with labels y

I For 2 classes, we use the binary cross-entropy loss (BCE)

Sigmoid Problem #1:

The gradient wrt. parameter ai is given by:

I Therefore: sgn(∂L/∂ai ) = sgn(∂L/∂g)

I Restricts gradient updates and leads to inefﬁcient optimization (minibatches help)

Rectiﬁed Linear Unit (ReLU):

Maxout: g(x) = max(a> >

I Even better: Symmetric difference quotient

1.0 Sigmoid Gradient with h = 100 max. error: 0.0189

1.0 Sigmoid Gradient with h = 10 4 max. error: 0.0006

1.0 Sigmoid Gradient with h = 10 6 max. error: 0.0568

I Zero-center: xi,j ← xi,j − µj with µj = 1 PN

I Decorrelate: Multiply with eigenvectors of covariance matrix

I Classiﬁcation loss becomes less sensitive to changes in the weight matrix

Common Practices for Images:

Algorithm for training an MLP using (stochastic) gradient descent:

I How to initialize the parameters w of all network layers?

Consider a layer in a Multi-Layer Perceptron:

The gradient wrt. parameter ai is given by:

Tanh Activation Function:

Tanh Activation Function:

Tanh Activation Function:

ReLU Activation Function:

ReLU Activation Function:

You might also like