Deep Learning Final Sheet
Deep Learning Final Sheet
Introduction
Learning goals
Bird!
No bird!
...
“Oh, it’s a bird!” Relationship of DL and ML
... ...
some magic Concept of representation or
feature learning
... ... Use-cases and data types for DL
methods
WHAT IS DEEP LEARNING
Deep Learning – 1 / 12
DEEP LEARNING AND NEURAL NETWORKS
Deep Learning – 2 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio
some magic
... ...
x1 x2 x3
Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio
some magic
... ...
Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio
Bird!
No bird!
some magic
... ...
... ...
Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio
Bird!
No bird!
some magic
... ...
... ...
Deep Learning – 3 / 12
POSSIBLE USE-CASES
Deep learning can be extremely valuable if the data has these
properties:
It is high dimensional.
Each single feature itself is not very informative but only a
combination of them might be.
There is a large amount of training data.
This implies that for tabular data, deep learning is rarely the
correct model choice.
Deep Learning – 4 / 12
POSSIBLE USE-CASE: IMAGES
High Dimensional: A color image with 255 × 255 (3 Colors)
pixels already has 195075 features.
Informative: A single pixel is not meaningful in itself.
Training Data: Depending on applications huge amounts of data
are available.
Deep Learning – 5 / 12
POSSIBLE USE-CASE: IMAGES
Deep Learning – 6 / 12
POSSIBLE USE-CASE: IMAGES
Deep Learning – 8 / 12
POSSIBLE USE-CASE: TEXT
High Dimensional: Each word can be a single feature (300000
words in the German language).
Informative: A single word does not provide much context.
Training Data: Huge amounts of text data available.
Deep Learning – 9 / 12
POSSIBLE USE-CASE: TEXT CLASSIFICATION
Deep Learning – 10 / 12
POSSIBLE USE-CASE: TEXT
Deep Learning – 11 / 12
APPLICATIONS OF DEEP LEARNING: SPEECH
Deep Learning – 12 / 12
Deep Learning
Learning goals
Graphical representation of a
single neuron
Affine transformations and
non-linear activation functions
Hypothesis spaces of a single
neuron
Typical loss functions
A SINGLE NEURON
Perceptron with input features x1 , x2 , ..., xp , weights w1 , w2 , ..., wp , bias term b, and
activation function τ .
Deep Learning – 1 / 11
A SINGLE NEURON
Activation function τ : a single neuron represents different functions
depending on the choice of activation function.
f (x ) = τ (wT x) = wT x
1
f (x ) = τ (wT x) =
1 + exp(−wT x)
Deep Learning – 2 / 11
A SINGLE NEURON
We consider a perceptron with 3-dimensional input, i.e.
f (x) = τ (w1 x1 + w2 x2 + w3 x3 + b).
Input features x are represented by nodes in the “input layer”.
Deep Learning – 3 / 11
A SINGLE NEURON
Weights w are connected to edges from the input layer.
Deep Learning – 4 / 11
A SINGLE NEURON
For an explicit graphical representation, we do a simple trick:
Add a constant feature to the inputs x̃ = (1, x1 , ..., xp )T
and absorb the bias into the weight vector w̃ = (b, w1 , ..., wp ).
The graphical representation is then:
Deep Learning – 5 / 11
A SINGLE NEURON
The computation τ (w1 x1 + w2 x2 + w3 x3 + b) is represented by the
neuron in the “output layer”.
Deep Learning – 6 / 11
A SINGLE NEURON
You can picture the input vector being "fed" to neurons on the left
followed by a sequence of computations performed from left to
right. This is called a forward pass.
Deep Learning – 7 / 11
A SINGLE NEURON
A neuron performs a 2-step computation:
1 Affine Transformation: weighted sum of inputs plus bias.
Deep Learning – 8 / 11
A SINGLE NEURON: HYPOTHESIS SPACE
The hypothesis space that is formed by single neuron is
p
( ! )
X
p p
H = f : R → R f (x) = τ w j xj + b , w ∈ R , b ∈ R .
j =1
Deep Learning – 9 / 11
A SINGLE NEURON: OPTIMIZATION
1
L (y , f (x)) = (y − f (x))2
2
For binary classification, we typically apply the cross entropy loss
(also known as Bernoulli loss):
Deep Learning – 10 / 11
A SINGLE NEURON: OPTIMIZATION
For a single neuron and both choices of τ the loss function is
convex.
The global optimum can be found with an iterative algorithm like
gradient descent.
A single neuron with logistic sigmoid function trained with the
Bernoulli loss yields the same result as logistic regression when
trained until convergence.
Note: In the case of regression and the L2-loss, the solution can
also be found analytically using the “normal equations”. However,
in other cases a closed-form solution is usually not available.
Deep Learning – 11 / 11
Deep Learning
XOR-Problem
Learning goals
Example problem a single
neuron can not solve but a single
hidden layer net can
EXAMPLE: XOR PROBLEM
The XOR gate (exclusive or) returns true, when an odd number of
inputs are true:
x1 x2 XOR = y
0 0 0
0 1 1
1 0 1
1 1 0
Can you learn the target function with a logistic regression model?
Deep Learning – 1 / 10
EXAMPLE: XOR PROBLEM
Logistic regression can not
solve this problem. In fact,
any model using simple
hyperplanes for separation
can not (including a single
neuron).
Deep Learning – 2 / 10
EXAMPLE: XOR PROBLEM
Consider the following model:
Figure: A neural network with two neurons in the hidden layer. The matrix W
describes the mapping from x to z. The vector u from z to y .
Deep Learning – 3 / 10
EXAMPLE: XOR PROBLEM
Let use ReLU σ(z ) = max {0, z } as activation function
( and a
1 if z > 0
simple thresholding function τ (z ) = [z > 0] =
0 otherwise
as output transformation function. We can represent the
architecture of the model by the following equation:
f (x | θ) = f (x | W, b, u, c ) = τ u> σ(W> x + b) + c
= τ u> max{0, W> x + b} + c
(2 × 2) + (2 × 1) + (2 × 1) + (1) = 9
| {z } | {z } | {z } |{z}
W b u c
Deep Learning – 4 / 10
EXAMPLE: XOR PROBLEM
1 1 0 1
Let W = , b= , u= , c = −0.5
1 1 −1 −2
0 0 0 0 0 −1
0 1 1 1 1 0
X=
1
, XW =
1 1 , XW + B = 1 0
0
1 1 2 2 2 1
Note: X is a (n × p) design matrix in which the rows correspond to the data points. W,
as usual, is a (p × m) matrix where each column corresponds to a single (hidden)
neuron. B is a (n × m) matrix with b duplicated along the rows.
Deep Learning – 5 / 10
EXAMPLE: XOR PROBLEM
1 1 0 1
Let W = , b= , u= , c = −0.5
1 1 −1 −2
0 0 0 0 0 −1
0 1 1 1 1 0
X=
1
, XW =
1 1 , XW + B = 1 0
0
1 1 2 2 2 1
0 0
1 0
Z = max{0, XW + B } =
1
0
2 1
Deep Learning – 6 / 10
EXAMPLE: XOR PROBLEM
Deep Learning – 7 / 10
EXAMPLE: XOR PROBLEM
Deep Learning – 8 / 10
EXAMPLE: XOR PROBLEM
In a final step we have to multiply the activated values of matrix Z
with the vector u and add the bias term c:
0 0 −0.5 −0.5
1 0 1
−0.5 0.5
f (x | W, b, u, c ) = + =
1 0 −2 −0.5 0.5
2 1 −0.5 −0.5
And then apply the step function τ (z ) = [z > 0]. This solves the
XOR problem perfectly!
x1 x2 XOR = y
0 0 0
0 1 1
1 0 1
1 1 0
Deep Learning – 9 / 10
NEURAL NETWORKS : OPTIMIZATION
Deep Learning – 10 / 10
Deep Learning
Learning goals
Architecture of single hidden
layer neural networks
Representation learning/
understanding the advantage of
hidden layers
Typical (non-linear) activation
functions
MOTIVATION
However, the neuron can easily separate the classes if the original
features are transformed (e.g., from Cartesian to polar
coordinates):
Before deep learning took off, features for tasks like machine
vision and speech recognition were “hand-designed” by domain
experts. This step of the machine learning pipeline is called
feature engineering.
(1)
zin = w11 x (1) + w21 x (2) + w31 x (3) + b1
(1)
zin = 3 ∗ (−3) + (−9) ∗ 1 + 2 ∗ 5 + 5 = −3
(2)
zin = w12 x (1) + w22 x (2) + w32 x (3) + b2
(2)
zin = 11 ∗ (−3) + (−2) ∗ 1 + 7 ∗ 5 + 2 = 2
(3)
zin = w13 x (1) + w23 x (2) + w33 x (3) + b3
(3)
zin = (−6) ∗ (−3) + 3 ∗ 1 + (−4) ∗ 5 − 1 = 0
(4)
zin = w14 x (1) + w24 x (2) + w34 x (3) + b4
(4)
zin = 6 ∗ (−3) + (−1) ∗ 1 + 5 ∗ 5 + 1 = 7
(i ) (i ) 1
zout = σ zin = (i )
−z
1+ e in
ReLU Activation:
1
σ(v ) =
1 + exp(−v )
Learning goals
Neural network architectures for
multi-class classification
Softmax activation function
Softmax loss
MULTI-CLASS CLASSIFICATION
Deep Learning – 1 / 6
MULTI-CLASS CLASSIFICATION
The first step is to add additional neurons to the output layer.
Each neuron in the layer will represent a specific class (number of
neurons in the output layer = number of classes).
Deep Learning – 2 / 6
MULTI-CLASS CLASSIFICATION
Notation:
f = (f1 , . . . , fg )
zj = σ(WTj x), j = 1, . . . , m.
Deep Learning – 3 / 6
MULTI-CLASS CLASSIFICATION
∂τ (fin )
Derivative ∂ fin = diag(τ (fin )) − τ (fin )τ (fin )T
Deep Learning – 4 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).
Deep Learning – 5 / 6
OPTIMIZATION: SOFTMAX LOSS
Deep Learning – 6 / 6
Deep Learning
Learning goals
Architectures of deep neural
networks
Deep neural networks as
chained functions
FEEDFORWARD NEURAL NETWORKS
We will now extend the model class once again, such that we allow
an arbitrary amount l of hidden layers.
Deep Learning – 1 / 7
FEEDFORWARD NEURAL NETWORKS
We can characterize those models by the following chain structure:
where σ (i ) and φ(i ) are the activation function and the weighted
sum of hidden layer i, respectively. τ and φ are the corresponding
components of the output layer.
Deep Learning – 2 / 7
FEEDFORWARD NEURAL NETWORKS
Figure: Structure of a deep neural network with l hidden layers (bias terms
omitted).
Deep Learning – 3 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE
Deep Learning – 4 / 7
WHY ADD MORE LAYERS?
Multiple layers allow for the extraction of more and more abstract
representations.
Each layer in a feed-forward neural network adds its own degree of
non-lnearity to the model.
Deep Learning – 5 / 7
DEEP NEURAL NETWORKS
Neural networks today can have hundreds of hidden layers. The greater
the number of layers, the "deeper" the network. Historically DNNs were
very challenging to train and not popular until the late ’00s for several
reasons:
Deep Learning – 6 / 7
DEEP NEURAL NETWORKS
The availability of large datasets and novel architectures that are
capable of handling even complex tensor-shaped data (e.g. CNNs
for image data), faster hardware, and better optimization and
regularization methods made it feasible to successfully implement
deep neural networks.
Deep Learning – 7 / 7
Deep Learning
Learning goals
Compact representation of
neural network equations
Vector notation for neuron layers
Vector and matrix notation of
bias and weight parameters
SINGLE HIDDEN LAYER NETWORKS: NOTATIONS
Deep Learning – 1 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATIONS
Hidden layer:
For example, to obtain z1 , we pick the first column of W :
w1,1
w2,1
W1 = .
..
wp,1
and compute
z1 = σ(WT1 x + b1 ) ,
where b1 is the bias of the first hidden neuron and σ : R → R is
an activation function.
Deep Learning – 2 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
zj = σ(WTj x + bj )
zin,j = WTj x + bj
Vectorized notation:
zin = (zin,1 , . . . , zin,m )T = WT x + b
(Note: WT x = (xT W)T )
z = zout = σ(zin ) = σ(WT x + b), where the (hidden layer)
activation function σ is applied element-wise to zin .
Deep Learning – 3 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Bias term:
We sometimes omit the bias term by adding a constant
feature to the input x̃ = (1, x1 , ..., xp ) and by adding the bias
term to the weight matrix
W̃ = (b, W1 , ..., Wp ).
Deep Learning – 4 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Output layer:
Deep Learning – 5 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Multiple inputs:
It is possible to feed multiple inputs to a neural network
simultaneously.
The inputs x(i ) , for i ∈ {1, . . . , n}, are arranged as rows in the
design matrix X.
X is a (n × p)-matrix.
Deep Learning – 6 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
The final output of the network, which contains a prediction for
each input, is τ (Z u + C ), where
Deep Learning – 7 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.
Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.
Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.
Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.
Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.
Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.
Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.
Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.
Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.
Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.
Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.
Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.
Deep Learning – 9 / 9
Deep Learning
Universal Approximation
nnet: size=4; maxit=1e+03
Train: mse=0.029; CV: mse.test.mean=0.055
Learning goals
●
●
●
●
1.0 ●
●
● ●
●
● ●
●●●
●
● ●
●
for one-hidden-layer neural
y
0.0
●
networks
●
● ●
● ●
●
●
−0.5
●
●
The pros and cons of a low
−1.0
● ●
● ●
●
●
approximation error
0.0 2.5 5.0 7.5 10.0
x
UNIVERSAL APPROXIMATION PROPERTY
Theorem. Let σ : R → R be a continuous, non-constant, bounded,
and monotonically increasing function. Let C ⊂ Rp be compact, and let
C(C ) denote the space of continuous functions C → R. Then, given a
function g ∈ C(C ) and an accuracy ε > 0, there exists a hidden layer
size m ∈ N and a set of coefficients Wj ∈ Rp , uj , bj ∈ R (for
j ∈ {1, . . . , m}), such that
m
X
f : C → R; f (x) = uj · σ WjT x + bj
j =1
Deep Learning – 1 / 14
UNIVERSAL APPROXIMATION PROPERTY
Corollary. Neural networks with a single sigmoidal hidden layer and
linear output layer are universal approximators.
This means that for a given
target function g there exists a
sequence of networks fk k ∈N that converges (pointwise) to g.
A network with fixed layer sizes can only model a subspace of all
continuous functions.
Deep Learning – 2 / 14
UNIVERSAL APPROXIMATION PROPERTY
Why is universal approximation a desirable property?
So ideally we would like the neural network (or any other learner)
to approximate the Bayes optimal hypothesis.
Deep Learning – 3 / 14
UNIVERSAL APPROXIMATION PROPERTY
Universal approximation ⇒ approximation error tends to zero as
hidden layer size tends to infinity.
Deep Learning – 4 / 14
UNIVERSAL APPROXIMATION PROPERTY
As we know, there are also good reasons for restricting the model
class.
Deep Learning – 5 / 14
EXAMPLE : REGRESSION/CLASSIFICATION
Deep Learning – 6 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=1; maxit=1e+03
Train: mse=0.391; CV: mse.test.mean=0.419
●
●
●
●
1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y
0.0
●
●
● ●
● ●
●
●
−0.5
●
−1.0
● ●
● ●
●
Deep Learning – 7 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=2; maxit=1e+03
Train: mse=0.088; CV: mse.test.mean=0.112
●
●
●
●
1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
0.0
●
y
●
● ●
● ●
●
●
−0.5
●
−1.0
● ●
● ●
●
−1.5
Deep Learning – 8 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=3; maxit=1e+03
Train: mse=0.032; CV: mse.test.mean=0.063
●
●
●
●
1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y
0.0
●
●
● ●
● ●
●
●
−0.5
●
−1.0
● ●
● ●
●
Deep Learning – 9 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=4; maxit=1e+03
Train: mse=0.029; CV: mse.test.mean=0.055
●
●
●
●
1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y
0.0
●
●
● ●
● ●
●
●
−0.5
●
−1.0
● ●
● ●
●
Deep Learning – 10 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=5; maxit=1e+03
Train: mse=0.028; CV: mse.test.mean=19.845
●
●
●
●
1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y
0.0
●
●
● ●
● ●
●
●
−0.5
●
−1.0
● ●
● ●
●
Deep Learning – 11 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=6; maxit=1e+03
Train: mse=0.031; CV: mse.test.mean=4.374
●
●
●
●
1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y
0.0
●
●
● ●
● ●
●
●
−0.5
●
−1.0
● ●
● ●
●
Deep Learning – 12 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=10; maxit=1e+03
Train: mse=0.023; CV: mse.test.mean=0.698
●
●
●
●
1 ●
●
● ●
●
● ●
● ●
● ●
●
● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
0
●
●
y
● ●
● ●
●
●
−1
● ●
● ●
●
Deep Learning – 13 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=1; maxit=500
Train: mmce=0.336; CV: mmce.test.mean=0.346
●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●
●
●
● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●
●
●
●
● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●
●
● ●
●
class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2
0.0 ● ●
● ● ●
● ● ● 1
●●
● ●
● ● ●● ●
●
● ●●
● ●
● ●
●● ● ● ● ●
● ●●
●
2
● ●●
●●
●
●●
●●●● ●● ●●●●
●●
●●
●
● ●● ●
● ●
●
●●
●●●
● ●● ●
● ● ●
● ●●●
●
● ● ● ●●
● ● ●● ●
● ● ●
● ● ● ● ● ●●●
●
● ●●●
−0.5 ● ●
●●
●● ●●●
●●
●●
●● ●
● ● ●● ●
● ●●●
●●
● ●● ●●
● ●
●
●
● ● ●● ●
● ● ● ● ●
● ● ● ●●
●●
● ●●
● ● ●● ● ● ● ● ● ● ● ●
● ●● ●●
●● ●●
● ● ● ●●● ● ●
●● ● ● ● ●
● ●●●● ● ●● ●●●
● ● ●
● ●●
●● ●
●● ●
●● ● ●●●
●● ●●● ●
●
●●
●●
● ●● ● ●● ●
●●●
●● ●
●
●
●
−1.0
Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=2; maxit=500
Train: mmce=0.426; CV: mmce.test.mean=0.412
●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
●●
1.0 ●
● ●●● ●● ●●●
● ● ●●
● ● ● ●
●
●● ● ●●
●●
● ● ●
●
●
●
● ●
●
●●
●
● ● ●
●●● ●●
●● ●●
●
● ● ● ● ●
● ●
●
● ● ●
●● ●●
●● ● ●
● ● ● ● ● ●● ●●●
● ●
●● ●●
●
●● ●● ●
● ●
0.5 ● ●●
●●●● ● ●
●
● ●● ●
●●
●●
●● ●●
● ●●
●
●● ●
●
● ●
● ● ●●●
●
● ●
●
● ●●
● ● ●
●
●
● ● ●
●● ●
●
●
● ●
●
●
● ●
●● ● ●
● ●
● ● ●●● ●
●●●●●
●
●
● ●
●
●
●
●●
● ●●
●●
●●● ●●
●
●
●
● ●
●● ●
● ●
●
● ●
●
● ●
●
● ●
● ●●
●● ●● ●●
●
●
●
●
●●
●
class
●
● ●
●●
●
●●●
●
●
●
● ●
●
● ●●
● ● ● ●
●
●
● ●
●
● ● ●● ●●
●
● ● ●●
●●
● ●
● ●
x2
0.0 ●
●
● ● ●
●
●
● 1
●
● ●●
● ●
●●
● ●
● ●● ●● ● 2
●●● ●●●
●
●●
● ●
●● ● ●
●
● ●
●
●
●● ●● ●
● ●
● ●
●
●
●● ● ●●
● ● ●
●●
●●
−0.5 ●● ● ●
● ●
●●● ●● ●●●
● ●●
●●
●● ●
● ●
●● ●●
●
● ● ● ●●
● ● ● ●
●
●
●
● ● ● ● ●●
●
● ● ● ●● ●
●
●●
● ● ●
●●● ●
●●
● ●●
● ●
● ●
● ● ●
●● ●
●● ● ●●●
● ●● ●● ●●
● ●
●●
●
●●
● ●●
●● ●
●●
●
●
−1.0
Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=3; maxit=500
Train: mmce=0.290; CV: mmce.test.mean=0.374
●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●
●
●
● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●
●
●
●
● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●
●
● ●
●
class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2
● ● ● 1
0.0 ●
●●
● ●
● ● ●
● ●●
● ●
● ●
●● ● ● ● ●
● ●
●
2
● ●●
●●
●
●●
●●●● ●● ●●●●
●●
●●
●
● ●● ●
● ●
●
●●
●●●
● ●● ●
● ● ●
● ●●●
●
● ● ● ●●
● ● ●● ●
● ● ●
● ● ● ●● ●●●
●
● ●●
−0.5 ● ●
●● ● ●●
●●
●● ● ●
● ● ●●
● ●●
● ●
● ●
● ● ● ●●
●● ●
●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0
Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=5; maxit=500
Train: mmce=0.272; CV: mmce.test.mean=0.322
●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
●●
1.0 ●
● ●●● ●
● ●●●
● ● ●●
● ● ● ●
●
●● ● ●
●
●●
● ●● ●
●
●
● ●
●●
●●
● ● ●●●
● ●
●
●
●
●
●
●
●● ● ●
●● ●
● ●
●
● ●● ●● ●● ●●●
● ●
●● ●
●● ●● ●
● ● ●
0.5 ● ●●
●●●● ● ●
●
● ●● ●
●●
●● ●
●
● ●●
●
●●
●
●
● ●
● ●
●●
● ●●
●
● ● ● ●●
● ● ●
● ●
●● ● ●●●
●● ●
●●
● ●●●
●
● ● ● ● ●
●● ● ●
● ● ●
● ● ● ●
● ●●
●
●
●
●●●●
●
●
● ●●
●●
●
● ●
● ● ●
class
● ●● ●
●●
● ●
●●
●
●
●
●
● ● ●
● ●
● ●
●
●●●
●
● ●
● ● ●●
●●
● ● ●
x2
● ● 1
0.0 ●
● ● ●
● ●
● ● ● ● ●
●
● 2
●●
● ●●
● ● ●
●
●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●●
● ●●
●●● ●
● ●
●
●●
● ●●
●
● ● ● ●● ●● ●
●●
●●
●●●● ●
● ●● ●● ● ● ● ●●
● ● ●● ●
● ●● ●●
●●●●
●
● ●●
●● ●
● ● ●
● ●●●● ● ●● ●●●
● ● ●
●●●
●
●●● ●
●
●●● ●●●
●●●●● ●
●
●●
●●
● ●● ●
●
● ●
●●●
●● ●●
●
−1.0
Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=10; maxit=500
Train: mmce=0.184; CV: mmce.test.mean=0.106
●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
●●
● ●
● ● ●
●
● ●
●
● ●●
● ● ● ● ●
● ●
●
●
●
●
●
●
●● ● ●
●● ●
● ●
●
● ●● ●● ●● ●●●
● ●
●● ●
●● ●● ●
● ● ●
0.5 ● ●●
●●●● ● ●
●
● ●● ●
●●
●● ●
●
●
● ●●
●
●
●
●● ● ● ●● ●
● ●
●
● ●●
● ● ● ●
● ●
●●● ●
● ●
●
● ●● ●●●● ●
●●●●
●
● ●
●
●●
● ●
●●
●●●
●
●
●
● ●
●● ●
● ●
● ●
● ●
● ●●
●● ●● ● ●
●
class
● ●
●●
●
● ● ●
●
● ●
● ●●
● ● ●● ●
● ●
● ●●● ●
● ●
x2
0.0 ●
●
● ●
● ● ● 1
●
● ● ●
● ●
●
● ●● ● ● 2
●●
●
● ●
●● ●●
●
●●
● ● ●
●
●
●●
●
● ●
● ●
● ●●
● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0
Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=30; maxit=500
Train: mmce=0.000; CV: mmce.test.mean=0.034
●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●
●
●
● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●
●
●
●
● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●
●
● ●
●
class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2
● ● ● 1
0.0 ● ●
● ●
● ● ● ● ●
●
● 2
●●
● ●●
● ● ●
●
●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0
Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=50; maxit=500
Train: mmce=0.000; CV: mmce.test.mean=0.026
●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●
●
●
● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●
●
●
●
● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●
●
● ●
●
class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2
● ● ● 1
0.0 ● ●
● ●
● ● ● ● ●
●
● 2
●●
● ●●
● ● ●
●
●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0
Deep Learning – 14 / 14
Deep Learning
Brief History
Learning goals
Predecessors of modern (deep)
neural networks
History of DL as a field
A BRIEF HISTORY OF NEURAL NETWORKS
1943: The first artificial neuron, the "Threshold Logic Unit (TLU)",
was proposed by Warren McCulloch & Walter Pitts.
Deep Learning – 1 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1957: The perceptron was invented by Frank Rosenblatt.
Deep Learning – 2 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1960: Adaptive Linear Neuron (ADALINE) was invented by
Bernard Widrow & Ted Hoff; weights are now adjustable according
to the weighted sum of the inputs.
Deep Learning – 3 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1969: The first “AI Winter” kicked in.
Marvin Minsky & Seymour Papert proved that a perceptron
cannot solve the XOR-Problem (linear separability).
Less funding ⇒ Standstill in AI/DL research.
Credit: https://emerj.com/ai-executive-guides/will-there-be-another-artificial-intelligence-winter-probably-not/
Deep Learning – 5 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
2006: Age of deep neural networks began.
Geoffrey Hinton showed that a deep belief network could be efficiently
trained using greedy layer-wise pretraining.
This wave of research popularized the use of the term deep learning to
emphasize that researchers were now able to train deeper neural networks
than had been possible before.
At this time, deep neural networks outperformed competing AI systems
based on other ML technologies as well as hand-designed functionality.
Deep Learning – 6 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
Credit: https://towardsdatascience.com/a-weird-introduction-to-deep-learning-7828803693b0
Deep Learning – 7 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
Deep Learning – 8 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
Deep Learning – 9 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
Deep Learning – 10 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
Credit: DeepMind
In 2018 and 2020, AlphaFold placed first in the overall rankings of the Critical
Assessment of Techniques for Protein Structure Prediction (CASP).
Deep Learning – 11 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
Credit: DeepMind
While there are several extensions to AlphaGo (e.g., Master AlphaGo, AlphaGo
Zero, AlphaZero, and MuZero), the main idea is the same: search for optimal
moves based on knowledge acquired by machine learning.
Deep Learning – 12 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
There are 175 billion parameters to be learned by the algorithm, but the quality of
the generated text is so high that it is hardly possible to distinguish it from a
human-written text.
Deep Learning – 13 / 13
Deep Learning
Basic Training
Learning goals
Empirical risk minimization
Gradient descent
Stochastic gradient descent
TRAINING NEURAL NETWORKS
In ML we use empirical risk minimization (ERM) to minimize
prediction losses over the training data
n
1 X (i ) (i )
Remp = L y ,f x | θ
n
i =1
1
L (y , f (x)) = (y − f (x))2
2
or cross-entropy for binary classification
Deep Learning – 1 / 11
GRADIENT DESCENT
Neg. risk gradient points in the direction of the steepest descent
⊤
∂Remp ∂Remp
−g = −∇Remp (θ) = − ,...,
∂θ1 ∂θd
θ [t +1] = θ [t ] − αg,
3.0
−0.15
2.5
−0.25
−0.35
−0.1
2.0
−0.45
θ2
1.5
−0.2
1.0
−0.4
−0.3
3 −0.3
0.5
−0.2
−0.4 2
−0 −0.1 5
.05
0.0
.0
−0
0
0 1
1 0.0 0.5 1.0 1.5 2.0 2.5 3.0
2 θ1
0
Deep Learning – 2 / 11
GRADIENT DESCENT AND OPTIMALITY
GD is a greedy algorithm: In
every iteration, it makes
locally optimal moves.
Deep Learning – 3 / 11
GRADIENT DESCENT AND OPTIMALITY
Note: It might not be that bad if we do not find the global optimum:
We don’t optimize the (theoretical) risk, but only an approximate
version, i.e. the empirical risk.
For very flexible models, aggressive optimization might overfitting.
Early-stopping might even increase generalization performance.
Deep Learning – 4 / 11
LEARNING RATE (LR)
The step-size α plays a key role in the convergence of the algorithm.
If the step size is too small, the training process may converge very
slowly (see left image). If the step size is too large, the process may not
converge, because it jumps around the optimal point (see right image).
Deep Learning – 5 / 11
LEARNING RATE
So far we have assumed a fixed value of α in every iteration:
α[t ] = α ∀t = {1, . . . , T }
However, it makes sense to adapt α in every iteration:
Steps of gradient descent for Remp (θ) = 10 θ12 + 0.5 θ22 . Left: 100 steps for with a fixed
learning rate. Right: 40 steps with an adaptive learning rate.
Deep Learning – 6 / 11
WEIGHT INITIALIZATION
Weights (and biases) of an NN must be initialized in GD.
We somehow must "break symmetry" – which would happen in
full-0-initialization. If two neurons (with the same activation) are
connected to the same inputs and have the same initial weights,
then both neurons will have the same gradient update and learn
the same features.
Weights are typically drawn from a uniform a Gaussian distribution
(both centered at 0 with a small variance).
Two common initialization strategies are ’Glorot initialization’ and
’He initialization’ which tune the variance of these distributions
based on the topology of the network.
Deep Learning – 7 / 11
STOCHASTIC GRADIENT DESCENT (SGD)
GD for ERM was:
n
[ t + 1] [t ] 1 X
θ =θ −α· · ∇θ L y (i ) , f (x(i ) | θ [t ] )
n
i =1
Deep Learning – 8 / 11
STOCHASTIC GRADIENT DESCENT (SGD)
SGD on function 1.25(x1 + 6)2 + (x2 − 8)2 .
Source : Shalev-Shwartz and Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University
Press, 2014.
Figure: Left = GD, right = SGD. Black line is an average of different SGD runs.
Deep Learning – 9 / 11
STOCHASTIC GRADIENT DESCENT
Deep Learning – 10 / 11
STOCHASTIC GRADIENT DESCENT
SGD is the most used optimizer in ML and especially in DL.
We usually have to add a considerable amount of tricks to SGD to
make it really efficient (e.g. momentum). More on this later.
SGD with (small) batches has a high variance, although is
unbiased. Hence, the LR α is smaller than in the batch mode.
When LR is slowly decreased, SGD converges to local minimum.
Recent results indicate that SGD often leads to better
generalization than GD, and may result in indirect regularization.
Deep Learning – 11 / 11
Deep Learning
Learning goals
Chain rule of calculus
Computational graphs
CHAIN RULE OF CALCULUS
The chain rule can be used to compute derivatives of the
composition of two or more functions.
Let x ∈ Rm , y ∈ Rn ,
g : Rm → Rn and f : Rn → R.
If y = g (x) and z = f (y), the chain rule yields:
∂z X ∂ z ∂ yj
= ·
∂ xi ∂ yj ∂ xi
j
Deep Learning – 2 / ??
COMPUTATIONAL GRAPHS
Deep Learning – 3 / ??
CHAIN RULE OF CALCULUS: EXAMPLE 1
∂z ∂z ∂y ∂x
= · ·
∂w ∂y ∂x ∂w
= f3′ (y ) · f2′ (x ) · f1′ (w )
source : Goodfellow et al. (2016)
= f3′ (f2 (f1 (w ))) · f2′ (f1 (w )) · f1′ (w )
Figure: A computational
graph, such that
x = f1 (w ), y = f2 (x )
and z = f3 (y ).
Deep Learning – 4 / ??
CHAIN RULE OF CALCULUS: EXAMPLE 2
Deep Learning – 5 / ??
COMPUTATIONAL GRAPH: NEURAL NET
Deep Learning – 6 / ??
Deep Learning
Basic Backpropagation 1
Learning goals
Forward and backward passes
Chain rule
Details of backprop
BACKPROPAGATION: BASIC IDEA
We would like to optimize ERM using gradient descent (GD) on:
n
1 X (i ) (i )
Remp (θ) = L y ,f x | θ .
n
i =1
We will see: This is simply (S)GD in disguise, cleverly using the chain
rule, so we can reuse a lot of intermediate results.
T
z1,in = W1 x + b1 = 1 · (−0.07) + 0 · 0.22 + 1 · (−0.46) = −0.53
1
z1,out = σ (z1,in ) = = 0.3705
1 + exp(−(−0.53))
T
z2,in = W2 x + b2 = 1 · 0.94 + 0 · 0.46 + 1 · 0.1 = 1.04
1
z2,out = σ (z2,in ) = = 0.7389
1 + exp(−1.04)
T
fin = u z + c = 0.3705 · (−0.22) + 0.7389 · 0.58 + 1 · 0.78 = 1.1122
1
fout = τ (fin ) = = 0.7525
1 + exp(−1.1122)
1 1
L (y , f (x)) = (y − f x(i ) | θ )2 = (y − fout )2
2 2
1 2
= (1 − 0.7525) = 0.0306
2
The calculation of the gradient is performed backwards (starting
from the output layer), so that results can be reused.
∂ L (y , f (x)) d 1
= (y − fout )2 = − (y − fout )
∂ fout ∂ fout 2 | {z }
=
ˆ residual
= −(1 − 0.7525) = −0.2475
∂ fout
= σ(fin ) · (1 − σ(fin ))
∂ fin
= 0.7525 · (1 − 0.7525) = 0.1862
With LR α = 0.5:
[new ] ∂ L (y , f (x))
[old ]
u1 = u1 −α·
∂ u1
= −0.22 − 0.5 · (−0.0171) = −0.2115
∂ fin
= u1 = −0.22
∂ z1,out
∂ z1,out
= σ(z1,in ) · (1 − σ(z1,in ))
∂ z1,in
= 0.3705 · (1 − 0.3705) = 0.2332
∂ z1,in
= x1 = 1
∂ W11
−0.2115
u= and c = 0.8030.
0.5970
Now rinse and repeat. This was one training iter, we do thousands.
Basic Backpropagation 2
Learning goals
Backprop formalism and
recursion
BACKWARD COMPUTATION AND CACHING
In the XOR example, we computed:
Deep Learning – 1 / 9
BACKWARD COMPUTATION AND CACHING
Next, let us compute:
Deep Learning – 2 / 9
BACKWARD COMPUTATION AND CACHING
Examining the two expressions:
∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in
= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11
Deep Learning – 4 / 9
BACKPROP: RECURSION
(i )
Let δk̃ (also: error signal) for a neuron k̃ in layer i represent how much
(i )
the loss L changes when the input zk̃ ,in changes:
(i ) (i +1)
! (i )
(i ) ∂L ∂L ∂ zk̃ ,out X ∂L ∂ zm,in ∂ zk̃ ,out
δk̃ = (i )
= (i ) (i )
= (i +1) (i ) (i )
∂ zk̃ ,in ∂ zk̃ ,out ∂ zk̃ ,in m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in
Note: The sum in the expression above is over all the neurons in layer
i + 1. This is simply an application of the (multivariate) chain rule.
Deep Learning – 5 / 9
BACKPROP: RECURSION
Using (i ) (i )
zk̃ ,out = σ(zk̃ ,in )
(i +1)
X (i + 1 ) (i ) (i + 1 )
zm,in = Wk ,m zk ,out + bm
k
we get:
(i +1)
! (i )
(i )
X ∂L ∂ zm,in ∂ zk̃ ,out
δk̃ = (i + 1 ) (i ) (i )
m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in
P
(i + 1 ) (i ) (i + 1 ) ! (i )
X ∂L ∂ k Wk ,m zk ,out + bm ∂σ(zk̃ ,in )
= (i + 1 ) (i ) (i )
m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in
! !
X ∂L X
σ ′ (zk̃ ,in ) = σ ′ (zk̃ ,in )
(i + 1 ) (i ) (i +1) (i +1) (i )
= (i + 1 )
Wk̃ ,m δm,in Wk̃ ,m
m ∂ zm,in m
Deep Learning – 6 / 9
BACKPROP: RECURSION
(i )
Given the error signal δ of neuron k̃ in layer i, the derivative of
k̃
loss L w.r.t. to the weight Wj̃ ,k̃ is simply:
(i )
∂L ∂ L ∂ zk̃ ,in (i ) (i −1)
(i )
= (i ) (i )
= δk̃ zj̃ ,out
∂ Wj̃ ,k̃ ∂ zk̃ ,in ∂ Wj̃ ,k̃
(i ) P (i ) (i −1) (i )
because z
k̃ ,in
= j W z
j ,k̃ j ,out
+ bk̃
(i )
Similarly, the derivative of loss L w.r.t. bias b is:
k̃
(i )
∂L ∂ L ∂ zk̃ ,in (i )
(i )
= (i ) (i )
= δk̃
∂ bk̃ ∂ zk̃ ,in ∂ bk̃
Deep Learning – 7 / 9
BACKPROP: RECURSION
It is not hard to show that the error signal δ i for an entire layer i is
(⊙ = element-wise product):
δ (O ) = ∇fout L ⊙ τ ′ (fin )
(i )
δ (i ) = W (i +1) δ (i +1) ⊙ σ ′ (zin )
Therefore, backpropagation works by computing and storing the
error signals backwards. That is, starting at the output layer and
ending at the first hidden layer. This way, the error signals of later
layers propagate backwards to the earlier layers.
The derivative of the loss L w.r.t. a given weight is computed
efficiently by plugging in the cached error signals, thereby avoiding
expensive and redundant computations.
Deep Learning – 8 / 9
Deep Learning
Learning goals
GPU training for accelerated
learning
Software for hardware support
Deep learning software platforms
Hardware for Deep Learning
Deep Learning – 1 / 9
HARDWARE FOR DEEP LEARNING
Deep NNs require special hardware to be trained efficiently.
The training is done using Graphics Processing Units (GPUs) and
a special programming language called CUDA.
For most NNs, training on standard CPUs takes very long.
Figure: Left: Each CPU can do 2-8 parallel computations. Right: A single
GPU can do thousands of simple parallel computations.
Deep Learning – 2 / 9
GRAPHICS PROCESSING UNITS (GPUS)
Deep Learning – 3 / 9
TENSOR PROCESSING UNITS (TPUS)
Deep Learning – 4 / 9
AND EVERYTHING ELSE...
With such powerful devices, memory/disk access during training
becomes the bottleneck
Nvidia DGX-1: Specialized solution with eight Tesla V100
GPUs, dual Intel Xeon, 512 GB of RAM, 4 SSD disks of 2TB
each
Specialized hardware for on-device inference
Example: Neural Engine on the Apple A11 (used for FaceID)
Keywords/buzzwords: Edge computing and Federated
learning
Deep Learning – 5 / 9
Software for Deep Learning
Deep Learning – 6 / 9
SOFTWARE FOR DEEP LEARNING
CUDA is a very low level programming language and thus writing
code for deep learning requires a lot of work.
Deep learning (software) frameworks:
Abstract the hardware (same code for CPU/GPU/TPU)
Automatically differentiate all computations
Distribute training among several hosts
Provide facilities for visualizing and debugging models
Can be used from several programming languages
Based on the concept of computational graph
Deep Learning – 7 / 9
SOFTWARE FOR DEEP LEARNING
Tensorflow
Popular in the industry
Developed by Google and
open source community
Python, R, C++ and Javascript APIs
Distributed training on GPUs and TPUs
Tools for visualizing neural nets, running
them efficiently on phones and embedded devices.
Keras
Intuitive, high-level wrapper
of Tensorflow for rapid prototyping
Python and (unofficial) R APIs
Deep Learning – 8 / 9
SOFTWARE FOR DEEP LEARNING
Pytorch
(Most) Popular in academia
Supported by Facebook
Python and C++ APIs
Distributed training on GPUs
MXNet
Open-source deep learning framework
written in C++ and cuda (used by
Amazon for their Amazon Web Services)
Scalable, allowing fast model training
Supports flexible model programming and
multiple languages (C++, Python, Julia,
Matlab, JavaScript, Go, R, Scala, Perl)
Deep Learning – 9 / 9
Deep Learning
Basic Regularization
Learning goals
Regularized cost functions
Norm penalties
Weight decay
Equivalence with constrained
optimization
REGULARIZATION
Any technique that is designed to reduce the test error possibly at
the expense of increased training error can be considered a form
of regularization.
Regularization is important in DL because NNs can have
extremely high capacity (millions of parameters) and are thus
prone to overfitting.
Deep Learning – 1 / 9
REVISION: REGULARIZED RISK MINIMIZATION
The goal of regularized risk minimization is to penalize the
complexity of the model to minimize the chances of overfitting.
By adding a parameter norm penalty term J (θ) to the empirical
risk Remp (θ) we obtain a regularized cost function:
Deep Learning – 2 / 9
L2-REGULARIZATION / WEIGHT DECAY
Let us optimize the L2-regularized risk of a model f (x | θ)
λ
min Rreg (θ) = min Remp (θ) + kθk22
θ θ 2
by gradient descent. The gradient is
Deep Learning – 3 / 9
EQUIVALENCE TO CONSTRAINED OPTIMIZATION
Norm penalties can be interpreted as imposing a constraint on the
weights. One can show that
is equvilalent to
Deep Learning – 4 / 9
EXAMPLE: WEIGHT DECAY
Deep Learning – 5 / 9
EXAMPLE: WEIGHT DECAY
0.100
0.075
weight decay
0
train error
10^(−5)
0.050
10^(−4)
10^(−3)
10^(−2)
0.025
0.000
A high weight decay of 10−2 leads to a high error on the training data.
Deep Learning – 6 / 9
EXAMPLE: WEIGHT DECAY
0.100
0.075
weight decay
0
test error
10^(−5)
0.050
10^(−4)
10^(−3)
10^(−2)
0.025
0.000
Second strongest weight decay leads to the best result on the test data.
Deep Learning – 7 / 9
TENSORFLOW PLAYGROUND
https://playground.tensorflow.org/
Deep Learning – 8 / 9
TENSORFLOW PLAYGROUND - EXERCISE
https://developers.google.com/machine-learning/crash-course/
regularization-for-simplicity/
playground-exercise-examining-l2-regularization
Deep Learning – 9 / 9
Introduction to Machine Learning
Learning goals
Understand that regularization
and parameter shrinkage can be
applied to non-linear models
Know structural risk minimization
Know how regularization risk
minimization is the same as MAP
in a Bayesian perspective, where
the penalty corresponds to
parameter prior.
SUMMARY: REGULARIZED RISK MINIMIZATION
If we should define ML in only one line, this might be it:
n
!
X
min Rreg (θ) = min L y (i ) , f x(i ) | θ + λ · J (θ)
θ θ
i =1
We see the typical U-shape with the sweet spot between overfitting
(LHS, low λ) and underfitting (RHS, high λ) in the middle.
generalization error
complexity
training error
n
X
min L y (i ) , f x(i ) | θ
θ
i =1
s.t. kθk22 ≤ t
Assume we have a parameterized distribution p(y |θ, x) for our data and
a prior q (θ) over our parameter space, all in the Bayesian framework.
Learning goals
Have a geometric understanding
of L2 regularization
Understand why L2
regularization in combination
with gradient descent is called
weight decay
WEIGHT DECAY VS. L2 REGULARIZATION
Let us optimize the L2-regularized risk of a model f (x | θ)
λ
min Rreg (θ) = min Remp (θ) + ∥θ∥22
θ θ 2
by gradient descent. The gradient is
θ [new] = θ [old] − α ∇θ Remp (θ [old] ) + λθ [old]
= θ [old] (1 − αλ) − α∇θ Remp (θ [old] ).
1
R̃emp (θ) = Remp (θ̂) + ∇θ Remp (θ̂) · (θ − θ̂) + (θ − θ̂)T H (θ − θ̂),
2
λ
R̃reg (θ) = R̃emp (θ) + ∥θ∥22
2
∇θ R̃reg (θ) = 0,
λθ + H (θ − θ̂) = 0,
(H + λI )θ = H θ̂,
θ̂Ridge = (H + λI )−1 H θ̂,
= Q (Σ + λI )−1 ΣQ ⊤ θ̂
Figure: The solid ellipses represent the contours of the unregularized objective and
the dashed circles represent the contours of the L2 penalty. At θ̂Ridge , the competing
objectives reach an equilibrium.
In the first dimension, the eigenvalue of the Hessian of Remp (θ) is small. The
objective function does not increase much when moving horizontally away
from θ̂ . Therefore, the regularizer has a strong effect on this axis and θ1 is
pulled close to zero.
Figure: The solid ellipses represent the contours of the unregularized objective and
the dashed circles represent the contours of the L2 penalty. At θ̂Ridge , the competing
objectives reach an equilibrium.
Early Stopping
Learning goals
Know how early stopping works
Understand how early stopping
acts as a regularizer
EARLY STOPPING
When training with an iterative optimizer such as SGD, it is
commonly the case that, after a certain number of iterations,
generalization error begins to increase even though training error
continues to decrease.
Early stopping refers to stopping the algorithm early before the
generalization error increases.
Figure: An illustration of the effect of early stopping. Left: The solid contour lines
indicate the contours of the negative log-likelihood. The dashed line indicates the
trajectory taken by SGD beginning from the origin. Rather than stopping at the point θ̂
that minimizes the risk, early stopping results in the trajectory stopping at an earlier
point θ̂Ridge . Right: An illustration of the effect of L2 regularization for comparison. The
dashed circles indicate the contours of the L2 penalty which causes the minimum of the
total cost to lie closer to the origin than the minimum of the unregularized cost.
Learning goals
Recap: Ensemble Methods
Dropout
Augmentation
Ensemble Methods
Deep Learning – 1 / 22
RECAP: ENSEMBLE METHODS
Idea: Train several models separately, and average their
prediction (i.e. perform model averaging).
Intuition: This improves performance on test set, since different
models will not make the same errors.
Ensembles can be constructed in different ways, e.g.:
by combining completely different kind of models (using
different learning algorithms and loss functions).
by bagging: train the same model on k datasets, constructed
by sampling n samples from original dataset.
Since training a neural network repeatedly on the same dataset
results in different solutions (why?) it can even make sense to
combine those.
Deep Learning – 2 / 22
RECAP: ENSEMBLE METHODS
Deep Learning – 3 / 22
Dropout
Deep Learning – 4 / 22
DROPOUT
Idea: reduce overfitting in neural networks by preventing complex
co-adaptations of neurons.
Method: during training, random subsets of the neurons are
removed from the network (they are "dropped out").This is done by
artificially setting the activations of those neurons to zero.
Whether a given unit/neuron is dropped out or not is completely
independent of the other units.
If the network has N (input/hidden) units, applying dropout to these
units can result in 2N possible ’subnetworks’.
Because these subnetworks are derived from the same ’parent’
network, many of the weights are shared.
Dropout can be seen as a form of "model averaging".
Deep Learning – 5 / 22
DROPOUT
Deep Learning – 6 / 22
DROPOUT
In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.
Deep Learning – 7 / 22
DROPOUT
In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.
Deep Learning – 8 / 22
DROPOUT
In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.
Deep Learning – 9 / 22
DROPOUT: ALGORITHM
To train with dropout a minibatch-based learning algorithm such as
stochastic gradient descent is used.
For each training case in a minibatch, we randomly sample a
binary vector/mask µ with one entry for each input or hidden unit in
the network. The entries of µ are sampled independently from
each other.
The probability of sampling a mask value of 0 (dropout) for one
unit is a hyperparameter known as the ’dropout rate’.
A typical value for the dropout rate is 0.2 for input units and 0.5 for
hidden units.
Each unit in the network is multiplied by the corresponding mask
value resulting in a subnetµ .
Forward propagation, backpropagation, and the learning update
are run as usual.
Deep Learning – 10 / 22
DROPOUT: ALGORITHM
Algorithm 1 Training a (parent) neural network with dropout rate p
1: Define parent network and initialize weights
2: for each minibatch: do
3: for each training sample: do
4: Draw mask µ using p
5: Compute forward pass for subnetµ
6: end for
7: Update the weights of the (parent) network by performing a gradient descent step
with weight decay
8: end for
The derivatives wrt. each parameter are averaged over the training
cases in each mini-batch. Any training case which does not use a
parameter contributes a gradient of zero for that parameter.
Deep Learning – 11 / 22
DROPOUT: WEIGHT SCALING
The weights of the network will be larger than normal because of
dropout. Therefore, to obtain a prediction at test time the weights
must be first scaled by the chosen dropout rate.
This means that if a unit (neuron) is retrained with probability p
during training, the weight at test time of that unit is multiplied by p.
Deep Learning – 12 / 22
DROPOUT: WEIGHT SCALING
Rescaling of the weights can also be performed at training time
instead, after each weight update at the end of the mini-batch.
This is sometimes called ’inverse dropout’. Keras and PyTorch
deep learning libraries implement dropout in this way.
Deep Learning – 13 / 22
DROPOUT: EXAMPLE
Deep Learning – 14 / 22
DROPOUT: EXAMPLE
0.05
0.04
0.03
Dropout rate:
(Input; Hidden Layers)
test error
(0;0)
(0.2;0.2)
0.02 (0.6;0.5)
0.01
0.00
Dropout rate of 0 (no dropouts) leads to higher test error than dropping
some units out.
Deep Learning – 15 / 22
DROPOUT, WEIGHT DECAY OR BOTH?
0.20
0.15
Test error
comparison
test error
dropout
0.10
dropout + weight decay
unregularized
weight decay
0.05
0.00
Deep Learning – 16 / 22
Dataset Augmentation
Deep Learning – 17 / 22
DATASET AUGMENTATION
Problem: low generalization because high ratio of
Deep Learning – 18 / 22
DATASET AUGMENTATION
Deep Learning – 19 / 22
DATASET AUGMENTATION
Deep Learning – 20 / 22
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky Ilya Sutskever and Ruslan
Salakhutdinov (2012)
Improving neural networks by preventing co-adaptation of feature detectors
http: // arxiv. org/ abs/ 1207. 0580
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky Ilya Sutskever and Ruslan
Salakhutdinov (2012)
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
http: // jmlr. org/ papers/ v15/ srivastava14a. html
Wu Ren, Yan Shengen, Shan Yi, Dang Qingqing and Sun Gang (2015)
Deep Image: Scaling up Image Recognition
https: // arxiv. org/ abs/ 1501. 02876
Deep Learning – 21 / 22
Deep Learning
Challenges in Optimization
Learning goals
Ill-Conditioning
Local Minima
Saddle Points
Cliffs and Exploding Gradients
CHALLENGES IN OPTIMIZATION
In this section, we summarize several of the most prominent
challenges regarding training of deep neural networks.
Traditionally, machine learning ensures that the optimization
problem is convex by carefully designing the objective function and
constraints. But for neural networks we are confronted with the
general nonconvex case.
Furthermore, we will see in this section that even convex
optimization is not without its complications.
Deep Learning – 1 / 34
Ill-Conditioning
Deep Learning – 2 / 34
EFFECTS OF CURVATURE
Intuitively, the curvature of a function determines the outcome of a GD
step. . .
Figure: Quadratic objective function f (x) with various curvatures. The dashed line
indicates the first order taylor approximation based on the gradient information alone.
Left: With negative curvature, the cost function decreases faster than the gradient
predicts; Middle: With no curvature, the gradient predicts the decrease correctly; Right:
With positive curvature, the function decreases more slowly than expected and begins
to increase.
Deep Learning – 3 / 34
SECOND DERIVATIVE AND CURVATURE
To understand better how the curvature of a function influences the
outcome of a gradient descent step, let us recall how curvature is
described mathematically:
The second derivative corresponds to the curvature of the graph of
a function.
The Hessian matrix of a function R(θ) : Rm → R is the matrix of
second-order partial derivatives
∂2
Hij = R(θ).
∂θi ∂θj
Deep Learning – 4 / 34
SECOND DERIVATIVE AND CURVATURE
The second derivative in a direction d, with ∥d∥ = 1, is given by
d⊤H d.
What is the direction of the highest curvature (red direction), and
what is the direction of the lowest curvature (blue)?
Deep Learning – 5 / 34
SECOND DERIVATIVE AND CURVATURE
Since H is real and symmetric (why?), eigendecomposition yields
H = Vdiag(λ)V−1 with V and λ collecting eigenvectors and
eigenvalues, respectively.
It can be shown, that the eigenvector vmax with the max.
eigenvalue λmax points into the direction of highest curvature
⊤ Hv
(vmax max = λmax ), while the eigenvector vmin with the min.
eigenvalue λmin points into the direction of least curvature.
Deep Learning – 6 / 34
SECOND DERIVATIVE AND CURVATURE
At a stationary point θ , where the gradient is 0, we can examine
the eigenvalues of the Hessian to determine whether the θ is a
local maximum, minimum or saddle point:
∀i : λi > 0 (H positive definite at θ) ⇒ minimum at θ
∀i : λi < 0 (H negative definite at θ) ⇒ maximum at θ
∃ i : λi < 0 ∧ ∃j : λj > 0 (H indefinit at θ) ⇒ saddle point at θ
Deep Learning – 7 / 34
ILL-CONDITIONED HESSIAN MATRIX
The condition number of a symmetric matrix A is given by the ratio of its
|λ |
min/max eigenvalues κ(A) = |λmax| . A matrix is called ill-conditioned, if
min
the condition number κ(A) is very high.
Deep Learning – 8 / 34
CURVATURE AND STEP-SIZE IN GD
What does it mean for gradient descent if the Hessian is ill-conditioned?
Let us consider the second-order Taylor approximation as a local
approximation of the of R around a current point θ 0 (with gradient
g)
1
R(θ) ≈ T2 f (θ, θ 0 ) := R(θ 0 )+(θ −θ 0 )⊤ g + (θ −θ 0 )⊤H (θ −θ 0 )
2
Furthermore, Taylor’s theorem states (proof in Koenigsberger
(1997), p. 68)
R(θ) − T2 f (θ, θ 0 )
lim 0 2
=0
θ→θ 0 ||θ − θ ||
Deep Learning – 9 / 34
CURVATURE AND STEP-SIZE IN GD
One GD step with a learning rate α yields new parameters
θ 0 − αg and a new approximated loss value
1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g .
2
g⊤ g
α∗ = .
g⊤H g
Deep Learning – 10 / 34
CURVATURE AND STEP-SIZE IN GD
Let us assume the gradient g points into the direction of vmax (i.e.
the direction of highest curvature), the optimal step size is given by
g⊤ g g⊤ g 1
α∗ = = = ,
g⊤H g λmax g⊤ g λmax
which is very small. Choosing a too large step-size is bad, as it
will make us “overshoot” the stationary point.
If, on the other hand, g points into the direction of the lowest
curvature, the optimal step size is
1
α∗ = ,
λmin
which corresponds to the largest possible optimal step-size.
We summarize: We want to perform big steps in directions of low
curvature, but small steps in directions of high curvature.
Deep Learning – 11 / 34
CURVATURE AND STEP-SIZE IN GD
But what if the gradient does not point into the direction of one of
the eigenvectors?
Let us consider the 2-dimensional case: We can decompose the
direction of g (black) into the two eigenvectors vmax and vmin
It would be optimal to perform a big step into the direction of the
smallest curvature vmin , but a small step into the direction of vmax ,
but the gradient points into a completely different direction.
Deep Learning – 12 / 34
ILL-CONDITIONING
GD is unaware of large differences in curvature, and can only walk
into the direction of the gradient.
Choosing a too large step-size will then cause the descent
direction change frequently (“jumping around”).
α needs to be small enough, which results in a low progress.
Deep Learning – 13 / 34
ILL-CONDITIONING
This effect is more severe, if a Hessian has a poor condition
number, i.e. the ratio between lowest and highest curvature is
large; gradient descent will perform poorly.
Figure: The contour lines show a quadratic risk function with a poorly conditioned
Hessian matrix. The plot shows the progress of gradient descent with a small step-size
vs. larger step-size. In both cases, convergence to the global optimum is rather slow.
Deep Learning – 14 / 34
ILL-CONDITIONING
In the worst case, ill-conditioning of the Hessian matrix and a too
big step-size will cause the risk to increase
1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g ,
2
which happens if
1 2 ⊤
α g H g > αg⊤ g.
2
To determine whether ill-conditioning is detrimental to the training,
the squared gradient norm g⊤ g and the risk can be monitored.
Deep Learning – 15 / 34
ILL-CONDITIONING
Deep Learning – 16 / 34
Local Minima
Deep Learning – 17 / 34
UNIMODAL VS. MULTIMODAL LOSS SURFACES
Figure: Left: Multimodal loss surface with saddle points; Right: (Nearly)
unimodal loss surface (Hao Li et al. (2017))
Deep Learning – 18 / 34
MULTIMODAL FUNCTION
Potential snippet from a loss surface of a deep neural network with
many local minima:
Deep Learning – 19 / 34
ONLY LOCALLY OPTIMAL MOVES
If the training algorithm makes only locally optimal moves (as in gradient
descent), it may move away from regions of much lower cost.
In the figure above, initializing the parameter on the "wrong" side of the
hill will result in suboptimal performance.
In higher dimensions, however, it may be possible for gradient descent to
go around the hill but such a trajectory might be very long and result in
excessive training time.
Deep Learning – 20 / 34
LOCAL MINIMA
Weight space symmetry:
If we swap incoming weight vectors for neuron i and j and do the
same for the outcoming weights, modelled function stays
unchanged.
⇒ with n hidden units and one hidden layer there are n!
networks with the same empirical risk
If we multiply incoming weights of a ReLU neuron with β and
outcoming with 1/β the modelled function stays unchanged.
⇒ The empirical risk of a NN can have very many minima with
equivalent empirical risk.
Deep Learning – 21 / 34
LOCAL MINIMA
In practice only local minima with a high value compared to the
global minimium are problematic.
Deep Learning – 22 / 34
Saddle Points
Deep Learning – 23 / 34
SADDLE POINTS
In optimization we look for areas with zero gradient.
A variant of zero gradient areas are saddle points.
For the empirical risk R of a neural network, the expected ratio of
the number of saddle points to local minima typically grows
exponentially with m
R : Rm → R
In other words: Networks with more parameters (deeper networks
or larger layers) exhibit a lot more saddle points than local minima.
Why is that?
The Hessian at a local minimum has only positive eigenvalues. At
a saddle point it is a mixture of positive and negative eigenvalues.
Deep Learning – 24 / 34
SADDLE POINTS
Imagine the sign of each eigenvalue is generated by coin flipping:
In a single dimension, it is easy to obtain a local minimum
(e.g. “head” means positive eigenvalue).
In an m-dimensional space, it is exponentially unlikely that all
m coin tosses will be head.
A property of many random functions is that eigenvalues of the
Hessian become more likely to be positive in regions of lower cost.
For the coin flipping example, this means we are more likely to
have heads m times if we are at a critical point with low cost.
That means in particular that local minima are much more likely to
have low cost than high cost and critical points with high cost are
far more likely to be saddle points.
See Dauphin et al. (2014) for a more detailed investigation.
Deep Learning – 25 / 34
SADDLE POINTS
“Saddle points are surrounded by high error plateaus that can
dramatically slow down learning, and give the illusory impression
of the existence of a local minimum” (Dauphin et al. (2014)).
Deep Learning – 26 / 34
SADDLE POINTS: EXAMPLE
Deep Learning – 27 / 34
SADDLE POINTS
So how do saddle points impair optimization?
First-order algorithms that use only gradient information might get
stuck in saddle points.
Second-order algorithms experience even greater problems when
dealing with saddle points. Newtons method for example actively
searches for a region with zero gradient. That might be another
reason why second-order methods have not succeeded in
replacing gradient descent for neural network training.
Deep Learning – 28 / 34
EXAMPLE: SADDLE POINT WITH GD
Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD
First step...
Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD
...second step...
Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD
...tenth step got stuck and cannot escape the saddle point!
Deep Learning – 29 / 34
Cliffs and Exploding Gradients
Deep Learning – 30 / 34
CLIFFS AND EXPLODING GRADIENTS
As a result from the multiplication of several parameters, the
emprirical risk for highly nonlinear deep neural networks often
contain sharp nonlinearities.
That may result in very high derivatives in some places.
As the parameters get close to such cliff regions, a gradient
descent update can catapult the parameters very far.
Such an occurrence can lead to losing most of the
optimization work that had been done.
However, serious consequences can be easily avoided using a
technique called gradient clipping.
The gradient does not specify the optimal step size, but only the
optimal direction within an infinitesimal region.
Deep Learning – 31 / 34
CLIFFS AND EXPLODING GRADIENTS
Gradient clipping simply caps the step size to be small enough that
it is less likely to go outside the region where the gradient indicates
the direction of steepest descent.
We simply “prune” the norm of the gradient at some threshold h:
h
if ||∇θ|| > h : ∇θ ← ∇θ
||∇θ||
Deep Learning – 32 / 34
EXAMPLE: CLIFFS AND EXPLODING GRADIENTS
Figure: “The objective function for highly nonlinear deep neural networks or
for recurrent neural networks often contains sharp nonlinearities in parameter
space resulting from the multiplication of several parameters. These
nonlinearities give rise to very high derivatives in some places. When the
parameters get close to such a cliff region, a gradient descent update can
catapult the parameters very far, possibly losing most of the optimization work
that had been done” (Goodfellow et al. (2016)).
Deep Learning – 33 / 34
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio (2014)
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization
https: // arxiv. org/ abs/ 1406. 2572
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein (2017)
Visualizing the Loss Landscape of Neural Nets
https: // arxiv. org/ abs/ 1712. 09913
Konrad Koenigsberger (1997)
Analysis 2, Springer
Rong Ge (2016)
Escaping from Saddle Points
http: // www. offconvex. org/ 2016/ 03/ 22/ saddlepoints/
Deep Learning – 34 / 34
Deep Learning
Advanced Optimization
Learning goals
SGD with Momentum
Learning Rate Schedules
Adaptive Learning Rates
Batch Normalization
Momentum
Deep Learning – 1 / 47
MOMENTUM
While SGD remains a popular optimization strategy, learning with it
can sometimes be slow.
Momentum is designed to accelerate learning, especially when
facing high curvature, small but consistent or noisy gradients.
Momentum accumulates an exponentially decaying moving
average of past gradients:
h1 X i
ν ← φν − α ∇θ L(y (i ) , f (x (i ) , θ))
m
i
| {z }
gθ
θ ← θ+ν
Deep Learning – 3 / 47
MOMENTUM: EXAMPLE
ν1 ← φν0 − αg (θ[0] )
θ[1] ← θ[0] + φν0 − αg (θ[0] )
ν2 ← φν1 − αg (θ[1] )
= φ(φν0 − αg (θ[0] )) − αg (θ[1] )
θ[2] ← θ[1] + φ(φν0 − αg (θ[0] )) − αg (θ[1] )
ν3 ← φν2 − αg (θ[2] )
= φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
θ[3] ← θ[2] + φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
= θ[2] + φ3 ν0 − φ2 αg (θ[0] ) − φαg (θ[1] ) − αg (θ[2] )
= θ[2] − α(φ2 g (θ[0] ) + φ1 g (θ[1] ) + φ0 g (θ[2] )) + φ3 ν0
t
X
θ[t +1] = θ[t ] − α φj g (θ[t −j ] ) + φt +1 ν0
j =0
Deep Learning – 4 / 47
MOMENTUM: EXAMPLE
Suppose momentum always observes the same gradient g (θ):
t
X
θ[t +1] = θ[t ] − α φj g (θ[t −j ] ) + φt +1 ν0
j =0
t
X
= θ[t ] − αg (θ) φj + φt +1 ν0
j =0
1 − φ t +1
= θ[t ] − αg (θ) + φt +1 ν0
1−φ
1
→ θ[t ] − αg (θ) for t → ∞.
1−φ
Thus, momentum will accelerate in the direction of −g (θ) until reaching terminal
velocity with step size:
1
−αg (θ)(1 + φ + φ2 + φ3 + ...) = −αg (θ)
1−φ
Deep Learning – 5 / 47
MOMENTUM: ILLUSTRATION
The vector ν3 (for ν0 = 0):
ν3 = φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
= −φ2 (αg (θ[0] )) − φ(αg (θ[1] )) − αg (θ[2] )
Figure: If consecutive (negative) gradients point mostly in the same direction, the
velocity "builds up". On the other hand, if consecutive (negative) gradients point in very
different directions, the velocity "dies down".
Deep Learning – 6 / 47
SGD WITH MOMENTUM
Deep Learning – 7 / 47
SGD WITH MOMENTUM
Figure: The contour lines show a quadratic loss function with a poorly conditioned
Hessian matrix. The two curves show how standard gradient descent (black) and
momentum (red) learn when dealing with ravines. Momentum reduces the oscillation
and accelerates the convergence.
Deep Learning – 8 / 47
SGD WITH AND WITHOUT MOMENTUM
The following plot was created by our Shiny App. On the upper left you can explore
different predefined examples. Click here
Deep Learning – 9 / 47
MOMENTUM IN PRACTICE
Deep Learning – 10 / 47
MOMENTUM IN PRACTICE
The higher momentum, the faster SGD learns the weights on the training data,
but if momentum is too large, the training and test error fluctuates.
Deep Learning – 11 / 47
MOMENTUM IN PRACTICE
The higher momentum, the faster SGD learns the weights on the training data,
but if momentum is too large, the training and test error fluctuates.
Deep Learning – 12 / 47
NESTEROV MOMENTUM
Momentum aims to solve poor conditioning of the Hessian but also
variance in the stochastic gradient.
Nesterov momentum modifies the algorithm such that the gradient
is evaluated after the current velocity is applied:
h1 X i
ν ← φν − α∇θ L(y (i ) , f (x (i ) , θ + φν))
m
i
θ ← θ+ν
Deep Learning – 13 / 47
SGD WITH NESTEROV MOMENTUM
Deep Learning – 14 / 47
MOMENTUM VS. NESTEROV MOMENTUM
Deep Learning – 15 / 47
Learning Rates
Deep Learning – 16 / 47
LEARNING RATE
The learning rate is a very important hyperparameter.
To systematically find a good learning rate, we can start at a very
low learning rate and gradually increase it (linearly or
exponentially) after each mini-batch.
We can then plot the learning rate and the training loss for each
batch.
A good learning rate is one that results in a steep decline in the
loss.
Credit: jeremyjordan
Deep Learning – 17 / 47
LEARNING RATE SCHEDULE
We would like to force convergence until reaching a local minimum.
Applying SGD, we have to decrease the learning rate over time,
thus α[t ] (learning rate at training iteration t).
The estimator ĝ is computed based on small batches.
Random sampling m training samples introduces noise, that
does not vanish even if we find a minimum.
In practice, a common strategy is to decay the learning rate
linearly over time until iteration τ :
( [0] [τ ]
t
α[0] + τt α[τ ] = t − α +α + α[0] for t ≤ τ
1−
α [t ] = τ τ
α[τ ] for t > τ
Deep Learning – 18 / 47
LEARNING RATE SCHEDULE
Example for τ = 4:
iteration t t /τ α [t ]
α + α = 43 α[0] + 14 α[τ ]
[0] 1 [τ ]
1
1 0.25 1− 4 4
2 0 .5 2
4
α + α[τ ]
[0] 2
4
3 0.75 1
4
α + α[τ ]
[0] 3
4
4 1 0 + α[τ ]
... α[τ ]
t +1 α[τ ]
Deep Learning – 19 / 47
CYCLICAL LEARNING RATES
Another option is to have a learning rate that periodically varies
according to some cyclic function.
Therefore, if training does not improve the loss anymore (possibly
due to saddle points), increasing the learning rate makes it
possible to rapidly traverse such regions.
Recall, saddle points are far more likely than local minima in deep
nets.
Each cycle has a fixed length in terms of the number of iterations.
Deep Learning – 20 / 47
CYCLICAL LEARNING RATES
One such cyclical function is the "triangular" function.
In the right image, the range is cut in half after each cycle.
Deep Learning – 21 / 47
CYCLICAL LEARNING RATES
Yet another option is to abruptly "restart" the learning rate after a
fixed number of iterations.
Loshchilov et al. (2016) proposed "cosine annealing" (between
restarts).
Deep Learning – 22 / 47
Algorithms with Adaptive Learning Rates
Deep Learning – 23 / 47
ADAPTIVE LEARNING RATES
The learning rate is reliably one of the hyperparameters that is the
most difficult to set because it has a significant impact on the
models performance.
Naturally, it might make sense to use a different learning rate for
each parameter, and automatically adapt them throughout the
training process.
Deep Learning – 24 / 47
ADAGRAD
Adagrad adapts the learning rate to the parameters.
In fact, Adagrad scales learning rates inversely proportional to the
square root of the sum of the past squared derivatives.
Parameters with large partial derivatives of the loss obtain a
rapid decrease in their learning rate.
Parameters with small partial derivatives on the other hand
obtain a relatively small decrease in their learning rate.
For that reason, Adagrad might be well suited when dealing with
sparse data.
Goodfellow et al. (2016) say that the accumulation of squared
gradients can result in a premature and overly decrease in the
learning rate.
Deep Learning – 25 / 47
ADAGRAD
Algorithm Adagrad
1: require Global learning rate α
2: require Initial parameter θ
3: require Small constant β , perhaps 10−7 , for numerical stability
4: Initialize gradient accumulation variable r = 0
5: while stopping criterion not met do
6: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
1
∇θ i L y (i ) , f x̃ (i ) | θ
P
7: Compute gradient estimate: ĝ ← m
8: Accumulate squared gradient r ← r + ĝ ⊙ ĝ
9: Compute update: ∇θ = − β+α√r ⊙ ĝ (division and square root applied element-wise)
10: Apply update: θ ← θ + ∇θ
11: end while
Deep Learning – 26 / 47
RMSPROP
RMSprop is a modification of Adagrad.
It’s intention is to resolve Adagrad’s radically diminishing learning
rates.
The gradient accumulation is replaced by an exponentially
weighted moving average.
Theoretically, that leads to performance gains in non-convex
scenarios.
Empirically, RMSProp is a very effective optimization algorithm.
Particularly, it is employed routinely by deep learning practitioners.
Deep Learning – 27 / 47
RMSPROP
Algorithm RMSProp
1: require Global learning rate α and decay rate ρ ∈ [0, 1)
2: require Initial parameter θ
3: require Small constant β , perhaps 10−6 , for numerical stability
4: Initialize gradient accumulation variable r = 0
5: while stopping criterion not met do
6: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
Deep Learning – 28 / 47
ADAM
Adaptive Moment Estimation (Adam) is another method that
computes adaptive learning rates for each parameter.
Adam uses the first and the second moments of the gradients.
Adam keeps an exponentially decaying average of past
gradients (first moment).
Like RMSProp it stores an exponentially decaying average of
past squared gradients (second moment).
Thus, it can be seen as a combination of RMSProp and
momentum.
Basically Adam uses the combined averages of previous gradients
at different moments to give it more “persuasive power” to
adaptively update the parameters.
Deep Learning – 29 / 47
ADAM
Algorithm Adam
1: require Step size α (suggested default: 0.001)
2: require Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1) (suggested de-
faults: 0.9 and 0.999 respectively)
3: require Small constant β (suggested default 10−8 )
4: require Initial parameters θ
5: Initialize time step t = 0
6: Initialize 1st and 2nd moment variables s[0] = 0, r[0] = 0
7: while stopping criterion not met do
8: t ←t +1
9: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
1
Compute gradient estimate: ĝ[t ] ← m ∇θ i L y (i ) , f x̃ (i ) | θ
P
10:
[t] [t −1]
11: Update biased first moment estimate: s ← ρ1 s + (1 − ρ1 )ĝ[t ]
12: Update biased second moment estimate: r[t ] ← ρ2 r[t −1] + (1 − ρ2 )ĝ[t ] ⊙ ĝ[t ]
s[t ]
13: Correct bias in first moment: ŝ ←
1−ρt1
r[t ]
14: Correct bias in second moment: r̂ ←
1−ρt2
Deep Learning – 30 / 47
ADAM
Adam initializes the exponentially weighted moving averages s
and r as 0 (zero) vectors.
As a result, they are biased towards zero.
This means E[s[t ] ] ̸= E[ĝ[t ] ] and E[r[t ] ] ̸= E[ĝ[t ] ⊙ ĝ[t ] ] (where the
expectations are calculated over minibatches).
To see this, let us unroll the computation of s[t ] for a few
time-steps:
[0]
s =0
[1] [0]
s = ρ1 s + (1 − ρ1 )ĝ[1] = (1 − ρ1 )ĝ[1]
[2]
s = ρ1 s[1] + (1 − ρ1 )ĝ[2] = ρ1 (1 − ρ1 )ĝ[1] + (1 − ρ1 )ĝ[2]
[3]
s = ρ1 s[2] + (1 − ρ1 )ĝ[3] = ρ21 (1 − ρ1 )ĝ[1] + ρ1 (1 − ρ1 )ĝ[2] + (1 − ρ1 )ĝ[3]
ρt1−i g[i ] .
Pt
Therefore, s[t ] = (1 − ρ1 ) i =1
Note that the contribution of the earlier ĝ[i ] to the moving average
shrinks rapidly.
Deep Learning – 31 / 47
ADAM
The expected value of s[t ] is:
t
X
E[s[t ] ] = E[(1 − ρ1 ) ρt1−i ĝ[i ] ]
i =1
t
X
= E[ĝ[t ] ](1 − ρ1 ) ρt1−i + ζ
i =1
r[t ]
Similarly, we set r̂[t ] = (1−ρt2 )
.
Deep Learning – 32 / 47
COMPARISON OF OPTIMIZERS: ANIMATION
Figure: Excerpts from an animation to compare the behavior of momentum and other
methods compared to SGD for a saddle point. Left: After a few seconds; Right: A bit
later. The animation shows that all showed methods accelerate optimization compared
to the standard SGD. The highest acceleration is obtained using Rmsprop followed by
Adagrad as learning rate strategies. You can find the animation here or click on the
images above.
Deep Learning – 33 / 47
Batch Normalization
Deep Learning – 34 / 47
BATCH NORMALIZATION
Batch Normalization (BatchNorm) is an extremely popular
technique that improves the training speed and stability of deep
neural nets.
It is an extra component that can be placed between each layer of
the neural network.
It works by changing the "distribution" of activations at each hidden
layer of the network.
We know that it is sometimes beneficial to normalize the inputs to
a learning algorithm by shifting and scaling all the features so that
they have 0 mean and unit variance.
BatchNorm applies a similar transformation to the activations of
the hidden layers (with a couple of additional tricks).
Deep Learning – 35 / 47
BATCH NORMALIZATION
For a hidden layer with neurons zj , j = 1, . . . , J, BatchNorm is
applied to each zj by considering the activations of zj over a given
minibatch of inputs.
(i )
Let zj denote the activation of zj for input x (i ) in the minibatch (of
size m).
The mean and variance of the activations are
m
1
P (i )
µj = m
zj
i
m
1 (i )
σj2 (zj − µj )2
P
= m
i
(i )
Each zj is then normalized
(i )
(i ) − µj zj
z̃j =q
σj2 + ϵ
(i ) (i )
ẑj = γj z̃j + βj
Deep Learning – 37 / 47
BATCH NORMALIZATION: ILLUSTRATION
Recall: zj = σ(WjT x + bj )
So far, we have applied batch-norm to the activation zj . It is
possible (and more common) to apply batch norm to WjT x + bj
before passing it to the nonlinear activation σ .
Deep Learning – 38 / 47
BATCH NORMALIZATION
The key impact of BatchNorm on the training process is this: It
reparametrizes the underlying optimization problem to make its
landscape significantly more smooth.
One aspect of this is that the loss changes at a smaller rate and
the magnitudes of the gradients are also smaller (see Santurkar et
al. 2018).
Deep Learning – 39 / 47
BATCH NORMALIZATION: PREDICTION
Once the network has been trained, how can we generate a
prediction for a single input (either at test time or in production)?
One option is to feed the entire training set to the (trained) network
and compute the means and standard deviations.
More commonly, during training, an exponentially weighted
running average of each of these statistics over the minibatches is
maintained.
The learned γ and β parameters are then used (in conjunction
with the running averages) to generate the output.
Deep Learning – 40 / 47
BATCH NORMALIZATION
For our final benchmark in this chapter we compute two models to
predict the mnist data.
One will extend our basic architecture such that we add batch
normalization to all hidden layers.
We use SGD as optimizer with a momentum of 0.9, a learning rate
of 0.03 and weight decay of 0.001.
Deep Learning – 41 / 47
BATCH NORMALIZATION
Deep Learning – 42 / 47
BATCH NORMALIZATION
Deep Learning – 43 / 47
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio (2014)
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization
https: // arxiv. org/ abs/ 1406. 2572
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein (2017)
Visualizing the Loss Landscape of Neural Nets
https: // arxiv. org/ abs/ 1712. 09913
Tim Dettmers (2015)
Deep Learning in a Nutshell: History and Training
https: // devblogs. nvidia. com/
deep-learning-nutshell-history-training/
Deep Learning – 44 / 47
REFERENCES
Hafidz Zulkifli (2018)
Understanding Learning Rates and How It Improves Performance in Deep
Learning
https: // towardsdatascience. com
Ilya Loshchilov, Frank Hutter (2016)
SGDR: Stochastic Gradient Descent with Warm Restarts
https: // arxiv. org/ abs/ 1608. 03983
Jeremy Jordan (2018)
Setting the learning rate of your neural network
https: // www. jeremyjordan. me/ nn-learning-rate/
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry (2018)
How Does Batch Normalization Help Optimization?
https: // arxiv. org/ abs/ 1805. 11604
Deep Learning – 45 / 47
REFERENCES
Akshay Chandra (2015)
Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient
Descent
https: // towardsdatascience. com/
learning-parameters-part-2-a190bef2d12
Deep Learning – 46 / 47
Deep Learning
Learning goals
Challenges in Optimization
related to Activation Functions
Activations for Hidden Units
Actications for Output Units
Hidden activations
Deep Learning – 1 / 17
HIDDEN ACTIVATIONS
Recall, hidden-layer activation functions make it possible for deep
neural nets to learn complex non-linear functions.
The design of hidden units is an extremely active area of research.
It is usually not possible to predict in advance which activation will
work best. Therefore, the design process often consists of trial and
error.
In the following, we will limit ourselves to the most popular
activations - Sigmoidal activation and ReLU.
It is possible for many other functions to perform as well as these
standard ones. An overview of further activations can be found
here .
Deep Learning – 2 / 17
SIGMOIDAL ACTIVATIONS
Sigmoidal functions such as tanh and the logistic sigmoid bound
the outputs to a certain range by "squashing" their inputs.
Deep Learning – 3 / 17
SIGMOIDAL ACTIVATION FUNCTIONS
1 Saturating Neurons:
We know: σ 0 (zin ) → 0 for |zin | → ∞.
→ Neurons with sigmoidal activations "saturate" easily, that is,
they stop being responsive when |zin | 0.
Deep Learning – 4 / 17
SIGMOIDAL ACTIVATION FUNCTIONS
2 Vanishing Gradients: Consider the vector of error signals δ (i ) in
layer i
(i ) (i +1) (i +1) 0 (i )
δ =W δ σ zin , i ∈ {1, ..., O }.
Each k -th component of the vector expresses how much the loss L
(i )
changes when the input to the k -th neuron zk ,in changes.
We know: σ 0 (z ) < 1 for all z ∈ R.
→ In each step of the recursive formula above, the value will be
multiplied by a value smaller than one
(i )
δ (1) = W(2) δ (2) σ 0 zin
(i ) (i )
= W(2) W(3) δ (3) σ 0 zin σ 0 zin
= ...
When this occurs, earlier layers train very slowly (or not at all).
Deep Learning – 5 / 17
RECTIFIED LINEAR UNITS (RELU)
Deep Learning – 6 / 17
RECTIFIED LINEAR UNITS (RELU)
ReLU units can significantly speed up training compared to units
with saturating activations.
Figure: A four-layer convolutional neural network with ReLUs (solid line) reaches a
25% training error rate on the CIFAR-10 dataset six times faster than an equivalent
network with tanh neurons (dashed line).
Deep Learning – 7 / 17
RECTIFIED LINEAR UNITS (RELU)
A downside of ReLU units is that when the input to the activation is
negative, the derivative is zero. This is known as the "dying ReLU
problem".
When a ReLU unit "dies", that is, when its activation is 0 for all
datapoints, it kills the gradient flowing through it during
backpropogation.
This means such units are never updated during training and the
problem can be irreversible.
Deep Learning – 8 / 17
GENERALIZATIONS OF RELU
There exist several generalizations of the ReLU activation that
have non-zero derivatives throughout their domains.
Leaky ReLU: (
v v ≥0
LReLU (v ) =
αv v < 0
Unlike the ReLU, when the input to the Leaky ReLU activation is
negative, the derivative is α which is a small positive value (such
as 0.01).
Deep Learning – 9 / 17
GENERALIZATIONS OF RELU
A variant of the Leaky ReLU is the Parametric ReLU (PReLU)
which learns the α from the data through backpropagation.
Exponential Linear Unit (ELU):
(
v v ≥0
ELU (v ) = v
α(e − 1) v < 0
Scaled Exponential Linear Unit (SELU):
(
v v ≥0
SELU (v ) = λ v
α(e − 1) v < 0
Note: In ELU and SELU, α and λ are hyperparameters that are set before training.
Deep Learning – 10 / 17
GENERALIZATIONS OF RELU
Deep Learning – 11 / 17
Output activations
Deep Learning – 12 / 17
OUTPUT ACTIVATIONS
As we have seen previously, the role of the output activation is to
get the final score on the same scale as the target.
The output activations and the loss functions used to train neural
networks can be viewed through the lens of maximum likelihood
estimation (MLE).
In general, the function f (x | θ) represented by the neural network
defines the conditional p(y | x, θ) in a supervised learning task.
Maximizing the likelihood is then equivalent to minimizing
− log p(y | x, θ).
An output unit with the identity function as the activation can be
used to represent the mean of a Gaussian distribution.
For such a unit, training with mean-squared error is equivalent to
maximizing the log-likelihood (ignoring issues with non-convexity).
Deep Learning – 13 / 17
OUTPUT ACTIVATIONS
Similarly, sigmoid and softmax units can output the parameter(s) of
a Bernoulli distribution and Categorical distribution, respectively.
It is straightforward to show that when the label is one-hot
encoded, training with the cross-entropy loss is equivalent to
maximizing log-likelihood. Click here
Because these activations can saturate, an important advantage of
maximizing log-likelihood is that the log undoes some of the
exponentiation in the activation functions which is desirable when
optimizing with gradient-based methods.
For example, in the case of softmax, the loss is:
g
X
L(y , f (x)) = −fin,k + log exp(fin,k 0 )
k 0 =1
where k is the correct class. The first term, −fin,k , does not
saturate which means training can progress steadily even if the
contribution of fin,k to the second term is negligible.
Deep Learning – 14 / 17
OUTPUT ACTIVATIONS
A neural network can even be used to output the parameters of
more complex distributions.
A Mixture Density Network, for example, outputs the parameters of
a Gaussian Mixture Model:
m
X
p(y |x) = φ(c ) (x) N y ; µ(c ) (x), Σ(c ) (x)
c =1
Figure: Samples drawn from a Mixture Density Network. The input x is sampled from
a uniform distribution and y is sampled from p(y | x, θ).
Deep Learning – 16 / 17
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton (2012)
ImageNet Classification with Deep Convolutional Neural Networks
https: // papers. nips. cc/ paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks
pdf
Guoqiang Zhang and Haopeng Li (2018)
Effectiveness of Scaled Exponentially-Regularized Linear Units
https: // arxiv. org/ abs/ 1807. 10117
Deep Learning – 17 / 17
Deep Learning
Network Initializations
Learning goals
Why Initializaiton matters
Weight Initializations
Bias Initialization
PRACTICAL INITIALIZATION
The weights (and biases) of a neural network must be assigned
some initial values before training can begin.
The choice of the initial weights (and biases) is crucial as it
determines whether an optimization algorithm converges, how fast
and whether to a point with high or low risk.
Initialization strategies to achieve "nice" properties are difficult to
find, because there is no good understanding which properties are
preserved under which circumstances.
In the following we seperate between the initialization of weights
and biases.
Deep Learning – 1 / 10
WEIGHT INITIALIZATION
It is important to initialize the weights randomly in order to "break
symmetry". If two neurons (with the same activation function in a
fully connected network) are connected to the same inputs and
have the same initial weights, then both neurons will have the
same gradient update in a given iteration and they will end up
learning the same features.
Furthermore, the initial weights should not be too large, because
this might result in an explosion of weights or high sensitivity to
changes in the input.
Weights are typically drawn from a uniform distribution or a
Gaussian centered at 0 with a small variance.
Centering the initial weights around 0 can be seen as a form of
regularization and it imposses that it is more likely that units do not
interact with each other than they do interact.
Deep Learning – 2 / 10
WEIGHT INITIALIZATION
Two common initialization strategies for weights are the ’Glorot
initialization’ and ’He initialization’ which tune the variance of these
distributions based on the topology of the network.
Glorot initialization suggests to sample each weight of a fully
connected layer with m inputs and n outputs from a uniform
distribution r r !
6 6
wj ,k ∼ U − ,
m+n m+n
Deep Learning – 3 / 10
WEIGHT INITIALIZATION
He initialization is especially useful for neural networks with
ReLU activations. Each weight of a fully connected layer with m
inputs is sampled from a Gaussian distribution
2
wj ,k ∼ N 0,
m
Deep Learning – 4 / 10
WEIGHT INITIALIZATION: EXAMPLE
We use a spiral planar data set to compare the following strategies:
Zero initialization,
random initialization (samples from
N 0, 1 · 10 − 4
) and He initialization.
For each strategy, a neural network with one hidden layer with 100
units, ReLU activation and Gradient Descent as optimizer was
used.
Deep Learning – 5 / 10
WEIGHT INITIALIZATION: EXAMPLE
Figure: Decision boundary with zero initialization on the training data set (left)
and the testing data set (right). The zero initialization does not break symmetry
and the complexity of the network reduces to that of a single neuron.
Deep Learning – 6 / 10
WEIGHT INITIALIZATION: EXAMPLE
training data set (left) and the testing data set (right).
Deep Learning – 7 / 10
WEIGHT INITIALIZATION: EXAMPLE
Figure: Decision boundary with He initialization on the training data set (left)
and the testing data set (right).
Deep Learning – 8 / 10
BIAS INITIALIZATION
Typically, we set the biases for each unit to heuristically chosen
constants.
Setting the biases to zero is compatible with most weight
initialization schemes as the schemes expect a small bias.
However, deviations from 0 can be made individually, for example,
in order to obtain the right marginal statistics of the output unit or
to avoid causing too much saturation at the initialization.
For details see Goodfellow et. al (2016).
Deep Learning – 9 / 10
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015)
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification. In Proceedings of the 2015 IEEE International Conference on
Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, Washington, DC,
USA, 1026-1034.
https: // arxiv. org/ abs/ 1502. 01852
Xavier Glorot and Yoshua Bengio (2010)
Understanding the difficulty of training deep feedforward neural networks
AISTATS, Volume 9 von JMLR Proceedings, Seite 249-256. JMLR.org
http: // proceedings. mlr. press/ v9/ glorot10a/ glorot10a. pdf? hc_
location= ufi
Abhijit Ghatak (2019)
Deep Learning with R. Springer.
Deep Learning – 10 / 10
Deep Learning
CNN: Introduction
Learning goals
What are CNNs?
When to apply CNNs?
A glimpse into CNN architectures
CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Networks (CNN, or ConvNet) are a powerful
family of neural networks that are inspired by biological processes
in which the connectivity pattern between neurons resembles the
organization of the mamel visual cortex.
Figure: The ventral (recognition) pathway in the visual cortex has multiple
stage: Retina - LGN - V1 - V2 - V4 - PIT - AIT etc., which consist of lots of
intermediate representations.
Deep Learning – 1 / 11
CONVOLUTIONAL NEURAL NETWORKS
Since 2012, given their success in the ILSVRC competition, CNNs
are popular in many fields.
Common applications of CNN-based architectures in computer
vision are:
Image classification.
Object detection / localization.
Semantic segmentation.
CNNs are widely applied in other domains such as natural
language processing (NLP), audio, and time-series data.
Basic idea: a CNN automatically extracts visual, or, more
generally, spatial features from an input data such that it is able to
make the optimal prediction based on the extracted features.
It contains different building blocks and components.
Deep Learning – 2 / 11
CNNS - WHAT FOR?
Figure: All Tesla cars being produced now have full self-driving hardware
(Source: Tesla website). A convolutional neural network is used to map raw
pixels from a single front-facing camera directly into steering commands. The
system learns to drive in traffic, on local roads, with or without lane markings
as well as on highways.
Deep Learning – 3 / 11
CNNS - WHAT FOR?
Figure: Given an input image, a CNN is first used to get the feature map of the
last convolutional layer, then a pyramid parsing module is applied to harvest
different sub-region representations, followed by upsampling and
concatenation layers to form the final feature representation, which carries
both local and global context information. Finally, the representation is fed into
a convolution layer to get the final per-pixel prediction. (Source: pyramid scene
parsing network, by Zhao et. al, CVPR 2017)
Deep Learning – 4 / 11
CNNS - WHAT FOR?
Deep Learning – 5 / 11
CNNS - WHAT FOR?
CNN for personalized medicine
Examples:
Tracking, diagnosis and
localization of Covid-19 patients.
CNN
based method (RADLogists)
for personalized Covid-19
detection: three CT scans from
a single Corona virus patient
diagnosed by RADLogists.
Deep Learning – 6 / 11
CNNS - WHAT FOR?
Figure: Four COVID-19 lung CT scans at the top with corresponding colored
maps showing Corona virus abnormalities at the bottom (Source: Megan
Scudellari, IEEE Spectrum 2021).
Deep Learning – 7 / 11
CNNS - WHAT FOR?
Deep Learning – 8 / 11
CNNS - WHAT FOR?
Deep Learning – 9 / 11
CNNS - WHAT FOR?
Deep Learning – 10 / 11
CNNS - A FIRST GLIMPSE
Deep Learning – 11 / 11
CNNS - A FIRST GLIMPSE
Deep Learning – 11 / 11
Deep Learning
Convolutional Operation
Learning goals
What are filters?
Convolutional Operation
2D Convolution
FILTERS TO EXTRACT FEATURES
Filters are widely applied in Computer Vision (CV) since the 70’s.
One prominent example: Sobel-Filter.
It detects edges in images.
Deep Learning – 1 / 9
FILTERS TO EXTRACT FEATURES
Edges occur where the intensity over neighboring pixels changes
fast.
Thus, approximate the gradient of the intensity of each pixel.
Sobel showed that the gradient image Gx of original image A in
x-dimension can be approximated by:
−1 0 +1
Gx = −2 0 +2 ∗ A = Sx ∗ A
−1 0 +1
where ∗ indicates a mathematical operation known as a
convolution, not a traditional matrix multiplication.
The filter matrix Sx consists of the product of an averaging and a
differentiation kernel:
T
1 2 1 −1 0 +1
| {z }| {z }
averaging differentiation
Deep Learning – 2 / 9
FILTERS TO EXTRACT FEATURES
Similarly, the gradient image Gy in y-dimension can be
approximated by:
−1 −2 −1
Gy = 0 0 0 ∗ A = Sy ∗ A
+1 +2 +1
Deep Learning – 3 / 9
HORIZONTAL VS VERTICAL EDGES
Source: Wikipedia
Deep Learning – 4 / 9
FILTERS TO EXTRACT FEATURES
Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES
Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES
Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES
Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES
Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES
Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES
Deep Learning – 5 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?
What we just did was extracting pre-defined features from our
input (i.e. edges).
A convolutional neural network does almost exactly the same:
“extracting features from the input”.
⇒ The main difference is that we usually do not tell the CNN what
to look for (pre-define them), the CNN decides itself.
In a nutshell:
We initialize a lot of random filters (like the Sobel but just
random entries) and apply them to our input.
Then, a classifier which (e.g. a feed forward neural net) uses
them as input data.
Filter entries will be adjusted by common gradient descent
methods.
Deep Learning – 6 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?
Deep Learning – 7 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?
Deep Learning – 7 / 9
WORKING WITH IMAGES
In order to understand the functionality of CNNs, we have to
familiarize ourselves with some properties of images.
Grey scale images:
Matrix with dimensions height × width × 1.
Pixel entries differ from 0 (black) to 255 (white).
Color images:
Tensor with dimensions height × width × 3.
The depth 3 denotes the RGB values (red - green - blue).
Filters:
A filter’s depth is always equal to the input’s depth!
In practice, filters are usually square.
Thus we only need one integer to define its size.
For example, a filter of size 2 applied on a color image
actually has the dimensions 2 × 2 × 3.
Deep Learning – 8 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .
Deep Learning – 9 / 9
Deep Learning
Properties of Convolution
Learning goals
Sparse Interactions
Parameter Sharing
Equivariance to Translation
SPARSE INTERACTIONS
Similarly...
Deep Learning – 1 / 7
SPARSE INTERACTIONS
Similarly...
Deep Learning – 1 / 7
SPARSE INTERACTIONS
Deep Learning – 1 / 7
SPARSE INTERACTIONS
Deep Learning – 1 / 7
SPARSE INTERACTIONS
Deep Learning – 1 / 7
SPARSE INTERACTIONS
Deep Learning – 1 / 7
SPARSE INTERACTIONS
What does that mean?
Our CNN has a receptive field of 4 neurons.
That means, we apply a “local search” for features.
A dense net on the other hand conducts a “global search”.
The receptive field of the dense net are 9 neurons.
When processing images, it is more likely that features occur at
specific locations in the input space.
For example, it is more likely to find the eyes of a human in a
certain area, like the face.
A CNN only incorporates the surrounding area of the filter into
its feature extraction process.
The dense architecture on the other hand assumes that every
single pixel entry has an influence on the eye, even pixels far
away or in the background.
Deep Learning – 2 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Even three...
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
PARAMETER SHARING
Deep Learning – 3 / 7
SPARSE CONNECTIONS AND PARAMETER
SHARING
Why is that good?
Less parameters drastically reduce memory requirements.
Faster runtime:
For m inputs and n outputs, a fully connected layer requires
m n parameters and has O (m n) runtime.
A convolutional layer has limited connections k << m, thus
only k n parameters and O (k n) runtime.
Less parameters mean less overfitting and better generalization!
Deep Learning – 4 / 7
SPARSE CONNECTIONS AND PARAMETER
SHARING
Example: consider a color image with size 100 100.
Suppose we would like to create one single feature map with a
“same padding” (i.e. the hidden layer is of the same size).
Choosing a filter with size 5 means that we have a total of
5 5 3 = 75 parameters (bias unconsidered).
A dense net with the same amount of “neurons” in the hidden
layer results in
parameters.
Note that this was just a fictitious example. In practice we normally
do not try to replicate CNN architectures with dense networks.
Deep Learning – 5 / 7
EQUIVARIANCE TO TRANSLATION
Deep Learning – 6 / 7
EQUIVARIANCE TO TRANSLATION
Deep Learning – 6 / 7
EQUIVARIANCE TO TRANSLATION
The filter does not care at what location the feature of interest is
located at.
Deep Learning – 6 / 7
EQUIVARIANCE TO TRANSLATION
Deep Learning – 6 / 7
NONLINEARITY IN FEATURE MAPS
As in dense nets, we use activation functions on all feature map
entries to introduce nonlinearity in the net.
Typically rectified linear units (ReLU) are used in CNNs:
They reduce the danger of saturating gradients compared to
sigmoid activations.
They can lead to sparse activations, as neurons 0 are
squashed to 0 which increases computational speed.
As seen in the last chapter, many variants of ReLU (Leaky ReLU,
ELU, PReLU, etc.) exist.
Deep Learning – 7 / 7
Deep Learning
CNN Components
Learning goals
Input Channel
Padding
Stride
Pooling
INPUT CHANNEL
Deep Learning – 1 / 14
Figure: Image source: Computer Vision Primer: How AI Sees An Imag eKishan Maladkar’s Blog)
Deep Learning – 2 / 14
VALID PADDING
Suppose we have an input of size 5 5 and a filter of size 2 2.
Deep Learning – 3 / 14
VALID PADDING
The filter is only allowed to move inside of the input space.
Deep Learning – 3 / 14
VALID PADDING
That will inevitably reduce the output dimensions.
In general, for an input of size i ( i ) and filter size k ( k ), the size of
the output feature map o ( o) claculated by:
o=i k +1
Deep Learning – 3 / 14
SAME PADDING
Suppose the following situation: an input with dimensions 5 5
and a filter with size 3 3.
Deep Learning – 4 / 14
SAME PADDING
We would like to obtain an output with the same dimensions as the
input.
Deep Learning – 4 / 14
SAME PADDING
Hence, we apply a technique called zero padding. That is to say
“pad” zeros around the input:
Deep Learning – 4 / 14
SAME PADDING
That always works! We just have to adjust the zeros according to
the input dimensions and filter size (ie. one, two or more rows).
Deep Learning – 4 / 14
PADDING AND NETWORK DEPTH
Figure: “Valid” versus “same” convolution. Top : Without padding, the width of
the feature map shrinks rapidly to 1 after just three convolutional layers (filter
width of 6 shown in each layer). This limits how deep the network can be
made. Bottom : With zero padding (shown as solid circles), the feature map
can remain the same size after each convolution which means the network
can be made arbitrarily deep. (Goodfellow, et al., 2016, ch. 9)
Deep Learning – 5 / 14
Strides
Deep Learning – 6 / 14
STRIDES
Stepsize “strides” of our filter (stride = 2 shown below).
Deep Learning – 7 / 14
STRIDES
Stepsize “strides” of our filter (stride = 2 shown below).
Deep Learning – 7 / 14
STRIDES
Stepsize “strides” of our filter (stride = 2 shown below).
Deep Learning – 7 / 14
STRIDES AND DOWNSAMPLING
Deep Learning – 8 / 14
MAX POOLING
Deep Learning – 9 / 14
MAX POOLING
Deep Learning – 9 / 14
MAX POOLING
Deep Learning – 9 / 14
AVERAGE POOLING
We’ve seen how max pooling worked, there are exists other
pooling operation such as avg pooling, fractional pooling, LP
pooling, softmax pooling, stochastic pooling, blur pooling, global
average pooling, and etc.
Similar to max pooling, we downsample the feature map but
optimally lose no information.
Deep Learning – 10 / 14
AVERAGE POOLING
Deep Learning – 10 / 14
AVERAGE POOLING
Deep Learning – 10 / 14
AVERAGE POOLING
The final pooled feature map has entries 3.75, 2.5, 4.25 and 1.75.
Deep Learning – 10 / 14
COMPARISON OF MAX AND AVERAGE POOLING
Avg pooling use all information by sum but max pooling use only
highest value.
In max pooling operation details are removed therefore it is
suitable for sparse information (Image Classification) and avg
pooling is suitable for dense information (NLP).
Figure: Shortcomings of max and average pooling using Toy Image (photo
source: https://iq.opengenus.org/maxpool-vs-avgpool/)
Deep Learning – 11 / 14
Figure: CNNs use colored images where each of the Red, Green and Blue (RGB) color spectrums serve as input. (source:
Chaitanya Belwal’s Blog)
In this CNN:
there are 3 input channel, with the size of 4x4 as an input matrices,
one 2x2 filter (also known as kernel),
a single ReLu layer,
a single pooling layer (which applies the MaxPool function),
Deep Learning – 12 / 14
and a single fully connected (FC) layer.
The elements of the filter matrix are equivalent to the unit weights
in a standard NN and will be updated during the backpropagation
phase.
Assuming a stride of 2 with no padding, the output size of the
convolution layer is determined by the following equation:
I K +2:P
O= S + 1 where:
O: is the dimension (rows and columns) of the output square
matrix,
I: is the dimension (rows and columns) of the input square
matrix,
K: is the dimension (rows and columns) of the filter (kernel)
square matrix,
P: is the number of pixels(cells) of padding added to each
side of the input,
Deep Learning – 13 / 14
S: is the stride, or the number of cells skipped each time the
kernel is slided.
I K + 2:P (4 2 + 2:0)
O= +1= +1 (1)
S 2
=2 (2)
Deep Learning – 14 / 14
Introduction to Deep Learning
CNN Applications
Learning goals
Application of CNNs in Visual
Recognition
APPLICATION - IMAGE CLASSIFICATION
One use case of CNNs is image classification.
There exists a broad variety of network architectures for image
classification such as the LeNet, AlexNet, InceptionNet and
ResNet which will be discussed later in detail.
All these architectures rely on a set of sequences of convolutional
layers and aim to learn the mapping from an image to a probability
score over a set of classes.
Figure: One example of the Cifar-10 data: A highly pixelated, coloured image
of a frog with dimension [3, 32, 32].
Figure: The CNN learns the mapping from grayscale (L) to color (ab) for each
pixel in the image. The L and ab maps are then concatenated to yield the
colorized image. The authors use the LAB color space for the image
representation.
Figure: The colour space (ab) is quantized in a total of 313 bins. This allows
to treat the color prediction as a classification problem where each pixel is
assigned a probability distribution over the 313 bins and the one with the
highest softmax score is taken as predicted color value. The bin is then
mapped back to the corresponding numeric (a,b) values. The network is
optimized using a multinomial cross-entropy loss over the 313 quantized (a,b)
bins.
Figure: The architecture consists of stacked CNN layers which are upsampled
towards the end of the net. It makes use of dilated convolutions and
upsampling layers which will be explained later. The output is a tensor of
dimension [64, 64, 313] that stores the 313 probabilities for each element of
the final, downsampled 64x64 feature maps.
P (y = cat jX ; )
ccat
label vector ccar and softmax output P (y = car jX ; )
cfrog P (y = frog jX ; )
Naive approach: use a CNN with two heads, one for the class
classification and one for the bounding box regression.
But: What happens, if there are two cats in the image?
Different approaches: "Region-based" CNNs (R-CNN, Fast R-CNN
and Faster R-CNN) and "single-shot" CNNs (SSD and YOLO).
Figure:
Figure:
Figure:
1D / 2D / 3D Convolutions
Learning goals
1D Convolutions
2D Convolutions
3D Convolutions
1D Convolutions
Deep Learning – 1 / 22
1D CONVOLUTIONS
Data situation: Sequential, 1-dimensional tensor data.
Data consists of tensors with shape [depth, xdim]
Depth 1 (single-channel):
Univariate time series, e.g. development of a single stock
price over time
Functional / curve data
Depth > 1 (mutli-channel):
Multivariate time series, e.g.
Movement data measured with multiple sensors for
human activity recognition
Temperature and humidity in weather forecasting
Text encoded as character-level one-hot-vectors
→ Convolve the data with a 1D-kernel
Deep Learning – 2 / 22
1D CONVOLUTIONS – OPERATION
Deep Learning – 3 / 22
1D CONVOLUTIONS – OPERATION
Deep Learning – 4 / 22
1D CONVOLUTIONS – SENSOR DATA
Deep Learning – 5 / 22
1D CONVOLUTIONS – SENSOR DATA
Figure: Time series classification with 1D CNNs and global average pooling
(explained later). An input time series is convolved with 3 CNN layers, pooled
and fed into a fully connected layer before the final softmax layer. This is one
of the classic time series classification architectures.
Deep Learning – 6 / 22
1D CONVOLUTIONS – TEXT MINING
1D convolutions also have an interesting application in text mining.
For example, they can be used to classify the sentiment of text
snippets such as yelp reviews.
Figure: Sentiment classification: can we teach the net that this a positive
review?
Deep Learning – 7 / 22
1D CONVOLUTIONS – TEXT MINING
Deep Learning – 8 / 22
1D CONVOLUTIONS – TEXT MINING
Deep Learning – 9 / 22
2D Convolutions
Deep Learning – 10 / 22
2D CONVOLUTIONS
The basic idea behind a 2D convolution is sliding a small window
(called a "kernel/filter") over a larger 2D array, and performing a dot
product between the filter elements and the corresponding input array
elements at every position.
Figure: Here’s a diagram demonstrating the application of a 2×2 convolution filter to a 5×5 array, in 16 different positions.
Deep Learning – 11 / 22
2D CONVOLUTIONS – EXAMPLE
Deep Learning – 12 / 22
2D CONVOLUTIONS – EXAMPLE
Deep Learning – 12 / 22
2D CONVOLUTIONS – EXAMPLE
Deep Learning – 12 / 22
2D CONVOLUTIONS – EXAMPLE
Each sliding position ends up with one number. The final output is
then a 2 × 2 matrix.
Deep Learning – 12 / 22
3D Convolutions
Deep Learning – 13 / 22
3D CONVOLUTIONS
Data situation: 3-dimensional tensor data.
Data consists of tensors with shape [depth, xdim, ydim, zdim].
Dimensions can be both temporal (e.g. video frames) or spatial
(e.g. MRI)
Examples:
Human activity recognition in video data
Disease classification or tumor segmentation on MRI scans
Solution: Move a 3D-kernel in x, y and z direction to capture all
important information.
Deep Learning – 14 / 22
3D CONVOLUTIONS – DATA
Figure: Illustration of depth 1 volumetric data: MRI scan. Each slice of the
stack has depth 1, as the frames are black-white.
Deep Learning – 15 / 22
3D CONVOLUTIONS – DATA
Deep Learning – 16 / 22
3D CONVOLUTIONS
Deep Learning – 17 / 22
3D CONVOLUTIONS
Deep Learning – 18 / 22
REFERENCES
Dumoulin, Vincent and Visin, Francesco (2016)
A guide to convolution arithmetic for deep learning
https: // arxiv. org/ abs/ 1603. 07285v1
Van den Oord, Aaron, Sander Dielman, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, and Koray Kavukocuoglu (2016)
WaveNet: A Generative Model for Raw Audio
https: // arxiv. org/ abs/ 1609. 03499
Benoit A., Gennart, Bernard Krummenacher, Roger D. Hersch, Bernard Saugy,
J.C. Hadorn and D. Mueller (1996)
The Giga View Multiprocessor Multidisk Image Server
https: // www. researchgate. net/ publication/ 220060811_ The_ Giga_
View_ Multiprocessor_ Multidisk_ Image_ Server
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Paluri Manohar
(2015)
Learning Spatiotemporal Features with 3D Convolutional Networks
https: // arxiv. org/ pdf/ 1412. 0767. pdf
Deep Learning – 19 / 22
REFERENCES
Milletari, Fausto, Nassir Navab and Seyed-Ahmad Ahmadi (2016)
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image
Segmentation
https: // arxiv. org/ pdf/ 1606. 04797. pdf
Zhang, Xiang, Junbo Zhao and Yann LeCun (2015)
Character-level Convolutional Networks for Text Classification
http: // arxiv. org/ abs/ 1509. 01626
Wang, Zhiguang, Weizhong Yan and Tim Oates (2017)
Time Series Classification from Scratch with Deep Neural Networks: A Strong
Baseline
http: // arxiv. org/ abs/ 1509. 01626
Fisher Yu and Vladlen Koltun (2015)
Multi-Scale Context Aggregation by Dilated Convolutions
https: // arxiv. org/ abs/ 1511. 07122
Deep Learning – 20 / 22
REFERENCES
Bai, Shaojie, Zico J. Kolter and Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
http: // arxiv. org/ abs/ 1509. 01626
Augustus Odena, Vincent Dumoulin and Chris Olah (2016)
Deconvolution and Checkerboard Artifacts
https: // distill. pub/ 2016/
deconv-checkerboard/ https://distill.pub/2016/deconv-checkerboard/
Andre Araujo, Wade Norris and Jack Sim (2019)
Computing Receptive Fields of Convolutional Neural Networks
https: // distill. pub/ 2019/ computing-receptive-fields/
Zhiguang Wang, Yan, Weizhong and Tim Oates (2017)
Time series classification from scratch with deep neural networks: A strong
baseline
https: // arxiv. org/ 1611. 06455
Deep Learning – 21 / 22
REFERENCES
Lin, Haoning and Shi, Zhenwei and Zou, Zhengxia (2017)
Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale
Fully Convolutional Network
Deep Learning – 22 / 22
Deep Learning
Learning goals
Dilated Convolutions
Transposed Convolutions
Dilated Convolutions
Deep Learning – 1 / 26
DILATED CONVOLUTIONS
Idea : artificially increase the receptive field of the net without
using more filter weights.
The receptive field of a single neuron comprises all inputs that
have an impact on this neuron.
Neurons in the first layers capture less information of the input,
while neurons in the last layers have huge receptive fields and can
capture a lot more global information from the input.
The size of the receptive fields depends on the filter size.
Deep Learning – 2 / 26
DILATED CONVOLUTIONS
Intuitively, neurons in the first layers capture less information of the
input (layer), while neurons in the last layers have huge receptive
fields and can capture a lot more global information from the input
(layer).
The size of the receptive fields depends on the filter size.
Deep Learning – 3 / 26
DILATED CONVOLUTIONS
By increasing the filter size, the size of the receptive fields
increases as well and more contextual information can be
captured.
However, increasing the filter size increases the number of
parameters, which leads to increased runtime.
Artificially increase the receptive field of the net without using more
filter weights by adding a new dilation parameter to the kernel that
skips pixels during convolution.
Benefits:
Capture more contextual information.
Enable the processing of inputs in higher dimensions to
detect fine details.
Improved run-time-performance due to less parameters.
Deep Learning – 4 / 26
DILATED CONVOLUTIONS
Useful in applications where the global context is of great
importance for the model decision.
This component finds application in:
Generation of audio-signals and songs within the famous
Wavenet developed by DeepMind.
Time series classification and forecasting.
Image segmentation.
Deep Learning – 5 / 26
DILATED CONVOLUTIONS
Deep Learning – 6 / 26
DILATED CONVOLUTIONS
Deep Learning – 7 / 26
DILATED CONVOLUTIONS
Deep Learning – 8 / 26
DILATED CONVOLUTIONS
Deep Learning – 9 / 26
Transposed Convolutions
Deep Learning – 10 / 26
TRANSPOSED CONVOLUTIONS
Problem setting:
For many applications and in many network architectures, we
often want to do transformations going in the opposite
direction of a normal convolution, i.e. we would like to perform
up-sampling.
examples include generating high-resolution images and
mapping low dimensional feature map to high dimensional
space such as in auto-encoder or semantic segmentation.
Instead of decreasing dimensionality as with regular convolutions,
transposed convolutions are used to re-increase dimensionality
back to the initial dimensionality.
Note: Do not confuse this with deconvolutions (which are
mathematically defined as the inverse of a convolution).
Deep Learning – 11 / 26
TRANSPOSED CONVOLUTIONS
Example 1:
Input: yellow feature map with dim 4 × 4.
Output: blue feature map with dim 2 × 2.
Deep Learning – 12 / 26
TRANSPOSED CONVOLUTIONS
One way to upsample is to use a regular convolution with various
padding strategies.
Deep Learning – 13 / 26
TRANSPOSED CONVOLUTIONS
Convolution with parameters kernel size k , stride s and padding
factor p
Associated transposed convolution has parameters k ′ = k , s′ = s
and p′ = k − 1
Deep Learning – 14 / 26
TRANSPOSED CONVOLUTIONS
Example 2 : Convolution as a matrix multiplication :
credit:Stanford University
Deep Learning – 15 / 26
TRANSPOSED CONVOLUTIONS
Example 2 : Transposed Convolution as a matrix multiplication :
credit:Stanford University
Important : Even though the "structure" of the matrix here is the transpose of
the original matrix, the non-zero elements are, in general, different from the
correponding elements in the original matrix. These (non-zero)
elements/weights are tuned by backpropagation.
Deep Learning – 16 / 26
TRANSPOSED CONVOLUTIONS
Example 3: Convolution as matrix multiplication:
Deep Learning – 17 / 26
TRANSPOSED CONVOLUTIONS
Example 3: Transposed Convolution as matrix multiplication:
Note:
Even though the transpose of the original matrix is shown in this
example, the actual values of the weights are different from the
original matrix (and optimized by backpropagation).
The goal of the transposed convolution here is simply to get back
the original dimensionality. It is not necessarily to get back the
original feature map itself.
Deep Learning – 18 / 26
TRANSPOSED CONVOLUTIONS
Example 3: Transposed Convolution as matrix multiplication:
Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS
Example 4: Let us now view transposed convolutions from a different
perspective.
Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS
Example 4: Let us now view transposed convolutions from a different
perspective.
Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS
Example 4: Let us now view transposed convolutions from a different
perspective.
Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS – DRAWBACK
Deep Learning – 20 / 26
TRANSPOSED CONVOLUTIONS – DRAWBACK
Explanation: transposed convolution yields an overlap in some feature
map values.
This leads to higher magnitude for some feature map elements than for
others, resulting in the checkerboard pattern.
One solution is to ensure that the kernel size is divisible by the stride.
Figure: 1D example. In both images, top row = input and bottom row = output. Top:
Here, kernel weights overlap unevenly which results in a checkerboard pattern. Bottom:
There is no checkerboard pattern as the kernel size is divisible by the stride.
Deep Learning – 21 / 26
TRANSPOSED CONVOLUTIONS – DRAWBACK
Solutions:
Increase dimensionality via upsampling (bilinear, nearest
neighbor) and then convolve this output with regular
convolution.
Make sure that the kernel size k is divisible by the stride s.
Deep Learning – 22 / 26
REFERENCES
Dumoulin, Vincent and Visin, Francesco (2016)
A guide to convolution arithmetic for deep learning
https: // arxiv. org/ abs/ 1603. 07285v1
Van den Oord, Aaron, Sander Dielman, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, and Koray Kavukocuoglu (2016)
WaveNet: A Generative Model for Raw Audio
https: // arxiv. org/ abs/ 1609. 03499
Benoit A., Gennart, Bernard Krummenacher, Roger D. Hersch, Bernard Saugy,
J.C. Hadorn and D. Mueller (1996)
The Giga View Multiprocessor Multidisk Image Server
https: // www. researchgate. net/ publication/ 220060811_ The_ Giga_
View_ Multiprocessor_ Multidisk_ Image_ Server
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Paluri Manohar
(2015)
Learning Spatiotemporal Features with 3D Convolutional Networks
https: // arxiv. org/ pdf/ 1412. 0767. pdf
Deep Learning – 23 / 26
REFERENCES
Milletari, Fausto, Nassir Navab and Seyed-Ahmad Ahmadi (2016)
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image
Segmentation
https: // arxiv. org/ pdf/ 1606. 04797. pdf
Zhang, Xiang, Junbo Zhao and Yann LeCun (2015)
Character-level Convolutional Networks for Text Classification
http: // arxiv. org/ abs/ 1509. 01626
Wang, Zhiguang, Weizhong Yan and Tim Oates (2017)
Time Series Classification from Scratch with Deep Neural Networks: A Strong
Baseline
http: // arxiv. org/ abs/ 1509. 01626
Fisher Yu and Vladlen Koltun (2015)
Multi-Scale Context Aggregation by Dilated Convolutions
https: // arxiv. org/ abs/ 1511. 07122
Deep Learning – 24 / 26
REFERENCES
Bai, Shaojie, Zico J. Kolter and Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
http: // arxiv. org/ abs/ 1509. 01626
Augustus Odena, Vincent Dumoulin and Chris Olah (2016)
Deconvolution and Checkerboard Artifacts
https: // distill. pub/ 2016/
deconv-checkerboard/ https://distill.pub/2016/deconv-checkerboard/
Andre Araujo, Wade Norris and Jack Sim (2019)
Computing Receptive Fields of Convolutional Neural Networks
https: // distill. pub/ 2019/ computing-receptive-fields/
Zhiguang Wang, Yan, Weizhong and Tim Oates (2017)
Time series classification from scratch with deep neural networks: A strong
baseline
https: // arxiv. org/ 1611. 06455
Deep Learning – 25 / 26
REFERENCES
Lin, Haoning and Shi, Zhenwei and Zou, Zhengxia (2017)
Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale
Fully Convolutional Network
Deep Learning – 26 / 26
Deep Learning
Learning goals
Separable Convolutions
Flattening
Separable Convolutions
Deep Learning – 1 / 17
SEPARABLE CONVOLUTIONS
Separable Convolutions are used in some neural net architectures,
such as the MobileNet.
Motivation: make convolution computationally more efficient.
One can perform:
spatially separable convolution
depthwise separable convolution.
The spatially separable convolution operates on the 2D spatial
dimensions of images, i.e. height and width. Conceptually, spatially
separable convolution decomposes a convolution into two separate
operations.
Consider the sobel kernel from the previous lecture:
+1 0 −1
Gx = +2 0 −2
+1 0 −1
Deep Learning – 2 / 17
SEPARABLE CONVOLUTIONS
this 3x3 dimensional kernel can be replaced by the outer product
of two 3x1 and 1x3 dimensional kernels:
+1
+2 ∗ +1 0 −1
+1
Deep Learning – 3 / 17
SEPARABLE CONVOLUTIONS
Figure: In convolution, the 3x3 kernel directly convolves with the image. In
spatially separable convolution, the 3x1 kernel first convolves with the image.
Then the 1x3 kernel is applied. This would require 6 instead of 9 parameters
while doing the same operations.
Deep Learning – 4 / 17
SPATIALLY SEPARABLE CONVOLUTION
Example 1: A convolution on a 5 × 5 image with a 3 × 3 kernel
(stride=1, padding=0) requires scanning the kernel at 3 positions
horizontally and 3 vertically. That is 9 positions in total, indicated as the
dots in the image below. At each position, 9 element-wise
multiplications are applied. Overall, that is 9 x 9 = 81 multiplications.
Deep Learning – 5 / 17
SPATIALLY SEPARABLE CONVOLUTION
Deep Learning – 6 / 17
DEPTHWISE SEPARABLE CONVOLUTION
The depthwise separable convolutions, which is much more
commonly used in deep learning (e.g. in MobileNet and Xception).
This convolution separates convolutional process into two stages
of depthwise and pointwise.
Deep Learning – 7 / 17
DEPTHWISE SEPARABLE CONVOLUTION
Deep Learning – 8 / 17
DEPTHWISE SEPARABLE CONVOLUTION
Deep Learning – 9 / 17
DEPTHWISE CONVOLUTION
As the name suggests, we perform kernel on depth of the input volume
(on the input channels). The steps followed in this convolution are:
Take number of kernels equal to the number of input channels,
each kernel having depth 1. Example, if we have a kernel of size
3 × 3 and an input of size 6 × 6 with 16 channels, then there will
be 16 × 3 × 3 kernels.
Every channel thus has 1 kernel associated with it. This kernel is
convolved over the associated channel separately resulting in 16
feature maps.
Stack all these feature maps to get the output volume with 4 × 4
output size and 16 channels.
Deep Learning – 10 / 17
POINTWISE CONVOLUTION
As the name suggests, this type of convolution is applied to every single
point in the convolution separately (remember 1 × 1 convs?). So how
does this work?
Take a 1 × 1 conv with number of filters equal to number of
channels you want as output.
Perform basic convolution applied in 1 × 1 conv to the output of
the Depth-wise convolution.
Deep Learning – 11 / 17
Flattening
Deep Learning – 12 / 17
FLATTENING
Flattening is converting the data into a 1-dimensional array for inputting
it to the next layer. We flatten the output of the convolutional layers to
create a single long feature vector. And it is connected to the final
classification model, which is called a fully-connected layer.
Deep Learning – 13 / 17
REFERENCES
Dumoulin, Vincent and Visin, Francesco (2016)
A guide to convolution arithmetic for deep learning
https: // arxiv. org/ abs/ 1603. 07285v1
Van den Oord, Aaron, Sander Dielman, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, and Koray Kavukocuoglu (2016)
WaveNet: A Generative Model for Raw Audio
https: // arxiv. org/ abs/ 1609. 03499
Benoit A., Gennart, Bernard Krummenacher, Roger D. Hersch, Bernard Saugy,
J.C. Hadorn and D. Mueller (1996)
The Giga View Multiprocessor Multidisk Image Server
https: // www. researchgate. net/ publication/ 220060811_ The_ Giga_
View_ Multiprocessor_ Multidisk_ Image_ Server
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Paluri Manohar
(2015)
Learning Spatiotemporal Features with 3D Convolutional Networks
https: // arxiv. org/ pdf/ 1412. 0767. pdf
Deep Learning – 14 / 17
REFERENCES
Milletari, Fausto, Nassir Navab and Seyed-Ahmad Ahmadi (2016)
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image
Segmentation
https: // arxiv. org/ pdf/ 1606. 04797. pdf
Zhang, Xiang, Junbo Zhao and Yann LeCun (2015)
Character-level Convolutional Networks for Text Classification
http: // arxiv. org/ abs/ 1509. 01626
Wang, Zhiguang, Weizhong Yan and Tim Oates (2017)
Time Series Classification from Scratch with Deep Neural Networks: A Strong
Baseline
http: // arxiv. org/ abs/ 1509. 01626
Fisher Yu and Vladlen Koltun (2015)
Multi-Scale Context Aggregation by Dilated Convolutions
https: // arxiv. org/ abs/ 1511. 07122
Deep Learning – 15 / 17
REFERENCES
Bai, Shaojie, Zico J. Kolter and Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
http: // arxiv. org/ abs/ 1509. 01626
Augustus Odena, Vincent Dumoulin and Chris Olah (2016)
Deconvolution and Checkerboard Artifacts
https: // distill. pub/ 2016/
deconv-checkerboard/ https://distill.pub/2016/deconv-checkerboard/
Andre Araujo, Wade Norris and Jack Sim (2019)
Computing Receptive Fields of Convolutional Neural Networks
https: // distill. pub/ 2019/ computing-receptive-fields/
Zhiguang Wang, Yan, Weizhong and Tim Oates (2017)
Time series classification from scratch with deep neural networks: A strong
baseline
https: // arxiv. org/ 1611. 06455
Deep Learning – 16 / 17
REFERENCES
Lin, Haoning and Shi, Zhenwei and Zou, Zhengxia (2017)
Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale
Fully Convolutional Network
Deep Learning – 17 / 17
Deep Learning
Modern Architectures - I
Learning goals
LeNet
AlexNet
VGG
Network in Network
LeNet
Deep Learning – 1 / 18
LENET ARCHITECTURE
Pioneering work on CNNs by Yann Lecun in 1998.
Applied on the MNIST dataset for automated handwritten digit
recognition.
Consists of convolutional, "subsampling" and dense layers.
Complexity and depth of the net was mainly restricted by limited
computational power back in the days.
Deep Learning – 2 / 18
LENET ARCHITECTURE
A neuron in a subsampling layer looks at a 2 × 2 region of a feature
map, sums the four values, multiplies it by a trainable coefficient,
adds a trainable bias and then applies a sigmoid activation.
A stride of 2 ensures that the size of the feature map reduces by
about a half.
The ’Gaussian connections’ layer has a neuron for each possible
class.
The output of each neuron in this layer is the (squared) Euclidean
distance between the activations from the previous layer and the
weights of the neuron.
Deep Learning – 3 / 18
AlexNet
Deep Learning – 4 / 18
ALEXNET
AlexNet, which employed an 8-layer CNN, won the ImageNet
Large Scale Visual Recognition (LSVR) Challenge 2012 by a
phenomenally large margin.
The network trained in parallel on two small GPUs, using two
streams of convolutions which are partly interconnected.
The architectures of AlexNet and LeNet are very similar, but there
are also significant differences:
First, AlexNet is deeper than the comparatively small LeNet5.
AlexNet consists of eight layers: five convolutional layers, two
fully-connected hidden layers, and one fully-connected output
layer.
Second, AlexNet used the ReLU instead of the sigmoid as its
activation function.
Deep Learning – 5 / 18
ALEXNET
Deep Learning – 6 / 18
VGG
Deep Learning – 7 / 18
VGG BLOCKS
The block composed of convolutions with 3 × 3 kernels with
padding of 1 (keeping height and width) and 2 × 2 max pooling
with stride of 2 (halving the resolution after each block).
The use of blocks leads to very compact representations of the
network definition.
It allows for efficient design of complex networks.
credit : D2DL
Deep Learning – 8 / 18
VGG NETWORK
Architecture introduced by Simonyan and Zisserman, 2014 as
“Very Deep Convolutional Network”.
A deeper variant of the AlexNet.
Basic idea is to have small filters and Deeper networks
Mainly uses many cnn layers with a small kernel size 3 × 3.
Stack of three 3 × 3 cnn (stride 1) layers has same effective
receptive field as one 7 × 7 conv layer.
Performed very well in the ImageNet Challenge in 2014.
Exists in a small version (VGG16) with a total of 16 layers (13 cnn
layers and 3 fc layers) using 5 VGG blocks and a larger version
(VGG19) with 19 layers (16 cnn layers and 3 fc layers) in 6 VGG
blocks.
Deep Learning – 9 / 18
VGG NETWORK
credit : D2DL
Deep Learning – 10 / 18
Network in Network (NiN)
Deep Learning – 11 / 18
NIN BLOCKS
The idea behind NiN is to apply a fully-connected layer at each
pixel location. If we tie the weights across each spatial location, we
could think of this as a 1 × 1 convolutional layer.
The NiN block consists of one convolutional layer followed by two
1 × 1 convolutional layers that act as per-pixel fully-connected
layers with ReLU activations.
The convolution window shape of the first layer is typically set by
the user. The subsequent window shapes are fixed to 1 × 1.
credit : D2DL
Deep Learning – 12 / 18
GLOBAL AVERAGE POOLING
Goal: tackle overfitting in the final fully connected layer.
The elements of the final feature maps are connected to the
output layer via a dense layer. This could require a huge
number of weights increasing the danger of overfitting.
Example: 1024 feature maps of dim 7x7 connected to 10
output neurons lead to 1024 · 72 · 10 weights for the final
dense layer.
The larger the feature map, the more detrimental.
Classic pooling not ideal as it removes spatial information and
is mainly used for dimension and parameter reduction.
Deep Learning – 13 / 18
GLOBAL AVERAGE POOLING
Solution:
Average each final feature map to the element of one global
average pooling (GAP) vector.
Example: 1024 feature maps are now reduced to GAP-vector
of length 1024 yielding a final dense layer with 1024 · 10
weights.
Deep Learning – 14 / 18
GLOBAL AVERAGE POOLING
GAP preserves whole information from the single feature maps
whilst decreasing the dimension.
Mitigates the possibly destructive effect of pooling.
Each element of the GAP output represents the activation of a
certain feature on the input data.
Acts as an additional regularizer on the final fully connected layer.
Deep Learning – 15 / 18
NETWORK IN NETWORK (NIN)
Deep Learning – 16 / 18
NETWORK IN NETWORK (NIN)
credit : D2DL
Deep Learning – 17 / 18
REFERENCES
B. Zhou, Khosla, A., Labedriza, A., Oliva, A. and A. Torralba (2016)
Deconvolution and Checkerboard Artifacts
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich (2014)
Going deeper with convolutions
https: // arxiv. org/ abs/ 1409. 4842
Kaiming He, Zhang, Xiangyu, Ren, Shaoqing, and Jian Sun (2015)
Deep Residual Learning for Image Recognition
https: // arxiv. org/ abs/ 1512. 03385
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba
(2016)
Learning Deep Features for Discriminative Localization
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf
Deep Learning – 18 / 18
Deep Learning
Modern Architectures - II
Learning goals
GoogleNet
ResNet
DenseNet
U-Net
GoogLeNet
Deep Learning – 1 / 24
INCEPTION MODULES
The Inception block is equivalent to a subnetwork with four paths.
It extracts information in parallel through convolutional layers of
different window shapes and max-pooling layers.
1 × 1 convolutions reduce channel dimensionality on a per-pixel
level. Max-pooling reduces the resolution.
Deep Learning – 2 / 24
GOOGLENET ARCHITECTURE
GoogLeNet connects multiple well-designed Inception blocks with
other layers in series.
The ratio of the number of channels assigned in the Inception
block is obtained through a large number of experiments on the
ImageNet dataset.
GoogLeNet, as well as its succeeding versions, was one of the
most efficient models on ImageNet, providing similar test accuracy
with lower computational complexity.
Deep Learning – 3 / 24
GOOGLENET ARCHITECTURE
credit : D2DL
Deep Learning – 4 / 24
Residual Networks (ResNet)
Deep Learning – 5 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
Problem setting: theoretically, we could build infinitely deep
architectures as the net should learn to pick the beneficial layers and
skip those that do not improve the performance automatically.
credit : D2DL
Figure: For non-nested function classes, a larger function class does not
guarantee to get closer to the “truth” function (F∗). This does not happen in
nested function classes.
Deep Learning – 6 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
But: this skipping would imply learning an identity mapping
x = F(x). It is very hard for a neural net to learn such a 1:1
mapping through the many non-linear activations in the
architecture.
Solution: offer the model explicitly the opportunity to skip certain
layers if they are not useful.
Introduced in He et. al , 2015 and motivated by the observation
that stacking evermore layers increases the test- as well as the
train-error (̸= overfitting).
Deep Learning – 7 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
credit : D2DL
Deep Learning – 8 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
credit : D2DL
Deep Learning – 9 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
Let H(x) be the optimal underlying mapping that should be
learned by (parts of) the net.
x is the input in layer l (can be raw data input or the output of a
previous layer).
H(x) is the output from layer l.
Instead of fitting H(x), the net is ought to learn the residual
mapping F(x) := H(x) − x whilst x is added via the identity
mapping.
Thus, H(x) = F(x) + x, as formulated on the previous slide.
The model should only learn the residual mapping F(x)
Thus, the procedure is also referred to as Residual Learning.
Deep Learning – 10 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
The element-wise addition of the learned residuals F(x) and the
identity-mapped data x requires both to have the same
dimensions.
To allow for downsampling within F(x) (via pooling or valid-padded
convolutions), the authors introduce a linear projection layer Ws .
Ws ensures that x is brought to the same dimensionality as F(x)
such that:
y = F(x) + Ws x,
y is the output of the skip module and Ws represents the weight
matrix of the linear projection (# rows of Ws = dimensionality of
F(x)).
This idea applies to fully connected layers as well as to
convolutional layers.
Deep Learning – 11 / 24
RESNET ARCHITECTURE
The residual mapping can learn the identity function more easily,
such as pushing parameters in the weight layer to zero.
We can train an effective deep neural network by having residual
blocks.
Inputs can forward propagate faster through the residual
connections across layers.
ResNet had a major influence on the design of subsequent deep
neural networks, both for convolutional and sequential nature.
Deep Learning – 12 / 24
RESNET ARCHITECTURE
credit : D2DL
Deep Learning – 13 / 24
Densely Connected Networks (DenseNet)
Deep Learning – 14 / 24
FROM RESNET TO DENSENET
ResNet significantly changed the view of how to parametrize the
functions in deep networks.
DenseNet (dense convolutional network) is to some extent the
logical extension of this [Huang et al., 2017].
Dense blocks where each layer is connected to every other layer in
feedforward fashion.
Alleviates vanishing gradient, strengthens feature propagation,
encourages feature reuse.
To understand how to arrive at it, let us take a small detour to
mathematics:
Recall the Taylor expansion for functions. For the point x = 0
it can be written as:
f ′′ (0) f ′′′ (0)
f (x ) = f (0) + f ′ (0)x + 2! x 2 + 3! x 3 + . . . .
Deep Learning – 15 / 24
FROM RESNET TO DENSENET
The key point is that it decomposes a function into
increasingly higher order terms. In a similar vein, ResNet
decomposes functions into : f (x) = x + g (x).
That is, ResNet decomposes f into a simple linear term and a
more complex nonlinear one. What if we want to capture (not
necessarily add) information beyond two terms? One solution
was DenseNet [Huang et al., 2017].
credit : D2DL
Deep Learning – 16 / 24
FROM RESNET TO DENSENET
As shown in previous Figure, the key difference between ResNet and
DenseNet is that in the latter case outputs are concatenated (denoted by [, ] )
rather than added. As a result, we perform a mapping from x to its values after
applying an increasingly complex sequence of functions:
x → [x, f1 (x), f2 ([x, f1 (x)]), f3 ([x, f1 (x), f2 ([x, f1 (x)])]), . . .] .
In the end, all these functions are combined in MLP to reduce the number of
features again. In terms of implementation this is quite simple: rather than
adding terms, we concatenate them.
The name DenseNet arises from the fact that the dependency graph between
variables becomes quite dense. The last layer of such a chain is densely
connected to all previous layers.
credit : D2DL
Deep Learning – 17 / 24
U-Net
Deep Learning – 18 / 24
U-NET
U-Net is a fully convolutional net that makes use of upsampling (via
transposed convolutions, for example) as well as skip connections.
Input images are getting convolved and down-sampled in the first
half of the architecture.
Then, they are getting upsampled and convolved again in the
second half to get back to the input dimension.
Skip connections throughout the net combine feature maps from
earlier layers with those from later layers by concatenating both
sets of maps along the depth/channel axis.
Only convolutional and no dense layers are used.
Deep Learning – 19 / 24
U-NET
Deep Learning – 20 / 24
U-NET
Example problem setting: train a neural net to pixelwise segment
roads in satellite imagery.
Answer the question: Where is the road map?
Deep Learning – 21 / 24
U-NET
The net takes an RGB image [512, 512, 3] and outputs a binary
(road / no road) probability mask [512, 512, 1] for each pixel.
The model is trained via a binary cross entropy loss which was
combined over each pixel.
Deep Learning – 22 / 24
REFERENCES
B. Zhou, Khosla, A., Labedriza, A., Oliva, A. and A. Torralba (2016)
Deconvolution and Checkerboard Artifacts
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich (2014)
Going deeper with convolutions
https: // arxiv. org/ abs/ 1409. 4842
Kaiming He, Zhang, Xiangyu, Ren, Shaoqing, and Jian Sun (2015)
Deep Residual Learning for Image Recognition
https: // arxiv. org/ abs/ 1512. 03385
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba
(2016)
Learning Deep Features for Discriminative Localization
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf
Deep Learning – 23 / 24
REFERENCES
Olaf Ronneberger, Philipp Fischer, Thomas Brox (2015)
U-Net: Convolutional Networks for Biomedical Image Segmentation
http: // arxiv. org/ abs/ 1505. 04597
Deep Learning – 24 / 24
Deep Learning
Learning goals
Why do we need them?
How do they work?
Computational Graph o
Recurrent Networks
Motivation
Deep Learning – 1 / 21
MOTIVATION FOR RECURRENT NETWORKS
The two types of neural network architectures that we’ve seen so
far are fully-connected networks and CNNs.
Their input layers have a fixed size and (typically) only handle
fixed-length inputs.
The primary reason: if we vary the size of the input layer, we would
also have to vary the number of learnable weights in the network.
This in particular relates to sequence data such as time-series,
audio and text.
Recurrent Neural Networks (RNNs) is a class of architectures
that allows varying input lengths and properly accounts for the
ordering in sequence data.
Deep Learning – 2 / 21
RNNS - INTRODUCTION
Suppose we have some text data and our task is to analyse the
sentiment in the text.
For example, given an input sentence, such as "This is good news.", the
network has to classify it as either ’positive’ or ’negative’.
We would like to train a simple neural network (such as the one below) to
perform the task.
Figure: Two equivalent visualizations of a dense net with a single hidden layer, where
the left is more abstract showing the network on a layer point-of-view.
Deep Learning – 3 / 21
RNNS - INTRODUCTION
Because sentences can be of varying lengths, we need to modify
the dense net architecture to handle such a scenario.
One approach is to draw inspiration from the way a human reads a
sentence; that is, one word at a time.
An important cognitive mechanism that makes this possible is
"short-term memory".
As we read a sentence from beginning to end, we retain some
information about the words that we have already read and use
this information to understand the meaning of the entire sentence.
Therefore, in order to feed the words in a sentence sequentially to
a neural network, we need to give it the ability to retain some
information about past inputs.
Deep Learning – 4 / 21
RNNS - INTRODUCTION
When words in a sentence are fed to the network one at a time,
the inputs are no longer independent. It is much more likely that
the word "good" is followed by "morning" rather than "plastic".
Hence, we also need to model this (long-term) dependency.
Each word must still be encoded as a fixed-length vector because
the size of the input layer will remain fixed.
Here, for the sake of the visualization, each word is represented as
a ’one-hot coded’ vector of length 5. (<eos> = ’end of sequence’)
Deep Learning – 5 / 21
RNNS - INTRODUCTION
Our goal is to feed the words to the network sequentially in
discrete time-steps.
A regular dense neural network with a single hidden layer only has
two sets of weights: ’input-to-hidden’ weights W and ’hidden-to-
output’ weights U.
Deep Learning – 6 / 21
RNNS - INTRODUCTION
In order to enable the network to retain information about past inputs, we
introduce an additional set of weights V, from the hidden neurons at
time-step t to the hidden neurons at time-step t + 1.
Having this additional set of weights makes the activations of the hidden
layer depend on both the current input and the activations for the
previous input.
Deep Learning – 7 / 21
RNNS - INTRODUCTION
With this additional set of hidden-to-hidden weights V, the network
is now a Recurrent Neural Network (RNN).
In a regular feed-forward network, the activations of the hidden
layer are only computed using the input-hidden weights W (and
bias b).
z = σ(W⊤ x + b)
In an RNN, the activations of the hidden layer (at time-step t) are
computed using both the input-to-hidden weights W and the
hidden-to-hidden weights V.
Deep Learning – 9 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 0, we feed the word "This" to the network and obtain z[0] .
z[0] = σ(W⊤ x[0] + b)
Because this is the very first input, there is no past state (or,
equivalently, the state is initialized to 0).
Deep Learning – 10 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 1, we feed the second word to the network to obtain z[1] .
z[1] = σ(V⊤ z[0] + W⊤ x[1] + b)
Deep Learning – 11 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 2, we feed the next word in the sentence.
z[2] = σ(V⊤ z[1] + W⊤ x[2] + b)
Deep Learning – 12 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 3, we feed the next word ("news") in the sentence.
z[3] = σ(V⊤ z[2] + W⊤ x[3] + b)
Deep Learning – 13 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
Once the entire input sequence has been processed, the
prediction of the network can be generated by feeding the
activations of the final time-step to the output neuron(s).
f = σ(U⊤ z[4] + c ), where c is the bias of the output neuron.
Deep Learning – 14 / 21
PARAMETER SHARING
This way, the network can process the sentence one word at a
time and the length of the network can vary based on the length of
the sequence.
It is important to note that no matter how long the input sequence
is, the matrices W and V are the same in every time-step. This is
another example of parameter sharing.
Therefore, the number of weights in the network is independent of
the length of the input sequence.
Deep Learning – 15 / 21
RNNS - USE CASE SPECIFIC ARCHITECTURES
RNNs are very versatile. They can be applied to a wide range of tasks.
Figure: RNNs can be used in tasks that involve multiple inputs and/or multiple outputs.
Examples:
Sequence-to-One: Sentiment analysis, document classification.
One-to-Sequence: Image captioning.
Sequence-to-Sequence: Language modelling, machine translation,
time-series prediction.
Deep Learning – 16 / 21
Computational Graph
Deep Learning – 17 / 21
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH
We went from
Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 21
RECURRENT OUTPUT-HIDDEN CONNECTIONS
Recurrent connections do not need to map from hidden to hidden
neurons!
RNN with feedback connection from the output to the hidden layer. The RNN is only
allowed to send f to future time points and, hence, z [t −1] is connected to z [t ] only
indirectly, via the predictions f [t −1] .
Deep Learning – 19 / 21
SEQ-TO-ONE MAPPINGS
RNNs do not need to produce an output at each time step. Often only
one output is produced after processing the whole sequence.
Time-unfolded recurrent neural network with a single output at the end of the
sequence. Such a network can be used to summarize a sequence and produce a fixed
size representation.
Deep Learning – 20 / 21
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http: // karpathy. github. io/ 2015/ 05/ 21/ rnn-effectiveness/
Deep Learning – 21 / 21
Deep Learning
Learning goals
How does Backpropagation work
for RNNs?
Exploding and Vanishing
Gradients
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Task: Learn character probability distribution from input text
Suppose we only had a vocabulary of four possible letters: “h”, “e”,
“l” and “o”
We want to train an RNN on the training sequence “hello”.
This training sequence is in fact a source of 4 separate training
examples:
The probability of “e” should be likely given the context of “h”
“l” should be likely in the context of “he”
“l” should also be likely given the context of “hel”
and “o” should be likely given the context of “hell”
Deep Learning – 1 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 2 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 3 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 4 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
The RNN has a 4-dimensional input and output. The exemplary hidden
layer consists of 3 neurons. This diagram shows the activations in the
forward pass when the RNN is fed the characters “hell” as input. The
output contains confidences the RNN assigns for the next character.
Deep Learning – 5 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
The RNN has a 4-dimensional input and output. The exemplary hidden
layer consists of 3 neurons. This diagram shows the activations in the
forward pass when the RNN is fed the characters “hell” as input. The
output contains confidences the RNN assigns for the next character.
dL
For training the RNN, we need to compute du i ,j
, dvdLi ,j , and dw
dL
i ,j
.
To do so, during backpropagation at time step t for an arbitrary
RNN, we need to compute
dL dL dz[t ] dz[2]
= ...
dz[1] dz[t ] dz[t −1] dz[1]
Deep Learning – 6 / 13
LONG-TERM DEPENDENCIES
Here, z[t ] = σ(V> z[t −1] + W> x[t ] + b)
It follows that:
dz[t ]
= diag(σ 0 (V> z[t −1] + W> x[t ] + b))V> = D[t −1] V>
dz[t −1]
dz[t −1]
= diag(σ 0 (V> z[t −2] + W> x[t −1] + b))V> = D[t −2] V>
dz[t −2]
..
.
dz[2]
= diag(σ 0 (V> z[1] + W> x[2] + b))V> = D[1] V>
dz[1]
dL dL dz[t ] dz[2]
= ... = D[t −1] D[t −2] . . . D[1] (V> )t −1
dz¸[1] dz[t ] dz[t −1] dz[1]
Deep Learning – 7 / 13
LONG-TERM DEPENDENCIES
dz[t ]
In general, for an arbitrary time-step i < t in the past, dz[i ]
will
contain the term (V> )t −i (this follows from the chain rule).
Based on the largest eigenvalue of V> , the presence of the term
(V> )t −i can either result in vanishing or exploding gradients.
This problem is quite severe for RNNs (as compared to
feedforward networks) because the same matrix V> is multiplied
several times. Click here
As the gap between t and i increases, the instability worsens.
It is thus quite challenging for RNNs to learn long-term
dependencies. The gradients either vanish (most of the time) or
explode (rarely, but with much damage to the optimization).
That happens simply because we propagate errors over very
many stages backwards.
Deep Learning – 8 / 13
LONG-TERM DEPENDENCIES
Deep Learning – 9 / 13
LONG-TERM DEPENDENCIES
Recall, that we can counteract exploding gradients by
implementing gradient clipping.
To avoid exploding gradients, we simply clip the norm of the
gradient at some threshold h (see chapter 4):
h
if ||∇W || > h : ∇W ← ∇W
||∇W ||
Deep Learning – 10 / 13
LONG-TERM DEPENDENCIES
Deep Learning – 11 / 13
LONG-TERM DEPENDENCIES
Even for a stable RNN (gradients not exploding), there will be
exponentially smaller weights for long-term interactions compared
to short-term ones and a more sophisticated solution is needed for
this vanishing gradient problem (discussed in the next chapters).
The vanishing gradient problem heavily depends on the choice of
the activation functions.
Sigmoid maps a real number into a “small” range (i.e. [0, 1])
and thus even huge changes in the input will only produce a
small change in the output. Hence, the gradient will be small.
This becomes even worse when we stack multiple layers.
We can avoid this problem by using activation functions which
do not “squash” the input.
The most popular choice is ReLU with gradients being either
0 or 1, i.e., they never saturate and thus don’t vanish.
The downside of this is that we can obtain a “dead” ReLU.
Deep Learning – 12 / 13
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-eectiveness/
Deep Learning – 13 / 13
Deep Learning
Learning goals
LSTM cell
GRU cell
Bidirectional RNNs
Long Short-Term Memory (LSTM)
Deep Learning – 1 / 15
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.
Deep Learning – 2 / 15
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Forget gate e[t ] : indicates which information of the old cell state
we should forget.
Intuition: Think of a model trying to predict the next word based on
all the previous ones. The cell state might include the gender of
the present subject, so that the correct pronouns can be used.
When we now see a new subject, we want to forget the gender of
the old one.
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Output gate o[t ] : Indicates which information form the cell state is
filtered.
It is given by o[t ] = σ(bo + V>
o z
[t −1] + W> x[t ] ), with specific
o
weights Wo , Vo .
Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)
Finally, the new state z[t ] of the LSTM is a function of the cell state,
multiplied by the output gate:
Deep Learning – 3 / 15
Gated Recurrent Units (GRU)
Deep Learning – 4 / 15
GATED RECURRENT UNITS (GRU)
The key distinction between regular RNNs and GRUs is that the
latter support gating of the hidden state.
Here, we have dedicated mechanisms for when a hidden state
should be updated and also when it should be reset.
These mechanisms are learned to:
avoid the vanishing/exploding gradient problem which comes
with a standard recurrent neural network.
solve the vanishing gradient problem by using an update gate
and a reset gate.
control the information that flows into (update gate) and out of
(reset gate) memory.
Deep Learning – 5 / 15
GATED RECURRENT UNITS (GRU)
For a given time step t, the hidden state of the last time step is
z[t −1] . The update gate u[t ] is computed as follows:
u[t ] = σ(W> [t ] > [t −1] + b )
u x + Vu z u
Deep Learning – 6 / 15
GATED RECURRENT UNITS (GRU)
Deep Learning – 7 / 15
GATED RECURRENT UNITS (GRU)
Deep Learning – 8 / 15
GATED RECURRENT UNITS (GRU)
The update gate u[t ] determines how much the old state z[t −1] and
the new candidate state z̃[t ] is used.
z [t ] = u [t ] z[t −1] + (1 − u[t ] ) z̃[t ] .
Deep Learning – 9 / 15
GATED RECURRENT UNITS (GRU)
Figure: GRU
Deep Learning – 10 / 15
GRU VS LSTM
Deep Learning – 11 / 15
Bidirectional RNNs
Deep Learning – 12 / 15
BIDIRECTIONAL RNNS
Another generalization of the simple RNN are bidirectional RNNs.
These allow us to process sequential data depending on both past
and future inputs, e.g. an application predicting missing words,
which probably depend on both preceding and following words.
One RNN processes inputs in the forward direction from x[1] to x[T ]
computing a sequence of hidden states (z[1] , . . . , z(T ) ), another
RNN in the backward direction from x[T ] to x[1] computing hidden
states (g[T ] , . . . , g[1] )
Predictions are then based on both hidden states, which could be
concatenated.
With connections going back in time, the whole input sequence
must be known in advance to train and infer from the model.
Bidirectional RNNs are often used for the encoding of a sequence
in machine translation.
Deep Learning – 13 / 15
BIDIRECTIONAL RNNS
Computational graph of a bidirectional RNN:
Deep Learning – 14 / 15
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/
Deep Learning – 15 / 15
Deep Learning
Applications of RNNs
Learning goals
RNN Applications in NLP
RNN Applications in Computer
Vision
Get to know Encoder-Decoder
Architectures
RNN’s Applications
Deep Learning – 1 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES
RNNs are very versatile. They can be applied to a wide range of tasks.
Figure: RNNs can be used in tasks that involve multiple inputs and/or multiple outputs.
Examples:
One-to-One : Image classification, video frame classification.
Deep Learning – 2 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES
Examples:
Many-to-One : Here a sequence of multiple steps as input are mapped to
a class or quantity prediction.
Example applications are: Sentiment analysis, document classification,
video classification, visual question answering.
Deep Learning – 3 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES
Figure: Ragrawal et al, “Visual 7W: Grounded Question Answering in Images”, CVPR
2015 Figures from Agrawal et al, copyright IEEE 2015.
Deep Learning – 4 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES
Examples:
One-to-Many: In this type of problem, an observation is mapped as input
to a sequence with multiple steps as an output.
Example applications are:
Image captioning: A combination of CNNs and RNNs are
used to provide a description of what exactly is happening
inside an image. CNN does the segmentation part and RNN
then uses the segmented data to recreate the description.
Video tagging: the RNNs can be used for video search where
we can do image description of a video divided into numerous
frames. Deep Learning – 5 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES
Deep Learning – 7 / 27
Seq-to-Seq (Type I)
Deep Learning – 8 / 27
RNNS - LANGUAGE MODELLING
In an earlier example, we built a ’sequence-to-one’ RNN model to
perform ’sentiment analysis’.
Another common task in Natural Language Processing (NLP) is
’language modelling’.
Input: word/character, encoded as a one-hot vector.
Output: probability distribution over words/characters given
previous words
τ
Y
[ 1]
P(y , . . . , y [τ ]
)= P(y [i ] |y [1] , . . . , y [i −1] )
i =1
Deep Learning – 9 / 27
RNNS - LANGUAGE MODELLING
In this example, we will feed the characters in the word "hello" one
at a time to a ’seq-to-seq’ RNN.
For the sake of the visualization, the characters "h", "e", "l" and "o"
are one-hot coded as a vectors of length 4 and the output layer
only has 4 neurons, one for each character (we ignore the <eos>
token).
At each time step, the RNN has to output a probability distribution
(softmax) over the 4 possible characters that might follow the
current input.
Naturally, if the RNN has been trained on words in the English
language:
The probability of “e” should be likely, given the context of “h”.
“l” should be likely in the context of “he”.
“l” should also be likely, given the context of “hel”.
and, finally, “o” should be likely, given the context of “hell”.
Deep Learning – 10 / 27
RNNS - LANGUAGE MODELLING
Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING
Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING
Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING
Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING
Source: Kaggle
Deep Learning – 12 / 27
WORD EMBEDDINGS
The dimensionality of these embeddings is typically much smaller
than the number of words in the dictionary.
Using them gives you a "warm start" for any NLP task. It is an
easy way to incorporate prior knowledge into your model and a
rudimentary form of transfer learning.
Two very popular approaches to learn word embeddings are
word2vec by Google and GloVe by Facebook. These embeddings
are typically 100 to 1000 dimensional.
Even though these embeddings capture the meaning of each word
to an extent, they do not capture the semantics of the word in a
given context because each word has a static precomputed
representation. For example, depending on the context, the word
"bank" might refer to a financial institution or to a river bank.
Deep Learning – 13 / 27
Seq-to-Seq (Type II)
Deep Learning – 14 / 27
Encoder-Decoder Architectures
Deep Learning – 15 / 27
ENCODER-DECODER NETWORK
For many interesting applications such as question answering,
dialogue systems, or machine translation, the network needs to
map an input sequence to an output sequence of different length.
This is what an encoder-decoder (also called
sequence-to-sequence architecture) enables us to do!
Deep Learning – 16 / 27
ENCODER-DECODER NETWORK
Figure: In the first part of the network, information from the input is encoded in
the context vector, here the final hidden state, which is then passed on to
every hidden state of the decoder, which produces the target sequence.
Deep Learning – 17 / 27
ENCODER-DECODER NETWORK
An input/encoder-RNN processes the input sequence of length nx
and computes a fixed-length context vector C, usually the final
hidden state or simple function of the hidden states.
One time step after the other information from the input sequence
is processed, added to the hidden state and passed forward in
time through the recurrent connections between hidden states in
the encoder.
The context vector summarizes important information from the
input sequence, e.g. the intent of a question in a question
answering task or the meaning of a text in the case of machine
translation.
The decoder RNN uses this information to predict the output, a
sequence of length ny , which could vary from nx .
Deep Learning – 18 / 27
ENCODER-DECODER NETWORK
In machine translation, the decoder is a language model with
recurrent connections between the output at one time step and the
hidden state at the next time step as well as recurrent connections
between the hidden states:
ny
Y
P(y [1] , . . . , y [yn ] |x[1] , . . . , x[xn ] ) = p(y [t ] |C ; y [1] , . . . , y [t −1] )
t =1
Deep Learning – 20 / 27
SOME MORE SOPHISTICATED APPLICATIONS
Deep Learning – 22 / 27
SOME MORE SOPHISTICATED APPLICATIONS
Figure: Convolutional and recurrent nets for detecting emotion from audio
data (Namrata Anand & Prateek Verma, 2016). We already had this example
in the CNN chapter!
Deep Learning – 23 / 27
SOME MORE SOPHISTICATED APPLICATIONS
Deep Learning – 24 / 27
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/
Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan (2014)
Show and Tell: A Neural Image Caption Generator
https://arxiv.org/abs/1411.4555
Alex Graves (2013)
Generating Sequences With Recurrent Neural Networks
https://arxiv.org/abs/1308.0850
Namrata Anand and Prateek Verma (2016)
Convolutional and recurrent nets for detecting emotion from audio data
http://cs231n.stanford.edu/reports/2015/pdfs/Cs_231n_paper.pdf
Gabriel Loye (2019)
Attention Mechanism
https://blog.oydhub.com/attention-mechanism/
Deep Learning – 25 / 27
REFERENCES
Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H.
Adelson and William T. Freeman (2015)
Visually Indicated Sounds
https://arxiv.org/abs/1512.08512
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-eectiveness/
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel and Yoshua Bengio (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
https://arxiv.org/abs/1502.03044
Shaojie Bai, J. Zico Kolter, Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
https://arxiv.org/abs/1803.01271
Deep Learning – 26 / 27
REFERENCES
Lilian Weng (2018)
Attention? Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Deep Learning – 27 / 27
Deep Learning
Learning goals
Familiarize with the most recent
sequence data modeling
technique:
Attention Mechanism
Transformers
Get to know the CNN alternative
to RNNs
Attention
Deep Learning – 1 / 22
WAHT IS ATTENTION
Humans process data by actively shifting their focus:
Different parts of an image carry different information
Words derive their specific meaning from contex
Remember specific, related events in the past
Allows to follow one thought at a time while suppressing
information irrelevant to the task
Example: cocktail party problem
Deep Learning – 2 / 22
WAHT IS ATTENTION
Deep Learning – 3 / 22
WAHT IS ATTENTION
Deep Learning – 4 / 22
WAHT IS ATTENTION
Key idea: Allow the decoder to access all the hidden states of the
encoder (instead of just the final one) so that it can dynamically
decide which ones are relevant at each time-step in the decoding.
This means the decoder can choose to "focus" on different hidden
states (of the encoder) at different time-steps of the decoding
process similar to how the human eye can focus on different
regions of the visual field.
This is known as an attention mechanism.
Deep Learning – 5 / 22
WAHT IS ATTENTION
The attention mechanism is implemented by an additional
component in the decoder.
For example, this can be a simple single-hidden layer feed-forward
neural network which is trained along with the RNN.
At any given time-step i of the decoding process, the network
computes the relevance of encoder state z[j ] as:
Deep Learning – 6 / 22
WAHT IS ATTENTION
The attention mechanism allows the decoder network to focus on
different parts of the input sequence by adding connections from
all hidden states of the encoder to each hidden state of the
decoder.
Figure: Attention at i = t + 1
Deep Learning – 7 / 22
WAHT IS ATTENTION
At each time step i, a set of weights (α[j ] )[i ] is computed which
determine how to combine the hidden states of the encoder into a
Pn
context vector g[i ] = j =x 1 (α[j ] )[i ] z[j ] , which holds the necessary
information to predict the correct output.
Each hidden state contains mostly information from recent inputs.
In the case of a bidirectional RNN to encode the input sequence, a
hidden state contains information from recent preceding and
following inputs.
Deep Learning – 8 / 22
WAHT IS ATTENTION
Figure: Attention at i = t + 2
Deep Learning – 9 / 22
WAHT IS ATTENTION
Deep Learning – 10 / 22
ATTENTION
Figure: Attention for image captioning: the attention mechanism tells the
network roughly which pixels to pay attention to when writing the text (Kelvin
Xu al. 2015)
Deep Learning – 11 / 22
Transformers
Deep Learning – 12 / 22
TRANSFORMERS
Advanced RNNs have similar limitations as vanilla RNN networks:
RNNs process the input data sequentially.
Difficulties in learning long term dependency (although GRU
or LSTM perform better than vanilla RNNs, they sometimes
struggle to remember the context introduced earlier in long
sequences).
These challenges are tackled by transformer networks.
Deep Learning – 13 / 22
TRANSFORMERS
Transformers are solely based on attention (no RNN or CNN).
In fact, the paper which coined the term transformer is called
Attention is all you need.
They are the state-of-the-art networks in natural language
processing (NLP) tasks since 2017.
Transformer architectures like BERT (Bidirectional Encoder
Representations from Transformers, 2018) and GPT-3 (Generative
Pre-trained Transformer-3, 2020) are pre-trained on a large corpus
and can be fine-tuned to specific language tasks.
Deep Learning – 14 / 22
TRANSFORMERS
Deep Learning – 15 / 22
CNNs or RNNs?
Deep Learning – 16 / 22
CNNS OR RNNS?
Historically, RNNs were the default for sequence processing tasks.
However, some families of CNNs (especially those based on Fully
Convolutional Networks (FCNs)) can be used to process
variable-length sequences such as text or time-series data.
If a CNN doesn’t contain any fully-connected layers, the total
number of weights in the network is independent of the spatial
dimensions of the input because of weight-sharing in the
convolutional layers.
Recent research [Bai et al. , 2018] indicates that such
convolutional architectures, so-called Temporal Convolutional
Networks (TCNs), can outperform RNNs on a wide range of tasks.
A major advantage of TCNs is that the entire input sequence can
be fed to the network at once (as opposed to sequentially).
Deep Learning – 17 / 22
CNNS OR RNNS?
Figure: A TCN (we have already seen this in the CNN lecture!) is simply a variant of
the one-dimensional FCN which uses a special type of dilated convolutions called
causal dilated convolutions.
Deep Learning – 18 / 22
SUMMARY
RNNs are specifically designed to process sequences of varying
lengths.
For that recurrent connections are introduced into the network
structure.
The gradient is calculated by backpropagation through time.
An LSTM replaces the simple hidden neuron by a complex system
consisting of cell state, and forget, input, and output gates.
An RNN can be used as a language model, which can be
improved by word-embeddings.
Different advanced types of RNNs exist, like Encoder-Decoder
architectures and bidirectional RNNs.1
1. A bidirectional RNN processes the input sequence in both directions (front-to-back and back-to-front).
Deep Learning – 19 / 22
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/
Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan (2014)
Show and Tell: A Neural Image Caption Generator
https://arxiv.org/abs/1411.4555
Alex Graves (2013)
Generating Sequences With Recurrent Neural Networks
https://arxiv.org/abs/1308.0850
Namrata Anand and Prateek Verma (2016)
Convolutional and recurrent nets for detecting emotion from audio data
http://cs231n.stanford.edu/reports/2015/pdfs/Cs_231n_paper.pdf
Gabriel Loye (2019)
Attention Mechanism
https://blog.oydhub.com/attention-mechanism/
Deep Learning – 20 / 22
REFERENCES
Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H.
Adelson and William T. Freeman (2015)
Visually Indicated Sounds
https://arxiv.org/abs/1512.08512
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-eectiveness/
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel and Yoshua Bengio (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
https://arxiv.org/abs/1502.03044
Shaojie Bai, J. Zico Kolter, Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
https://arxiv.org/abs/1803.01271
Deep Learning – 21 / 22
REFERENCES
Lilian Weng (2018)
Attention? Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Deep Learning – 22 / 22
Deep Learning
Unsupervised Learning
Learning goals
Unsupervised learning tasks
Unsupervised deep learning
UNSUPERVISED LEARNING
Deep Learning – 1 / 8
UNSUPERVISED LEARNING
In unsupervised learning scenarios training data consists of
unlabeled input points x (1) , . . . , x (n) .
Our goal is to learn some underlying hidden structure of the data.
Examples are: clustering, dimensionality reduction, feature
learning, density estimation, etc.
Deep Learning – 2 / 8
UNSUPERVISED LEARNING - EXAMPLES
1. Clustering.
Figure: Cluster analysis results for different algorithms. Different clusters are
indicated by different colors. (Source : Wikipedia)
Deep Learning – 3 / 8
UNSUPERVISED LEARNING - EXAMPLES
2. Dimensionality reduction/manifold learning.
E.g. for visualisation in a low dimensional space.
Deep Learning – 4 / 8
UNSUPERVISED LEARNING - EXAMPLES
2. Dimensionality reduction/manifold learning.
E.g. for image compression.
Deep Learning – 5 / 8
UNSUPERVISED LEARNING - EXAMPLES
Deep Learning – 6 / 8
UNSUPERVISED LEARNING - EXAMPLES
4. Density fitting/learning a generative model.
Deep Learning – 7 / 8
UNSUPERVISED DEEP LEARNING
Given i.i.d. (unlabeled) data x1 , x2 , . . . , xn ∼ pdata , in unsupervised
deep learning, one usually trains :
an autoencoder (a special kind of neural network) for
representation learning (feature extraction, dimensionality
reduction, manifold learning, ...), or,
Deep Learning – 8 / 8
UNSUPERVISED DEEP LEARNING
Given i.i.d. (unlabeled) data x1 , x2 , . . . , xn ∼ pdata , in unsupervised
deep learning, one usually trains :
an autoencoder (a special kind of neural network) for
representation learning (feature extraction, dimensionality
reduction, manifold learning, ...), or,
a generative model, i.e. a probabilistic model of the data
generating distribution pdata (data generation, outlier detection,
missing feature extraction, reconstruction, denoising or planning in
reinforcement learning, ...).
Deep Learning – 8 / 8
Deep Learning
Learning goals
Task and structure of an AE
Undercomplete AEs
Relation of AEs and PCA
AUTOENCODER-TASK AND STRUCTURE
Deep Learning – 1 / 15
AUTOENCODER (AE)- COMPUTATIONAL GRAPH
The general structure of an AE as a computational graph:
Deep Learning – 2 / 15
Undercomplete Autoencoders
Deep Learning – 3 / 15
UNDERCOMPLETE AUTOENCODERS
A naive implementation of an autoencoder would simply learn the
identity dec (enc (x)) = ^
x.
This would not be useful.
Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS
Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS
Therefore we have a “bottleneck” layer: We restrict the
architecture, such that
dim(z) < dim(x)
Such an AE is called undercomplete.
Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS
In an undercomplete AE, the hidden layer has fewer neurons than
the input layer.
→ That will force the AE to
capture only the most salient features of the training data!
learn a “compressed” representation of the input.
Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS
Deep Learning – 5 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: Flow chart of our our autoencoder: reconstruct the input with fixed
dimensions dim(z) ≤ dim(x).
Deep Learning – 6 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Deep Learning – 7 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 784 = dim(x).
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 256.
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 64.
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 32.
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 16.
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 8.
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 4.
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 2.
Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST
Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 1.
Deep Learning – 8 / 15
INCREASING THE CAPACTIY OF AES
Increasing the number of layers adds capacity to autoencoders:
Deep Learning – 9 / 15
Autoencoders as Principal Component
Analysis
Deep Learning – 10 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS
Deep Learning – 11 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS
It can be shown that the optimal solution is an orthogonal linear
transformation (i.e. a rotation of the coordinate system) given by
the dim(z ) = k singular vectors with largest singular values.
Deep Learning – 12 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS
This is an equivalent formulation to Principal Component
Analysis (PCA), which uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a
set of values of linearly uncorrelated variables called principal
components.
The transformation is defined in such a way that the first principal
component has the largest possible variance (i.e., accounts for as
much of the variability in the data as possible).
Deep Learning – 13 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS
The formulations are equivalent: “Find a linear projection into a
k -dimensional space that ...”
“... minimizes the L2-reconstruction error” (AE-based
formulation).
“... maximizes the variance of the projected datapoints”
(statistical formulation).
An AE with a non-linear decoder/encoder can be seen as a
non-linear generalization of PCA.
Deep Learning – 14 / 15
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Deep Learning – 15 / 15
Deep Learning
Regularized Autoencoders
Learning goals
Overcomplete AEs
Sparse AEs
Denoising AEs
Contractive AEs
Overcomplete Autoencoders
Deep Learning – 1 / 20
OVERCOMPLETE AE – PROBLEM
Overcomplete AE (code dimension ≥ input dimension): even a linear
AE can copy the input to the output without learning anything useful.
How can an overcomplete AE be useful?
Figure: Overcomplete AE that learned to copy its inputs to the hidden layer
and then to the output layer (Credits to M. Ponti).
Deep Learning – 2 / 20
REGULARIZED AUTOENCODER
Deep Learning – 3 / 20
Sparse Autoencoder
Deep Learning – 4 / 20
SPARSE AUTOENCODER
Try to keep the number of active neurons per training input low.
Forces the model to respond to unique statistical features of the
input data.
Deep Learning – 5 / 20
Denoising Autoencoders
Deep Learning – 6 / 20
DENOISING AUTOENCODERS (DAE)
The denoising autoencoder (DAE) is an autoencoder that receives a
corrupted data point as input and is trained to predict the original,
uncorrupted data point as its output.
Idea: representation should be robust to introduction of noise.
Produce corrupted version x̃ of input x, e.g. by
random assignment of subset of inputs to 0.
adding Gaussian noise.
Modified reconstruction loss: L(x, dec (enc (x̃)))
→ denoising AEs must learn to undo this corruption.
Deep Learning – 7 / 20
DENOISING AUTOENCODERS (DAE)
With the corruption process, we induce stochasticity into the DAE.
Formally: let C (x̃|x) present the conditional distribution of
corrupted samples x̃, given a data sample x.
Like feedforward NNs can model a distribution over targets p(y|x),
output units and loss function of an AE can be chosen such that
one gets a stochastic decoder pdecoder (x|z).
E.g. linear output units to parametrize the mean of Gaussian
distribution for real valued x and negative log-likelihood loss (which
is equal to MSE).
The DAE then learns a reconstruction distribution preconstruct (x|x̃)
from training pairs (x, x̃).
(Note that the encoder could also be made stochastic, modelling
pencoder (z|x̃).)
Deep Learning – 8 / 20
DENOISING AUTOENCODERS (DAE)
The general structure of a DAE as a computational graph:
Deep Learning – 9 / 20
DENOISING AUTOENCODERS (DAE)
Deep Learning – 10 / 20
DENOISING AUTOENCODERS (DAE)
Deep Learning – 11 / 20
DENOISING AUTOENCODERS (DAE)
Deep Learning – 12 / 20
DENOISING AUTOENCODERS (DAE)
An example of a vector field learned by a DAE.
Deep Learning – 13 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
We will now corrupt the MNIST data with Gaussian noise and then
try to denoise it as good as possible.
Deep Learning – 14 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: Top row: original data, bottom row: corrupted mnist data.
Deep Learning – 15 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 1568 (overcomplete).
Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 784 (= dim(x)).
Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 256.
Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 64.
Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 32.
Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 16.
Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE
Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 8.
Deep Learning – 16 / 20
Contractive Autoencoder
Deep Learning – 17 / 20
CONTRACTIVE AUTOENCODER
Goal: For very similar inputs, the learned encoding should also be
very similar.
We can train our model in order for this to be the case by requiring
that the derivative of the hidden layer activations are small with
respect to the input.
In other words: The encoded state enc (x) should not change
much for small changes in the input.
Add explicit regularization term to the reconstruction loss:
∂ enc (x) 2
L(x, dec (enc (x)) + λk ∂ x kF
Deep Learning – 18 / 20
DAE VS. CAE
DAE CAE
the decoder function is trained the encoder function is trained
to resist infinitesimal perturba- to resist infinitesimal perturba-
tions of the input. tions of the input.
Deep Learning – 19 / 20
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Everything you wanted to know about Deep Learning for Computer Vision but
were afraid to ask (2017)
SIBGRAPI Tutorials 2017
Deep Learning – 20 / 20
Deep Learning
Learning goals
convolutional AEs
applications of AEs
CONVOLUTIONAL AUTOENCODER (CONVAE)
Deep Learning – 1 / 6
CONVOLUTIONAL AUTOENCODER (CONVAE)
Deep Learning – 2 / 6
CONVOLUTIONAL AUTOENCODER (CONVAE)
Figure: Top row: noised data, second row: AE with dim(z ) = 32 (roughly 50k
params), third row: ConvAE (roughly 25k params), fourth row: ground truth.
Deep Learning – 3 / 6
REAL-WORLD APPLICATIONS
Today, autoencoders are still used for tasks such as:
data de-noising,
compression,
and dimensionality reduction for the purpose of visualization.
Deep Learning – 4 / 6
REAL-WORLD APPLICATIONS
Medical image denoising using convolutional denoising autoencoders
Figure: Top row : real image, second row : noisy version, third row : results of
a (convolutional) denoising autoencoder and fourth row : results of a median
filter (Lovedeep Gondara (2016))
Deep Learning – 5 / 6
REAL-WORLD APPLICATIONS
AE-based image compression.
Deep Learning – 6 / 6
Deep Learning
Manifold learning
Learning goals
manifold hypothesis
manifold learning with AEs
Manifold hypothesis: Data of interest lies on an embedded
non-linear manifold within the higher-dimensional space.
A manifold:
is a topological space that locally resembles the Euclidean
space.
in ML, more loosely refers to a connected set of points that
can be approximated well by considering only a small number
of dimensions.
Deep Learning – 1 / 6
An important characterization of a manifold is the set of its tangent
planes.
Definition: At a point x on a d-dimensional manifold, the tangent
plane is given by d basis vectors that span the local directions of
variation allowed on the manifold.
Deep Learning – 2 / 6
Manifold hypothesis does not need to hold true.
In the context of AI tasks (e.g. processing images, sound, or text) it
seems to be at least approximately correct, since :
probability distributions over images, text strings, and sounds
that occur in real life are highly concentrated (randomly
sampled pixel values do not look like images, randomly
sampling letters is unlikely to result in a meaningful sentence).
samples are connected to each other by other samples, with
each sample surrounded by other highly similar samples that
can be reached by applying transformations (E.g. for images:
Dim or brighten the lights, move or rotate objects, change the
colors of objects, etc).
Deep Learning – 3 / 6
LEARNING MANIFOLDS WITH AES
Deep Learning – 4 / 6
LEARNING MANIFOLDS WITH AES
Only the variations tangent to the manifold around x need to
correspond to changes in z = enc (x). Hence the encoder learns a
mapping from the input space to a representation space that is
only sensitive to changes along the manifold directions, but that is
insensitive to changes orthogonal to the manifold.
Deep Learning – 5 / 6
LEARNING MANIFOLDS WITH AES
Common setting: a representation (embedding) for the points on
the manifold is learned.
Two different approaches
1 Non-parametric methods: learn an embedding for each
training example.
2 Learning a more general mapping for any point in the input
space.
AI problems can have very complicated structures that can be
difficult to capture from only local interpolation.
⇒ Motivates use of distributed representations and deep
learning!
Deep Learning – 6 / 6
Deep Learning
Learning goals
learning a generative model
examples of generative models
WHICH FACE IS FAKE?
Deep Learning – 1 / 10
DEEP UNSUPERVISED LEARNING
There are two main goals of deep unsupervised learning:
Representation Learning
Examples are: manifold learning, feature learning, etc.
Can be done by an autoencoder
Examples of applications:
dimensionality reduction / data compression
transfer learning / semi-supervised learning
Generative Models
Given a training set D = (x(1) , . . . , x(n) ) where each
x(i ) ∼ Px , the goal is to estimate Px .
Goal: Take as input training samples from some distribution
and learn a model that represents that distribution!
Examples of applications:
generating music, videos, volumetric models for 3D
printing, synthetic data for learning algorithms, outlier
identification, images denoising, inpainting, etc.
Deep Learning – 2 / 10
DENSITY FITTING / LEARNING A GENERATIVE
MODEL
Given D = x(1) , x(2) , . . . , x(n) ∼ Px learn a model of Px (for
Deep Learning – 3 / 10
DENSITY FITTING / LEARNING A GENERATIVE
MODEL
Given D = x(1) , x(2) , . . . , x(n) ∼ Px learn a model of Px (for
Deep Learning – 3 / 10
WHY GENERATIVE MODELS?
Generative model are capable of uncovering underlying latent variables
in a dataset and can be used for
sampling / data generation
outlier detection
missing feature extraction
image denoising / reconstruction
representation learning
planning in reinforcement learning
...
Deep Learning – 4 / 10
APPLICATION EXAMPLE: IMAGE GENERATION
Deep Learning – 5 / 10
APPLICATION EXAMPLE: NEURAL STYLE
TRANSFER
A photograph is “redrawn” in the style of another image! (Gatys et al.,
2015)
Deep Learning – 6 / 10
APPLICATION EXAMPLE: NEURAL STYLE
TRANSFER
A photograph is “redrawn” in the style of another image! (Gatys et al.,
2015)
Deep Learning – 6 / 10
APPLICATION EXAMPLE: NEURAL STYLE
TRANSFER
A photograph is “redrawn” in the style of another image! (Gatys et al.,
2015)
Deep Learning – 6 / 10
APPLICATION EXAMPLE: IMAGE INPAINTING
Figure: A generative model fills in the missing portion of the image based on
the surrounding context.
Deep Learning – 7 / 10
APPLICATION EXAMPLE: SEMANTIC LABELS –>
IMAGES
Deep Learning – 8 / 10
APPLICATION EXAMPLE: GENERATING IMAGES
FROM TEXT
Deep Learning – 9 / 10
REFERENCES
Ugur Demir, Gozde Unal (2018)
Patch-Based Image Inpainting with Generative Adversarial Networks
https: // arxiv. org/ abs/ 1803. 07422
Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen (2018)
Progressive Growing of GANs for Improved Quality, Stability, and Variation
https: // arxiv. org/ abs/ 1710. 10196
Leon A. Gatys et al. (2015)
Neural Algorithm of Artistic Style
https: // arxiv. org/ abs/ 1508. 06576
Deep Learning – 10 / 10
Deep Learning
Learning goals
probabilistic graphical models
latent variables
directed graphical models
Probabilistic graphical models
Deep Learning – 1 / 10
GRAPHICAL MODELS
Deep Learning – 2 / 10
WHY AGAIN GRAPHICAL MODELS?
1 Graphical models visualize the
structure of a probabilistic model; they
help to develop, understand and
motivate probabilistic models.
Deep Learning – 3 / 10
WHY AGAIN GRAPHICAL MODELS?
1 Graphical models visualize the
structure of a probabilistic model; they
help to develop, understand and
motivate probabilistic models.
Deep Learning – 3 / 10
GRAPHICAL MODELS: EXAMPLE
Deep Learning – 5 / 10
LATENT VARIABLES: MOTIVATION
Figure: A simple illustration of the relevance of latent variables. Here, six 200 x 200
pixel images are shown where each pixel is either black or white. Naively, the
probability distribution over the space of all such images would need 240000 − 1
parameters to fully specify. However, we see that the images have three main factors of
variation : object type (shape), position and size. This suggests that the actual number
of parameters required might be significantly fewer.
Deep Learning – 6 / 10
LATENT VARIABLES
Figure: ’Object’, ’position’ and ’size’ are the latent variables behind an image.
Deep Learning – 7 / 10
Directed generative models
Deep Learning – 8 / 10
DIRECTED GENERATIVE MODELS
Goal: Learn to generate x from some latent variables z
Z Z
pθ (x) = pθ (x, z)dz = pθ (x|z)pθ (z)dz
Image from: Ward, A. D., Hamarneh, G.: 3D Surface Parameterization Using Manifold Learning for Medial Shape
Representation, Conference on Image Processing, Proc. of SPIE Medical Imaging, 2007
Deep Learning – 9 / 10
DIRECTED GENERATIVE MODELS
The latent variables z must be learned from the data (which only
contains the observed variables x).
pθ (x|z)pθ (z)
The posterior is given by pθ (z|x) = pθ (x)
.
R
But pθ (x) = pθ (x|z)pθ (z)dz is intractable and common
algorithms (such as Expectation Maximization) cannot be used.
Deep Learning – 10 / 10
DIRECTED GENERATIVE MODELS
The latent variables z must be learned from the data (which only
contains the observed variables x).
pθ (x|z)pθ (z)
The posterior is given by pθ (z|x) = pθ (x)
.
R
But pθ (x) = pθ (x|z)pθ (z)dz is intractable and common
algorithms (such as Expectation Maximization) cannot be used.
The classic DAG problem: How do we efficiently learn pθ (z|x)?
Learning goals
architecture of a GAN
minimax loss
training a GANN
WHAT IS A GAN?
Deep Learning – 1 / 39
WHAT IS A GAN?
Deep Learning – 2 / 39
FAKE CURRENCY ILLUSTRATION
The generative model can be thought of as analogous to a team of
counterfeiters, trying to produce fake currency and use it without
detection, while the discriminative model is analogous to the police,
trying to detect the counterfeit currency. Competition in this game
drives both teams to improve their methods until the counterfeits are
indistinguishable from the genuine articles.
-Ian Goodfellow
Deep Learning – 3 / 39
GAN Training
Deep Learning – 4 / 39
MINIMAX LOSS FOR GANS
Deep Learning – 5 / 39
MINIMAX LOSS FOR GANS
G(z) is the output of the generator for a given state z of the latent
variables.
Deep Learning – 6 / 39
MINIMAX LOSS FOR GANS
The generator only has control over D (G(z)) and tries to push that
toward 1 with each gradient update. This is the same as
minimizing V(D,G).
Deep Learning – 7 / 39
GAN TRAINING : PSEUDOCODE
Deep Learning – 8 / 39
GAN TRAINING: ILLUSTRATION
GANs are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between
samples from the data generating distribution (black,dotted line) px from those of the generative distribution pg (G) (green, solid
line).Source: Goodfellow et al (2017),
For k steps, G’s parameters are frozen and one performs gradient
ascent on D to increase its accuracy.
Finally, D’s parameters are frozen and one performs gradient
descent on G to increase its generation performance.
Note, that G gets to peek at D’s internals (from the
back-propagated errors) but D does not get to peek at G.
Deep Learning – 9 / 39
DIVERGENCE MEASURES
All such measures always positive and 0 if and only if the two
distributions are equal to each other.
Deep Learning – 10 / 39
DIVERGENCE MEASURES
If our generator has the capacity to model pdata (x) perfectly, the choice of
divergence does not matter much because they all achieve their
minimum (that is 0) when pg (x) = pdata (x).
Deep Learning – 11 / 39
IMPLICIT DIVERGENCE MEASURE OF GANS
1 pdata + pg 1 pdata + pg
JS (pdata ||pg ) = KL(pdata || ) + KL(pg || )
2 2 2 2
pdata (x)
KL(pdata ||pg ) = Ex∼pdata (x) [log ]
pg (x)
Deep Learning – 12 / 39
OPTIMAL DISCRIMINATOR
∗ is:
For G fixed, the optimal discriminator DG
Deep Learning – 13 / 39
OPTIMAL DISCRIMINATOR
∗ is:
For G fixed, the optimal discriminator DG
Deep Learning – 15 / 39
ADVERSARIAL TRAINING
Deep Learning – 16 / 39
ADVERSARIAL TRAINING -EXAMPLE
∂ LA ∂ LB
=y, = −x
∂x ∂y
In adversarial training, both players perform gradient descent on
their respective losses.
We update x with x − α · y and y with y + α · x simultaneously in
one iteration, where α is the learning rate.
Deep Learning – 18 / 39
POSSIBLE BEHAVIOUR #1: CONVERGENCE
Deep Learning – 19 / 39
POSSIBLE BEHAVIOUR #2: CHAOTIC BEHAVIOUR
Deep Learning – 20 / 39
POSSIBLE BEHAVIOUR #3: CYCLES
Credit: Goodfellow
Figure: Simultaneous gradient descent with an infinitesimal step size can result in a
circular orbit in the parameter space.
Deep Learning – 21 / 39
NON-STATIONARY LOSS SURFACE
Deep Learning – 22 / 39
ILLUSTRATION OF CONVERGENCE
Deep Learning – 23 / 39
ILLUSTRATION OF CONVERGENCE: FINAL STEP
Deep Learning – 24 / 39
CHALLENGES FOR GAN TRAINING
Deep Learning – 25 / 39
GAN variants
Deep Learning – 26 / 39
NON-SATURATING LOSS
Deep Learning – 29 / 39
ARCHITECTURE-VARIANT GANS
Credit: hindupuravinash
Deep Learning – 30 / 39
GAN APPLICATION
What kinds of problems can GANs address?
Generation
Conditional Generation
Clustering
Semi-supervised Learning
Representation Learning
Translation
Any traditional discriminative task can be approached with
generative models
Deep Learning – 31 / 39
CONDITIONAL GANS: MOTIVATION
In an ordinary GAN, the only thing that is fed to the generator are
the latent variables z.
A conditional GAN allows you to condition the generative model on
additional variables.
E.g. a generator conditioned on text input (in addition to z) can be
trained to generate the image described by the text.
Deep Learning – 32 / 39
CONDITIONAL GANS: ARCHITECTURE
Deep Learning – 33 / 39
CONDITIONAL GANS: EXAMPLE
Figure: When the model is conditioned on a one-hot coded class label, it generates
random images that belong (mostly) to that particular class. The randomness here
comes from the randomly sampled z. (Note : z is implicit. It is not shown above.)
Deep Learning – 34 / 39
CONDITIONAL GANS: MORE EXAMPLES
Figure: Conditional GANs can translate images of one type to another. In each of the
4 examples above, the image on the left is fed to the network and the image on the
right is generated by the network.
Deep Learning – 35 / 39
MORE GENERATIVE MODELS
Deep Learning – 36 / 39
REFERENCES
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)
Generative Adversarial Networks
https: // arxiv. org/ abs/ 1406. 2661
Santiago Pascual, Antonio Bonafonte, Joan Serra (2017)
SEGAN: Speech Enhancement Generative Adversarial Network
https: // arxiv. org/ abs/ 1703. 09452
Ian Goodfellow (2016)
NIPS 2016 Tutorial: Generative Adversarial Networks
https: // arxiv. org/ abs/ 1701. 00160
Lilian Weng (2017)
From GAN to WGAN
https: // lilianweng. github. io/ lil-log/ 2017/ 08/ 20/
from-GAN-to-WGAN. html
Deep Learning – 37 / 39
REFERENCES
Mark Chang (2016)
Generative Adversarial Networks
https: // www. slideshare. net/ ckmarkohchang/
generative-adversarial-networks
Lucas Theis, Aaron van den Oord, Matthias Bethge (2016)
A note on the evaluation of generative models
https: // arxiv. org/ abs/ 1511. 01844
Aiden Nibali (2016)
The GAN objective, from practice to theory and back again
https: // aiden. nibali. org/ blog/ 2016-12-21-gan-objective/
Mehdi Mirza, Simon Osindero (2014)
Conditional Generative Adversarial Nets
https: // arxiv. org/ abs/ 1411. 1784
Deep Learning – 38 / 39
REFERENCES
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros (2016)
Image-to-Image Translation with Conditional Adversarial Networks
https: // arxiv. org/ abs/ 1611. 07004
Guim Perarnau (2017)
Fantastic GANs and where to find them
https: // guimperarnau. com/ blog/ 2017/ 03/
Fantastic-GANs-and-where-to-find-them
Deep Learning – 39 / 39
Deep Learning
GAN variants
Learning goals
non-saturating loss
conditional GANs
NON-SATURATING LOSS
Deep Learning – 3 / 12
ARCHITECTURE-VARIANT GANS
Motivated by different challenges in GAN training procedure described,
there have been several types of architecture variants proposed.
Understanding and improving GAN training is a very active area of
research.
Credit: hindupuravinash
Deep Learning – 4 / 12
CONDITIONAL GANS: MOTIVATION
In an ordinary GAN, the only thing that is fed to the generator are
the latent variables z.
A conditional GAN allows you to condition the generative model on
additional variables.
E.g. a generator conditioned on text input (in addition to z) can be
trained to generate the image described by the text.
Deep Learning – 5 / 12
CONDITIONAL GANS: ARCHITECTURE
Deep Learning – 6 / 12
CONDITIONAL GANS: EXAMPLE
Figure: When the model is conditioned on a one-hot coded class label, it generates
random images that belong (mostly) to that particular class. The randomness here
comes from the randomly sampled z. (Note : z is implicit. It is not shown above.)
Deep Learning – 7 / 12
CONDITIONAL GANS: MORE EXAMPLES
Figure: Conditional GANs can translate images of one type to another. In each of the
4 examples above, the image on the left is fed to the network and the image on the
right is generated by the network.
Deep Learning – 8 / 12
MORE GENERATIVE MODELS
Deep Learning – 9 / 12
REFERENCES
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)
Generative Adversarial Networks
https: // arxiv. org/ abs/ 1406. 2661
Santiago Pascual, Antonio Bonafonte, Joan Serra (2017)
SEGAN: Speech Enhancement Generative Adversarial Network
https: // arxiv. org/ abs/ 1703. 09452
Ian Goodfellow (2016)
NIPS 2016 Tutorial: Generative Adversarial Networks
https: // arxiv. org/ abs/ 1701. 00160
Lilian Weng (2017)
From GAN to WGAN
https: // lilianweng. github. io/ lil-log/ 2017/ 08/ 20/
from-GAN-to-WGAN. html
Deep Learning – 10 / 12
REFERENCES
Mark Chang (2016)
Generative Adversarial Networks
https: // www. slideshare. net/ ckmarkohchang/
generative-adversarial-networks
Lucas Theis, Aaron van den Oord, Matthias Bethge (2016)
A note on the evaluation of generative models
https: // arxiv. org/ abs/ 1511. 01844
Aiden Nibali (2016)
The GAN objective, from practice to theory and back again
https: // aiden. nibali. org/ blog/ 2016-12-21-gan-objective/
Mehdi Mirza, Simon Osindero (2014)
Conditional Generative Adversarial Nets
https: // arxiv. org/ abs/ 1411. 1784
Deep Learning – 11 / 12
REFERENCES
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros (2016)
Image-to-Image Translation with Conditional Adversarial Networks
https: // arxiv. org/ abs/ 1611. 07004
Guim Perarnau (2017)
Fantastic GANs and where to find them
https: // guimperarnau. com/ blog/ 2017/ 03/
Fantastic-GANs-and-where-to-find-them
Deep Learning – 12 / 12
Deep Learning
Learning goals
(no) convergence to fix point
problems of adversarial setting
ADVERSARIAL TRAINING
Deep Learning – 1 / 11
ADVERSARIAL TRAINING -EXAMPLE
∂ LA ∂ LB
=y, = −x
∂x ∂y
In adversarial training, both players perform gradient descent on
their respective losses.
We update x with x − α · y and y with y + α · x simultaneously in
one iteration, where α is the learning rate.
Deep Learning – 3 / 11
POSSIBLE BEHAVIOUR #1: CONVERGENCE
Deep Learning – 4 / 11
POSSIBLE BEHAVIOUR #2: CHAOTIC BEHAVIOUR
Deep Learning – 5 / 11
POSSIBLE BEHAVIOUR #3: CYCLES
Credit: Goodfellow
Figure: Simultaneous gradient descent with an infinitesimal step size can result in a
circular orbit in the parameter space.
Deep Learning – 6 / 11
NON-STATIONARY LOSS SURFACE
Deep Learning – 7 / 11
ILLUSTRATION OF CONVERGENCE
Deep Learning – 8 / 11
ILLUSTRATION OF CONVERGENCE: FINAL STEP
Deep Learning – 9 / 11
CHALLENGES FOR GAN TRAINING
Deep Learning – 10 / 11
REFERENCES
Ian Goodfellow (2016)
NIPS 2016 Tutorial: Generative Adversarial Networks
https: // arxiv. org/ abs/ 1701. 00160
Lilian Weng (2017)
From GAN to WGAN
https: // lilianweng. github. io/ lil-log/ 2017/ 08/ 20/
from-GAN-to-WGAN. html
Mark Chang (2016)
Generative Adversarial Networks
https: // www. slideshare. net/ ckmarkohchang/
generative-adversarial-networks
Deep Learning – 11 / 11